Statistics
16
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-12-21
Support
Share

AuthorMax Pumperla, Edward Oakes, Richard Liaw

Get started with Ray, the open source distributed computing framework that simplifies the process of scaling compute-intensive Python workloads. With this practical book, Python programmers, data engineers, and data scientists will learn how to leverage Ray locally and spin up compute clusters. You'll be able to use Ray to structure and run machine learning programs at scale. Authors Max Pumperla, Edward Oakes, and Richard Liaw show you how to build machine learning applications with Ray. You'll understand how Ray fits into the current landscape of machine learning tools and discover how Ray continues to integrate ever more tightly with these tools. Distributed computation is hard, but by using Ray you'll find it easy to get started. • Learn how to build your first distributed applications with Ray Core • Conduct hyperparameter optimization with Ray Tune • Use the Ray RLlib library for reinforcement learning • Manage distributed training with the Ray Train library • Use Ray to perform data processing with Ray Datasets • Learn how work with Ray Clusters and serve models with Ray Serve • Build end-to-end machine learning applications with Ray AIR

Tags
No tags
ISBN: 1098117166
Publisher: O'Reilly Media, Inc.
Publish Year: 2022
Language: 英文
Pages: 160
File Format: PDF
File Size: 4.0 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write— so you can take advantage of these technologies long before the official release of these titles. Max Pumperla, Edward Oakes, and Richard Liaw Learning Ray Flexible Distributed Python for Data Science Boston Farnham Sebastopol TokyoBeijing
978-1-098-11716-0 Learning Ray by Max Pumperla, Edward Oakes, and Richard Liaw Copyright © 2023 Max Pumperla and O’Reilly Media inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com ). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com . Editors: Jeff Bleiel and Jessica Haberman Production Editor: Katherine Tozer Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea April 2023: First Edition Revision History for the Early Release 2022-01-21: First Release 2022-03-11: Second Release 2022-04-21: Third Release 2022-06-06: Fourth Release 2022-07-13: Fifth Release See http://oreilly.com/catalog/errata.csp?isbn=9781098117221 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Ray, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Anyscale. See our statement of editorial inde‐ pendence.
Table of Contents 1. An Overview of Ray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 What Is Ray? 8 What Led to Ray? 8 Flexible Workloads in Python and Reinforcement Learning 10 Three Layers: Core, Libraries and Ecosystem 11 A Distributed Computing Framework 11 A Suite of Data Science Libraries 14 Machine Learning and the Data Science Workflow 14 Data Processing with Ray Data 17 Model Training 19 Hyperparameter Tuning 22 Model Serving 24 A Growing Ecosystem 26 How Ray Integrates and Extends 26 Ray as Distributed Interface 27 Summary 27 2. Getting Started With Ray Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 An Introduction To Ray Core 29 A First Example Using the Ray API 30 An Overview of the Ray Core API 40 Design Principles 41 Understanding Ray System Components 42 Scheduling and Executing Work on a Node 42 The Head Node 45 Distributed Scheduling and Execution 46 Summary 48 iii
3. Building Your First Distributed Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Setting Up A Simple Maze Problem 50 Building a Simulation 55 Training a Reinforcement Learning Model 58 Building a Distributed Ray App 62 Recapping RL Terminology 67 Summary 68 4. Reinforcement Learning with Ray RLlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 An Overview of RLlib 70 Getting Started With RLlib 71 Building A Gym Environment 71 Running the RLlib CLI 73 Using the RLlib Python API 74 Configuring RLlib Experiments 81 Resource Configuration 81 Debugging and Logging Configuration 82 Rollout Worker and Evaluation Configuration 82 Environment Configuration 83 Working With RLlib Environments 84 An Overview of RLlib Environments 84 Working with Multiple Agents 85 Working with Policy Servers and Clients 90 Advanced Concepts 92 Building an Advanced Environment 93 Applying Curriculum Learning 94 Working with Offline Data 96 Other Advanced Topics 97 Summary 98 5. Hyperparameter Optimization with Ray Tune. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Tuning Hyperparameters 100 Building a random search example with Ray 100 Why is HPO hard? 102 An introduction to Tune 103 How does Tune work? 104 Configuring and running Tune 108 Machine Learning with Tune 113 Using RLlib with Tune 113 Tuning Keras Models 114 Summary 116 iv | Table of Contents
6. Distributed Training with Ray Train. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 The Basics of Distributed Model Training 119 Introduction to Ray Train 120 Creating an End-To-End Example for Ray Train 121 Preprocessors in Ray Train 123 Usage of Preprocessors 123 Serialization of Preprocessors 123 Trainers in Ray Train 124 Distributed Training for Gradient Boosted Trees 124 Distributed Training for Deep Learning 125 Scaling Out Training with Ray Train Trainers 127 Connecting Data to Distributed Training 128 Ray Train Features 130 Checkpoints 130 Callbacks 130 Integration with Ray Tune 131 Exporting Models 132 Some Caveats 133 7. Data Processing with Ray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Ray Datasets 136 An Overview of Datasets 136 Datasets Basics 137 Computing Over Datasets 140 Dataset Pipelines 142 Example: Parallel SGD from Scratch 145 External Library Integrations 148 Overview 148 Dask on Ray 149 Building an ML Pipeline 151 Background 151 End-to-End Example: Predicting Big Tips in Nyc Taxi Rides 152 Table of Contents | v
(This page has no text content)
CHAPTER 1 An Overview of Ray A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable. —Leslie Lamport A Note for Early Release Readers With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. One of the reasons we need efficient distributed computing is that we’re collecting ever more data with a large variety at increasing speeds. The storage systems, data processing and analytics engines that have emerged in the last decade are crucially important to the success of many companies. Interestingly, most “big data” technolo‐ gies are built for and operated by (data) engineers, that are in charge of data collec‐ tion and processing tasks. The rationale is to free up data scientists to do what they’re best at. As a data science practitioner you might want to focus on training complex machine learning models, running efficient hyperparameter selection, building entirely new and custom models or simulations, or serving your models to showcase them. At the same time you simply might have to scale them to a compute cluster. To do that, the distributed system of your choice needs to support all of these fine- grained “big compute” tasks, potentially on specialized hardware. Ideally, it also fits into the big data tool chain you’re using and is fast enough to meet your latency requirements. In other words, distributed computing has to be powerful and flexible enough for complex data science workloads — and Ray can help you with that. 7
Python is likely the most popular language for data science today, and it’s certainly the one I find the most useful for my daily work. By now it’s over 30 years old, but has a still growing and active community. The rich PyData ecosystem is an essential part of a data scientist’s toolbox. How can you make sure to scale out your workloads while still leveraging the tools you need? That’s a difficult problem, especially since communities can’t be forced to just toss their toolbox, or programming language. That means distributed computing tools for data science have to be built for their existing community. What Is Ray? What I like about Ray is that it checks all the above boxes. It’s a flexible distributed computing framework build for the Python data science community. Ray is easy to get started and keeps simple things simple. Its core API is as lean as it gets and helps you reason effectively about the distributed programs you want to write. You can effi‐ ciently parallelize Python programs on your laptop, and run the code you tested locally on a cluster practically without any changes. Its high-level libraries are easy to configure and can seamlessly be used together. Some of them, like Ray’s reinforce‐ ment learning library, would have a bright future as standalone projects, distributed or not. While Ray’s core is built in C++, it’s been a Python-first framework since day one, integrates with many important data science tools, and can count on a growing ecosystem. Distributed Python is not new, and Ray is not the first framework in this space (nor will it be the last), but it is special in what it has to offer. Ray is particularly strong when you combine several of its modules and have custom, machine learning heavy workloads that would be difficult to implement otherwise. It makes distributed com‐ puting easy enough to run your complex workloads flexibly by leveraging the Python tools you know and want to use. In other words, by learning Ray you get to know !exible distributed Python for data science. In this chapter you’ll get a first glimpse at what Ray can do for you. We will discuss the three layers that make up Ray, namely its core engine, its high-level libraries and its ecosystem. Throughout the chapter we’ll show you first code examples to give you a feel for Ray, but we defer any in-depth treatment of Ray’s APIs and components to later chapters. You can view this chapter as an overview of the whole book as well. What Led to Ray? Programming distributed systems is hard. It requires specific knowledge and experi‐ ence you might not have. Ideally, such systems get out of your way and provide abstractions to let you focus on your job. But in practice “all non-trivial abstractions, to some degree, are leaky” (Spolsky), and getting clusters of computers to do what 8 | Chapter 1: An Overview of Ray
1 Moore’s Law held for a long time, but there might be signs that it’s slowing down. We’re not here to argue it, though. What’s important is not that our computers generally keep getting faster, but the relation to the amount of compute we need. you want is undoubtedly difficult. Many software systems require resources that far exceed what single servers can do. Even if one server was enough, modern systems need to be failsafe and provide features like high availability. That means your appli‐ cations might have to run on multiple machines, or even datacenters, just to make sure they’re running reliably. Even if you’re not too familiar with machine learning (ML) or more generally artifi‐ cial intelligence (AI) as such, you must have heard of recent breakthroughs in the field. To name just two, systems like Deepmind’s AlphaFold for solving the protein folding problem, or OpenAI’s Codex that’s helping software developers with the tedi‐ ous parts of their job, have made the news lately. You might also have heard that ML systems generally require large amounts of data to be trained. OpenAI has shown exponential growth in compute needed to train AI models in their paper “AI and Compute”. The operations needed for AI systems in their study is measured in peta‐ flops (thousands of trillion operations per second), and has been doubling every 3.4 months since 2012. Compare this to Moore’s Law1, which states that the number of transistors in comput‐ ers would double every two years. Even if you’re bullish on Moore’s law, you can see how there’s a clear need for distributed computing in ML. You should also under‐ stand that many tasks in ML can be naturally decomposed to run in parallel. So, why not speed things up if you can? Distributed computing is generally perceived as hard. But why is that? Shouldn’t it be realistic to find good abstractions to run your code on clusters, without having to constantly think about individual machines and how they interoperate? What if we specifically focused on AI workloads? Researchers at RISELab at UC Berkeley created Ray to address these questions. None of the tools existing at the time met their needs. They were looking for easy ways to speed up their workloads by distributing them to compute clusters. The workloads they had in mind were quite flexible in nature and didn’t fit into the analytics engines available. At the same time, RISELab wanted to build a system that took care of how the work was distributed. With reasonable default behaviors in place, researchers should be able to focus on their work. And ideally they should have access to all their favorite tools in Python. For this reason, Ray was built with an emphasis on high- performance and heterogeneous workloads. Anyscale, the company behind Ray, is building a managed Ray Platform and offers hosted solutions for your Ray applica‐ tions. Let’s have a look at an example of what kinds of applications Ray was designed for. What Led to Ray? | 9
Flexible Workloads in Python and Reinforcement Learning One of my favorite apps on my phone can automatically classify or “label” individual plants in our garden. It works by simply showing it a picture of the plant in question. That’s immensely helpful, as I’m terrible at distinguishing them all. (I’m not bragging about the size of my garden, I’m just bad at it.) In the last couple of years we’ve seen a surge of impressive applications like that. Ultimately, the promise of AI is to build intelligent agents that go far beyond classify‐ ing objects. Imagine an AI application that not only knows your plants, but can take care of to them, too. Such an application would have to • Operate in dynamic environments (like the change of seasons) • React to changes in the environment (like a heavy storm or pests attacking your plants) • Take sequences of actions (like watering and fertilizing plants) • Accomplish long-term goals (like prioritizing plant health) By observing its environment such an AI would also learn to explore the possible actions it could take and come up with better solutions over time. If you feel like this example is artificial or too far out, it’s not difficult to come up with examples on your own that share all the above requirements. Think of managing and optimizing a sup‐ ply chain, strategically restocking a warehouse considering fluctuating demands, or orchestrating the processing steps in an assembly line. Another famous example of what you could expect from an AI would be Stephen Wozniak’s famous “Coffee Test”. If you’re invited to a friend’s house, you can navigate to the kitchen, spot the coffee machine and all necessary ingredients, figure out how to brew a cup of coffee, and sit down to enjoy it. A machine should be able to do the same, except the last part might be a bit of a stretch. What other examples can you think of? You can frame all the above requirements naturally in a subfield of machine learning called reinforcement learning (RL). We’ve dedicated all of Chapter 4 to RL. For now, it’s enough to understand that it’s about agents interacting with their environment by observing it and emitting actions. In RL, agents evaluate their environments by attrib‐ uting a reward (e.g., how healthy is my plant on a scale from 1 to 10). The term “rein‐ forcement” comes from the fact that agents will hopefully learn to seek out behaviour that leads to good outcomes (high reward), and shy away from punishing situations (low or negative reward). The interaction of agents with their environment is usually modeled by creating a computer simulation of it. These simulations can become com‐ plicated quite quickly, as you might imagine from the examples we’ve given. 10 | Chapter 1: An Overview of Ray
2 For the experts among you, I don’t claim that RL is the answer. RL is just a paradigm that naturally fits into this discussion of AI goals. 3 This is a Python book, so we’ll exclusively focus on it. But you should at least know that Ray also has a Java API, which at this point is less mature than its Python equivalent. We don’t have gardening robots like the one I’ve sketched yet. And we don’t know which AI paradigm will get us there.2 What I do know is that the world is full of com‐ plex, dynamic and interesting examples that we need to tackle. For that we need com‐ putational frameworks that help us do that, and Ray was built to do exactly that. RISELab created Ray to build and run complex AI applications at scale, and rein‐ forcement learning has been an integral part of Ray from the start. Three Layers: Core, Libraries and Ecosystem Now that you know why Ray was built and what its creators had in mind, let’s look at the three layers of Ray. • A low-level, distributed computing framework for Python with a concise core API.3 • A set of high-level libraries for data science built and maintained by the creators of Ray. • A growing ecosystem of integrations and partnerships with other notable projects. There’s a lot to unpack here, and we’ll look into each of these layers individually in the remainder of this chapter. You can imagine Ray’s core engine with its API at the cen‐ ter of things, on which everything else builds. Ray’s data science libraries build on top of it. In practice, most data scientists will use these higher level libraries directly and won’t often need to resort to the core API. The growing number of third-party inte‐ grations for Ray is another great entrypoint for experienced practitioners. Let’s look into each one of the layers one by one. A Distributed Computing Framework At its core, Ray is a distributed computing framework. We’ll provide you with just the basic terminology here, and talk about Ray’s architecture in depth in Chapter 2. In short, Ray sets up and manages clusters of computers so that you can run distributed tasks on them. A ray cluster consists of nodes that are connected to each other via a network. You program against the so-called driver, the program root, which lives on the head node. The driver can run jobs, that is a collection of tasks, that are run on the nodes in the cluster. Specifically, the individual tasks of a job are run on worker pro‐ A Distributed Computing Framework | 11
4 We’re using Ray version 1.9.0 at this point, as it’s the latest version available as of this writing. cesses on worker nodes. Figure Figure 1-1 illustrates the basic structure of a Ray clus‐ ter. Figure 1-1. "e basic components of a Ray cluster What’s interesting is that a Ray cluster can also be a local cluster, i.e. a cluster consist‐ ing just of your own computer. In this case, there’s just one node, namely the head node, which has the driver process and some worker processes. The default number of worker processes is the number of CPUs available on your machine. With that knowledge at hand, it’s time to get your hands dirty and run your first local Ray cluster. Installing Ray4 on any of the major operating systems should work seam‐ lessly using pip: pip install "ray[rllib, serve, tune]"==1.9.0 With a simple pip install ray you would have installed just the very basics of Ray. Since we want to explore some advanced features, we installed the “extras” rllib, serve and tune, which we’ll discuss in a bit. Depending on your system configuration you may not need the quotation marks in the above installation command. Next, go ahead and start a Python session. You could use the ipython interpreter, which I find to be the most suitable environment for following along simple exam‐ ples. If you don’t feel like typing in the commands yourself, you can also jump into the jupyter notebook for this chapter and run the code there. The choice is up to you, but in any case please remember to use Python version 3.7 or later. In your Python session you can now easily import and initialize Ray as follows: 12 | Chapter 1: An Overview of Ray
Example 1-1. import ray ray.init() With those two lines of code you’ve started a Ray cluster on your local machine. This cluster can utilize all the cores available on your computer as workers. In this case you didn’t provide any arguments to the init function. If you wanted to run Ray on a “real” cluster, you’d have to pass more arguments to init. The rest of your code would stay the same. After running this code you should see output of the following form (we use ellipses to remove the clutter): ... INFO services.py:1263 -- View the Ray dashboard at http://127.0.0.1:8265 {'node_ip_address': '192.168.1.41', 'raylet_ip_address': '192.168.1.41', 'redis_address': '192.168.1.41:6379', 'object_store_address': '.../sockets/plasma_store', 'raylet_socket_name': '.../sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '...', 'metrics_export_port': 61794, 'node_id': '...'} This indicates that your Ray cluster is up and running. As you can see from the first line of the output, Ray comes with its own, pre-packaged dashboard. In all likelihood you can check it out at http://127.0.0.1:8265, unless your output shows a different port. If you want you can take your time to explore the dashboard for a little. For instance, you should see all your CPU cores listed and the total utilization of your (trivial) Ray application. We’ll come back to the dashboard in later chapters. We’re not quite ready to dive into all the details of a Ray cluster here. To jump ahead just a little, you might see the raylet_ip_address, which is a reference to a so-called Raylet, which is responsible for scheduling tasks on your worker nodes. Each Raylet has a store for distributed objects, which is hinted at by the object_store_address above. Once tasks are scheduled, they get executed by worker processes. In Chapter 2 you’ll get a much better understanding of all these components and how they make up a Ray cluster. Before moving on, we should also briefly mention that the Ray core API is very acces‐ sible and easy to use. But since it is also a rather low-level interface, it takes time to build interesting examples with it. Chapter 2 has an extensive first example to get you started with the Ray core API, and in Chapter 3 you’ll see how to build a more inter‐ esting Ray application for reinforcement learning. A Distributed Computing Framework | 13
5 I never liked the categorization of data science as an intersection of disciplines, like maths, coding and busi‐ ness. Ultimately, that doesn’t tell you what practitioners do. It doesn’t do a cook justice to tell them they sit at the intersection of agriculture, thermodynamics and human relations. It’s not wrong, but also not very help‐ ful. 6 As a fun exercise, I recommend reading Paul Graham’s famous “Hackers and Painters” essay on this topic and replace “computer science” with “data science”. What would hacking 2.0 be? Right now your Ray cluster doesn’t do much, but that’s about to change. After giving you a quick introduction to the data science workflow in the following section, you’ll run your first concrete Ray examples. A Suite of Data Science Libraries Moving on to the second layer of Ray, in this section we’ll introduce all the data sci‐ ence libraries that Ray comes with. To do so, let’s first take a bird’s eye view on what it means to do data science. Once you understand this context, it’s much easier to place Ray’s higher-level libraries and see how they can be useful to you. If you have a good idea of the data science process, you can safely skip ahead to section “Data Processing with Ray Data” on page 17. Machine Learning and the Data Science Work!ow The somewhat elusive term “data science” (DS) evolved quite a bit in recent years, and you can find many definitions of varying usefulness online.5 To me, it’s the prac‐ tice of gaining insights and building real-world applications by leveraging data. That’s quite a broad definition, and you don’t have to agree with me. My point is that data science is an inherently practical and applied field that centers around building and understanding things, which makes fairly little sense in a purely academic context. In that sense, describing practitioners of this field as “data scientists” is about as bad of a misnomer as describing hackers as “computer scientists”.6 Since you are familiar with Python and hopefully bring a certain craftsmanship atti‐ tude with you, we can approach the Ray’s data science libraries from a very pragmatic angle. Doing data science in practice is an iterative process that goes something like this: Requirements engineering You talk to stakeholders to identify the problems you need to solve and clarify the requirements for this project. Data collection Then you source, collect and inspect the data. 14 | Chapter 1: An Overview of Ray
Data processing Afterwards you process the data such that you can tackle the problem. Model building You then move on to build a model (in the broadest sense) using the data. That could be a dashboard with important metrics, a visualisation, or a machine learn‐ ing model, among many other things. Model evaluation The next step is to evaluate your model against the requirements in the first step. Deployment If all goes well (it likely doesn’t), you deploy your solution in a production envi‐ ronment. You should understand this as an ongoing process that needs to be monitored, not as a one-off step. Otherwise, you need to circle back and start from the top. The most likely outcome is that you need to improve your solution in various ways, even after initial deployment. Machine learning is not necessarily part of this process, but you can see how building smart applications or gaining insights might benefit from ML. Building a face detec‐ tion app into your social media platform, for better or worse, might be one example of that. When the data science process just described explicitly involves building machine learning models, you can further specify some steps: Data processing To train machine learning models, you need data in a format that is understood by your ML model. The process of transforming and selecting what data should be fed into your model is often called feature engineering. This step can be messy. You’ll benefit a lot if you can rely on common tools to do the job. Model training In ML you need to train your algorithms on data that got processed in the last step. This includes selecting the right algorithm for the job, and it helps if you can choose from a wide variety. Hyperparameter tuning Machine learning models have parameters that are tuned in the model training step. Most ML models also have another set of parameters, called hyperparame‐ ters that can be modified prior to training. These parameters can heavily influ‐ ence the performance of your resulting ML model and need to be tuned properly. There are good tools to help automate that process. Model serving Trained models need to be deployed. To serve a model means to make it available to whoever needs access by whatever means necessary. In prototypes, you often A Suite of Data Science Libraries | 15
use simple HTTP servers, but there are many specialised software packages for ML model serving. This list is by no means exhaustive. Don’t worry if you’ve never gone through these steps or struggle with the terminology, we’ll come back to this in much more detail in later chapters. If you want to understand more about the holistic view of the data sci‐ ence process when building machine learning applications, the book Building Machine Learning Powered Applications is dedicated to it entirely. Figure Figure 1-2 gives an overview of the steps we just discussed: Figure 1-2. An overview of the data science experimentation work!ow using machine learning At this point you might be wondering how any of this relates to Ray. The good news is that Ray has a dedicated library for each of the four ML-specific tasks above, cover‐ ing data processing, model training, hyperparameter tuning and model serving. And the way Ray is designed, all these libraries are distributed by construction. Let’s walk through each of them one-by-one. 16 | Chapter 1: An Overview of Ray
Data Processing with Ray Data The first high-level library of Ray we talk about is called “Ray Data”. This library con‐ tains a data structure aptly called Dataset, a multitude of connectors for loading data from various formats and systems, an API for transforming such datasets, a way to build data processing pipelines with them, and many integrations with other data processing frameworks. The Dataset abstraction builds on the powerful Arrow framework. To use Ray Data, you need to install Arrow for Python, for instance by running pip install pyarrow. We’ll now discuss a simple example that creates a distributed Data set on your local Ray cluster from a Python data structure. Specifically, you’ll create a dataset from a Python dictionary containing a string name and an integer-valued data for 10000 entries: Example 1-2. import ray items = [{"name": str(i), "data": i} for i in range(10000)] ds = ray.data.from_items(items) ds.show(5) Creating a Dataset by using from_items from the ray.data module. Printing the first 10 items of the Dataset. To show a Dataset means to print some of its values. You should see precisely 5 so- called ArrowRow elements on your command line, like this: ArrowRow({'name': '0', 'data': 0}) ArrowRow({'name': '1', 'data': 1}) ArrowRow({'name': '2', 'data': 2}) ArrowRow({'name': '3', 'data': 3}) ArrowRow({'name': '4', 'data': 4}) Great, now you have some distributed rows, but what can you do with that data? The Dataset API bets heavily on functional programming, as it is very well suited for data transformations. Even though Python 3 made a point of hiding some of its functional programming capabilities, you’re probably familiar with functionality such as map, filter and others. If not, it’s easy enough to pick up. map takes each element of your dataset and transforms is into something else, in parallel. filter removes data points according to a boolean filter function. And the slightly more elaborate flat_map first maps values similarly to map, but then also “flattens” the result. For instance, if map would produce a list of lists, flat_map would flatten out the nested lists and give you A Suite of Data Science Libraries | 17
just a list. Equipped with these three functional API calls, let’s see how easily you can transform your dataset ds: Example 1-3. Transforming a Dataset with common functional programming routines squares = ds.map(lambda x: x["data"] ** 2) evens = squares.filter(lambda x: x % 2 == 0) evens.count() cubes = evens.flat_map(lambda x: [x, x**3]) sample = cubes.take(10) print(sample) We map each row of ds to only keep the square value of its data entry. Then we filter the squares to only keep even numbers (a total of 5000 ele‐ ments). We then use flat_map to augment the remaining values with their respective cubes. To take a total of 10 values means to leave Ray and return a Python list with these values that we can print. The drawback of Dataset transformations is that each step gets executed synchro‐ nously. In example Example 1-3 this is a non-issue, but for complex tasks that e.g. mix reading files and processing data, you want an execution that can overlap indi‐ vidual tasks. DatasetPipeline does exactly that. Let’s rewrite the last example into a pipeline. Example 1-4. pipe = ds.window() result = pipe\ .map(lambda x: x["data"] ** 2)\ .filter(lambda x: x % 2 == 0)\ .flat_map(lambda x: [x, x**3]) result.show(10) You can turn a Dataset into a pipeline by calling .window() on it. Pipeline steps can be chained to yield the same result as before. 18 | Chapter 1: An Overview of Ray
There’s a lot more to be said about Ray Data, especially its integration with notable data processing systems, but we’ll have to defer an in-depth discussion until Chap‐ ter 7. Model Training Moving on to the next set of libraries, let’s look at the distributed training capabilities of Ray. For that, you have access to two libraries. One is dedicated to reinforcement learning specifically, the other one has a different scope and is aimed primarily at supervised learning tasks. Reinforcement learning with Ray RLlib Let’s start with Ray RLlib for reinforcement learning. This library is powered by the modern ML frameworks TensorFlow and PyTorch, and you can choose which one to use. Both frameworks seem to converge more and more conceptually, so you can pick the one you like most without losing much in the process. Throughout the book we use TensorFlow for consistency. Go ahead and install it with pip install tensor flow right now. One of the easiest ways to run examples with RLlib is to use the command line tool rllib, which we’ve already implicitly installed earlier with pip. Once you run more complex examples in Chapter 4, you will mostly rely on its Python API, but for now we just want to get a first taste of running RL experiments. We’ll look at a fairly classical control problem of balancing a pendulum. Imagine you have a pendulum like the one in figure Figure 1-3, fixed at as single point and subject to gravity. You can manipulate that pendulum by giving it a push from the left or the right. If you assert just the right amount of force, the pendulum might remain in an upright position. That’s our goal - and the question is whether we can teach a rein‐ forcement learning algorithm to do so for us. A Suite of Data Science Libraries | 19
The above is a preview of the first 20 pages. Register to read the complete e-book.