Scaling Python with Dask From Data Science to Machine Learning (Holden Karau, Mika Kimmins) (Z-Library)

Author: Holden Karau, Mika Kimmins

科学

Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn. Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA. With this book, you'll learn: • What Dask is, where you can use it, and how it compares with other tools • How to use Dask for batch data parallel processing • Key distributed system concepts for working with Dask • Methods for using Dask with higher-level APIs and building blocks • How to work with integrated libraries such as scikit-learn, pandas, and PyTorch • How to use Dask with GPUs

📄 File Format: PDF
💾 File Size: 8.9 MB
20
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
Holden Karau & Mika Kimmins Scaling Python with Dask From Data Science to Machine Learning
📄 Page 2
DATA “Scaling Python with Dask is excellent and a must-read if you’re a new Dask user or considering Dask for a project. Dask offers powerful features along with many subtle considerations to keep in mind, making Holden and Mika your ideal tour guides for exploring this new territory.” —Adam Breindel Independent Consultant, Data Engineering and ML/AI “I’m happy to see a Dask book written by experts in the field.” —Matthew Rocklin Original Dask Maintainer and CEO at Coiled Computing Scaling Python with Dask Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Modern systems contain multi-core CPUs and GPUs that have the potential for parallel computing. But many scientific Python tools were not designed to leverage this parallelism. With this short but thorough resource, data scientists and Python programmers will learn how the Dask open source library for parallel computing provides APIs that make it easy to parallelize PyData libraries including NumPy, pandas, and scikit-learn. Authors Holden Karau and Mika Kimmins show you how to use Dask computations in local systems and then scale to the cloud for heavier workloads. This practical book explains why Dask is popular among industry experts and academics and is used by organizations that include Walmart, Capital One, Harvard Medical School, and NASA. With this book, you’ll learn: • How to use Dask for batch data parallel processing • Key distributed system concepts for working with Dask • Methods for using Dask with higher-level APIs and building blocks • How to work with integrated libraries • How to use Dask with GPUs Holden Karau is a queer transgender Canadian, Apache Spark committer, Apache Software Foundation member, and an active open source contributor. Mika Kimmins is a data engineer, distributed systems researcher, and ML consultant. She’s worked on a variety of NLP projects incorporating language modeling, reinforcement learning, and ML pipelining at scale. US $79.99 CAN $99.99 ISBN: 978-1-098-11987-4
📄 Page 3
Holden Karau and Mika Kimmins Scaling Python with Dask From Data Science to Machine Learning Boston Farnham Sebastopol TokyoBeijing
📄 Page 4
978-1-098-11987-4 [LSI] Scaling Python with Dask by Holden Karau and Mika Kimmins Copyright © 2023 Holden Karau and Mika Kimmins. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Virginia Wilson Production Editor: Gregory Hyman Copyeditor: JM Olejarz Proofreader: Arthur Johnson Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2023: First Edition Revision History for the First Edition 2023-07-19: First Release See https://oreilly.com/catalog/errata.csp?isbn=9781098119874 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Scaling Python with Dask, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page 5
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. What Is Dask?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Why Do You Need Dask? 1 Where Does Dask Fit in the Ecosystem? 2 Big Data 3 Data Science 4 Parallel to Distributed Python 4 Dask Community Libraries 5 What Dask Is Not 7 Conclusion 7 2. Getting Started with Dask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Installing Dask Locally 9 Hello Worlds 10 Task Hello World 11 Distributed Collections 13 Dask DataFrame (Pandas/What People Wish Big Data Was) 15 Conclusion 16 3. How Dask Works: The Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Execution Backends 17 Local Backends 18 Distributed (Dask Client and Scheduler) 19 Dask’s Diagnostics User Interface 21 Serialization and Pickling 22 Partitioning/Chunking Collections 24 iii
📄 Page 6
Dask Arrays 24 Dask Bags 25 Dask DataFrames 25 Shuffles 26 Partitions During Load 27 Tasks, Graphs, and Lazy Evaluation 27 Lazy Evaluation 27 Task Dependencies 28 visualize 28 Intermediate Task Results 30 Task Sizing 30 When Task Graphs Get Too Large 30 Combining Computation 31 Persist, Caching, and Memoization 31 Fault Tolerance 32 Conclusion 33 4. Dask DataFrame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 How Dask DataFrames Are Built 36 Loading and Writing 36 Formats 37 Filesystems 41 Indexing 42 Shuffles 43 Rolling Windows and map_overlap 43 Aggregations 44 Full Shuffles and Partitioning 47 Embarrassingly Parallel Operations 50 Working with Multiple DataFrames 51 Multi-DataFrame Internals 52 Missing Functionality 53 What Does Not Work 53 What’s Slower 54 Handling Recursive Algorithms 54 Re-computed Data 54 How Other Functions Are Different 55 Data Science with Dask DataFrame: Putting It Together 55 Deciding to Use Dask 56 Exploratory Data Analysis with Dask 56 Loading Data 56 Plotting Data 57 iv | Table of Contents
📄 Page 7
Inspecting Data 58 Conclusion 58 5. Dask’s Collections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Dask Arrays 61 Common Use Cases 61 When Not to Use Dask Arrays 62 Loading/Saving 62 What’s Missing 62 Special Dask Functions 63 Dask Bags 63 Common Use Cases 64 Loading and Saving Dask Bags 64 Loading Messy Data with a Dask Bag 64 Limitations 68 Conclusion 69 6. Advanced Task Scheduling: Futures and Friends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Lazy and Eager Evaluation Revisited 72 Use Cases for Futures 73 Launching Futures 73 Future Life Cycle 74 Fire-and-Forget 75 Retrieving Results 76 Nested Futures 78 Conclusion 79 7. Adding Changeable/Mutable State with Dask Actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 What Is the Actor Model? 82 Dask Actors 83 Your First Actor (It’s a Bank Account) 83 Scaling Dask Actors 85 Limitations 87 When to Use Dask Actors 88 Conclusion 88 8. How to Evaluate Dask’s Components and Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Qualitative Considerations for Project Evaluation 91 Project Priorities 91 Community 92 Dask-Specific Best Practices 94 Table of Contents | v
📄 Page 8
Up-to-Date Dependencies 94 Documentation 94 Openness to Contributions 95 Extensibility 96 Quantitative Metrics for Open Source Project Evaluation 96 Release History 96 Commit Frequency (and Volume) 97 Library Usage 97 Code and Best Practices 99 Conclusion 100 9. Migrating Existing Analytic Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Why Dask? 101 Limitations of Dask 102 Migration Road Map 103 Types of Clusters 103 Development: Considerations 105 Deployment Monitoring 108 Conclusion 109 10. Dask with GPUs and Other Special Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Transparent Versus Non-transparent Accelerators 112 Understanding Whether GPUs or TPUs Can Help 112 Making Dask Resource-Aware 113 Installing the Libraries 114 Using Custom Resources Inside Your Dask Tasks 115 Decorators (Including Numba) 116 GPUs 117 GPU Acceleration Built on Top of Dask 118 cuDF 118 BlazingSQL 118 cuStreamz 119 Freeing Accelerator Resources 119 Design Patterns: CPU Fallback 119 Conclusion 120 11. Machine Learning with Dask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Parallelizing ML 122 When to Use Dask-ML 122 Getting Started with Dask-ML and XGBoost 123 Feature Engineering 123 vi | Table of Contents
📄 Page 9
Model Selection and Training 127 When There Is No Dask-ML Equivalent 128 Use with Dask’s joblib 129 XGBoost with Dask 130 ML Models with Dask-SQL 132 Inference and Deployment 134 Distributing Data and Models Manually 135 Large-Scale Inferences with Dask 136 Conclusion 138 12. Productionizing Dask: Notebooks, Deployment, Tuning, and Monitoring. . . . . . . . . . 139 Factors to Consider in a Deployment Option 140 Building Dask on a Kubernetes Deployment 142 Dask on Ray 144 Dask on YARN 144 Dask on High-Performance Computing 146 Setting Up Dask in a Remote Cluster 146 Connecting a Local Machine to an HPC Cluster 152 Dask JupyterLab Extension and Magics 153 Installing JupyterLab Extensions 153 Launching Clusters 154 UI 154 Watching Progress 155 Understanding Dask Performance 156 Metrics in Distributed Computing 156 The Dask Dashboard 157 Saving and Sharing Dask Metrics/Performance Logs 163 Advanced Diagnostics 165 Scaling and Debugging Best Practices 166 Manual Scaling 166 Adaptive/Auto-scaling 166 Persist and Delete Costly Data 166 Dask Nanny 166 Worker Memory Management 167 Cluster Sizing 168 Chunking, Revisited 168 Avoid Rechunking 169 Scheduled Jobs 169 Deployment Monitoring 170 Conclusion 171 Table of Contents | vii
📄 Page 10
A. Key System Concepts for Dask Users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 B. Scalable DataFrames: A Comparison and Some History. . . . . . . . . . . . . . . . . . . . . . . . . . 183 C. Debugging Dask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D. Streaming with Streamz and Dask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 viii | Table of Contents
📄 Page 11
Preface We wrote this book for data scientists and data engineers familiar with Python and pandas who are looking to handle larger-scale problems than their current tooling allows. Current PySpark users will find that some of this material overlaps with their existing knowledge of PySpark, but we hope they still find it helpful, and not just for getting away from the Java Virtual Machine (JVM). If you are not familiar with Python, some excellent O’Reilly titles include Learning Python and Python for Data Analysis. If you and your team are more frequent users of JVM languages (such as Java or Scala), while we are a bit biased, we’d encourage you to check out Apache Spark along with Learning Spark (O’Reilly) and High Performance Spark (O’Reilly). This book is primarily focused on data science and related tasks because, in our opinion, that is where Dask excels the most. If you have a more general problem that Dask does not seem to be quite the right fit for, we would (with a bit of bias again) encourage you to check out Scaling Python with Ray (O’Reilly), which has less of a data science focus. A Note on Responsibility As the saying goes, with great power comes great responsibility. Dask and tools like it enable you to process more data and build more complex models. It’s essential not to get carried away with collecting data simply for the sake of it, and to stop to ask yourself if including a new field in your model might have some unintended real-world implications. You don’t have to search very hard to find stories of well- meaning engineers and data scientists accidentally building models or tools that had devastating impacts, such as increased auditing of minorities, gender-based discrim‐ ination, or subtler things like biases in word embeddings (a way to represent the meanings of words as vectors). Please use your newfound powers with such potential consequences in mind, for one never wants to end up in a textbook for the wrong reasons. ix
📄 Page 12
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Online Figures Print readers can find larger, color versions of some figures at https://oreil.ly/SPWD- figures. Links to each figure also appear in their captions. License Once published in print and excluding O’Reilly’s distinctive design elements (i.e., cover art, design format, “look and feel”) or O’Reilly’s trademarks, service marks, and trade names, this book is available under a Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International Public License. We’d like to thank O’Reilly for allowing us to make this book available under a Creative Commons license and hope that you will choose to support this book (and us) by purchasing several copies (it makes an excellent gift for whichever holiday season is coming up next). x | Preface
📄 Page 13
Using Code Examples The Scaling Python Machine Learning GitHub repo contains the majority of the examples in this book. They are mainly under the dask directory, with more esoteric parts (such as the cross-platform CUDA container) found in separate top-level direc‐ tories. If you have a technical question or a problem using the code examples, please email support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Scaling Python with Dask by Holden Karau and Mika Kimmins (O’Reilly). Copyright 2023 Holden Karau and Mika Kimmins, 978-1-098-11987-4.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. Preface | xi
📄 Page 14
1 We are sometimes stubborn to a fault. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/scaling-python-dask. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments This is a book written by two trans immigrants living in America at a time when the walls can feel like they’re closing in. We choose to dedicate this book to those fighting for a more just world in whichever way, however small—thank you. To all those we lost or didn’t get to meet, we miss you. To those we have yet to meet, we are excited to meet you. This book would not exist if not for the communities it is built on. From the Dask community to the PyData community, thank you. Thank you to all the early readers and reviewers for your contributions and guidance. These reviewers include Ruben Berenguel, Adam Breindel, Tom Drabas, Joseph Gnanaprakasam, John Ian‐ none, Kevin Kho, Jess Males, and many more. A special thanks to Ann Spencer for reviewing the early proposals of what eventually became this and Scaling Python with Ray. Any remaining mistakes are entirely our fault, sometimes going against reviewers’ advice.1 xii | Preface
📄 Page 15
Holden would also like to thank her wife and partners for putting up with her long in-the-bathtub writing sessions. A special thank you to Timbit for guarding the house and generally giving Holden a reason to get out of bed (albeit often a bit too early for her taste). Mika would additionally like to thank Holden for her mentorship and help, and give a shout-out to her colleagues at the Harvard data science department for providing her with unlimited free coffee. Preface | xiii
📄 Page 16
(This page has no text content)
📄 Page 17
1 Not all Python code, however; for example, Dask would be a bad choice for scaling a web server (very stateful from the web socket needs). CHAPTER 1 What Is Dask? Dask is a framework for parallelized computing with Python that scales from multi‐ ple cores on one machine to data centers with thousands of machines. It has both low-level task APIs and higher-level data-focused APIs. The low-level task APIs power Dask’s integration with a wide variety of Python libraries. Having public APIs has allowed an ecosystem of tools to grow around Dask for various use cases. Continuum Analytics, now known as Anaconda Inc, started the open source, DARPA-funded Blaze project, which has evolved into Dask. Continuum has partici‐ pated in developing many essential libraries and even conferences in the Python data analytics space. Dask remains an open source project, with much of its development now being supported by Coiled. Dask is unique in the distributed computing ecosystem, because it integrates popular data science, parallel, and scientific computing libraries. Dask’s integration of differ‐ ent libraries allows developers to reuse much of their existing knowledge at scale. They can also frequently reuse some of their code with minimal changes. Why Do You Need Dask? Dask simplifies scaling analytics, ML, and other code written in Python,1 allowing you to handle larger and more complex data and problems. Dask aims to fill the space where your existing tools, like pandas DataFrames, or your scikit-learn machine learning pipelines start to become too slow (or do not succeed). While the term “big data” is perhaps less in vogue now than a few years ago, the data size of the problems has not gotten smaller, and the complexity of the computation and models has not 1
📄 Page 18
gotten simpler. Dask allows you to primarily use the existing interfaces that you are used to (such as pandas and multi-processing) while going beyond the scale of a single core or even a single machine. On the other hand, if all your data fits in memory on a laptop, and you can finish your analysis before you’ve had a chance to brew a cup of your favorite warm beverage, you probably don’t need Dask yet. Where Does Dask Fit in the Ecosystem? Dask provides scalability to multiple, traditionally distinct tools. It is most often used to scale Python data libraries like pandas and NumPy. Dask extends existing tools for scaling, such as multi-processing, allowing them to exceed their current limits of single machines to multi-core and multi-machine. The following provides a quick look at the ecosystem evolution: Early “big data” query Apache Hadoop and Apache Hive Later “big data” query Apache Flink and Apache Spark DataFrame-focused distributed tools Koalas, Ray, and Dask From an abstraction point of view, Dask sits above the machines and cluster man‐ agement tools, allowing you to focus on Python code instead of the intricacies of machine-to-machine communication: Scalable data and ML tools Hadoop, Hive, Flink, Spark, TensorFlow, Koalas, Ray, Dask, etc. Compute resources Apache Hadoop YARN, Kubernetes, Amazon Web Services, Slurm Workload Manager, etc. We say a problem is compute-bound if the limiting factor is not the amount of data but rather the work we are doing on the data. Memory-bound problems are problems in which the computation is not the limiting factor; rather, the ability to store all the data in memory is the limiting factor. Some problems can be both compute-bound and memory-bound, as is often the case for large deep-learning problems. 2 | Chapter 1: What Is Dask?
📄 Page 19
2 With the exception of non-uniform memory access (NUMA) systems. Multi-core (think multi-threading) processing can help with compute-bound prob‐ lems (up to the limit of the number of cores in a machine). Generally, multi-core processing is unable to help with memory-bound problems, as all Central Processing Units (CPUs) have similar access to the memory.2 Accelerated processing, including the use of specialized instruction sets or specialized hardware like Tensor Processing Units or Graphics Processing Units, is generally useful only for compute-bound problems. Sometimes using accelerated processing introduces memory-bound problems, as the amount of memory available to the accelerated computation can be smaller than the “main” system memory. Multi-machine processing is important for both classes of problems. Since the num‐ ber of cores you can get in a machine (affordably) is limited, even if a problem is “only” compute bound at certain scales, you will need to consider multi-machine pro‐ cessing. More commonly, memory-bound problems are a good fit for multi-machine scaling, as Dask can often split the data between the different machines. Dask has both multi-core and multi-machine scaling, allowing you to scale your Python code as you see fit. Much of Dask’s power comes from the tools and libraries built on top of it, which fit into their parts of the data processing ecosystem (such as BlazingSQL). Your background and interest will naturally shape how you first view Dask, so in the following subsections, we’ll briefly discuss how you can use Dask for different types of problems, as well as how it compares to some existing tools. Big Data Dask has better Python library integrations and lower overhead for tasks than many alternatives. Apache Spark (and its Python companion, PySpark) is one of the most popular tools for big data. Existing big data tools, such as PySpark, have more data sources and optimizers (like predicate push-down) but higher overhead per task. Dask’s lower overhead is due mainly to the rest of the Python big data ecosystem being built primarily on top of the JVM. These tools have advanced features such as query optimizers, but with the cost of copying data between the JVM and Python. Unlike many other traditional big data tools, such as Spark and Hadoop, Dask considers local mode a first-class citizen. The traditional big data ecosystem focuses on using the local mode for testing, but Dask focuses on good performance when running on a single node. Where Does Dask Fit in the Ecosystem? | 3
📄 Page 20
3 Of course, opinions vary. See, for example, “Single Node Processing — Spark, Dask, Pandas, Modin, Koalas Vol. 1”, “Benchmark: Koalas (PySpark) and Dask”, and “Spark vs. Dask vs. Ray”. 4 Celery, often used for background job management, is an asynchronous task queue that can also split up and distribute work. But it is at a lower level than Dask and does not have the same high-level conveniences as Dask. Another significant cultural difference comes from packaging, with many projects in big data putting everything together (for example, Spark SQL, Spark Kubernetes, and so on are released together). Dask takes a more modular approach, with its components following their own development and release cadence. Dask’s approach can iterate faster, at the cost of occasional incompatibilities between libraries. Data Science One of the most popular Python libraries in the data science ecosystem is pandas. Apache Spark (and its Python companion, PySpark) is also one of the most popular tools for distributed data science. It has support for both Python and JVM languages. Spark’s first attempt at DataFrames more closely resembled SQL than what you may think of as DataFrames. While Spark has started to integrate pandas support with the Koalas project, Dask’s support of data science library APIs is best in class, in our opinion.3 In addition to the pandas APIs, Dask supports scaling NumPy, scikit-learn, and other data science tools. Dask can be extended to support data types besides NumPy and pandas, and this is how GPU support is implemented with cuDF. Parallel to Distributed Python Parallel computing refers to running multiple operations at the same time, and distributed computing carries this on to multiple operations on multiple machines. Parallel Python encompasses a wide variety of tools ranging from multi-processing to Celery.4 Dask gives you the ability to specify an arbitrary graph of dependencies and execute them in parallel. Under the hood, this execution can either be backed by a single machine (with threads or processes) or be distributed across multiple workers. 4 | Chapter 1: What Is Dask?
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now
Back to List