Scaling Machine Learning with Spark Distributed ML with MLlib, TensorFlow, and PyTorch (Adi Polak) (Z-Library)

Author: Adi Polak

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals—allowing data and ML practitioners to collaborate and understand each other better. Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology. You will: • Explore machine learning, including distributed computing concepts and terminology • Manage the ML lifecycle with MLflow • Ingest data and perform basic preprocessing with Spark • Explore feature engineering, and use Spark to extract features • Train a model with MLlib and build a pipeline to reproduce it • Build a data system to combine the power of Spark with deep learning • Get a step-by-step example of working with distributed TensorFlow • Use PyTorch to scale machine learning and its internal architecture

📄 File Format: PDF

💾 File Size: 7.6 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Pola k Sca ling M a chine Lea rning w ith Sp a rk Sca ling M a chine Lea rning w ith Sp a rk Adi Polak Scaling Machine Learning with Spark Distributed ML with MLlib, TensorFlow, and PyTorch

📄 Page 2

MACHINE LE ARNING “If there is one book the Spark community has been craving for the last decade, it’s this.” —Andy Petrella Founder at Kensu and author of Fundamentals of Data Observability Scaling Machine Learning with Spark Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today’s traditional methods. You’ll learn a more holistic approach that takes you beyond specific requirements and organizational goals—allowing data and ML practitioners to collaborate and understand each other better. Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLFlow, TensorFlow, and PyTorch. If you’re a data scientist who works with machine learning, this book shows you when and why to use each technology. You will: • Explore machine learning, including distributed computing concepts and terminology • Manage the ML lifecycle with MLflow • Ingest data and perform basic preprocessing with Spark • Explore feature engineering, and use Spark to extract features • Train a model with MLlib and build a pipeline to reproduce it • Build a data system to combine the power of Spark with deep learning • Get a step-by-step example of working with distributed TensorFlow • Use PyTorch to scale machine learning and its internal architecture Adi Polak is an open source technologist who believes in communities and education, and their ability to positively impact the world around us. She is passionate about building a better world through open collaboration and technological innovation. As a seasoned engineer and vice president of developer experience at Treeverse, Adi shapes the future of data and ML technologies for hands-on builders. She serves on multiple program committees and acts as an advisor for conferences like Data & AI Summit by Databricks, Current by Confluent, and Scale by the Bay, among others. Adi previously served as a senior manager for Azure at Microsoft, where she helped build advanced analytics systems and modern data architectures. Adi gained experience in machine learning by conducting research for IBM, Deutsche Telekom, and other Fortune 500 companies. 9 7 8 1 0 9 8 1 0 6 8 2 9 5 7 9 9 9 US $79.99 CAN $99.99 ISBN: 978-1-098-10682-9 Pola k

📄 Page 3

Praise for Scaling Machine Learning with Spark If there is one book the Spark community has been craving for the last decade, it’s this. Writing about the combination of Spark and AI requires broad knowledge, a deep technical skillset, and the ability to break down complex concepts so they’re easy to understand. Adi delivers all of this and more while covering big data, AI, and everything in between. —Andy Petrella, founder at Kensu and author of Fundamentals of Data Observability (O’Reilly) Scaling Machine Learning with Spark is a wealth of knowledge for data and ML practitioners, providing a holistic and creative approach to building end-to-end scalable machine learning solutions. The author’s expertise and knowledge, combined with a focus on collaboration and understanding, makes this book a must-read for anyone in the industry. —Noah Gift, Duke executive in residence Adi’s book is without any doubt a good reference and resource to have beside you when working with Spark and distributed ML. You will learn best practices she has to share along with her experience working in the industry for many years. Worth the investment and time reading it. —Laura Uzcategui, machine learning engineer at TalentBait

📄 Page 4

This book is an amazing synthesis of knowledge and experience. I consider it essential reading for both novice and veteran machine learning engineers. Readers will deepen their understanding of general principles for machine learning in distributed systems while simultaneously engaging with the technical details required to integrate and scale the most widely used tools of the trade including Spark, PyTorch, Tensorflow. —Matthew Housley, CTO and coauthor of Fundamentals of Data Engineering (O’Reilly) Adi’s done a wonderful job at creating a very readable, practical, and insanely detailed deep dive into machine learning with Spark. —Joe Reis, coauthor of Fundamentals of Data Engineering (O’Reilly) and “recovering data scientist”

📄 Page 5

Adi Polak Scaling Machine Learning with Spark Distributed ML with MLlib, TensorFlow, and PyTorch Boston Farnham Sebastopol TokyoBeijing

📄 Page 6

978-1-098-10682-9 [LSI] Scaling Machine Learning with Spark by Adi Polak Copyright © 2023 Adi Polak. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Jill Leonard Production Editor: Jonathon Owen Copyeditor: Rachel Head Proofreader: Piper Editorial Consulting, LLC Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea March 2023: First Edition Revision History for the First Edition 2023-03-02: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098106829 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Scaling Machine Learning with Spark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

📄 Page 7

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Distributed Machine Learning Terminology and Concepts. . . . . . . . . . . . . . . . . . . . . . . . . 1 The Stages of the Machine Learning Workflow 4 Tools and Technologies in the Machine Learning Pipeline 6 Distributed Computing Models 8 General-Purpose Models 8 Dedicated Distributed Computing Models 10 Introduction to Distributed Systems Architecture 11 Centralized Versus Decentralized Systems 11 Interaction Models 12 Communication in a Distributed Setting 13 Introduction to Ensemble Methods 14 High Versus Low Bias 15 Types of Ensemble Methods 15 Distributed Training Topologies 16 The Challenges of Distributed Machine Learning Systems 18 Performance 18 Resource Management 21 Fault Tolerance 22 Privacy 23 Portability 24 Setting Up Your Local Environment 24 Chapters 2–6 Tutorials Environment 25 Chapters 7–10 Tutorials Environment 27 Summary 28 v

📄 Page 8

2. Introduction to Spark and PySpark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Apache Spark Architecture 30 Intro to PySpark 32 Apache Spark Basics 33 Software Architecture 33 PySpark and Functional Programming 39 Executing PySpark Code 40 pandas DataFrames Versus Spark DataFrames 41 Scikit-Learn Versus MLlib 42 Summary 43 3. Managing the Machine Learning Experiment Lifecycle with MLflow. . . . . . . . . . . . . . . 45 Machine Learning Lifecycle Management Requirements 46 What Is MLflow? 47 Software Components of the MLflow Platform 48 Users of the MLflow Platform 49 MLflow Components 50 MLflow Tracking 50 MLflow Projects 53 MLflow Models 54 MLflow Model Registry 55 Using MLflow at Scale 57 Summary 60 4. Data Ingestion, Preprocessing, and Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . 61 Data Ingestion with Spark 62 Working with Images 63 Working with Tabular Data 65 Preprocessing Data 66 Preprocessing Versus Processing 66 Why Preprocess the Data? 67 Data Structures 68 MLlib Data Types 68 Preprocessing with MLlib Transformers 70 Preprocessing Image Data 77 Save the Data and Avoid the Small Files Problem 80 Descriptive Statistics: Getting a Feel for the Data 81 Calculating Statistics 82 Descriptive Statistics with Spark Summarizer 83 Data Skewness 85 Correlation 86 Summary 90 vi | Table of Contents

📄 Page 9

5. Feature Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Features and Their Impact on Models 93 MLlib Featurization Tools 96 Extractors 96 Selectors 97 Example: Word2Vec 98 The Image Featurization Process 99 Understanding Image Manipulation 101 Extracting Features with Spark APIs 103 The Text Featurization Process 109 Bag-of-Words 110 TF-IDF 110 N-Gram 111 Additional Techniques 112 Enriching the Dataset 112 Summary 113 6. Training Models with Spark MLlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Algorithms 116 Supervised Machine Learning 117 Classification 117 Regression 122 Unsupervised Machine Learning 127 Frequent Pattern Mining 127 Clustering 127 Evaluating 131 Supervised Evaluators 131 Unsupervised Evaluators 134 Hyperparameters and Tuning Experiments 135 Building a Parameter Grid 135 Splitting the Data into Training and Test Sets 135 Cross-Validation: A Better Way to Test Your Models 137 Machine Learning Pipelines 138 Constructing a Pipeline 140 How Does Splitting Work with the Pipeline API? 141 Persistence 141 Summary 142 7. Bridging Spark and Deep Learning Frameworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 The Two Clusters Approach 147 Implementing a Dedicated Data Access Layer 148 Features of a DAL 148 Table of Contents | vii

📄 Page 10

Selecting a DAL 150 What Is Petastorm? 151 SparkDatasetConverter 152 Petastorm as a Parquet Store 156 Project Hydrogen 158 Barrier Execution Mode 158 Accelerator-Aware Scheduling 160 A Brief Introduction to the Horovod Estimator API 161 Summary 162 8. TensorFlow Distributed Machine Learning Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A Quick Overview of TensorFlow 164 What Is a Neural Network? 166 TensorFlow Cluster Process Roles and Responsibilities 168 Loading Parquet Data into a TensorFlow Dataset 169 An Inside Look at TensorFlow’s Distributed Machine Learning Strategies 171 ParameterServerStrategy 173 CentralStorageStrategy: One Machine, Multiple Processors 175 MirroredStrategy: One Machine, Multiple Processors, Local Copy 175 MultiWorkerMirroredStrategy: Multiple Machines, Synchronous 177 TPUStrategy 181 What Things Change When You Switch Strategies? 181 Training APIs 182 Keras API 182 Custom Training Loop 186 Estimator API 188 Putting It All Together 189 Troubleshooting 191 Summary 192 9. PyTorch Distributed Machine Learning Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 A Quick Overview of PyTorch Basics 194 Computation Graph 194 PyTorch Mechanics and Concepts 196 PyTorch Distributed Strategies for Training Models 200 Introduction to PyTorch’s Distributed Approach 201 Distributed Data-Parallel Training 202 RPC-Based Distributed Training 203 Communication Topologies in PyTorch (c10d) 212 What Can We Do with PyTorch’s Low-Level APIs? 220 Loading Data with PyTorch and Petastorm 221 viii | Table of Contents

📄 Page 11

Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch 224 The Enigma of Mismatched Data Types 224 The Mystery of Straggling Workers 226 How Does PyTorch Differ from TensorFlow? 227 Summary 228 10. Deployment Patterns for Machine Learning Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Deployment Patterns 230 Pattern 1: Batch Prediction 230 Pattern 2: Model-in-Service 231 Pattern 3: Model-as-a-Service 232 Determining Which Pattern to Use 233 Production Software Requirements 234 Monitoring Machine Learning Models in Production 238 Data Drift 239 Model Drift, Concept Drift 242 Distributional Domain Shift (the Long Tail) 242 What Metrics Should I Monitor in Production? 243 How Do I Measure Changes Using My Monitoring System? 244 What It Looks Like in Production 245 The Production Feedback Loop 246 Deploying with MLlib 247 Production Machine Learning Pipelines with Structured Streaming 248 Deploying with MLflow 249 Defining an MLflow Wrapper 250 Deploying the Model as a Microservice 253 Loading the Model as a Spark UDF 254 How to Develop Your System Iteratively 254 Summary 256 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Table of Contents | ix

📄 Page 12

(This page has no text content)

📄 Page 13

Preface Welcome to Scaling Machine Learning with Spark: Distributed ML with MLlib, Tensor‐ Flow, and PyTorch. This book aims to guide you in your journey as you learn more about machine learning (ML) systems. Apache Spark is currently the most popular framework for large-scale data processing. It has numerous APIs implemented in Python, Java, and Scala and is used by many powerhouse companies, including Net‐ flix, Microsoft, and Apple. PyTorch and TensorFlow are among the most popular frameworks for machine learning. Combining these tools, which are already in use in many organizations today, allows you to take full advantage of their strengths. Before we get started, though, perhaps you are wondering why I decided to write this book. Good question. There are two reasons. The first is to support the machine learning ecosystem and community by sharing the knowledge, experience, and exper‐ tise I have accumulated over the last decade working as a machine learning algorithm researcher, designing and implementing algorithms to run on large-scale data. I have spent most of my career working as a data infrastructure engineer, building infra‐ structure for large-scale analytics with all sorts of formatting, types, schemas, etc., and integrating knowledge collected from customers, community members, and collea‐ gues who have shared their experience while brainstorming and developing solutions. Our industry can use such knowledge to propel itself forward at a faster rate, by lev‐ eraging the expertise of others. While not all of this book’s content will be applicable to everyone, much of it will open up new approaches for a wide array of practitioners. This brings me to my second reason for writing this book: I want to provide a holistic approach to building end-to-end scalable machine learning solutions that extends beyond the traditional approach. Today, many solutions are customized to the specific requirements of the organization and specific business goals. This will most likely con‐ tinue to be the industry norm for many years to come. In this book, I aim to challenge the status quo and inspire more creative solutions while explaining the pros and cons of multiple approaches and tools, enabling you to leverage whichever tools are used in your organization and get the best of all worlds. My overall goal is to make it simpler xi

📄 Page 14

for data and machine learning practitioners to collaborate and understand each other better. Who Should Read This Book? This book is designed for machine learning practitioners with previous industry experience who want to learn about Apache Spark’s MLlib and increase their under‐ standing of the overall system and flow. It will be particularly relevant to data scien‐ tists and machine learning engineers, but MLOps engineers, software engineers, and anyone interested in learning about or building distributed machine learning models and building pipelines with MLlib, distributed PyTorch, and TensorFlow will also find value. Technologists who understand high-level concepts of working with machine learning and want to dip their feet into the technical side as well should also find the book interesting and accessible. Do You Need Distributed Machine Learning? As with every good thing, it depends. If you have small datasets that fit into your machine’s memory, the answer is no. If at some point you will need to scale out your code and make sure you can train a model on a larger dataset that does not fit into a single machine’s memory, then yes. It is often better to use the same tools across the software development lifecycle, from the local development environment to staging and production. Take into considera‐ tion, though, that this also introduces other complexities involved in managing a dis‐ tributed system, which typically will be handled by a different team in your organization. It’s a good idea to have a common language to collaborate with your colleagues. Also, one of the greatest challenges people who create machine learning models face today is moving them from local development all the way to production. Many of us sin with spaghetti code that should be reproducible but often is not and is hard to maintain and collaborate on. I will touch upon that topic as part of the discussion of managing the lifecycle of experiments. Navigating This Book This book is designed to build from foundational information in the first few chap‐ ters, covering the machine learning workflow using Apache Spark and PySpark and managing the machine learning experiment lifecycle with MLflow, to bridging into a dedicated machine learning platform in Chapters 7, 8, and 9. The book concludes with a look at deployment patterns, inference, and monitoring of models in produc‐ tion. Here’s a breakdown of what you will find in each chapter: xii | Preface

📄 Page 15

Chapter 1, “Distributed Machine Learning Terminology and Concepts” This chapter provides a high-level introduction to machine learning and covers terminology and concepts related to distributed computing and network topolo‐ gies. I will walk you through various concepts and terms, so you have a strong foundation for the next chapters. Chapter 2, “Introduction to Spark and PySpark” The goal of this chapter is to bring you up to speed on Spark and its Python library, PySpark. We’ll discuss terminology, software abstractions, and more. Chapter 3, “Managing the Machine Learning Experiment Lifecycle with MLflow” This chapter introduces MLflow, a platform that facilitates managing the machine learning lifecycle. We’ll discuss what a machine learning experiment is and why managing its lifecycle is important, and we’ll examine the various com‐ ponents of MLflow that make this possible. Chapter 4, “Data Ingestion, Preprocessing, and Descriptive Statistics” Next, we will dive into working with data. In this chapter, I will discuss how to use Spark to ingest your data, perform basic preprocessing (using image files as an example), and get a feel for the data. I’ll also cover how to avoid the so-called small file problem with image files by leveraging the PySpark API. Chapter 5, “Feature Engineering” Once you’ve performed the steps in the previous chapter, you’re ready to engi‐ neer the features you will use to train your machine learning model. This chapter explains in detail what feature engineering is, covering various types, and show‐ cases how to leverage Spark’s functionality for extracting features. We’ll also look at how and when to use applyInPandas and pandas_udf to optimize perfor‐ mance. Chapter 6, “Training Models with Spark MLlib” This chapter walks you through working with MLlib to train a model, evaluate and build a pipeline to reproduce the model, and finally persist it to disk. Chapter 7, “Bridging Spark and Deep Learning Frameworks” This chapter breaks down how to build a data system to combine the power of Spark with deep learning frameworks. It discusses bridging Spark and deep learning clusters and provides an introduction to Petastorm, Horovod, and the Spark initiative Project Hydrogen. Chapter 8, “TensorFlow Distributed Machine Learning Approach” Here, I’ll lead you through a step-by-step example of working with distributed TensorFlow—specifically tf.keras—while leveraging the preprocessing you’ve done with Spark. You will also learn about the various TensorFlow patterns for scaling machine learning and the component architectures that support it. Preface | xiii

📄 Page 16

Chapter 9, “PyTorch Distributed Machine Learning Approach” This chapter covers the PyTorch approach to scaling machine learning, including its internal architecture. We will walk through a step-by-step example of working with distributed PyTorch while leveraging the preprocessing you did with Spark in previous chapters. Chapter 10, “Deployment Patterns for Machine Learning Models” In this chapter, I present the various deployment patterns available to us, includ‐ ing batch and streaming inference with Spark and MLflow, and provide examples of using the pyfunc functionality in MLflow that allows us to deploy just about any machine learning model. This chapter also covers monitoring and imple‐ menting a production machine learning system in phases. What Is Not Covered There are many ways to go about distributed machine learning. Some involve run‐ ning multiple experiments in parallel, with multiple hyperparameters, on data that has been loaded into memory. You might be able to load the dataset into a single machine’s memory, or it may be so large that it has to be partitioned across multiple machines. We will briefly discuss grid search, a technique for finding the optimal val‐ ues for a set of hyperparameters, but this book will only extend that far. This book does not cover the following topics: An introduction to machine learning algorithms There are many wonderful books that go into depth on the various machine learning algorithms and their uses, and this book won’t repeat them. Deploying models to mobile or embedded devices This often requires working with TinyML and dedicated algorithms to shrink the size of the final model (which may initially be created from a large dataset). TinyML TinyML is focused on building relatively small machine learning models that can run on resource-constrained devices. To learn about this topic, check out TinyML by Peter Warden and Daniel Situnayake (O’Reilly). Online learning Online learning is used when the data is generated as a function of time or when the machine learning algorithm needs to adapt dynamically to new patterns in the data. It’s also used when training over the entire dataset is computationally infeasible, requiring out-of-core algorithms. This is a fundamentally different way of approaching machine learning with specialized applications, and it is not covered in this book. xiv | Preface

📄 Page 17

Parallel experiments While the tools discussed in this book, such as PyTorch and TensorFlow, enable us to conduct parallel experiments, this book will focus solely on parallel data training, where the logic stays the same, and each machine processes a different chunk of the data. This is not an exhaustive list—since all roads lead to distribution in one way or another, I might have forgotten to mention some topics here, or new ones may have gained traction in the industry since the time of writing. As mentioned previously, my aim is to share my perspective, given my accumulated experience and knowledge in the field of machine learning, and to equip others with a holistic approach to use in their own endeavors; it is my intention to cover as many of the key points as possible to provide a foundation, and I encourage you to explore further to deepen your understanding of the topics discussed here. The Environment and Tools Now that you understand the topics that will (and won’t) be covered, it’s time to set up your tutorial environment. You’ll be using various platforms and libraries together to develop a machine learning pipeline as you work through the exercises in this book. The Tools This section briefly introduces the tools that we will use to build the solutions dis‐ cussed in this book. If you aren’t familiar with them, you may want to review their documentation before getting started. To implement the code samples provided in the book on your own machine, you will need to have the following tools installed locally: Apache Spark A general-purpose, large-scale analytics engine for data processing. PySpark An interface for Apache Spark in Python. PyTorch A machine learning framework developed by Facebook, based on the Torch library, used for computer vision and natural language processing applications. We will use its distributed training capabilities. TensorFlow A platform for machine learning pipelines developed by Google. We will use its distributed training capabilities. Preface | xv

📄 Page 18

MLflow An open source platform for managing the machine learning lifecycle. We will use it to manage the experiments in this book. Petastorm A library that enables distributed training and evaluation of deep learning mod‐ els using datasets in Apache Parquet format. Petastorm supports machine learn‐ ing frameworks such as TensorFlow and PyTorch. We will use it to bridge between Spark and a deep learning cluster. Horovod A distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. This project aims to support developers in scaling a single-GPU training script to train across many GPUs in parallel. We will use it both to optimize workloads over multiple GPUs and to coordinate the distributed systems of a Spark cluster and a deep learning cluster, which requires a dedicated distributed system scheduler to manage the cluster resources and enable them to work together using the same hardware. NumPy A Python library for scientific computing that enables efficient performance of various types of operations on arrays (mathematical, logical, shape manipulation, sorting, selecting, I/O, and more). We will use it for various statistical and mathe‐ matical operations that can be done on a single machine. PIL The Python Imaging Library, also known as Pillow. We will use this for working with images. In today’s ecosystem, new tools in the space of machine learning and distributed data are emerging every day. History has taught us that some of them will stick around and others won’t. Keep an eye out for the tools that are already used in your work‐ place, and try to exhaust their capabilities before jumping into introducing new ones. The Datasets In this book’s examples, we will leverage existing datasets where practical and produce dedicated datasets when necessary to better convey the message. The datasets listed here, all available on Kaggle, are used throughout the book and are included in the accompanying GitHub repository: Caltech 256 dataset Caltech 256 is an extension of the Caltech 101 dataset, which contains pictures of objects in 101 categories. The Caltech 256 dataset contains 30,607 images of objects spanning 257 categories. The categories are extremely diverse, ranging xvi | Preface

📄 Page 19

from tennis shoes to zebras, and there are images with and without backgrounds and in horizontal and vertical orientations. Most categories have about 100 images, but some have as many as 800. CO2 Emission by Vehicles dataset The CO2 Emission by Vehicles dataset is based on seven years’ worth of data about vehicular CO2 emissions from the Government of Canada’s Open Data website. There are 7,385 rows and 12 columns (make, model, transmission, etc., as well as CO2 emissions and various fuel consumption measures). Zoo Animal Classification dataset For learning about the statistics functions available in the MLlib library, we will use the Zoo Animal Classification dataset. It consists of 101 animals, with 16 Boolean-valued attributes used to describe them. The animals can be classified into seven types: Mammal, Bird, Reptile, Fish, Amphibian, Bug, and Invertebrate. I chose it because it’s fun and relatively simple to grasp. If you’re working through the tutorials on your local machine, I recommend using the sample datasets provided in the book’s GitHub repo. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, file and directory names, and file extensions. Constant width Used for command-line input/output and code examples, as well as for code ele‐ ments that appear in the text, including variable and function names, classes, and modules. Constant width italic Shows text to be replaced with user-supplied values in code examples and com‐ mands. Constant width bold Shows commands or other text that should be typed literally by the user. This element signifies a tip or suggestion. Preface | xvii

📄 Page 20

This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/smls-git. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Scaling Machine Learning with Spark, by Adi Polak. Copyright 2023 by Adi Polak, 978-1-098-10682-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technology and business train‐ ing, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. xviii | Preface

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List