Statistics
104
Views
18
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-11-08
Support
Share

AuthorRobert Crowe, Hannes Hapke, Emily Caveness, Di Zhu

Using machine learning for products, services, and critical business processes is quite different from using ML in an academic or research setting—especially for recent ML graduates and those moving from research to a commercial environment. Whether you currently work to create products and services that use ML, or would like to in the future, this practical book gives you a broad view of the entire field. Authors Robert Crowe, Hannes Hapke, Emily Caveness, and Di Zhu help you identify topics that you can dive into deeper, along with reference materials and tutorials that teach you the details. You'll learn the state of the art of machine learning engineering, including a wide range of topics such as modeling, deployment, and MLOps. You'll learn the basics and advanced aspects to understand the production ML lifecycle. This book provides four in-depth sections that cover all aspects of machine learning engineering Data: collecting, labeling, validating, automation, and data preprocessing; data feature engineering and selection; data journey and storage Modeling: high performance modeling; model resource management techniques; model analysis and interoperability; neural architecture search Deployment: model serving patterns and infrastructure for ML models and LLMs; management and delivery; monitoring and logging Productionalizing: ML pipelines; classifying unstructured texts and images; genAI model pipelines

Tags
No tags
ISBN: 1098156013
Publisher: O'Reilly Media
Publish Year: 2024
Language: 英文
Pages: 475
File Format: PDF
File Size: 17.8 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Robert Crowe, Hannes Hapke, Emily Caveness & Di Zhu Foreword by D. Sculley Machine Learning Production Systems Engineering Machine Learning Models and Pipelines
9 7 8 1 0 9 8 1 5 6 0 1 5 5 7 9 9 9 ISBN: 978-1-098-15601-5 US $79.99 CAN $99.99 DATA Robert Crowe, product manager for JAX and GenAI at Google, helps developers quickly learn what they need to be productive. Hannes Hapke, principal machine learning engineer at Digits, has coauthored multiple machine learning publications. Emily Caveness, software engineer at Google, currently works on ML data analysis and validation. Di Zhu, software engineer at Google, has worked on a variety of projects, including MLOps infrastructure and applied machine learning solutions. The world of machine learning (ML) and artificial intelligence (AI) is exploding, with new research, models, and technologies arriving nearly every day. Given this wealth of options, it’s easy for data scientists, ML engineers, and software developers to get lost among the many steps necessary to take an AI/ML model from experiment stage into production. This practical book focuses on production machine learning, a process that enables you to bring ML models into viable products and applications. Production machine learning covers all areas of ML, taking you beyond simple model training. This book places special emphasis on ML pipelines that will help you build the foundation for your ML production systems. You’ll explore a broad range of technologies you need to put ML applications into production, as well as the issues and approaches you need to consider. Critical ML engineering topics include: • Data collection, validation, storage, and feature engineering • Model analysis, serving, monitoring, and logging • Orchestrating machine learning pipelines using TensorFlow Extended (TFX) and other tools This publication provides in-depth examples, including end-to-end ML pipelines for NLP and computer vision models. Machine Learning Production Systems “A comprehensive book that gives you a holistic view of the entire process of building, deploying, and managing ML systems in production. It takes you through everything you need to know—from getting the most out of your data all the way through training models, deploying those models to scalable infrastructure, and managing the details to keep them running smoothly.” Laurence Moroney AI consultant, teacher, and author
Robert Crowe, Hannes Hapke, Emily Caveness, and Di Zhu Foreword by D. Sculley Machine Learning Production Systems Engineering Machine Learning Models and Pipelines Boston Farnham Sebastopol TokyoBeijing
978-1-098-15601-5 [LSI] Machine Learning Production Systems by Robert Crowe, Hannes Hapke, Emily Caveness, and Di Zhu Copyright © 2025 Robert Crowe, Hannes Hapke, Emily Caveness, and Di Zhu. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Jeff Bleiel Production Editor: Katherine Tozer Copyeditor: Audrey Doyle Proofreader: Piper Editorial Consulting, LLC Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea October 2024: First Edition Revision History for the First Edition 2024-10-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098156015 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning Production Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1. Introduction to Machine Learning Production Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Production Machine Learning? 1 Benefits of Machine Learning Pipelines 3 Focus on Developing New Models, Not on Maintaining Existing Models 3 Prevention of Bugs 3 Creation of Records for Debugging and Reproducing Results 4 Standardization 4 The Business Case for ML Pipelines 4 When to Use Machine Learning Pipelines 5 Steps in a Machine Learning Pipeline 5 Data Ingestion and Data Versioning 6 Data Validation 6 Feature Engineering 6 Model Training and Model Tuning 7 Model Analysis 7 Model Deployment 8 Looking Ahead 8 2. Collecting, Labeling, and Validating Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Important Considerations in Data Collection 9 Responsible Data Collection 10 Labeling Data: Data Changes and Drift in Production ML 11 Labeling Data: Direct Labeling and Human Labeling 13 Validating Data: Detecting Data Issues 14 iii
Validating Data: TensorFlow Data Validation 14 Skew Detection with TFDV 15 Types of Skew 16 Example: Spotting Imbalanced Datasets with TensorFlow Data Validation 17 Conclusion 19 3. Feature Engineering and Feature Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Introduction to Feature Engineering 21 Preprocessing Operations 23 Feature Engineering Techniques 24 Normalizing and Standardizing 24 Bucketizing 25 Feature Crosses 26 Dimensionality and Embeddings 26 Visualization 26 Feature Transformation at Scale 27 Choose a Framework That Scales Well 27 Avoid Training–Serving Skew 28 Consider Instance-Level Versus Full-Pass Transformations 28 Using TensorFlow Transform 29 Analyzers 31 Code Example 32 Feature Selection 32 Feature Spaces 33 Feature Selection Overview 33 Filter Methods 34 Wrapper Methods 35 Embedded Methods 37 Feature and Example Selection for LLMs and GenAI 38 Example: Using TF Transform to Tokenize Text 38 Benefits of Using TF Transform 41 Alternatives to TF Transform 42 Conclusion 42 4. Data Journey and Data Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Data Journey 43 ML Metadata 44 Using a Schema 45 Schema Development 46 Schema Environments 46 Changes Across Datasets 47 iv | Table of Contents
Enterprise Data Storage 48 Feature Stores 48 Data Warehouses 50 Data Lakes 51 Conclusion 51 5. Advanced Labeling, Augmentation, and Data Preprocessing. . . . . . . . . . . . . . . . . . . . . . 53 Advanced Labeling 54 Semi-Supervised Labeling 54 Active Learning 56 Weak Supervision 59 Advanced Labeling Review 60 Data Augmentation 61 Example: CIFAR-10 62 Other Augmentation Techniques 62 Data Augmentation Review 62 Preprocessing Time Series Data: An Example 63 Windowing 64 Sampling 65 Conclusion 66 6. Model Resource Management Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Dimensionality Reduction: Dimensionality Effect on Performance 67 Example: Word Embedding Using Keras 68 Curse of Dimensionality 72 Adding Dimensions Increases Feature Space Volume 73 Dimensionality Reduction 74 Quantization and Pruning 78 Mobile, IoT, Edge, and Similar Use Cases 78 Quantization 78 Optimizing Your TensorFlow Model with TF Lite 84 Optimization Options 85 Pruning 86 Knowledge Distillation 89 Teacher and Student Networks 89 Knowledge Distillation Techniques 90 TMKD: Distilling Knowledge for a Q&A Task 93 Increasing Robustness by Distilling EfficientNets 95 Conclusion 96 Table of Contents | v
7. High-Performance Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Distributed Training 97 Data Parallelism 98 Efficient Input Pipelines 101 Input Pipeline Basics 101 Input Pipeline Patterns: Improving Efficiency 102 Optimizing Your Input Pipeline with TensorFlow Data 103 Training Large Models: The Rise of Giant Neural Nets and Parallelism 105 Potential Solutions and Their Shortcomings 106 Pipeline Parallelism to the Rescue? 107 Conclusion 109 8. Model Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Analyzing Model Performance 111 Black-Box Evaluation 112 Performance Metrics and Optimization Objectives 112 Advanced Model Analysis 113 TensorFlow Model Analysis 113 The Learning Interpretability Tool 119 Advanced Model Debugging 120 Benchmark Models 121 Sensitivity Analysis 121 Residual Analysis 125 Model Remediation 126 Discrimination Remediation 127 Fairness 127 Fairness Evaluation 128 Fairness Considerations 130 Continuous Evaluation and Monitoring 130 Conclusion 131 9. Interpretability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Explainable AI 133 Model Interpretation Methods 136 Method Categories 136 Intrinsically Interpretable Models 139 Model-Agnostic Methods 144 Local Interpretable Model-Agnostic Explanations 148 Shapley Values 149 The SHAP Library 151 Testing Concept Activation Vectors 153 vi | Table of Contents
AI Explanations 154 Example: Exploring Model Sensitivity with SHAP 156 Regression Models 156 Natural Language Processing Models 158 Conclusion 159 10. Neural Architecture Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Hyperparameter Tuning 161 Introduction to AutoML 163 Key Components of NAS 163 Search Spaces 164 Search Strategies 166 Performance Estimation Strategies 168 AutoML in the Cloud 169 Amazon SageMaker Autopilot 169 Microsoft Azure Automated Machine Learning 170 Google Cloud AutoML 171 Using AutoML 172 Generative AI and AutoML 172 Conclusion 172 11. Introduction to Model Serving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Model Training 173 Model Prediction 174 Latency 174 Throughput 174 Cost 175 Resources and Requirements for Serving Models 175 Cost and Complexity 175 Accelerators 176 Feeding the Beast 177 Model Deployments 177 Data Center Deployments 178 Mobile and Distributed Deployments 178 Model Servers 179 Managed Services 180 Conclusion 181 12. Model Serving Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Batch Inference 183 Batch Throughput 184 Table of Contents | vii
Batch Inference Use Cases 185 ETL for Distributed Batch and Stream Processing Systems 186 Introduction to Real-Time Inference 186 Synchronous Delivery of Real-Time Predictions 188 Asynchronous Delivery of Real-Time Predictions 188 Optimizing Real-Time Inference 188 Real-Time Inference Use Cases 189 Serving Model Ensembles 190 Ensemble Topologies 190 Example Ensemble 190 Ensemble Serving Considerations 190 Model Routers: Ensembles in GenAI 191 Data Preprocessing and Postprocessing in Real Time 191 Training Transformations Versus Serving Transformations 193 Windowing 193 Options for Preprocessing 194 Enter TensorFlow Transform 196 Postprocessing 197 Inference at the Edge and at the Browser 198 Challenges 199 Model Deployments via Containers 200 Training on the Device 200 Federated Learning 201 Runtime Interoperability 201 Inference in Web Browsers 202 Conclusion 202 13. Model Serving Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Model Servers 204 TensorFlow Serving 204 NVIDIA Triton Inference Server 206 TorchServe 207 Building Scalable Infrastructure 208 Containerization 210 Traditional Deployment Era 210 Virtualized Deployment Era 211 Container Deployment Era 211 The Docker Containerization Framework 211 Container Orchestration 213 Reliability and Availability Through Redundancy 216 Observability 217 viii | Table of Contents
High Availability 218 Automated Deployments 219 Hardware Accelerators 219 GPUs 220 TPUs 220 Conclusion 221 14. Model Serving Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Example: Deploying TensorFlow Models with TensorFlow Serving 223 Exporting Keras Models for TF Serving 223 Setting Up TF Serving with Docker 224 Basic Configuration of TF Serving 224 Making Model Prediction Requests with REST 225 Making Model Prediction Requests with gRPC 227 Getting Predictions from Classification and Regression Models 228 Using Payloads 229 Getting Model Metadata from TF Serving 229 Making Batch Inference Requests 230 Example: Profiling TF Serving Inferences with TF Profiler 232 Prerequisites 232 TensorBoard Setup 233 Model Profile 234 Example: Basic TorchServe Setup 238 Installing the TorchServe Dependencies 238 Exporting Your Model for TorchServe 238 Setting Up TorchServe 239 Making Model Prediction Requests 242 Making Batch Inference Requests 242 Conclusion 243 15. Model Management and Delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Experiment Tracking 245 Experimenting in Notebooks 246 Experimenting Overall 247 Tools for Experiment Tracking and Versioning 248 Introduction to MLOps 252 Data Scientists Versus Software Engineers 252 ML Engineers 252 ML in Products and Services 253 MLOps 253 MLOps Methodology 255 Table of Contents | ix
MLOps Level 0 255 MLOps Level 1 257 MLOps Level 2 260 Components of an Orchestrated Workflow 263 Three Types of Custom Components 265 Python Function–Based Components 265 Container-Based Components 266 Fully Custom Components 267 TFX Deep Dive 270 TFX SDK 270 Intermediate Representation 271 Runtime 271 Implementing an ML Pipeline Using TFX Components 271 Advanced Features of TFX 273 Managing Model Versions 275 Approaches to Versioning Models 275 Model Lineage 277 Model Registries 277 Continuous Integration and Continuous Deployment 278 Continuous Integration 278 Continuous Delivery 280 Progressive Delivery 280 Blue/Green Deployment 281 Canary Deployment 281 Live Experimentation 282 Conclusion 284 16. Model Monitoring and Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 The Importance of Monitoring 286 Observability in Machine Learning 287 What Should You Monitor? 288 Custom Alerting in TFX 289 Logging 290 Distributed Tracing 292 Monitoring for Model Decay 293 Data Drift and Concept Drift 294 Model Decay Detection 295 Supervised Monitoring Techniques 296 Unsupervised Monitoring Techniques 297 Mitigating Model Decay 298 Retraining Your Model 299 x | Table of Contents
When to Retrain 299 Automated Retraining 300 Conclusion 300 17. Privacy and Legal Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Why Is Data Privacy Important? 302 What Data Needs to Be Kept Private? 302 Harms 303 Only Collect What You Need 303 GenAI Data Scraped from the Web and Other Sources 304 Legal Requirements 304 The GDPR and the CCPA 304 The GDPR’s Right to Be Forgotten 305 Pseudonymization and Anonymization 306 Differential Privacy 307 Local and Global DP 308 Epsilon-Delta DP 308 Applying Differential Privacy to ML 309 TensorFlow Privacy Example 310 Federated Learning 312 Encrypted ML 313 Conclusion 314 18. Orchestrating Machine Learning Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 An Introduction to Pipeline Orchestration 315 Why Pipeline Orchestration? 315 Directed Acyclic Graphs 316 Pipeline Orchestration with TFX 317 Interactive TFX Pipelines 317 Converting Your Interactive Pipeline for Production 319 Orchestrating TFX Pipelines with Apache Beam 319 Orchestrating TFX Pipelines with Kubeflow Pipelines 321 Introduction to Kubeflow Pipelines 321 Installation and Initial Setup 323 Accessing Kubeflow Pipelines 324 The Workflow from TFX to Kubeflow 325 OpFunc Functions 328 Orchestrating Kubeflow Pipelines 330 Google Cloud Vertex Pipelines 333 Setting Up Google Cloud and Vertex Pipelines 333 Setting Up a Google Cloud Service Account 337 Table of Contents | xi
Orchestrating Pipelines with Vertex Pipelines 340 Executing Vertex Pipelines 342 Choosing Your Orchestrator 344 Interactive TFX 344 Apache Beam 344 Kubeflow Pipelines 344 Google Cloud Vertex Pipelines 345 Alternatives to TFX 345 Conclusion 345 19. Advanced TFX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Advanced Pipeline Practices 347 Configure Your Components 347 Import Artifacts 348 Use Resolver Node 349 Execute a Conditional Pipeline 350 Export TF Lite Models 351 Warm-Starting Model Training 352 Use Exit Handlers 353 Trigger Messages from TFX 354 Custom TFX Components: Architecture and Use Cases 356 Architecture of TFX Components 356 Use Cases of Custom Components 357 Using Function-Based Custom Components 357 Writing a Custom Component from Scratch 358 Defining Component Specifications 360 Defining Component Channels 361 Writing the Custom Executor 361 Writing the Custom Driver 364 Assembling the Custom Component 365 Using Our Basic Custom Component 366 Implementation Review 367 Reusing Existing Components 367 Creating Container-Based Custom Components 370 Which Custom Component Is Right for You? 372 TFX-Addons 373 Conclusion 374 20. ML Pipelines for Computer Vision Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Our Data 376 Our Model 376 xii | Table of Contents
Custom Ingestion Component 377 Data Preprocessing 378 Exporting the Model 379 Our Pipeline 380 Data Ingestion 380 Data Preprocessing 381 Model Training 382 Model Evaluation 382 Model Export 384 Putting It All Together 384 Executing on Apache Beam 385 Executing on Vertex Pipelines 386 Model Deployment with TensorFlow Serving 387 Conclusion 389 21. ML Pipelines for Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Our Data 392 Our Model 392 Ingestion Component 393 Data Preprocessing 394 Putting the Pipeline Together 397 Executing the Pipeline 397 Model Deployment with Google Cloud Vertex 398 Registering Your ML Model 398 Creating a New Model Endpoint 400 Deploying Your ML Model 400 Requesting Predictions from the Deployed Model 402 Cleaning Up Your Deployed Model 403 Conclusion 404 22. Generative AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Generative Models 406 GenAI Model Types 406 Agents and Copilots 407 Pretraining 407 Pretraining Datasets 408 Embeddings 408 Self-Supervised Training with Masks 409 Fine-Tuning 410 Fine-Tuning Versus Transfer Learning 410 Fine-Tuning Datasets 411 Table of Contents | xiii
Fine-Tuning Considerations for Production 411 Fine-Tuning Versus Model APIs 412 Parameter-Efficient Fine-Tuning 412 LoRA 412 S-LoRA 413 Human Alignment 413 Reinforcement Learning from Human Feedback 413 Reinforcement Learning from AI Feedback 414 Direct Preference Optimization 414 Prompting 415 Chaining 416 Retrieval Augmented Generation 416 ReAct 417 Evaluation 417 Evaluation Techniques 417 Benchmarking Across Models 418 LMOps 418 GenAI Attacks 419 Jailbreaks 419 Prompt Injection 420 Responsible GenAI 420 Design for Responsibility 420 Conduct Adversarial Testing 421 Constitutional AI 421 Conclusion 422 23. The Future of Machine Learning Production Systems and Next Steps . . . . . . . . . . . . . 423 Let’s Think in Terms of ML Systems, Not ML Models 423 Bringing ML Systems Closer to Domain Experts 424 Privacy Has Never Been More Important 424 Conclusion 424 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 xiv | Table of Contents
Foreword My first big break in AI and machine learning (ML) came about 20 years ago. It was during a time when the internet still felt like a brand new technology. The world was noticing that the power of free communication had drawbacks as well as benefits— with those drawbacks being most notable in the form of email spam. These unwanted messages were clogging up inboxes everywhere with shady offers for pills or scams seeking bank account information. Email spam was a raging problem because the available spam filters (being based largely on hand-crafted rules and patterns) were ineffective. Spammers would fool these filters with all kinds of tricks, like int3nt!onal mi$$pellings or o t h e r h a c k y m e t h o d s that were hard for a fixed rule to adapt to. As a grad student at the time, I became part of the community of researchers that believed a funny technology called machine learning might be the right solution for this set of problems. I was even lucky enough to create a model that won one of the early benchmark competitions for email spam filtering. I remember that early model for two reasons. First, it was kind of cool that it worked well by using a simple but very flexible representation—something that we would now call an early precursor to a one-dimensional convolution on strings. Second, I can look back and say with certainty that it would have been an absolute mess to put into a production environment. It had been designed under the pressures of academic research, in which velocity trumps reliability, and quick fixes and patches that work once are more than good enough. I didn’t know any better. I had never actually met anyone who had run an ML pipeline in production. Back then I don’t think I had ever even heard the words production and machine learning used together in the same sentence. The first real production system I got to design and build was an early system at Google for detecting and removing ads that violated policies—basically ads that were scammy or spammy. This was important work, and I felt it was extremely rewarding to protect our users this way. It was also a time when creating an ML production xv
system meant building everything from scratch. There weren’t reliable scalable libra‐ ries—this was well before PyTorch or TensorFlow—and infrastructure for data stor‐ age, model training, and serving all had to be built from scratch. As you might guess, this meant that I got hit with every single pitfall imaginable in production ML: validation, monitoring, safety checks, rollout plans, update mechanisms, dealing with churn, dealing with noise, handling unreliable labels, encountering unstable data dependencies—the list goes on. It was a hard way to learn these lessons, but the experience definitely made an impression. A few years later, I was leading Google’s systems for search ads click-through predic‐ tion. At the time, this was perhaps one of the largest and—from a business stand‐ point—most impactful ML systems in the world. Because of that, reliability was of the utmost importance, and much of the work that my colleagues and I did revolved around strengthening the production robustness of our system. This included both system-level robustness from an infrastructure perspective, and statistical robustness to ensure that changes in data over time would be handled well. Because running ML systems at this scale and importance was still quite new, we had to invent much of this for ourselves. We ended up writing a few papers on this experience, one of which was cheerfully titled “Machine Learning: The High Interest Credit Card of Technical Debt,” hoping to share what we had learned with others in the field. And I got to help put some of these thoughts into general practice through some of the early designs of TensorFlow Extended (TFX). Now here we are in the present day. AI and ML are more important than ever, and the emergent capabilities of large language models (LLMs) and generative AI (GenAI) are incredibly promising. There is also more awareness of the importance of production-grade safety, reliability, responsibility, and robustness—along with a keen understanding of just how difficult these problems can be. It might feel daunting to be taking on the challenge of building a new AI or ML pipeline. Fortunately, today, you are not alone. The field has come a long way from those early days; we have some incredible benefits now. One incredible benefit is that the level of production-grade infrastructure has advanced considerably, and best practices have been codified into off-the-shelf offerings through TFX and similar offerings that significantly simplify building a robust pipeline. But even more important than the infrastructure is the people in the field. There are folks like the authors of this book—Robert Crowe, Hannes Hapke, Emily Caveness, and Di Zhu—who are willing to serve as your guide through these pipeline jungles, providing painstakingly detailed knowledge. They will ensure you don’t have to learn the way I did—by hitting pitfall after unexpected pitfall—and can put you on a well-lit path to success. xvi | Foreword
I have known Hannes and Robert for many years. Hannes and I first met at a Google Developer Advisory Board meeting, where he provided a ton of useful feedback on ways that Google could support ML developers even better, and I could tell from the first conversation that he was someone who had lived these problems and their solutions in the trenches for many years. Robert and I have been colleagues at Google for quite some time, and I have always been struck by both his technical expertise and by his ability to articulate clear and simple explanations for complex systems. So you are in good hands, and you are in for an exciting journey. I very much hope that you don’t just read this book— that you also build along with it and create something amazing, something that pushes forward the cutting edge of what AI and ML can do, and most of all, something that will not wake you up at 3 a.m. with a production outage. Very best wishes for your journey! — D. Sculley CEO, Kaggle August 2024 Foreword | xvii
(This page has no text content)
The above is a preview of the first 20 pages. Register to read the complete e-book.