Building Machine Learning Powered Applications Going from Idea to Product (Emmanuel Ameisen) (Z-Library)

Emmanuel Ameisen Building Machine Learning Powered Applications Going from Idea to Product

(This page has no text content)

Emmanuel Ameisen Building Machine Learning Powered Applications Going from Idea to Product Boston Farnham Sebastopol TokyoBeijing

978-1-492-04511-3 [LSI] Building Machine Learning Powered Applications by Emmanuel Ameisen Copyright © 2020 Emmanuel Ameisen. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jonathan Hassell Development Editor: Melissa Potter Production Editor: Deborah Baker Copyeditor: Kim Wimpsett Proofreader: Christina Edwards Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest February 2020: First Edition Revision History for the First Edition 2020-01-17: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492045113 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Machine Learning Powered Applications, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. Find the Correct ML Approach 1. From Product Goal to ML Framing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Estimate What Is Possible 4 Models 5 Data 13 Framing the ML Editor 15 Trying to Do It All with ML: An End-to-End Framework 16 The Simplest Approach: Being the Algorithm 17 Middle Ground: Learning from Our Experience 18 Monica Rogati: How to Choose and Prioritize ML Projects 20 Conclusion 22 2. Create a Plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Measuring Success 23 Business Performance 24 Model Performance 25 Freshness and Distribution Shift 28 Speed 30 Estimate Scope and Challenges 31 Leverage Domain Expertise 31 Stand on the Shoulders of Giants 32 ML Editor Planning 36 Initial Plan for an Editor 36 Always Start with a Simple Model 36 iii

To Make Regular Progress: Start Simple 37 Start with a Simple Pipeline 37 Pipeline for the ML Editor 39 Conclusion 40 Part II. Build a Working Pipeline 3. Build Your First End-to-End Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 The Simplest Scaffolding 45 Prototype of an ML Editor 47 Parse and Clean Data 47 Tokenizing Text 48 Generating Features 48 Test Your Workflow 50 User Experience 50 Modeling Results 51 ML Editor Prototype Evaluation 52 Model 53 User Experience 53 Conclusion 54 4. Acquire an Initial Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Iterate on Datasets 55 Do Data Science 56 Explore Your First Dataset 57 Be Efficient, Start Small 57 Insights Versus Products 58 A Data Quality Rubric 58 Label to Find Data Trends 64 Summary Statistics 65 Explore and Label Efficiently 67 Be the Algorithm 82 Data Trends 84 Let Data Inform Features and Models 85 Build Features Out of Patterns 85 ML Editor Features 88 Robert Munro: How Do You Find, Label, and Leverage Data? 89 Conclusion 90 iv | Table of Contents

Part III. Iterate on Models 5. Train and Evaluate Your Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 The Simplest Appropriate Model 95 Simple Models 96 From Patterns to Models 98 Split Your Dataset 99 ML Editor Data Split 105 Judge Performance 106 Evaluate Your Model: Look Beyond Accuracy 109 Contrast Data and Predictions 109 Confusion Matrix 110 ROC Curve 111 Calibration Curve 114 Dimensionality Reduction for Errors 116 The Top-k Method 116 Other Models 121 Evaluate Feature Importance 121 Directly from a Classifier 122 Black-Box Explainers 123 Conclusion 125 6. Debug Your ML Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Software Best Practices 127 ML-Specific Best Practices 128 Debug Wiring: Visualizing and Testing 130 Start with One Example 130 Test Your ML Code 136 Debug Training: Make Your Model Learn 140 Task Difficulty 142 Optimization Problems 144 Debug Generalization: Make Your Model Useful 146 Data Leakage 147 Overfitting 147 Consider the Task at Hand 150 Conclusion 151 7. Using Classifiers for Writing Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Extracting Recommendations from Models 154 What Can We Achieve Without a Model? 154 Extracting Global Feature Importance 155 Using a Model’s Score 156 Table of Contents | v

Extracting Local Feature Importance 157 Comparing Models 159 Version 1: The Report Card 160 Version 2: More Powerful, More Unclear 160 Version 3: Understandable Recommendations 162 Generating Editing Recommendations 163 Conclusion 167 Part IV. Deploy and Monitor 8. Considerations When Deploying Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Data Concerns 172 Data Ownership 172 Data Bias 173 Systemic Bias 174 Modeling Concerns 175 Feedback Loops 175 Inclusive Model Performance 177 Considering Context 177 Adversaries 178 Abuse Concerns and Dual-Use 179 Chris Harland: Shipping Experiments 180 Conclusion 182 9. Choose Your Deployment Option. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Server-Side Deployment 183 Streaming Application or API 184 Batch Predictions 186 Client-Side Deployment 188 On Device 189 Browser Side 191 Federated Learning: A Hybrid Approach 191 Conclusion 193 10. Build Safeguards for Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Engineer Around Failures 195 Input and Output Checks 196 Model Failure Fallbacks 200 Engineer for Performance 204 Scale to Multiple Users 204 Model and Data Life Cycle Management 207 vi | Table of Contents

Data Processing and DAGs 210 Ask for Feedback 211 Chris Moody: Empowering Data Scientists to Deploy Models 214 Conclusion 216 11. Monitor and Update Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Monitoring Saves Lives 217 Monitoring to Inform Refresh Rate 217 Monitor to Detect Abuse 218 Choose What to Monitor 219 Performance Metrics 219 Business Metrics 222 CI/CD for ML 223 A/B Testing and Experimentation 224 Other Approaches 227 Conclusion 228 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Table of Contents | vii

(This page has no text content)

Preface The Goal of Using Machine Learning Powered Applications Over the past decade, machine learning (ML) has increasingly been used to power a variety of products such as automated support systems, translation services, recom‐ mendation engines, fraud detection models, and many, many more. Surprisingly, there aren’t many resources available to teach engineers and scientists how to build such products. Many books and classes will teach how to train ML mod‐ els or how to build software projects, but few blend both worlds to teach how to build practical applications that are powered by ML. Deploying ML as part of an application requires a blend of creativity, strong engi‐ neering practices, and an analytical mindset. ML products are notoriously challeng‐ ing to build because they require much more than simply training a model on a dataset. Choosing the right ML approach for a given feature, analyzing model errors and data quality issues, and validating model results to guarantee product quality are all challenging problems that are at the core of the ML building process. This book goes through every step of this process and aims to help you accomplish each of them by sharing a mix of methods, code examples, and advice from me and other experienced practitioners. We’ll cover the practical skills required to design, build, and deploy ML–powered applications. The goal of this book is to help you suc‐ ceed at every part of the ML process. Use ML to Build Practical Applications If you regularly read ML papers and corporate engineering blogs, you may feel over‐ whelmed by the combination of linear algebra equations and engineering terms. The hybrid nature of the field leads many engineers and scientists who could contribute their diverse expertise to feel intimidated by the field of ML. Similarly, entrepreneurs ix

and product leaders often struggle to tie together their ideas for a business with what is possible with ML today (and what may be possible tomorrow). This book covers the lessons I have learned working on data teams at multiple com‐ panies and helping hundreds of data scientists, software engineers, and product man‐ agers build applied ML projects through my work leading the artificial intelligence program at Insight Data Science. The goal of this book is to share a step-by-step practical guide to building ML–pow‐ ered applications. It is practical and focuses on concrete tips and methods to help you prototype, iterate, and deploy models. Because it spans a wide range of topics, we will go into only as much detail as is needed at each step. Whenever possible, I will pro‐ vide resources to help you dive deeper into the topics covered if you so desire. Important concepts are illustrated with practical examples, including a case study that will go from idea to deployed model by the end of the book. Most examples will be accompanied by illustrations, and many will contain code. All of the code used in this book can be found in the book’s companion GitHub repository. Because this book focuses on describing the process of ML, each chapter builds upon concepts defined in earlier ones. For this reason, I recommend reading it in order so that you can understand how each successive step fits into the entire process. If you are looking to explore a subset of the process of ML, you might be better served with a more specialized book. If that is the case, I’ve shared a few recommendations. Additional Resources • If you’d like to know ML well enough to write your own algorithms from scratch, I recommend Data Science from Scratch, by Joel Grus. If the theory of deep learn‐ ing is what you are after, the textbook Deep Learning (MIT Press), by Ian Good‐ fellow, Yoshua Bengio, and Aaron Courville, is a comprehensive resource. • If you are wondering how to train models efficiently and accurately on specific datasets, Kaggle and fast.ai are great places to look. • If you’d like to learn how to build scalable applications that need to process a lot of data, I recommend looking at Designing Data-Intensive Applications (O’Reilly), by Martin Kleppmann. If you have coding experience and some basic ML knowledge and want to build ML– driven products, this book will guide you through the entire process from product idea to shipped prototype. If you already work as a data scientist or ML engineer, this book will add new techniques to your ML development tool. If you do not know how to code but collaborate with data scientists, this book can help you understand the process of ML, as long as you are willing to skip some of the in-depth code examples. Let’s start by diving deeper into the meaning of practical ML. x | Preface

Practical ML For the purpose of this introduction, think of ML as the process of leveraging pat‐ terns in data to automatically tune algorithms. This is a general definition, so you will not be surprised to hear that many applications, tools, and services are starting to integrate ML at the core of the way they function. Some of these tasks are user-facing, such as search engines, recommendations on social platforms, translation services, or systems that automatically detect familiar faces in photographs, follow instructions from voice commands, or attempt to pro‐ vide useful suggestions to finish a sentence in an email. Some work in less visible ways, silently filtering spam emails and fraudulent accounts, serving ads, predicting future usage patterns to efficiently allocate resources, or experimenting with personalizing website experiences for each user. Many products currently leverage ML, and even more could do so. Practical ML refers to the task of identifying practical problems that could benefit from ML and delivering a successful solution to these problems. Going from a high-level product goal to ML–powered results is a challenging task that this book tries to help you to accomplish. Some ML courses will teach students about ML methods by providing a dataset and having them train a model on them, but training an algorithm on a dataset is a small part of the ML process. Compelling ML–powered products rely on more than an aggregate accuracy score and are the results of a long process. This book will start from ideation and continue all the way through to production, illustrating every step on an example application. We will share tools, best practices, and common pitfalls learned from working with applied teams that are deploying these kinds of systems every day. What This Book Covers To cover the topic of building applications powered by ML, the focus of this book is concrete and practical. In particular, this book aims to illustrate the whole process of building ML–powered applications. To do so, I will first describe methods to tackle each step in the process. Then, I will illustrate these methods using an example project as a case study. The book also con‐ tains many practical examples of ML in industry and features interviews with profes‐ sionals who have built and maintained production ML models. The entire process of ML To successfully serve an ML product to users, you need to do more than simply train a model. You need to thoughtfully translate your product need to an ML problem, Preface | xi

gather adequate data, efficiently iterate in between models, validate your results, and deploy them in a robust manner. Building a model often represents only a tenth of the total workload of an ML project. Mastering the entire ML pipeline is crucial to successfully build projects, succeed at ML interviews, and be a top contributor on ML teams. A technical, practical case study While we won’t be re-implementing algorithms from scratch in C, we will stay practi‐ cal and technical by using libraries and tools providing higher-level abstractions. We will go through this book building an example ML application together, from the ini‐ tial idea to the deployed product. I will illustrate key concepts with code snippets when applicable, as well as figures describing our application. The best way to learn ML is by practicing it, so I encour‐ age you to go through the book reproducing the examples and adapting them to build your own ML–powered application. Real business applications Throughout this book, I will include conversations and advice from ML leaders who have worked on data teams at tech companies such as StitchFix, Jawbone, and Figur‐ eEight. These discussions will cover practical advice garnered after building ML applications with millions of users and will correct some popular misconceptions about what makes data scientists and data science teams successful. Prerequisites This book assumes some familiarity with programming. I will mainly be using Python for technical examples and assume that the reader is familiar with the syntax. If you’d like to refresh your Python knowledge, I recommend The Hitchhiker’s Guide to Python (O’Reilly), by Kenneth Reitz and Tanya Schlusser. In addition, while I will define most ML concepts referred to in the book, I will not cover the inner workings of all ML algorithms used. Most of these algorithms are standard ML methods that are covered in introductory-level ML resources, such as the ones mentioned in “Additional Resources” on page x. Our Case Study: ML–Assisted Writing To concretely illustrate this idea, we will build an ML application together as we go through this book. As a case study, I chose an application that can accurately illustrate the complexity of iterating and deploying ML models. I also wanted to cover a product that could pro‐ xii | Preface

duce value. This is why we will be implementing a machine learning–powered writing assistant. Our goal is to build a system that will help users write better. In particular, we will aim to help people write better questions. This may seem like a very vague objective, and I will define it more clearly as we scope out the project, but it is a good example for a few key reasons. Text data is everywhere Text data is abundantly available for most use cases you can think of and is core to many practical ML applications. Whether we are trying to better understand the reviews of our product, accurately categorize incoming support requests, or tailor our promotional messages to potential audiences, we will consume and produce text data. Writing assistants are useful From Gmail’s text prediction feature to Grammarly’s smart spellchecker, ML– powered editors have proven that they can deliver value to users in a variety of ways. This makes it particularly interesting for us to explore how to build them from scratch. ML–assisted writing is self-standing Many ML applications can function only when tightly integrated into a broader ecosystem, such as ETA prediction for ride-hailing companies, search and rec‐ ommendation systems for online retailers, and ad bidding models. A text editor, however, even though it could benefit from being integrated into a document editing ecosystem, can prove valuable on its own and be exposed through a sim‐ ple website. Throughout the book, this project will allow us to highlight the challenges and associ‐ ated solutions we suggest to build ML–powered applications. The ML Process The road from an idea to a deployed ML application is long and winding. After see‐ ing many companies and individuals build such projects, I’ve identified four key suc‐ cessive stages, which will each be covered in a section of this book. 1. Identifying the right ML approach: The field of ML is broad and often proposes a multitude of ways to tackle a given product goal. The best approach for a given problem will depend on many factors such as success criteria, data availability, and task complexity. The goals of this stage are to set the right success criteria and to identify an adequate initial dataset and model choice. 2. Building an initial prototype: Start by building an end-to-end prototype before working on a model. This prototype should aim to tackle the product goal with Preface | xiii

no ML involved and will allow you to determine how to best apply ML. Once a prototype is built, you should have an idea of whether you need ML, and you should be able to start gathering a dataset to train a model. 3. Iterating on models: Now that you have a dataset, you can train a model and eval‐ uate its shortcomings. The goal of this stage is to repeatedly alternate between error analysis and implementation. Increasing the speed at which this iteration loop happens is the best way to increase ML development speed. 4. Deployment and monitoring: Once a model shows good performance, you should pick an adequate deployment option. Once deployed, models often fail in unex‐ pected ways. The last two chapters of this book will cover methods to mitigate and monitor model errors. There is a lot of ground to cover, so let’s dive right in and start with Chapter 1! Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. xiv | Preface

This element indicates a warning or caution. Using Code Examples Supplemental code examples for this book are available for download at https:// oreil.ly/ml-powered-applications. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: Building Machine Learn‐ ing Powered Applications by Emmanuel Ameisen (O’Reilly). Copyright 2020 Emma‐ nuel Ameisen, 978-1-492-04511-3.” If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. Preface | xv

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) You can access the web page for this book, where we list errata, examples, and any additional information, at https://oreil.ly/Building_ML_Powered_Applications. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments The project of writing this book started as a consequence of my work mentoring Fel‐ lows and overseeing ML projects at Insight Data Science. For giving me the opportu‐ nity to lead this program and for encouraging me to write about the lessons learned doing so, I’d like to thank Jake Klamka and Jeremy Karnowski, respectively. I’d also like to thank the hundreds of Fellows I’ve worked with at Insight for allowing me to help them push the limits of what an ML project can look like. Writing a book is a daunting task, and the O’Reilly staff helped make it more manage‐ able every step of the way. In particular, I would like to thank my editor, Melissa Pot‐ ter, who tirelessly provided guidance, suggestions, and moral support throughout the journey that is writing a book. Thank you to Mike Loukides for somehow convincing me that writing a book was a reasonable endeavor. Thank you to the tech reviewers who combed through early drafts of this book, pointing out errors and offering suggestions for improvement. Thank you Alex Gude, Jon Krohn, Kristen McIntyre, and Douwe Osinga for taking the time out of your busy schedules to help make this book the best version of itself that it could be. To data practitioners whom I asked about the challenges of practical ML they felt needed the xvi | Preface

most attention, thank you for your time and insights, and I hope you’ll find that this book covers them adequately. Finally, for their unwavering support during the series of busy weekends and late nights that came with writing this book, I’d like to thank my unwavering partner Mari, my sarcastic sidekick Eliott, my wise and patient family, and my friends who refrained from reporting me as missing. You made this book a reality. Preface | xvii

(This page has no text content)

Statistics

Uploader

Building Machine Learning Powered Applications Going from Idea to Product (Emmanuel Ameisen) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Building Machine Learning Powered Applications Going from Idea to Product (Emmanuel Ameisen) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You