Data Science on the Google Cloud Platform Implementing End-to-End Real-Time Data Pipelines From Ingest to Machine Learning (Valliappa Lakshmanan) (Z-Library)

Valliappa Lakshmanan 2nd Edition Data Science on the Google Cloud Platform Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning

(This page has no text content)

Valliappa Lakshmanan Data Science on the Google Cloud Platform Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning SECOND EDITION Boston Farnham Sebastopol TokyoBeijing

978-1-098-11895-2 LSI Data Science on the Google Cloud Platform, Second Edition by Valliappa Lakshmanan Copyright © 2022 Google LLC. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisition Editor: Jessica Haberman Development Editor: Michele Cronin Production Editor: Katherine Tozer Copyeditor: Tom Sullivan Proofreader: Piper Editorial Consulting, LLC Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea January 2018: First Edition April 2022: Second Edition Revision History for the Second Edition 2022-03-29: First Release 2022-04-22: Second Release See http://oreilly.com/catalog/errata.csp?isbn=9781098118952 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science on the Google Cloud Plat‐ form, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Making Better Decisions Based on Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Many Similar Decisions 4 The Role of Data Scientists 5 Scrappy Environment 7 Full Stack Cloud Data Scientists 8 Collaboration 9 Best Practices 10 Simple to Complex Solutions 10 Cloud Computing 11 Serverless 12 A Probabilistic Decision 13 Probabilistic Approach 14 Probability Density Function 15 Cumulative Distribution Function 16 Choices Made 18 Choosing Cloud 19 Not a Reference Book 19 Getting Started with the Code 20 Agile Architecture for Data Science on Google Cloud 22 What Is Agile Architecture? 23 No-Code, Low-Code 23 Use Managed Services 24 Summary 25 Suggested Resources 26 iii

2. Ingesting Data into the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Airline On-Time Performance Data 29 Knowability 31 Causality 31 Training–Serving Skew 32 Downloading Data 33 Hub-and-Spoke Architecture 34 Dataset Fields 35 Separation of Compute and Storage 37 Scaling Up 39 Scaling Out with Sharded Data 41 Scaling Out with Data-in-Place 43 Ingesting Data 46 Reverse Engineering a Web Form 46 Dataset Download 48 Exploration and Cleanup 50 Uploading Data to Google Cloud Storage 51 Loading Data into Google BigQuery 55 Advantages of a Serverless Columnar Database 55 Staging on Cloud Storage 57 Access Control 57 Ingesting CSV Files 61 Partitioning 62 Scheduling Monthly Downloads 63 Ingesting in Python 65 Cloud Run 71 Securing Cloud Run 72 Deploying and Invoking Cloud Run 74 Scheduling Cloud Run 75 Summary 76 Code Break 77 Suggested Resources 78 3. Creating Compelling Dashboards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Explain Your Model with Dashboards 83 Why Build a Dashboard First? 84 Accuracy, Honesty, and Good Design 86 Loading Data into Cloud SQL 88 Create a Google Cloud SQL Instance 89 Create Table of Data 91 Interacting with the Database 95 Querying Using BigQuery 96 iv | Table of Contents

Schema Exploration 96 Using Preview 97 Using Table Explorer 99 Creating BigQuery View 100 Building Our First Model 101 Contingency Table 101 Threshold Optimization 103 Building a Dashboard 106 Getting Started with Data Studio 107 Creating Charts 109 Adding End-User Controls 110 Showing Proportions with a Pie Chart 112 Explaining a Contingency Table 117 Modern Business Intelligence 119 Digitization 119 Natural Language Queries 120 Connected Sheets 122 Summary 123 Suggested Resources 123 4. Streaming Data: Publication and Ingest with Pub/Sub and Dataflow. . . . . . . . . . . . . . 125 Designing the Event Feed 126 Transformations Needed 127 Architecture 128 Getting Airport Information 129 Sharing Data 132 Time Correction 133 Apache Beam/Cloud Dataflow 135 Parsing Airports Data 136 Adding Time Zone Information 139 Converting Times to UTC 141 Correcting Dates 144 Creating Events 146 Reading and Writing to the Cloud 148 Running the Pipeline in the Cloud 150 Publishing an Event Stream to Cloud Pub/Sub 153 Speed-Up Factor 154 Get Records to Publish 155 How Many Topics? 156 Iterating Through Records 157 Building a Batch of Events 158 Publishing a Batch of Events 159 Table of Contents | v

Real-Time Stream Processing 160 Streaming in Dataflow 160 Windowing a Pipeline 162 Streaming Aggregation 162 Using Event Timestamps 165 Executing the Stream Processing 166 Analyzing Streaming Data in BigQuery 168 Real-Time Dashboard 169 Summary 170 Suggested Resources 171 5. Interactive Data Exploration with Vertex AI Workbench. . . . . . . . . . . . . . . . . . . . . . . . . 173 Exploratory Data Analysis 174 Exploration with SQL 177 Reading a Query Explanation 179 Exploratory Data Analysis in Vertex AI Workbench 184 Jupyter Notebooks 185 Creating a Notebook 186 Jupyter Commands 188 Installing Packages 188 Jupyter Magic for Google Cloud 189 Exploring Arrival Delays 190 Basic Statistics 191 Plotting Distributions 191 Quality Control 194 Arrival Delay Conditioned on Departure Delay 199 Evaluating the Model 204 Random Shuffling 204 Splitting by Date 205 Training and Testing 206 Summary 210 Suggested Resources 210 6. Bayesian Classifier with Apache Spark on Cloud Dataproc. . . . . . . . . . . . . . . . . . . . . . . 211 MapReduce and the Hadoop Ecosystem 211 How MapReduce Works 212 Apache Hadoop 214 Google Cloud Dataproc 214 Need for Higher-Level Tools 216 Jobs, Not Clusters 217 Preinstalling Software 219 Quantization Using Spark SQL 221 vi | Table of Contents

JupyterLab on Cloud Dataproc 222 Independence Check Using BigQuery 223 Spark SQL in JupyterLab 225 Histogram Equalization 227 Bayesian Classification 231 Bayes in Each Bin 231 Evaluating the Model 232 Dynamically Resizing Clusters 233 Comparing to Single Threshold Model 235 Orchestration 237 Submitting a Spark Job 238 Workflow Template 238 Cloud Composer 239 Autoscaling 239 Serverless Spark 240 Summary 242 Suggested Resources 243 7. Logistic Regression Using Spark ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Logistic Regression 246 How Logistic Regression Works 246 Spark ML Library 249 Getting Started with Spark Machine Learning 250 Spark Logistic Regression 251 Creating a Training Dataset 252 Training the Model 256 Predicting Using the Model 259 Evaluating a Model 260 Feature Engineering 263 Experimental Framework 263 Feature Selection 267 Feature Transformations 271 Feature Creation 274 Categorical Variables 278 Repeatable, Real Time 280 Summary 281 Suggested Resources 282 8. Machine Learning with BigQuery ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Logistic Regression 283 Presplit Data 285 Interrogating the Model 286 Table of Contents | vii

Evaluating the Model 287 Scale and Simplicity 289 Nonlinear Machine Learning 290 XGBoost 290 Hyperparameter Tuning 292 Vertex AI AutoML Tables 294 Time Window Features 296 Taxi-Out Time 296 Compounding Delays 298 Causality 299 Time Features 300 Departure Hour 300 Transform Clause 302 Categorical Variable 303 Feature Cross 303 Summary 305 Suggested Resources 306 9. Machine Learning with TensorFlow in Vertex AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Toward More Complex Models 310 Preparing BigQuery Data for TensorFlow 314 Reading Data into TensorFlow 315 Training and Evaluation in Keras 317 Model Function 317 Features 318 Inputs 320 Training the Keras Model 320 Saving and Exporting 322 Deep Neural Network 322 Wide-and-Deep Model in Keras 323 Representing Air Traffic Corridors 323 Bucketing 324 Feature Crossing 325 Wide-and-Deep Classifier 326 Deploying a Trained TensorFlow Model to Vertex AI 327 Concepts 328 Uploading Model 328 Creating Endpoint 330 Deploying Model to Endpoint 330 Invoking the Deployed Model 331 Summary 332 Suggested Resources 333 viii | Table of Contents

10. Getting Ready for MLOps with Vertex AI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Developing and Deploying Using Python 336 Writing model.py 337 Writing the Training Pipeline 338 Predefined Split 340 AutoML 341 Hyperparameter Tuning 343 Parameterize Model 344 Shorten Training Run 345 Metrics During Training 347 Hyperparameter Tuning Pipeline 347 Best Trial to Completion 349 Explaining the Model 350 Configuring Explanations Metadata 350 Creating and Deploying Model 352 Obtaining Explanations 352 Summary 354 Suggested Resources 355 11. Time-Windowed Features for Real-Time Machine Learning. . . . . . . . . . . . . . . . . . . . . . 357 Time Averages 357 Apache Beam and Cloud Dataflow 358 Reading and Writing 360 Time Windowing 362 Machine Learning Training 367 Machine Learning Dataset 367 Training the Model 373 Streaming Predictions 376 Reuse Transforms 377 Input and Output 379 Invoking Model 380 Reusing Endpoint 381 Batching Predictions 384 Streaming Pipeline 385 Writing to BigQuery 385 Executing Streaming Pipeline 386 Late and Out-of-Order Records 387 Possible Streaming Sinks 393 Summary 400 Suggested Resources 401 Table of Contents | ix

12. The Full Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Four Years of Data 403 Creating Dataset 404 Training Model 409 Evaluation 411 Summary 417 Suggested Resources 417 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Considerations for Sensitive Data Within Machine Learning Datasets. . . . . . . . . . . . . . . . . 423 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 x | Table of Contents

Preface In my current role at Google, I get to work alongside data scientists and data engi‐ neers in a variety of industries as they move their data processing and analysis meth‐ ods to the public cloud. Some try to do the same things they do on premises, the same way they do them, just on rented computing resources. The visionary users, though, rethink their systems, transform how they work with data, and thereby are able to innovate faster. As early as 2011, an article in Harvard Business Review recognized that some of cloud computing’s greatest successes come from allowing groups and communities to work together in ways that were not previously possible. This is now much more widely recognized. An MIT survey in 2017 found that more respondents (45%) cited increased agility rather than cost savings (34%) as the reason to move to the public cloud. However, it is still not widely achieved. McKinsey estimated in 2021 that com‐ panies are leaving behind nearly $1 trillion of value by not looking at the public cloud as a source of transformative value. Therefore, being able to work on a data science project in the cloud is a skill well worth investing in. In this book, we walk through an example of a cloud-native, transformative, collabo‐ rative way of doing data science. You will learn how to implement an end-to-end data pipeline—we will begin with ingesting the data in a serverless way and work our way through data exploration, dashboards, relational databases, and streaming data all the way to training and making an operational machine learning model. I cover all these aspects of data-based services because data engineers will be involved in designing the services, developing the statistical and machine learning models, and implementing them in large-scale production and in real time. Who This Book Is For If you use computers to work with data, this book is for you. You might go by the title of data analyst, database administrator, data engineer, data scientist, or systems programmer today. Although your role might be narrower today (perhaps you do xi

only data analysis, or only model building, or only DevOps), you want to stretch your wings a bit—you want to learn how to create data science models as well as how to implement them at scale in production systems. Google Cloud Platform is designed to make you forget about infrastructure. The marquee data services—Google BigQuery, Cloud Dataflow, Cloud Pub/Sub, and Ver‐ tex AI—are all serverless and autoscaling. When you submit a query to BigQuery, it is run on thousands of nodes, and you get your result back; you don’t spin up a cluster or install any software. Similarly, in Cloud Dataflow, when you submit a data pipeline, and in Vertex AI, when you submit a machine learning job, you can process data at scale and train models at scale without worrying about cluster management or failure recovery. Cloud Pub/Sub is a global messaging service that autoscales to the through‐ put and number of subscribers and publishers without any work on your part. Even when you’re running open source software like Apache Spark that’s designed to oper‐ ate on a cluster, Google Cloud Platform makes it easy with job-specific clusters and serverless Spark. Because of this job-specific infrastructure, there’s no need to fear overprovisioning hardware or running out of capacity to run a job when you need it. Plus, data is encrypted, both at rest and in transit, and kept secure. As a data scientist, not having to manage infrastructure is incredibly liberating. These autoscaled, fully managed services make it easier to implement data science models at scale—which is why data scientists no longer need to hand off their models to data engineers. Instead, they can write a data science workload, submit it to the cloud, and have that workload executed automatically in an autoscaled manner. At the same time, data science packages are becoming simpler and simpler. So, it has become extremely easy for an engineer to slurp in data and use a canned model to get an initial (and often very good) model up and running. With well-designed packages and easy-to-consume APIs, you don’t need to know the esoteric details of data science algorithms—only what each algorithm does and how to link algorithms together to solve realistic problems. This convergence between data science and data engineering is why you can stretch your wings beyond your current role. Rather than simply read this book cover-to-cover, I strongly encourage you to follow along with me by trying out the code. The full source code for the end-to-end pipe‐ line I build in this book is on GitHub. Create a Google Cloud Platform project, and after reading each chapter, try to repeat what I did by referring to the code and to the README.md file in each folder of the GitHub repository. Follow the instructions in the README.md files in GitHub to try out the code. The code snippets in the book are often incomplete— for example, I may omit some arguments to cloud commands for clarity or conciseness. xii | Preface

Note that this is not a reference book—the best reference to Google Cloud is its docu‐ mentation, and there is very little value to be had by simply reproducing that in a book. Instead, this book shows you how to use a variety of tools together to solve a problem. My goal here is to teach you how to think about a problem in order to solve it using Google Cloud, not to comprehensively cover any particular product. If you find yourself fascinated by a topic in this book and want to dive deeper, you can find a few selected resources at the end of every chapter that provide a deeper dive into topics covered in the chapter. Don’t feel obligated to watch every video or read every article. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Preface | xiii

Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/GoogleCloudPlatform/data-science-on-gcp. If you have a technical question or a problem using the code examples, please email bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science on the Google Cloud Platform by Valliappa Lakshmanan (O’Reilly). Copyright 2022 Google LLC, 978-1-098-11895-2.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North xiv | Preface

Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/data-science-on-gcp. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on Facebook: https://facebook.com/oreilly. Follow us on Twitter: https://twitter.com/oreillymedia. Watch us on YouTube: https://www.youtube.com/oreillymedia. Acknowledgments When I took the job at Google in 2014, I had used the public cloud simply as a way to rent infrastructure—so I was spinning up virtual machines, installing the software I needed on those machines, and then running my data processing jobs using my usual workflow. Fortunately, I realized that Google’s big data stack was different, and so I set out to learn how to take full advantage of all the data and machine learning tools on Google Cloud Platform. The way I learn best is to write code, and so that’s what I did. When a Python meetup group asked me to talk about Google Cloud Platform, I did a show-and-tell of the code that I had written. It turned out that a walk-through of the code to build an end- to-end system while contrasting different approaches to a data science problem was quite educational for the attendees. I wrote up the essence of my talk as a book pro‐ posal and sent it to O’Reilly Media. A book, of course, needs to have a lot more depth than a 60-minute code walk‐ through. Imagine that you come to work one day to find an email from a new employee at your company, someone who’s been at the company less than six months. Somehow, he’s decided he’s going to write a book on the pretty sophisticated platform that you’ve had a hand in building and is asking for your help. He is not part of your team, helping him is not part of your job, and he is not even located in the same office as you. What is your response? Would you volunteer? What makes Google such a great place to work is the people who work here. It is a testament to the company’s culture that so many people—engineers, technical leads, product managers, solutions architects, data scientists, legal counsel, directors— across so many different teams happily gave of their expertise to someone they had Preface | xv

never met (in fact, I still haven’t met many of these people in person). This book, thus, is immeasurably better because of (in alphabetical order of last names) William Brockman, Mike Dahlin, Tony DiLoreto, Bob Evans, Roland Hess, Brett Hesterberg, Dennis Huo, Chad Jennings, Puneith Kaul, Dinesh Kulkarni, Manish Kurse, Reuven Lax, Jonathan Liu, James Malone, Dave Oleson, Mosha Pasumansky, Kevin Peterson, Olivia Puerta, Reza Rokni, Karn Seth, Sergei Sokolenko, and Amy Unruh. In particu‐ lar, thanks to Mike Dahlin, Manish Kurse, and Olivia Puerta for reviewing every sin‐ gle chapter. When the first edition of the book was in early access, I received valuable error reports from Anthonios Partheniou and David Schwantner. Needless to say, I am responsible for any errors that remain. A few times during the writing of the book, I found myself completely stuck. Some‐ times, the problems were technical. Thanks to (in alphabetical order) Ahmet Altay, Eli Bixby, Ben Chambers, Slava Chernyak, Marián Dvorský, Robbie Haertel, Felipe Hoffa, Amir Hormati, Qiming (Bradley) Jiang, Kenneth Knowles, Nikhil Kothari, and Chris Meyers for showing me the way forward. At other times, the problems were related to figuring out company policy or getting access to the right team, document, or statistic. This book would have been a lot poorer had these colleagues not unblocked me at critical points (again alphabetically): Louise Byrne, Apurva Desai, Rochana Golani, Fausto Ibarra, Jason Martin, Neal Mueller, Philippe Poutonnet, Brad Svee, Jordan Tigani, William Vampenebe, and Miles Ward. Thank you all for your help and encouragement. Five years on, I continue to be humbled by the incredible talent and collaboration of my colleagues. Sagar Baliyara, Filipe Gracio, Polong Lin, and Krishnan Saidapet (in alphabetical order of last names) brought a close eye to the second edition and made many great suggestions. Thanks also to the O’Reilly team—Marie Beaugureau, Kristen Brown, Ben Lorica, Tim McGovern, Rachel Roumeliotis, and Heather Scherer for believing in me and making the process of moving from draft to the first edition of the book painless. Producing the second edition was greatly streamlined by Katherine Tozer, Michele Cronin, and Tom Sullivan. The second edition has also greatly benefited from fresh outside perspectives. Colin Dietrich verified much of the code in the book and made numerous pull requests to the GitHub repository. Joy Payton suggested many improvements to make the book more accessible to beginners in data science. Michael Hopkins and Margaret Maynard-Reid scrutinized the book for areas that needed updating. Thanks also to readers of the first edition who left reviews of the book on Amazon, filed issues on GitHub, and reached out to me via email and on Twitter. Your feedback has greatly improved this edition of the book. xvi | Preface

Finally, and most important, thanks to Abirami, Sidharth, and Sarada for your under‐ standing and patience even as I became engrossed in writing and coding. You make it all worthwhile. I am donating 100% of the royalties from this book to United Way of King County, where I live. I strongly encourage you to get involved with a local charity to give, vol‐ unteer, and take action to help solve your community’s toughest challenges. Preface | xvii

(This page has no text content)

Statistics

Uploader

Data Science on the Google Cloud Platform Implementing End-to-End Real-Time Data Pipelines From Ingest to Machine Learning (Valliappa Lakshmanan) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Data Science on the Google Cloud Platform Implementing End-to-End Real-Time Data Pipelines From Ingest to Machine Learning (Valliappa Lakshmanan) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You