Marco Tranquillin, Valliappa Lakshmanan & Firat Tekiner Architecting Data and Machine Learning Platforms Enable Analytics and AI-Driven Innovation in the Cloud
CLOUD COMPUTING Waiting for Quote Architecting Data and Machine Learning Platforms Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks. Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You’ll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you’ll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage. You’ll learn how to: • Design a modern and secure cloud native or hybrid data analytics and machine learning platform • Accelerate data-led innovation by consolidating enterprise data in a governed, scalable, and resilient data platform • Democratize access to enterprise data and govern how business teams extract insights and build AI/ML capabilities • Enable your business to make decisions in real time using streaming pipelines • Build an MLOps platform to move to a predictive and prescriptive analytics approach Marco Tranquillin is a seasoned consultant who helps organizations make technology transformations through cloud computing. Valliappa Lakshmanan is a renowned executive who partners with C-suite and data science teams to build value from data and AI. Firat Tekiner is an innovative product manager who develops and delivers data products and AI systems for the world’s largest organizations. US $65.99 CAN $82.99 ISBN: 978-1-098-15161-4 “This book is a great introduction to the concepts, patterns, and components used to design and build a modern cloud data and ML platform aligned with the strategic direction of the organization. I wish I had read it years ago.” —Robert Sahlin Data Platform Lead at Mathem
Praise for Architecting Data and Machine Learning Platforms Becoming a data-driven company requires solid data capabilities that can fit the company strategy. This book offers a 360-degree view of strategies for data transformation with real-life evolutionary architecture scenarios. A must-read for architects and everyone driving a data transformation program. —Mattia Cinquilli, Data and Analytics Director for Telco and Media at Sky Cloudy with a chance of data insights! This book is the Mary Poppins of the data world, making the complex journey of building modern cloud data platforms practically perfect in every way. The authors, a band of seasoned engineers, are like data whisperers, guiding you through the labyrinth of machine learning and analytics. They help turn the abandoned carts of your organization into free-shipping success stories. If you’re looking for a book that simplifies data and ML platforms while making you chuckle, this is your golden ticket! —Priscilla Moraes, PhD in AI and NLP, Director of Applied Sciences at Microsoft The authors’ experience dealing with evolving data and AI/ML practices show throughout the book. It’s a comprehensive collection of wisdom dealing with data at scale using cloud and on-prem technologies. —Bala Natarajan, Former VP, Enterprise Data Platform, PayPal
(This page has no text content)
Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner Architecting Data and Machine Learning Platforms Enable Analytics and AI-Driven Innovation in the Cloud Boston Farnham Sebastopol TokyoBeijing
978-1-098-15161-4 [LSI] Architecting Data and Machine Learning Platforms by Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner Copyright © 2024 Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Megan Laddusaw Development Editor: Virginia Wilson Production Editor: Gregory Hyman Copyeditor: nSight, Inc. Proofreader: Shannon Turlington Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea October 2023: First Edition Revision History for the First Edition 2023-10-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098151614 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Architecting Data and Machine Learning Platforms, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Modernizing Your Data Platform: An Introductory Overview. . . . . . . . . . . . . . . . . . . . . . . 1 The Data Lifecycle 2 The Journey to Wisdom 2 Water Pipes Analogy 3 Collect 4 Store 5 Process/Transform 7 Analyze/Visualize 8 Activate 9 Limitations of Traditional Approaches 10 Antipattern: Breaking Down Silos Through ETL 10 Antipattern: Centralization of Control 13 Antipattern: Data Marts and Hadoop 15 Creating a Unified Analytics Platform 16 Cloud Instead of On-Premises 17 Drawbacks of Data Marts and Data Lakes 18 Convergence of DWHs and Data Lakes 19 Hybrid Cloud 23 Reasons Why Hybrid Is Necessary 24 Challenges of Hybrid Cloud 25 Why Hybrid Can Work 26 Edge Computing 27 Applying AI 29 Machine Learning 29 Uses of ML 30 Why Cloud for AI? 31 iii
Cloud Infrastructure 31 Democratization 32 Real Time 34 MLOps 35 Core Principles 36 Summary 38 2. Strategic Steps to Innovate with Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Step 1: Strategy and Planning 42 Strategic Goals 43 Identify Stakeholders 45 Change Management 45 Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach 47 Why Cloud Costs Less 47 How Much Are the Savings? 49 When Does Cloud Help? 50 Step 3: Break Down Silos 50 Unifying Data Access 51 Choosing Storage 52 Semantic Layer 53 Step 4: Make Decisions in Context Faster 55 Batch to Stream 55 Contextual Information 56 Cost Management 56 Step 5: Leapfrog with Packaged AI Solutions 57 Predictive Analytics 58 Understanding and Generating Unstructured Data 59 Personalization 60 Packaged Solutions 60 Step 6: Operationalize AI-Driven Workflows 61 Identifying the Right Balance of Automation and Assistance 61 Building a Data Culture 62 Populating Your Data Science Team 62 Step 7: Product Management for Data 64 Applying Product Management Principles to Data 64 1. Understand and Maintain a Map of Data Flows in the Enterprise 65 2. Identify Key Metrics 65 3. Agreed Criteria, Committed Roadmap, and Visionary Backlog 66 4. Build for the Customers You Have 67 5. Don’t Shift the Burden of Change Management 67 6. Interview Customers to Discover Their Data Needs 68 7. Whiteboard and Prototype Extensively 68 iv | Table of Contents
8. Build Only What Will Be Used Immediately 69 9. Standardize Common Entities and KPIs 69 10. Provide Self-Service Capabilities in Your Data Platform 70 Summary 70 3. Designing Your Data Team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Classifying Data Processing Organizations 73 Data Analysis–Driven Organization 76 The Vision 77 The Personas 78 The Technological Framework 80 Data Engineering–Driven Organization 82 The Vision 82 The Personas 84 The Technological Framework 86 Data Science–Driven Organization 89 The Vision 89 The Personas 91 The Technological Framework 92 Summary 94 4. A Migration Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Modernize Data Workflows 95 Holistic View 95 Modernize Workflows 96 Transform the Workflow Itself 98 A Four-Step Migration Framework 98 Prepare and Discover 99 Assess and Plan 100 Execute 103 Optimize 104 Estimating the Overall Cost of the Solution 105 Audit of the Existing Infrastructure 105 Request for Information/Proposal and Quotation 106 Proof of Concept/Minimum Viable Product 107 Setting Up Security and Data Governance 108 Framework 108 Artifacts 110 Governance over the Life of the Data 111 Schema, Pipeline, and Data Migration 113 Schema Migration 113 Pipeline Migration 113 Table of Contents | v
Data Migration 116 Migration Stages 121 Summary 122 5. Architecting a Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Data Lake and the Cloud—A Perfect Marriage 125 Challenges with On-Premises Data Lakes 125 Benefits of Cloud Data Lakes 126 Design and Implementation 127 Batch and Stream 127 Data Catalog 129 Hadoop Landscape 130 Cloud Data Lake Reference Architecture 131 Integrating the Data Lake: The Real Superpower 136 APIs to Extend the Lake 136 The Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta Lake 136 Interactive Analytics with Notebooks 138 Democratizing Data Processing and Reporting 140 Build Trust in the Data 141 Data Ingestion Is Still an IT Matter 143 ML in the Data Lake 145 Training on Raw Data 145 Predicting in the Data Lake 146 Summary 146 6. Innovating with an Enterprise Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A Modern Data Platform 149 Organizational Goals 150 Technological Challenges 151 Technology Trends and Tools 152 Hub-and-Spoke Architecture 154 Data Ingest 157 Business Intelligence 161 Transformations 164 Organizational Structure 169 DWH to Enable Data Scientists 171 Query Interface 171 Storage API 172 ML Without Moving Your Data 173 Summary 177 vi | Table of Contents
7. Converging to a Lakehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 The Need for a Unique Architecture 179 User Personas 179 Antipattern: Disconnected Systems 180 Antipattern: Duplicated Data 180 Converged Architecture 182 Two Forms 183 Lakehouse on Cloud Storage 184 SQL-First Lakehouse 189 The Benefits of Convergence 193 Summary 195 8. Architectures for Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 The Value of Streaming 197 Industry Use Cases 198 Streaming Use Cases 199 Streaming Ingest 200 Streaming ETL 200 Streaming ELT 202 Streaming Insert 203 Streaming from Edge Devices (IoT) 204 Streaming Sinks 205 Real-Time Dashboards 205 Live Querying 206 Materialize Some Views 206 Stream Analytics 207 Time-Series Analytics 207 Clickstream Analytics 208 Anomaly Detection 210 Resilient Streaming 211 Continuous Intelligence Through ML 212 Training Model on Streaming Data 212 Streaming ML Inference 215 Automated Actions 215 Summary 216 9. Extending a Data Platform Using Hybrid and Edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Why Multicloud? 219 A Single Cloud Is Simpler and Cost-Effective 220 Multicloud Is Inevitable 220 Multicloud Could Be Strategic 221 Multicloud Architectural Patterns 223 Table of Contents | vii
Single Pane of Glass 223 Write Once, Run Anywhere 224 Bursting from On Premises to Cloud 225 Pass-Through from On Premises to Cloud 226 Data Integration Through Streaming 227 Adopting Multicloud 229 Framework 229 Time Scale 231 Define a Target Multicloud Architecture 231 Why Edge Computing? 233 Bandwidth, Latency, and Patchy Connectivity 233 Use Cases 235 Benefits 236 Challenges 237 Edge Computing Architectural Patterns 237 Smart Devices 238 Smart Gateways 238 ML Activation 239 Adopting Edge Computing 241 The Initial Context 241 The Project 241 The Final Outcomes and Next Steps 244 Summary 245 10. AI Application Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Is This an AI/ML Problem? 248 Subfields of AI 248 Generative AI 249 Problems Fit for ML 253 Buy, Adapt, or Build? 254 Data Considerations 254 When to Buy 255 What Can You Buy? 256 How Adapting Works 258 AI Architectures 260 Understanding Unstructured Data 261 Generating Unstructured Data 263 Predicting Outcomes 265 Forecasting Values 266 Anomaly Detection 268 Personalization 269 Automation 271 viii | Table of Contents
Responsible AI 272 AI Principles 273 ML Fairness 274 Explainability 275 Summary 276 11. Architecting an ML Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 ML Activities 279 Developing ML Models 280 Labeling Environment 281 Development Environment 281 User Environment 282 Preparing Data 283 Training ML Models 284 Deploying ML Models 286 Deploying to an Endpoint 287 Evaluate Model 288 Hybrid and Multicloud 288 Training-Serving Skew 288 Automation 293 Automate Training and Deployment 293 Orchestration with Pipelines 294 Continuous Evaluation and Training 296 Choosing the ML Framework 298 Team Skills 298 Task Considerations 299 User-Centric 299 Summary 300 12. Data Platform Modernization: A Model Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 New Technology for a New Era 303 The Need for Change 304 It Is Not Only a Matter of Technology 305 The Beginning of the Journey 307 The Current Environment 307 The Target Environment 309 The PoC Use Case 311 The RFP Responses Proposed by Cloud Vendors 312 The Target Environment 312 The Approach on Migration 316 The RFP Evaluation Process 323 The Scope of the PoC 323 Table of Contents | ix
The Execution of the PoC 324 The Final Decision 325 Peroration 326 Summary 326 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 x | Table of Contents
Preface What is a data platform? Why do you need it? What does building a data and machine learning (ML) platform involve? Why should you build your data platform on the cloud? This book starts by answering these common questions that arise when dealing with data and ML projects. We then lay out the strategic journey that we recommend you take to build data and ML capabilities in your business, show you how to execute on each step of that strategy, and wrap up all the concepts in a model data modernization case. Why Do You Need a Cloud Data Platform? Imagine that the chief technology officer (CTO) of your company wants to build a new mobile-friendly ecommerce website. “We are losing business,” he claims, “because our website is not optimized for mobile phones, especially in Asian languages.” The chief executive officer (CEO) trusts the CTO when he says that the current website’s mobile user experience isn’t great, but she wonders whether customers who access the platform through mobile phones form a profitable segment of the population. She calls the head of operations in Asia and asks, “What is the revenue and profit margin on customers who reach our ecommerce site on mobile phones? How will our overall revenue change over the next year if we increase the number of people making purchases on mobile?” How would the regional leader in Asia go about answering this question? It requires the ability to relate customer visits (to determine the origin of HTTP requests), customer purchases (to know what they purchased), and procurement information (to determine the cost of those items). It also requires being able to predict the growth in different segments of the market. Would the regional leader have to reach out to the information technology (IT) department and ask them to pull together xi
the necessary information from all these different sources and write a program to compute these statistics? Does the IT department have the bandwidth to answer this question and the skills to do predictive analysis? How much better would it be if the organization has a data platform? In this case, all the data will have already been collected and cleaned up and be available for analysis and synthesis across the organization. A data analyst team could simply run an interactive, ad hoc query. They could also easily create or retrieve forecasts of revenue and traffic patterns by taking advantage of built-in artificial intelligence (AI) capabilities and allow a data-driven decision to be made on the CTO’s request to invest in a new mobile-friendly website. One possible way to answer the CEO’s question is to procure and deploy a real user monitoring (RUM) tool. There are lots of specific tools available, one for every one-off decision like this. Having a data platform allows the organization to answer many such one-off questions without having to procure and install a bunch of these specific solutions. Modern organizations increasingly want to make decisions based on data. Our exam‐ ple focused on a one-time decision. However, in many cases, organizations want to make decisions repeatedly, in an automated manner for every transaction. For example, the organization might want to determine whether a shopping cart is in danger of being abandoned and immediately show the customer options of low-cost items that can be added to the shopping cart to meet the minimum for free shipping. These items should appeal to the individual shopper and therefore require a solid analytics and ML capability. To make decisions based on data, organizations need a data and ML platform that simplifies: • Getting access to data • Running an interactive, ad hoc query • Creating a report • Making automated decisions based on data • Personalization of the business’ services As you will see in this book, cloud-based data platforms reduce the technical barrier for all these capabilities: it is possible to access data from anywhere, carry out fast, large-scale queries even on edge devices, and take advantage of services that provide many analytics and AI capabilities. However, being able to put in place all the building blocks needed to achieve that can sometimes be a complex journey. The goal of this book is to help readers have a better understanding of the main concepts, xii | Preface
architectural patterns, and tools available to build modern cloud data platforms so that they can gain better visibility and control of their corporate data to make more meaningful and automated business decisions. We, the authors of this book, are engineers who have years of experience helping enterprises in a wide variety of industries and geographies build data and ML plat‐ forms. These enterprises want to derive insights from their data but often face many challenges with getting all the data they need in a form where it can be quickly analyzed. Therefore, they find themselves having to build a modern data and ML platform. Who Is This Book For? This book is for architects who wish to support data-driven decision making in their business by creating a data and ML platform using public cloud technologies. Data engineers, data analysts, data scientists, and ML engineers will find the book useful to gain a conceptual design view of the systems that they might be implementing on top of. Digitally native companies have been doing this already for several years. As early as 2016, Twitter explained that their data platform team maintains “systems to support and manage the production and consumption of data for a variety of business purposes, including publicly reported metrics, recommendations, A/B test‐ ing, ads targeting, etc.” In 2016, this involved maintaining one of the largest Hadoop clusters in the world. By 2019, this was changing to include supporting the use of a cloud-native data warehousing solution. Etsy, to take another example, says that their ML platform “supports ML experiments by developing and maintaining the technical infrastructure that Etsy’s ML practition‐ ers rely on to prototype, train, and deploy ML models at scale.” Both Twitter and Etsy have built modern data and ML platforms. The platforms at the two companies are different, to support the different types of data, personnel, and business use cases that the platforms need to support, but the underlying approach is pretty similar. In this book, we will show you how to architect a modern data and ML platform that enables engineers in your business to: • Collect data from a variety of sources such as operational databases, customer clickstream, Internet of Things (IoT) devices, software as a service (SaaS) appli‐ cations, etc. • Break down silos between different parts of the organization Preface | xiii
• Process data while ingesting it or after loading it while guaranteeing proper processes for data quality and governance • Analyze the data routinely or ad hoc • Enrich the data with prebuilt AI models • Build ML models to carry out predictive analytics • Act on the data routinely or in response to triggering events or thresholds • Disseminate insights and embed analytics This book is a good introduction to architectural considerations if you work with data and ML models in enterprises, because you will be required to do your work on the platform built by your data or ML platform team. Thus, if you are a data engineer, data analyst, data scientist, or ML engineer, you will find this book helpful for gaining a high-level systems design view. Even though our primary experience is with Google Cloud, we endeavor to maintain a cloud-agnostic vision of the services that underlie the architectures by bringing in examples from, but not limited to, all three major cloud providers (i.e., Amazon Web Services [AWS], Microsoft Azure, and Google Cloud). Organization of This Book The book has been organized in 12 chapters that map to the strategic steps to innovate with the data that will be explained in detail in Chapter 2. The book concludes with a model use case scenario to showcase how an organization might approach its modernization journey. The visual representation of the book flow is reported in Figure P-1. Chapter 1 discusses why organizations should build a data platform. It also covers approaches, technology trends, and core principles in data platforms. In Chapters 2 and 3, we dive more into how to plan the journey, identifying the strategic steps to innovate and how to effect change. Here we will discuss concepts like reduction of the total cost of ownership (TCO), the removal of data silos, and how to leverage AI to unlock innovation. We also analyze the building blocks of a data lifecycle, discuss how to design your data team, and recommend an adoption plan. In Chapter 4, we consolidate these into a migration framework. xiv | Preface
Figure P-1. Book flow diagram In Chapters 5, 6, and 7, we discuss three of the most common architectures for data platforms—data lakes (Chapter 5), data warehouses (Chapter 6), and lakehouses (Chapter 7). We demonstrate that lakehouses can be built in one of two ways, evolving to this architecture starting from either a data lake or a data warehouse, and discuss how to choose between the two paths. Preface | xv
In Chapters 8 and 9, we discuss two common extensions of the basic lakehouse pattern. We show how to make decisions in context faster and in real time via the introduction of streaming patterns and how to support hybrid architectures by expanding to the edge. Chapters 10 and 11 cover how to build and use AI/ML in enterprise environments and how to design architectures to design, build, serve, and orchestrate innovative models. Those chapters include both predictive ML models and generative ones. Finally, in Chapter 12, we will have a look at a model data modernization case journey with a focus on how to migrate from a legacy architecture to the new one, explaining the process by which an organization can select one specific solution. If you are a cloud architect tasked with building a data and ML platform for your business, read all the chapters of the book in order. If you are a data analyst whose task is to create reports, dashboards, and embedded analytics, read Chapters 1, 4 through 7, and 10. If you are a data engineer who builds data pipelines, read Chapters 5 through 9. Skim the remaining chapters and use them as a reference when you encounter the need for a particular type of application. If you are a data scientist charged with building ML models, read Chapters 7, 8, 10, and 11. If you are an ML engineer interested in operationalizing ML models, skim through Chapters 1 through 9 and study Chapters 10 and 11 carefully. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. xvi | Preface
Comments 0
Loading comments...
Reply to Comment
Edit Comment