Joe Reis & Matt Housley Fundamentals of Data Engineering Plan and Build Robust Data Systems Reis & H ousley
DATA “The world of data has been evolving for a while now. First there were designers. Then database administrators. Then CIOs. Then data architects. This book signals the next step in the evolution and maturity of the industry. It is a must read for anyone who takes their profession and career honestly.” —Bill Inmon creator of the data warehouse “Fundamentals of Data Engineering is a great introduction to the business of moving, processing, and handling data. I’d highly recommend it for anyone wanting to get up to speed in data engineering or analytics, or for existing practitioners who want to fill in any gaps in their understanding.” —Jordan Tigani founder and CEO, MotherDuck, and founding engineer and cocreator of BigQuery Fundamentals of Data Engineering US $69.99 CAN $87.99 ISBN: 978-1-098-10830-4 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you’ll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle. Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of down- stream data consumers. You’ll understand how to apply the concepts of data generation, ingestion, orchestration, trans- formation, storage, and governance that are critical in any data environment regardless of the underlying technology. This book will help you: • Get a concise overview of the entire data engineering landscape • Assess data engineering problems using an end-to-end framework of best practices • Cut through marketing hype when choosing data technologies, architecture, and processes • Use the data engineering lifecycle to design and build a robust architecture • Incorporate data governance and security across the data engineering lifecycle Joe Reis is a “recovering data scientist,” and a data engineer and architect. Matt Housley is a data engineering consultant and cloud specialist.
Praise for Fundamentals of Data Engineering The world of data has been evolving for a while now. First there were designers. Then database administrators. Then CIOs. Then data architects. This book signals the next step in the evolution and maturity of the industry. It is a must read for anyone who takes their profession and career honestly. —Bill Inmon, creator of the data warehouse Fundamentals of Data Engineering is a great introduction to the business of moving, processing, and handling data. It explains the taxonomy of data concepts, without focusing too heavily on individual tools or vendors, so the techniques and ideas should outlast any individual trend or product. I’d highly recommend it for anyone wanting to get up to speed in data engineering or analytics, or for existing practitioners who want to fill in any gaps in their understanding. —Jordan Tigani, founder and CEO, MotherDuck, and founding engineer and cocreator of BigQuery If you want to lead in your industry, you must build the capabilities required to provide exceptional customer and employee experiences. This is not just a technology problem. It’s a people opportunity. And it will transform your business. Data engineers are at the center of this transformation. But today the discipline is misunderstood. This book will demystify data engineering and become your ultimate guide to succeeding with data. —Bruno Aziza, Head of Data Analytics, Google Cloud
What a book! Joe and Matt are giving you the answer to the question, “What must I understand to do data engineering?” Whether you are getting started as a data engineer or strengthening your skills, you are not looking for yet another technology handbook. You are seeking to learn more about the underlying principles and the core concepts of the role, its responsibilities, its technical and organizational environment, its mission—that’s exactly what Joe and Matt offer in this book. —Andy Petrella, founder of Kensu This is the missing book in data engineering. A wonderfully thorough account of what it takes to be a good practicing data engineer, including thoughtful real-life considerations. I’d recommend all future education of data professionals include Joe and Matt’s work. —Sarah Krasnik, data engineering leader It is incredible to realize the breadth of knowledge a data engineer must have. But don’t let it scare you. This book provides a great foundational overview of various architectures, approaches, methodologies, and patterns that anyone working with data needs to be aware of. But what is even more valuable is that this book is full of golden nuggets of wisdom, best-practice advice, and things to consider when making decisions related to data engineering. It is a must read for both experienced and new data engineers. —Veronika Durgin, data and analytics leader I was honored and humbled to be asked by Joe and Matt to help technical review their masterpiece of data knowledge, Fundamentals of Data Engineering. Their ability to break down the key components that are critical to anyone wanting to move into a data engineering role is second to none. Their writing style makes the information easy to absorb, and they leave no stone unturned. It was an absolute pleasure to work with some of the best thought leaders in the data space. I can’t wait to see what they do next. —Chris Tabb, cofounder of LEIT DATA Fundamentals of Data Engineering is the first book to take an in-depth and holistic look into the requirements of today’s data engineer. As you’ll see, the book dives into the critical areas of data engineering including skill sets, tools, and architectures used to manage, move, and curate data in today’s complex technical environments. More importantly, Joe and Matt convey their master of understanding data engineering and take the time to further dive into the more nuanced areas of data engineering and make it relatable to the reader. Whether you’re a manager, experienced data engineer, or someone wanting to get into the space, this book provides practical insight into today’s data engineering landscape. —Jon King, Principal Data Architect
Two things will remain relevant to data engineers in 2042: SQL and this book. Joe and Matt cut through the hype around tools to extract the slowly changing dimensions of our discipline. Whether you’re starting your journey with data or adding stripes to your black belt, Fundamentals of Data Engineering lays the foundation for mastery. —Kevin Hu, CEO of Metaplane In a field that is rapidly changing, with new technology solutions popping up constantly, Joe and Matt provide clear, timeless guidance, focusing on the core concepts and foundational knowledge required to excel as a data engineer. This book is jam packed with information that will empower you to ask the right questions, understand trade-offs, and make the best decisions when designing your data architecture and implementing solutions across the data engineering lifecycle. Whether you’re just considering becoming a data engineer or have been in the field for years, I guarantee you’ll learn something from this book! —Julie Price, Senior Product Manager, SingleStore Fundamentals of Data Engineering isn’t just an instruction manual—it teaches you how to think like a data engineer. Part history lesson, part theory, and part acquired knowledge from Joe and Matt’s decades of experience, the book has definitely earned its place on every data professional’s bookshelf. —Scott Breitenother, founder and CEO, Brooklyn Data Co. There is no other book that so comprehensively covers what it means to be a data engineer. Joe and Matt dive deep into responsibilities, impacts, architectural choices, and so much more. Despite talking about such complex topics, the book is easy to read and digest. A very powerful combination. —Danny Leybzon, MLOps Architect I wish this book was around years ago when I started working with data engineers. The wide coverage of the field makes the involved roles clear and builds empathy with the many roles it takes to build a competent data discipline. —Tod Hansmann, VP Engineering A must read and instant classic for anyone in the data engineering field. This book fills a gap in the current knowledge base, discussing fundamental topics not found in other books. You will gain understanding of foundational concepts and insight into historical context about data engineering that will set up anyone to succeed. —Matthew Sharp, Data and ML Engineer
Data engineering is the foundation of every analysis, machine learning model, and data product, so it is critical that it is done well. There are countless manuals, books, and references for each of the technologies used by data engineers, but very few (if any) resources that provide a holistic view of what it means to work as a data engineer. This book fills a critical need in the industry and does it well, laying the foundation for new and working data engineers to be successful and effective in their roles. This is the book that I’ll be recommending to anyone who wants to work with data at any level. —Tobias Macey, host of The Data Engineering Podcast
Joe Reis and Matt Housley Fundamentals of Data Engineering Plan and Build Robust Data Systems Boston Farnham Sebastopol TokyoBeijing
978-1-098-10830-4 [LSI] Fundamentals of Data Engineering by Joe Reis and Matt Housley Copyright © 2022 Joseph Reis and Matthew Housley. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Michele Cronin Production Editor: Gregory Hyman Copyeditor: Sharon Wilkey Proofreader: Amnet Systems, LLC Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2022: First Edition Revision History for the First Edition 2022-06-22: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098108304 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Data Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part I. Foundation and Building Blocks 1. Data Engineering Described. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Data Engineering? 3 Data Engineering Defined 4 The Data Engineering Lifecycle 5 Evolution of the Data Engineer 6 Data Engineering and Data Science 11 Data Engineering Skills and Activities 13 Data Maturity and the Data Engineer 13 The Background and Skills of a Data Engineer 17 Business Responsibilities 18 Technical Responsibilities 19 The Continuum of Data Engineering Roles, from A to B 21 Data Engineers Inside an Organization 22 Internal-Facing Versus External-Facing Data Engineers 23 Data Engineers and Other Technical Roles 24 Data Engineers and Business Leadership 28 Conclusion 31 Additional Resources 32 2. The Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 What Is the Data Engineering Lifecycle? 33 The Data Lifecycle Versus the Data Engineering Lifecycle 34 Generation: Source Systems 35 iii
Storage 38 Ingestion 39 Transformation 43 Serving Data 44 Major Undercurrents Across the Data Engineering Lifecycle 48 Security 49 Data Management 50 DataOps 59 Data Architecture 64 Orchestration 64 Software Engineering 66 Conclusion 68 Additional Resources 69 3. Designing Good Data Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 What Is Data Architecture? 71 Enterprise Architecture Defined 72 Data Architecture Defined 75 “Good” Data Architecture 76 Principles of Good Data Architecture 77 Principle 1: Choose Common Components Wisely 78 Principle 2: Plan for Failure 79 Principle 3: Architect for Scalability 80 Principle 4: Architecture Is Leadership 80 Principle 5: Always Be Architecting 81 Principle 6: Build Loosely Coupled Systems 81 Principle 7: Make Reversible Decisions 83 Principle 8: Prioritize Security 84 Principle 9: Embrace FinOps 85 Major Architecture Concepts 87 Domains and Services 87 Distributed Systems, Scalability, and Designing for Failure 88 Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices 90 User Access: Single Versus Multitenant 94 Event-Driven Architecture 95 Brownfield Versus Greenfield Projects 96 Examples and Types of Data Architecture 98 Data Warehouse 98 Data Lake 101 Convergence, Next-Generation Data Lakes, and the Data Platform 102 Modern Data Stack 103 Lambda Architecture 104 iv | Table of Contents
Kappa Architecture 105 The Dataflow Model and Unified Batch and Streaming 105 Architecture for IoT 106 Data Mesh 109 Other Data Architecture Examples 110 Who’s Involved with Designing a Data Architecture? 111 Conclusion 111 Additional Resources 111 4. Choosing Technologies Across the Data Engineering Lifecycle. . . . . . . . . . . . . . . . . . . 115 Team Size and Capabilities 116 Speed to Market 117 Interoperability 117 Cost Optimization and Business Value 118 Total Cost of Ownership 118 Total Opportunity Cost of Ownership 119 FinOps 120 Today Versus the Future: Immutable Versus Transitory Technologies 120 Our Advice 122 Location 123 On Premises 123 Cloud 124 Hybrid Cloud 127 Multicloud 128 Decentralized: Blockchain and the Edge 129 Our Advice 129 Cloud Repatriation Arguments 130 Build Versus Buy 132 Open Source Software 133 Proprietary Walled Gardens 137 Our Advice 138 Monolith Versus Modular 139 Monolith 139 Modularity 140 The Distributed Monolith Pattern 142 Our Advice 142 Serverless Versus Servers 143 Serverless 143 Containers 144 How to Evaluate Server Versus Serverless 145 Our Advice 146 Optimization, Performance, and the Benchmark Wars 147 Table of Contents | v
Big Data...for the 1990s 148 Nonsensical Cost Comparisons 148 Asymmetric Optimization 148 Caveat Emptor 149 Undercurrents and Their Impacts on Choosing Technologies 149 Data Management 149 DataOps 149 Data Architecture 150 Orchestration Example: Airflow 150 Software Engineering 151 Conclusion 151 Additional Resources 151 Part II. The Data Engineering Lifecycle in Depth 5. Data Generation in Source Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sources of Data: How Is Data Created? 156 Source Systems: Main Ideas 156 Files and Unstructured Data 156 APIs 157 Application Databases (OLTP Systems) 157 Online Analytical Processing System 159 Change Data Capture 159 Logs 160 Database Logs 161 CRUD 162 Insert-Only 162 Messages and Streams 163 Types of Time 164 Source System Practical Details 165 Databases 166 APIs 174 Data Sharing 176 Third-Party Data Sources 177 Message Queues and Event-Streaming Platforms 177 Whom You’ll Work With 181 Undercurrents and Their Impact on Source Systems 183 Security 183 Data Management 184 DataOps 184 Data Architecture 185 vi | Table of Contents
Orchestration 186 Software Engineering 187 Conclusion 187 Additional Resources 188 6. Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Raw Ingredients of Data Storage 191 Magnetic Disk Drive 191 Solid-State Drive 193 Random Access Memory 194 Networking and CPU 195 Serialization 195 Compression 196 Caching 197 Data Storage Systems 197 Single Machine Versus Distributed Storage 198 Eventual Versus Strong Consistency 198 File Storage 199 Block Storage 202 Object Storage 205 Cache and Memory-Based Storage Systems 211 The Hadoop Distributed File System 211 Streaming Storage 212 Indexes, Partitioning, and Clustering 213 Data Engineering Storage Abstractions 215 The Data Warehouse 215 The Data Lake 216 The Data Lakehouse 216 Data Platforms 217 Stream-to-Batch Storage Architecture 217 Big Ideas and Trends in Storage 218 Data Catalog 218 Data Sharing 219 Schema 219 Separation of Compute from Storage 220 Data Storage Lifecycle and Data Retention 223 Single-Tenant Versus Multitenant Storage 226 Whom You’ll Work With 227 Undercurrents 228 Security 228 Data Management 228 DataOps 229 Table of Contents | vii
Data Architecture 230 Orchestration 230 Software Engineering 230 Conclusion 230 Additional Resources 231 7. Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 What Is Data Ingestion? 234 Key Engineering Considerations for the Ingestion Phase 235 Bounded Versus Unbounded Data 236 Frequency 237 Synchronous Versus Asynchronous Ingestion 238 Serialization and Deserialization 239 Throughput and Scalability 239 Reliability and Durability 240 Payload 241 Push Versus Pull Versus Poll Patterns 244 Batch Ingestion Considerations 244 Snapshot or Differential Extraction 246 File-Based Export and Ingestion 246 ETL Versus ELT 246 Inserts, Updates, and Batch Size 247 Data Migration 247 Message and Stream Ingestion Considerations 248 Schema Evolution 248 Late-Arriving Data 248 Ordering and Multiple Delivery 248 Replay 249 Time to Live 249 Message Size 249 Error Handling and Dead-Letter Queues 249 Consumer Pull and Push 250 Location 250 Ways to Ingest Data 250 Direct Database Connection 251 Change Data Capture 252 APIs 254 Message Queues and Event-Streaming Platforms 255 Managed Data Connectors 256 Moving Data with Object Storage 257 EDI 257 Databases and File Export 257 viii | Table of Contents
Practical Issues with Common File Formats 258 Shell 258 SSH 259 SFTP and SCP 259 Webhooks 259 Web Interface 260 Web Scraping 260 Transfer Appliances for Data Migration 261 Data Sharing 262 Whom You’ll Work With 262 Upstream Stakeholders 262 Downstream Stakeholders 263 Undercurrents 263 Security 264 Data Management 264 DataOps 266 Orchestration 268 Software Engineering 268 Conclusion 268 Additional Resources 269 8. Queries, Modeling, and Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Queries 272 What Is a Query? 273 The Life of a Query 274 The Query Optimizer 275 Improving Query Performance 275 Queries on Streaming Data 281 Data Modeling 287 What Is a Data Model? 288 Conceptual, Logical, and Physical Data Models 289 Normalization 290 Techniques for Modeling Batch Analytical Data 294 Modeling Streaming Data 307 Transformations 309 Batch Transformations 310 Materialized Views, Federation, and Query Virtualization 323 Streaming Transformations and Processing 326 Whom You’ll Work With 329 Upstream Stakeholders 329 Downstream Stakeholders 330 Undercurrents 330 Table of Contents | ix
Security 330 Data Management 331 DataOps 332 Data Architecture 333 Orchestration 333 Software Engineering 333 Conclusion 334 Additional Resources 335 9. Serving Data for Analytics, Machine Learning, and Reverse ETL. . . . . . . . . . . . . . . . . 337 General Considerations for Serving Data 338 Trust 338 What’s the Use Case, and Who’s the User? 339 Data Products 340 Self-Service or Not? 341 Data Definitions and Logic 342 Data Mesh 343 Analytics 344 Business Analytics 344 Operational Analytics 346 Embedded Analytics 348 Machine Learning 349 What a Data Engineer Should Know About ML 350 Ways to Serve Data for Analytics and ML 351 File Exchange 351 Databases 352 Streaming Systems 354 Query Federation 354 Data Sharing 355 Semantic and Metrics Layers 355 Serving Data in Notebooks 356 Reverse ETL 358 Whom You’ll Work With 360 Undercurrents 360 Security 361 Data Management 362 DataOps 362 Data Architecture 363 Orchestration 363 Software Engineering 364 Conclusion 365 Additional Resources 365 x | Table of Contents
Part III. Security, Privacy, and the Future of Data Engineering 10. Security and Privacy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 People 370 The Power of Negative Thinking 370 Always Be Paranoid 370 Processes 371 Security Theater Versus Security Habit 371 Active Security 371 The Principle of Least Privilege 372 Shared Responsibility in the Cloud 372 Always Back Up Your Data 372 An Example Security Policy 373 Technology 374 Patch and Update Systems 374 Encryption 375 Logging, Monitoring, and Alerting 375 Network Access 376 Security for Low-Level Data Engineering 377 Conclusion 378 Additional Resources 378 11. The Future of Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 The Data Engineering Lifecycle Isn’t Going Away 380 The Decline of Complexity and the Rise of Easy-to-Use Data Tools 380 The Cloud-Scale Data OS and Improved Interoperability 381 “Enterprisey” Data Engineering 383 Titles and Responsibilities Will Morph... 384 Moving Beyond the Modern Data Stack, Toward the Live Data Stack 385 The Live Data Stack 385 Streaming Pipelines and Real-Time Analytical Databases 386 The Fusion of Data with Applications 387 The Tight Feedback Between Applications and ML 388 Dark Matter Data and the Rise of...Spreadsheets?! 388 Conclusion 389 A. Serialization and Compression Technical Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 B. Cloud Networking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Table of Contents | xi
(This page has no text content)
Preface How did this book come about? The origin is deeply rooted in our journey from data science into data engineering. We often jokingly refer to ourselves as recovering data scientists. We both had the experience of being assigned to data science projects, then struggling to execute these projects due to a lack of proper foundations. Our journey into data engineering began when we undertook data engineering tasks to build foundations and infrastructure. With the rise of data science, companies splashed out lavishly on data science talent, hoping to reap rich rewards. Very often, data scientists struggled with basic problems that their background and training did not address—data collection, data cleansing, data access, data transformation, and data infrastructure. These are problems that data engineering aims to solve. What This Book Isn’t Before we cover what this book is about and what you’ll get out of it, let’s quickly cover what this book isn’t. This book isn’t about data engineering using a particular tool, technology, or platform. While many excellent books approach data engineering technologies from this perspective, these books have a short shelf life. Instead, we focus on the fundamental concepts behind data engineering. What This Book Is About This book aims to fill a gap in current data engineering content and materials. While there’s no shortage of technical resources that address specific data engineering tools and technologies, people struggle to understand how to assemble these compo‐ nents into a coherent whole that applies in the real world. This book connects the dots of the end-to-end data lifecycle. It shows you how to stitch together various technologies to serve the needs of downstream data consumers such as analysts, data scientists, and machine learning engineers. This book works as a complement xiii
to O’Reilly books that cover the details of particular technologies, platforms, and programming languages. The big idea of this book is the data engineering lifecycle: data generation, storage, ingestion, transformation, and serving. Since the dawn of data, we’ve seen the rise and fall of innumerable specific technologies and vendor products, but the data engi‐ neering lifecycle stages have remained essentially unchanged. With this framework, the reader will come away with a sound understanding for applying technologies to real-world business problems. Our goal here is to map out principles that reach across two axes. First, we wish to distill data engineering into principles that can encompass any relevant technology. Second, we wish to present principles that will stand the test of time. We hope that these ideas reflect lessons learned across the data technology upheaval of the last twenty years and that our mental framework will remain useful for a decade or more into the future. One thing to note: we unapologetically take a cloud-first approach. We view the cloud as a fundamentally transformative development that will endure for decades; most on-premises data systems and workloads will eventually move to cloud hosting. We assume that infrastructure and systems are ephemeral and scalable, and that data engineers will lean toward deploying managed services in the cloud. That said, most concepts in this book will translate to non-cloud environments. Who Should Read This Book Our primary intended audience for this book consists of technical practitioners, mid- to senior-level software engineers, data scientists, or analysts interested in moving into data engineering; or data engineers working in the guts of specific technologies, but wanting to develop a more comprehensive perspective. Our secondary target audience consists of data stakeholders who work adjacent to technical practition‐ ers—e.g., a data team lead with a technical background overseeing a team of data engineers, or a director of data warehousing wanting to migrate from on-premises technology to a cloud-based solution. Ideally, you’re curious and want to learn—why else would you be reading this book? You stay current with data technologies and trends by reading books and articles on data warehousing/data lakes, batch and streaming systems, orchestration, modeling, management, analysis, developments in cloud technologies, etc. This book will help you weave what you’ve read into a complete picture of data engineering across technologies and paradigms. xiv | Preface
Comments 0
Loading comments...
Reply to Comment
Edit Comment