Data Quality Fundamentals A Practitioners Guide to Building Trustworthy Data Pipelines (Barr Moses, Lior Gavish, Molly Vorwerck) (Z-Library)
Statistics
11
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025年12月13日
Actions

Data Quality Fundamentals A Practitioners Guide to Building Trustworthy Data Pipelines (Barr Moses, Lior Gavish, Molly Vorwerck) (Z-Library)

技术

AuthorBarr Moses, Lior Gavish, Molly Vorwerck

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you. Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies. • Build more trustworthy and reliable data pipelines • Write scripts to make data checks and identify broken pipelines with data observability • Learn how to set and maintain data SLAs, SLIs, and SLOs • Develop and lead data quality initiatives at your company • Learn how to treat data services and systems with the diligence of production software • Automate data lineage graphs across your data ecosystem • Build anomaly detectors for your critical data assets

ISBN: 1098112040
Publisher: O'Reilly Media
Publish Year: 2022
Language: 英文
Pages: 311
File Format: PDF
File Size: 9.5 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

M oses, G a vish & Vorw erck D a ta Q ua lity Fund a m enta ls D a ta Q ua lity Fund a m enta ls Barr Moses, Lior Gavish & Molly Vorwerck Data Quality Fundamentals A Practitioner's Guide to Building Trustworthy Data Pipelines
DATA “A must-read for anyone who cares about data quality.” —Debashis Saha Data Leader AppZen, Intuit, and eBay Data Quality Fundamentals US $59.99 CAN $74.99 ISBN: 978-1-098-11204-2 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you’re using broken or just plain wrong? These problems affect almost every team, yet they’re usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to these questions, this book is for you. Many data engineering teams today face the “good pipelines, bad data” problem. It doesn’t matter how advanced your data infrastructure is if the data you’re piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck, from the data observability company Monte Carlo, explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world’s most innovative companies. • Build more trustworthy and reliable data pipelines • Write scripts to make data checks and identify broken pipelines with data observability • Learn how to set and maintain data SLAs, SLIs, and SLOs • Develop and lead data quality initiatives at your company • Learn how to treat data services and systems with the diligence of production software • Automate data lineage graphs across your data ecosystem • Build anomaly detectors for your critical data assets Barr Moses is CEO and cofounder of Monte Carlo, creator of the data observability category. During her decade-long career in data, she served as commander of a data intelligence unit in the Israeli Air Force, a consultant at Bain & Company, and vice president of operations at Gainsight. She led O’Reilly’s first course on data quality. Lior Gavish, CTO and cofounder of Monte Carlo, previously cofounded cybersecurity startup Sookasa, acquired by Barracuda in 2016. At Barracuda, he was senior vice president of engineering, launching award-winning ML products for fraud prevention. Lior holds an MBA from Stanford and an MSc in computer science from Tel Aviv University. Molly Vorwerck, head of content at Monte Carlo, also served as editor- in-chief of the Uber Engineering blog and lead program manager for Uber’s technical brand team. She also led internal communications for Uber’s chief technology officer and strategy for Uber AI Labs’ research review program. M oses, G a vish & Vorw erck
Praise for Data Quality Fundamentals Data engineers, ETL programmers, and entire data pipeline teams need a reference and testing guide like this! As I did, they will learn the building blocks, processes, and tooling that help ensure the quality of data-intensive applications. This book adds fresh perspectives and practical test scenarios that expand the wisdom to test modern data pipelines. —Wayne Yaddow, Data and ETL Quality Analyst Your data investments, infrastructure, and insights don’t matter at all if you can’t trust your data. Barr, Lior, and Molly have done a tremendous job in breaking down the fundamentals of what trusting your data means and have created a very practical framework to implement data quality in enterprises. A must-read for anyone who cares about data quality. —Debashis Saha, Data Leader AppZen, Intuit, and eBay As data architecture becomes increasingly distributed and the accountability for data increasingly decentralized, the focus on data quality will continue to grow. Data Quality Fundamentals provides an important resource for engineering teams that are serious about improving the accuracy, reliability, and trust of their data through some of today’s most significant technologies and processes. —Mammad Zadeh, Data Leader and Former VP of Engineering at Intuit
(This page has no text content)
Barr Moses, Lior Gavish, and Molly Vorwerck Data Quality Fundamentals A Practitioner’s Guide to Building Trustworthy Data Pipelines Boston Farnham Sebastopol TokyoBeijing
978-1-098-11204-2 [LSI] Data Quality Fundamentals by Barr Moses, Lior Gavish, and Molly Vorwerck Copyright © 2022 Monte Carlo Data, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Jill Leonard Production Editor: Gregory Hyman Copyeditor: Charles Roumeliotis Proofreader: Piper Editorial Consulting, LLC Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2022: First Edition Revision History for the First Edition 2022-09-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098112042 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Quality Fundamentals, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Monte Carlo Data. See our statement of editorial independence.
To Rae and Robert, who keep things in perspective, no matter where we look. To the Monte Carlo jellyfish and the data reliability pioneers—you know who you are. So grateful to be on this journey with you.
(This page has no text content)
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Why Data Quality Deserves Attention—Now. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Data Quality? 4 Framing the Current Moment 4 Understanding the “Rise of Data Downtime” 5 Other Industry Trends Contributing to the Current Moment 8 Summary 10 2. Assembling the Building Blocks of a Reliable Data System. . . . . . . . . . . . . . . . . . . . . . . . 13 Understanding the Difference Between Operational and Analytical Data 14 What Makes Them Different? 15 Data Warehouses Versus Data Lakes 16 Data Warehouses: Table Types at the Schema Level 17 Data Lakes: Manipulations at the File Level 18 What About the Data Lakehouse? 21 Syncing Data Between Warehouses and Lakes 21 Collecting Data Quality Metrics 22 What Are Data Quality Metrics? 22 How to Pull Data Quality Metrics 23 Using Query Logs to Understand Data Quality in the Warehouse 30 Using Query Logs to Understand Data Quality in the Lake 31 Designing a Data Catalog 32 Building a Data Catalog 33 Summary 38 v
3. Collecting, Cleaning, Transforming, and Testing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Collecting Data 39 Application Log Data 40 API Responses 41 Sensor Data 42 Cleaning Data 43 Batch Versus Stream Processing 45 Data Quality for Stream Processing 47 Normalizing Data 50 Handling Heterogeneous Data Sources 50 Schema Checking and Type Coercion 52 Syntactic Versus Semantic Ambiguity in Data 52 Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka 53 Running Analytical Data Transformations 54 Ensuring Data Quality During ETL 54 Ensuring Data Quality During Transformation 55 Alerting and Testing 55 dbt Unit Testing 56 Great Expectations Unit Testing 59 Deequ Unit Testing 60 Managing Data Quality with Apache Airflow 63 Scheduler SLAs 63 Installing Circuit Breakers with Apache Airflow 66 SQL Check Operators 67 Summary 67 4. Monitoring and Anomaly Detection for Your Data Pipelines. . . . . . . . . . . . . . . . . . . . . . . 69 Knowing Your Known Unknowns and Unknown Unknowns 70 Building an Anomaly Detection Algorithm 72 Monitoring for Freshness 73 Understanding Distribution 79 Building Monitors for Schema and Lineage 87 Anomaly Detection for Schema Changes and Lineage 88 Visualizing Lineage 92 Investigating a Data Anomaly 94 Scaling Anomaly Detection with Python and Machine Learning 99 Improving Data Monitoring Alerting with Machine Learning 104 Accounting for False Positives and False Negatives 105 Improving Precision and Recall 106 Detecting Freshness Incidents with Data Monitoring 110 vi | Table of Contents
F-Scores 111 Does Model Accuracy Matter? 112 Beyond the Surface: Other Useful Anomaly Detection Approaches 116 Designing Data Quality Monitors for Warehouses Versus Lakes 117 Summary 118 5. Architecting for Data Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Measuring and Maintaining High Data Reliability at Ingestion 119 Measuring and Maintaining Data Quality in the Pipeline 123 Understanding Data Quality Downstream 125 Building Your Data Platform 127 Data Ingestion 128 Data Storage and Processing 129 Data Transformation and Modeling 129 Business Intelligence and Analytics 130 Data Discovery and Governance 131 Developing Trust in Your Data 132 Data Observability 132 Measuring the ROI on Data Quality 133 How to Set SLAs, SLOs, and SLIs for Your Data 135 Case Study: Blinkist 138 Summary 140 6. Fixing Data Quality Issues at Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Fixing Quality Issues in Software Development 142 Data Incident Management 143 Incident Detection 145 Response 147 Root Cause Analysis 148 Resolution 157 Blameless Postmortem 157 Incident Response and Mitigation 159 Establishing a Routine of Incident Management 160 Why Data Incident Commanders Matter 165 Case Study: Data Incident Management at PagerDuty 166 The DataOps Landscape at PagerDuty 166 Data Challenges at PagerDuty 166 Using DevOps Best Practices to Scale Data Incident Management 167 Summary 168 Table of Contents | vii
7. Building End-to-End Lineage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Building End-to-End Field-Level Lineage for Modern Data Systems 170 Basic Lineage Requirements 172 Data Lineage Design 173 Parsing the Data 180 Building the User Interface 181 Case Study: Architecting for Data Reliability at Fox 183 Exercise “Controlled Freedom” When Dealing with Stakeholders 184 Invest in a Decentralized Data Team 185 Avoid Shiny New Toys in Favor of Problem-Solving Tech 186 To Make Analytics Self-Serve, Invest in Data Trust 187 Summary 188 8. Democratizing Data Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Treating Your “Data” Like a Product 190 Perspectives on Treating Data Like a Product 191 Convoy Case Study: Data as a Service or Output 192 Uber Case Study: The Rise of the Data Product Manager 193 Applying the Data-as-a-Product Approach 194 Building Trust in Your Data Platform 199 Align Your Product’s Goals with the Goals of the Business 199 Gain Feedback and Buy-in from the Right Stakeholders 200 Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains 201 Sign Off on Baseline Metrics for Your Data and How You Measure Them 202 Know When to Build Versus Buy 202 Assigning Ownership for Data Quality 204 Chief Data Officer 205 Business Intelligence Analyst 205 Analytics Engineer 205 Data Scientist 206 Data Governance Lead 206 Data Engineer 206 Data Product Manager 207 Who Is Responsible for Data Reliability? 207 Creating Accountability for Data Quality 208 Balancing Data Accessibility with Trust 209 Certifying Your Data 211 Seven Steps to Implementing a Data Certification Program 211 Case Study: Toast’s Journey to Finding the Right Structure for Their Data Team 216 In the Beginning: When a Small Team Struggles to Meet Data Demands 217 viii | Table of Contents
Supporting Hypergrowth as a Decentralized Data Operation 217 Regrouping, Recentralizing, and Refocusing on Data Trust 218 Considerations When Scaling Your Data Team 219 Increasing Data Literacy 222 Prioritizing Data Governance and Compliance 224 Prioritizing a Data Catalog 224 Beyond Catalogs: Enforcing Data Governance 227 Building a Data Quality Strategy 228 Make Leadership Accountable for Data Quality 228 Set Data Quality KPIs 229 Spearhead a Data Governance Program 229 Automate Your Lineage and Data Governance Tooling 229 Create a Communications Plan 230 Summary 230 9. Data Quality in the Real World: Conversations and Case Studies. . . . . . . . . . . . . . . . . . 233 Building a Data Mesh for Greater Data Quality 234 Domain-Oriented Data Owners and Pipelines 235 Self-Serve Functionality 236 Interoperability and Standardization of Communications 236 Why Implement a Data Mesh? 236 To Mesh or Not to Mesh? That Is the Question 237 Calculating Your Data Mesh Score 238 A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh 239 Can You Build a Data Mesh from a Single Solution? 239 Is Data Mesh Another Word for Data Virtualization? 239 Does Each Data Product Team Manage Their Own Separate Data Stores? 240 Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh? 240 Is the Data Mesh Right for All Data Teams? 241 Does One Person on Your Team “Own” the Data Mesh? 241 Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts? 242 Case Study: Kolibri Games’ Data Stack Journey 243 First Data Needs 243 Pursuing Performance Marketing 245 2018: Professionalize and Centralize 246 Getting Data-Oriented 248 Getting Data-Driven 251 Building a Data Mesh 254 Five Key Takeaways from a Five-Year Data Evolution 256 Table of Contents | ix
Making Metadata Work for the Business 257 Unlocking the Value of Metadata with Data Discovery 260 Data Warehouse and Lake Considerations 260 Data Catalogs Can Drown in a Data Lake—or Even a Data Mesh 261 Moving from Traditional Data Catalogs to Modern Data Discovery 262 Deciding When to Get Started with Data Quality at Your Company 264 You’ve Recently Migrated to the Cloud 265 Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity 265 Your Data Team Is Growing 266 Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues 266 Your Team Has More Data Consumers Than They Did One Year Ago 267 Your Company Is Moving to a Self-Service Analytics Model 267 Data Is a Key Part of the Customer Value Proposition 267 Data Quality Starts with Trust 268 Summary 268 10. Pioneering the Future of Reliable Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Be Proactive, Not Reactive 271 Predictions for the Future of Data Quality and Reliability 273 Data Warehouses and Lakes Will Merge 273 Emergence of New Roles on the Data Team 274 Rise of Automation 275 More Distributed Environments and the Rise of Data Domains 277 So Where Do We Go from Here? 277 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 x | Table of Contents
Preface If you’ve experienced any of the following scenarios, raise your hand (or, you can just nod in solidarity—there’s no way we’ll know otherwise): • Five thousand rows in a critical (and relatively predictable) table suddenly turns into five hundred, with no rhyme or reason. • A broken dashboard causes an executive dashboard to spit null values. • A hidden schema change breaks a downstream pipeline. • And the list goes on. This book is for everyone who has suffered from unreliable data, silently or with muf‐ fled screams, and wants to do something about it. We expect that these individuals will come from data engineers, data analytics, or data science backgrounds, and be actively involved in building, scaling, and managing their company’s data pipelines. On the surface, it may seem like Data Quality Fundamentals is a manual about how to clean, wrangle, and generally make sense of data—and it is. But more so, this book tackles best practices, technologies, and processes around building more reliable data systems and, in the process, cultivating data trust with your team and stakeholders. In Chapter 1, we’ll discuss why data quality deserves attention now, and how architec‐ tural and technological trends are contributing to an overall decrease in governance and reliability. We’ll introduce the concept of “data downtime,” and explain how it harkens back to the early days of site reliability engineering (SRE) teams and how these same DevOps principles can apply to your data engineering workflows as well. In Chapter 2, we’ll highlight how to build more resilient data systems by walking through how you can solve for and measure data quality across several key data pipeline technologies, including data warehouses, data lakes, and data catalogs. These three foundational technologies store, process, and track data health preproduction, which naturally leads us into Chapter 3, where we’ll walk through how to collect, clean, transform, and test your data with quality and reliability in mind. xi
Next, Chapter 4 will walk through one of the most important aspects of the data reliability workflow—proactive anomaly detection and monitoring—by sharing how to build a data quality monitor using a publicly available data set about exoplanets. This tutorial will give readers the opportunity to directly apply the lessons they’ve learned in Data Quality Fundamentals to their work in the field, albeit at a limited scale. Chapter 5 will provide readers with a bird’s-eye view into what it takes to put these critical technologies together and architect robust systems and processes that ensure data quality is measured and maintained no matter the use case. We’ll also share how best-in-class data teams at Airbnb, Uber, Intuit, and other companies integrate data reliability into their day-to-day workflows, including setting SLAs, SLIs, and SLOs, and building data platforms that optimize for data quality across five key pillars: freshness, volume, distribution, schema, and lineage. In Chapter 6, we’ll dive into the steps necessary to actually react to and fix data quality issues in production environments, including data incident management, root cause analysis, postmortems, and establishing incident communication best practices. Then, in Chapter 7, readers will take their understanding of root cause analysis one step further by learning how to build field-level lineage using popular and widely adopted open source tools that should be in every data engineer’s arsenal. In Chapter 8, we’ll discuss some of the cultural and organizational barriers data teams must cross when evangelizing and democratizing data quality at scale, including best-in-class principles like treating your data like a product, understanding your company’s RACI matrix for data quality, and how to structure your data team for maximum business impact. In Chapter 9, we’ll share several real-world case studies and conversations with leading minds in the data engineering space, including Zhamak Dehghani, creator of the data mesh, António Fitas, whose team bravely shares their story of how they’re migrating toward a decentralized (and data quality first!) data architecture, and Alex Tverdohleb, VP of Data Services at Fox and a pioneer of the “controlled freedom” data management technique. This patchwork of theory and on-the-ground examples will help you visualize how several of the technical and process-driven data quality concepts we highlight in Chapters 1 through 8 can come to life in stunning color. And finally, in Chapter 10, we finish our book with a tangible calculation for measur‐ ing the financial impact of poor data on your business, in human hours, as a way to help readers (many of whom are tasked with fixing data downtime) make the case with leadership to invest in more tools and processes to solve these problems. We’ll also highlight four of our predictions for the future of data quality as it relates to broader industry trends, such as distributed data management and the rise of the data lakehouse. xii | Preface
At the very least, we hope that you walk away from this book with a few tricks up your sleeve when it comes to making the case for prioritizing data quality and reliability across your organization. As any seasoned data leader will tell you, data trust is never built in a day, but with the right approach, incremental progress can be made—pipeline by pipeline. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/data-quality-fundamentals-code. If you have a technical question or a problem using the code examples, please send an email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant Preface | xiii
amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Quality Funda‐ mentals by Barr Moses, Lior Gavish, and Molly Vorwerck (O’Reilly). Copyright 2022 Monte Carlo Data, Inc., 978-1-098-11204-2.” If you feel your use of code examples falls outside fair use or the permission described herein, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/data-quality-fundamentals. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: https://twitter.com/oreillymedia. xiv | Preface
Watch us on YouTube: https://www.youtube.com/oreillymedia. Acknowledgments This book was a labor of love, and for that reason, we have many people to thank. First, we’d like to thank Jess Haberman, our fearless acquisitions editor, who believed in us every step of the way. When Jess came to us with the idea for a book on data quality, we were taken aback—in the best way possible. We had no idea that a topic—data reliability—that’s so near and dear to our hearts would find life outside of our personal blog articles. With her dedication and encouragement, we were able to draft a proposal that set itself apart from what was already published in the space and ultimately write a book that would bring value to other data practitioners struggling with data downtime. We must also thank Jill Leonard, our development editor, who has served as our Yoda of the entire writing process. From providing invaluable guidance on flow and copy, to being available for pep talks and brainstorming sessions (“Should this chapter go here? What about there? What even is a preface?”), Jill was the Jedi who saw us through to the finish line. Our mutual love of cats only helped seal the bond. We are forever indebted to our technical reviewers, Tristan Baker, Debashis Saha, Wayne Yaddow, Scott Haines, Sam Bail, Joy Payton, and Robert Ansel, for their sharp edits and valuable feedback on multiple drafts of the book. Their passion for bringing DevOps best practices and good data hygiene to the field is an inspiration, and we’ve been grateful to work with them. We’d like to acknowledge—and thank a million times over—Ryan Kearns, a contribu‐ tor to this book whose name could have been on the byline. From spearheading sev‐ eral chapters to offering critical insights on the technologies and processes discussed, this book would not have come together without his assistance. We learn from him every day and are so lucky to call him a dear colleague. In the coming years, Ryan will undoubtedly become one of the most important voices in data engineering and data science. There were several industry experts and trailblazers we interviewed for this book and various other projects we’ve pursued over the past year. In no particular order, we’d like to thank Brandon Beidel, Alex Tverdohleb, António Fitas, Gopi Krishnamurthy, Manu Raj, Zhamak Dehghani, Mammad Zadeh, Greg Waldman, Wendy Turner Wil‐ liams, Zosia Kossowski, Erik Bernhardsson, Jessica Cherny, Josh Wills, Kyle Shannon, Atul Gupte, Chad Sanderson, Patricia Ho, Michael Celentano, Prateek Chawla, Cindi Howson, Debashis Saha, Melody Chien, Ankush Jain, Maxime Beauchemin, DJ Patil, Bob Muglia, Mauricio de Diana, Shane Murray, Francisco Alberini, Mei Tao, Xuanzi Han, and Helena Munoz. Preface | xv
We’d also like to thank Brandon Gubitosa, Sara Gates, and Michael Segner for their assistance with outlines and drafts—and for always encouraging us to “kill our darlings.” We’re indebted to our parents, Elisha and Kadia Moses, Motti and Vira Gavish, and Gregg and Barbara Vorwerck, for encouraging us to pursue our passions for data engineering and data quality, from launching a company and category dedicated to the concept, to writing this book. We’d also like to thank Rae Barr Gavish (RBG) for being our number one fan, and Robert Ansel for being our resident SRE, WordPress consultant, and DevOps guru. And we’re forever indebted to our customers, who are helping us pioneer the data observability category and through the process laying the foundations for the future of reliable data at scale. xvi | Preface
The above is a preview of the first 20 pages. Register to read the complete e-book.