Petrella Fund a m enta ls of D a ta O b serva b ility Fund a m enta ls of D a ta O b serva b ility Fundamentals of Data Observability Implement Trustworthy End-to-End Data Solutions Andy Petrella
DATA “A must-read for any technologist tired of struggling with data quality, unreliable platforms, and opaque data pipelines.” —Emily Gorcenski Principal Data Scientist, Thoughtworks “This is the book we’ve been waiting for to move beyond the hype and make data observability a reality.” —Matthew Housley CTO and coauthor of Fundamentals of Data Engineering “This book will help guide the future of data-driven decision making.” —Matthew Weingarten Data Engineer/Data Passionista Fundamentals of Data Observability Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Quickly detect, troubleshoot, and prevent a wide range of data issues through data observability, a set of best practices that enables data teams to gain greater visibility of data and its usage. If you’re a data engineer, data architect, or machine learning engineer who depends on the quality of your data, this book shows you how to focus on the practical aspects of introducing data observability in your everyday work. Author Andy Petrella helps you build the right habits to identify and solve data issues, such as data drifts and poor quality, so you can stop their propagation in data applications, pipelines, and analytics. You’ll learn ways to introduce data observability, including setting up a framework for generating and collecting all the information you need. • Learn the core principles and benefits of data observability • Use data observability to detect, troubleshoot, and prevent data issues • Follow the book’s recipes to implement observability in your data projects • Use data observability to create a trustworthy communication framework with data consumers • Learn how to educate your peers about the benefits of data observability Andy Petrella is the founder of Kensu, a real-time data monitoring solution. He has 20 years of experience in the data industry and created the Spark Notebook. US $65.99 CAN $82.99 ISBN: 978-1-098-13329-0 Petrella
Fundamentals of Data Observability Many people talk about data observability, yet only some fully understand it. This is where this book comes in, drawing from Andy’s professional background and vast hands- on experience, it offers a simple-to-implement, smart, and applicable structure for considering and applying data observability. —Adi Polak, author of Scaling Machine Learning with Spark Data observability is a topic that hasn’t received the extensive discussion it deserves. Andy Petrella does an excellent job conveying the what, why, and how of achieving successful observability within data products. This will help guide the future of data-driven decision making. —Matthew Weingarten, Data Engineer/Data Passionista Data observability is widely discussed and widely misunderstood. In this book, Andy provides an intelligent and practical framework for thinking about and implementing data observability. —Joe Reis, coauthor of Fundamentals of Data Engineering and “recovering data scientist” This book is a brilliant manifestation of Andy’s extensive experience in data engineering and architecture in a career that has spanned in-the-trenches pipeline construction, software development, product design, and leadership. He guides the reader through data observability on all levels, bridging the gap between engineering technical mastery and organizational dynamics. This is the book we’ve been waiting for to move beyond the hype and make data observability a reality. —Matthew Housley, CTO and coauthor of Fundamentals of Data Engineering
I’ve seen businesses of all sizes struggle with data platform reliability time and time again. Andy’s book provides both a theoretical and practical foundation for addressing this challenge head-on. The text explores the principles of data observability and backs these up with technical implementations designed to address the common and repeated patterns that plague data teams. This book is a must-read for any technologist tired of struggling with data quality, unreliable platforms, and opaque data pipelines. —Emily Gorcenski, Principal Data Scientist, Thoughtworks Observability has gone mainstream in the past years, but one application area has been missing so far: data. With Fundamentals of Data Observability, you now have the missing piece at hand, explaining what observability means for your data pipelines and how to successfully implement it in your organization. An excellent hands-on guide that should be at the top of your reading list! —Michael Hausenblas, AWS This book paints a realistic and accurate view of data observability, shares the key tenants and mental models, and proves that observability is a critical requirement for any modern data organization and partitioner looking to remain competitive. —Scott Haines, Distinguished Software Engineer, Nike
Andy Petrella Fundamentals of Data Observability Implement Trustworthy End-to-End Data Solutions Boston Farnham Sebastopol TokyoBeijing
978-1-098-13329-0 [LSI] Fundamentals of Data Observability by Andy Petrella Copyright © 2023 O’Reilly Media. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Gary O’Brien Production Editor: Ashley Stussy Copyeditor: Liz Wheeler Proofreader: Piper Editorial Consulting, LLC Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea August 2023: First Edition Revision History for the First Edition 2023-08-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098133290 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Data Observability, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Kensu. See our statement of editorial independence.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Part I. Introducing Data Observability 1. Introducing Data Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Scaling Data Teams 4 Challenges of Scaling Data Teams 7 Segregated Roles and Responsibilities and Organizational Complexity 10 Anatomy of Data Issues and Consequences 12 Impact of Data Issues on Data Team Dynamics 14 Scaling AI Roadblocks 20 Challenges with Current Data Management Practices 23 Effects of Data Governance at Scale 24 Data Observability to the Rescue 26 The Areas of Observability 27 How Data Teams Can Leverage Data Observability Now 30 Low Latency Data Issues Detection 30 Efficient Data Issues Troubleshooting 30 Preventing Data Issues 30 Decentralized Data Quality Management 31 Complementing Existing Data Governance Capabilities 31 The Future and Beyond 31 Conclusion 32 2. Components of Data Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Channels of Data Observability Information 34 Logs 34 iii
Traces 35 Metrics 36 Observations Model 37 Physical Space 39 Server 40 User 40 Static Space 41 Dynamic Space 48 Expectations 54 Rules 55 Automatic Anomaly Detection 60 Prevent Garbage In, Garbage Out 62 Conclusion 66 3. Roles of Data Observability in a Data Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Data Architecture 67 Where Does Data Observability Fit in a Data Architecture? 68 Data Architecture with Data Observability 70 How Data Observability Helps with Data Engineering Undercurrents 71 Security 72 Data Management 73 Support for Data Mesh’s Data as Products 83 Conclusion 87 Part II. Implementing Data Observability 4. Generate Data Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 At the Source 91 Generating Data Observations at the Source 92 Low-Level API in Python 93 Description of the Data Pipeline 93 Definition of the Status of the Data Pipeline 96 Data Observations for the Data Pipeline 98 Generate Contextual Data Observations 100 Generate Data-Related Observations 104 Generate Lineage-Related Data Observations 111 Wrap-Up: The Data-Observable Data Pipeline 114 Using Data Observations to Address Failures of the Data Pipeline 117 Conclusion 121 iv | Table of Contents
5. Automate the Generation of Data Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Abstraction Strategies 123 Event Listeners 124 Aspect-Oriented Programming 125 High-Level Applications 136 No-Code Applications 137 Low-Code Applications 138 Differences Among Monitoring Alternatives 139 Conclusion 143 6. Implementing Expectations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Introducing Expectations 145 Shift-Left Data Quality 146 Corner Cases Discovery 147 Lifting Service Level Indicators 148 Using Data Profilers 149 Maintaining Expectations 150 Overarching Practices 151 Fail Fast and Fail Safe 151 Simplify Tests and Extend CI/CD 152 Conclusion 153 Part III. Data Observability in Action 7. Integrating Data Observability in Your Data Stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Ingestion Stage 158 Ingestion Stage Data Observability Recipes 161 Airbyte Agent 161 Transformation 167 Transformation Stage Data Observability Recipes 167 Apache Spark 168 dbt Agent 177 Serving 185 Recipes 186 BigQuery in Python 188 Orchestrated SQL with Airflow 194 Analytics 197 Machine Learning Recipes 197 Business Intelligence Recipes 201 Conclusion 208 Table of Contents | v
8. Making Opaque Systems Translucent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Data Translucence 210 Opaque Systems 211 SaaS 212 Don’t Touch It; It (Kinda) Works 212 Inherited Systems 213 Strategies for Data Translucence 214 Strategies 215 The Data Observability Connector 219 Example: Building a dbt Data Observability Connector (SaaS) 222 Conclusion 225 Afterword: Future Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 vi | Table of Contents
Preface Welcome to Fundamentals of Data Observability, a book designed to provide a robust introduction to a crucial, emerging field in data engineering and analytics. As we venture into an era characterized by unprecedented data growth, the impor‐ tance of understanding our data—its sources, destinations, usages, and behaviors— has never been more important. Observability, traditionally a term associated with software and systems engineering, has now made its way into the data space, becom‐ ing a cornerstone of trustworthy, efficient, and insightful data systems. This book aims to guide readers into the depth of this new and necessary discipline, exploring its principles, techniques, and evolving best practices. Fundamentals of Data Observability is not just for data engineers or data scientists, but for anyone who interacts with data systems in their daily work life. Whether you’re a chief data officer (CDO), a chief technology officer (CTO), a manager, a leader, a developer, a data analyst, or a business manager, understanding data observ‐ ability concepts and principles will empower you to make better decisions, build more robust systems, and gain greater insight from your data resources. This book begins by outlining the core concepts of data observability, drawing paral‐ lels to similar concepts in software engineering, and setting the stage for the more advanced material. It subsequently delves into the principles and techniques to achieve data observability, providing practical guidance on how to implement them. The final section discusses how to get started today with the system you are using or have inherited. The book concludes with thoughts about the future of data observa‐ bility, exploring ongoing research and emerging trends that are set to shape the field in the coming years. Every chapter in this book is packed with actionable advice to reinforce the topics covered. My aim is not merely to impart knowledge but to facilitate the practical application of data observability concepts in your real-world situations. vii
I hope that by the end of this book, you not only will understand the “what” and “why” of data observability but will also be armed with the “how”—practical knowl‐ edge that you can apply to improve the reliability, usability, and understandability of your data systems. The field of data observability is still young, and there is much to explore and learn. As you embark on this exciting journey, remember that understanding our data and its usages is not just a technical goal—it’s a foundation for making better decisions, fostering innovation, and driving the success of our enterprises. Overview of the Book Fundamentals of Data Observability is organized into three parts and eight chapters, each addressing specific areas of data observability: Part I: The Foundations of Data Observability Chapter 1 introduces the concept of data observability, explains why it has become an essential aspect of data management and sheds light on its role in facilitating accurate, reliable, and actionable insights from data. Chapter 2 delves deeper into the components of data observability. It provides an understanding of its multifaceted nature and how to ensure its implementation aligns with both short-term and long-term organizational needs. Chapter 3 describes the roles and responsibilities of a data observability platform within an organization. It covers its relationships with existing systems and practices, and explores the changes in workflow and management that it facilitates. Part II: Implementing Data Observability Chapter 4 outlines the APIs and practices required to make a data system observable. It includes a Python-based implementation example that can be adapted to suit any other programming language or system. Chapter 5 focuses on automating the generation of data observations. It showcases a range of practices to lessen manual effort, thereby enabling data users to concentrate on more strategic tasks. Chapter 6 covers how to implement expectations within data applications, supporting continuous validation, one of the crucial principles of data observability. Part III: Get Started Today Chapter 7 provides ready-to-use recipes for implementing data observability princi‐ ples into various technologies used in data pipelines. The applications range from tra‐ ditional data processing systems to machine learning applications and data visualization tools, broadening the scope of data observability. viii | Preface
Chapter 8 provides actionable intelligence to incorporate data observability into sys‐ tems that are currently opaque or closed, or whose knowledge base has faded over time. Through these three sections, Fundamentals of Data Observability offers a compre‐ hensive guide to understanding, implementing, and leveraging data observability principles in various data systems, new or existing. Who Should Read This Book Fundamentals of Data Observability is a vital guide for anyone who plays a role in the world of data engineering, analytics, and governance. This book provides in-depth insight into the principles of data observability and its role in ensuring efficient and reliable data systems. Here’s who should read this book, and why: Data engineers and analyst engineers Whether you are just beginning your career or are an experienced professional, the book will empower you with the knowledge to architect and manage observa‐ ble data systems. It delves deep into the tools, technologies, and practices that can improve the reliability, usability, and understandability of data systems. The book also provides practical advice and case studies to help you apply these concepts in real-world scenarios. Lead data engineers and heads of data engineering This book is a resource for team leaders and managers responsible for designing, implementing, and managing data systems. The chapters guide you on creating a strategy for implementing data observability in your organization and offer advice on managing the change effectively. It provides the knowledge you need to mentor your team and facilitate their growth in this emerging field. CDOs, CTOs, and heads of data For those in executive roles, the book offers an overview of the principles of data observability and its significance in the broader data architecture landscape. Understanding this will allow you to make informed decisions about resource allocation, risk management, and strategic direction. It provides a firm ground‐ ing in the language and concepts of data observability, enabling you to engage more productively with your technical teams. Data governance and architecture professionals For those involved in data governance and architecture, this book provides insights into how data observability principles can contribute to robust, secure, and compliant data practices. It addresses how data observability intersects with other data systems in place, helping you build a more integrated and effective data strategy. Preface | ix
Software engineers If you are a software engineer working on building data systems, the book’s sec‐ ond part will be especially relevant to you. It provides practical guidance on how to make these systems observable, thereby ensuring they can be effectively main‐ tained and their data properly utilized by data users such as engineers and analysts. In a world increasingly dominated by data, understanding the principles of data observability is crucial. This book will equip you with the knowledge and skills to make your data systems more reliable, understandable, and usable, driving better decision making and business success. Whether you are a hands-on engineer, a team leader, or a strategic decision maker, Fundamentals of Data Observability is an essen‐ tial addition to your professional library. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. x | Preface
Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/Fundamentals-of-Data-Observability/oreilly-fodo-source-code. If you have a technical question or a problem using the code examples, please send email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require per‐ mission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Fundamentals of Data Observability by Andy Petrella (O’Reilly). Copyright 2023 O’Reilly Media, 978-1-098-13329-0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. Preface | xi
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/fundamentals-of-data- observability-1e. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments Firstly, I want to express my deepest gratitude to my wife Sandrine, and our sons Noah and Livio. Your unending patience and support throughout the writing of this book made it possible. I may have missed a few rare sunny Belgian weekends tucked away with this project, but your understanding and encouragement never wavered. To my parents, thank you for providing me with the opportunities to pursue my stud‐ ies and follow my passions. My heartfelt thanks to Jess Haberman for entrusting me with writing about this emerging and significant topic. Your assistance in shaping the outline and message of the book was invaluable in getting this project off the ground. To Gary O’Brien, the keystone of this project, thank you for your unwavering enthu‐ siasm and tireless effort. Your dedication to improving the quality and coherence of this book was a source of inspiration, and your nerdy jokes provided much-needed levity during our countless discussions. xii | Preface
I extend my deepest appreciation to Joe Reis and Matthew Housley. Your continuous thought-provoking comments and suggestions greatly improved the content presen‐ ted in this book. Your perspectives were particularly insightful when they aligned with the themes of your own work. Adi Pollack, thank you for being a beacon of positivity throughout this process. Your excitement and constructive feedback on my proposals kept me reassured that I was headed in the right direction. Your countless refinements ensured that the content remained crisp yet easily digestible for readers. To Emily Gorcenski, Matthew Weingarten, Scott Haines, and Simon Späti, thank you for your expertise in enhancing the robustness of the technical statements, and for your efforts to increase the clarity of the material from the reader’s perspective. I am grateful to Ines Dehaybe, Emanuele Lucchini, and François Pietquin for your friendly support, often provided after hours, which greatly contributed to this book’s success and helped me stay on schedule. I want to express my gratitude to Eloy Sasot, Doug Laney, and Chris Tabb for your diligent reviews, which helped clarify and tailor chapters for business-oriented read‐ ers and stakeholders. Also, my thanks go to Becky Lawlor and Jenifer Servais for your tremendous help in structuring my thoughts and polishing my English prose. To everyone who contributed, your collective efforts have shaped this book into what it is today. My sincerest thanks to you all. Preface | xiii
(This page has no text content)
PART I Introducing Data Observability
(This page has no text content)
Comments 0
Loading comments...
Reply to Comment
Edit Comment