Hubert Dulay & Stephen Mooney Streaming Data Mesh A Model for Optimizing Real-Time Data Services Strea m ing D a ta M esh Strea m ing D a ta M esh
DATA “A comprehensive guide that masterfully explores the transformative potential of streaming data mesh architectures. With practical, actionable insights and step-by-step guidance, this book is an essential read for data professionals seeking to revolutionize data management and processing in real time.” —Yingjun Wu Founder and CEO, RisingWave Labs “An amazing resource for anyone looking to gain an understanding of what current data architecture patterns are about.” —Benjamin Djidi CEO, Popsink Streaming Data Mesh Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Data lakes and warehouses have become increasingly fragile, costly, and difficult to maintain as data gets bigger and moves faster. Data meshes can help your organization decentralize data, giving ownership back to the engineers who produced it. This book provides a concise yet comprehensive overview of data mesh patterns for streaming and real-time data services. Authors Hubert Dulay and Stephen Mooney examine the vast differences between streaming and batch data meshes. Data engineers, architects, data product owners, and those in DevOps and MLOps roles will learn steps for implementing a streaming data mesh, from defining a data domain to building a good data product. Through the course of the book, you’ll create a complete self-service data platform and devise a data governance system that enables your mesh to work seamlessly. With this book, you will: • Design a streaming data mesh using Kafka • Learn how to identify a domain • Build your first data product using self-service tools • Apply data governance to the data products you create • Learn the differences between synchronous and asynchronous data services • Implement self-services that support decentralized data Hubert Dulay is a systems and data engineer at StarTree. He has consulted for many financial institutions, healthcare organizations, and telecommunications companies. Stephen Mooney is an independent data scientist and data engineer. He has worked for major companies in healthcare, retail, and the public sector. US $65.99 CAN $82.99 ISBN: 978-1-098-13072-5
Hubert Dulay and Stephen Mooney Streaming Data Mesh A Model for Optimizing Real-Time Data Services FIRST EDITION Boston Farnham Sebastopol TokyoBeijing
978-1-098-13072-5 Streaming Data Mesh by Hubert Dulay and Stephen Mooney Copyright © 2023 Hubert Dulay and Stephen Mooney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Andy Kwan Development Editor: Jeff Bleiel Production Editor: Beth Kelly Copyeditor: Sonia Saruba Proofreader: Sharon Wilkey Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea June 2023: First Edition Revision History for the First Edition 2023-05-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098130725 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Data Mesh, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Confluent. See our statement of editorial independence.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Data Mesh Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data Divide 3 Data Mesh Pillars 4 Data Ownership 4 Data as a Product 5 Federated Computational Data Governance 6 Self-Service Data Platform 6 Data Mesh Diagram 7 Other Similar Architectural Patterns 8 Data Fabric 8 Data Gateways and Data Services 10 Data Democratization 11 Data Virtualization 12 Focusing on Implementation 12 Apache Kafka 13 AsyncAPI 13 2. Streaming Data Mesh Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The Streaming Advantage 16 Streaming Enables Real-Time Use Cases 16 Streaming Enables Data Optimization Advantages 17 Reverse ETL 18 The Kappa Architecture 19 Lambda Architecture Introduction 19 iii
Kappa Architecture Introduction 22 Summary 25 3. Domain Ownership. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Identifying Domains 28 Discernible Domains 28 Geographic Regions 28 Hybrid Architecture 31 Multicloud 32 Avoiding Ambiguous Domains 34 Domain-Driven Design 35 Domain Model 36 Domain Logic 36 Bounded Context 36 The Ubiquitous Language 37 Data Mesh Domain Roles 37 Data Product Engineer 37 Data Product Owner or Data Steward 38 Streaming Data Mesh Tools and Platforms to Consider 39 Domain Charge-Backs 39 Summary 41 4. Streaming Data Products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Defining Data Product Requirements 44 Identifying Data Product Derivatives 45 Derivatives from Other Domains 46 Ingesting Data Product Derivatives with Kafka Connect 46 Consumability 48 Synchronous Data Sources 52 Asynchronous Data Sources and Change Data Capture 53 Debezium Connectors 54 Transforming Data Derivatives to Data Products 55 Data Standardization 56 Protecting Sensitive Information 56 SQL 57 Extract, Transform, and Load 63 Publishing Data Products with AsyncAPI 69 Registering the Streaming Data Product 70 Building an AsyncAPI YAML Document 72 Assigning Data Tags 82 Versioning 85 iv | Table of Contents
Monitoring 86 Summary 86 5. Federated Computational Data Governance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Data Governance in a Streaming Data Mesh 90 Data Lineage Graph 90 Streaming Data Catalog to Organize Data Products 92 Metadata 92 Schemas 93 Lineage 95 Security 97 Scalability 98 Generating the Data Product Page from AsyncAPI 98 Apicurio Registry 100 Access Workflow 101 Centralized Versus Decentralized 101 Centralized Engineers 101 Decentralized (Domain) Engineers 102 Summary 103 6. Self-Service Data Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Streaming Data Mesh CLI 106 Resource-Related Commands 108 Cluster-Related Commands 108 Topic-Related Commands 109 The domain Commands 110 The connect Commands 111 The streaming Commands 112 Publishing a Streaming Data Product 114 Data Governance-Related Services 116 Security Services 116 Standards Services 121 Lineage Services 123 SaaS Services and APIs 125 Summary 126 7. Architecting a Streaming Data Mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Infrastructure 127 Two Architecture Solutions 128 Dedicated Infrastructure 129 Multitenant Infrastructure 135 Table of Contents | v
Streaming Data Mesh Central Architecture 140 The Domain Agent (aka Sidecar) 140 Data Plane 142 Control Plane 142 Summary 147 8. Building a Decentralized Data Team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 The Traditional Data Warehouse Structure 150 Introducing the Decentralized Team Structure 152 Empowering People 153 Working Processes 154 Fostering Collaboration 154 Data-Driven Automation 154 New Roles in Data Domains 155 New Roles in the Data Plane 156 New Roles in Data Science and Business Intelligence 157 9. Feature Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Separating Data Engineering from Data Science 162 Online and Offline Data Stores 163 Apache Feast Introduction 164 Summary 168 10. Streaming Data Mesh in Practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Streaming Data Mesh Example 170 Deploying an On-Premises Streaming Data Mesh 172 Installing a Connector 175 Deploying Clickstream Connector and Auto-Creating Tables 176 Deploying the Debezium Postgres CDC Connector 181 Enrichment of Streaming Data 183 Publishing the Data Product 188 Consuming Streaming Data Products 190 Fully Managed SaaS Services 196 Summary and Considerations 201 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 vi | Table of Contents
Preface Welcome to this first edition of Streaming Data Mesh! This is your guide to under‐ standing and building a streaming data mesh that meets all of the pillars of a data mesh. Data mesh is one of the most popular architectures for data platforms that many are exploring today. This book will help you get a full understanding of this self-servicing data platform in a streaming context. Today, batch processing dominates all extract, transform, and load (ETL) processes in most businesses. This book will help show a different perspective of data pipelines and apply the same concepts you already understand in batch ETL, but in a streaming ETL in the context of a data mesh. This book is designed to help you understand the essential concepts around stream‐ ing data mesh—the concepts, architectures, and technologies at its core. The book covers all the essential topics related to streaming mesh, from the basics of data architecture, to the use of big data tools for data warehousing, to business-oriented approaches for streaming data mesh architectures. Additionally, we will look at a stack of services involved in a successful streaming data mesh project. This book does not require you to have preknowledge of the pillars that make up a data mesh. We will briefly introduce the pillars at a very high level and define them with streaming specifically in mind. If you feel you need to understand data mesh in more detail, please refer to Zhamak Dehghani’s book, Data Mesh (O’Reilly). Who Should Read This Book This book is written for anyone who is interested in learning more about streaming data mesh, combining the exciting work done in data mesh with real-time streaming for data transformation, data product definition, and data governance. This book is also useful for data engineers, data analysts, data scientists, software architects, and product owners who want to implement a streaming data architecture for their vii
projects. This book is useful for those who wish to become familiar with streaming data technologies and best practices for integrating them, at scale, into their projects. Why We Wrote This Book We wrote a book on streaming data mesh because we believe it has the potential to revolutionize the way companies manage and process their data. Streaming data mesh provides a platform that unites messaging, storage, and processing capabilities into one comprehensive solution. By increasing data reliability and coverage while reducing costs, this platform enables companies to significantly accelerate their digi‐ tal transformation and become data-driven organizations. With this book, we want to make sure our readers understand the key principles, the latest approaches, and the dos and don’ts of streaming data mesh. We also want to provide step-by-step guidance for setting up and operating a streaming data mesh, taking into account best practices. Navigating This Book This book is organized as follows: • Chapters 1 and 2 provide an introduction to data mesh concepts and extend these into a streaming context. • Chapter 3 goes into detail about domain ownership and the approaches used to identify domains, domain-driven design, the roles associated with a data domain, tools to consider, as well as an approach to domain-centric charge-backs. • Chapter 4 explores the creation of streaming data products, including data prod‐ uct identification, ingestion, transformation, and publication. • Chapter 5 examines federated computational data governance within a streaming data mesh. • Chapter 6 discusses the self-service infrastructure as it relates to streaming data mesh. • Chapter 7 dives into the architecture of a streaming data mesh and its compo‐ nents, including infrastructure and cloud architecture. • Chapter 8 discusses the structure, alignment, and roles associated with building a decentralized team. • Chapter 9 discusses the application of streaming data mesh for creating feature stores to empower data-science model training and inference. • Chapter 10 provides a concrete example of creating a streaming data mesh. viii | Preface
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/hdulay/streaming-data-mesh. If you have a technical question or a problem using the code examples, please send email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You Preface | ix
do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Streaming Data Mesh by Hubert Dulay and Stephen Mooney (O’Reilly). Copyright 2023 Hubert Dulay and Stephen Mooney, 978-1-098-13072-5.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/streaming-data-mesh. x | Preface
For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments We could not have written this book without Andy Kwan promoting our proposal for it. Thanks also to our production editor, Beth Kelly, and most importantly, Jeff Bleiel. Jeff has been a tremendous help, and we greatly appreciate all that he has done for us. A special thanks goes out to all of the reviewers who spent countless hours digesting this content and suggesting improvements. Your unwavering support was instrumen‐ tal to making this book a reality. Ralph Matthias Debusmann, for reaching out and showing his interest early on in our book. Ravneet Singh, thank you too for your help and support. Dr. Ian Buss, thanks again and again! Sharon Xie, Decodable is lucky to have you. Eric Sammer, thanks for the experience. Hubert Thanks to my wife, Beth, and kids, Aster and Nico, for supporting me and reminding me to make time for myself and family. I’d like to specifically thank everyone who influenced me during my time at Cloudera. “Always be building your brand,” Hemal Kanani—I still hear your voice when I read that phrase—BOOM! Ben Spivey for always being there as a mentor and friend. Ian Buss for showing me that big data is easy. Marlo Carillo and my filip big data brothers—thanks for representing the RP. And of course, the CLDR Illuminatis. I’d like to also thank everyone at Confluent who journeyed with me to IPO and for giving me the experiences needed to write this book. Dan Elliman, thanks for being Batman to my Robin in the NE. Eric Langan, thanks for having such a great and contagious attitude. Paul Earsy for guiding me through muddy waters. For Steve Williams: why did you retire? You’re still at your prime. Jay Kreps for his lead‐ ership. Gwen Shapira for being a huge influence. Yeva Byzek, Ben Stopford, Adam Bellemare, and Travis Hoffman for being there early in the data mesh discourse at Confluent. Thanks, Confluent, for sponsoring this book and for all the other smart people there. I would also like to thank the many people that provided feedback and helped shape the book: Benjamin Djidi, Ismael Ghalimi, David Yaffe, Hojjat Jafarpour, Yingjun Wu, Zander Matheson, Will Plummer, Ting Wang, Jove Zhong, and Yaniv Ben Hemo. Preface | xi
Stephen I would like to express my sincere gratitude to everyone who supported me while writing this book. Special thanks to the colleagues who guided me through the process of writing and publishing. I am also grateful to friends and family for their unwavering love and encouragement. Additionally, I am thankful to the editorial team at O’Reilly for their invaluable advice and resources. Finally, I am grateful to the many readers who have been a wonderful source of inspiration for me throughout this journey. Thank you all. xii | Preface
CHAPTER 1 Data Mesh Introduction Youngsters think that at some point data architectures were easy, and then data volume, velocity, variety grew and we needed new architectures which are hard. In reality, data problems were always organization problems and therefore were never solved. —Gwen (Chen) Shapira, Kafka: The Definitive Guide (O’Reilly) If you’re working at a growing company, you’ll realize that a positive correlation exists between company growth and the scale of ingress data. This could be from increased usage for existing applications or newly added applications and features. It’s up to the data engineer to organize, optimize, process, govern, and serve this growing data to the consumers while maintaining service-level agreements (SLAs). Most likely, these SLAs were guaranteed to the consumers without the data engineer’s input. The first thing you learn when working with such a large amount of data is that when the data processing starts to encroach toward the guarantees made by these SLAs, more focus is put on staying within the SLAs, and things like data governance are marginalized. This in turn generates a lot of distrust in the data being served and ultimately distrust in the analytics—the same analytics that can be used to improve operational applications to generate more revenue or prevent revenue loss. If you replicate this problem across all lines of business in the enterprise, you start to get very unhappy data engineers trying to speed up data pipelines within the capacity of the data lake and data processing clusters. This is the position where I found myself more often than not. So what is a data mesh? The term “mesh” in “data mesh” was taken from the term “service mesh,” which is a means of adding observability, security, discovery, and reliability at the platform level rather than the application layer. A service mesh is typ‐ ically implemented as a scalable set of network proxies deployed alongside application code (a pattern sometimes called a sidecar). These proxies handle communication 1
between microservices and also act as a point where service mesh features are introduced. Microservice architecture is at the core of a streaming data mesh architecture, and introduces a fundamental change that decomposes monolithic applications by creat‐ ing loosely coupled, smaller, highly maintainable, agile, and independently scalable services beyond the capacity of any monolithic architecture. In Figure 1-1 you can see this decomposition of the monolithic application to create a more scalable micro‐ service architecture without losing the business purpose of the application. Figure 1-1. Decomposing a monolithic application into microservices that communicate with one another via a service mesh A data mesh seeks to accomplish the same goals that microservices achieved for monolithic applications. In Figure 1-2 a data mesh tries to create the same loosely coupled, smaller, highly maintainable, agile, and independently scalable data products beyond the capacity of any monolithic data lake architecture. Figure 1-2. Monolithic data lake/warehouse decomposed to data products and domains that communicate via a data mesh Zhamak Dehghani (whom I refer to as ZD in this book) is the pioneer of the data mesh pattern. If you are not familiar with ZD and her data mesh blog, I highly recommend reading it as well as her very popular book Data Mesh (O’Reilly). I 2 | Chapter 1: Data Mesh Introduction
will be introducing a simple overview to help you get a basic understanding of the pillars that make up the data mesh architectural pattern so that I can refer to them throughout the book. In this chapter we will set up the basics of what a data mesh is before we introduce a streaming data mesh in Chapter 2. This will help lay a foundation for better understanding as we overlay ideas of streaming. We will then talk about other archi‐ tectures that share similarities with data mesh to help delineate them. These other architectures tend to confuse data architects when designing a data mesh, and it is best to get clarity before we introduce data mesh to streaming. Data Divide ZD’s blog talks about a data divide, illustrated in Figure 1-3, to help describe the movement of data within a business. This foundational concept will help in under‐ standing how data drives business decisions and the monolithic issues that come with it. To summarize, the operational data plane holds data stores that back the applications that power a business. Figure 1-3. The data divide separating the operational data plane from the analytical data plane An extract, transform, and load (ETL) process replicates the operational data to the analytical data plane, since you do not want to execute analytical queries on opera‐ tional data stores, taking away compute and storage resources needed to generate revenue for your business. The analytical plane contains data lakes/warehouses/lake- houses to derive insights. These insights are fed back to the operational data plane to make improvements and grow the business. Data Divide | 3
With the help of Kubernetes on the operational plane, applications have evolved from sluggish and monolithic applications to agile and scalable microservices that inter‐ communicate, creating a service mesh. The same cannot be said for the analytical plane. The goal of the data mesh is to do just that: to break up the monolithic analyt‐ ical plane to a decentralized solution to enable agile, scalable, and easie-to-manage data. We will refer to the operational plane and analytical plane throughout the book, so it’s important to establish this understanding early as we start to build a streaming data mesh example. Data Mesh Pillars The foundation of the data mesh architecture is supported by the pillars defined in Table 1-1. We will quickly summarize them in the following sections, covering the salient concepts of each, so that we can focus on the implementation of a streaming data mesh in later chapters. Table 1-1. Data mesh pillars defined by ZD Data ownership Data as a product Self-service data platform Federated computational data governance Decentralization and distribution of responsibility to people who are closest to the data in order to support continuous change and scalability. Discovering, understanding, trusting, and ultimately using quality data. Self-service data infrastructure as a platform to enable domain autonomy. Governance model that embraces decentralization, domain self- sovereignty, and interoperability through global standardization. A dynamic topology and most importantly automated execution of decisions by the platform. Data ownership and data as a product make up the core of the data mesh pillars. Self- service data platform and federated computational data governance exist to support the first two pillars. We’ll briefly discuss these four now and devote a whole chapter to each pillar beginning with Chapter 3. Data Ownership As mentioned previously, the primary pillar of a data mesh is to decentralize the data so that its ownership is given back to the team that originally produced it (or at least know and care about it the most). The data engineers within this team will be assigned a domain—one in which they are experts in the data itself. Some examples of domains are analytics, inventory, and application(s). They are the groups who likely were previously writing to and reading from the monolithic data lake. Domains are responsible for capturing data from its true source. Each domain trans‐ forms, enriches, and ultimately provides that data to its consumers. There are three types of domains: 4 | Chapter 1: Data Mesh Introduction
Producer only Domains that only produce data and don’t consume it from other domains Consumer only Domains that only consume data from other domains Producer and consumer Domains that produce and consume data to and from other domains, respec‐ tively Following the domain-driven design (DDD) approach, which models domain objects defined by the business domain’s experts and implemented in software, the domain knows the specific details of its data, such as schema and data types that adhere to these domain objects. Since data is defined at the domain level, it is the best place to define specifics about its definition, transforma‐ tion, and governance. Data as a Product Since data now belongs to a domain, we require a way to serve data between domains. Since this data needs to be consumable and usable, it needs to be treated as any other product so consumers will have a good data experience. From this point forward, we will call any data being served to other domains data products. Defining what a “good experience” is with data products is a task that has to be agreed upon among the domains in the data mesh. An agreed-to definition will help provide well-defined expectations among the participating domains in the mesh. Table 1-2 lists some ideas to think about that will help create a “good experience” for data product consumers and help with building data products in a domain. Table 1-2. Considerations that can ease the development of data products and create a “good experience” with them Considerations Description Data products should be easily consumable. Some examples could be: • Cleanliness • Preparedness • High throughput • Interoperability Engineers should have a generalist skill set. Engineers need to build data products without needing tools that require hyper-specialized skills. These are a possible minimum set of skills needed to build data products: • SQL • YAML • JSON Data Mesh Pillars | 5
Considerations Description Data products should be searchable. When publishing a data product to the data mesh, a data catalog will be used for discovery, metadata views (usage, security, schema, and lineage), and access requests to the data product by domains that may want to consume it. Federated Computational Data Governance Since domains are used to create data products, and sharing data products across many domains ultimately builds a mesh of data, we need to ensure that the data being served follows some guidelines. Data governance involves creating and adhering to a set of global rules, standards, and policies applied to all data products and their interfaces to ensure a collaborative and interoperable data mesh community. These guidelines must be agreed upon among the participating data mesh domains. Data mesh is not completely decentralized. The data is decentral‐ ized in domains, but the mesh part of data mesh is not. Data gover‐ nance is critical in building the mesh in a data mesh. Examples of this include building self-services, defining security, and enforcing interoperability practices. Here are some things to consider when thinking about data governance for a data mesh: authorization, authentication, data and metadata replication methods, sche‐ mas, data serialization, and tokenization/encryption. Self-Service Data Platform Because following these pillars requires a set of hyper-specialized skills, a set of services must be created in order to build a data mesh and its data products. These tools require compatibility with skills that are accessible to a more generalist engineer. When building a data mesh, it is necessary to enable existing engineers in a domain to perform the tasks required. Domains have to capture data from their operational stores, transform (join or enrich, aggregate, balance) that data, and publish their data products to the data mesh. Self-service services are the “easy buttons” necessary to make data mesh easy to adopt with high usability. In summary, the self-services enable the domain engineers to take on many of the tasks the data engineer was responsible for across all lines of the business. A data mesh not only breaks up the monolithic data lake, but also breaks up the monolithic role of the data engineer into simple tasks the domain engineers can perform. We cannot replicate what was done in the data lake to all of the domains. We instead build self-services so that the domains can build a data mesh and publish data products with simple tools and general skills. 6 | Chapter 1: Data Mesh Introduction
Comments 0
Loading comments...
Reply to Comment
Edit Comment