📄 Page
1
(This page has no text content)
📄 Page
2
Practical Lakehouse Architecture Designing and Implementing Modern Data Platforms at Scale Gaurav Ashok Thalpati
📄 Page
3
Practical Lakehouse Architecture by Gaurav Ashok Thalpati Copyright © 2024 Gaurav Ashok Thalpati. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800- 998-9938 or corporate@oreilly.com. Acquisitions Editor: Andy Kwan Development Editor: Jeff Bleiel Production Editor: Christopher Faucher Copyeditor: Nicole Taché Proofreader: Tove Innis Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea August 2024: First Edition Revision History for the First Edition
📄 Page
4
2024-07-24: First Release See http://oreilly.com/catalog/errata.csp? isbn=9781098153014 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Lakehouse Architecture, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-15301-4 [LSI]
📄 Page
5
Preface It’s 2024—the year of AI! Just like 2023 and 2022, and a few years before that. In today’s world, AI is everywhere. But AI needs data. Data that is of good quality. Data that is discoverable. Data that humans and machines can easily consume. But how do we ensure that we make such data available? By implementing robust data platforms that ingest, store, and maintain this data to democratize it with all its users. Today’s best-in-class, data-driven organizations leverage AI and heavily depend on data. They have invested in modern data platforms that support their current and future demands. Modern data platforms need modern data architectures, like lakehouses, to support their BI to AI needs. Lakehouse architecture—the main topic of this book— leverages technology advancements to simplify data platform design and enables organizations to build scalable and open platforms. Lakehouse has gained popularity in the last few years, with several organizations, product vendors, and data practitioners implementing their platforms using this architecture. There won’t be a better time to explore, understand, and evaluate the lakehouse for your use cases, and this book can help you get started on your journey. Who Should Read This Book?
📄 Page
6
This book is for all data practitioners who handle large volumes of data and are responsible for designing and implementing modern data platforms. This book is a comprehensive guide for data architects and can help them understand key considerations, establish design principles, and make critical decisions when implementing a data platform. For data engineers, this book will help them understand key concepts like open table formats, schema evolution, and time travel, which they can leverage when implementing data pipelines. Other data personas, like data analysts and data scientists, will learn about crucial topics like lakehouse data management, data discovery, access control, and sensitive data handling. Data practitioners new to lakehouse architecture can read this book to learn the core concepts. Experienced data architects and senior data engineers can use this guide to make key design decisions during the design phase. And data leaders can refer to this book when planning their lakehouse initiatives. Why I Wrote This Book When I started working on a lakehouse project a few years back, the open table formats were still evolving, and not all cloud services supported lakehouse technologies like open table formats. Not many data practitioners knew the benefits of lakehouse architecture, either, or understood how it could help simplify their data landscape. There was not much material available for end-to-end guidance to design and implement a lakehouse using different technologies across cloud platforms. That’s when I started blogging about these topics to share what I had learned and explored. When I got the opportunity to write this book
📄 Page
7
on the same subject, I thought it was the right time to share my knowledge and observations with a larger audience. This book is my attempt to explain in simple words how to design and implement a lakehouse. I’ve provided several examples across AWS, Azure, GCP, Databricks, Snowflake, and other platforms to explain various data management and governance processes. I hope you will find this book helpful for implementing your data platforms. Navigating This Book This book has nine chapters, each covering a different aspect of designing and implementing a lakehouse data platform. Chapter 1 introduces you to lakehouse architecture and the key concepts, features, and benefits of implementing a data platform using lakehouse architecture. This chapter will also help you understand the fundamental concepts for building data platforms. Chapter 2 discusses traditional architectures like data warehouses and data lakes and covers how lakehouse architecture stands out compared to these patterns. If you are new to data warehouses or data lakes, this chapter will be a good primer for understanding these architectures. Chapter 3 explores the storage layer—the heart of the lakehouse. This chapter explains open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. It also describes the key considerations for evaluating different file and table formats in order to select the right one for your use case.
📄 Page
8
Chapter 4 focuses on data catalogs and will help you understand the overall metadata management process within a lakehouse. It provides an overview of data catalog services across AWS, Azure, and GCP platforms, along with some popular third-party products. Chapter 5 explores the different compute engine options for data engineering and consumption activities. It describes factors that will impact your decision making process when selecting the right compute engine. Chapter 6 discusses the governance and security aspects of data and AI assets within a lakehouse. It also lists the activities you should perform, based on your role, to maintain the governance and security of data within the lakehouse. Chapter 7 gives the big picture view of designing your lakehouse by combining storage, compute, and data catalogs. This chapter is critical for data architects who have to make choices during the design process. At the end of this chapter, you will find a questionnaire you can refer to during talks with different stakeholders. While all the previous chapters discuss an ideal lakehouse implementation, Chapter 8 provides a reality check by highlighting the challenges you can face while implementing a lakehouse. This chapter gives you ideal versus real-world scenarios and explains how to tackle these to build a lakehouse in the real world. The final chapter, Chapter 9, explores the future of lakehouses. It introduces some of the new file and table formats, innovative products, and new approaches to implementing a lakehouse platform.
📄 Page
9
O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
📄 Page
10
TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://oreilly.com/about/contact.html
📄 Page
11
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/lakehouse-architecture. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly- media Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments I accidentally started my data journey a couple of decades ago. While interested in becoming an animator, I landed a job as a trainee ETL developer. These past 20 years have been about learning, understanding, and exploring data in various forms. Many people have helped, supported, and encouraged me during this journey and this book is the result of their efforts. I’m deeply grateful to all my colleagues, mentors, and customers for providing me with opportunities to work on some of the most exciting data and analytics projects. A big shout-out to the various data communities, user groups, content creators, and book authors around the globe for sharing their knowledge. You all have inspired me to write this book. My heartfelt thanks to Shivam Panicker, Sivakumar Ponnusamy, and Ankush Gautam, the tech reviewers of this book, for their insights and suggestions, which have improved the book and genuinely added more value for readers.
📄 Page
12
Writing a book on my favorite topic is a dream come true. Thanks to the entire O’Reilly team for this once-in-a- lifetime opportunity. I’d like to thank: Andy Kwan, my acquisition editor, for trusting me to write this book and helping me through the initial proposal and approval process. Jeff Bleiel, my development editor, for supporting me throughout my book-writing journey. This book would not have been possible without his edits, suggestions, and encouragement. Nicole Taché, for copyediting and bringing this book to a better shape and form. Christopher Faucher, my production editor, for coordinating and managing the production process and providing the final touches to this book. Finally, I’d like to thank my family—my parents, Ashok and Archana, and my elder sister Kirti—for their sacrifices to help me reach this stage in my life. Vishakha, my wife, has been my pillar of strength, and Soham, my son, has been my biggest supporter. This book would not have been possible without their continuous encouragement. Last but not least, a big thanks to you, the reader of this book, for investing your time in reading it.
📄 Page
13
Chapter 1. Introduction to Lakehouse Architecture All data practitioners, irrespective of their job profiles, perform two common and foundational activities—asking questions and finding answers! Any data person, whether they’re a data engineer, data architect, data analyst, data scientist, or even a data leader like a chief information officer (CIO) or chief data officer (CDO), must be curious and ask questions. Finding answers to complex questions is difficult. But the more challenging task is to ask the right questions. The “art of the possible” can only be explored by: (1) asking the right questions and (2) uncovering answers by leveraging the data. However simple this might sound, an organization needs an entire data platform to enable users to perform these tasks. This platform must support data ingestion and storage, provide tools for users to ask and discover new questions, perform advanced analysis, predict and forecast results, and generate insights. The data platform is the infrastructure that enables users to leverage data for business benefits. To implement such data platforms, you need a robust data architecture—one that can help you define the core components of the data platform and establish the design principles for putting it into practice. Traditionally, organizations have used data warehouse or data lake architectures to implement their data platforms. Both of these architectural approaches have been widely adopted across industries. These architectures have also evolved to
📄 Page
14
leverage continuously improving modern technologies and patterns. Lakehouse architecture is one such modern architectural pattern that has developed in the last few years, and it has become a popular choice for data architects who are designing data platforms. In this chapter, I’ll introduce you to the fundamental concepts related to data architecture, data platforms and their core components, and how data architecture helps build a data platform. Then, I’ll explain why there is a need for new architectural patterns like the lakehouse. You’ll learn lakehouse architecture fundamentals, characteristics, and the benefits of implementing a data platform using lakehouse architecture. I’ll conclude the chapter with important takeaways, which will summarize everything we’ve discussed and help you remember the key points while reading the subsequent chapters in this book. Let’s start with the fundamentals of data architecture. Understanding Data Architecture The data platform is the end result of implementing a data architecture using the chosen technology stack. Data architecture is the blueprint that defines the system that you aim to build. It helps you visualize the end state of your target system and how you plan to achieve it. Data architecture defines the core components, the interdependencies between these components, fundamental design principles, and processes required to implement your data platform. What Is Data Architecture?
📄 Page
15
To understand data architecture, consider this real-world analogy of a commercial construction site, such as a shopping mall or large residential development. Building a commercial property requires robust architecture, innovative design, an experienced architect, and an army of construction workers. Architecture plays the most crucial role in development—it ensures that the construction survives all weather conditions, helps people easily access and navigate through various floors, and enables quick evacuation for people in an emergency. Such architectures are based on certain guiding principles that define the core design and layout of the building blocks. Whether you are constructing a residential property, a commercial complex, or a sports arena, the foundational pillars and the core design principles for the architecture remain the same. However, the design patterns—interiors, aesthetics, and other features catering to the users—differ. Similar to building a commercial property, data architecture plays the most crucial role when developing robust data platforms that will support various users and various data and analytics use cases. To build a platform that is resilient, scalable, and accessible to all users, the data architecture should be based on core guiding principles. Regardless of the industry or domain, the data architecture fundamentals remain the same. Data architecture, like the design architecture for a construction site, plays a significant role in determining how users adapt to the platform. The section will cover the importance of data architecture in the overall process of implementing a data platform. How Does Data Architecture Help Build a Data Platform?
📄 Page
16
Architecting the data platform is probably the most critical phase of a data project and often impacts key outcomes like the platform’s user adoption, scalability, compliance, and security. Data architecture helps you define the following foundational activities that you need to do to start building your platform. Defining core components The core components of your data platform help perform daily activities like data ingestion, storage, transformation, consumption, and other common services related to management, operations, governance, and security. Data architecture helps you define these core components of your data platform. These core components are discussed in detail in the next section. Defining component interdependencies and data flow After defining the core components of your platform, you need to determine how they will interact. Data architecture defines these dependencies and helps you to visualize how the data would flow between producers and consumers. Architecture also helps you determine and address any specific limitations or integration challenges you may face while moving data across these components. Defining guiding principles As part of the data architecture design process, you’ll also define the guiding principles for implementing your data platform. These principles help build a shared understanding between the various data teams that are using the platform. They ensure everyone follows the same design approach, common standards, and reusable frameworks. Defining shared guiding principles allows you to implement an optimized, efficient, and reliable data
📄 Page
17
platform solution. Guiding principles can be applied across various components and are defined based on the data architecture capabilities and limitations. For example, if your platform has multiple business intelligence (BI) tools provisioned, a guiding principle should specify which BI tool to use based on the data consumption pattern or use case. Defining the technology stack The architecture blueprint also informs the tech stack of the core components in the platform. When architecting the platform, it might be challenging to finalize all the underlying technologies—a detailed study of limitations and benefits, along with proof of concept (PoC), would be required to finalize them. Data architecture helps to define key considerations for making these technology choices and the desired success factors when carrying out any PoC activities and finalizing the tech stack. Aligning with overall vision and data strategy Finally, and most critically, data architecture helps you implement a data platform that is aligned with your overall vision and your organization’s data strategy for achieving its business goals. For example, data governance is integral to any organization’s data strategy. Data architecture defines the components that ensure data governance is at the core of each process. These are components like metadata repositories, data catalogues, access controls, and data sharing principles.
📄 Page
18
NOTE Data governance is an umbrella term that comprises various standards, rules, and policies that ensure all data processes follow the same formal guidelines. These guidelines help to assure compliance with geographic or industry regulations, as well as to ensure the data is trustworthy, high quality, and delivers value. Organizations should follow data governance policies across all data management processes to maintain consumers’ trust in data and to remain compliant. Data governance helps organizations maintain better control over their data, to easily discover data, and to securely share data with consumers. Now that you better understand data architecture and its significance in implementing data platforms, let’s discuss the core components of a data platform. Core Components of a Data Platform In this section, we’ll look at the core components of a data platform and how their features contribute to a robust data ecosystem. Figure 1-1 shows the core components for implementing a data platform based on a data architecture blueprint.
📄 Page
19
Figure 1-1. Core components of a data platform Let’s explore these core components and their associated processes. Source systems Source systems provide data to the data platform that can be used for analytics, business intelligence (BI), and machine learning (ML) use cases. These sources include legacy systems, backend online transaction processing (OLTP) systems, IoT devices, clickstreams, and social media. Sources can be categorized based on multiple factors. Internal and external source systems Internal sources are the internal applications within an organization that produce data. These include in-house customer relationship management (CRM) systems, transactional databases, and machine-generated logs.
📄 Page
20
Internal sources can be owned by internal domain-specific teams that are responsible for generating the data. Data platforms often need data from external systems to enhance their internal data and gain competitive insights. Examples of data that come from external source systems are exchange rates, weather information, and market research data. Batch, near real-time, and streaming systems Until a couple of decades ago, most source systems could send only batch data, meaning that they would generally send the data at the end of the day as a daily batch process. With the increasing demand for more near real-time insights and analytics, source systems started sending data on a near real-time basis. These systems can now share data as multiple, smaller micro-batches at a fixed interval that can be as low as a few minutes. Sources like IoT devices, social media feeds, and clickstreams send data as a continuous stream that should be ingested and processed in real time to get the maximum value. Structured, semi-structured, and unstructured data Source systems traditionally produced only structured data in tables or fixed structured files. With advances in data interchange formats, there was increased production of semi-structured data in the form of XML and JSON files. And as organizations started implementing big data solutions, they started generating large volumes of unstructured data in the form of images, videos, and machine logs. Your data platform should support all types of source systems, sending different types of data at various time intervals. Data ingestion