Delta Lake Up And Running Modern Data Lakehouse Architectures with Delta Lake (Bennie Haelen, Dan Davis) (Z-Library)

Author: Bennie Haelen, Dan Davis

科学

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data's quality. Delta Lake's open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You'll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You'll learn how to: Use modern data management and data engineering techniques Understand how ACID transactions bring reliability to data lakes at scale Run streaming and batch jobs against your data lake concurrently Execute update, delete, and merge commands against your data lake Use time travel to roll back and examine previous data versions Build a streaming data quality pipeline following the medallion architecture

📄 File Format: PDF

💾 File Size: 1.4 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Delta Lake Up & Running Modern Data Lakehouse Architectures with Delta Lake Bennie Haelen & Dan Davis D elta La ke: U p & R unning D elta La ke: U p & R unning

📄 Page 2

DATA Delta Lake: Up & Running Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their analytics and machine learning models depends on the data’s quality. Delta Lake’s open source format offers a robust lakehouse framework over platforms like Amazon S3, ADLS, and GCS. This practical book shows data engineers, data scientists, and data analysts how to get Delta Lake and its features up and running. The ultimate goal of building data pipelines and applications is to gain insights from data. You’ll understand how your storage solution choice determines the robustness and performance of the data pipeline, from raw data to insights. You’ll learn how to: • Use modern data management and data engineering techniques • Understand how ACID transactions bring reliability to data lakes at scale • Run streaming and batch jobs against your data lake concurrently • Execute update, delete, and merge commands against your data lake • Use time travel to roll back and examine previous data versions • Build a streaming data quality pipeline following the medallion architecture Bennie Haelen is a principal architect with Insight Digital Innovation, a Microsoft and Databricks partner. He focuses on modern data warehousing, machine learning, generative AI, and IoT on commercial cloud platforms. Dan Davis is a cloud data architect with a decade of experience delivering analytic insights and business value from data. He designs data platforms, frameworks, and processes to support data integration and analytics. 9 7 8 1 0 9 8 1 3 9 7 2 8 5 6 5 9 9 US $65.99 CAN $82.99 ISBN: 978-1-098-13972-8

📄 Page 3

Bennie Haelen and Dan Davis Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake Boston Farnham Sebastopol TokyoBeijing

📄 Page 4

978-1-098-13972-8 [LSI] Delta Lake: Up and Running by Bennie Haelen and Dan Davis Copyright © 2024 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Gary O’Brien Production Editor: Ashley Stussy Copyeditor: Charles Roumeliotis Proofreader: Sonia Saruba Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea October 2023: First Edition Revision History for the First Edition 2023-10-16: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098139728 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Delta Lake: Up and Running, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Databricks. See our statement of editorial independence.

📄 Page 5

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. The Evolution of Data Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A Brief History of Relational Databases 2 Data Warehouses 3 Data Warehouse Architecture 3 Dimensional Modeling 7 Data Warehouse Benefits and Challenges 8 Introducing Data Lakes 10 Data Lakehouse 14 Data Lakehouse Benefits 15 Implementing a Lakehouse 16 Delta Lake 18 The Medallion Architecture 21 The Delta Ecosystem 22 Delta Lake Storage 22 Delta Sharing 23 Delta Connectors 23 Conclusion 24 2. Getting Started with Delta Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Getting a Standard Spark Image 26 Using Delta Lake with PySpark 26 Running Delta Lake in the Spark Scala Shell 27 Running Delta Lake on Databricks 28 Creating and Running a Spark Program: helloDeltaLake 29 The Delta Lake Format 30 Parquet Files 31 iii

📄 Page 6

Writing a Delta Table 34 The Delta Lake Transaction Log 36 How the Transaction Log Implements Atomicity 36 Breaking Down Transactions into Atomic Commits 36 The Transaction Log at the File Level 37 Scaling Massive Metadata 44 Conclusion 48 3. Basic Operations on Delta Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Creating a Delta Table 50 Creating a Delta Table with SQL DDL 50 The DESCRIBE Statement 53 Creating Delta Tables with the DataFrameWriter API 54 Creating a Delta Table with the DeltaTableBuilder API 57 Generated Columns 58 Reading a Delta Table 60 Reading a Delta Table with SQL 60 Reading a Table with PySpark 63 Writing to a Delta Table 64 Cleaning Out the YellowTaxis Table 65 Inserting Data with SQL INSERT 65 Appending a DataFrame to a Table 66 Using the OverWrite Mode When Writing to a Delta Table 68 Inserting Data with the SQL COPY INTO Command 68 Partitions 70 User-Defined Metadata 76 Using SparkSession to Set Custom Metadata 77 Using the DataFrameWriter to Set Custom Metadata 78 Conclusion 79 4. Table Deletes, Updates, and Merges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Deleting Data from a Delta Table 81 Table Creation and DESCRIBE HISTORY 82 Performing the DELETE Operation 84 DELETE Performance Tuning Tips 86 Updating Data in a Table 87 Use Case Description 87 Updating Data in a Table 88 UPDATE Performance Tuning Tips 90 Upsert Data Using the MERGE Operation 90 Use Case Description 90 The MERGE Dataset 91 iv | Table of Contents

📄 Page 7

The MERGE Statement 92 Analyzing the MERGE operation with DESCRIBE HISTORY 97 Inner Workings of the MERGE Operation 98 Conclusion 98 5. Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Data Skipping 99 Partitioning 102 Partitioning Warnings and Considerations 108 Compact Files 109 Compaction 109 OPTIMIZE 110 ZORDER BY 113 ZORDER BY Considerations 117 Liquid Clustering 118 Enabling Liquid Clustering 119 Operations on Clustered Columns 120 Liquid Clustering Warnings and Considerations 122 Conclusion 123 6. Using Time Travel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Delta Lake Time Travel 126 Restoring a Table 128 Restoring via Timestamp 129 Time Travel Under the Hood 129 RESTORE Considerations and Warnings 131 Querying an Older Version of a Table 132 Data Retention 134 Data File Retention 134 Log File Retention 136 Setting File Retention Duration Example 136 Data Archiving 137 VACUUM 138 VACUUM Syntax and Examples 139 How Often Should You Run VACUUM and Other Maintenance Tasks? 140 VACUUM Warnings and Considerations 141 Changing Data Feed 143 Enabling the CDF 144 Viewing the CDF 146 CDF Warnings and Considerations 149 Conclusion 150 Table of Contents | v

📄 Page 8

7. Schema Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Schema Validation 152 Viewing the Schema in the Transaction Log Entries 152 Schema on Write 153 Schema Enforcement Example 154 Schema Evolution 157 Adding a Column 158 Missing Data Column in Source DataFrame 160 Changing a Column Data Type 162 Adding a NullType Column 164 Explicit Schema Updates 165 Adding a Column to a Table 166 Adding Comments to a Column 167 Changing Column Ordering 168 Delta Lake Column Mapping 169 Renaming a Column 171 Replacing the Table Columns 172 Dropping a Column 174 The REORG TABLE Command 177 Changing Column Data Type or Name 179 Conclusion 181 8. Operations on Streaming Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Streaming Overview 184 Spark Structured Streaming 184 Delta Lake and Structured Streaming 184 Streaming Examples 185 Hello Streaming World 185 AvailableNow Streaming 195 Updating the Source Records 197 Reading a Stream from the Change Data Feed 201 Conclusion 204 9. Delta Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Conventional Methods of Data Sharing 205 Legacy and Homegrown Solutions 206 Proprietary Vendor Solutions 207 Cloud Object Storage 209 Open Source Delta Sharing 210 Delta Sharing Goals 210 Delta Sharing Under the Hood 211 Data Providers and Recipients 211 vi | Table of Contents

📄 Page 9

Benefits of the Design 212 The delta-sharing Repository 213 Step 1: Installing the Python Connector 213 Step 2: Installing the Profile File 213 Step 3: Reading a Shared Table 214 Conclusion 215 10. Building a Lakehouse on Delta Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Storage Layer 218 What Is a Data Lake? 218 Types of Data 218 Key Benefits of a Cloud Data Lake 219 Data Management 222 SQL Analytics 225 SQL Analytics via Spark SQL 225 SQL Analytics via Other Delta Lake Integrations 227 Data for Data Science and Machine Learning 229 Challenges with Traditional Machine Learning 230 Delta Lake Features That Support Machine Learning 231 Putting It All Together 233 Medallion Architecture 234 The Bronze Layer (Raw Data) 236 The Silver Layer 237 The Gold Layer 237 The Complete Lakehouse 238 Conclusion 240 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Table of Contents | vii

📄 Page 10

(This page has no text content)

📄 Page 11

Preface The goal of this book is to provide data practitioners with practical instructions on how to set up Delta Lake and start using its unique features. This book is designed for an audience that fits any of the following profiles: • Data practitioners with a Spark background • Data practitioners unfamiliar with or new to Delta Lake needing an introduction to the technology, the problems it solves, its main features and terminology, as well as how to get started using it • Data practitioners looking to learn about the features and benefits of modern lakehouse architectures It is important to note that this book and the features discussed apply to the Delta Lake open source framework (Delta Lake OSS). Proprietary features and optimiza‐ tions that some companies offer around Delta Lake are considered out of the scope of this book. First, we discuss why Delta Lake is an important tool for building modern enterprise data platforms and data science and AI solutions, followed by instructions on how to set up Delta Lake with Spark. Each of the subsequent chapters will walk you through the fundamental functions and operations of Delta Lake using step-by-step instructions and real-world examples. The code examples in the book range from snippets that can be used in a PySpark shell to those designed to be run with a complete end-to-end notebook. In this book, all code snippets will be in Python, SQL, and, where necessary, shell commands. A GitHub repository is provided to aid readers in following along throughout the book. Datasets, files, and code samples are provided in the repo and referred to throughout the book. Below are some important things to note about using the GitHub repo: ix

📄 Page 12

Code samples Code samples are organized in the repo by chapter, and for most chapters a chapter initialization script is intended to be executed before executing any of the related code for that particular chapter. This chapter initialization code is required before executing code in order to set up the appropriate Delta tables and datasets to best demonstrate the topics being discussed. These chapter initial‐ ization scripts are explicitly called out in the text of the book before executing the first set of sample code for a given chapter. Code sample data files Data files required to execute the provided code samples live in the GitHub repository. The data files in the GitHub repo come from the popular NYC Yellow and Green taxi trip records. These files were downloaded and curated for effective demonstration throughout this book. Method for running Delta Lake for this book The method for running Delta Lake for the purposes of this book and the code in the provided GitHub repo is Databricks Community Edition. Databricks Community Edition was chosen to develop and run the code samples because it is free, simplifies setup of Spark and Delta Lake, and does not require your own cloud account or for you to supply cloud compute or storage resources. The Delta tables, datasets, and code samples used in this book and the GitHub repo were developed and tested on Databricks Community Edition hosted on Azure, using Azure Data Lake Storage Gen2 as the underlying storage layer and Databricks Runtime 12.2 LTS. Please note that if you are running the code samples on Spark and Delta Lake outside of Databricks (e.g., on your local machine), then there will be additional setup, configuration, and potential editor syntax options to be accounted for by the reader. Notebooks You will also see the term notebook. A notebook refers to a Databricks notebook, the primary tool for developing code and presenting results throughout the book. Code languages Delta Lake supports multiple languages (Scala, Java, Python, and SQL) for a variety of functionality. This book will focus primarily on Python and SQL. Code samples will provide code in the language deemed most appropriate to the topic being discussed. Alternatives for similar functionality in other languages will not always be provided. Please refer to the Delta Lake documentation to view similar functionality in alternative languages. For code snippets used throughout this book, the default language is Python. To indicate use of a language other than Python in a code snippet, you will see language magic commands, that is, %<language> (e.g., %sql). You can assume that code snippets without a language magic command are using Python. x | Preface

📄 Page 13

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/delta-lake-up-and-running-1e. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: https://twitter.com/oreillymedia. Watch us on YouTube: https://youtube.com/oreillymedia. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. Preface | xi

📄 Page 14

This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/benniehaelen/delta-lake-up-and-running. If you have a technical question or a problem using the code examples, please send email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Delta Lake: Up and Running by Bennie Haelen and Dan Davis (O’Reilly). Copyright 2024 O’Reilly Media, Inc., 978-1-098-13972-8.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. xii | Preface

📄 Page 15

O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. Acknowledgment We would like to thank our technical reviewers: Adam Breindel, Andrei Ionescu, and Jobenish Purushothaman. Their attention to detail, feedback, and thoughtful suggestions played a pivotal role in helping shape the content of this book while ensuring its accuracy. Their input undoubtedly helped make this book a better quality product that will be a valuable resource to readers. Aside from the technical reviewers, we also received valuable feedback throughout the process of writing the book from other contributors. We would like to extend our thanks to the following: Alex Ott, Anthony Krinsky, Artem Sheiko, Bilal Obeidat, Carlos Morillo, Eli Swanson, Guillermo G. Schiava D’Albano, Jitesh Soni, Joe Widen, Kyle Hale, Marco Scagliola, Nick Karpov, Nouran Younis, Ori Zohar, Sirui Sun, Susan Pierce, and Youssef Mrini. Without your input, this book would not be the valuable resource it is. Finally, we would like to thank the open source community. Without the communi‐ ty’s contributions and collective efforts, Delta Lake would not have the remarkable capabilities it has today. The community’s commitment to innovation helps drive Delta Lake’s evolution and impact, and we, along with others, cannot express our thanks and appreciation enough. Bennie Haelen I would like to thank my wonderful wife Jenny. You have always been there to encourage and motivate me throughout the writing of this book; you are the great inspiration of my life. Thanks to my co-author Dan for being there through difficult periods in my life. Dan, you have a great career ahead of you. Thanks to my friends and colleagues that I can always reach out to with challenging questions no matter what time of the day. Preface | xiii

📄 Page 16

Dan Davis I would like to thank my family. Your continued encouragement and support have provided the foundation of my journey to where I am today and in writing this book. Thank you for always being a constant source of motivation. I would also like to thank all of my friends and colleagues that I have learned from and who have continually provided support to me along the way. I cannot thank my co-author, Bennie, enough. Thank you for being the mentor that you are, providing me with support, and presenting me with great opportunities. And last but not least, I would like to thank my beloved companion, who is always by my side whether he enjoys it or not, my dog River. xiv | Preface

📄 Page 17

CHAPTER 1 The Evolution of Data Architectures As a data engineer, you want to build large-scale data, machine learning, data science, and AI solutions that offer state-of-the-art performance. You build these solutions by ingesting large amounts of source data, then cleansing, normalizing, and combining the data, and ultimately presenting this data to the downstream applications through an easy-to-consume data model. As the amount of data you need to ingest and process is ever increasing, you need the ability to scale your storage horizontally. Additionally, you need the ability to dynamically scale your compute resources to address processing and consumption spikes. Since you are combining your data sources into one data model, you not only need to append data to tables, but you often need to insert, update, or delete (i.e., MERGE or UPSERT) records based upon complex business logic. You want to be able to perform these operations with transactional guarantees, and without having to constantly rewrite large data files. In the past, the preceding set of requirements was addressed by two distinct toolsets. The horizontal scalability and decoupling of storage and compute were offered by cloud-based data lakes, while relational data warehouses offered transactional guar‐ antees. However, traditional data warehouses tightly coupled storage and compute into an on-premises appliance and did not have the degree of horizontal scalability associated with data lakes. Delta Lake brings capabilities such as transactional reliability and support for UPSERTs and MERGEs to data lakes while maintaining the dynamic horizontal scalability and separation of storage and compute of data lakes. Delta Lake is one solution for building data lakehouses, an open data architecture combining the best of data warehouses and data lakes. 1

📄 Page 18

1 Codd, E.F. (1970). Relational Database: A Practical Foundation for Productivity. San Jose: San Jose Research Laboratory. In this introduction, we will take a brief look at relational databases and how they evolved into data warehouses. Next, we will look at the key drivers behind the emer‐ gence of data lakes. We will address the benefits and drawbacks of each architecture, and finally show how the Delta Lake storage layer combines the benefits of each architecture, enabling the creation of data lakehouse solutions. A Brief History of Relational Databases In his historic 1970 paper,1 E.F. Codd introduced the concept of looking at data as logical relations, independent of physical data storage. This logical relation between data entities became known as a database model or schema. Codd’s writings led to the birth of the relational database. The first relational database systems were introduced in the mid-1970s by IBM and UBC. Relational databases and their underlying SQL language became the standard storage technology for enterprise applications throughout the 1980s and 1990s. One of the main reasons behind this popularity was that relational databases offered a concept called transactions. A database transaction is a sequence of operations on a database that satisfies four properties: atomicity, consistency, isolation, and durability, com‐ monly referred to by their acronym ACID. Atomicity ensures that all changes made to the database are executed as a single operation. This means that the transaction succeeds only when all changes have been performed successfully. For example, when the online banking system is used to transfer money from savings to checking, the atomicity property will guarantee that the operation will only succeed when the money is deducted from my savings account and added to my checking account. The complete operation will either succeed or fail as a complete unit. The consistency property guarantees that the database transitions from one consistent state at the beginning of the transaction to another consistent state at the end of the transaction. In our earlier example, the transfer of the money would only happen if the savings account had sufficient funds. If not, the transaction would fail, and the balances would stay in their original, consistent state. Isolation ensures that concurrent operations happening within the database are not affecting each other. This property ensures that when multiple transactions are exe‐ cuted concurrently, their operations do not interfere with each other. 2 | Chapter 1: The Evolution of Data Architectures

📄 Page 19

Durability refers to the persistence of committed transactions. It guarantees that once a transaction is completed successfully, it will result in a permanent state even in the event of a system failure. In our money transfer example, durability will ensure that updates made to both my savings and checking account are persistent and can survive a potential system failure. Database systems continued to mature throughout the 1990s, and the advent of the internet in the mid-1990s led to an explosive growth of data and the need to store this data. Enterprise applications were using relational database management system (RDBMS) technology very effectively. Flagship products such as SAP and Salesforce would collect and maintain massive amounts of data. However, this development was not without its drawbacks. Enterprise applications would store the data in their own, proprietary formats, leading to the rise of data silos. These data silos were owned and controlled by one department or business unit. Over time, organizations recognized the need to develop an enterprise view across these different data silos, leading to the rise of data warehouses. Data Warehouses While each enterprise application has some type of reporting built in, business opportunities were missed because of the lack of a comprehensive view across the organization. At the same time, organizations recognized the value of analyzing data over longer periods of time. Additionally, they wanted to be able to slice and dice the data over several cross-cutting subject matters, such as customers, products, and other business entities. This led to the introduction of the data warehouse, a central relational repository of integrated, historical data from multiple data sources that presents a single integrated, historical view of the business with a unified schema, covering all perspectives of the enterprise. Data Warehouse Architecture A simple representation of a typical data warehouse architecture is shown in Figure 1-1. Data Warehouses | 3

📄 Page 20

Figure 1-1. Data warehouse architecture When we look at the diagram in Figure 1-1, we start with the data source layer on the left. Organizations need to ingest data from a set of heterogeneous data sources. While the data from the organization’s enterprise resource planning (ERP) system(s) forms the backbone of the organizational model, we need to augment this data with the data from the operational systems running the day-to-day operations, such as human resources (HR) systems and workflow management software. Additionally, organizations might want to leverage the customer interaction data covered by their customer relationship management (CRM) and point of sale (POS) systems. In addi‐ tion to the core data sources listed here, there is a need to ingest data from a wide array of external data sources, in a variety of formats, such as spreadsheets, CSV files, etc. These different source systems each might have their own data format. Therefore, the data warehouse contains a staging area where the data from the different sources can be combined into one common format. To do this the system must ingest the data from the original data sources. The actual ingestion process varies by data source type. Some systems allow direct database access, and others allow data to be ingested through an API, while many data sources still rely on file extracts. Next, the data warehouse needs to transform the data into a standardized format, allowing the downstream processes to access the data easily. Finally, the transformed 4 | Chapter 1: The Evolution of Data Architectures

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List