Data Pipelines Pocket Reference Moving and Processing Data for Analytics (James Densmore) (Z-Library)

Data Pipelines Pocket Reference Moving and Processing Data for Analytics James Densmore

(This page has no text content)

James Densmore Data Pipelines Pocket Reference Moving and Processing Data for Analytics

978-1-492-08783-0 [LSI] Data Pipelines Pocket Reference by James Densmore Copyright © 2021 James Densmore. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebasto‐ pol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promo‐ tional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Developmental Editor: Corbin Collins Production Editor: Katherine Tozer Copyeditor: Kim Wimpsett Proofreader: Abby Wheeler Indexer: Ellen Troutman Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea March 2021: First Edition Revision History for the First Edition 2021-02-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492087830 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Pipe‐ lines Pocket Reference, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages result‐ ing from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Preface vii Chapter 1: Introduction to Data Pipelines 1 What Are Data Pipelines? 1 Who Builds Data Pipelines? 2 Why Build Data Pipelines? 4 How Are Pipelines Built? 5 Chapter 2: A Modern Data Infrastructure 7 Diversity of Data Sources 7 Cloud Data Warehouses and Data Lakes 14 Data Ingestion Tools 15 Data Transformation and Modeling Tools 16 Workflow Orchestration Platforms 17 Customizing Your Data Infrastructure 20 Chapter 3: Common Data Pipeline Patterns 21 ETL and ELT 21 The Emergence of ELT over ETL 23 EtLT Subpattern 26 iii

ELT for Data Analysis 27 ELT for Data Science 28 ELT for Data Products and Machine Learning 29 Chapter 4: Data Ingestion: Extracting Data 33 Setting Up Your Python Environment 34 Setting Up Cloud File Storage 37 Extracting Data from a MySQL Database 39 Extracting Data from a PostgreSQL Database 63 Extracting Data from MongoDB 67 Extracting Data from a REST API 74 Streaming Data Ingestions with Kafka and Debezium 79 Chapter 5: Data Ingestion: Loading Data 83 Configuring an Amazon Redshift Warehouse as a Destination 83 Loading Data into a Redshift Warehouse 86 Configuring a Snowflake Warehouse as a Destination 95 Loading Data into a Snowflake Data Warehouse 97 Using Your File Storage as a Data Lake 99 Open Source Frameworks 101 Commercial Alternatives 102 Chapter 6: Transforming Data 105 Noncontextual Transformations 106 When to Transform? During or After Ingestion? 116 Data Modeling Foundations 117 Chapter 7: Orchestrating Pipelines 149 Apache Airflow Setup and Overview 151 Building Airflow DAGs 161 Additional Pipeline Tasks 170 Advanced Orchestration Configurations 171 iv | Table of Contents

Managed Airflow Options 176 Other Orchestration Frameworks 177 Chapter 8: Data Validation in Pipelines 179 Validate Early, Validate Often 179 A Simple Validation Framework 183 Validation Test Examples 198 Commercial and Open Source Data Validation Frameworks 209 Chapter 9: Best Practices for Maintaining Pipelines 211 Handling Changes in Source Systems 211 Scaling Complexity 216 Chapter 10: Measuring and Monitoring Pipeline Performance 225 Key Pipeline Metrics 225 Prepping the Data Warehouse 226 Logging and Ingesting Performance Data 228 Transforming Performance Data 239 Orchestrating a Performance Pipeline 246 Performance Transparency 248 Index 251 Table of Contents | v

(This page has no text content)

Preface Data pipelines are the foundation for success in data analytics and machine learning. Moving data from numerous, diverse sources and processing it to provide context is the difference between having data and getting value from it. I’ve worked as a data analyst, data engineer, and leader in the data analytics field for more than 10 years. In that time, I’ve seen rapid change and growth in the field. The emergence of cloud infrastructure, and cloud data warehouses in particular, has created an opportunity to rethink the way data pipelines are designed and implemented. This book describes what I believe are the foundations and best practices of building data pipelines in the modern era. I base my opinions and observations on my own experience as well as those of industry leaders who I know and follow. My goal is for this book to serve as a blueprint as well as a ref‐ erence. While your needs are specific to your organization and the problems you’ve set out to solve, I’ve found success with variations of these foundations many times over. I hope you find it a valuable resource in your journey to building and maintaining data pipelines that power your data organization. vii

Who This Book Is For This book’s primary audience is current and aspiring data engi‐ neers as well as analytics team members who want to under‐ stand what data pipelines are and how they are implemented. Their job titles include data engineers, technical leads, data warehouse engineers, analytics engineers, business intelligence engineers, and director/VP-level analytics leaders. I assume that you have a basic understanding of data ware‐ housing concepts. To implement the examples discussed, you should be comfortable with SQL databases, REST APIs, and JSON. You should be proficient in a scripting language, such as Python. Basic knowledge of the Linux command line and at least one cloud computing platform is ideal as well. All code samples are written in Python and SQL and make use of many open source libraries. I use Amazon Web Services (AWS) to demonstrate the techniques described in the book, and AWS services are used in many of the code samples. When possible, I note similar services on other major cloud providers such as Microsoft Azure and Google Cloud Platform (GCP). All code samples can be modified for the cloud provider of your choice, as well as for on-premises use. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, state‐ ments, and keywords. viii | Preface

Constant width bold Shows commands or other text that should be typed liter‐ ally by the user. Constant width italic Shows text that should be replaced with user-supplied val‐ ues or by values determined by context. Using Code Examples Supplemental material (code examples, exercises, etc.) is avail‐ able for download at https://oreil.ly/datapipelinescode. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incor‐ porating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Pipelines Pocket Reference by James Densmore (O’Reilly). Copyright 2021 James Densmore, 978-1-492-08783-0.” If you feel your use of code examples falls outside fair use or the permission given above, please feel free to contact us: permissions@oreilly.com. Preface | ix

O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast col‐ lection of text and video from O’Reilly and 200+ other publish‐ ers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, exam‐ ples, and any additional information. You can access this page at https://oreil.ly/data-pipelines-pocket-ref. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia x | Preface

Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments Thank you to everyone at O’Reilly who helped make this book possible, especially Jessica Haberman and Corbin Collins. The invaluable feedback of three amazing technical reviewers, Joy Payton, Gordon Wong, and Scott Haines led to critical improvements throughout. Finally, thank you to my wife Amanda for her encouragement from the moment this book was proposed, as well as my dog Izzy for sitting by my side dur‐ ing countless hours of writing. Preface | xi

(This page has no text content)

CHAPTER 1 Introduction to Data Pipelines Behind every glossy dashboard, machine learning model, and business-changing insight is data. Not just raw data, but data collected from numerous sources that must be cleaned, pro‐ cessed, and combined to deliver value. The famous phrase “data is the new oil” has proven true. Just like oil, the value of data is in its potential after it’s refined and delivered to the con‐ sumer. Also like oil, it takes efficient pipelines to deliver data through each stage of its value chain. This Pocket Reference discusses what these data pipelines are and shows how they fit into a modern data ecosystem. It covers common considerations and key decision points when imple‐ menting pipelines, such as batch versus streaming data inges‐ tion, building versus buying tooling, and more. Though it is not exclusive to a single language or platform, it addresses the most common decisions made by data professionals while dis‐ cussing foundational concepts that apply to homegrown solu‐ tions, open source frameworks, and commercial products. What Are Data Pipelines? Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can 1

be derived. They are the foundation of analytics, reporting, and machine learning capabilities. The complexity of a data pipeline depends on the size, state, and structure of the source data as well as the needs of the ana‐ lytics project. In their simplest form, pipelines may extract only data from one source such as a REST API and load to a desti‐ nation such as a SQL table in a data warehouse. In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination. Pipelines often contain tasks from multiple systems and programming languages. What’s more, data teams typically own and maintain numerous data pipelines that share dependencies and must be coordina‐ ted. Figure 1-1 illustrates a simple pipeline. Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database. Who Builds Data Pipelines? With the popularization of cloud computing and software as a service (SaaS), the number of data sources organizations need to make sense of has exploded. At the same time, the demand for data to feed machine learning models, data science research, and time-sensitive insights is higher than ever. To keep up, data engineering has emerged as a key role on analytics teams. Data engineers specialize in building and maintaining the data pipelines that underpin the analytics ecosystem. A data engineer’s purpose isn’t simply to load data into a data warehouse. Data engineers work closely with data scientists 2 | Chapter 1: Introduction to Data Pipelines

and analysts to understand what will be done with the data and help bring their needs into a scalable production state. Data engineers take pride in ensuring the validity and timeli‐ ness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong. And yes, something will eventually go wrong! The specific skills of a data engineer depend somewhat on the tech stack their organization uses. However, there are some common skills that all good data engineers possess. SQL and Data Warehousing Fundamentals Data engineers need to know how to query databases, and SQL is the universal language to do so. Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling. Even if a data team includes data warehousing specialists, a data engi‐ neer with warehousing fundamentals is a better partner and can fill more complex technical gaps that arise. Python and/or Java The language in which a data engineer is proficient will depend on the tech stack of their team, but either way a data engineer isn’t going to get the job done with “no code” tools even if they have some good ones in their arsenal. Python and Java cur‐ rently dominate in data engineering, but newcomers like Go are emerging. Distributed Computing Solving a problem that involves high data volume and a desire to process data quickly has led data engineers to work with dis‐ tributed computing platforms. Distributed computing combines the power of multiple systems to efficiently store, process, and analyze high volumes of data. Who Builds Data Pipelines? | 3

One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity. Though not all data pipelines require the use of distributed computing, data engineers need to know how and when to uti‐ lize such a framework. Basic System Administration A data engineer is expected to be proficient on the Linux com‐ mand line and be able to perform tasks such as analyze applica‐ tion logs, schedule cron jobs, and troubleshoot firewall and other security settings. Even when working fully on a cloud provider such as AWS, Azure, or Google Cloud, they’ll end up using those skills to get cloud services working together and data pipelines deployed. A Goal-Oriented Mentality A good data engineer doesn’t just possess technical skills. They may not interface with stakeholders on a regular basis, but the analysts and data scientists on the team certainly will. The data engineer will make better architectural decisions if they’re aware of the reason they’re building a pipeline in the first place. Why Build Data Pipelines? In the same way that the tip of the iceberg is all that can be seen by a passing ship, the end product of the analytics workflow is all that the majority of an organization sees. Executives see dashboards and pristine charts. Marketing shares cleanly pack‐ aged insights on social media. Customer support optimizes the call center staffing based on the output of a predictive demand model. 4 | Chapter 1: Introduction to Data Pipelines

What most people outside of analytics often fail to appreciate is that to generate what is seen, there’s a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes. It’s not uncommon for a single dashboard, or even a single metric, to be derived from data originating in multiple source systems. In addition, data pipelines do more than just extract data from sources and load them into simple database tables or flat files for analysts to use. Raw data is refined along the way to clean, structure, normalize, combine, aggregate, and at times ano‐ nymize or otherwise secure it. In other words, there’s a lot more going on below the water line. Supplying Data to Analysts and Data Scientists Don’t rely on data analysts and data scientists hunting for and procuring data on their own for each project that comes their way. The risks of acting on stale data, multiple sources of truth, and bogging down analytics talent in data acquisition are too great. Data pipelines ensure that the proper data is delivered so the rest of the analytics organization can focus their time on what they do best: delivering insights. How Are Pipelines Built? Along with data engineers, numerous tools to build and sup‐ port data pipelines have emerged in recent years. Some are open source, some commercial, and some are homegrown. Some pipelines are written in Python, some in Java, some in another language, and some with no code at all. Throughout this Pocket Reference I explore some of the most popular products and frameworks for building pipelines, as well as discuss how to determine which to use based on your organization’s needs and constraints. How Are Pipelines Built? | 5

Though I do not cover all such products in depth, I do provide examples and sample code for some. All code in this book is written in Python and SQL. These are the most common, and in my opinion, the most accessible, languages for building data pipelines. In addition, pipelines are not just built—they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time. It’s no small feat, but when it’s done well, the value of an organization’s data can truly be unlocked. 6 | Chapter 1: Introduction to Data Pipelines

Statistics

Uploader

Data Pipelines Pocket Reference Moving and Processing Data for Analytics (James Densmore) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Data Pipelines Pocket Reference Moving and Processing Data for Analytics (James Densmore) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You