Data Pipelines with Apache Airflow (Bas P. Harenslak, Julian Rutger de Ruiter) (Z-Library)

M A N N I N G Bas Harenslak Julian de Ruiter

Pipeline as DAG Task 1 Task 2 Task 3 Task 4 Schedule interval = @daily DAG file (Python) Dependency between tasks, indicating task 3 must run before task 4 Which schedule to use for running the DAG Represents a task/operation we want to run

Data Pipelines with Apache Airflow

(This page has no text content)

Data Pipelines with Apache Airflow BAS HARENSLAK AND JULIAN DE RUITER MANN I NG SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2021 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Development editor: Tricia Louvar Technical development editor: Arthur Zubarev Manning Publications Co. Review editor: Aleks Dragosavljević 20 Baldwin Road Production editor: Deirdre S. Hiam PO Box 761 Copy editor: Michele Mitchell Shelter Island, NY 11964 Proofreader: Keri Hales Technical proofreader: Al Krinker Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617296901 Printed in the United States of America

brief contents PART 1 GETTING STARTED ........................................................1 1 ■ Meet Apache Airflow 3 2 ■ Anatomy of an Airflow DAG 20 3 ■ Scheduling in Airflow 40 4 ■ Templating tasks using the Airflow context 60 5 ■ Defining dependencies between tasks 85 PART 2 BEYOND THE BASICS. .................................................113 6 ■ Triggering workflows 115 7 ■ Communicating with external systems 135 8 ■ Building custom components 157 9 ■ Testing 186 10 ■ Running tasks in containers 220 PART 3 AIRFLOW IN PRACTICE . ..............................................253 11 ■ Best practices 255 12 ■ Operating Airflow in production 281v

BRIEF CONTENTSvi13 ■ Securing Airflow 322 14 ■ Project: Finding the fastest way to get around NYC 344 PART 4 IN THE CLOUDS .........................................................365 15 ■ Airflow in the clouds 367 16 ■ Airflow on AWS 375 17 ■ Airflow on Azure 394 18 ■ Airflow in GCP 412

contents preface xv acknowledgments xvii about this book xix about the authors xxiii about the cover illustration xxiv PART 1 GETTING STARTED ..............................................1 1 Meet Apache Airflow 3 1.1 Introducing data pipelines 4 Data pipelines as graphs 4 ■ Executing a pipeline graph 6 Pipeline graphs vs. sequential scripts 6 ■ Running pipeline using workflow managers 9 1.2 Introducing Airflow 10 Defining pipelines flexibly in (Python) code 10 ■ Scheduling and executing pipelines 11 ■ Monitoring and handling failures 13 Incremental loading and backfilling 15 1.3 When to use Airflow 17 Reasons to choose Airflow 17 ■ Reasons not to choose Airflow 17 1.4 The rest of this book 18vii

CONTENTSviii2 Anatomy of an Airflow DAG 20 2.1 Collecting data from numerous sources 21 Exploring the data 21 2.2 Writing your first Airflow DAG 22 Tasks vs. operators 26 ■ Running arbitrary Python code 27 2.3 Running a DAG in Airflow 29 Running Airflow in a Python environment 29 ■ Running Airflow in Docker containers 30 ■ Inspecting the Airflow UI 31 2.4 Running at regular intervals 33 2.5 Handling failing tasks 36 3 Scheduling in Airflow 40 3.1 An example: Processing user events 41 3.2 Running at regular intervals 42 Defining scheduling intervals 42 ■ Cron-based intervals 44 Frequency-based intervals 46 3.3 Processing data incrementally 46 Fetching events incrementally 46 ■ Dynamic time references using execution dates 48 ■ Partitioning your data 50 3.4 Understanding Airflow’s execution dates 52 Executing work in fixed-length intervals 52 3.5 Using backfilling to fill in past gaps 54 Executing work back in time 54 3.6 Best practices for designing tasks 55 Atomicity 55 ■ Idempotency 57 4 Templating tasks using the Airflow context 60 4.1 Inspecting data for processing with Airflow 61 Determining how to load incremental data 61 4.2 Task context and Jinja templating 63 Templating operator arguments 64 ■ What is available for templating? 66 ■ Templating the PythonOperator 68 Providing variables to the PythonOperator 73 ■ Inspecting templated arguments 75 4.3 Hooking up other systems 77

CONTENTS ix5 Defining dependencies between tasks 85 5.1 Basic dependencies 86 Linear dependencies 86 ■ Fan-in/-out dependencies 87 5.2 Branching 90 Branching within tasks 90 ■ Branching within the DAG 92 5.3 Conditional tasks 97 Conditions within tasks 97 ■ Making tasks conditional 98 Using built-in operators 100 5.4 More about trigger rules 100 What is a trigger rule? 101 ■ The effect of failures 102 Other trigger rules 103 5.5 Sharing data between tasks 104 Sharing data using XComs 104 ■ When (not) to use XComs 107 ■ Using custom XCom backends 108 5.6 Chaining Python tasks with the Taskflow API 108 Simplifying Python tasks with the Taskflow API 109 When (not) to use the Taskflow API 111 PART 2 BEYOND THE BASICS ........................................113 6 Triggering workflows 115 6.1 Polling conditions with sensors 116 Polling custom conditions 119 ■ Sensors outside the happy flow 120 6.2 Triggering other DAGs 122 Backfilling with the TriggerDagRunOperator 126 Polling the state of other DAGs 127 6.3 Starting workflows with REST/CLI 131 7 Communicating with external systems 135 7.1 Connecting to cloud services 136 Installing extra dependencies 137 ■ Developing a machine learning model 137 ■ Developing locally with external systems 143 7.2 Moving data from between systems 150 Implementing a PostgresToS3Operator 151 ■ Outsourcing the heavy work 155

CONTENTSx8 Building custom components 157 8.1 Starting with a PythonOperator 158 Simulating a movie rating API 158 ■ Fetching ratings from the API 161 ■ Building the actual DAG 164 8.2 Building a custom hook 166 Designing a custom hook 166 ■ Building our DAG with the MovielensHook 172 8.3 Building a custom operator 173 Defining a custom operator 174 ■ Building an operator for fetching ratings 175 8.4 Building custom sensors 178 8.5 Packaging your components 181 Bootstrapping a Python package 182 ■ Installing your package 184 9 Testing 186 9.1 Getting started with testing 187 Integrity testing all DAGs 187 ■ Setting up a CI/CD pipeline 193 ■ Writing unit tests 195 ■ Pytest project structure 196 ■ Testing with files on disk 201 9.2 Working with DAGs and task context in tests 203 Working with external systems 208 9.3 Using tests for development 215 Testing complete DAGs 217 9.4 Emulate production environments with Whirl 218 9.5 Create DTAP environments 219 10 Running tasks in containers 220 10.1 Challenges of many different operators 221 Operator interfaces and implementations 221 ■ Complex and conflicting dependencies 222 ■ Moving toward a generic operator 223 10.2 Introducing containers 223 What are containers? 223 ■ Running our first Docker container 224 ■ Creating a Docker image 225 Persisting data using volumes 227 10.3 Containers and Airflow 230 Tasks in containers 230 ■ Why use containers? 231

CONTENTS xi10.4 Running tasks in Docker 232 Introducing the DockerOperator 232 ■ Creating container images for tasks 233 ■ Building a DAG with Docker tasks 236 Docker-based workflow 239 10.5 Running tasks in Kubernetes 240 Introducing Kubernetes 240 ■ Setting up Kubernetes 242 Using the KubernetesPodOperator 245 ■ Diagnosing Kubernetes- related issues 248 ■ Differences with Docker-based workflows 250 PART 3 AIRFLOW IN PRACTICE .....................................253 11 Best practices 255 11.1 Writing clean DAGs 256 Use style conventions 256 ■ Manage credentials centrally 260 Specify configuration details consistently 261 ■ Avoid doing any computation in your DAG definition 263 ■ Use factories to generate common patterns 265 ■ Group related tasks using task groups 269 ■ Create new DAGs for big changes 270 11.2 Designing reproducible tasks 270 Always require tasks to be idempotent 271 ■ Task results should be deterministic 271 ■ Design tasks using functional paradigms 272 11.3 Handling data efficiently 272 Limit the amount of data being processed 272 ■ Incremental loading/processing 274 ■ Cache intermediate data 275 Don’t store data on local file systems 275 ■ Offload work to external/source systems 276 11.4 Managing your resources 276 Managing concurrency using pools 276 ■ Detecting long-running tasks using SLAs and alerts 278 12 Operating Airflow in production 281 12.1 Airflow architectures 282 Which executor is right for me? 284 ■ Configuring a metastore for Airflow 284 ■ A closer look at the scheduler 286 12.2 Installing each executor 290 Setting up the SequentialExecutor 291 ■ Setting up the LocalExecutor 292 ■ Setting up the CeleryExecutor 293 Setting up the KubernetesExecutor 296

CONTENTSxii12.3 Capturing logs of all Airflow processes 302 Capturing the webserver output 303 ■ Capturing the scheduler output 303 ■ Capturing task logs 304 ■ Sending logs to remote storage 305 12.4 Visualizing and monitoring Airflow metrics 305 Collecting metrics from Airflow 306 ■ Configuring Airflow to send metrics 307 ■ Configuring Prometheus to collect metrics 308 Creating dashboards with Grafana 310 ■ What should you monitor? 312 12.5 How to get notified of a failing task 314 Alerting within DAGs and operators 314 ■ Defining service-level agreements 316 ■ Scalability and performance 318 ■ Controlling the maximum number of running tasks 318 ■ System performance configurations 319 ■ Running multiple schedulers 320 13 Securing Airflow 322 13.1 Securing the Airflow web interface 323 Adding users to the RBAC interface 324 ■ Configuring the RBAC interface 327 13.2 Encrypting data at rest 327 Creating a Fernet key 328 13.3 Connecting with an LDAP service 330 Understanding LDAP 330 ■ Fetching users from an LDAP service 333 13.4 Encrypting traffic to the webserver 333 Understanding HTTPS 334 ■ Configuring a certificate for HTTPS 336 13.5 Fetching credentials from secret management systems 339 14 Project: Finding the fastest way to get around NYC 344 14.1 Understanding the data 347 Yellow Cab file share 348 ■ Citi Bike REST API 348 Deciding on a plan of approach 350 14.2 Extracting the data 350 Downloading Citi Bike data 351 ■ Downloading Yellow Cab data 353 14.3 Applying similar transformations to data 356

CONTENTS xiii14.4 Structuring a data pipeline 360 14.5 Developing idempotent data pipelines 361 PART 4 IN THE CLOUDS...............................................365 15 Airflow in the clouds 367 15.1 Designing (cloud) deployment strategies 368 15.2 Cloud-specific operators and hooks 369 15.3 Managed services 370 Astronomer.io 371 ■ Google Cloud Composer 371 Amazon Managed Workflows for Apache Airflow 372 15.4 Choosing a deployment strategy 372 16 Airflow on AWS 375 16.1 Deploying Airflow in AWS 375 Picking cloud services 376 ■ Designing the network 377 Adding DAG syncing 378 ■ Scaling with the CeleryExecutor 378 Further steps 380 16.2 AWS-specific hooks and operators 381 16.3 Use case: Serverless movie ranking with AWS Athena 383 Overview 383 ■ Setting up resources 384 ■ Building the DAG 387 ■ Cleaning up 393 17 Airflow on Azure 394 17.1 Deploying Airflow in Azure 394 Picking services 395 ■ Designing the network 395 Scaling with the CeleryExecutor 397 ■ Further steps 398 17.2 Azure-specific hooks/operators 398 17.3 Example: Serverless movie ranking with Azure Synapse 400 Overview 400 ■ Setting up resources 401 ■ Building the DAG 404 ■ Cleaning up 410 18 Airflow in GCP 412 18.1 Deploying Airflow in GCP 413 Picking services 413 ■ Deploying on GKE with Helm 415 Integrating with Google services 417 ■ Designing the network 419 ■ Scaling with the CeleryExecutor 419

CONTENTSxiv18.2 GCP-specific hooks and operators 422 18.3 Use case: Serverless movie ranking on GCP 427 Uploading to GCS 428 ■ Getting data into BigQuery 429 Extracting top ratings 432 appendix A Running code samples 436 appendix B Package structures Airflow 1 and 2 439 appendix C Prometheus metric mapping 443 index 445

preface We’ve both been fortunate to be data engineers in interesting and challenging times. For better or worse, many companies and organizations are realizing that data plays a key role in managing and improving their operations. Recent developments in machine learning and AI have opened a slew of new opportunities to capitalize on. However, adopting data-centric processes is often difficult, as it generally requires coordinating jobs across many different heterogeneous systems and tying everything together in a nice, timely fashion for the next analysis or product deployment. In 2014, engineers at Airbnb recognized the challenges of managing complex data workflows within the company. To address those challenges, they started developing Airflow: an open source solution that allowed them to write and schedule workflows and monitor workflow runs using the built-in web interface. The success of the Airflow project quickly led to its adoption under the Apache Software Foundation, first as an incubator project in 2016 and later as a top-level proj- ect in 2019. As a result, many large companies now rely on Airflow for orchestrating numerous critical data processes. Working as consultants at GoDataDriven, we’ve helped various clients adopt Air- flow as a key component in projects involving the building of data lakes/platforms, machine learning models, and so on. In doing so, we realized that handing over these solutions can be challenging, as complex tools like Airflow can be difficult to learn overnight. For this reason, we also developed an Airflow training program at GoData- Driven, and have frequently organized and participated in meetings to share our knowledge, views, and even some open source packages. Combined, these efforts havexv

PREFACExvihelped us explore the intricacies of working with Airflow, which were not always easy to understand using the documentation available to us. In this book, we aim to provide a comprehensive introduction to Airflow that cov- ers everything from building simple workflows to developing custom components and designing/managing Airflow deployments. We intend to complement many of the excellent blogs and other online documentation by bringing several topics together in one place, using a concise and easy-to-follow format. In doing so, we hope to kickstart your adventures with Airflow by building on top of the experience we’ve gained through diverse challenges over the past years.

acknowledgments This book would not have been possible without the support of many amazing people. Colleagues from GoDataDriven and personal friends supported us and provided valu- able suggestions and critical insights. In addition, Manning Early Access Program (MEAP) readers posted useful comments in the online forum. Reviewers from the development process also contributed helpful feedback: Al Krinker, Clifford Thurber, Daniel Lamblin, David Krief, Eric Platon, Felipe Ortega, Jason Rendel, Jeremy Chen, Jiri Pik, Jonathan Wood, Karthik Sirasanagandla, Kent R. Spillner, Lin Chen, Philip Best, Philip Patterson, Rambabu Posa, Richard Meinsen, Robert G. Gimbel, Roman Pavlov, Salvatore Campagna, Sebastián Palma Mardones, Thorsten Weber, Ursin Stauss, and Vlad Navitski. At Manning, we owe special thanks to Brian Sawyer, our acquisitions editor, who helped us shape the initial book proposal and believed in us being able to see it through; Tricia Louvar, our development editor, who was very patient in answering all our questions and concerns, provided critical feedback on each of our draft chapters, and was an essential guide for us throughout this entire journey; and to the rest of the staff as well: Deirdre Hiam, our project editor; Michele Mitchell, our copyeditor; Keri Hales, our proofreader; and Al Krinker, our technical proofreader. Bas Harenslak I would like to thank my friends and family for their patience and support during this year-and-a-half adventure that developed from a side project into countless days, nights, and weekends. Stephanie, thank you for always putting up with me working atxvii

ACKNOWLEDGMENTSxviiithe computer. Miriam, Gerd, and Lotte, thank you for your patience and belief in me while writing this book. I would also like to thank the team at GoDataDriven for their support and dedication to always learn and improve, I could not have imagined being the author of a book when I started working five years ago. Julian de Ruiter First and foremost, I’d like to thank my wife, Anne Paulien, and my son, Dexter, for their endless patience during the many hours that I spent doing “just a little more work” on the book. This book would not have been possible without their unwavering support. In the same vein, I’d also like to thank our family and friends for their sup- port and trust. Finally, I’d like to thank our colleagues at GoDataDriven for their advice and encouragement, from whom I’ve also learned an incredible amount in the past years.

Statistics

Uploader

Data Pipelines with Apache Airflow (Bas P. Harenslak, Julian Rutger de Ruiter) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Data Pipelines with Apache Airflow (Bas P. Harenslak, Julian Rutger de Ruiter) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You