Data Pipelines with Apache Airflow Orchestration for data and AI, 2nd ed. (Julian de Ruiter, Ismael Cabral etc.)（Z-Library）

Author: Julian de Ruiter, Ismael Cabral, Kris Geusebroek, Daniel van der Ende, Bas Harenslak

代码

Simplify, streamline, and scale your data operations with data pipelines built on Apache Airflow Data Pipelines with Apache Airflow has empowered thousands of data engineers to build more successful data platforms. This new second edition has been fully revised for Airflow 3 with coverage of all the latest features of Apache Airflow, including the Taskflow API, deferrable operators, and Large Language Model integration. Filled with real-world scenarios and examples, you'll be carefully guided from Airflow novice to expert. In Data Pipelines with Apache Airflow, Second Edition you'll learn how to: • Master the core concepts of Airflow architecture and workflow design • Schedule data pipelines using the Dataset API and time tables, including complex irregular schedules • Develop custom Airflow components for your specific needs • Implement comprehensive testing strategies for your pipelines • Apply industry best practices for building and maintaining Airflow workflows • Deploy and operate Airflow in production environments • Orchestrate workflows in container-native environments • Build and deploy Machine Learning and Generative AI models using Airflow Using real-world scenarios and examples, Data Pipelines with Apache Airflow, Second Edition teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack. Part reference and part tutorial, each technique is illustrated with engaging hands-on examples, from training machine learning models for generative AI to optimizing delivery routes.

📄 File Format: PDF

💾 File Size: 28.0 MB

133

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

M A N N I N G SECOND EDITION Orchestration for data and AI Julian de Ruiter Ismael Cabral Kris Geusebroek Daniel van der Ende Bas Harenslak Sponsored by Foreword by Tamara J. Fingerlin

📄 Page 2

The world’s best data engineers don’t babysit Airflow. They build with Astro. Offload Airflow operations to Astro, the fully-managed Airflow platform trusted by hundreds of the worldʼs leading data teams. astronomer.io/astro-free TRY ASTRO FREE

📄 Page 3

Praise for the First Edition An Airflow bible. Useful for all kinds of users, from novice to expert. —Rambabu Posa, Sai Aashika Consultancy An easy-to-follow exploration of the benefits of orchestrating your data pipeline jobs with Airflow. —Daniel Lamblin, Coupang The one reference you need to create, author, schedule, and monitor workflows with Apache Airflow. Clear recommendation. —Thorsten Weber, bbv Software Services By far the best resource for Airflow. —Jonathan Wood, LexisNexis

📄 Page 4

(This page has no text content)

📄 Page 5

MANN I NG Shelter ISland Julian de Ruiter Ismael Cabral Kris Geusebroek Daniel van der Ende Bas Harenslak Data Pipelines with Apache Airflow Orchestration for Data and AI Second Edition Foreword by Tamara J. Fingerlin

📄 Page 6

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com © 2026 Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid- free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. ∞ Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 ISBN 9781633433885 Printed in the United States of America The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Development editor: Elesha Hyde Technical editor: Arthur Zubarev Review editor: Radmila Ercegovac Production editor: Keri Hales Copy editor: Keir Simpson Proofreader: Katie Tennant Technical proofreader: Anant Agarwal Typesetter: Tamara ŠveliÊ SabljiÊ Cover designer: Marija Tudor

📄 Page 7

v brief contents Part 1 Getting started ..........................................................1 1 ■ Meet Apache Airflow 3 2 ■ Anatomy of an Airflow DAG 21 3 ■ Time-based scheduling 45 4 ■ Asset-aware scheduling 68 5 ■ Templating tasks using the Airflow context 86 6 ■ Defining dependencies between tasks 111 Part 2 Beyond the basics .................................................... 145 7 ■ Triggering workflows with external input 147 8 ■ Communicating with external systems 164 9 ■ Extending Airflow with custom operators and sensors 188 10 ■ Testing 226 11 ■ Running tasks in containers 257 Part 3 Airflow in practice ................................................. 289 12 ■ Best practices 291 13 ■ Project: Finding the fastest way to get around NYC 322 14 ■ Project: Keeping family traditions alive with Airflow and generative AI 343

📄 Page 8

vi brief contents Part 4 Airflow in production ............................................ 381 15 ■ Operating Airflow in production 383 16 ■ Securing Airflow 424 17 ■ Airflow deployment options 444 appendix A ■ Running code samples 470 appendix B ■ Prometheus metric mapping 474

📄 Page 9

vii contents foreword xv preface xvii acknowledgments xix about this book xxii about the authors xxvi about the cover illustration xxviii Part 1 Getting started ...........................................1 1 Meet Apache Airflow 3 1.1 Introducing data pipelines 4 Drawing a pipeline as a graph 4  ■  Executing a pipeline graph 6 Pipeline graphs vs. sequential scripts 6  ■  Running pipelines using workflow managers 9 1.2 Introducing Airflow 10 Defining pipelines flexibly in (Python) code 10  ■  Integrating with external systems 11  ■  Scheduling and executing pipelines 11 Monitoring and handling failures 14  ■  Incremental loading and backfilling 17 1.3 When to use Airflow 18 Reasons to choose Airflow 18  ■  Reasons not to choose Airflow 19 1.4 The rest of this book 19

📄 Page 10

viii contents 2 Anatomy of an Airflow DAG 21 2.1 Collecting data from numerous sources 21 2.2 Writing your first Airflow DAG 23 Tasks vs. operators 27  ■  Running arbitrary Python code 28 2.3 Running a DAG in Airflow 30 Running Airflow in a Python environment 31  ■  Running Airflow with Docker 32  ■  Inspecting the DAG in Airflow 33 2.4 Running at regular intervals 37 2.5 Handling failing tasks 38 2.6 DAG versioning 41 3 Time-based scheduling 45 3.1 Processing user events 46 3.2 The basic components of an Airflow schedule 47 3.3 Running regularly using trigger-based schedules 48 Defining a daily schedule 49  ■  Using cron expressions 51 Using shorthand expressions 52  ■  Using frequency-based timetables 53  ■  Summarizing trigger timetables 54 3.4 Incremental processing with data intervals 55 Processing data incrementally 55  ■  Defining incremental schedules with data intervals 55  ■  Defining intervals using frequencies 58  ■  Summarizing interval-based schedules 59 3.5 Handling irregular intervals 60 3.6 Managing backfilling of historical data 61 3.7 Designing well-behaved tasks 63 Atomicity 63  ■  Idempotency 65 4 Asset-aware scheduling 68 4.1 Challenges of scaling time-based schedules 69 4.2 Introducing asset-aware scheduling 70 4.3 Producing asset events 71 4.4 Consuming asset events 73 4.5 Adding extra information to events 76 4.6 Skipping updates 78

📄 Page 11

ixcontents 4.7 Consuming multiple assets 79 4.8 Combining time- and asset-based schedules 84 5 Templating tasks using the Airflow context 86 5.1 Inspecting data for processing with Airflow 87 5.2 Task context and Jinja templating 89 Templating operator arguments 89  ■  Templating the PythonOperator 91  ■  Passing additional variables to the PythonOperator 96  ■  Inspecting templated arguments 98 5.3 What is available for templating 99 5.4 Bringing it all together 102 6 Defining dependencies between tasks 111 6.1 Basic dependencies 112 Linear dependencies 112  ■  Fan-in/fan-out dependencies 113 6.2 Branching 115 Branching within tasks 116  ■  Branching within the DAG 118 6.3 Conditional tasks 123 Conditions within tasks 123  ■  Making tasks conditional 124 Using built-in operators 127 6.4 Exploring trigger rules 128 What is a trigger rule? 128  ■  The effect of failures 128 Other trigger rules 130 6.5 Sharing data between tasks 132 Sharing data using XComs 132  ■  When and when not to use XComs 135  ■  Using custom XCom backends 135 XCom cleanup 136 6.6 Chaining Python tasks with the Taskflow API 136 Simplifying Python tasks with the Taskflow API 137  ■  Using the Taskflow API to define a new DAG 140  ■  When and when not to use the Taskflow API 140 Part 2 Beyond the basics ..................................... 145 7 Triggering workflows with external input 147 7.1 Polling conditions with sensors 148 Polling custom conditions 151  ■  Working with sensors outside the happy flow 152

📄 Page 12

x contents 7.2 Starting workflows with the REST API and CLI 157 7.3 Triggering workflows with messages 160 8 Communicating with external systems 164 8.1 Installing additional operators 165 8.2 Developing a machine learning model 166 Use case: Classifying handwritten digits 166  ■  Setting up the pipeline 167  ■  Developing locally with external systems 172 8.3 Moving data from between systems 180 Use case: Analyzing Airbnb listings 180  ■  Implementing a PostgresToS3Operator 181  ■  Outsourcing the heavy work 185 9 Extending Airflow with custom operators and sensors 188 9.1 Starting with a PythonOperator 189 Simulating a movie-rating API 189  ■  Fetching ratings from the API 192  ■  Building the actual DAG 195 9.2 Building a custom hook 197 Designing a custom hook 197  ■  Building a DAG with the MovielensHook 204 9.3 Building a custom operator 205 Defining a custom operator 206  ■  Building an operator to fetch ratings 207 9.4 Building custom sensors 210 9.5 Building a custom deferrable operator 213 Executing asynchronous tasks using the triggerer 214 Running the Movielens sensor asynchronously 215 9.6 Packaging the components 221 Bootstrapping a Python package 221  ■  Installing the package 223  ■  Sharing the package with others 224 10 Testing 226 10.1 Getting started with testing 227 Integrity testing all DAGs 227  ■  Setting up a CI/CD pipeline 232  ■  Writing unit tests 234  ■  Creating the pytest project structure 235  ■  Testing with files on disk 240 10.2 Working with external systems 242

📄 Page 13

xicontents 10.3 Using tests for development 249 10.4 Testing complete DAGs 251 Using dag.test() to test the whole DAG 252  ■  Emulating production environments with Whirl 255  ■  Creating DTAP environments 256 11 Running tasks in containers 257 11.1 Challenges of different operators 258 Operator interfaces and implementations 258  ■  Complex and conflicting dependencies 258  ■  Moving toward a generic operator 259 11.2 Introducing containers 260 What are containers? 260  ■  Running a first Docker container 261  ■  Creating a Docker image 262 Persisting data using volumes 264 11.3 Containers and Airflow 266 Tasks in containers 266  ■  Why use containers? 267 11.4 Running tasks in Docker 268 Introducing the DockerOperator 268  ■  Creating container images for tasks 270  ■  Building a DAG with Docker tasks 273 Docker-based workflow 276 11.5 Running tasks in Kubernetes 277 Introducing Kubernetes 277  ■  Setting up Kubernetes 278 Using the KubernetesPodOperator 281  ■  Diagnosing Kubernetes-related issues 285  ■  Differences between Kubernetes - and Docker-based workflows 287 Part 3 Airflow in practice .................................. 289 12 Best practices 291 12.1 Writing clean DAGs 291 Using style conventions 292  ■  Managing credentials centrally 296  ■  Specifying configuration details consistently 297 Avoiding computation in your DAG definition 300  ■  Using factories to generate common patterns 302  ■  Grouping related tasks with task groups 305  ■  Being explicit when specifying your DAG schedule 306  ■  Using Dynamic Task Mapping to generate tasks dynamically 307

📄 Page 14

xii contents 12.2 Designing reproducible tasks 313 Requiring tasks to be idempotent 314  ■  Ensuring that task results are deterministic 314  ■  Designing tasks using functional paradigms 314 12.3 Handling data efficiently 315 Limiting the amount of data being processed 315  ■  Loading/ processing data incrementally 317  ■  Caching intermediate data 317  ■  Avoiding storing data on local filesystems 318 Offloading work to external/source systems 319 12.4 Managing concurrency using pools 319 13 Project: Finding the fastest way to get around NYC 322 13.1 Use case: Investigating traffic in New York City 322 13.2 Understanding the data 326 Yellow Cab file share 326  ■  Citi Bike REST API 326 Deciding on a plan of approach 328 13.3 Extracting the data 329 Downloading Citi Bike data 329  ■  Downloading Yellow Cab data 331 13.4 Applying similar transformations to data 333 13.5 Structuring a data pipeline 339 13.6 Developing idempotent data pipelines 340 14 Project: Keeping family traditions alive with Airflow and generative AI 343 14.1 Use case: Bringing family recipes to life 344 14.2 Fine-tuning an existing LLM 344 14.3 RAG to the rescue 345 14.4 Uploading recipes to the Recipe Vault UI 349 14.5 Preprocessing the recipes with DockerOperator 351 14.6 Creating a collection to store our recipes 357 Defining how to vectorize our text 359  ■  Creating a schema for the collection 361  ■  Preparing our collection of recipes 362 14.7 Updating and creating new records in the vector database 363 14.8 Deleting outdated records from the vector database 367

📄 Page 15

xiiicontents 14.9 Adding recipes to the vector database 368 14.10 RAG in action 370 The R is for retrieving 372  ■  Structuring our questions with prompt templates 373  ■  Searching for recipes 375 Part 4 Airflow in production ............................. 381 15 Operating Airflow in production 383 15.1 Revisiting the Airflow architecture 383 15.2 Choosing the executor 385 Overview of executor types 386  ■  Which executor is right for you? 387  ■  Installing each executor 389 15.3 Configuring the metastore 396 15.4 Configuring the scheduler 399 Configuring scheduler components 399  ■  Running multiple schedulers 401  ■  Configuring system performance 401 Controlling the maximum number of running tasks 402 15.5 Configuring the DAG processor manager 403 15.6 Capturing logs 405 Capturing API server output 405  ■  Capturing scheduler output 406  ■  Capturing task logs 407  ■  Sending logs to remote storage 407 15.7 Visualizing and monitoring Airflow metrics 408 Collecting metrics from Airflow 408  ■  Configuring Airflow to send metrics 410  ■  Configuring Prometheus to collect metrics 411  ■  Creating dashboards with Grafana 413 What should you monitor? 415 15.8 Setting up alerts 417 15.9 Scaling Airflow beyond a single instance 419 16 Securing Airflow 424 16.1 Role-based access in the Airflow UI 425 Adding users 425  ■  Configuring the RBAC interface 427 16.2 Encrypting data at rest 428 16.3 Connecting with a directory service 430 Understanding LDAP 431  ■  Fetching users from an LDAP service 433

📄 Page 16

xiv contents 16.4 Encrypting traffic to the web server 434 Understanding HTTPS 434  ■  Configuring a certificate for HTTPS 436 16.5 Fetching credentials from secrets-management systems 440 17 Airflow deployment options 444 17.1 Managed Airflow 445 Astronomer 445  ■  Google Cloud Composer 446 Amazon Managed Workflows for Apache Airflow 447 17.2 Airflow on Kubernetes 447 Preparing the Kubernetes cluster 448  ■  Connecting to your Kubernetes cluster 449  ■  Deploying with the Apache Airflow Helm Chart 449  ■  Changing the default deployment configuration 451  ■  Changing the apiserver secret key 452 Using an external database for Airflow metadata 453 Deploying DAGs 454  ■  Deploying a Python library 461 Configuring the executor(s) 464 17.3 Choosing a deployment strategy 468 appendix A Running code samples 470 appendix B Prometheus metric mapping 474 index 476

📄 Page 17

xv foreword Apache Airflow® is the open source standard for workflow orchestration. Since its cre- ation in 2014, it has evolved from mostly supporting bread and butter ETL pipelines to being the fundamental platform for implementing comprehensive DataOps work- flows, ranging from business operations and analytics to revenue-generating data prod- ucts, to multi-agent AI orchestration. These are just a few of the many use cases we’ve seen our customers and the wider community implement with Airflow. A watershed moment in this growth story was the release of Apache Airflow 3.0 on April 22, 2025, delivering features eagerly anticipated by the community, such as DAG versioning, improved backfills, and an all new React based Airflow UI. The response from the Airflow community to this foundational release has been enthusiastic with 26% of over 5500 Airflow users upgrading to Airflow 3 within the first 7 months, according to the 2025 Airflow community survey. This second edition of Data Pipelines with Apache Airflow brings one of the most com- prehensive Airflow books into the 3.0 era. It includes a chapter on data-aware sched- uling with Assets, a section on the new EdgeExecutor to run remote workers, and an all-new project demonstrating how to use Airflow to orchestrate GenAI pipelines. This book is a valuable resource for newcomers and experienced Airflow users alike. Airflow beginners will learn how to write DAGs from the ground up, starting with a conceptual introduction and progressing through two complete Airflow projects. For seasoned practitioners, the later chapters include a treasure trove of best practices for running Airflow in production, covering code testing, security, and deployment of your Airflow instance.

📄 Page 18

xvi foreword We at Astronomer are thrilled that Data Pipelines with Apache Airflow, Second Edition is now available for our customers and the Airflow community alike, to expand their capa- bilities and Airflow use cases. —Tamara J. Fingerlin, Senior Developer Advocate, Astronomer

📄 Page 19

xvii preface The world of data is never dull, and much has changed since the original release of this book. Whereas large language models (LLMs) used to be a niche research topic, nowadays everyone—even our mothers—has heard of AI tools such as ChatGPT. For better or worse (depending on who you ask), this shift has led to a huge boom in com- panies adopting AI to optimize their processes and shift to more data-driven decision making. Besides this technological acceleration, global challenges such as geopolitical strife and climate change add even more pressure to adapt to an ever-changing envi- ronment, making high-quality data more critical than ever. In response to these developments, the landscape of data tooling has not stood still. There is more contention than ever in the space of data orchestrators, trending toward more integrated, secure, and developer-friendly platforms. Accordingly, Airflow has evolved considerably since the first edition of this book was released, adding several new features and culminating in the recent release of a new major milestone: Airflow 3. Working on this second edition, we found that we needed to make substantial changes to bring the book up to date with all the changes since Airflow 2.0, including the following: ¡ An entirely new UI ¡ Significant changes in Airflow’s scheduling logic ¡ New features such as data-aware (event-based) scheduling, new executor types, and the option to combine multiple executors ¡ Changes in deployment, shifting to managed solutions or Kubernetes ¡ New use cases, such as generative AI (GenAI)–related workloads (e.g., RAG)

📄 Page 20

xviii preface To incorporate all these changes, we reworked the book considerably, adding many new sections, screenshots, and chapters and restructuring the code examples to make them easier to use and enable them to run in a generic way across all chapters. We hope that these additions do justice to the incredible amount of work that the commu- nity has put into developing this new release. Altogether, this updated book aims to provide a comprehensive introduction to Air- flow 3, covering everything from building simple workflows and developing custom components to designing and managing Airflow deployments. We intend to comple- ment the many excellent blogs and other online documentation by consolidating vari- ous topics into a single, concise, easy-to-follow resource. Through our combined years of experience working with Airflow, we hope to give you a strong foundation to begin your journey with this powerful tool.

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

Recommended for You

Loading recommended books...

Failed to load, please try again later

← Back to List

Data Pipelines with Apache Airflow Orchestration for data and AI, 2nd ed. (Julian de Ruiter, Ismael Cabral etc.)（Z-Library）

📄 Text Preview (First 20 pages)

Registered users can read the full content for free

💝 Support Author

Recommended for You

{{title}}