Digital Library

Data Transformation The Definitive Guide Designing Scalable and Efficient Data Pipelines to Power Analytics, Machine… (Andrew Madson, Toby Mao, Iaroslav Zeigerman)（Z-Library）

Name: Data Transformation The Definitive Guide Designing Scalable and Efficient Data Pipelines to Power Analytics, Machine… (Andrew Madson, Toby Mao, Iaroslav Zeigerman)（Z-Library）
Availability: InStock
Rating: 5 (90 reviews)
Author: Andrew Madson, Toby Mao, Iaroslav Zeigerman

Andrew Madson, Toby Mao, Iaroslav Zeigerman

Data Transformation The Definitive Guide Designing Scalable and Efficient Data Pipelines to Power Analytics, Machine… (Andrew Madson, Toby Mao, Iaroslav Zeigerman)（Z-Library）

Author Andrew Madson, Toby Mao, Iaroslav Zeigerman

数据

Data Transformation: The Definitive Guide provides a rigorous and practical roadmap for designing scalable, efficient, and maintainable data pipelines. Written by leaders in the field, this book introduces foundational principles and modern practices that treat data transformation with the same discipline as software development—equal parts theory and hands-on implementation. With guidance on everything from building reproducible, testable workflows to deploying industrial-grade frameworks, the book equips data professionals with the knowledge to tackle real-world challenges in analytics, machine learning, and AI. Squarely focusing on reliability and scale, the authors deliver essential strategies for turning raw data into fresh, trustworthy insights. • Structure transformation pipelines for maintainability and reproducibility • Apply modern data development workflows, including CI/CD and versioning • Manage complexity through modular pipeline design and best practices • Evaluate tools and frameworks like SQLMesh and adopt them with confidence • Troubleshoot data quality issues with robust testing and observability techniques • Accelerate delivery of analytics and ML products with scalable transformation foundations

Format PDF

Size 3.2 MB

Views

Downloads

0.00

Total Donations

Read Online Download

Text Preview (First 20 pages)

Registered users can read the full content for free

Page 1

(This page has no text content)

Page 2

(This page has no text content)

Page 3

(This page has no text content)

Page 4

(This page has no text content)

Page 5

With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write— so you can take advantage of these technologies long before the official release of these titles. Andrew Madson, Toby Mao, and Iaroslav Zeigerman Data Transformation: The Definitive Guide Designing Scalable and Efficient Data Pipelines to Power Analytics, Machine Learning, and AI

Page 6

979-8-341-66137-0 Data Transformation: The Definitive Guide by Andrew Madson , Toby Mao , and Iaroslav Zeigerman Copyright © 2027 O’Reilly Media, Inc. All rights reserved. Published by O’Reilly Media, Inc. , 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( https://oreilly.com ). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com . Acquisitions Editor: Aaron Black Development Editor: Gary O’Brien Production Editor: Katherine Tozer Cover Designer: Karen Montgomery Interior Designer: David Futato Interior Illustrator: Kate Dullea April 2027: First Edition Revision History for the Early Release 2026-04-08: First Release See https://oreilly.com/catalog/errata.csp?isbn=9798341661424 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Transformation: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Fivetran. See our statement of editorial independence.

Page 7

Table of Contents Brief Table of Contents (Not Yet Final). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Reproducibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Reproducibility in Pipelines 9 Reproducibility Factors 10 Mutable or Unstable Data Sources 10 Version Control for Code and Models 11 Dependency and Environment Drift 11 Non-Deterministic and Time-Sensitive Logic 11 Poor Documentation and Discipline 12 Configuration Drift 12 Techniques for Reproducibility 13 Version Control (Git) and Docs-as-Code 13 Deterministic Transformation Logic 13 Environment Isolation and Dependency Management 14 Declarative Frameworks and Pipeline-as-Code 14 Metadata and Lineage Capture 16 Testing and Assertions 17 Raw Data Retention 17 Audit Trails and Logging 18 Idempotency 20 Upserts (Insert/Update or MERGE operations) 20 Insert Overwrite 21 Deduplication and Constraint Enforcement 21 Checkpointing and State Tracking 22 Functional Transformation Without Side-Effects 23 Application to SQL Sushi Co. 23 Versioned Specs and Code 24 v

Page 8

Environment Isolation and Deterministic Execution 24 Spark for Controlled Processing 24 Idempotent, Key-Based Upserts 25 Metadata, Lineage, and Logging 26 Plan-Based Backfills with SQLMesh 26 Bringing it all together 27 2. Backfilling and Reprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 When and Why to Backfill Data 30 Common Triggers for Backfilling 30 The Business Case for Strategic Backfilling 32 Risk Assessment and Mitigation 33 Making the Backfill Decision 33 Strategies for Reprocessing Large Datasets 34 Incremental vs. Full Refresh Strategies 35 Partitioning Strategies for Parallel Processing 36 Resource Optimization and Scaling Patterns 37 Advanced Reprocessing Patterns 38 Shadow Table Examples 39 Optimizing for Specific Storage Systems 40 Ensuring Consistency During Re-runs 40 Idempotency: The Foundation of Safe Reprocessing 41 Automation for Safe Backfills 48 Orchestration Patterns and Tools 49 Resource Management and Throttling 50 Safety Mechanisms and Circuit Breakers 51 Monitoring and Observability 53 Version Control and Rollback Strategies 55 Advanced Automation Patterns 57 Integration with Modern Data Stacks 58 Conclusion 60 vi | Table of Contents

Page 9

Brief Table of Contents (Not Yet Final) Chapter 1: Business Challenges and the State of Data Today (unavailable) Chapter 2: Spec Writing (unavailable) Chapter 3: Reproducibility (available) Chapter 4: Backfilling and Reprocessing (available) Chapter 5: Incremental Models (unavailable) Chapter 6: Streaming Data Transformation (not available) Chapter 7: Testing and Data Quality – Safeguarding Pipeline Integrity (not available) Chapter 8: Version Control – Managing Change in Data Pipeline (not available) Chapter 9: CI/CD for Data Pipeline (not available) Chapter 10: Observability and Monitoring – Tracking Pipeline Health (not available) Chapter 11: Scalability and Performance (not available) Chapter 12: Scheduling SQL Pipelines with Python (not available) Chapter 13: Workflow Orchestration (not available) Chapter 14: SQL-Based Transformation Framework (not available) Chapter 15: Beyond SQL - Spark for Complex Processing (not available) Chapter 16: Real-Time Data Transformation (not available) Chapter 17: End-to-End Case Study (not available) vii

Page 10

(This page has no text content)

Page 11

CHAPTER 1 Reproducibility A Note for Early Release Readers With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. This will be the 3rd chapter of the final book. Please note that the GitHub repo will be made active later on. If you’d like to be actively involved in reviewing and commenting on this draft, please reach out to the editor at gobrien@oreilly.com. Data transformation pipelines can’t just produce correct results once. They need to produce the same results reliably under the same conditions every single time. That’s reproducibility. Reproducibility in Pipelines In data engineering, reproducibility means you can re-run a data transformation process and get identical results, assuming the inputs and conditions stay the same. Think of the ideal data pipeline as a pure function: input => output (#goals). If you run the same input with the same configuration and environment against the same input data, you should get the same output. Reproducibility builds trust. It ensures results aren’t a one-off accident but the consis‐ tent outcome of a defined process. When you have reproducibility, teams can verify results, debug issues by recreating scenarios, and reliably update data outputs when source data or business logic changes. Reproducibility is closely tied to determinism. 9

Page 12

Determinism means that given the same input, a process will always produce the same output, just like the pure function above. There’s no randomness, no variability based on timing, no dependency on external factors that might change. Deterministic practices are why a pipeline behaves predictably. Reproducibility is the goal (being able to recreate results), and determinism is a key principle that makes it possible. People sometimes conflate reproducibility with consistency, auditability, and data quality. Let’s set clear definitions. Consistency refers to uniformity and coherence of data at a given point. Like making sure all parts of a dataset follow the same definitions or that data isn’t contradictory across systems. A reproducible pipeline contributes to consistency over time since each run produces consistent results. But consistency can also mean transactional consistency (like ACID databases ensuring operations are applied atomically) or conceptual consistency of metrics. Reproducibility is about being able to repeat the process. Consistency is about the state of the data at one time, no partial updates or conflicting values. Auditability focuses on traceability. Can you track what happened in a pipeline? Who ran what, when, and how was the data changed? An auditable pipeline keeps detailed logs, version histories, and lineage so every change or result can be traced and examined. Auditability and reproducibility work together. Auditability ensures you can inspect and trace past results. Reproducibility ensures you can rerun and verify those results. Mature data workflows treat auditability as requiring reproducibility. Data quality refers to the accuracy, completeness, and validity of data. While a repro‐ ducible pipeline helps maintain quality by eliminating random errors and making it easier to test and re-run validations, it doesn’t guarantee correct results on its own. A pipeline can be consistently reproducible yet consistently wrong if the logic is flawed or the input data is bad. Reproducibility just guarantees you’ll get the same result given the same inputs. It doesn’t automatically mean the result is correct. Reproducibility Factors Reproducibility can be challenging. There are technical and organizational factors at play. Understanding these factors is the first step to success. Here are common anti-patterns that prevent pipeline reproducibility. Mutable or Unstable Data Sources Make friends with software engineers. If your upstream source data changes over time in uncontrolled and unexpected ways, it’s hard to reproduce past results. Imag‐ ine a source system that retroactively modifies or deletes historical records without notice. Running the pipeline today on “the same” date range as a month ago might 10 | Chapter 1: Reproducibility

Page 13

yield different outputs. Without versioned snapshots of input data, pipelines can’t be rerun exactly as before. Late-arriving data or data that gets corrected at the source can also introduce discrep‐ ancies if not handled correctly. When source data isn’t immutable or historically accessible, reproducibility suffers because the inputs aren’t truly the same on each run. Version Control for Code and Models If the transformation code (SQL, Python scripts, SQLMesh, etc.) isn’t rigorously version-controlled, it’s difficult to recreate the exact logic used in a prior run. Teams that modify pipeline code without tracking versions will struggle to reproduce an earlier state of the pipeline. Without history, you can’t roll back to the exact code used at a given time. Past results may be irreproducible. The same goes for AI and machine learning pipelines. Not versioning models or parameters means you can’t later rerun the pipeline with the same model to get the same outcome. Dependency and Environment Drift Data pipelines have many dependencies: libraries, database engines, hardware, OS environments, or configuration settings. If dependencies aren’t controlled, the pipe‐ line might produce different results in different environments or at different times. An update to a Python library or a change in the SQL engine’s behavior could break the pipeline. Environment drift is a silent killer. Environment drift occurs when the production environment slowly diverges from the development/test environment or from what existed when the pipeline was first built. If someone reruns a pipeline on a new server or a container with mismatched packages, results can differ from previous runs. Without environment isolation and dependency management, you’ll suffer from the “it works on my machine” syndrome. Consistent, reproducible pipelines require stable versions of dependencies and execution environments. Non-Deterministic and Time-Sensitive Logic Pipelines that include non-deterministic operations or depend on specific timing can yield inconsistent results—using a random sample without a fixed seed. Iterating over an unordered set where the order of processing could vary run to run (we’re looking at you SELECT *). Relying on the current timestamp inside the transformation logic. All these can make outputs vary. Reproducibility Factors | 11

Page 14

If a pipeline processes “today’s data up to now,” then running it at different times yields different outputs. This complicates reproducibility unless you can fix the execution time or inputs. Any logic that isn’t purely functional, depending only on inputs, can hurt reproducibility—even external calls, like hitting an API that might return dynamic results, introduce variability. Poor Documentation and Discipline Organizational culture and practices have a huge impact on data pipeline quality. If how the pipeline runs isn’t documented (required configuration, manual steps, special parameters used), reproducing it by a different team member or after some time becomes error-prone (refer to Chapter 2: Spec Writing). A lack of clear procedures for running or deploying the pipeline, like not recording that a backfill was done with an ad hoc script, or not noting that data was manually adjusted, is a reproducibility nightmare. DataOps practices are far behind DevOps, but they’re catching up. Why’s that important? If there’s no culture of testing or code review, changes that unintentionally alter outputs might slip in, going unnoticed until much later. Tribal knowledge is a common anti-pattern. Pipelines that rely on an individual remembering to do X when Y happens can’t be reliably repeated by others. Well- documented steps and strong DataOps processes, like requiring all changes to go through version control and CI/CD, improve reproducibility by ensuring everyone runs the pipeline consistently. Configuration Drift Beyond code and data, pipeline components usually have configuration files, environ‐ ment variables, and infrastructure setups that affect behavior. If these configurations drift, reproducing the pipeline end-to-end might fail or produce different results. Imagine an engineer updates a config in one environment but not another, or secrets/ credentials expire. Without central management of configuration (preferably also version-controlled, which we’ll discuss in the next section) and alignment between transformation stages, the pipeline won’t be portable. If the orchestration (workflow definitions) isn’t versioned or if schedules and triggers change without record, it’ll be hard to know how the pipeline was executed. Anything that introduces variability in the input data, code, environment, or manual procedure prevents reproducibility. Recognize these pitfalls and address them with engineering and governance best practices, ideally during the pipeline design process. 12 | Chapter 1: Reproducibility

Page 15

Techniques for Reproducibility Reproducibility requires a combination of technical practices and design principles to make pipelines deterministic, traceable, and repeatable. Here are best practices that enable reproducible data transformation pipelines. Version Control (Git) and Docs-as-Code Use version control for everything. Code, configuration, even documentation. Storing pipeline code (SQL queries, scripts, ETL workflows) in a Git repository ensures every change is tracked and historical versions can be retrieved. You can always rerun an older version of the pipeline if needed or pinpoint when a change in logic was introduced. Version control the specifications and documentation of the pipeline, too. A Docs-as- Code approach keeps the design spec, data model definitions, and business logic documentation in the same repository as the code. Your documentation gives you a record of the intended behavior at each release. What’s a practical way to do this? Write specs in Markdown or YAML and manage them with Git alongside the code. When spec and code evolve simultaneously under version control, you can reproduce the code state and also understand the why behind that code. Version control systems facilitate reproducibility through tags or release versions. You can tag a particular pipeline release that produced a specific report, then later check out that tag to reproduce the report. Adopting consistent, disciplined version control provides a strong foundation for reproducibility. Deterministic Transformation Logic Make pipeline operations deterministic so repeated runs produce identical outcomes. Eliminate sources of randomness or variability in your code. If your pipeline uses sampling or a random generator (like in data augmentation or splitting), always set a fixed seed so results are the same each run. Don’t rely on system time or the order of unordered data structures in your computations. If your code iterates over a set or dictionary, which in some languages is unordered, explicitly sort it to ensure consistent processing order. # Bad: Non-deterministic sampling sample = df.sample(n=1000) # Good: Fixed seed for reproducibility sample = df.sample(n=1000, random_state=42) Deterministic logic means designing idempotent transformations (we’ll dig into this more in the Idempotency section). Reprocessing the same records shouldn’t duplicate or diverge. Another aspect is handling time-based partitions or incremental logic Techniques for Reproducibility | 13

Page 16

thoughtfully. If today’s run processes yesterday’s data, make the date window parame‐ terized so re-running for a past date uses the intended fixed window. -- Bad: Uses current date, not reproducible SELECT * FROM orders WHERE date = CURRENT_DATE - 1 -- Good: Parameterized date for reproducibility SELECT * FROM orders WHERE date = '{{ run_date }}' Coding with deterministic functions and controlled inputs helps the pipeline behave like a function. Specific inputs equal specific outputs. This makes debugging and run comparisons much easier because the logic itself is deterministic. Environment Isolation and Dependency Management Standardize and isolate the execution environment so the pipeline runs the same way everywhere. Containerization and environment management using tools like Docker or Kuber‐ netes encapsulate the pipeline’s runtime (OS, language runtime, libraries, etc.) that can be redeployed consistently. At a minimum, use virtual environments or depend‐ ency lock files like a requirements.txt or lockfile for Python, or environment.yml for Conda. Pin library versions and ensure anyone running the pipeline installs the same versions. # Example Dockerfile for reproducible environment FROM python:3.9-slim COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY pipeline/ /app/pipeline/ WORKDIR /app Infrastructure-as-code (IaC) could also be employed using Terraform or similar, so infrastructure can be reproduced. Many teams maintain dev/test/prod parity. The environments are as similar as possible. A pipeline tested in dev is reproducible in prod. Environment isolation extends to data platform dependencies. Use a consistent ver‐ sion of the SQL engine or Spark, and be cautious when upgrading those. Test for any changes in results. By eliminating environmental differences, you ensure the only factors influencing pipeline output are the code and data, which are already controlled using version control, CI/CD, etc. Declarative Frameworks and Pipeline-as-Code Declarative tools let you specify what the outcome should be (the target model or table and how it’s defined) rather than writing detailed instructions for how to do it. 14 | Chapter 1: Reproducibility

Page 17

This higher-level approach means the framework itself handles a lot of reproducibil‐ ity nuts and bolts for you. SQLMesh, for instance, parses your SQL logic and tracks dependencies between models. It enables features like automated backfills and environment promotion. Declarative frameworks treat transformations as code (usually SQL files with MODEL DDL blocks), which are easy to version control and test. They also often include built-in capabilities for environment isolation at the data level, like the ability to generate a dev environment where you can run the pipeline without affecting production data. -- Example declarative pipeline definition in SQLMesh MODEL ( name example.fact_sales, kind FULL, owner 'analytics_team', description 'Daily sales aggregation' ); SELECT transaction_date AS date, product_id, SUM(amount) AS total_sales FROM staging.raw_sales GROUP BY transaction_date, product_id; Since the pipeline is described in a structured way, it’s easier for others to understand and rerun it. Using a declarative, model-driven approach gives your pipeline more consistent, reproducible behavior. Wait! Where is ORDER BY? Don’t you need that to retrieve the same results every time? ORDER BY isn’t required in this fact_sales aggregation example because the model produces a complete result set that will be stored as an unordered table in the database. GROUP BY already ensures deterministic results - the same input data will always produce the same aggregated output, regardless of row order. Adding ORDER BY would waste compute resources since the ordering would be discarded when the table is materialized. However, ORDER BY becomes essential for reproducibility when you’re limiting results or using row-dependent operations. For example: -- Example where ORDER BY is needed for reproducible results MODEL ( name example.top_products_daily, kind FULL, description 'Top 10 products by sales each day' ); SELECT date, Techniques for Reproducibility | 15

Page 18

product_id, total_sales, ROW_NUMBER() OVER (PARTITION BY date ORDER BY total_sales DESC) as rank FROM ( SELECT transaction_date as date, product_id, SUM(amount) as total_sales FROM staging.raw_sales GROUP BY transaction_date, product_id ) WHERE rank <= 10 ORDER BY date, rank; -- Ensures consistent ordering for downstream consumers Without the ORDER BY in the window function, the ranking would be non- deterministic when products have identical sales, potentially producing different results across runs. Metadata and Lineage Capture Track detailed metadata about pipeline executions and data lineage. Reproducibility isn’t just about getting the same result. It’s also about knowing how that result was produced. By capturing lineage (which data sources and transformations led to a given output), you create a map that can be used to reproduce or troubleshoot outputs. Who doesn’t love a map? Modern data catalogs and orchestrators often provide lineage graphs, and frameworks like OpenLineage standardize the collection of this information. SQLMesh provides column-level lineage, while dbt provides model-level lineage. Your pipeline should log what input data versions or timestamps it processed and which code version produced the output. If a pipeline run produces a table, attach metadata like “this table was generated by pipeline X run ID 123 on date Y using commit Z of the code”. This makes it far easier to rerun later or verify that exact scenario. Lineage metadata helps you answer questions. If this number looks off, which raw files or source records went into it? Storing metadata about row counts, timestamps, and checksums of outputs for each run can help compare runs for differences. Some teams even implement data versioning systems like lakeFS or Delta Lake and Apache Iceberg’s time-travel features to snapshot data at each run. This way, not only is code versioned, but the data state is too. True reproducibility of a past state is only possible when you can query the historical snapshot. Traceability through metadata is a powerful tool for reproducibility. If you know the inputs and what code ran, you can reconstruct the pipeline’s behavior. 16 | Chapter 1: Reproducibility

Page 19

Testing and Assertions Embed tests and assertions in your pipeline to ensure changes or reruns don’t pro‐ duce unexpected results. Treat data pipelines with the same rigor as software. This includes testing. Unit tests can be written for transformation logic. Given an input, does the SQL logic produce the expected output? Data tests or assertions can run as part of the pipeline to validate outputs. Many SQL modeling frameworks support tests. In our SQL Sushi example, the spec defines tests like uniqueness or referential integrity on columns. These run to catch any deviations in the output. # Example data quality test def test_sales_never_negative(): result = run_query("SELECT COUNT(*) FROM fact_sales WHERE amount < 0") assert result[0][0] == 0, "Found negative sales amounts" def test_daily_record_count_stable(): yesterday_count = run_query("SELECT COUNT(*) FROM fact_sales WHERE date = CURRENT_DATE - 1") day_before_count = run_query("SELECT COUNT(*) FROM fact_sales WHERE date = CURRENT_DATE - 2") yesterday_value = yesterday_count[0][0] day_before_value = day_before_count[0][0] assert abs(yesterday_value - day_before_value) / day_before_value < 0.05, "Daily count varied by more than 5%" If a pipeline is reproducible, a test failing in a new run usually indicates that either data or logic has changed. Having a suite of tests helps ensure that when you refactor or upgrade the pipeline, it still produces the same results on a known dataset. Assertions within the pipeline act as canaries in the coal mine if a run diverges from historical patterns. When such assertions fail, you’re alerted that the current run isn’t consistent with previous runs and prompted to investigate. Testing prevents silent drift. Incorporating continuous integration for your data pipe‐ line code, where every code change triggers a test run on sample data or in a staging environment, can catch non-reproducible changes early. If a source system develo‐ per’s schema change causes the output to differ from a baseline result unexpectedly, tests flag it before it hits production. This is the basic concept behind data contracts. By building in these checks, you ensure each pipeline run remains consistent with the intended behavior, or that intentional changes are surfaced and reviewed. Raw Data Retention Keep a copy of your raw input data. Or at least be able to access historical inputs. One of the biggest barriers to reproducibility is when the original data is no longer available or has been altered. To counter this, design your data architecture with a raw data retention policy. If you ingest files daily, don’t discard or overwrite those raw files Techniques for Reproducibility | 17

Page 20

after processing. Store them in a raw archive (data lake or cloud storage) partitioned by date. If you consume messages from a stream, consider using a technology that retains history, like Kafka with log retention or Delta Lake for a bronze table of raw events. You don’t have to keep records forever, but you should have a purposeful retention policy in place. By having the raw dataset, you can re-run the pipeline on exactly what was received at that time. This could be implemented with a medallion architecture (Bronze, Silver, Gold layers), where Bronze retains all original data unmodified. Alternatively, you might use data versioning tools to tag snapshots of data. Raw data retention goes hand-in-hand with backfilling, the ability to reprocess histor‐ ical periods. Keeping historical raw data (and the ability to isolate it by date/version) means you can backfill old results whenever logic updates or an issue needs investiga‐ tion. You can produce outputs as if you had run the pipeline back then. Treat raw data as the system of record and never destructively update it. Append new data or mark corrections separately so the pipeline can always be pointed at a known, unchanging set of input for any given period. Audit Trails and Logging Maintain detailed logs and run records for pipeline executions. This includes logging the start/end of each job, configuration settings used, number of records processed, any errors or warnings, and the identity of the code version or person who triggered it. Many orchestration tools (Apache Airflow, Dagster, Prefect) maintain run histories where you can inspect execution status, duration, and associated logs through their web UIs or APIs. These logs help you understand the conditions of each run. Compare log parameters between runs to ensure nothing significant has changed. An audit trail might record data quality metrics for each run: null value counts, processing duration, schema vio‐ lations, and data volume changes. These metrics can signal if a run was anomalous. # Example audit logging with proper error handling and structured logging import logging import json import os import subprocess from datetime import datetime from typing import Dict, Any def get_current_git_commit() -> str: """Get current git commit hash safely.""" try: return subprocess.check_output( 18 | Chapter 1: Reproducibility

The above is a preview of the first 20 pages. Register to read the complete e-book.

Support Author

0.00

Total Amount (¥)

Donation Count

Recommended for You

Loading recommended books...

Failed to load, please try again later

← Back to List