Building ETL Pipelines with Python (Brij Kishore Pandey, Emily Ro Schoof) (z-library.sk, 1lib.sk, z-lib.sk)
Author: Brij Kishore Pandey, Emily Ro Schoof
数据
Create and deploy enterprise-ready ETL pipelines by employing modern methods
📄 File Format:
PDF
💾 File Size:
7.6 MB
9
Views
0
Downloads
0.00
Total Donations
📄 Text Preview (First 20 pages)
ℹ️
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
📄 Page
1
(This page has no text content)
📄 Page
2
Building ETL Pipelines with Python Copyright © 2023 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Group Product Manager: Reshma Raman Publishing Product Managers: Birjees Patel and Heramb Bhavsar Content Development Editor: Shreya Moharir
📄 Page
3
Project Coordinator: Hemangi Lotlikar Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Proofreader: Safis Editing Indexer: Subalakshmi Govindhan Production Designer: Prashant Ghare DevRel Marketing Coordinator: Nivedita Singh First published: September 2023 Production reference: 1250923 Published by Packt Publishing Ltd. Grosvenor House 11 St Paul’s Square Birmingham B3 1RB ISBN 978-1-80461-525-6 www.packtpub.com To my daughter, Yashvi, who lights up my life; to Khushboo, my wife, my rock, and my inspiration; to my parents, Madhwa Nand and Veena, who taught me everything I know; and to my brothers, who have always stood by my side. – Brij Kishore Pandey
📄 Page
4
Contributors About the authors Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. Hailing from the renowned SRM Institute of Science and Technology in Chennai, India, Brij’s academic foundation in electrical and electronics engineering has served as the bedrock upon which he built his dynamic career. He has had the privilege of collaborating with industry behemoths such as JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare, contributing immensely with his diverse skill set. Presently, Brij assumes a dual role, guiding teams as a principal software engineer and providing visionary architectural solutions at ADP (Automatic Data Processing Inc.). A fervent believer in continuous learning and sharing knowledge, Brij has graced various international platforms as a speaker, sharing insights, experiences, and best practices with budding engineers and seasoned professionals alike. His influence doesn’t end there; he has also taken on mentorship roles, guiding the next generation of tech aficionados, in association with Mentor Cruise Inc.
📄 Page
5
Beyond the world of code, algorithms, and systems, Brij finds profound solace in spiritual pursuits. He devotes times to the ardent practice of meditation and myriad yoga disciplines, echoing his belief in a holistic approach to well-being. Deep spiritual guidance from his revered guru, Avdhoot Shivanand, has been pivotal in shaping his inner journey and perspective. Originally from India, Brij Kishore Pandey resides in Parsippany, New Jersey, USA, with his wife and daughter. Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily’s multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly’s Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process. About the reviewers Adonis Castillo Cordero has been working in software engineering, data engineering, and business intelligence for the last five years. He is passionate about systems engineering, data, and leadership. His recent focus
📄 Page
6
areas include cloud-native landscape, business strategy, and data engineering and analytics. Based in Alajuela, Costa Rica, Adonis currently works as a lead data engineer and has worked for Fortune 500 companies such as Experian and 3M. I’m grateful for my family and friends’ unwavering support during this project. Thanks to the publisher for their professionalism and guidance. I sincerely hope the book brings joy and is useful to readers. Dr. Bipul Kumar is an AI consultant who brings over seven years of experience in deep learning and machine learning to the table. His journey in AI has encompassed various domains, including conversational AI, computer vision, and speech recognition. Bipul has had the privilege to work on impactful projects, including contributing to developing software as a medical device as the head of AI at Kaliber Labs. He also served as an AI consultant at AIX, specializing in developing conversational AI. His academic pursuits led him to earn a PhD from IIM Ranchi and a B.Tech from SRMIST. With a passion for research and innovation, Bipul has authored numerous publications and contributed to a patent application, humbly making his mark on the AI landscape.
📄 Page
7
Table of Contents Preface
📄 Page
8
Part 1: Introduction to ETL, Data Pipelines, and Design Principles 1 A Primer on Python and the Development Environment Introducing Python fundamentals An overview of Python data structures Python if…else conditions or conditional statements Python looping techniques Python functions Object-oriented programming with Python Working with files in Python Establishing a development environment Version control with Git tracking Documenting environment dependencies with requirements.txt Utilizing module management systems (MMSs) Configuring a Pipenv environment in PyCharm Summary
📄 Page
9
2 Understanding the ETL Process and Data Pipelines What is a data pipeline? How do we create a robust pipeline? Pre-work – understanding your data Design planning – planning your workflow Architecture development – developing your resources Putting it all together – project diagrams What is an ETL data pipeline? Batch processing Streaming method Cloud-native Automating ETL pipelines Exploring use cases for ETL pipelines Summary References 3
📄 Page
10
Design Principles for Creating Scalable and Resilient Pipelines Technical requirements Understanding the design patterns for ETL Basic ETL design pattern ETL-P design pattern ETL-VP design pattern ELT two-phase pattern Preparing your local environment for installations Open source Python libraries for ETL pipelines Pandas NumPy Scaling for big data packages Dask Numba Summary References
📄 Page
11
Part 2: Designing ETL Pipelines with Python 4 Sourcing Insightful Data and Data Extraction Strategies Technical requirements What is data sourcing? Accessibility to data Types of data sources Getting started with data extraction CSV and Excel data files Parquet data files API connections Databases Data from web pages Creating a data extraction pipeline using Python Data extraction Logging
📄 Page
12
Summary References 5 Data Cleansing and Transformation Technical requirements Scrubbing your data Data transformation Data cleansing and transformation in ETL pipelines Understanding the downstream applications of your data Strategies for data cleansing and transformation in Python Preliminary tasks – the importance of staging data Transformation activities in Python Creating data pipeline activity in Python Summary 6
📄 Page
13
Loading Transformed Data Technical requirements Introduction to data loading Choosing the load destination Types of load destinations Best practices for data loading Optimizing data loading activities by controlling the data import method Creating demo data Full data loads Incremental data loads Precautions to consider Tutorial – preparing your local environment for data loading activities Downloading and installing PostgreSQL Creating data schemas in PostgreSQL Summary 7 Tutorial – Building an End-to-End ETL Pipeline in Python
📄 Page
14
Technical requirements Introducing the project The approach The data Creating tables in PostgreSQL Sourcing and extracting the data Transformation and data cleansing Loading data into PostgreSQL tables Making it deployable Summary 8 Powerful ETL Libraries and Tools in Python Technical requirements Architecture of Python files Configuring your local environment config.ini config.yaml Part 1 – ETL tools in Python Bonobo
📄 Page
15
Odo Mito ETL Riko pETL Luigi Part 2 – pipeline workflow management platforms in Python Airflow Summary
📄 Page
16
Part 3: Creating ETL Pipelines in AWS 9 A Primer on AWS Tools for ETL Processes Common data storage tools in AWS Amazon RDS Amazon Redshift Amazon S3 Amazon EC2 Discussion – Building flexible applications in AWS Leveraging S3 and EC2 Computing and automation with AWS AWS Glue AWS Lambda AWS Step Functions AWS big data tools for ETL pipelines AWS Data Pipeline Amazon Kinesis
📄 Page
17
Amazon EMR Walk-through – creating a Free Tier AWS account Prerequisites for running AWS from your device in AWS AWS CLI Docker LocalStack AWS SAM CLI Summary 10 Tutorial – Creating an ETL Pipeline in AWS Technical requirements Creating a Python pipeline with Amazon S3, Lambda, and Step Functions Setting the stage with the AWS CLI Creating a “proof of concept” data pipeline in Python Using Boto3 and Amazon S3 to read data AWS Lambda functions
📄 Page
18
AWS Step Functions An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS Configuring your AWS environment with EC2 and RDS Creating an RDS instance Creating an EC2 instance Creating a data pipeline locally with Bonobo Adding the pipeline to AWS Summary 11 Building Robust Deployment Pipelines in AWS Technical requirements What is CI/CD and why is it important? The six key elements of CI/CD Essential steps for CI/CD adoption CI/CD is a continual process Creating a robust CI/CD process for ETL pipelines in AWS Creating a CI/CD pipeline
📄 Page
19
Building an ETL pipeline using various AWS services Setting up a CodeCommit repository Orchestrating with AWS CodePipeline Testing the pipeline Summary
📄 Page
20
Part 4: Automating and Scaling ETL Pipelines 12 Orchestration and Scaling in ETL Pipelines Technical requirements Performance bottlenecks Inflexibility Limited scalability Operational overheads Exploring the types of scaling Vertical scaling Horizontal scaling Choose your scaling strategy Processing requirements Data volume Cost Complexity and skills Reliability and availability
The above is a preview of the first 20 pages. Register to read the complete e-book.
Recommended for You
Loading recommended books...
Failed to load, please try again later