📄 Page
1
M ishra , Q u & C ha lla AW S C ertif ied D a ta Engineer A ssocia te Stud y G uide Sakti Mishra, Dylan Qu & Anusha Challa AWS Certif ied Data Engineer Associate Study Guide In-Depth Guidance and Practice
📄 Page
2
9 7 8 1 0 9 8 1 7 0 0 7 3 5 5 9 9 9 ISBN: 978-1-098-17007-3 US $59.99 CAN $74.99 CLOUD COMPUTING Sakti Mishra is a principal data and AI solutions architect at AWS and the author of Simplify Big Data Analytics with Amazon EMR. He has 20 years of experience in cloud, big data, and analytics. Dylan Qu is a principal solutions architect at AWS with eight years of experience in scalable data solutions and is also a frequent contributor on big data, serverless, and machine learning. Anusha Challa is a data and analytics expert with over 15 years of experience and a master‘s degree in machine learning. She’s known for speaking at AWS events and authoring technical content. There’s no better time to become a data engineer. And acing the AWS Certified Data Engineer Associate (DEA-C01) exam will help you tackle the demands of modern data engineering and secure your place in the technology-driven future. Authors Sakti Mishra, Dylan Qu, and Anusha Challa equip you with the knowledge and sought-after skills necessary to effectively manage data and excel in your career. Whether you’re a data engineer, data analyst, or machine learning engineer, you’ll discover in-depth guidance, practical exercises, sample questions, and expert advice you need to leverage AWS services effectively and achieve certification. By reading, you’ll learn how to: • Ingest, transform, and orchestrate data pipelines effectively • Select the ideal data store, design efficient data models, and manage data lifecycles • Analyze data rigorously and maintain high data quality standards • Implement robust authentication, authorization, and data governance protocols • Prepare thoroughly for the DEA-C01 exam with targeted strategies and practices AWS Certif ied Data Engineer Associate Study Guide M ishra , Q u & C ha lla AW S C ertif ied D a ta Engineer A ssocia te Stud y G uide
📄 Page
3
Sakti Mishra, Dylan Qu, and Anusha Challa AWS Certified Data Engineer Associate Study Guide In-Depth Guidance and Practice
📄 Page
4
978-1-098-17007-3 [LSI] AWS Certified Data Engineer Associate Study Guide by Sakti Mishra, Dylan Qu, and Anusha Challa Copyright © 2025 Sakti Mishra, Dylan Qu, and Anusha Challa. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Megan Laddusaw Development Editor: Shira Evans Production Editor: Gregory Hyman Copyeditor: Charles Roumeliotis Proofreader: Vanessa Moore Indexer: Potomac Indexing, LLC Cover Designer: Susan Brown Interior Designer: David Futato Cover Illustrator: José Marzan, Jr. Interior Illustrator: Kate Dullea September 2025: First Edition Revision History for the First Edition 2025-08-22: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098170073 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. AWS Certified Data Engineer Associate Study Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Certification Essentials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Who Is a Data Engineer? 1 Becoming an AWS Data Engineer Associate 2 Exam Topics 3 Exam Format 4 Registering for the Exam 5 Exam-Style Questions 5 Think Like an AWS Solutions Architect: Translating a Real-World Problem-Solving Framework into Certification 5 The Solutions Architect’s Problem-Solving Framework 6 Real-World Example: Designing a Serverless Stream Analytics Platform to Detect Fraud 7 How This Thought Process Applies to Certification Questions 8 Study Plan 10 Conclusion 11 2. Prerequisite Knowledge for Aspiring Data Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Databases and Types of Databases 13 What Is a Database? 13 What Is a Database Management System? 13 Types of Databases 14 Hierarchical Databases 14 Relational Databases 14 NoSQL Databases 15 OLTP Versus OLAP 16 Overview of Big Data 16 iii
📄 Page
6
Distributed Processing Frameworks for Big Data 18 MapReduce 18 Spark 19 Flink 20 Hive 21 Presto 22 Trino 22 What Is a Data Lake? 23 What Is a Data Warehouse? 23 Data Warehouse Versus Data Lake 24 ETL Versus ELT 24 Different Ways to Process Data 25 Batch Processing Pipeline 25 Real-Time Stream Processing 26 Event-Driven Processing 26 High-Level Architecture Overview of Data Processing Pipelines 26 Working with Code Repositories 28 What Is a Code Repository? 28 How to Work with Code Repositories 29 CI/CD 30 Cloud Computing and AWS 31 What Is Cloud Computing? 31 An Overview of Amazon Web Services 32 Getting Started with AWS 34 How to Set Up an AWS Account 34 Configure Access with AWS IAM 35 Create an IAM User for Authentication 36 Add Permissions to Authorize the User 36 What Is an IAM Policy? 36 What Is an IAM Role? 37 Best Practices to Follow with AWS IAM 37 Conclusion 38 Resources 38 3. Overview of AWS Analytics and Auxiliary Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 AWS Analytics Services 39 Amazon Kinesis Data Streams 40 Amazon Data Firehose 41 Amazon Managed Service for Apache Flink 43 Amazon Managed Streaming for Apache Kafka 44 iv | Table of Contents
📄 Page
7
Reference Architecture: Streaming Analytics Pattern with Apache Flink and MSK 46 AWS Glue 47 AWS Glue DataBrew 50 Amazon Athena 52 Amazon EMR 53 Amazon Redshift 55 Amazon QuickSight 57 Reference Architecture: Lakehouse with Glue, Redshift, and Athena 59 Amazon OpenSearch Service 60 Amazon DataZone 62 AWS Lake Formation 63 Auxiliary Services for Analytics 64 Application Integration 64 Compute and Containers 65 Database 67 Storage 68 Machine Learning 69 Migration and Transfer 70 Networking and Content Delivery 71 Security, Identity, and Compliance 72 Management Governance 73 Developer Tools 75 Cloud Financial Management 76 AWS Well-Architected Tool 76 Conclusion 77 Additional Resources 77 4. Data Ingestion and Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Data Ingestion 80 Real-Time Streaming Data Ingestion 80 Kinesis Data Streams Versus Amazon MSK 83 Sample Streaming Ingestion Use Cases 85 Ingesting Data Using Zero-ETL Integrations 89 Ingesting Data from Databases with CDC Using AWS Data Migration Service 91 Supported Sources for AWS DMS 91 Supported Targets for AWS DMS 92 Sample Use Cases 92 Best Practices for Data Ingestion 95 Best Practices for Streaming Ingestion 96 Best Practices for Choosing Data Stream Capacity Mode 97 Table of Contents | v
📄 Page
8
Best Practices for Sharding 98 Best Practices for Consuming Data from KDS 98 Best Practices for Amazon MSK 99 Best Practices for Amazon Data Firehose 103 Best Practices for AWS DMS Replication Instances and Tasks 104 Best Practices for AWS DMS Tasks with Amazon Redshift Target 105 Data Transformation 107 Batch Data Transformation 107 Streaming Data Transformation 107 Data Transformation Using AWS Glue 108 Glue Connectors 108 Glue Bookmarks 109 Data Processing Units 109 Worker Type 109 Glue Jobs 110 Data Sources and Destinations 110 Best Practices for AWS Glue 113 Data Transformation Using Amazon EMR 114 Storage 114 Deployment Options 115 Instance Types 116 Best Practices for Amazon EMR 116 AWS Glue Versus Amazon EMR Options 117 SQL-Based Data Transformation Using Amazon Redshift 118 Amazon Redshift Compute 118 Amazon Redshift Storage 118 SQL Data Transformations 121 Amazon Managed Service for Apache Flink 123 Amazon Data Firehose for Transformation 125 AWS Lambda for Transformation 125 Choosing the Right Streaming Transformation Service 125 Choosing the Right Batch Transformation Service 127 Data Preparation for Nontechnical Personas 128 Fill Missing Values 128 Identify Duplicate Records 129 Formatting Functions 130 Integrating Data from Multiple Sources 130 Nesting and Unnesting Data Structures 130 Protecting Sensitive Data 131 Other Data Preparation Transformations 133 Orchestrating Data Pipelines 133 vi | Table of Contents
📄 Page
9
AWS Step Functions 133 Managed Workflows for Apache Airflow 134 Sample Use Case 136 AWS Glue Workflows 137 Sample Use Case 137 Amazon Redshift Scheduler 138 Amazon EventBridge 139 Sample Use Case 140 Choosing the Right Orchestration Service 142 Conclusion 143 Practice Questions 144 Additional Resources 149 5. Data Store Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Choosing a Data Store 151 AWS Core Storage Services 152 AWS Cloud Databases 153 Data Storage Formats for Data Lakes 155 Row-Based File Formats 155 Column-Based File Formats 156 Table Formats 156 Building a Data Strategy with Multiple Data Stores 158 Data Cataloging Systems 160 Components of Metadata and Data Catalogs 160 Populating an AWS Glue Data Catalog 161 Data Catalog Best Practices 164 Enriching Data Catalogs with Data Classification 166 Managing the Lifecycle of Data 167 Selecting Storage Solutions for Hot and Cold Data 167 Example: Building a Petabyte-Scale Log Analytics Solution on AWS 169 Storage Tier Decisions for Different Access Patterns 169 Defining Data Retention Policy and Archiving Strategies 170 Performing COPY and UNLOAD Operations to Move Data Between Amazon S3 and Amazon Redshift 171 Optimizing Data Management with Amazon S3 173 Overview of S3 Storage Classes 173 Choosing the Right Storage Class 176 S3 Intelligent-Tiering 176 Managing the Data Lifecycle with Amazon S3 Lifecycle 178 Monitoring the Amazon S3 Data Lifecycle 179 Expiring Snapshots from Open Table Formats 181 Table of Contents | vii
📄 Page
10
Archiving Data from Amazon DynamoDB to Amazon S3 182 Ensuring S3 Data Resiliency with S3 Versioning 183 Enabling Versioning on an S3 Bucket 184 S3 Versioning and Object Lifecycle Management 184 Designing Data Models and Schema 185 Introduction to Data Modeling 185 Data Modeling Strategies for Amazon Redshift 187 Data Modeling Strategies for Amazon DynamoDB 192 Data Modeling Strategies for Data Lakes 199 Amazon S3 Data Lake Best Practices 200 Conclusion 201 Practice Questions 202 Additional Resources 206 6. Data Operations and Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Amazon QuickSight 208 Data Sources 208 Datasets 210 Refreshing SPICE Datasets 210 Visualizations 211 Presentation Formats 218 QuickSight GenBI Capabilities (QuickSight Q) 219 SQL Analytics Using Amazon Athena 222 Choice of Querying Engine 222 Workgroups 224 Capacity Reservations 225 Athena Federated SQL 225 Use Cases 226 DDL Capabilities 228 Best Practices When Using Amazon Athena 229 SQL Analytics Using Amazon Redshift 230 SQL Functions 231 Semi-Structured Data Analysis 231 Geospatial Data Analysis 234 Query Data from Data Lake 235 Analyzing Data from Operational Data Stores Using Amazon Redshift 235 Redshift ML and Generative AI 236 User-Defined Functions 237 Analyzing Data Using Notebooks 237 AWS Glue Interactive Sessions 237 Amazon EMR Notebooks 239 viii | Table of Contents
📄 Page
11
Data Pipeline Resiliency 239 Monitoring 240 Alerting 243 Event-Driven Pipeline Maintenance with EventBridge 245 Ensuring Data Quality and Reliability: Deequ and DQDL 246 Automated Data Quality Checks and Error Handling 252 Troubleshooting and Performance Tuning 252 CI/CD Pipelines 254 Version Control and Collaboration 255 Infrastructure as Code 255 Disaster Recovery and High Availability 258 Cost Optimization for Data Pipelines 263 Leveraging Serverless Services 263 Autoscaling 264 Tiered Storage 264 Columnar Formats 264 Monitor and Control Data Transfer Costs 264 Follow Cost Optimization Best Practices 265 Conclusion 265 Practice Questions 266 Additional Resources 270 7. Data Security and Governance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Network Security 271 Amazon VPC Overview 272 Security Groups Overview 273 Best Practices for Configuring Security Groups for Your Workloads 273 Configuring a VPC and Security Group for an Amazon EMR Cluster 274 Managed Services Versus Unmanaged Services 275 VPC Endpoints Overview 276 User Authentication and Authorization 279 Authenticating Users with IAM Credentials 279 IAM Role-Based Authentication and Authorization 279 Service-Linked Roles 280 Managed Versus Self-Managed Policies 280 Enable Single Sign-on with AWS IAM Identity Center 280 Data Security and Privacy 283 Secure Data in Amazon S3 283 Manage Database Credentials 283 Data Encryption and Decryption and Managing the Encryption Keys 284 Managing Encryption Keys with AWS KMS 285 Table of Contents | ix
📄 Page
12
Enabling Encryption in AWS Analytics Services 288 Sensitive Data Detection and Redaction 292 Fine-Grained Access Control with AWS Lake Formation 296 Database Security in Amazon Redshift 303 Fine-Grained Access Control in Amazon QuickSight 304 Data Governance 304 Metadata Management and Technical Catalog 305 Data Sharing 306 Data Quality 312 Data Profiling 313 Data Lifecycle Management 314 Data Lineage 315 Logging and Auditing 317 Analyzing Logs Using AWS Services 319 Conclusion 321 Practice Questions 321 Additional Resources 325 8. Implementing Batch and Streaming Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Data Processing Pipeline 327 Implementing a Batch Processing Pipeline 328 Use Case and Architecture Overview 328 Overview of Input Dataset 329 Step-by-Step Implementation Guide 330 Best Practices and Optimization Techniques 354 Implementing a Real-Time Streaming Pipeline 354 Use Case and Architecture Overview 355 Step-by-Step Implementation Guide 355 Conclusion 373 Resources 373 9. Practice Exam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 10. What’s New in AWS for Data Engineers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Amazon SageMaker Unified Studio 397 Amazon SageMaker Catalog 398 Amazon SageMaker Lakehouse 399 Amazon SageMaker AI 401 Amazon S3 Tables 402 Amazon S3 Metadata 403 Improving the Developer Experience with Generative AI 403 x | Table of Contents
📄 Page
13
Generative AI–Powered Code Generation with Amazon Q Developer 403 Automated Script Upgrade in AWS Glue 404 GenAI-Powered Troubleshooting for Spark in AWS Glue 404 Conclusion 405 Resources 405 Appendix: Solutions to the Practice Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Table of Contents | xi
📄 Page
14
(This page has no text content)
📄 Page
15
Preface As Data Analytics Specialist Architects at Amazon Web Services (AWS), we—Sakti, Dylan, and Anusha—spent more than five years collaborating to solve some of the most challenging and innovative data problems for diverse clients. Our collective experience spans a wide range of industries and use cases: helping Chief Data Officers shape organizational data strategies, architecting petabyte-scale lakehouses, and building operationally excellent data platforms through proven best practices in performance, cost optimization, security, and comprehensive data governance. In today’s landscape, the demand for skilled data professionals has become more critical than ever, with the rise of generative AI compelling companies to leverage their data as a key business differentiator. Throughout our tenure at AWS, we were constantly asked by colleagues from diverse backgrounds how they could break into the dynamic field of data engineering. Our consistent recommendation was to use the AWS Certified Data Engineer Asso‐ ciate (DEA-C01) certification as a starting point. Our rationale is not simply about acquiring another credential but about leveraging the certification’s curriculum as a structured framework to gain a fundamental understanding of data engineering principles, both in general and specifically on the AWS Cloud. This book is the result of that shared experience and our passion for teaching, created to provide the clear, comprehensive, and practical guide we wished we had. What This Book Isn’t Before we detail what this book covers, it’s important to clarify what it isn’t. This book is not an exhaustive deep dive into a single AWS service, nor is it a comprehen‐ sive manual for hands-on implementation. While many excellent books approach data engineering from a specific technology perspective, their focus can be narrow. Instead, our goal is to provide comprehensive coverage of the fundamental concepts and architectural patterns for data engineering on AWS. xiii
📄 Page
16
What This Book Is About This book is designed to be your comprehensive guide to mastering the skills for the AWS Certified Data Engineer Associate (DEA-C01) certification. Our goal is to provide a clear path from foundational concepts to advanced, practical application. By the end of this book, you will understand: • The format of the DEA-C01 exam, how to prepare effectively, and strategies for success on test day • The key responsibilities and mindset of an AWS Certified Data Engineer • How core AWS database, analytics, and auxiliary services function and how to apply them to solve real-world data challenges • The art of selecting the right services to architect solutions that are optimized for cost, performance, security, and high availability Who Should Read This Book Our primary audience is any technical practitioner who wants to prepare for the DEA-C01 certification. This guide is crafted to serve a diverse group of professionals, and you will find this book especially valuable if you are: • A software engineer, data scientist, or data analyst interested in transitioning into data engineering. We provide the foundational knowledge and practical AWS skills needed to make a successful career pivot. • A current data engineer focused on specific technologies who wants to broaden their perspective across the entire AWS data ecosystem. This book will help you connect the dots and build a more comprehensive skill set. How This Book Is Organized The book is organized into four parts, each building upon the last to create a com‐ plete learning journey: Chapters 1 to 3 This part lays the essential groundwork. We begin by defining the data engineer’s role and breaking down the AWS Certified Data Engineer Associate exam itself— what it covers, how to register, and a recommended study plan. We then cover the prerequisite knowledge every data engineer needs, including foundational concepts in databases, data lakes, distributed processing frameworks like Spark and Flink, and the fundamentals of the AWS Cloud. This part ensures you have the solid base needed to tackle the core technical content. xiv | Preface
📄 Page
17
Chapters 4 to 7 This is the heart of the book, diving deep into the four technical domains of the certification. This part is meticulously structured to align with the official exam guide, helping you build a solid understanding of the required knowledge. You will learn to design and implement pipelines for data ingestion and transfor‐ mation (Chapter 4), select and manage the right data stores for any use case (Chapter 5), maintain and optimize data pipelines for operational excellence (Chapter 6), and secure your data with robust governance and security controls (Chapter 7). Chapters 8 and 9 Here we transition from theory to practical application and exam readiness. In Chapter 8, we provide a hands-on implementation guide for building both batch and real-time streaming data pipelines, allowing you to apply the concepts learned in previous chapters. To solidify your knowledge and build confidence for the exam, Chapter 9 provides an extensive practice exam with over 40 certification-style questions, complete with detailed explanations and rationales that guide you on how to approach and solve them. Chapter 10 Finally, we look to the future, covering the latest services and features in the AWS data landscape. While some of these newer capabilities may not yet be on the current exam, understanding them is vital for any forward-looking data engineer. We are committed to keeping this guide relevant and will update this section in future editions as the certification scope and AWS services evolve. Accessing the Book’s Images Online Readers of the printed book can access large-format versions of the book’s images at https://oreil.ly/aws-certified-data-engineer-images. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Preface | xv
📄 Page
18
Constant width bold Used to highlight snippets of special interest in program listings. This element signifies a tip or suggestion. This element signifies a general note. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 141 Stony Circle, Suite 195 Santa Rosa, CA 95401 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://oreilly.com/about/contact.html We have a web page for this book, where we list errata and any additional informa‐ tion. You can access this page at https://oreil.ly/aws-certified-data-engineer. xvi | Preface
📄 Page
19
For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgments We would like to extend our deep appreciation to our technical reviewers, Julian Setiawan, Pooja Chitrakar, and Sam Warner, for their invaluable feedback that helped enhance the quality of this work. Working with the O’Reilly team has been a true pleasure. We extend special thanks to Shira Evans for her excellent organization and assistance; Greg Hyman, our diligent production editor; Kate Dullea, our wonderful technical illustrator; and Megan Lad‐ dusaw, our content acquisition editor. On a personal note: Sakti extends his heartfelt gratitude to his coauthors, Dylan and Anusha, whose invaluable collaboration and insights were instrumental in bringing this work to fru‐ ition. He is deeply thankful to his wife, Soumya Mishra, for her unwavering support and patience throughout this journey. He is deeply grateful to his parents, Asoka and Bijayalaxmi Mishra, and his sister, Sabujima Mishra, who have been constant pillars of strength in his life and instilled in him the value of continuous learning and perseverance. Dylan would like to express his sincere gratitude to his coauthor Sakti, who first proposed this book and assembled such a dream team to bring it to life. He is also immensely thankful for his coauthor Anusha, whose dedication and deep technical insights were essential to the quality of this guide. A special thanks to his wife, Surui Qu, for her constant support and encouragement throughout this entire process. He is also deeply grateful to his mother, Xin Li, and father, Anjing Qu, for surrounding him with love and inspiring him to strive for excellence from a young age. Anusha is grateful to her coauthors, Dylan, and Sakti, whose collaboration transcen‐ ded into meaningful personal connections. Anusha owes gratitude to her husband Saravana, whose understated support made the long hours manageable. She is blessed to have her mother, Padmavati, father, Buddha Bhagavan, and sister, Praveena, who take immense pride in her smallest achievements. Each of these individuals played a vital part in the completion of this work, and their contributions were truly irreplaceable. Preface | xvii
📄 Page
20
(This page has no text content)