Analytics Engineering with SQL and dbt Building Meaningful Data Models at Scale (Rui Machado, Hélder Russa) (Z-Library) (1)

Author: Rui Machado, Hélder Russa

科学

With the shift from data warehouses to data lakes, data now lands in repositories before it's been transformed, enabling engineers to model raw data into clean, well-defined datasets. DBT (data build tool) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL. Authors Rui Machado from Monstarlab and Helder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. If you know your business well and have the technical skills to model raw data into clean, well-defined datasets, you'll learn how to design and deliver data models without any technical influence. With this book, you'll learn: What DBT is and how a DBT project is structured How DBT fits into the data engineering and analytics worlds How to collaborate on building data models The main tools and architectures for building useful, functional data models How to fit DBT into data warehousing and laking architecture How to build tests for data transformations

📄 File Format: PDF

💾 File Size: 2.2 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Rui Machado & Hélder Russa Analytics Engineering with SQL and dbt Building Meaningful Data Models at Scale

📄 Page 2

DATA / DATA SCIENCE “If your team is struggling with inefficient views, tangled stored procedures, low analytics adoption, or a whole host of other problems, this book will help you see a new way forward.” —Jacob Frackson Lead Data Architect, Datatonic “With this book, you will get the essentials on the core principles that will help you become a skilled professional, delivering real value to your organization.” —Michal Kolacek Analytics Engineering Lead, Slido Analytics Engineering with SQL and dbt Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia With the shift from data warehouses to data lakes, data now lands in repositories before it’s been transformed, enabling engineers to model raw data into clean, well-defined datasets. The data build tool (dbt) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL. Authors Rui Machado from Fraudio and Hélder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. With this book, you’ll learn: • Essentials of data modeling techniques and their role in analytics engineering • Essentials of creating and maintaining databases with SQL • How SQL can be used to deliver data analytics insights and reports • The main tools and architectures for building useful, functional data models • What dbt is and how a dbt project is structured • How dbt fits into the data engineering and analytics worlds • How to build tests for data transformations Rui Machado is vice president of technology at Fraudio, with a background in information technologies and data science. Hélder Russa is the head of data engineering at Jumia, with over 10 years of hands-on experience in computer science. 9 7 8 1 0 9 8 1 4 2 3 8 4 5 6 5 9 9 US $65.99 CAN $82.99 ISBN: 978-1-098-14238-4

📄 Page 3

Rui Machado and Hélder Russa Analytics Engineering with SQL and dbt Building Meaningful Data Models at Scale Boston Farnham Sebastopol TokyoBeijing

📄 Page 4

978-1-098-14238-4 [LSI] Analytics Engineering with SQL and dbt by Rui Machado and Hélder Russa Copyright © 2024 Rui Pedro Machado and Hélder Russa. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Angela Rufino Production Editor: Christopher Faucher Copyeditor: Piper Editorial Consulting, LLC Proofreader: Sharon Wilkey Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea December 2023: First Edition Revision History for the First Edition 2023-12-08: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098142384 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Analytics Engineering with SQL and dbt, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

📄 Page 5

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Analytics Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Databases and Their Impact on Analytics Engineering 3 Cloud Computing and Its Impact on Analytics Engineering 5 The Data Analytics Lifecycle 8 The New Role of Analytics Engineer 11 Responsibilities of an Analytics Engineer 12 Enabling Analytics in a Data Mesh 13 Data Products 14 dbt as a Data Mesh Enabler 15 The Heart of Analytics Engineering 16 The Legacy Processes 17 Using SQL and Stored Procedures for ETL/ELT 18 Using ETL Tools 19 The dbt Revolution 20 Summary 22 2. Data Modeling for Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A Brief on Data Modeling 24 The Conceptual Phase of Modeling 25 The Logical Phase of Modeling 28 The Physical Phase of Modeling 30 The Data Normalization Process 31 Dimensional Data Modeling 35 Modeling with the Star Schema 36 Modeling with the Snowflake Schema 40 Modeling with Data Vault 42 Monolith Data Modeling 45 iii

📄 Page 6

Building Modular Data Models 47 Enabling Modular Data Models with dbt 49 Testing Your Data Models 57 Generating Data Documentation 59 Debugging and Optimizing Data Models 60 Medallion Architecture Pattern 63 Summary 66 3. SQL for Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 The Resiliency of SQL 68 Database Fundamentals 70 Types of Databases 72 Database Management System 75 “Speaking” with a Database 77 Creating and Managing Your Data Structures with DDL 78 Manipulating Data with DML 82 Inserting Data with INSERT 83 Selecting Data with SELECT 85 Updating Data with UPDATE 96 Deleting Data with DELETE 97 Storing Queries as Views 98 Common Table Expressions 101 Window Functions 105 SQL for Distributed Data Processing 109 Data Manipulation with DuckDB 113 Data Manipulation with Polars 117 Data Manipulation with FugueSQL 122 Bonus: Training Machine Learning Models with SQL 129 Summary 133 4. Data Transformation with dbt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 dbt Design Philosophy 136 dbt Data Flow 138 dbt Cloud 139 Setting Up dbt Cloud with BigQuery and GitHub 140 Using the dbt Cloud UI 153 Using the dbt Cloud IDE 163 Structure of a dbt Project 165 Jaffle Shop Database 168 YAML Files 168 Models 174 Sources 184 Tests 189 iv | Table of Contents

📄 Page 7

Analyses 197 Seeds 198 Documentation 200 dbt Commands and Selection Syntax 209 Jobs and Deployment 212 Summary 221 5. dbt Advanced Topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Model Materializations 223 Tables, Views, and Ephemeral Models 224 Incremental Models 227 Materialized Views 229 Snapshots 230 Dynamic SQL with Jinja 233 Using SQL Macros 236 dbt Packages 242 Installing Packages 242 Exploring the dbt_utils Package 244 Using Packages Inside Macros and Models 244 dbt Semantic Layer 246 Summary 250 6. Building an End-to-End Analytics Engineering Use Case. . . . . . . . . . . . . . . . . . . . . . . . . 253 Problem Definition: An Omnichannel Analytics Case 254 Operational Data Modeling 254 Conceptual Model 254 Logical Model 255 Physical Model 256 High-Level Data Architecture 260 Analytical Data Modeling 265 Identify the Business Processes 266 Identify Facts and Dimensions in the Dimensional Data Model 267 Identify the Attributes for Dimensions 269 Define the Granularity for Business Facts 270 Creating Our Data Warehouse with dbt 271 Tests, Documentation, and Deployment with dbt 280 Data Analytics with SQL 291 Conclusion 296 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Table of Contents | v

📄 Page 8

(This page has no text content)

📄 Page 9

Preface In the ever-evolving business world, a captivating concept known as analytics engi‐ neering has emerged. It quickly became the talk of the town, in demand by managers, presented by IT companies, and admired by users who marveled at the possibilities it offered. But amid the excitement, many didn’t know what analytics engineering was about. They thought it was about creating data pipelines, designing stunning visualizations, and using advanced algorithms. Oh, how wrong they were! You can imagine this extraordinary world of analytical engineering as a cross between the meticulous investigator Sherlock Holmes, representing the analytical side, and the genius engineer Tony Stark, better known as Iron Man, representing the engineering side. Imagine the remarkable problem-solving skills of Sherlock Holmes combined with the cutting-edge technologies of Iron Man. This combination is what defines the true power and potential of analytical technology. But beware: if you thought analytics engineering was limited to data pipelines and visualizations, you missed the deep deductive thinking that Sherlock Holmes, as a representation of a data analyst or business analyst, brings to the equation. This field is where analytical investigation crosses with the techniques of a software engineer or data engineer, represented by Tony Stark. Stop for a moment and think about the importance of data in your business. Why do you seek it? The answer lies in the pursuit of knowledge. Analytic technology is used to transform raw data into actionable insights that serve as the basis for informed decisions. It’s a powerful support system that provides facts illuminating your busi‐ ness’s reality. However, it doesn’t make decisions for you but instead provides you with the information you need to make your business a success. Before you dive into creating an impressive Iron Man suit of analytics technologies, embrace the wisdom of Sherlock Holmes. Use his keen observational skills to identify and understand the core of your challenges. Refrain from succumbing to the lure of visualizations and algorithms just because others are fascinated by them. Remember that analytics engineering is more than just technology: it’s a management tool that vii

📄 Page 10

will be successful only if it’s aligned with your organization’s strategies and goals. Ensuring that your key performance indicators are aligned with the reality of your business will ensure that the results of your analytics engineering efforts are accurate, impactful, and won’t disappoint you. The great adventure of analytics engineering doesn’t begin with building data pipe‐ lines or selecting advanced algorithms. No, my friend, it starts with a thorough introspection of your organization’s knowledge gaps. Figure out why that knowledge is important and how it can be leveraged to drive your business to success. Use the transformative power of analytics as your compass, pointing the way to success amid the vast sea of data. In your pursuit of analytics engineering, always remember the story of Sherlock Holmes. Avoid building an extravagant aircraft when a humble bicycle would suffice. Let the complexity of the problem and its contextual nuances guide your efforts. Remember that analytics isn’t just about technology; it’s a beacon of management, an invaluable tool that must be used with purpose and precision. Let it become your constant companion on the road to success. Why We Wrote This Book In today’s era of abundant information, it is not uncommon for vital knowledge, concepts, and techniques to become obscured amid the rapid growth of technology and the relentless pursuit of innovation. During this dynamic transformation, several essential concepts can sometimes be inadvertently overlooked. This oversight doesn’t stem from their diminishing relevance but rather from the swift pace of progress. One such fundamental concept that often falls by the wayside is data modeling in the context of data management. It’s worth noting that data modeling encompasses various approaches, including Kimball, conceptual, logical, and physical modeling, among others. We recognize the pressing need to emphasize the significance of data modeling in this diverse landscape, and that’s one of the key reasons we’ve crafted this book. Within these pages, we aim to shed light on the intricacies and various dimensions of data modeling and how it underpins the broader field of analytics engineering. Over time, the importance of data modeling in guaranteeing a solid data management system has gradually faded from general awareness. This is not because it became outdated but rather due to a shift in the industry’s focus. New words, tools, and meth‐ ods have emerged, making the fundamental principles less important. A transition occurred from traditional practices to modern solutions that promised quickness and efficiency, sometimes resulting in a loss of foundational strength. viii | Preface

📄 Page 11

The rise of analytics engineering led to a resurgence. It was not just a trend filled with fancy words but also a return to the basics, echoing the principles of the business intelligence sector. The difference is that modern tools, infrastructure, and techniques are now available to implement these principles more efficiently. So, why did we feel the need to document our thoughts? There are two primary reasons. First and foremost, it is crucial to underscore the enduring value and signif‐ icance of well-established concepts like data modeling. While these methodologies may have been around for a while, they provide a robust foundation for the devel‐ opment of modern techniques. Our second intention is to emphasize that analytics engineering is not a standalone entity but rather a natural progression from the legacy of business intelligence. By integrating the two, organizations can construct a more resilient data value chain, ensuring that their data is not just extensive but also actionable, ultimately enhancing its utility. This book is not just a sentimental trip down memory lane or a commentary on the present. It’s a blueprint for the future. Our goal is to help organizations revisit their foundations, appreciate the advantages of old and new technologies, and integrate them for a comprehensive data management approach. We’ll dig deeper into data modeling and transformation details, explain its importance, and examine how it interacts with modern analytics engineering tools. We aim to provide our readers with a complete understanding, enabling them to strengthen their data management processes and utilize the full potential of their data. Who This Book Is For This book is designed for professionals, students, and enthusiasts dealing with the complex world of data management and analytics. Whether you’re an experienced veteran reminiscing about the basic principles of data modeling or an aspiring analyst keen to understand the transformation from business intelligence to contemporary analytics engineering, our storytelling assures clearness and direction. Organizations seeking to strengthen their data processes will discover immense value in the combination of well-proven principles and modern tools discussed in this book. In summary, if you wish to take full advantage of your data by combining the strengths of the past with the innovations of the present, this book will guide you. Preface | ix

📄 Page 12

How This Book Is Organized We’ve structured the book into six chapters: Chapter 1, “Analytics Engineering” This chapter traces the evolution of data management from traditional SQL- based systems to innovative tools such as Apache Airflow and dbt, each changing how we handle and view data. The analytics engineer role bridges data engineer‐ ing and analytics, guaranteeing that our insights are reliable and actionable. Despite the changes in tools and roles, the importance and value of data remain paramount. Nevertheless, challenges endure, such as data quality and efficient storage, as well as optimizing compute resources for tasks like load balancing on platforms such as Redshift or designing efficient jobs with appropriately sized warehouses on Snowflake. Data modeling, which involves structuring data to reflect real-world scenarios, is at the core of these solutions. Chapter 2, “Data Modeling for Analytics” This chapter delves into the critical role of data modeling in today’s analytics- driven landscape. We will investigate how it aids in structuring data for efficient analysis and explore the significance of data normalization in reducing duplicity. While we emphasize the importance of normalization, it’s worth noting that various modeling methodologies, such as Kimball and One Big Table, advocate for different approaches, including denormalization, depending on specific use cases. By understanding these basic principles and considering the broader spectrum of modeling methodologies, analysts can effectively explore the data, ensuring substantial insights and informed decisions. Devoid of a robust data model, whether normalized or denormalized as per the context, the analytical process can be inconsistent and inaccurate. Chapter 3, “SQL for Analytics” This chapter explores the enduring strength of SQL as a premier analytics language. We will start by outlining the basics of databases and how SQL serves as the primary language for interacting with databases. Our journey will cover the usefulness of views in streamlining queries, the powerful features of window functions for advanced computations, and the flexibility of common table expressions in refining complex queries. We will also discuss SQL’s role in distributed data processing and conclude with an exciting application of SQL in machine learning model training. Chapter 4, “Data Transformation with dbt” This chapter provides a detailed exploration of dbt beyond an initial introduc‐ tion. We will examine dbt’s crucial role in the data analytics lifecycle and demon‐ strate how it transforms raw data into structured and accessible models. Our exploration will navigate the dbt project structure, addressing features such as x | Preface

📄 Page 13

model building, documentation, and testing while providing insights into dbt artifacts, including YAML files. At the end of this chapter, you will have a comprehensive understanding of dbt, enabling you to seamlessly incorporate it into your analytics workflows. Chapter 5, “dbt Advanced Topics” In this chapter, we’ll dig into the advanced aspects of dbt. Beyond just views or tables, we’ll discuss the range of model materializations in dbt, including the use of ephemeral models, data snapshots, and the implementation of incremen‐ tal models to sidestep constant full data loads. Additionally, we’ll elevate our analytics code, focusing on optimizing its efficiency with techniques such as Jinja, macros, and packages to keep it DRY (Don’t Repeat Yourself). Finally, we will also introduce the dbt semantic layer, which plays the key role of acting as a bridge between raw data and meaningful insights. Chapter 6, “Building an End-to-End Analytics Engineering Use Case” This concluding chapter consolidates everything you have learned about analyt‐ ics engineering using dbt and SQL. After deepening the concepts, techniques, and best practices in prior chapters, we now pivot toward a hands-on approach by crafting a complete analytics engineering use case from scratch. dbt and SQL’s capabilities will be harnessed to design, implement, and deploy an all- encompassing analytics solution. Data modeling for varied purposes will be in the spotlight. The goal is to illustrate a holistic analytics workflow, spanning from data ingestion to reporting, by merging insights from prior chapters. During this process, we will overcome prevalent challenges and provide strategies to navigate them effectively. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. Preface | xi

📄 Page 14

This element signifies a tip or suggestion. This element signifies a general note. Using Code Examples If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Analytics Engineering with SQL and dbt by Rui Machado and Hélder Russa (O’Reilly). Copyright 2024 Rui Pedro Machado and Hélder Russa, 978-1-098-14238-4.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. xii | Preface

📄 Page 15

O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/analytics-engineering-SQL-dbt. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Preface | xiii

📄 Page 16

Acknowledgments I want to send a special message to my wife, Ana, and my two wonderful daughters, Mimi and Magui. You inspire me every day to believe in myself and to pursue my dreams unwaveringly because what I achieve for me, I achieve for us. Above all, I want to show my daughters that anything is possible when we set our minds to it. Lastly, I need to thank Hélder, friend and coauthor, for keeping this dream alive and having levels of resilience I have never seen before in anyone. — Rui Machado I want to thank my (future) wife for always being by my side. Her patience and words were my rock in the toughest times. Also, a special thank you to my parents. Without them and their efforts to allow me to continue my studies and pursue my dreams, certainly this book wouldn’t be possible. Again, my genuine thank you to them. Finally, to all my anonymous and not-so-anonymous friend(s) and coauthor, Rui, who stood by my side with their positivity and constructive feedback, and substantially enriched the content of this book. — Hélder Russa xiv | Preface

📄 Page 17

CHAPTER 1 Analytics Engineering The historical development of analytics includes significant milestones and technolo‐ gies that have shaped the field into what it is today. It began with the advent of data warehousing in the 1980s, which created the foundational framework for organizing and analyzing business data. Bill Inmon, a computer scientist who continued to publish throughout the 1980s and 1990s, is widely regarded as providing the first solid theoretical foundation for data warehousing. A subsequent wave of development occurred when Ralph Kimball, another leading contributor to data warehousing and business intelligence (BI), published his influen‐ tial work, The Data Warehouse Toolkit, in 1996. Kimball’s work laid the foundation for dimensional modeling, marking another crucial milestone in the evolution of analytics. Together, the contributions of Inmon and Kimball, spanning the late 20th century, played pivotal roles in shaping the landscape of data warehousing and analytics. In the early 2000s, the emergence of tech giants like Google and Amazon created the need for more advanced solutions for processing massive amounts of data, leading to the release of the Google File System and Apache Hadoop. This marked the era of Big Data Engineering, in which professionals used the Hadoop framework to process large amounts of data. The rise of public cloud providers like Amazon Web Services (AWS) revolutionized the way software and data applications were developed and deployed. One of the pio‐ neering offerings from AWS was Amazon Redshift, introduced in 2012. It represented an interesting blend of online analytical processing (OLAP) and traditional database technologies. In its early days, Redshift required database administrators to manage tasks like vacuuming and scaling to maintain optimal performance. 1

📄 Page 18

Over time, cloud native technologies have continued to evolve, and Redshift itself has undergone significant enhancements. While retaining its core strengths, newer versions of Redshift, along with cloud native platforms like Google BigQuery and Snowflake, have streamlined many of these administrative tasks, offering advanced data processing capabilities to enterprises of all sizes. This evolution highlights the ongoing innovation within the cloud data processing ecosystem. The modern data stack, consisting of tools like Apache Airflow, data build tool (dbt), and Looker, further transformed data workflows. With these advances, the term “Big Data engineer” became obsolete, making way for a data engineer’s broader and more inclusive role. This shift was recognized in the influential articles of Maxime Beau‐ chemin—creator of Apache Superset and Airflow and one of the first data engineers at Facebook and Airbnb—particularly in his article “The Rise of the Data Engineer”, which highlighted the growing importance of data engineering in the industry. All of these rapid developments in the data field have led to significant changes in the role of data professionals. With the advent of data tools, simple tasks are becoming strategic tasks. Today’s data engineers have a multifaceted role that encompasses data modeling, quality assurance, security, data management, architectural design, and orchestration. They are increasingly adopting software engineering practices and concepts, such as functional data engineering and declarative programming, to enhance their work‐ flows. While Python and structured query language (SQL) stand out as indispensable languages for data engineers, it’s important to note that the choice of programming languages can vary widely in this field. Engineers may leverage other languages such as Java (commonly used for managing Apache Spark and Beam), Scala (also prevalent in the Spark and Beam ecosystem), Go, and more, depending on the specific needs and preferences of their projects. The combination of languages like Java and SQL is also common among data engineers at large organizations. Organizations are increasingly moving toward decentralized data teams, self-service platforms, and alternative data storage options. As data engineers are forced to adapt to all these market changes, we often see some taking on a more technical role, focusing on platform enablement. Other data engineers work closer to the business, designing, implementing, and maintaining systems that turn raw data into high-value information as they adapt to this accelerated industry that is bringing new tools to market every day and spawning the fantastic world of analytics engineering. In this chapter, we provide an introduction to the field of analytics engineering and its role in the data-driven decision-making process. We discuss the importance of analytics engineering in today’s data-driven world and the primary roles of an analytics engineer. In addition, we will explore how the analytics engineering lifecycle is used to manage the analytics process and how it ensures the quality and accuracy of the data and insights generated. We will also address the current trends and 2 | Chapter 1: Analytics Engineering

📄 Page 19

technologies shaping the field of analytics engineering, from history to the present, touching on emerging concepts like data mesh, and discussing the fundamental choices between extract, load, and transform (ELT) and extract, transform, and load (ETL) strategies as well as the many data modeling techniques being adopted around the world. Databases and Their Impact on Analytics Engineering For a long time now, data has increasingly become the focus of interest for companies that want to stay one step ahead of the competition, improve their internal processes, or merely understand the behavior of their customers. With new tools, new ways of working, and new areas of knowledge such as data science and BI, it’s becoming increasingly difficult to fully survey and understand the data landscape these days. The natural progress of technology has caused an oversupply of data analysis, visuali‐ zation, and storage tools, each offering unique features and capabilities. Nevertheless, an accelerated deployment of those tools has resulted in a fragmented landscape, requiring individuals and organizations to remain up-to-date with the most recent technological developments while at the same time having to make prudent choices on how to use them. Sometimes this abundance creates confusion and requires a continuous cycle of learning and adaptation. The evolution of work practices is accompanied by a diversification of tools. Dynamic and Agile methodologies have replaced traditional approaches to data man‐ agement and analysis. Iterative practices and cross-functional collaboration introduce flexibility and speed to data projects, but they also pose a challenge in harmonizing workflows across diverse teams and roles. Effective communication and alignment are crucial as diverse facets of the data process converge, creating a need for a comprehensive understanding of these novel work practices. Specialized areas such as data science and BI have increased the complexity of the data field as well. Data scientists apply advanced statistical and machine learning techniques to detect complex patterns, whereas BI experts extract valuable informa‐ tion from raw data to produce practical insights. Such specialized areas introduce refined techniques that require regular skill development and learning. A successful adoption of these practices necessitates a dedicated commitment to education and a flexible approach to skill acquisition. As data spreads across the digital domain, it carries with it unforeseen amounts, vari‐ eties, and speeds. The flood of data, along with the complex features of present-day data sources, such as Internet of things (IoT) gadgets and unorganized text, makes data management even more demanding. The details of incorporating, converting, and assessing data precision become more apparent, emphasizing the need for strong methods that guarantee reliable and precise insights. Databases and Their Impact on Analytics Engineering | 3

📄 Page 20

The multifaceted nature of the data world compounds its complexity. As an outcome of converging skills from various domains, including computer science, statistics, and field-specific proficiency, a cooperative and communicative strategy is necessary. This multidisciplinary interaction accentuates the significance of efficient teamwork and knowledge sharing. But that has not always been the case. For decades, spreadsheets were the standard technology for storing, managing, and analyzing data at all levels, both for business operational management and for analytics to understand it. However, as businesses have become more complex, so has the need for data-related decision making. And the first of these came in the form of a revolution called databases. Databases can be defined as an organized collection of structured information or data, usually stored electronically in a computer system. This data can be in the form of text, numbers, images, or other types of digital information. Data is stored in a way that facilitates access and retrieval using a set of predefined rules and structures called a schema. Databases are an essential part of analytics because they provide a way to efficiently store, organize, and retrieve large amounts of data, allowing analysts to easily access the data they need to perform complex analyses to gain insights that would otherwise be difficult or impossible to obtain. In addition, databases can be configured to ensure data integrity, which guarantees that the data being analyzed is accurate and consistent and thus makes the analysis more reliable and trustworthy. One of the most common ways to use databases for analytics is the data warehousing technique, that is, to construct and use a data warehouse. A data warehouse is a large, centralized data store designed to simplify data use. The data in a data warehouse is typically extracted from a variety of sources, such as transactional systems, external data feeds, and other databases. The data is then cleansed, transformed, and integra‐ ted into a single, consistent data model that typically follows a dimensional modeling technique such as the star schema or Data Vault. Another important use of databases in analytics is the process of data mining. Data mining uses statistical and machine learning techniques to uncover patterns and relationships in large datasets. In this way, trends can be identified, future behavior can be predicted, and other types of predictions can be made. Database technologies and data scientists have thus played a crucial role in the emer‐ gence of data science by providing a way to efficiently store, organize, and retrieve large amounts of data, enabling data scientists to work with large datasets and focus on what matters: gaining knowledge from data. The use of SQL and other programming languages, such as Python or Scala, that allow interaction with databases has enabled data scientists to perform complex data queries and manipulations. Also, the use of data visualization tools such as Tableau 4 | Chapter 1: Analytics Engineering

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List