Building Real-Time Analytics Systems From Events to Insights with Apache Kafka and Apache Pinot (Mark Needham) (Z-Library)
Author: Unknown Author
技术
No Description
📄 File Format:
PDF
💾 File Size:
15.0 MB
226
Views
73
Downloads
0.00
Total Donations
📄 Text Preview (First 20 pages)
ℹ️
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
📄 Page
1
Mark Needham Foreword by Gunnar Morling Building Real-Time Analytics Systems From Events to Insights with Apache Kafka and Apache Pinot N eed ha m B uild ing Rea l-Tim e A na lytics System s B uild ing Rea l-Tim e A na lytics System s
📄 Page
2
DATA SCIENCE “This book provides a well-structured foundation for anyone willing to design, build, and maintain real-time analytics applications, including data engineers, architects, and technology leaders.” —Dunith Dhanushka Senior Developer Advocate at Redpanda Data Building Real-Time Analytics Systems Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Gain deep insight into real-time analytics, including the features of these systems and the problems they solve. With this practical book, data engineers at organizations that use event-processing systems such as Kafka, Google Pub/Sub, and AWS Kinesis will learn how to analyze data streams in real time. The faster you derive insights, the quicker you can spot changes in your business and act accordingly. Author Mark Needham from StarTree provides an overview of the real-time analytics space and an understanding of what goes into building real-time applications. The book’s second part offers a series of hands-on tutorials that show you how to combine multiple software products to build real-time analytics applications for an imaginary pizza delivery service. You will: • Learn common architectures for real-time analytics • Discover how event processing differs from real-time analytics • Ingest event data from Apache Kafka into Apache Pinot • Combine event streams with OLTP data using Debezium and Kafka Streams • Write real-time queries against event data stored in Apache Pinot • Build a real-time dashboard and order tracking app • Learn how Uber, Stripe, and Just Eat use real-time analytics Mark Needham, developer relations engineer at StarTree, helps users learn how to use Apache Pinot to build real-time user-facing analytics applications. He previously worked in developer relations, product engineering, and field engineering at Neo4j. US $65.99 CAN $82.99 ISBN: 978-1-098-13879-0
📄 Page
3
Mark Needham Foreword by Gunnar Morling Building Real-Time Analytics Systems From Events to Insights with Apache Kafka and Apache Pinot Boston Farnham Sebastopol TokyoBeijing
📄 Page
4
978-1-098-13879-0 [LSI] Building Real-Time Analytics Systems by Mark Needham Copyright © 2023 Blue Theta and Dunith Dhanushka. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Shira Evans Production Editor: Jonathon Owen Copyeditor: Penelope Perkins Proofreader: Sharon Wilkey Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2023: First Edition Revision History for the First Edition 2023-09-13: First Release See https://oreilly.com/catalog/errata.csp?isbn=9781098138790 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Real-Time Analytics Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Introduction to Real-Time Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is an Event Stream? 1 Making Sense of Streaming Data 3 What Is Real-Time Analytics? 3 Benefits of Real-Time Analytics 5 New Revenue Streams 5 Timely Access to Insights 5 Reduced Infrastructure Cost 6 Improved Overall Customer Experience 6 Real-Time Analytics Use Cases 6 User-Facing Analytics 7 Personalization 7 Metrics 7 Anomaly Detection and Root Cause Analysis 7 Visualization 8 Ad Hoc Analytics 8 Log Analytics/Text Search 8 Classifying Real-Time Analytics Applications 9 Internal Versus External Facing 9 Machine Versus Human Facing 10 Summary 11 2. The Real-Time Analytics Ecosystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Defining the Real-Time Analytics Ecosystem 13 iii
📄 Page
6
The Classic Streaming Stack 14 Complex Event Processing 15 The Big Data Era 16 The Modern Streaming Stack 19 Event Producers 21 Streaming Data Platform 23 Stream Processing Layer 25 Serving Layer 27 Frontend 31 Summary 32 3. Introducing All About That Dough: Real-Time Analytics on Pizza. . . . . . . . . . . . . . . . . . . 33 Existing Architecture 34 Setup 37 MySQL 38 Apache Kafka 41 ZooKeeper 41 Orders Service 42 Spinning Up the Components 43 Inspecting the Data 43 Applications of Real-Time Analytics 47 Summary 47 4. Querying Kafka with Kafka Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 What Is Kafka Streams? 50 What Is Quarkus? 52 Quarkus Application 53 Installing the Quarkus CLI 53 Creating a Quarkus Application 53 Creating a Topology 54 Querying the Key-Value Store 56 Creating an HTTP Endpoint 59 Running the Application 60 Querying the HTTP Endpoint 60 Limitations of Kafka Streams 60 Summary 61 5. The Serving Layer: Apache Pinot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Why Can’t We Use Another Stream Processor? 63 Why Can’t We Use a Data Warehouse? 64 What Is Apache Pinot? 64 How Does Pinot Model and Store Data? 66 iv | Table of Contents
📄 Page
7
Schema 66 Table 66 Setup 67 Data Ingestion 68 Pinot Data Explorer 71 Indexes 72 Updating the Web App 74 Summary 76 6. Building a Real-Time Analytics Dashboard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Dashboard Architecture 77 What Is Streamlit? 78 Setup 78 Building the Dashboard 79 Summary 88 7. Product Changes Captured with Change Data Capture. . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Capturing Changes from Operational Databases 89 Change Data Capture 90 Why Do We Need CDC? 90 What Is CDC? 91 What Are the Strategies for Implementing CDC? 92 Log-Based Data Capture 92 Requirements for a CDC System 93 Debezium 94 Applying CDC to AATD 94 Setup 95 Connecting Debezium to MySQL 95 Querying the Products Stream 97 Updating Products 98 Summary 99 8. Joining Streams with Kafka Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Enriching Orders with Kafka Streams 101 Adding Order Items to Pinot 108 Updating the Orders Service 111 Refreshing the Streamlit Dashboard 115 Summary 117 9. Upserts in the Serving Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Order Statuses 119 Enriched Orders Stream 121 Table of Contents | v
📄 Page
8
Upserts in Apache Pinot 124 Updating the Orders Service 127 Creating UsersResource 128 Adding an allUsers Endpoint 128 Adding an Orders for User Endpoint 129 Adding an Individual Order Endpoint 130 Configuring Cross-Origin Resource Sharing 133 Frontend App 133 Order Statuses on the Dashboard 136 Time Spent in Each Order Status 136 Orders That Might Be Stuck 138 Summary 141 10. Geospatial Querying. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Delivery Statuses 144 Updating Apache Pinot 146 Orders 146 Delivery Statuses 148 Updating the Orders Service 154 Individual Orders 154 Delayed Orders by Area 157 Consuming the New API Endpoints 158 Summary 160 11. Production Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Preproduction 162 Capacity Planning 162 Data Partitioning 163 Throughput 166 Data Retention 167 Data Granularity 168 Total Data Size 168 Replication Factor 169 Deployment Platform 169 In-House Skills 169 Data Privacy and Security 169 Cost 170 Control 170 Postproduction 171 Monitoring and Alerting 171 Data Governance 172 Summary 173 vi | Table of Contents
📄 Page
9
12. Real-Time Analytics in the Real World. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Content Recommendation (Professional Social Network) 175 The Problem 176 The Solution 176 Benefits 178 Operational Analytics (Streaming Service) 178 The Problem 179 The Solution 180 Benefits 181 Real-Time Ad Analytics (Online Marketplace) 182 The Problem 182 The Solution 183 Benefits 184 User-Facing Analytics (Collaboration Platform) 184 The Problem 185 The Solution 186 Benefits 187 Summary 187 13. The Future of Real-Time Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Edge Analytics 189 Compute-Storage Separation 190 Data Lakehouses 192 Real-Time Data Visualization 194 Streaming Databases 194 Streaming Data Platform as a Service 196 Reverse ETL 197 Summary 198 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Table of Contents | vii
📄 Page
10
(This page has no text content)
📄 Page
11
Foreword When I started my career in software engineering in the early 2000s, data analytics oftentimes was an afterthought when designing software systems. Batch jobs running once per day would extract data from operational databases and load it into data warehouses, and business analysts typically were happy when they could look at the data from yesterday or last week, creating reports, running once-off queries, etc. Apart from perhaps a few handcrafted, highly optimized queries running within operational databases, the idea of user-facing analytics was pretty much unheard of: serving analytics workloads to thousands of concurrent users, based on the freshest data possible. Since then, the appetite for real-time analytics has substantially increased. Use cases like fraud detection, resource planning, content recommenda‐ tions, predictive analytics, and many others require the latest data in order to provide value. If, for instance, your bank detects a pattern of misuse for your credit card because it got stolen, you’d want your card to be blocked right now and not tomor‐ row, right? Tools and platforms such as Apache Kafka (for data streaming), Apache Flink (stream processing), Apache Pinot (data analytics) and Apache Superset (data visualization) provide an excellent foundation for real-time analytics and have seen a tremendous uptake over the last years. At the same time, getting started with implementing your first use cases can be challenging, and you might ask yourself questions such as these: Which tools to choose for which purpose? How to put the individual pieces together for a coherent solution? What challenges exist when putting them into production and how to overcome those? Mark’s book is a treasure trove of guidance around these and many other concerns. Starting with the foundations (What even is real-time analytics?), he provides a com‐ prehensive overview of the software ecosystem in this space, discusses Apache Pinot as one of the leading real-time analytics platforms, and dives into production consid‐ erations as well as more specific aspects such geospatial queries and upsert operations (a notoriously tricky part in most analytics stores). ix
📄 Page
12
Having worked on Debezium, an open source platform for change data capture (CDC), for many years, it’s my particular joy to see an entire chapter on that topic. CDC plays a key role in real-time data pipelines, with feeding live data changes from operational databases such as MySQL or PostgreSQL to analytics platforms like Apache Pinot being a core use case, which I’ve seen coming up again and again in the Debezium community. Being an experienced CDC user himself, Mark is doing an excellent job explaining key CDC use cases and implementation approaches and showing how to set up Debezium in a comprehensive example. The great attention to detail and practical hands-on style are a defining theme of the entire book: any conceptual discussion is always followed by practical examples, showing the reader in detail how to put the different ideas and technologies into action. The book is great for reading end to end, or you can equally well just pick specific chapters if you want to learn more about one particular topic. The world around us is real-time, and any software systems we build need to account for that fact. As you implement your own analytics use cases for gaining real-time insight into your data, Building Real-Time Analytics Systems will quickly become an invaluable resource, and I am sure it’s going to keep its spot on your desk for quick access for a long time. — Gunnar Morling Hamburg, June 2023 x | Foreword
📄 Page
13
Preface This book is a practical guide for implementing real-time analytics applications on top of existing data infrastructure. It is aimed at data engineers, data architects, and application developers who have some experience working with streaming data or would like to get acquainted with it. In Chapters 1 and 2, we give an introduction to the topic and an overview of the types of real-time analytics applications that you can build. We also describe the types of products/tools that you’ll likely be using, explaining how to pick the right tool for the job, as well as explaining when a tool might not be necessary. In Chapter 3, we introduce a fictional pizza company that already has streaming infrastructure set up but hasn’t yet implemented any real-time functionality. The next seven chapters will show how to implement different types of real-time analytics applications for this pizza company. If you’re interested in getting your hands dirty, these chapters will be perfect for you, and hopefully you’ll pick up some ideas (and code!) that you can use in your own projects. The book will conclude with considerations when putting applications into produc‐ tion, a look at some real-world use cases of real-time analytics, and a gaze into our real-time analytics crystal ball to see what might be coming in this field over the next few years. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. xi
📄 Page
14
Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/RTA-github. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Building Real-Time Ana‐ lytics Systems by Mark Needham (O’Reilly). Copyright 2023 Blue Theta and Dunith Dhanushka, 978-1-098-13879-0.” xii | Preface
📄 Page
15
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/building-RTA. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments Writing this book has been an exhilarating journey, and I am deeply grateful to the countless individuals who have provided their support, wisdom, and encouragement along the way. Preface | xiii
📄 Page
16
First and foremost, I would like to extend my heartfelt appreciation to Dunith Dhanushka, a prominent thought leader in the real-time analytics space. His insight‐ ful blog posts and engaging talk at Current 2022 have served as invaluable sources of inspiration, shaping significant portions of this book. The opportunity to engage in thought-provoking conversations with him has not only deepened my understanding of the intricacies of the real-time analytics stack but also guided me in refining the way I presented these concepts throughout the manuscript. I am also immensely grateful to Hubert Dulay, who generously shared his expertise as a technical reviewer for this book. His keen eyes and astute suggestions have been crucial in ensuring the accuracy and clarity of the content presented. Hubert’s dedica‐ tion to providing constructive feedback has played a vital role in enhancing the over‐ all quality of the book, and I am truly thankful for his invaluable contributions. xiv | Preface
📄 Page
17
CHAPTER 1 Introduction to Real-Time Analytics It’s a huge competitive advantage to see in real time what’s happening with your data. —Hilary Mason, Founder and CEO of Fast Forward Labs A lot of data in a business environment is considered unbounded because it arrives gradually over time. Customers, employers, and machines produced data yesterday and today and will continue to produce more data tomorrow. This process never ends unless you go out of business, so the dataset is never complete in any meaningful way. Of the companies that participated in Confluent’s Business Impact of Data Streaming: State of Data in Motion Report 2022, 97% have access to real-time data streams, and 66% have widespread access. Today, many businesses are adopting streaming data and real-time analytics to make faster, more reliable, and more accurate decisions, allowing them to gain a competi‐ tive advantage in their market segment. This chapter provides an introduction to streaming and real-time analytics. We’ll start with a refresher about streaming data before explaining why organizations want to apply analytics on top of that data. After going through some use cases, we’ll con‐ clude with an overview of the types of real-time analytics applications we can build. What Is an Event Stream? The term streaming describes a continuous, never-ending flow of data. The data is made available incrementally over time, which means that you can act upon it without needing to wait for the whole dataset to become available so that you can download it. 1
📄 Page
18
A data stream consists of a series of data points ordered in time, that is, chronological order, as shown in Figure 1-1. Figure 1-1. A data stream Each data point represents an event, or a change in the state of the business. For example, these might be real-time events like a stream of transactions coming from an organization or Internet of Things (IoT) sensors emitting their readings. One thing even streams have in common is that they keep on producing data for as long as the business exists. Event streams are generated by different data sources in a business, in various formats and volumes. We can also consider a data stream as an immutable, time-ordered stream of events, carrying facts about state changes that occurred in the business. These sources include, but are not limited to, ecommerce purchases, in-game player activity, infor‐ mation from social networks, clickstream data, activity logs from web servers, sensor data, and telemetry from connected devices or instrumentation in data centers. An example of an event is the following: A user with ID 1234 purchased item 567 for $3.99 on 2022/06/12 at 12:23:212 Events are an immutable representation of facts that happened in the past. The facts of this event are shown in Table 1-1. Table 1-1. Facts in event example Fact Value User ID 1234 Item purchased 567 Price paid $3.99 By aggregating and analyzing event streams, businesses can uncover insights about their customers and use them to improve their offerings. In the next section, we will discuss different means of making sense of events. 2 | Chapter 1: Introduction to Real-Time Analytics
📄 Page
19
Making Sense of Streaming Data Events have a shelf life. The business value of events rapidly decreases over time, as shown in Figure 1-2. Figure 1-2. Event shelf life The sooner you understand events’ behavior, the sooner you can react and maximize your business outcome. For example, if we have an event that a user abandoned their shopping cart, we can reach out to them via SMS or email to find out why that hap‐ pened. Perhaps we can offer them a voucher for one of the items in their cart to entice them to come back and complete the transaction. But that only works if we’re able to react to the cart abandonment in real time. If we detect it tomorrow, the user has probably forgotten what they were doing and will likely ignore our email. What Is Real-Time Analytics? Real-time analytics (RTA) describes an approach to data processing that allows us to extract value from events as soon as they are made available. When we use the term real time in this book, we are referring to soft real time. Delays causes by network latencies and garbage col‐ lection pauses, for example, may delay the delivery and processing of events by hundreds of milliseconds or more. What Is Real-Time Analytics? | 3
📄 Page
20
Real-time analytics differs substantially from batch processing, where we collect data in batches and then process it, often with quite a long delay between event time and processing time. Figure 1-3 gives a visual representation of batch processing. Figure 1-3. Batch processing In contrast, with real-time analytics we react right after the event happens, as shown in Figure 1-4. Figure 1-4. Real-time processing Traditionally, batch processing was the only means of data analysis, but it required us to draw artificial time boundaries to make it easier to divide the data into chunks of fixed duration and process them in batches. For example, we might process a day’s worth of data at the end of every day or an hour’s worth of data at the end of every hour. That was too slow for many users because it produced stale results and didn’t allow them to react to things as they were happening. Over time the impact of these problems was reduced by decreasing the size of pro‐ cessing batches down to the minute or even the second, which eventually led to events being processed as they arrived and fixed time slices being abandoned. And that is the whole idea behind real-time analytics! Real-time analytics systems capture, analyze, and act upon events as soon as they become available. They are the unbounded, incrementally processed counterpart to the batch processing systems that have dominated the data analytics space for years. 4 | Chapter 1: Introduction to Real-Time Analytics
The above is a preview of the first 20 pages. Register to read the complete e-book.