Streaming Databases Unifying Batch and Stream Processing (Hubert Dulay, Ralph M. Debusmann) (Z-Library)

(This page has no text content)

DATA “Streaming Databases offers a comprehensive guide to reconciling the streaming revolution with traditional database thinking. This book is an invaluable resource for both experienced streaming practitioners and those embarking on their real-time data journey.” —Adrian Kosowski CPO and Founder, Pathway “Stream processing is hard. You have to handle duplicated or out-of-order events all the time. With this book, you can leave the hard work to streaming databases. It’s the best guide out there for building superior streaming ETL, CDC, or real-time analytics solutions.” —Jove Zhong Cofounder and Head of Product, Timeplus Inc. Streaming Databases linkedin.com/company/oreilly-media youtube.com/oreillymedia Real-time applications are becoming the norm today. But building a model that works properly requires real-time data from the source, in-flight stream processing, and low latency serving of its analytics. With this practical book, data engineers, data architects, and data analysts will learn how to use streaming databases to build real-time solutions. Authors Hubert Dulay and Ralph M. Debusmann take you through streaming database fundamentals, including how these databases reduce infrastructure for real-time solutions. You’ll learn the difference between streaming databases, stream processing, and real-time online analytical processing (OLAP) databases. And you’ll discover when to use push queries versus pull queries and how to serve synchronous and asynchronous data processed by streaming databases. This guide helps you: • Explore stream processing and streaming databases • Learn how to build real-time solutions with streaming databases • Understand how to construct materialized views from any number of streams • Learn how to serve synchronous and asynchronous data • Get started building low-complexity streaming solutions with minimal setup Hubert Dulay is a systems and data engineer at StarTree and the coauthor of Streaming Data Mesh. Ralph M. Debusmann, PhD, is a former AI/NLP researcher who currently serves as lead enterprise Kafka engineer at Migros-Genossenschafts-Bund. US $79.99 CAN $99.99 ISBN: 978-1-098-15483-7

Hubert Dulay and Ralph M. Debusmann Streaming Databases Unifying Batch and Stream Processing Boston Farnham Sebastopol TokyoBeijing

978-1-098-15483-7 [LSI] Streaming Databases by Hubert Dulay and Ralph M. Debusmann Copyright © 2024 Hubert Dulay and Ralph M. Debusmann. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Rita Fernando Production Editor: Katherine Tozer Copyeditor: Emily Wydeven Proofreader: Krsta Technology Solutions Indexer: BIM Creatives, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea August 2024: First Edition Revision History for the First Edition 2024-08-08: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098154837 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Databases, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Streaming Foundations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Turning the Database Inside Out 2 Externalizing Database Features 3 Write-Ahead Log 3 Streaming Platforms 5 Materialized Views 8 Use Case: Clickstream Analysis 10 Understanding Transactions and Events 11 Domain-Driven Design 11 Context Enrichment 12 Change Data Capture 13 Connectors 15 Connector Middleware 16 Embedded 16 Custom-Built 16 Summary 18 2. Stream Processing Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Stateful Transformations 21 Data Pipelines 23 ELT Limitations 26 Stream Processing with ELT 27 Stream Processors 28 iii

Popular Stream Processors 28 Newer Stream Processors 29 Emulating Materialized Views in Apache Spark 30 Two Types of Streams 31 Append Stream 32 Debezium Change Data 33 Materialized Views 34 Summary 36 3. Serving Real-Time Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Real-Time Expectations 40 Choosing an Analytical Data Store 41 Sourcing from a Topic 42 Ingestion Transformations 44 OLTP Versus OLAP 46 ACID 46 Row- Versus Column-Based Optimization 47 Queries Per Second and Concurrency 48 Indexing 49 Serving Analytical Results 53 Synchronous Queries 53 Asynchronous Queries 53 Push Versus Pull Queries 54 Summary 56 4. Materialized Views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Views, Materialized Views, and Incremental Updates 57 Change Data Capture 60 Push Versus Pull Queries 61 CDC and Upsert 66 Joining Streams 69 Apache Calcite 70 Clickstream Use Case 73 Summary 75 5. Introduction to Streaming Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Identifying the Streaming Database 78 Column-Based Streaming Database 81 Row-Based Streaming Database 81 Edge Streaming-Like Databases 84 SQL Expressivity 84 iv | Table of Contents

Streaming Debuggability 87 Advantages of Debugging in Streaming Databases 87 SQL Is Not a Silver Bullet 88 Streaming Database Implementations 88 Streaming Database Architecture 89 ELT with Streaming Databases 93 Summary 94 6. Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Toy Example 98 Transactions 99 Analyzing the Transactions 99 Comparing Consistency Across Stream Processing Systems 100 Flink SQL 100 ksqlDB 104 Proton (Timeplus) 107 RisingWave 110 Materialize 112 Pathway 114 Out-of-Order Messages 117 Going Beyond Eventual Consistency 117 Why Do Eventually Consistent Stream Processors Fail the Toy Example? 117 How Do Internally Consistent Stream Processing Systems Pass the Toy Example? 121 How Can We Fix Eventually Consistent Stream Processing Systems to Pass the Toy Example? 124 Consistency Versus Latency 127 Summary 128 7. Emergence of Other Hybrid Data Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Data Planes 132 Hybrid Transactional/Analytical Database 134 Other Hybrid Databases 137 Motivations for Hybrid Systems 138 The Influence of PostgreSQL on Hybrid Databases 139 Near-Edge Analytics 139 Next-Generation Hybrid Databases 140 Next-Generation Streaming OLTP Databases 142 Next-Generation Streaming RTOLAP Databases 143 Next-Generation HTAP Databases 143 Summary 144 Table of Contents | v

8. Zero-ETL or Near-Zero-ETL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 ETL Model 145 Zero-ETL 146 Near-Zero-ETL 148 PeerDB 149 Proton 151 Embedded OLAP 152 Data Gravity and Replication 157 Analytical Data Reduction 157 Lambda Architecture 157 Apache Pinot Hybrid Tables 159 Pipeline Configurations 165 Summary 166 9. The Streaming Plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Data Gravity 168 Components of the Streaming Plane 170 Streaming Plane Infrastructure 172 Operational Analytics 173 Data Mesh 175 Pillars of a Data Mesh 176 Challenge of a Data Mesh 178 Streaming Data Mesh with Streaming Plane and Streaming Databases 178 Data Locality 179 Data Replication 180 Summary 182 10. Deployment Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Consistent Streaming Database 184 Consistent Streaming Processor and RTOLAP 185 Eventually Consistent OLAP Streaming Database 186 Eventually Consistent Stream Processor and RTOLAP 188 Eventually Consistent Stream Processor and HTAP 188 ksqlDB 189 Incremental View Maintenance 190 Postgres Multicorn Foreign Data Wrapper 191 When to Use Code-Based Stream Processors 191 When to Use Lakehouse/Streamhouse Technologies 192 Caching Technologies 192 Where to Do Processing and Querying in General? 193 The Four “Where” Questions 193 vi | Table of Contents

An Analytical Use Case 193 Consequences 196 Summary 197 11. Future State of Real-Time Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 The Convergence of the Data Planes 199 Graph Databases 201 Memgraph 201 thatDot/Quine 202 Vector Databases 205 Milvus 2.x: Streaming as the Central Backbone 206 RTOLAP Databases: Adding Vector Search 208 Incremental View Maintenance 209 pg_ivm 210 Hydra 210 Epsio 211 Feldera 211 PeerDB 213 Data Wrapping and Postgres Multicorn 215 Classical Databases 217 Data Warehouses 220 BigQuery 220 Redshift 221 Snowflake 222 Lakehouse 223 Delta Lake 224 Apache Paimon 225 Apache Iceberg 226 Apache Hudi 227 OneTable or XTable 228 The Relationship of Streaming and Lakehouses 228 Conclusion 229 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Table of Contents | vii

(This page has no text content)

Foreword Pioneering a new category of software systems is the dream of many software engi‐ neers. I feel very fortunate for the opportunity to work on ksqlDB early on, even before it was called ksqlDB, and before the category of streaming databases was generally known. When I first heard that Ralph and Hubert were writing a book dedicated to streaming databases, I was naturally interested right away. So what is a streaming database? Database systems have many different flavors, from traditional relational databases to XML, graph, object, vector, and NoSQL databases. Many of these are well known and have been established for many decades. Stream‐ ing, or stream processing, is much less established, although it has seen a steep adoption rate in the industry over the past decade or so, led by the rise of Apache Kafka as the de facto streaming platform. Historically, stream processing was considered difficult, and only larger organizations with dedicated teams of streaming experts could master it. The same was true for data processing and computing 50 years ago, before SQL and relational database systems were invented to allow nontechnical users to work with data stored in computer systems. Now, SQL is the lingua franca of data processing. Streaming databases are the next step in the evolution of stream processing. They unify well-established techniques from database systems with the new paradigms from the streaming world to simplify stream processing and enable nontechnical users to work with data in motion, similar to what we are used to when we query data at rest. Database systems are designed to solve specific problems. The two main categories of database systems, online transaction processing (OLTP) and online analytical pro‐ cessing (OLAP) systems, were not originally designed for internet-scale applications. ix

With the rise of “big data” at the beginning of the third millennium, new systems such as MapReduce were invented to meet the increased scaling requirements. However, those new systems were developed by technical experts for technical experts, and they moved us away from the familiarity of SQL. With the invention of data lakes, the first child of the “big data” era, it was quickly realized that SQL was needed to enable nontechnical users to make the most of these new technologies. As a result, SQL was reintroduced, and nowadays, all modern data lakes use SQL to query the stored data. Data streaming, as the second child of the “big data” era, followed the same trend: first, stream processing systems were built by experts for experts without the support of SQL. It wasn’t long until SQL and database technologies were introduced to enable nontechnical users to use these new streaming systems. This development led to streaming databases and the waves of innovation that followed. As more people realize the significance of streaming databases in the world of stream processing and database technology, they will need guidance on how to use them with their existing systems. Stream processing, as this book puts it, adds a new plane between the operational plane (OLTP) and the analytical plane (OLAP). The streaming plane opens up a rich area of possibilities for the future of data systems. In this book, Hubert and Ralph discuss the three different starting points for stream‐ ing databases: • Stream processing systems that adopt database technologies and SQL • Database systems that are extended to incorporate streaming concepts • Data lakes (which already adopted SQL) that are extended to use streaming capabilities These three gave rise to a variety of different streaming databases, each with its own limitations and optimized for different use cases. This raises the question: which system should we use for what use case, and what are the trade-offs? Following Jay Kreps’ prediction that “companies are becoming software,” we have an exciting future in data processing ahead of us with streaming databases at its very core. The simplifications that streaming databases and streaming SQL offer allow many more nontechnical users to adopt stream processing, which will lead the way for streaming to become ubiquitous. We are still early in the era of streaming databases, and it’s exciting to observe the current trends and discover newly built systems. x | Foreword

This book provides an excellent entry point for learning about all these cutting-edge innovations and the zoo of options, which is typical for the early days of a new era. If you want to learn even more about streaming databases, check out Hubert and Ralph’s podcast on Spotify, simply called “Hubert’s Podcast.” They interviewed many different people in the streaming and data space in preparation for this book, and it’s a gem by itself. — Matthias J. Sax Technical Lead, Kafka Streams Engineering Team at Confluent Apache Committer and PMC member (Kafka, Flink, Storm) Reno, NV, May 2024 Foreword | xi

(This page has no text content)

Preface In this book, we go beyond the boundaries of traditional batch processing and seamlessly integrate the dynamic world of streaming data. If you come from the streaming world, we provide a database perspective for stream processing. Streaming databases bridge the gap between data at rest and data in motion. Drawing inspiration from Martin Kleppmann’s seminal work on “turning the data‐ base inside out,” we flip the narrative to “bringing streaming systems back into the database.” Through this paradigm shift, we can first unravel the intricate layers of stream processing before we find familiar abstractions that make real-time streaming more accessible and understandable to developers, regardless of their familiarity with streaming technologies. Our exploration delves into the core principles of streaming databases, exposing how they empower developers to take on real-time data processing use cases within the familiar confines of a database environment. Focusing on practicality and usability, we unveil how streaming databases democratize real-time data analytics, paving the way for innovative applications and insights. Whether you’re a seasoned database engineer or a novice developer, this book guides you to unlocking the full potential of streaming databases and embracing the future of data processing. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. xiii

Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/hdulay/streaming-databases. If you have a technical question or a problem using the code examples, please send email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Streaming Databases by Hubert Dulay and Ralph M. Debusmann (O’Reilly). Copyright 2024 Hubert Dulay and Ralph M. Debusmann, 978-1-098-15483-7.” xiv | Preface

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/streaming-databases. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Watch us on YouTube: https://youtube.com/oreillymedia. Hubert’s Acknowledgements I’d like to first thank my wife, Beth, and the kids, Aster and Nico, for supporting me while I wrote this book. It would not have been easy without them. Second, I’d like to thank Ralph for being a great technologist, teacher, and capable coauthor, making us an excellent writing pair. Preface | xv

When we first started writing this book, we interviewed many experts and leaders in the streaming space who were driving innovation in streaming, real-time analytics, and, more importantly, its adoption. Thanks to Seth Wiesman, Arjun Narayan, and Frank McSherry for the insights and Nikhil Benesch for seeking me out at Current. Thanks to Will Plummer for initially reaching out, Jove Zhong for reviewing the book, and Gang Tao and Ting Wang for continually supporting us. I’d also like to thank Yingjun Wu for your wisdom and for reviewing the book. Thank you to Adrian Kosowski, Anup Surendran, and Bobur Umurzokov for their continued partnership and support. Thanks to Hojjat Jafarpour, and Monish, for all the enjoyable conversa‐ tions. Thanks to Mihai Budiu and Leonid Ryzhyk for speaking to us initially and for the quote, “All databases are streaming databases.” Thanks to Micah Wylde, Richard Artoul, and Ryan Worl for your interesting conversations. Thanks to Robin Fehr and Nico Kruber for also reviewing the book. Thank you, Matthias Sax, for writing the forward to this book. Thanks to Rita Fernando for making writing for O’Reilly easy and fun. Lastly, thanks to the other streaming and database technologists who bring streaming and real-time analytics to customers. Ralph’s Acknowledgments I would firstly like to thank Bea for supporting me (not only with finishing this book), my parents, and my kids, Sophie, Stella, and Selene. A huge thank you goes to Hubert for having been able to coauthor this book and for the great time writing it together. I would also like to thank his colleagues at Migros, with whom I have had the pleasure to collaborate on and discuss topics related to this book, especially Martin Muggli, Jason Nguyen, Simon Hofer, Alexander Rovner, André Pechstein, Erik Vido, and Philipp Jud de Capitani. Thanks also go to all those who further took part in our book’s genesis by providing valuable insights and feedback and engaging in inspirational discussions. In addition to those mentioned already, this includes (in alphabetical order) Jamie Brandon, Pavan Keshavamurthy, Giannis Polyzos, Florent Ramiere, Michael Rosam, and Yaroslav Tkachenko, noting that this list is far from complete. xvi | Preface

CHAPTER 1 Streaming Foundations The hero’s journey always begins with the call. One way or another, a guide must come to say, “Look, you’re in Sleepy Land. Wake. Come on a trip. There is a whole aspect of your consciousness, your being, that’s not been touched. So you’re at home here? Well, there’s not enough of you there.” And so it starts. —Joseph Campbell, Reflections on the Art of Living: A Joseph Campbell Companion The streaming database is a concept born from over a decade of processing and serv‐ ing data. The evolution leading to the advent of streaming databases is rooted in the broader history of database management systems, data processing, and the changing demands of the digital age. To understand this evolution, let’s take a historical journey through the key milestones that have shaped the development of streaming databases. The rise of the internet and the explosive growth of digital data in the late 20th century led to the need for more scalable and flexible data management solutions. Data warehouses and batch-oriented processing frameworks like Hadoop emerged to address these challenges of the size of data during this era. The term “big data” was and still is used to refer not only to the size of data but also to all solutions that store and process data that is extremely large. Big data cannot fit on a single computer or server. You need to divide it up into smaller, equal-sized parts and store them in multiple computers. Systems like Hadoop and MapReduce became popular because they enabled distributed storage and processing. This led to the idea of using distributed streaming to move large volumes of data into Hadoop. Apache Kafka emerged as one such messaging service that was designed to handle big data. Not only did it provide a way to move data from system to system, but it also provided a way to access data in motion—in real time. It was a development that led to a new wave of demand for real-time streaming use cases. 1

New technologies, such as Apache Flink and Apache Spark, were developed and were able to meet these new expectations. As distributed frameworks for batch processing and streaming, they could process data across many servers and provide analytical results. When coupled with Kafka, the trio provided a solution that could support streaming real-time analytical use cases. We’ll discuss stream processors in more detail in Chapter 2. In the mid-2010s, simpler and better paradigms in streaming emerged to increase the scale of real-time data processing. This included two new stream processing frameworks, Apache Kafka Streams (KStreams) and Apache Samza. KStreams and Samza were the first to implement materialized views, which made the stream look and feel more like a database. Martin Kleppmann took the pairing of databases and streaming even further. In his 2015 talk, “Turning the Database Inside-Out”, he described a way to implement stream processing that takes internal database features and externalizes them in real-time streams. This approach led to more scalable, resilient, and real-time stream processing systems. One of the problems of stream processing was (and still is) that it’s harder to use than batch processing. There are fewer abstractions, and much more deep-down tech is shining through. To implement stream processing for their use case, data engineers now had to consider data order, consistency for accurate processing, fault tolerance, resilience, scalability, and more. This became a hurdle that deterred data teams from attempting to use streaming. As a result, most have opted to continue using databases to transform data and running the data processing in batches at the expense of not meeting performance requirements. In this book, we hope to make streaming and stream processing more accessible to those who are used to working with databases. We’ll start, as Kleppmann did, by talking about how to turn the database inside out. Turning the Database Inside Out Martin Kleppmann is a distinguished software developer who gave the thought- provoking talk “Turning the Database Inside-Out.” He introduced Apache Samza as a newer way of implementing stream processing that takes internal database features and externalizes them in real-time streams. His thought leadership led to the paradigm shift of introducing materialized views to stream processing. Really it’s a surreptitious attempt to take the database architecture we know and turn it inside out. —Martin Kleppmann, “Turning the Database Inside-Out” 2 | Chapter 1: Streaming Foundations

Statistics

Uploader

Streaming Databases Unifying Batch and Stream Processing (Hubert Dulay, Ralph M. Debusmann) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Streaming Databases Unifying Batch and Stream Processing (Hubert Dulay, Ralph M. Debusmann) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You