📄 Page
1
Mitch Seymour Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example
📄 Page
2
(This page has no text content)
📄 Page
3
Mitch Seymour Mastering Kafka Streams and ksqlDB Building Real-Time Data Systems by Example Boston Farnham Sebastopol TokyoBeijing
📄 Page
4
978-1-492-06249-3 [LSI] Mastering Kafka Streams and ksqlDB by Mitch Seymour Copyright © 2021 Mitch Seymour. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Jeff Bleiel Production Editor: Daniel Elfanbaum Copyeditor: Kim Cofer Proofreader: JM Olejarz Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea February 2021: First Edition Revision History for the First Edition 2021-02-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492062493 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Kafka Streams and ksqlDB, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part I. Kafka 1. A Rapid Introduction to Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Communication Model 2 How Are Streams Stored? 6 Topics and Partitions 9 Events 11 Kafka Cluster and Brokers 12 Consumer Groups 13 Installing Kafka 15 Hello, Kafka 16 Summary 19 Part II. Kafka Streams 2. Getting Started with Kafka Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 The Kafka Ecosystem 23 Before Kafka Streams 24 Enter Kafka Streams 25 Features at a Glance 27 Operational Characteristics 28 iii
📄 Page
6
Scalability 28 Reliability 29 Maintainability 30 Comparison to Other Systems 30 Deployment Model 30 Processing Model 31 Kappa Architecture 32 Use Cases 33 Processor Topologies 35 Sub-Topologies 37 Depth-First Processing 39 Benefits of Dataflow Programming 41 Tasks and Stream Threads 41 High-Level DSL Versus Low-Level Processor API 44 Introducing Our Tutorial: Hello, Streams 45 Project Setup 46 Creating a New Project 46 Adding the Kafka Streams Dependency 47 DSL 48 Processor API 51 Streams and Tables 53 Stream/Table Duality 57 KStream, KTable, GlobalKTable 57 Summary 58 3. Stateless Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Stateless Versus Stateful Processing 62 Introducing Our Tutorial: Processing a Twitter Stream 63 Project Setup 65 Adding a KStream Source Processor 65 Serialization/Deserialization 69 Building a Custom Serdes 70 Defining Data Classes 71 Implementing a Custom Deserializer 72 Implementing a Custom Serializer 73 Building the Tweet Serdes 74 Filtering Data 75 Branching Data 77 Translating Tweets 79 Merging Streams 81 Enriching Tweets 82 iv | Table of Contents
📄 Page
7
Avro Data Class 83 Sentiment Analysis 85 Serializing Avro Data 87 Registryless Avro Serdes 88 Schema Registry–Aware Avro Serdes 88 Adding a Sink Processor 90 Running the Code 91 Empirical Verification 91 Summary 94 4. Stateful Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Benefits of Stateful Processing 96 Preview of Stateful Operators 97 State Stores 98 Common Characteristics 99 Persistent Versus In-Memory Stores 101 Introducing Our Tutorial: Video Game Leaderboard 102 Project Setup 104 Data Models 104 Adding the Source Processors 106 KStream 106 KTable 107 GlobalKTable 109 Registering Streams and Tables 110 Joins 111 Join Operators 112 Join Types 113 Co-Partitioning 114 Value Joiners 117 KStream to KTable Join (players Join) 119 KStream to GlobalKTable Join (products Join) 120 Grouping Records 121 Grouping Streams 121 Grouping Tables 122 Aggregations 123 Aggregating Streams 123 Aggregating Tables 126 Putting It All Together 127 Interactive Queries 129 Materialized Stores 129 Accessing Read-Only State Stores 131 Table of Contents | v
📄 Page
8
Querying Nonwindowed Key-Value Stores 131 Local Queries 134 Remote Queries 134 Summary 142 5. Windows and Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Introducing Our Tutorial: Patient Monitoring Application 144 Project Setup 146 Data Models 147 Time Semantics 147 Timestamp Extractors 150 Included Timestamp Extractors 150 Custom Timestamp Extractors 152 Registering Streams with a Timestamp Extractor 153 Windowing Streams 154 Window Types 154 Selecting a Window 158 Windowed Aggregation 159 Emitting Window Results 161 Grace Period 163 Suppression 163 Filtering and Rekeying Windowed KTables 166 Windowed Joins 167 Time-Driven Dataflow 168 Alerts Sink 170 Querying Windowed Key-Value Stores 170 Summary 173 6. Advanced State Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Persistent Store Disk Layout 176 Fault Tolerance 177 Changelog Topics 178 Standby Replicas 180 Rebalancing: Enemy of the State (Store) 180 Preventing State Migration 181 Sticky Assignment 182 Static Membership 185 Reducing the Impact of Rebalances 186 Incremental Cooperative Rebalancing 187 Controlling State Size 189 Deduplicating Writes with Record Caches 195 vi | Table of Contents
📄 Page
9
State Store Monitoring 196 Adding State Listeners 196 Adding State Restore Listeners 198 Built-in Metrics 199 Interactive Queries 200 Custom State Stores 201 Summary 202 7. Processor API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 When to Use the Processor API 204 Introducing Our Tutorial: IoT Digital Twin Service 205 Project Setup 208 Data Models 209 Adding Source Processors 211 Adding Stateless Stream Processors 213 Creating Stateless Processors 214 Creating Stateful Processors 217 Periodic Functions with Punctuate 221 Accessing Record Metadata 223 Adding Sink Processors 225 Interactive Queries 225 Putting It All Together 226 Combining the Processor API with the DSL 230 Processors and Transformers 231 Putting It All Together: Refactor 235 Summary 236 Part III. ksqlDB 8. Getting Started with ksqlDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 What Is ksqlDB? 240 When to Use ksqlDB 241 Evolution of a New Kind of Database 243 Kafka Streams Integration 243 Connect Integration 246 How Does ksqlDB Compare to a Traditional SQL Database? 247 Similarities 248 Differences 249 Architecture 251 ksqlDB Server 251 Table of Contents | vii
📄 Page
10
ksqlDB Clients 253 Deployment Modes 255 Interactive Mode 255 Headless Mode 256 Tutorial 257 Installing ksqlDB 257 Running a ksqlDB Server 258 Precreating Topics 259 Using the ksqlDB CLI 259 Summary 262 9. Data Integration with ksqlDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Kafka Connect Overview 264 External Versus Embedded Connect 265 External Mode 266 Embedded Mode 267 Configuring Connect Workers 268 Converters and Serialization Formats 270 Tutorial 272 Installing Connectors 272 Creating Connectors with ksqlDB 273 Showing Connectors 275 Describing Connectors 276 Dropping Connectors 277 Verifying the Source Connector 277 Interacting with the Kafka Connect Cluster Directly 278 Introspecting Managed Schemas 279 Summary 279 10. Stream Processing Basics with ksqlDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Tutorial: Monitoring Changes at Netflix 281 Project Setup 284 Source Topics 284 Data Types 285 Custom Types 287 Collections 288 Creating Source Collections 289 With Clause 291 Working with Streams and Tables 292 Showing Streams and Tables 292 Describing Streams and Tables 294 viii | Table of Contents
📄 Page
11
Altering Streams and Tables 295 Dropping Streams and Tables 295 Basic Queries 296 Insert Values 296 Simple Selects (Transient Push Queries) 298 Projection 299 Filtering 300 Flattening/Unnesting Complex Structures 302 Conditional Expressions 302 Coalesce 303 IFNULL 303 Case Statements 303 Writing Results Back to Kafka (Persistent Queries) 304 Creating Derived Collections 304 Putting It All Together 308 Summary 309 11. Intermediate and Advanced Stream Processing with ksqlDB. . . . . . . . . . . . . . . . . . . . 311 Project Setup 312 Bootstrapping an Environment from a SQL File 312 Data Enrichment 314 Joins 314 Windowed Joins 319 Aggregations 322 Aggregation Basics 323 Windowed Aggregations 325 Materialized Views 331 Clients 332 Pull Queries 333 Curl 335 Push Queries 336 Push Queries via Curl 336 Functions and Operators 337 Operators 337 Showing Functions 338 Describing Functions 339 Creating Custom Functions 340 Additional Resources for Custom ksqlDB Functions 345 Summary 346 Table of Contents | ix
📄 Page
12
Part IV. The Road to Production 12. Testing, Monitoring, and Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Testing 350 Testing ksqlDB Queries 350 Testing Kafka Streams 352 Behavioral Tests 359 Benchmarking 362 Kafka Cluster Benchmarking 364 Final Thoughts on Testing 366 Monitoring 366 Monitoring Checklist 367 Extracting JMX Metrics 367 Deployment 370 ksqlDB Containers 370 Kafka Streams Containers 372 Container Orchestration 373 Operations 374 Resetting a Kafka Streams Application 374 Rate-Limiting the Output of Your Application 376 Upgrading Kafka Streams 377 Upgrading ksqlDB 378 Summary 379 A. Kafka Streams Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 B. ksqlDB Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 x | Table of Contents
📄 Page
13
Foreword Businesses are increasingly built around events—the real-time activity data of what is happening in a company—but what is the right infrastructure for harnessing the power of events? This is a question I have been thinking about since 2009, when I started the Apache Kafka project at LinkedIn. In 2014, I cofounded Confluent to definitively answer it. Beyond providing a way to store and access discrete events, an event streaming platform needs a mechanism to connect with a myriad of external systems. It also requires global schema management, metrics, and monitoring. But perhaps most important of all is stream processing—continuous computation over never-ending streams of data—without which an event streaming platform is simply incomplete. Now more than ever, stream processing plays a key role in how businesses interact with the world. In 2011, Marc Andreessen wrote an article titled “Why Software Is Eating the World.” The core idea is that any process that can be moved into software eventually will be. Marc turned out to be prescient. The most obvious outcome is that software has permeated every industry imaginable. But a lesser understood and more important outcome is that businesses are increas‐ ingly defined in software. Put differently, the core processes a business executes— from how it creates a product, to how it interacts with customers, to how it delivers services—are increasingly specified, monitored, and executed in software. What has changed because of that dynamic? Software, in this new world, is far less likely to be directly interacting with a human. Instead, it is more likely that its purpose is to programmatically trigger actions or react to other pieces of software that carry out business directly. It begs the question: are our traditional application architectures, centered around existing databases, sufficient for this emerging world? Virtually all databases, from the most established relational databases to the newest key-value stores, follow a paradigm in which data is passively stored and the database waits for commands to retrieve or modify it. This paradigm was driven by human-facing applications in xi
📄 Page
14
which a user looks at an interface and initiates actions that are translated into data‐ base queries. We think that is only half the problem, and the problem of storing data needs to be complemented with the ability to react to and process events. Events and stream processing are the keys to succeeding in this new world. Events support the continuous flow of data throughout a business, and stream processing automatically executes code in response to change at any level of detail—doing it in concert with knowledge of all changes that came before it. Modern stream processing systems like Kafka Streams and ksqlDB make it easy to build applications for a world that speaks software. In this book, Mitch Seymour lucidly describes these state-of-the-art systems from first principles. Mastering Kafka Streams and ksqlDB surveys core concepts, details the nuances of how each system works, and provides hands-on examples for using them for business in the real world. Stream processing has never been a more essen‐ tial programming paradigm—and Mastering Kafka Streams and ksqlDB illuminates the path to succeeding at it. — Jay Kreps Cocreator of Apache Kafka, Cofounder and CEO of Confluent xii | Foreword
📄 Page
15
Preface For data engineers and data scientists, there’s never a shortage of technologies that are competing for our attention. Whether we’re perusing our favorite subreddits, scan‐ ning Hacker News, reading tech blogs, or weaving through hundreds of tables at a tech conference, there are so many things to look at that it can start to feel overwhelming. But if we can find a quiet corner to just think for a minute, and let all of the buzz fade into the background, we can start to distinguish patterns from the noise. You see, we live in the age of explosive data growth, and many of these technologies were created to help us store and process data at scale. We’re told that these are modern solutions for modern problems, and we sit around discussing “big data” as if the idea is avant- garde, when really the focus on data volume is only half the story. Technologies that only solve for the data volume problem tend to have batch-oriented techniques for processing data. This involves running a job on some pile of data that has accumulated for a period of time. In some ways, this is like trying to drink the ocean all at once. With modern computing power and paradigms, some technologies actually manage to achieve this, though usually at the expense of high latency. Instead, there’s another property of modern data that we focus on in this book: data moves over networks in steady and never-ending streams. The technologies we cover in this book, Kafka Streams and ksqlDB, are specifically designed to process these continuous data streams in real time, and provide huge competitive advantages over the ocean-drinking variety. After all, many business problems are time-sensitive, and if you need to enrich, transform, or react to data as soon as it comes in, then Kafka Streams and ksqlDB will help get you there with ease and efficiency. Learning Kafka Streams and ksqlDB is also a great way to familiarize yourself with the larger concepts involved in stream processing. This includes modeling data in dif‐ ferent ways (streams and tables), applying stateless transformations of data, using local state for more advanced operations (joins, aggregations), understanding the dif‐ ferent time semantics and methods for grouping data into time buckets/windows, xiii
📄 Page
16
and more. In other words, your knowledge of Kafka Streams and ksqlDB will help you distinguish and evaluate different stream processing solutions that currently exist and may come into existence sometime in the future. I’m excited to share these technologies with you because they have both made an impact on my own career and helped me accomplish technological feats that I thought were beyond my own capabilities. In fact, by the time you finish reading this sentence, one of my Kafka Streams applications will have processed nine million events. The feeling you’ll get by providing real business value without having to invest exorbitant amounts of time on the solution will keep you working with these technol‐ ogies for years to come, and the succinct and expressive language constructs make the process feel more like an art form than a labor. And just like any other art form, whether it be a life-changing song or a beautiful painting, it’s human nature to want to share it. So consider this book a mixtape from me to you, with my favorite compi‐ lations from the stream processing space available for your enjoyment: Kafka Streams and ksqlDB, Volume 1. Who Should Read This Book This book is for data engineers who want to learn how to build highly scalable stream processing applications for moving, enriching, and transforming large amounts of data in real time. These skills are often needed to support business intelligence initia‐ tives, analytic pipelines, threat detection, event processing, and more. Data scientists and analysts who want to upgrade their skills by analyzing real-time data streams will also find value in this book, which is an exciting departure from the batch processing space that has typically dominated these fields. Prior experience with Apache Kafka is not required, though some familiarity with the Java programming language will make the Kafka Streams tutorials easier to follow. Navigating This Book This book is organized roughly as follows: • Chapter 1 provides an introduction to Kafka and a tutorial for running a single- node Kafka cluster. • Chapter 2 provides an introduction to Kafka Streams, starting with a background and architectural review, and ending with a tutorial for running a simple Kafka Streams application. • Chapters 3 and 4 discuss the stateless and stateful operators in the Kafka Streams high-level DSL (domain-specific language). Each chapter includes a tutorial that will demonstrate how to use these operators to solve an interesting business problem. xiv | Preface
📄 Page
17
• Chapter 5 discusses the role that time plays in our stream processing applica‐ tions, and demonstrates how to use windows to perform more advanced stateful operations, including windowed joins and aggregations. A tutorial inspired by predictive healthcare will demonstrate the key concepts. • Chapter 6 describes how stateful processing works under the hood, and provides some operational tips for stateful Kafka Streams applications. • Chapter 7 dives into Kafka Streams’ lower-level Processor API, which can be used for scheduling periodic functions, and provides more granular access to application state and record metadata. The tutorial in this chapter is inspired by IoT (Internet of Things) use cases. • Chapter 8 provides an introduction to ksqlDB, and discusses the history and architecture of this technology. The tutorial in this chapter will show you how to install and run a ksqlDB server instance, and work with the ksqlDB CLI. • Chapter 9 discusses ksqlDB’s data integration features, which are powered by Kafka Connect. • Chapters 10 and 11 discuss the ksqlDB SQL dialect in detail, demonstrating how to work with different collection types, perform push queries and pull queries, and more. The concepts will be introduced using a tutorial based on a Netflix use case: tracking changes to various shows/films, and making these changes avail‐ able to other applications. • Chapter 12 provides the information you need to deploy your Kafka Streams and ksqlDB applications to production. This includes information on monitoring, testing, and containerizing your applications. Source Code The source code for this book can be found on GitHub at https://github.com/mitch- seymour/mastering-kafka-streams-and-ksqldb. Instructions for building and running each tutorial will be included in the repository. Kafka Streams Version At the time of this writing, the latest version of Kafka Streams was version 2.7.0. This is the version we use in this book, though in many cases, the code will also work with older or newer versions of the Kafka Streams library. We will make efforts to update the source code when newer versions introduce breaking changes, and will stage these updates in a dedicated branch (e.g., kafka-streams-2.8). Preface | xv
📄 Page
18
ksqlDB Version At the time of this writing, the latest version of ksqlDB was version 0.14.0. Compati‐ bility with older and newer versions of ksqlDB is less guaranteed due to the ongoing and rapid development of this technology, and the lack of a major version (e.g., 1.0) at the time of this book’s publication. We will make efforts to update the source code when newer versions introduce breaking changes, and will stage these updates in a dedicated branch (e.g., ksqldb-0.15). However, it is recommended to avoid versions older than 0.14.0 when running the examples in this book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. xvi | Preface
📄 Page
19
Using Code Examples Supplemental material (code examples, exercises, etc.) can be found on the book’s GitHub page, https://github.com/mitch-seymour/mastering-kafka-streams-and-ksqldb. If you have a technical question or a problem using the code examples, please email bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Mastering Kafka Streams and ksqlDB by Mitch Seymour (O’Reilly). Copyright 2021 Mitch Seymour, 978-1-492-06249-3.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. Preface | xvii
📄 Page
20
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/mastering-kafka-streams. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly. Follow us on Twitter: http://twitter.com/oreillymedia. Watch us on YouTube: http://www.youtube.com/oreillymedia. Acknowledgments First and foremost, I want to thank my wife, Elyse, and my daughter, Isabelle. Writing a book is a huge time investment, and your patience and support through the entire process helped me immensely. As much as I enjoyed writing this book, I missed you both greatly, and I look forward to having more date nights and daddy-daughter time again. I also want to thank my parents, Angie and Guy, for teaching me the value of hard work and for being a never-ending source of encouragement. Your support has hel‐ ped me overcome many challenges over the years, and I am eternally grateful for you both. This book would not be possible without the following people, who dedicated a lot of their time to reviewing its content and providing great feedback and advice along the way: Matthias J. Sax, Robert Yokota, Nitin Sharma, Rohan Desai, Jeff Bleiel, and Danny Elfanbaum. Thank you all for helping me create this book, it’s just as much yours as it is mine. Many of the tutorials were informed by actual business use cases, and I owe a debt of gratitude to everyone in the community who openly shared their experiences with Kafka Streams and ksqlDB, whether it be through conferences, podcasts, blogs, or xviii | Preface