📄 Page
1
Ben Stopford Foreword by Sam Newman Concepts and Patterns for Streaming Services with Apache Kafka Designing Event-Driven Systems Compliments of
📄 Page
2
Ben Stopford Designing Event-Driven Systems Concepts and Patterns for Streaming Services with Apache Kafka Boston Farnham Sebastopol TokyoBeijing
📄 Page
3
978-1-492-03822-1 [LSI] Designing Event-Driven Systems by Ben Stopford Copyright © 2018 O’Reilly Media. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online edi‐ tions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Brian Foster Interior Designer: David Futato Production Editor: Justin Billing Cover Designer: Karen Montgomery Copyeditor: Rachel Monaghan Illustrator: Rebecca Demarest Proofreader: Amanda Kersey April 2018: First Edition Revision History for the First Edition 2018-03-28: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Designing Event-Driven Systems, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsi‐ bility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Confluent. See our statement of editorial independence.
📄 Page
4
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Part I. Setting the Stage 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. The Origins of Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3. Is Kafka What You Think It Is?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Kafka Is Like REST but Asynchronous? 13 Kafka Is Like a Service Bus? 14 Kafka Is Like a Database? 15 What Is Kafka Really? A Streaming Platform 15 4. Beyond Messaging: An Overview of the Kafka Broker. . . . . . . . . . . . . . . . . . . . . . . 17 The Log: An Efficient Structure for Retaining and Distributing Messages 18 Linear Scalability 19 Segregating Load in Multiservice Ecosystems 21 Maintaining Strong Ordering Guarantees 21 Ensuring Messages Are Durable 22 Load-Balance Services and Make Them Highly Available 23 Compacted Topics 24 Long-Term Data Storage 25 Security 25 Summary 25 iii
📄 Page
5
Part II. Designing Event-Driven Systems 5. Events: A Basis for Collaboration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Commands, Events, and Queries 30 Coupling and Message Brokers 32 Using Events for Notification 34 Using Events to Provide State Transfer 37 Which Approach to Use 38 The Event Collaboration Pattern 39 Relationship with Stream Processing 41 Mixing Request- and Event-Driven Protocols 42 Summary 44 6. Processing Events with Stateful Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Making Services Stateful 47 Summary 52 7. Event Sourcing, CQRS, and Other Stateful Patterns. . . . . . . . . . . . . . . . . . . . . . . . . 55 Event Sourcing, Command Sourcing, and CQRS in a Nutshell 55 Version Control for Your Data 57 Making Events the Source of Truth 59 Command Query Responsibility Segregation 61 Materialized Views 62 Polyglot Views 63 Whole Fact or Delta? 64 Implementing Event Sourcing and CQRS with Kafka 65 Summary 71 Part III. Rethinking Architecture at Company Scales 8. Sharing Data and Services Across an Organization. . . . . . . . . . . . . . . . . . . . . . . . . . 75 Encapsulation Isn’t Always Your Friend 77 The Data Dichotomy 79 What Happens to Systems as They Evolve? 80 Make Data on the Outside a First-Class Citizen 83 Don’t Be Afraid to Evolve 84 Summary 85 9. Event Streams as a Shared Source of Truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A Database Inside Out 87 Summary 90 iv | Table of Contents
📄 Page
6
10. Lean Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 If Messaging Remembers, Databases Don’t Have To 91 Take Only the Data You Need, Nothing More 92 Rebuilding Event-Sourced Views 93 Automation and Schema Migration 94 Summary 96 Part IV. Consistency, Concurrency, and Evolution 11. Consistency and Concurrency in Event-Driven Systems. . . . . . . . . . . . . . . . . . . . . 101 Eventual Consistency 102 The Single Writer Principle 105 Atomicity with Transactions 108 Identity and Concurrency Control 108 Limitations 110 Summary 110 12. Transactions, but Not as We Know Them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 The Duplicates Problem 111 Using the Transactions API to Remove Duplicates 114 Exactly Once Is Both Idempotence and Atomic Commit 115 How Kafka’s Transactions Work Under the Covers 116 Store State and Send Events Atomically 118 Do We Need Transactions? Can We Do All This with Idempotence? 119 What Can’t Transactions Do? 119 Making Use of Transactions in Your Services 120 Summary 120 13. Evolving Schemas and Data over Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Using Schemas to Manage the Evolution of Data in Time 123 Handling Schema Change and Breaking Backward Compatibility 124 Collaborating over Schema Change 126 Handling Unreadable Messages 127 Deleting Data 127 Segregating Public and Private Topics 129 Summary 129 Part V. Implementing Streaming Services with Kafka 14. Kafka Streams and KSQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A Simple Email Service Built with Kafka Streams and KSQL 133 Table of Contents | v
📄 Page
7
Windows, Joins, Tables, and State Stores 135 Summary 138 15. Building Streaming Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 An Order Validation Ecosystem 139 Join-Filter-Process 140 Event-Sourced Views in Kafka Streams 141 Collapsing CQRS with a Blocking Read 142 Scaling Concurrent Operations in Streaming Systems 142 Rekey to Join 145 Repartitioning and Staged Execution 146 Waiting for N Events 147 Reflecting on the Design 148 A More Holistic Streaming Ecosystem 148 Summary 150 vi | Table of Contents
📄 Page
8
1 D. L. Parnas, On the Criteria to Be Used in Decomposing Systems into Modules (Pittsburgh, PA: Carnegie Mellon University, 1971). Foreword For as long as we’ve been talking about services, we’ve been talking about data. In fact, before we even had the word microservices in our lexicon, back when it was just good old-fashioned service-oriented architecture, we were talking about data: how to access it, where it lives, who “owns” it. Data is all-important—vital for the continued success of our business—but has also been seen as a massive constraint in how we design and evolve our systems. My own journey into microservices began with work I was doing to help organi‐ zations ship software more quickly. This meant a lot of time was spent on things like cycle time analysis, build pipeline design, test automation, and infrastructure automation. The advent of the cloud was a huge boon to the work we were doing, as the improved automation made us even more productive. But I kept hitting some fundamental issues. All too often, the software wasn’t designed in a way that made it easy to ship. And data was at the heart of the problem. Back then, the most common pattern I saw for service-based systems was sharing a database among multiple services. The rationale was simple: the data I need is already in this other database, and accessing a database is easy, so I’ll just reach in and grab what I need. This may allow for fast development of a new service, but over time it becomes a major constraint. As I expanded upon in my book, Building Microservices, a shared database cre‐ ates a huge coupling point in your architecture. It becomes difficult to under‐ stand what changes can be made to a schema shared by multiple services. David Parnas1 showed us back in 1971 that the secret to creating software whose parts could be changed independently was to hide information between modules. But at a swoop, exposing a schema to multiple services prohibits our ability to inde‐ pendently evolve our codebases. Foreword | vii
📄 Page
9
As the needs and expectations of software changed, IT organizations changed with them. The shift from siloed IT toward business- or product-aligned teams helped improve the customer focus of those teams. This shift often happened in concert with the move to improve the autonomy of those teams, allowing them to develop new ideas, implement them, and then ship them, all while reducing the need for coordination with other parts of the organization. But highly cou‐ pled architectures require heavy coordination between systems and the teams that maintain them—they are the enemy of any organization that wants to opti‐ mize autonomy. Amazon spotted this many years ago. It wanted to improve team autonomy to allow the company to evolve and ship software more quickly. To this end, Ama‐ zon created small, independent teams who would own the whole lifecycle of delivery. Steve Yegge, after leaving Amazon for Google, attempted to capture what it was that made those teams work so well in his infamous (in some circles) “Platform Rant”. In it, he outlined the mandate from Amazon CEO Jeff Bezos regarding how teams should work together and how they should design systems. These points in particular resonate for me: 1) All teams will henceforth expose their data and functionality through service interfaces. 2) Teams must communicate with each other through these interfaces. 3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s datastore, no shared-memory model, no backdoors whatsoever. The only communication allowed is via service interface calls over the network. In my own way, I came to the realization that how we store and share data is key to ensuring we develop loosely coupled architectures. Well-defined interfaces are key, as is hiding information. If we need to store data in a database, that database should be part of a service, and not accessed directly by other services. A well- defined interface should guide when and how that data is accessed and manipu‐ lated. Much of my time over the past several years has been taken up with pushing this idea. But while people increasingly get it, challenges remain. The reality is that services do need to work together and do sometimes need to share data. How do you do that effectively? How do you ensure that this is done in a way that is sym‐ pathetic to your application’s latency and load conditions? What happens when one service needs a lot of information from another? Enter streams of events, specifically the kinds of streams that technology like Kafka makes possible. We’re already using message brokers to exchange events, but Kafka’s ability to make that event stream persistent allows us to consider a new way of storing and exchanging data without losing out on our ability to cre‐ ate loosely coupled autonomous architectures. In this book, Ben talks about the viii | Foreword
📄 Page
10
idea of “turning the database inside out”—a concept that I suspect will get as many skeptical responses as I did back when I was suggesting moving away from giant shared databases. But after the last couple of years I’ve spent exploring these ideas with Ben, I can’t help thinking that he and the other people working on these concepts and technology (and there is certainly lots of prior art here) really are on to something. I’m hopeful that the ideas outlined in this book are another step forward in how we think about sharing and exchanging data, helping us change how we build microservice architecture. The ideas may well seem odd at first, but stick with them. Ben is about to take you on a very interesting journey. —Sam Newman Foreword | ix
📄 Page
11
(This page has no text content)
📄 Page
12
Preface In 2006 I was working at ThoughtWorks, in the UK. There was a certain energy to the office at that time, with lots of interesting things going on. The Agile movement was in full bloom, BDD (behavior-driven development) was flourish‐ ing, people were experimenting with Event Sourcing, and SOA (service-oriented architecture) was being adapted to smaller projects to deal with some of the issues we’d seen in larger implementations. One project I worked on was led by Dave Farley, an energetic and cheerful fellow who managed to transfer his jovial bluster into pretty much everything we did. The project was a relatively standard, medium-sized enterprise application. It had a web portal where customers could request a variety of conveyancing serv‐ ices. The system would then run various synchronous and asynchronous pro‐ cesses to put the myriad of services they requested into action. There were a number of interesting elements to that particular project, but the one that really stuck with me was the way the services communicated. It was the first system I’d worked on that was built solely from a collaboration of events. Having worked with a few different service-based systems before, all built with RPCs (remote procedure calls) or request-response messaging, I thought this one felt very different. There was something inherently spritely about the way you could plug new services right into the event stream, and something deeply satis‐ fying about tailing the log of events and watching the “narrative” of the system whizz past. A few years later, I was working at a large financial institution that wanted to build a data service at the heart of the company, somewhere applications could find the important datasets that made the bank work—trades, valuations, refer‐ ence data, and the like. I find this sort of problem quite compelling: it was techni‐ cally challenging and, although a number of banks and other large companies had taken this kind of approach before, it felt like the technology had moved on to a point where we could build something really interesting and transformative. Preface | xi
📄 Page
13
Yet getting the technology right was only the start of the problem. The system had to interface with every major department, and that meant a lot of stakehold‐ ers with a lot of requirements, a lot of different release schedules, and a lot of expectations around uptime. I remember discussing the practicalities of the project as we talked our design through in a two-week stakeholder kick-off meet‐ ing. It seemed a pretty tall order, not just technically, but organizationally, but it also seemed plausible. So we pulled together a team, with a bunch of people from ThoughtWorks and Google and a few other places, and the resulting system had some pretty interest‐ ing properties. The datastore held queryable data in memory, spread over 35 machines per datacenter, so it could handle being hit from a compute grid. Writes went directly through the query layer into a messaging system, which formed (somewhat unusually for the time) the system of record. Both the query layer and the messaging layer were designed to be sharded so they could scale linearly. So every insert or update was also a published event, and there was no side-stepping it either; it was baked into the heart of the architecture. The interesting thing about making messaging the system of record is you find yourself repurposing the data stream to do a whole variety of useful things: recording it on a filesystem for recovery, pushing it to another datacenter, hyd‐ rating a set of databases for reporting and analytics, and, of course, broadcasting it to anyone with the API who wants to listen. But the real importance of using messaging as a system of record evaded me somewhat at the time. I remember speaking about the project at QCon, and there were more questions about the lone “messaging as a system of record” slide, which I’d largely glossed over, than there were about the fancy distributed join layer that the talk had focused on. So it slowly became apparent that, for all its features—the data-driven precaching that made joins fast, the SQL-over- Document interface, the immutable data model, and late-bound schema—what most customers needed was really subtly different, and somewhat simpler. While they would start off making use of the data service directly, as time passed, some requirement would often lead them to take a copy, store it independently, and do their own thing. But despite this, they still found the central dataset useful and would often take a subset, then later come back for more. So, on reflection, it seemed that a messaging system optimized to hold datasets would be more appropriate than a database optimized to publish them. A little while later Con‐ fluent formed, and Kafka seemed a perfect solution for this type of problem. The interesting thing about these two experiences (the conveyancing application and the bank-wide data service) is that they are more closely related than they may initially appear. The conveyancing application had been wonderfully collab‐ orative, yet pluggable. At the bank, a much larger set of applications and services integrated through events, but also leveraged a historic reference they could go xii | Preface
📄 Page
14
back to and query. So the contexts were quite different—the first was a single application, the second a company—but much of the elegance of both systems came from their use of events. Streaming systems today are in many ways quite different from both of these examples, but the underlying patterns haven’t really changed all that much. Nev‐ ertheless, the devil is in the details, and over the last few years we’ve seen clients take a variety of approaches to solving both of these kinds of problems, along with many others. Problems that both distributed logs and stream processing tools are well suited to, and I’ve tried to extract the key elements of these approaches in this short book. How to Read This Book The book is arranged into five sections. Part I sets the scene, with chapters that introduce Kafka and stream processing and should provide even seasoned practi‐ tioners with a useful overview of the base concepts. In Part II you’ll find out how to build event-driven systems, how such systems relate to stateful stream pro‐ cessing, and how to apply patterns like Event Collaboration, Event Sourcing, and CQRS. Part III is more conceptual, building on the ideas from Part II, but apply‐ ing them at the level of whole organizations. Here we question many of the com‐ mon approaches used today, and dig into patterns like event streams as a source of truth. Part IV and Part V are more practical. Part V starts to dip into a little code, and there is an associated GitHub project to help you get started if you want to build fine-grained services with Kafka Streams. The introduction given in Chapter 1 provides a high-level overview of the main concepts covered in this book, so it is a good place to start. Acknowledgments Many people contributed to this book, both directly and indirectly, but a special thanks to Jay Kreps, Sam Newman, Edward Ribeiro, Gwen Shapira, Steve Coun‐ sell, Martin Kleppmann, Yeva Byzek, Dan Hanley, Tim Bergland, and of course my ever-patient wife, Emily. Preface | xiii
📄 Page
15
(This page has no text content)
📄 Page
16
PART I Setting the Stage The truth is the log. —Pat Helland, “Immutability Changes Everything,” 2015
📄 Page
17
(This page has no text content)
📄 Page
18
CHAPTER 1 Introduction While the main focus of this book is the building of event-driven systems of dif‐ ferent sizes, there is a deeper focus on software that spans many teams. This is the realm of service-oriented architectures: an idea that arose around the start of the century, where a company reconfigures itself around shared services that do commonly useful things. This idea became quite popular. Amazon famously banned all intersystem com‐ munications by anything that wasn’t a service interface. Later, upstart Netflix went all in on microservices, and many other web-based startups followed suit. Enterprise companies did similar things, but often using messaging systems, which have a subtly different dynamic. Much was learned during this time, and there was significant progress made, but it wasn’t straightforward. One lesson learned, which was pretty ubiquitous at the time, was that service- based approaches significantly increased the probability of you getting paged at 3 a.m., when one or more services go down. In hindsight, this shouldn’t have been surprising. If you take a set of largely independent applications and turn them into a web of highly connected ones, it doesn’t take too much effort to imagine that one important but flaky service can have far-reaching implications, and in the worst case bring the whole system to a halt. As Steve Yegge put it in his famous Amazon/Google post, “Organizing into services taught teams not to trust each other in most of the same ways they’re not supposed to trust external devel‐ opers.” What did work well for Amazon, though, was the element of organizational change that came from being wholeheartedly service based. Service teams think of their software as being a cog in a far larger machine. As Ian Robinson put it, “Be of the web, not behind the web.” This was a huge shift from the way people built applications previously, where intersystem communication was something teams reluctantly bolted on as an afterthought. But the services model made 3
📄 Page
19
interaction a first-class entity. Suddenly your users weren’t just customers or businesspeople; they were other applications, and they really cared that your ser‐ vice was reliable. So applications became platforms, and building platforms is hard. LinkedIn felt this pain as it evolved away from its original, monolithic Java appli‐ cation into 800–1,100 services. Complex dependencies led to instability, version‐ ing issues caused painful lockstep releases, and early on, it wasn’t clear that the new architecture was actually an improvement. One difference in the way LinkedIn evolved its approach was its use of a messag‐ ing system built in-house: Kafka. Kafka added an asynchronous publish- subscribe model to the architecture that enabled trillions of messages a day to be transported around the organization. This was important for a company in hypergrowth, as it allowed new applications to be plugged in without disturbing the fragile web of synchronous interactions that drove the frontend. But this idea of rearchitecting a system around events isn’t new—event-driven architectures have been around for decades, and technologies like enterprise messaging are big business, particularly with (unsurprisingly) enterprise compa‐ nies. Most enterprises have been around for a long time, and their systems have grown organically, over many iterations or through acquisition. Messaging sys‐ tems naturally fit these complex and disconnected worlds for the same reasons observed at LinkedIn: events decouple, and this means different parts of the company can operate independently of one another. It also means it’s easier to plug new systems into the real time stream of events. A good example is the regulation that hit the finance industry in January 2018, which states that trading activity has to be reported to a regulator within one minute of it happening. A minute may seem like a long time in computing terms, but it takes only one batch-driven system, on the critical path in one business silo, for that to be unattainable. So the banks that had gone to the effort of instal‐ ling real-time trade eventing, and plumbed it across all their product-aligned silos, made short work of these regulations. For the majority that hadn’t it was a significant effort, typically resulting in half-hearted, hacky solutions. So enterprise companies start out complex and disconnected: many separate, asynchronous islands—often with users of their own—operating independently of one another for the most part. Internet companies are different, starting life as simple, front-facing web applications where users click buttons and expect things to happen. Most start as monoliths and stay that way for some time (arguably for longer than they should). But as internet companies grow and their business gets more complex, they see a similar shift to asynchronicity. New teams and depart‐ ments are introduced and they need to operate independently, freed from the synchronous bonds that tie the frontend. So ubiquitous desires for online utilit‐ ies, like making a payment or updating a shopping basket, are slowly replaced by 4 | Chapter 1: Introduction
📄 Page
20
a growing need for datasets that can be used, and evolved, without any specific application lock-in. But messaging is no panacea. Enterprise service buses (ESBs), for example, have vocal detractors and traditional messaging systems have a number of issues of their own. They are often used to move data around an organization, but the absence of any notion of history limits their value. So, even though recent events typically have more value than old ones, business operations still need historical data—whether it’s users wanting to query their account history, some service needing a list of customers, or analytics that need to be run for a management report. On the other hand, data services with HTTP-fronted interfaces make lookups simple. Anyone can reach in and run a query. But they don’t make it so easy to move data around. To extract a dataset you end up running a query, then period‐ ically polling the service for changes. This is a bit of a hack, and typically the operators in charge of the service you’re polling won’t thank you for it. But replayable logs, like Kafka, can play the role of an event store: a middle ground between a messaging system and a database. (If you don’t know Kafka, don’t worry—we dive into it in Chapter 4.) Replayable logs decouple services from one another, much like a messaging system does, but they also provide a central point of storage that is fault-tolerant and scalable—a shared source of truth that any application can fall back to. A shared source of truth turns out to be a surprisingly useful thing. Microservi‐ ces, for example, don’t share their databases with one another (referred to as the IntegrationDatabase antipattern). There is a good reason for this: databases have very rich APIs that are wonderfully useful on their own, but when widely shared they make it hard to work out if and how one application is going to affect oth‐ ers, be it data couplings, contention, or load. But the business facts that services do choose to share are the most important facts of all. They are the truth that the rest of the business is built on. Pat Helland called out this distinction back in 2006, denoting it “data on the outside.” But a replayable log provides a far more suitable place to hold this kind of data because (somewhat counterintuitively) you can’t query it! It is purely about stor‐ ing data and pushing it to somewhere new. This idea of pure data movement is important, because data on the outside—the data services share—is the most tightly coupled of all, and the more services an ecosystem has, the more tightly coupled this data gets. The solution is to move data somewhere that is more loosely coupled, so that means moving it into your application where you can manipulate it to your heart’s content. So data movement gives applications a level of operability and control that is unachievable with a direct, runtime dependency. This idea of retaining control turns out to be important—it’s the same reason the shared database pattern doesn’t work out well in practice. Introduction | 5