Database Internals A Deep-Dive into How Distributed Data Systems Work (Alex Petrov) (Z-Library)

Author: Alex Petrov

SQL

When it comes to choosing, using, and maintaining a database, understanding its internals is essential. But with so many distributed databases and tools available today, it's often difficult to understand what each one offers and how they differ. With this practical guide, Alex Petrov guides developers through the concepts behind modern database and storage engine internals. Throughout the book, you'll explore relevant material gleaned from numerous books, papers, blog posts, and the source code of several open source databases. These resources are listed at the end of parts one and two. You'll discover that the most significant distinctions among many modern databases reside in subsystems that determine how storage is organized and how data is distributed. This book examines: • Storage engines: Explore storage classification and taxonomy, and dive into B-Tree-based and immutable Log Structured storage engines, with differences and use-cases for each • Storage building blocks: Learn how database files are organized to build efficient storage, using auxiliary data structures such as Page Cache, Buffer Pool and Write-Ahead Log • Distributed systems: Learn step-by-step how nodes and processes connect and build complex communication patterns • Database clusters: Which consistency models are commonly used by modern databases and how distributed storage systems achieve consistency Alex Petrov is a data infrastructure engineer, database and storage systems enthusiast, Apache Cassandra committer, and PMC member. How expertise is in storage, distributed systems, and algorithms.

📄 File Format: PDF

💾 File Size: 12.0 MB

353

Views

141

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

(This page has no text content)

📄 Page 2

Alex Petrov Database Internals A Deep Dive into How Distributed Data Systems Work Boston Farnham Sebastopol TokyoBeijing

📄 Page 3

978-1-492-04034-7 [MBP] Database Internals by Alex Petrov Copyright © 2019 Oleksandr Petrov. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Mike Loukides Development Editor: Michele Cronin Production Editor: Christopher Faucher Copyeditor: Kim Cofer Proofreader: Sonia Saruba Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest October 2019: First Edition Revision History for the First Edition 2019-09-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492040347 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Database Internals, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

📄 Page 4

To Pieter Hintjens, from whom I got my first ever signed book: an inspiring distributed systems programmer, author, philosopher, and friend.

📄 Page 5

(This page has no text content)

📄 Page 6

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part I. Storage Engines 1. Introduction and Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 DBMS Architecture 8 Memory- Versus Disk-Based DBMS 10 Durability in Memory-Based Stores 11 Column- Versus Row-Oriented DBMS 12 Row-Oriented Data Layout 13 Column-Oriented Data Layout 14 Distinctions and Optimizations 15 Wide Column Stores 15 Data Files and Index Files 17 Data Files 18 Index Files 18 Primary Index as an Indirection 20 Buffering, Immutability, and Ordering 21 Summary 22 2. B-Tree Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Binary Search Trees 26 Tree Balancing 27 Trees for Disk-Based Storage 28 Disk-Based Structures 29 Hard Disk Drives 30 Solid State Drives 30 v

📄 Page 7

On-Disk Structures 32 Ubiquitous B-Trees 33 B-Tree Hierarchy 35 Separator Keys 36 B-Tree Lookup Complexity 37 B-Tree Lookup Algorithm 38 Counting Keys 38 B-Tree Node Splits 39 B-Tree Node Merges 41 Summary 42 3. File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Motivation 46 Binary Encoding 47 Primitive Types 47 Strings and Variable-Size Data 49 Bit-Packed Data: Booleans, Enums, and Flags 49 General Principles 50 Page Structure 52 Slotted Pages 52 Cell Layout 54 Combining Cells into Slotted Pages 56 Managing Variable-Size Data 57 Versioning 58 Checksumming 59 Summary 60 4. Implementing B-Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Page Header 61 Magic Numbers 62 Sibling Links 62 Rightmost Pointers 63 Node High Keys 64 Overflow Pages 65 Binary Search 67 Binary Search with Indirection Pointers 67 Propagating Splits and Merges 68 Breadcrumbs 69 Rebalancing 70 Right-Only Appends 71 Bulk Loading 72 Compression 73 vi | Table of Contents

📄 Page 8

Vacuum and Maintenance 74 Fragmentation Caused by Updates and Deletes 75 Page Defragmentation 76 Summary 76 5. Transaction Processing and Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Buffer Management 81 Caching Semantics 83 Cache Eviction 83 Locking Pages in Cache 84 Page Replacement 85 Recovery 88 Log Semantics 90 Operation Versus Data Log 91 Steal and Force Policies 91 ARIES 92 Concurrency Control 93 Serializability 94 Transaction Isolation 95 Read and Write Anomalies 95 Isolation Levels 96 Optimistic Concurrency Control 98 Multiversion Concurrency Control 99 Pessimistic Concurrency Control 99 Lock-Based Concurrency Control 100 Summary 108 6. B-Tree Variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Copy-on-Write 112 Implementing Copy-on-Write: LMDB 113 Abstracting Node Updates 113 Lazy B-Trees 114 WiredTiger 114 Lazy-Adaptive Tree 116 FD-Trees 117 Fractional Cascading 118 Logarithmic Runs 119 Bw-Trees 120 Update Chains 121 Taming Concurrency with Compare-and-Swap 121 Structural Modification Operations 122 Consolidation and Garbage Collection 123 Table of Contents | vii

📄 Page 9

Cache-Oblivious B-Trees 124 van Emde Boas Layout 125 Summary 127 7. Log-Structured Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 LSM Trees 130 LSM Tree Structure 132 Updates and Deletes 136 LSM Tree Lookups 137 Merge-Iteration 137 Reconciliation 140 Maintenance in LSM Trees 141 Read, Write, and Space Amplification 143 RUM Conjecture 144 Implementation Details 145 Sorted String Tables 145 Bloom Filters 146 Skiplist 148 Disk Access 150 Compression 151 Unordered LSM Storage 152 Bitcask 153 WiscKey 154 Concurrency in LSM Trees 155 Log Stacking 157 Flash Translation Layer 157 Filesystem Logging 159 LLAMA and Mindful Stacking 160 Open-Channel SSDs 161 Summary 162 Part I Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Part II. Distributed Systems 8. Introduction and Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Concurrent Execution 171 Shared State in a Distributed System 173 Fallacies of Distributed Computing 174 Processing 175 Clocks and Time 176 viii | Table of Contents

📄 Page 10

State Consistency 177 Local and Remote Execution 178 Need to Handle Failures 178 Network Partitions and Partial Failures 179 Cascading Failures 180 Distributed Systems Abstractions 181 Links 182 Two Generals’ Problem 187 FLP Impossibility 189 System Synchrony 190 Failure Models 191 Crash Faults 191 Omission Faults 192 Arbitrary Faults 193 Handling Failures 193 Summary 193 9. Failure Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Heartbeats and Pings 196 Timeout-Free Failure Detector 197 Outsourced Heartbeats 198 Phi-Accural Failure Detector 199 Gossip and Failure Detection 200 Reversing Failure Detection Problem Statement 201 Summary 202 10. Leader Election. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Bully Algorithm 207 Next-In-Line Failover 208 Candidate/Ordinary Optimization 209 Invitation Algorithm 210 Ring Algorithm 211 Summary 212 11. Replication and Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Achieving Availability 216 Infamous CAP 216 Use CAP Carefully 217 Harvest and Yield 218 Shared Memory 219 Ordering 221 Consistency Models 222 Table of Contents | ix

📄 Page 11

Strict Consistency 223 Linearizability 223 Sequential Consistency 227 Causal Consistency 229 Session Models 233 Eventual Consistency 234 Tunable Consistency 235 Witness Replicas 236 Strong Eventual Consistency and CRDTs 238 Summary 240 12. Anti-Entropy and Dissemination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Read Repair 245 Digest Reads 246 Hinted Handoff 246 Merkle Trees 247 Bitmap Version Vectors 248 Gossip Dissemination 250 Gossip Mechanics 251 Overlay Networks 251 Hybrid Gossip 253 Partial Views 254 Summary 255 13. Distributed Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Making Operations Appear Atomic 258 Two-Phase Commit 259 Cohort Failures in 2PC 261 Coordinator Failures in 2PC 262 Three-Phase Commit 264 Coordinator Failures in 3PC 265 Distributed Transactions with Calvin 266 Distributed Transactions with Spanner 268 Database Partitioning 270 Consistent Hashing 271 Distributed Transactions with Percolator 272 Coordination Avoidance 275 Summary 277 14. Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Broadcast 280 Atomic Broadcast 281 x | Table of Contents

📄 Page 12

Virtual Synchrony 282 Zookeeper Atomic Broadcast (ZAB) 283 Paxos 285 Paxos Algorithm 286 Quorums in Paxos 287 Failure Scenarios 288 Multi-Paxos 291 Fast Paxos 292 Egalitarian Paxos 293 Flexible Paxos 296 Generalized Solution to Consensus 297 Raft 300 Leader Role in Raft 302 Failure Scenarios 304 Byzantine Consensus 305 PBFT Algorithm 306 Recovery and Checkpointing 309 Summary 309 Part II Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 A. Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Table of Contents | xi

📄 Page 13

(This page has no text content)

📄 Page 14

Preface Distributed database systems are an integral part of most businesses and the vast majority of software applications. These applications provide logic and a user inter‐ face, while database systems take care of data integrity, consistency, and redundancy. Back in 2000, if you were to choose a database, you would have just a few options, and most of them would be within the realm of relational databases, so differences between them would be relatively small. Of course, this does not mean that all data‐ bases were completely the same, but their functionality and use cases were very similar. Some of these databases have focused on horizontal scaling (scaling out)—improving performance and increasing capacity by running multiple database instances acting as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Par‐ allel DB2, and many others. Today, horizontal scaling remains one of the most impor‐ tant properties that customers expect from databases. This can be explained by the rising popularity of cloud-based services. It is often easier to spin up a new instance and add it to the cluster than scaling vertically (scaling up) by moving the database to a larger, more powerful machine. Migrations can be long and painful, potentially incurring downtime. Around 2010, a new class of eventually consistent databases started appearing, and terms such as NoSQL, and later, big data grew in popularity. Over the last 15 years, the open source community, large internet companies, and database vendors have created so many databases and tools that it’s easy to get lost trying to understand use cases, details, and specifics. The Dynamo paper [DECANDIA07], published by the team at Amazon in 2007, had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cas‐ sandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, cre‐ ated by former Akamai engineers. xiii

📄 Page 15

Today, the field is changing again: after the time of key-value stores, NoSQL, and eventual consistency, we have started seeing more scalable and performant databases, able to execute complex queries with stronger consistency guarantees. Audience of This Book In conversations at technical conferences, I often hear the same question: “How can I learn more about database internals? I don’t even know where to start.” Most of the books on database systems do not go into details of storage engine implementation, and cover the access methods, such as B-Trees, on a rather high level. There are very few books that cover more recent concepts, such as different B-Tree variants and log- structured storage, so I usually recommend reading papers. Everyone who reads papers knows that it’s not that easy: you often lack context, the wording might be ambiguous, there’s little or no connection between papers, and they’re hard to find. This book contains concise summaries of important database systems concepts and can serve as a guide for those who’d like to dig in deeper, or as a cheat sheet for those already familiar with these concepts. Not everyone wants to become a database developer, but this book will help people who build software that uses database systems: software developers, reliability engi‐ neers, architects, and engineering managers. If your company depends on any infrastructure component, be it a database, a mes‐ saging queue, a container platform, or a task scheduler, you have to read the project change-logs and mailing lists to stay in touch with the community and be up-to-date with the most recent happenings in the project. Understanding terminology and knowing what’s inside will enable you to yield more information from these sources and use your tools more productively to troubleshoot, identify, and avoid potential risks and bottlenecks. Having an overview and a general understanding of how data‐ base systems work will help in case something goes wrong. Using this knowledge, you’ll be able to form a hypothesis, validate it, find the root cause, and present it to other project maintainers. This book is also for curious minds: for the people who like learning things without immediate necessity, those who spend their free time hacking on something fun, cre‐ ating compilers, writing homegrown operating systems, text editors, computer games, learning programming languages, and absorbing new information. The reader is assumed to have some experience with developing backend systems and working with database systems as a user. Having some prior knowledge of different data structures will help to digest material faster. xiv | Preface

📄 Page 16

Why Should I Read This Book? We often hear people describing database systems in terms of the concepts and algo‐ rithms they implement: “This database uses gossip for membership propagation” (see Chapter 12), “They have implemented Dynamo,” or “This is just like what they’ve described in the Spanner paper” (see Chapter 13). Or, if you’re discussing the algo‐ rithms and data structures, you can hear something like “ZAB and Raft have a lot in common” (see Chapter 14), “Bw-Trees are like the B-Trees implemented on top of log structured storage” (see Chapter 6), or “They are using sibling pointers like in Blink- Trees” (see Chapter 5). We need abstractions to discuss complex concepts, and we can’t have a discussion about terminology every time we start a conversation. Having shortcuts in the form of common language helps us to move our attention to other, higher-level problems. One of the advantages of learning the fundamental concepts, proofs, and algorithms is that they never grow old. Of course, there will always be new ones, but new algo‐ rithms are often created after finding a flaw or room for improvement in a classical one. Knowing the history helps to understand differences and motivation better. Learning about these things is inspiring. You see the variety of algorithms, see how our industry was solving one problem after the other, and get to appreciate that work. At the same time, learning is rewarding: you can almost feel how multiple puzzle pieces move together in your mind to form a full picture that you will always be able to share with others. Scope of This Book This is neither a book about relational database management systems nor about NoSQL ones, but about the algorithms and concepts used in all kinds of database sys‐ tems, with a focus on a storage engine and the components responsible for distribution. Some concepts, such as query planning, query optimization, scheduling, the rela‐ tional model, and a few others, are already covered in several great textbooks on data‐ base systems. Some of these concepts are usually described from the user’s perspective, but this book concentrates on the internals. You can find some pointers to useful literature in the Part II Conclusion and in the chapter summaries. In these books you’re likely to find answers to many database-related questions you might have. Query languages aren’t discussed, since there’s no single common language among the database systems mentioned in this book. To collect material for this book, I studied over 15 books, more than 300 papers, countless blog posts, source code, and the documentation for several open source Preface | xv

📄 Page 17

databases. The rule of thumb for whether or not to include a particular concept in the book was the question: “Do the people in the database industry and research circles talk about this concept?” If the answer was “yes,” I added the concept to the long list of things to discuss. Structure of This Book There are some examples of extensible databases with pluggable components (such as [SCHWARZ86]), but they are rather rare. At the same time, there are plenty of exam‐ ples where databases use pluggable storage. Similarly, we rarely hear database vendors talking about query execution, while they are very eager to discuss the ways their databases preserve consistency. The most significant distinctions between database systems are concentrated around two aspects: how they store and how they distribute the data. (Other subsystems can at times also be of importance, but are not covered here.) The book is arranged into parts that discuss the subsystems and components responsible for storage (Part I) and distribution (Part II). Part I discusses node-local processes and focuses on the storage engine, the central component of the database system and one of the most significant distinctive factors. First, we start with the architecture of a database management system and present several ways to classify database systems based on the primary storage medium and layout. We continue with storage structures and try to understand how disk-based structures are different from in-memory ones, introduce B-Trees, and cover algorithms for effi‐ ciently maintaining B-Tree structures on disk, including serialization, page layout, and on-disk representations. Later, we discuss multiple variants to illustrate the power of this concept and the diversity of data structures influenced and inspired by B- Trees. Last, we discuss several variants of log-structured storage, commonly used for imple‐ menting file and storage systems, motivation, and reasons to use them. Part II is about how to organize multiple nodes into a database cluster. We start with the importance of understanding the theoretical concepts for building fault-tolerant distributed systems, how distributed systems are different from single-node applica‐ tions, and which problems, constraints, and complications we face in a distributed environment. After that, we dive deep into distributed algorithms. Here, we start with algorithms for failure detection, helping to improve performance and stability by noticing and reporting failures and avoiding the failed nodes. Since many algorithms discussed later in the book rely on understanding the concept of leadership, we introduce sev‐ eral algorithms for leader election and discuss their suitability. xvi | Preface

📄 Page 18

As one of the most difficult things in distributed systems is achieving data consis‐ tency, we discuss concepts of replication, followed by consistency models, possible divergence between replicas, and eventual consistency. Since eventually consistent systems sometimes rely on anti-entropy for convergence and gossip for data dissemi‐ nation, we discuss several anti-entropy and gossip approaches. Finally, we discuss log‐ ical consistency in the context of database transactions, and finish with consensus algorithms. It would’ve been impossible to write this book without all the research and publica‐ tions. You will find many references to papers and publications in the text, in square brackets with monospace font; for example, [DECANDIA07]. You can use these ref‐ erences to learn more about related concepts in more detail. After each chapter, you will find a summary section that contains material for further study, related to the content of the chapter. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Preface | xvii

📄 Page 19

Using Code Examples This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Database Internals by Alex Petrov (O’Reilly). Copyright 2019 Oleksandr Petrov, 978-1-492-04034-7.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) xviii | Preface

📄 Page 20

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/database-internals. To comment or ask technical questions about this book, please send an email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments This book wouldn’t have been possible without the hundreds of people who have worked hard on research papers and books, which have been a source of ideas, inspi‐ ration, and served as references for this book. I’d like to say thank you to all the people who reviewd manuscripts and provided feedback, making sure that the material in this book is correct and the wording is pre‐ cise: Dmitry Alimov, Peter Alvaro, Carlos Baquero, Jason Brown, Blake Eggleston, Marcus Eriksson, Francisco Fernández Castaño, Heidi Howard, Vaidehi Joshi, Maxi‐ milian Karasz, Stas Kelvich, Michael Klishin, Predrag Knežević, Joel Knighton, Eugene Lazin, Nate McCall, Christopher Meiklejohn, Tyler Neely, Maxim Neverov, Marina Petrova, Stefan Podkowinski, Edward Ribiero, Denis Rytsov, Kir Shatrov, Alex Sorokoumov, Massimiliano Tomassi, and Ariel Weisberg. Of course, this book wouldn’t have been possible without support from my family: my wife Marina and my daughter Alexandra, who have supported me on every step on the way. Preface | xix

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List