Database From Scratch Build Your Own in build-your-own.org James Smith B+Tree Durable KV Transaction Relational DB Query Language Learn by doing: From B+Tree to SQL in 3000 lines of code 2nd Edition
Build Your Own Database From Scratch in Go 2nd Edition From B+Tree To SQL James Smith 2024-06-11 build-your-own.org
Contents 00. Introduction 1 01. From Files To Databases 4 02. Indexing Data Structures 9 03. B-Tree & Crash Recovery 14 04. B+Tree Node and Insertion 20 05. B+Tree Deletion and Testing 30 06. Append-Only KV Store 35 07. Free List: Recyle & Reuse 47 08. Tables on KV 57 09. Range Queries 66 10. Secondary Indexes 71 11. Atomic Transactions 76 12. Concurrency Control 80 13. SQL Parser 88 14. Query Language 96
CHAPTER 00 Introduction Master fundamentals by building your own DB What to learn? Complex systems like databases are built on a few simple principles. 1. Atomicity & durability. A DB is more than files! • Persist data with fsync. • Crash recovery. 2. KV store based on B-tree. • Disk-based data structures. • Space management with a free list. 3. Relational DB on top of KV. • How tables and indexes are mapped to low-level B-trees. • SQL-like query language; parser & interpreter. 4. Concurrency control for transactions. Code a database in 3000 LoC, incrementally It’s amazing that an interesting and broad topic can be captured in 3000 LoC. You may have experience with larger projects, but not all experience is equal. LoC Step 366 B+tree data structure. 601 Append-only KV. 731 Practical KV with a free list. 1107 Tables on KV. 1294 Range queries. 1438 Secondary indexes. 1461 Transactional interfaces. 1702 Concurrency control. 2795 SQL-like query language. 1
2024-06-11 00. Introduction Learn by doing: principles instead of jargon Database literature is full of confusing, overloaded jargon with no consistent meaning. It’s easy to get lost when reading about it. On the other hand, Feymann once said, “what I can’t build, I don’t understand”. Can you build a database by reading about databases? Test your understanding! While there is a lot to learn, not all knowledge is equally important, it takes only a few principles to build a DB, so anyone can try. Topic 1: durability and atomicity More than a data format Smartphones use SQLite (a file-based DB) heavily. Why store data in SQLite instead of some other format, say JSON? Because you risk data loss if it crashes during an update. The file can end up half-written, truncated, or even missing. There are techniques to fix this, and they lead to databases. Durability and atomicity with `fsync` Atomicity means that data is either updated or not, not in between. Durability means that data is guaranteed to exist after a certain point. They are not separate concerns, because we must achieved both. The first thing to learn is the fsync syscall. A filewrite doesn’t reach disk synchronously, there are multiple levels of buffering (OS page cache and on-device RAM). fsync flushes pending data and waits until it’s done. This makes writes durable, but what about atomicity? Topic 2: indexing data structures Control latency and cost with indexes A DB turns a query into a result without the user knowing how. But the result is not the only concern, latency and cost (memory, IO, computation) are also relevant, hence the distinction between analytical (OLAP) and transactional (OLTP). • OLAP can involve large amounts of data, with aggregation or join operations. Indexing can be limited or non-existent. • OLTP touches small amounts of data using indexes. Low latency & cost. The word “transactional” is not about DB transactions, it’s just a funny jargon. Report an Error | Ask a Question @ build-your-own.org 2
2024-06-11 00. Introduction In-memory data structures vs. on-disk data structures There are extra challenges when putting an indexing data structure on disk. (See my book “Build Your Own Redis” for a much easier in-memory DB). One of the problems is updating disk data in-place, because you have to deal with corrupted states after a crash. Disks are not just slower RAM. The R in RAM stands for “random”, which is another problem for disk-based data because random access is much slower than sequential access, even on SSDs. So data structures like binary trees are not viable while B-trees and LSM-trees are OK. Concurrent access to data structures is also a topic. Topic 3: Relational DB on KV Two layers of DB interfaces SQL is almost a synonym for database. But SQL is just a user interface, it’s not fundamental to a DB. What’s important is the functionalities underneath. Another much simpler interface is key-value (KV). You can get, set, and delete a single key, and most importantly, list a range of keys in sorted order. KV is simpler than SQL because it’s one layer lower. Relational DBs are built on top of KV-like interfaces called storage engines. Query languages: parsers and interpreters The last step is easy, despite the larger LoC. Both the parser and the interpreter are coded with nothing but recursion! The lesson can be applied to almost any computer language, or creating your own programming language or DSL (See my book “From Source Code To Machine Code” for more challenges). Build Your Own X book series X includes Redis, web server and compiler. Read the web version on the website. https://build-your-own.org Report an Error | Ask a Question @ build-your-own.org 3
CHAPTER 01 From Files To Databases Let’s start with files, and examine the challenges we face. 1.1 Updating files in-place Let’s say you need to save some data to disk; this is a typical way to do it: func SaveData1(path string, data []byte) error { fp, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664) if err != nil { return err } defer fp.Close() _, err = fp.Write(data) if err != nil { return err } return fp.Sync() // fsync } This code creates the file if it does not exist, or truncates the existing one before writing the content. And most importantly, the data is not persistent unless you call fsync (fp.Sync() in Go). It has some serious limitations: 1. It updates the content as a whole; only usable for tiny data. This is why you don’t use Excel as a database. 2. If you need to update the old file, you must read and modify it in memory, then overwrite the old file. What if the app crashes while overwriting the old file? 3. If the app needs concurrent access to the data, how do you prevent readers from getting mixed data and writers from conflicting operations? That’s why most databases are client-server, you need a server to coordinate concurrent clients. (Concurrency is more complicated without a server, see SQLite). 4
2024-06-11 01. From Files To Databases 1.2 Atomic renaming Replacing data atomically by renaming files Many problems are solved by not updating data in-place. You can write a new file and delete the old file. Not touching the old file data means: 1. If the update is interrupted, you can recover from the old file since it remains intact. 2. Concurrent readers won’t get half written data. The problem is how readers will find the new file. A common pattern is to rename the new file to the old file path. func SaveData2(path string, data []byte) error { tmp := fmt.Sprintf("%s.tmp.%d", path, randomInt()) fp, err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664) if err != nil { return err } defer func() { fp.Close() if err != nil { os.Remove(tmp) } }() _, err = fp.Write(data) if err != nil { return err } err = fp.Sync() // fsync if err != nil { return err } return os.Rename(tmp, path) } Renaming a file to an existing one replaces it atomically; deleting the old file is not needed (and not correct). Pay attention to the meaning of the jargon, whenever you see “X is atomic”, you should ask “X is atomic with respect to what?” In this case: Report an Error | Ask a Question @ build-your-own.org 5
2024-06-11 01. From Files To Databases • Rename is atomic w.r.t. concurrent readers; a reader opens either the old or the new file. • Rename is NOT atomic w.r.t. power loss; it’s not even durable. You need an extra fsync on the parent directory, which is discussed later. Why dœs renaming work? Filesystems keep a mapping from file names to file data, so replacing a file by renaming simply points the file name to the new data without touching the old data. That’s why atomic renaming is possible in filesystems. And the operation cost is constant regardless of the data size. On Linux, the replaced old file may still exist if it’s still being opened by a reader; it’s just not accessible from a file name. Readers can safely work on whatever version of the data it got, while writer won’t be blocked by readers. However, there must be a way to prevent concurrent writers. The level of concurrency is multi-reader-single-writer, which is what we will implement. 1.3 Append-only logs Safe incremental updates with logs One way to do incremental updates is to just append the updates to a file. This is called a “log” because it’s append-only. It’s safer than in-place updates because no data is overwritten; you can always recover the old data after a crash. The reader must consider all log entries when using the log. For example, here is a log-based KV with 4 entries: 0 1 2 3 | set a=1 | set b=2 | set a=3 | del b | The final state is a=3. Logs are an essential component of many databases. However, a log is only a description of each update, which means: • It’s not an indexing data structure; readers must read all entries. • It has no way to reclaim space from deleted data. So logs alone are not enough to build a DB, they must be combined with other indexing data structures. Report an Error | Ask a Question @ build-your-own.org 6
2024-06-11 01. From Files To Databases Atomic log updates with checksums While a log won’t corrupt old data, you still have to deal with the last entry if it gets corrupted after a crash. Many possibilities: 1. The last append simply does not happen; the log is still good. 2. The last entry is half written. 3. The size of the log is increased but the last entry is not there. The way to deal with these cases is to add a checksum to each log entry. If the checksum is wrong, the update did not happen, making log updates atomic (w.r.t. both readers and durability). This scenario is about incomplete writes (torn writes in DB jargon) that occur before a successful fsync. Checksums can also detect other forms of corruption after fsync, but that’s not something a DB can recover from. 1.4 `fsync` gotchas After renaming files or creating new files, you must call fsync on the parent directory. A directory is a mapping from file names to files, and like file data, it’s not durable unless you use fsync. See this example[1] of fsync on the directory. Another issue with fsync is error handling. If fsync fails, the DB update fails, but what if you read the file afterwards? You may get the new data even if fsync failed (because of the OS page cache)! This behavior is filesystem dependent[2]. 1.5 Summary of database challenges What we have learned: 1. Problems with in-place updates. • Avoid in-place updates by renaming files. • Avoid in-place updates using logs. 2. Append-only logs. • Incremental updates. • Not a full solution; no indexing and space reuse. 3. fsync usage. What remains a question: [1]https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_pillai.pdf#page=31 [2]https://www.usenix.org/conference/atc20/presentation/rebello Report an Error | Ask a Question @ build-your-own.org 7
2024-06-11 01. From Files To Databases 1. Indexing data structures and how to update them. 2. Reuse space from append-only files. 3. Combining a log with an indexing data structure. 4. Concurrency. Report an Error | Ask a Question @ build-your-own.org 8
CHAPTER 02 Indexing Data Structures 2.1 Types of queries Most SQL queries can be broken down into 3 types: 1. Scan the whole data set. (No index is used). 2. Point query: Query the index by a specific key. 3. Range query: Query the index by a range. (The index is sorted). There are ways to make scanning fast, such as column-based storage. But a scan is 𝑂 (𝑁) no matter how fast it is; our focus is on queries that can be served in 𝑂 (log𝑁) using data structures. A range query consists of 2 phases: 1. Seek: find the starting key. 2. Iterate: find the previous/next key in sorted order. A point query is just seek without iterate; a sorting data structure is all we need. 2.2 Hashtables Hashtables are viable if you only consider point queries (get, set, del), so we will not bother with them because of the lack of ordering. However, coding a hashtable, even an in-memory one, is still a valuable exercise. It’s far easier than the B-tree we’ll code later, though some challenges remain: • How to grow a hashtable? Keys must be moved to a larger hashtable when the load factor is too high. Moving everything at once is prohibitively 𝑂 (𝑁). Rehashing must be done progressively, even for in-memory apps like Redis. • Other things mentioned before: in-place updates, space reuse, and etc. 2.3 Sorted arrays Ruling out hashtables, let’s start with the simplest sorting data structure: the sorted array. You can binary search on it in 𝑂 (log𝑁). For variable-length data such as strings (KV), use an array of pointers (offsets) to do binary searches. Updating a sorted array is 𝑂 (𝑁), either in-place or not. So it’s not practical, but it can be extended to other updatable data structures. 9
2024-06-11 02. Indexing Data Structures One way to reduce the update cost is to split the array into several smaller non-overlapping arrays — nested sorted arrays. This extension leads to B+tree (multi-level n-ary tree), with the additional challenge of maintaining these small arrays (tree nodes). Another form of “updatable array” is the log-structured merge tree (LSM-tree). Updates are first buffered in a smaller array (or other sorting data structures), then merged into the main array when it becomes too large. The update cost is amortized by propagating smaller arrays into larger arrays. 2.4 B-tree A B-tree is a balanced n-ary tree, comparable to balanced binary trees. Each node stores variable number of keys (and branches) up to 𝑛 and 𝑛 > 2. Reducing random access with shorter trees A disk can only perform a limited number of IOs per second (IOPS), which is the limiting factor for tree lookups. Each level of the tree is a disk read in a lookup, and n-ary trees are shorter than binary trees for the same number of keys (log𝑛 𝑁 vs. log2 𝑁), thus n-ary trees are used for fewer disk reads per lookup. How is the 𝑛 chosen? There is a trade-off: • Larger 𝑛 means fewer disk reads per lookup (better latency and throughput). • Larger 𝑛 means larger nodes, which are slower to update (discussed later). IO in the unit of pages While you can read any number of bytes at any offset from a file, disks do not work that way. The basic unit of disk IO is not bytes, but sectors, which are 512-byte contiguous blocks on old HDDs. However, disk sectors are not an application’s concern because regular file IOs do not interact directly with the disk. The OS caches/buffers disk reads/writes in the page cache, which consists of 4K-byte memory blocks called pages. In any way, there is a minimum unit of IO. DBs can also define their own unit of IO (also called a page), which can be larger than an OS page. The minimum IO unit implies that tree nodes should be allocated in multiples of the unit; a half used unit is half wasted IO. Another reason against small 𝑛! Report an Error | Ask a Question @ build-your-own.org 10
2024-06-11 02. Indexing Data Structures The B+tree variant In the context of databases, B-tree means a variant of B-tree called B+tree. In a B+tree, internal nodes do not store values, values exist only in leaf nodes. This leads to shorter tree because internal nodes have more space for branches. B+tree as an in-memory data structure also makes sense because the minimum IO unit between RAM and CPU caches is 64 bytes (cache line). The performance benefit is not as great as on disk because not much can fit in 64 bytes. Data structure space overhead Another reason why binary trees are impractical is the number of pointers; each key has at least 1 incoming pointer from the parent node, whereas in a B+tree, multiple keys in a leaf node share 1 incoming pointer. Keys in a leaf node can also be packed in a compact format or compressed to further reduce the space. 2.5 Log-structured storage Update by merge: amortize cost Themost common example of log-structured storage is log-structuremerge tree (LSM-tree). Its main idea is neither log nor tree; it’s “merge” instead! Let’s start with 2 files: a small file holding the recent updates, and a large file holding the rest of the data. Updates go to the small file first, but it cannot grow forever; it will be merged into the large file when it reaches a threshold. writes => | new updates | => | accumulated data | file 1 file 2 Merging 2 sorted files results in a newer, larger file that replaces the old large file and shrinks the small file. Merging is 𝑂 (𝑁), but can be done concurrently with readers and writers. Reduce write amplification with multiple levels Buffering updates is better than rewriting the whole dataset every time. What if we extend this scheme to multiple levels? Report an Error | Ask a Question @ build-your-own.org 11
2024-06-11 02. Indexing Data Structures |level 1| || \/ |------level 2------| || \/ |-----------------level 3-----------------| In the 2-level scheme, the large file is rewritten every time the small file reaches a threshold, the excess disk write is called write amplification, and it gets worse as the large file gets larger. If we use more levels, we can keep the 2nd level small by merging it into the 3rd level, similar to how we keep the 1st level small. Intuitively, levels grow exponentially, and the power of two growth (merging similarly sized levels) results in the least write amplification. But there is a trade-off between write amplification and the number of levels (query performance). LSM-tree indexes Each level contains indexing data structures, which could simply be a sorted array, since levels are never updated (except for the 1st level). But binary search is not much better than binary tree in terms of random access, so a sensible choice is to use B-tree inside a level, that’s the “tree” part of LSM-tree. Anyway, data structures are much simpler because of the lack of updates. To better understand the idea of “merge”, you can try to apply it to hashtables, a.k.a. log-structured hashtables. LSM-tree queries Keys can be in any levels, so to query an LSM-tree, the results from each level are combined (n-way merge for range queries). For point queries, Bloom filters can be used as an optimization to reduce the number of searched levels. Since levels are never updated, there can be old versions of keys in older levels, and deleted keys are marked with a special flag in newer levels (called tombstones). Thus, newer levels have priority in queries. The merge process naturally reclaims space from old or deleted keys. Thus, it’s also called compaction. Report an Error | Ask a Question @ build-your-own.org 12
2024-06-11 02. Indexing Data Structures Real-world LSM-tree: SSTable, MemTable and log These are jargons about LSM-tree implementation details. You don’t need to know them to build one from principles, but they do solve some real problems. Levels are split into multiple non-overlapping files called SSTables, rather than one large file, so that merging can be done gradually. This reduces the free space requirement when merging large levels, and the merging process is spread out over time. The 1st level is updated directly, a log becomes a viable choice because the 1st level is bounded in size. This is the “log” part of the LSM-tree, an example of combining a log with other indexing data structures. But even if the log is small, a proper indexing data structure is still needed. The log data is duplicated in an in-memory index called MemTable, which can be a B-tree, skiplist, or whatever. It’s a small, bounded amount of in-memory data, and has the added benefit of accelerating the read-the-recent-updates scenario. 2.6 Summary of indexing data structures There are 2 options: B+tree and LSM-tree. LSM-tree solves many of the challenges from the last chapter, such as how to update disk- based data structures and resue space. While these challenges remain for B+tree, which will be explored later. Report an Error | Ask a Question @ build-your-own.org 13
CHAPTER 03 B-Tree & Crash Recovery 3.1 B-tree as a balanced n-ary tree Height-balanced tree Many practical binary trees, such as the AVL tree[1] or the RB tree, are called height-balanced trees, meaning that the height of the tree (from root to leaves) is limited to 𝑂 (log𝑁), so a lookup is 𝑂 (log𝑁). A B-tree is also height-balanced; the height is the same for all leaf nodes. Generalizing binary trees n-ary trees can be generalized from binary trees (and vice versa). An example is the 2-3-4 tree, which is a B-tree where each node can have either 2, 3, or 4 children. The 2-3-4 tree is equivalent to the RB tree. However, we won’t go into the details because they are not necessary for understanding B-trees. Visualizing a 2-level B+tree of a sorted sequence [1, 2, 3, 4, 6, 9, 11, 12]. [1, 4, 9] / | \ v v v [1, 2, 3] [4, 6] [9, 11, 12] In a B+tree, only leaf nodes contain value, keys are duplicated in internal nodes to indicate the key range of the subtree. In this example, node [1, 4, 9] indicates that its 3 subtrees are within intervals [1, 4), [4, 9), and [9, +∞). However, only 2 keys are needed for 3 intervals, so the first key (1) can be omitted and the 3 intervals become (-∞, 4), [4, 9), and (9, +∞). 3.2 B-tree as nest arrays Two-level nested arrays Without knowing the details of the RB tree or the 2-3-4 tree, the B-tree can be understood from sorted arrays. The problem with sorted arrays is the 𝑂 (𝑁) update. If we split the array into 𝑚 smaller non-overlapping ones, the update becomes 𝑂 (𝑁/𝑚). But we have to find out which small array to update/query first. So we need another sorted array of references to smaller arrays, that’s the internal nodes in a B+tree. [1]https://build-your-own.org/redis/10_avltree 14
2024-06-11 03. B-Tree & Crash Recovery [[1,2,3], [4,6], [9,11,12]] The lookup cost is still 𝑂 (log𝑁) with 2 binary searches. If we choose 𝑚 as √𝑁, update become 𝑂 (√𝑁), that’s as good as 2-level sorted arrays can be. Multiple levels of nested arrays 𝑂 (√𝑁) is unacceptable for databases, but if we add more levels by splitting arrays even more, the cost is further reduced. Let’s say we keep splitting levels until all arrays are no larger than a constant 𝑠, we end upwith log (𝑁/𝑠) levels, and the lookup cost is 𝑂 (log (𝑁/𝑠) + log (𝑠)), which is still 𝑂 (log𝑁). For insertion and deletion, after finding the leaf node, updating the leaf node is constant 𝑂 (𝑠) most of the time. The remaining problem is to maintain the invariants that nodes are not larger than 𝑠 and are not empty. 3.3 Maintaining a B+tree 3 invariants to preserve when updating a B+tree: 1. Same height for all leaf nodes. 2. Node size is bounded by a constant. 3. Node is not empty. Growing a B-tree by splitting nodes The 2nd invariant is violated by inserting into a leaf node, which is restored by splitting the node into smaller ones. parent parent / | \ => / | | \ L1 L2 L6 L1 L3 L4 L6 * * * After splitting a leaf node, its parent node gets a new branch, which may also exceed the size limit, so it may need to be split as well. Node splitting can propagate to the root node, increasing the height by 1. new_root / \ root N1 N2 / | \ => / | | \ L1 L2 L6 L1 L3 L4 L6 This preserves the 1st invariant, since all leaves gain height by 1 simultaneously. Report an Error | Ask a Question @ build-your-own.org 15
2024-06-11 03. B-Tree & Crash Recovery Shrinking a B-tree by merging nodes Deleting may result in empty nodes. The 3rd invariant is restored by merging empty nodes into a sibling node. Merging is the opposite of splitting. It can also propagate to the root node, so the tree height can decrease. When coding a B-tree, merging can be done earlier to reduce wasted space: you can merge a non-empty node when its size reaches a lower bound. 3.4 B-Tree on disk You can already code an in-memory B-tree using these principles. But B-tree on disk requires extra considerations. Block-based allocation One missing detail is how to limit node size. For in-memory B+tree, you can limit the maximum number of keys in a node, the node size in bytes is not a concern, because you can allocate as many bytes as needed. For disk-based data structures, there are no malloc/free or garbage collectors to rely on; space allocation and reuse is entirely up to us. Space reuse can be done with a free list if all allocations are of the same size, which we’ll implement later. For now, all B-tree nodes are the same size. Copy-on-write B-tree for safe updates We’ve seen 3 crash-resistant ways to update disk data: renaming files, logs, LSM-trees. The lesson is not to destroy any old data during an update. This idea can be applied to trees: make a copy of the node and modify the copy instead. Insertion or deletion starts at a leaf node; after making a copy with the modification, its parent node must be updated to point to the new node, which is also done on its copy. The copying propagates to the root node, resulting in a new tree root. • The original tree remains intact and is accessible from the old root. • The new root, with the updated copies all the way to the leaf, shares all other nodes with the original tree. d d D* / \ / \ / \ b e ==> b e + B* e / \ / \ / \ a c a c a C* original updated Report an Error | Ask a Question @ build-your-own.org 16
2024-06-11 03. B-Tree & Crash Recovery This is a visualization of updating the leaf c. The copied nodes are in uppercase (D, B, C), while the shared subtrees are in lowercase (a, e). This is called a copy-on-write data structure. It’s also described as immutable, append-only (not literally), or persistent (not related to durability). Be aware that database jargon does not have consistent meanings. 2 more problems remain for the copy-on-write B-tree: 1. How to find the tree root, as it changes after each update? The crash safety problem is reduced to a single pointer update, which we’ll solve later. 2. How to reuse nodes from old versions? That’s the job of a free list. Copy-on-write B-tree advantages One advantage of keeping old versions around is that we got snapshot isolation for free. A transaction starts with a version of the tree, and won’t see changes from other versions. And crash recovery is effortless; just use the last old version. Another one is that it fits the multi-reader-single-writer concurrency model, and readers do not block the writer. We’ll explore these later. Alternative: In-place update with double-write While crash recovery is obvious in copy-on-write data structures, they can be undesirable due to the high write amplification. Each update copies the whole path (𝑂 (log𝑁)), while most in-place updates touch only 1 leaf node. It’s possible to do in-place updates with crash recovery without copy-on-write: 1. Save a copy of the entire updated nodes somewhere. This is like copy-on-write, but without copying the parent node. 2. fsync the saved copies. (Can respond to the client at this point.) 3. Actually update the data structure in-place. 4. fsync the updates. After a crash, the data structure may be half updated, but we don’t really know. What we do is blindly apply the saved copies, so that the data structure ends with the updated state, regardless of the current state. Report an Error | Ask a Question @ build-your-own.org 17
Comments 0
Loading comments...
Reply to Comment
Edit Comment