Build Your Own Database in Go From Scratch From B+tree To SQL in
Build Your Own Database in Go From Scratch From B+tree To SQL in
From Scratch in Go
From B+Tree To SQL
James Smith
2024-06-04
Build Your Own Database
From Scratch in Go
00. Introduction
Master fundamentals by building your own DB
What to learn?
Code a database in 3000 LoC, incrementally
Learn by doing: principles instead of jargon
Topic 1: durability and atomicity
More than a data format
Durability and atomicity with `fsync`
Topic 2: indexing data structures
Control latency and cost with indexes
In-memory data structures vs. on-disk data structures
Topic 3: Relational DB on KV
Two layers of DB interfaces
Query languages: parsers and interpreters
Build Your Own X book series
01. From Files To Databases
1.1 Updating files in-place
1.2 Atomic renaming
Replacing data atomically by renaming files
Why does renaming work?
1.3 Append-only logs
Safe incremental updates with logs
Atomic log updates with checksums
1.4 `fsync` gotchas
1.5 Summary of database challenges
02. Indexing Data Structures
2.1 Types of queries
2.2 Hashtables
2.3 Sorted arrays
2.4 B-tree
Reducing random access with shorter trees
IO in the unit of pages
The B+tree variant
Data structure space overhead
2.5 Log-structured storage
Update by merge: amortize cost
Reduce write amplification with multiple levels
LSM-tree indexes
LSM-tree queries
Real-world LSM-tree: SSTable, MemTable and log
2.6 Summary of indexing data structures
03. B-Tree & Crash Recovery
3.1 B-tree as a balanced n-ary tree
Height-balanced tree
Generalizing binary trees
3.2 B-tree as nest arrays
Two-level nested arrays
Multiple levels of nested arrays
3.3 Maintaining a B+tree
Growing a B-tree by splitting nodes
Shrinking a B-tree by merging nodes
3.4 B-Tree on disk
Block-based allocation
Copy-on-write B-tree for safe updates
Copy-on-write B-tree advantages
Alternative: In-place update with double-write
The crash recovery principle
3.5 What we learned
04. B+Tree Node and Insertion
4.1 Design B+tree nodes
What we will do
The node format
Simplifications and limits
In-memory data types
Decouple data structure from IO
4.2 Decode the node format
Header
Child pointers
KV offsets and pairs
KV lookups within a node
4.2 Update B+tree nodes
Insert into leaf nodes
Node copying functions
Update internal nodes
4.3 Split B+tree nodes
4.4 B+tree insertion
4.5 What’s next?
05. B+Tree Deletion and Testing
5.1 High-level interfaces
Keep the root node
Sentinel value
5.2 Merge nodes
Node update functions
Merge conditions
5.3 B+tree deletion
5.4 Test the B+tree
06. Append-Only KV Store
6.1 What we will do
6.2 Two-phase update
Atomicity + durability
Alternative: durability with a log
Concurrency of in-memory data
6.3 Database on a file
The file layout
`fsync` on directory
`mmap`, page cache and IO
6.4 Manage disk pages
Invoke `mmap`
`mmap` a growing file
Capture page updates
6.5 The meta page
Read the meta page
Update the meta page
6.6 Error handling
Scenarios after IO errors
Revert to the previous version
Recover from temporary write errors
6.7 Summary of the append-only KV store
07. Free List: Recyle & Reuse
7.1 Memory management techniques
What we will do
List of unused objects
Embedded linked list
External list
7.2 Linked list on disk
Free list requirements
Free list disk layout
Update free list nodes
7.3 Free list implementation
Free list interface
Free list data structure
Consuming from the free list
Pushing into the free list
7.4 KV with a free list
Page management
Update the meta page
7.5 Conclusion of the KV store
08. Tables on KV
8.1 Encode rows as KVs
Indexed queries: point and range
The primary key as the “key”
The secondary indexes as separate tables
Alternative: auto-generated row ID
8.2 Databases schemas
The table prefix
Data types
Records
Schemas
Internal tables
8.3 Get, update, insert, delete, create
Point query and update interfaces
Query by primary key
Read the schema
Insert or update a row
Create a table
8.4 Conclusion of tables on KV
09. Range Queries
9.1 B+tree iterator
The iterator interface
Navigate a tree
Seek to a key
9.2 Order-preserving encoding
Sort arbitrary data as byte strings
Numbers
Strings
Tuples
9.3 Range query
9.4 What we learned
10. Secondary Indexes
10.1 Secondary indexes as extra keys
Table schema
KV structures
10.2 Using secondary indexes
Select an index by matching columns
Encode missing columns as infinity
10.3 Maintaining secondary indexes
Sync with the primary data
Atomicity of multi-key updates
10.4 Summary of tables and indexes on KV
11. Atomic Transactions
11.1 The all-or-nothing effect
Commit and rollback
Atomicity via copy-on-write
Alternative: atomicity via logging
11.2 Transactional interfaces
Move tree operations to transactions
Transactional table operations
11.3 Optional optimizations
Reduce copying on multi-key updates
Range delete
Compress common prefixes
12. Concurrency Control
12.1 Levels of concurrency
The problem: interleaved readers and writers
Readers-writer lock (RWLock)
Read-copy-update (RCU)
Optimistic concurrency control
Alternative: pessimistic concurrency control
Comparison of concurrency controls
12.2 Snapshot isolation for readers
Capture local updates
Read back your own write
Version numbers in the free list
12.3 Handle conflicts for writers
Detect conflicts with history
Serialize internal data structures
13. SQL Parser
13.1 Syntax, parser, and interpreter
Tree representation of computer languages
Evaluate by visiting tree nodes
13.2 Query language specification
Statements
Conditions
Expressions
13.3 Recursive descent
Tree node structures
Split the input into smaller parts
Convert infix operators into a binary tree
Operator precedence with recursion
14. Query Language
14.1 Expression evaluation
14.2 Range queries
Set up a range query
Revisit the infinity encoding
14.3 Results iterator
Iterators all the way down
Transform data with iterators
14.4 Conclusions and next steps
00. Introduction
What to learn?
Complex systems like databases are built on a few simple principles.
LoC Step
366 B+tree data structure.
601 Append-only KV.
731 Practical KV with a free list.
1107 Tables on KV.
1294 Range queries.
1438 Secondary indexes.
1461 Transactional interfaces.
1702 Concurrency control.
2795 SQL-like query language.
The first thing to learn is the fsync syscall. A file write doesn’t reach
disk synchronously, there are multiple levels of buffering (OS page
cache and on-device RAM). fsync flushes pending data and waits until
it’s done. This makes writes durable, but what about atomicity?
One of the problems is updating disk data in-place, because you have
to deal with corrupted states after a crash. Disks are not just slower
RAM.
Another much simpler interface is key-value (KV). You can get, set,
and delete a single key, and most importantly, list a range of keys in
sorted order. KV is simpler than SQL because it’s one layer lower.
Relational DBs are built on top of KV-like interfaces called storage
engines.
https://build-your-own.org
01. From Files To Databases
Let’s start with files, and examine the challenges we face.
_, err = fp.Write(data)
if err != nil {
return err
}
return fp.Sync() // fsync
}
This code creates the file if it does not exist, or truncates the existing
one before writing the content. And most importantly, the data is not
persistent unless you call fsync (fp.Sync() in Go).
1. If the update is interrupted, you can recover from the old file
since it remains intact.
2. Concurrent readers won’t get half written data.
The problem is how readers will find the new file. A common pattern
is to rename the new file to the old file path.
_, err = fp.Write(data)
if err != nil {
return err
}
err = fp.Sync() // fsync
if err != nil {
return err
}
On Linux, the replaced old file may still exist if it’s still being opened
by a reader; it’s just not accessible from a file name. Readers can
safely work on whatever version of the data it got, while writer won’t
be blocked by readers. However, there must be a way to prevent
concurrent writers. The level of concurrency is multi-reader-single-
writer, which is what we will implement.
The reader must consider all log entries when using the log. For
example, here is a log-based KV with 4 entries:
0 1 2 3
| set a=1 | set b=2 | set a=3 | del b |
It’s not an indexing data structure; readers must read all entries.
It has no way to reclaim space from deleted data.
So logs alone are not enough to build a DB, they must be combined
with other indexing data structures.
1. The last append simply does not happen; the log is still good.
2. The last entry is half written.
3. The size of the log is increased but the last entry is not there.
The way to deal with these cases is to add a checksum to each log
entry. If the checksum is wrong, the update did not happen, making
log updates atomic (w.r.t. both readers and durability).
Another issue with fsync is error handling. If fsync fails, the DB update
fails, but what if you read the file afterwards? You may get the new
data even if fsync failed (because of the OS page cache)! This behavior
is filesystem dependent.
One way to reduce the update cost is to split the array into several
smaller non-overlapping arrays — nested sorted arrays. This
extension leads to B+tree (multi-level n-ary tree), with the additional
challenge of maintaining these small arrays (tree nodes).
Another form of “updatable array” is the log-structured merge tree
(LSM-tree). Updates are first buffered in a smaller array (or other
sorting data structures), then merged into the main array when it
becomes too large. The update cost is amortized by propagating
smaller arrays into larger arrays.
2.4 B-tree
A B-tree is a balanced n-ary tree, comparable to balanced binary
trees. Each node stores variable number of keys (and branches) up to
n and n > 2.
Larger n means fewer disk reads per lookup (better latency and
throughput).
Larger n means larger nodes, which are slower to update
(discussed later).
IO in the unit of pages
While you can read any number of bytes at any offset from a file,
disks do not work that way. The basic unit of disk IO is not bytes, but
sectors, which are 512-byte contiguous blocks on old HDDs.
In any way, there is a minimum unit of IO. DBs can also define their
own unit of IO (also called a page), which can be larger than an OS
page.
Let’s start with 2 files: a small file holding the recent updates, and a
large file holding the rest of the data. Updates go to the small file
first, but it cannot grow forever; it will be merged into the large file
when it reaches a threshold.
Merging 2 sorted files results in a newer, larger file that replaces the
old large file and shrinks the small file.
|level 1|
||
\/
|------level 2------|
||
\/
|-----------------level 3-----------------|
In the 2-level scheme, the large file is rewritten every time the small
file reaches a threshold, the excess disk write is called write
amplification, and it gets worse as the large file gets larger. If we use
more levels, we can keep the 2nd level small by merging it into the
3rd level, similar to how we keep the 1st level small.
LSM-tree indexes
Each level contains indexing data structures, which could simply be a
sorted array, since levels are never updated (except for the 1st level).
But binary search is not much better than binary tree in terms of
random access, so a sensible choice is to use B-tree inside a level,
that’s the “tree” part of LSM-tree. Anyway, data structures are much
simpler because of the lack of updates.
To better understand the idea of “merge”, you can try to apply it to
hashtables, a.k.a. log-structured hashtables.
LSM-tree queries
Keys can be in any levels, so to query an LSM-tree, the results from
each level are combined (n-way merge for range queries).
Since levels are never updated, there can be old versions of keys in
older levels, and deleted keys are marked with a special flag in newer
levels (called tombstones). Thus, newer levels have priority in
queries.
The merge process naturally reclaims space from old or deleted keys.
Thus, it’s also called compaction.
But even if the log is small, a proper indexing data structure is still
needed. The log data is duplicated in an in-memory index called
MemTable, which can be a B-tree, skiplist, or whatever. It’s a small,
bounded amount of in-memory data, and has the added benefit of
accelerating the read-the-recent-updates scenario.
LSM-tree solves many of the challenges from the last chapter, such
as how to update disk-based data structures and resue space. While
these challenges remain for B+tree, which will be explored later.
03. B-Tree & Crash Recovery
Height-balanced tree
Many practical binary trees, such as the AVL tree or the RB tree, are
called height-balanced trees, meaning that the height of the tree
(from root to leaves) is limited to O(log N), so a lookup is O(log N).
A B-tree is also height-balanced; the height is the same for all leaf
nodes.
The problem with sorted arrays is the O(N) update. If we split the
array into m smaller non-overlapping ones, the update becomes
O(N/m). But we have to find out which small array to update/query
first. So we need another sorted array of references to smaller arrays,
that’s the internal nodes in a B+tree.
Let’s say we keep splitting levels until all arrays are no larger than a
constant s, we end up with log (N/s) levels, and the lookup cost is
O(log (N/s) + log (s)), which is still O(log N).
For insertion and deletion, after finding the leaf node, updating the
leaf node is constant O(s) most of the time. The remaining problem
is to maintain the invariants that nodes are not larger than s and are
not empty.
parent parent
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
* * *
After splitting a leaf node, its parent node gets a new branch, which
may also exceed the size limit, so it may need to be split as well. Node
splitting can propagate to the root node, increasing the height by 1.
new_root
/ \
root N1 N2
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
This preserves the 1st invariant, since all leaves gain height by 1
simultaneously.
Space reuse can be done with a free list if all allocations are of the
same size, which we’ll implement later. For now, all B-tree nodes are
the same size.
The original tree remains intact and is accessible from the old
root.
The new root, with the updated copies all the way to the leaf,
shares all other nodes with the original tree.
d d D*
/ \ / \ / \
b e ==> b e + B* e
/ \ / \ / \
a c a c a C*
original updated
1. How to find the tree root, as it changes after each update? The
crash safety problem is reduced to a single pointer update, which
we’ll solve later.
2. How to reuse nodes from old versions? That’s the job of a free
list.
And crash recovery is effortless; just use the last old version.
After a crash, the data structure may be half updated, but we don’t
really know. What we do is blindly apply the saved copies, so that the
data structure ends with the updated state, regardless of the current
state.
| a=1 b=2 |
|| 1. Save a copy of the entire updated nodes.
\/
| a=1 b=2 | + | a=2 b=4 |
data updated copy
|| 2. fsync the saved copies.
\/
| a=1 b=2 | + | a=2 b=4 |
data updated copy (fsync'ed)
|| 3. Update the data structure in-place. But we crashed here!
\/
| ??????? | + | a=2 b=4 |
data (bad) updated copy (good)
|| Recovery: apply the saved copy.
\/
| a=2 b=4 | + | a=2 b=4 |
data (new) useless now
What if we save the original nodes instead of the updated nodes with
double-write? That’s the 3rd way to recover from corruption, and it
recovers to the old version like copy-on-write. We can combine the 3
ways into 1 idea: there is enough information for either the
old state or the new state at any point.
We’ll use copy-on-write because it’s simpler, but you can deviate
here.
What we will do
The first big step is just the B+tree data structures, other DB
concerns will be covered in later chapters. We’ll do it from the
bottom up.
The same format is used for both leaf nodes and internal nodes. This
wastes some space: leaf nodes don’t need pointers and internal nodes
don’t need values.
const HEADER = 4
func init() {
node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
assert(node1max <= BTREE_PAGE_SIZE) // maximum KV
}
The key size limit also ensures that an internal node can always host
2 keys.
Child pointers
// pointers
func (node BNode) getPtr(idx uint16) uint64 {
assert(idx < node.nkeys())
pos := HEADER + 8*idx
return binary.LittleEndian.Uint64(node[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64)
// offset list
func offsetPos(node BNode, idx uint16) uint16 {
assert(1 <= idx && idx <= node.nkeys())
return HEADER + 8*node.nkeys() + 2*(idx-1)
}
func (node BNode) getOffset(idx uint16) uint16 {
if idx == 0 {
return 0
}
return binary.LittleEndian.Uint16(node[offsetPos(node, idx):])
}
func (node BNode) setOffset(idx uint16, offset uint16)
// key-values
func (node BNode) kvPos(idx uint16) uint16 {
assert(idx <= node.nkeys())
return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
func (node BNode) getKey(idx uint16) []byte {
assert(idx < node.nkeys())
pos := node.kvPos(idx)
klen := binary.LittleEndian.Uint16(node[pos:])
return node[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte
It also conveniently returns the node size (used space) with an off-
by-one lookup.
// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: binary search
func nodeLookupLE(node BNode, key []byte) uint16 {
nkeys := node.nkeys()
found := uint16(0)
// the first key is a copy from the parent node,
// thus it's always less than or equal to the key.
for i := uint16(1); i < nkeys; i++ {
cmp := bytes.Compare(node.getKey(i), key)
if cmp <= 0 {
found = i
}
if cmp >= 0 {
break
}
}
return found
}
// copy multiple KVs into the position from the old node
func nodeAppendRange(
new BNode, old BNode,
dstNew uint16, srcOld uint16, n uint16,
)
Note that the tree.new callback is used to allocate the child nodes.
4.3 Split B+tree nodes
Due to the size limits we imposed, a node can host at least 1 KV pair.
In the worst case, an oversized node will be split into 3 nodes, with a
large KV in the middle. So we may have to split it 2 times.
// split a oversized node into 2 so that the 2nd node always fits on a page
func nodeSplit2(left BNode, right BNode, old BNode) {
// code omitted...
}
// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
if old.nbytes() <= BTREE_PAGE_SIZE {
old = old[:BTREE_PAGE_SIZE]
return 1, [3]BNode{old} // not split
}
left := BNode(make([]byte, 2*BTREE_PAGE_SIZE)) // might be split later
right := BNode(make([]byte, BTREE_PAGE_SIZE))
nodeSplit2(left, right, old)
if left.nbytes() <= BTREE_PAGE_SIZE {
left = left[:BTREE_PAGE_SIZE]
return 2, [3]BNode{left, right} // 2 nodes
}
leftleft := BNode(make([]byte, BTREE_PAGE_SIZE))
middle := BNode(make([]byte, BTREE_PAGE_SIZE))
nodeSplit2(leftleft, middle, left)
assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
return 3, [3]BNode{leftleft, middle, right} // 3 nodes
}
Note that the returned nodes are allocated from memory; they are
just temporary data until nodeReplaceKidN actually allocates them.
4.4 B+tree insertion
We’ve implemented 3 node operations:
Let’s put them together for a full B+tree insertion, which starts with
key lookups in the root node until it reaches a leaf.
Most of the details are introduced with the tree insertion, so there’s
not much more to learn from the deletion. Skip this chapter if you
know the principle.
Sentinel value
There is a trick when creating the first root: we inserted an empty
key. This is called a sentinel value, it’s used to remove an edge case.
If you examine the lookup function nodeLookupLE, you’ll see that it won’t
work if the key is out of the node range. This is fixed by inserting an
empty key into the tree, which is the lowest possible key by sort
order, so that nodeLookupLE will always find a position.
5.2 Merge nodes
Merge conditions
Deleting may result in empty nodes, which can be merged with a
sibling if it has one. shouldMerge returns which sibling (left or right) to
merge with.
if idx > 0 {
sibling := BNode(tree.get(node.getPtr(idx - 1)))
merged := sibling.nbytes() + updated.nbytes() - HEADER
if merged <= BTREE_PAGE_SIZE {
return -1, sibling // left
}
}
if idx+1 < node.nkeys() {
sibling := BNode(tree.get(node.getPtr(idx + 1)))
merged := sibling.nbytes() + updated.nbytes() - HEADER
if merged <= BTREE_PAGE_SIZE {
return +1, sibling // right
}
}
return 0, BNode{}
}
Deleted keys mean unused space within nodes. In the worst case, a
mostly empty tree can still retain a large number of nodes. We can
improve this by triggering merges earlier — using 1/4 of a page as a
threshold instead of the empty node, which is a soft limit on the
minimum node size.
type C struct {
tree BTree
ref map[string]string // the reference data
pages map[uint64]BNode // in-memory pages
}
func newC() *C {
pages := map[uint64]BNode{}
return &C{
tree: BTree{
get: func(ptr uint64) []byte {
node, ok := pages[ptr]
assert(ok)
return node
},
new: func(node []byte) uint64 {
assert(BNode(node).nbytes() <= BTREE_PAGE_SIZE)
ptr := uint64(uintptr(unsafe.Pointer(&node[0])))
assert(pages[ptr] == nil)
pages[ptr] = node
return ptr
},
del: func(ptr uint64) {
assert(pages[ptr] != nil)
delete(pages, ptr)
},
},
ref: map[string]string{},
pages: pages,
}
}
The test cases are left as an exercise. The next thing is B+tree on
disk.
06. Append-Only KV Store
type KV struct {
Path string // file name
// internals
fd int
tree BTree
// more ...
}
func (db *KV) Open() error
We’ll implement the 3 B+tree callbacks that deal with disk pages:
Atomicity + durability
As discussed in chapter 03, for a copy-on-write tree, the root pointer
is updated atomically. Then fsync is used to request and confirm
durability.
We won’t use a log as copy-on-write doesn’t need it. But a log still
offers the benefits discussed above; it’s one of the reasons logs are
ubiquitous in databases.
Concurrency of in-memory data
Atomicity for in-memory data (w.r.t. concurrency) can be achieved
with a mutex (lock) or some atomic CPU instructions. There is a
similar problem: memory reads/writes may not appear in order due
to factors like out-of-order execution.
`fsync` on directory
As mentioned in chapter 01, fsync must be used on the parent
directory after a rename. This is also true when creating new files,
because there are 2 things to be made persistent: the file data, and
the directory that references the file.
The directory fd can be used by openat to open the target file, which
guarantees that the file is from the same directory we opened before,
in case the directory path is replaced in between (race condition).
Although this is not our concern as we don’t expect multi-process
operations.
func Mmap(fd int, offset int64, length int, ...) (data []byte, err error)
1. The CPU triggers a page fault, which hands control to the OS.
2. The OS then …
1. Reads the swapped data into physical memory.
2. Remaps the virtual address to it.
3. Hands control back to the process.
3. The process resumes with the virtual address mapped to real
RAM.
mmap works in a similar way, the process gets an address range from
mmap and when it touches a page in it, it page faults and the OS reads
the data into a cache and remaps the page to the cache. That’s the
automatic IO in a read-only scenario.
The CPU also takes note (called a dirty bit) when the process
modifies a page so the OS can write the page back to disk later. fsync
is used to request and wait for the IO. This is writing data via mmap, it
is not very different from write on Linux because write goes to the
same page cache.
You don’t have to mmap, but it’s important to understand the basics.
Invoke `mmap`
A file-backed mmap can be either read-only, read-write, or copy-on-
write. To create a read-only mmap, use the PROT_READ and MAP_SHARED flags.
type KV struct {
// ...
mmap struct {
total int // mmap size, can be larger than the file size
chunks [][]byte // multiple mmaps, can be non-continuous
}
}
Adding a new mapping each time the file is expanded results in lots
of mappings, which is bad for performance because the OS has to
keep track of them. This is avoided with exponential growth, since
mmap can go beyond the file size.
func extendMmap(db *KV, size int) error {
if size <= db.mmap.total {
return nil // enough range
}
alloc := max(db.mmap.total, 64<<20) // double the current address space
for db.mmap.total + alloc < size {
alloc *= 2 // still not enough?
}
chunk, err := syscall.Mmap(
db.fd, int64(db.mmap.total), alloc,
syscall.PROT_READ, syscall.MAP_SHARED, // read-only
)
if err != nil {
return fmt.Errorf("mmap: %w", err)
}
db.mmap.total += alloc
db.mmap.chunks = append(db.mmap.chunks, chunk)
return nil
}
You may wonder why not just create a very large mapping (say, 1TB)
and forget about the growing file, since an unrealized virtual address
costs nothing. This is OK for a toy DB in 64-bit systems.
type KV struct {
// ...
page struct {
flushed uint64 // database size in number of pages
temp [][]byte // newly allocated pages
}
}
If fsync fails on the meta page, the meta page on disk can be either the
new or the old version, while the in-memory tree root is the old
version. So the 2nd successful update will overwrite the data pages of
the newer version, which can be left in a corrupted intermediate state
if crashed.
type KV struct {
// ...
failed bool // Did the last update fail?
}
B+tree on disk is a major step. We just have to add a free list to make
it practical.
07. Free List: Recyle & Reuse
The last step of the KV store is to reuse deleted pages, which is also a
problem for in-memory data structures.
What we will do
Memory (space) management can be either manual or automatic. A
garbage collector is automatic, it detects unused objects without any
help from the programmer. The next problem is how to deal with
(reuse) unused objects.
head
↓
[ next | space... ] (unused object 1)
↓
[ next | space... ] (unused object 2)
↓
...
External list
The other scheme is to store pointers to unused pages in an external
data structure. The external data structure itself takes up space,
which is a problem we’ll solve.
Let’s say our free list is just a log of unused page numbers; adding
items is just appending. The problem is how to remove items so that
it doesn’t grow infinitely.
7.2 Linked list on disk
If items are removed from the beginning, how do you reclaim the
space from the removed items? We’re back to the original problem.
To solve the problem, the free list should also be page-based, so that
it can manage itself. A page-based list is just a linked list, except that
a page can hold multiple items, like a B+tree node. This is also called
an unrolled linked list.
In summary:
// node format:
// | next | pointers | unused |
// | 8B | n*8B | ... |
type LNode []byte
const FREE_LIST_HEADER = 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
We also store pointers to both the head node and the tail node in the
meta page. The pointer to the tail node is needed for O(1) insertion.
first_item
↓
head_page -> [ next | xxxxx ]
↓
[ next | xxxxxxxx ]
↓
tail_page -> [ NULL | xxxx ]
↑
last_item
Following this analysis, the embedded list can also work iff the next
pointer is reserved in the B+tree node. Here you can deviate from
the book. Although this doubles write amplification.
headSeq, tailSeq
are indexes into the head and tail nodes, except that
they are monotonically increasing. So the wrapped-around index is:
During an update, the list is both added to and removed from, and
when we remove from the head, we cannot remove what we just
added to the tail. So we need to …
// remove 1 item from the head node, and remove the head node if empty.
func flPop(fl *FreeList) (ptr uint64, head uint64) {
if fl.headSeq == fl.maxSeq {
return 0, 0 // cannot advance
}
node := LNode(fl.get(fl.headPage))
ptr = node.getPtr(seq2idx(fl.headSeq)) // item
fl.headSeq++
// move to the next one if the head node is empty
if seq2idx(fl.headSeq) == 0 {
head, fl.headPage = fl.headPage, node.getNext()
assert(fl.headPage != 0)
}
return
}
The free list self-manages; the removed head node is fed back to
itself.
What if the last node is removed? A linked list with 0 nodes implies
nasty special cases. In practice, it’s easier to design the linked list
to have at least 1 node than to deal with special cases. That’s why
we assert(fl.headPage != 0).
Pushing into the free list
Appending an item to the tail node is simply advancing tailSeq. And
when the tail node is full, we immediately add a new empty tail node
to ensure that there is at least 1 node in case the previous tail node is
removed as a head node.
Again, the free list is self-managing: it will try to get a node from
itself for the new tail node before resorting to appending.
7.4 KV with a free list
Page management
Now that pages can be reused, reused pages are overwritten in-place,
so a map is used to capture pending updates.
type KV struct {
// ...
page struct {
flushed uint64 // database size in number of pages
nappend uint64 // number of pages to be appended
updates map[uint64][]byte // pending updates, including appended pages
}
}
Another change is that we may read a page again after it has been
updated, so KV.pageRead should consult the pending updates map first.
Remember that the free list always contains at least 1 node, we’ll
assign an empty node to it when initializing an empty DB.
That’s enough for a KV store with get, set, del. But there is more in
part II:
Relational DB on KV store.
Concurrent transactions.
08. Tables on KV
create table t1 (
k1 string,
k2 int,
v1 string,
v2 string,
primary key (k1, k2)
);
key value
t1 k1, k2 v1, v2
Some DBs allow tables without a primary key, what they do is add a
hidden, auto-generated primary key.
create table t1 (
k1 string,
k2 int,
v1 string,
v2 string,
primary key (k1, k2),
index idx1 (v1),
index idx2 (v2, v1)
);
Which adds an extra key to find the unique row identifier (primary
key).
key value
t1 k1, k2 v1, v2
idx1 v1 k1, k2
idx2 v2, v1 k1, k2
The primary key is also an index, but with the unique constraint.
Alternative: auto-generated row ID
Some DBs use an auto-generated ID as the “true” primary key, as
opposed to the user-selected primary key. In this case, there is no
distinction between primary and secondaries; the user primary key is
also an indirection.
key value
t1 ID k1, k2, v1, v2
primary key k1, k2 ID
idx1 v1 ID
idx2 v2, v1 ID
For ID keys, internal nodes can store more keys (shorter tree).
Secondary indexes are smaller as they don’t duplicate the user
primary key.
The prefix is a 32-bit auto-incrementing integer, you can also use the
table name instead, with the drawback that it can be arbitrarily long.
Data types
One advantage of relational DB over KV is that they support more
data types. To reflect this aspect, we’ll support 2 data types: string
and integer.
const (
TYPE_BYTES = 1 // string (of arbitrary bytes)
TYPE_INT64 = 2 // integer; 64-bit signed
)
// table cell
type Value struct {
Type uint32 // tagged union
I64 int64
Str []byte
}
Records
A Record represents a list of column names and values.
// table row
type Record struct {
Cols []string
Vals []Value
}
Schemas
We’ll only consider the primary key in this chapter, leaving indexes
for later.
Internal tables
Where to store table schemas? Since we’re coding a DB, we know
how to store stuff; we’ll store them in a predefined internal table.
type DB struct {
Path string
kv KV
}
The code to handle the columns is just mandane, we’ll skip it.
tdef := &TableDef{}
err = json.Unmarshal(rec.Get("def").Str, tdef)
assert(err == nil)
return tdef
}
// update modes
const (
MODE_UPSERT = 0 // insert or replace
MODE_UPDATE_ONLY = 1 // update existing keys
MODE_INSERT_ONLY = 2 // only add new keys
)
The update function here only deals with a complete row. Partial
updates (read-modify-write) are implemented at a higher level
(query language).
func dbUpdate(db *DB, tdef *TableDef, rec Record, mode int) (bool, error) {
values, err := checkRecord(tdef, rec, len(tdef.Cols))
if err != nil {
return false, err
}
key := encodeKey(nil, tdef.Prefix, values[:tdef.PKeys])
val := encodeValues(nil, values[tdef.PKeys:])
return db.kv.Update(key, val, mode)
}
Create a table
The process of creating a table is rather boring:
Range queries.
Secondary indexes.
09. Range Queries
// find the closest position that is less or equal to the input key
func (tree *BTree) SeekLE(key []byte) *BIter
For example, the query for a <= key looks like this:
// find the closest position that is less or equal to the input key
func (tree *BTree) SeekLE(key []byte) *BIter {
iter := &BIter{tree: tree}
for ptr := tree.root; ptr != 0; {
node := tree.get(ptr)
idx := nodeLookupLE(node, key)
iter.path = append(iter.path, node)
iter.pos = append(iter.pos, idx)
ptr = node.getPtr(idx)
}
return iter
}
const (
CMP_GE = +3 // >=
CMP_GT = +2 // >
CMP_LT = -2 // <
CMP_LE = -3 // <=
)
func (tree *BTree) Seek(key []byte, cmp int) *BIter
Numbers
Let’s start with a simple problem: how to encode unsigned integers
so that they can be compared by bytes.Compare? bytes.Compare works byte
by byte until a difference is met. So the 1st byte is most significant in
a comparison, if we put the most significant (higher) bits of an
integer first, they can be compared byte-wise. That’s just big-endian
integers.
0x0000000000000001 -> 00 00 00 00 00 00 00 01
0x0000000000000002 -> 00 00 00 00 00 00 00 02
...
0x00000000000000ff -> 00 00 00 00 00 00 00 ff
0x0000000000000100 -> 00 00 00 00 00 00 01 00
-2 7f ff ff ff ff ff ff fe
-1 7f ff ff ff ff ff ff ff
0 80 00 00 00 00 00 00 00
1 80 00 00 00 00 00 00 01
MaxInt64 ff ff ff ff ff ff ff ff
Strings
The key can be multiple columns. But bytes.Compare only works with a
single string column, because it needs the length. We cannot simply
concatenate string columns, because this creates ambiguity. E.g.,
("a", "bc") vs. ("ab", "c").
The problem with delimiters is that the input cannot contain the
delimiter, this is solved by escaping the delimiter. We’ll use byte
0x01 as the escaping byte, and the escaping byte itself must be
escaped. So we’ll need 2 transformations:
00 -> 01 01
01 -> 01 02
Note that the escape sequences still preserve the sort order.
Tuples
A multi-column comparison (tuple) is done column by column until
a difference is met. This is like a string comparison, except that each
item is a typed value instead of a byte. We can simply concatenate
the encoded bytes of each column as long as there is no ambiguity.
The next step is to add secondary indexes, which are just extra tables.
10. Secondary Indexes
Table schema
As mentioned in chapter 08, secondary indexes are just extra KV
pairs containing the primary key. Each index is distinguished by a
key prefix in the B+tree.
The first index is used as the primary key as it’s also an index.
KV structures
For a secondary index, we could put the primary key in B+tree value,
which is used to find the full row. However, unlike the primary key,
secondary indexes don’t have the unique constraint, so there can be
duplicate B+tree keys.
create table t1 (
k1 string,
k2 int,
v1 string,
v2 string,
primary key (k1, k2),
index idx1 (v1),
index idx2 (v2, k2)
);
key value
t1 prefix1, k1, k2 v1, v2
idx1 prefix2, v1, k1, k2 (empty)
idx2 prefix3, v2, k2, k1 (empty)
// order-preserving encoding
func encodeValues(out []byte, vals []Value) []byte {
for _, v := range vals {
out = append(out, byte(v.Type)) // *added*: doesn't start with 0xff
switch v.Type {
case TYPE_INT64:
var buf [8]byte
u := uint64(v.I64) + (1 << 63) // flip the sign bit
binary.BigEndian.PutUint64(buf[:], u) // big endian
out = append(out, buf[:]...)
case TYPE_BYTES:
out = append(out, escapeString(v.Str)...)
out = append(out, 0) // null-terminated
default:
panic("what?")
}
}
return out
}
We’ll prepend the column type code as the tag. This also makes
debugging easier since we can now decode stuff by looking at the
hexdump.
// for the input range, which can be a prefix of the index key.
func encodeKeyPartial(
out []byte, prefix uint32, vals []Value, cmp int,
) []byte {
out = encodeKey(out, prefix, vals)
if cmp == CMP_GT || cmp == CMP_LE { // encode missing columns as infinity
out = append(out, 0xff) // unreachable +infinity
} // else: -infinity is the empty string
return out
}
func dbUpdate(db *DB, tdef *TableDef, rec Record, mode int) (bool, error) {
// ...
// insert the row
req := UpdateReq{Key: key, Val: val, Mode: mode}
if _, err = db.kv.Update(&req); err != nil {
return false, err
}
// maintain secondary indexes
if req.Updated && !req.Added {
// use `req.Old` to delete the old indexed keys ...
}
if req.Updated {
// add the new indexed keys ...
}
return req.Updated, nil
}
Achieving this with just get, set, del is tricky, which is why simple KV
interfaces are very limiting. Our next step is a transactional KV
interface to allow atomic operations on multiple keys or even
concurrent readers.
We’ll drop the get-set-del interface and add a new one to allow
atomic execution of a group of operations. Concurrency is discussed
in the next chapter.
// begin a transaction
func (kv *KV) Begin(tx *KVTX)
// end a transaction: commit updates; rollback on error
func (kv *KV) Commit(tx *KVTX) error
// end a transaction: rollback
func (kv *KV) Abort(tx *KVTX)
Atomicity via copy-on-write
With copy-on-write, both commit and rollback are just updating the
root pointer. This is already implemented as error handling in
chapter 06.
// previous chapter!!!
func (db *KV) Update(req *UpdateReq) (bool, error) {
meta := saveMeta(db)
if !db.tree.Update(req) {
return false, nil
}
err := updateOrRevert(db, meta)
return err == nil, err
}
Note that these functions no longer return errors because the actual
disk update is moved to KVTX.Commit().
Range delete
Although we can now do multi-key updates. Deleting a large number
of keys, such as dropping a table, is still problematic w.r.t. resource
usage. The naive approach to dropping a table is to iterate and delete
keys one by one. This reads the entire table into memory and does
useless work as nodes are updated repeatedly before being deleted.
Some DBs use separate files for each table, so this is not a problem.
In our case, a single B+tree is used for everything, so we can
implement a range delete operation that frees all leaf nodes with a
range without even looking at them.
Read-copy-update (RCU)
To prevent readers and writers from blocking each other, we can
make readers and writers work on their own version of the data.
3 write b := a
4 commit
5 commit
TX1 depends on the same key that TX2 modifies, so they cannot both
succeed.
2 delete a
3 commit
4 commit
1. TX starts.
2. Reads are on the snapshot, but writes are buffered locally.
3. Before committing, verify that there are no conflicts with
committed TXs.
4. TX ends.
If there’s a conflict, abort and rollback.
Otherwise, transfer buffered writes to the DB.
// begin a transaction
func (kv *KV) Begin(tx *KVTX) {
// read-only snapshot, just the tree root and the page read callback
tx.snapshot.root = kv.tree.root
tx.snapshot.get = ... // read from mmap'ed pages ...
// in-memory tree to capture updates
pages := [][]byte(nil)
tx.pending.get = func(ptr uint64) []byte { return pages[ptr-1] }
tx.pending.new = func(node []byte) uint64 {
pages = append(pages, node)
return uint64(len(pages))
}
tx.pending.del = func(uint64) {}
}
FLAG_DELETED = byte(1)
FLAG_UPDATED = byte(2)
This works by checking the version when consuming from the list
head. Remember that the free list is a FILO (first-in-last-out), so
pages from the oldest version will be consumed first.
Modification 1: Version numbers in KVTX and KV.
type KV struct {
// ...
history []CommittedTX // chanages keys; for detecting conflicts
}
type CommittedTX struct {
version uint64
writes []KeyRange // sorted
}
func (kv *KV) Commit(tx *KVTX) error {
// ...
if len(writes) > 0 {
kv.history = append(kv.history, CommittedTX{kv.version, writes})
}
return nil
}
type KV struct {
// ...
mutex sync.Mutex // serialize TX methods
}
func (kv *KV) Begin(tx *KVTX) {
kv.mutex.Lock()
defer kv.mutex.Unlock()
// ...
}
func (kv *KV) Commit(tx *KVTX) // same
func (kv *KV) Abort(tx *KVTX) // same
We can use this lock for all KVTX methods. But there are ways to
reduce the locking. For example, we don’t have to serialize
read/write methods because …
This way, read/write methods do not require the lock and can run in
parallel. That’s good, because reads can trigger page faults and block
the thread.
So far, only Begin, Commit, and Abort are serialized. But considering that
Commit involves IO, we can go further by releasing the lock while
waiting for IO to allow other TXs to enter and read-only TXs to exit.
The commit step should still be serialized with other commit steps
via another lock. This is left as an exercise.
13. SQL Parser
SQL is easily parsed by computers while still looking like English.
select
/ | \
columns table condition
... foo and
/ \
> <
/ \ / \
a b a c
Example 2: Expression a + b * c:
+
/ \
a *
/ \
b c
# pseudo code
def eval(node):
if is_binary_operator(node):
left, right = eval(node.left), eval(node.right)
return node.operator(left, right)
elif is_value(node):
return node.value
...
In retrospect, this is the reason why trees are relevant, because trees
represent the evaluation order. A programming language also has
control flows, variables, etc., but once you represent it with a tree,
the rest should be obvious.
13.2 Query language specification
Statements
Not exactly SQL, just a look-alike.
Conditions
A SQL DB will choose an index based on the WHERE clause if possible,
and/or fetch and filter the rows if the condition is not fully covered
by the index. This is automatic without any direct user control.
Here we’ll deviate from SQL: Instead of WHERE, we’ll use separate
clauses for indexing conditions and filtering conditions,
1. The INDEX BY clause selects an index and controls the sort order.
Both are optional. And the primary key is selected if the INDEX BY is
missing.
Expressions
An expression is either a …
column name,
literal value like numbers or strings,
binary or unary operator,
tuple.
a OR b
a AND b
NOT a
a = b, a < b, ... -- comparisons
a + b, a - b
a * b, a / b
-a
pScan is the last part of SELECT. It’s further divided into 3 smaller parts.
1. Split the input into smaller and smaller parts until it ends as
either an operator, a name, or a literal value.
2. Determine the next part by looking at the next keywords.
pExprOr … … pNum
term
term + term
term + term + term + ...
+
/ \
left right
+
/ \
+ R
/ \
LL LR
We can add more terms, and it’s still a binary tree. Pseudo-code:
def parse_terms():
node = parse_column()
while consume('+'):
right = parse_column()
node = QLNode(type='+', kids=[node, right])
return node
The left subrule expr can expand to the rule itself, but the right
subrule term is the bottommost part that cannot expand any further.
Pseudo-code:
def parse_terms():
node = parse_factors()
while consume('+'):
right = parse_factors()
node = QLNode(type='+', kids=[node, right])
return node
def parse_factors():
node = parse_column()
while consume('*'):
right = parse_column()
node = QLNode(type='*', kids=[node, right])
return node
a + b × c - d
1 term + term - term
2 factor factor × factor factor
a OR b -- pExprOr
a AND b -- pExprAnd
NOT a -- pExprNot
a = b, a < b -- pExprCmp
a + b, a - b -- pExprAdd
a * b, a / b -- pExprMul
-a -- pExprUnop
It has 3 phases: INDEX BY, LIMIT, and FILTER. Scanner implements the INDEX
BY.
Let’s say the index is (a, b). Queries using a prefix of the index are
already handled by the key encoding in chapter 10. So …
Since the use of the empty tuple (), Key1 and Key2 can now have a
different set of columns, so we have to modify the index selection to
allow this.