build-your-own-database-from-scratch-1n
build-your-own-database-from-scratch-1n
https://build-your-own.org
Made with Xodo PDF Reader and Editor
Persistence, Indexing,
Concurrency
James Smith
build-your-own.org
2023-05- 1 5
Made with Xodo PDF Reader and Editor
Contents
00. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
02. Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
i
Made with Xodo PDF Reader and Editor
00. Introduction
Databases are not black boxes. Understand them by building your own from scratch!
The book focuses on important ideas rather than implementation details. Real-world
databases are complex and harder to grasp. We can learn faster and easier from a stripped-
down version of a database. And the “from scratch” method forces you to learn deeper.
Although the book is short and the implementation is minimal, it aims to cover three
important topics:
1. Persistence. How not to lose or corrupt your data. Recovering from a crash.
2. Indexing. Efficiently querying and manipulating your data. (B-tree).
3. Concurrency. How to handle multiple (large number of ) clients. And transactions.
If you have only vague ideas like “databases store my data” or “indexes are fast”, this book
is for you.
This book takes a step-by-step approach. Each step builds on the previous one and adds a
new concept. The book uses Golang for sample code, but the topics are language agnostic.
Readers are advised to code their own version of a database rather than just read the text.
https://build-your-own.org
Why do we need databases? Why not dump the data directly into files? Our first topic is
persistence.
1
Made with Xodo PDF Reader and Editor
Let’s say your process crashed middle-way while writing to a file, or you lost power, what’s
the state of the file?
Any outcome is possible. Your data is not guaranteed to persist on a disk when you simply
write to files. This is a concern of databases. And a database will recover to a usable state
when started after an unexpected shutdown.
This is only acceptable when the dataset is tiny. A database like SQLite can do incremental
updates.
There are two distinct types of database queries: analytical (OLAP) and transactional
(OLTP).
• Analytical (OLAP) queries typically involve a large amount of data, with aggregation,
grouping, or join operations.
• In contrast, transactional (OLTP) queries usually only touch a small amount of
indexed data. The most common types of queries are indexed point queries and
indexed range queries.
Note that the word “transactional” is not related to database transactions as you may know.
Computer jargon is often overloaded with different meanings. This book focuses only on
OLTP techniques.
While many applications are not real-time systems, most user-facing software should
respond in a reasonable (small) amount of time, using a reasonable amount of resources
(memory, IO). This falls into the OLTP category. How do we find the data quickly (in
O(log(n))), even if the dataset is large? This is why we need indexes.
build-your-own.org 2
Made with Xodo PDF Reader and Editor
If we ignore the persistence aspect and assume that the dataset fits in memory, finding the
data quickly is the problem of data structures. Data structures that persist on a disk to look
up data are called “indexes” in database systems. And database indexes can be larger than
memory. There is a saying: if your problem fits in memory, it’s an easy problem.
Modern applications do not just do everything sequentially, nor do databases. There are
different levels of concurrency:
Even the file-based SQLite supports some concurrency. But concurrency is easier within
a process, which is why most database systems can only be accessed via a “server”.
With the addition of concurrency, applications often need to do things atomically, such as
the read-modify-write operation. This adds a new concept to databases: transactions.
build-your-own.org 3
Made with Xodo PDF Reader and Editor
4
Made with Xodo PDF Reader and Editor
This chapter shows the limitations of simply dumping data to files and the problems that
databases solve.
Let’s say you have some data that needs to be persisted to a file; this is a typical way to do
it:
_, err = fp.Write(data)
return err
}
1. It truncates the file before updating it. What if the file needs to be read concurrently?
2. Writing data to files may not be atomic, depending on the size of the write. Con-
current readers might get incomplete data.
3. When is the data actually persisted to the disk? The data is probably still in the
operating system’s page cache after the write syscall returns. What’s the state of the
file when the system crashes and reboots?
5
Made with Xodo PDF Reader and Editor
_, err = fp.Write(data)
if err != nil {
os.Remove(tmp)
return err
}
This approach is slightly more sophisticated, it first dumps the data to a temporary file,
then rename the temporary file to the target file. This seems to be free of the non-atomic
problem of updating a file directly — the rename operation is atomic. If the system crashed
before renaming, the original file remains intact, and applications have no problem reading
the file concurrently.
However, this is still problematic because it doesn’t control when the data is persisted to
the disk, and the metadata (the size of the file) may be persisted to the disk before the data,
potentially corrupting the file after when the system crash. (You may have noticed that
some log files have zeros in them after a power failure, that’s a sign of file corruption.)
1.3 fsync
To fix the problem, we must flush the data to the disk before renaming it. The Linux
syscall for this is “fsync”.
_, err = fp.Write(data)
if err != nil {
build-your-own.org 6
Made with Xodo PDF Reader and Editor
os.Remove(tmp)
return err
}
Are we done yet? The answer is no. We have flushed the data to the disk, but what about
the metadata? Should we also call the fsync on the directory containing the file?
This rabbit hole is quite deep and that’s why databases are preferred over files for persisting
data to the disk.
In some use cases, it makes sense to persist data using an append-only log.
The nice thing about the append-only log is that it does not modify the existing data, nor
build-your-own.org 7
Made with Xodo PDF Reader and Editor
does it deal with the rename operation, making it more resistant to corruption. But logs
alone are not enough to build a database.
1. A database uses additional “indexes” to query the data efficiently. There are only
brute-force ways to query a bunch of records of arbitrary order.
2. How do logs handle deleted data? They cannot grow forever.
We have already seen some of the problems we must handle. Let’s start with indexing first
in the next chapter.
build-your-own.org 8
Made with Xodo PDF Reader and Editor
02. Indexing
Although a relational DB supports many types of queries, almost all queries can be broken
down into three types of disk operations:
Database indexes are mostly about range queries and point queries, and it’s easy to see
that a range query is just a superset of point queries. If we extract the functionality of the
database indexes, it is trivial to make a key-value store. But the point is that a database
system can be built on top of a KV store.
We’ll build a KV store before attempting the relational DB, but let’s explore our options
first.
2.2 Hashtables
Hashtables are the first to be ruled out when designing a general-purpose KV store. The
main reason is sorting — many real-world applications do require sorting and ordering.
2.3 B-Trees
Balanced binary trees can be queried and updated in O(log(n)) and can be range-queried.
A B-tree is roughly a balanced n-ary tree. Why use an n-ary tree instead of a binary tree?
There are several reasons:
9
Made with Xodo PDF Reader and Editor
Every leaf node in a binary tree is reached via a pointer from a parent node, and
the parent node may also have a parent. On average, each leaf node requires 1~2
pointers.
This is in contrast to B-trees, where multiple data in a leaf node share one parent.
And n-ary trees are also shorter. Less space is wasted on pointers.
2. Faster in memory.
Due to modern CPU memory caching and other factors, n-ary trees can be faster
than binary trees, even if their big-O complexity is the same.
We’ll use B-trees in this book. But B-trees are not the only option.
2.4 LSM-Trees
How to query:
How to update:
5. When updating a key, the key is inserted into a file from the top level first.
6. If the file size exceeds a threshold, merge it with the next level.
7. The file size threshold increases exponentially with each level, which means that
the amount of data also increases exponentially.
build-your-own.org 10
Made with Xodo PDF Reader and Editor
1. Each level is sorted, keys can be found via binary search, and range queries are just
sequential file IO. It’s efficient.
For updates:
2. The top-level file size is small, so inserting into the top level requires only a small
amount of IO.
3. Data is eventually merged to a lower level. Merging is sequential IO, which is an
advantage.
4. Higher levels trigger merging more often, but the merge is also smaller.
5. When merging a file into a lower level, any lower files whose range intersects are
replaced by the merged results (which can be multiple files). We can see why levels
are split into multiple files — to reduce the size of the merge.
6. Merging can be done in the background. However, low-level merging can suddenly
cause high IO usage, which can degrade system performance.
Readers can try to use LSM-trees instead of B-trees after finishing this book. And compare
the cons and pros between B-trees and LSM-trees.
build-your-own.org 11
Made with Xodo PDF Reader and Editor
Our first intuition comes from balanced binary trees (BST). Binary trees are popular data
structures for sorted data. Keeping a tree in good shape after inserting or removing keys
is what “balancing” means. As stated in a previous chapter, n-ary trees should be used
instead of binary trees to make use of the “page” (minimum unit of IO).
B-trees can be generalized from BSTs. Each node of a B-tree contains multiple keys and
multiple links to its children. When looking up a key in a node, all keys are used to decide
the next child node.
[1, 4, 9]
/ | \
v v v
[1, 2, 3] [4, 6] [9, 11, 12]
The balancing of a B-tree is different from a BST, popular BSTs like RB trees or AVL
trees are balanced on the height of sub-trees (by rotation). While the height of all B-tree
leaf nodes is the same, a B-tree is balanced by the size of the nodes:
• If a node is too large to fit on one page, it is split into two nodes. This will increase
the size of the parent node and possibly increase the height of the tree if the root
node was split.
• If a node is too small, try merging it with a sibling.
If you are familiar with RB trees, you may also be aware of 2-3 trees that can be easily
generalized as B-trees.
Even if you are not familiar with the 2-3 tree, you can still gain some intuition using nested
arrays.
Let’s start with a sorted array. Queries can be done by bisection. But, updating the array
is O(n) which we need to tackle. Updating a big array is bad so we split it into smaller
arrays. Let’s say we split the array into sqrt(n) parts, and each part contains sqrt(n) keys
on average.
12
Made with Xodo PDF Reader and Editor
To query a key, we must first determine which part contains the key, bisecting on the
sqrt(n) parts is O(log(n)). After that, bisecting the key on the part is again O(log(n)) —
it’s no worse than before. And updating is improved to O(sqrt(n)).
This is a 2-level sorted nested array, what if we add more levels? This is another intuition
of the B-tree.
Updating a B-tree is more complicated. From now on we’ll use a variant of B-tree called
“B+ tree”, the B+ tree stores values only in leaf nodes, and internal nodes contain only
keys.
Key insertion starts at a leaf. A leaf is just a sorted list of keys. Inserting the key into the
leaf is trivial. But, the insertion may cause the node size to exceed the page size. In this
case, we need to split the leaf node into 2 nodes, each containing half of the keys, so that
both leaf nodes fit into one page.
After splitting a leaf node into 2 nodes. The parent node replaces the old pointer and key
with the new pointers and keys. And the size of the node increases, which may trigger
further splitting.
parent parent
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
After the root node is split, a new root node is added. This is how a B-tree grows.
new_root
/ \
root N1 N2
/ | \ => / | | \
L1 L2 L6 L1 L3 L4 L6
build-your-own.org 13
Made with Xodo PDF Reader and Editor
Key deletion is the opposite of insertion. A node is never empty because a small node will
be merged into either its left sibling or its right sibling.
And when a non-leaf root is reduced to a single key, the root can be replaced by its sole
child. This is how a B-tree shrinks.
Immutable means never updating data in place. Some similar jargons are “append-only”,
“copy-on-write”, and “persistent data structures” (the word “persistent” has nothing to do
with the “persistence” we talked about ealier).
For example, when inserting a key into a leaf node, do not modify the node in place,
instead, create a new node with all the keys from the to-be-updated node and the new
key. Now the parent node must also be updated to point to the new node.
Likewise, the parent node is duplicated with the new pointer. Until we reach the root
node, the entire path has been duplicated. This effectively creates a new version of the
tree that coexists with the old version.
1. Avoid data corruption. Immutable data structures do not modify the existing data,
they merely add new data, so the old version of data remains intact even if the update
is interrupted.
2. Easy concurrency. Readers can operate concurrently with writers since readers can
work on older versions unaffected.
Persistence and concurrency are covered in later chapters. For now, we’ll code an im-
mutable B+ tree first.
build-your-own.org 14
Made with Xodo PDF Reader and Editor
Our B-tree will be persisted to the disk eventually, so we need to design the wire format
for the B-tree nodes first. Without the format, we won’t know the size of a node and
when to split a node.
A node consists of:
1. A fixed-sized header containing the type of the node (leaf node or internal node)
and the number of keys.
2. A list of pointers to the child nodes. (Used by internal nodes).
3. A list of offsets pointing to each key-value pair.
4. Packed KV pairs.
| type | nkeys | pointers | offsets | key-values
| 2B | 2B | nkeys * 8B | nkeys * 2B | ...
To keep things simple, both leaf nodes and internal nodes use the same format.
Since we’re going to dump our B-tree to the disk eventually, why not use an array of
bytes as our in-memory data structure as well?
const (
15
Made with Xodo PDF Reader and Editor
And we can’t use the in-memory pointers, the pointers are 64-bit integers referencing
disk pages instead of in-memory nodes. We’ll add some callbacks to abstract away this
aspect so that our data structure code remains pure data structure code.
The page size is defined to be 4K bytes. A larger page size such as 8K or 16K also works.
We also add some constraints on the size of the keys and values. So that a node with a
single KV pair always fits on a single page. If you need to support bigger keys or bigger
values, you have to allocate extra pages for them and that adds complexity.
const HEADER = 4
func init() {
node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
assert(node1max <= BTREE_PAGE_SIZE)
}
Since a node is just an array of bytes, we’ll add some helper functions to access its content.
build-your-own.org 16
Made with Xodo PDF Reader and Editor
// header
func (node BNode) btype() uint16 {
return binary.LittleEndian.Uint16(node.data)
}
func (node BNode) nkeys() uint16 {
return binary.LittleEndian.Uint16(node.data[2:4])
}
func (node BNode) setHeader(btype uint16, nkeys uint16) {
binary.LittleEndian.PutUint16(node.data[0:2], btype)
binary.LittleEndian.PutUint16(node.data[2:4], nkeys)
}
// pointers
func (node BNode) getPtr(idx uint16) uint64 {
assert(idx < node.nkeys())
pos := HEADER + 8*idx
return binary.LittleEndian.Uint64(node.data[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) {
assert(idx < node.nkeys())
pos := HEADER + 8*idx
binary.LittleEndian.PutUint64(node.data[pos:], val)
}
// offset list
func offsetPos(node BNode, idx uint16) uint16 {
assert(1 <= idx && idx <= node.nkeys())
return HEADER + 8*node.nkeys() + 2*(idx-1)
}
func (node BNode) getOffset(idx uint16) uint16 {
build-your-own.org 17
Made with Xodo PDF Reader and Editor
if idx == 0 {
return 0
}
return binary.LittleEndian.Uint16(node.data[offsetPos(node, idx):])
}
func (node BNode) setOffset(idx uint16, offset uint16) {
binary.LittleEndian.PutUint16(node.data[offsetPos(node, idx):], offset)
}
// key-values
func (node BNode) kvPos(idx uint16) uint16 {
assert(idx <= node.nkeys())
return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
func (node BNode) getKey(idx uint16) []byte {
assert(idx < node.nkeys())
pos := node.kvPos(idx)
klen := binary.LittleEndian.Uint16(node.data[pos:])
return node.data[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte {
assert(idx < node.nkeys())
pos := node.kvPos(idx)
klen := binary.LittleEndian.Uint16(node.data[pos+0:])
vlen := binary.LittleEndian.Uint16(node.data[pos+2:])
return node.data[pos+4+klen:][:vlen]
}
build-your-own.org 18
Made with Xodo PDF Reader and Editor
To insert a key into a leaf node, we need to look up its position in the sorted KV list.
// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: bisect
func nodeLookupLE(node BNode, key []byte) uint16 {
nkeys := node.nkeys()
found := uint16(0)
// the first key is a copy from the parent node,
// thus it's always less than or equal to the key.
for i := uint16(1); i < nkeys; i++ {
cmp := bytes.Compare(node.getKey(i), key)
if cmp <= 0 {
found = i
}
if cmp >= 0 {
break
}
}
return found
}
The lookup works for both leaf nodes and internal nodes. Note that the first key is skipped
for comparison, since it has already been compared from the parent node.
After looking up the position to insert, we need to create a copy of the node with the new
key in it.
build-your-own.org 19
Made with Xodo PDF Reader and Editor
The nodeAppendRange function copies keys from an old node to a new node.
// pointers
for i := uint16(0); i < n; i++ {
new.setPtr(dstNew+i, old.getPtr(srcOld+i))
}
// offsets
dstBegin := new.getOffset(dstNew)
srcBegin := old.getOffset(srcOld)
for i := uint16(1); i <= n; i++ { // NOTE: the range is [1, n]
offset := dstBegin + old.getOffset(srcOld+i) - srcBegin
new.setOffset(dstNew+i, offset)
}
// KVs
begin := old.kvPos(srcOld)
end := old.kvPos(srcOld + n)
copy(new.data[new.kvPos(dstNew):], old.data[begin:end])
}
build-your-own.org 20
Made with Xodo PDF Reader and Editor
build-your-own.org 21
Made with Xodo PDF Reader and Editor
build-your-own.org 22
Made with Xodo PDF Reader and Editor
Inserting keys into a node increases its size, causing it to exceed the page size. In this case,
the node is split into multiple smaller nodes.
The maximum allowed key size and value size only guarantee that a single KV pair always
fits on one page. In the worst case, the fat node is split into 3 nodes (one large KV pair in
the middle).
// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
if old.nbytes() <= BTREE_PAGE_SIZE {
old.data = old.data[:BTREE_PAGE_SIZE]
return 1, [3]BNode{old}
}
left := BNode{make([]byte, 2*BTREE_PAGE_SIZE)} // might be split later
right := BNode{make([]byte, BTREE_PAGE_SIZE)}
nodeSplit2(left, right, old)
if left.nbytes() <= BTREE_PAGE_SIZE {
left.data = left.data[:BTREE_PAGE_SIZE]
return 2, [3]BNode{left, right}
}
// the left node is still too large
leftleft := BNode{make([]byte, BTREE_PAGE_SIZE)}
middle := BNode{make([]byte, BTREE_PAGE_SIZE)}
nodeSplit2(leftleft, middle, left)
assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
return 3, [3]BNode{leftleft, middle, right}
}
build-your-own.org 23
Made with Xodo PDF Reader and Editor
Inserting a key into a node can result in either 1, 2 or 3 nodes. The parent node must
update itself accordingly. The code for updating an internal node is similar to that for
updating a leaf node.
We have finished the B-tree insertion. Deletion and the rest of the code will be introduced
in the next chapter.
build-your-own.org 24
Made with Xodo PDF Reader and Editor
The code for deleting a key from a leaf node is just like other nodeReplace* functions.
25
Made with Xodo PDF Reader and Editor
case BNODE_NODE:
return nodeDelete(tree, node, idx, key)
default:
panic("bad node!")
}
}
The difference is that we need to merge nodes instead of splitting nodes. A node may be
merged into one of its left or right siblings. The nodeReplace* functions are for updating
links.
build-your-own.org 26
Made with Xodo PDF Reader and Editor
case mergeDir == 0:
assert(updated.nkeys() > 0)
nodeReplaceKidN(tree, new, node, idx, updated)
}
return new
}
if idx > 0 {
sibling := tree.get(node.getPtr(idx - 1))
merged := sibling.nbytes() + updated.nbytes() - HEADER
if merged <= BTREE_PAGE_SIZE {
return -1, sibling
build-your-own.org 27
Made with Xodo PDF Reader and Editor
}
}
if idx+1 < node.nkeys() {
sibling := tree.get(node.getPtr(idx + 1))
merged := sibling.nbytes() + updated.nbytes() - HEADER
if merged <= BTREE_PAGE_SIZE {
return +1, sibling
}
}
return 0, BNode{}
}
We need to keep track of the root node as the tree grows and shrinks. Let’s start with
deletion.
This is the final interface for B-tree deletion. The height of the tree will be reduced by
one when:
tree.del(tree.root)
if updated.btype() == BNODE_NODE && updated.nkeys() == 1 {
build-your-own.org 28
Made with Xodo PDF Reader and Editor
// remove a level
tree.root = updated.getPtr(0)
} else {
tree.root = tree.new(updated)
}
return true
}
// the interface
func (tree *BTree) Insert(key []byte, val []byte) {
assert(len(key) != 0)
assert(len(key) <= BTREE_MAX_KEY_SIZE)
assert(len(val) <= BTREE_MAX_VAL_SIZE)
if tree.root == 0 {
// create the first node
root := BNode{data: make([]byte, BTREE_PAGE_SIZE)}
root.setHeader(BNODE_LEAF, 2)
// a dummy key, this makes the tree cover the whole key space.
// thus a lookup can always find a containing node.
nodeAppendKV(root, 0, 0, nil, nil)
nodeAppendKV(root, 1, 0, key, val)
tree.root = tree.new(root)
return
}
node := tree.get(tree.root)
tree.del(tree.root)
build-your-own.org 29
Made with Xodo PDF Reader and Editor
}
tree.root = tree.new(root)
} else {
tree.root = tree.new(splitted[0])
}
}
1. A new root node is created when the old root is split into multiple nodes.
2. When inserting the first key, create the first leaf node as the root.
There is a little trick here. We insert an empty key into the tree when we create the first
node. The empty key is the lowest possible key by sorting order, it makes the lookup
function nodeLookupLE always successful, eliminating the case of failing to find a node that
contains the input key.
Since our data structure code is pure data structure code (without IO), the page allocation
code is isolated via 3 callbacks. Below is the container code for testing our B-tree, it keeps
pages in an in-memory hashmap without persisting them to disk. In the next chapter,
we’ll implement persistence without modifying the B-tree code.
type C struct {
tree BTree
ref map[string]string
pages map[uint64]BNode
}
func newC() *C {
pages := map[uint64]BNode{}
return &C{
tree: BTree{
get: func(ptr uint64) BNode {
node, ok := pages[ptr]
assert(ok)
return node
build-your-own.org 30
Made with Xodo PDF Reader and Editor
},
new: func(node BNode) uint64 {
assert(node.nbytes() <= BTREE_PAGE_SIZE)
key := uint64(uintptr(unsafe.Pointer(&node.data[0])))
assert(pages[key].data == nil)
pages[key] = node
return key
},
del: func(ptr uint64) {
_, ok := pages[ptr]
assert(ok)
delete(pages, ptr)
},
},
ref: map[string]string{},
pages: pages,
}
}
We use a reference map to record each B-tree update, so that we can verify the correctness
of a B-tree later.
This B-tree implementation is pretty minimal, but minimal is good for the purpose
of learning. Real-world implementations can be much more complicated and contain
practical optimizations.
build-your-own.org 31
Made with Xodo PDF Reader and Editor
1. Use different formats for leaf nodes and internal nodes. Leaf nodes do not need
pointers and internal nodes do not need values. This saves some space.
2. One of the lengths of the key or value is redundant — the length of the KV pair can
be inferred from the offset of the next key.
3. The first key of a node is not needed because it’s inherited from a link of its parent.
4. Add a checksum to detect data corruption.
The next step in building a KV store is to persist our B-tree to the disk, which is the topic
of the next chapter.
build-your-own.org 32
Made with Xodo PDF Reader and Editor
The B-tree data structure from the previous chapter can be dumped to disk easily. Let’s
build a simple KV store on top of it. Since our B-tree implementation is immutable, we’ll
allocate disk space in an append-only manner, reusing disk space is deferred to the next
chapter.
As mentioned in previous chapters, persisting data to disk is more than just dumping data
into files. There are a couple of considerations:
1. Crash recovery: This includes database process crashes, OS crashes, and power
failures. The database must be in a usable state after a reboot.
2. Durability: After a successful response from the database, the data involved is
guaranteed to persist, even after a crash. In other words, persistence occurs before
responding to the client.
There are many materials describing databases using the ACID jargon (atomicity, consis-
tency, isolation, durability), but these concepts are not orthogonal and hard to explain, so
let’s focus on our practical example instead.
1. The immutable aspect of our B-tree: Updating the B-tree does not touch the
previous version of the B-tree, which makes crash recovery easy — should the
update goes wrong, we can simply recover to the previous version.
2. Durability is achieved via the fsync Linux syscall. Normal file IO via write or mmap
goes to the page cache first, the system has to flush the page cache to the disk later.
The fsync syscall blocks until all dirty pages are flushed.
How do we recover to the previous version if an update goes wrong? We can split the
update into two phases:
The first phase may involve writing multiple pages to the disk, this is generally not atomic.
But the second phase involves only a single pointer and can be done in an atomic single
33
Made with Xodo PDF Reader and Editor
page write. This makes the whole operation atomic — the update will simply not happen
if the database crashes.
The first phase must be persisted before the second phase, otherwise, the root pointer
could point to a corrupted (partly persisted) version of the tree after a crash. There should
be an fsync between the two phases (to serve as a barrier).
And the second phase should also be fsync’d before responding to the client.
6.2 mmap-Based IO
The contents of a disk file can be mapped from a virtual address using the mmap syscall.
Reading from this address initiates transparent disk IO, which is the same as reading the
file via the read syscall, but without the need for a user-space buffer and the overhead of a
syscall. The mapped address is a proxy to the page cache, modifying data via it is the same
as the write syscall.
mmap is convenient, and we’ll use it for our KV store. However, the use of mmap is not
essential.
if fi.Size()%BTREE_PAGE_SIZE != 0 {
return 0, nil, errors.New("File size is not a multiple of page size.")
}
mmapSize := 64 << 20
assert(mmapSize%BTREE_PAGE_SIZE == 0)
for mmapSize < int(fi.Size()) {
mmapSize *= 2
}
// mmapSize can be larger than the file
build-your-own.org 34
Made with Xodo PDF Reader and Editor
syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
)
if err != nil {
return 0, nil, fmt.Errorf("mmap: %w", err)
}
The above function creates the initial mapping at least the size of the file. The size of
the mapping can be larger than the file size, and the range past the end of the file is not
accessible (SIGBUS), but the file can be extended later.
We may have to extend the range of the mapping as the file grows. The syscall for extending
a mmap range is mremap. Unfortunately, we may not be able to keep the starting address
when extending a range by remapping. Our approach to extending mappings is to use
multiple mappings — create a new mapping for the overflow file range.
type KV struct {
Path string
// internals
fp *os.File
tree BTree
mmap struct {
file int // file size, can be larger than the database size
total int // mmap size, can be larger than the file size
chunks [][]byte // multiple mmaps, can be non-continuous
}
page struct {
flushed uint64 // database size in number of pages
temp [][]byte // newly allocated pages
}
}
build-your-own.org 35
Made with Xodo PDF Reader and Editor
return nil
}
db.mmap.total += db.mmap.total
db.mmap.chunks = append(db.mmap.chunks, chunk)
return nil
}
The size of the new mapping increases exponentially so that we don’t have to call mmap
frequently.
build-your-own.org 36
Made with Xodo PDF Reader and Editor
The first page of the file is used to store the pointer to the root, let’s call it the “master
page”. The total number of pages is needed for allocating new nodes, thus it is also stored
there.
| the_master_page | pages... | tree_root | pages... |
_ _
| btree root | page used | ^ ^
| | | |
+------------+----------------------+ |
| |
+---------------------------------------+
The function below reads the master page when initializing a database:
data := db.mmap.chunks[0]
root := binary.LittleEndian.Uint64(data[16:])
used := binary.LittleEndian.Uint64(data[24:])
build-your-own.org 37
Made with Xodo PDF Reader and Editor
db.tree.root = root
db.page.flushed = used
return nil
}
Below is the function for updating the master page. Unlike the code for reading, it doesn’t
use the mapped address for writing. This is because modifying a page via mmap is not
atomic. The kernel could flush the page midway and corrupt the disk file, while a small
write that doesn’t cross the page boundary is guaranteed to be atomic.
We’ll simply append new pages to the end of the database until we add a free list in the
next chapter.
And new pages are kept temporarily in memory until copied to the file later (after possibly
extending the file).
type KV struct {
// omitted...
page struct {
build-your-own.org 38
Made with Xodo PDF Reader and Editor
Before writing the pending pages, we may need to extend the file first. The corresponding
syscall is fallocate.
build-your-own.org 39
Made with Xodo PDF Reader and Editor
db.mmap.file = fileSize
return nil
}
// btree callbacks
db.tree.get = db.pageGet
db.tree.new = db.pageNew
db.tree.del = db.pageDel
build-your-own.org 40
Made with Xodo PDF Reader and Editor
// done
return nil
fail:
db.Close()
return fmt.Errorf("KV.Open: %w", err)
}
// cleanups
func (db *KV) Close() {
for _, chunk := range db.mmap.chunks {
err := syscall.Munmap(chunk)
assert(err == nil)
}
_ = db.fp.Close()
}
Unlike queries, update operations must persist the data before returning.
// read the db
func (db *KV) Get(key []byte) ([]byte, bool) {
return db.tree.Get(key)
}
// update the db
build-your-own.org 41
Made with Xodo PDF Reader and Editor
build-your-own.org 42
Made with Xodo PDF Reader and Editor
Our KV store is functional, but the file can’t grow forever as we update the database, we’ll
finish our KV store by reusing disk pages in the next chapter.
build-your-own.org 43
Made with Xodo PDF Reader and Editor
Since our B-tree is immutable, every update to the KV store creates new nodes in the path
instead of updating current nodes, leaving some nodes unreachable from the latest version.
We need to reuse these unreachable nodes from old versions, otherwise, the database file
will grow indefinitely.
To reuse these pages, we’ll add a persistent free list to keep track of unused pages. Update
operations reuse pages from the list before appending new pages, and unused pages from
the current version are added to the list.
The list is used as a stack (first-in-last-out), each update operation can both remove from
and add to the top of the list.
The free list is also immutable like our B-tree. Each node contains:
44
Made with Xodo PDF Reader and Editor
const BNODE_FREE_LIST = 3
const FREE_LIST_HEADER = 4 + 8 + 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
The FreeList type consists of the pointer to the head node and callbacks for managing
disk pages.
These callbacks are different from the B-tree because the pages used by the list are managed
by the list itself.
• The new callback is only for appending new pages since the free list must reuse pages
from itself.
• There is no del callback because the free list adds unused pages to itself.
• The use callback registers a pending update to a reused page.
build-your-own.org 45
Made with Xodo PDF Reader and Editor
Getting the nth item from the list is just a simple list traversal.
Updating the list is tricky. It first removes popn items from the list, then adds the freed to
the list, which can be divided into 3 phases:
1. If the head node is larger than popn, remove it. The node itself will be added to the
list later. Repeat this step until it is not longer possible.
2. We may need to remove some items from the list and possibly add some new items
to the list. Updating the list head requires new pages, and new pages should be
reused from the items of the list itself. Pop some items from the list one by one until
there are enough pages to reuse for the next phase.
3. Modify the list by adding new nodes.
build-your-own.org 46
Made with Xodo PDF Reader and Editor
build-your-own.org 47
Made with Xodo PDF Reader and Editor
// done
flnSetTotal(fl.get(fl.head), uint64(total+len(freed)))
}
if len(reuse) > 0 {
// reuse a pointer from the list
fl.head, reuse = reuse[0], reuse[1:]
fl.use(fl.head, new)
} else {
// or append a page to house the new node
fl.head = fl.new(new)
}
}
assert(len(reuse) == 0)
}
build-your-own.org 48
Made with Xodo PDF Reader and Editor
The data structure is modified. Temporary pages are kept in a map keyed by their assigned
page numbers. And removed page numbers are also there.
type KV struct {
// omitted...
page struct {
flushed uint64 // database size in number of pages
nfree int // number of pages taken from the free list
nappend int // number of pages to be appended
// newly allocated or deallocated pages keyed by the pointer.
// nil value denotes a deallocated page.
updates map[uint64][]byte
}
}
The pageGet function is modified to also return temporary pages because the free list code
depends on this behavior.
build-your-own.org 49
Made with Xodo PDF Reader and Editor
The function for allocating a B-tree page is changed to reuse pages from the free list first.
Removed pages are marked for the free list update later.
Callbacks for appending a new page and reusing a page for the free list:
build-your-own.org 50
Made with Xodo PDF Reader and Editor
Before extending the file and writing pages to disk, we must update the free list first since
it also creates pending writes.
build-your-own.org 51
Made with Xodo PDF Reader and Editor
}
return nil
}
Step 5: Done
The KV store is finished. It is persistent and crash resistant, although it can only be accessed
sequentially.
build-your-own.org 52
Made with Xodo PDF Reader and Editor
53
Made with Xodo PDF Reader and Editor
8.1 Introduction
The first step in building a relational DB on top of a KV store is to add tables. A table is just
a bunch of rows and columns. A subset of the columns is defined as the “primary key”;
primary keys are unique, so they can be used to refer to a row (in queries and secondary
indexes).
How does a table fit into a KV store? We can split a row into two parts:
This allows us to do both point queries and range queries on the primary key. For now,
we’ll only consider queries on the primary key, the use of secondary indexes is deferred to
a later chapter.
Below is the definition of rows and cells. For now, we only support two data types (int64
and bytes).
const (
TYPE_ERROR = 0
TYPE_BYTES = 1
TYPE_INT64 = 2
)
// table cell
type Value struct {
Type uint32
I64 int64
Str []byte
}
// table row
type Record struct {
54
Made with Xodo PDF Reader and Editor
Cols []string
Vals []Value
}
type DB struct {
Path string
// internals
kv KV
tables map[string]*TableDef // cached table definition
}
// table definition
type TableDef struct {
// user defined
Name string
Types []uint32 // column types
Cols []string // column names
PKeys int // the first `PKeys` columns are the primary key
// auto-assigned B-tree key prefixes for different tables
Prefix uint32
}
To support multiple tables, the keys in the KV store are prefixed with a unique 32-bit
number.
Table definitions have to be stored somewhere, we’ll use an internal table to store them.
And we’ll also add an internal table to store the metadata used by the DB itself.
build-your-own.org 55
Made with Xodo PDF Reader and Editor
Let’s implement the point query by the primary key, range queries will be added in the
next chapter.
build-your-own.org 56
Made with Xodo PDF Reader and Editor
The method for encoding data into bytes and decoding from bytes will be explained in
the next chapter. For now, any serialization scheme will do for this chapter.
build-your-own.org 57
Made with Xodo PDF Reader and Editor
tdef := &TableDef{}
err = json.Unmarshal(rec.Get("def").Str, tdef)
assert(err == nil)
return tdef
}
build-your-own.org 58
Made with Xodo PDF Reader and Editor
8.4 Updates
An update can be either insert a new row or replace an existing row. The B-tree interface
is modified to support different update modes.
build-your-own.org 59
Made with Xodo PDF Reader and Editor
// add a record
func (db *DB) Set(table string, rec Record, mode int) (bool, error) {
tdef := getTableDef(db, table)
if tdef == nil {
return false, fmt.Errorf("table not found: %s", table)
}
return dbUpdate(db, tdef, rec, mode)
}
func (db *DB) Insert(table string, rec Record) (bool, error) {
return db.Set(table, rec, MODE_INSERT_ONLY)
}
func (db *DB) Update(table string, rec Record) (bool, error) {
return db.Set(table, rec, MODE_UPDATE_ONLY)
}
func (db *DB) Upsert(table string, rec Record) (bool, error) {
return db.Set(table, rec, MODE_UPSERT)
}
build-your-own.org 60
Made with Xodo PDF Reader and Editor
Three steps:
build-your-own.org 61
Made with Xodo PDF Reader and Editor
The prefix numbers are allocated incrementally from the next_prefix key of the TDEF_META
internal table. The table definitions are stored as a JSON in the TDEF_TABLE table.
Although we have added table structures, the result is still pretty much a KV store. Some
important aspects are missing:
build-your-own.org 62
Made with Xodo PDF Reader and Editor
We have implemented table structures on top of a KV store and we’re able to retrieve
records by primary key. In this chapter, we’ll add the capacity to retrieve a range of records
in sorted order.
The first step is to add the range query to the B-tree. The BIter type allows us to traverse
a B-tree iteratively.
// B-tree iterator
type BIter struct {
tree *BTree
path []BNode // from root to leaf
pos []uint16 // indexes into nodes
}
The BIter is a path from the root node to the KV pair in a leaf node. Moving the iterator
is simply moving the positions or nodes to a sibling.
63
Made with Xodo PDF Reader and Editor
BTree.SeekLE is the function for finding the initial position in a range query. It is just a
normal B-tree lookup with the path recorded.
// find the closest position that is less or equal to the input key
func (tree *BTree) SeekLE(key []byte) *BIter {
iter := &BIter{tree: tree}
for ptr := tree.root; ptr != 0; {
node := tree.get(ptr)
idx := nodeLookupLE(node, key)
iter.path = append(iter.path, node)
iter.pos = append(iter.pos, idx)
if node.btype() == BNODE_NODE {
ptr = node.getPtr(idx)
} else {
ptr = 0
}
}
return iter
}
The nodeLookupLE function only works for the “less than or equal” operator in range
queries, for the other 3 operators (less than; greater than; greater than or equal), the result
may be off by one. We’ll fix this with the BTree.Seek function.
build-your-own.org 64
Made with Xodo PDF Reader and Editor
const (
CMP_GE = +3 // >=
CMP_GT = +2 // >
CMP_LT = -2 // <
CMP_LE = -3 // <=
)
// find the closest position to a key with respect to the `cmp` relation
func (tree *BTree) Seek(key []byte, cmp int) *BIter {
iter := tree.SeekLE(key)
if cmp != CMP_LE && iter.Valid() {
cur, _ := iter.Deref()
if !cmpOK(cur, cmp, key) {
// off by one
if cmp > 0 {
iter.Next()
} else {
iter.Prev()
}
}
}
return iter
}
build-your-own.org 65
Made with Xodo PDF Reader and Editor
To support range queries, the serialized primary key must be correctly compared in the
KV store. One way to do this is to deserialize the primary key and compare it column by
column. What we’ll use is another way, to let the serialized key bytes reflect their lexico-
graphic order, that is to say, keys can be compared correctly by bytes.Compare or memcmp
without deserializing them first. Let’s call this technique “order-preserving encoding”,
it can be used without controlling the key comparison function of the underlying KV
store.
For integers, you can easily see that unsigned big-endian integers are order-preserving —
the most significant bits come first in big-endian format. And null-terminated strings are
also order-preserving.
For signed integers, the problem is that negative numbers have the most significant bit (sign
bit) set. We need to flip the sign bit before big-endian encoding them to make negative
numbers lower.
// order-preserving encoding
func encodeValues(out []byte, vals []Value) []byte {
for _, v := range vals {
switch v.Type {
case TYPE_INT64:
var buf [8]byte
u := uint64(v.I64) + (1 << 63)
binary.BigEndian.PutUint64(buf[:], u)
out = append(out, buf[:]...)
case TYPE_BYTES:
out = append(out, escapeString(v.Str)...)
out = append(out, 0) // null-terminated
default:
panic("what?")
}
}
return out
}
build-your-own.org 66
Made with Xodo PDF Reader and Editor
The problem with null-terminated strings is that they cannot contain the null byte. We’ll
fix this by “escaping” the null byte. "\x00" is replaced by "\x01\x01", the escaping byte
"\x01" itself is replaced by "\x01\x02", this still preserves the sort order.
To wrap things up, we’ll add the Scanner type, which allows us to iterate through a range
of records in sorted order.
build-your-own.org 67
Made with Xodo PDF Reader and Editor
Cmp2 int
Key1 Record
Key2 Record
// internal
tdef *TableDef
iter *BIter // the underlying B-tree iterator
keyEnd []byte // the encoded Key2
}
build-your-own.org 68
Made with Xodo PDF Reader and Editor
if err != nil {
return err
}
req.tdef = tdef
Point queries are just special cases of range queries, so why not get rid of them?
build-your-own.org 69
Made with Xodo PDF Reader and Editor
sc := Scanner{
Cmp1: CMP_GE,
Cmp2: CMP_LE,
Key1: *rec,
Key2: *rec,
}
if err := dbScan(db, tdef, &sc); err != nil {
return false, err
}
if sc.Valid() {
sc.Deref(rec)
return true, nil
} else {
return false, nil
}
}
We only allow range queries on the full primary key, but range queries on a prefix of the
primary key are also legitimate. We’ll fix this in the next chapter, along with secondary
indexes.
build-your-own.org 70
Made with Xodo PDF Reader and Editor
In this chapter, we’ll add extra indexes (also known as secondary indexes) to our database.
Queries will no longer be restricted to the primary key.
The Indexes and IndexPrefixes fields are added to the table definition. Like the table
itself, each index is assigned a key prefix in the KV store.
// table definition
type TableDef struct {
// user defined
Name string
Types []uint32 // column types
Cols []string // column names
PKeys int // the first `PKeys` columns are the primary key
Indexes [][]string
// auto-assigned B-tree key prefixes for different tables/indexes
Prefix uint32
IndexPrefixes []uint32
}
To find a row via an index, the index must contain a copy of the primary key. We’ll
accomplish this by appending primary key columns to the index; this also makes the index
key unique, which is assumed by the B-tree lookup code.
71
Made with Xodo PDF Reader and Editor
index = append(index, c)
}
}
assert(len(index) < len(tdef.Cols))
return index, nil
}
Indexes are checked and have the primary key appended before creating a new table.
build-your-own.org 72
Made with Xodo PDF Reader and Editor
After updating a row, we need to remove the old row from the indexes. The B-tree
interface is modified to return the previous value of an update.
build-your-own.org 73
Made with Xodo PDF Reader and Editor
Val []byte
Mode int
}
Below is the function for adding or removing a record from the indexes. Here we encounter
a problem: updating a table with secondary indexes involves multiple keys in the KV store,
which should be done atomically. We’ll fix that in a later chapter.
const (
INDEX_ADD = 1
INDEX_DEL = 2
)
build-your-own.org 74
Made with Xodo PDF Reader and Editor
// maintain indexes
if req.Updated && !req.Added {
decodeValues(req.Old, values[tdef.PKeys:]) // get the old row
indexOp(db, tdef, Record{tdef.Cols, values}, INDEX_DEL)
}
if req.Updated {
indexOp(db, tdef, rec, INDEX_ADD)
}
return added, nil
}
build-your-own.org 75
Made with Xodo PDF Reader and Editor
// maintain indexes
if deleted {
// likewise...
}
return true, nil
}
We’ll also implement range queries using a prefix of an index. For example, we can do x
< a AND a < y on the index [a, b, c], which contains the prefix [a]. Selecting an index
is simply matching columns by the input prefix. The primary key is considered before
secondary indexes.
build-your-own.org 76
Made with Xodo PDF Reader and Editor
}
if winner == -2 {
return -2, fmt.Errorf("no index found")
}
return winner, nil
}
We may have to encode extra columns if the input key uses a prefix of an index instead of
the full index. For example, for a query v1 < a with the index [a, b], we cannot use [v1]
< key as the underlying B-tree query, because any key [v1, v2] satisfies [v1] < [v1, v2]
Instead, we can use [v1, MAX] < key in this case where the MAX is the maximum possible
value for column b. Below is the function for encoding a partial query key with additional
columns.
build-your-own.org 77
Made with Xodo PDF Reader and Editor
For the int64 type, the maximum value is encoded as all 0xff bytes. The problem is that
there is no maximum value for strings. What we can do is use the "\xff" as the encoding
of the “pseudo maximum string value”, and change the normal string encoding to not
startk with the "\xff".
The first byte of a string is escaped by the "\xfe" byte if it’s "\xff" or "\xfe". Thus all
string encodings are lower than "\xff".
pos := 0
if len(in) > 0 && in[0] >= 0xfe {
out[0] = 0xfe
build-your-own.org 78
Made with Xodo PDF Reader and Editor
out[1] = in[0]
pos += 2
in = in[1:]
}
// omitted...
return out
}
The index key contains all primary key columns so that we can find the full row. The
Scanner type is now aware of the selected index.
tdef := sc.tdef
rec.Cols = tdef.Cols
rec.Vals = rec.Vals[:0]
key, val := sc.iter.Deref()
if sc.indexNo < 0 {
// primary key, decode the KV pair
build-your-own.org 79
Made with Xodo PDF Reader and Editor
// omitted...
} else {
// secondary index
// The "value" part of the KV store is not used by indexes
assert(len(val) == 0)
The dbScan function is modified to use secondary indexes. And range queries by index
prefixes also work now. And it can also scan the whole table without any key at all. (The
primary key is selected if no columns are present).
// select an index
build-your-own.org 80
Made with Xodo PDF Reader and Editor
req.db = db
req.tdef = tdef
req.indexNo = indexNo
Step 5: Congratulations
We have implemented some major features of our relational DB: tables, range queries, and
secondary indexes. We can start adding more features and a query language to our DB.
However, some major aspects are still missing: transactions and concurrency, which will
be explored in later chapters.
build-your-own.org 81
Made with Xodo PDF Reader and Editor
For now, we’ll only consider sequential execution, and leave concurrency for the next
chapter.
// KV transaction
type KVTX struct {
// later...
}
// begin a transaction
func (kv *KV) Begin(tx *KVTX)
// end a transaction: commit updates
func (kv *KV) Commit(tx *KVTX) error
// end a transaction: rollback
func (kv *KV) Abort(tx *KVTX)
The methods for reading and updating the KV store are moved to the transaction type.
Note that these methods can no longer fail because they do not perform IOs, IO operations
are performed by committing the transaction, which can fail instead.
82
Made with Xodo PDF Reader and Editor
// KV operations
func (tx *KVTX) Get(key []byte) ([]byte, bool) {
return tx.db.tree.Get(key)
}
func (tx *KVTX) Seek(key []byte, cmp int) *BIter {
return tx.db.tree.Seek(key, cmp)
}
func (tx *KVTX) Update(req *InsertReq) bool {
tx.db.tree.InsertEx(req)
return req.Added
}
func (tx *KVTX) Del(req *DeleteReq) bool {
return tx.db.tree.DeleteEx(req)
}
Similarly, we’ll also add the transaction type for DB, which is a wrapper around the KV
transaction type.
// DB transaction
type DBTX struct {
kv KVTX
db *DB
}
build-your-own.org 83
Made with Xodo PDF Reader and Editor
And the read and update methods are also moved to the transaction type.
Modifications to the DB code are mostly changing the arguments of functions, which will
be omitted in the code listing.
The transaction type saves a copy of the in-memory data structure: the pointer to the tree
root and the pointer to the free list head.
// KV transaction
type KVTX struct {
db *KV
// for the rollback
tree struct {
root uint64
}
free struct {
head uint64
}
}
This is used for rollbacks. Rolling back a transaction is simply pointing to the previous tree
root, which can be done trivially even if there is an IO error while writing B-tree data.
// begin a transaction
func (kv *KV) Begin(tx *KVTX) {
tx.db = kv
tx.tree.root = kv.tree.root
tx.free.head = kv.free.head
}
build-your-own.org 84
Made with Xodo PDF Reader and Editor
Committing a transaction is not much different from how we persisted data before, except
that we have to roll back on errors in the first phase of a commit.
// the page data must reach disk before the master page.
// the `fsync` serves as a barrier here.
if err := kv.fp.Sync(); err != nil {
rollbackTX(tx)
return fmt.Errorf("fsync: %w", err)
}
build-your-own.org 85
Made with Xodo PDF Reader and Editor
kv.page.nfree = 0
kv.page.nappend = 0
kv.page.updates = map[uint64][]byte{}
There are not many changes in this chapter, because we have left out an important aspect
— concurrency — which will be explored in the next chapter.
build-your-own.org 86
Made with Xodo PDF Reader and Editor
To support concurrent requests, we can first separate transactions into read-only trans-
actions and read-write transactions. Readers alone can always run concurrently as they
do not modify the data. While writers have to modify the tree and must be serialized (at
least partially).
There are various degrees of concurrency that we can support. We can use a readers-writer
lock (RWLock). It allows the execution of either:
The RWLock is a practical technique and is easy to add. However, thanks to the use of
immutable data structures, we can easily implement a greater degree of concurrency.
With a RWLock, readers can be blocked by a writer and vice versa. But with immutable
data structures, a writer creates a new version of data instead of overwriting the current
version, this allows concurrency between readers and a writer, which is superior to the
RWLock. This is what we’ll implement.
Note that it’s possible to implement a greater degree of concurrency by only serializing
writers partially. For example, if a read-write transaction reads a subset of data and then
uses that data to determine what to write, we might be able to perform read operations
concurrently and serialize only the final commit operation. However, this introduces
new problems: even if the commit operation is serialized, writers may submit conflicting
commits, so we need extra mechanisms to prevent or detect conflicts. We won’t do that
in this book.
87
Made with Xodo PDF Reader and Editor
3 major changes:
First, we’ll split the transaction type into two, for both read-only and read-write transac-
tions respectively. The B-tree type is moved to the transaction type (a snapshot). Reads
from one transaction won’t be affected by other transactions.
Next, we’ll consider the use of mutexes. Writers are fully serialized, so a single mutex
for writers would do. Some fields are updated by writers and read by readers, such as the
latest tree root and the mmap chunks. These fields need another mutex to protect them.
type KV struct {
// omitted...
mu sync.Mutex
writer sync.Mutex
// omitted...
}
Lastly, the free list needs a redesign because we cannot reuse a page that is still reachable by
a reader. Our solution is to assign an auto-incrementing version number to each version of
the B-tree, and the free list stores the version number along with the page number when a
page is freed. Only the free page with a version number smaller than all current readers
can be reused.
build-your-own.org 88
Made with Xodo PDF Reader and Editor
1. The B-tree type is moved to the transaction type and only a root pointer remains
here. Likewise, the free list is also moved.
2. The data structure and code for managing disk pages are also moved to the transaction
type.
3. Added mutexes. The writer mutex is for serializing writers and the mu mutex is for
protecting data fields.
4. Added version numbers. And a list of ongoing readers for tracking the minimum
active version (for the free list). The reader list is maintained as a heap data structure
so that the minimum version is the first element.
type KV struct {
Path string
// internals
fp *os.File
// mod 1: moved the B-tree and the free list
tree struct {
root uint64
build-your-own.org 89
Made with Xodo PDF Reader and Editor
}
free FreeListData
mmap struct {
// same; omitted...
}
// mod 2: moved the page management
page struct {
flushed uint64 // database size in number of pages
}
// mod 3: mutexes
mu sync.Mutex
writer sync.Mutex
// mod 4: version number and the reader list
version uint64
readers ReaderList // heap, for tracking the minimum reader version
}
// implements heap.Interface
type ReaderList []*KVReader
// read-only KV transactions
type KVReader struct {
// the snapshot
version uint64
tree BTree
mmap struct {
chunks [][]byte // copied from struct KV. read-only.
}
// for removing from the heap
index int
}
The B-tree type is moved into KVReader, and so is the page management function
pageGetMapped. The version and index fields are for the ReaderList heap. We also take a
copy of the mmap chunks because it’s modified by writers.
build-your-own.org 90
Made with Xodo PDF Reader and Editor
The KVTX extends the KVReader so that it gets all the read methods. And like the B-tree
type, the free list and the page management data structure are also moved from the KV
type.
// KV transaction
type KVTX struct {
KVReader
db *KV
free FreeList
page struct {
nappend int // number of pages to be appended
// newly allocated or deallocated pages keyed by the pointer.
// nil value denotes a deallocated page.
updates map[uint64][]byte
}
}
build-your-own.org 91
Made with Xodo PDF Reader and Editor
To start a read-write transaction, the writer lock must be acquired. Then we can initialize
the B-tree type and the free list type. No additional locks are needed to access fields from
the KV type because the writer is the only thread that can modify anything. Except for the
reader list (which is modified by readers).
// begin a transaction
func (kv *KV) Begin(tx *KVTX) {
tx.db = kv
tx.page.updates = map[uint64][]byte{}
tx.mmap.chunks = kv.mmap.chunks
kv.writer.Lock()
tx.version = kv.version
// btree
tx.tree.root = kv.tree.root
tx.tree.get = tx.pageGet
tx.tree.new = tx.pageNew
tx.tree.del = tx.pageDel
// freelist
tx.free.FreeListData = kv.free
tx.free.version = kv.version
tx.free.get = tx.pageGet
tx.free.new = tx.pageAppend
tx.free.use = tx.pageUse
tx.free.minReader = kv.version
kv.mu.Lock()
if len(kv.readers) > 0 {
tx.free.minReader = kv.readers[0].version
}
kv.mu.Unlock()
}
Rolling back a transaction is now a no-op, because nothing in the KV type is modified until
commit.
build-your-own.org 92
Made with Xodo PDF Reader and Editor
The KV type is only modified a transaction is committed (for the tree root and the free list
head).
The free list is changed from a FILO (first-in-last-out) to a FIFO (first-in-first-out); pages
freed by newer versions are added to the list head, and reused pages are removed from the
list tail. This keeps the free list in sorted order (by version number).
To avoid reusing a page that a reader is still reading, reused pages must be from a version
no newer than any reader. That’s why we design the free list to be sorted by version
number.
build-your-own.org 93
Made with Xodo PDF Reader and Editor
Version numbers are added to free list nodes and to the master page:
| type | size | total | next | pointer-version-pairs |
| 2B | 2B | 8B | 8B | size * 16B |
The oldest version of all readers was obtained at the beginning of a transaction.
// begin a transaction
func (kv *KV) Begin(tx *KVTX) {
build-your-own.org 94
Made with Xodo PDF Reader and Editor
// omitted...
tx.free.minReader = kv.version
kv.mu.Lock()
if len(kv.readers) > 0 {
tx.free.minReader = kv.readers[0].version
}
kv.mu.Unlock()
}
return ptr
}
build-your-own.org 95
Made with Xodo PDF Reader and Editor
// a < b
func versionBefore(a, b uint64) bool {
return int64(a-b) < 0
}
Adding things to the list head is more complicated than before, but nothing special is
required. We’ll skip the code listing.
// add some new pointers to the head and finalize the update
func (fl *FreeList) Add(freed []uint64)
We have shown a great advantage of immutable data structures, which is easy concurrency
between readers and a writer. Read-only transactions can run as long as needed, the
only downside is that long-running readers prevent page reuse. However, read-write
transactions are expected to be short, because writers are fully serialized. This degree of
concurrency may be sufficient for some use cases.
We have explored 3 major topics in this book: persistence, indexing, and concurrency.
Now it’s time to add a user interface to our database — a query language — which is the
topic of the next chapter.
build-your-own.org 96
Made with Xodo PDF Reader and Editor
The last thing to add to our database is a query language. A query language exposes all
the functionality we have implemented as a human interface.
13.1.1 Statements
The grammar is designed to look like SQL, which looks like English.
13.1.2 Conditions
However, the conditions in our language differ from SQL. Unlike SQL, which uses the
WHERE clause to select rows, we separate conditions into indexing conditions and non-indexing
conditions.
1. The INDEX BY clause explicitly selects the index for the query. It represents an indexed
point query or an indexed range query, and the range can be either open-ended or
closed. It also controls the order of the rows.
97
Made with Xodo PDF Reader and Editor
2. The FILTER clause selects rows without using indexes. Both the INDEX BY and the
FILTER clauses are optional.
13.1.3 Expressions
The language also contains arbitrary expressions in the SELECT statement, FILTER conditions,
and the UPDATE statement. Expressions are just recursive binary or unary operations.
-a
a * b, a / b
a + b, a - b
a = b, a < b, ... -- all comparisons
NOT a
a AND b
a OR b
(a, b, c, ...) -- tuple
Let’s start by parsing expressions. Expressions are just trees, so let’s define the tree structure
first.
// syntax tree
type QLNode struct {
build-your-own.org 98
Made with Xodo PDF Reader and Editor
Like the structure itself, the process for parsing an expression is also recursive. Let’s start
with simple examples.
Consider a subset of the language consisting only of additions and column names:
a
a + b
a + b + c + ...
def parse_add():
node = parse_column()
while parse('+'):
right = parse_column()
node = QLNode(type='+', kids=[node, right])
return node
build-your-own.org 99
Made with Xodo PDF Reader and Editor
Now we add the multiplication operator, which has a different precedence. Let’s revise
the expression a + b, the subexpression a or b could be a multiplication, which should be
applied before the addition. (e.g.: when the b is c * d). We’ll add a level of recursion to
handle this:
def parse_add():
node = parse_mul()
while parse('+'):
right = parse_mul()
node = QLNode(type='+', kids=[node, right])
return node
def parse_mul():
node = parse_column()
while parse('*'):
right = parse_column()
node = QLNode(type='*', kids=[node, right])
return node
Notice that the parse_add recurses into the parse_mul for subexpressions, which recurses
into the parse_column. From there we can see the pattern:
The Parser structure stores the current position in the input when parsing. Every parsing
function takes it as an argument.
build-your-own.org 100
Made with Xodo PDF Reader and Editor
The highest level of parsing is the tuple expression (e.g.: (a, b, c, ...)), followed by the
OR operator, then followed by the AND operator, etc.
The pKeyword function matches one or more words from the input and advances the
position.
build-your-own.org 101
Made with Xodo PDF Reader and Editor
p.idx += len(kw)
}
return true
}
The skipSpace function does what its name says. The isSym thing is explained later.
13.3.2 Generalization
The pExprOr should recurse into the AND operator (pExprAnd) according to the precedence
list. But there are many precedences, so let’s generalize this.
func pExprBinop(
p *Parser, node *QLNode,
ops []string, types []uint32, next func(*Parser, *QLNode),
) {
assert(len(ops) == len(types))
left := QLNode{}
next(p, &left)
build-your-own.org 102
Made with Xodo PDF Reader and Editor
left = new
more = true
break
}
}
}
*node = left
}
The pExprBinop is the generalized function for parsing binary operators. It takes a list of
operators of equal precedence and tries to parse with each of them. The parser for the
next precedence is parameterized via the next argument.
List of binary parsers ordered by precedence:
The pExprNot and pExprUnop are unary operators. They are much easier than binary
operators.
build-your-own.org 103
Made with Xodo PDF Reader and Editor
pExprAtom(p, &node.Kids[0])
default:
pExprAtom(p, node)
}
}
The pExprAtom function is the deepest level. It parses either a column name, an integer,
a string, or a pair of parentheses, which recurses back into the highest level function
pExprTuple.
The pErr function stores an error in the Parser structure. To keep the code concise, the
parser continues execution even after an error. That’s why you don’t see any error handling
here.
The pSym function is for parsing a column name. It’s just matching characters against a
rule. This can also be done with a regular expression.
build-your-own.org 104
Made with Xodo PDF Reader and Editor
end := p.idx
if !(end < len(p.input) && isSymStart(p.input[end])) {
return false
}
end++
for end < len(p.input) && isSym(p.input[end]) {
end++
}
if pKeywordSet[strings.ToLower(string(p.input[p.idx:end]))] {
return false // not allowed
}
node.Type = QL_SYM
node.Str = p.input[p.idx:end]
p.idx = end
return true
}
build-your-own.org 105
Made with Xodo PDF Reader and Editor
// stmt: select
type QLSelect struct {
QLScan
Names []string // expr AS name
Output []QLNode // expression list
}
// stmt: update
type QLUpdate struct {
QLScan
Names []string
Values []QLNode
}
// stmt: insert
type QLInsert struct {
Table string
Mode int
Names []string
Values [][]QLNode
}
build-your-own.org 106
Made with Xodo PDF Reader and Editor
// stmt: delete
type QLDelete struct {
QLScan
}
We’ll use the SELECT statement as the only example. The parser is divided into several
components.
build-your-own.org 107
Made with Xodo PDF Reader and Editor
// SELECT xxx
pSelectExprList(p, &stmt)
// FROM table
if !pKeyword(p, "from") {
pErr(p, nil, "expect `FROM` table")
}
stmt.Table = pMustSym(p)
if p.err != nil {
return nil
}
return &stmt
}
Let’s zoom into the pSelectExprList function, which consists of finer and finer compo-
nents.
build-your-own.org 108
Made with Xodo PDF Reader and Editor
The rest of the code should be trivial at this point. We’ll learn how to execute the parsed
statements in the next chapter.
build-your-own.org 109
Made with Xodo PDF Reader and Editor
14.1 Introduction
To execute a statement, the statement is translated into function calls to the existing DB
interfaces. For example, the CREATE TABLE statement is translated into a simple function
call:
Other statements are not as simple, but they are still glue code that does not have much
functionality on its own. Executing a SELECT statement is more complicated:
// output
for _, irec := range records {
orec := Record{Cols: req.Names}
for _, node := range req.Output {
ctx := QLEvalContex{env: irec}
qlEval(&ctx, node)
if ctx.err != nil {
return nil, ctx.err
}
orec.Vals = append(orec.Vals, ctx.out)
}
out = append(out, orec)
}
return out, nil
}
110
Made with Xodo PDF Reader and Editor
It does 2 things:
Both of these things are also part of some other statements. We’ll look at expression
evaluation first.
There are 3 places where we need to evaluate expressions against a row. They are the
expression list of the SELECT statement, the conditions in the FILTER clause, and the values
of the UPDATE statement.
The qlEval function is for such tasks. To keep the code concise, the result of an evalu-
ation (either a scalar value or an error) and the current row are put in the QLEvalContex
structure.
The qlEval function evaluates subexpressions recursively and then performs the calculation
of the operator. Column names and literal values (integers and strings) are obvious to
handle.
switch node.Type {
// refer to a column
build-your-own.org 111
Made with Xodo PDF Reader and Editor
case QL_SYM:
if v := ctx.env.Get(string(node.Str)); v != nil {
ctx.out = *v
} else {
qlErr(ctx, "unknown column: %s", node.Str)
}
// a literal value
case QL_I64, QL_STR:
ctx.out = node.Value
// more; omitted...
default:
panic("not implemented")
}
}
Operators are also easy to handle. We also do type checking when evaluating an expression.
Listing the QL_NEG operator as an example:
// unary ops
case QL_NEG:
qlEval(ctx, node.Kids[0])
if ctx.out.Type == TYPE_INT64 {
ctx.out.I64 = -ctx.out.I64
} else {
qlErr(ctx, "QL_NEG type error")
}
// execute a query
func qlScan(req *QLScan, tx *DBReader, out []Record) ([]Record, error) {
sc := Scanner{}
err := qlScanInit(req, &sc)
if err != nil {
build-your-own.org 112
Made with Xodo PDF Reader and Editor
First, we need to translate the INDEX BY clause into the Record type, which is used by the
Scanner iterator. An INDEX BY clause is either:
1. A point query.
2. An open-ended range (a single comparison).
3. A closed range (two comparisons of different directions).
Corresponding examples:
build-your-own.org 113
Made with Xodo PDF Reader and Editor
if req.Key2.Type != 0 {
sc.Key2, sc.Cmp2, err = qlEvalScanKey(req.Key1)
if err != nil {
return err
}
}
The qlEvalScanKey function is for converting a comparison operator to the Record type
build-your-own.org 114
Made with Xodo PDF Reader and Editor
We need to deal with the LIMIT and the FILTER clauses when iterating rows.
rec := Record{}
if ok {
sc.Deref(&rec)
}
// `FILTER`
if ok && req.Filter.Type != 0 {
ctx := QLEvalContex{env: rec}
qlEval(&ctx, req.Filter)
if ctx.err != nil {
return nil, ctx.err
}
if ctx.out.Type != TYPE_INT64 {
return nil, errors.New("filter is not of boolean type")
}
ok = (ctx.out.I64 != 0)
}
if ok {
out = append(out, rec)
}
sc.Next()
}
build-your-own.org 115
Made with Xodo PDF Reader and Editor
The code for the SELECT statement is already listed. Let’s add the DELETE statement, which
is not much different.
As you can see, this chapter does not add much. The implementation of our query language
is mostly glue code for what we have already implemented. Before reaching this point, it
may have been a mystery to you how a database turns SQL into rows. Now that you have
build-your-own.org 116
Made with Xodo PDF Reader and Editor
a better understanding of databases, there are some other aspects that you may want to
explore.
You can try to add more features to our database, such as joins, group bys, and aggregations,
which are common in analytical queries. Getting these things to work should not be
difficult at this point.
You can also build a client API and a server for our database. A server process is needed
anyway for managing concurrent access. To do this, you’ll need to learn network pro-
gramming. Although networking in Golang is fairly easy and high-level, there is also the
“from scratch” method if you are willing to learn more. My other book “Build Your Own
Redis From Scratch” is for learning network programming from scratch and some data
structures. The Redis book, along with other “build your own X” books, can be found on
the official website: https://build-your-own.org.
build-your-own.org 117