J. Smith - Build Your Own Redis With C-C++. Learn Network Programming and Data Structures by Coding From Scratch (2023)
J. Smith - Build Your Own Redis With C-C++. Learn Network Programming and Data Structures by Coding From Scratch (2023)
Redis
with C/C++
Learn network programming and data structures
by coding from scratch
James Smith
build-your-own.org
2023-0 1 -3 1
Contents
Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A1: Hints to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
i
Part 1. Getting Started
Make a functioning server that responds to commands.
1
01. Introduction
Redis could be considered one of the building blocks of modern computing that stood the
test of time. The knowledge required for building such a project is broader and deeper
than usual application-level development. Learning from such projects is a good way for
software developers to level up their skills.
Redis is a good target for learning because it covers two important subjects of software
engineering: network programming and data structures.
• While there are many guides on socket APIs or high-level libraries, network pro-
gramming is more than calling APIs or libraries. It is important to understand core
concepts such as the event loop, protocols, timers, etc, which this book will cover.
The lack of understanding can result in fatal mistakes even if you are just employing
high-level networking libraries or frameworks in your applications.
• Although many people learned some basic data structures from textbooks, there is
still something more to learn. Data structures implemented in real projects often
have some practical considerations which are not touched by textbooks. Learning
how data structures are used in a non-toy environment (especially in C) is a unique
experience from building Redis.
Like most real-world projects, Redis is a complex project built with lots of effort, which
can be hard to grasp for beginners. Instead, this book takes an opposite approach: learning
by building things from scratch.
A couple of points:
2
2023-01-31 01. Introduction
• To learn faster. By building things from scratch, concepts can be introduced gradu-
ally. Starting from the small, adding things incrementally, and getting the big picture
in the end.
• To learn deeper. While there are many materials explaining how an existing stuff
works, the understanding obtained by reading those materials is often not the same
as building the stuff yourself. It is easy to mistake memorization for understanding,
and it’s easier to pick up unimportant details than principles and basics.
• To learn more. The “from scratch” approach forces you to touch every aspect of the
subject — there are no shortcuts to knowledge! And often not every aspect is known
to you beforehand, you may discover “things I don’t know I don’t know” in the
process.
This book follows a step-by-step approach. Each step builds on the previous one, adding a
new concept. The full source code is provided on the web for reference purposes, readers
are advised to tinker with it or DIY without it.
The code is written as direct and straightforwardly as the author could. It’s mostly plain
C with minimal C++ features. Don’t worry if you don’t know C, you just have the
opportunity to do it in another language by yourself.
The end result is a mini Redis alike with only about 1200 lines of code. 1200 LoC seems
low, but it illustrates many important aspects the book attempts to cover.
The techniques and approaches used in the book are not exactly the same as the real Redis.
Some are intentionally simplified, and some are chosen to illustrate a general topic. Readers
can learn even more by comparing different approaches.
The code used in this book is intended to run on Linux only, and can be downloaded at
this URL:
https://build-your-own.org/redis/src.tgz
The contents and the source code of this book can be browsed online at:
https://build-your-own.org
build-your-own.org 3
02. Introduction to Sockets
Redis is an example of the server/client system. Multiple clients connect to a single server,
and the server receives requests from TCP connections and sends responses back. There
are several Linux system calls we need to learn before we can start socket programming.
The socket() syscall returns an fd. Here is a rough explanation of “fd” if you are unfamiliar
with Unix systems: An fd is an integer that refers to something in the Linux kernel, like a
TCP connection, a disk file, a listening port, or some other resources, etc.
The bind() and listen() syscall: the bind() associates an address to a socket fd, and the
listen() enables us to accept connections to that address.
The accept() takes a listening fd, when a client makes a connection to the listening address,
the accept() returns an fd that represents the connection socket. Here is the pseudo-code
that explains the typical workflow of a server:
fd = socket()
bind(fd, address)
listen(fd)
while True:
conn_fd = accept(fd)
do_something_with(conn_fd)
close(conn_fd)
The read() syscall receives data from a TCP connection. The write() syscall sends data.
The close() syscall destroys the resource referred by the fd and recycles the fd number.
We have introduced the syscalls needed for server-side network programming. For the
client side, the connect() syscall takes a socket fd and address and makes a TCP connection
to that address. Here is the pseudo-code for the client:
4
2023-01-31 02. Introduction to Sockets
fd = socket()
connect(fd, address)
do_something_with(fd)
close(fd)
The next chapter will help you get started using real code.
build-your-own.org 5
03. Hello Server/Client
This chapter continues the introduction of socket programming. We’ll write 2 simple
(incomplete and broken) programs to demonstrate the syscalls from the last chapter. The
first program is a server, it accepts connections from clients, reads a single message, and
writes a single reply. The second program is a client, it connects to the server, writes a
single message, and reads a single reply. Let’s start with the server first.
First, we need to obtain a socket fd: int fd = socket(AF_INET, SOCK_STREAM, 0);
The AF_INET is for IPv4, use AF_INET6 for IPv6 or dual-stack socket. For simplicity, we’ll
just use AF_INET throughout this book.
The SOCK_STREAM is for TCP. We won’t use anything other than TCP in this book. All the
3 parameters of the socket() call are fixed in this book.
Next, we’ll introduce a new syscall:
int val = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));
The setsockopt() call is used to configure various aspects of a socket. This particular call
enables the SO_REUSEADDR option. Without this option, the server won’t able to bind to the
same address if restarted. Exercise to reader: find out what exactly is SO_REUSEADDR and
why it is needed.
The next step is the bind() and listen(), we’ll bind on the wildcard address
0.0.0.0:1234:
6
2023-01-31 03. Hello Server/Client
// listen
rv = listen(fd, SOMAXCONN);
if (rv) {
die("listen()");
}
while (true) {
// accept
struct sockaddr_in client_addr = {};
socklen_t socklen = sizeof(client_addr);
int connfd = accept(fd, (struct sockaddr *)&client_addr, &socklen);
if (connfd < 0) {
continue; // error
}
do_something(connfd);
close(connfd);
}
build-your-own.org 7
2023-01-31 03. Hello Server/Client
Note that the read() and write() call returns the number of read or written bytes. A
real programmer must deal with the return value of functions, but in this chapter, I have
omitted lots of things for brevity. And the code in this chapter is not the correct way to
do networking anyway.
build-your-own.org 8
2023-01-31 03. Hello Server/Client
Run ./server in a window and then run ./client in another window. You should see the
following results:
$ ./server
client says: hello
$ ./client
server says: world
Exercise for readers: read manpages of APIs used in this chapter, or find online tutorials
for them. Make sure you know how to find helps on API usage since this book won’t
cover the details of API usage.
• 03_client.cpp
• 03_server.cpp
build-your-own.org 9
04. Protocol Parsing
Our server will be able to process multiple requests from a client, to do that we need to
implement some sort of “protocol”, at least to split requests apart from the TCP byte
stream. The easiest way to split requests apart is by declaring how long the request is at the
beginning of the request. Let’s use the following scheme.
+-----+------+-----+------+--------
| len | msg1 | len | msg2 | more...
+-----+------+-----+------+--------
The protocol consists of 2 parts: a 4-byte little-endian integer indicating the length of the
following request, and a variable length request.
Starts from the code from the last chapter, the loop of the server is modified to handle
multiple requests:
while (true) {
// accept
struct sockaddr_in client_addr = {};
socklen_t socklen = sizeof(client_addr);
int connfd = accept(fd, (struct sockaddr *)&client_addr, &socklen);
if (connfd < 0) {
continue; // error
}
The one_request function only parses one request and replies, until something bad happens
or the client connection is gone. Our server can only handle one connection at once until
we introduce the event loop in later chapters.
10
2023-01-31 04. Protocol Parsing
1. The read() syscall just returns whatever data is available in the kernel, or blocks if
there is none. It’s the application that is responsible for handling insufficient data.
The read_full() function read from the kernel until it got exactly n bytes.
2. Likewise, the write() syscall can return successfully with partial data written if the
kernel buffer is full, we need to keep trying when the write() returns fewer bytes
than we need.
build-your-own.org 11
2023-01-31 04. Protocol Parsing
uint32_t len = 0;
memcpy(&len, rbuf, 4); // assume little endian
if (len > k_max_msg) {
msg("too long");
return -1;
}
// request body
err = read_full(connfd, &rbuf[4], len);
if (err) {
msg("read() error");
return err;
}
// do something
rbuf[4 + len] = '\0';
printf("client says: %s\n", &rbuf[4]);
build-your-own.org 12
2023-01-31 04. Protocol Parsing
For convenience, we added a limit to the maximum request size and use a large enough
buffer to hold the request. Endianness used to be a consideration when parsing protocols,
but it is less relevant today so we are just memcpy-ing integers.
// 4 bytes header
char rbuf[4 + k_max_msg + 1];
errno = 0;
int32_t err = read_full(fd, rbuf, 4);
if (err) {
if (errno == 0) {
msg("EOF");
} else {
msg("read() error");
}
return err;
}
build-your-own.org 13
2023-01-31 04. Protocol Parsing
// reply body
err = read_full(fd, &rbuf[4], len);
if (err) {
msg("read() error");
return err;
}
// do something
rbuf[4 + len] = '\0';
printf("server says: %s\n", &rbuf[4]);
return 0;
}
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
}
// multiple requests
int32_t err = query(fd, "hello1");
if (err) {
goto L_DONE;
}
err = query(fd, "hello2");
if (err) {
goto L_DONE;
build-your-own.org 14
2023-01-31 04. Protocol Parsing
}
err = query(fd, "hello3");
if (err) {
goto L_DONE;
}
L_DONE:
close(fd);
return 0;
}
$ ./server
client says: hello1
client says: hello2
client says: hello3
EOF
$ ./client
server says: world
server says: world
server says: world
The protocol parsing code requires at least 2 read() syscalls per request. The number of
syscalls can be reduced by using “buffered IO”. That is: read as much as you can into a
buffer at once, then try to parse multiple requests from that buffer. Readers are encouraged
to try this as an exercise as it may be helpful to understand later chapters.
Notes on protocols: The protocol used in this chapter is the most simple practical protocol.
Most real-world protocols are more complicated than this. Some use text instead of binary
data. While text protocols have the advantage of being human-readable, text protocols do
require more parsing than binary ones, which are more coding and error-prone. Another
thing to complicate protocol parsing is that some protocols don’t have a straight way to
split messages apart, those protocols may use delimiters, or require further parsing to split
messages. The use of delimiters in protocols can add another complication when the
build-your-own.org 15
2023-01-31 04. Protocol Parsing
protocol is carrying arbitrary data, as the delimiters in data need to be “escaped”. We’ll
stick to the simple binary protocol for later chapters.
• 04_client.cpp
• 04_server.cpp
build-your-own.org 16
05. The Event Loop and Nonblocking IO
There are 3 ways to deal with concurrent connections in server-side network programming.
They are: forking, multi-threading, and event loops. Forking creates new processes for
each client connection to achieve concurrency. Multi-threading uses threads instead of
processes. An event loop uses polling and nonblocking IO and usually runs on a single
thread. Due to the overhead of processes and threads, most modern production-grade
software uses event loops for networking.
The simplified pseudo-code for the event loop of our server is:
all_fds = [...]
while True:
active_fds = poll(all_fds)
for each fd in active_fds:
do_something_with(fd)
def do_something_with(fd):
if fd is a listening socket:
add_new_client(fd)
elif fd is a client connection:
while work_not_done(fd):
do_something_to_client(fd)
def do_something_to_client(fd):
if should_read_from(fd):
data = read_until_EAGAIN(fd)
process_incoming_data(data)
while should_write_to(fd):
write_until_EAGAIN(fd)
if should_close(fd):
destroy_client(fd)
Instead of just doing things (reading, writing, or accepting) with fds, we use the poll
17
2023-01-31 05. The Event Loop and Nonblocking IO
In blocking mode, read blocks the caller when there are no data in the kernel, write blocks
when the write buffer is full, and accept blocks when there are no new connections in the
kernel queue. In nonblocking mode, those operations either success without blocking, or
fail with the errno EAGAIN, which means “not ready”. Nonblocking operations that fail
with EAGAIN must be retried after the readiness was notified by the poll.
The poll is the sole blocking operation in an event loop, everything else must be non-
blocking; thus, a single thread can handle multiple concurrent connections. All blocking
networking IO APIs, such as read, write, and accept, have a nonblocking mode. APIs
that do not have a nonblocking mode, such as gethostbyname, and disk IOs, should be
performed in thread pools, which will be covered in later chapters. Also, timers must be
implemented within the event loop since we can’t sleep waiting inside the event loop.
The syscall for setting an fd to nonblocking mode is fcntl:
flags |= O_NONBLOCK;
errno = 0;
(void)fcntl(fd, F_SETFL, flags);
if (errno) {
die("fcntl error");
}
}
On Linux, besides the poll syscall, there are also select and epoll. The ancient select
syscall is basically the same as the poll, except that the maximum fd number is limited to
a small number, which makes it obsolete in modern applications. The epoll API consists
of 3 syscalls: epoll_create, epoll_wait, and epoll_ctl. The epoll API is stateful, instead
of supplying a set of fds as a syscall argument, epoll_ctl was used to manipulate an fd set
created by epoll_create, which the epoll_wait is operating on.
build-your-own.org 18
2023-01-31 05. The Event Loop and Nonblocking IO
We’ll use the poll syscall in the next chapter since it’s slightly less code than the stateful
epoll API. However, the epoll API is preferable in real-world projects since the argument
for the poll can become too large as the number of fds increases.
build-your-own.org 19
06. The Event Loop Implementation
This chapter walks through the real C++ code of an echo server.
enum {
STATE_REQ = 0,
STATE_RES = 1,
STATE_END = 2, // mark the connection for deletion
};
struct Conn {
int fd = -1;
uint32_t state = 0; // either STATE_REQ or STATE_RES
// buffer for reading
size_t rbuf_size = 0;
uint8_t rbuf[4 + k_max_msg];
// buffer for writing
size_t wbuf_size = 0;
size_t wbuf_sent = 0;
uint8_t wbuf[4 + k_max_msg];
};
We need buffers for reading/writing, since in nonblocking mode, IO operations are often
deferred.
The state is used to decide what to do with the connection. There are 2 states for an
ongoing connection. The STATE_REQ is for reading requests and the STATE_RES is for sending
responses.
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
20
2023-01-31 06. The Event Loop Implementation
build-your-own.org 21
2023-01-31 06. The Event Loop Implementation
return 0;
}
The first thing in our event loop is setting up arguments of poll. The listening fd is polled
with the POLLIN flag. For the connection fd, the state of the struct Conn determines the poll
flag. In this particular case, the poll flag is either reading (POLLIN) or writing (POLLOUT),
never both. If using epoll, the first thing in an event loop is usually updating the fd set
with epoll_ctl.
The poll also takes a timeout argument which can be used to implement timers, in our
case, this argument doesn’t matter, just set it to a big number. After the return of poll, we
are notified by which fd are ready for reading/writing and act accordingly.
The accept_new_conn() function accepts a new connection and creates the struct Conn
object:
build-your-own.org 22
2023-01-31 06. The Event Loop Implementation
build-your-own.org 23
2023-01-31 06. The Event Loop Implementation
build-your-own.org 24
2023-01-31 06. The Event Loop Implementation
msg("EOF");
}
conn->state = STATE_END;
return false;
}
conn->rbuf_size += (size_t)rv;
assert(conn->rbuf_size <= sizeof(conn->rbuf) - conn->rbuf_size);
There are lots of things to unpack here. To understand this function, let’s review the
pseudo-code from the last chapter:
def do_something_to_client(fd):
if should_read_from(fd):
data = read_until_EAGAIN(fd)
process_incoming_data(data)
# code omitted...
The try_fill_buffer() function fills the read buffer with data. Since the size of the read
buffer is limited, the read buffer could be full before we hit EAGAIN, so we need to process
data immediately after reading to clear some read buffer space, then the try_fill_buffer()
is looped until we hit EAGAIN.
The read syscall (and any other syscalls) need to be retried after getting the errno EINTR.
The EINTR means the syscall was interrupted by a signal, the retrying is needed even if our
application does not make use of signals.
The try_one_request function handles the incoming data, but why is this in a loop? Is
there more than one request in the read buffer? The answer is yes. For a request/response
protocol, clients are not limited to sending one request and waiting for the response at
a time, clients can save some latency by sending multiple requests without waiting for
build-your-own.org 25
2023-01-31 06. The Event Loop Implementation
responses in between, this mode of operation is called “pipelining”. Thus we can’t assume
that the read buffer contains at most one request.
Listing the try_one_request function:
build-your-own.org 26
2023-01-31 06. The Event Loop Implementation
// change state
conn->state = STATE_RES;
state_res(conn);
The try_one_request function takes one request from the read buffer, generates a response,
then transits to the STATE_RES state.
The code for the state STATE_RES:
build-your-own.org 27
2023-01-31 06. The Event Loop Implementation
conn->wbuf_size = 0;
return false;
}
// still got some data in wbuf, could try to write again
return true;
}
The above code flushes the write buffer until it got EAGAIN, or transits back to the STATE_REQ
if the flushing is done.
To test our server, we can run the client from chapter 04 since the protocol is identical.
We can also modify the client to demonstrate pipelining client:
// the `query` function was simply splited into `send_req` and `read_res`.
static int32_t send_req(int fd, const char *text);
static int32_t read_res(int fd);
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
}
// code omitted...
build-your-own.org 28
2023-01-31 06. The Event Loop Implementation
}
}
L_DONE:
close(fd);
return 0;
}
Exercises:
1. Try to use epoll instead of poll in the event loop. This should be easy.
2. We are using memmove to reclaim read buffer space. However, memmove on every
request is unnecessary, change the code the perform memmove only before read.
3. In the state_res function, write was performed for a single response. In pipelined
sceneries, we could buffer multiple responses and flush them in the end with a single
write call. Note that the write buffer could be full in the middle.
• 06_client.cpp
• 06_server.cpp
build-your-own.org 29
07. Basic Server: get, set, del
With the event loop code from the last chapter, we can finally start adding commands to
our server.
The “command” in our design is a list of strings, like set key val. We’ll encode the
“command” with the following scheme.
+------+-----+------+-----+------+-----+-----+------+
| nstr | len | str1 | len | str2 | ... | len | strn |
+------+-----+------+-----+------+-----+-----+------+
The nstr is the number of strings and the len is the length of the following string. Both
are 32-bit integers.
The response is a 32-bit status code followed by the response string.
+-----+---------+
| res | data... |
+-----+---------+
30
2023-01-31 07. Basic Server: get, set, del
// change state
conn->state = STATE_RES;
state_res(conn);
The do_request function handles the request. Only 3 commands (get, set, del) are recog-
nized now.
build-your-own.org 31
2023-01-31 07. Basic Server: get, set, del
*rescode = RES_ERR;
const char *msg = "Unknown cmd";
strcpy((char *)res, msg);
*reslen = strlen(msg);
return 0;
}
return 0;
}
build-your-own.org 32
2023-01-31 07. Basic Server: get, set, del
return -1;
}
size_t pos = 4;
while (n--) {
if (pos + 4 > len) {
return -1;
}
uint32_t sz = 0;
memcpy(&sz, &data[pos], 4);
if (pos + 4 + sz > len) {
return -1;
}
out.push_back(std::string((char *)&data[pos + 4], sz));
pos += 4 + sz;
}
if (pos != len) {
return -1; // trailing garbage
}
return 0;
}
enum {
RES_OK = 0,
RES_ERR = 1,
RES_NX = 2,
};
// The data structure for the key space. This is just a placeholder
// until we implement a hashtable in the next chapter.
static std::map<std::string, std::string> g_map;
build-your-own.org 33
2023-01-31 07. Basic Server: get, set, del
{
if (!g_map.count(cmd[1])) {
return RES_NX;
}
std::string &val = g_map[cmd[1]];
assert(val.size() <= k_max_msg);
memcpy(res, val.data(), val.size());
*reslen = (uint32_t)val.size();
return RES_OK;
}
build-your-own.org 34
2023-01-31 07. Basic Server: get, set, del
return -1;
}
// code omitted...
build-your-own.org 35
2023-01-31 07. Basic Server: get, set, del
std::vector<std::string> cmd;
for (int i = 1; i < argc; ++i) {
cmd.push_back(argv[i]);
}
int32_t err = send_req(fd, cmd);
if (err) {
goto L_DONE;
}
err = read_res(fd);
if (err) {
goto L_DONE;
}
L_DONE:
close(fd);
return 0;
}
Testing commands:
$ ./client get k
server says: [2]
$ ./client set k v
server says: [0]
$ ./client get k
server says: [0] v
$ ./client del k
server says: [0]
$ ./client get k
server says: [2]
$ ./client aaa bbb
server says: [1] Unknown cmd
build-your-own.org 36
2023-01-31 07. Basic Server: get, set, del
• 07_client.cpp
• 07_server.cpp
build-your-own.org 37
Part 2. Essential Topics
Learn more network programming and data structures.
38
08. Data Structure: Hashtables
This chapter fills the placeholder code in the last chapter’s server. We’ll start by im-
plementing a hashtable. Hashtables are often the obvious data structure for holding an
unknown amount of key-value data that does not require ordering.
There are two kinds of hashtables: chaining and open addressing. Their primary difference
is collision resolution. Open addressing seeks another free slot in the event of a collision
while chaining simply groups conflicting keys with a linked list. There are many variants
of open addressing due to the need to find free slots, while the chaining hashtable is pretty
much a fixed design. The hashtable used in our server is a chaining one. A chaining
hashtable is easy to code; it doesn’t require much choice-making.
When the size of the hashtable is the power of two, the indexing operation is a simple bit
mask with the hash code.
// n must be a power of 2
static void h_init(HTab *htab, size_t n) {
assert(n > 0 && ((n - 1) & n) == 0);
htab->tab = (HNode **)calloc(sizeof(HNode *), n);
htab->mask = n - 1;
39
2023-01-31 08. Data Structure: Hashtables
htab->size = 0;
}
// hashtable insertion
static void h_insert(HTab *htab, HNode *node) {
size_t pos = node->hcode & htab->mask;
HNode *next = htab->tab[pos];
node->next = next;
htab->tab[pos] = node;
htab->size++;
}
Deleting is easy. Notice how the use of pointers enables succinct code. The from pointer
build-your-own.org 40
2023-01-31 08. Data Structure: Hashtables
can be either an item of the array or from a node, yet the code doesn’t differentiate.
*from = (*from)->next;
htab->size--;
return node;
}
Our hashtable is fixed in size, we need to migrate to a bigger one when the load factor
is too high. There is an extra consideration when using hashtables in Redis. Resizing a
large hashtable requires moving a lot of nodes to a new table, which can stall the server
for some time. This shall be avoided by not moving everything at once, instead, we keep
two hashtables and gradually move nodes between them. Here is the final hashtable
interface:
HNode *hm_lookup(
HMap *hmap, HNode *key, bool (*cmp)(HNode *, HNode *))
{
hm_help_resizing(hmap);
HNode **from = h_lookup(&hmap->ht1, key, cmp);
if (!from) {
from = h_lookup(&hmap->ht2, key, cmp);
}
return from ? *from : NULL;
}
build-your-own.org 41
2023-01-31 08. Data Structure: Hashtables
size_t nwork = 0;
while (nwork < k_resizing_work && hmap->ht2.size > 0) {
// scan for nodes from ht2 and move them to ht1
HNode **from = &hmap->ht2.tab[hmap->resizing_pos];
if (!*from) {
hmap->resizing_pos++;
continue;
}
if (hmap->ht2.size == 0) {
// done
free(hmap->ht2.tab);
hmap->ht2 = HTab{};
}
}
The insertion subroutine will trigger resizing should the table become too full:
build-your-own.org 42
2023-01-31 08. Data Structure: Hashtables
h_insert(&hmap->ht1, node);
if (!hmap->ht2.tab) {
// check whether we need to resize
size_t load_factor = hmap->ht1.size / (hmap->ht1.mask + 1);
if (load_factor >= k_max_load_factor) {
hm_start_resizing(hmap);
}
}
hm_help_resizing(hmap);
}
HNode *hm_pop(
HMap *hmap, HNode *key, bool (*cmp)(HNode *, HNode *))
{
hm_help_resizing(hmap);
HNode **from = h_lookup(&hmap->ht1, key, cmp);
if (from) {
return h_detach(&hmap->ht1, from);
}
from = h_lookup(&hmap->ht2, key, cmp);
if (from) {
return h_detach(&hmap->ht2, from);
}
return NULL;
}
build-your-own.org 43
2023-01-31 08. Data Structure: Hashtables
The hashtable implementation is done. Let’s add them to the server. Looking at the struct
HNode again, this structure contains no data, how do we actually use that? The answer is
called “intrusive data structure”:
Instead of making our data structure contain data, the hashtable node structure is embedded
into the payload data. This is the standard way of creating generic data structures in C.
Besides making the data structure fully generic, this technique also has the advantage of
reducing unnecessary memory management. The structure node is not separately allocated
but is part of the payload data, and the data structure code does not own the payload
but merely organizes the data. This may be quite a new idea to you if you learned data
structures from textbooks, which is probably using void * or C++ templates or even
macros.
Listing the do_get function to see how the intrusive data structure is used:
build-your-own.org 44
2023-01-31 08. Data Structure: Hashtables
*reslen = (uint32_t)val.size();
return RES_OK;
}
The hm_lookup function returns a pointer to HNode, which is a member of the Entry, we need
some pointer arithmetics to convert that pointer to an Entry pointer. The container_of
macro is commonly used in C projects for this purpose:
Entry key;
key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(), key.key.size());
build-your-own.org 45
2023-01-31 08. Data Structure: Hashtables
if (node) {
container_of(node, Entry, node)->val.swap(cmd[2]);
} else {
Entry *ent = new Entry();
ent->key.swap(key.key);
ent->node.hcode = key.node.hcode;
ent->val.swap(cmd[2]);
hm_insert(&g_data.db, &ent->node);
}
return RES_OK;
}
Entry key;
key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(), key.key.size());
Exercises:
1. Our hashtable triggers resizing when the load factor is too high, should we also
shrink the hashtable when the load factor is too low? Can the shrinking be performed
automatically?
build-your-own.org 46
2023-01-31 08. Data Structure: Hashtables
• 08_server.cpp
• hashtable.cpp
• hashtable.h
build-your-own.org 47
09. Data Serialization
For now, our server protocol response is an error code plus a string. What if we need to
return more complicated data? For example, we might add the keys command that returns
a list of strings. We have already encoded the list-of-strings data in the request protocol.
In this chapter, we will generalize the encoding to handle different types of data. This is
often called “serialization”.
Our serialization protocol consists of five types of data:
enum {
SER_NIL = 0,
SER_ERR = 1,
SER_STR = 2,
SER_INT = 3,
SER_ARR = 4,
};
The SER_NIL is like NULL, the SER_ERR is for returning error code and message, the SER_STR
and SER_INT are for string and int64, and the SER_ARR is for arrays.
Code listing starts with the try_one_request function:
48
2023-01-31 09. Data Serialization
// code omitted...
}
For convenience, std::string was used to hold the response data. Production-grade
projects often have more sophisticated ways to manage buffers.
build-your-own.org 49
2023-01-31 09. Data Serialization
As we can see, our serialization protocol starts with one byte of data type, followed by
various types of payload data. Arrays come with their size first, then their possibly nested
elements.
static void h_scan(HTab *tab, void (*f)(HNode *, void *), void *arg) {
if (tab->size == 0) {
build-your-own.org 50
2023-01-31 09. Data Serialization
return;
}
for (size_t i = 0; i < tab->mask + 1; ++i) {
HNode *node = tab->tab[i];
while (node) {
f(node, arg);
node = node->next;
}
}
}
The del command responds with an integer indicating whether the deletion took place.
build-your-own.org 51
2023-01-31 09. Data Serialization
The code for other commands is of nothing interesting, there is no need to list them.
Listing the client “deserialization” code:
build-your-own.org 52
2023-01-31 09. Data Serialization
{
uint32_t len = 0;
memcpy(&len, &data[1], 4);
printf("(arr) len=%u\n", len);
size_t arr_bytes = 1 + 4;
for (uint32_t i = 0; i < len; ++i) {
int32_t rv = on_response(&data[arr_bytes], size - arr_bytes);
if (rv < 0) {
return rv;
}
arr_bytes += (size_t)rv;
}
printf("(arr) end\n");
return (int32_t)arr_bytes;
}
default:
msg("bad response");
return -1;
}
}
build-your-own.org 53
2023-01-31 09. Data Serialization
$ ./client asdf
(err) 1 Unknown cmd
$ ./client get asdf
(nil)
$ ./client set k v
(nil)
$ ./client get k
(str) v
$ ./client keys
(arr) len=1
(str) k
(arr) end
$ ./client del k
(int) 1
$ ./client del k
(int) 0
$ ./client keys
(arr) len=0
(arr) end
• 09_client.cpp
• 09_server.cpp
• hashtable.cpp
• hashtable.h
build-your-own.org 54
10. The AVL Tree: Implementation & Testing
While Redis is often referred to as a key-value store, the “value” part of Redis is not
restricted to plain strings, lists, hashmaps, and sorted sets are quite nice things to have.
Redis is also referred to as the “data structure server” due to its rich set of data structures.
Redis is often used as an in-memory cache, and when storing data in memory, there is an
advantage of freely using data structures. The sorted set data structure in Redis is quite a
unique and useful thing. Not only it offers the ability to sort your data in order, but also
has the unique feature of querying ordered data by rank. If you put 20M records into a
sorted set, you can get the record that ranked at 10M, without going through the first 10M
records, this is a feat that can not be emulated by current SQL databases.
As the name “sorted set” implies, it’s a data structure for sorting. Trees, balanced binary
trees, are popular data structures for storing sorted data. Among various data structures,
the author found the AVL tree particularly simple and easy to code, which will be used
in this book to implement sorted set. The real Redis project uses skiplist which is also
considered easy to code.
The idea of the AVL tree is to restrict the height difference between the left subtree
and the right subtree. The height difference between subtrees is restricted to be at most
one, never reaching two. When inserting/removing nodes from an AVL tree, the height
difference can temporarily reach two, which is then fixed by the node rotations. The
rotation operation is the basis of balanced binary trees, which is also used by other balanced
trees like the RB tree. After the rotation, a node with a subtree height difference of two is
reduced back to be at most one.
Let’s start with the tree node:
struct AVLNode {
uint32_t depth = 0;
uint32_t cnt = 0;
AVLNode *left = NULL;
AVLNode *right = NULL;
AVLNode *parent = NULL;
};
55
2023-01-31 10. The AVL Tree: Implementation & Testing
node->depth = 1;
node->cnt = 1;
node->left = node->right = node->parent = NULL;
}
This is a regular binary tree node with extra fields. The depth field is the height of the
tree. The cnt field is the size of the tree, this field is not specific to the AVL tree, it is used
to implement the rank-based query, which will be explained in the next chapter.
Listing some helper functions:
build-your-own.org 56
2023-01-31 10. The AVL Tree: Implementation & Testing
new_node->left = node;
new_node->parent = node->parent;
node->parent = new_node;
avl_update(node);
avl_update(new_node);
return new_node;
}
The avl_fix_left and avl_fix_right are functions for fixing excess height difference:
build-your-own.org 57
2023-01-31 10. The AVL Tree: Implementation & Testing
If the right subtree is too deep, a left rotation will fix it. Before the left rotation, we may
need a right rotation on the right subtree to ensure the right subtree is leaning in the
correct direction. Here is the visualization:
b b d
/ \ / \ / \
a c ==> a d ==> b c
/ \ /
d c a
The avl_fix function fixes everything after an insertion/deletion operation. It goes from
the initially affected node to the root node. Since the rotation may change the root of the
tree, the root node is returned. This is the core of our AVL tree implementation.
// fix imbalanced nodes and maintain invariants until the root is reached
static AVLNode *avl_fix(AVLNode *node) {
while (true) {
avl_update(node);
uint32_t l = avl_depth(node->left);
uint32_t r = avl_depth(node->right);
AVLNode **from = NULL;
if (node->parent) {
from = (node->parent->left == node)
? &node->parent->left : &node->parent->right;
}
if (l == r + 2) {
node = avl_fix_left(node);
} else if (l + 2 == r) {
node = avl_fix_right(node);
}
if (!from) {
return node;
}
*from = node;
node = node->parent;
}
}
Insertion for binary trees is easy, just walk down from the root until you find an empty
subtree and place the new node here, then call up avl_fix for maintenance.
build-your-own.org 58
2023-01-31 10. The AVL Tree: Implementation & Testing
Deletion is more complicated. If the target node has no subtree, just remove it straight,
if it has one subtree, replace the node with that subtree. The problem arises when the
node has both subtrees, we can’t remove it straight, instead, we remove its sibling in the
right subtree, and swap it with the detached sibling. Here is the function for removing a
node:
*victim = *node;
if (victim->left) {
victim->left->parent = victim;
}
if (victim->right) {
victim->right->parent = victim;
}
build-your-own.org 59
2023-01-31 10. The AVL Tree: Implementation & Testing
This is the generic function for removing nodes from a binary tree, with the AVL-tree-
specific avl_fix.
Readers with experiences with the RB tree may notice how small and simple the AVL
tree implementation is. The maintenance code for RB tree node deletion is significantly
more complicated than the insertion; while the AVL tree uses the same function avl_fix
for both insertion and deletion, this symmetry greatly reduces the efforts required to code
an AVL tree.
The AVL tree is significantly more complicated than the hashtable we coded before. Thus,
we need to invest more time on testing. The testing code also demonstrates the usage of
those AVL tree functions.
Here are our testing data types. If you are not familiar with intrusive data structures, read
the hashtable chapter.
struct Data {
AVLNode node;
uint32_t val = 0;
};
struct Container {
AVLNode *root = NULL;
};
build-your-own.org 60
2023-01-31 10. The AVL Tree: Implementation & Testing
if (!c.root) {
c.root = &data->node;
return;
}
*from = &data->node;
data->node.parent = cur;
c.root = avl_fix(&data->node);
break;
}
cur = *from;
}
}
build-your-own.org 61
2023-01-31 10. The AVL Tree: Implementation & Testing
return false;
}
c.root = avl_del(cur);
delete container_of(cur, Data, node);
return true;
}
Here is the function for verifying the correctness of the tree structure:
assert(node->parent == parent);
avl_verify(node, node->left);
avl_verify(node, node->right);
uint32_t l = avl_depth(node->left);
uint32_t r = avl_depth(node->right);
assert(l == r || l + 1 == r || l == r + 1);
assert(node->depth == 1 + max(l, r));
build-your-own.org 62
2023-01-31 10. The AVL Tree: Implementation & Testing
Code for comparing the contents of AVL tree with the expected data:
Container c;
build-your-own.org 63
2023-01-31 10. The AVL Tree: Implementation & Testing
container_verify(c, {});
add(c, 123);
container_verify(c, {123});
assert(!del(c, 124));
assert(del(c, 123));
container_verify(c, {});
// sequential insertion
std::multiset<uint32_t> ref;
for (uint32_t i = 0; i < 1000; i += 3) {
add(c, i);
ref.insert(i);
container_verify(c, ref);
}
// random insertion
for (uint32_t i = 0; i < 100; i++) {
uint32_t val = (uint32_t)rand() % 1000;
add(c, val);
ref.insert(val);
container_verify(c, ref);
}
// random deletion
for (uint32_t i = 0; i < 200; i++) {
uint32_t val = (uint32_t)rand() % 1000;
auto it = ref.find(val);
if (it == ref.end()) {
assert(!del(c, val));
} else {
assert(del(c, val));
ref.erase(it);
}
container_verify(c, ref);
}
build-your-own.org 64
2023-01-31 10. The AVL Tree: Implementation & Testing
Some more targeted tests. Given a tree of a certain size, perform insertion/deletion at
every possible position.
add(c, val);
ref.insert(val);
container_verify(c, ref);
dispose(c);
}
}
assert(del(c, val));
ref.erase(val);
container_verify(c, ref);
dispose(c);
}
build-your-own.org 65
2023-01-31 10. The AVL Tree: Implementation & Testing
With the help of those test cases, the author did found and fixed a couple of mistakes
while writing this chapter.
Exercises:
1. While there is not much code for our AVL tree, this AVL tree implementation is
probably not a very efficient one. Our code contains some reductant pointer updates,
which might be a source of optimization. Also, we don’t need to store the height
value for balancing, it is possible to store the height difference instead. Research and
explore efficient AVL tree implementations.
2. Can you create more test cases? The test cases presented in this chapter are unlikely
to be sufficient.
• avl.cpp
• test_avl.cpp
build-your-own.org 66
11. The AVL Tree and the Sorted Set
Based on the AVL tree in the last chapter, the sorted set data structure can be easily added.
The structure definition:
struct ZSet {
AVLNode *tree = NULL;
HMap hmap;
};
struct ZNode {
AVLNode tree;
HNode hmap;
double score = 0;
size_t len = 0;
char name[0];
};
The sorted set is a sorted list of pairs of (score, name) that supports query or update by
the sorting key, or by the name. It’s a combination of the AVL tree and hashtable, and the
pair node belongs to both, which demonstrates the flexibility of intrusive data structures.
The name string is embedded at the end of the pair node, in the hope of saving up some
space overheads.
67
2023-01-31 11. The AVL Tree and the Sorted Set
The function for tree insertion is roughly the same as the testing code seen from the
previous chapter:
*from = &node->tree;
node->tree.parent = cur;
zset->tree = avl_fix(&node->tree);
break;
}
cur = *from;
}
}
build-your-own.org 68
2023-01-31 11. The AVL Tree and the Sorted Set
// add a new (score, name) tuple, or update the score of the existing tuple
bool zset_add(ZSet *zset, const char *name, size_t len, double score) {
ZNode *node = zset_lookup(zset, name, len);
if (node) {
zset_update(zset, node, score);
return false;
} else {
node = znode_new(name, len, score);
hm_insert(&zset->hmap, &node->hmap);
tree_add(zset, node);
return true;
}
}
// lookup by name
ZNode *zset_lookup(ZSet *zset, const char *name, size_t len) {
build-your-own.org 69
2023-01-31 11. The AVL Tree and the Sorted Set
Here is the primary use case of sorted sets: the range query.
// find the (score, name) tuple that is greater or equal to the argument,
// then offset relative to it.
ZNode *zset_query(
ZSet *zset, double score, const char *name, size_t len, int64_t offset)
{
AVLNode *found = NULL;
AVLNode *cur = zset->tree;
while (cur) {
if (zless(cur, score, name, len)) {
cur = cur->right;
} else {
found = cur; // candidate
cur = cur->left;
}
}
if (found) {
found = avl_offset(found, offset);
}
return found ? container_of(found, ZNode, tree) : NULL;
}
The range query is just a regular binary tree look-up, followed by an offset operation. The
offset operation is what makes the sorted set special, it is not a regular binary tree walk.
Let’s review the AVLNode:
struct AVLNode {
uint32_t depth = 0;
uint32_t cnt = 0;
AVLNode *left = NULL;
build-your-own.org 70
2023-01-31 11. The AVL Tree and the Sorted Set
It has an extra cnt field (the size of the tree), which is not explained in the previous chapter.
It is used by the avl_offset function:
build-your-own.org 71
2023-01-31 11. The AVL Tree and the Sorted Set
With the size information embedded in the node, we can determine whether the offset
target is inside a subtree or not. The offset operation runs in two phases: firstly, it walks
up along the tree if the target is not in a subtree, then it walks down the tree, narrowing
the distance until the target is met. The worst-case is O(log(n)) regardless of how long
the offset is, which is better than offsetting by walking to the succeeding node one by one
(best-case of O(offset)). The real Redis project uses a similar technique for skip lists.
It is a good idea to stop and test the new avl_offset function now.
dispose(c.root);
}
For now, we have implemented major functionalities of the sorted set. Let’s add the sorted
set type to our server.
build-your-own.org 72
2023-01-31 11. The AVL Tree and the Sorted Set
enum {
T_STR = 0,
T_ZSET = 1,
};
The rest of the code is considered trivial, which will be omitted in the code listing.
build-your-own.org 73
2023-01-31 11. The AVL Tree and the Sorted Set
CASES = r'''
$ ./client zscore asdf n1
(nil)
$ ./client zquery xxx 1 asdf 1 10
(arr) len=0
(arr) end
# more cases...
'''
import shlex
import subprocess
cmds = []
outputs = []
lines = CASES.splitlines()
for x in lines:
x = x.strip()
if not x:
continue
if x.startswith('$ '):
cmds.append(x[2:])
outputs.append('')
else:
outputs[-1] = outputs[-1] + x + '\n'
Exercises:
1. The avl_offset function gives us the ability to query sorted set by rank, now do the
reverse, given a node in an AVL tree, find its rank, with a worst-case of O(log(n)).
3. The 11_server.cpp file already contains some sorted set commands, try adding more.
build-your-own.org 74
2023-01-31 11. The AVL Tree and the Sorted Set
• 11_client.cpp
• 11_server.cpp
• avl.cpp
• avl.h
• common.h
• hashtable.cpp
• hashtable.h
• test_cmds.py
• test_offset.cpp
• zset.cpp
• zset.h
build-your-own.org 75
12. The Event Loop and Timers
There is one major thing missing in our server: timeouts. Every networked application
needs to handle timeouts since the other side of the network can just disappear. Not only
do ongoing IO operations like read/write need timeouts, but it is also a good idea to kick
out idle TCP connections. To implement timeouts, the event loop must be modified since
the poll is the only thing that is blocking.
Looking at our existing event loop code:
The poll syscall takes a timeout argument, which imposes an upper bound of time spent
on the poll syscall. The timeout value is currently an arbitrary value of 1000 ms. If we set
the timeout value according to the timer, poll should wake up at the time it expires, or
before that; then we have a chance to fire the timer in due time.
The problem is that we might have more than one timer, the timeout value of poll should
be the timeout value of the nearest timer. Some data structure is needed for finding the
nearest timer. The heap data structure is a popular choice for finding the min/max value
and is often used for such purpose. Also, any data structure for sorting can be used. For
example, we can use the AVL tree to order timers and possibly augment the tree to keep
track of the minimum value.
Let’s start by adding timers to kick out idle TCP connections. For each connection there
is a timer, set to a fixed timeout into the future, every time there are IO activities on the
connection, the timer is renewed to a fixed timeout. Notice that when we renew a timer,
it becomes the most distant one; therefore, we can exploit this fact to simplify the data
structure; a simple linked list is sufficient to keep the order of timers: the new or updated
timer simply goes to the end of the list, and the list maintains sorted order. Also, operations
on linked lists are O(1), which is better than sorting data structures.
Defining the linked list is a trivial task:
struct DList {
DList *prev = NULL;
DList *next = NULL;
76
2023-01-31 12. The Event Loop and Timers
};
get_monotonic_usec is the function for getting the time. Note that the timestamp must
be monotonic. Timestamp jumping backward can cause all sorts of troubles in computer
systems.
The next step is adding the list to the server and the connection struct.
build-your-own.org 77
2023-01-31 12. The Event Loop and Timers
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
} g_data;
struct Conn {
int fd = -1;
uint32_t state = 0; // either STATE_REQ or STATE_RES
// buffer for reading
size_t rbuf_size = 0;
uint8_t rbuf[4 + k_max_msg];
// buffer for writing
size_t wbuf_size = 0;
size_t wbuf_sent = 0;
uint8_t wbuf[4 + k_max_msg];
uint64_t idle_start = 0;
// timer
DList idle_list;
};
int main() {
// some initializations
dlist_init(&g_data.idle_list);
build-your-own.org 78
2023-01-31 12. The Event Loop and Timers
// handle timers
process_timers();
return 0;
}
build-your-own.org 79
2023-01-31 12. The Event Loop and Timers
The next_timer_ms function takes the first (nearest) timer from the list and uses it the
calculate the timeout value of poll.
At each iteration of the event loop, the list is checked in order to fire timers in due time.
build-your-own.org 80
2023-01-31 12. The Event Loop and Timers
// do the work
if (conn->state == STATE_REQ) {
state_req(conn);
} else if (conn->state == STATE_RES) {
state_res(conn);
} else {
assert(0); // not expected
}
}
build-your-own.org 81
2023-01-31 12. The Event Loop and Timers
}
conn->fd = connfd;
conn->state = STATE_REQ;
conn->rbuf_size = 0;
conn->wbuf_size = 0;
conn->wbuf_sent = 0;
conn->idle_start = get_monotonic_usec();
dlist_insert_before(&g_data.idle_list, &conn->idle_list);
conn_put(g_data.fd2conn, conn);
return 0;
}
Don’t forget to remove the connection from the list when done:
$ ./server
removing idle connection: 4
$ socat tcp:127.0.0.1:1234 -
build-your-own.org 82
2023-01-31 12. The Event Loop and Timers
• 12_server.cpp
• avl.cpp
• avl.h
• common.h
• hashtable.cpp
• hashtable.h
• list.h
• zset.cpp
• zset.h
build-your-own.org 83
13. The Heap Data Structure and the TTL
The primary use of Redis is as cache servers, and one way to manage the size of the cache
is through explicitly setting TTLs (time to live). TTLs can be implemented using timers.
Unfortunately, timers in the last chapter are of fixed value (using linked lists); thus, a
sorting data structure is needed for implementing arbitrary and mutable timeouts; and the
heap data structure is a popular choice. Compared with the AVL tree we used before, the
heap data structure has the advantage of using less space.
A quick review of the heap data structure:
1. A heap is a binary tree, packed into an array; and the layout of the tree is fixed. The
parent-child relationship is implicit, pointers are not included in heap elements.
2. The only constraint on the tree is that parents are no bigger than their kids.
3. The value of an element can be updated. If the value changes:
• Its value is bigger than before: it may be bigger than its kids, and if so, swap it
with the smallest kid, so that the parent-child constraint is satisfied again. Now
that one of the kids is bigger than before, continue this process until reaching a
leave.
• Its value is smaller: likewise, swap it with its parent until reaching the root.
4. New elements are added to the end of the array as leaves. Maintain the constraint as
above.
5. When removing an element from a heap, replace it with the last element in the array,
then maintain the constraint as if its value was updated.
struct HeapItem {
uint64_t val = 0;
size_t *ref = NULL;
};
84
2023-01-31 13. The Heap Data Structure and the TTL
std::string val;
uint32_t type = 0;
ZSet *zset = NULL;
// for TTLs
size_t heap_idx = -1;
};
The heap is used to order the timestamps, and the Entry is mutually linked with the
timestamp. The heap_idx is the index of the corresponding HeapItem, and the ref points
to the Entry. We are using the intrusive data structure again; the ref pointer points to the
heap_idx field.
Swap with the parent when a kid is smaller than its parent. Note the heap_idx is updated
through the ref pointer while swapping.
*a[pos].ref = pos;
pos = heap_parent(pos);
build-your-own.org 85
2023-01-31 13. The Heap Data Structure and the TTL
}
a[pos] = t;
*a[pos].ref = pos;
}
*a[pos].ref = pos;
pos = min_pos;
}
a[pos] = t;
*a[pos].ref = pos;
}
The heap_update is the heap function for updating a position. It is used for updating,
inserting, and deleting.
build-your-own.org 86
2023-01-31 13. The Heap Data Structure and the TTL
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
// timers for TTLs
std::vector<HeapItem> heap;
} g_data;
Updating, adding, and removing a timer to the heap. Just call the heap_update after
updating an element of the array.
build-your-own.org 87
2023-01-31 13. The Heap Data Structure and the TTL
The next_timer_ms function is modified to use both idle timers and TTL timers.
// idle timers
if (!dlist_empty(&g_data.idle_list)) {
Conn *next = container_of(g_data.idle_list.next, Conn, idle_list);
build-your-own.org 88
2023-01-31 13. The Heap Data Structure and the TTL
// ttl timers
if (!g_data.heap.empty() && g_data.heap[0].val < next_us) {
next_us = g_data.heap[0].val;
}
if (next_us == (uint64_t)-1) {
return 10000; // no timer, the value doesn't matter
}
// idle timers
while (!dlist_empty(&g_data.idle_list)) {
// code omitted...
}
// TTL timers
const size_t k_max_works = 2000;
size_t nworks = 0;
while (!g_data.heap.empty() && g_data.heap[0].val < now_us) {
Entry *ent = container_of(g_data.heap[0].ref, Entry, heap_idx);
HNode *node = hm_pop(&g_data.db, &ent->node, &hnode_same);
assert(node == &ent->node);
build-your-own.org 89
2023-01-31 13. The Heap Data Structure and the TTL
entry_del(ent);
if (nworks++ >= k_max_works) {
// don't stall the server if too many keys are expiring at once
break;
}
}
}
This is just checking the minimal value of the heap and removing keys. Note that we
put a limit on the number of keys expired per event loop iteration; the limit is needed to
prevent the server from stalling should there are too many keys expiring at once.
The command for updating and querying TTLs is straightforward to add:
Entry key;
key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(), key.key.size());
build-your-own.org 90
2023-01-31 13. The Heap Data Structure and the TTL
Exercises:
1. The heap-based timer adds O(log(n)) operations to the server, which might be a
bottleneck for a sufficiently large number of keys. Can you think of optimizations
for a large number of timers?
2. The real Redis does not use sorting for expiration, find out how it is done, and list
the pros and cons of both approaches.
• 13_server.cpp
• avl.cpp
• avl.h
• common.h
• hashtable.cpp
• hashtable.h
• heap.cpp
• heap.h
• list.h
• test_heap.cpp
• zset.cpp
build-your-own.org 91
2023-01-31 13. The Heap Data Structure and the TTL
• zset.h
build-your-own.org 92
14. The Thread Pool & Asynchronous Tasks
There is a flaw in our server since the introduction of the sorted set data type: the deletion
of keys. If the size of a sorted set is huge, it can take a long time to free its nodes and
the server is stalled during the destruction of the key. This can be easily fixed by using
multi-threading to move the destructor away from the main thread.
Firstly, we introduce the “thread pool”, which is literally a pool of threads. The thread
from the pool consumes tasks from a queue and executes them. It is trivial to code a
multi-producer multi-consumer queue using pthread APIs. (Although there is only a
single producer in our case.)
The relevant pthread primitives are pthread_mutex_t and pthread_cond_t; they are called
the mutex and the condition variable respectively. If you are unfamiliar with them, it
is advised to get some education on multi-threading after reading this chapter. (Such as
manpages of the pthread APIs, textbooks on operating systems, online courses, etc.)
• The queue is accessed by multiple threads (both the producer and consumers), so it
needs the protection of a mutex, obviously.
• The consumer threads should be sleeping when idle, and only be waken up when
the queue is not empty, this is the job of the condition variable.
struct Work {
void (*f)(void *) = NULL;
void *arg = NULL;
};
struct TheadPool {
std::vector<pthread_t> threads;
std::deque<Work> queue;
pthread_mutex_t mu;
pthread_cond_t not_empty;
};
93
2023-01-31 14. The Thread Pool & Asynchronous Tasks
The thread_pool_init is for initialization and starting threads. pthread types are initialized
by pthread_xxx_init functions and the pthread_create starts a thread with the target
function worker.
tp->threads.resize(num_threads);
for (size_t i = 0; i < num_threads; ++i) {
int rv = pthread_create(&tp->threads[i], NULL, &worker, tp);
assert(rv == 0);
}
}
// do the work
w.f(w.arg);
build-your-own.org 94
2023-01-31 14. The Thread Pool & Asynchronous Tasks
}
return NULL;
}
pthread_mutex_lock(&tp->mu);
tp->queue.push_back(w);
pthread_cond_signal(&tp->not_empty);
pthread_mutex_unlock(&tp->mu);
}
The explanation:
1. For both the producer and consumers, the queue access code is surrounded by the
pthread_mutex_lock and the pthread_mutex_unlock, only one thread can access the
queue at once.
2. After a consumer acquired the mutex, check the queue:
• If the queue is not empty, grab a job from the queue, release the mutex and do
the work.
• Otherwise, release the mutex and go to sleep, the sleep can be wakened later
by the condition variable. This is accomplished via a single pthread_cond_wait
call.
3. After the producer puts a job into the queue, the producer calls the pthread_cond_signal
to wake up a potentially sleeping consumer.
4. After a consumer woken up from the pthread_cond_wait, the mutex is held again
automatically. The consumer must check for the condition again after waking up, if
the condition (a non-empty queue) is not satisfied, go back to sleep.
The use of the condition variable needs some more explanations: The pthread_cond_wait
function is always inside a loop checking for the condition. This is because the condition
build-your-own.org 95
2023-01-31 14. The Thread Pool & Asynchronous Tasks
could be changed by other consumers before the wakening consumer grabs the mutex;
the mutex is not transferred from the signaler to the to-be-waked consumer! It is probably
a mistake if you see a condition variable used without a loop.
A concrete sequence to help you understand the use of condition variables:
Note that the pthread_cond_signal doesn’t need to be protected by the mutex, singaling
after releasing the mutex is also correct.
The thread pool is done. Let’s add that to our server:
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
// timers for TTLs
std::vector<HeapItem> heap;
// the thread pool
TheadPool tp;
} g_data;
// some initializations
dlist_init(&g_data.idle_list);
thread_pool_init(&g_data.tp, 4);
The entry_del function is modified: It will put the destruction of large sorted sets into the
thread pool. And the thread pool is only for the large ones since multi-threading has some
overheads too.
build-your-own.org 96
2023-01-31 14. The Thread Pool & Asynchronous Tasks
// dispose the entry after it got detached from the key space
static void entry_del(Entry *ent) {
entry_set_ttl(ent, -1);
if (too_big) {
thread_pool_queue(&g_data.tp, &entry_del_async, ent);
} else {
entry_destroy(ent);
}
}
Exercises:
build-your-own.org 97
2023-01-31 14. The Thread Pool & Asynchronous Tasks
semaphore.
2. Some fun exercises to help you understand these primitives further:
• 14_server.cpp
• avl.cpp
• avl.h
• common.h
• hashtable.cpp
• hashtable.h
• heap.cpp
• heap.h
• list.h
• thread_pool.cpp
• thread_pool.h
• zset.cpp
• zset.h
build-your-own.org 98
Appendixes
99
A1: Hints to Exercises
Q: Our hashtable triggers resizing when the load factor is too high, should we also
shrink the hashtable when the load factor is too low? Can the shrinking be performed
automatically?
Hints:
Hashtable shrinking is not done automatically in practice. Many real-world usage patterns
are periodic, shrinking is not always clearly beneficial. Besides, shrinking does not always
return the memory to OS, this is dependent on many factors such as the malloc imple-
mentation and the level of memory fragmentation; the outcome of shrinking is not easily
predictable.
Q: Can you create more test cases? The test cases presented in this chapter are unlikely
to be sufficient.
Hints:
Our existing test cases enumerate AVL trees of various sizes. However, given a tree of a
particular size, there are many possible configurations, we can go further by enumerating
tree configurations too.
Also, for more complicated code, it is helpful to use profiling tools to check whether the
test cases give full coverage of the target code. Non-full coverage indicates bugs in test
cases or target code.
100
2023-01-31 A1: Hints to Exercises
Q: The avl_offset function gives us the ability to query sorted set by rank, now do
the reverse, given a node in an AVL tree, find its rank, with a worst-case of O(log(n)).
Hints:
The rank of a node is related to the rank of its parent. And the rank of the root is obvious.
Q: Another sorted set application: count the number of elements within a range. (also
with a worst-case of O(log(n)).)
Hints:
Q: The heap-based timer adds O(log(n)) operations to the server, which might be a
bottleneck for a sufficiently large number of keys. Can you think of optimizations for
a large number of timers?
Hints:
We can make the heap more cache friendly by using the n-ary tree instead of the binary
tree. Some real-world project uses the quadtree which fits in the 64-byte cache line.
Also, in our case, the TTL timers don’t have to be fired at the exact time. We can use a
very coarse timestamp (such as round up to 1min resolution) for TTL timers, and keys
with the same timestamp can share the same timer. This reduces the number of timers,
but the timers are delayed so we need to check the real expiration time when accessing
the key.
Q: The real Redis does not use sorting for expiration, find out how it is done, and list
the pros and cons of both approaches.
Hints:
build-your-own.org 101
2023-01-31 A1: Hints to Exercises
Taking the idea that keys don’t need to be expired at the exact time, the read Redis samples
the key space at random to find dead keys. The higher the ratio of dead keys, the easier to
find and eliminate them.
The cons:
1. It requires that keys with a TTL should not be mixed with keys without a TTL,
otherwise, the non-TTL keys interfere with the sampling, making it harder to find
dead keys. This can be a source of surprise for operators.
2. While the concept is simple, the implementation uses some heuristics to determine
the rate of the sampling. If the heuristic is not tuned properly, in a worse-case, the
server might not be removing dead keys fast enough, leading to excessive memory
usage, which may frustrate the operator.
Hints:
You need to figure out how to sleep and wake up using mutex first. Then you need to
keep track of a list of sleepers in the condition variable so that you can wake up them
later.
build-your-own.org 102