Cache Craftiness For Fast Multicore Key-Value Storage
Cache Craftiness For Fast Multicore Key-Value Storage
rarely observed: in an insert test with 8 threads, less than 1 implementation of remove [33] breaks the invariants that make this possible.
next_layer pointers, are updated lazily during later opera- Multi-column puts are atomic: a concurrent get will see ei-
tions. ther all or none of a put’s column modifications.
Masstree includes several value implementations; we
4.6.5 Removes evaluate one most appropriate for small values. Each value
Masstree, unlike some prior work [11, 27], includes a full is allocated as a single memory block. Modifications don’t
implementation of concurrent remove. Space constraints act in place, since this could expose intermediate states to
preclude a full discussion, but we mention several interesting concurrent readers. Instead, put creates a new value object,
features. copying unmodified columns from the old value object as
First, remove operations, when combined with inserts, appropriate. This design uses cache effectively for small
must sometimes cause readers to retry! Consider the follow- values, but would cause excessive data copying for large
ing threads running in parallel on a one-node tree: values; for those, Masstree offers a design that stores each
column in a separately-allocated block.
get(n, k1 ):
locate k1 at n position i 4.8 Discussion
remove(n, k1 ):
.. remove k1 from n position i More than 30% of the cost of a Masstree lookup is in com-
. put(n, k2 , v2 ): putation (as opposed to DRAM waits), mostly due to key
insert k2 , v2 at n position j search within tree nodes. Linear search has higher complex-
lv ← n.lv[i]; check n.version; ity than binary search, but exhibits better locality. For Mass-
return lv.value tree, the performance difference of the two search schemes is
architecture dependent. On an Intel processor, linear search
The get operation may return k1 ’s (removed) value, since the
can be up to 5% faster than binary search. On an AMD pro-
operations overlapped. Remove thus must not clear the mem-
cessor, both perform the same.
ory corresponding to the key or its value: it just changes
One important PALM optimization is parallel lookup [34].
the permutation. But then if the put operation happened to
This effectively overlaps the DRAM fetches for many oper-
pick j = i, the get operation might return v2 , which isn’t a
ations by looking up the keys for a batch of requests in par-
valid value for k1 . Masstree must therefore update the ver-
allel. Our implementation of this technique did not improve
sion counter’s vinsert field when removed slots are reused.
performance on our 48-core AMD machine, but on a 24-
When a border node becomes empty, Masstree removes it
core Intel machine, throughput rose by up to 34%. We plan
and any resulting empty ancestors. This requires the border-
to change Masstree’s network stack to apply this technique.
node list be doubly-, not singly-, linked. A naive implemen-
tation could break the list under concurrent splits and re-
moves; compare-and-swap operations (some including flag 5. Networking and persistence
bits) are required for both split and remove, which slightly Masstree uses network interfaces that support per-core re-
slows down split. As with any state observable by concur- ceive and transmit queues, which reduce contention when
rent readers, removed nodes must not be freed immediately. short query packets arrive from many clients. To support
Instead, we mark them as deleted and reclaim them later. short connections efficiently, Masstree can configure per-
Any operation that encounters a deleted node retries from core UDP ports that are each associated with a single core’s
the root. Remove’s code for manipulating interior nodes re- receive queue. Our benchmarks, however, use long-lived
sembles that for split; hand-over-hand locking is used to find TCP query connections from few clients (or client aggre-
the right key to remove. Once that key is found, the deleted gators), a common operating mode that is equally effective
node becomes completely unreferenced and future readers at avoiding network overhead.
will not encounter it. Masstree logs updates to persistent storage to achieve
Removes can delete entire layer-h trees for h ≥ 1. These persistence and crash recovery. Each server query thread
are not cleaned up right away: normal operations lock at (core) maintains its own log file and in-memory log buffer. A
most one layer at a time, and removing a full tree requires corresponding logging thread, running on the same core as
locking both the empty layer-h tree and the layer-(h − 1) bor- the query thread, writes out the log buffer in the background.
der node that points to it. Epoch-based reclamation tasks are Logging thus proceeds in parallel on each core.
scheduled as needed to clean up empty and pathologically- A put operation appends to the query thread’s log buffer
shaped layer-h trees. and responds to the client without forcing that buffer to
storage. Logging threads batch updates to take advantage of
4.7 Values higher bulk sequential throughput, but force logs to storage
The Masstree system stores values consisting of a ver- at least every 200 ms for safety. Different logs may be on
sion number and an array of variable-length strings called different disks or SSDs for higher total log throughput.
columns. Gets can retrieve multiple columns (identified by Value version numbers and log record timestamps aid the
integer indexes) and puts can modify multiple columns. process of log recovery. Sequential updates to a value ob-
tain distinct, and increasing, version numbers. Update ver- 6.1 Setup
sion numbers are written into the log along with the opera- The experiments use a 48-core server (eight 2.4 GHz six-
tion, and each log record is timestamped. When restoring a core AMD Opteron 8431 chips) running Linux 3.1.5. Each
database from logs, Masstree sorts logs by timestamp. It first core has private 64 KB instruction and data caches and a
calculates the recovery cutoff point, which is the minimum 512 KB private L2 cache. The six cores in each chip share
of the logs’ last timestamps, τ = min`∈L maxu∈` u.timestamp, a 6 MB L3 cache. Cache lines are 64 bytes. Each of the
where L is the set of available logs and u denotes a single chips has 8 GB of DRAM attached to it. The tests use up
logged update. Masstree plays back the logged updates in to 16 cores on up to three chips, and use DRAM attached to
parallel, taking care to apply a value’s updates in increasing only those three chips; the extra cores are disabled. The goal
order by version, except that updates with u.timestamp ≥ τ is to mimic the configuration of a machine more like those
are dropped. easily purchasable today. The machine has four SSDs, each
Masstree periodically writes out a checkpoint containing with a measured sequential write speed of 90 to 160 MB/sec.
all keys and values. This speeds recovery and allows log Masstree uses all four SSDs to store logs and checkpoints.
space to be reclaimed. Recovery loads the latest valid check- The server has a 10 Gb Ethernet card (NIC) connected to a
point that completed before τ, the log recovery time, and switch. Also on that switch are 25 client machines that send
then replays logs starting from the timestamp at which the requests over TCP. The server’s NIC distributes interrupts
checkpoint began. over all cores. Results are averaged over three runs.
Our checkpoint facility is independent of the Masstree de- All experiments in this section use small keys and values.
sign; we include it to show that persistence need not limit Most keys are no more than 10 bytes long; values are always
system performance, but do not evaluate it in depth. It takes 1–10 bytes long. Keys are distributed uniformly at random
Masstree 58 seconds to create a checkpoint of 140 million over some range (the range changes by experiment). The key
key-value pairs (9.1 GB of data in total), and 38 seconds to space is not partitioned: a border node generally contains
recover from that checkpoint. The main bottleneck for both keys created by different clients, and sometimes one client
is imbalance in the parallelization among cores. Checkpoints will overwrite a key originally inserted by another. One com-
run in parallel with request processing. When run concur- mon key distribution is “1-to-10-byte decimal,” which com-
rently with a checkpoint, a put-only workload achieves 72% prises the decimal string representations of random numbers
of its ordinary throughput due to disk contention. between 0 and 231 . This exercises Masstree’s variable-length
key support, and 80% of the keys are 9 or 10 bytes long,
causing Masstree to create layer-1 trees.
We run separate experiments for gets and puts. Get exper-
iments start with a full store (80–140 million keys) and run
6. Tree evaluation for 20 seconds. Put experiments start with an empty store
and run for 140 million total puts. Most puts are inserts, but
We evaluate Masstree in two parts. In this section, we focus
about 10% are updates since multiple clients occasionally
on Masstree’s central data structure, the trie of B+ -trees. We
put the same key. Puts generally run 30% slower than gets.
show the cumulative impact on performance of various tree
design choices and optimizations. We show that Masstree 6.2 Factor analysis
scales effectively and that its single shared tree can outper-
form separate per-core trees when the workload is skewed. We analyze Masstree’s performance by breaking down the
We also quantify the costs of Masstree’s flexibility. While performance gap between a binary tree and Masstree. We
variable-length key support comes for free, range query sup- evaluate several configurations on 140M-key 1-to-10-byte-
port does not: a near-best-case hash table (which lacks range decimal get and put workloads with 16 cores. Each server
query support) can provide 2.5× the throughput of Masstree. thread generates its own workload: these numbers do not
The next section evaluates Masstree as a system. There, include the overhead of network and logging. Figure 8 shows
we describe the performance impact of checkpoint and re- the results.
covery, and compare the whole Masstree system against Binary We first evaluate a fast, concurrent, lock-free bi-
other high performance storage systems: MongoDB, VoltDB, nary tree. Each 40-byte tree node here contains a full key,
Redis, and memcached. Masstree performs very well, achiev- a value pointer, and two child pointers. The fast jemalloc
ing 26–1000× the throughput of the other tree-based (range- memory allocator is used.
query-supporting) stores. Redis and memcached are based
on hash tables; this gives them O(1) average-case lookup +Flow, +Superpage, +IntCmp Memory allocation often
in exchange for not supporting range queries. memcached bottlenecks multicore performance. We switch to Flow, our
can exceed Masstree’s throughput on uniform workloads; on implementation of the Streamflow [32] allocator (“+Flow”).
other workloads, Masstree provides up to 3.7× the through- Flow supports 2 MB x86 superpages, which, when intro-
put of these systems. duced (“+Superpage”), improve throughput by 27–37% due
12
Get
11 Put 3.33
+F
+S
+I
4-
B-
+P
+P tch
M uter
Bi
+F
+S
+I
4-
B-
+P
+P tch
M uter
tre
tre
as
as
nt
nt
na
na
tre
tre
lo
up
re
er
lo
up
re
er
Cm e
Cm e
str
str
ry
ry
e
w
fe
fe
m
e
e
er
er
ee
ee
pa
pa
p
p
g
g
Figure 8. Contributions of design features to Masstree’s performance (§6.2). Design features are cumulative. Measurements
use 16 cores and each server thread generates its own load (no clients or network traffic). Bar numbers give throughput relative
to the binary tree running the get workload.
to fewer TLB misses and lower kernel overhead for alloca- over 4-tree (“+Prefetch”). Leaf-node permutations (§4.6.2,
tion. Integer key comparison (§4.2, “+IntCmp”) further im- “+Permuter”) further improve put throughput by 4%.
proves throughput by 15–24%.
Masstree Finally, Masstree itself improves throughput by
4–8% over “+Permuter” in these experiments. This surprised
4-tree A balanced binary tree has log2 n depth, impos-
us. 1-to-10-byte decimal keys can share an 8-byte prefix,
ing an average of log2 n − 1 serial DRAM latencies per
forcing Masstree to create layer-1 trie-nodes, but in these
lookup. We aim to reduce and overlap those latencies and
experiments such nodes are quite empty. A 140M-key put
to pack more useful information into cache lines that must
workload, for example, creates a tree with 33% of its keys
be fetched. “4-tree,” a tree with fanout 4, uses both these
in layer-1 trie-nodes, but the average number of keys per
techniques. Its wider fanout nearly halves average depth rel-
layer-1 trie-node is just 2.3. One might expect this to perform
ative to the binary tree. Each 4-tree node comprises two
worse than a true B-tree, which has better node utilization.
cache lines, but usually only the first must be fetched from
Masstree’s design, thanks to features such as storing 8 bytes
DRAM. This line contains all data important for traversal—
per key per interior node rather than 16, appears efficient
the node’s four child pointers and the first 8 bytes of each
enough to overcome this effect.
of its keys. (The binary tree also fetches only one cache line
per node, but most of it is not useful for traversal.) All inter- 6.3 System relevance of tree design
nal nodes are full. Reads are lockless and need never retry;
Cache-crafty design matters not just in isolation, but also
inserts are lock-free but use compare-and-swap. “4-tree” im-
in the context of a full system. We turn on logging, gen-
proves throughput by 41–44% over “+IntCmp”.
erate load using network clients, and compare “+IntCmp,”
the fastest binary tree from the previous section, with Mass-
B-tree, +Prefetch, +Permuter 4-tree yields good perfor-
tree. On 140M-key 1-to-10-byte-decimal workloads with
mance, but would be difficult to balance. B-trees have even
16 cores, Masstree provides 1.90× and 1.53× the through-
wider fanout and stay balanced, at the cost of somewhat
put of the binary tree for gets and puts, respectively.5 Thus,
less efficient memory usage (nodes average 75% full). “B-
if logging and networking infrastructure are reasonably well
tree” is a concurrent B+ -tree with fanout 15 that imple-
implemented, tree design can improve system performance.
ments our concurrency control scheme from §4. Each node
has space for up to the first 16 bytes of each key. Un- 6.4 Flexibility
fortunately this tree reduces put throughput by 12% over
Masstree supports several features that not all key-value ap-
4-tree, and does not improve get throughput much. Con-
plications require, including range queries, variable-length
ventional B-tree inserts must rearrange a node’s keys—4-
keys, and concurrency. We now evaluate how much these
tree never rearranges keys—and B-tree nodes spend 5 cache
features cost by evaluating tree variants that do not support
lines to achieve average fanout 11, a worse cache-line-to-
them. We include network and logging.
fanout ratio than 4-tree’s. However, wide B-tree nodes are
easily prefetched to overlap these DRAM latencies. When 5 Absolute Masstree throughput is 8.03 Mreq/sec for gets (77% of the
prefetching is added, B-tree improves throughput by 9–31% Figure 8 value) and 5.78 Mreq/sec for puts (63% of the Figure 8 value).
Variable-length keys We compare Masstree with a concur- 12
Masstree get
rent B-tree supporting only fixed-size 8-byte keys (a version +Permuter get
10
(req/sec, millions)
of “+Permuter”). When run on a 16-core get workload with
Throughput
8
80M 8-byte decimal keys, Masstree supports 9.84 Mreq/sec
and the fixed-size B-tree 9.93 Mreq/sec, just 0.8% more. The 6
difference is so small likely because the trie-of-trees design 4
effectively has fixed-size keys in most tree nodes.
2
Keys with common prefixes Masstree is intended to pre- 0
serve good cache performance when keys share common 8 16 24 32 40 48
prefixes. However, unlike some designs, such as partial-key Key length (bytes)
B-trees, Masstree can become superficially unbalanced. Fig- Figure 9. Performance effect of varying key length on
ure 9 provides support for Masstree’s choice. The work- Masstree and “+Permuter.” For each key length, keys differ
loads use 16 cores and 80M decimal keys. The X axis gives only in the last 8 bytes. 16-core get workload.
each test’s key length in bytes, but only the final 8 bytes
vary uniformly. A 0-to-40-byte prefix is the same for ev-
ery key. Despite the resulting imbalance, Masstree has 3.4× 0.7
Get
(req/sec/core, millions)
the throughput of “+Permuter” for relatively long keys. This 0.6 Put
is because “+Permuter” incurs a cache miss for the suffix 0.5
Throughput
of every key it compares. However, Masstree has 1.4× the 0.4
throughput of “+Permuter” even for 16-byte keys, which 0.3
“+Permuter” stores entirely inline. Here Masstree’s perfor- 0.2
mance comes from avoiding repeated comparisons: it exam-
0.1
ines the key’s first 8 bytes once, rather than O(log2 n) times.
0
1 2 4 8 16
Concurrency Masstree uses interlocked instructions, such
Number of cores
as compare-and-swap, that would be unnecessary for a
single-core store. We implemented a single-core version Figure 10. Masstree scalability.
of Masstree by removing locking, node versions, and in-
terlocked instructions. When evaluated on one core using a
140M-key, 1-to-10-byte-decimal put workload, single-core 1000 cycles of CPU time in computation independent of the
Masstree beats concurrent Masstree by just 13%. number of cores, but average per-operation DRAM stall time
varies from 2050 cycles with one core to 2800 cycles with 16
Range queries Masstree uses a tree to support range cores. This increase roughly matches the decrease in perfor-
queries. If they were not needed, a hash table might be mance from one to 16 cores in Figure 10, and is consistent
preferable, since hash tables have O(1) lookup cost while a with the cores contending for some limited resource having
tree has O(log n). To measure this factor, we implemented to do with memory fetches, such as DRAM or interconnect
a concurrent hash table in the Masstree framework and bandwidth.
measured a 16-core, 80M-key workload with 8-byte ran-
dom alphabetical keys.6 Our hash table has 2.5× higher 6.6 Partitioning and skew
total throughput than Masstree. Thus, of these features, only Some key-value stores partition data among cores in order
range queries appear inherently expensive. to avoid contention. We show here that, while partitioning
works well for some workloads, sharing data among all
6.5 Scalability
cores works better for others. We compare Masstree with
This section investigates how Masstree’s performance scales 16 separate instances of the single-core Masstree variant
with the number of cores. Figure 10 shows the results for 16- described above, each serving a partition of the overall data.
core get and put workloads using 140M 1-to-10-byte deci- The partitioning is static, and each instance holds the same
mal keys. The Y axis shows per-core throughput; ideal scal- number of keys. Each instance allocates memory from its
ability would appear as a horizontal line. At 16 cores, Mass- local DRAM node. Clients send each query to the instance
tree scales to 12.7× and 12.5× its one-core performance for appropriate for the query’s key. We refer this configuration
gets and puts respectively. as “hard-partitioned” Masstree.
The limiting factor for the get workload is high and in- Tests use 140M-key, 1-to-10-byte decimal get workloads
creasing DRAM fetch cost. Each operation consumes about with various partition skewness. Following Hua et al. [22],
6 Digit-only we model skewness with a single parameter δ . For skewness
keys caused collisions and we wanted the test to favor the hash
table. The hash table is open-coded and allocated using superpages, and has δ , 15 partitions receive the same number of requests, while
30% occupancy. Each hash lookup inspects 1.1 entries on average. the last one receives δ × more than the others. For example,
12 C/C++ Batched Range
Masstree Server client library query query
10 Hard-partitioned Masstree
(req/sec, millions)
8
VoltDB-2.0 1.3.6.1 Yes Yes
6 memcached-1.4.8 1.0.3 Yes for get No
4
Redis-2.4.5 latest hiredis Yes No