0% found this document useful (0 votes)
18 views14 pages

Cache Craftiness For Fast Multicore Key-Value Storage

Masstree is an in-memory key-value database designed for multicore systems, utilizing a trie-like structure of B+ trees to efficiently manage variable-length keys with shared prefixes. It achieves high performance through optimistic concurrency control, allowing for millions of queries per second while minimizing DRAM access delays. The system supports flexible operations, including range queries, and is optimized for both small and large value workloads, making it suitable for performance-sensitive applications.

Uploaded by

mostlyamiable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Cache Craftiness For Fast Multicore Key-Value Storage

Masstree is an in-memory key-value database designed for multicore systems, utilizing a trie-like structure of B+ trees to efficiently manage variable-length keys with shared prefixes. It achieves high performance through optimistic concurrency control, allowing for millions of queries per second while minimizing DRAM access delays. The system supports flexible operations, including range queries, and is optimized for both small and large value workloads, making it suitable for performance-sensitive applications.

Uploaded by

mostlyamiable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Cache Craftiness for Fast Multicore Key-Value Storage

Yandong Mao, Eddie Kohler† , Robert Morris


MIT CSAIL, † Harvard University

Abstract sufficiently fast single servers. A common route to high per-


We present Masstree, a fast key-value database designed for formance is to use different specialized storage systems for
SMP machines. Masstree keeps all data in memory. Its main different workloads [4].
data structure is a trie-like concatenation of B+ -trees, each of This paper presents Masstree, a storage system special-
which handles a fixed-length slice of a variable-length key. ized for key-value data in which all data fits in memory, but
This structure effectively handles arbitrary-length possibly- must persist across server restarts. Within these constraints,
binary keys, including keys with long shared prefixes. B+ - Masstree aims to provide a flexible storage model. It sup-
tree fanout was chosen to minimize total DRAM delay when ports arbitrary, variable-length keys. It allows range queries
descending the tree and prefetching each tree node. Lookups over those keys: clients can traverse subsets of the database,
use optimistic concurrency control, a read-copy-update-like or the whole database, in sorted order by key. It performs
technique, and do not write shared data structures; updates well on workloads with many keys that share long prefixes.
lock only affected nodes. Logging and checkpointing pro- (For example, consider Bigtable [12], which stores infor-
vide consistency and durability. Though some of these ideas mation about Web pages under permuted URL keys like
appear elsewhere, Masstree is the first to combine them. We “edu.harvard.seas.www/news-events”. Such keys group to-
discuss design variants and their consequences. gether information about a domain’s sites, allowing more
On a 16-core machine, with logging enabled and queries interesting range queries, but many URLs will have long
arriving over a network, Masstree executes more than six shared prefixes.) Finally, though efficient with large values,
million simple queries per second. This performance is com- it is also efficient when values are small enough that disk
parable to that of memcached, a non-persistent hash table and network throughput don’t limit performance. The com-
server, and higher (often much higher) than that of VoltDB, bination of these properties could free performance-sensitive
MongoDB, and Redis. users to use richer data models than is common for stores
like memcached today.
Categories and Subject Descriptors H.2.4 [Information Masstree uses a combination of old and new techniques
Systems]: DATABASE MANAGEMENT – Concurrency to achieve high performance [8, 11, 13, 20, 27–29]. It
Keywords multicore; in-memory; key-value; persistent achieves fast concurrent operation using a scheme inspired
by OLFIT [11], Bronson et al. [9], and read-copy up-
date [28]. Lookups use no locks or interlocked instructions,
1. Introduction
and thus operate without invalidating shared cache lines and
Storage server performance matters. In many systems that in parallel with most inserts and updates. Updates acquire
use a single storage server, that server is often the per- only local locks on the tree nodes involved, allowing modi-
formance bottleneck [1, 18], so improvements directly im- fications to different parts of the tree to proceed in parallel.
prove system capacity. Although large deployments typi- Masstree shares a single tree among all cores to avoid load
cally spread load over multiple storage servers, single-server imbalances that can occur in partitioned designs. The tree
performance still matters: faster servers may reduce costs, is a trie-like concatenation of B+ -trees, and provides high
and may also reduce load imbalance caused by partitioning performance even for long common key prefixes, an area in
data among servers. Intermediate-sized deployments may be which other tree designs have trouble. Query time is dom-
able to avoid the complexity of multiple servers by using inated by the total DRAM fetch time of successive nodes
during tree descent; to reduce this cost, Masstree uses a
wide-fanout tree to reduce the tree depth, prefetches nodes
Permission to make digital or hard copies of all or part of this work for personal or
from DRAM to overlap fetch latencies, and carefully lays
classroom use is granted without fee provided that copies are not made or distributed out data in cache lines to reduce the amount of data needed
for profit or commercial advantage and that copies bear this notice and the full citation per node. Operations are logged in batches for crash recov-
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. ery and the tree is periodically checkpointed.
EuroSys’12, April 10–13, 2012, Bern, Switzerland.
Copyright c 2012 ACM 978-1-4503-1223-3/12/04. . . $10.00
We evaluate Masstree on a 16-core machine with simple data among multiple cores to avoid concurrency, and thus
benchmarks and a version of the Yahoo! Cloud Serving avoids data structure locking costs. In contrast, Masstree
Benchmark (YCSB) [16] modified to use small keys and shares data among all cores to avoid load imbalances that
values. Masstree achieves six to ten million operations per can occur with partitioned data, and achieves good scaling
second on parts A–C of the benchmark, more than 30× as with lock-free lookups and locally locked inserts.
fast as VoltDB [5] or MongoDB [2]. Shore-MT [24] identifies lock contention as a major bot-
The contributions of this paper are as follows. First, an tleneck for multicore databases, and improves performance
in-memory concurrent tree that supports keys with shared by removing locks incrementally. Masstree provides high
prefixes efficiently. Second, a set of techniques for laying concurrency from the start.
out the data of each tree node, and accessing it, that reduces Recent key-value stores [2, 3, 12, 17, 26] provide high
the time spent waiting for DRAM while descending the tree. performance partially by offering a simpler query and data
Third, a demonstration that a single tree shared among mul- model than relational databases, and partially by partition-
tiple cores can provide higher performance than a partitioned ing data over a cluster of servers. Masstree adopts the first
design for some workloads. Fourth, a complete design that idea. Its design focuses on multicore performance rather than
addresses all bottlenecks in the way of million-query-per- clustering, though in principle one could operate a cluster of
second performance. Masstree servers.

2. Related work 3. System interface


Masstree is implemented as a network key-value storage
Masstree builds on many previous systems. OLFIT [11] is a
server. Its requests query and change the mapping of keys
Blink -tree [27] with optimistic concurrency control. Each up-
to values. Values can be further divided into columns, each
date to a node changes the node’s version number. Lookups
of which is an uninterpreted byte string.
check a node’s version number before and after observing
Masstree supports four operations: getc (k), putc (k, v),
its contents, and retry if the version number changes (which
remove(k), and getrangec (k, n). The c parameter is an op-
indicates that the lookup may have observed an inconsistent
tional list of column numbers that allows clients to get or set
state). Masstree uses this idea, but, like Bronson et al. [9],
subsets of a key’s full value. The getrange operation, also
it splits the version number into two parts; this, and other
called “scan,” implements a form of range query. It returns
improvements, lead to less frequent retries during lookup.
up to n key-value pairs, starting with the next key at or after
PALM [34] is a lock-free concurrent B+ -tree with twice
k and proceeding in lexicographic order by key. Getrange
the throughput of OLFIT. PALM uses SIMD instructions
is not atomic with respect to inserts and updates. A single
to take advantage of parallelism within each core. Lookups
client message can include many queries.
for an entire batch of queries are sorted, partitioned across
cores, and processed simultaneously, a clever way to opti-
mize cache usage. PALM requires fixed-length keys and its 4. Masstree
query batching results in higher query latency than OLFIT Our key data structure is Masstree, a shared-memory, con-
and Masstree. Many of its techniques are complementary to current-access data structure combining aspects of B+ -
our work. trees [6] and tries [20]. Masstree offers fast random access
Bohannon et al. [8] store parts of keys directly in tree and stores keys in sorted order to support range queries.
nodes, resulting in fewer DRAM fetches than storing keys The design was shaped by three challenges. First, Masstree
indirectly. AlphaSort [29] explores several ideas to mini- must efficiently support many key distributions, including
mize cache misses by storing partial keys. Masstree uses a variable-length binary keys where many keys might have
trie [20] like data structure to achieve the same goal. long common prefixes. Second, for high performance and
Rao et al. [30] propose storing each node’s children in scalability, Masstree must allow fine-grained concurrent ac-
contiguous memory to make better use of cache. Fewer node cess, and its get operations must never dirty shared cache
pointers are required, and prefetching is simplified, but some lines by writing shared data structures. Third, Masstree’s
memory is wasted on nonexistent nodes. Cha et al. report layout must support prefetching and collocate important in-
that a fast B+ -tree outperforms a CSB+ -tree [10]; Masstree formation on small numbers of cache lines. The second and
improves cache efficiency using more local techniques. third properties together constitute cache craftiness.
Data-cache stalls are a major bottleneck for database
systems, and many techniques have been used to improve 4.1 Overview
caching [14, 15, 21, 31]. Chen et al. [13] prefetch tree nodes; A Masstree is a trie with fanout 264 where each trie node
Masstree adopts this idea. is a B+ -tree. The trie structure efficiently supports long
H-Store [25, 35] and VoltDB, its commercial version, keys with shared prefixes; the B+ -tree structures efficiently
are in-memory relational databases designed to be orders of support short keys and fine-grained concurrency, and their
magnitude faster than previous systems. VoltDB partitions medium fanout uses cache lines effectively.
struct interior_node: struct border_node:
uint32_t version; uint32_t version;
uint8_t nkeys; uint8_t nremoved;
uint64_t keyslice[15]; uint8_t keylen[15];
node* child[16]; uint64_t permutation;
interior_node* parent; uint64_t keyslice[15];
link_or_value lv[15];
border_node* next;
union link_or_value: border_node* prev;
node* next_layer; interior_node* parent;
[opaque] value; keysuffix_t keysuffixes;

Figure 2. Masstree node structures.

Figure 1. Masstree structure: layers of B+ -trees form a trie.


3. t.remove(“01234567XY”) traverses through the root layer
to the layer-1 B+ -tree, where it deletes key “XY”. The
Put another way, a Masstree comprises one or more layers “AB” key remains in the layer-1 B+ -tree.
of B+ -trees, where each layer is indexed by a different 8-byte
slice of key. Figure 1 shows an example. The trie’s single Balance A Masstree’s shape depends on its key distribu-
root tree, layer 0, is indexed by the slice comprising key tion. For example, 1000 keys that share a 64-byte prefix
bytes 0–7, and holds all keys up to 8 bytes long. Trees in generate at least 8 layers; without the prefix they would fit
layer 1, the next deeper layer, are indexed by bytes 8–15; comfortably in one layer. Despite this, Masstrees have the
trees in layer 2 by bytes 16–23; and so forth. same query complexity as B-trees. Given n keys of maxi-
Each tree contains at least one border node and zero or mum length `, query operations on a B-tree examine O(log n)
more interior nodes. Border nodes resemble leaf nodes in nodes and make O(log n) key comparisons; but since each
conventional B+ -trees, but where leaf nodes store only keys key has length O(`), the total comparison cost is O(` log n).
and values, Masstree border nodes can also store pointers to A Masstree will make O(log n) comparisons in each of O(`)
deeper trie layers. layers, but each comparison considers fixed-size key slices,
Keys are generally stored as close to the root as possible, for the same total cost of O(` log n). When keys have long
subject to three invariants. (1) Keys shorter than 8h + 8 bytes common prefixes, Masstree outperforms conventional bal-
are stored at layer ≤ h. (2) Any keys stored in the same layer- anced trees, performing O(` + log n) comparisons per query
h tree have the same 8h-byte prefix. (3) When two keys share (` for the prefix plus log n for the suffix). However, Mass-
a prefix, they are stored at least as deep as the shared prefix. tree’s range queries have higher worst-case complexity than
That is, if two keys longer than 8h bytes have the same 8h- in a B+ -tree, since they must traverse multiple layers of tree.
byte prefix, then they are stored at layer ≥ h. Partial-key B-trees [8] can avoid some key comparisons
Masstree creates layers as needed (as is usual for tries). while preserving true balance. However, unlike these trees,
Key insertion prefers to use existing trees; new trees are cre- Masstree bounds the number of non-node memory refer-
ated only when insertion would otherwise violate an invari- ences required to find a key to at most one per lookup. Mass-
ant. Key removal deletes completely empty trees but does tree lookups, which focus on 8-byte key slice comparisons,
not otherwise rearrange keys. For example, if t begins as an are also easy to code efficiently. Though Masstree can use
empty Masstree: more memory on some key distributions, since its nodes are
relatively wide, it outperformed our pkB-tree implementa-
1. t.put(“01234567AB”) stores key “01234567AB” in the tion on several benchmarks by 20% or more.
root layer. The relevant key slice, “01234567”, is stored
separately from the 2-byte suffix “AB”. A get for this key 4.2 Layout
first searches for the slice, then compares the suffix. Figure 2 defines Masstree’s node structures. At heart, Mass-
2. t.put(“01234567XY”): Since this key shares an 8-byte tree’s interior and border nodes are internal and leaf nodes of
prefix with an existing key, Masstree must create a new a B+ -tree with width 15. Border nodes are linked to facilitate
layer. The values for “01234567AB” and “01234567XY” remove and getrange. The version, nremoved, and permuta-
are stored, under slices “AB” and “XY”, in a freshly tion fields are used during concurrent updates and described
allocated B+ -tree border node. This node then replaces below; we now briefly mention other features.
the “01234567AB” entry in the root layer. Concurrent The keyslice variables store 8-byte key slices as 64-bit
gets observe either the old state (with “01234567AB”) or integers, byte-swapped if necessary so that native less-than
the new layer, so the “01234567AB” key remains visible comparisons provide the same results as lexicographic string
throughout the operation. comparison. This was the most valuable of our coding tricks,
improving performance by 13–19%. Short key slices are interior node; if full, this interior node must itself be split
padded with 0 bytes. (updating its children’s parent pointers). The split process
Border nodes store key slices, lengths, and suffixes. terminates either at a node with insertion room or at the
Lengths, which distinguish different keys with the same root, where a new interior node is created and installed.
slice, are a consequence of our decision to allow binary Removing a key simply deletes it from the relevant border
strings as keys. Since null characters are valid within key node. Empty border nodes are then freed and deleted from
strings, Masstree must for example distinguish the 8-byte their parent interior nodes. This process, like split, continues
key “ABCDEFG\0” from the 7-byte key “ABCDEFG”, up the tree as necessary. Though remove in classical B+ -
which have the same slice representation. trees can redistribute keys among nodes to preserve balance,
A single tree can store at most 10 keys with the same removal without rebalancing has theoretical and practical
slice, namely keys with lengths 0 through 8 plus either one advantages [33].
key with length > 8 or a link to a deeper trie layer.1 We Insert and remove maintain a per-tree doubly linked list
ensure that all keys with the same slice are stored in the among border nodes. This list speeds up range queries in ei-
same border node. This simplifies and slims down interior ther direction. If only forward range queries were required, a
nodes, which need not contain key lengths, and simplifies singly linked list could suffice, but the backlinks are required
the maintenance of other invariants important for concurrent anyway for our implementation of concurrent remove.
operation, at the cost of some checking when nodes are We apply common case optimizations. For example, se-
split. (Masstree is in this sense a restricted type of prefix quential insertions are easy to detect (the item is inserted
B-tree [7].) at the end of a node with no next sibling). If a sequential
Border nodes store the suffixes of their keys in keysuf- insert requires a split, the old node’s keys remain in place
fixes data structures. These are located either inline or in and Masstree inserts the new item into an empty node. This
separate memory blocks; Masstree adaptively decides how improves memory utilization and performance for sequen-
much per-node memory to allocate for suffixes and whether tial workloads. (Berkeley DB and others also implement this
to place that memory inline or externally. Compared to a optimization.)
simpler technique (namely, allocating fixed space for up to
15 suffixes per node), this approach reduces memory usage 4.4 Concurrency overview
by up to 16% for workloads with short keys and improves Masstree achieves high performance on multicore hardware
performance by 3%. using fine-grained locking and optimistic concurrency con-
Values are stored in link_or_value unions, which contain trol. Fine-grained locking means writer operations in dif-
either values or pointers to next-layer trees. These cases are ferent parts of the tree can execute in parallel: an update
distinguished by the keylen field. Users have full control over requires only local locks.2 Optimistic concurrency control
the bits stored in value slots. means reader operations, such as get, acquire no locks what-
Masstree’s performance is dominated by the latency of soever, and in fact never write to globally-accessible shared
fetching tree nodes from DRAM. Many such fetches are memory. Writes to shared memory can limit performance by
required for a single put or get. Masstree prefetches all of causing contention—for example, contention among read-
a tree node’s cache lines in parallel before using the node, ers for a node’s read lock—or by wasting DRAM bandwidth
so the entire node can be used after a single DRAM latency. on writebacks. But since readers don’t lock out concurrent
Up to a point, this allows larger tree nodes to be fetched in writers, readers might observe intermediate states created
the same amount of time as smaller ones; larger nodes have by writers, such as partially-inserted keys. Masstree read-
wider fanout and thus reduce tree height. On our hardware, ers and writers must cooperate to avoid confusion. The key
tree nodes of four cache lines (256 bytes, which allows a communication channel between them is a per-node version
fanout of 15) provide the highest total performance. counter that writers mark as “dirty” before creating interme-
diate states, and then increment when done. Readers snap-
4.3 Nonconcurrent modification shot a node’s version before accessing the node, then com-
Masstree’s tree modification algorithms are based on se- pare this snapshot to the version afterwards. If the versions
quential algorithms for B+ -tree modification. We describe differ or are dirty, the reader may have observed an inconsis-
them as a starting point. tent intermediate state and must retry.
Inserting a key into a full border node causes a split. A Our optimistic concurrency control design was inspired
new border node is allocated, and the old keys (plus the by read-copy update [28], and borrows from OLFIT [11] and
inserted key) are distributed among the old and new nodes. Bronson et al.’s concurrent AVL trees [9].
The new node is then inserted into the old node’s parent Masstree’s correctness condition can be summarized as
no lost keys: A get(k) operation must return a correct value
1 At most one key can have length > 8 because of the invariants above:
the second such key will create the deeper trie layer. Not all key slices can 2 These data structure locks are often called “latches,” with the word “lock”
support 10 keys—any slice whose byte 7 is not null occurs at most twice. reserved for transaction locks. We do not discuss transactions or their locks.
stableversion(node n):
v ← n.version
while v.inserting or v.splitting:
v ← n.version
return v
lock(node n):
Figure 3. Version number layout. The locked bit is claimed while n 6= NIL and swap(n.version.locked, 1) = 1:
by update or insert. inserting and splitting are “dirty” bits set // retry
during inserts and splits, respectively. vinsert and vsplit are
unlock(node n): // implemented with one memory write
counters incremented after each insert or split. isroot tells
if n.version.inserting:
whether the node is the root of some B+ -tree. isborder tells
+ + n.version.vinsert
whether the node is interior or border. unused allows more else if n.version.splitting:
efficient operations on the version number. + + n.version.vsplit
n.version.{locked, inserting, splitting} ← 0

for k, regardless of concurrent writers. (When get(k) and lockedparent(node n):


put(k, v) run concurrently, the get can return either the old retry: p ← n.parent; lock(p)
or the new value.) The biggest challenge in preserving cor- if p 6= n.parent: // parent changed underneath us
rectness is concurrent splits and removes, which can shift unlock(p); goto retry
return p
responsibility for a key away from a subtree even as a reader
traverses that subtree.
Figure 4. Helper functions.
4.5 Writer–writer coordination
Masstree writers coordinate using per-node spinlocks. A tial put phase that reaches the node responsible for a key is
node’s lock is stored in a single bit in its version counter. logically a reader and takes no locks.
(Figure 3 shows the version counter’s layout.) It’s simple to design a correct, though inefficient, opti-
Any modification to a node’s keys or values requires mistic writer–reader coordination algorithm using version
holding the node’s lock. Some data is protected by other fields.
nodes’ locks, however. A node’s parent pointer is protected
by its parent’s lock, and a border node’s prev pointer is 1. Before making any change to a node n, a writer operation
protected by its previous sibling’s lock. This minimizes the must mark n.version as “dirty.” After making its change,
simultaneous locks required by split operations; when an it clears this mark and increments the n.version counter.
interior node splits, for example, it can assign its children’s 2. Every reader operation first snapshots every node’s ver-
parent pointers without obtaining their locks. sion. It then computes, keeping track of the nodes it
Splits and node deletions require a writer to hold several examines. After finishing its computation (but before
locks simultaneously. When node n splits, for example, the returning the result), it checks whether any examined
writer must simultaneously hold n’s lock, n’s new sibling’s node’s version was dirty or has changed from the snap-
lock, and n’s parent’s lock. (The simultaneous locking pre- shot; if so, the reader must retry with a fresh snapshot.
vents a concurrent split from moving n, and therefore its sib-
ling, to a different parent before the new sibling is inserted.) Universal before-and-after version checking would clearly
As with Blink -trees [27], lock ordering prevents deadlock: ensure that readers detect any concurrent split (assuming
locks are always acquired up the tree. version numbers didn’t wrap mid-computation3 ). It would
We evaluated several writer–writer coordination proto- equally clearly perform terribly. Efficiency is recovered by
cols on different tree variants, including lock-free algorithms eliminating unnecessary version changes, by restricting the
relying on compare-and-swap operations. The current lock- version snapshots readers must track, and by limiting the
ing protocol performs as well or better. On current cache- scope over which readers must retry. The rest of this sec-
coherent shared-memory multicore machines, the major cost tion describes different aspects of coordination by increasing
of locking, namely the cache coherence protocol, is also in- complexity.
curred by lock-free operations like compare-and-swap, and 4.6.1 Updates
Masstree never holds a lock for very long.
Update operations, which change values associated with ex-
4.6 Writer–reader coordination isting keys, must prevent concurrent readers from observing
intermediate results. This is achieved by atomically updat-
We now turn to writer–reader coordination, which uses opti-
mistic concurrency control. Note that even an all-put work- 3 Our current counter could wrap if a reader blocked mid-computation for
load involves some writer–reader coordination, since the ini- 222 inserts. A 64-bit version counter would never overflow in practice.
split(node n, key k): // precondition: n locked findborder(node root, key k):
n0 ← new border node retry: n ← root; v ← stableversion(n)
n.version.splitting ← 1 if v.isroot is false:
n0 .version ← n.version // n0 is initially locked root ← root.parent; goto retry
split keys among n and n0 , inserting k descend: if n is a border node:
ascend: p ← lockedparent(n) // hand-over-hand locking return hn, vi
if p = NIL: // n was old root n0 ← child of n containing k
create a new interior node p with children n, n0 v0 ← stableversion(n0 )
unlock(n); unlock(n0 ); return if n.version ⊕ v ≤ “locked”: // hand-over-hand validation
else if p is not full: n ← n0 ; v ← v0 ; goto descend
p.version.inserting ← 1 00
v ← stableversion(n)
insert n0 into p if v00 .vsplit 6= v.vsplit:
unlock(n); unlock(n0 ); unlock(p); return goto retry // if split, retry from root
else: v ← v00 ; goto descend // otherwise, retry from n
p.version.splitting ← 1
unlock(n) Figure 6. Find the border node containing a key.
p0 ← new interior node
p0 .version ← p.version
split keys among p and p0 , inserting n0
unlock(n0 ); n ← p; n0 ← p0 ; goto ascend the node; loads the permutation; rearranges the permuta-
tion to shift an unused slot to the correct insertion position
and increment nkeys; writes the new key and value to the
Figure 5. Split a border node and insert a key.
previously-unused slot; and finally writes back the new per-
mutation and unlocks the node. The new key becomes visi-
ing values using aligned write instructions. On modern ma- ble to readers only at this last step.
chines, such writes have atomic effect: any concurrent reader A compiler fence, and on some architectures a machine
will see either the old value or the new value, not some un- fence instruction, is required between the writes of the key
holy mixture. Updates therefore don’t need to increment the and value and the write of the permutation. Our implemen-
border node’s version number, and don’t force readers to tation includes fences whenever required, such as in version
retry. checks.
However, writers must not delete old values until all con-
current readers are done examining them. We solve this 4.6.3 New layers
garbage collection problem with read-copy update tech-
niques, namely a form of epoch-based reclamation [19]. All Masstree creates a new layer when inserting a key k1 into
data accessible to readers is freed using similar techniques. a border node that contains a conflicting key k2 . It allocates
a new empty border node n0 , inserts k2 ’s current value into
4.6.2 Border inserts it under the appropriate key slice, and then replaces k2 ’s
Insertion in a conventional B-tree leaf rearranges keys into value in n with the next_layer pointer n0 . Finally, it unlocks
sorted order, which creates invalid intermediate states. One n and continues the attempt to insert k1 , now using the newly
solution is forcing readers to retry, but Masstree’s border- created layer n0 .
node permutation field makes each insert visible in one Since this process only affects a single key, there is no
atomic step instead. This solves the problem by eliminating need to update n’s version or permutation. However, readers
invalid intermediate states. The permutation field compactly must reliably distinguish true values from next_layer point-
represents the correct key order plus the current number of ers. Since the pointer and the layer marker are stored sep-
keys, so writers expose a new sort order and a new key arately, this requires a sequence of writes. First, the writer
with a single aligned write. Readers see either the old order, marks the key as UNSTABLE; readers seeing this marker will
without the new key, or the new order, with the new key retry. It then writes the next_layer pointer, and finally marks
in its proper place. No key rearrangement, and therefore no the key as a LAYER.
version increment, is required.
The 64-bit permutation is divided into 16 four-bit sub- 4.6.4 Splits
fields. The lowest 4 bits, nkeys, holds the number of keys Splits, unlike non-split inserts, remove active keys from a
in the node (0–15). The remaining bits constitute a fifteen- visible node and insert them in another. Without care, a
element array, keyindex[15], containing a permutation of get concurrent with the split might mistakenly report these
the numbers 0 through 15. Elements keyindex[0] through shifting keys as lost. Writers must therefore update version
keyindex[nkeys − 1] store the indexes of the border node’s fields to signal splits to readers. The challenge is to update
live keys, in increasing order by key. The other elements these fields in writers, and check them in readers, in such a
list currently-unused slots. To insert a key, a writer locks way that no change is lost.
Figures 5 and 6 present pseudocode for splitting a bor- get(node root, key k):
der node and for traversing down a B+ -tree to the border retry: hn, vi ← findborder(root, k)
node responsible for a key. (Figure 4 presents some helper forward: if v.deleted:
functions.) The split code uses hand-over-hand locking and goto retry
marking [9]: lower levels of the tree are locked and marked ht, lvi ← extract link_or_value for k in n
if n.version ⊕ v > “locked”:
as “splitting” (a type of dirty marking) before higher levels.
v ← stableversion(n); next ← n.next
Conversely, the traversal code checks versions hand-over- while !v.deleted and next 6= NIL and k ≥ lowkey(next):
hand in the opposite direction: higher levels’ versions are n ← next; v ← stableversion(n); next ← n.next
verified before the traversal shifts to lower levels. goto forward
To see why this is correct, consider an interior node B that else if t = NOTFOUND:
splits to create a new node B0 : return NOTFOUND
else if t = VALUE:
return lv.value
else if t = LAYER:
root ← lv.next_layer; advance k to next slice
goto retry
else: // t = UNSTABLE
goto forward

(Dashed lines from B indicate child pointers that were


Figure 7. Find the value for a key.
shifted to B0 .) The split procedure changes versions and
shifts keys in the following steps.
insert in 106 had to retry from the root due to a concurrent
0
1. B and B are marked splitting. split. Other algorithms, such as backing up the tree step by
2. Children, including X, are shifted from B to B0 . step, were more complex to code but performed no better.
However, concurrent inserts are (as one might expect) ob-
3. A (B’s parent) is locked and marked inserting.
served 15× more frequently than splits. It is simple to handle
4. The new node, B0 , is inserted into A. them locally, so Masstree maintains separate split and insert
5. A, B, and B0 are unlocked, which increments the A vin- counters to distinguish the cases.
sert counter and the B and B0 vsplit counters. Figure 7 shows full code for Masstree’s get operation.
(Puts are similar, but since they obtain locks, the retry logic
Now consider a concurrent findborder(X) operation that is simpler.) Again, the node’s contents are extracted between
starts at node A. We show that this operation either finds checks of its version, and version changes cause retries.
X or eventually retries. First, if findborder(X) traverses to Border nodes, unlike interior nodes, can handle splits
node B0 , then it will find X, which moved to B0 (in step 2) using their links.4 The key invariant is that nodes split “to
before the pointer to B0 was published (in step 4). Instead, the right”: when a border node n splits, its higher keys are
assume findborder(X) traverses to B. Since the findborder shifted to its new sibling. Specifically, Masstree maintains
operation retries on any version difference, and since find- the following invariants:
border loads the child’s version before double-checking the
• The initial node in a B+ -tree is a border node. This node
parent’s (“hand-over-hand validation” in Figure 6), we know
that findborder loaded B’s version before A was marked as is not deleted until the B+ -tree itself is completely empty,
inserting (step 3). This in turn means that the load of B’s ver- and always remains the leftmost node in the tree.
sion happened before step 1. (That step marks B as splitting, • Every border node n is responsible for a range of keys
which would have caused stableversion to retry.) Then there [lowkey(n), highkey(n)). (The leftmost and rightmost
are two possibilities. If findborder completes before the split nodes have lowkey(n) = −∞ and highkey(n) = ∞, re-
operation’s step 1, it will clearly locate node X. On the other spectively.) Splits and deletes can modify highkey(n),
hand, if findborder is delayed past step 1, it will always de- but lowkey(n) remains constant over n’s lifetime.
tect a split and retry from the root. The B.version ⊕ v check
will fail because of B’s splitting flag; the following stablever- Thus, get can reliably find the relevant border node by com-
sion(B) will delay until that flag is cleared, which happens paring the current key and the next border node’s lowkey.
when the split executes step 5; and at that point, B’s vsplit The first lines of findborder (Figure 6) handle stale roots
counter has changed. caused by concurrent splits, which can occur at any layer.
Masstree readers treat splits and inserts differently. In- When the layer-0 global root splits, we update it imme-
serts retry locally, while splits require retrying from the root. diately, but other roots, which are stored in border nodes’
Wide B-tree fanout and fast code mean concurrent splits are [27] and OLFIT [11] also link interior nodes, but our “B− tree”
4 Blink -trees

rarely observed: in an insert test with 8 threads, less than 1 implementation of remove [33] breaks the invariants that make this possible.
next_layer pointers, are updated lazily during later opera- Multi-column puts are atomic: a concurrent get will see ei-
tions. ther all or none of a put’s column modifications.
Masstree includes several value implementations; we
4.6.5 Removes evaluate one most appropriate for small values. Each value
Masstree, unlike some prior work [11, 27], includes a full is allocated as a single memory block. Modifications don’t
implementation of concurrent remove. Space constraints act in place, since this could expose intermediate states to
preclude a full discussion, but we mention several interesting concurrent readers. Instead, put creates a new value object,
features. copying unmodified columns from the old value object as
First, remove operations, when combined with inserts, appropriate. This design uses cache effectively for small
must sometimes cause readers to retry! Consider the follow- values, but would cause excessive data copying for large
ing threads running in parallel on a one-node tree: values; for those, Masstree offers a design that stores each
column in a separately-allocated block.
get(n, k1 ):
locate k1 at n position i 4.8 Discussion
remove(n, k1 ):
.. remove k1 from n position i More than 30% of the cost of a Masstree lookup is in com-
. put(n, k2 , v2 ): putation (as opposed to DRAM waits), mostly due to key
insert k2 , v2 at n position j search within tree nodes. Linear search has higher complex-
lv ← n.lv[i]; check n.version; ity than binary search, but exhibits better locality. For Mass-
return lv.value tree, the performance difference of the two search schemes is
architecture dependent. On an Intel processor, linear search
The get operation may return k1 ’s (removed) value, since the
can be up to 5% faster than binary search. On an AMD pro-
operations overlapped. Remove thus must not clear the mem-
cessor, both perform the same.
ory corresponding to the key or its value: it just changes
One important PALM optimization is parallel lookup [34].
the permutation. But then if the put operation happened to
This effectively overlaps the DRAM fetches for many oper-
pick j = i, the get operation might return v2 , which isn’t a
ations by looking up the keys for a batch of requests in par-
valid value for k1 . Masstree must therefore update the ver-
allel. Our implementation of this technique did not improve
sion counter’s vinsert field when removed slots are reused.
performance on our 48-core AMD machine, but on a 24-
When a border node becomes empty, Masstree removes it
core Intel machine, throughput rose by up to 34%. We plan
and any resulting empty ancestors. This requires the border-
to change Masstree’s network stack to apply this technique.
node list be doubly-, not singly-, linked. A naive implemen-
tation could break the list under concurrent splits and re-
moves; compare-and-swap operations (some including flag 5. Networking and persistence
bits) are required for both split and remove, which slightly Masstree uses network interfaces that support per-core re-
slows down split. As with any state observable by concur- ceive and transmit queues, which reduce contention when
rent readers, removed nodes must not be freed immediately. short query packets arrive from many clients. To support
Instead, we mark them as deleted and reclaim them later. short connections efficiently, Masstree can configure per-
Any operation that encounters a deleted node retries from core UDP ports that are each associated with a single core’s
the root. Remove’s code for manipulating interior nodes re- receive queue. Our benchmarks, however, use long-lived
sembles that for split; hand-over-hand locking is used to find TCP query connections from few clients (or client aggre-
the right key to remove. Once that key is found, the deleted gators), a common operating mode that is equally effective
node becomes completely unreferenced and future readers at avoiding network overhead.
will not encounter it. Masstree logs updates to persistent storage to achieve
Removes can delete entire layer-h trees for h ≥ 1. These persistence and crash recovery. Each server query thread
are not cleaned up right away: normal operations lock at (core) maintains its own log file and in-memory log buffer. A
most one layer at a time, and removing a full tree requires corresponding logging thread, running on the same core as
locking both the empty layer-h tree and the layer-(h − 1) bor- the query thread, writes out the log buffer in the background.
der node that points to it. Epoch-based reclamation tasks are Logging thus proceeds in parallel on each core.
scheduled as needed to clean up empty and pathologically- A put operation appends to the query thread’s log buffer
shaped layer-h trees. and responds to the client without forcing that buffer to
storage. Logging threads batch updates to take advantage of
4.7 Values higher bulk sequential throughput, but force logs to storage
The Masstree system stores values consisting of a ver- at least every 200 ms for safety. Different logs may be on
sion number and an array of variable-length strings called different disks or SSDs for higher total log throughput.
columns. Gets can retrieve multiple columns (identified by Value version numbers and log record timestamps aid the
integer indexes) and puts can modify multiple columns. process of log recovery. Sequential updates to a value ob-
tain distinct, and increasing, version numbers. Update ver- 6.1 Setup
sion numbers are written into the log along with the opera- The experiments use a 48-core server (eight 2.4 GHz six-
tion, and each log record is timestamped. When restoring a core AMD Opteron 8431 chips) running Linux 3.1.5. Each
database from logs, Masstree sorts logs by timestamp. It first core has private 64 KB instruction and data caches and a
calculates the recovery cutoff point, which is the minimum 512 KB private L2 cache. The six cores in each chip share
of the logs’ last timestamps, τ = min`∈L maxu∈` u.timestamp, a 6 MB L3 cache. Cache lines are 64 bytes. Each of the
where L is the set of available logs and u denotes a single chips has 8 GB of DRAM attached to it. The tests use up
logged update. Masstree plays back the logged updates in to 16 cores on up to three chips, and use DRAM attached to
parallel, taking care to apply a value’s updates in increasing only those three chips; the extra cores are disabled. The goal
order by version, except that updates with u.timestamp ≥ τ is to mimic the configuration of a machine more like those
are dropped. easily purchasable today. The machine has four SSDs, each
Masstree periodically writes out a checkpoint containing with a measured sequential write speed of 90 to 160 MB/sec.
all keys and values. This speeds recovery and allows log Masstree uses all four SSDs to store logs and checkpoints.
space to be reclaimed. Recovery loads the latest valid check- The server has a 10 Gb Ethernet card (NIC) connected to a
point that completed before τ, the log recovery time, and switch. Also on that switch are 25 client machines that send
then replays logs starting from the timestamp at which the requests over TCP. The server’s NIC distributes interrupts
checkpoint began. over all cores. Results are averaged over three runs.
Our checkpoint facility is independent of the Masstree de- All experiments in this section use small keys and values.
sign; we include it to show that persistence need not limit Most keys are no more than 10 bytes long; values are always
system performance, but do not evaluate it in depth. It takes 1–10 bytes long. Keys are distributed uniformly at random
Masstree 58 seconds to create a checkpoint of 140 million over some range (the range changes by experiment). The key
key-value pairs (9.1 GB of data in total), and 38 seconds to space is not partitioned: a border node generally contains
recover from that checkpoint. The main bottleneck for both keys created by different clients, and sometimes one client
is imbalance in the parallelization among cores. Checkpoints will overwrite a key originally inserted by another. One com-
run in parallel with request processing. When run concur- mon key distribution is “1-to-10-byte decimal,” which com-
rently with a checkpoint, a put-only workload achieves 72% prises the decimal string representations of random numbers
of its ordinary throughput due to disk contention. between 0 and 231 . This exercises Masstree’s variable-length
key support, and 80% of the keys are 9 or 10 bytes long,
causing Masstree to create layer-1 trees.
We run separate experiments for gets and puts. Get exper-
iments start with a full store (80–140 million keys) and run
6. Tree evaluation for 20 seconds. Put experiments start with an empty store
and run for 140 million total puts. Most puts are inserts, but
We evaluate Masstree in two parts. In this section, we focus
about 10% are updates since multiple clients occasionally
on Masstree’s central data structure, the trie of B+ -trees. We
put the same key. Puts generally run 30% slower than gets.
show the cumulative impact on performance of various tree
design choices and optimizations. We show that Masstree 6.2 Factor analysis
scales effectively and that its single shared tree can outper-
form separate per-core trees when the workload is skewed. We analyze Masstree’s performance by breaking down the
We also quantify the costs of Masstree’s flexibility. While performance gap between a binary tree and Masstree. We
variable-length key support comes for free, range query sup- evaluate several configurations on 140M-key 1-to-10-byte-
port does not: a near-best-case hash table (which lacks range decimal get and put workloads with 16 cores. Each server
query support) can provide 2.5× the throughput of Masstree. thread generates its own workload: these numbers do not
The next section evaluates Masstree as a system. There, include the overhead of network and logging. Figure 8 shows
we describe the performance impact of checkpoint and re- the results.
covery, and compare the whole Masstree system against Binary We first evaluate a fast, concurrent, lock-free bi-
other high performance storage systems: MongoDB, VoltDB, nary tree. Each 40-byte tree node here contains a full key,
Redis, and memcached. Masstree performs very well, achiev- a value pointer, and two child pointers. The fast jemalloc
ing 26–1000× the throughput of the other tree-based (range- memory allocator is used.
query-supporting) stores. Redis and memcached are based
on hash tables; this gives them O(1) average-case lookup +Flow, +Superpage, +IntCmp Memory allocation often
in exchange for not supporting range queries. memcached bottlenecks multicore performance. We switch to Flow, our
can exceed Masstree’s throughput on uniform workloads; on implementation of the Streamflow [32] allocator (“+Flow”).
other workloads, Masstree provides up to 3.7× the through- Flow supports 2 MB x86 superpages, which, when intro-
put of these systems. duced (“+Superpage”), improve throughput by 27–37% due
12
Get
11 Put 3.33

Throughput (req/sec, millions)


10 3.18 3.19
2.93
9 2.72
2.62
8 2.42 2.51 2.40
7 2.11
6 1.70
1.68
5 1.48
1.36
4 1.13 1.16
1.00 0.99
3
2
1
0
Bi

+F

+S

+I

4-

B-

+P

+P tch

M uter

Bi

+F

+S

+I

4-

B-

+P

+P tch

M uter
tre

tre
as

as
nt

nt
na

na
tre

tre
lo

up

re

er

lo

up

re

er
Cm e

Cm e
str

str
ry

ry

e
w

fe

fe

m
e

e
er

er
ee

ee
pa

pa
p

p
g

g
Figure 8. Contributions of design features to Masstree’s performance (§6.2). Design features are cumulative. Measurements
use 16 cores and each server thread generates its own load (no clients or network traffic). Bar numbers give throughput relative
to the binary tree running the get workload.

to fewer TLB misses and lower kernel overhead for alloca- over 4-tree (“+Prefetch”). Leaf-node permutations (§4.6.2,
tion. Integer key comparison (§4.2, “+IntCmp”) further im- “+Permuter”) further improve put throughput by 4%.
proves throughput by 15–24%.
Masstree Finally, Masstree itself improves throughput by
4–8% over “+Permuter” in these experiments. This surprised
4-tree A balanced binary tree has log2 n depth, impos-
us. 1-to-10-byte decimal keys can share an 8-byte prefix,
ing an average of log2 n − 1 serial DRAM latencies per
forcing Masstree to create layer-1 trie-nodes, but in these
lookup. We aim to reduce and overlap those latencies and
experiments such nodes are quite empty. A 140M-key put
to pack more useful information into cache lines that must
workload, for example, creates a tree with 33% of its keys
be fetched. “4-tree,” a tree with fanout 4, uses both these
in layer-1 trie-nodes, but the average number of keys per
techniques. Its wider fanout nearly halves average depth rel-
layer-1 trie-node is just 2.3. One might expect this to perform
ative to the binary tree. Each 4-tree node comprises two
worse than a true B-tree, which has better node utilization.
cache lines, but usually only the first must be fetched from
Masstree’s design, thanks to features such as storing 8 bytes
DRAM. This line contains all data important for traversal—
per key per interior node rather than 16, appears efficient
the node’s four child pointers and the first 8 bytes of each
enough to overcome this effect.
of its keys. (The binary tree also fetches only one cache line
per node, but most of it is not useful for traversal.) All inter- 6.3 System relevance of tree design
nal nodes are full. Reads are lockless and need never retry;
Cache-crafty design matters not just in isolation, but also
inserts are lock-free but use compare-and-swap. “4-tree” im-
in the context of a full system. We turn on logging, gen-
proves throughput by 41–44% over “+IntCmp”.
erate load using network clients, and compare “+IntCmp,”
the fastest binary tree from the previous section, with Mass-
B-tree, +Prefetch, +Permuter 4-tree yields good perfor-
tree. On 140M-key 1-to-10-byte-decimal workloads with
mance, but would be difficult to balance. B-trees have even
16 cores, Masstree provides 1.90× and 1.53× the through-
wider fanout and stay balanced, at the cost of somewhat
put of the binary tree for gets and puts, respectively.5 Thus,
less efficient memory usage (nodes average 75% full). “B-
if logging and networking infrastructure are reasonably well
tree” is a concurrent B+ -tree with fanout 15 that imple-
implemented, tree design can improve system performance.
ments our concurrency control scheme from §4. Each node
has space for up to the first 16 bytes of each key. Un- 6.4 Flexibility
fortunately this tree reduces put throughput by 12% over
Masstree supports several features that not all key-value ap-
4-tree, and does not improve get throughput much. Con-
plications require, including range queries, variable-length
ventional B-tree inserts must rearrange a node’s keys—4-
keys, and concurrency. We now evaluate how much these
tree never rearranges keys—and B-tree nodes spend 5 cache
features cost by evaluating tree variants that do not support
lines to achieve average fanout 11, a worse cache-line-to-
them. We include network and logging.
fanout ratio than 4-tree’s. However, wide B-tree nodes are
easily prefetched to overlap these DRAM latencies. When 5 Absolute Masstree throughput is 8.03 Mreq/sec for gets (77% of the
prefetching is added, B-tree improves throughput by 9–31% Figure 8 value) and 5.78 Mreq/sec for puts (63% of the Figure 8 value).
Variable-length keys We compare Masstree with a concur- 12
Masstree get
rent B-tree supporting only fixed-size 8-byte keys (a version +Permuter get
10

(req/sec, millions)
of “+Permuter”). When run on a 16-core get workload with

Throughput
8
80M 8-byte decimal keys, Masstree supports 9.84 Mreq/sec
and the fixed-size B-tree 9.93 Mreq/sec, just 0.8% more. The 6
difference is so small likely because the trie-of-trees design 4
effectively has fixed-size keys in most tree nodes.
2
Keys with common prefixes Masstree is intended to pre- 0
serve good cache performance when keys share common 8 16 24 32 40 48

prefixes. However, unlike some designs, such as partial-key Key length (bytes)
B-trees, Masstree can become superficially unbalanced. Fig- Figure 9. Performance effect of varying key length on
ure 9 provides support for Masstree’s choice. The work- Masstree and “+Permuter.” For each key length, keys differ
loads use 16 cores and 80M decimal keys. The X axis gives only in the last 8 bytes. 16-core get workload.
each test’s key length in bytes, but only the final 8 bytes
vary uniformly. A 0-to-40-byte prefix is the same for ev-
ery key. Despite the resulting imbalance, Masstree has 3.4× 0.7
Get

(req/sec/core, millions)
the throughput of “+Permuter” for relatively long keys. This 0.6 Put
is because “+Permuter” incurs a cache miss for the suffix 0.5

Throughput
of every key it compares. However, Masstree has 1.4× the 0.4
throughput of “+Permuter” even for 16-byte keys, which 0.3
“+Permuter” stores entirely inline. Here Masstree’s perfor- 0.2
mance comes from avoiding repeated comparisons: it exam-
0.1
ines the key’s first 8 bytes once, rather than O(log2 n) times.
0
1 2 4 8 16
Concurrency Masstree uses interlocked instructions, such
Number of cores
as compare-and-swap, that would be unnecessary for a
single-core store. We implemented a single-core version Figure 10. Masstree scalability.
of Masstree by removing locking, node versions, and in-
terlocked instructions. When evaluated on one core using a
140M-key, 1-to-10-byte-decimal put workload, single-core 1000 cycles of CPU time in computation independent of the
Masstree beats concurrent Masstree by just 13%. number of cores, but average per-operation DRAM stall time
varies from 2050 cycles with one core to 2800 cycles with 16
Range queries Masstree uses a tree to support range cores. This increase roughly matches the decrease in perfor-
queries. If they were not needed, a hash table might be mance from one to 16 cores in Figure 10, and is consistent
preferable, since hash tables have O(1) lookup cost while a with the cores contending for some limited resource having
tree has O(log n). To measure this factor, we implemented to do with memory fetches, such as DRAM or interconnect
a concurrent hash table in the Masstree framework and bandwidth.
measured a 16-core, 80M-key workload with 8-byte ran-
dom alphabetical keys.6 Our hash table has 2.5× higher 6.6 Partitioning and skew
total throughput than Masstree. Thus, of these features, only Some key-value stores partition data among cores in order
range queries appear inherently expensive. to avoid contention. We show here that, while partitioning
works well for some workloads, sharing data among all
6.5 Scalability
cores works better for others. We compare Masstree with
This section investigates how Masstree’s performance scales 16 separate instances of the single-core Masstree variant
with the number of cores. Figure 10 shows the results for 16- described above, each serving a partition of the overall data.
core get and put workloads using 140M 1-to-10-byte deci- The partitioning is static, and each instance holds the same
mal keys. The Y axis shows per-core throughput; ideal scal- number of keys. Each instance allocates memory from its
ability would appear as a horizontal line. At 16 cores, Mass- local DRAM node. Clients send each query to the instance
tree scales to 12.7× and 12.5× its one-core performance for appropriate for the query’s key. We refer this configuration
gets and puts respectively. as “hard-partitioned” Masstree.
The limiting factor for the get workload is high and in- Tests use 140M-key, 1-to-10-byte decimal get workloads
creasing DRAM fetch cost. Each operation consumes about with various partition skewness. Following Hua et al. [22],
6 Digit-only we model skewness with a single parameter δ . For skewness
keys caused collisions and we wanted the test to favor the hash
table. The hash table is open-coded and allocated using superpages, and has δ , 15 partitions receive the same number of requests, while
30% occupancy. Each hash lookup inspects 1.1 entries on average. the last one receives δ × more than the others. For example,
12 C/C++ Batched Range
Masstree Server client library query query
10 Hard-partitioned Masstree
(req/sec, millions)

MongoDB-2.0 2.0 No Yes


Throughput

8
VoltDB-2.0 1.3.6.1 Yes Yes
6 memcached-1.4.8 1.0.3 Yes for get No
4
Redis-2.4.5 latest hiredis Yes No

2 Figure 12. Versions of tested servers and client libraries.


0
0 1 2 3 4 5 6 7 8 9
δ
each with four sites; 16 Redis processes; and 16 memcached
Figure 11. Throughput of Masstree and hard-partitioned processes. Masstree uses 16 threads.
Masstree with various skewness (16-core get workload). VoltDB is an in-memory RDBMS. It achieves robustness
through replication rather than persistent storage. We turn
VoltDB’s replication off. VoltDB supports transactions and
at δ = 9, one partition handles 40% of the requests and each a richer data and query model than Masstree.
other partition handles 4%. MongoDB is a key-value store. It stores data primarily
Figure 11 shows that the throughput of hard-partitioned on disk, and supports named columns and auxiliary indices.
Masstree decreases with skewness. The core serving the hot We set the MongoDB chunk size to 300MB, run it on an
partition is saturated for δ ≥ 1. This throttles the entire sys- in-memory file system to eliminate storage I/O, and use the
tem, since other partitions’ clients must wait for the slow “_id” column as the key, indexed by a B-tree.
partition in order to preserve skewness, leaving the other Redis is an in-memory key-value store. Like Masstree, it
cores partially idle. At δ = 9, 80% of total CPU time is logs to disk for crash recovery. We give each Redis process
idle. Masstree throughput is constant; at δ = 9 it provides a separate log, using all four SSDs, and disable checkpoint-
3.5× the throughput of hard-partitioned. However, for a uni- ing and log rewriting (log rewriting degrades throughput by
form workload (δ = 0), hard-partitioned Masstree has 1.5× more than 50%). Redis uses a hash table internally and thus
the throughput of Masstree, mostly because it avoids remote does not support range queries. To implement columns, we
DRAM access (and interlocked instructions). Thus Mass- used Redis’s support for reading and writing specific byte
tree’s shared data is an advantage with skewed workloads, ranges of a value.
but can be slower than hard-partitioning for uniform ones. memcached is an in-memory key-value store usually used
This problem may diminish on single-chip machines, where for caching non-persistent data. Like Redis, memcached
all DRAM is local. uses a hash table internally and does not support range
queries. The memcached client library supports batched gets
but not batched puts, which limits its performance on work-
7. System evaluation loads involving many puts.
This section compares the performance of Masstree with that Our benchmarks run against databases initialized with
of MongoDB, VoltDB, memcached, and Redis, all systems 20M key-value pairs. We use two distinct sets of workloads.
that have reputations for high performance. Many of these The first set’s benchmarks resemble those in the previous
systems support features that Masstree does not, some of section: get and put workloads with uniformly-distributed
which may bottleneck their performance. We disable other 1-to-10-byte decimal keys and 8-byte values. These bench-
systems’ expensive features when possible. Nevertheless, marks are run for 60 seconds. The second set uses workloads
the comparisons in this section are not entirely fair. We based on the YCSB cloud serving benchmark [16]. We use
provide them to put Masstree’s throughput in the context of a Zipfian distribution for key popularity and set the num-
other systems used in practice for key-value workloads. ber of columns to 10 and size of each column to 4 bytes.
Figure 12 summarizes the software versions we tested. The small column size ensures that no workload is bottle-
The client libraries vary in their support for batched or necked by network or SSD bandwidth. YCSB includes a
pipelined queries, which reduce networking overheads. The benchmark, YCSB-E, dependent on range queries. We mod-
memcached client library does not support batched puts. ify this benchmark to return one column per key, rather than
Except for Masstree, these systems’ storage data struc- all 10, again to prevent the benchmark from being limited
tures are not designed to scale well when shared among mul- by the network. Initial tests were client limited, so we run
tiple cores. They are intended to be used with multiple in- multiple client processes. Finally, some systems (Masstree)
stances on multicore machines, each with a partition of the do not yet support named columns, and on others (Redis)
data. For each system, we use the configuration on 16 cores named column support proved expensive; for these systems
that yields the highest performance: eight MongoDB pro- we modified YCSB to identify columns by number rather
cesses and one configuration server; four VoltDB processes, than name. We call the result MYCSB.
Throughput (req/sec, millions, and as % of Masstree)
Workload Masstree MongoDB VoltDB Redis Memcached
Uniform key popularity, 1-to-10-byte decimal keys, one 8-byte column
get 9.10 0.04 0.5% 0.22 2.4% 5.97 65.6% 9.78 107.4%
put 5.84 0.04 0.7% 0.22 3.7% 2.97 50.9% 1.21 20.7%
1-core get 0.91 0.01 1.1% 0.02 2.6% 0.54 59.4% 0.77 84.3%
1-core put 0.60 0.04 6.8% 0.02 3.6% 0.28 47.2% 0.11 17.7%
Zipfian key popularity, 5-to-24-byte keys, ten 4-byte columns for get, one 4-byte column for update & getrange
MYCSB-A (50% get, 50% put) 6.05 0.05 0.9% 0.20 3.4% 2.13 35.2% N/A
MYCSB-B (95% get, 5% put) 8.90 0.04 0.5% 0.20 2.3% 2.69 30.2% N/A
MYCSB-C (all get) 9.86 0.05 0.5% 0.21 2.1% 2.70 27.4% 5.28 53.6%
MYCSB-E (95% getrange, 5% put) 0.91 0.00 0.1% 0.00 0.1% N/A N/A
Figure 13. System comparison results. All benchmarks run against a database initialized with 20M key-value pairs and use 16
cores unless otherwise noted. Getrange operations retrieve one column for n adjacent keys, where n is uniformly distributed
between 1 and 100.

Puts in this section’s benchmarks modify existing keys’ 8. Conclusions


values, rather than inserting new keys. This made it easier to Masstree is a persistent in-memory key-value database. Its
preserve MYCSB’s key popularity distribution with multiple design pays particular attention to concurrency and to effi-
client processes. ciency for short and simple queries. Masstree keeps all data
We do not run systems on benchmarks they don’t support. in memory in a tree, with fanout chosen to minimize total
The hash table stores can’t run MYCSB-E, which requires DRAM delay when descending the tree with prefetching.
range queries, and memcached can’t run MYCSB-A and The tree is shared among all cores to preserve load balance
-B, which require individual-column update. In all cases when key popularities are skewed. It maintains high concur-
Masstree includes logging and network I/O. rency using optimistic concurrency control for lookup and
Figure 13 shows the results. Masstree outperforms the local locking for updates. For good performance for keys
other systems on almost all workloads, usually by a substan- with long shared prefixes, a Masstree consists of a trie-like
tial margin. The exception is that on a get workload with concatenation of B+ -trees, each of the latter supporting only
uniformly distributed keys and 16 cores, memcached has fixed-length keys for efficiency. Logging and checkpointing
7.4% better throughput than Masstree. This is because mem- provide consistency and durability.
cached, being partitioned, avoids remote DRAM access (see On a 16-core machine, with logging enabled and queries
§6.6). When run on a single core, Masstree slightly exceeds arriving over a network, Masstree executes more than six
the performance of this version of memcached (though as million simple queries per second. This performance is com-
we showed above, a hash table could exceed Masstree’s per- parable to that of memcached, a non-persistent hash table
formance by 2.5×). server, and higher (often much higher) than that of VoltDB,
We believe these numbers fairly represent the systems’ MongoDB, and Redis.
absolute performance. For example, VoltDB’s performance
on uniform key distribution workloads is consistent with that Acknowledgments
reported by the VoltDB developers for a similar benchmark,
volt2 [23].7 We thank the Eurosys reviewers and our shepherd, Eric
Several conclusions can be drawn from the data. Mass- Van Hensbergen, for many helpful comments. This work
tree has good efficiency even for challenging (non-network- was partially supported by the National Science Foundation
limited) workloads. Batched query support is vital on these (awards 0834415 and 0915164) and by Quanta Computer.
benchmarks: memcached’s update performance is signifi- Eddie Kohler’s work was partially supported by a Sloan
cantly worse than its get performance, for example. VoltDB’s Research Fellowship and a Microsoft Research New Faculty
range query support lags behind its support for pure gets. As Fellowship.
we would expect given the results in §6.6, partitioned stores
perform better on uniform workloads than skewed work- References
loads: compare Redis and memcached on the uniform get [1] Sharding for startups. http://www.
workload with the Zipfian MYCSB-C workload. startuplessonslearned.com/2009/01/
sharding-for-startups.html.
[2] MongoDB. http://mongodb.com.
7 We also implemented volt2; it gave similar results. [3] Redis. http://redis.io.
[4] Cassandra @ Twitter: An interview with Ryan King. [21] N. Hardavellas, I. Pandis, R. Johnson, N. G. Mancheril,
http://nosql.mypopescu.com/post/407159447/ A. Ailamaki, and B. Falsafi. Database servers on chip mul-
cassandra-twitter-an-interview-with-ryan-king. tiprocessors: Limitations and opportunities. In 3rd Biennial
[5] VoltDB, the NewSQL database for high velocity applications. Conference on Innovative Data Systems Research (CIDR),
http://voltdb.com. Asilomar, Califormnia, USA, January 2007.
[22] K. A. Hua and C. Lee. Handling data skew in multiprocessor
[6] R. Bayer and E. McCreight. Organization and maintenance
database computers using partition tuning. In Proc. 17th
of large ordered indices. In Proc. 1970 ACM SIGFIDET
VLDB Conference, pages 525–535, 1991.
(now SIGMOD) Workshop on Data Description, Access and
Control, SIGFIDET ’70, pages 107–141. [23] J. Hugg. Key-value benchmarking. http://voltdb.com/
company/blog/key-value-benchmarking.
[7] R. Bayer and K. Unterauer. Prefix B-Trees. ACM Transactions
on Database Systems, 2(1):11–26, Mar. 1977. [24] R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and
B. Falsafi. Shore-MT: A scalable storage manager for the
[8] P. Bohannon, P. McIlroy, and R. Rastogi. Main-memory index multicore era. In Proc. 12th International Conference on Ex-
structures with fixed-size partial keys. SIGMOD Record, 30: tending Database Technology: Advances in Database Tech-
163–174, May 2001. nology, pages 24–35, New York, NY, USA, 2009.
[9] N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun. A [25] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin,
practical concurrent binary search tree. In Proc. 15th ACM S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker,
PPoPP Symposium, Bangalore, India, 2010. Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: A high-
[10] S. K. Cha and C. Song. P*TIME: Highly scalable OLTP performance, distributed main memory transaction processing
DBMS for managing update-intensive stream workload. In system. Proc. VLDB Endowment, 1:1496–1499, August 2008.
Proc. 30th VLDB Conference, pages 1033–1044, 2004. [26] A. Lakshman and P. Malik. Cassandra: A decentralized struc-
[11] S. K. Cha, S. Hwang, K. Kim, and K. Kwon. Cache- tured storage system. ACM SIGOPS Operating System Re-
conscious concurrency control of main-memory indexes on view, 44:35–40, April 2010.
shared-memory multiprocessor systems. In Proc. 27th VLDB [27] P. L. Lehman and S. B. Yao. Efficient locking for concurrent
Conference, 2001. operations on B-trees. ACM Transactions on Database Sys-
[12] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, tems, 6(4):650–670, 1981.
M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: [28] P. E. McKenney, D. Sarma, A. Arcangeli, A. Kleen,
A distributed storage system for structured data. ACM Trans- O. Krieger, and R. Russell. Read-copy update. In Proc. 2002
actions on Computer Systems, 26:4:1–4:26, June 2008. Ottawa Linux Symposium, pages 338–367, 2002.
[13] S. Chen, P. B. Gibbons, and T. C. Mowry. Improving index [29] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. Lomet.
performance through prefetching. In Proc. 2001 SIGMOD AlphaSort: A cache-sensitive parallel external sort. The VLDB
Conference, pages 235–246. Journal, 4(4):603–627, 1995.
[14] J. Cieslewicz and K. A. Ross. Data partitioning on chip [30] J. Rao and K. A. Ross. Making B+-trees cache conscious in
multiprocessors. In Proc. 4th International Workshop on Data main memory. SIGMOD Record, 29:475–486, May 2000.
Management on New Hardware, DaMoN ’08, pages 25–34, [31] K. A. Ross. Optimizing read convoys in main-memory query
New York, NY, USA, 2008. processing. In Proc. 6th International Workshop on Data
[15] J. Cieslewicz, K. A. Ross, K. Satsumi, and Y. Ye. Automatic Management on New Hardware, DaMoN ’10, pages 27–33,
contention detection and amelioration for data-intensive oper- New York, NY, USA, 2010. ACM.
ations. In Proc. 2010 SIGMOD Conference, pages 483–494. [32] S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos.
[16] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and Scalable locality-conscious multithreaded memory allocation.
R. Sears. Benchmarking cloud serving systems with YCSB. In Proc. 5th International Symposium on Memory Manage-
In Proc. 1st ACM Symposium on Cloud Computing, SoCC ’10, ment, ISMM ’06, pages 84–94. ACM, 2006.
pages 143–154, New York, NY, USA, 2010. [33] S. Sen and R. E. Tarjan. Deletion without rebalancing in
[17] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, balanced binary trees. In Proc. 21st SODA, pages 1490–1499,
A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, 2010.
and W. Vogels. Dynamo: Amazon’s highly available key- [34] J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey.
value store. In Proc. 21st ACM SOSP, pages 205–220, 2007. PALM: Parallel architecture-friendly latch-free modifications
to B+ trees on many-core processors. Proc. VLDB Endow-
[18] B. Fitzpatrick. LiveJournal’s backend—a history of
ment, 4(11):795–806, August 2011.
scaling. http://www.danga.com/words/2005_oscon/
oscon-2005.pdf. [35] M. Stonebraker, S. Madden, J. D. Abadi, S. Harizopoulos,
N. Hachem, and P. Helland. The end of an architectural
[19] K. Fraser. Practical lock-freedom. Technical Report UCAM-
era: (it’s time for a complete rewrite). In Proc. 33rd VLDB
CL-TR-579, University of Cambridge Computer Laboratory,
Conference, pages 1150–1160, 2007.
2004.
[20] E. Fredkin. Trie memory. Communications of the ACM, 3:
490–499, September 1960.

You might also like