0% found this document useful (0 votes)
91 views24 pages

Leveraging Glocality For Fast Failure Recovery in Distributed RAM Storage

Uploaded by

Sid Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views24 pages

Leveraging Glocality For Fast Failure Recovery in Distributed RAM Storage

Uploaded by

Sid Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Leveraging Glocality for Fast Failure Recovery

in Distributed RAM Storage

YIMING ZHANG and DONGSHENG LI, National University of Defense Technology


LING LIU, Georgia Institute of Technology

Distributed RAM storage aggregates the RAM of servers in data center networks (DCN) to provide extremely
high I/O performance for large-scale cloud systems. For quick recovery of storage server failures, Mem-
Cube [53] exploits the proximity of the BCube network to limit the recovery traffic to the recovery servers’
1-hop neighborhood. However, the previous design is applicable only to the symmetric BCube(n, k) network
with nk+1 nodes and has suboptimal recovery performance due to congestion and contention. 3
To address these problems, in this article, we propose CubeX, which (i) generalizes the “1-hop” principle
of MemCube for arbitrary cube-based networks and (ii) improves the throughput and recovery performance
of RAM-based key-value (KV) store via cross-layer optimizations. At the core of CubeX is to leverage the
glocality (= globality + locality) of cube-based networks: It scatters backup data across a large number of
disks globally distributed throughout the cube and restricts all recovery traffic within the small local range
of each server node. Our evaluation shows that CubeX not only efficiently supports RAM-based KV store for
cube-based networks but also significantly outperforms MemCube and RAMCloud in both throughput and
recovery time.
CCS Concepts: • Information systems → Storage replication; Cloud based storage; • Computer sys-
tems organization → Secondary storage organization;
Additional Key Words and Phrases: Globality, locality, cube-based networks, distributed RAM storage, fast
recovery
ACM Reference format:
Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging Glocality for Fast Failure Recovery in Dis-
tributed RAM Storage. ACM Trans. Storage 15, 1, Article 3 (February 2019), 24 pages.
https://doi.org/10.1145/3289604

1 INTRODUCTION
In recent years, the role of RAM has become increasingly important for storage in large-scale cloud
systems [38, 39]. In most of these storage systems, however, RAM is only used as a cache for the
primary storage. For instance, memcached [8] has been widely used by various Web services and
applications, and Google’s Bigtable [14] keeps entire column family in RAM.

This work is supported by the National Natural Science Foundation of China (61772541 and 61872376), the National Science
Foundation under Grants NSF SaTC 1564097 and IBM faculty award.
Authors’ addresses: Y. Zhang (corresponding author) and D. Li (corresponding author), PDL, School of Computer, National
University of Defense Technology, Changsha 410073, China; emails: [email protected], [email protected]; L. Liu,
College of Computing, Georgia Institute of Technology, Atlanta, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
1553-3077/2019/02-ART3 $15.00
https://doi.org/10.1145/3289604

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:2 Y. Zhang et al.

It is the responsibility of cache-based applications to manage the consistency, making cache


systems (like memcached) vulnerable to consistency problems [1]. Moreover, inevitable cache miss
may cause difficulties for applications to effectively utilize RAM [44]. For example, 1% cache miss
ratio would result in 10× higher I/O latency, because disks are 1,000× slower than RAM. Due to
this problem, currently interactive Internet services, e.g., Facebook and Amazon, can make at most
hundreds of sequential I/O operations (with little locality) to respond to an HTTP request [47], to
limit the cumulative I/O latency, which is a key factor in overall response time. Clearly, completely
using RAM as storage (instead of only as cache) would boost the performance of a large variety of
online data-intensive applications.
Recently, RAMCloud [44] and MemCube [53] took one step forward to directly use RAM as a
persistent storage. RAMCloud stores data completely in the RAM of storage servers and keeps
backup copies on the hard disk drives (HDD) or solid state drivers (SSD) of so-called “backup
servers.” The main drawback of RAMCloud is that it completely ignores the network impact
in large-scale RAM storage systems. For example, there are various short-lived network prob-
lems [25], like incast and temporary hot spot, which might make heartbeats be discarded (in Eth-
ernet) or suspended (in InfiniBand). Consequently, RAMCloud will incorrectly consider them as
server failures and issue unnecessary recovery. In our previous work [53], we proposed Mem-
Cube, which exploits the proximity of the BCube network [28] to design the CubicRing structure
to limit the recovery within the 1-hop range, so that the recovery could be performed avoiding
most congestion in the network. However, the design of MemCube has two drawbacks. First, the
CubicRing structure is applicable only to the BCube(n, k) network, which constructs a symmetric
cube topology with nk+1 nodes. Second, MemCube has suboptimal recovery performance due to
the (last-hop) congestion and contention.
To address these problems, in this article, we propose CubeX, which (i) generalizes the “1-hop”
principle of MemCube for arbitrary cube-based networks, and (ii) improves the throughput and
recovery performance of RAM-based KV via cross-layer optimizations. The basic idea of CubeX
is that in data centers, we could have a codesign in which the storage system and the network
collaboratively optimize the performance. Specifically, CubeX leverages the glocality (= globality
+ locality) of cube-based networks to achieve efficient RAM-based storage: It scatters backup data
globally throughout the cube-based network, which allows high aggregate recovery bandwidth
both for disks and for network, and it limits all the recovery traffic to the small local range of each
node, reducing the recovery congestion.
We implement CubeX by incorporating existing reliable KV techniques, such as aggressive data
partitioning [14] for parallel recovery, log-structured storage [44] for simplifying persistence, and
ZooKeeper service [33] for highly available coordination. We evaluate CubeX on cube-based net-
works, including both an 8 × 8 and a 4 × 4 × 2 × 2 generalized hypercube [28], and compare CubeX
with state-of-the-art RAM-based storage system (MemCube).
In this article, we make the following contributions.

• We leverage both the globality and the locality of cube-based networks to realize quick
recovery for RAM-based key-value store, and analyze their tradeoff;
• CubeX realizes cross-layer optimizations, such as recovery server remapping, asynchronous
backup server recovery, and surrogate backup writing, to support quick recovery and high
I/O performance;
• We implement CubeX by modifying MemCube, and evaluate CubeX on an Ethernet testbed
connected using various networks. Evaluation results show that CubeX remarkably outper-
forms MemCube both in throughput and in recovery time.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:3

This article is organized as follows: Section 2 discusses our motivation; Section 3 introduces
the design of the CubeX structure; Section 4 describes the basic recovery mechanism of CubeX by
leveraging the glocality of cubes; Section 5 introduces the optimization for recovery server remap-
ping; Section 6 presents our prototype implementation of CubeX; Section 7 presents experimental
results; Section 8 introduces related work; and Section 9 concludes the article.

2 PRELIMINARIES
2.1 RAM-Based Storage
RAM-based storage is promising to achieve extremely low I/O latency (10μs-level RTT) and high
I/O throughput (millions of reads/writes per second). The most obvious disadvantage of RAM-
based storage is the high cost and energy usage per bit. A RAM-based storage is about 50× worse
than a pure disk-based storage and 5× worse than a flash memory-based storage for both cost
and energy usage metrics [45]. Therefore, if an application needs to store a large amount of data
inexpensively and has a low access rate, RAM-based storage might not be the best solution.
However, when considering cost/energy usage per operation, a RAM-based storage is about
1,000× more efficient than a disk-based storage and 10× more efficient than a flash memory-based
storage [45]. Based on Gray’s Rule [26], that the cost per usable bit of disk increases as the desired
access rate to each record increases, Ousterhout et al. deduced that with recent technologies, if a
1KB record is accessed more than once every 30h, then it is cheaper to store it in memory than on
disk (because to enable this access rate only 2% of the disk space can be used) [45].
Furthermore, Andersen et al. generalizes Gray’s rule to compare the total cost (including hard-
ware cost and energy usage) of ownership over 3 years given the required data size and access
rate [11]. It shows that (i) for high access rates and smaller data sizes RAM is cheapest, for low
access rates and large data size disk is cheapest, and flash is cheapest for the middle; and (ii) since
for all three storages cost/bit is improving much more rapidly than cost/operation/sec, the appli-
cability of RAM-based storage will be continuously increasing.
Redis [12] is a RAM-based key-value store with rich data model that keeps all the data in the
RAM. Currently Redis is designed mainly for using as a cache layer. For cluster storage scenario,
Redis cannot support the more efficient primary-backup persistence mechanism of RAMCloud
(primary copies in primary servers’ RAM and backup copies in backup servers’ disks), and thus
suffers from a severe performance penalty.

2.2 Cube-Based Datacenter Networks


The cube-based network is a server-centric network architecture, where nodes act as not only
storage servers but also relay nodes for routing. For example, Figure 1 shows the topology of
a 4 × 4 × 2 generalized hypercube (GHC) [28], one of the most popular cube-based networks. A
GHC structure consists of r dimensions with mi number of nodes in the ith dimension (1 ≤ i ≤ r ).
A node (e.g., 000 in Figure 1) in a particular axis connects to all other mi − 1 nodes in the same axis
(e.g., 100, 200, and 300 in one dimension). Due to the limit of NIC port numbers, in practice a node
connects to one switch (using only one port) in each dimension, instead of directly connecting to
all other mi − 1 nodes (requiring mi − 1 port).

2.3 Fast Failure Recovery


Fast recovery is important for RAM-based storage, not only for improving durability (as discussed
in Reference [44]) but also for providing high availability, which is defined as “mean time to failure”
divided by the sum of “mean time to failure” and “mean time to recovery” [23]. The tree-based
structure of RAMCloud cannot achieve high availability in large-scale clusters with thousands of

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:4 Y. Zhang et al.

Fig. 1. CubeX on generalized hypercube (GHC). Left: a 4 × 4 × 2 GHC. Right: the corresponding network
connection. (It uses a 32-port switch to emulate the 16 2-port mini switches.) When a primary server 000
fails, each of its seven recovery servers (001, 010, 020, 030, 100, 200, 300) becomes a new primary server and
recovers 1/7 the primary data of 000. For example, 030 recovers by fetching backup data from its four backup
servers 031, 130, 230, 330.

servers [5, 27]. This consequently affects data durability, because more failures may occur before
a previous failure gets recovered. Moreover, as discussed in Reference [16], since the recovery
servers independently choose their new backup servers for storing the backup data without a
global view of the network, there exists substantial imbalance in the usage of bottleneck links,
which is completely ignored by RAMCloud. However, RAMCloud requires that during recovery
all the backup data must be restored by the recovery servers. Thus in a triple-replicated cluster the
recovery bandwidth achieved by a recovery server is only 1/3 the bandwidth of that server [44]. For
example, in RAMCloud the PCI-Express bandwidth of 25Gbps limits the throughput per recovery
server to no more than 800MB per second [44].

2.4 MemCube
MemCube [53] proposes to exploit proximity of the BCube network [28] to limit the recovery in 1-
hop range, so that the recovery could be performed avoiding most congestion. However, the design
of MemCube is applicable only to BCube, and it has suboptimal recovery performance due to the
last-hop congestion and contention. For example, the recovery of a backup server is performed
after all other recoveries complete, since its recovery flows have overlap and contention with
the primary server recovery [53]. Therefore, MemCube’s recovery is suboptimal if we define the
recovery time as the entire time window for recovering all the primary/recovery/backup servers
instead of just the primary server. Suppose that the replication factor is f . In MemCube after the
new primary servers restart the service, there would be a (short) time window in which the number
of backup copies is f − 1 instead of f .

2.5 Our Goal


Sub-second recovery is not practical in large-scale RAM storage systems, because both failure de-
tection and recovery preparation require at least a few hundred milliseconds in a cluster. Therefore,
the goal of CubeX is to recover a failed storage server (with tens or hundreds of GB RAM) in a few
seconds, which is enough for achieving high data durability and availability desirable to the users.
For example, if a failure could be recovered in several seconds in a large-scale storage system, the
probability of data loss in one year is as low as 10−6 ∼ 10−5 , and the availability could be as high
as four nines (99.99%) [53].

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:5

3 CUBEX DESIGN
This section first introduces the design of the generalized primary-backup-recovery structure of
CubeX, and then analyzes its properties.

3.1 Background
We follow the primary-backup model that is preferred by most RAM-based storage systems includ-
ing RAMCloud and MemCube. In a primary-backup RAM-based storage system with replication
factor f , there are one primary replica and f backup replicas for each key-value (KV) pair. The
value is a variable-length byte array (up to 1MB). For that pair of KV, The primary replica is stored
in the RAM of a primary server and the f backup replicas are maintained on the disks of f backup
servers.
When a primary server (with tens or hundreds of GB RAM) fails, to quickly recover all its pri-
mary data in a few seconds, the recovery procedure needs to involve hundreds of backup disks
(which is easy to achieve in a moderate-size data center). The data is recovered to the recover servers
of the failed primary server, each becoming a new primary server. Considering the currently avail-
able commodity NIC bandwidth (10Gbps), several tens of recovery servers are needed to provide
adequate (aggregate) recovery bandwidth to recover a failed primary server (with tens of GB data
in RAM) in a few seconds.

3.2 Glocalized Primary-Backup Structure


We generalize the “1-hop” principle of MemCube [53], which exploits the proximity of BCube [28]
to limit all the recoveries to the one-hop neighborhood of recovery servers, for all cube-based
networks, and propose the glocalized CubeX structure for the primary-backup storage model.
The basic idea behind CubeX is simple. All the nodes in the cube-based network serve as primary
servers, which are responsible for the whole key space. By emulating the “1-hop” backup and
recovery in MemCube, for each primary server P its direct neighbors in all dimensions are P’s
recovery servers, and for each recovery server R its direct neighbors in all dimensions except the
dimension of RP are P’s backup servers corresponding to R. A primary server backs up its data to
all its backup servers. Once the primary server fails each recovery server R will recover a fraction
of P’s data from its backups.
Clearly, this design is simple enough to generalize to all cube-based network topologies such as
generalized hypercube (GHC), MDCube [52], k-ary n-cube [28], hyperbanyan [22], and nearest-
neighbor mesh. For clarity, in the rest of this article, we will focus on the generalized hypercube
topology for describing the design, analysis and evaluation of CubeX, but it is straightforward to
generalize to all the other cube-based networks.
Consider the generalized hypercube (Figure 1). Each node (e.g., 000 in the red dashed circle)
in GHC is a primary server; all its direct neighbors are the recovery servers (001, 010, 020, 030,
100, 200, 300 in the blue dashed circles); and for each recovery server its direct neighbors are its
corresponding backup servers (e.g., for recovery server 030, the backup servers are 031, 130, 230,
330 in the green dashed circles). Clearly, in the symmetric structure, the primary, recovery, and
backup relations are commutative. For example, if node A is node B’s recovery server, then node B
is also node A’s recovery server.
The primary server uses its local (i.e., directly-connected) neighbors as its recovery servers, and
each recovery server uses its own local neighbors as the backup servers. For two-dimensional
cube-based networks, this design makes all servers globally participate in the recovery of one
failed primary server, as either recovery or backup servers. For cube-based networks with more
than two dimensions, usually it is not necessary to use all servers in the entire network for a

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:6 Y. Zhang et al.

recovery, because the aggregate recovery throughput is already enough when only using servers
in the two-dimensional sub network, as discussed in the next sub section.
By this means the recovery traffic (from backup servers to their corresponding recovery servers)
is restricted within the local range of directly-connected neighbors. For example, when the primary
server 000 fails, each of the seven recovery servers reads backup data from its directly connected
backup servers (e.g., 030 from 031, 130, 230 and 330). There are totally 4 × 6 + 6 × 1 = 30 parallel
recovery flows in the local range.
Centralized master. CubeX inherits the key-ring mapping mechanism from MemCube [53],
where a centralized master maintains the mappings from keys to primary servers. The master
equally divides the key space into consecutive subspaces, each being assigned to one primary
server. The key space held by a primary server is further equally divided to its recovery servers;
and for each recovery server, its subspace is mapped to its backup servers. To avoid potential per-
formance bottleneck at the global master, the mapping from a primary server’s key space to its
recovery/backup servers is maintained by itself instead of the master, and the recovery servers
have a cache of the mapping they are involved in. After a server fails, the global master will ask
all its 1-hop neighbors to reconstruct the mapping.
For example, in the example shown in Figure 1, the master divides the whole (space of) keys into
32 shares each corresponding to a primary server P, which is denoted as S P (e.g., S 000 for P = 000).
Similarly, each recovery server R of P is equally assigned a subspace of S P , which is denoted as
S P, R (e.g., S 000,030 for R = 030). For a recovery server R the sub key space S P, R is further equally
assigned to each backup server B, which is denoted as S P, R, B . For example, for the 4 backups (nodes
031, 130, 230, 330) for P = 000 and R = 030, S 000,030 is divided into S 000,030,031 , S 000,030,130 , S 000,030,230 ,
S 000,030,330 for the nodes 031, 130, 230, 330, respectively, and is stored on the corresponding backup
servers’ disks. Other system information such as the addresses of the servers is also hold by the
master.
Replication factor. In a primary-backup storage system with replication factor f , each KV
(mapped onto primary server P has 1 dominant backup replica (stored on a backup server B) and
f − 1 secondary backup replicas that are also stored on P’s backup servers other than B. Given B,
all its dominant backup data is also stored on P’s f − 1 backup servers, which are called shadow
backup servers of B. Shadow backup servers may be chosen for specific requirements. For example,
if it is required that f backup copies should be stored in different racks, then each backup server
chooses f − 1 shadow backup servers in f − 1 racks (rather than its own rack).

3.3 Analysis
Structure properties. For an m-dimensional Πm i=1a i = a 1 × a 2 × · · · × am GHC structure, all the
nodes are primary servers and thus the number of primary servers are Πm i=1a i . For each primary
server P (e.g., node 000 in Figure 1), its recovery servers are its direct neighbors and thus the num-
ber of P’s recovery servers are Σm i=1 (a i − 1) (3 + 3 + 1 = 7 for node 000). For a recovery server R j
(e.g., node 030) in the jth dimension of P, its backup servers are its direct neighbors that are neither
primary server P nor P’s recovery servers, i.e., R j ’s direct neighbors not in P’s jth dimension.
The number of R j ’s backups is Σm i=1,ij (a i − 1). For example, node 030 has 3 + 1 = 4 backups.
Since P’s backup servers are 2-hop-away from node P, each backup server corresponds to two
recovery servers (e.g., backup node 031 corresponds to two recovery nodes 030 and 001). Therefore,
the number of P’s backup servers is 12 Σm j=1 (a j − 1)Σi=1,ij (a i − 1) (e.g., primary server 000 has 2 ×
m 1

((2 − 1) × 6 + (4 − 1) × 4 + (4 − 1) × 4) = 15 backup servers). Totally Σ j=1 Πi=1,ij ai switches are


m m

needed to connect the entire GHC structure. For example, the 4 × 4 × 2 GHC depicted in Figure 1
requires 4 × 2 + 4 × 2 + 4 × 4 = 32 switches.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:7

CubeX has enough backup and recovery servers in real deployment. For example, for an
8 × 8 × 8 GHC network, there are 512 primary servers and 192 8-port switches. For each primary
server, there are 21 recovery nodes and 147 backup nodes. Roughly suppose that the NIC band-
width and disk I/O bandwidth are respectively 10Gbps and 100MB per second and each backup
server has three disks. When a primary server fails in the three-dimensional hybercube, each of
its recovery servers has two NIC ports participating in the recovery. Then, the aggregate network
bandwidth and aggregate disk bandwidth are respectively 420Gbps (= 10Gbps ×2 × 21) and 43GB/s
(≈ 100MB/s ×3 × 147), which are sufficient to recover tens of GB data in a few seconds.
Backup server selection. Optionally, a slightly different design for backup server selection is to
include the recovery server itself as one of the backup servers. CubeX gives up this asymmetric de-
sign, because it adds the complexity of management while brings only insignificant improvement.
Considering the above example. By including the recovery servers, the number of backup servers
per primary server increases from 147 to 168, which is less than 15% of the original number.
Incremental scalability. CubeX is incrementally scalable. That is, the unit of upgrading CubeX
is one node. This is because when the number of available servers is less than Πmi=1a i for an m-
dimensional GHC structure, a partial m-dimensional GHC can be built with any number (≤ m) of
(m − 1)-dimensional GHC structures using a complete set of dimension-m switches.
Performance degradation under switch failures. Both CubeX and MemCube performs better
than RAMCloud (on FatTree) under switch failures [28]. This is because switches in FatTree at
different levels have different impact on routing and the failures of low-level switches make the
traffic get imbalanced and dramatically degrade FatTree’s performance. For example, for a partial
8 × 8 × 8 × 8 GHC structure with 2,048 servers and 1Gbps links the initial ABT (aggregate bot-
tleneck throughput) is 2,006Gbps, and its ABT is 765Gbps when the switch failure ratio reaches
20%; while for a FatTree with the same number of servers and levels of switches the initial ABT is
1,895Gbps and it drops to ONLY 267Gbps when 20% switches fail.
Complexity. The complexity of CubeX mainly comes from its wiring. For an m-dimensional
Πmi=1a i = a 1 × a 2 × · · · × am GHC structure with N = Πi=1a i servers and M = Σ j=1 Πi=1,ij a i
m m m

switches, the total number of wires is N × M. Thus CubeX has the same wiring complexity as
MemCube (on BCube [28]) and RAMCloud (on FatTree [27]). Moreover, the wiring complexity of
CubeX is feasible even for shipping-container based modular data centers (MDC). For example, it
is easy to package a GHC with 512 servers into a 40-feet container. Therefore, CubeX is suitable
for normal data center constraints.
3.4 Globality-Locality Tradeoff
For cube-based networks with more than two dimensions, there are some servers that are more
than two hop away from a primary server and that will not participate its recovery. This is not
a problem in most cases of large-scale networks. For example, the above example of the 8 × 8 ×
8 GHC 10GbE network can achieve high aggregate recovery throughput enough to complete the
recovery in a few seconds.
In some rare cases of high-dimensional, small-sized, and low-bandwidth networks, the basic de-
sign of CubeX might not provide enough aggregate recovery throughput. To address this problem,
CubeX trades locality for globality. Consider a 4 × 4 × 4 GHC network with 64 servers constructed
from four 4 × 4 GHCs. In the basic CubeX design, a primary server P has 9 recovery servers and 27
backup servers, leaving 27 3-hop neighbors in the other three 4 × 4 GHCs (referred to as GHC R )
unused in the recovery. If it is necessary to get more servers involved in the recovery, then in each
GHC R we can separately construct a “pseudo” primary-recovery-backup structure, with the “pri-
mary” server being a 1-hop recovery server (referred to as R 1 ) of P. Thus, we could replace each

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:8 Y. Zhang et al.

of the three 1-hop recovery servers (R 1 ) with 6 2-hop recovery servers (referred to as R 2 ) in the
corresponding GHC R . By this means a primary server P will have 24 recovery servers (6 R 1 s and
18 R 2 s) and 36 backup servers (9 2-hop neighbors and 27 3-hop neighbors). After a primary server
P fails, in each 4 × 4 GHC there will be 18 concurrent 1-hop local recovery flows. As discussed in
References [44, 53], having more recovery servers improves the aggregate recovery throughput
but leads to more data fragmentation after the recovery.

4 BASIC RECOVERY
In this section, we first briefly introduce how CubeX generalizes MemCube’s recovery procedure
for cube-based networks, and then present the cross-layer recovery optimization of CubeX.

4.1 Localized Detection


Phillipa Gill et al. characterize network failure patterns in data centers and report that in net-
work there are a large number of software-induced transient failures and a high ratio of heartbeat
pings and ACKs are lost. Compared to traditional distributed systems, RAM-based storage has to
use small timeouts (usually < one second), which may mistakenly treat temporary problems as
permanent storage server failures and outages.
To address this problem, CubeX generalizes the 1-hop detection mechanism of MemCube as
follows. For each primary server, only its direct neighbors (i.e., recovery servers) are responsible
for checking its status. For example, a primary server 000 in Figure 1 is inspected by its neighbors
(recovery servers 001, 010, 020, 030, 100, 200, 300). The primary server 000 pings its seven recovery
servers periodically. Once 000’s recovery servers cannot receive ping messages from 000, they will
report to the master, which will start the recovery for 000 after a final confirmation. Since all pings
are constrained to the local range, the possibility of false positives is minimized.
CubeX uses atomic failure recovery [44] to handle rarely-happend false detection. That is, once
a recovery is started, the old primary server will be considered dead even if it is actually alive and
some unexpected factor leads to the false positive detection.
Localized failure detection also helps to alleviate network problems due to operator errors. For
example, one of the biggest service disruptions [1] of Amazon is because a wrong configuration for
upgrading a primary network incorrectly shifts traffic to a low-capacity redundant router, where
the unexpected high traffic level makes almost all transit heartbeat pings/ACKs between stor-
age/backup servers unable to reach their destination, which eventually results in a catastrophic
recovery storm. Clearly if localized detection were adopted then most congestion-induced false
positives in Amazon’s storage system might have been avoided.

4.2 Glocalized Recovery


A failed server S is a primary, recovery and backup server at the same time. CubeX generalizes
the recovery procedure of MemCube [53], where the centralized master coordinates the recovery
by asking all S’s 1-hop neighbors to report their local cache of (part of) the mapping from S’s
key space to S’s recovery/backup servers and reconstructing an integrated view of the mapping
previously maintained by S. The global master then uses the mapping to coordinate the recovery
of the failed server S (respectively, as a primary/recovery/backup server).
• First, the master pauses key-value service and sets up the recovery process, i.e., which sub-
space should be recovered to which nodes.
• Second, recover S as a primary/recovery/backup server.
• Third, restart the service.
The first and third steps are the same as MemCube, and the second step for CubeX is as follows.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:9

Primary server recovery. Primary server (S) recovery requires to recover both the primary data
and the backup data of S. When S fails, each of its r recovery servers becomes a new primary
server recovering 1/r the primary data of S. For example, suppose that in Figure 1 a primary 000
fails. The primary data is recovered from the old backup servers of S (e.g., 031, 130, 230, 330) to
the corresponding new primary storage server (030). The backup replicas that were previously
hold by the old backup servers (e.g., 031, 130, 230, 330) of the old recovery server (030) is now to
be transferred to the new backup servers of the new primary storage server (030). That is, 031 →
(131, 231, 331, 021, 011, 001); 130 → (131, 120, 110 100); 230 → (231, 220, 210 200); 330 → (331, 320,
310 300).
Recovery server recovery. To handle the failure of a recovery server (S), CubeX generalizes
MemCube’s mechanism of finding new recovery servers to replace S (for each of S’s primary
servers) and moving corresponding backup data. To replace the failed recovery server, CubeX
transfers the backup replicas corresponding to S from the old servers to the current ones corre-
sponding to P’s other recovery servers. For example, suppose that in Figure 1 a recovery server 000
(of the primary server 001) fails. Then the backup servers corresponding to 000 (010, 020, 030, 100,
200, 300) will transfer their backup replicas to the backup servers of 001’s remaining recovery
servers (011, 021, 031, 101, 201, 301). That is, 010 → (010, 111, 211, 311); 020 → (020, 121, 221, 321);
. . . ; 300 → (300, 311, 321, 331).
Backup server recovery. The failed server (S) is also a backup server of other primary servers,
and thus we need to perform the procedure for backup server recovery. Considering a backup
server 000 for the primary server 330 in Figure 1. After the backup server 000 fails, intuitively 330
has to copy its primary replicas to other backup servers (010, 020, 100, 200), which are also recovery
servers of the failed primary server 000.
Note that the aggregate network bandwidth of the recovery servers of S is always the bottleneck
compared to the aggregate bandwidth of the backup nodes of S, which is mainly due to the follow-
ing reasons: (i) one recovery server corresponds to many backup servers, and (ii) one backup server
can install many disks. Therefore, the above backup server recovery (e.g., 330 → 030 → {020, 010})
conflicts with the recovery of the primary storage server (e.g., 330 → 030), both of which use the
bottleneck inbound network bandwidth of the recovery servers (e.g., 030). In Section 4.3, we will
introduce how CubeX gracefully handle this problem using asynchronous recovery.

4.3 Asynchronous Backup & Recovery


As discussed in Section 4.2, since the backup server recovery contends with the primary server
recovery, CubeX performs it after all the other recoveries for primary and backup servers complete,
in a similar way followed by RAMCloud and MemCube. This remarkably increases the recovery
time, since a failed server stores roughly the same amount of backup copies as its own primary data.
Asynchronous backup. To improve the overall recovery performance, CubeX designs an asyn-
chronous backup mechanism to improve the recovery of backup servers. Suppose that the repli-
cation factor is f . For a primary server P and each of its backup server B, besides the f − 1 copies
on f − 1 shadow backup servers, P’s backup data has another replication called standby copy. Sup-
pose that two recovery servers (R 1 and R 2 ) of P have B as one of their backup servers. The standby
copies (of B’s backup data for P) are uniformly distributed across other backups (called standby
servers of B) of R 1 and R 2 . Clearly, the standby data is exactly the same as B’s backup data that
needs to be recovered after B fails. For instance, in Figure 1 primary 000’s backup data on 031
is replicated to 031’s standby servers (130, 230, 330 for R 1 = 030) and (101, 201, 301, 011, 021 for
R 2 = 001).

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:10 Y. Zhang et al.

When the primary node P receives a write request, it stores the new data in its RAM, and trans-
fers it to one backup server (B) and f − 1 shadow backup servers’ RAM and then returns. An
( f + 1)th standby copy is asynchronously replicated to one of B’s standby servers. Therefore, af-
ter the backup node B fails, since B’s backup data has already been synchronously replicated (as
secondary backup copies) to the f − 1 shadow backup servers, the master will check whether all
the writes of B’s standby copies have completed. If so, then the backup node B’s standby servers
can directly change their standby data to the backup data with no data transfer, which completes
the recovery of a backup server.
The asynchronous backup mechanism does not affect individual writes, but a potential draw-
back of asynchronous backup is that it incurs extra network traffic. In practice, however, the size
of the values of CubeX’s KV pairs is no larger than 1MB (as discussed in Section 3.1) and the av-
erage size of values is usually smaller than 64KB [53] in RAM-based key-value stores (like CubeX,
MemCube, and RAMCloud), where the normal I/O performance is bounded by CPU and network
latency [44] instead of bandwidth. Therefore, the bandwidth utilization of the extra traffic for
asynchronous backup is negligible, which will be further evaluated in Section 7.2.
Asynchronous Recovery. Standby data allows us to perform the recovery of backups asyn-
chronously, restoring the backup data (as well as the standby data) in the same way discussed
in Section 4.2. This is because the data is ONLY for recovery of future backup server failures and
has no affect on durability/availability.
We have two options to restore the backup and standby data: (i) from existing secondary backup
copies, or (ii) from the primary copy. Since the first method generates too many all-to-all traffics
(each of which between all the backup servers of a primary server), which may induce undesire
congestion, CubeX chooses the second one. Because the restore process has overlap with the recov-
ery of primary and backup data of the primary server, only after all other recoveries are completed
and all the new primary servers have started their storage services, does CubeX begin to restore
the backup and standby servers.
Since the size of the standby data on a failed backup server is roughly the same as the size of
the primary data on a primary server, the restore process of standby data from the failed backup
server’s (B’s) two-hop neighbors (i.e., the primary servers) to B’s one-hop neighbors (i.e., other
backup servers sharing the same recovery servers) is also similar to the primary data recovery.
So this process can be finished in a period of time roughly the same as a primary server failure
recovery. During the process, the involved primary servers can service read/write requests to this
data as usual, except that they do not write the corresponding standby data but cache it in RAM.
Missed standby data is written in a large transfer after the restore process completes.
If we recover the shadow backup servers in the same way of the asynchronous backup server
recovery, assuming each primary server to have α GB of primary data and the replication factor f ,
then totally ( f + 1)α GB data is recovered from the primary servers’ RAM to the disks of the new
backup servers, shadow backup servers, and standby servers, the process of which is NOT urgent
and thus can be conducted after all other recovery completes in a relatively long period of time.

4.4 Multi-Failure Recovery


CubeX treats multiple failures as separate single failures. For simultaneous failures, if two failures
happen too close (≤ 2-hop) to each other, which means the recovery/backup server of a failed
server may also fail, CubeX simply excludes the unavailable recovery/backup server, and uses
other recovery server or shadow backup servers instead.
For multiple servers failure recovery, the recovery servers need to have enough spare RAM to ac-
commodate the recovered data. When a specific server cannot accommodate more recovered data,

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:11

CubeX will simply choose another server for the recovery, in which case CubeX is “degraded” to
a RAMCloud [44]. Clearly, the recovery could complete as long as there is at least one copy avail-
able. With a disk replication factor of f , in the worst case at least f (or f + 1 if the asynchronous
standby copy write is completed) concurrent problems could be tolerated. Domain failures can be
recovered as long as for all lost replicas there is at least one backup replica available outside the
domain. CubeX can install on each backup server a battery to ensure the small amount of buffered
data to have the same non-volatility as disk-stored data.

4.5 Switch Failure Recovery


CubeX handles switch failures in the same way as MemCube [53] does. For an m-dimensional
Πmi=1a i = a 1 × a 2 × · · · × am GHC structure, there are m disjoint paths between any pair of nodes.
When CubeX experiences k < m concurrent switch failures it still works correctly leveraging the
remaining switches. Therefore, CubeX can easily handle switch failures by replacing the switch
without urgent recovery for primary/recovery/backup servers.

5 OPTIMIZATION
5.1 Recovery Server Remapping
We note that recovery servers store no data related with the failed primary server. Therefore,
CubeX proposes an optional remapping approach for recovery server recovery, i.e., for each af-
fected primary storage server P, the affected sub key space could be re-designated to P’s other
recovery servers without data transfer. Adopting this approach requires more elaborate process-
ing in subsequent failure recoveries, as discussed in Section 4.4. In normal cases a backup server B
directly connects to two recovery servers R 1 and R 2 , then after one recovery server (say, R 1 ) fails,
B can simply register to R 2 , i.e., tells the primary server P to redirect R 1 ’s sub key space S P, R1, B to
R 2 , which could be accomplished instantaneously.
As discussed in Section 3.3, for an m-dimensional Πm i=1a i = a 1 × a 2 × · · · × am GHC structure,
each primary server has Σm (a
i=1 i − 1) recovery servers and vice versa. Therefore, if subsequent
server failures happen independently at random, the possibility for the second failure being af-
Σm a −m
fected by the remapping mechanism is Πim=1 aii −1 . That is, when a backup server B has no more
i =1
directly connected recovery servers for the primary server P (due to remapping in the previous
failure), P will designate a new recovery server R  for B’s sub key space, and transfer B’s backup
data to a server B  on the backup servers of R , which is similar to the processing introduced in
Section 4.2. It is likely to find such a B  that is directly connected with B, because B is two-hop
away from P while R  is one-hop away from P, which means in, for instance, an n × n × n GHC
structure, there are initially 2(n − 2) recovery servers of P that are two-hop away from B each
having a backup server one-hop away from B.
The recovery for a single failure of CubeX with remapping is the same as that without remapping
(as described in Section 4). The recovery for multiple failures of CubeX with remapping will be
introduced in the next subsection.

5.2 Multi-Failure Recovery for Remapping


As more servers fail, a backup server may service one or two recovery servers in the recovery. This
is because backup servers may lose one recovery server due to previous failures and the remapping
mechanism. Since the aggregate recovery bandwidth of network is usually smaller than that of
disks, CubeX should finish all the recovery processes at the same time, to fully utilize the network
bandwidth. This means that all backup servers should have similar recovery bandwidth despite
the number of their recovery servers.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:12 Y. Zhang et al.

Fig. 2. Implementation details of on-machine memory management.

We call the backup servers with only one recovery server as “lonely backup servers.” The re-
covery servers that have some lonely backup servers are called “affected recovery servers,” and
the recovery servers that have no lonely backup servers are called “free recovery servers.” If a
backup server connects to one free recovery servers and one affected recovery servers, then we
will call it “free backup server”; and if it connects to two affected recovery servers, then we will
call it “affected backup servers.”
Consider an n × n × n GHC. Suppose that first a recovery server (say, node 001) of a primary
server (say, node 000) fails and then 000 also fails. Let α, β, γ be the optimal bandwidth shares
assigned to the lonely/free/affected backup servers of the affected recovery server with per-port λ
Gbps bandwidth, respectively. To finish all the recovery at the same time, we should have α = 2γ ,
α = β + (n−1)λ
, and α + (n − 2)β = λ. Therefore, we have α = (n−1)
2n−3
2 λ, β = (n−1) 2 λ, and γ = (n−1) 2 λ.
n−2 n−1.5

When n  1, we have α ≈ 2β ≈ 2γ . Therefore, the lonely backup servers (which have only one
recovery server R) should use approximately twice the bandwidth of R used by R’s other backup
servers. Clearly, for multiple failures, as long as n  1 and the failures happen randomly, we can
get similar results for α, β, γ . Therefore, as discussed in Orchestra [17], we can get nearly opti-
mal bandwidth allocation by simply creating two separate TCP connections for the lonely backup
servers that are connecting to only one recovery server.

6 IMPLEMENTATION
We implement CubeX both on Linux and on Windows. This section introduces the implementation
of CubeX on Linux, while CubeX on Windows has similar internal designs and is omitted here for
conciseness.

6.1 Overview
CubeX is prototyped by modifying MemCube to adapt to arbitrary cube-based networks and
adding a recovery optimization module.
Compared with the original MemCube implementation [53], in our CubeX prototype, we re-
place the BCube-specific failure recovery mechanism with the generalized cube-oriented design.
The primary data and backup data are scattered (globally) across all servers using aggressive data
partitioning [14] for reconstructing lost data in parallel, and the failure detection and recovery of
each server are constrained to that server’s local neighborhood.
Recovery server remapping is implemented by letting the backup server to compute the sub-
space (that belonged to the failed recovery server) and notify the remapping result (i.e., the new
recovery server) to the primary server. Asynchronous backup server recovery is implemented by
restoring data to the backup/standby servers after the KV service is resumed.
CubeX adopts a chain-based structure for memory management on individual storage servers.
We use a cache chain to group slabs into caches as well as a hash chain for fast KV access. As
shown in Figure 2, we uses the slab-based mechanism [3, 8] to alleviate fragmentation problems

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:13

Fig. 3. Optimizing data path in processing a write.

caused by frequent alloc/free operations in memory management inside a server. It pre-allocates a


large amount of memory during initialization, and uses fixed-size memory chunks (Cache_chain)
with a series of predetermined size classes (Cache[0], Cache[1], · · · ). We design a hash ta-
ble (Hash_chain) to quickly locate the keys, and implement a reusable stack (Reuse_stack) to
efficiently collect and reuse obsolete memory.

6.2 Surrogate Writing


CubeX realizes two approaches for processing a normal write. In the first (normal) approach, when
performing a write the primary storage server writes the data to the memory and then issues
backup operations to the backup server and f − 1 shadow backup servers. The backup servers
acknowledge to the primary after storing the backup replicas in their memory (not disks), and then
the primary server acknowledges to its client. The backup servers adopt log-structured storage [44]
for simplifying persistence.
Considering the backup and shadow backup servers are 2 hops away from the primary servers
and all backup data passes intermediate recovery servers, an alternative data path (Figure 3) is
proposed where the primary server returns to the client as soon as it receives acknowledgements
from the intermediate recovery servers. Here the recovery servers act as a backup surrogate and
ensure backup data to be eventually written to the disks of its backup and shadow backup servers.
This approach reduces the response time of write operations, which may become non-trivial in
low-latency Ethernet. Compared with the first approach, a potential danger of surrogate writing
in CubeX is that the intermediate recovery servers may fail before the write is acknowledged by
the backup and shadow backup servers (Backup_finish_ack() in Figure 3).
To address this problem, we design (but not yet implement) a straightforward mechanism to
improving durability: The write is simultaneously performed to several recovery servers and will
be discarded after being acknowledged by the backup and shadow backup servers. We will incor-
porate this approach to CubeX’s implementation in the future.

6.3 Global Master


We implement a logically-single global master that manages the configuration information, aggre-
gates failure reports from recovery servers, identifies switch failures, and starts and coordinates a
failure recovery. Like RAMCloud [44], The global master assigns tablets (consecutive key ranges)
to primary servers. The master maintains the mapping between tablets and primary servers, and
the client library maintains a cache of this mapping and retrieves the mapping the first time it
reads/writes a tablet. Normal reads/writes are performed without querying the master. If the cache
becomes stale due to failures, then the client will ask the master to get the up-to-date mapping.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:14 Y. Zhang et al.

We use ZooKeeper [33] to realize the high availability and durability of the master. We add
a /CubeX znode in the ZooKeeper namespace, which contains sub znodes including /ac_master
(active master), /pri_servers (primary servers list), /keys (mapping from key space onto primary
servers), and so on. Several master instances compete for a single lease to ensure there are one
active instance in most of the time. Other instances are in the standby mode. The active instance
periodically increases a version number to update its existence. After the active instance fails or
disconnected, some standby instance will win the lease and become active to provide services.
Both the clients and the storage servers use the /ac_master znode to locate the active master. If
a server fails concurrently with a master failure, e.g., the recovery servers R cannot get response
from the master, then R will ask the ZooKeeper service to locate the new active instance and then
report the failure to it. Afterwards, the normal recovery procedure is performed.

7 EVALUATION
7.1 Configuration
We use 64 servers and 8 Pronto 1GbE switches (48 ports) to construct the BCube network (for
MemCube) and the GHC network (for CubeX).
We build a two-level BCube(8,1) network. The four switches act as 16 virtual switches (8 ports)
together (8 virtual switches at each level). Note that BCube(8,1) network is also an 8 × 8 GHC
network. We also use 8 switches to emulate 32 4-port virtual switches and 64 2-port switches and
construct a 4 × 4 × 2 × 2 GHC network, which can be viewed as duplicating the 4 × 4 × 2 GHC
network (Figure 1) in a fourth dimension.
Each server has six Intel 2.5GHz cores, 12 1TB disks, and two 1GbE 2-port NICs. Sixteen of
the 64 servers have 32GB RAM and the others have 16GB. We have one additional server to run
the client, and three others to run one active and two standby masters for ZooKeeper. The four
additional servers connect to the testbed using a 1GbE control network. The asynchronous backup
and recovery mechanism is adopted as introduced in Section 4.3, and a fourth standby copy is
possibly stored on a backup server’s disk.
Our experiments in this section answer the questions mainly including:

• How fast can CubeX handle normal I/O (write) requests, with/without surrogate writing
(Section 7.2)?
• How much does CubeX outperform MemCube in single storage server recovery, with asyn-
chronous backup server recovery (Section 7.3) and recovery server remapping (Section 7.4)?
• Does CubeX preserve its property for (simulated) large-scale cube-based networks
(Section 7.5)?
• Does CubeX generalize well for different kinds of GHC (Section 7.6)?
• And how well does CubeX perform under multiple failures (Section 7.7)?

7.2 Throughput
Our first experiment evaluates the throughput of a single CubeX server (with/without surrogate
writing) and compares CubeX with MemCube. We run m processes respectively for CubeX and
MemCube on a server, each being assigned with a separate and consecutive sub key space.
We adapt the memcached benchmark [12] to CubeX, where the benchmark client has parallel
connections asynchronously performing write operations with the form of “set key value.” The
size of the key-value pairs is 1KB and the key is a random 15-byte string. We measure the number
of set requests handled per second as a function of the number of processes (m) running on the
server. Each point is an average of 20 runs.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:15

Fig. 4. Throughput. Each service process corresponds to 50 connections from the benchmark client.

Fig. 5. Throughput. Each service process corresponds to 100 connections from the benchmark client.

Figures 4 and 5 show the evaluation results, where each service process corresponds to 50 and
100 connections from the benchmark client, respectively. Each service process corresponds to 50
connections from the benchmark client. The differences to the mean are less than 5% and thus
are omitted for clarity. It shows that the I/O throughput of CubeX (without surrogate) is very
close to that of MemCube, demonstrating the asynchronous backup mechanism introduced in
Section 4.3 does not affect normal I/O. This is because for RAM-based key-value stores the normal
I/O performance is bounded by CPU and network latency [44] and the network bandwidth never
becomes the bottleneck. The small variance mainly comes from uncertain conditions like the usage
of CPU, I/O, and network.
In contrast, CubeX with surrogate achieves much higher throughput than both CubeX without
surrogate and MemCube, which shows that the 1-hop writing to the surrogate recovery servers is
more efficient than the 2-hop writing to the backup servers. This is mainly because the software-
based routing (on intermediate nodes) in 2-hop writing suffers from high CPU overhead and pro-
cessing latency [37]. Considering the client connects to the testbed using a 1GbE control network
(discussed in Section 7.1), the available bandwidth (1Gbps) is not fully utilized in this test (1KB ×
90,000/s ≈ 720Mbps), showing the throughput is bounded by the forwarding performance at the

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:16 Y. Zhang et al.

Fig. 6. Server Failure Recovery in CubeX and MemCube (8 × 8 GHC). CubeX, respectively, uses two mech-
anisms of “async” and “async + remap.”

primary/recovery servers (which is determined by the underlying BCube implementation [28]),


rather than the network bandwidth.
We also emulate a switch failure by disabling the corresponding ports of a virtual switch and
evaluate the throughput of a CubeX server. The throughput has little change compared with that
before the switch failure, which is not depicted in Figures 4 or 5. This proves that CubeX gracefully
handles switch failures by exploiting the multipath property of cube-based networks.

7.3 Failure Recovery


The original MemCube handles backup server failure after all other failures are recovered. If we
include the backup server recovery time into the overall recovery time, then the recovery perfor-
mance of the original MemCube design will apparently degrade (as discussed in Section 2).
We first evaluate the recovery procedure of both CubeX and MemCube on an 8 × 8 GHC, taking
the backup server failure recovery into account. In this subsection CubeX adopts the asynchronous
recovery mechanism (Section 4.3) but does not incorporate the recovery server remapping mech-
anism (Section 5). The replication factor is 1, i.e., there is one primary copy and one backup copy
for each piece of data. We fill all the 64 servers in BCube(8, 1) with 12GB of data and then make a
server failed. The ping period is 300ms and the recovery will be started if three consecutive pings
are missing. We concurrently perform all the recoveries for the server’s primary/recovery/backup
role.
Figure 6 demonstrates the recovery procedure, where for clarity the error bars are depicted only
for every second. CubeX with asynchronous recovery significantly outperforms (about 40% faster)
MemCube, and consequently CubeX provides higher availability and durability. This is because it
hides the bandwidth-consuming backup server recovery using standby servers without affecting
the overall durability. For MemCube, since the backup server recovery is performed concurrently
with the primary recovery, the recovery bandwidth per server is limited to only half the available
bandwidth.
However, in this experiment the recovery bottleneck is at the aggregate in-bound network band-
width of recovery servers. This is because we use a relatively low-bandwidth (1GbE) network. We
expect to achieve 10× faster recovery when using a 10GbE network.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:17

Table 1. Simulated Recovery Time (in Seconds)

Topology 8*8 8*8*8 16 * 16 * 16 24 * 16 * 16


Number of servers 64 512 4,096 6,144
CubeX 10.5 3.6 1.8 1.2
MemCube 23.7 10.2 4.8 N/A
RAMCloud 216.5 247.9 320.4 368.8

7.4 Remapping
In this subsection, we evaluate the effect of CubeX’s recovery server remapping mechanism. As
introduced in Section 5, when recovering a recovery server failure in our test, for each affected
primary server P the affected sub key space is re-designated to P’s other recovery servers without
data transfer.
The configuration is the same as the previous experiments, and the result is also depicted in
Figure 6. Suppose that server 00 fails as a recovery server on the 8 × 8 GHC network. Consider
one of its 14 primary servers (e.g., 01) originally having 14 recovery servers and 49 backup servers.
Without remapping, the backup data on the old backup servers (10, 20, . . . , 70) of the failed recov-
ery server 00 will be equally assigned to the backup servers of 01’s remaining 13 recovery servers
(11, 21, . . . , 71, 02, 03, . . . , 07), to minimize the recovery time of the primary server 01’s (possible)
future failure. For example, an old backup server 10 originally stores 12/49GB of backup data for
01, half (β = 6/49GB) corresponding to the failed recovery server 00 (and half corresponding to the
other recovery server 11). To recover 01’s recovery server, 10 will remain β/13GB of backup data
(for recovery server 11), transfer 2β/13GB of backup data to 12 (for recovery servers 11 and 02),
transfer 2β/13GB of backup data to 13 (for recovery servers 11 and 03), . . . , and transfer 2β/13GB
of backup data to 17 (for recovery servers 11 and 07). The other six backup servers (20, 30, . . . , 70)
corresponding to 00 have similar processing, and thus we totally transfer γ = 2β/13 × 6 × 7GB of
backup data for a primary server 01 of the failed recovery server 00. Since the failed recovery server
00 has 14 primary servers, in total the remapping mechanism saves 14γ ≈ 11.08GB of data transfer.
CubeX (async) is about 20% slower than CubeX (async + remap), which is mainly because in
CubeX (async) the recovery of recovery servers competes with the primary server recovery for
CPUs, although they do not contend for the recovery bandwidth (because they occur in different
directions [53]).

7.5 Large-Scale Simulation


We also evaluate the recovery of CubeX for larger scales of GHCs through simulations, and com-
pare it with MemCube (on BCube(n, k)) and RAMCloud (on non-blocking FatTree). Since the bot-
tleneck is at the bandwidth of recovery servers, we use NS2 to simulate to recover 12GB data from
the backup servers to their corresponding recovery servers for the failed primary server.
The result is depicted in Table 1, where CubeX significantly outperforms MemCube and RAM-
Cloud. For a 16 × 16 × 16 GHC with 4,096 servers, CubeX recovers 12GB data within 1.8s; and for
a 24 × 16 × 16 GHC with 6144 servers, CubeX recovers 12GB data within 1.2s. In contrast, Mem-
Cube spends 4.8s to recover in the 4096-node cluster and it cannot support the 6,144-node cluster,
because the network is not a standard hyper cube. RAMCloud is the worst and takes two orders of
magnitude longer time for recovery compared to CubeX in the two large-scale (4,096- and 6,144-
node) clusters, because its random recovery pattern results in severe recovery traffic congestion.
Table 1 shows that CubeX outperforms MemCube by about 3× when the cluster has hundreds
or thousands of servers. This is crucial for distributed RAM storage systems both for durability

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:18 Y. Zhang et al.

Fig. 7. CubeX Recovery for various GHCs.

and for availability. First, as discussed in Reference [44], in a 10,000-server RAM storage system
with 3× replication, two failures/year/server with a Poisson distribution, the one-year data loss
probability is about 10−6 when the failure recovery time is one second; and the probability increases
to 10−4 when the recovery completes in 10s. Second, as discussed in Reference [7], there are about
1,000 server failures per year in a normal-sized cluster of hundreds of servers. Consequently, for
the 512-server cluster in Table 1, CubeX could achieve a high availability of about four nines
3.6×1,000
(1 − 365×24×3,600 ≈ 99.99%), while MemCube has a relatively low availability of about three nines.
Note that it is difficult for CubeX to recover several tens or hundreds of GB data within less than
1s, since both failure detection and recovery preparation require at least a few hundred millisec-
onds that are omitted in our simulation.

7.6 Generalized Hypercube


In this section, we evaluate more generalized hypercubes for recovery in CubeX. The configuration
is the same as previous experiments, except that we use both a 4 × 4 × 2 × 2 generalized hypercube
and an 8 × 8 generalized hypercube. The result is depicted in Figure 7.
From this figure, we could see that CubeX on the 4 × 4 × 2 × 2 GHC network has even higher
aggregate recovery bandwidth than on the 8 × 8 GHC network, although each primary server has
fewer number of recovery servers (8 vs. 14). This is because on the 8 × 8 GHC network CubeX uses
1 NIC port per recovery server, while on the 4 × 4 × 2 × 2 GHC network CubeX uses three ports
per server and thus triples the bottleneck bandwidth for each recovery server.

7.7 Multiple Failures Recovery


In this section, we show how CubeX recovers from a primary failure right after a recovery server
failure. In the 8 × 8 GHC network, we first cause a recovery server (01) failure of a primary server
(00). As described in Section 5, this recovery server failure can be recovered instantaneously by
remapping the affected backup servers to new recovery servers. We then make the primary storage
server (00) also fail. Note that currently half of 00’s backup servers (11 ∼ 71) have only one recovery
server.
The recovery process for the second failure (00) is depicted in Figure 8, where “1-connection”
represents that each backup-recovery pair has exactly one TCP connection, while “2-connection”
represents that if a backup server has only one recovery server R then it has two recovery

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:19

Fig. 8. Second failure recovery of directly connected neighbor.

connections to R. At the beginning both “1-connection” recovery and “2-connection” recovery


have almost the same aggregate recovery bandwidth. After 6s, however, only “2-connection” can
keep the recovery speed while “1-connection” experiences an obvious degradation. This is because
in “1-connection” recovery after this point some backup servers finish their recoveries, resulting
in a waste of the network bandwidth. This figure shows that our simple method (discussed in
Section 4.4) for handling backup servers heterogeneity has a near-optimal result for generalized
hypercubes.

8 RELATED WORK
In this section, we briefly discuss related work on RAM/disks based storage and computing.
The design of CubeX is inspired by large-scale RAM storage systems (like RAMCloud [44] and
MemCube [53]), Remote RAM based computing techniques (like FaRM [20], Trinity [48], and
MemC3 [21]) and other disk-based storage systems (like Bigtable [14], GFS [24], and URSA [2]).

8.1 Large-Scale RAM Storage Systems


RAMCloud [44] is a RAM-based storage system that scatters backup data across thousands of disks.
It provides scalable and high-performance key-value storage by employing randomized techniques
and managing the system in a scalable and decentralized fashion. It uses a log-structured represen-
tation for key-value data, both in RAM and on disks. RAMCloud relies on the high-performance
but expensive InfiniBand networks to achieve fast failure recovery, and thus cannot be applied
to traditional Ethernet-based cloud storage systems (as demonstrated by the evaluation result in
Section 7.5).
MemCube [53] is focused on Ethernet-based storage and identifies network-related challenges
for RAM storage, including false detection due to transient network problems, traffic congestion
during the recovery, and top-of-rack switch failures. MemCube proposes the CubicRing structure
and exploits network proximity to address the challenges. However, MemCube can only be applied
to the BCube network, and its design is suboptimal due to several problems (e.g., server recovery
contention).
FAWN [11] couples embedded CPUs to flash memory and takes a balance between computa-
tion and I/O to enable efficient data storage. FlashStore [18] uses flash memory as a non-volatile
cache, organizing data in a log-structure and exploiting flash memory’s random write performance.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:20 Y. Zhang et al.

SkimpyStash [19] is a hybrid key-value store on flash memory and RAM, which uses a hash table
directory in RAM to index key-value pairs stored on flash. HashCache [10] is also targeted at RAM
and flash combined storage system, leveraging an efficient structure to lower the amortized cost
of insertions and updates.
CubeX is inspired by RAMCloud and MemCube. It leverages the glocality (= globality + lo-
cality) to adapt to any cube-based networks (like generalized hypercube (GHC), MDCube [52],
k-ary n-cube [28], and hyperbanyan [22]) and realizes cross-layer optimizations, such as recov-
ery server remapping, asynchronous backup server recovery, and surrogate backup writing, to
improve RAM-based storage performance in both normal I/O operations and failure recovery.

8.2 Remote RAM Based Computing


FaRM [20] is a memory distributed computing system that exploits remote direct memory access
(RDMA) to improve both latency and throughput. FaRM exposes the memory of machines in the
cluster as a shared address space. FaRM supports its applications to use transactions to allocate,
read, write, and free objects in the address space with location transparency. FaRM and CubeX
takes different ways in using remote RAM: FaRM organizes remote RAM to provide a large, shared
memory address view, while CubeX uses it as persistent storage.
Trinity [48] is a general purpose graph engine over a distributed memory cloud. Trinity sup-
ports efficient graph exploration and parallel computing by leveraging optimized memory storage
management and network communication. Trinity efficiently enables both online query process-
ing and offline analytics on large graphs. Since CubeX is orthogonal to its upper layer applications,
it is also possible to deploy Trinity on CubeX.
MemC3 [21] is a compact and concurrent MemCache with efficient caching and smart hashing.
It provides a set of improvements to Memcached that substantially improve both the memory effi-
ciency and throughput, mainly including optimistic cuckoo hashing, compact LRU-approximating
eviction algorithm, and optimistic locking. MICA [36] is a scalable in-memory key-value store that
handles 65.6 to 76.9 million operations per second on a single multi-core machine. MICA (i) en-
ables parallel access to partitioned data, (ii) maps client requests directly to specific CPU cores at
the NIC level and adopts a light-weight networking stack to bypass the kernel, and (iii) designs
novel data structures (including circular logs, lossy concurrent hash indexes, and bulk chaining) to
handle read-/write-intensive workloads at low overhead. CubeX is implemented on Memcached,
so it is natural to combine the optimizations of MemC3 and MICA to CubeX.

8.3 Other Distributed Storage Systems


Bigtable [14] realizes fast crash recovery by using aggressive data partitioning. Bigtable also uses
a log-structured approach for its metadata and buffers new data in memory. GFS [24] serves a
role for Bigtable like the backup servers in CubeX. Log-structured File System (LFS) [46] is the
first to propose to append updates sequentially at the end of files to improve write performance.
The Zebra file system [31] integrates RAID and log-based file systems, striping a file log across
a RAID array. Zebra provide no mechanisms for background journal replay. Google File System
(GFS) [24] adopts the log structure to accelerate its write-once, read-many workloads. Adaptive
LFS [40] improves the performance of log-structured file systems by leveraging adaptive methods
based on workload characteristics. OptFS [15] decouples durability from ordering and achieves
optimistic crash consistency for a journal file system. Gecko [49] designs a chained-journal to
reduce I/O contention and provide contention-oblivious disk arrays. BlueSky [50] utilizes NFS
to provide a cloud-backed file system that support users to leverage the third-party file storage
clusters. Compared with the RAM-based CubeX key-value store, these distributed file/KV systems
are orders of magnitude slower both in IOPS and latency.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:21

Distributed block storage systems provide a block interface [32, 34] to remote clients via pro-
tocols like iSCSI [9] and AoE (ATA-over-Ethernet) [13]. For example, Petal [34] uses redundant
network storage to provide virtual disks. Salus [51] leverages HDFS [4] and HBase [30] to provide
virtual disk service with ordered-commit semantics, using a two-phase commit protocol. It also
provides prefix semantics when failures occur. pNFS (parallel NFS) [32] exports a block/object/file
interface to local cloud storage. Blizzard [41] is built on FDS [43] and exposes parallelism to virtual
disks with crash consistency guarantees. Blizzard leverages a full bisection bandwidth network to
stripe data aggressively, and utilizes delayed durability semantics to increase the rate at which
clients can issue writes while still achieving crash consistency. URSA [2] is an SSD-HDD-hybrid
block storage system that provides virtual disks, which can be mounted like normal physical ones.
It collaboratively stores primary data on SSDs and backup chunks on HDDs, using a journal as a
buffer to bridge the performance gap between SSDs and HDDs. Block storage systems could be
used as a secondary storage for CubeX.
F4 [42] designs a Binary Large OBjects (blob) storage system for Facebook’s corpus of photos,
videos that need to be reliably stored and quickly accessible. It uses temperature zones to iden-
tify hot/warm blobs and effectively reduces the effective-replication-factor. FDS [43] is a locality-
oblivious blob store built on a full-bisection bandwidth FatTree network. It multiplexes an applica-
tion’s aggregate I/O bandwidth across the available throughput. FDS supports fast failure recovery
by simultaneously perform recovery across the network. The design of CubeX could be ported to
these blob storage systems for high (blob) read/write performance.

9 CONCLUSION
CubeX is a network-aware key-value store that supports fast failure recovery on cube-based net-
works. At the core of CubeX is to leverage the glocality (= globality + locality) of cubes. CubeX
also designs cross-layer optimizations for achieving high throughput and low recovery time. It
exploits the globality of cubes to scatter backup data across large number of disks, and exploits
the locality of cubes to restrict all recovery traffic within the local range.
We plan to improve CubeX in the follow aspects. First, since low latency [47] is one primary
advantage of RAM-based storage, in the future CubeX may require a low-latency Ethernet in-
frastructure of 10μs level RTT. Second, as high-bandwidth (40 ∼ 100Gbps) networks [53] and
(10 ∼ 30Gbps) SSDs [6] become practical, we will study how to coordinate the RAM, backup disks,
networks, and CPUs to collaboratively achieve even higher recovery speed. Third, we will in-
corporate recent advances in failure detection, such as latency measurements between servers
in Pingmesh [29] and guided probes for potential failures in Everflow [54], to improve CubeX’s
failure detection. Fourth, some practical issues also need to be considered, for example, design-
ing a cleaner for the logging system, replacing replication with erasure coding, and incorporating
superscalar communication [35].

REFERENCES
[1] AWS Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Retrieved
from http://aws.amazon.com/message/65648/.
[2] NiceX Lab. Ursa Block Store. Retrieved from http://nicexlab.com/ursa/.
[3] RedisLabs. Redis Official Website. Retrieved from http://redis.io/.
[4] Dhruba Borthakur. HDFS Architecture Guide. Retrieved from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.
html.
[5] SOSP 2011 PC meeting. SOSP 2011 Reviews and Comments on RAMCloud. https://ramcloud.stanford.edu/wiki/pages/
viewpage.action?pageId=8355860SOSP-2011-Reviews-and-comments-on-RAMCloud.
[6] Josh Norem. Samsung SSD 960 EVO (500GB). Retrieved from https://www.pcmag.com/review/358847/samsung-
ssd-960-evo-500gb.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:22 Y. Zhang et al.

[7] Rich Miller. Failure Rates in Google Data Centers. Retrieved from http://www.datacenterknowledge.com/archives/
2008/05/30/failure-rates-in-google-data-centers/.
[8] Dormando. Memcached Official Website. Retrieved from http://www.memcached.org/.
[9] Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI
protocol. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies
(MSST’03). IEEE, 123–134.
[10] Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and large CAMs
for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked
Systems Design and Implementation (NSDI’10). USENIX Association, 433–448. Retrieved from http://www.usenix.org/
events/nsdi10/tech/full_papers/anand.pdf.
[11] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan.
2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles
(SOSP’09), Jeanna Neefe Matthews and Thomas E. Anderson (Eds.). ACM, 1–14. Retrieved from http://dblp.uni-trier.
de/db/conf/sosp/sosp2009.html#AndersenFKPTV09.
[12] Antirez. [n.d.]. An update on the memcached/redis benchmark. Retrieved from http://antirez.com/post/update-
on-memcached-redis-benchmark.html.
[13] Ed L. Cashin. 2005. Kernel korner: Ata over ethernet: Putting hard drives on the lan. Linux J. 2005, 134 (2005), 10.
[14] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar
Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Pro-
ceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205–218.
[15] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Prin-
ciples. ACM, 228–243.
[16] Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clus-
ters. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications
(SIGCOMM’13), Dah Ming Chiu, Jia Wang, Paul Barford, and Srinivasan Seshan (Eds.). ACM, 231–242.
[17] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. 2011. Managing data transfers in
computer clusters with orchestra. In ACM SIGCOMM Computer Communication Review, Vol. 41. ACM, 98–109.
[18] Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value
store. Proc. VLDB Endow. 3, 2 (2010), 1414–1425. Retrieved from http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.
html#DebnathSL10.
[19] Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-
based storage. In Proceedings of the SIGMOD Conference, Timos K. Sellis, Rene J. Miller, Anastasios Kementsiet-
sidis, and Yannis Velegrakis (Eds.). ACM, 25–36. Retrieved from http://dblp.uni-trier.de/db/conf/sigmod/sigmod2011.
html#DebnathSL11.
[20] Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory.
In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.
[21] Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber
caching and smarter hashing. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI’13). 371–384.
[22] Clayto S. Ferner and Kyungsook Y. Lee. 1992. Hyperbanyan networks: A new class of networks for distributed mem-
ory multiprocessors. IEEE Trans. Comput. 41, 3 (1992), 254–261.
[23] Armando Fox. 2002. Toward recovery-oriented computing. In Proceedings of the Conference on Very Large Data Bases
(VLDB’02). 873–876.
[24] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the ACM
Symposium on Operating Systems Principles (SOSP’03). 29–43.
[25] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Mea-
surement, analysis, and implications. In Proceedings of the Association for Computing Machinery’s Special Interest
Group on Data Communications (SIGCOMM’11), Srinivasan Keshav, Jörg Liebeherr, John W. Byers, and Jeffrey C.
Mogul (Eds.). ACM, 350–361.
[26] Jim Gray and Gianfranco R. Putzolu. 1987. The 5 minute rule for trading memory for disk accesses and the 10 byte
rule for trading memory for CPU time. In Proceedings of the Association for Computing Machinery Special Interest
Group on Management of Data, Umeshwar Dayal and Irving L. Traiger (Eds.). ACM Press, 395–398.
[27] Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David
A. Maltz, Parveen Patel, and Sudipta Sengupta. 2011. VL2: A scalable and flexible data center network. Commun. ACM
54, 3 (2011), 95–104.
[28] Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and
Songwu Lu. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:23

Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications (SIG-
COMM’09). 63–74.
[29] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang,
Hua Chen et al. 2015. Pingmesh: A large-scale system for data center network latency measurement and analysis. In
ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 139–152.
[30] Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi
H. Arpaci-Dusseau. 2014. Analysis of hdfs under hbase: A facebook messages case study. In Proceedings of the 12th
USENIX Conference on File and Storage Technologies (FAST’14). 199–212.
[31] John H. Hartman and John K. Ousterhout. 1995. The Zebra striped network file system. ACM Trans. Comput. Syst. 13,
3 (1995), 274–310.
[32] Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In Pro-
ceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE,
18–27.
[33] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for
Internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 1–14.
[34] Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices,
Vol. 31. ACM, 84–92.
[35] HuiBa Li, ShengYun Liu, YuXing Peng, DongSheng Li, HangJun Zhou, and XiCheng Lu. 2010. Superscalar communi-
cation: A runtime optimization for distributed applications. Sci. China Info. Sci. 53, 10 (2010), 1931–1946.
[36] Hyeontaek Lim, Donsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast
in-memory key-value storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI’14). 429–444.
[37] Guohan Lu, Chuanxiong Guo, Yulong Li, Zhiqiang Zhou, Tong Yuan, Haitao Wu, Yongqiang Xiong, Rui Gao, and
Yongguang Zhang. 2011. ServerSwitch: A programmable and high performance platform for data center networks.
In Proceedings of the (NSDI’11).
[38] Xicheng Lu, Huaimin Wang, and Ji Wang. 2006. Internet-based virtual computing environment (iVCE): Concepts and
architecture. Sci. China Ser. F: Info. Sci. 49, 6 (2006), 681–701.
[39] Xicheng Lu, Huaimin Wang, Ji Wang, and Jie Xu. 2013. Internet-based virtual computing environment: Beyond the
data center as a computer. Future Gen. Comput. Syst. 29, 1 (2013), 309–322.
[40] Jeanna Neefe Matthews, Drew Roselli, Adam M. Costello, Randolph Y. Wang, and Thomas E. Anderson. 1997. Im-
proving the performance of log-structured file systems with adaptive methods. In Proceedings of the ACM Symposium
on Operating Systems Principles (SOSP’97). ACM.
[41] James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram,
Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications.
In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257–273.
[42] Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva
Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceed-
ings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383–398.
[43] Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat data-
center storage. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).
[44] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John K. Ousterhout, and Mendel Rosenblum. 2011. Fast crash
recovery in RAMCloud. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 29–41.
[45] John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish
Mitra, Aravind Narayanan, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan
Stutsman. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operat. Syst. Rev.
43, 4 (2009), 92–105.
[46] Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system.
ACM Trans. Comput. Syst. 10, 1 (1992), 26–52.
[47] Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout. 2011. It’s time for
low latency. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’11).
[48] Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A distributed graph engine on a memory cloud. In Proceedings
of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 505–516.
[49] Ji-Yong Shin, Mahesh Balakrishnan, Tudor Marian, and Hakim Weatherspoon. 2013. Gecko: Contention-oblivious
disk arrays for cloud storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13).
285–298.
[50] Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. 2012. BlueSky: A cloud-backed file system for the enterprise.
In Proceedings of the 10th USENIX Conference on File and Storage Technologies. USENIX Association, 19–19.

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:24 Y. Zhang et al.

[51] Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike
Dahlin. 2013. Robustness in the salus scalable block store. In Proceedings of the 10th USENIX Symposium on Networked
Systems Design and Implementation (NSDI’13). 357–370.
[52] Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: A high perfor-
mance network structure for modular data center interconnection. In Proceedings of the International Conference
on Emerging Networking Experiments and Technologies (CoNEXT’09), Joörg Liebeherr, Giorgio Ventre, Ernst W.
Biersack, and Srinivasan Keshav (Eds.). ACM, 25–36. Retrieved from http://dblp.uni-trier.de/db/conf/conext/
conext2009.html#WuLLGZ09.
[53] Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu, Haitao Wu, and Yongqiang Xiong. 2015. CubicRing: En-
abling one-hop failure detection and recovery for distributed in-memory storage systems. In Proceedings of the 12th
USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 529–542.
[54] Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming
Zhang, Ben Y. Zhao et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer
Communication Review, Vol. 45. ACM, 479–491.

Received November 2017; revised July 2018; accepted October 2018

ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.

You might also like