Leveraging Glocality For Fast Failure Recovery in Distributed RAM Storage
Leveraging Glocality For Fast Failure Recovery in Distributed RAM Storage
Distributed RAM storage aggregates the RAM of servers in data center networks (DCN) to provide extremely
high I/O performance for large-scale cloud systems. For quick recovery of storage server failures, Mem-
Cube [53] exploits the proximity of the BCube network to limit the recovery traffic to the recovery servers’
1-hop neighborhood. However, the previous design is applicable only to the symmetric BCube(n, k) network
with nk+1 nodes and has suboptimal recovery performance due to congestion and contention. 3
To address these problems, in this article, we propose CubeX, which (i) generalizes the “1-hop” principle
of MemCube for arbitrary cube-based networks and (ii) improves the throughput and recovery performance
of RAM-based key-value (KV) store via cross-layer optimizations. At the core of CubeX is to leverage the
glocality (= globality + locality) of cube-based networks: It scatters backup data across a large number of
disks globally distributed throughout the cube and restricts all recovery traffic within the small local range
of each server node. Our evaluation shows that CubeX not only efficiently supports RAM-based KV store for
cube-based networks but also significantly outperforms MemCube and RAMCloud in both throughput and
recovery time.
CCS Concepts: • Information systems → Storage replication; Cloud based storage; • Computer sys-
tems organization → Secondary storage organization;
Additional Key Words and Phrases: Globality, locality, cube-based networks, distributed RAM storage, fast
recovery
ACM Reference format:
Yiming Zhang, Dongsheng Li, and Ling Liu. 2019. Leveraging Glocality for Fast Failure Recovery in Dis-
tributed RAM Storage. ACM Trans. Storage 15, 1, Article 3 (February 2019), 24 pages.
https://doi.org/10.1145/3289604
1 INTRODUCTION
In recent years, the role of RAM has become increasingly important for storage in large-scale cloud
systems [38, 39]. In most of these storage systems, however, RAM is only used as a cache for the
primary storage. For instance, memcached [8] has been widely used by various Web services and
applications, and Google’s Bigtable [14] keeps entire column family in RAM.
This work is supported by the National Natural Science Foundation of China (61772541 and 61872376), the National Science
Foundation under Grants NSF SaTC 1564097 and IBM faculty award.
Authors’ addresses: Y. Zhang (corresponding author) and D. Li (corresponding author), PDL, School of Computer, National
University of Defense Technology, Changsha 410073, China; emails: [email protected], [email protected]; L. Liu,
College of Computing, Georgia Institute of Technology, Atlanta, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
1553-3077/2019/02-ART3 $15.00
https://doi.org/10.1145/3289604
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:2 Y. Zhang et al.
• We leverage both the globality and the locality of cube-based networks to realize quick
recovery for RAM-based key-value store, and analyze their tradeoff;
• CubeX realizes cross-layer optimizations, such as recovery server remapping, asynchronous
backup server recovery, and surrogate backup writing, to support quick recovery and high
I/O performance;
• We implement CubeX by modifying MemCube, and evaluate CubeX on an Ethernet testbed
connected using various networks. Evaluation results show that CubeX remarkably outper-
forms MemCube both in throughput and in recovery time.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:3
This article is organized as follows: Section 2 discusses our motivation; Section 3 introduces
the design of the CubeX structure; Section 4 describes the basic recovery mechanism of CubeX by
leveraging the glocality of cubes; Section 5 introduces the optimization for recovery server remap-
ping; Section 6 presents our prototype implementation of CubeX; Section 7 presents experimental
results; Section 8 introduces related work; and Section 9 concludes the article.
2 PRELIMINARIES
2.1 RAM-Based Storage
RAM-based storage is promising to achieve extremely low I/O latency (10μs-level RTT) and high
I/O throughput (millions of reads/writes per second). The most obvious disadvantage of RAM-
based storage is the high cost and energy usage per bit. A RAM-based storage is about 50× worse
than a pure disk-based storage and 5× worse than a flash memory-based storage for both cost
and energy usage metrics [45]. Therefore, if an application needs to store a large amount of data
inexpensively and has a low access rate, RAM-based storage might not be the best solution.
However, when considering cost/energy usage per operation, a RAM-based storage is about
1,000× more efficient than a disk-based storage and 10× more efficient than a flash memory-based
storage [45]. Based on Gray’s Rule [26], that the cost per usable bit of disk increases as the desired
access rate to each record increases, Ousterhout et al. deduced that with recent technologies, if a
1KB record is accessed more than once every 30h, then it is cheaper to store it in memory than on
disk (because to enable this access rate only 2% of the disk space can be used) [45].
Furthermore, Andersen et al. generalizes Gray’s rule to compare the total cost (including hard-
ware cost and energy usage) of ownership over 3 years given the required data size and access
rate [11]. It shows that (i) for high access rates and smaller data sizes RAM is cheapest, for low
access rates and large data size disk is cheapest, and flash is cheapest for the middle; and (ii) since
for all three storages cost/bit is improving much more rapidly than cost/operation/sec, the appli-
cability of RAM-based storage will be continuously increasing.
Redis [12] is a RAM-based key-value store with rich data model that keeps all the data in the
RAM. Currently Redis is designed mainly for using as a cache layer. For cluster storage scenario,
Redis cannot support the more efficient primary-backup persistence mechanism of RAMCloud
(primary copies in primary servers’ RAM and backup copies in backup servers’ disks), and thus
suffers from a severe performance penalty.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:4 Y. Zhang et al.
Fig. 1. CubeX on generalized hypercube (GHC). Left: a 4 × 4 × 2 GHC. Right: the corresponding network
connection. (It uses a 32-port switch to emulate the 16 2-port mini switches.) When a primary server 000
fails, each of its seven recovery servers (001, 010, 020, 030, 100, 200, 300) becomes a new primary server and
recovers 1/7 the primary data of 000. For example, 030 recovers by fetching backup data from its four backup
servers 031, 130, 230, 330.
servers [5, 27]. This consequently affects data durability, because more failures may occur before
a previous failure gets recovered. Moreover, as discussed in Reference [16], since the recovery
servers independently choose their new backup servers for storing the backup data without a
global view of the network, there exists substantial imbalance in the usage of bottleneck links,
which is completely ignored by RAMCloud. However, RAMCloud requires that during recovery
all the backup data must be restored by the recovery servers. Thus in a triple-replicated cluster the
recovery bandwidth achieved by a recovery server is only 1/3 the bandwidth of that server [44]. For
example, in RAMCloud the PCI-Express bandwidth of 25Gbps limits the throughput per recovery
server to no more than 800MB per second [44].
2.4 MemCube
MemCube [53] proposes to exploit proximity of the BCube network [28] to limit the recovery in 1-
hop range, so that the recovery could be performed avoiding most congestion. However, the design
of MemCube is applicable only to BCube, and it has suboptimal recovery performance due to the
last-hop congestion and contention. For example, the recovery of a backup server is performed
after all other recoveries complete, since its recovery flows have overlap and contention with
the primary server recovery [53]. Therefore, MemCube’s recovery is suboptimal if we define the
recovery time as the entire time window for recovering all the primary/recovery/backup servers
instead of just the primary server. Suppose that the replication factor is f . In MemCube after the
new primary servers restart the service, there would be a (short) time window in which the number
of backup copies is f − 1 instead of f .
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:5
3 CUBEX DESIGN
This section first introduces the design of the generalized primary-backup-recovery structure of
CubeX, and then analyzes its properties.
3.1 Background
We follow the primary-backup model that is preferred by most RAM-based storage systems includ-
ing RAMCloud and MemCube. In a primary-backup RAM-based storage system with replication
factor f , there are one primary replica and f backup replicas for each key-value (KV) pair. The
value is a variable-length byte array (up to 1MB). For that pair of KV, The primary replica is stored
in the RAM of a primary server and the f backup replicas are maintained on the disks of f backup
servers.
When a primary server (with tens or hundreds of GB RAM) fails, to quickly recover all its pri-
mary data in a few seconds, the recovery procedure needs to involve hundreds of backup disks
(which is easy to achieve in a moderate-size data center). The data is recovered to the recover servers
of the failed primary server, each becoming a new primary server. Considering the currently avail-
able commodity NIC bandwidth (10Gbps), several tens of recovery servers are needed to provide
adequate (aggregate) recovery bandwidth to recover a failed primary server (with tens of GB data
in RAM) in a few seconds.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:6 Y. Zhang et al.
recovery, because the aggregate recovery throughput is already enough when only using servers
in the two-dimensional sub network, as discussed in the next sub section.
By this means the recovery traffic (from backup servers to their corresponding recovery servers)
is restricted within the local range of directly-connected neighbors. For example, when the primary
server 000 fails, each of the seven recovery servers reads backup data from its directly connected
backup servers (e.g., 030 from 031, 130, 230 and 330). There are totally 4 × 6 + 6 × 1 = 30 parallel
recovery flows in the local range.
Centralized master. CubeX inherits the key-ring mapping mechanism from MemCube [53],
where a centralized master maintains the mappings from keys to primary servers. The master
equally divides the key space into consecutive subspaces, each being assigned to one primary
server. The key space held by a primary server is further equally divided to its recovery servers;
and for each recovery server, its subspace is mapped to its backup servers. To avoid potential per-
formance bottleneck at the global master, the mapping from a primary server’s key space to its
recovery/backup servers is maintained by itself instead of the master, and the recovery servers
have a cache of the mapping they are involved in. After a server fails, the global master will ask
all its 1-hop neighbors to reconstruct the mapping.
For example, in the example shown in Figure 1, the master divides the whole (space of) keys into
32 shares each corresponding to a primary server P, which is denoted as S P (e.g., S 000 for P = 000).
Similarly, each recovery server R of P is equally assigned a subspace of S P , which is denoted as
S P, R (e.g., S 000,030 for R = 030). For a recovery server R the sub key space S P, R is further equally
assigned to each backup server B, which is denoted as S P, R, B . For example, for the 4 backups (nodes
031, 130, 230, 330) for P = 000 and R = 030, S 000,030 is divided into S 000,030,031 , S 000,030,130 , S 000,030,230 ,
S 000,030,330 for the nodes 031, 130, 230, 330, respectively, and is stored on the corresponding backup
servers’ disks. Other system information such as the addresses of the servers is also hold by the
master.
Replication factor. In a primary-backup storage system with replication factor f , each KV
(mapped onto primary server P has 1 dominant backup replica (stored on a backup server B) and
f − 1 secondary backup replicas that are also stored on P’s backup servers other than B. Given B,
all its dominant backup data is also stored on P’s f − 1 backup servers, which are called shadow
backup servers of B. Shadow backup servers may be chosen for specific requirements. For example,
if it is required that f backup copies should be stored in different racks, then each backup server
chooses f − 1 shadow backup servers in f − 1 racks (rather than its own rack).
3.3 Analysis
Structure properties. For an m-dimensional Πm i=1a i = a 1 × a 2 × · · · × am GHC structure, all the
nodes are primary servers and thus the number of primary servers are Πm i=1a i . For each primary
server P (e.g., node 000 in Figure 1), its recovery servers are its direct neighbors and thus the num-
ber of P’s recovery servers are Σm i=1 (a i − 1) (3 + 3 + 1 = 7 for node 000). For a recovery server R j
(e.g., node 030) in the jth dimension of P, its backup servers are its direct neighbors that are neither
primary server P nor P’s recovery servers, i.e., R j ’s direct neighbors not in P’s jth dimension.
The number of R j ’s backups is Σm i=1,ij (a i − 1). For example, node 030 has 3 + 1 = 4 backups.
Since P’s backup servers are 2-hop-away from node P, each backup server corresponds to two
recovery servers (e.g., backup node 031 corresponds to two recovery nodes 030 and 001). Therefore,
the number of P’s backup servers is 12 Σm j=1 (a j − 1)Σi=1,ij (a i − 1) (e.g., primary server 000 has 2 ×
m 1
needed to connect the entire GHC structure. For example, the 4 × 4 × 2 GHC depicted in Figure 1
requires 4 × 2 + 4 × 2 + 4 × 4 = 32 switches.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:7
CubeX has enough backup and recovery servers in real deployment. For example, for an
8 × 8 × 8 GHC network, there are 512 primary servers and 192 8-port switches. For each primary
server, there are 21 recovery nodes and 147 backup nodes. Roughly suppose that the NIC band-
width and disk I/O bandwidth are respectively 10Gbps and 100MB per second and each backup
server has three disks. When a primary server fails in the three-dimensional hybercube, each of
its recovery servers has two NIC ports participating in the recovery. Then, the aggregate network
bandwidth and aggregate disk bandwidth are respectively 420Gbps (= 10Gbps ×2 × 21) and 43GB/s
(≈ 100MB/s ×3 × 147), which are sufficient to recover tens of GB data in a few seconds.
Backup server selection. Optionally, a slightly different design for backup server selection is to
include the recovery server itself as one of the backup servers. CubeX gives up this asymmetric de-
sign, because it adds the complexity of management while brings only insignificant improvement.
Considering the above example. By including the recovery servers, the number of backup servers
per primary server increases from 147 to 168, which is less than 15% of the original number.
Incremental scalability. CubeX is incrementally scalable. That is, the unit of upgrading CubeX
is one node. This is because when the number of available servers is less than Πmi=1a i for an m-
dimensional GHC structure, a partial m-dimensional GHC can be built with any number (≤ m) of
(m − 1)-dimensional GHC structures using a complete set of dimension-m switches.
Performance degradation under switch failures. Both CubeX and MemCube performs better
than RAMCloud (on FatTree) under switch failures [28]. This is because switches in FatTree at
different levels have different impact on routing and the failures of low-level switches make the
traffic get imbalanced and dramatically degrade FatTree’s performance. For example, for a partial
8 × 8 × 8 × 8 GHC structure with 2,048 servers and 1Gbps links the initial ABT (aggregate bot-
tleneck throughput) is 2,006Gbps, and its ABT is 765Gbps when the switch failure ratio reaches
20%; while for a FatTree with the same number of servers and levels of switches the initial ABT is
1,895Gbps and it drops to ONLY 267Gbps when 20% switches fail.
Complexity. The complexity of CubeX mainly comes from its wiring. For an m-dimensional
Πmi=1a i = a 1 × a 2 × · · · × am GHC structure with N = Πi=1a i servers and M = Σ j=1 Πi=1,ij a i
m m m
switches, the total number of wires is N × M. Thus CubeX has the same wiring complexity as
MemCube (on BCube [28]) and RAMCloud (on FatTree [27]). Moreover, the wiring complexity of
CubeX is feasible even for shipping-container based modular data centers (MDC). For example, it
is easy to package a GHC with 512 servers into a 40-feet container. Therefore, CubeX is suitable
for normal data center constraints.
3.4 Globality-Locality Tradeoff
For cube-based networks with more than two dimensions, there are some servers that are more
than two hop away from a primary server and that will not participate its recovery. This is not
a problem in most cases of large-scale networks. For example, the above example of the 8 × 8 ×
8 GHC 10GbE network can achieve high aggregate recovery throughput enough to complete the
recovery in a few seconds.
In some rare cases of high-dimensional, small-sized, and low-bandwidth networks, the basic de-
sign of CubeX might not provide enough aggregate recovery throughput. To address this problem,
CubeX trades locality for globality. Consider a 4 × 4 × 4 GHC network with 64 servers constructed
from four 4 × 4 GHCs. In the basic CubeX design, a primary server P has 9 recovery servers and 27
backup servers, leaving 27 3-hop neighbors in the other three 4 × 4 GHCs (referred to as GHC R )
unused in the recovery. If it is necessary to get more servers involved in the recovery, then in each
GHC R we can separately construct a “pseudo” primary-recovery-backup structure, with the “pri-
mary” server being a 1-hop recovery server (referred to as R 1 ) of P. Thus, we could replace each
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:8 Y. Zhang et al.
of the three 1-hop recovery servers (R 1 ) with 6 2-hop recovery servers (referred to as R 2 ) in the
corresponding GHC R . By this means a primary server P will have 24 recovery servers (6 R 1 s and
18 R 2 s) and 36 backup servers (9 2-hop neighbors and 27 3-hop neighbors). After a primary server
P fails, in each 4 × 4 GHC there will be 18 concurrent 1-hop local recovery flows. As discussed in
References [44, 53], having more recovery servers improves the aggregate recovery throughput
but leads to more data fragmentation after the recovery.
4 BASIC RECOVERY
In this section, we first briefly introduce how CubeX generalizes MemCube’s recovery procedure
for cube-based networks, and then present the cross-layer recovery optimization of CubeX.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:9
Primary server recovery. Primary server (S) recovery requires to recover both the primary data
and the backup data of S. When S fails, each of its r recovery servers becomes a new primary
server recovering 1/r the primary data of S. For example, suppose that in Figure 1 a primary 000
fails. The primary data is recovered from the old backup servers of S (e.g., 031, 130, 230, 330) to
the corresponding new primary storage server (030). The backup replicas that were previously
hold by the old backup servers (e.g., 031, 130, 230, 330) of the old recovery server (030) is now to
be transferred to the new backup servers of the new primary storage server (030). That is, 031 →
(131, 231, 331, 021, 011, 001); 130 → (131, 120, 110 100); 230 → (231, 220, 210 200); 330 → (331, 320,
310 300).
Recovery server recovery. To handle the failure of a recovery server (S), CubeX generalizes
MemCube’s mechanism of finding new recovery servers to replace S (for each of S’s primary
servers) and moving corresponding backup data. To replace the failed recovery server, CubeX
transfers the backup replicas corresponding to S from the old servers to the current ones corre-
sponding to P’s other recovery servers. For example, suppose that in Figure 1 a recovery server 000
(of the primary server 001) fails. Then the backup servers corresponding to 000 (010, 020, 030, 100,
200, 300) will transfer their backup replicas to the backup servers of 001’s remaining recovery
servers (011, 021, 031, 101, 201, 301). That is, 010 → (010, 111, 211, 311); 020 → (020, 121, 221, 321);
. . . ; 300 → (300, 311, 321, 331).
Backup server recovery. The failed server (S) is also a backup server of other primary servers,
and thus we need to perform the procedure for backup server recovery. Considering a backup
server 000 for the primary server 330 in Figure 1. After the backup server 000 fails, intuitively 330
has to copy its primary replicas to other backup servers (010, 020, 100, 200), which are also recovery
servers of the failed primary server 000.
Note that the aggregate network bandwidth of the recovery servers of S is always the bottleneck
compared to the aggregate bandwidth of the backup nodes of S, which is mainly due to the follow-
ing reasons: (i) one recovery server corresponds to many backup servers, and (ii) one backup server
can install many disks. Therefore, the above backup server recovery (e.g., 330 → 030 → {020, 010})
conflicts with the recovery of the primary storage server (e.g., 330 → 030), both of which use the
bottleneck inbound network bandwidth of the recovery servers (e.g., 030). In Section 4.3, we will
introduce how CubeX gracefully handle this problem using asynchronous recovery.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:10 Y. Zhang et al.
When the primary node P receives a write request, it stores the new data in its RAM, and trans-
fers it to one backup server (B) and f − 1 shadow backup servers’ RAM and then returns. An
( f + 1)th standby copy is asynchronously replicated to one of B’s standby servers. Therefore, af-
ter the backup node B fails, since B’s backup data has already been synchronously replicated (as
secondary backup copies) to the f − 1 shadow backup servers, the master will check whether all
the writes of B’s standby copies have completed. If so, then the backup node B’s standby servers
can directly change their standby data to the backup data with no data transfer, which completes
the recovery of a backup server.
The asynchronous backup mechanism does not affect individual writes, but a potential draw-
back of asynchronous backup is that it incurs extra network traffic. In practice, however, the size
of the values of CubeX’s KV pairs is no larger than 1MB (as discussed in Section 3.1) and the av-
erage size of values is usually smaller than 64KB [53] in RAM-based key-value stores (like CubeX,
MemCube, and RAMCloud), where the normal I/O performance is bounded by CPU and network
latency [44] instead of bandwidth. Therefore, the bandwidth utilization of the extra traffic for
asynchronous backup is negligible, which will be further evaluated in Section 7.2.
Asynchronous Recovery. Standby data allows us to perform the recovery of backups asyn-
chronously, restoring the backup data (as well as the standby data) in the same way discussed
in Section 4.2. This is because the data is ONLY for recovery of future backup server failures and
has no affect on durability/availability.
We have two options to restore the backup and standby data: (i) from existing secondary backup
copies, or (ii) from the primary copy. Since the first method generates too many all-to-all traffics
(each of which between all the backup servers of a primary server), which may induce undesire
congestion, CubeX chooses the second one. Because the restore process has overlap with the recov-
ery of primary and backup data of the primary server, only after all other recoveries are completed
and all the new primary servers have started their storage services, does CubeX begin to restore
the backup and standby servers.
Since the size of the standby data on a failed backup server is roughly the same as the size of
the primary data on a primary server, the restore process of standby data from the failed backup
server’s (B’s) two-hop neighbors (i.e., the primary servers) to B’s one-hop neighbors (i.e., other
backup servers sharing the same recovery servers) is also similar to the primary data recovery.
So this process can be finished in a period of time roughly the same as a primary server failure
recovery. During the process, the involved primary servers can service read/write requests to this
data as usual, except that they do not write the corresponding standby data but cache it in RAM.
Missed standby data is written in a large transfer after the restore process completes.
If we recover the shadow backup servers in the same way of the asynchronous backup server
recovery, assuming each primary server to have α GB of primary data and the replication factor f ,
then totally ( f + 1)α GB data is recovered from the primary servers’ RAM to the disks of the new
backup servers, shadow backup servers, and standby servers, the process of which is NOT urgent
and thus can be conducted after all other recovery completes in a relatively long period of time.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:11
CubeX will simply choose another server for the recovery, in which case CubeX is “degraded” to
a RAMCloud [44]. Clearly, the recovery could complete as long as there is at least one copy avail-
able. With a disk replication factor of f , in the worst case at least f (or f + 1 if the asynchronous
standby copy write is completed) concurrent problems could be tolerated. Domain failures can be
recovered as long as for all lost replicas there is at least one backup replica available outside the
domain. CubeX can install on each backup server a battery to ensure the small amount of buffered
data to have the same non-volatility as disk-stored data.
5 OPTIMIZATION
5.1 Recovery Server Remapping
We note that recovery servers store no data related with the failed primary server. Therefore,
CubeX proposes an optional remapping approach for recovery server recovery, i.e., for each af-
fected primary storage server P, the affected sub key space could be re-designated to P’s other
recovery servers without data transfer. Adopting this approach requires more elaborate process-
ing in subsequent failure recoveries, as discussed in Section 4.4. In normal cases a backup server B
directly connects to two recovery servers R 1 and R 2 , then after one recovery server (say, R 1 ) fails,
B can simply register to R 2 , i.e., tells the primary server P to redirect R 1 ’s sub key space S P, R1, B to
R 2 , which could be accomplished instantaneously.
As discussed in Section 3.3, for an m-dimensional Πm i=1a i = a 1 × a 2 × · · · × am GHC structure,
each primary server has Σm (a
i=1 i − 1) recovery servers and vice versa. Therefore, if subsequent
server failures happen independently at random, the possibility for the second failure being af-
Σm a −m
fected by the remapping mechanism is Πim=1 aii −1 . That is, when a backup server B has no more
i =1
directly connected recovery servers for the primary server P (due to remapping in the previous
failure), P will designate a new recovery server R for B’s sub key space, and transfer B’s backup
data to a server B on the backup servers of R , which is similar to the processing introduced in
Section 4.2. It is likely to find such a B that is directly connected with B, because B is two-hop
away from P while R is one-hop away from P, which means in, for instance, an n × n × n GHC
structure, there are initially 2(n − 2) recovery servers of P that are two-hop away from B each
having a backup server one-hop away from B.
The recovery for a single failure of CubeX with remapping is the same as that without remapping
(as described in Section 4). The recovery for multiple failures of CubeX with remapping will be
introduced in the next subsection.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:12 Y. Zhang et al.
We call the backup servers with only one recovery server as “lonely backup servers.” The re-
covery servers that have some lonely backup servers are called “affected recovery servers,” and
the recovery servers that have no lonely backup servers are called “free recovery servers.” If a
backup server connects to one free recovery servers and one affected recovery servers, then we
will call it “free backup server”; and if it connects to two affected recovery servers, then we will
call it “affected backup servers.”
Consider an n × n × n GHC. Suppose that first a recovery server (say, node 001) of a primary
server (say, node 000) fails and then 000 also fails. Let α, β, γ be the optimal bandwidth shares
assigned to the lonely/free/affected backup servers of the affected recovery server with per-port λ
Gbps bandwidth, respectively. To finish all the recovery at the same time, we should have α = 2γ ,
α = β + (n−1)λ
, and α + (n − 2)β = λ. Therefore, we have α = (n−1)
2n−3
2 λ, β = (n−1) 2 λ, and γ = (n−1) 2 λ.
n−2 n−1.5
When n 1, we have α ≈ 2β ≈ 2γ . Therefore, the lonely backup servers (which have only one
recovery server R) should use approximately twice the bandwidth of R used by R’s other backup
servers. Clearly, for multiple failures, as long as n 1 and the failures happen randomly, we can
get similar results for α, β, γ . Therefore, as discussed in Orchestra [17], we can get nearly opti-
mal bandwidth allocation by simply creating two separate TCP connections for the lonely backup
servers that are connecting to only one recovery server.
6 IMPLEMENTATION
We implement CubeX both on Linux and on Windows. This section introduces the implementation
of CubeX on Linux, while CubeX on Windows has similar internal designs and is omitted here for
conciseness.
6.1 Overview
CubeX is prototyped by modifying MemCube to adapt to arbitrary cube-based networks and
adding a recovery optimization module.
Compared with the original MemCube implementation [53], in our CubeX prototype, we re-
place the BCube-specific failure recovery mechanism with the generalized cube-oriented design.
The primary data and backup data are scattered (globally) across all servers using aggressive data
partitioning [14] for reconstructing lost data in parallel, and the failure detection and recovery of
each server are constrained to that server’s local neighborhood.
Recovery server remapping is implemented by letting the backup server to compute the sub-
space (that belonged to the failed recovery server) and notify the remapping result (i.e., the new
recovery server) to the primary server. Asynchronous backup server recovery is implemented by
restoring data to the backup/standby servers after the KV service is resumed.
CubeX adopts a chain-based structure for memory management on individual storage servers.
We use a cache chain to group slabs into caches as well as a hash chain for fast KV access. As
shown in Figure 2, we uses the slab-based mechanism [3, 8] to alleviate fragmentation problems
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:13
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:14 Y. Zhang et al.
We use ZooKeeper [33] to realize the high availability and durability of the master. We add
a /CubeX znode in the ZooKeeper namespace, which contains sub znodes including /ac_master
(active master), /pri_servers (primary servers list), /keys (mapping from key space onto primary
servers), and so on. Several master instances compete for a single lease to ensure there are one
active instance in most of the time. Other instances are in the standby mode. The active instance
periodically increases a version number to update its existence. After the active instance fails or
disconnected, some standby instance will win the lease and become active to provide services.
Both the clients and the storage servers use the /ac_master znode to locate the active master. If
a server fails concurrently with a master failure, e.g., the recovery servers R cannot get response
from the master, then R will ask the ZooKeeper service to locate the new active instance and then
report the failure to it. Afterwards, the normal recovery procedure is performed.
7 EVALUATION
7.1 Configuration
We use 64 servers and 8 Pronto 1GbE switches (48 ports) to construct the BCube network (for
MemCube) and the GHC network (for CubeX).
We build a two-level BCube(8,1) network. The four switches act as 16 virtual switches (8 ports)
together (8 virtual switches at each level). Note that BCube(8,1) network is also an 8 × 8 GHC
network. We also use 8 switches to emulate 32 4-port virtual switches and 64 2-port switches and
construct a 4 × 4 × 2 × 2 GHC network, which can be viewed as duplicating the 4 × 4 × 2 GHC
network (Figure 1) in a fourth dimension.
Each server has six Intel 2.5GHz cores, 12 1TB disks, and two 1GbE 2-port NICs. Sixteen of
the 64 servers have 32GB RAM and the others have 16GB. We have one additional server to run
the client, and three others to run one active and two standby masters for ZooKeeper. The four
additional servers connect to the testbed using a 1GbE control network. The asynchronous backup
and recovery mechanism is adopted as introduced in Section 4.3, and a fourth standby copy is
possibly stored on a backup server’s disk.
Our experiments in this section answer the questions mainly including:
• How fast can CubeX handle normal I/O (write) requests, with/without surrogate writing
(Section 7.2)?
• How much does CubeX outperform MemCube in single storage server recovery, with asyn-
chronous backup server recovery (Section 7.3) and recovery server remapping (Section 7.4)?
• Does CubeX preserve its property for (simulated) large-scale cube-based networks
(Section 7.5)?
• Does CubeX generalize well for different kinds of GHC (Section 7.6)?
• And how well does CubeX perform under multiple failures (Section 7.7)?
7.2 Throughput
Our first experiment evaluates the throughput of a single CubeX server (with/without surrogate
writing) and compares CubeX with MemCube. We run m processes respectively for CubeX and
MemCube on a server, each being assigned with a separate and consecutive sub key space.
We adapt the memcached benchmark [12] to CubeX, where the benchmark client has parallel
connections asynchronously performing write operations with the form of “set key value.” The
size of the key-value pairs is 1KB and the key is a random 15-byte string. We measure the number
of set requests handled per second as a function of the number of processes (m) running on the
server. Each point is an average of 20 runs.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:15
Fig. 4. Throughput. Each service process corresponds to 50 connections from the benchmark client.
Fig. 5. Throughput. Each service process corresponds to 100 connections from the benchmark client.
Figures 4 and 5 show the evaluation results, where each service process corresponds to 50 and
100 connections from the benchmark client, respectively. Each service process corresponds to 50
connections from the benchmark client. The differences to the mean are less than 5% and thus
are omitted for clarity. It shows that the I/O throughput of CubeX (without surrogate) is very
close to that of MemCube, demonstrating the asynchronous backup mechanism introduced in
Section 4.3 does not affect normal I/O. This is because for RAM-based key-value stores the normal
I/O performance is bounded by CPU and network latency [44] and the network bandwidth never
becomes the bottleneck. The small variance mainly comes from uncertain conditions like the usage
of CPU, I/O, and network.
In contrast, CubeX with surrogate achieves much higher throughput than both CubeX without
surrogate and MemCube, which shows that the 1-hop writing to the surrogate recovery servers is
more efficient than the 2-hop writing to the backup servers. This is mainly because the software-
based routing (on intermediate nodes) in 2-hop writing suffers from high CPU overhead and pro-
cessing latency [37]. Considering the client connects to the testbed using a 1GbE control network
(discussed in Section 7.1), the available bandwidth (1Gbps) is not fully utilized in this test (1KB ×
90,000/s ≈ 720Mbps), showing the throughput is bounded by the forwarding performance at the
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:16 Y. Zhang et al.
Fig. 6. Server Failure Recovery in CubeX and MemCube (8 × 8 GHC). CubeX, respectively, uses two mech-
anisms of “async” and “async + remap.”
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:17
7.4 Remapping
In this subsection, we evaluate the effect of CubeX’s recovery server remapping mechanism. As
introduced in Section 5, when recovering a recovery server failure in our test, for each affected
primary server P the affected sub key space is re-designated to P’s other recovery servers without
data transfer.
The configuration is the same as the previous experiments, and the result is also depicted in
Figure 6. Suppose that server 00 fails as a recovery server on the 8 × 8 GHC network. Consider
one of its 14 primary servers (e.g., 01) originally having 14 recovery servers and 49 backup servers.
Without remapping, the backup data on the old backup servers (10, 20, . . . , 70) of the failed recov-
ery server 00 will be equally assigned to the backup servers of 01’s remaining 13 recovery servers
(11, 21, . . . , 71, 02, 03, . . . , 07), to minimize the recovery time of the primary server 01’s (possible)
future failure. For example, an old backup server 10 originally stores 12/49GB of backup data for
01, half (β = 6/49GB) corresponding to the failed recovery server 00 (and half corresponding to the
other recovery server 11). To recover 01’s recovery server, 10 will remain β/13GB of backup data
(for recovery server 11), transfer 2β/13GB of backup data to 12 (for recovery servers 11 and 02),
transfer 2β/13GB of backup data to 13 (for recovery servers 11 and 03), . . . , and transfer 2β/13GB
of backup data to 17 (for recovery servers 11 and 07). The other six backup servers (20, 30, . . . , 70)
corresponding to 00 have similar processing, and thus we totally transfer γ = 2β/13 × 6 × 7GB of
backup data for a primary server 01 of the failed recovery server 00. Since the failed recovery server
00 has 14 primary servers, in total the remapping mechanism saves 14γ ≈ 11.08GB of data transfer.
CubeX (async) is about 20% slower than CubeX (async + remap), which is mainly because in
CubeX (async) the recovery of recovery servers competes with the primary server recovery for
CPUs, although they do not contend for the recovery bandwidth (because they occur in different
directions [53]).
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:18 Y. Zhang et al.
and for availability. First, as discussed in Reference [44], in a 10,000-server RAM storage system
with 3× replication, two failures/year/server with a Poisson distribution, the one-year data loss
probability is about 10−6 when the failure recovery time is one second; and the probability increases
to 10−4 when the recovery completes in 10s. Second, as discussed in Reference [7], there are about
1,000 server failures per year in a normal-sized cluster of hundreds of servers. Consequently, for
the 512-server cluster in Table 1, CubeX could achieve a high availability of about four nines
3.6×1,000
(1 − 365×24×3,600 ≈ 99.99%), while MemCube has a relatively low availability of about three nines.
Note that it is difficult for CubeX to recover several tens or hundreds of GB data within less than
1s, since both failure detection and recovery preparation require at least a few hundred millisec-
onds that are omitted in our simulation.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:19
8 RELATED WORK
In this section, we briefly discuss related work on RAM/disks based storage and computing.
The design of CubeX is inspired by large-scale RAM storage systems (like RAMCloud [44] and
MemCube [53]), Remote RAM based computing techniques (like FaRM [20], Trinity [48], and
MemC3 [21]) and other disk-based storage systems (like Bigtable [14], GFS [24], and URSA [2]).
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:20 Y. Zhang et al.
SkimpyStash [19] is a hybrid key-value store on flash memory and RAM, which uses a hash table
directory in RAM to index key-value pairs stored on flash. HashCache [10] is also targeted at RAM
and flash combined storage system, leveraging an efficient structure to lower the amortized cost
of insertions and updates.
CubeX is inspired by RAMCloud and MemCube. It leverages the glocality (= globality + lo-
cality) to adapt to any cube-based networks (like generalized hypercube (GHC), MDCube [52],
k-ary n-cube [28], and hyperbanyan [22]) and realizes cross-layer optimizations, such as recov-
ery server remapping, asynchronous backup server recovery, and surrogate backup writing, to
improve RAM-based storage performance in both normal I/O operations and failure recovery.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:21
Distributed block storage systems provide a block interface [32, 34] to remote clients via pro-
tocols like iSCSI [9] and AoE (ATA-over-Ethernet) [13]. For example, Petal [34] uses redundant
network storage to provide virtual disks. Salus [51] leverages HDFS [4] and HBase [30] to provide
virtual disk service with ordered-commit semantics, using a two-phase commit protocol. It also
provides prefix semantics when failures occur. pNFS (parallel NFS) [32] exports a block/object/file
interface to local cloud storage. Blizzard [41] is built on FDS [43] and exposes parallelism to virtual
disks with crash consistency guarantees. Blizzard leverages a full bisection bandwidth network to
stripe data aggressively, and utilizes delayed durability semantics to increase the rate at which
clients can issue writes while still achieving crash consistency. URSA [2] is an SSD-HDD-hybrid
block storage system that provides virtual disks, which can be mounted like normal physical ones.
It collaboratively stores primary data on SSDs and backup chunks on HDDs, using a journal as a
buffer to bridge the performance gap between SSDs and HDDs. Block storage systems could be
used as a secondary storage for CubeX.
F4 [42] designs a Binary Large OBjects (blob) storage system for Facebook’s corpus of photos,
videos that need to be reliably stored and quickly accessible. It uses temperature zones to iden-
tify hot/warm blobs and effectively reduces the effective-replication-factor. FDS [43] is a locality-
oblivious blob store built on a full-bisection bandwidth FatTree network. It multiplexes an applica-
tion’s aggregate I/O bandwidth across the available throughput. FDS supports fast failure recovery
by simultaneously perform recovery across the network. The design of CubeX could be ported to
these blob storage systems for high (blob) read/write performance.
9 CONCLUSION
CubeX is a network-aware key-value store that supports fast failure recovery on cube-based net-
works. At the core of CubeX is to leverage the glocality (= globality + locality) of cubes. CubeX
also designs cross-layer optimizations for achieving high throughput and low recovery time. It
exploits the globality of cubes to scatter backup data across large number of disks, and exploits
the locality of cubes to restrict all recovery traffic within the local range.
We plan to improve CubeX in the follow aspects. First, since low latency [47] is one primary
advantage of RAM-based storage, in the future CubeX may require a low-latency Ethernet in-
frastructure of 10μs level RTT. Second, as high-bandwidth (40 ∼ 100Gbps) networks [53] and
(10 ∼ 30Gbps) SSDs [6] become practical, we will study how to coordinate the RAM, backup disks,
networks, and CPUs to collaboratively achieve even higher recovery speed. Third, we will in-
corporate recent advances in failure detection, such as latency measurements between servers
in Pingmesh [29] and guided probes for potential failures in Everflow [54], to improve CubeX’s
failure detection. Fourth, some practical issues also need to be considered, for example, design-
ing a cleaner for the logging system, replacing replication with erasure coding, and incorporating
superscalar communication [35].
REFERENCES
[1] AWS Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Retrieved
from http://aws.amazon.com/message/65648/.
[2] NiceX Lab. Ursa Block Store. Retrieved from http://nicexlab.com/ursa/.
[3] RedisLabs. Redis Official Website. Retrieved from http://redis.io/.
[4] Dhruba Borthakur. HDFS Architecture Guide. Retrieved from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.
html.
[5] SOSP 2011 PC meeting. SOSP 2011 Reviews and Comments on RAMCloud. https://ramcloud.stanford.edu/wiki/pages/
viewpage.action?pageId=8355860SOSP-2011-Reviews-and-comments-on-RAMCloud.
[6] Josh Norem. Samsung SSD 960 EVO (500GB). Retrieved from https://www.pcmag.com/review/358847/samsung-
ssd-960-evo-500gb.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:22 Y. Zhang et al.
[7] Rich Miller. Failure Rates in Google Data Centers. Retrieved from http://www.datacenterknowledge.com/archives/
2008/05/30/failure-rates-in-google-data-centers/.
[8] Dormando. Memcached Official Website. Retrieved from http://www.memcached.org/.
[9] Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun, and Jesse Willeke. 2003. A performance analysis of the iSCSI
protocol. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies
(MSST’03). IEEE, 123–134.
[10] Ashok Anand, Chitra Muthukrishnan, Steven Kappes, Aditya Akella, and Suman Nath. 2010. Cheap and large CAMs
for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked
Systems Design and Implementation (NSDI’10). USENIX Association, 433–448. Retrieved from http://www.usenix.org/
events/nsdi10/tech/full_papers/anand.pdf.
[11] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan.
2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles
(SOSP’09), Jeanna Neefe Matthews and Thomas E. Anderson (Eds.). ACM, 1–14. Retrieved from http://dblp.uni-trier.
de/db/conf/sosp/sosp2009.html#AndersenFKPTV09.
[12] Antirez. [n.d.]. An update on the memcached/redis benchmark. Retrieved from http://antirez.com/post/update-
on-memcached-redis-benchmark.html.
[13] Ed L. Cashin. 2005. Kernel korner: Ata over ethernet: Putting hard drives on the lan. Linux J. 2005, 134 (2005), 10.
[14] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar
Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A distributed storage system for structured data. In Pro-
ceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205–218.
[15] Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Prin-
ciples. ACM, 228–243.
[16] Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. 2013. Leveraging endpoint flexibility in data-intensive clus-
ters. In Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications
(SIGCOMM’13), Dah Ming Chiu, Jia Wang, Paul Barford, and Srinivasan Seshan (Eds.). ACM, 231–242.
[17] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. 2011. Managing data transfers in
computer clusters with orchestra. In ACM SIGCOMM Computer Communication Review, Vol. 41. ACM, 98–109.
[18] Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2010. FlashStore: High throughput persistent key-value
store. Proc. VLDB Endow. 3, 2 (2010), 1414–1425. Retrieved from http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.
html#DebnathSL10.
[19] Biplob K. Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-
based storage. In Proceedings of the SIGMOD Conference, Timos K. Sellis, Rene J. Miller, Anastasios Kementsiet-
sidis, and Yannis Velegrakis (Eds.). ACM, 25–36. Retrieved from http://dblp.uni-trier.de/db/conf/sigmod/sigmod2011.
html#DebnathSL11.
[20] Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory.
In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.
[21] Bin Fan, David G. Andersen, and Michael Kaminsky. 2013. MemC3: Compact and concurrent memcache with dumber
caching and smarter hashing. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI’13). 371–384.
[22] Clayto S. Ferner and Kyungsook Y. Lee. 1992. Hyperbanyan networks: A new class of networks for distributed mem-
ory multiprocessors. IEEE Trans. Comput. 41, 3 (1992), 254–261.
[23] Armando Fox. 2002. Toward recovery-oriented computing. In Proceedings of the Conference on Very Large Data Bases
(VLDB’02). 873–876.
[24] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the ACM
Symposium on Operating Systems Principles (SOSP’03). 29–43.
[25] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Mea-
surement, analysis, and implications. In Proceedings of the Association for Computing Machinery’s Special Interest
Group on Data Communications (SIGCOMM’11), Srinivasan Keshav, Jörg Liebeherr, John W. Byers, and Jeffrey C.
Mogul (Eds.). ACM, 350–361.
[26] Jim Gray and Gianfranco R. Putzolu. 1987. The 5 minute rule for trading memory for disk accesses and the 10 byte
rule for trading memory for CPU time. In Proceedings of the Association for Computing Machinery Special Interest
Group on Management of Data, Umeshwar Dayal and Irving L. Traiger (Eds.). ACM Press, 395–398.
[27] Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David
A. Maltz, Parveen Patel, and Sudipta Sengupta. 2011. VL2: A scalable and flexible data center network. Commun. ACM
54, 3 (2011), 95–104.
[28] Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and
Songwu Lu. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
Leveraging Glocality for Fast Failure Recovery in Distributed RAM Storage 3:23
Proceedings of the Association for Computing Machinery’s Special Interest Group on Data Communications (SIG-
COMM’09). 63–74.
[29] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang,
Hua Chen et al. 2015. Pingmesh: A large-scale system for data center network latency measurement and analysis. In
ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 139–152.
[30] Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi
H. Arpaci-Dusseau. 2014. Analysis of hdfs under hbase: A facebook messages case study. In Proceedings of the 12th
USENIX Conference on File and Storage Technologies (FAST’14). 199–212.
[31] John H. Hartman and John K. Ousterhout. 1995. The Zebra striped network file system. ACM Trans. Comput. Syst. 13,
3 (1995), 274–310.
[32] Dean Hildebrand and Peter Honeyman. 2005. Exporting storage systems in a scalable manner with pNFS. In Pro-
ceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE,
18–27.
[33] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for
Internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (ATC’10). 1–14.
[34] Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices,
Vol. 31. ACM, 84–92.
[35] HuiBa Li, ShengYun Liu, YuXing Peng, DongSheng Li, HangJun Zhou, and XiCheng Lu. 2010. Superscalar communi-
cation: A runtime optimization for distributed applications. Sci. China Info. Sci. 53, 10 (2010), 1931–1946.
[36] Hyeontaek Lim, Donsu Han, David G. Andersen, and Michael Kaminsky. 2014. MICA: A holistic approach to fast
in-memory key-value storage. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI’14). 429–444.
[37] Guohan Lu, Chuanxiong Guo, Yulong Li, Zhiqiang Zhou, Tong Yuan, Haitao Wu, Yongqiang Xiong, Rui Gao, and
Yongguang Zhang. 2011. ServerSwitch: A programmable and high performance platform for data center networks.
In Proceedings of the (NSDI’11).
[38] Xicheng Lu, Huaimin Wang, and Ji Wang. 2006. Internet-based virtual computing environment (iVCE): Concepts and
architecture. Sci. China Ser. F: Info. Sci. 49, 6 (2006), 681–701.
[39] Xicheng Lu, Huaimin Wang, Ji Wang, and Jie Xu. 2013. Internet-based virtual computing environment: Beyond the
data center as a computer. Future Gen. Comput. Syst. 29, 1 (2013), 309–322.
[40] Jeanna Neefe Matthews, Drew Roselli, Adam M. Costello, Randolph Y. Wang, and Thomas E. Anderson. 1997. Im-
proving the performance of log-structured file systems with adaptive methods. In Proceedings of the ACM Symposium
on Operating Systems Principles (SOSP’97). ACM.
[41] James Mickens, Edmund B. Nightingale, Jeremy Elson, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram,
Osama Khan, and Krishna Nareddy. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications.
In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257–273.
[42] Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva
Shankar, Viswanath Sivakumar, Linpeng Tang et al. 2014. f4: Facebook’s warm BLOB storage system. In Proceed-
ings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 383–398.
[43] Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. 2012. Flat data-
center storage. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’12).
[44] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John K. Ousterhout, and Mendel Rosenblum. 2011. Fast crash
recovery in RAMCloud. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 29–41.
[45] John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish
Mitra, Aravind Narayanan, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan
Stutsman. 2009. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. Operat. Syst. Rev.
43, 4 (2009), 92–105.
[46] Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system.
ACM Trans. Comput. Syst. 10, 1 (1992), 26–52.
[47] Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout. 2011. It’s time for
low latency. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’11).
[48] Bin Shao, Haixun Wang, and Yatao Li. 2013. Trinity: A distributed graph engine on a memory cloud. In Proceedings
of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 505–516.
[49] Ji-Yong Shin, Mahesh Balakrishnan, Tudor Marian, and Hakim Weatherspoon. 2013. Gecko: Contention-oblivious
disk arrays for cloud storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13).
285–298.
[50] Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. 2012. BlueSky: A cloud-backed file system for the enterprise.
In Proceedings of the 10th USENIX Conference on File and Storage Technologies. USENIX Association, 19–19.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.
3:24 Y. Zhang et al.
[51] Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike
Dahlin. 2013. Robustness in the salus scalable block store. In Proceedings of the 10th USENIX Symposium on Networked
Systems Design and Implementation (NSDI’13). 357–370.
[52] Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: A high perfor-
mance network structure for modular data center interconnection. In Proceedings of the International Conference
on Emerging Networking Experiments and Technologies (CoNEXT’09), Joörg Liebeherr, Giorgio Ventre, Ernst W.
Biersack, and Srinivasan Keshav (Eds.). ACM, 25–36. Retrieved from http://dblp.uni-trier.de/db/conf/conext/
conext2009.html#WuLLGZ09.
[53] Yiming Zhang, Chuanxiong Guo, Dongsheng Li, Rui Chu, Haitao Wu, and Yongqiang Xiong. 2015. CubicRing: En-
abling one-hop failure detection and recovery for distributed in-memory storage systems. In Proceedings of the 12th
USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 529–542.
[54] Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming
Zhang, Ben Y. Zhao et al. 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM Computer
Communication Review, Vol. 45. ACM, 479–491.
ACM Transactions on Storage, Vol. 15, No. 1, Article 3. Publication date: February 2019.