Demand-Aware Erasure Coding For Distributed Storage Systems
Demand-Aware Erasure Coding For Distributed Storage Systems
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
1
Abstract—Distributed storage systems provide cloud storage services by storing data on commodity storage servers. Conventionally,
data are protected against failures of such commodity servers by replication. Erasure coding consumes less storage overhead than
replication to tolerate the same number of failures and thus has been replacing replication in many distributed storage systems.
However, with erasure coding, the overhead of reconstructing data from failures also increases significantly. Under the ever-changing
workload where data accesses can be highly skewed, it is challenging to deploy erasure coding with appropriate values of parameters
to achieve a well trade-off between storage overhead and reconstruction overhead.
In this paper, we propose Zebra, a framework that encodes data by their demand into multiple tiers that deploy erasure codes with
different values of parameters. Zebra automatically determines the number of such tiers and dynamically assigns erasure codes with
optimal values of parameters into corresponding tiers. With Zebra, a flexible trade-off between storage overhead and reconstruction
overhead is achieved with multiple tiers. When demand changes, Zebra adjusts itself with a marginal amount of network transfer. We
demonstrate that Zebra can work with two representative families of erasure codes in distributed storage systems, Reed-Solomon
codes and local reconstruction codes.
Index Terms—distributed storage system, demand skewness, erasure coding, reconstruction, storage overhead, Reed-Solomon code,
local reconstruction code
F
1 I NTRODUCTION
without loss of data. The naive and the conventional way 0.75
0.75
to store redundant data in a distributed storage system is
0.50 0.50
replication, i.e., saving multiple copies of the same data on
different servers. Saving N copies on N servers (N -way 0.25 0.25
replication) can tolerate up to N − 1 server failures without 2 4 6 8 101214161820 2 4 6 8 101214161820
k k
loss of data. However, replication is very expensive in terms
of its storage overhead, especially for data at a petabyte
scale. For example, with 3-way replication, to store 10 PB of Fig. 1. CPU and network overhead to reconstruct one block with RS
data, we need to spend additional 20 PB to store the other code (r = 2), where each block contains 64 MB.
• Jun Li is with the School of Computing and Information Sciences, Florida However, RS codes can incur significantly higher over-
International University. head when we need to reconstruct an unavailable block.
• Baochun Li is with the Department of Electrical and Computer Engineer- When a block becomes unavailable after a server failure, we
ing, University of Toronto.
need to reconstruct it on another existing server to maintain
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
2
the level of failure tolerance. With a (k = 10, r = 4) RS code, (a) demand of top 100 files (b) demand over time
for example, if one block is not available, we need to obtain 800
10 blocks to reconstruct it. Therefore, when data are not 700 400 File No.1
600 File No.3
available due to a server failure, a read request of such data 500 300 File No.5
Demand
File No.7
needs to be performed by degraded read 1 that reconstructs 400 200 File No.9
300
them from blocks on other available servers. Higher over- 200 100
head of reconstruction can lead to a higher access latency 100
during the degraded read. With typical values of k , the 0 0 20 40 60 80 100 0 5 10 15 20
File ID (sorted by demand) Time (hr.)
network transfer incurred by reconstruction can be huge: in
a Facebook’s cluster the daily median of top-of-rack network
transfer incurred by reconstruction can be as much as 180 Fig. 2. The demand skewness of files in a Facebook workload trace.
TB [12]. Fig. 1 illustrates the overhead of time and network
transfer to reconstruct one block of 64 MB with RS codes2 .
We can see that both the time and network transfer incurred By solving geometric programming problems, Zebra deter-
by reconstruction increases linearly with k . In other words, mines parameter values of erasure codes in each tier, such
a smaller value of k means less reconstruction overhead, that hot data can be reconstructed with low overhead and
leading to less network transfer for data reconstruction as cold data can enjoy low storage overhead at the same time.
well as lower latency for degraded read. Meanwhile, Zebra can also assign data to the best fitting
Therefore, in a distributed storage system, while we tier by their demand, so as to minimize reconstruction
desire for a small value of k to achieve low reconstruction overhead. When demand changes, Zebra can dynamically
overhead, it takes a large value of k to save storage over- migrate data accordingly into different tiers or even change
head. Currently, most distributed storage systems deploy parameter values of tiers, while carefully controlling the
only one erasure code to encode data, optimized either for overhead in the migration. For the hot data in the tier
storage overhead or reconstruction overhead. However, in with the highest demand, Zebra can be further extended
practical distributed storage systems, the demand of data to achieve better load balance under a high volume of
can be highly skewed. In Fig. 2a, we show the demand demand. Besides RS codes, we show that Zebra can also
of data in a workload trace measured from a Facebook be applied with local reconstruction codes [8], another kind
cluster [14]. This workload contains more than 104 files of erasure codes for distributed storage systems with lower
while we only show the 100 most demanded files in the reconstruction overhead.
figure, with all the rest having no more than 11 visits. We can We run simulations under various workload traces to
see that a very small portion of data are highly demanded evaluate the performance of the Zebra framework. We
while the rest are barely touched. Even among blocks be- demonstrate the performance of Zebra working with RS
longing to the same file, it has also been observed that the codes and local reconstruction codes. The evaluation results
demand to such blocks can be skewed when running data show that Zebra can reduce reconstruction overhead, espe-
analytical jobs [15], [16]. With only one erasure code we cially by 89.5% for the hot data. Moreover, the hot data also
cannot accommodate cold data with low storage overhead enjoy less demand by 63.2% due to the better load balance
while achieving low reconstruction overhead for hot data. in Zebra. The cold data, on the other hand, will incur less
Moreover, the demand for the data can change over time storage overhead to maintain their tolerance against failures.
dynamically. We sort files in the trace by their demand, such With the ever-changing workload, we demonstrate that the
that the file with higher overall demand has the lower index. network transfer of migration can be well controlled, such
As shown in Fig. 2b, while file No. 1 and 3 have consistent that it occupies no more than 12.6% of the network transfer
demand over time, the other three files only have transient of demand in the worst case.
high demand at some time and have no visit any other time.
Therefore, it is also challenging to find an erasure code that
can work well adaptively with the ever-changing demand.
2 M OTIVATION AND E XAMPLES
In this paper, we propose Zebra, a novel framework In this section, we present the general idea that motivates
for distributed storage systems deploying erasure codes. the design of the Zebra framework. We assume that in a
According to the demand of data, Zebra can split data distributed storage system, data are stored in blocks of 64
into multiple tiers such that data in different tiers are en- MB, where we compute 1 parity block from every 3 data
coded with erasure codes with different values of param- blocks. In other words, a (k = 3, r = 1) RS code is deployed
eters. Although existing works [17], [18], [19] also features in this distributed storage system.
such tiered architectures, they require static configurations As a toy example in Fig. 3a, assume that we have 6 data
of parameters in each tier. Nevertheless, Zebra offers the blocks in total, i.e., A1 − A3 and B1 − B3 . We can encode
flexibility to dynamically configure parameters of erasure A1 , A2 , and A3 into one parity block P1 , and B1 , B2 , and
codes deployed in each tier, and even the number of tiers. B3 into the other parity block P2 . We then call each three
data blocks and the corresponding parity block as one stripe,
1. As the result of a degraded read can also be used to recover an such as A1 , A2 , A3 , and P1 . With six data blocks in Fig. 3a,
unavailable block on a replacement server, we assume that there is no we thus have two stripes. More stripes can be added this
or little need to recover a block individually and only consider the
reconstruction overhead in the degraded read in this paper.
way if there are more data blocks. Such two stripes fall into
2. The time of the reconstruction is measured using the zfec li- one single tier because they are encoded from the same
brary [13] running on an Intel Core i7 processor. (3, 1) RS code. In this way, we can tolerate a failure of
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
3
tier 1: (3,1) RS code tier 1: (2,1) RS code HDFS [9], [18], Openstack Swift [5], Google file system [6],
stripe 1-1
stripe 1-1 stripe 1-2
A1 A2 A3 P1 A1 B1 P1 and Windows Azure storage [8], are moving towards or
have deployed erasure coding as an alternative to replica-
tier 2: (4,1) RS code
B1 B2 B3 P2 tions, where in most cases RS codes are chosen by these dis-
stripe 2-1
A2 A3 B2 B3 P2 tributed storage systems. However, they all choose only one
kind of erasure code with fixed parameters. It is then hard
(a) one tier (b) two tiers
to trade well between storage overhead and reconstruction
overhead of erasure codes, under the dynamical workload
Fig. 3. Comparison of data encoded in one and two tiers of RS codes. Ai
and Bi are data blocks, i = 1, 2, 3. P1 and P2 are parity blocks encoded with highly skewed data demand [15], [16], [20].
by corresponding RS codes. Colored red, A1 and B1 are hot blocks with Traditional RS codes can incur high overhead of recon-
100 visits per unit of time, while other blocks are cold with only 10 visits. struction when some of the data are not available due to
failures inside distributed storage systems [12], [21]. There
any single block, with 1.33x storage overhead. When one has been a growing attention of improving the overhead
of them becomes unavailable, we need to obtain the other of reconstruction of erasure codes. For example, local re-
three blocks in the same stripe to reconstruct it until it is pairable codes [22] can achieve low reconstruction overhead
available on a replacement server. Therefore, the amount of by allowing unavailable data to be reconstructed from a
data to be read from disks and transferred through network small number of other servers. Similar ideas have been
during degraded read is 3 × 64 MB = 192 MB. applied in the design of other erasure codes [8], [21], [23],
However, as we know that the demand to different [24], [25]. On the other hand, another family of erasure
blocks can be significantly skewed, we assume that the codes, called regenerating codes, are designed to achieve the
demand of the 6 data blocks is not equal to each other, optimal network transfer in the reconstruction [26]. All these
where A1 and B1 are highly demanded with 100 visits per erasure codes, however, are optimized for their own objec-
unit of time, and all other four blocks are visited by 10 times tives over all encoded data, unaware of that the demand of
per unit of time. Under the Zebra framework, we can then data can be high skewed. As data with different demand
encode these 6 data blocks into two tiers, i.e., we encode may have different performance objectives, applying one
the two hot blocks with a (2, 1) RS code and the other four single erasure code over all data may not achieve all their
blocks with a (4, 1) RS code. Therefore, there is only one objectives. Different from these erasure codes, in Zebra we
parity block in each stripe, and we still have 2 parity blocks propose to dynamically assign data into multiple tiers by
in total, maintaining the same storage overhead as above. their demand so as to achieve a flexible trade-off between
Meanwhile, we can still tolerate the failure of any single storage and reconstruction overhead.
block. As server failures are independent to the demand of Some distributed storage systems, such as HDFS [27],
data (especially when the demand of data are balanced on allow data to be stored under a tiered architecture, where
servers), each request will have an equal chance to meet data in different tiers are stored by different erasure codes
a failed server and such a request will have to be served or replication with preconfigured parameters and can be
by degraded read. Though the cold blocks need to obtain 4 automatically migrated between different tiers [17], [18],
blocks to reconstruct, the hot blocks have dominant demand [19]. However, all these systems require users to configure
with much less blocks to obtain. Hence, this time we can the parameter of erasure code in each tier statically, and
significantly reduce the average number of blocks to visit they cannot well adapt themselves to the ever-changing
per unit of time. On average per degraded read, we need to workload. Besides, as in different tiers the storage overhead
have
(2×2×100+4×4×10)×64 MB
= 149.3 MB to read, saving incurred by the corresponding erasure codes will also be
2×100+4×10 different, it is hard to control the overall storage overhead.
corresponding disk I/O and network transfer by 22.2%.
The Zebra framework, however, does not need to specify
For the two hot blocks in particular, their reconstruction
parameters of each tier or even the number of tiers. Ac-
overhead can simply be reduced from 3 to 2 blocks, i.e.,
cording to the demand of data, Zebra can configure itself
33.3% reduction in time, disk I/O, and network transfer.
flexibly, where only the overall storage overhead and failure
In this example, we deploy two tiers of RS codes. How-
tolerance need to be manually specified.
ever, Zebra is not limited to only two tiers. In fact, we
can flexibly deploy any number of tiers inside the Zebra Different from the above works where the reconstruction
framework with different parameter values of RS codes, overhead is evaluated in terms of network traffic or disk
where the number of tiers and the parameter values can be I/O, a tree-structured topology can be created, which routes
efficiently calculated according to the demand of data and the traffic through the edges of the tree and alleviates the
the requirements of storage overhead and failure tolerance. bottleneck of sending data from existing servers to the
Zebra can automatically assign data into corresponding tiers replacement server [28], [29], [30]. The purpose of such
by their demand. Besides RS codes, we further show that works, instead, is to save the time of reconstruction. Such
Zebra can work with local reconstruction codes. works can be applied into our framework without affecting
the network overhead during reconstruction, and thus we
focus on network overhead only in this paper.
3 R ELATED W ORK
At the scale of petabyte storage, erasure coding has become
more and more attractive to distributed storage systems 4 S YSTEM M ODEL
because of its low storage overhead and high failure tol- In this paper, we assume that in a distributed storage
erance. Hence, many distributed storage systems, such as system, data are stored in blocks with the same size. This
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
4
PN r
is a common practice in distributed storage systems [3]. all blocks is N + i=1 ki .
Since the total storage space we
Assume that we have N blocks in total, and each block can use under the constraint of the overall storage overhead
P (C−1)N
Bi is associated with demand of di visits per unit of time, C is CN , we can write this constraint as N 1
i=1 ki ≤ r .
i = 1, . . . , N . We also assume that any r block failures Therefore, we can solve K to minimize the overall recon-
should be tolerated without data loss. Hence, if RS codes struction overhead with respect to the storage overhead by
are deployed, in the Zebra framework each block will be the following integer geometric programming problem.
encoded with a (ki , r) RS code, where we call ki as the rank
min D·K (1)
of block Bi . For convenience, we let D = (d1 , . . . , dN ), and
N
X
K = (k1 , . . . , kN ). We also want to control the overall stor- 1 (C − 1)N
s.t. ≤ , (2)
age overhead such that the overall storage space consumed ki r
i=1
is no more than C times of the original data.
ki ∈ Z+ . (3)
In this model, we assume that the erasure codes de-
ployed in the distributed storage system should be sys- Typically, a geometric programming problem can be
tematic. In other words, a (k, r) systematic erasure code easily converted to a convex optimization problem and
computes k + r blocks from original data, in which k blocks solved efficiently [31]. However, there are some other issues
are the same as the original data, i.e., data blocks. The other that makes it challenging to achieve a practical solution by
r blocks are known as parity blocks. Such k + r blocks directly solving this problem.
belong to the same stripe and are stored on k + r different First, in a distributed storage system there can be an
servers. From the k data blocks in each stripe, we can always extremely large number of data blocks. For example, in
directly obtain any data block without decoding as long as it HDFS the default block size is 64 MB. If there are 1 PB
is available. Hence, we can assume that all demand will go of data stored in HDFS, there will be over 107 blocks in
directly to the corresponding data blocks rather than parity total. No solver of convex optimization problems can solve
blocks, unless the demanded data blocks are unavailable. our model in a reasonable amount of time. Besides, the
Moreover, when some data block is not available, only geometric programming problem in (1)-(3) is an integer
reconstruction, instead of decoding, needs to be performed. programming problem. This also significantly increase the
We now use this model to represent the way to en- complexity to solve it.
code data in one or multiple tiers with erasure codes. Second, the solution solved from (1)-(3) is an offline
Fig. 3 (without loss of generality, we can rewrite Ai as solution. In other words, we need to know the demand
Bi+3 , i = 1, 2, 3) illustrates two examples of this model in advance before we can get the optimal solution, which
with systematic RS codes, where N = 6, C = 43 , and makes it impractical. We need to find an online algorithm
D = {100, 10, 10, 100, 10, 10}. In Fig. 3a, ki = 3 for all i, that can solve K in advance of the demand.
i.e., all blocks are encoded in one tier with a (3, 1) RS code. Third, given a solution of this problem, we can’t even
On the other hand, in Fig. 3b, we have K = {2, 4, 4, 2, 4, 4} guarantee that it is feasible. For example, if the solution
such that the six blocks are encoded into two tiers with a is K = {8, 8, 8, 8, 8, 8}, we will need to encode 6 blocks
(2, 1) and a (4, 1) RS code. with an (8, r) erasure code. This is impossible without
In this way, we can see that the number of tiers in the rearranging the data blocks into a smaller size. Variable-size
model does not need to be explicitly defined as blocks with blocks, however, will incur significantly more complexity
the same rank can be categorized into the same tier. Once to manage data inside the distributed storage system. In
the rank of each block is equal to each other, for example, it this paper, we retain the assumption of fixed-size blocks
becomes the conventional case of only one tier. Hence, we and manage to encode data with such solutions without
are not limited by a given number of tiers, and we can easily incurring much additional overhead.
change the number of tiers when demand changes. In the rest of the paper, we introduce the Zebra frame-
In this paper, our objective is to minimize the average work that solves these practical issues efficiently and then
reconstruction overhead of degraded read, with respect to show the encoding scheme under the Zebra framework. We
the constraint of the overall storage overhead C . As shown start from introducing Zebra with RS codes, and then extend
in Fig. 1, the reconstruction overhead of a block increases Zebra to make it work with local reconstruction codes.
linearly with its rank. We assume that each server has the
same chance to be unavailable. Thus, the chance of de- 5 Z EBRA F RAMEWORK
graded read should also increase linearly with the demand 5.1 Limiting the complexity
of corresponding block. Combining all demand together, we
PN
di ki In the Zebra framework, we propose a few heuristics to
can define the average reconstruction overhead as PN di .i=1
compute ranks of blocks efficiently. We start from temporar-
i=1
As the demand D is already given, it is equivalent to ily removing the integer constraint (3) and resolving the
PN
minimize i=1 di ki = D · K , which we call as the overall complexity issues of the non-integer geometric program-
reconstruction overhead. ming problem in (1)-(2) by studying its property.
Besides the overall reconstruction overhead, we need to Without loss of generality, we assume that in D, di ≥ dj
control the overall storage overhead, which can be com- if i > j . In other words, we sort the element in D in a non-
puted with D and K in the model. Since each block has ascending order. In this way, an optimal solution of K in
the same size, we assume that the size of each block is 1 (1)-(2) should also be in non-descending order.
for convenience. Thus, the storage space consumed to store We prove this property by contradiction. Assume that
block Bi and its parity is 1 + kri . The sum of storage space of there exist i and j in an optimal solution of (1)-(2) such
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
5
# files
contradictory to the assumption that the original solution is
optimal. On the other hand, if di = dj , we can assign a new
2ki kj
rank, ki +k j
, to both block Bi and Bj to get a lower overall 30
reconstruction overhead, while still satisfying the condition
in (2). This is also contradictory to the assumption that the 20 1 2 3 4 5 6 7 8 9 10
original solution is optimal.
From this property, we can directly get a corollary that t
ki = kj if di = dj . In other words, if two block have the
same demand, they will also have the same rank in the Fig. 4. The number of files with various numbers of t running in a
optimal solution. Inspired by this property, we can refine Facebook workload.
the original model to significantly decrease the complexity
to solve it, by reducing the number of variables to solve.
encoded with a (ki , r) RS code, and the overall storage
First, in this paper, we assume that all blocks of the same
overhead should be no more than C . Hence, the problem
file should have the same or similar demand in a distributed
to solve the optimal K can be redefined as
storage system. Notice that if all blocks of the same file have
n
X
the same demand, this step will not hurt the optimality of
the solution. In practice, if the demand of blocks in a single min wi di ki (4)
i=1
file is not the same, we can use the demand of the hottest n
block of the file as the demand of the file. The intuition X wi (C − 1)N
s.t. ≤ (5)
of this assumption is that typically a distributed storage i=1
ki r
system that stores large-size files will have distributed data ki > 0, ∀i. (6)
processing system running upon it, such as Hadoop and
Spark, which will visit each block of the file distributively. This is still a geometric programming problem. Notice
Therefore, once a file is visited, all of its blocks will be visited that here we tentatively remove the constraint (3) that each
first. Although in some cases [15], [16] that some data in ki must be an integer. In the Zebra framework, we will solve
a block will be selected for future processing, leading to this problem first, and then round ki to a nearby integer. For
different workload in different blocks eventually, we focus example, we can always round ki to its ceiling dke. In this
on heterogeneous demand on the file level instead of the way, we won’t break the requirement of storage overhead
block level in this paper. in (4), whilePthe worst case of additional reconstruction
N
Second, we extend this property by assuming that blocks overhead is i=1 wi di . Thus the approximation ratio of this
with similar demand will also have similar ranks. Thus, we ceiling rounding algorithm is 1 + min1i ki .
can classify the demand of all blocks into discrete categories. Comparing to this naive rounding algorithm, in Zebra
The simplest way is to set a parameter t where any demand we use an iterative rounding algorithm that achieve even
that falls into the interval (tx − t, tx] will be approximated lower reconstruction overhead. The idea of this iterative
as tx, ∀x ∈ Z∗ . Hence, files with similar demand can be algorithm is in each round solving the geometric program-
temporarily merged into the same one when calculating ming problem in (4)-(6), and rounding one ki to its nearest
their ranks, and thus we can further reduce the complexity positive integer which leads to the minimum change to the
to solve the model. For the cold data, this is especially overall reconstruction overhead. In the next round, the file
useful, as cold data typically occupy a very large portion corresponding to the ki in the previous round is removed
of all the data, yet with similar demand of very little values. from the problem and the storage overhead it has incurred
Thus, we can quickly categorize cold data into few intervals, is also deducted in (5). We keep running this iteration such
and then thousands of files can be grouped into few ones. that in every round one ki will be picked up and removed
We run the simulation on the workload from Facebook from the model until there is only one file left and the last ki
that we show in Fig. 2. In this workload, there are 15565 will be rounded to its ceiling. The details of this algorithm
files in total. If we store data into blocks of 64 MB, there will is shown in Fig. 5.
be 1.8 × 107 blocks. The result in Fig. 4 shows that we can We now run Zebra on the same Facebook workload
reduce the number of files into 55 even when t = 1. Notice used in Fig. 6a, and illustrate the overall reconstruction
that when t = 1, we actually don’t merge files unless their overhead which is calculated by (4). Note that in practice
demand is exactly the same. When we increase the value of not all requests will be served by degraded read. However,
t, we can even further reduce the complexity of solving k . as explained in Sec. 4, the overall reconstruction overhead is
linearly proportional to the average reconstruction overhead
per degraded read in a given workload, regardless of the ac-
5.2 Solving the ranks tual probability of degraded read. We assume that any two
After the two steps described above, we can refine the model block failures should be tolerated, i.e., r = 2, and the overall
in (1)-(2) such that there are n files, where each file Fi is storage overhead should be no more than 1.2x. Compared
associated with size wi and demand di . Each file will be to a single erasure code, which we use a (10, 2) RS code in
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
6
this case to meet the requirement of storage overhead, Zebra (a) overall reconstruction (b) comparison of the two
with the iterative rounding algorithm can save up to 39.2% 1e8 overhead rounding algorithms in Zebra
15
2.8
overall reconstruction
overhead (# blocks)
additional overhead
reconstruction overhead. We can also observe that with a ceiling
ideal K itervative
percentage of
smaller value of t, the overall reconstruction overhead can Zebra (ceiling) 10
2.4 Zebra (iterative)
be further saved, as with a smaller t, there will be more files single RS
and potentially more tiers. 2.0 5
In practice, our observation shows that under a real
workload, the approximation ratio of the reconstruction 1.6 5 10 0 1 2 3 4 5 6 7 8 9 10
t t
overhead with ranks of blocks solved by Zebra can be closer
to 1. We can see that the performance of the approximated
solutions of K is close to the non-integer solution of K (ideal Fig. 6. Comparisons of the overall reconstruction overhead and the
K ), where the ceiling rounding algorithm can incur at most approximation ratio of the two rounding algorithms.
10.0% more reconstruction overhead in the worst case. On
the other hand, the iterative rounding algorithm can achieve
blocks in each tier with a (ki , r) RS code, where ki is the
even lower approximation ratio, by incurring at most 6.0%
rank of the file and r is the number of failures to tolerate.
more reconstruction overhead. The reason is that instead
In particular, for the blocks of rank 1, i.e., ki = 1, they will
of rounding ki at the same time, we round different ki
be replicated with 1 + r copies. Since data in each tier are
iteratively every round in Fig. 5. Therefore, the difference of
migrated from multiple files, the chance of having a tier with
storage overhead caused by the rounding in one round will
no more than ki blocks is negligible. Inside each tier, we
be reflected in the next round, reducing the differences of
encode every ki blocks into r parity blocks with the (ki , r)
overall reconstruction overhead of remaining data. A more
RS code where such ki + r blocks belong to a stripe. All
detailed comparison of these two rounding algorithm can
blocks in each stripe will be stored into different servers. If
be found in Fig. 6b, where the iterative rounding algorithm
we have remaining blocks or the total number of blocks is
on average incurs 49.1% less additional reconstruction over-
less than ki , we will temporarily encode them with an (l, r)
head than the ceiling rounding algorithm. The saving tends
RS code if the number of remaining blocks is l. Since l < ki ,
to be more significant when we have a smaller value of t.
this will incur additional storage overhead. However, given
Due to the low number of files in the refined model, the
the large number of blocks stored in a distributed storage
completion time to solve K is swift. In fact, we can always
system, this additional storage overhead is marginal.
solve K with both of the two rounding algorithms within
In terms of storage overhead of additional metadata
2.5 second, with any values of t.
introduced by Zebra, we can store the rank along with the
existing metadata of each file. Other metadata, such as r,
5.3 Encoding data in multiple tiers C , t, and other additional parameters introduced in the rest
From the ranks solved above, we can now encode data into of this paper, are global information and require only few
multiple tiers. The number of tiers is determined by merging bytes to store.
files with the same rank into one tier. We actually encode
5.4 Balancing load of hot data
+ So far, when the optimal solution of K solved from (7)-(9)
Input: wi and di where i = 1, . . . , N , C > 1, r ∈ Z ,
N ∈ Z+ has ki no more than 1, we will use a (1, r) RS code, equiv-
Output: ki where i = 1, . . . , N alent to (1 + r)-way replication, to store the corresponding
1: S = ∅, T = {1, . . . , N } block. As blocks with higher demand will have lower ranks,
2: R = 0 we can believe that only the hottest block will be replicated.
3: while T 6= ∅ do This makes perfect sense because the demand is so high
4: solve that we probably cannot afford the cost to reconstruct them
X with RS code if one data block is not available. However, as
min wi di ki (7) shown in Fig. 2, the demand of the hot blocks can be much
i∈T
more variable than cold blocks. Therefore, even if all of the
X wi (C − 1)N
s.t. ≤ −R (8) hottest blocks are stored with (1 + r)-way replication, their
i∈T
ki r demand will still be highly variable.
ki > 0, ∀i ∈ T. (9) We notice that in the iterative rounding algorithm, the
storage overhead of each file, i.e., 1 + kri , is determined by
5: ∀i ∈ T , let ki0 be the positive integer nearest to ki 1
ki . When we choose any integer that is more than ki , we
6: ∀i ∈ T , calculate δi = |wi di (ki − ki0 )| actually use less storage space than the space assigned to
7: î = arg mini∈T δi this file in the non-integer solution of (7)-(9). Therefore, if
w
8: R = R + k0î ki ≤ 1, we can store the hot blocks of this file with more
î
9: kî = kî0 (when |T | > 1) or dkî e (when |T | = 1) copies such that the average load of these hot blocks can be
10: S = S + {î}, T = T − {î} reduced and better balanced, while not affecting the ranks
11: end while of other
j blocks.
k Specifically, the number of copies will be
1 + r k1i , and thus a block with lower ki will have more
Fig. 5. The iterative rounding algorithm used in Zebra copies than r + 1. Meanwhile, the storage overhead of this
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
7
overall reconstruction
overhead (# blocks)
To implement this load balancing, we only need to =0.5 single RS
=0.75
change the three lines in Fig. 5. If ki < 1, we calculate
j k 3
δi = wi di (1 − ki ) in line 6, and R = R + wî k1 in 2
î
line 8. In line 9, wej let k kî = 1 and meanwhile replicate
1
this file with 1 + r k1 copies. If ki ≥ 1, these three lines
î
remain unchanged. In this way, the storage space assigned 0 5 10 15 20
to the chosen file by the non-integer solution of (7)-(9) in this time (hr.)
round is preserved.
We show in Fig. 7 the mean and standard deviation
Fig. 8. Overall reconstruction overhead with demand updated online.
of the demand of blocks of rank 1 under the Facebook
workload, with C , the overall storage overhead, no more
Once again, we run the simulation with the hourly
than 1.5 and 1.8. With the load-balanced replication, we
updated demand on the workload from Facebook, with
can observe in Fig. 7a that the demand of hot blocks on
r = 2, C = 1.2, and t = 1. Fig. 8 illustrates the results.
average can be saved by 49.8% when the limit of the storage
Compared to the single RS code, Zebra can work well
overhead is 1.5, and by 26.9% when the storage overhead
with online demand, and we can on average save 68.3%
increases to C = 1.8. The load of hot blocks can also be
of reconstruction overhead in general. We can observe that
significantly more balanced. From Fig. 7b, we can observe
in this workload, the transiency is quite significant (from the
that the standard deviation of all these hot blocks can be
overhead of the single RS code), and thus with a larger α we
reduced by 57.3% (C = 1.5%) and 76.4% (C = 1.8%).
can slightly better adapt to the general workload change.
In fact, the best choice of α depends on the characteristics
6 D EPLOYING Z EBRA WITH O NLINE D EMAND of the workload, and we will show the results with more
6.1 Online demand workload in Sec. 8.
Before we deploy the Zebra framework in any practical
scenarios, we should be aware that ranks inside the Zebra 6.2 Data migration
framework can only be constructed with the given demand,
however, they must work well with the demand in future.
To meet this requirement with online demand, the de- 109
ranks changed
=1 =0.75
# blocks with
and the demand measured in this hour is the most recent =0.5 =0.25
# blocks with
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
8
Once the demand changes after each time interval, we Because of this property, we can easily downgrade or
will need to update the files if their ranks are changed. This upgrade data between an (mk, r) and a (k, r) RS code, m ∈
may involve quite a lot of migration overhead, especially Z+ . We show in Fig. 10c and Fig. 10d how we can migrate
the network transfer. An naive way to update files with new between these two RS codes. To downgrade from a (k, r)
ranks is to compute new parity blocks with the new RS code RS code to an (mk, r) RS code, by applying the property of
and then remove the old parity blocks. To encode r parity the Cauchy matrix, we only need to XOR the r parity blocks
blocks from k data blocks with a (k, r) RS code, we need in the m stripes together. To upgrade from an (mk, r) RS
to transfer at least k + r − 1 blocks, by computing all parity code to a (k, r) RS code, we need to generate the parity
blocks on one server and sending r −1 ones to other servers. blocks in the m − 1 stripes under the (k, r) RS code and
Thus, if the rank of a file is updated, the traffic to generate XOR the new parity blocks and the existing parity blocks
new parity blocks will be even more than the amount of this into the parity blocks of the last stripe. We show in Table 1
file. Fig. 9 shows the number of blocks that have different the network transfer of both upgrade and downgrade, as
ranks after each hour. We can see that there can be 525.9 TB well as the network transfer of the naive migration scheme.
of data to migrate to new RS codes at some hour when α = We can see that the Cauchy matrix can help to save network
1. The migration overhead varies with different values of α, transfer in both cases, when m < 2 + k−1 r . Apparently when
and it is hard to predict (in some hours we have more blocks m = 2 this condition can always be satisfied, and we will
to migrate when α = 0.75 than α = 1 and α = 0.5. In other use this property to save the migration overhead.
words, we need to generate new parity blocks for almost
half of all data. Moreover, when α = 1, the amount of data TABLE 1
to migrate can even exceed the amount of data requested Network transfer between (mk, r) and (k, r) RS codes.
by users. This will not only hinder the migration process
from finishing quickly, but hurt the performance of the data without Cauchy matrix with Cauchy matrix
access as well. Hence, though migration is unavoidable to downgrade mk + r − 1 blocks (m − 1)r blocks
meet the ever-changing demand, our objective is to reduce upgrade m(k + r − 1) blocks (m − 1)(k + 2r − 1) blocks
the network transfer for migration to a marginal level, for
any values of α. However, we can rely on the Cauchy matrix only when
0 1 0 1
the rank of the new code is multiple times or can be divided
1 0 0 0 D1 D1 D2 D3 D4 P1 P2
B 0 1 0 0 C
0
D1
1
B D2 C into the rank of the old code. To maximize this effect, we
k=4
B C B C
B
B
B
0 0 1 0 C B
C⇥B
C @
D2 C B
C = B D3 C
C
can also change the way to solve ranks. We set kmax , an
B 0 0 0 1 C D3 A B B D4 C
C
@ 1 5 2 7 A D4 @ P1 A upper bound of ranks of all blocks, such that all ki should
r=2
3 4
D4 P3
P40
D1 D2 D3 D4 P1 P2 (k, r) RS code into one stripe with an (mk, r) RS code, we
(b) split (4,2) RS code into (2,2) RS code (d) downgrade from (2,2) RS code to (4,2) RS code
may also need to move data blocks because two blocks in
different stripes may be stored in the same server. In Zebra,
Fig. 10. Construction of the Cauchy RS code and its migration. if kmax is set, we store every kmax data blocks into different
servers, and compute parity blocks with the corresponding
In this paper, we propose a different way to encode data (k, r) RS codes which will be stored into other servers.
such that we can control the migration overhead without Notice that such (k, r) RS codes are constructed by splitting
increasing the overall reconstruction overhead significantly. the generator matrix of the (kmax , r) Cauchy RS codes, as
We use Cauchy RS codes [32], [33] in Zebra, which contains if there were upgraded from the (kmax , r) RS codes before.
a Cauchy matrix in its generator matrix. With a (k, r) RS Therefore, when we need to migrate into an (mk, r) RS code,
code, we have a (k + r) × k matrix G as its generator matrix, we will not need to move any data blocks, but only migrate
such that the encoding operation can be formalized as the parity blocks as described above. This method can also be
multiplication of the generator matrix and the k data blocks, applied to the upgrade case as well. Apparently, we can
as illustrated in Fig. 10a. In particular, if the first k rows of maximize the effect of the Cauchy matrix by wisely selecting
G are an identity matrix, the corresponding RS code will
the value of kmax . For example, when kmax = 16, we have
I 5 available ranks (1, 2, 4, 8, 16). Thus, when downgrading
be systematic. In this way, we can write G as G = .
Ĝ or upgrading to any neighbor ranks we can always exploit
In a Cauchy RS code, the matrix Ĝ is a Cauchy matrix. A the Cauchy matrix to save network transfer (with m = 2).
benefit of Cauchy RS code is that all encoding operations For some other values of kmax , some rank may not be
can be converted into XOR operations. More importantly, integer multiples of its neighboring rank (e.g., 3 and 4 when
Cauchy matrix makes it easy to migrate from one RS code kmax = 12), we will have to remove all existing parity blocks
into another RS code with significantly less overhead, since and generate new parity blocks with the new ranks.
any submatrix of a Cauchy matrix is still a Cauchy matrix There is also a side effect of kmax which set a lower
(Fig. 10b). bound of the data reliability. This is because with a higher
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
9
n
f
For example, data encoded with a (10, 2) RS code will have
a lower reliability than data encoded with a (4, 2) RS code,
even though they can both tolerate two failures. This can
happen in all designs with tiered architectures [17], [18],
[19]. By limiting the highest value of k , we can also allow
the user to control the data reliability in the worst case.
1e7
4
overall reconstruction
migration (# blocks)
overhead (# blocks)
107
network transfer of
kmax = 12
kmax = 16
kmax = 20
kmax = 24
no kmax
no kmax
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
10
use wi and di to denote the size and the demand of file i, Input: wi and di where i = 1, . . . , n, C > 1, r ∈ Z+ , and
i = 1, . . . , n. For each file we encode all blocks in this file N ∈ Z+
with a (ki , li , gi ) local reconstruction code. In other words, Output: li where i = 1, . . . , n
in this file we encode every ki data blocks with the (ki , li , gi ) 1: S = ∅, T = {1, . . . , n}
local reconstruction codes in one stripe, and generate li local 2: R = 0
parity blocks and gi global parity blocks. Assuming that any 3: while T 6= ∅ do
r failures should be tolerated, for all i we can set gi = r − 1. 4: solve the problem
As the demand to data all goes to data blocks, we X
only consider the reconstruction overhead of data blocks. min wi di κi (14)
To apply Zebra to local reconstruction codes, we need to i∈T
X wi r−1
dynamically determine the “rank” of data blocks by their (C − 1 − kmax )N
demand. As the rank is defined as the number of blocks to s.t. ≤ − R (15)
κ
i∈T i
r
visit in the reconstruction in RS codes, in local reconstruction
codes the rank of a data block can be defined as ki /li . Hence κi > 0, ∀i ∈ T (16)
we use the technique in Sec. 6.2 by letting ki = kmax ,
which is naturally an upper bound of ranks of all data 5: for each i ∈ T do
blocks.
PN The overall reconstruction overhead can be written 6: if |T | > 1 then
as i=1 wi di kmax li . 7: let κ0i be the integer that is both nearest to κi
On the other hand, the storage overhead of a local and a divisor of kmax
reconstruction code is 1 + li +g ki . As the value of gi only
i
8: else
depends on the number of failures to tolerate and ki is a 9: let κ0i be the integer that is no less than κi
constant, the overall storage overhead of all files, and also a divisor of kmax
Xn X n 10: end if
li + gi r−1 li
wi 1 + =N 1+ + wi · . 11: calculate δi = |wi di (κi − max(1, κ0i ))|
i=1
k max kmax i=1
k max 12: end for
linearly depends on the inverse of the rank. By letting κi = 13: î = arg mini∈T δi
kmax /li , we can solve the rank of data blocks in each file by 14: if κî ≥ 1 then
w
a geometric programming problem that is similar to (4)-(6). 15: R = R + k0î
î
16: κî = κ0î , li = kmax
κî
n
X 17: else j k
min wi di κi (11) 18: R = R + wî · κ1
i=1 j k î
Xn
(C − 1 − r−1 19: li = kmax · κ1i , κî = 1
wi kmax )N
s.t. ≤ (12) 20: end if
κ
i=1 i
r 21: S = S + {î}, T = T − {î}
κi > 0, ∀i. (13) 22: end while
Based on this problem, we can get the iterative rounding
Fig. 14. The iterative rounding algorithm to solve li of local reconstruc-
algorithm to solve li in each file, as shown in Fig. 14. As tion codes.
local reconstruction codes can be modeled as geometric
programming problems similar to RS codes, this algorithm
is largely based on the iterative rounding algorithm in Fig. 5,
Zebra we can save reconstruction overhead by between
except for the following differences:
49.1% (kmax = 12) and 68.2% (kmax = 30). On the other
First, we always choose li as a divisor of kmax , such that
hand, we can observe in Fig. 16 that with various values
κi is always a divisor of kmax as well. Once a non-integer
of kmax , the traffic incurred in the migration between time
solution κi is obtained after line 4, we will round it to the
intervals can be well controlled, where the total amount of
nearest integer that is a divisor of kmax . In this way, we
traffic occupies no more than 1.2% of the data demand.
can also apply the technique used in Fig. 10 to construct
the generator matrix of the RS codes to compute the local Second, we apply the technique used in Sec. 5.4 to
parity nodes. In other words, in one stripe with kmax data balance the load of hot blocks. When κi is solved to be
blocks, the li generator matrixes that correspond to the li less than 1, we change the way to generate the local parity
local parity blocks can be split from the Cauchy generator blocks in the original local reconstruction
j k codes, by repli-
matrix of a (kmax , 1) RS code, helping us to save traffic in cating each data block with κ1i more copies in line 19. In
the migration when the value of li is updated. Fig. 17, we calculate the mean and the standard deviation
In Fig. 15, we compare the overall reconstruction over- of the demand of such hot blocks with kmax = 24. Similar
head of various values of kmax with one single local re- results as RS codes in Fig. 7 can be observed where both
construction codes, running under the Facebook workload, the mean and the standard deviation of the demand can
with C = 1.2, r = 2, t = 1, α = 0.5, and one hour per be significantly reduced. We also notice that the average
time interval. To meet the requirements of storage overhead demand of hot blocks with local reconstruction codes is
and failure tolerance, we use a (20, 4, 1) local reconstruction much less that that with RS codes in Fig. 7, implying that
codes as the single LRC in Fig. 15. It turns out that with local reconstruction codes allow more blocks to have low
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
11
1e7 10
overall reconstruction
without LD without LD
overhead (# blocks)
kmax = 12
kmax = 16
kmax = 20
kmax = 24
kmax = 30
0 C=1.5 C=1.8 0 C=1.5 C=1.8
Fig. 15. Comparison of overall reconstruction overhead of various values Fig. 17. Comparison of demand of hot blocks (of rank 1) with and without
of kmax with one single local reconstruction code (LRC), when α = 0.5. load balance (LD), when r = 2 and t = 1.
r
of kmax that is no less than C−1 , and can hence achieve the
ratio of migration traffic
1.0%
to demand traffic
=1
=0.25
=0.75
=0.25
=0.75
=0.5
=0.5
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
12
storage overhead 1.5x storage overhead 1.4x average reconstruction overhead by up to 73.6% (C = 1.5)
17.5 17.5 for the top 15% demanded files in Fig. 18a, and by up
average reconstruction overhead
(6,3) RS (10,4) RS
per degraded read (# blocks) 15.0 Zebra (FB1) 15.0 Zebra (FB1) to 72.1% when C = 1.4. Comparing RS codes with local
Two RS (FB1) Two RS (FB1) reconstruction codes, with the same storage overhead, local
12.5 Zebra (FB2) 12.5 Zebra (FB2)
Two RS (FB2) Two RS (FB2) reconstruction codes can save the average reconstruction
10.0 10.0
overhead by 80.5% (C = 1.5) and 87.5% (C = 1.4), respec-
7.5 7.5 tively. This is credited to the low reconstruction overhead of
5.0 5.0 local reconstruction codes.
2.5 2.5
0.00% 20% 40% 60% 80%100% 0.00% 20% 40% 60% 80%100%
top percentage of file top percentage of files
(sorted by demand) (sorted by demand)
d
(a) RS codes.
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
13
local reconstruction code, while they (data blocks) both transfer to serve the demand. With various constraints of the
require two blocks to reconstruct. Therefore, with a higher overall storage overhead, we can see that the total migration
kmax , more data can have lower reconstruction overhead. traffic never exceeds 13% of the total demand traffic, thanks
With RS codes, on the other hand, different values of kmax to the Cauchy RS codes used in Zebra. Though more migra-
do not change the reconstruction overhead of data, but tion traffic can be observed with higher storage overhead,
only introduce one more tiers with lower storage overhead. we can see that this increased amount of traffic is marginal.
This tier, as mentioned above, has little demand that cannot Finally, we show the distribution of demand with the
significantly change the overall reconstruction overhead. Zebra framework with Fig. 23. We can observe that in
In Fig. 21, we measure the average storage overhead of both FB1 and FB2, the load balance in Zebra can help to
files in all time intervals, to compare the storage overhead significantly save the peak demand of blocks. With the load
of data with different demand. This time we sort the files balance in Zebra, we can achieve 12.1% less demand on
by their demand (from bottom to top) and calculate the average for the top 15% demanded files with RS codes in
average storage overhead of data in a 1% interval of files. FB1 and 34.6% in FB2. With local reconstruction codes, we
For example, data points with 99% on the x-axis indicate can save the load on hottest files by 50.9% (FB1). Moreover,
the average storage overhead of files with demand in the the load balance can help to make the demand of blocks
bottom 99%-100% interval, i.e., the top 1% most demanded more stable among files with different demand. It can also
files. In fact, 90% files have very low demand in both be observed that hotter files can enjoy more savings of
FB1 and FB2, and hence they all have very similar storage demand. From the top 3% demanded files to the top 30%
overhead. Hence, in Fig. 21 we focus on the top 10% files. demand files in Fig. 23b, the saving of demand can change
We can see that at most 2% files have storage overhead from 63.3% to 20.9%. The reason is easy to infer, as the load
higher than the given constraint in Zebra, indicating their balance in Zebra can only work for the replicated blocks, i.e.,
extremely high demand. On the other hand, due to the static the hottest block of rank 1. Thus, with a higher percentage,
configuration of the two erasure codes, more data with high we will have more cold data, which not only reduce the
demand have to be stored in the cold tier with unnecessarily average demand, but dilute the load balance achieved by
low storage overhead. Hence, with the two static erasure the hot data as well.
codes, we cannot fully utilize the storage space, and this also
explains why two erasure codes have higher reconstruction
average demand per block
in the 1% interval
storage overhead
storage overhead
FB 2 4.0% FB 2
cannot replicate the same number of copies as RS codes. For
to demand traffic
to demand traffic
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2018.2885306, IEEE
Transactions on Cloud Computing
14
codes, according to their demand. The Zebra framework can [21] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang, and
achieve a much lower reconstruction overhead for the hot J. Li, “Simple Regenerating Codes: Network Coding for Cloud
Storage,” in Proc. IEEE INFOCOM, 2012.
data, while spending less storage space to store the cold [22] D. Papailiopoulos and A. Dimakis, “Locally Repairable Codes,” in
data. Zebra can also help to balance the demand, especially Proc. IEEE International Symposium on Information Theory Proceed-
for the hot data. With the ever-changing demand, Zebra can ings (ISIT), July 2012, pp. 2771–2775.
update itself accordingly with a low network transfer. [23] F. Oggier and A. Datta, “Self-repairing Homomorphic Codes for
Distributed Storage Systems,” in Proc. IEEE INFOCOM, 2011.
[24] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and
K. Ramchandran, “A “HitchHiker’s” Guide to Fast and Effi-
R EFERENCES cient Data Reconstruction in Erasure-coded Data Centers,” in
Proc. ACM SIGCOMM, 2014.
[1] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McK- [25] C. Huang, M. Chen, and J. Li, “Pyramid Codes: Flexible Schemes
elvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Ud- to Trade Space for Access Efficiency in Reliable Data Storage
daraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, Systems,” ACM Trans. Storage, vol. 9, no. 1, pp. 3:1–3:28, March
A. Agarwal, M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, 2013.
A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and [26] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ram-
L. Rigas, “Windows azure storage: A highly available cloud stor- chandran, “Network Coding for Distributed Storage Systems,”
age service with strong consistency,” in Proceedings of the Twenty- IEEE Trans. Inform. Theory, vol. 56, no. 9, pp. 4539–4551, 2010.
Third ACM Symposium on Operating Systems Principles (SOSP). [27] D. Borthakur, “HDFS Architecture Guide,” Hadoop Apache Project,
ACM, 2011, pp. 143–157. http://hadoop.apache.org/common/docs/current/hdfs design.
[2] S. Ghemawat, H. Gobioff, and S. Leung, “The Google File System,” pdf.
in Proc. ACM Symposium on Operating Systems Principles (SOSP), [28] J. Li, S. Yang, X. Wang, and B. Li, “Tree-Structured Data Regenera-
2003. tion in Distributed Storage Systems With Regenerating Codes,” in
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Proc. IEEE INFOCOM, 2010.
Distributed File System,” in Proc. 26th Symposium on Mass Storage [29] S. Mitra, R. Panta, M.-R. Ra, and S. Bagchi, “Partial-Parallel-Repair
Systems and Technologies (MSST), 2010. (PPR): A Distributed Technique for Repairing Erasure Coded Stor-
[4] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, age,” in Proc. European Conference on Computer Systems (Eurosys).
R. Vadali, S. Chen, and D. Borthakur, “XORing Elephants: Novel ACM, 2016.
Erasure Codes for Big Data,” in Proc. VLDB Endowment, 2013. [30] R. Li, X. Li, P. P. C. Lee, and Q. Huang, “Repair Pipelining
[5] J. Arnold, “Erasure Codes With OpenStack Swift Digging for Erasure-Coded Storage,” in Proc. USENIX Annual Technical
Deeper,” July 2013, https://swiftstack.com/blog/2013/07/17/ Conference (USENIX ATC), Santa Clara, CA, 2017, pp. 567–579.
erasure-codes-with-openstack-swift-digging-deeper/. [31] S. Boyd and L. Vandenberghe, Convex Optimization. New York:
[6] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, Cambridge University Press, 2004.
L. Barroso, C. Grimes, and S. Quinlan, “Availability in Globally [32] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and
Distributed File System,” in Proc. USENIX Operating Systems De- D. Zuckerman, “An XOR-based Erasure-resilient Coding Scheme,”
sign and Implementation, 2010. International Computer Science Institute, Tech. Rep. Technical
[7] “HDFS-RAID,” http://wiki.apache.org/hadoop/HDFS-RAID. Report TR-95-048, August 1995.
[8] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, [33] J. Plank and L. Xu, “Optimizing Cauchy Reed-Solomon Codes
and S. Yekhanin, “Erasure Coding in Windows Azure Storage,” in for Fault-Tolerant Network Storage Applications,” in Proc. IEEE
Proc. USENIX Annual Technical Conference (USENIX ATC), 2012. International Symposium on Network Computing and Applications,
[9] “[HDFS-7295] Erasure Coding Support inside HDFS,” https:// July 2006, pp. 173–180.
issues.apache.org/jira/browse/HDFS-7285.
[10] H. Weatherspoon and J. D. Kubiatowicz, “Erasure Coding vs.
Replication: A Quantitative Comparison,” in Proc. International
Workshop on Peer-To-Peer Systems (IPTPS), 2002.
[11] I. Reed and G. Solomon, “Polynomial Codes over Certain Finite Jun Li received his Ph.D. degree from the De-
Fields,” Journal of the Society for Industrial and Applied Mathematics, partment of Electrical and Computer Engineer-
vol. 8, no. 2, pp. 300–304, 1960. ing, University of Toronto, in 2017, and his B.S.
[12] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and and M.S. degrees from the School of Computer
K. Ramchandran, “A Solution to the Network Challenges of Science, Fudan University, China, in 2009 and
Data Recovery in Erasure-coded Distributed Storage Systems: A 2012. He is currently an assistant professor in
Study on the Facebook Warehouse Cluster,” in Proc. 5th USENIX the School of Computing and Information Sci-
Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013. ences, Florida International University. His re-
[13] Z. Wilcox-O’Hearn, “zfec,” http://pypi.python.org/pypi/zfec. search interests include erasure codes and dis-
[14] “SWIM Workload Repository,” https : / / github . com / tributed storage systems.
SWIMProjectUCB/SWIM/wiki/Workloads-repository.
[15] X. Zhang, J. Wang, and J. Yin, “Sapprox : Enabling Efficient
and Accurate Approximations on Sub-datasets with Distribution-
aware Online Sampling,” Proceedings of the VLDB Endowment,
vol. 10, no. 3, pp. 109–120, 2016.
[16] J. Wang, J. Yin, J. Zhou, X. Zhang, and R. Wang, “DataNet: A Baochun Li received his Ph.D. degree from the
Data Distribution-Aware Method for Sub-Dataset Analysis on Dis- Department of Computer Science, University of
tributed File Systems,” in IEEE International Parallel and Distributed Illinois at Urbana-Champaign, Urbana, in 2000.
Processing Symposium (IPDPS), 2016, pp. 504–513. Since then, he has been with the Department
[17] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan, “The HP Au- of Electrical and Computer Engineering at the
toRAID Hierarchical Storage System,” ACM Transactions on Com- University of Toronto, where he is currently a
puter Systems (TOCS), vol. 14, no. 1, pp. 108–136, 1996. Professor. He holds the Bell Canada Endowed
[18] W. Wang and H. Kuang, “Saving capacity with HDFS RAID,” Chair in Computer Engineering since August
2014, https : / / code . facebook . com / posts / 536638663113101 / 2005. His research interests include large-scale
saving-capacity-with-hdfs-raid/. distributed systems, cloud computing, peer-to-
[19] M. Xia, M. Saxena, M. Blaum, and D. A. Pease, “A Tale of Two peer networks, applications of network coding,
Erasure Codes in HDFS,” in Proc. USENIX Conference on File and and wireless networks. He is a member of the ACM and a fellow of the
Storage Technologies (FAST), 2015. IEEE.
[20] Y. Chen, S. Alspaugh, and R. Katz, “Interactive Analytical Process-
ing in Big Data Systems: A Cross-industry Study of MapReduce
Workloads,” Proc. VLDB Endow., vol. 5, no. 12, pp. 1802–1813,
August 2012.
2168-7161 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.