Speeding_Up_Distributed_Machine_Learning_Using_Codes
Speeding_Up_Distributed_Machine_Learning_Using_Codes
3, MARCH 2018
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1515
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1516 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
cost of multicasting a common msg to n workers the output of the asynchronous algorithm can differ from
(1) that of a serial execution with an identical number of
iterations.
Clearly, 1 ≤ γ (n) ≤ n: if γ (n) = n, the cost of multicasting is Recently, replication-based approaches have been explored
equal to that of unicasting a single message (as in the above to tackle the straggler problem: by replicating tasks and
example); if γ (n) = 1, there is essentially no advantage of scheduling the replicas, the runtime of distributed algorithms
using multicast over unicast. can be significantly improved [30]–[36]. By collecting outputs
We now state the main result on Coded Shuffling in the of the fast-responding nodes (and potentially canceling all
following (informal) theorem. the other slow-responding replicas), such replication-based
Theorem 2 (Coded Shuffling): Let α be the fraction of the scheduling algorithms can reduce latency. Lee et al. [35] show
data matrix that can be cached at each worker, and n be the that even without replica cancellation, one can still reduce
number of workers. Assume that the advantage of multicasting the average task latency by properly scheduling redundant
over unicasting is γ (n). Then, coded shuffling
reduces the requests. We view these policies as special instances of coded
communication cost by a factor of α + n1 γ (n) compared computation: such task replication schemes can be seen as
to uncoded shuffling. repetition-coded computation. In Sec. III, we describe this
For the formal version of the theorem and its proofs, see connection in detail, and indicate that coded computation can
Sec. IV-D. significantly outperform replication (as is usually the case for
The remainder of this paper is organized as follows. coding vs. replication in other engineering applications).
In Sec. II, we provide an extensive review of the related Another line of work that is closely related to coded
works in the literature. Sec. III introduces the coded matrix computation is about the latency analysis of coded distributed
multiplication, and Sec. IV introduces the coded shuffling storage systems. Huang et al. [37] and Lee et al. [38] show
algorithm. Finally, Sec. V presents conclusions and discusses that the flexibility of erasure-coded distributed storage systems
open problems. allows for faster data retrieval performance than replication-
II. R ELATED W ORK based distributed storage systems. Joshi et al. [39] show
that scheduling redundant requests to an increased number of
A. Coded Computation and Straggler Mitigation storage nodes can improve the latency performance, and char-
The straggler problem has been widely observed in dis- acterize the resulting storage-latency tradeoff. Sun et al. [40]
tributed computing clusters. Dean and Barroso [5] show that study the problem of adaptive redundant requests scheduling,
running a computational task at a computing node often and characterize the optimal strategies for various scenarios.
involves unpredictable latency due to several factors such Kadhe [41] and Soljanin [42] analyze the latency performance
as network latency, shared resources, maintenance activities, of availability codes, a class of storage codes designed for
and power limits. Further, they argue that stragglers cannot enhanced availability. Joshi et al. [36] study the cost associated
be completely removed from a distributed computing cluster. with scheduling of redundant requests, and propose a general
Ananthanarayanan et al. [26] characterize the impact and scheduling policy that achieves a delicate balance between the
causes of stragglers that arise due to resource contention, latency performance and the cost.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1517
We now review some recent works on coded compu- order than before. Although the effects of random shuffling
tation, which have been published after our conference are far from understood theoretically, the large statistical
publications [1], [2]. In [43], an anytime coding scheme gains have turned it into a common practice. Intuitively,
for approximate matrix multiplication is proposed, and it is data shuffling before a new pass over the data, implies
shown that the proposed scheme can improve the quality that nodes get a nearly “fresh” sample from the data set,
of approximation compared with the other existing coded which experimentally leads to better statistical performance.
schemes for exact computation. Dutta et al. [44] propose Moreover, bad orderings of the data—known to lead to slow
a coded computation scheme called ‘Short-Dot’. Short-Dot convergence in the worst case [61], [64], [65]—are “averaged
induces additional sparsity to the encoded matrices at the cost out”. However, the statistical benefits of data shuffling do
of reduced decoding flexibility, and hence potentially speeds not come for free: each time a new shuffle is performed, the
up the computation. Tandon et al. [45] consider the problem entire dataset is communicated over the network of nodes.
of computing gradients in a distributed system, and propose This inevitably leads to performance bottlenecks due to heavy
a novel coded computation scheme tailored for computing a communication.
sum of functions. In many machine learning problems, the In this work, we propose to use coding opportunities to
objective function is a sum of per-data loss functions, and significantly reduce the communication cost of some dis-
hence the gradient of the objective function is the sum of tributed learning algorithms that require data shuffling. Our
gradients of per-data loss functions. Based on this observation, coded shuffling algorithm is built upon the coded caching
they propose Gradient Coding, which can reliably compute the scheme by Maddah-Ali and Niesen [66]. Coded caching is
exact gradient of any function in the presence of stragglers. a technique to reduce the communication rate in content
While Gradient coding can be applied to computing gradients delivery networks. Mainly motivated by video sharing appli-
of any functions, it usually incurs significant storage and cations, coded caching exploits the multicasting opportunities
computation overheads. Bitar et al. [46] consider a secure between users that request different video files to significantly
coded computation problem where the input data matrices reduce the communication burden of the server node that has
need to be secured from the workers. They propose a secure access to the files. Coded caching has been studied in many
computation scheme based on Staircase codes, which can scenarios such as decentralized coded caching [67], online
speed up the distributed computation while securing the coded caching [68], hierarchical coded caching for wireless
input data from the workers. Lee et al. [47] consider the communication [69], and device-to-device coded caching [70].
problem of large matrix-matrix multiplication, and propose Recently, Li et al. [71] proposed coded MapReduce that
a new coded computation scheme based on product codes. reduces the communication cost in the process of transferring
Reisizadehmobarakeh et al. [48] consider the coded compu- the results of mappers to reducers.
tation problem on heterogenous computing clusters while our Our proposed approach is significantly different from all
work assumes a homogeneous computing cluster. The authors related studies on coded caching in two ways: (i) we shuffle
show that by delicately distributing jobs across heterogenous the data points among the computing nodes to increase the
workers, one can improve the performance of coded compu- statistical efficiency of distributed computation and machine
tation compared with the symmetric job allocation scheme, learning algorithms; and (ii) we code the data over their
which is designed for homogeneous workers in our work. actual representation (i.e., over the doubles or floats) unlike
While most of the works focus on the application of coded the traditional coding schemes over bits. In Sec. IV, we
computation to linear operations, a recent work shows that describe how coded shuffling can remarkably speed up the
coding can be used also in distributed computing frameworks communication phase of large-scale parallel machine learning
involving nonlinear operations [49]. Lee et al. [49] show algorithms, and provide extensive numerical experiments to
that by leveraging the multi-core architecture in the worker validate our results.
computing units and “coding across” the multi-core computed The coded shuffling problem that we study is related to
outputs, significant (and in some settings unbounded) gains in the index coding problem [72], [73]. Indeed, given a fixed
speed-up in computational time can be achieved between the “side information” reflecting the memory content of the nodes,
coded and uncoded schemes. the data delivery strategy for a particular permutation of the
data rows induces an index coding problem. However, our
B. Data Shuffling and Communication Overheads coded shuffling framework is different from index coding
Distributed learning algorithms on large-scale in at least two significant ways. First, the coded shuffling
networked systems have been extensively studied in the framework involves multiple iterations of data being stored
literature [50]–[60]. Many of the distributed algorithms that across all the nodes. Secondly, when the caches of the nodes
are implemented in practice share a similar algorithmic are updated in coded shuffling, the system is unaware of the
“anatomy”: the data set is split among several cores or nodes, upcoming permutations. Thus, the cache update rules need to
each node trains a model locally, then the local models are be designed to target any possible unknown permutation of
averaged, and the process is repeated. While training a model data in succeeding iterations of the algorithm.
with parallel or distributed learning algorithms, it is common We now review some recent works on coded shuffling,
to randomly re-shuffle the data a number of times [29], which have been published after our first presentation [1], [2].
[61]–[65]. This essentially means that after each shuffling Attia and Tandon [74] study the information-theoretic limits
the learning algorithm will go over the data in a different of the coded shuffling problem. More specifically, the authors
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1518 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
completely characterize the fundamental limits for the case Upon receiving the input argument x, the master node mul-
of 2 workers and the case of 3 workers. Attia and Tandon [75] ticasts x to all the workers, and then waits until it receives
consider the worse-case formulation of the coded shuf- the responses from any of the decodable sets. Each worker
fling problem, and propose a two-stage shuffling algorithm. node starts computing its local function when it receives
Song and Fragouli [76] propose a new coded shuffling scheme its local input argument, and sends the task result to the
based on pliable index coding. While most of the existing master node. Once the master node receives the results from
works focus on either coded computation or coded shuffling, some decodable set, it decodes the received task results and
one notable exception is [77]. In this work, the authors gener- obtains f A (x).
alize the original coded MapReduce framework by introducing The algorithm described in Sec. I is an example of coded
stragglers to the computation phases. Observing that highly distributed algorithms: it is a coded distributed algorithm for
flexible codes are not favorable to coded shuffling while matrix multiplication that uses an (n, n − 1) MDS code. One
replication codes allow for efficient shuffling, the authors can generalize the described algorithm using an (n, k) MDS
propose an efficient way of coding to mitigate straggler effects code as follows. For any 1 ≤ k ≤ n, the data matrix A is first
as well as reduce the shuffling overheads. divided into k equal-sized submatrices.2 Then, by applying
an (n, k) MDS code to each element of the submatrices,
III. C ODED C OMPUTATION n encoded submatrices are obtained. We denote these
In this section, we propose a novel paradigm to mitigate n encoded submatrices by A1 , A2 , . . . , An . Note that the
the straggler problem. The core idea is simple: we introduce Ai = Ai for 1 ≤ i ≤ k if a systematic MDS code is used for
redundancy into subtasks of a distributed algorithm such that the encoding procedure. Upon receiving any k task results, the
the original task’s result can be decoded from a subset of the master node can use the decoding algorithm to decode k task
subtask results, treating uncompleted subtasks as erasures. results. Then, one can find AX simply by concatenating them.
For this specific purpose, we use erasure codes to design
coded subtasks.
An erasure code is a method of introducing redundancy to B. Runtime of Uncoded/Coded Distributed Algorithms
information for robustness to noise [78]. It encodes a message In this section, we analyze the runtime of uncoded and
of k symbols into a longer message of n coded symbols such coded distributed algorithms. We first consider the overall
that the original k message symbols can be recovered by uncoded . Assume
runtime of an uncoded distributed algorithm, Toverall
decoding a subset of coded symbols [78], [79]. We now show that the runtime of each task is identically distributed and
how erasure codes can be applied to distributed computation independent of others. We denote the runtime of the i th worker
to mitigate the straggler problem. under a computation scheme, say s, by Tis . Note that the
distributions of Ti ’s can differ across different computation
A. Coded Computation schemes.
A coded distributed algorithm is specified by local func- uncoded
= T(n)
uncoded
= max{T1uncoded, . . . , Tnuncoded},
def
tions, local data blocks, decodable sets of indices, and a Toverall (2)
decoding function: The local functions and data blocks specify
where T(i) is the i th smallest one in {Ti }ni=1 . From (2), it is clear
the way the original computational task and the input data are
that a single straggler can slow down the overall algorithm.
distributed across n workers; and the decodable sets of indices
A coded distributed algorithm is terminated whenever the
and the decoding function are such that the desired compu-
master node receives results from any decodable set of work-
tation result can be correctly recovered using the decoding
ers. Thus, the overall runtime of a coded algorithm is not
function as long as the local computation results from any of
determined by the slowest worker, but by the first time to
the decodable sets are collected.
collect results from some decodable set in I, i.e.,
The formal definition of coded distributed algorithms is as
coded
= T(coded = min max T jcoded
def
follows. Toverall (3)
I)
Definition 1 (Coded Computation): Consider a computa- i∈I j ∈i
tional task fA (·). A coded distributed algorithm for computing We remark that the runtime of uncoded distributed algorithms
f A (·) is specified by (2) is a special case of (3) with I = {[n]}. In the following
• local functions f Ai (·) ni=1 and local data blocks Ai ni=1 ; examples, we consider the runtime of the repetition-coded
i
• (minimal) decodable sets of indices I ⊂ P([n]) and a algorithms and the MDS-coded algorithms.
decoding function dec(·, ·), Example 1 (Repetition Codes): Consider an nk -repetition-
where [n] = {1, 2, . . . , n}, and P(·) is the power set of a
def
code where each local task is replicated nk times. We assume
set. The decodable sets of indices I is minimal: no element that each group of nk consecutive workers work on the replicas
of I is a subset of other elements. The decoding function takes of one local task. Thus, the decodable sets of indices I are
a sequence of indices and a sequence of subtask results, and all the minimal sets that have k distinct task results, i.e.,
it must correctly output f A (x) if any decodable set of indices I = {1, 2, . . . , nk } × { nk + 1, nk + 2, . . . , nk + k} × . . . × {n −
and its corresponding results are given.
k + 1, n − k + 2, . . . , n}, where A × B denotes the Cartesian
n n
A coded distributed algorithm can be run in a dis-
tributed computing cluster as follows. Assume that the 2 If the number of rows of A is not a multiple of k, one can append zero
i th (encoded) data block Ai is stored at the i th worker for all i . rows to A to make the number of rows a multiple of k.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1519
product of matrix A and B. Thus, algorithm, one can first find the distribution of
Repetition-coded Repetition-coded
Toverall = max minn {T(i−1) n + j }. (4) Repetition-coded
min {T(i−1) n + j },
i∈[k] j ∈[ k ] k j ∈[ nk ] k
Example 2 (MDS Codes): If one uses an (n, k) MDS code, and then find the distribution of the maximum of k such terms:
the decodable sets of indices are the sets of any k indices, i.e.,
n k
I = {i|i ⊂ [n], |i| = k}. Thus, Repetition
Foverall (t) = 1 − [1 − F(kt)] k . (6)
MDS-coded
Toverall = T(k)
MDS-coded
(5) The runtime distribution of an (n, k)-MDS-coded distributed
That is, the algorithm’s runtime will be determined by the algorithm is simply the k th order statistic:
k th response, not by the n th response. MDS-coded
Foverall (t)
t
n−1
C. Probabilistic Model of Runtime = nk f (kτ ) F(kτ )k−1 [1 − F(kτ )]n−k dτ . (7)
τ =0 k −1
In this section, we analyze the runtime of uncoded/coded
distributed algorithms assuming that task runtimes, includ- Remark 2: For the same values of n and k, the runtime
ing times to communicate inputs and outputs, are randomly distribution of a repetition-coded distributed algorithm strictly
distributed according to a certain distribution. For analytical dominates that of an MDS-coded distributed algorithm. This
purposes, we make a few assumptions as follows. We first can be shown by observing that the decodable sets of the
assume the existence of the mother runtime distribution F(t): MDS-coded algorithm contain those of the repetition-coded
we assume that running an algorithm using a single machine algorithm.
takes a random amount of time T0 , that is a positive-valued, In Fig. 4, we compare the runtime distributions of uncoded
continuous random variable parallelized according to F, i.e. and coded distributed algorithms. We compare the runtime
Pr(T0 ≤ t) = F(t). We also assume that T0 has a probability distributions of uncoded algorithm, repetition-coded algorithm,
density function f (t). Then, when the algorithm is distributed and MDS-coded algorithm with n = 10 and k = 5. In Fig. 4a,
into a certain number of subtasks, say , the runtime distri- we use a shifted-exponential distribution as the mother runtime
bution of each of the subtasks is assumed to be a scaled distribution. That is, F(t) = 1 − et −1 for t ≥ 1. In Fig. 4b, we
distribution of the mother distribution, i.e., Pr(Ti ≤ t) = F(t) use the empirical task runtime distribution that is measured
for 1 ≤ i ≤ . Note that we are implicitly assuming a on an Amazon EC2 cluster.3 Observe that for both cases, the
symmetric job allocation scheme, which is the optimal job runtime distribution of the MDS-coded distribution has the
allocation scheme if the underlying workers have the identical lightest tail.
computing capabilities, i.e., homogeneous computing nodes
are assumed. Finally, the computing times of the k tasks are D. Optimal Code Design for Coded Distributed Algorithms:
assumed to be independent of one another. The Shifted-Exponential Case
Remark 1 (Homogeneous Clusters and Heterogenous When a coded distributed algorithm is used, the original
Clusters): In this work, we assume homogeneous clusters: task is divided into a fewer number of tasks compared to the
that is, all the workers have independent and identically case of uncoded algorithms. Thus, the runtime of each task of
distributed computing time statistics. While our symmetric a coded algorithm, which is F(kt), is stochastically larger than
job allocation is optimal for homogeneous cases, it can be that of an uncoded algorithm, which is F(nt). If the value that
strictly suboptimal for heterogenous cases. While our work we choose for k is too small, then the runtime of each task
focuses on homogeneous clusters, we refer the interested becomes so large that the overall runtime of the distributed
reader to a recent work [48] for a generalization of our coded algorithm will eventually increase. If k is too large,
problem setting to that of heterogeneous clusters, for which the level of redundancy may not be sufficient to prevent the
symmetric allocation strategies are no longer optimal. algorithm from being delayed by the stragglers.
We first consider an uncoded distributed algorithm with Given the mother runtime distribution and the code parame-
n (uncoded) subtasks. Due to the assumptions mentioned ters, one can compute the overall runtime distribution of the
above, the runtime of each subtask is F(nt). Thus, the run- coded distributed algorithm using (6) and (7). Then, one can
time distribution of an uncoded distributed algorithm, denoted optimize the design based on various target metrics, e.g., the
uncoded (t), is simply [F(nt)]n .
by Foverall expected overall runtime, the 99th percentile runtime, etc.
When repetition codes or MDS codes are used, an In this section, we show how one can design an optimal
algorithm is first divided into k (< n) systematic sub- coded algorithm that minimizes the expected overall runtime
tasks, and then n − k coded tasks are designed to pro- for a shifted-exponential mother distribution. The shifted-
vide an appropriate level of redundancy. Thus, the runtime exponential distribution strikes a good balance between accu-
of each task is distributed according to F(kt). Using (4) racy and analytical tractability. This model is motivated by
and (5), one can easily find the runtime distribution of an the model proposed in [80]: the authors used this distribution
n Repetition
k -repetition-coded distributed algorithm, Foverall , and the to model latency of file queries from cloud storage systems.
runtime distribution of an (n, k)-MDS-coded distributed algo-
MDS-coded . For an n -repetition-coded distributed
rithm, Foverall 3 The detailed description of the experiments is provided in Sec. III-F.
k
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1520 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
log n and Hn−k log (n − k). We first note that the expected
value of the maximum of n independent exponential random
variables with rate μ is Hμn . Thus, the average runtime of an
uncoded distributed algorithm is
uncoded 1 1 log n
E[Toverall ] = 1 + log n = . (9)
n μ n
For the average runtime of an nk -Repetition-coded distributed
algorithm, we first note that the minimum of nk independent
exponential random variables with rate μ is distributed as an
exponential random variable with rate nk μ. Thus,
Repetition-coded 1 k log n
E[Toverall ]= 1+ log k = . (10)
k nμ n
Finally, we note that the expected value of the k th statistic
of n independent exponential random variables of rate μ
is Hn −H
μ
n−k
. Therefore,
MDS-coded 1 1 n 1
E[Toverall ]= 1+ log = . (11)
k μ n−k n
Using these closed-form expressions of the average runtime,
one can easily find the optimal value of k that achieves the
optimal average runtime. The following lemma characterizes
the optimal repetition code for the repetition-coded algorithms
and their runtime performances.
Lemma 1 (Optimal Repetition-Coded Distributed
Algorithms): If μ ≥ 1, the average runtime of an nk -
Fig. 4. Runtime distributions of uncoded/coded distributed algorithms. Repetition-coded distributed algorithm, in a distributed
We plot the runtime distributions of uncoded/coded distributed algorithms. For computing cluster with n workers, is minimized by setting
the uncoded algorithms, we use n = 10, and for the coded algorithms, we use
n = 10 and k = 5. In (a), we plot the runtime distribution when the runtime of k = n, i.e., not replicating tasks. If μ = v1 for some
tasks are distributed according to the shifted-exponential distribution. Indeed, integer v > 1, the average runtime is minimized by setting
the curves in (a) are analytically obtainable: See Sec. III-D for more details. k = μn, and the corresponding minimum average runtime
In (b), we use the empirical task runtime distribution measured on an Amazon
EC2 cluster.
1
is nμ (1 + log(nμ)).
Proof: It is easy to see that (10) as a function of k has
a unique extreme point. By differentiating (10) with respect
The shifted-exponential distribution is the sum of a constant to k and equating it to zero, we have k = μn. Thus, if μ ≥ 1,
and an exponential random variable, i.e., one should set k = n; if μ = v1 < 1 for some integer v, one
Pr(T0 ≤ t) = 1 − e−μ(t −1), ∀t ≥ 1, (8) should set k = μn.
The above lemma reveals that the optimal repetition-coded
where the exponential rate μ is called the straggling distributed algorithm can achieve a lower average runtime
parameter. than the uncoded distributed algorithm if μ < 1; however,
With this shifted-exponential model, we first characterize a the optimal repetition-coded distributed algorithm still suffers
lower bound on the fundamental limit of the average runtime. from the factor of (log n), and cannot achieve the order-
Proposition 1: The average runtime of any distributed algo- optimal performance. The following lemma, on the other hand,
rithm, in a distributed computing cluster with n workers, is shows that the optimal MDS-coded distributed algorithm can
lower bounded by n1 . achieve the order-optimal average runtime performance.
Proof: One can show that the average runtime of any Lemma 2 (Optimal MDS-Coded Distributed Algorithms):
distributed algorithm strictly decreases if the mother runtime The average runtime of an (n, k)-MDS-coded distributed
distribution is replaced with a deterministic constant 1. Thus, algorithm, in a distributed computing cluster with n workers,
the optimal average runtime with this deterministic mother can be minimized by setting k = k where
distribution serves as a strict lower bound on the optimal aver-
age runtime with the shifted-exponential mother distribution. 1
k = 1+ n, (12)
The constant mother distribution implies that stragglers do not W−1 (−e−μ−1 )
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1521
k
1
1 + μ1 log n−k
n
= μ1 n−k
1
. By setting k = α n, computed in a distributed way by computing partial sums at
we have 1
α 1+ 1
μ log 1−α
1
= 1 1
μ 1−α , which implies different worker nodes and then adding all the partial sums
at the master node. This distributed algorithm is an uncoded
μ+1 = 1
1−α − log 1
1−α . By defining β = 1
1−α and distributed algorithm: in each round, the master node needs to
eβ
exponentiating both the sides, we have eμ+1 = β . Note that wait for all the task results in order to compute the gradient.5
x
the solution of ex = t, t ≥ e and x ≥ 1 is x = −W−1 (− 1t ). 5 Indeed, one may apply another coded computation scheme called Gradient
Thus, β = −W−1 (−e−μ−1 ). By plugging the above equation Coding [45], which was proposed after our conference publications. By apply-
into the definition of β, the claim is proved. ing Gradient Coding to this algorithm, one can achieve straggler tolerance but
at the cost of significant computation and storage overheads. More precisely,
We plot nT and kμ as functions of μ in Fig. 5. it incurs (n) larger computation and storage overheads in order to protect
the algorithm from (n) stragglers. Later in this section, we will show that
4 W (x), the lower branch of Lambert W function evaluated at x, is the our coded computation scheme, which is tailor-designed for linear regression,
−1
unique solution of tet = x and t ≤ −1. incurs (1) overheads to protect the algorithm from (n) stragglers.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1522 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
Fig. 6. Illustration of a coded gradient descent approach for linear regression. The coded gradient descent computes a gradient of the objective function
using coded matrix multiplication twice: in each iteration, it first computes Ax(t) as depicted in (a) and (b), and then computes AT (Ax(t) − y) as depicted
in (c) and (d).
Thus, the runtime of each update iteration is determined by minimum storage overhead per node is a n1 -fraction of the
the slowest response among all the worker nodes. data matrix, the relative storage overhead of the coded gradient
We now propose the coded gradient descent, a coded dis- descent algorithm is at least about factor of 2, if k1 n and
tributed algorithm for linear regression problems. Note that in k2 n.
each iteration, the following two matrix-vector multiplications
are computed. F. Experimental Results
In order to see the efficacy of coded computation, we
Ax(t ), AT (Ax(t ) − y) = AT z(t )
def
(16) implement the proposed algorithms and test them on an
Amazon EC2 cluster. We first obtain the empirical distribution
In Sec. III-A, we proposed the MDS-coded distributed algo- of task runtime in order to observe how frequently stragglers
rithm for matrix multiplication. Here, we apply the algorithm appear in our testbed by measuring round-trip times between
twice to compute these two multiplications in each iteration. the master node and each of 10 worker instances on an
More specifically, for the first matrix multiplication, we choose Amazon EC2 cluster. Each worker computes a matrix-vector
1 ≤ k1 < n and use an (n, k1 )-MDS-coded distributed multiplication and passes the computation result to the master
algorithm for matrix multiplication to encode the data matrix node, and the master node measures round trip times that
A. Similarly for the second matrix multiplication, we choose include both computation time and communication time. Each
1 ≤ k2 < n and use a (n, k2 )-MDS-coded distributed worker repeats this procedure 500 times, and we obtain the
algorithm to encode the transpose of the data matrix. Denoting empirical distribution of round trip times across all the worker
the i th row-split (column-split) of A as Ai ( Ai ), the i th worker nodes.
stores both Ai and Ai . In the beginning of each iteration, In Fig. 7, we plot the histogram and complementary
the master node multicasts x(t ) to the worker nodes, each of CDF (CCDF) of measured computing times; the average round
which computes the local matrix multiplication for Ax(t ) and trip time is 0.11 second, and the 95th percentile latency is 0.20
sends the result to the master node. Upon receiving any k1 second, i.e., roughly five out of hundred tasks are going to be
task results, the master node can start decoding the result and roughly two times slower than the average tasks. Assuming
obtain z(t ) = Ax(t ). The master node now multicasts z(t ) to the the probability of a worker being a straggler is 5%, if one
workers, and the workers compute local matrix multiplication runs an uncoded distributed algorithm with 10 workers, the
for AT z(t ) . Finally, the master node can decode AT z(t ) as soon probability of not seeing such a straggler is only about 60%, so
as it receives any k2 task results, and can proceed to the next the algorithm is slowed down by a factor of more than 2 with
iteration. Fig. 6 illustrates the protocol with k1 = k2 = n − 1. probability 40%. Thus, this observation strongly emphasizes
Remark 4 (Storage Overhead of the Coded Gradient the necessity of an efficient straggler mitigation algorithm.
Descent): The coded gradient descent requires each node to In Fig. 4a, we plot the runtime distributions of uncoded/coded
store a ( k11 + k12 − k11k2 )-fraction of the data matrix. As the distributed algorithms using this empirical distribution as the
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1523
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1524 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
Fig. 8. Comparison of parallel matrix multiplication algorithms. We compare various parallel matrix multiplication algorithms: block, column-partition,
row-partition, and coded (row-partition) matrix multiplication. We implement the four algorithms using OpenMPI and test them on Amazon EC2 cluster of
25 instances. We measure the average and the 95th percentile runtime of the algorithms. Plotted in (a) and (b) are the results with m1-small instances, and
in (c) and (d) are the results with c1-medium instances.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1525
C. Example
in [66] and [86]. Our coded shuffling algorithm is related to
The following example illustrates the coded shuffling the coded caching problem [66], since one can design the
scheme. right cache update rule to reduce the communication rate
Example 3: Let n = 3. Recall that worker node i needs for an unknown demand or permutation of the data rows.
to obtain A(Si ∩ Ci) for the next iteration of the algorithm. A key difference though is that the coded shuffling algorithm
Consider i = 1. The data rows in S1 ∩ C1 are stored either is run over many iterations of the machine learning algorithm.
exclusively in C2 or C3 (i.e. C 2 or C 3 ), or stored in both C2 Thus, the right cache update rule is required to guarantee
and C3 (i.e. C2,3 ). The transmitted message consists of 4 parts: the opportunity of coded transmission at every iteration. Fur-
• (Part 1) M{1,2} = A(S1 ∩ C 2 ) + A(S2 ∩ C 1 ), thermore, the coded shuffling problem has some connections
• (Part 2) M{1,3} = A(S1 ∩ C 3 ) + A(S3 ∩ C 1 ), to coded MapReduce [86] as both algorithms mitigate the
• (Part 3) M{2,3} = A(S2 ∩ C 3 ) + A(S3 ∩ C 2 ), and communication bottlenecks in distributed computation and
• (Part 4) M{1,2,3} = A(S1 ∩ C 2,3 ) + A(S2 ∩ C 1,3 ) + machine learning. However, coded shuffling enables coded
A(S3 ∩ C 1,2 ). transmission of raw data by leveraging the extra memory space
We show that worker node 1 can recover the data rows that available at each node, while coded MapReduce enables coded
it does not store or A(S1 ∩ C1). First, observe that node 1 transmission of processed data in the shuffling phase of the
1 . Thus, it can recover A(S1 ∩ C
stores S2 ∩ C 2 ) using part 1 of MapReduce algorithm by cleverly introducing redundancy in
the message since A(S1 ∩ C 2 ) = M1 − A(S2 ∩ C 1 ). Similarly, the computation of the mappers.
node 1 recovers A(S1 ∩ C 3 ) = M2 − A(S3 ∩ C 1 ). Finally, We now prove Theorem 3.
from part 4 of the message, node 1 recovers A(S1 ∩ C 2,3 ) = Proof: To find the transmission rate of the coded scheme
M4 − A(S2 ∩ C 1,3 ) − A(S3 ∩ C 1,2 ). we first need to find the cardinality of sets Sit +1 ∩ C t for
I
I ⊂ [n] and i ∈ / I. To this end, we first find the probability that
D. Main Results a random data row, r, belongs to C t . Denote this probability
I
by Pr(r ∈ CI ). Recall that the cache content distribution at
t
We now present the main result of this section, which
characterizes the communication rate of the coded scheme. iteration t: q/n rows of cache j are stored with S tj and the
s−q/n
Let p = q−q/n . other s − q/n rows are stored uniformly at random. Thus, we
can compute Pr(r ∈ C t ) as follows.
Theorem 3 (Coded Shuffling Rate): Coded shuffling I
achieves communication rate t )
Pr(r ∈ C I
q n
Rc = (1− p)n+1 +(n − 1) p(1 − p)−(1− p)2 (18) t |r ∈ S t ) Pr(r ∈ S t )
(np)2 = Pr(r ∈ C I i i (19)
i=1
(in number of data rows transmitted per iteration from the n
1 t |r ∈ Sit )
master node), which is significantly smaller than Ru in (17). = Pr(r ∈ C I (20)
The reduction in communication rate is illustrated in Fig. 10 n
i=1
for n = 50 and q = 1000 as a function of s/q, where 1/n ≤ 1
= t |r ∈ Sit )
Pr(r ∈ C (21)
s/q ≤ 1. For instance, when s/q = 0.1, the communication n I
i∈I
overhead for data-shuffling is reduced by more than 81%. 1 s − q/n |I |−1
s − q/n n−|I |
Thus, at a very low storage overhead for caching, the algorithm = 1− (22)
n q − q/n q − q/n
can be significantly accelerated. i∈I
Before we present the proof of the theorem, we |I| |I |−1
= p (1 − p)n−|I | . (23)
briefly compare our main result with similar results shown n
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1526 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1527
framework that can significantly speed up existing distributed [11] D. S. Papailiopoulos, A. G. Dimakis, and V. R. Cadambe, “Repair
algorithms, by introducing redundancy through codes into the optimal erasure codes through Hadamard designs,” in Proc. 49th
Annu. Allerton Conf. Commun., Control, Comput. (Allerton), 2011,
computation. Further, we propose Coded Shuffling that can pp. 1382–1389.
significantly reduce the heavy price of data-shuffling, which is [12] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the locality
required for achieving high statistical efficiency in distributed of codeword symbols,” IEEE Trans. Inf. Theory, vol. 58, no. 11,
pp. 6925–6934, Nov. 2011.
machine learning algorithms. Our preliminary experimental [13] F. Oggier and A. Datta, “Self-repairing homomorphic codes for dis-
results validate the power of our proposed schemes in effec- tributed storage systems,” in Proc. IEEE INFOCOM, Apr. 2011,
tively curtailing the negative effects of system bottlenecks, and pp. 1215–1223.
attaining significant speedups of up to 40%, compared to the [14] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang, and J. Li,
“Simple regenerating codes: Network coding for cloud storage,” in Proc.
current state-of-the-art methods. IEEE INFOCOM, Mar. 2012, pp. 2801–2805.
There exists a whole host of theoretical and practical open [15] J. Han and L. A. Lastras-Montano, “Reliable memories with subline
problems related to the results of this paper. For coded compu- accesses,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2007,
pp. 2531–2535.
tation, instead of the MDS codes, one could achieve different [16] C. Huang, M. Chen, and J. Li, “Pyramid codes: Flexible schemes to
tradeoffs by employing another class of codes. Then, although trade space for access efficiency in reliable data storage systems,” in
matrix multiplication is one of the most basic computational Proc. 6th IEEE Int. Symp. Netw. Comput. Appl. (NCA), Jul. 2007,
pp. 79–86.
blocks in many analytics, it would be interesting to leverage
[17] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,”
coding for a broader class of distributed algorithms. in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Mar. 2012, pp. 2771–2775.
For coded shuffling, convergence analysis of distributed [18] G. M. Kamath, N. Prakash, V. Lalitha, and P. V. Kumar. (2012).
machine learning algorithms under shuffling is not well “Codes with local regeneration.” [Online]. Available: https://arxiv.
org/abs/1211.1932
understood. As we observed in the experiments, shuffling [19] A. S. Rawat, O. O. Koyluoglu, N. Silberstein, and S. Vishwanath, “Opti-
significantly reduces the number of iterations required to mal locally repairable and secure codes for distributed storage systems,”
achieve a target reliability, but missing is a rigorous analysis IEEE Trans. Inf. Theory, vol. 60, no. 1, pp. 212–236, Jan. 2014.
[20] N. Prakash, G. M. Kamath, V. Lalitha, and P. V. Kumar, “Optimal linear
that compares the convergence performances of algorithms codes with a local-error-correction property,” in Proc. IEEE Int. Symp.
with shuffling or without shuffling. Further, the trade-offs Inf. Theory (ISIT), Jul. 2012, pp. 2776–2780.
between bandwidth, storage, and the statistical efficiency of [21] N. Silberstein, A. S. Rawat and S. Vishwanath, “Error resilience in
the distributed algorithms are not well understood. Moreover, distributed storage via rank-metric codes,” in Proc. 50th Annu. Allerton
Conf. Commun., Control, Comput. (Allerton), Monticello, IL, USA,
it is not clear how far our achievable scheme, which achieves 2012, pp. 1150–1157.
a bandwidth reduction gain of ( n1 ), is from the fundamental [22] C. Huang et al., “Erasure coding in windows azure storage,” in Proc.
limit of communication rate for coded shuffling. Therefore, USENIX Annu. Tech. Conf. (ATC), Jun. 2012, pp. 15–26.
[23] M. Sathiamoorthy et al., “XORing elephants: Novel erasure codes for
finding an information-theoretic lower bound on the rate of big data,” Proc. VLDB Endowment, vol. 6, no. 5, pp. 325–336, 2013.
coded shuffling is another interesting open problem. [24] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and
K. Ramchandran, “A solution to the network challenges of data recovery
R EFERENCES in erasure-coded distributed storage systems: A study on the Facebook
warehouse cluster,” in Proc. USENIX HotStorage, Jun. 2013.
[1] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, [25] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and
“Speeding up distributed machine learning using codes,” presented at the K. Ramchandran, “A hitchhiker’s guide to fast and efficient data recon-
Neural Inf. Process. Syst. Workshop Mach. Learn. Syst., Dec. 2015. struction in erasure-coded data centers,” in Proc. ACM Conf. SIGCOMM,
[2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, 2014, pp. 331–342.
“Speeding up distributed machine learning using codes,” in Proc. IEEE [26] G. Ananthanarayanan et al., “Reining in the outliers in Map-Reduce
Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 1143–1147. clusters using Mantri,” in Proc. 9th USENIX Symp. Oper. Syst. Des.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Implement. (OSDI), 2010, pp. 265–278. [Online]. Available: http://www.
“Spark: Cluster computing with working sets,” in Proc. 2nd USENIX usenix.org/events/osdi10/tech/full_papers/Ananthanarayanan.pdf
Workshop Hot Topics Cloud Comput. (HotCloud), 2010, p. 95. [Online].
[27] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
Available: https://www.usenix.org/conference/hotcloud-10/spark-cluster-
I. Stoica, “Improving MapReduce performance in heterogeneous
computing-working-sets
environments,” in Proc. 9th USENIX Symp. Oper. Syst. Des.
[4] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on
Implement. (OSDI), 2008, pp. 29–42. [Online]. Available:
large clusters,” in Proc. 6th Symp. Oper. Syst. Design Implement. (OSDI),
http://www.usenix.org/events/osdi08/tech/full_papers/zaharia/zaharia.pdf
2004, pp. 137–150. [Online]. Available: http://www.usenix.org/
events/osdi04/tech/dean.html [28] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimiza-
[5] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, tion,” in Proc. 25th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2011,
no. 2, pp. 74–80, Feb. 2013. pp. 873–881. [Online]. Available: http://papers.nips.cc/paper/4247-
[6] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and distributed-delayed-stochastic-optimization
K. Ramchandran, “Network coding for distributed storage systems,” [29] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach
IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4539–4551, Sep. 2010. to parallelizing stochastic gradient descent,” in Proc. 25th Annu. Conf.
[7] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating Neural Inf. Process. (NIPS), 2011, pp. 693–701.
codes for distributed storage at the MSR and MBR points via a [30] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica,
product-matrix construction,” IEEE Trans. Inf. Theory, vol. 57, no. 8, “Effective straggler mitigation: Attack of the clones,” in Proc.
pp. 5227–5239, Aug. 2011. 10th USENIX Symp. Netw. Syst. Des. Implement. (NSDI), 2013,
[8] C. Suh and K. Ramchandran, “Exact-repair MDS code construction pp. 185–198. [Online]. Available: https://www.usenix.org/conference/
using interference alignment,” IEEE Trans. Inf. Theory, vol. 57, no. 3, nsdi13/technical-sessions/presentation/ananthanarayanan
pp. 1425–1442, Mar. 2011. [31] N. B. Shah, K. Lee, and K. Ramchandran, “When do redundant
[9] I. Tamo, Z. Wang, and J. Bruck, “MDS array codes with optimal requests reduce latency?” in Proc. 51st Annu. Allerton Conf. Commun.,
rebuilding,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aug. 2011, Control, Comput., 2013, pp. 731–738. [Online]. Available: http://dx.doi.
pp. 1240–1244. org/10.1109/Allerton.2013.6736597
[10] V. R. Cadambe, C. Huang, S. A. Jafar, and J. Li. (2011). “Optimal [32] D. Wang, G. Joshi, and G. W. Wornell, “Efficient task replication for fast
repair of MDS codes in distributed storage via subspace interference response times in parallel computation,” ACM SIGMETRICS, vol. 42,
alignment.” [Online]. Available: https://arxiv.org/abs/1106.1250 no. 1, pp. 599–600, 2014.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1528 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018
[33] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, and E. Hyytiä, [58] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin,
“Reducing latency via redundant requests: Exact analysis,” ACM SIG- and M. I. Jordan, “MLbase: A distributed machine-learning sys-
METRICS, vol. 43„ no. 1, pp. 347–360, 2015. tem,” in Proc. 6th Biennial Conf. Innov. Data Syst. Res. (CIDR),
[34] M. Chaubey and E. Saule, “Replicated data placement for uncer- Jan. 2013, p. 2. [Online]. Available: http://www.cidrdb.org/cidr2013/
tain scheduling,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Papers/CIDR13_Paper118.pdf
Workshop (IPDPS), May 2015, pp. 464–472. [Online]. Available: [59] E. R. Sparks et al., “MLI: An API for distributed machine learning,” in
http://dx.doi.org/10.1109/IPDPSW.2015.50 Proc. IEEE 13th Int. Conf. Data Mining (ICDM), 2013, pp. 1187–1192.
[35] K. Lee, R. Pedarsani, and K. Ramchandran, “On scheduling redundant [Online]. Available: http://dx.doi.org/10.1109/ICDM.2013.158
requests with cancellation overheads,” in Proc. 53rd Annu. Allerton Conf. [60] M. Li et al., “Scaling distributed machine learning with the
Commun., Control, Comput., Oct. 2015, pp. 1279–1290. parameter server,” in Proc. 11th USENIX Symp. Oper. Syst. Des.
[36] G. Joshi, E. Soljanin, and G. Wornell, “Efficient redundancy techniques Implement. (OSDI), 2014, pp. 583–598. [Online]. Available:
for latency reduction in cloud systems,” ACM Trans. Model. Perform. https://www.usenix.org/conference/osdi14/technical-sessions/
Eval. Comput. Syst., vol. 2, no. 2, pp. 12:1–12:30, Apr. 2017. [Online]. presentation/li_mu
Available: http://doi.acm.org/10.1145/3055281 [61] B. Recht and C. Ré, “Parallel stochastic gradient algorithms for large-
[37] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, “Codes can scale matrix completion,” Math. Programm. Comput., vol. 5, no. 2,
reduce queueing delay in data centers,” in Proc. IEEE Int. Symp. Inf. pp. 201–226, 2013.
Theory (ISIT), Jul. 2012, pp. 2766–2770. [62] L. Bottou, “Stochastic gradient descent tricks,” in Neural Net-
[38] K. Lee, N. B. Shah, L. Huang, and K. Ramchandran, “The MDS queue: works: Tricks Trade. 2nd ed. 2012, pp. 421–436. [Online]. Available:
Analysing the latency performance of erasure codes,” IEEE Trans. Inf. http://dx.doi.org/10.1007/978-3-642-35289-8_25
Theory, vol. 63, no. 5, pp. 2822–2842, May 2017. [63] C. Zhang and C. Ré, “Dimmwitted: A study of main-memory statistical
[39] G. Joshi, Y. Liu, and E. Soljanin, “On the delay-storage trade-off in analytics,” Proc. VLDB Endowment, vol. 7, no. 12, pp. 1283–1294, 2014.
content download from coded distributed storage systems,” IEEE J. Sel. [64] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. (2015). “Why ran-
Areas Commun., vol. 32, no. 5, pp. 989–997, May 2014. dom reshuffling beats stochastic gradient descent.” [Online]. Available:
[40] Y. Sun, Z. Zheng, C. E. Koksal, K.-H. Kim, and N. B. Shroff. (2015). https://arxiv.org/abs/1510.08560
“Provably delay efficient data retrieving in storage clouds.” [Online]. [65] S. Ioffe and C. Szegedy. (2015). “Batch normalization: Accelerating
Available: https://arxiv.org/abs/1501.01661 deep network training by reducing internal covariate shift,” [Online].
[41] S. Kadhe, E. Soljanin, and A. Sprintson, “When do the availability codes Available: https://arxiv.org/abs/1502.03167
make the stored data more available?” in Proc. 53rd Annu. Allerton Conf. [66] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of
Commun., Control, Comput. (Allerton), Sep. 2015, pp. 956–963. caching,” IEEE Trans. Inf. Theory, vol. 60, no. 5, pp. 2856–2867,
[42] S. Kadhe, E. Soljanin, and A. Sprintson, “Analyzing the download time May 2014.
of availability codes,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), [67] M. A. Maddah-Ali and U. Niesen, “Decentralized coded caching
Jun. 2015, pp. 1467–1471. attains order-optimal memory-rate tradeoff,” IEEE/ACM Trans. Netw.,
[43] N. Ferdinand and S. Draper, “Anytime coding for distributed compu- vol. 23, no. 4, pp. 1029–1040, Aug. 2014. [Online]. Available:
tation,” presented at the 54th Annu. Allerton Conf. Commun., Control, http://dx.doi.org/10.1109/TNET.2014.2317316
Comput., Monticello, IL, USA, 2016. [68] R. Pedarsani, M. A. Maddah-Ali, and U. Niesen, “Online coded
[44] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear caching,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2014,
transforms distributedly using coded short dot products,” in Proc. Adv. pp. 1878–1883. [Online]. Available: http://dx.doi.org/10.1109/
Neural Inf. Process. Syst., 2016, pp. 2092–2100. ICC.2014.6883597
[45] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. (2016). [69] N. Karamchandani, U. Niesen, M. A. Maddah-Ali, and S. Diggavi,
“Gradient coding.” [Online]. Available: https://arxiv.org/abs/1612.03301 “Hierarchical coded caching,” in Proc. IEEE Int. Symp. Inf. The-
[46] R. Bitar, P. Parag, and S. E. Rouayheb. (2017). “Minimiz- ory (ISIT), Jun. 2014, pp. 2142–2146.
ing latency for secure distributed computing.” [Online]. Available: [70] M. Ji, G. Caire, and A. F. Molisch, “Fundamental limits of distributed
https://arxiv.org/abs/1703.01504 caching in D2D wireless networks,” in Proc. IEEE Inf. Theory Work-
[47] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix shop (ITW), Sep. 2013, pp. 1–5. [Online]. Available: http://dx.doi.org/
multiplication,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2017, 10.1109/ITW.2013.6691247
pp. 1–2. [71] S. Li, M. A. Maddah-ali, and S. Avestimehr, “Coded MapReduce,”
[48] A. Reisizadehmobarakeh, S. Prakash, R. Pedarsani, and S. Avestimehr. presented at the 53rd Annu. Allerton Conf. Commun., Control, Comput.,
“Coded computation over heterogeneous clusters.” [Online]. Available: Monticello, IL, USA, 2015.
https://arxiv.org/abs/1701.05973 [72] Y. Birk and T. Kol, “Coding on demand by an informed source (ISCOD)
[49] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Coded for efficient broadcast of different supplemental data to caching clients,”
computation for multicore setups,” presened at the ISIT, Jun. 2017. IEEE Trans. Inf. Theory, vol. 52, no. 6, pp. 2825–2830, Jun. 2006.
[50] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA: [73] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol, “Index coding with side
Athena Scientific, 1999. information,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1479–1494,
[51] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- Mar. 2011.
agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, [74] M. A. Attia and R. Tandon, “Information theoretic limits of data
pp. 48–61, Jan. 2009. shuffling for distributed learning,” in Proc. IEEE Global Commun.
[52] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Conf. (GLOBECOM), Dec. 2016, pp. 1–6.
optimization and statistical learning via the alternating direction method [75] M. A. Attia and R. Tandon, “On the worst-case communication overhead
of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, for distributed data shuffling,” in Proc. 54th Annu. Allerton Conf.
Jan. 2011. Commun., Control, Comput. (Allerton), Sep. 2016, pp. 961–968.
[53] R. Bekkerman, M. Bilenko, and J. Langford, Scaling Up Machine [76] L. Song and C. Fragouli. (2017). “A pliable index coding approach to
Learning: Parallel and Distributed Approaches. Cambridge, U.K.: data shuffling.” [Online]. Available: https://arxiv.org/abs/1701.05540
Cambridge Univ. Press, 2011. [77] S. Li, M. A. Maddah-Ali and A. S. Avestimehr, “A unified coding
[54] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging framework for distributed computing with straggling servers,” in Proc.
for distributed optimization: Convergence analysis and network scal- IEEE Globecom Workshops (GC Wkshps), Washington, DC, USA, 2016,
ing,” IEEE Trans. Autom. Control, vol. 57, no. 3, pp. 592–606, pp. 1–6.
Mar. 2012. [78] T. M. Cover and J. A. Thomas, Elements of Information Theory.
[55] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed Hoboken, NJ, USA: Wiley, 2012.
optimization and learning over networks,” IEEE Trans. Signal Process., [79] S. Lin and D. J. Costello, Error Control Coding, vol. 2. Englewood
vol. 60, no. 8, pp. 4289–4305, Aug. 2012. Cliffs, NJ, USA: Prentice-Hall, 2004.
[56] J. Dean et al., “Large scale distributed deep networks,” [80] G. Liang and U. C. Kozat, “TOFEC: Achieving optimal throughput-
in Proc. 26th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2012, delay trade-off of cloud storage using erasure codes,” in Proc.
pp. 1232–1240. [Online]. Available: http://papers.nips.cc/paper/4687- IEEE Conf. Comput. Commun. (INFOCOM), Apr. 2014, pp. 826–834.
large-scale-distributed-deep-networks [Online]. Available: http://dx.doi.org/10.1109/INFOCOM.2014.6848010
[57] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and [81] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.:
J. M. Hellerstein, “Distributed graphlab: A framework for machine Cambridge Univ. Press, 2004.
learning and data mining in the cloud,” Proc. VLDB Endowment, vol. 5, [82] X. Meng et al., “Mllib: Machine learning in apache spark,” J. Mach.
no. 8, pp. 716–727, 2012. Learn. Res., vol. 17, no. 1, pp. 1235–1241, 2016.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1529
[83] Open MPI: Open Source High Performance Computing. Accessed on Dimitris Papailiopoulos is an Assistant Professor of Electrical and Computer
Nov. 25, 2015. [Online]. Available: http://www.open-mpi.org Engineering at the University of Wisconsin-Madison and a Faculty Fellow of
[84] StarCluster. Accessed on Nov. 25, 2015. [Online]. Available: the Grainger Institute for Engineering. Between 2014 and 2016, Papailiopou-
http://star.mit.edu/cluster/ los was a postdoctoral researcher in EECS at UC Berkeley and a member of
[85] BLAS (Basic Linear Algebra Subprograms). Accessed on Nov. 25, 2015. the AMPLab. His research interests span machine learning, coding theory,
[Online]. Available:http://www.netlib.org/blas/ and distributed algorithms, with a current focus on coordination-avoiding
[86] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Fundamental tradeoff parallel machine learning and the use of erasure codes to speed up distributed
between computation and communication in distributed computing,” computation. Dimitris earned his Ph.D. in ECE from UT Austin in 2014,
in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 1814–1818. under the supervision of Alex Dimakis. In 2015, he received the IEEE Signal
[87] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, Processing Society, Young Author Best Paper Award.
“Augmenting data center networks with multi-gigabit wireless links,”
ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, pp. 38–49,
Aug. 2011.
[88] Y. Zhu et al., “Cutting the cord: A robust wireless facilities network
for data centers,” in Proc. 20th Annu. Int. Conf. Mobile Comput. Netw.,
2014, pp. 581–592.
[89] M. Y. Arslan, I. Singh, S. Singh, H. V. Madhyastha, K. Sundaresan, and
S. V. Krishnamurthy, “Computing while charging: Building a distributed
computing infrastructure using smartphones,” in Proc. 8th Int. Conf.
Emerg. Netw. Experim. Technol., 2012, pp. 193–204.
Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.