0% found this document useful (0 votes)
5 views

Speeding_Up_Distributed_Machine_Learning_Using_Codes

Uploaded by

wudirac
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Speeding_Up_Distributed_Machine_Learning_Using_Codes

Uploaded by

wudirac
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1514 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO.

3, MARCH 2018

Speeding Up Distributed Machine Learning


Using Codes
Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos,
and Kannan Ramchandran, Fellow, IEEE

Abstract— Codes are widely used in many engineering appli-


cations to offer robustness against noise. In large-scale systems,
there are several types of noise that can affect the performance
of distributed machine learning algorithms—straggler nodes,
system failures, or communication bottlenecks—but there has
been little interaction cutting across codes, machine learning, and
distributed systems. In this paper, we provide theoretical insights
on how coded solutions can achieve significant gains compared
with uncoded ones. We focus on two of the most basic building
blocks of distributed learning algorithms: matrix multiplication
and data shuffling. For matrix multiplication, we use codes to
alleviate the effect of stragglers and show that if the number of Fig. 1. Conceptual diagram of the phases of distributed computation.
homogeneous workers is n, and the runtime of each subtask has The algorithmic workflow of distributed (potentially iterative) tasks, can be
an exponential tail, coded computation can speed up distributed seen as receiving input data, storing them in distributed nodes, communicating
matrix multiplication by a factor of log n. For data shuffling, data around the distributed network, and then computing locally a function at
we use codes to reduce communication bottlenecks, exploiting each distributed node. The main bottlenecks in this execution (communication,
the excess in storage. We show that when a constant fraction α stragglers, system failures) can all be abstracted away by incorporating a
of the data matrix can be cached at each worker, and n is the notion of delays between these phases, denoted by  boxes.
number of workers, coded shuffling reduces the communication
cost by a factor of (α + n1 )γ (n) compared with uncoded shuffling, I. I NTRODUCTION
where γ (n) is the ratio of the cost of unicasting n messages
to n users to multicasting a common message (of the same size)
to n users. For instance, γ (n)  n if multicasting a message to
n users is as cheap as unicasting a message to one user. We also
I N RECENT years, the computational paradigm for large-
scale machine learning and data analytics has shifted
towards massively large distributed systems, comprising
provide experimental results, corroborating our theoretical gains individually small and unreliable computational nodes
of the coded algorithms.
(low-end, commodity hardware). Specifically, modern dis-
Index Terms— Algorithm design and analysis, channel coding,
distributed computing, distributed databases, encoding, machine tributed systems like Apache Spark [3] and computational
learning algorithms, multicast communication, robustness, primitives like MapReduce [4] have gained significant traction,
runtime. as they enable the execution of production-scale tasks on data
sizes of the order of petabytes. However, it is observed that the
Manuscript received October 18, 2016; revised May 24, 2017 and performance of a modern distributed system is significantly
July 18, 2017; accepted July 19, 2017. Date of publication August 4,
2017; date of current version February 15, 2018. This work was partly affected by anomalous system behavior and bottlenecks [5],
supported by Institute for Information & communications Technology i.e., a form of “system noise”. Given the individually unpre-
Promotion(IITP) grant funded by the Korea government(MSIT) (No.2017- dictable nature of the nodes in these systems, we are faced with
0-00694, Coding for High-Speed Distributed Networks), the Brain Korea
21 Plus Project, and NSF CIF grant (No.1703678, Foundations of coding the challenge of securing fast and high-quality algorithmic
for modern distributed computing). This paper was presented in part at results in the face of uncertainty.
the 2015 Neural Information Processing Systems Workshop on Machine In this work, we tackle this challenge using coding the-
Learning Systems [1] and the 2016 IEEE International Symposium on
Information Theory [2]. oretic techniques. The role of codes in providing resiliency
K. Lee is with the School of Electrical Engineering, Korea Advanced against noise has been studied for decades in several other
Institute of Science and Technology, Daejeon 34141, South Korea (e-mail: engineering contexts, and is part of our everyday infrastructure
[email protected]).
M. Lam and K. Ramchandran are with the Department of Electrical (smartphones, laptops, WiFi and cellular systems, etc.). The
Engineering and Computer Sciences, University of California at Berkeley, goal of our work is to apply coding techniques to blueprint
Berkeley, CA 94720 USA (e-mail: [email protected]; robust distributed systems, especially for distributed machine
[email protected]).
R. Pedarsani is with the Department of Electrical and Computer Engineer- learning algorithms. The workflow of distributed machine
ing, University of California at Santa Barbara, Santa Barbara, CA 93106 USA learning algorithms in a large-scale system can be decomposed
(e-mail: [email protected]). into three functional phases: a storage, a communication,
D. Papailiopoulos is with the Department of Electrical and Computer
Engineering, University of Wisconsin–Madison, Madison, WI 53706 USA and a computation phase, as shown in Fig. 1. In order to
(e-mail: [email protected]). develop and deploy sophisticated solutions and tackle large-
Communicated by P. Sadeghi, Associate Editor for Coding Techniques. scale problems in machine learning, science, engineering, and
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. commerce, it is important to understand and optimize novel
Digital Object Identifier 10.1109/TIT.2017.2736066 and complex trade-offs across the multiple dimensions of
0018-9448 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1515

computation, communication, storage, and the accuracy of


results. Recently, codes have begun to transform the storage
layer of distributed systems in modern data centers under
the umbrella of regenerating and locally repairable codes for
distributed storage [6]–[21] which are also having a major
impact on industry [22]–[25].
In this paper, we explore the use of coding theory to Fig. 2. Illustration of Coded Matrix Multiplication. Data matrix A is
partitioned into 2 submatrices: A1 and A2 . Node W1 stores A1 , node W2
remove bottlenecks caused during the other phases: the com- stores A2 , and node W3 stores A1 + A2 . Upon receiving X, each node
munication and computation phases of distributed algorithms. multiplies X with the stored matrix, and sends the product to the master
More specifically, we identify two core blocks relevant to the node. Observe that the master node can always recover AX upon receiving
any 2 products, without needing to wait for the slowest response. For instance,
communication and computation phases that we believe are consider a case where the master node has received A1 X and (A1 + A2 )X.
key primitives in a plethora of distributed data processing and By subtracting A1 X from (A1 + A2 )X, it can recover A2 X and hence AX.
machine learning algorithms: matrix multiplication and data
shuffling. compute a matrix multiplication AX for data matrix A ∈ Rq×r
For matrix multiplication, we use codes to leverage the and input matrix X ∈ Rr×s . The data matrix A is divided
plethora of nodes and alleviate the effect of stragglers, i.e., into two submatrices A1 ∈ Rq/2×r and A2 ∈ Rq/2×r and
nodes that are significantly slower than average. We show stored in node 1 and node 2, as shown in Fig. 2. The sum
analytically that if there are n workers having identically of the two submatrices is stored in node 3. After the master
distributed computing time statistics that are exponentially dis- node transmits X to the worker nodes, each node computes
tributed, the optimal coded matrix multiplication is (log n)1 the matrix multiplication of the stored matrix and the received
times faster than the uncoded matrix multiplication on average. matrix X, and sends the computation result back to the master
Data shuffling is a core element of many machine learning node. The master node can compute AX as soon as it receives
applications, and is well-known to improve the statistical any two computation results.
performance of learning algorithms. We show that codes can Coded Computation designs parallel tasks for a linear
be used in a novel way to trade off excess in available storage operation using erasure codes such that its runtime is not
for reduced communication cost for data shuffling done in affected by up to a certain number of stragglers. Matrix
parallel machine learning algorithms. We show that when a multiplication is one of the most basic linear operations and
constant fraction of the data matrix can be cached at each is the workhorse of a host of machine learning and data
worker, and n is the number of workers, coded shuffling analytics algorithms, e.g., gradient descent based algorithm
reduces the communication cost by a factor (γ (n)) compared for regression problems, power-iteration like algorithms for
to uncoded shuffling, where γ (n) is the ratio of the cost of spectral analysis and graph ranking applications, etc. Hence,
unicasting n messages to n users to multicasting a common we focus on the example of matrix multiplication in this
message (of the same size) to n users. For instance, γ (n)  n paper. With coded computation, we will show that the runtime
if multicasting a message to n users is as cheap as unicasting of the algorithm can be significantly reduced compared to
a message to one user. that of other uncoded algorithms. The main result on Coded
We would like to remark that a major innovation of our Computation is stated in the following (informal) theorem.
coding solutions is that they are woven into the fabric of the Theorem 1 (Coded Computation): If the number of workers
algorithmic design, and coding/decoding is performed over the is n, and the runtime of each subtask has an exponential
representation field of the input data (e.g., floats or doubles). tail, the optimal coded matrix multiplication is (log n) times
In sharp contrast to most coding applications, we do not faster than the uncoded matrix multiplication.
need to “re-factor code” and modify the distributed system For the formal version of the theorem and its proof,
to accommodate for our solutions; it is all done seamlessly in see Sec. III-D.
the algorithmic design layer, an abstraction that we believe is We now overview the main results on coded shuffling.
much more impactful as it is located “higher up” in the system Consider a master-worker setup where a master node holds
layer hierarchy compared to traditional applications of coding the entire data set. The generic machine learning task that we
that need to interact with the stored and transmitted “bits” wish to optimize is the following: 1) the data set is randomly
(e.g., as is the case for coding solutions for the physical or permuted and partitioned in batches at the master; 2) the
storage layer). master sends the batches to the workers; 3) each worker uses
its batch and locally trains a model; 4) the local models are
Overview of The Main Results averaged at the master and the process is repeated. To reduce
We now provide a brief overview of the main results of this communication overheads between master and workers, Coded
paper. The following toy example illustrates the main idea Shuffling exploits i) the locally cached data points of previous
of Coded Computation. Consider a system with three worker passes and ii) the “transmission strategy” of the master node.
nodes and one master node, as depicted in Fig. 2. The goal is to We illustrate the basics of Coded Shuffling with a toy
example. Consider a system with two worker nodes and one
1 For any two sequences f (n) and g(n): f (n) = (g(n)) if there master node. Assume that the data set consists of 4 batches
exists a positive constant c such that f (n) ≥ cg(n); f (n) = o(g(n)) if A1 , . . . , A4 , which are stored across two workers as shown in
f (n)
lim n→∞ g(n) = 0.
Fig. 3. The sole objective of the master is to transmit A3 to the

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1516 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

disk failures, varying network conditions, and imbalanced


workload.
One approach to mitigate the adverse effect of strag-
glers is based on efficient straggler detection algorithms.
Fig. 3. Illustration of Coded Shuffling. Data matrix A is partitioned into For instance, the default scheduler of Hadoop constantly
4 submatrices: A1 to A4 . Before shuffling, worker W1 has A1 and A2 and
worker W2 has A3 and A4 . The master node can send A2 + A3 in order to detects stragglers while running computational tasks. When-
shuffle the data stored at the two workers. ever it detects a straggler, it relaunches the task that was
running on the detected straggler at some other available node.
first worker and A4 to the second. For this purpose, the master Zaharia et al. [27] propose a modification to the exist-
node can simply multicast a coded message A2 + A3 to the ing straggler detection algorithm and show that the pro-
worker nodes since the workers can decode the desired batches posed solution can effectively reduce the completion time of
using the stored batches. Compared to the naïve (or uncoded) MapReduce tasks. Ananthanarayanan et al. [26] propose a sys-
shuffling scheme in which the master node transmits A2 and tem that efficiently detects stragglers using real-time progress
A3 separately, this new shuffling scheme can save 50% of the and cancels those stragglers, and show that the proposed
communication cost, speeding up the overall machine learning system can further reduce the runtime of MapReduce tasks.
algorithm. The Coded Shuffling algorithm is a generalization of Another line of work is based on breaking the syn-
the above toy example, which we explain in detail in Sec. IV. chronization barriers in distributed algorithms [28], [29].
Note that the above example assumes that multicasting An asynchronous parallel execution can continuously make
a message to all workers costs exactly the same as uni- progress without having to wait for all the responses from
casting a message to one of the workers. In general, we the workers, and hence the overall runtime is less affected
capture the advantage of using multicasting over unicasting by by stragglers. However, these asynchronous approaches break
defining γ (n) as follows: the serial consistency of the algorithm to be parallelized,
cost of unicasting n separate msgs to n workers and do not guarantee “correctness” of the end result, i.e.,
γ (n) = .
def

cost of multicasting a common msg to n workers the output of the asynchronous algorithm can differ from
(1) that of a serial execution with an identical number of
iterations.
Clearly, 1 ≤ γ (n) ≤ n: if γ (n) = n, the cost of multicasting is Recently, replication-based approaches have been explored
equal to that of unicasting a single message (as in the above to tackle the straggler problem: by replicating tasks and
example); if γ (n) = 1, there is essentially no advantage of scheduling the replicas, the runtime of distributed algorithms
using multicast over unicast. can be significantly improved [30]–[36]. By collecting outputs
We now state the main result on Coded Shuffling in the of the fast-responding nodes (and potentially canceling all
following (informal) theorem. the other slow-responding replicas), such replication-based
Theorem 2 (Coded Shuffling): Let α be the fraction of the scheduling algorithms can reduce latency. Lee et al. [35] show
data matrix that can be cached at each worker, and n be the that even without replica cancellation, one can still reduce
number of workers. Assume that the advantage of multicasting the average task latency by properly scheduling redundant
over unicasting is γ (n). Then, coded  shuffling
 reduces the requests. We view these policies as special instances of coded
communication cost by a factor of α + n1 γ (n) compared computation: such task replication schemes can be seen as
to uncoded shuffling. repetition-coded computation. In Sec. III, we describe this
For the formal version of the theorem and its proofs, see connection in detail, and indicate that coded computation can
Sec. IV-D. significantly outperform replication (as is usually the case for
The remainder of this paper is organized as follows. coding vs. replication in other engineering applications).
In Sec. II, we provide an extensive review of the related Another line of work that is closely related to coded
works in the literature. Sec. III introduces the coded matrix computation is about the latency analysis of coded distributed
multiplication, and Sec. IV introduces the coded shuffling storage systems. Huang et al. [37] and Lee et al. [38] show
algorithm. Finally, Sec. V presents conclusions and discusses that the flexibility of erasure-coded distributed storage systems
open problems. allows for faster data retrieval performance than replication-
II. R ELATED W ORK based distributed storage systems. Joshi et al. [39] show
that scheduling redundant requests to an increased number of
A. Coded Computation and Straggler Mitigation storage nodes can improve the latency performance, and char-
The straggler problem has been widely observed in dis- acterize the resulting storage-latency tradeoff. Sun et al. [40]
tributed computing clusters. Dean and Barroso [5] show that study the problem of adaptive redundant requests scheduling,
running a computational task at a computing node often and characterize the optimal strategies for various scenarios.
involves unpredictable latency due to several factors such Kadhe [41] and Soljanin [42] analyze the latency performance
as network latency, shared resources, maintenance activities, of availability codes, a class of storage codes designed for
and power limits. Further, they argue that stragglers cannot enhanced availability. Joshi et al. [36] study the cost associated
be completely removed from a distributed computing cluster. with scheduling of redundant requests, and propose a general
Ananthanarayanan et al. [26] characterize the impact and scheduling policy that achieves a delicate balance between the
causes of stragglers that arise due to resource contention, latency performance and the cost.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1517

We now review some recent works on coded compu- order than before. Although the effects of random shuffling
tation, which have been published after our conference are far from understood theoretically, the large statistical
publications [1], [2]. In [43], an anytime coding scheme gains have turned it into a common practice. Intuitively,
for approximate matrix multiplication is proposed, and it is data shuffling before a new pass over the data, implies
shown that the proposed scheme can improve the quality that nodes get a nearly “fresh” sample from the data set,
of approximation compared with the other existing coded which experimentally leads to better statistical performance.
schemes for exact computation. Dutta et al. [44] propose Moreover, bad orderings of the data—known to lead to slow
a coded computation scheme called ‘Short-Dot’. Short-Dot convergence in the worst case [61], [64], [65]—are “averaged
induces additional sparsity to the encoded matrices at the cost out”. However, the statistical benefits of data shuffling do
of reduced decoding flexibility, and hence potentially speeds not come for free: each time a new shuffle is performed, the
up the computation. Tandon et al. [45] consider the problem entire dataset is communicated over the network of nodes.
of computing gradients in a distributed system, and propose This inevitably leads to performance bottlenecks due to heavy
a novel coded computation scheme tailored for computing a communication.
sum of functions. In many machine learning problems, the In this work, we propose to use coding opportunities to
objective function is a sum of per-data loss functions, and significantly reduce the communication cost of some dis-
hence the gradient of the objective function is the sum of tributed learning algorithms that require data shuffling. Our
gradients of per-data loss functions. Based on this observation, coded shuffling algorithm is built upon the coded caching
they propose Gradient Coding, which can reliably compute the scheme by Maddah-Ali and Niesen [66]. Coded caching is
exact gradient of any function in the presence of stragglers. a technique to reduce the communication rate in content
While Gradient coding can be applied to computing gradients delivery networks. Mainly motivated by video sharing appli-
of any functions, it usually incurs significant storage and cations, coded caching exploits the multicasting opportunities
computation overheads. Bitar et al. [46] consider a secure between users that request different video files to significantly
coded computation problem where the input data matrices reduce the communication burden of the server node that has
need to be secured from the workers. They propose a secure access to the files. Coded caching has been studied in many
computation scheme based on Staircase codes, which can scenarios such as decentralized coded caching [67], online
speed up the distributed computation while securing the coded caching [68], hierarchical coded caching for wireless
input data from the workers. Lee et al. [47] consider the communication [69], and device-to-device coded caching [70].
problem of large matrix-matrix multiplication, and propose Recently, Li et al. [71] proposed coded MapReduce that
a new coded computation scheme based on product codes. reduces the communication cost in the process of transferring
Reisizadehmobarakeh et al. [48] consider the coded compu- the results of mappers to reducers.
tation problem on heterogenous computing clusters while our Our proposed approach is significantly different from all
work assumes a homogeneous computing cluster. The authors related studies on coded caching in two ways: (i) we shuffle
show that by delicately distributing jobs across heterogenous the data points among the computing nodes to increase the
workers, one can improve the performance of coded compu- statistical efficiency of distributed computation and machine
tation compared with the symmetric job allocation scheme, learning algorithms; and (ii) we code the data over their
which is designed for homogeneous workers in our work. actual representation (i.e., over the doubles or floats) unlike
While most of the works focus on the application of coded the traditional coding schemes over bits. In Sec. IV, we
computation to linear operations, a recent work shows that describe how coded shuffling can remarkably speed up the
coding can be used also in distributed computing frameworks communication phase of large-scale parallel machine learning
involving nonlinear operations [49]. Lee et al. [49] show algorithms, and provide extensive numerical experiments to
that by leveraging the multi-core architecture in the worker validate our results.
computing units and “coding across” the multi-core computed The coded shuffling problem that we study is related to
outputs, significant (and in some settings unbounded) gains in the index coding problem [72], [73]. Indeed, given a fixed
speed-up in computational time can be achieved between the “side information” reflecting the memory content of the nodes,
coded and uncoded schemes. the data delivery strategy for a particular permutation of the
data rows induces an index coding problem. However, our
B. Data Shuffling and Communication Overheads coded shuffling framework is different from index coding
Distributed learning algorithms on large-scale in at least two significant ways. First, the coded shuffling
networked systems have been extensively studied in the framework involves multiple iterations of data being stored
literature [50]–[60]. Many of the distributed algorithms that across all the nodes. Secondly, when the caches of the nodes
are implemented in practice share a similar algorithmic are updated in coded shuffling, the system is unaware of the
“anatomy”: the data set is split among several cores or nodes, upcoming permutations. Thus, the cache update rules need to
each node trains a model locally, then the local models are be designed to target any possible unknown permutation of
averaged, and the process is repeated. While training a model data in succeeding iterations of the algorithm.
with parallel or distributed learning algorithms, it is common We now review some recent works on coded shuffling,
to randomly re-shuffle the data a number of times [29], which have been published after our first presentation [1], [2].
[61]–[65]. This essentially means that after each shuffling Attia and Tandon [74] study the information-theoretic limits
the learning algorithm will go over the data in a different of the coded shuffling problem. More specifically, the authors

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1518 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

completely characterize the fundamental limits for the case Upon receiving the input argument x, the master node mul-
of 2 workers and the case of 3 workers. Attia and Tandon [75] ticasts x to all the workers, and then waits until it receives
consider the worse-case formulation of the coded shuf- the responses from any of the decodable sets. Each worker
fling problem, and propose a two-stage shuffling algorithm. node starts computing its local function when it receives
Song and Fragouli [76] propose a new coded shuffling scheme its local input argument, and sends the task result to the
based on pliable index coding. While most of the existing master node. Once the master node receives the results from
works focus on either coded computation or coded shuffling, some decodable set, it decodes the received task results and
one notable exception is [77]. In this work, the authors gener- obtains f A (x).
alize the original coded MapReduce framework by introducing The algorithm described in Sec. I is an example of coded
stragglers to the computation phases. Observing that highly distributed algorithms: it is a coded distributed algorithm for
flexible codes are not favorable to coded shuffling while matrix multiplication that uses an (n, n − 1) MDS code. One
replication codes allow for efficient shuffling, the authors can generalize the described algorithm using an (n, k) MDS
propose an efficient way of coding to mitigate straggler effects code as follows. For any 1 ≤ k ≤ n, the data matrix A is first
as well as reduce the shuffling overheads. divided into k equal-sized submatrices.2 Then, by applying
an (n, k) MDS code to each element of the submatrices,
III. C ODED C OMPUTATION n encoded submatrices are obtained. We denote these
In this section, we propose a novel paradigm to mitigate n encoded submatrices by A1 , A2 , . . . , An . Note that the
the straggler problem. The core idea is simple: we introduce Ai = Ai for 1 ≤ i ≤ k if a systematic MDS code is used for
redundancy into subtasks of a distributed algorithm such that the encoding procedure. Upon receiving any k task results, the
the original task’s result can be decoded from a subset of the master node can use the decoding algorithm to decode k task
subtask results, treating uncompleted subtasks as erasures. results. Then, one can find AX simply by concatenating them.
For this specific purpose, we use erasure codes to design
coded subtasks.
An erasure code is a method of introducing redundancy to B. Runtime of Uncoded/Coded Distributed Algorithms
information for robustness to noise [78]. It encodes a message In this section, we analyze the runtime of uncoded and
of k symbols into a longer message of n coded symbols such coded distributed algorithms. We first consider the overall
that the original k message symbols can be recovered by uncoded . Assume
runtime of an uncoded distributed algorithm, Toverall
decoding a subset of coded symbols [78], [79]. We now show that the runtime of each task is identically distributed and
how erasure codes can be applied to distributed computation independent of others. We denote the runtime of the i th worker
to mitigate the straggler problem. under a computation scheme, say s, by Tis . Note that the
distributions of Ti ’s can differ across different computation
A. Coded Computation schemes.
A coded distributed algorithm is specified by local func- uncoded
= T(n)
uncoded
= max{T1uncoded, . . . , Tnuncoded},
def
tions, local data blocks, decodable sets of indices, and a Toverall (2)
decoding function: The local functions and data blocks specify
where T(i) is the i th smallest one in {Ti }ni=1 . From (2), it is clear
the way the original computational task and the input data are
that a single straggler can slow down the overall algorithm.
distributed across n workers; and the decodable sets of indices
A coded distributed algorithm is terminated whenever the
and the decoding function are such that the desired compu-
master node receives results from any decodable set of work-
tation result can be correctly recovered using the decoding
ers. Thus, the overall runtime of a coded algorithm is not
function as long as the local computation results from any of
determined by the slowest worker, but by the first time to
the decodable sets are collected.
collect results from some decodable set in I, i.e.,
The formal definition of coded distributed algorithms is as
coded
= T(coded = min max T jcoded
def
follows. Toverall (3)
I)
Definition 1 (Coded Computation): Consider a computa- i∈I j ∈i
tional task fA (·). A coded distributed algorithm for computing We remark that the runtime of uncoded distributed algorithms
f A (·) is specified by (2) is a special case of (3) with I = {[n]}. In the following
• local functions  f Ai (·) ni=1 and local data blocks Ai ni=1 ; examples, we consider the runtime of the repetition-coded
i
• (minimal) decodable sets of indices I ⊂ P([n]) and a algorithms and the MDS-coded algorithms.
decoding function dec(·, ·), Example 1 (Repetition Codes): Consider an nk -repetition-
where [n] = {1, 2, . . . , n}, and P(·) is the power set of a
def
code where each local task is replicated nk times. We assume
set. The decodable sets of indices I is minimal: no element that each group of nk consecutive workers work on the replicas
of I is a subset of other elements. The decoding function takes of one local task. Thus, the decodable sets of indices I are
a sequence of indices and a sequence of subtask results, and all the minimal sets that have k distinct task results, i.e.,
it must correctly output f A (x) if any decodable set of indices I = {1, 2, . . . , nk } × { nk + 1, nk + 2, . . . , nk + k} × . . . × {n −
and its corresponding results are given.
k + 1, n − k + 2, . . . , n}, where A × B denotes the Cartesian
n n
A coded distributed algorithm can be run in a dis-
tributed computing cluster as follows. Assume that the 2 If the number of rows of A is not a multiple of k, one can append zero
i th (encoded) data block Ai is stored at the i th worker for all i . rows to A to make the number of rows a multiple of k.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1519

product of matrix A and B. Thus, algorithm, one can first find the distribution of
Repetition-coded Repetition-coded
Toverall = max minn {T(i−1) n + j }. (4) Repetition-coded
min {T(i−1) n + j },
i∈[k] j ∈[ k ] k j ∈[ nk ] k

Example 2 (MDS Codes): If one uses an (n, k) MDS code, and then find the distribution of the maximum of k such terms:
the decodable sets of indices are the sets of any k indices, i.e.,  
n k
I = {i|i ⊂ [n], |i| = k}. Thus, Repetition
Foverall (t) = 1 − [1 − F(kt)] k . (6)
MDS-coded
Toverall = T(k)
MDS-coded
(5) The runtime distribution of an (n, k)-MDS-coded distributed
That is, the algorithm’s runtime will be determined by the algorithm is simply the k th order statistic:
k th response, not by the n th response. MDS-coded
Foverall (t)
 t  
n−1
C. Probabilistic Model of Runtime = nk f (kτ ) F(kτ )k−1 [1 − F(kτ )]n−k dτ . (7)
τ =0 k −1
In this section, we analyze the runtime of uncoded/coded
distributed algorithms assuming that task runtimes, includ- Remark 2: For the same values of n and k, the runtime
ing times to communicate inputs and outputs, are randomly distribution of a repetition-coded distributed algorithm strictly
distributed according to a certain distribution. For analytical dominates that of an MDS-coded distributed algorithm. This
purposes, we make a few assumptions as follows. We first can be shown by observing that the decodable sets of the
assume the existence of the mother runtime distribution F(t): MDS-coded algorithm contain those of the repetition-coded
we assume that running an algorithm using a single machine algorithm.
takes a random amount of time T0 , that is a positive-valued, In Fig. 4, we compare the runtime distributions of uncoded
continuous random variable parallelized according to F, i.e. and coded distributed algorithms. We compare the runtime
Pr(T0 ≤ t) = F(t). We also assume that T0 has a probability distributions of uncoded algorithm, repetition-coded algorithm,
density function f (t). Then, when the algorithm is distributed and MDS-coded algorithm with n = 10 and k = 5. In Fig. 4a,
into a certain number of subtasks, say , the runtime distri- we use a shifted-exponential distribution as the mother runtime
bution of each of the  subtasks is assumed to be a scaled distribution. That is, F(t) = 1 − et −1 for t ≥ 1. In Fig. 4b, we
distribution of the mother distribution, i.e., Pr(Ti ≤ t) = F(t) use the empirical task runtime distribution that is measured
for 1 ≤ i ≤ . Note that we are implicitly assuming a on an Amazon EC2 cluster.3 Observe that for both cases, the
symmetric job allocation scheme, which is the optimal job runtime distribution of the MDS-coded distribution has the
allocation scheme if the underlying workers have the identical lightest tail.
computing capabilities, i.e., homogeneous computing nodes
are assumed. Finally, the computing times of the k tasks are D. Optimal Code Design for Coded Distributed Algorithms:
assumed to be independent of one another. The Shifted-Exponential Case
Remark 1 (Homogeneous Clusters and Heterogenous When a coded distributed algorithm is used, the original
Clusters): In this work, we assume homogeneous clusters: task is divided into a fewer number of tasks compared to the
that is, all the workers have independent and identically case of uncoded algorithms. Thus, the runtime of each task of
distributed computing time statistics. While our symmetric a coded algorithm, which is F(kt), is stochastically larger than
job allocation is optimal for homogeneous cases, it can be that of an uncoded algorithm, which is F(nt). If the value that
strictly suboptimal for heterogenous cases. While our work we choose for k is too small, then the runtime of each task
focuses on homogeneous clusters, we refer the interested becomes so large that the overall runtime of the distributed
reader to a recent work [48] for a generalization of our coded algorithm will eventually increase. If k is too large,
problem setting to that of heterogeneous clusters, for which the level of redundancy may not be sufficient to prevent the
symmetric allocation strategies are no longer optimal. algorithm from being delayed by the stragglers.
We first consider an uncoded distributed algorithm with Given the mother runtime distribution and the code parame-
n (uncoded) subtasks. Due to the assumptions mentioned ters, one can compute the overall runtime distribution of the
above, the runtime of each subtask is F(nt). Thus, the run- coded distributed algorithm using (6) and (7). Then, one can
time distribution of an uncoded distributed algorithm, denoted optimize the design based on various target metrics, e.g., the
uncoded (t), is simply [F(nt)]n .
by Foverall expected overall runtime, the 99th percentile runtime, etc.
When repetition codes or MDS codes are used, an In this section, we show how one can design an optimal
algorithm is first divided into k (< n) systematic sub- coded algorithm that minimizes the expected overall runtime
tasks, and then n − k coded tasks are designed to pro- for a shifted-exponential mother distribution. The shifted-
vide an appropriate level of redundancy. Thus, the runtime exponential distribution strikes a good balance between accu-
of each task is distributed according to F(kt). Using (4) racy and analytical tractability. This model is motivated by
and (5), one can easily find the runtime distribution of an the model proposed in [80]: the authors used this distribution
n Repetition
k -repetition-coded distributed algorithm, Foverall , and the to model latency of file queries from cloud storage systems.
runtime distribution of an (n, k)-MDS-coded distributed algo-
MDS-coded . For an n -repetition-coded distributed
rithm, Foverall 3 The detailed description of the experiments is provided in Sec. III-F.
k

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1520 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

exist, and hence the uncoded distributed algorithm achieves the


optimal runtime, which is n1 .
We now analyze the average runtime of uncoded/coded
distributed algorithms. We assume that n is large, and k is
n
linear in n. Accordingly, we approximate Hn = i=1 i 
1
def

log n and Hn−k  log (n − k). We first note that the expected
value of the maximum of n independent exponential random
variables with rate μ is Hμn . Thus, the average runtime of an
uncoded distributed algorithm is
   
uncoded 1 1 log n
E[Toverall ] = 1 + log n =  . (9)
n μ n
For the average runtime of an nk -Repetition-coded distributed
algorithm, we first note that the minimum of nk independent
exponential random variables with rate μ is distributed as an
exponential random variable with rate nk μ. Thus,
   
Repetition-coded 1 k log n
E[Toverall ]= 1+ log k =  . (10)
k nμ n
Finally, we note that the expected value of the k th statistic
of n independent exponential random variables of rate μ
is Hn −H
μ
n−k
. Therefore,
    
MDS-coded 1 1 n 1
E[Toverall ]= 1+ log = . (11)
k μ n−k n
Using these closed-form expressions of the average runtime,
one can easily find the optimal value of k that achieves the
optimal average runtime. The following lemma characterizes
the optimal repetition code for the repetition-coded algorithms
and their runtime performances.
Lemma 1 (Optimal Repetition-Coded Distributed
Algorithms): If μ ≥ 1, the average runtime of an nk -
Fig. 4. Runtime distributions of uncoded/coded distributed algorithms. Repetition-coded distributed algorithm, in a distributed
We plot the runtime distributions of uncoded/coded distributed algorithms. For computing cluster with n workers, is minimized by setting
the uncoded algorithms, we use n = 10, and for the coded algorithms, we use
n = 10 and k = 5. In (a), we plot the runtime distribution when the runtime of k = n, i.e., not replicating tasks. If μ = v1 for some
tasks are distributed according to the shifted-exponential distribution. Indeed, integer v > 1, the average runtime is minimized by setting
the curves in (a) are analytically obtainable: See Sec. III-D for more details. k = μn, and the corresponding minimum average runtime
In (b), we use the empirical task runtime distribution measured on an Amazon
EC2 cluster.
1
is nμ (1 + log(nμ)).
Proof: It is easy to see that (10) as a function of k has
a unique extreme point. By differentiating (10) with respect
The shifted-exponential distribution is the sum of a constant to k and equating it to zero, we have k = μn. Thus, if μ ≥ 1,
and an exponential random variable, i.e., one should set k = n; if μ = v1 < 1 for some integer v, one
Pr(T0 ≤ t) = 1 − e−μ(t −1), ∀t ≥ 1, (8) should set k = μn.
The above lemma reveals that the optimal repetition-coded
where the exponential rate μ is called the straggling distributed algorithm can achieve a lower average runtime
parameter. than the uncoded distributed algorithm if μ < 1; however,
With this shifted-exponential model, we first characterize a the optimal repetition-coded distributed algorithm still suffers
lower bound on the fundamental limit of the average runtime. from the factor of (log n), and cannot achieve the order-
Proposition 1: The average runtime of any distributed algo- optimal performance. The following lemma, on the other hand,
rithm, in a distributed computing cluster with n workers, is shows that the optimal MDS-coded distributed algorithm can
lower bounded by n1 . achieve the order-optimal average runtime performance.
Proof: One can show that the average runtime of any Lemma 2 (Optimal MDS-Coded Distributed Algorithms):
distributed algorithm strictly decreases if the mother runtime The average runtime of an (n, k)-MDS-coded distributed
distribution is replaced with a deterministic constant 1. Thus, algorithm, in a distributed computing cluster with n workers,
the optimal average runtime with this deterministic mother can be minimized by setting k = k where
distribution serves as a strict lower bound on the optimal aver-
age runtime with the shifted-exponential mother distribution. 1
k = 1+ n, (12)
The constant mother distribution implies that stragglers do not W−1 (−e−μ−1 )

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1521

In addition to the order-optimality of MDS-coded distrib-


uted algorithms, the above lemma precisely characterizes the
gap between the achievable runtime and the optimistic lower
bound of n1 . For instance, when μ > 1, the optimal average
runtime is only 3.15 away from the lower bound.
Remark 3 (Storage Overhead): So far, we have consid-
ered only the runtime performance of distributed algorithms.
Another important metric to be considered is the storage cost.
When coded computation is being used, the storage over-
head may increase. For instance, the MDS-coded distributed
algorithm for matrix multiplication, described in Sec. III-A,
requires 1k of the whole data to be stored at each worker,
while the uncoded distributed algorithm requires n1 . Thus, the
1 1
k −n
storage overhead factor is 1 = n
k − 1. If one uses the
n
runtime-optimal MDS-coded distributed algorithm for matrix
multiplication, the storage overhead is kn − 1 = α1 − 1.

E. Coded Gradient Descent: An MDS-Coded Distributed


Algorithm for Linear Regression
In this section, as a concrete application of coded matrix
multiplication, we propose the coded gradient descent for
solving large-scale linear regression problems.
We first describe the (uncoded) gradient-based distributed
algorithm. Consider the following linear regression,
1
min f (x) = min Ax − y 22 ,
def
(14)
x x 2
where y ∈ Rq is the label vector, A = [a1 , a2 , . . . , aq ]T ∈
Rq×r is the data matrix, and x ∈ Rr is the unknown
weight vector to be found. We seek a distributed algorithm
to solve this regression problem. Since f (x) is convex in x,
Fig. 5. nT and kn as functions of μ. As a function of the straggling the gradient-based distributed algorithm works as follows.
parameter, we plot the normalized optimal computing time and the optimal We first compute the objective function’s gradient: ∇ f (x) =
value of k. (a) nT as a function of μ. (b) kn as a function of μ. AT (Ax − y). Denoting by x(t ) the estimate of x after the
t th iteration, we iteratively update x(t ) according to the fol-
lowing equation.
and W−1 (·) is the lower branch of Lambert W function4 Thus,
x(t +1) = x(t ) − η∇ f (x(t )) = x(t ) − ηAT (Ax(t ) − y) (15)
MDS-coded −W−1 (−e−μ−1 ) def γ (μ)
T = min E[Toverall ]= = .
def

k μn n The above algorithm is guaranteed to converge to the optimal


(13) solution if we use a small enough step size η [81], and can be
easily distributed. We describe one simple way of parallelizing
Proof: It is easy to see that (11) as a function
the algorithm, which is implemented in many open-source
of k has a unique extreme point. By differentiating (11)
machine learning libraries including Spark mllib [82].
with respect to k and equating it to zero, we have
As AT (Ax(t ) − y) = i=1 ai (aiT x(t ) − yi ), gradients can be
q

k
1
1 + μ1 log n−k
n
= μ1 n−k
1
. By setting k = α n, computed in a distributed way by computing partial sums at
we have 1
α 1+ 1
μ log 1−α
1
= 1 1
μ 1−α , which implies different worker nodes and then adding all the partial sums
at the master node. This distributed algorithm is an uncoded
μ+1 = 1
1−α − log 1
1−α . By defining β = 1
1−α and distributed algorithm: in each round, the master node needs to

exponentiating both the sides, we have eμ+1 = β . Note that wait for all the task results in order to compute the gradient.5
x
the solution of ex = t, t ≥ e and x ≥ 1 is x = −W−1 (− 1t ). 5 Indeed, one may apply another coded computation scheme called Gradient
Thus, β = −W−1 (−e−μ−1 ). By plugging the above equation Coding [45], which was proposed after our conference publications. By apply-
into the definition of β, the claim is proved. ing Gradient Coding to this algorithm, one can achieve straggler tolerance but
at the cost of significant computation and storage overheads. More precisely,
We plot nT and kμ as functions of μ in Fig. 5. it incurs (n) larger computation and storage overheads in order to protect
the algorithm from (n) stragglers. Later in this section, we will show that
4 W (x), the lower branch of Lambert W function evaluated at x, is the our coded computation scheme, which is tailor-designed for linear regression,
−1
unique solution of tet = x and t ≤ −1. incurs (1) overheads to protect the algorithm from (n) stragglers.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1522 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

Fig. 6. Illustration of a coded gradient descent approach for linear regression. The coded gradient descent computes a gradient of the objective function
using coded matrix multiplication twice: in each iteration, it first computes Ax(t) as depicted in (a) and (b), and then computes AT (Ax(t) − y) as depicted
in (c) and (d).

Thus, the runtime of each update iteration is determined by minimum storage overhead per node is a n1 -fraction of the
the slowest response among all the worker nodes. data matrix, the relative storage overhead of the coded gradient
We now propose the coded gradient descent, a coded dis- descent algorithm is at least about factor of 2, if k1  n and
tributed algorithm for linear regression problems. Note that in k2  n.
each iteration, the following two matrix-vector multiplications
are computed. F. Experimental Results
In order to see the efficacy of coded computation, we
Ax(t ), AT (Ax(t ) − y) = AT z(t )
def
(16) implement the proposed algorithms and test them on an
Amazon EC2 cluster. We first obtain the empirical distribution
In Sec. III-A, we proposed the MDS-coded distributed algo- of task runtime in order to observe how frequently stragglers
rithm for matrix multiplication. Here, we apply the algorithm appear in our testbed by measuring round-trip times between
twice to compute these two multiplications in each iteration. the master node and each of 10 worker instances on an
More specifically, for the first matrix multiplication, we choose Amazon EC2 cluster. Each worker computes a matrix-vector
1 ≤ k1 < n and use an (n, k1 )-MDS-coded distributed multiplication and passes the computation result to the master
algorithm for matrix multiplication to encode the data matrix node, and the master node measures round trip times that
A. Similarly for the second matrix multiplication, we choose include both computation time and communication time. Each
1 ≤ k2 < n and use a (n, k2 )-MDS-coded distributed worker repeats this procedure 500 times, and we obtain the
algorithm to encode the transpose of the data matrix. Denoting empirical distribution of round trip times across all the worker
the i th row-split (column-split) of A as Ai ( Ai ), the i th worker nodes.
stores both Ai and  Ai . In the beginning of each iteration, In Fig. 7, we plot the histogram and complementary
the master node multicasts x(t ) to the worker nodes, each of CDF (CCDF) of measured computing times; the average round
which computes the local matrix multiplication for Ax(t ) and trip time is 0.11 second, and the 95th percentile latency is 0.20
sends the result to the master node. Upon receiving any k1 second, i.e., roughly five out of hundred tasks are going to be
task results, the master node can start decoding the result and roughly two times slower than the average tasks. Assuming
obtain z(t ) = Ax(t ). The master node now multicasts z(t ) to the the probability of a worker being a straggler is 5%, if one
workers, and the workers compute local matrix multiplication runs an uncoded distributed algorithm with 10 workers, the
for AT z(t ) . Finally, the master node can decode AT z(t ) as soon probability of not seeing such a straggler is only about 60%, so
as it receives any k2 task results, and can proceed to the next the algorithm is slowed down by a factor of more than 2 with
iteration. Fig. 6 illustrates the protocol with k1 = k2 = n − 1. probability 40%. Thus, this observation strongly emphasizes
Remark 4 (Storage Overhead of the Coded Gradient the necessity of an efficient straggler mitigation algorithm.
Descent): The coded gradient descent requires each node to In Fig. 4a, we plot the runtime distributions of uncoded/coded
store a ( k11 + k12 − k11k2 )-fraction of the data matrix. As the distributed algorithms using this empirical distribution as the

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1523

row-partition algorithms are outperformed by the uncoded


column-partition algorithm. This is the case of a fat matrix
multiplication with c1-medium instances. Note that when a
row-partition algorithm is used, the size of messages from the
master node to the workers is n times larger compared with the
case of column-partition algorithms. Thus, when the variability
of computational times becomes low compared with that of
communication time, the larger communication overhead of
row-partition algorithms seems to arise, nullifying the benefits
of coding.
We also evaluate the performance of the coded gradient
descent algorithm for linear regression. The coded linear
regression procedure is also implemented in C++ using
OpenMPI, and benchmarked on a cluster of 11 EC2 machines
(10 workers and a master). Similar to the previous benchmarks,
we randomly draw a square matrix of size 2000 × 2000, a
Fig. 7. Empirical CCDF of the measured round trip times. We measure
round trip times between the master node and each of 10 worker nodes on an fat matrix of size 400 × 10000, and a tall matrix of size
Amazon EC2 cluster. A round trip time consists of transmission time of the 10000 × 400, and use them as a data matrix. We use a
input vector from the master to a worker, computation time, and transmission (10, 8)-MDS code for the coded linear regression so that each
time of the output vector from a worker to the master.
multiplication of the gradient descent algorithm is not slowed
down by up to 2 stragglers. Fig. 9 shows that the gradient
mother runtime distribution. When an uncoded distributed algorithm with the coded matrix multiplication significantly
algorithm is used, the overall runtime distribution entails a outperforms the one with the uncoded matrix multiplication;
heavy tail, while the runtime distribution of the MDS-coded the average runtime is reduced by 31.3% to 35.7%, and the
algorithm has almost no tail. tail runtime is reduced by 27.9% to 35.6%.
We then implement the coded matrix multiplication in C++ IV. C ODED S HUFFLING
using OpenMPI [83] and benchmark on a cluster of 26 EC2
instances (25 workers and a master).6 Also, three uncoded We shift our focus from solving the straggler problem to
matrix multiplication algorithms – block, column-partition, solving the communication bottleneck problem. In this section,
and row-partition – are implemented and benchmarked. we explain the problem of data-shuffling, propose the Coded
We randomly draw a square matrix of size 5750 × 5750, Shuffling algorithm, and analyze its performance.
a fat matrix of size 5750 × 11500, and a tall matrix of size
A. Setup and Notations
11500 × 5750, and multiply them with a column vector. For
the coded matrix multiplication, we choose an (25, 23) MDS We consider a master-worker distributed setup, where the
code so that the runtime of the algorithm is not affected by any master node has access to the entire data-set. Before every iter-
2 stragglers. Fig. 8 shows that the coded matrix multiplication ation of the distributed algorithm, the master node randomly
outperforms all the other parallel matrix multiplication algo- partition the entire data set into n subsets, say A1 , A2 , . . . , An .
rithms in most cases. On a cluster of m1-small, the most The goal of the shuffling phase is to distribute each of these
unreliable instances, the coded matrix multiplication achieves partitioned data sets to the corresponding worker so that each
about 40% average runtime reduction and about 60% tail worker can perform its distributed task with its own exclusive
reduction compared to the best of the 3 uncoded matrix multi- data set after the shuffling phase.
plication algorithmss. On a cluster of c1-medium instances, We let A(J ) ∈ R|J |×r , J ⊂ [q] be the concatenation of
the coded algorithm achieves the best performance in most |J | rows of matrix A with indices in J . Assume that each
of the tested cases: the average runtime is reduced by at worker node has a cache of size s data rows (or s × r real
most 39.5%, and the 95th percentile runtime is reduced by numbers). In order to be able to fully store the data matrix
at most 58.3%. Among the tested cases, we observe one across the worker nodes, we impose the inequality condition
case in which both the uncoded row-partition and the coded q/n ≤ s. Further, clearly if s > q, the data matrix can be
fully stored at each worker node, eliminating the need for
6 For the benchmark, we manage the cluster using the StarCluster toolkit any shuffling. Thus, without loss of generality we assume
[84]. Input data is generated using a Python script, and the input matrix is row- that s ≤ q. As explained earlier working on the same data
partitioned for each of the workers (with the required encoding as described in
the previous sections) in a preprocessing step. The procedure begins by having points at each worker node in all the iterations of the iterative
all of the worker nodes read in their respective row-partitioned matrices. Then, optimization algorithm leads to slow convergence. Thus, to
the master node reads the input vector and distributes it to all worker nodes enhance the statistical efficiency of the algorithm, the data
in the cluster through an asynchronous send (MPI_Isend). Upon receiving
the input vector, each worker node begins matrix multiplication through a matrix is shuffled after each iteration. More precisely, at each
BLAS [85] routine call and once completed sends the result back to the master iteration t, the set of data rows [q] is partitioned uniformly at
using MPI_Send. The master node waits for a sufficient number of results random into n subsets Sit , 1 ≤ i ≤ n so that ∪ni=1 Sit = [q] and
to be received by continuously polling (MPI_Test) to see if any results are
obtained. The procedure ends when the master node decodes the overall result Sit ∩ S tj = ∅ when i = j ; thus, each worker node computes
after receiving enough partial results. a fresh local function of the data. Clearly, the data set that

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1524 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

Fig. 8. Comparison of parallel matrix multiplication algorithms. We compare various parallel matrix multiplication algorithms: block, column-partition,
row-partition, and coded (row-partition) matrix multiplication. We implement the four algorithms using OpenMPI and test them on Amazon EC2 cluster of
25 instances. We measure the average and the 95th percentile runtime of the algorithms. Plotted in (a) and (b) are the results with m1-small instances, and
in (c) and (d) are the results with c1-medium instances.

follows: Sit +1 ⊂ Cit +1 and Cit +1 \ Sit +1 is distributed uniformly


at random in [q] \ Sit +1 without replacement.
2) Encoding and Transmission Schemes: We now for-
mally describe two transmission schemes of the master node:
(1) uncoded transmission and (2) coded transmission. In the
following descriptions, we drop the iteration index t (and t +1)
for the ease of notation.
The uncoded transmission first finds how many data rows
in Si are already cached in Ci , i.e. |Ci ∩ Si |. Since, the
new permutation (partitioning) is picked uniformly at random,
Fig. 9. Comparison of parallel gradient algorithms. We compare parallel s/q fraction of the data row indices in Si are cached in Ci , so
gradient algorithms for linear regression problems. We implement both the
uncoded gradient descent algorithm and the coded gradient descent algorithm as q gets large, we have |Ci ∩ Si | + o(q) = qn (1 − s/q).
using Open MPI, and test them on an Amazon EC2 cluster of 10 worker Thus, without coding, the master node needs to transmit
q
n (1−s/q) data points to each of the n worker nodes. The total
instances. Plotted are the average and the 95th percentile runtimes of the
algorithms. (a) Average runtime. (b) Tail runtime.
communication rate (in data points transmitted per iteration)
worker i works on has cardinality q/n, i.e., |Sit | = q/n. Note of the uncoded scheme is then
that the sampling we consider here is without replacement, q
Ru = n × (1 − s/q) = q(1 − s/q). (17)
and hence these data sets are non-overlapping. n
We now describe the coded transmission scheme. Define
B. Shuffling Schemes the set of “exclusive” cache content as C I = (∩i∈I Ci ) ∩
We now present our coded shuffling algorithm, consisting 
∩i ∈[n]\I Ci that denotes the set of rows that are stored at
of a transmission strategy for the master node, and caching
and decoding strategies for the worker nodes. Let Cit be the the caches of I, and are not stored at the caches of [n] \ I.
cache content of node i (set of row indices stored in cache i ) For each subset I with |I| ≥ 2, the master node will multicast

i∈I A(Si ∩ CI \{i} ) to the worker nodes. Note that in general,
at the end of iteration t. We design a transmission algorithm
(by the master node) and a cache update algorithm to ensure the matrices A’s differ in their sizes, so one has to zero-
that (i) Sit ⊂ Cit ; and (ii) Cit \ Sit is distributed uniformly pad the shorter matrices and sum the zero-padded matrices.
at random without replacement in the set [q] \ Sit . The first Algorithm 1 provides the pseudocode of the coded encoding
condition ensures that at each iteration, the workers have and transmission scheme.7
access to the data set that they are supposed to work on. 3) Decoding Algorithm: The decoding algorithm for the
The second condition provides the opportunity of effective uncoded transmission scheme is straightforward: each worker
coded transmissions for shuffling in the next iteration as will simply takes the additional data rows that are required for
be explained later. the new iteration, and ignores the other data rows. We now
1) Cache Update Rule: We consider the following cache describe the decoding algorithm for the coded transmission
update rule: the new cache will contain the subset of the data scheme. Each worker, say worker i , decodes each encoded
points used in the current iteration (this is needed for the local data row as follows. Consider an encoded data row for some
computations), plus a random subset of the previous cached I that contains i . (All other data rows are discarded.) Such an
contents. More specifically, q/n rows of the new cache are encoded data row must be the sum of some data row in Si and
|I| − 1 data rows in C I \{i} , which are available in worker i
precisely the rows in Sit +1 , and s − q/n rows of the cache are
 Hence, the worker can always subtract
sampled points from the set Cit \ Sit +1 , uniformly at random by the definition of C.
without replacement. Since the permutation π t is picked 7 Note that for each encoded data row, the master node also needs to transmit
uniformly at random, the marginal distribution of the cache tiny metadata describing which data rows are included in the summation.
contents at iteration t +1 given Sit +1 , 1 ≤ i ≤ n is described as We omit this detail in the description of the algorithm.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1525

Algorithm 1 Coded Encoding and Transmission Scheme


procedure E NCODING([Ci ]ni=1 )
for each I ∈ [n]n , |I| > 2 do
I = (∩i∈I Ci ) ∩ ∩i ∈[n]\I C 
C i
|I | I \{i} |
 ← maxi=1 |Si ∩ C
for each i ∈ I do
Bi [1 : |Si ∩ C I \{i} |, :] ← A(Si ∩ C
I \{i} )
Bi [|Si ∩ C I \{i} | + 1 : , :] ← 0
end for
broadcast i∈I Bi
end for
end procedure

I \{i} and decode the data row


the data rows corresponding to C
Fig. 10. The achievable rates of coded and uncoded shuffling schemes.
This figure shows the achievable rates of coded and uncoded schemes versus
in Si . the cache size for parallel stochastic gradient descent algorithm.

C. Example
in [66] and [86]. Our coded shuffling algorithm is related to
The following example illustrates the coded shuffling the coded caching problem [66], since one can design the
scheme. right cache update rule to reduce the communication rate
Example 3: Let n = 3. Recall that worker node i needs for an unknown demand or permutation of the data rows.
to obtain A(Si ∩ Ci) for the next iteration of the algorithm. A key difference though is that the coded shuffling algorithm
Consider i = 1. The data rows in S1 ∩ C1 are stored either is run over many iterations of the machine learning algorithm.
exclusively in C2 or C3 (i.e. C 2 or C 3 ), or stored in both C2 Thus, the right cache update rule is required to guarantee

and C3 (i.e. C2,3 ). The transmitted message consists of 4 parts: the opportunity of coded transmission at every iteration. Fur-
• (Part 1) M{1,2} = A(S1 ∩ C 2 ) + A(S2 ∩ C 1 ), thermore, the coded shuffling problem has some connections
• (Part 2) M{1,3} = A(S1 ∩ C 3 ) + A(S3 ∩ C 1 ), to coded MapReduce [86] as both algorithms mitigate the
• (Part 3) M{2,3} = A(S2 ∩ C 3 ) + A(S3 ∩ C 2 ), and communication bottlenecks in distributed computation and
• (Part 4) M{1,2,3} = A(S1 ∩ C 2,3 ) + A(S2 ∩ C 1,3 ) + machine learning. However, coded shuffling enables coded
A(S3 ∩ C 1,2 ). transmission of raw data by leveraging the extra memory space
We show that worker node 1 can recover the data rows that available at each node, while coded MapReduce enables coded
it does not store or A(S1 ∩ C1). First, observe that node 1 transmission of processed data in the shuffling phase of the
1 . Thus, it can recover A(S1 ∩ C
stores S2 ∩ C 2 ) using part 1 of MapReduce algorithm by cleverly introducing redundancy in
the message since A(S1 ∩ C 2 ) = M1 − A(S2 ∩ C 1 ). Similarly, the computation of the mappers.
node 1 recovers A(S1 ∩ C 3 ) = M2 − A(S3 ∩ C 1 ). Finally, We now prove Theorem 3.
from part 4 of the message, node 1 recovers A(S1 ∩ C 2,3 ) = Proof: To find the transmission rate of the coded scheme
M4 − A(S2 ∩ C 1,3 ) − A(S3 ∩ C 1,2 ). we first need to find the cardinality of sets Sit +1 ∩ C t for
I
I ⊂ [n] and i ∈ / I. To this end, we first find the probability that
D. Main Results a random data row, r, belongs to C t . Denote this probability
I

by Pr(r ∈ CI ). Recall that the cache content distribution at
t
We now present the main result of this section, which
characterizes the communication rate of the coded scheme. iteration t: q/n rows of cache j are stored with S tj and the
s−q/n
Let p = q−q/n . other s − q/n rows are stored uniformly at random. Thus, we
can compute Pr(r ∈ C t ) as follows.
Theorem 3 (Coded Shuffling Rate): Coded shuffling I
achieves communication rate t )
Pr(r ∈ C I
q n
Rc = (1− p)n+1 +(n − 1) p(1 − p)−(1− p)2 (18) t |r ∈ S t ) Pr(r ∈ S t )
(np)2 = Pr(r ∈ C I i i (19)
i=1
(in number of data rows transmitted per iteration from the n
1 t |r ∈ Sit )
master node), which is significantly smaller than Ru in (17). = Pr(r ∈ C I (20)
The reduction in communication rate is illustrated in Fig. 10 n
i=1
for n = 50 and q = 1000 as a function of s/q, where 1/n ≤ 1
= t |r ∈ Sit )
Pr(r ∈ C (21)
s/q ≤ 1. For instance, when s/q = 0.1, the communication n I
i∈I
overhead for data-shuffling is reduced by more than 81%.  1  s − q/n |I |−1  
s − q/n n−|I |
Thus, at a very low storage overhead for caching, the algorithm = 1− (22)
n q − q/n q − q/n
can be significantly accelerated. i∈I
Before we present the proof of the theorem, we |I| |I |−1
= p (1 − p)n−|I | . (23)
briefly compare our main result with similar results shown n

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1526 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

(19) is by the law of total probability. (20) is by the fact that


r is chosen randomly. To see (21), note that Pr(r ∈ C t |r ∈
I
Si , i ∈
t / I) = 0. Thus, the summation can be written only on
the indices of I. We now explain (22). Given that r belongs
to Sit , and i ∈ I, then r ∈ Ci with probability 1. The other
|I|−1 caches with indices in I \{i } contain r with probability
s−q/n
q−q/n independently. Further, the caches with indices in [n]\I
s−q/n
do not contain r with probability 1 − q−q/n . By defining
s−q/n
p=
def
q−q/n ,
we have (23).
We now find the cardinality of Sit +1 ∩ C
t for I ⊂ [n] and
I
/ I. Note that |Sit +1 | = q/n. Thus, as q gets large (and
i ∈
n remains sub-linear in q), by the law of large numbers,
q |I| |I |−1
|Sit +1 ∩ C
t | =
I × p (1 − p)n−|I | + o(q). (24)
n n
Recall that for each subset I with |I| ≥ 2, the master node Fig. 11. Gains of multicasting over unicasting in distributed systems.
will send i∈I A(Si ∩ C I \{i} ) . Thus, the total rate of coded We measure the time taken for a data block of size of 4.15 MB to be
transmission is transmitted to a targeted number of workers on an Amazon EC2 cluster,
n  
and compare the average transmission time taken with Message Passing
 n q i − 1 i−2 Interface (MPI) scatter (unicast) and that with MPI broadcast. Observe that
Rc = p (1 − p)n−(i−1) . (25) the average transmission time increases linearly as the number of receivers
i n n increases, but with MPI broadcast, the average transmission time increases
i=2
logarithmically.
To complete the proof, we simplify the above expression. Let
x = 1−p p . Taking derivative with respect to x from both sides
   
of the equality ni=1 ni x i−1 = 1x (1 + x)n − 1 , we have considered in Corollary 2, the renormalized communication
γ
n  
cost of coded shuffling Rc given γ (n) is
 n 1 + (1 + x)n−1 (nx − x − 1)
(i − 1)x i−2 = . (26) γ n Ru
i x2 Rc = Rc → . (29)
i=2 γ (n) γ (n)s/q
Using (26) in (25) completes the proof. Thus, the communication cost of coded shuffling is smaller
Corollary 1: Consider the case that the cache sizes are than uncoded shuffling if γ (n) > q/s. Note that s/q is the
just enough to store the data required for processing; that is fraction of the data matrix that can be stored at each worker’s
s = q/n. Then, Rc = 12 Ru . Thus, one gets a factor 2 reduction cache. Thus, in the regime of interest where s/q is a constant
gain in communication rate by exploiting coded caching. independent of n, and γ (n) scales with n, the reduction gain
Note that when s = q/n, p = 0. Finding the limit of coded shuffling in communication cost is still unbounded
lim p→0 Rc in (18), after some manipulations, one calculates and increasing in n.
  We emphasize that even in point-to-point communication
s 1
Rc = q 1 − = Ru /2, (27) networks, multicasting the same message to multiple nodes is
q 1 + ns/q
significantly faster than unicasting different message (of the
which shows Corollary 1. same size) to multiple nodes, i.e., γ (n)  1, justifying
Corollary 2: Consider the regime of interest where n, s, the advantage of using coded shuffling. For instance, the
and q get large, and s/q → c > 0 and n/q → 0. Then, MPI broadcast API (MPI_Bcast) utilizes a tree multicast
 
s 1 Ru algorithm, which achieves γ (n) =  logn n . Shown in Fig. 11
Rc → q 1 − = (28) is the time taken for a data block to be transmitted to an
q ns/q ns/q
increasing number of workers on an Amazon EC2 cluster,
Thus, using coding, the communication rate is reduced which consists of a point-to-point communication network.
by (n). We compare the average transmission time taken with MPI
Remark 5 (The Advantage of Using Multicasting Over scatter (unicast) and that with MPI broadcast. Observe that
Unicasting): the average transmission time increases linearly as the number
It is reasonable to assume that γ (n)  n for wireless of receivers increases, but with MPI broadcast, the average
architecture that is of great interest with the emergence of transmission time increases logarithmically.
wireless data centers, e.g. [87], [88], and mobile comput-
ing platforms [89]. However, still in many applications, the
network topology is based on point-to-point communication, V. C ONCLUSION
and the multicasting opportunity is not fully available, i.e., In this paper, we have explored the power of coding in
γ (n) < n. For these general cases, we have to renormalize the order to make distributed algorithms robust to a variety of
communication cost of coded shuffling since we have assumed sources of “system noise” such as stragglers and communi-
that γ (n) = n in our results. For instance, in the regime cation bottlenecks. We propose a novel Coded Computation

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1527

framework that can significantly speed up existing distributed [11] D. S. Papailiopoulos, A. G. Dimakis, and V. R. Cadambe, “Repair
algorithms, by introducing redundancy through codes into the optimal erasure codes through Hadamard designs,” in Proc. 49th
Annu. Allerton Conf. Commun., Control, Comput. (Allerton), 2011,
computation. Further, we propose Coded Shuffling that can pp. 1382–1389.
significantly reduce the heavy price of data-shuffling, which is [12] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the locality
required for achieving high statistical efficiency in distributed of codeword symbols,” IEEE Trans. Inf. Theory, vol. 58, no. 11,
pp. 6925–6934, Nov. 2011.
machine learning algorithms. Our preliminary experimental [13] F. Oggier and A. Datta, “Self-repairing homomorphic codes for dis-
results validate the power of our proposed schemes in effec- tributed storage systems,” in Proc. IEEE INFOCOM, Apr. 2011,
tively curtailing the negative effects of system bottlenecks, and pp. 1215–1223.
attaining significant speedups of up to 40%, compared to the [14] D. S. Papailiopoulos, J. Luo, A. G. Dimakis, C. Huang, and J. Li,
“Simple regenerating codes: Network coding for cloud storage,” in Proc.
current state-of-the-art methods. IEEE INFOCOM, Mar. 2012, pp. 2801–2805.
There exists a whole host of theoretical and practical open [15] J. Han and L. A. Lastras-Montano, “Reliable memories with subline
problems related to the results of this paper. For coded compu- accesses,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2007,
pp. 2531–2535.
tation, instead of the MDS codes, one could achieve different [16] C. Huang, M. Chen, and J. Li, “Pyramid codes: Flexible schemes to
tradeoffs by employing another class of codes. Then, although trade space for access efficiency in reliable data storage systems,” in
matrix multiplication is one of the most basic computational Proc. 6th IEEE Int. Symp. Netw. Comput. Appl. (NCA), Jul. 2007,
pp. 79–86.
blocks in many analytics, it would be interesting to leverage
[17] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,”
coding for a broader class of distributed algorithms. in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Mar. 2012, pp. 2771–2775.
For coded shuffling, convergence analysis of distributed [18] G. M. Kamath, N. Prakash, V. Lalitha, and P. V. Kumar. (2012).
machine learning algorithms under shuffling is not well “Codes with local regeneration.” [Online]. Available: https://arxiv.
org/abs/1211.1932
understood. As we observed in the experiments, shuffling [19] A. S. Rawat, O. O. Koyluoglu, N. Silberstein, and S. Vishwanath, “Opti-
significantly reduces the number of iterations required to mal locally repairable and secure codes for distributed storage systems,”
achieve a target reliability, but missing is a rigorous analysis IEEE Trans. Inf. Theory, vol. 60, no. 1, pp. 212–236, Jan. 2014.
[20] N. Prakash, G. M. Kamath, V. Lalitha, and P. V. Kumar, “Optimal linear
that compares the convergence performances of algorithms codes with a local-error-correction property,” in Proc. IEEE Int. Symp.
with shuffling or without shuffling. Further, the trade-offs Inf. Theory (ISIT), Jul. 2012, pp. 2776–2780.
between bandwidth, storage, and the statistical efficiency of [21] N. Silberstein, A. S. Rawat and S. Vishwanath, “Error resilience in
the distributed algorithms are not well understood. Moreover, distributed storage via rank-metric codes,” in Proc. 50th Annu. Allerton
Conf. Commun., Control, Comput. (Allerton), Monticello, IL, USA,
it is not clear how far our achievable scheme, which achieves 2012, pp. 1150–1157.
a bandwidth reduction gain of ( n1 ), is from the fundamental [22] C. Huang et al., “Erasure coding in windows azure storage,” in Proc.
limit of communication rate for coded shuffling. Therefore, USENIX Annu. Tech. Conf. (ATC), Jun. 2012, pp. 15–26.
[23] M. Sathiamoorthy et al., “XORing elephants: Novel erasure codes for
finding an information-theoretic lower bound on the rate of big data,” Proc. VLDB Endowment, vol. 6, no. 5, pp. 325–336, 2013.
coded shuffling is another interesting open problem. [24] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and
K. Ramchandran, “A solution to the network challenges of data recovery
R EFERENCES in erasure-coded distributed storage systems: A study on the Facebook
warehouse cluster,” in Proc. USENIX HotStorage, Jun. 2013.
[1] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, [25] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and
“Speeding up distributed machine learning using codes,” presented at the K. Ramchandran, “A hitchhiker’s guide to fast and efficient data recon-
Neural Inf. Process. Syst. Workshop Mach. Learn. Syst., Dec. 2015. struction in erasure-coded data centers,” in Proc. ACM Conf. SIGCOMM,
[2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, 2014, pp. 331–342.
“Speeding up distributed machine learning using codes,” in Proc. IEEE [26] G. Ananthanarayanan et al., “Reining in the outliers in Map-Reduce
Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 1143–1147. clusters using Mantri,” in Proc. 9th USENIX Symp. Oper. Syst. Des.
[3] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Implement. (OSDI), 2010, pp. 265–278. [Online]. Available: http://www.
“Spark: Cluster computing with working sets,” in Proc. 2nd USENIX usenix.org/events/osdi10/tech/full_papers/Ananthanarayanan.pdf
Workshop Hot Topics Cloud Comput. (HotCloud), 2010, p. 95. [Online].
[27] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
Available: https://www.usenix.org/conference/hotcloud-10/spark-cluster-
I. Stoica, “Improving MapReduce performance in heterogeneous
computing-working-sets
environments,” in Proc. 9th USENIX Symp. Oper. Syst. Des.
[4] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on
Implement. (OSDI), 2008, pp. 29–42. [Online]. Available:
large clusters,” in Proc. 6th Symp. Oper. Syst. Design Implement. (OSDI),
http://www.usenix.org/events/osdi08/tech/full_papers/zaharia/zaharia.pdf
2004, pp. 137–150. [Online]. Available: http://www.usenix.org/
events/osdi04/tech/dean.html [28] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimiza-
[5] J. Dean and L. A. Barroso, “The tail at scale,” Commun. ACM, vol. 56, tion,” in Proc. 25th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2011,
no. 2, pp. 74–80, Feb. 2013. pp. 873–881. [Online]. Available: http://papers.nips.cc/paper/4247-
[6] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and distributed-delayed-stochastic-optimization
K. Ramchandran, “Network coding for distributed storage systems,” [29] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach
IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4539–4551, Sep. 2010. to parallelizing stochastic gradient descent,” in Proc. 25th Annu. Conf.
[7] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating Neural Inf. Process. (NIPS), 2011, pp. 693–701.
codes for distributed storage at the MSR and MBR points via a [30] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica,
product-matrix construction,” IEEE Trans. Inf. Theory, vol. 57, no. 8, “Effective straggler mitigation: Attack of the clones,” in Proc.
pp. 5227–5239, Aug. 2011. 10th USENIX Symp. Netw. Syst. Des. Implement. (NSDI), 2013,
[8] C. Suh and K. Ramchandran, “Exact-repair MDS code construction pp. 185–198. [Online]. Available: https://www.usenix.org/conference/
using interference alignment,” IEEE Trans. Inf. Theory, vol. 57, no. 3, nsdi13/technical-sessions/presentation/ananthanarayanan
pp. 1425–1442, Mar. 2011. [31] N. B. Shah, K. Lee, and K. Ramchandran, “When do redundant
[9] I. Tamo, Z. Wang, and J. Bruck, “MDS array codes with optimal requests reduce latency?” in Proc. 51st Annu. Allerton Conf. Commun.,
rebuilding,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Aug. 2011, Control, Comput., 2013, pp. 731–738. [Online]. Available: http://dx.doi.
pp. 1240–1244. org/10.1109/Allerton.2013.6736597
[10] V. R. Cadambe, C. Huang, S. A. Jafar, and J. Li. (2011). “Optimal [32] D. Wang, G. Joshi, and G. W. Wornell, “Efficient task replication for fast
repair of MDS codes in distributed storage via subspace interference response times in parallel computation,” ACM SIGMETRICS, vol. 42,
alignment.” [Online]. Available: https://arxiv.org/abs/1106.1250 no. 1, pp. 599–600, 2014.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
1528 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 64, NO. 3, MARCH 2018

[33] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, and E. Hyytiä, [58] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin,
“Reducing latency via redundant requests: Exact analysis,” ACM SIG- and M. I. Jordan, “MLbase: A distributed machine-learning sys-
METRICS, vol. 43„ no. 1, pp. 347–360, 2015. tem,” in Proc. 6th Biennial Conf. Innov. Data Syst. Res. (CIDR),
[34] M. Chaubey and E. Saule, “Replicated data placement for uncer- Jan. 2013, p. 2. [Online]. Available: http://www.cidrdb.org/cidr2013/
tain scheduling,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. Papers/CIDR13_Paper118.pdf
Workshop (IPDPS), May 2015, pp. 464–472. [Online]. Available: [59] E. R. Sparks et al., “MLI: An API for distributed machine learning,” in
http://dx.doi.org/10.1109/IPDPSW.2015.50 Proc. IEEE 13th Int. Conf. Data Mining (ICDM), 2013, pp. 1187–1192.
[35] K. Lee, R. Pedarsani, and K. Ramchandran, “On scheduling redundant [Online]. Available: http://dx.doi.org/10.1109/ICDM.2013.158
requests with cancellation overheads,” in Proc. 53rd Annu. Allerton Conf. [60] M. Li et al., “Scaling distributed machine learning with the
Commun., Control, Comput., Oct. 2015, pp. 1279–1290. parameter server,” in Proc. 11th USENIX Symp. Oper. Syst. Des.
[36] G. Joshi, E. Soljanin, and G. Wornell, “Efficient redundancy techniques Implement. (OSDI), 2014, pp. 583–598. [Online]. Available:
for latency reduction in cloud systems,” ACM Trans. Model. Perform. https://www.usenix.org/conference/osdi14/technical-sessions/
Eval. Comput. Syst., vol. 2, no. 2, pp. 12:1–12:30, Apr. 2017. [Online]. presentation/li_mu
Available: http://doi.acm.org/10.1145/3055281 [61] B. Recht and C. Ré, “Parallel stochastic gradient algorithms for large-
[37] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran, “Codes can scale matrix completion,” Math. Programm. Comput., vol. 5, no. 2,
reduce queueing delay in data centers,” in Proc. IEEE Int. Symp. Inf. pp. 201–226, 2013.
Theory (ISIT), Jul. 2012, pp. 2766–2770. [62] L. Bottou, “Stochastic gradient descent tricks,” in Neural Net-
[38] K. Lee, N. B. Shah, L. Huang, and K. Ramchandran, “The MDS queue: works: Tricks Trade. 2nd ed. 2012, pp. 421–436. [Online]. Available:
Analysing the latency performance of erasure codes,” IEEE Trans. Inf. http://dx.doi.org/10.1007/978-3-642-35289-8_25
Theory, vol. 63, no. 5, pp. 2822–2842, May 2017. [63] C. Zhang and C. Ré, “Dimmwitted: A study of main-memory statistical
[39] G. Joshi, Y. Liu, and E. Soljanin, “On the delay-storage trade-off in analytics,” Proc. VLDB Endowment, vol. 7, no. 12, pp. 1283–1294, 2014.
content download from coded distributed storage systems,” IEEE J. Sel. [64] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. (2015). “Why ran-
Areas Commun., vol. 32, no. 5, pp. 989–997, May 2014. dom reshuffling beats stochastic gradient descent.” [Online]. Available:
[40] Y. Sun, Z. Zheng, C. E. Koksal, K.-H. Kim, and N. B. Shroff. (2015). https://arxiv.org/abs/1510.08560
“Provably delay efficient data retrieving in storage clouds.” [Online]. [65] S. Ioffe and C. Szegedy. (2015). “Batch normalization: Accelerating
Available: https://arxiv.org/abs/1501.01661 deep network training by reducing internal covariate shift,” [Online].
[41] S. Kadhe, E. Soljanin, and A. Sprintson, “When do the availability codes Available: https://arxiv.org/abs/1502.03167
make the stored data more available?” in Proc. 53rd Annu. Allerton Conf. [66] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of
Commun., Control, Comput. (Allerton), Sep. 2015, pp. 956–963. caching,” IEEE Trans. Inf. Theory, vol. 60, no. 5, pp. 2856–2867,
[42] S. Kadhe, E. Soljanin, and A. Sprintson, “Analyzing the download time May 2014.
of availability codes,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), [67] M. A. Maddah-Ali and U. Niesen, “Decentralized coded caching
Jun. 2015, pp. 1467–1471. attains order-optimal memory-rate tradeoff,” IEEE/ACM Trans. Netw.,
[43] N. Ferdinand and S. Draper, “Anytime coding for distributed compu- vol. 23, no. 4, pp. 1029–1040, Aug. 2014. [Online]. Available:
tation,” presented at the 54th Annu. Allerton Conf. Commun., Control, http://dx.doi.org/10.1109/TNET.2014.2317316
Comput., Monticello, IL, USA, 2016. [68] R. Pedarsani, M. A. Maddah-Ali, and U. Niesen, “Online coded
[44] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear caching,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2014,
transforms distributedly using coded short dot products,” in Proc. Adv. pp. 1878–1883. [Online]. Available: http://dx.doi.org/10.1109/
Neural Inf. Process. Syst., 2016, pp. 2092–2100. ICC.2014.6883597
[45] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. (2016). [69] N. Karamchandani, U. Niesen, M. A. Maddah-Ali, and S. Diggavi,
“Gradient coding.” [Online]. Available: https://arxiv.org/abs/1612.03301 “Hierarchical coded caching,” in Proc. IEEE Int. Symp. Inf. The-
[46] R. Bitar, P. Parag, and S. E. Rouayheb. (2017). “Minimiz- ory (ISIT), Jun. 2014, pp. 2142–2146.
ing latency for secure distributed computing.” [Online]. Available: [70] M. Ji, G. Caire, and A. F. Molisch, “Fundamental limits of distributed
https://arxiv.org/abs/1703.01504 caching in D2D wireless networks,” in Proc. IEEE Inf. Theory Work-
[47] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix shop (ITW), Sep. 2013, pp. 1–5. [Online]. Available: http://dx.doi.org/
multiplication,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2017, 10.1109/ITW.2013.6691247
pp. 1–2. [71] S. Li, M. A. Maddah-ali, and S. Avestimehr, “Coded MapReduce,”
[48] A. Reisizadehmobarakeh, S. Prakash, R. Pedarsani, and S. Avestimehr. presented at the 53rd Annu. Allerton Conf. Commun., Control, Comput.,
“Coded computation over heterogeneous clusters.” [Online]. Available: Monticello, IL, USA, 2015.
https://arxiv.org/abs/1701.05973 [72] Y. Birk and T. Kol, “Coding on demand by an informed source (ISCOD)
[49] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Coded for efficient broadcast of different supplemental data to caching clients,”
computation for multicore setups,” presened at the ISIT, Jun. 2017. IEEE Trans. Inf. Theory, vol. 52, no. 6, pp. 2825–2830, Jun. 2006.
[50] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA: [73] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol, “Index coding with side
Athena Scientific, 1999. information,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1479–1494,
[51] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- Mar. 2011.
agent optimization,” IEEE Trans. Autom. Control, vol. 54, no. 1, [74] M. A. Attia and R. Tandon, “Information theoretic limits of data
pp. 48–61, Jan. 2009. shuffling for distributed learning,” in Proc. IEEE Global Commun.
[52] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed Conf. (GLOBECOM), Dec. 2016, pp. 1–6.
optimization and statistical learning via the alternating direction method [75] M. A. Attia and R. Tandon, “On the worst-case communication overhead
of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, for distributed data shuffling,” in Proc. 54th Annu. Allerton Conf.
Jan. 2011. Commun., Control, Comput. (Allerton), Sep. 2016, pp. 961–968.
[53] R. Bekkerman, M. Bilenko, and J. Langford, Scaling Up Machine [76] L. Song and C. Fragouli. (2017). “A pliable index coding approach to
Learning: Parallel and Distributed Approaches. Cambridge, U.K.: data shuffling.” [Online]. Available: https://arxiv.org/abs/1701.05540
Cambridge Univ. Press, 2011. [77] S. Li, M. A. Maddah-Ali and A. S. Avestimehr, “A unified coding
[54] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging framework for distributed computing with straggling servers,” in Proc.
for distributed optimization: Convergence analysis and network scal- IEEE Globecom Workshops (GC Wkshps), Washington, DC, USA, 2016,
ing,” IEEE Trans. Autom. Control, vol. 57, no. 3, pp. 592–606, pp. 1–6.
Mar. 2012. [78] T. M. Cover and J. A. Thomas, Elements of Information Theory.
[55] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed Hoboken, NJ, USA: Wiley, 2012.
optimization and learning over networks,” IEEE Trans. Signal Process., [79] S. Lin and D. J. Costello, Error Control Coding, vol. 2. Englewood
vol. 60, no. 8, pp. 4289–4305, Aug. 2012. Cliffs, NJ, USA: Prentice-Hall, 2004.
[56] J. Dean et al., “Large scale distributed deep networks,” [80] G. Liang and U. C. Kozat, “TOFEC: Achieving optimal throughput-
in Proc. 26th Annu. Conf. Neural Inf. Process. Syst. (NIPS), 2012, delay trade-off of cloud storage using erasure codes,” in Proc.
pp. 1232–1240. [Online]. Available: http://papers.nips.cc/paper/4687- IEEE Conf. Comput. Commun. (INFOCOM), Apr. 2014, pp. 826–834.
large-scale-distributed-deep-networks [Online]. Available: http://dx.doi.org/10.1109/INFOCOM.2014.6848010
[57] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and [81] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.:
J. M. Hellerstein, “Distributed graphlab: A framework for machine Cambridge Univ. Press, 2004.
learning and data mining in the cloud,” Proc. VLDB Endowment, vol. 5, [82] X. Meng et al., “Mllib: Machine learning in apache spark,” J. Mach.
no. 8, pp. 716–727, 2012. Learn. Res., vol. 17, no. 1, pp. 1235–1241, 2016.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: SPEEDING UP DISTRIBUTED MACHINE LEARNING USING CODES 1529

[83] Open MPI: Open Source High Performance Computing. Accessed on Dimitris Papailiopoulos is an Assistant Professor of Electrical and Computer
Nov. 25, 2015. [Online]. Available: http://www.open-mpi.org Engineering at the University of Wisconsin-Madison and a Faculty Fellow of
[84] StarCluster. Accessed on Nov. 25, 2015. [Online]. Available: the Grainger Institute for Engineering. Between 2014 and 2016, Papailiopou-
http://star.mit.edu/cluster/ los was a postdoctoral researcher in EECS at UC Berkeley and a member of
[85] BLAS (Basic Linear Algebra Subprograms). Accessed on Nov. 25, 2015. the AMPLab. His research interests span machine learning, coding theory,
[Online]. Available:http://www.netlib.org/blas/ and distributed algorithms, with a current focus on coordination-avoiding
[86] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Fundamental tradeoff parallel machine learning and the use of erasure codes to speed up distributed
between computation and communication in distributed computing,” computation. Dimitris earned his Ph.D. in ECE from UT Austin in 2014,
in Proc. IEEE Int. Symp. Inf. Theory (ISIT), Jul. 2016, pp. 1814–1818. under the supervision of Alex Dimakis. In 2015, he received the IEEE Signal
[87] D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall, Processing Society, Young Author Best Paper Award.
“Augmenting data center networks with multi-gigabit wireless links,”
ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, pp. 38–49,
Aug. 2011.
[88] Y. Zhu et al., “Cutting the cord: A robust wireless facilities network
for data centers,” in Proc. 20th Annu. Int. Conf. Mobile Comput. Netw.,
2014, pp. 581–592.
[89] M. Y. Arslan, I. Singh, S. Singh, H. V. Madhyastha, K. Sundaresan, and
S. V. Krishnamurthy, “Computing while charging: Building a distributed
computing infrastructure using smartphones,” in Proc. 8th Int. Conf.
Emerg. Netw. Experim. Technol., 2012, pp. 193–204.

Kangwook Lee is a postdoctoral researcher in the School of Electrical


Engineering, KAIST. Kangwook earned his Ph.D. in EECS from UC Berkeley
in 2016, under the supervision of Kannan Ramchandran. He is a recipient Kannan Ramchandran (Ph.D.: Columbia University, 1993) is a Professor of
of the KFAS Fellowship from 2010 to 2015. His research interests lie in Electrical Engineering and Computer Sciences at UC Berkeley, where he has
information theory and machine learning.
been since 1999. He was on the faculty at the University of Illinois at Urbana-
Champaign from 1993 to 1999, and with AT&T Bell Labs from 1984 to 1990.
Maximilian Lam is a computer science student at UC Berkeley whose main He is an IEEE Fellow, and a recipient of the 2017 IEEE Kobayashi Computers
research interests are systems and machine learning. and Communications Award, which recognizes outstanding contributions to
the integration of computers and communications. His research awards include
Ramtin Pedarsani is an Assistant Professor in ECE Department at the an IEEE Information Theory Society and Communication Society Joint Best
University of California, Santa Barbara. He received the B.Sc. degree in Paper award for 2012, an IEEE Communication Society Data Storage Best
electrical engineering from the University of Tehran, Tehran, Iran, in 2009, the Paper award in 2010, two Best Paper awards from the IEEE Signal Processing
M.Sc. degree in communication systems from the Swiss Federal Institute of Society in 1993 and 1999, an Okawa Foundation Prize for outstanding
Technology (EPFL), Lausanne, Switzerland, in 2011, and his Ph.D. from the research at Berkeley in 2001, an Outstanding Teaching Award at Berkeley
University of California, Berkeley, in 2015. His research interests include net- in 2009, and a Hank Magnuski Scholar award at Illinois in 1998. His
works, machine learning, stochastic systems, information and coding theory, research interests are at the intersection of signal processing, coding theory,
and transportation systems. Ramtin is a recipient of the IEEE international communications and networking with a focus on theory and algorithms for
conference on communications (ICC) best paper award in 2014. large-scale distributed systems.

Authorized licensed use limited to: Yunnan University. Downloaded on March 27,2023 at 04:28:48 UTC from IEEE Xplore. Restrictions apply.

You might also like