0% found this document useful (0 votes)
8 views151 pages

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

The document presents 'coded computing', a paradigm utilizing coding theory to address bottlenecks in large-scale distributed computing and machine learning, focusing on communication, straggler delays, and security issues. It introduces the 'Coded Distributed Computing' (CDC) scheme, which optimizes computation and communication trade-offs, and demonstrates its effectiveness through empirical evaluations in various applications. The authors also explore the implementation of coding techniques for enhancing performance in cloud-based systems and highlight future research directions in this field.

Uploaded by

hemant singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views151 pages

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

The document presents 'coded computing', a paradigm utilizing coding theory to address bottlenecks in large-scale distributed computing and machine learning, focusing on communication, straggler delays, and security issues. It introduces the 'Coded Distributed Computing' (CDC) scheme, which optimizes computation and communication trade-offs, and demonstrates its effectiveness through empirical evaluations in various applications. The authors also explore the implementation of coding techniques for enhancing performance in cloud-based systems and highlight future research directions in this field.

Uploaded by

hemant singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Foundations and Trends R in Communications and

Information Theory
Coded Computing: Mitigating
Fundamental Bottlenecks in
Large-Scale Distributed Computing
and Machine Learning
Suggested Citation: Songze Li and Salman Avestimehr (2020), “Coded Computing:
Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine
Learning”, Foundations and Trends R in Communications and Information Theory: Vol.
17, No. 1, pp 1–148. DOI: 10.1561/0100000103.

Songze Li
University of Southern California
USA
[email protected]
Salman Avestimehr
University of Southern California
USA
[email protected]

This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents

1 Introduction 3
1.1 Coding for Bandwidth Reduction . . . . . . . . . . . . . . 5
1.2 Coding for Straggler Mitigation . . . . . . . . . . . . . . . 7
1.3 Coding for Security and Privacy . . . . . . . . . . . . . . . 11
1.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Coding for Bandwidth Reduction 18


2.1 A Fundamental Tradeoff Between Computation
and Communication . . . . . . . . . . . . . . . . . . . . . 20
2.2 Empirical Evaluations of Coded
Distributed Computing . . . . . . . . . . . . . . . . . . . 37
2.3 Extension to Wireless Distributed Computing . . . . . . . 48
2.4 Related Works and Open Problems . . . . . . . . . . . . . 61

3 Coding for Straggler Mitigation 66


3.1 Optimal Coding for Matrix Multiplications . . . . . . . . . 69
3.2 Optimal Coding for Polynomial Evaluations . . . . . . . . 79
3.3 Related Works and Open Problems . . . . . . . . . . . . . 96

4 Coding for Security and Privacy 103


4.1 Secure and Private Multiparty Computing . . . . . . . . . 104
4.2 Privacy Preserving Machine Learning . . . . . . . . . . . . 113
4.3 Related Works and Open Problems . . . . . . . . . . . . . 122

Acknowledgements 126

Appendices 127

A Proof of Lemma 3.6 128

References 131
Coded Computing: Mitigating
Fundamental Bottlenecks in
Large-Scale Distributed Computing
and Machine Learning
Songze Li1 and Salman Avestimehr2
1 University of Southern California, USA; [email protected]
2 University of Southern California, USA; [email protected]

ABSTRACT
We introduce the concept of “coded computing”, a novel
computing paradigm that utilizes coding theory to effec-
tively inject and leverage data/computation redundancy
to mitigate several fundamental bottlenecks in large-scale
distributed computing, namely communication bandwidth,
straggler’s (i.e., slow or failing nodes) delay, privacy and
security bottlenecks. More specifically, for MapReduce based
distributed computing structures, we propose the “Coded
Distributed Computing” (CDC) scheme, which injects re-
dundant computations across the network in a structured
manner, such that in-network coding opportunities are en-
abled to substantially slash the communication load to shuf-
fle the intermediate computation results. We prove that
CDC achieves the optimal tradeoff between computation
and communication, and demonstrate its impact on a wide
range of distributed computing systems from cloud-based
datacenters to mobile edge/fog computing platforms.

Songze Li and Salman Avestimehr (2020), “Coded Computing: Mitigating Funda-


mental Bottlenecks in Large-Scale Distributed Computing and Machine Learning”,
Foundations and Trends R in Communications and Information Theory: Vol. 17, No.
1, pp 1–148. DOI: 10.1561/0100000103.
2

Secondly, to alleviate the straggler effect that prolongs the


executions of distributed machine learning algorithms, we
utilize the ideas from error correcting codes to develop
“Polynomial Codes” for computing general matrix algebra,
and “Lagrange Coded Computing” (LCC) for computing
arbitrary multivariate polynomials. The core idea of these
proposed schemes is to apply coding to create redundant
data/computation scattered across the network, such that
completing the overall computation task only requires a sub-
set of the network nodes returning their local computation
results. We demonstrate the optimality of Polynomial Codes
and LCC in minimizing the computation latency, by proving
that they require the least number of nodes to return their
results.
Finally, we illustrate the role of coded computing in pro-
viding security and privacy in distributed computing and
machine learning. In particular, we consider the problems of
secure multiparty computing (MPC) and privacy-preserving
machine learning, and demonstrate how coded computing
can be leveraged to provide efficient solutions to these criti-
cal problems and enable substantial improvements over the
state of the art.
To illustrate the impact of coded computing on real world
applications and systems, we implement the proposed coding
schemes on cloud-based distributed computing systems, and
significantly improve the run-time performance of important
benchmarks including distributed sorting, distributed train-
ing of regression models, and privacy-preserving training for
image classification. Throughout this monograph, we also
highlight numerous open problems and exciting research
directions for future work on coded computing.
1
Introduction

Recent years have witnessed a rapid growth of large-scale machine


learning and big data analytics, facilitating the developments of data-
intensive applications like voice/image recognition, real-time mapping
services, autonomous driving, social networks, and augmented/virtual
reality. These applications are supported by cloud infrastructures com-
posed of large datacenters. Within a datacenter, a massive amount of
users’ data are stored distributedly on hundreds of thousands of low-end
commodity servers, and any application of big data analytics has to
be performed in a distributed manner within or across datacenters.
This has motivated the fast development of scalable, interpretable, and
fault-tolerant distributed computing frameworks (see, e.g., [42, 56, 129,
171, 175]) that efficiently utilize the underlying hardware resources (e.g.,
CPUs and GPUs).
In this monograph, we focus on addressing the following three
major performance bottlenecks for large-scale distributed machine learn-
ing/data analytics systems.

• Communication bottleneck: Excessive data shuffling between com-


pute nodes.

3
4 Introduction

• Straggler bottleneck: Delay of computation caused by slow or


failing compute nodes, which are referred to as stragglers.

• Security bottleneck: Vulnerability to eavesdroppers and attackers.

To alleviate these bottlenecks, we take an unorthodox approach by


employing ideas and techniques from coding theory, and propose the
concept of “coded computing”, whose core spirit is described as follows.

Exploiting coding theory to optimally inject and leverage


data/task redundancy in distributed computing systems, creat-
ing coding opportunities to overcome communication, straggler,
and security bottlenecks.

Guided by this core spirit, we propose and evaluate a rich class


of coded distributed computing frameworks, for computation tasks
ranging from general MapReduce primitives to fundamental polynomial
algebra, and for computation systems ranging from conventional cloud-
based datacenters to emerging (mobile) edge/fog computing systems.
In the rest of this section, we describe our contributions on utilizing
coded computing to mitigate the communication, straggler, and security
bottlenecks, and discuss related works.
Before proceeding with the overview of coded computing, we would
like to also point out an important remark. In order to enable redundant
computations in coded computing, we need to also redundantly store
the datasets over which the computations are done. This would impose
a certain communication and storage cost to the system. However, in
many applications this cost can be ignored due to the following two
reasons. First, in many computation scenarios we are interested in many
computations over the same dataset (e.g., database query, keyword
search, loss calculation in machine learning, etc.). In those cases the
cost of encoding and redundantly storing the dataset in the network
can be amortized over many computations. Second, in many scenarios,
the encoding and storage of the dataset can happen at a different time
than the desired computations. For example, one can use the off-peak
1.1. Coding for Bandwidth Reduction 5

network times to properly encode and store the dataset, so as to be


ready for computations during the peak times.

1.1 Coding for Bandwidth Reduction

It is well known that communicating intermediate computation results


(or data shuffling) is one of the major performance bottlenecks for various
distributed computing applications, including self-join [3], TeraSort [58],
and many machine learning algorithms [36]. For instance, in a Facebook’s
Hadoop cluster, it is observed that 33% of the overall job execution
time is spent on data shuffling [36]. Also as is observed in [174], 70% of
the overall job execution time is spent on data shuffling when running a
self-join job on Amazon EC2 clusters. This bottleneck is becoming worse
for training deep neural networks with millions of model parameters
(e.g., ResNet-50 [65]), where partial gradients with millions of entries
are computed at distributed computing nodes and passed across the
network to update the model parameters [33].
Many optimization methods have been proposed to alleviate the
communication bottleneck in distributed computing systems. For ex-
ample, from the algorithm perspective, when the function that reduces
the final result is commutative and associative, it was proposed to
pre-combine intermediate results before data shuffling, cutting off the
amount of data movement [42, 125]. On the other hand, from the
system perspective, optimal flow scheduling across network paths has
been designed to accelerate the data shuffling process [53, 57], and
distributed cache memories were utilized to speed up the data transfer
between consecutive computation stages [49, 173]. Recently, motivated
by the fact that training algorithms exhibit tolerance to precision loss
of intermediate results, a family of lossy compression (or quantization)
algorithms for distributed learning systems have been developed to com-
press the intermediate results (e.g., gradients), and then the compressed
results are communicated to achieve a smaller bandwidth consumption
(see, e.g., [5, 19, 138, 158]).
The above mentioned approaches are designed for specific compu-
tations and network structures, and difficult to generalize to handle
arbitrary computation tasks. To overcome these difficulties, we focus
6 Introduction

on a general MapReduce-type distributed computing model [42], and


propose to utilize coding theory to slash the communication bottleneck
in running MapReduce applications. In particular, in this computing
model, each input file is mapped into multiple intermediate values, one
for each of the output functions, and the intermediate values from all
input files for each output function are collected and reduced to the
final output result. For this model, we propose a coded computing
scheme, named “coded distributed computing” (CDC), which trades
extra local computations for more network bandwidth. For some de-
sign parameter r, which is termed as “communication load”, the CDC
scheme places and maps each of the input files on r carefully chosen
distributed computing nodes, injecting r times more local computations.
In return, the redundant computations produce side information at the
nodes, which enable the opportunities to create coded multicast packets
during data shuffling that are simultaneously useful for r nodes. That
is, the CDC scheme trades r times more redundant computations for
an r times reduction in the communication load. Furthermore, we theo-
retically demonstrate that this inversely proportional tradeoff between
computation and communication achieved by CDC is fundamental, i.e.,
for a given computation load, no other schemes can achieve a lower
communication load than that achieved by CDC.
Having proposed the CDC framework and characterized its opti-
mal performance in trading extra computations for communication
bandwidth, we also empirically demonstrate its impact on speeding up
practical workloads. In particular, we integrate the principle of CDC
into the widely used Hadoop sorting benchmark, TeraSort [62], de-
veloping a novel distributed sorting algorithm, named CodedTeraSort.
At a high level, CodedTeraSort imposes structured redundancy in the
input data, enabling in-network coding opportunities to significantly
slash the load of data shuffling, which is a major bottleneck of the
run-time performance of TeraSort. Through extensive experiments
on Amazon EC2 [7] clusters, we demonstrate that CodedTeraSort
achieves 1.97×∼3.39× speedup over TeraSort, for typical settings of
interest. Despite the extra overhead imposed by coding (e.g., genera-
tion of the coding plan, data encoding and decoding), the practically
1.2. Coding for Straggler Mitigation 7

achieved performance gain approximately matches the gain theoretically


promised by CodedTeraSort.
Beyond the conventional wireline networks in datacenters, we also
introduce the concept of coded computing to tackle the scenarios of
mobile edge/fog computing, where the communication bottleneck is
even more severe due to the low data rate and the large number of
mobile users. In particular, we consider a wireless distributed computing
platform, which is composed of a cluster of mobile users scattered around
the network edge, connected wirelessly through an access point. Each
user has a limited storage and processing capability, and the users have to
collaborate to satisfy their computational needs that require processing
a large dataset. This ad hoc computing model, in contrast to the
centralized cloud computing model, is becoming increasingly common
in the emerging edge computing paradigm for Internet-of-Things (IoT)
applications [28, 32]. For this model, following the principle of the CDC
scheme, we propose a coded wireless distributed computing (CWDC)
scheme that jointly designs the local storage and computation for each
user, and the communication schemes between the users. The CWDC
scheme achieves a constant bandwidth consumption that is independent
of the number of users in the network, which leads to a scalable design of
the platform that can simultaneously accommodate an arbitrary number
of users. Moreover, for a more practically important decentralized setting,
in which each user needs to decide its local storage and computation
independently without knowing the existence of any other participating
users, we extend the CWDC scheme to achieve a bandwidth consumption
that is very close to that of the centralized setting.

1.2 Coding for Straggler Mitigation

Other than data shuffling, another major performance bottleneck of


distributed computing applications is the effect of stragglers. That is,
the execution time of a computation consisting of multiple parallel tasks
is limited by the slowest task run on the straggling processor. These
stragglers significantly slow down the overall computations, and have
been widely observed in distributed computing systems (see, e.g., [8, 41,
172]). For instance, it was experimentally demonstrated in [172] that
8 Introduction

this straggler effect can prolong the job execution time by as much as
five times.
Conventionally, in the original open-source implementation of
Hadoop MapReduce [10], the stragglers are constantly detected and
the slow tasks are speculatively restarted on other available nodes. Fol-
lowing this idea of straggler detection, more timely straggler detection
algorithms and better scheduling algorithms have been developed to
further alleviate the straggler effect (see, e.g., [9, 172]). Apart from
straggler detection and speculative restart, another straggler mitigation
technique is to schedule the clones of the same task (see, e.g., [8, 30,
55, 91, 139]). The underlying idea of cloning is to execute redundant
tasks such that the computation can proceed when the results of the
fast-responding clones have returned. Recently, it has been proposed
to utilize error correcting codes for straggler mitigation in distributed
matrix-vector multiplication [47, 89, 96, 106]. The main idea is to par-
tition the data matrix into K batches, and then generate N coded
batches using the maximum-distance-separable (MDS) code [101], and
assign multiplication with each of the coded batches to a worker node.
Benefiting from the “any K of N ” property of the MDS code, the
computation can be accomplished as long as any K fastest nodes have
finished their computations, providing the system the robustness to
up to N − K arbitrary stragglers. This coded approach was shown to
significantly outperform the state-of-the-art cloning approaches in strag-
gler mitigation capability, and minimize the the overall computation
latency.
Our first contribution on this topic is the development of optimal
codes, named polynomial codes, to deal with stragglers in distributed
high-dimensional matrix–matrix multiplication. More specifically, we
consider a distributed matrix multiplication problem where we aim to
compute C = A> B from input matrices A and B. The computation is
carried out using a distributed system with a master node and N worker
nodes that can each stores a fixed fraction of A and B respectively
(possibly in a coded manner). For this problem, we aim to design
computation strategies that achieve the minimum possible recovery
threshold, which is defined as the minimum number of workers that the
master needs to wait for in order to compute C. While the prior works,
1.2. Coding for Straggler Mitigation 9

Table 1.1: Comparison of recovery threshold for distributed high-dimensional matrix


multiplication, over a system consisting of a master node, and N worker nodes

1D MDS Code Product Code Polynomial Code



Recovery threshold Θ(N ) Θ( N ) Θ(1)

i.e., the one dimensional MDS code (1D MDS code) in [89], and the
product code in [90] apply MDS codes on the data matrices, they are
sub-optimal in minimizing the recovery threshold. The main novelty
and advantage of the proposed polynomial code is that, by carefully
designing the algebraic structure of the coded storage at each worker,
we create an MDS structure on the intermediate computations, instead
of only the coded data matrices. This allows polynomial code to achieve
order-wise improvement over state of the arts (see Table 1.1). We also
prove the optimality of polynomial code by showing that it achieves
the information-theoretic lower bound on the recovery threshold. As
a by-product, we also prove the optimality of polynomial code under
several other performance metrics considered in previous literature.
Going beyond matrix algebra, we also study the straggler mitigation
strategies for scenarios where the function of interest is an arbitrary
multivariate polynomial of the input dataset. This significantly broadens
the scope of the problem to cover many computations of interest in ma-
chine learning, such as various gradient and loss-function computations
in learning algorithms and tensor algebraic operations (e.g., low-rank
tensor approximation). In particular, we consider a computation task
for which the goal is to compute a function f over a large dataset
X = (X1 , . . . , XK ) to obtain K outputs Y1 = f (X1 ), . . . , YK = f (XK ).
The computation is carried over a system consisting of a master node
and N worker nodes. Each worker i stores a coded dataset X̃i generated
from X, computes f (X̃i ), and sends the obtained result to the master.
The master decodes the output Y1 , . . . , YK from the computation results
of the group of the fastest workers.
For this setting, a naive repetition scheme would repeat the compu-
tation for each data block Xk onto N/K workers, yielding a recovery
threshold of N − N/K + 1 = Θ(N ). We propose the “Lagrange Coded
10 Introduction

Computing” (LCC) framework to minimize the recovery threshold. In


particular, denoting the degree of the function f as deg f , LCC promises
the recovery of all output results at the master as soon as it receives com-
putation results from (K − 1) deg f + 1 workers. That is, LCC achieves a
recovery threshold of (K − 1) deg f + 1. Note that the recovery threshold
of LCC is Θ(K), which is independent of the total number of workers N .
Hence, as the network expands (i.e., N grows), compared with the
naive repetition scheme, LCC benefits much more from the abundant
computation resources in alleviating the negative effects caused by slow
or failed nodes, which leads to a much lower computation latency. In
fact, we demonstrate through proving a matching information-theoretic
converse that LCC achieves the minimum possible recovery threshold
among all distributed computing schemes.
The key idea of LCC is to encode the input dataset using the
well-known Lagrange interpolation polynomial, in order to create com-
putation redundancy in a novel coded form across the workers. This
redundancy can then be exploited to provide resiliency to stragglers.
Additionally, we emphasize on the following two salient features of the
data encoding of LCC:

• Universal: The data encoding is oblivious of the output function


f . Therefore, the coded data placement can be performed offline
without knowing which operations will be applied on the data.

• Incremental: When new data become available and coded data


batches need to be updated, we only need to encode the new
data and append them to the previously coded batches, instead
of accessing the entire uncoded data and re-encoding them to
update the coded data.

Finally, we specialize our general theoretical guarantees for LCC


in the context of least-squares linear regression, which is one of the
elemental learning tasks, and demonstrate its performance gain by op-
timally suppressing stragglers. Leveraging the algebraic structure of
gradient computations, several strategies have been developed recently
to exploit data and gradient coding for straggler mitigation in the
training process (see, e.g., [78, 89, 94, 106, 147]). We implement LCC
1.3. Coding for Security and Privacy 11

for regression on Amazon EC2 clusters, and empirically compare its


performance with the conventional uncoded approaches, and two state-
of-the-art straggler mitigation schemes: gradient coding (GC) [63, 127,
147, 164] and matrix-vector multiplication (MVM) based approaches
[89, 106]. Our experiment results demonstrate that compared with the
uncoded scheme, LCC improves the run-time by 6.79×∼13.43×. Com-
pared with the GC scheme, LCC improves the run-time by 2.36×∼4.29×.
Compared with the MVM scheme, LCC improves the run-time by
1.01×∼12.65×.

1.3 Coding for Security and Privacy

Data privacy has become a major concern in the information age. The
immensity of modern datasets has popularized the use of third-party
cloud services, and as a result, the threat of privacy infringement has
increased dramatically. In order to alleviate this concern, techniques
for private computation are essential [25, 38, 102, 114]. Additionally,
third-party service providers often have an interest in the result of the
computation, and might attempt to alter it for their benefit [23, 24].
In particular, we consider a common and important scenario where a
user wishes to disperse computations over a large network of workers,
subject to the following privacy and security constraints.

• Privacy constraint: Sets of colluding workers cannot infer anything


about the input dataset in the information-theoretic sense.

• Security constraint: The computation must be accomplished suc-


cessfully even if some workers return purposefully erroneous
results.

The problem of secure and private distributed computing has been


studied extensively from various perspectives in the past, mainly within
the scope of secure multiparty computation (MPC) [18, 37, 38, 64]. Most
notably, the celebrated BGW scheme [18], which adapts the Shamir
secret sharing scheme [141] to the realm of computation, has been a
reference point for several decades. The key idea of BGW scheme is to
view any computation task as composed by linear and bilinear functions
12 Introduction

to be handled in multiple rounds. It applies the Shamir secret sharing


scheme to generate coded data shares with security guarantees, and
computes the function on the coded shares. We generalize the proposed
Lagrange Coded Computing (LCC) scheme designed for straggler miti-
gation purposes to also provide security and privacy guarantees to MPC
systems. Specifically, similarly as before, we consider the problem of
evaluating a multivariate polynomial f over dataset X = (X1 , . . . , XK ).
We employ a distributed computing network with a master and N
workers, and aim to compute Y1 = f (X1 ), . . . , YK = f (XK ). For this
computing system, we propose modifications to the data encoding and
computation decoding processes of LCC, and demonstrate that LCC
provides a T -private and A-secure computation of f (i.e., keeping the
dataset private amidst collusion of any T workers, and the computation
secure amidst the presence of A Byzantine adversarial workers), for any
pair (T, A) satisfying
N ≥ (K + T − 1) deg f + 2A + 1. (1.1)
Furthermore, we also demonstrate that LCC achieves an optimal tradeoff
between privacy and security, and requires a minimal amount of added
randomness to preserve privacy.
In the presence of Byzantine workers, a subset of computation re-
sults received at the master can be arbitrarily erroneous. In order to
correctly recover the computation results, during the decoding pro-
cess, instead of mere polynomial interpolation, the master applies an
error correcting decoding algorithm for a Reed–Solomon code of dimen-
sion (K − 1) deg(f ) + 1 and length N . This allows LCC to tolerate A
malicious workers as long as 2A ≤ N − (K − 1) deg f − 1. Obtaining
information-theoretic privacy against colluding workers, i.e., keeping
small sets of workers oblivious to the dataset does not require altering
the encoding nor decoding algorithm. However, prior to encoding, the
dataset X is padded by T random elements R1 , . . . , RT , where T is the
maximum size of sets of workers that cannot infer anything about X.
We note from (1.1) that when N ≥ (K + T − 1) deg f + 2A + 1, the
LCC scheme simultaneously achieves
1. Resiliency against N − ((K + T − 1) deg f + 2A + 1) straggler
workers that prolong computations;
1.3. Coding for Security and Privacy 13

2. Security against A malicious workers, with no computational


restriction, that deliberately send erroneous data in order to affect
the computation for their benefit; and

3. (Information-theoretic) Privacy of the dataset amidst possible


collusion of up to T workers.

We also note that the number of workers the master needs to wait
for does not scale with the total number of workers N , hence the key
property of LCC is that adding one additional worker can increase its
resiliency to stragglers by 1, or increase its robustness to malicious worker
by 1/2, while maintaining the privacy constraint. Hence, this result
essentially extends the well-known optimal scaling of error-correcting
codes (i.e., adding one parity can provide robustness against 1 erasure
or 1/2 error in optimal maximum distance separable codes) to the
distributed computing paradigm. Compared with the state-of-the-art
BGW-based designs, we also show that LCC significantly improves the
storage, communication, and secret-sharing overhead needed for secure
and private multiparty computing (see Table 1.2).
Finally, we will also discuss the problem of privacy-preserving ma-
chine learning. In particular, we consider an application scenario in
which a data-owner (e.g., a hospital) wishes to train a logistic regression
model by offloading the large volume of data (e.g., healthcare records)
and computationally-intensive training tasks (e.g., gradient computa-
tions) to N machines over a cloud platform, while ensuring that any
collusions between T out of N workers do not leak information about

Table 1.2: Comparison between BGW based designs and LCC. The computational
complexity is normalized by that of evaluating f ; randomness, which refers to the
number of random entries used in encoding functions, is normalized by the length
of Xi

BGW LCC
Complexity/worker K 1
Frac. data/worker 1 1/K
Randomness KT T
Min. num. of workers 2T + 1 deg f · (K + T − 1) + 1
14 Introduction

the dataset. We discuss a recently proposed scheme [145], named Cod-


edPrivateML, which leverages coded computing for this problem. More
specifically, we show how one can leverage coded computing to both
provide strong information-theoretic privacy guarantees and enable fast
training by distributing the training computation load effectively across
several workers.

1.4 Related Works

The problem of characterizing the minimum communication for dis-


tributed computing has been previously considered in several settings
in both computer science and information theory literature. In [163],
a basic computing model is proposed, where two parities have x and
y and aim to compute a Boolean function f (x, y) by exchanging the
minimum number of bits between them. Also, the problem of minimiz-
ing the required communication for computing the modulo-two sum
of distributed binary sources with symmetric joint distribution was
introduced in [85]. Following these two seminal works, a wide range of
communication problems in the scope of distributed computing have
been studied (cf. [16, 88, 116, 120, 121, 126]).
The idea of efficiently creating and exploiting coded multicasting
for bandwidth reduction was initially proposed in the context of cache
networks in [104, 105], and extended in [72, 79], where caches pre-
fetch part of the content in a way to enable coding during the content
delivery, minimizing the network traffic. Generally speaking, we can
also view the data shuffling of the considered distributed computing
framework as an instance of the index coding problem [15, 20], in
which a central server aims to design a broadcast message (code) with
minimum length to simultaneously satisfy the requests of all the clients,
given the clients’ side information stored in their local caches. Note
that while a randomized linear network coding approach (see e.g.,
[2, 66, 83]) is sufficient to implement any multicast communication
where messages are intended by all receivers, it is generally sub-optimal
for index coding problems where every client requests different messages.
Although the index coding problem is still open in general, for the
considered distributed computing scenario where we are given the
1.4. Related Works 15

flexibility of designing Map computation (thus the flexibility of designing


side information), we can prove tight lower bounds on the minimum
communication loads, demonstrating the optimality of the proposed
Coded Distributed Computing scheme.
We would like to also point out that the main focus of the index
coding problem/literature is to design the optimal delivery scheme for a
given (often fixed) side information at the nodes. On the other hand, the
key novelty of our scheme/framework is the design of side information
(or redundant computations) at the nodes in order to maximize the
index coding (or coded multicast) opportunities. So, while index coding
focused on the design of best delivery strategies, we focus on the design
of best side information structure. In that sense they are complementary
to each other and we can leverage any of the delivery schemes developed
in the index coding literature (e.g., the schemes based on local clique
cover [142], partial and fractional clique cover [1, 20], interference
alignment [107], and many other schemes [11]) in the shuffling phase.
Other than designing coded computing strategies for bandwidth
reduction, there has recently been a surge of interest in developing
coded computing frameworks for straggler mitigation. Initiated in [89],
many following works has focused on designing data encoding strate-
gies, mainly inspired by the concepts of erasure/error correcting codes
for communication systems, to minimize the recovery threshold, in
distributed computation of matrix-vector and matrix–matrix multipli-
cations (e.g., [47, 52, 155, 167, 168]). Coded computing also finds its
application in distributed machine learning, specifically for running
distributed stochastic gradient descent (SGD) on a master/worker ar-
chitecture. For general machine learning tasks, data encoding is not
applicable due to the complicated structure of gradient computation
(e.g., gradients are computed numerically using back-propagation for
deep neural networks). In this scenario, “gradient coding” techniques
[63, 94, 127, 147, 164] have been designed to code across partial gradi-
ents computed from uncoded data, such that the master can recover
the total gradient as the sum of all partial gradients, after receiving the
computation results from the minimum possible number of workers.
The proposed Lagrange Coded Computing (LCC) scheme improves
and expands these prior works in a few aspects: Generality – LCC
16 Introduction

significantly expands the computation class for which we know how to


design coded computing to go beyond linear and bilinear computations
that have so far been the main research focus. In particular, it can
be applied to more general multivariate polynomial computations that
arise in machine learning applications. Universality – once the data has
been coded, any polynomial up to a certain degree can be computed
distributedly via LCC. In other words, data encoding of LCC can
be universally used for any polynomial computation. This is in stark
contrast to previous task-specific coding techniques in the literature.
Security and privacy – other than straggler mitigation, LCC also extends
the application of coded computing to secure and private computing
for general polynomial computations.
The security and privacy issue of distributed computing has been
extensively studied in the literature of secure multiparty computing
(MPC) and secure machine learning/data mining, [18, 37, 38, 64, 68,
102]. As a representative example, we briefly describe the celebrated
BGW MPC scheme [18]. Given data inputs {Xi }K i=1 , the problem is to
compute outputs {f (Xi )}K i=1 using N workers in a privacy-preserving
manner (i.e., colluding workers cannot infer anything about the dataset
using their local data). To do that, BGW first uses Shamir’s scheme [141]
to encode each Xi as a polynomial Pi (z) = Xi + Zi,1 z + · · · + Zi,T z T ,
where Zi,j ’s are i.i.d. uniformly random variables and T is the number
of colluding workers that should be tolerated. Then, each worker `
stores the coded data {Pi (α` )}K i=1 , for a distinct α` , and computes
{f (Pi (α` ))}K
i=1 . Hence, for each i, each worker provides the evaluation
of the degree-(deg f · T ) polynomial f (Pi (z)) at a distinct point α` . The
polynomial f (Pi (z)) can be interpolated using computation results from
deg f · T + 1 workers, and f (Xi ) is obtained by taking the constant
term of f (Pi (z)).1 In the proposed LCC scheme, instead of hiding
Xi ’s individually in data encoding, we code across Xi ’s together with
some added random inputs. This gives rise to significant reduction on
storage overhead, computational complexity, and the amount of padded
1
It is also possible to use the conventional multi-round BGW, which only requires
N ≥ 2T + 1 workers to ensure T -privacy. However, multiple rounds of computation
and communication (Ω(log(deg f )) rounds) are needed, which further increases its
communication overhead.
1.4. Related Works 17

randomness. However, under the same condition, LCC scheme requires


N ≥ deg f · (K + T − 1) + 1 number of workers, which is larger than
that of the BGW scheme. So, in some sense LCC achieves reduction in
storage overhead, computational complexity, and the amount of padded
randomness, at the expense of increasing the number of needed workers
(or reducing the fraction of Byzantine workers that can be tolerated).
We refer to Table 1.2 for a detailed comparison between BGW and
LCC.
Coding techniques have been recently developed to provide security
and privacy guarantees to distributed computing. Specifically, staircase
codes [21] were proposed to combat stragglers in linear computations
(e.g., matrix-vector multiplications) while preserving data privacy, im-
proving the computation latency of the conventional secure computing
schemes based on secret sharing [110, 141]. The proposed LCC scheme
generalizes the staircase codes beyond linear computations. Even for the
linear case, LCC guarantees data privacy against T colluding workers by
introducing less randomness than [21] (T rather than T K/(K − T )). Be-
yond linear computations, a recent work [117] has combined ideas from
the BGW scheme and the polynomial code [167] to form polynomial shar-
ing, a private coded computing scheme for arbitrary matrix polynomials.
However, polynomial sharing inherits the undesired BGW property of
performing a communication round for every bilinear operation in the
polynomial; a feature that drastically reduces communication efficiency,
and is circumvented by the one-shot approach of LCC. DRACO [31]
was proposed as a secure distributed training algorithm that is robust
to Byzantine workers. Since DRACO is designed for general gradient
computations, it employs a blackbox approach, i.e., the coding is applied
on the gradients computed from uncoded data, but not on the data
itself, which is similar to the gradient coding techniques [63, 94, 127,
147, 164] designed primarily for stragglers. For this approach, [31] show
that a 2A + 1 multiplicative factor of redundant computations is needed
to be robust to A Byzantine workers. For the proposed LCC however,
the blackbox approach is disregarded in favor of an algebraic one, and
consequently, a 2A additive factor suffices.
2
Coding for Bandwidth Reduction

In this section, we focus on a general distributed computing framework,


motivated by prevalent structures like MapReduce [42] and Spark [171],
in which the overall computation is decomposed into two stages: “Map”
and “Reduce”. Firstly in the Map stage, distributed computing nodes
process parts of the input data locally, generating some intermediate
values according to their designed Map functions. Next, they exchange
the calculated intermediate values among each other (a.k.a. data shuf-
fling), in order to calculate the final output results distributedly using
their designed Reduce functions.
Within this framework, data shuffling often appears to limit the
performance of distributed computing applications, including self-join [3],
TeraSort [58], and machine learning algorithms [36]. For example, in a
Facebook’s Hadoop cluster, it is observed that 33% of the overall job
execution time is spent on data shuffling [36]. Also as is observed in [174],
70% of the overall job execution time is spent on data shuffling when
running self-join on an Amazon EC2 cluster [7]. This bottleneck becomes
even worse when training deep neural networks (e.g., ResNet-50 [65])
on distributed computing systems, where partial gradients with millions
of entries are shuffled across networks to update model parameters [33].

18
19

As such motivated, we ask this fundamental question that if coding


can help distributed computing in reducing the load of communication
and speeding up the overall computation? Coding is known to be helpful
in coping with the channel uncertainty in telecommunication systems
and also in reducing the storage cost in distributed storage systems and
cache networks. In this section, we extend the application of coding to
distributed computing and propose a framework to substantially reduce
the load of data shuffling via coding and some extra computing in the
Map phase.
More specifically, we first formalize a MapReduce-type distributed
computing framework, and define the “computation load” as the amount
of local Map computations performed at distributed computing nodes,
and the “communication load” as the amount of information bits shuffled
between nodes. We characterize a fundamental “inversely proportional”
tradeoff relationship between computation load and communication load.
In particular, we propose a coded computing scheme, named “Coded
Distributed Computing” (CDC), which demonstrates that increasing
the computation load of the Map phase by a factor of r (i.e., evaluating
each Map function at r carefully chosen nodes) can create novel coding
opportunities in the data shuffling phase that reduce the communica-
tion load by the same factor. We also show that CDC is optimal, in
the sense that it achieves the best tradeoff between computation and
communication in the proposed MapReduce framework.
Having theoretically characterized the tradeoff between computation
and communication, we exploit this tradeoff to improve the run-time
performance of practical workloads. Particularly, we apply the principles
of the CDC scheme to TeraSort [62], which is a widely used bench-
mark in Hadoop MapReduce [10] for distributedly sorting terabytes
of data [118], and develop a new distributed sorting algorithm, named
CodedTeraSort, which imposes structured redundancy in data to en-
able coding opportunities for efficient data shuffling. We empirically
demonstrate that CodedTeraSort speeds up the state-of-the-art sorting
algorithms by 1.97×–3.39× in typical settings of interest.
Having demonstrated the impact of coding on improving the per-
formance of applications run on wired networks like datacenters, we
also introduce the concept of coded computing to tackle the scenarios
20 Coding for Bandwidth Reduction

of mobile edge/fog computing, where the communication bottleneck


is even more severe due to the low data rate and the large number of
mobile users. In particular, we consider a wireless distributed computing
platform, which is composed of a cluster of memory-limited mobile users
and an access point at the network edge. The users collaborate with each
other through the access point to satisfy their computational needs that
require processing a large dataset. For this platform we propose a coded
wireless distributed computing (CWDC) scheme that jointly designs the
local storage and computation for each user, and the communication
between users through the access point. The CWDC scheme achieves a
constant bandwidth consumption that is independent of the number
of users in the system, which leads to a scalable design of the platform
that can simultaneously accommodate an arbitrary number of users.
Finally, we end this section with some related works and open
problems along this research direction.

2.1 A Fundamental Tradeoff Between Computation


and Communication

In this subsection, we formulate a general distributed computing frame-


work motivated by MapReduce, and characterize the optimal tradeoff
between computation and communication within this framework.

2.1.1 Problem Formulation: A Distributed Computing Framework


We consider the problem of computing Q arbitrary output functions from
N input files using a cluster of K distributed computing nodes (servers),
for some positive integers Q, N, K ∈ N, with N ≥ K.1 More specifically,
given N input files w1 , . . . , wN ∈ F2F , for some F ∈ N, the goal is
to compute Q output functions φ1 , . . . , φQ , where φq : (F2F )N → F2B ,
q ∈ {1, . . . , Q}, maps all input files to an output uq = φq (w1 , . . . , wN ) ∈
F2B , for some B ∈ N.

1
The motivation for considering simultaneous computation of Q functions is
that we consider a common scenario in which many computation requests (over the
same dataset) are continuously submitted (e.g., database queries, web search, loss
computation in machine learning, etc).
2.1. A Fundamental Tradeoff 21

Figure 2.1: Illustration of a two-stage distributed computing framework. The overall


computation is decomposed into computing a set of Map and Reduce functions.

Motivated by MapReduce, we assume that as illustrated in Figure 2.1


the computation of the output function φq , q ∈ {1, . . . , Q} can be
decomposed as follows:
φq (w1 , . . . , wN ) = hq (gq,1 (w1 ), . . . , gq,N (wN )), (2.1)
where
• The “Map” functions ~gn = (g1,n , . . . , gQ,n ): F2F → (F2T )Q , n ∈
{1, . . . , N }, maps the input file wn into Q length-T intermediate
values vq,n = gq,n (wn ) ∈ F2T , q ∈ {1, . . . , Q}, for some T ∈ N.2

• The “Reduce” functions hq : (F2T )N → F2B , q ∈ {1, . . . , Q}, maps


the intermediate values of the output function φq in all input files
into the output value uq = hq (vq,1 , . . . , vq,N ).

Remark 2.1. Note that for every set of output functions φ1 , . . . , φQ


such a Map-Reduce decomposition exists (e.g., setting gq,n
0 s to identity

2
When mapping a file, we compute Q intermediate values in parallel, one for each
of the Q output functions. The main reason to do this is that parallel processing can be
efficiently performed for applications that fit into the MapReduce framework. In other
words, mapping a file according to one function is only marginally more expensive
than mapping according to all functions. For example, for the canonical Word Count
task, while we are scanning a document to count the number of appearances of one
word, we can simultaneously count the numbers of appearances of other words with
marginally increased computation cost.
22 Coding for Bandwidth Reduction

functions such that gq,n (wn ) = wn for all n = 1, . . . , N , and hq to φq


in (2.1)). However, such a decomposition is not unique, and in the
distributed computing literature, there has been quite some work on
developing appropriate decompositions of computations like join, sorting
and matrix multiplication (see, e.g., [42, 125]), for them to be performed
efficiently in a distributed manner. Here we do not impose any constraint
on how the Map and Reduce functions are chosen (for example, they
can be arbitrary linear or non-linear functions).
The above computation is carried out by K distributed comput-
ing nodes, labelled as Node 1, . . . , Node K. They are interconnected
through a multicast network. Following the above decomposition, the
computation proceeds in three phases: Map, Shuffle and Reduce.
Map Phase: Node k, k ∈ {1, . . . , K}, computes the Map functions of a
set of files Mk , which are stored on Node k, for some design parameter
Mk ⊆ {w1 , . . . , wN }. For each file wn in Mk , Node k computes ~gn (wn ) =
(v1,n , . . . , vQ,n ). We assume that each file is mapped by at least one
node, i.e., ∪ Mk = {w1 , . . . , wN }.
k=1,...,K

Definition 2.1 (Computation Load). We define the computation load,


denoted by r, 1 ≤ r ≤ K, as the total number of Map functions
computed Pacross the K nodes, normalized by the number of files N ,
K
|M |
i.e., r , k=1N k . The computation load r can be interpreted as the
average number of nodes that map each file.
Shuffle Phase: Node k, k ∈ {1, . . . , K}, is responsible for computing
a subset of output functions, whose indices are denoted by a set Wk ⊆
{1, . . . , Q}. We focus on the case K
Q
∈ N, and utilize a symmetric task
assignment across the K nodes to maintain load balance. More precisely,
we require (1) |W1 | = · · · = |WK | = K
Q
, (2) Wj ∩ Wk = ∅ for all j 6= k.
Remark 2.2. Beyond the symmetric task assignment considered in this
section, characterizing the optimal computation-communication tradeoff
allowing general asymmetric task assignments is a challenging open
problem. As the first step to study this problem, the follow-up work [166]
considers the scenario in which the number of output functions Q is fixed
and the computing resources are abundant (e.g., number of computing
2.1. A Fundamental Tradeoff 23

nodes K  Q), it is shown in [166] that asymmetric task assignments


can do better than the symmetric ones in optimizing the overall run-time
performance.

To compute the output uq for some q ∈ Wk , Node k needs the


intermediate values that are not computed locally in the Map phase, i.e.,
{vq,n : q ∈ Wk , wn ∈
/ Mk }. After Node k, k ∈ {1, . . . , K}, has finished
mapping all the files in Mk , the K nodes proceed to exchange the
needed intermediate values. In particular, each node k creates an input
symbol Xk ∈ F2`k , for some `k ∈ N, as a function of the intermediate
values computed locally during the Map phase, i.e., for some encoding
function ψk : (F2T )Q|Mk | → F2`k at Node k, we have

Xk = ψk ({~gn : wn ∈ Mk }). (2.2)

Having generated the message Xk , Node k multicasts it to all other


nodes.
By the end of the Shuffle phase, each of the K nodes receives
X1 , . . . , XK free of error.

Definition 2.2 (Communication Load). We define the communication


load, denoted by L, 0 ≤ L ≤ 1, as L , `1 +···+`
QN T . That is, L represents
K

the (normalized) total number of bits communicated by the K nodes


during the Shuffle phase.3

Reduce Phase: Node k, k ∈ {1, . . . , K}, uses the messages X1 , . . . , XK


communicated in the Shuffle phase, and the local results from the Map
phase {~gn : wn ∈ Mk } to construct inputs to the corresponding Reduce
functions of Wk , i.e., for each q ∈ Wk and some decoding function
χqk : F2`1 × · · · × F2`K × (F2T )Q|Mk | → (F2T )N , Node k computes

(vq,1 , . . . , vq,N ) = χqk (X1 , . . . , XK , {~gn : wn ∈ Mk }). (2.3)


3
For notational convenience, we define all variables in binary extension fields.
However, one can consider arbitrary field sizes. For example, we can consider all
intermediate values vq,n , q = 1, . . . , Q, n = 1, . . . , N , to be in the field FpT , for
some prime number p and positive integer T , and the symbol communicated by
Node k (i.e., Xk ), to be in the field Fs`k for some prime number s and positive
integer `k , for all k = 1, . . . , K. In this case, the communication load can be defined
as L , (`1 +···+` K ) log s
QN T log p
.
24 Coding for Bandwidth Reduction

Finally, Node k, k ∈ {1, . . . , K}, computes the Reduce function


uq = hq (vq,1 . . . vq,N ) for all q ∈ Wk .
We say that a computation-communication pair (r, L) ∈ R2 is feasible
if for any δ > 0 and sufficiently large N , there exist M1 , . . . , MK ,
W1 , . . . , WK , a set of encoding functions {ψk }K
k=1 , and a set of decoding
functions {χqk : q ∈ Wk }K k=1 that achieve a computation-communication
pair (r̃, L̃) ∈ Q2 such that |r − r̃| ≤ δ, |L − L̃| ≤ δ, and Node k can
successfully compute all the output functions whose indices are in Wk ,
for all k ∈ {1, . . . , K}.

Definition 2.3. We define the computation-communication function of


the distributed computing framework

L∗ (r) , inf{L: (r, L) is feasible}. (2.4)

L∗ (r) characterizes the optimal tradeoff between computation and


communication in this framework.

Example (uncoded scheme). In the Shuffle phase of a simple “un-


coded” scheme, each node receives the needed intermediate values sent
uncodedly by some other nodes. Since a total of QN intermediate values
are needed across the K nodes and rN · K Q
= rQN
K of them are already
available after the Map phase, the communication load achieved by the
uncoded scheme
Luncoded (r) = 1 − r/K. (2.5)

2.1.2 Main Results


Theorem 2.1. The computation-communication function of the dis-
tributed computing framework, L∗ (r) is given by
1 r
 
L (r) = Lcoded (r) , · 1 −

, r ∈ {1, . . . , K}, (2.6)
r K
for sufficiently large T . For general 1 ≤ r ≤ K, L∗ (r) is the lower convex
envelop of the above points {(r, 1r · (1 − Kr )): r ∈ {1, . . . , K}}.

We prove the achievability of Theorem 2.1 by proposing a coded


scheme, named Coded Distributed Computing, in Subsection 2.1.3. We
demonstrate that no other scheme can achieve a communication load
2.1. A Fundamental Tradeoff 25

Figure 2.2: Comparison of the communication load achieved by the proposed coded
scheme in Theorem 2.1 with that of the uncoded scheme in (2.5), for Q = 10 output
functions, N = 2520 input files and K = 10 computing nodes.

smaller than the lower convex envelop of the points {(r, 1r · (1 − Kr )): r ∈
{1, . . . , K}} by proving the converse in Subsection 2.1.4.

Remark 2.3. Theorem 2.1 exactly characterizes the optimal tradeoff


between the computation load and the communication load in the
considered distributed computing framework.

Remark 2.4. For r ∈ {1, . . . , K}, the communication load achieved in


Theorem 2.1 is less than that of the uncoded scheme in (2.5) by a
multiplicative factor of r, which equals the computation load and can
grow unboundedly as the number of nodes K increases if e.g., r = Θ(K).
As illustrated in Figure 2.2, while the communication load of the uncoded
scheme decreases linearly as the computation load increases, Lcoded (r)
achieved in Theorem 2.1 is inversely proportional to the computation
load.

Remark 2.5. While increasing the computation load r causes a longer


Map phase, the coded achievable scheme of Theorem 2.1 maximizes
the reduction of the communication load using the extra computations.
Therefore, Theorem 2.1 provides an analytical framework to optimally
trading the computation power in the Map phase for more bandwidth
26 Coding for Bandwidth Reduction

in the Shuffle phase, which helps to minimize the overall execution time
of applications whose performances are limited by data shuffling. In
the next subsection, we will empirically demonstrate this idea through
experiments on a widely-used practical workload.

Remark 2.6. In [93], we also consider a generalization of the above


distributed computing framework, which we call “cascaded distributed
computing framework”, where after the Map phase, each Reduce func-
tion is computed by s > 1 nodes. This generalized model is motivated by
the fact that many distributed computing jobs require multiple rounds
of Map and Reduce executions, where the Reduce results of the previous
round serve as the inputs to the Map functions of the next round. For
the cascaded distributed computing framework, we generalize our coded
computing scheme to achieve the optimal tradeoff between computation
and communication loads.

2.1.3 Coded Distributed Computing


In this subsection, we formally prove the upper bound in Theorem 2.1
by describing and analyzing the Coded Distributed Computing (CDC)
scheme. Before we present the general CDC scheme, we first illustrate
its key coding ideas via an example.

Illustrative Example
We consider a MapReduce-type problem in Figure 2.3 for distributed
computing of Q = 3 output functions, represented by red/circle,
green/square, and blue/triangle respectively, from N = 6 input files,
using K = 3 computing nodes. Nodes 1, 2, and 3 are respectively respon-
sible for final reduction of red/circle, green/square, and blue/triangle
output functions. We first consider the case where no redundancy is
imposed on the computations, i.e., each file is mapped once and compu-
tation load r = 1. As shown in Figure 2.3(a), Node k maps File 2k − 1
and File 2k for k = 1, 2, 3. In this case, each node maps 2 input files
locally. In Figure 2.3, we represent, for example, the intermediate value
of the red/circle function in File n using a red circle labelled by n, for
all n = 1, . . . , 6. Similar representations follow for the green/square and
2.1. A Fundamental Tradeoff 27

(a) Uncoded Distributed Computing Scheme. (b) Coded Distributed Computing Scheme.

Figure 2.3: Illustrations of the conventional uncoded distributed computing scheme


with computation load r = 1, and the proposed Coded Distributed Computing
scheme with computation load r = 2, for computing Q = 3 functions from N = 6
inputs on K = 3 nodes.

the blue/triangle functions. After the Map phase, each node obtains
2 out of 6 required intermediate values to reduce the output function
it is responsible for (e.g., Node 1 knows the red circles in File 1 and
File 2). Hence, each node needs 4 intermediate values from the other
nodes, yielding a communication load of 4×3
3×6 = 3 .
2

Now, we demonstrate how the proposed CDC scheme trades the


computation load to slash the communication load via in-network coding.
As shown in Figure 2.3(b), we double the computation load such that
each file is now mapped on two nodes (r = 2). It is apparent that since
more local computations are performed, each node now only requires
2 other intermediate values, and an uncoded shuffling scheme would
achieve a communication load of 2×3 3×6 = 3 . However, we can do much
1

better with coding. As shown in Figure 2.3(b), instead of unicasting


individual intermediate values, every node multicasts a bit-wise XOR,
denoted by ⊕, of 2 locally computed intermediate values to the other
two nodes, simultaneously satisfying their data demands. For example,
knowing the blue/triangle in File 3, Node 2 can cancel it from the
coded packet sent by Node 1, recovering the needed green/square in
File 1. Therefore, this coding incurs a communication load of 3×63
= 16 ,
achieving a 2× gain over the uncoded shuffling.
28 Coding for Bandwidth Reduction

General CDC Scheme


We first consider the integer-valued computation load r ∈ {1, . . . , K},
and then generalize the CDC scheme for any 1 ≤ r ≤ K. When r = K,
every node can map all the input files and compute all the output
functions locally, thus no communication is needed and L∗ (K) = 0. In
what follows, we focus on the case where r < K.
We consider sufficiently large number of input files N , and Kr (η −


1) < N ≤ Kr η, for some η ∈ N. We first inject Kr η − N empty files


 

into the system to obtain a total of N̄ = Kr η files, which is now a




multiple of Kr . We note that lim N N̄


= 1. Next, we proceed to present

N →∞
the CDC scheme for a system with N̄ input files w1 , . . . , wN̄ .
Map phase design. The N̄ input files are evenly partitioned into Kr


disjoint batches of size η, each corresponding to a subset T ⊂ {1, . . . , K}


of size r, i.e.,

{w1 , . . . , wN̄ } = ∪ BT , (2.7)


T ⊂{1,...,K},|T |=r

where BT denotes the batch of η files corresponding to the subset T .


Given this partition, Node k, k ∈ {1, . . . , K}, computes the Map
functions of the files in BT if k ∈ T . Or equivalently, BT ⊆ Mk if k ∈ T .
Since each node is in K−1 r−1 subsets of size r, each node computes


K−1
r−1 η = rKN̄ Map functions, i.e., |Mk | = rKN̄ for all k ∈ {1, . . . , K}.
After the Map phase, Node k, k ∈ {1, . . . , K}, knows the intermediate
values of all Q output functions in the files in Mk , i.e., {vq,n : q ∈
{1, . . . , Q}, wn ∈ Mk }.
Coded data shuffling. We focus on the case where the number of the
output functions Q satisfies K Q
∈ N, and enforce a symmetric assignment
of the Reduce functions such that every node reduces K Q
functions. That
is, |W1 | = · · · = |WK | = K
Q
, and Wj ∩ Wk = ∅ for all j 6= k.
For any subset P ⊂ {1, . . . , K}, and k ∈ / P, we denote the set of
intermediate values needed by Node k and known exclusively by nodes
whose indices are in P as VPk . More formally:
 
VPk , vq,n : q ∈ Wk , wn ∈ ∩ Mi , wn ∈
/ ∪ Mi . (2.8)
i∈P i∈P
/
2.1. A Fundamental Tradeoff 29

For each subset S ⊆ {1, . . . , K} of size |S| = r + 1, we perform the


following three steps to shuffle the intermediate results.
• Step 1: data association. For each k ∈ S, VS\{k} k is the set of in-
termediate values that are requested by Node k and are computed
from the files in the batch BS\{k} , and they are exclusively known
at all nodes whose indices are in S\{k}. We evenly and arbitrarily
split VS\{k}
k , into r disjoint segments {VS\{k},i
k : i ∈ S\{k}}, where
VS\{k},i denotes the segment associated with Node i in S\{k} for
k

Node k. That is, VS\{k}


k = ∪ VS\{k},i
k .
i∈S\{k}

• Step 2: coded multicast. Each node i, i ∈ S, computes the


bit-wise XOR, denoted by ⊕, of all the segments associated with
it in S, generating a coded segment XiS = ⊕ VS\{k},i
k . Then,
k∈S\{i}
Node i multicasts XiS to all other nodes in S \ {i}.
• Step 3: decoding. Having received XiS from Node i, Node k
computes the bit-wise XOR of XiS with its local data segments
j
{VS\{j},i : j ∈ S \{i, k}} to recover VS\{k},i
k = j
⊕ VS\{j},i ⊕XiS .
j∈S\{i,k}
Having decoded the data segments VS\{k},i
k for all i ∈ S \ {k},
Node k concatenates them to recover VS\{k} .
k

After we iterate the above data shuffling process over all subsets
of r + 1 nodes, it is easy to see that for each node k, other than its
locally computed intermediate values, it has recovered all the required
intermediate values, i.e., {VS\{k}
k : S ⊆ {1, . . . , K}, |S| = r + 1, k ∈ S},
to compute the Reduce functions locally.
Communication load. Since the coded segment XiS has a size of
K · r bits for each i ∈ S, there are a total of K · r (r + 1) bits
Q ηT Q ηT

shuffled across the network in each subset S of size r + 1. Therefore,


the communication load achieved by this coded data shuffling scheme,
for r ∈ {1, . . . , K − 1}, is
r (r + 1) N̄ (K − r)
K Q
· ηT
Lcoded (r) = lim r+1 K
= lim
N →∞ QN T N →∞ N Kr
1 r
 
= · 1− (2.9)
r K
30 Coding for Bandwidth Reduction

Non-integer valued computation load. For non-integer valued com-


putation load r ≥ 1, we generalize the CDC scheme as follows. We
first expand the computation load r = αr1 + (1 − α)r2 as a convex
combination of r1 , brc and r2 , dre, for some 0 ≤ α ≤ 1. Then we
partition the set of N̄ input files {w1 , . . . , wN̄ } into two disjoint subsets
I1 and I2 of sizes |I1 | = αN̄ and |I2 | = (1 − α)N̄ . We next apply
the CDC scheme described above respectively to the files in I1 with a
computation load r1 and the files in I2 with a computation load r2 , to
compute each of the Q output functions at the same node. This results
in a communication load of
QαN̄ Lcoded (r1 )T + Q(1 − α)N̄ Lcoded (r2 )T
lim
N →∞ QN T
= αLcoded (r1 ) + (1 − α)Lcoded (r2 ), (2.10)

where Lcoded (r) is the communication load achieved by CDC in (2.9)


for integer-valued r.
Using this generalized CDC scheme, for any two integer-valued
computation loads r1 and r2 , the points on the line segment connect-
ing (r1 , Lcoded (r1 )) and (r2 , Lcoded (r2 )) are achievable. Therefore, for
general 1 ≤ r ≤ K, the lower convex envelop of the achievable points
{(r, Lcoded (r)): r ∈ {1, . . . , K}} is achievable. This proves the upper
bound on the computation-communication function in Theorem 2.1.

Remark 2.7. The ideas of efficiently creating and exploiting coded multi-
casting opportunities have been introduced in caching problems [72, 104,
105]. Through the above description of the CDC scheme, we illustrated
how coding opportunities can be utilized in distributed computing to
slash the load of communicating intermediate values, by designing a par-
ticular assignment of extra computations across distributed computing
nodes. We note that the calculated intermediate values in the Map phase
mimics the locally stored cache contents in caching problems, providing
the “side information” to enable coding in the following Shuffle phase
(or content delivery).

Remark 2.8. Generally speaking, we can view the Shuffle phase of the
considered distributed computing framework as an instance of the index
2.1. A Fundamental Tradeoff 31

coding problem [15, 20], in which a central server aims to design a broad-
cast message (code) with minimum length to simultaneously satisfy the
requests of all the clients, given the clients’ side information stored in
their local caches. Note that while a randomized linear network coding
approach (see, e.g., [2, 66, 83]) is sufficient to implement any multicast
communication where messages are intended by all receivers, it is gener-
ally sub-optimal for index coding problems where every client requests
different messages. Although the index coding problem is still open in
general, for the considered distributed computing scenario where we are
given the flexibility of designing Map computation (thus the flexibility
of designing side information), we next prove tight lower bounds on
the minimum communication load, demonstrating the optimality of the
proposed CDC scheme.

2.1.4 Optimality of CDC


In this subsection, we prove the lower bound on L∗ (r) in Theorem 2.1,
and demonstrate the optimality of CDC in minimizing the communica-
tion load.
For k ∈ {1, . . . , K}, we denote the set of indices of the files mapped
by Node k as Mk , and the set of indices of the Reduce functions
computed by Node k as Wk . As the first step, we consider the commu-
nication load for a given file assignment M , (M1 , M2 , . . . , MK ) in
the Map phase. We denote the minimum communication load under
the file assignment M by L∗M .
We denote the number of files that are mapped at j nodes under a
file assignment M, as ajM , for all j ∈ {1, . . . , K}:
 - 
ajM = (2.11)
X
∩ Mk ∪ Mi .
k∈J i∈J
/
J ⊆{1,...,K}: |J |=j

For example, for the particular file assignment in Figure 2.4, i.e.,
M = ({1, 3, 5, 6}, {4, 5, 6}, {2, 3, 4, 6}), a1M = 2 since File 1 and File 2
are mapped on a single node (i.e., Node 1 and Node 3 respectively).
Similarly, we have a2M = 3 (Files 3, 4, and 5), and a3M = 1 (File 6).
32 Coding for Bandwidth Reduction

Figure 2.4: A file assignment for N = 6 files and K = 3 nodes.

For a particular file assignment M, we present a lower bound on


L∗M in the following lemma.
K aj
Kj .
K−j
Lemma 2.2. L∗M ≥ ·
P M
N
j=1

Next, we first demonstrate the converse of Theorem 2.1 using


Lemma 2.2, and then give the proof of Lemma 2.2.
Converse Proof of Theorem 2.1. It is clear that the minimum communi-
cation load L∗ (r) is lower bounded by the minimum value of L∗M over
all possible file assignments which admit a computation load of r:

L∗ (r) ≥ inf L∗M . (2.12)


M: |M1 |+···+|MK |=rN

Then by Lemma 2.2, we have


K
aj K −j
L∗ (r) ≥ inf (2.13)
X
M
· .
M:|M1 |+···+|MK |=rN
j=1
N Kj

For every file assignment M such that |M1 | + · · · + |MK | = rN ,


{ajM }K
j=1 satisfy

ajM ≥ 0, j ∈ {1, . . . , K}, (2.14)


K
j
aM = N, (2.15)
X

j=1
K
jajM = rN. (2.16)
X

j=1
2.1. A Fundamental Tradeoff 33

Then since the function K−j


Kj in (2.13) is convex in j, and by (2.15)
K ajM
= 1, (2.13) becomes
P
N
j=1

K ajM
K−
P
j N
(a) K −r
L∗ (r) ≥ inf = (2.17)
j=1
K
,
M: |M1 |+···+|MK |=rN P ajM Kr
K j N
j=1

where (a) is due to the requirement imposed by the computation load


in (2.16).
The lower bound on L∗ (r) in (2.17) holds for general 1 ≤ r ≤ K.
We can further improve the lower bound for non-integer valued r as
follows. For a particular r ∈
/ N, we first find the line p + qj as a function
of 1 ≤ j ≤ K connecting the two points (brc, K−brc Kbrc ) and (dre, Kdre ).
K−dre

More specifically, we find p, q ∈ R such that


K − brc
p + qj|j=brc = , (2.18)
Kbrc
K − dre
p + qj|j=dre = . (2.19)
Kdre

Then by the convexity of the function K−j


Kj in j, we have for integer-
valued j = 1, . . . , K,
K −j
≥ p + qj, j = 1, . . . , K. (2.20)
Kj
Then (2.13) reduces to
K
aj
L∗ (r) ≥ inf · (p + qj) (2.21)
X
M
M: |M1 |+···+|MK |=rN
j=1
N
K
aj K
jajM
= inf ·p+ (2.22)
X
M
X
·q
M: |M1 |+···+|MK |=rN
j=1
N j=1
N
(b)
= p + qr, (2.23)

where (b) is due to the constraints on {ajM }K


j=1 in (2.15) and (2.16).
34 Coding for Bandwidth Reduction

Therefore, L∗ (r) is lower bounded by the lower convex envelop of


the points {(r, K−r
Kr ): r ∈ {1, . . . , K}}. This completes the proof of the
converse part of Theorem 2.1. 
We devote the rest of this subsection to the proof of Lemma 2.2.
To prove Lemma 2.2, we develop a lower bound on the number of bits
communicated by any subset of nodes, by induction on the size of the
subset.
Proof of Lemma 2.2. For q ∈ {1, . . . , Q}, n ∈ {1, . . . , N }, we let Vq,n
be i.i.d. random variables uniformly distributed on F2T . We let the
intermediate values vq,n be the realizations of Vq,n . For some Q ⊆
{1, . . . , Q} and N ⊆ {1, . . . , N }, we define

VQ, N , {Vq,n : q ∈ Q, n ∈ N }. (2.24)

Since each message Xk is generated as a function of the intermediate


values that are computed at Node k, we have for all k ∈ {1, . . . , K},

H(Xk | V:,Mk ) = 0, (2.25)

where we use “:” to denote the set of all possible indices.


The validity of the shuffling scheme requires that for all k ∈
{1, . . . , K}, the following equation holds:

H(VWk ,: | X: , V:,Mk ) = 0. (2.26)

For a subset S ⊆ {1, . . . , K}, we define

YS , (VWS ,: , V:,MS ), (2.27)

which contains all the intermediate values required by the nodes in S


and all the intermediate values known locally by the nodes in S after
the Map phase.
For any subset S ⊆ {1, . . . , K} and a file assignment M, we denote
the number of files that are exclusively mapped by j nodes in S as aj,S
M:

 - 
aj,S (2.28)
X
M , ∩ Mk ∪ Mi ,
k∈J i∈J
/
J ⊆S: |J |=j
2.1. A Fundamental Tradeoff 35

and the message symbols communicated by the nodes whose indices are
in S as
XS , {Xk : k ∈ S}. (2.29)
Then we prove the following claim.
Claim 2.2.1. For any subset S ⊆ {1, . . . , K}, we have
|S|
j,S Q |S| − j
H(XS | YS c ) ≥ T (2.30)
X
aM · ,
j=1
K j

where S c , {1, . . . , K}\S denotes the complement of S. 


We prove Claim 2.2.1 by induction.
a. If S = {k} for any k ∈ {1, . . . , K}, obviously
Q 1−1
H(Xk | Y{1,...,K}\{k} ) ≥ 0 = T aM (2.31)
1,{k}
· .
K 1
b. Suppose the statement is true for all subsets of size S0 .
For any S ⊆ {1, . . . , K} of size |S| = S0 + 1 and any k ∈ S, we have
1 X
H(XS | YS c ) = H(XS , Xk | YS c ) (2.32)
|S| k∈S
1 X 1
≥ H(XS | Xk , YS c ) + H(XS | YS c ). (2.33)
|S| k∈S |S|

From (2.33), we have


1
H(XS | YS c ) ≥ H(XS | Xk , YS c ) (2.34)
X
|S| − 1 k∈S
1 X
≥ H(XS | Xk , V:,Mk , YS c ) (2.35)
S0 k∈S
1 X
= H(XS | V:,Mk , YS c ). (2.36)
S0 k∈S

Due to the decodability criterion at Node k, for each k ∈ S, the


term on the RHS of (2.36) can be written as
H(XS | V:,Mk , YS c ) = H(XS , VWk ,: | V:,Mk , YS c ) (2.37)
= H(VWk ,: | V:,Mk , YS c ) + H(XS | VWk ,: , V:,Mk , YS c ). (2.38)
36 Coding for Bandwidth Reduction

The first term on the RHS of (2.38) can be lower bounded as follows.

H(VWk ,: | V:,Mk , YS c ) = H(VWk ,: | V:,Mk , VWS c ,: , V:,MS c ) (2.39)

(a)
= H(VWk ,: | V:,Mk , V:,MS c ) (2.40)

(b)
= H(V{q},: | V{q},Mk ∪MS c ) (2.41)
X

q∈Wk

S0 S0
(c) Q X Q X
= (2.42)
j,S\{k} j,S\{k}
T aM ≥ T a ,
K j=0 K j=1 M

where (a) is due to the independence of intermediate values and the


fact that Wk ∩ WS c = ∅ (different nodes calculate different output
functions), (b) is due to the independence of intermediate values, and
(c) is due to the independence of the intermediate values and the fact
that |Wk | = KQ
.
The second term on the RHS of (2.38) can be lower bounded by the
induction assumption:

H(XS | VWk ,: , V:,Mk , YS c ) = H(XS\{k} | Y(S\{k})c ) (2.43)

S0
j,S\{k} Q S0 − j
(2.44)
X
≥T aM · .
j=1
K j

Thus by (2.36), (2.38), (2.42) and (2.44), we have


 
1 X X
S0 S0
j,S\{k} Q j,S\{k} Q S0 − j 
H(XS | YS c ) ≥ +T
X
T aM aM ·
S0 k∈S j=1
K j=1
K j
(2.45)

Q 1 X j,S\{k}
S0 S0
T XX j,S\{k} Q S0
= =T
X
aM · · aM .
S0 k∈S j=1 K j j=1
K j k∈S
(2.46)
2.2. Empirical Evaluations of Coded Distributed Computing 37

By the definition of aj,S


M , we have the following equations.

N
= 1(file n is only mapped by some nodes
X j,S\{k} XX
aM
k∈S k∈S n=1
in S\{k})·1(file n is mapped by j nodes) (2.47)
N
= 1(file n is only mapped by j nodes in S)
X

n=1
1(file n is not mapped by Node k) (2.48)
X
·
k∈S
N
= 1(file n is only mapped by j nodes in S)(|S| − j)
X

n=1
(2.49)

M (S0
=aj,S + 1 − j). (2.50)

Applying (2.50) to (2.46) yields


0 +1
Q S0 + 1 − j
SX
H(XS | YS c ) ≥ T aj,S
M · . (2.51)
j=1
K j

c. Thus for all subsets S ⊆ {1, . . . , K}, the following equation holds:
|S|
j,S Q |S| − j
H(XS | YS c ) ≥ T (2.52)
X
aM · ,
j=1
K j

which proves Claim 2.2.1.


Then by Claim 2.2.1, let S = {1, . . . , K} be the set of all K nodes,

H(XS | YS c ) X K
ajM K − j
L∗M ≥ ≥ · . (2.53)
QN T j=1
N Kj

This completes the proof of Lemma 2.2. 

2.2 Empirical Evaluations of Coded Distributed Computing

In this subsection, we apply the Coded Distributed Computing (CDC)


scheme proposed in the previous subsection to a widely-used distributed
38 Coding for Bandwidth Reduction

sorting algorithm, TeraSort [62], developing a coded distributed sort-


ing algorithm CodedTeraSort. While the run-time performance of
TeraSort is known to be severely limited by the data shuffling time be-
tween distributed computing nodes (see, e.g., [58, 174]), CodedTeraSort
injects and leverages extra local computations to trade for a substan-
tially smaller bandwidth consumption, hence significantly improving
the overall run-time performance over TeraSort.

2.2.1 Execution Time of Coded Distributed Computing


For a MapReduce application whose overall response time is composed
of the time spent executing the Map tasks, denoted by Tmap , the time
spent shuffling intermediate values, denoted by Tshuffle , and the time
spent executing the Reduce tasks, denoted by Treduce , we have

Ttotal, MR = Tmap + Tshuffle + Treduce . (2.54)

Using CDC, we can leverage r× more computations in the Map


phase, in order to reduce the communication load by the same multi-
plicative factor, where r ∈ N is a design parameter that can be optimized
to minimize the overall execution time. Hence, CDC promises that we
can achieve the overall execution time of
1
Ttotal, CDC ≈ rTmap + Tshuffle + Treduce , (2.55)
r
for any 1 ≤ r ≤ K, where K is the total number of nodes on which the
distributed computation is executed. To minimize the above execution
time, one would choose
$s % &s '
Tshuffle Tshuffle
r =

or ,
Tmap Tmap
resulting in execution time of

CDC ≈ 2 Tshuffle Tmap + Treduce . (2.56)


q

Ttotal,

2.2.2 TeraSort
TeraSort [118] is a conventional algorithm for distributed sorting of a
large amount of data. The input data to be sorted is in the format of
2.2. Empirical Evaluations of Coded Distributed Computing 39

key-value (KV) pairs, meaning each input KV pair consists of a key and
a value. For example, the domain of the keys can be 10-byte integers,
and the domain of the values can be arbitrary strings. TeraSort aims
to sort the input data according to their keys, e.g., sorting integers.
A TeraSort algorithm run over K nodes, whose indices are denoted by
a set K = {1, . . . , K}, is comprised of the following five components.
File placement. Let F denote the entire KV pairs to be sorted. They
are split into K disjoint input files, denoted by F{1} , . . . , F{K} . File F{k}
is assigned to and locally stored at Node k.
Key domain partitioning. The key domain of the KV pair, denoted
by P , is split into K ordered partitions, denoted by P1 , . . . , PK . Specif-
ically, for any p ∈ Pi and any p0 ∈ Pi+1 , it holds that p < p0 for all
i ∈ {1, . . . , K − 1}. For example, when P = [0, 100] and K = 4, the
partitions can be P1 = [0, 25), P2 = [25, 50), P3 = [50, 75), P4 = [75, 100].
Node k is responsible for sorting all KV pairs in the partition Pk , for
all k ∈ K.
Map stage. Each node hashes each KV pair in the locally stored file
F{k} to the partition its key falls into. For each of the K key partitions,
the hashing procedure on the file F{k} generates an intermediate value
that contains the KV pairs in F{k} whose keys belong to that partition.
More specifically, we denote the intermediate value of the partition
Pj from the file F{k} as I{k}
j
, and the hashing procedure on the file F{k}
is defined as n o  
1 K
I{k} , . . . , I{k} ← Hash F{k} .

Shuffle stage. The intermediate value I{j}


k calculated at Node j, j 6= k,

is unicast to Node k from Node j, for all k ∈ K. Since the intermediate


value I{k}
k is computed locally at Node k in the Map stage, by the end of

the Shuffle stage, Node k knows all intermediate values {I{1}


k , . . . , Ik }
{K}
of the partition Pk from all K files.
Reduce stage. Node k locally sorts all KV pairs whose keys fall into
the partition Pk , for all k ∈ K. Specifically, it sorts all intermediate
values in the partition Pk into a sorted list Qk as follows
n o
k k
Qk ← Sort I{1} , . . . , I{K} .
40 Coding for Bandwidth Reduction

Table 2.1: Performance of TeraSort sorting 12 GB data with K = 16 nodes and


100 Mbps network speed

Map Pack Shuffle Unpack Reduce Total


(sec.) (sec.) (sec.) (sec.) (sec.) (sec.)
1.86 2.35 945.72 0.85 10.47 961.25

Performance Evaluation
We performed an experiment on Amazon EC2 to sort 12 GB of data by
running TeraSort on 16 nodes. The breakdown of the total execution
time is shown in Table 2.1.
We observe from Table 2.1 that for a conventional TeraSort exe-
cution, 98.4% of the total execution time was spent in data shuffling,
which is 508.5× of the time spent in the Map stage. This motivates us to
develop a coded distributed sorting algorithm, named CodedTeraSort,
which integrates the coding technique of CDC into TeraSort to trade
extra computation time to significantly reduce the communication time,
as shown in (2.55).

2.2.3 Coded TeraSort


We describe the CodedTeraSort algorithm, which is developed by inte-
grating the coding techniques of the CDC into the TeraSort algorithm.
Structured redundant file placement. For some parameter r ∈
{1, . . . , K}, we first split the entire input KV pairs into N = Kr input


files. Unlike the file placement of TeraSort, CodedTeraSort places each


of the N input files repetitively on r distinct nodes.
We label an input file using a unique subset S of K with size |S| = r,
i.e., the N input files are denoted by

{FS : S ⊆ K, |S| = r}. (2.57)

We repetitively place an input file FS on each of the r nodes in S,


and hence each node now stores N r/K = K−1 r−1 files. As illustrated in


a simple example in Figure 2.5 for K = 4 and r = 2, the file F{2,3} is


placed on Nodes 2 and 3. Node 2 has files F{1,2} , F{2,3} , F{2,4} .
2.2. Empirical Evaluations of Coded Distributed Computing 41

Figure 2.5: An illustration of the structured redundant file placement in


CodedTeraSort with K = 4 nodes and r = 2.

As is done in the TeraSort, the key domain of the input KV pairs


is split into K ordered partitions P1 , . . . , PK , and Node k is responsible
for sorting all KV pairs in the partition Pk in the Reduce stage, for all
k ∈ K.
Map stage. Each node repeatedly performs the Map stage operation
of TeraSort described above, on each input file placed on that node.
Only relevant intermediate values generated in the Map stage are kept
locally nfor further o processing. In particular, out of the K intermediate
values IS , . . . , IS generated from file FS , only ISk and ISi : i ∈ K\S
1 K


are kept at Node k. This is because that the intermediate value ISi ,
required by Node i ∈ S\{k} in the Reduce stage, is already available at
Node i after the Map stage, so Node k does not need to keep them and
send them to the nodes in S\{k}. For example, as shown in Figure 2.6,
Node 1 does not keep the intermediate value I{1,2} 2 for Node 2. However,
Node 1 keeps I{1,2} , I{1,2} , I{1,2} , which are required by Nodes 1, 3, and 4
1 3 4

in the Reduce stage.


Encoding to create coded packets. The role of the encoding process
is to exploit the structured data redundancy to create coded multicast
packets that are simultaneously useful for multiple nodes, thus saving
the load of communicating intermediate values. Specifically, in every
subset M ⊆ K of |M| = r + 1 nodes, the encoding operation proceeds
as follows.
42 Coding for Bandwidth Reduction

Figure 2.6: An illustration of the Map stage at Node 1 in CodedTeraSort with


K = 4, r = 2 and the key partitions [0, 25), [25, 50), [50, 75), [75, 100].

• For each t ∈ M, the intermediate value IM\{t}


t , which is know at all
nodes in M\{t}, is evenly and arbitrarily split into r segments, i.e.,
t
IM\{t} = {IM\{t},k
t
: k ∈ M\{t}}, (2.58)
where IM\{t},k
t denotes the segment corresponding to Node k.

• For each k ∈ M, we generate the coded packet of Node k in M,


denoted by EM,k , by XORing all segments corresponding to Node
k in M,4 i.e.,
EM,k = ⊕ t
IM\{t},k . (2.59)
t∈M\{k}

By the end of the Encoding stage, for each k ∈ K, Node k has


generated K−1 coded packets, i.e., {EM,k : k ∈ M, |M| = r + 1}.

r
In Figure 2.7, we consider a scenario with r = 2, and illustrate the
encoding process in the subset M = {1, 2, 3}. Exploiting the particular
structure imposed in the stage of file placement, each node creates a
coded packet that contains data segments useful for the other 2 nodes.
Multicast shuffling. After all coded packets are created at the K
nodes, the multicast shuffling process takes place within each subset
of r + 1 nodes. Specifically, within each group M ⊆ K of |M| = r + 1
nodes, each Node k ∈ M multicasts its coded packet EM,k to the other
nodes in M\{k}. This coded packet is simultaneously useful for all of
these r nodes.
Decoding. Having received the coded packet EM,u from Node u,
Node k ∈ M\{u} performs the decoding process by XORing the
4
All segments are zero-padded to the length of the longest one.
2.2. Empirical Evaluations of Coded Distributed Computing 43

Figure 2.7: An illustration of the encoding process within a multicast group


M = {1, 2, 3}.

data segments {IM\{t},u


t : t ∈ M\{u, k}} with EM,u to recover the
desired segment IM\{k},u . Similarly, Node k recovers all data segments
k

k
{IM\{k},u : u ∈ M\{k}} from the received coded packets in M, and
merge them back to obtain a required intermediate value IM\{k}
k .
Reduce. After the Decoding stage, Node k has obtained all KV pairs in
the partition Pk , for all k ∈ K. In this final stage, Node k, k = 1, . . . , K,
performs the Reduce process as in the TeraSort algorithm, sorting the
KV pairs in partition Pk locally.

2.2.4 Experiments
We imperially demonstrate the performance gain of CodedTeraSort
through experiments on Amazon EC2 clusters. In this subsection, we
first present the choices we have made for the implementation. Then, we
describe experiment setup. Finally, we discuss the experiment results.
44 Coding for Bandwidth Reduction

Figure 2.8: The coordinator-worker system architecture.

Implementation Choices
Data format. All input KV pairs are generated from TeraGen [62] in
the standard Hadoop package. Each input KV pair consists of a 10-byte
key and a 90-byte value. A key is a 10-byte unsigned integer, and the
value is an arbitrary string of 90 bytes. The KV pairs are sorted based
on their keys, using the standard integer ordering.
Platform and library. We choose Amazon EC2 as the evaluation
platform. We implement both TeraSort and CodedTeraSort algorithms
in C++, and use Open MPI library [119] for communications among EC2
instances.
System architecture. As shown in Figure 2.8, we employ a system
architecture that consists of a coordinator node and K worker nodes,
for some K ∈ N. Each node is run as an EC2 instance. The coordinator
node is responsible for creating the key partitions and placing the
input files on the local disks of the worker nodes. The worker nodes
are responsible for distributedly executing the stages of the sorting
algorithms.
In-memory processing. After the KV pairs are loaded from the local
files into the workers’ memories, all intermediate data that are used
for encoding, decoding and local sorting are persisted in the memories,
and hence there is no disk I/O involved during the executions of the
algorithms.
In the TeraSort implementation, each node sequentially steps
through Map, Pack, Shuffle, Unpack, and Reduce stages. In the Reduce
2.2. Empirical Evaluations of Coded Distributed Computing 45

Figure 2.9: (a) Serial unicast in the Shuffle stage of TeraSort; a solid arrow repre-
sents a unicast. (b) Serial multicast in the Multicast Shuffle stage of CodedTeraSort;
a group of solid arrows starting at the same node represents a multicast.

stage, the standard sort std::sort is used to sort each partition locally.
To better interpret the experiment results, we add the Pack and the Un-
pack stages to separate the time of serialization and deserialization from
the other stages. The Pack stage serializes each intermediate value to a
continuous memory array to ensure that a single TCP flow is created for
each intermediate value (which may contain multiple KV pairs) when
MPI_Send is called.5 The Unpack stage deserializes the received data to
a list of KV pairs. In the Shuffle stage, intermediate values are unicast
serially, meaning that there is only one sender node and one receiver
node at any time instance. Specifically, as illustrated in Figure 2.9(a),
Node 1 starts to unicast to Nodes 2, 3, and 4 back-to-back. After Node
1 finishes, Node 2 unicasts back-to-back to Nodes 1, 3, and 4. This
continues until Node 4 finishes.
In the CodedTeraSort implementation, each node sequentially steps
through CodeGen, Map, Encode, Multicast Shuffling, Decode, and Re-
duce stages. In the CodeGen (or code generation) stage, firstly, each
node generates all file indices, as subsets of r nodes. Then each node
uses MPI_Comm_split to initialize r+1 K 
multicast groups each contain-
ing r + 1 nodes on Open MPI, such that multicast communications
will be performed within each of these groups. The serialization and
deserialization are implemented respectively in the Encode and the

5
Creating a TCP flow per KV pair leads to inefficiency from overhead and
convergence issue.
46 Coding for Bandwidth Reduction

Decode stages. In Multicast Shuffling, MPI_Bcast is called to multicast


a coded packet in a serial manner, so only one node multicasts one of
its encoded packets at any time instance. Specifically, as illustrated in
Figure 2.9(b), Node 1 multicasts to the other 2 nodes in each multicast
group Node 1 is in. For example, Node 1 first multicasts to Node 2 and 3
in the group {1, 2, 3}. After Node 1 finishes, Node 2 starts multicasting
in the same manner. This process continues until Node 4 finishes.

Experiment Setup
We conduct experiments using the following configurations to evaluate
the performance of CodedTeraSort and TeraSort on Amazon EC2:

• The coordinator runs on a r3.large instance with 2 processors,


15 GB memory, and 32 GB SSD.

• Each worker node runs on an m3.large instance with 2 processors,


7.5 GB memory, and 32 GB SSD.

• The incoming and outgoing traffic rates of each instance are


limited to 100 Mbps.6

• 12 GB of input data (equivalently 120 M KV pairs) is sorted.

2.2.5 Results
The breakdowns of the execution times with K = 16 workers and
K = 20 workers are shown in Tables 2.2 and 2.3 respectively. We
observe an overall 1.97×–3.39× speedup of CodedTeraSort as compared
with TeraSort. From the experiment results we make the following
observations:

• For CodedTeraSort, the time spent in the CodeGen stage is


proportional to r+1
K 
, which is the number of multicast groups.

6
This is to alleviate the effects of the bursty behaviors of the transmission rates
in the beginning of some TCP sessions. The rates are limited by traffic control
command tc [149].
2.2. Empirical Evaluations of Coded Distributed Computing 47

Table 2.2: Sorting 12 GB data with K = 16 nodes and 100 Mbps network speed

Pack/ Unpack/ Total


CodeGen Map Encode Shuffle Decode Reduce Time
(sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) Speedup

TeraSort: – 1.86 2.35 945.72 0.85 10.47 961.25


CodedTeraSort: 6.06 6.03 5.79 412.22 2.41 13.05 445.56 2.16×
r=3
CodedTeraSort: 23.47 10.84 8.10 222.83 3.69 14.40 283.33 3.39×
r=5

• The Map time of CodedTeraSort is approximately r times higher


than that of TeraSort. This is because that each node hashes
r times more KV pairs than that in TeraSort. Specifically, the
ratios of the CodedTeraSort’s Map time to the TeraSort’s Map
time from Table 2.2 are 6.03/1.86 ≈ 3.2 and 10.84/1.86 ≈ 5.8,
and from Table 2.3 are 4.68/1.47 ≈ 3.2 and 8.59/1.47 ≈ 5.8.

• While CodedTeraSort theoretically promises a factor of more


than r× reduction in shuffling time, the actual gains observed
in the experiments are slightly less than r. For example, for an
experiment with K = 16 nodes and r = 3, as shown in Table 2.2,
the speedup of the Shuffle stage is 945.72/412.22 ≈ 2.3 < 3. This
phenomenon is caused by the following two factors. (1) Open
MPI’s multicast API (MPI_Bcast) has an inherent overhead per
multicast group, for instance, a multicast tree is constructed before
multicasting to a set of nodes. (2) Using the MPI_Bcast API, the
time of multicasting a packet to r nodes is higher than that of

Table 2.3: Sorting 12 GB data with K = 20 nodes and 100 Mbps network speed

Pack/ Unpack/ Total


CodeGen Map Encode Shuffle Decode Reduce Time
(sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) Speedup

TeraSort: – 1.47 2.00 960.07 0.62 8.29 972.45


CodedTeraSort: 19.32 4.68 4.89 453.37 1.87 9.73 493.86 1.97×
r=3
CodedTeraSort: 140.91 8.59 7.51 269.42 3.70 10.97 441.10 2.20×
r=5
48 Coding for Bandwidth Reduction

unicasting the same packet to a single node. In fact, as measured


in [89], the multicasting time increases logarithmically with r.

• The sorting times in the Reduce stage of both algorithms depend


on the available memories of the nodes. CodedTeraSort inherently
has a higher memory overhead, e.g., it requires persisting more
intermediate values in the memories than TeraSort for coding
purposes, hence its local sorting process takes slightly longer. This
can be observed from the Reduce column in Tables 2.2 and 2.3.

Further, we observe the following trends from both tables:


Impact of redundancy parameter r: As r increases, the shuffling
time reduces substantially by approximately r times. However, the Map
execution time increases linearly with r, and more importantly the
CodeGen time increases as r+1K 
. Hence, for small values of r (r < 6) we
observe overall reduction in execution time, and the speedup increases.
However, as we further increase r, the CodeGen time will dominate the
execution time, and the speedup decreases. Hence, in our evaluations,
we have limited r to be at most 5.7
Impact of worker number K: As K increases, the speedup decreases.
This is due to the following two reasons. (1) The number of multicast
groups, i.e., r+1
K 
, grows exponentially with K, resulting in a longer
execution time of the CodeGen process. (2) When more nodes participate
in the computation, for a fixed r, less amount of KV pairs are hashed
at each node locally in the Map stage, resulting in less locally available
intermediate values and a higher communication load.

2.3 Extension to Wireless Distributed Computing

Having theoretically and empirically demonstrated how coding can


help to overcome the communication bottlenecks and significantly im-
prove the performance of applications hosted over wireline networks like
datacenters, we also extend the idea of coded computing into mobile
7
The redundancy parameter r is also limited by the total storage avail-
able at the nodes. Since for a choice of redundancy parameter r, each piece
of input KV pairs should be stored at r nodes, we can not increase r beyond
total available storage at the worker nodes
input size
.
2.3. Extension to Wireless Distributed Computing 49

edge computing, in which mobile users participating in the compu-


tation exchange intermediate computation results via the underlying
wireless links. In the mobile edge computing scenario, the communica-
tion bottleneck becomes much worse due to much lower data rates of
wireless networks, which significantly delays the overall computation.
We demonstrate in this subsection that coding can exploit the rather
abundant computation resources in the network to create redundant
computations, and trade these redundant computations for substantial
reduction on the bandwidth requirement. This technology will enable
a scalable mobile computing platform that can accommodates a large
number of users with a fixed communication bandwidth.

2.3.1 System Model


We consider a system that has K mobile users. As illustrated in Fig-
ure 2.10, all users are connected wirelessly to an access point (e.g., a
cellular base station or a Wi-Fi router). The uplink channels of the K
users towards the access point are orthogonal to each other, and the
signals transmitted by the access point on the downlink are received by
all the users.
The system has a dataset (e.g., a feature repository of objects in a
image recognition application) that is evenly partitioned into N files
w1 , . . . , wN ∈ F2F , for some N, F ∈ N. Each user k has a length-D input

Figure 2.10: A wireless distributed computing system.


50 Coding for Bandwidth Reduction

dk ∈ F2D (e.g., user’s image in the image recognition application) to


process using the N files. To do that, as shown in Figure 2.10, User k
needs to compute
φ( dk ; w1 , . . . , wN ), (2.60)
|{z} | {z }
input dataset
where φ: F2D × (F2F )N → F2B is an output function that maps the
input dk to an output result (e.g., the returned result after processing
the image) of length B ∈ N.
We assume that every mobile user has a local memory that can
store up to µ fractions of the dataset (i.e., µN files), for some constant
parameter µ. We focus on the case where K1 ≤ µ < 1, such that each
user does not have enough storage for the entire dataset, but the entire
dataset can be stored collectively across all the users. We denote the
set of indices of the files stored by User k as Uk . The selections of
Uk s are design parameters, and we denote the design of U1 , . . . , UK as
dataset placement. The dataset placement is performed in prior to the
computation (e.g., users download parts of the feature repository when
installing the image recognition application).
Remark 2.9. The employed physical-layer network model is rather
simple and one can do better using a more detailed model and more
advanced techniques. However we note that any wireless medium can
be converted to our simple model using (1) TDMA on uplink; and
(2) broadcast at the rate of weakest user on downlink. Since our goal
is to introduce a “coded” framework for scalable wireless distributed
computing, we decide to abstract out the physical layer and focus on
the amount of data needed to be communicated.
Distributed computing model. Motivated by prevalent distributed
computing structures like MapReduce [42] and Spark [171], we assume
that the computation for input dk can be decomposed as
φ(dk ; w1 , . . . , wN ) = h(g1 (dk ; w1 ), . . . , gN (dk ; wN )), (2.61)
where
• The “Map” functions gn (dk ; wn ): F2D ×F2F → F2T , n ∈ {1, . . . , N },
k ∈ {1, . . . , K}, maps the input dk and the file wn into an inter-
mediate value vk,n = gn (dk ; wn ) ∈ F2T , for some T ∈ N,
2.3. Extension to Wireless Distributed Computing 51

• The “Reduce” function h: (F2T )N → F2B maps the interme-


diate values for input dk in all files into the output value
φ(dk ; w1 , . . . , wN ) = h(vk,1 , . . . , vk,N ), for all k ∈ {1, . . . , K}.

We focus on the applications in which the size of the users’ inputs


is much smaller than the size of the computed intermediate values,
i.e., D  T . As a result, the overhead of disseminating the inputs is
negligible, and we assume that the users’ inputs d1 , . . . , dK are known
at each user before the computation starts.

Remark 2.10. The above assumption holds for various wireless dis-
tributed computing applications. For example, in a mobile navigation
application, an input is simply the address of the intended destination.
The computed intermediate results contain all possible routes between
the two end locations, from which the fastest one is computed for the
user. Similarly, for a set of “filtering” applications like image recognition
(or similarly augmented reality) and recommendation systems, the in-
puts are light-weight queries (e.g., the feature vector of an image) that
are much smaller than the filtered intermediate results containing all
attributes of related information. For example, an input can be multiple
words describing the type of restaurant a user is interested in, and the
intermediate results returned by a recommendation system application
can be a list of relevant information that include customers’ comments,
pictures, and videos of the recommended restaurants.

Following the decomposition in (2.61), the overall computation


proceeds in three phases: Map, Shuffle, and Reduce.
Map phase: User k, k ∈ {1, . . . , K}, computes the Map functions of
d1 , . . . , dK based on the files in Uk . For each input dk and each file wn
in Uk , User k computes gn (dk , wn ) = vk,n .
Shuffle phase: Users exchange the needed intermediate values via
the access point they all wirelessly connect to. As a result, the Shuffle
phase breaks into two sub-phases: uplink communication and downlink
communication.
52 Coding for Bandwidth Reduction

On the uplink, user k creates a message Wk as a function of the


intermediate values computed locally, i.e.,

Wk = ψk ({vk,n : k ∈ {1, . . . , K}, n ∈ Uk }), (2.62)

and communicates Wk to the access point.

Definition 2.4 (Uplink Communication Load). We define the uplink


communication load, denoted by Lu , as the total number of bits in all
uplink messages W1 , . . . , WK , normalized by the number of bits in the
N intermediate values required by a user (i.e., N T ).

We assume that the access point does not have access to the dataset.
Upon decoding all the uplink messages W1 , . . . , WK , the access point
generates a message X from the decoded uplink messages, i.e.,

X = ρ(W1 , . . . , WK ), (2.63)

and then broadcasts X to all users on the downlink.

Definition 2.5 (Downlink Communication Load). We define the down-


link communication load, denoted by Ld , as the number of bits in the
downlink message X, normalized by N T .

Reduce phase: User k, k ∈ {1, . . . , K}, uses the locally computed


results {~gn : n ∈ Uk } and the decoded downlink message X to construct
the inputs to the corresponding Reduce function, and calculates the
output value φ(dk ; w1 , . . . , wN ) = h(vk,1 , . . . , vk,N ).
Example (uncoded scheme). As a benchmark, we consider an un-
coded scheme, where each user receives the needed intermediate values
sent uncodedly by some other users and forwarded by the access point,
achieving the communication loads
1
 
Luncoded
u (µ) = Luncoded
d (µ) = µK · −1 . (2.64)
µ
The above communication loads of the uncoded scheme grow with
the number of users K, overwhelming the limited spectral resources. In
this subsection, we argue that by utilizing coding at the users and the
access point, we can accommodate any number of users with a constant
2.3. Extension to Wireless Distributed Computing 53

communication load. Particularly, we propose in the next subsection


a scalable coded wireless distributed computing (CWDC) scheme that
achieves minimum possible uplink and downlink communication load
simultaneously, i.e.,
1
Lcoded
u = Loptimum
u ≈ − 1, (2.65)
µ
1
Lcoded
d = Loptimum
d ≈ − 1. (2.66)
µ

2.3.2 The Proposed CWDC Scheme


We present the proposed CWDC scheme for the wireless distributed
computing system. We first consider the storage size µ ∈ { K1 , K2 , . . . , 1}
such that µK ∈ N. We assume that N is sufficiently large such that
N = µKK
η for some η ∈ N.8
Dataset placement and Map phase execution. We evenly parti-
tion the indices of the N files into µK
K
disjoint batches, each containing
the indices of η files. We denote a batch of file indices as BT , which is
labelled by a unique subset T ⊂ {1, . . . , K} of size |T | = µK. As such
defined, we have
{1, . . . , N } = {i: i ∈ BT , T ⊂ {1, . . . , K}, |T | = µK}. (2.67)
User k, k ∈ {1, . . . , K}, stores locally all the files whose indices are in
BT if k ∈ T . That is,
Uk = ∪ BT . (2.68)
T : |T |=µK,k∈T

As a result, each of the N files is stored by µK distinct users.


Uplink communication. For any subset W ⊂ {1, . . . , K}, and any
k∈/ W, we denote the set of intermediate values needed by User k and
known exclusively by users in W as VWk . More formally:
 
k
VW , vk,n : n ∈ ∩ Ui , n ∈
/ ∪ Ui . (2.69)
i∈W i∈W
/

8
For small number of files N < µK , we can apply the coded wireless distributed
K

computing scheme to a smaller subset of users, achieving a part of the gain in
reducing the communication load.
54 Coding for Bandwidth Reduction

For all subsets S ⊆ {1, . . . , K} of size µK + 1:

1. For each User k ∈ S, VS\{k}


k is the set of intermediate values that
are requested by User k and are in the files whose indices are
in the batch BS\{k} , and they are exclusively known at all users
whose indices are in S\{k}. We evenly and arbitrarily split VS\{k}
k ,
into µK disjoint segments {VS\{k},i : i ∈ S\{k}}, where VS\{k},i
k k

denotes the segment associated with User i in S\{k} for User k.


That is, VS\{k}
k = ∪ VS\{k},i
k .
i∈S\{k}

2. User i, i ∈ S, sends the bit-wise XOR, denoted by ⊕, of all the


segments associated with it in S, i.e., User i sends the coded
segment WiS , ⊕ VS\{k},ik .
k∈S\{i}

Since the coded message WiS contains η


µK T
9 bits for all i ∈ S, there
are a total of bits communicated on the uplink in every subset
(µK+1)η
µK T
S of size µK + 1. Therefore, the uplink communication load achieved
by this coded scheme is

µK+1 (µK + 1) · η · T
K 
1 1 2
 
Lcoded
u (µ) = = − 1, µ∈ , ,...,1 .
µK · N T µ K K
(2.70)
Downlink communication. For all subsets S ⊆ {1, . . . , K} of size
µK + 1, the access point computes µK random linear combinations of
the uplink messages generated based on the subset S:

CjS ({WiS : i ∈ S}), j = 1, . . . , µK, (2.71)

and multicasts them to all users in S.


Since each linear combination contains η
µK T bits, the coded scheme
achieves a downlink communication load
K 
µK+1 η ·T µK 1
 
Lcoded (µ) = = · −1 ,
d
NT µK + 1 µ
1 2
 
µ∈ , ,...,1 . (2.72)
K K
9
Here we assume that T is sufficiently large such that T
µK
∈ N.
2.3. Extension to Wireless Distributed Computing 55

After receiving the random linear combinations C1S , . . . , CµK


S , User i,

i ∈ S, cancels all segments she knows locally, i.e., ∪ {VS\{k},j k :j∈


k∈S\{i}
S\{k}}. Consequently, User i obtains µK random linear combinations
of the required µK segments {VS\{i},j
i : j ∈ S\{i}}.
When µK is not an integer, we can first expand µ = αµ1 + (1 − α)µ2
as a convex combination of µ1 , bµKc/K and µ2 , dµKe/K. Then we
partition the set of the N files into two disjoint subsets I1 and I2 of
sizes |I1 | = αN and |I2 | = (1 − α)N . We next apply the above coded
scheme respectively to the files in I1 and I2 , yielding the following
communication loads.
1 1
   
Lcoded
u (µ) = α − 1 + (1 − α) −1 , (2.73)
µ1 µ2
µ1 K 1 µ2 K 1
   
Lcoded (µ) = α · − 1 + (1 − α) · − 1 .
d
µ1 K + 1 µ1 µ2 K + 1 µ2
(2.74)
Hence, for general storage size µ, CWDC achieves the following
communication loads.
1
 
Lcoded
u (µ) = Conv −1 , (2.75)
µ
µK 1
  
coded
Ld (µ) = Conv · −1 , (2.76)
µK + 1 µ
where Conv(f (µ)) denotes the lower convex envelop of the points
{(µ, f (µ)): µ ∈ { K1 , K2 , . . . , 1}} for function f (µ).
We summarize the performance of the proposed CWDC scheme in
the following theorem.

Theorem 2.3. For a wireless distributed computing application with a


dataset of N files, and K users that each can store µ ∈ { K1 , K2 , . . . , 1}
fraction of the files, the proposed CWDC scheme achieves the following
uplink and downlink communication loads for sufficiently large N .
1
Lcoded
u (µ) = − 1, (2.77)
µ
µK 1
 
coded
Ld (µ) = · −1 . (2.78)
µK + 1 µ
56 Coding for Bandwidth Reduction

For general 1
K ≤ µ ≤ 1, the achieved loads are as stated in (2.75) and
(2.76).

Remark 2.11. Theorem 2.3 implies that, for large K, Lcoded u (µ) ≈
Lcoded
d (µ) ≈ 1
µ − 1, which is independent of the number of users. Hence,
we can accommodate any number of users without incurring extra
communication load, and the proposed scheme is scalable. The reason
for this phenomenon is that, as more users joint the network, with
an appropriate dataset placement, we can create coded multicasting
opportunities to reduce the communication loads by a factor that scales
linearly with K. Such phenomenon was also observed in the context of
cache networks (see e.g., [105]).

Remark 2.12. As illustrated in Figure 2.11, the proposed CWDC


scheme utilizes coding at the mobile users and the access point to
reduce the uplink and downlink communication load by a factor of µK
and µK + 1 respectively, which scale linearly with the aggregated storage
size of the system.

20 20
Uncoded Uncoded
18 CWDC (Optimal) 18 CWDC (Optimal)

16 16
Downlink Communication Load
Uplink Communication Load

14 14

12 12

10 10

8 8

6 10 × 6 11 ×
4 4

2 2

0 0
0 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Storage Size (µ) Storage Size (µ)

(a) Uplink. (b) Downlink.

Figure 2.11: Comparison of the communication loads achieved by the uncoded


scheme with those achieved by the proposed CWDC scheme, for a network of K = 20
1
users. Here the storage size µ ≥ K = 0.05 such that the entire dataset can be stored
across the users.
2.3. Extension to Wireless Distributed Computing 57

Remark 2.13. Compared with distributed computing over wired servers


where we only need to design one data shuffling scheme between servers
in Subsection 2.1, here in the wireless setting we jointly design uplink
and downlink shuffling schemes, which minimize both the uplink and
downlink communication loads.

2.3.3 Optimality of the Proposed CWDC Scheme


In this subsection, we demonstrate in the following theorem, that the
proposed CWDC scheme achieves the minimum uplink and downlink
communication loads using any scheme.

Theorem 2.4. For a wireless distributed computing application using


any dataset placement and communication schemes that achieve an
uplink load Lu and a downlink load Ld , Lu and Ld are lower bounded
by Lcoded
u (µ) and Lcoded
d (µ) as stated in Theorem 2.3 respectively.

Remark 2.14. Using Theorems 2.3 and 2.4, we have completely charac-
terized the minimum achievable uplink and downlink communication
loads, using any dataset placement, uplink and downlink communication
schemes. This implies that the proposed CWDC scheme simultaneously
minimizes both uplink and downlink communication loads required to
accomplish distributed computing, and no other scheme can improve
upon it. This also demonstrates that there is no fundamental tension
between optimizing uplink and downlink communication in wireless
distributed computing.

For a dataset placement U = {Uk }K k=1 , we denote the minimum


possible uplink and downlink communication loads, achieved by any
uplink-downlink communication scheme to accomplish wireless dis-
tributed computing, by L∗u (U) and L∗d (U) respectively. We next prove
Theorem 2.4 by deriving lower bounds on L∗u (U) and L∗d (U) respectively.
58 Coding for Bandwidth Reduction

Lower Bound on L∗u (U)


For a given dataset placement U, we denote the number of files that
are stored at j users as ajU , for all j ∈ {1, . . . , K}, i.e.,
 - 
ajU = (2.79)
X
∩ Uk ∪ Ui .
k∈J i∈J
/
J ⊆{1,...,K}: |J |=j

For any U, it is clear that {ajU }K


j=1 satisfy

K
j
aU = N, (2.80)
X

j=1
K
jajU = µN K. (2.81)
X

j=1

We start the proof with the following lemma, which characterizes


a lower bound on L∗u (U) in terms of the distribution of the files in the
dataset placement U, i.e., a1U , . . . , aK
U.

K aj
Lemma 2.5. L∗u (U) ≥ · j .
K−j
P U
N
j=1

Lemma 2.5 can be proved following the similar steps in the proof of
Lemma 2.2 in Subsection 2.1, after replacing the downlink broadcast
message X with the uplink unicast messages W1 , . . . , WK in conditional
entropy terms (since X is a function of W1 , . . . , WK ).
Next, since the function K−j
j in Lemma 2.5 is convex in j, and by
K aj
(2.80) that = 1 and (2.81), we have
P U
N
j=1

K aj
K−
P
j NU
K − µK 1
L∗u (U) ≥ = = − 1. (2.82)
j=1
K
P ajU µK µ
jN
j=1

We can further improve the lower bound in (2.82) for a particular µ


such that µK ∈ / N. For a given storage size µ, we first find two points
(µ1 , µ1 − 1) and (µ2 , µ12 − 1), where µ1 , bµKc/K and µ2 , dµKe/K.
1
2.3. Extension to Wireless Distributed Computing 59

Then we find the line p + qt connecting these two points as a function


of t, K1 ≤ t ≤ 1, for some constants p, q ∈ R. We note that p and q are
different for different µ and
1
p + qt|t=µ1 = − 1, (2.83)
µ1
1
p + qt|t=µ2 = − 1. (2.84)
µ2
Then by the convexity of the function 1t − 1, the function 1t − 1
cannot be smaller then the function p + qt at the points t = K1 , K2 , . . . , 1.
That is, for all t ∈ { K1 , . . . , 1},
1
− 1 ≥ p + qt. (2.85)
t
By Lemma 2.5, we have
K
aj K −j
L∗u (U) ≥ (2.86)
X
U
·
j=1
N j
atK 1
 
= U
−1 (2.87)
X
·
1
N t
t= K ,...,1

atK
U
· (p + qt) (2.88)
X

1
N
t= K ,...,1

= p + qµ. (2.89)

Therefore, for general K1 ≤ µ ≤ 1, L∗u (U) is lower bounded by the


lower convex envelop of the points {(µ, µ1 − 1): µ ∈ { K1 , K2 , . . . , 1}}.

Lower Bound on L∗d (U)


The lower bound on the minimum downlink communication load L∗d (U)
can be proved following the similar steps of lower bounding the min-
imum uplink communication load L∗u (U), after making the following
enhancements to the downlink communication system:

• We consider the access point as the (K + 1)th user who has stored
all N files and has a virtual input to process. Thus the enhanced
60 Coding for Bandwidth Reduction

downlink communication system has K + 1 users, and the dataset


placement for the enhanced system
Ū , {U, UK+1 }, (2.90)
where UK+1 is equal to {1, . . . , N }.

• We assume that every one of the K + 1 users can broadcast to


the rest of the users, where the broadcast message is generated
by mapping the locally stored files.
Apparently the minimum downlink communication load of the sys-
tem cannot increase after the above enhancements. Thus the lower
bound on the minimum downlink communication load of the enhanced
system is also a lower bound for the original system.
Then we can apply the same arguments in the proof of Lemma 2.5
to the enhanced downlink system of K + 1 users, obtaining a lower
bound on L∗d (U), as described in the following corollary:
K aj
Corollary 2.6. L∗d (U) ≥ · j+1 .
K−j
P U
N
j=1

Proof. Applying Lemma 2.5 to the enhanced downlink system yields


K+1 ajŪ K + 1 − j X aj K + 1 − j
K+1
L∗d (Ū) (2.91)
X

≥ · ≥ ·
j=1
N j j=2
N j
K aj+1
K −j
= (2.92)
X

· .
j=1
N j+1

Since the access point has stored every file, aj+1 Ū


= ajU , for all
j ∈ {1, . . . , K}. Therefore, (2.92) can be re-written as
K
aj K −j
L∗d (U) ≥ L∗d (Ū) ≥ (2.93)
X
U
· .
j=1
N j+1


Then following the same arguments as in the proof for the minimum
uplink communication load, we have
K − µK µK 1
 
L∗d (U) ≥ = · −1 . (2.94)
µK + 1 µK + 1 µ
2.4. Related Works and Open Problems 61

For general K1 ≤ µ ≤ 1, L∗d (U) is lower bounded by the lower convex


envelop of the points {(µ, µK+1
µK
( µ1 − 1)): µ ∈ { K1 , K2 , . . . , 1}}.
This completes the proof of Theorem 2.4.

2.4 Related Works and Open Problems

The problem of characterizing the minimum communication for dis-


tributed computing has been previously considered in several settings in
both computer science and information theory communities. In [163], a
basic computing model is proposed, where two parities have x and y and
aim to compute a boolean function f (x, y) by exchanging the minimum
number of bits between them. Also, the problem of minimizing the re-
quired communication for computing the modulo-two sum of distributed
binary sources with symmetric joint distribution was introduced in [85].
Following these two seminal works, a wide range of communication
problems in the scope of distributed computing have been studied (see,
e.g., [16, 88, 116, 120, 121, 126]). The key differences distinguishing
the setting in this section from most of the prior ones are: (1) We
focus on the flow of communication in a general MapReduce distributed
computing framework, rather than the structures of the functions or
the input distributions. (2) We do not impose any constraint on the
numbers of output results, input data files and computing nodes (they
can be arbitrarily large). (3) We do not assume any special property
(e.g., linearity) of the computed functions.
The idea of efficiently creating and exploiting coded multicasting was
initially proposed in the context of cache networks in [104, 105], and
extended in [72, 79], where caches pre-fetch part of the content in a way
to enable coding during the content delivery, minimizing the network
traffic. In this monograph, we focus on the tradeoff between computation
and communication in distributed computing. We demonstrate that the
coded multicasting opportunities exploited in the caching problems also
exist in the data shuffling of distributed computing frameworks, which
can be created by a strategy of repeating the computations of the Map
functions specified by the proposed scheme.
There are many follow-up works after the formulation and the
characterization of the optimal computation-communication tradeoff
62 Coding for Bandwidth Reduction

in [93, 97]. In [98], the proposed Coded Distributed Computing (CDC)


was utilized in multi-stage computations that consist of executing a
series of MapReduce jobs described by a directed acyclic graph, individ-
ually minimizing the communication load within each stage. When the
Reduce functions are linear, it was shown in [92] that the pre-combining
technique (see, e.g., [42]) can be combined with CDC to further reduce
the bandwidth requirement. When the distributed computing nodes
have heterogeneous storage/processing capabilities, or the flexibility of
asymmetric computations for the output functions, the optimal task
allocation and coded communication schemes were studied in [50, 81,
131, 166]. Recent works [84, 159] also studied a new tradeoff between the
number of files and the load of communication, under the MapReduce
distributed computing framework. Compared with CDC scheme, the
new coded computing schemes require exponentially less number of
files (or splits of the entire dataset), at the cost of slightly increased
communication load.
In another closely related line of works, coded data shuffling schemes
were designed to efficiently move data batches between distributed
worker, in order to improve the statistical efficiency of distributed
iterative algorithms (see, e.g., [12, 89, 146, 152]). In this setting, at the
beginning of each iteration, the data stored in the local cache memories
of the workers are exploited to create coded multicast packets for data
shuffling. By the end of the iteration, the workers update their local
caches such that efficient multicasting opportunities are enabled in the
next iteration.
Finally, we end this section with some open problems along the
direction of coding for bandwidth reduction.
Heterogeneous networks with asymmetric tasks. It is common
to have computing nodes with heterogeneous storage, processing and
communication capacities within computer clusters. In addition, pro-
cessing different parts of the dataset can generate intermediate results
with different sizes (e.g., performing data analytics on highly-clustered
graphs). For computing over heterogeneous nodes, one solution is to
break the more powerful nodes into multiple smaller virtual nodes
that have homogeneous capability, and then apply the proposed CDC
2.4. Related Works and Open Problems 63

scheme for the homogeneous setting. When intermediate results have


different sizes, the proposed coding scheme still applies, but the coding
operations are not symmetric as in the homogeneous case. Alternatively,
we can employ a low-complexity greedy approach, in which we assign
the Map tasks to maximize the number of multicasting opportunities
that simultaneously deliver useful information to the largest possible
number of nodes. As mentioned before, some preliminary studies along
this direction have been performed to obtain the solutions for some
special cases (see, e.g., [50, 81, 131, 166]). Nevertheless, systematically
characterizing the optimal resource allocation strategies and coding
schemes for general heterogeneous networks with asymmetric tasks
remains an interesting open problem.
Multi-stage computation tasks. Unlike simple computation tasks
like Grep, Join and Sort, many distributed computing applications con-
tain multiples stages of MapReduce computations, whose computation
logic can be expressed as a directed acyclic graph. In order to speed
up multi-stage computation tasks using codes, while one straightfor-
ward approach is to apply the proposed CDC scheme for the cascaded
distributed computing framework to compute each stage locally, we
expect to achieve a higher reduction in bandwidth consumption and
response time by globally designing codes for the entire task graph and
accounting for interactions between consecutive stages. A preliminary
exploration along this direction was recently presented in [98].
Optimal code design for underlying algebra. In many machine
learning applications, the Reduce function has specific algebraic proper-
ties that can be exploited in coded computing to substantially reduce
the communication load. For example, many Reduce functions in the ex-
isting MapReduce frameworks are linear, as they essentially provide an
average or a linear summary of the corresponding intermediate compu-
tations. As a preliminary work, a coded distributed computing scheme,
named “compressed coded distributed computing” (compressed CDC),
was proposed in [92], which incorporates the pre-combining compression
techniques into the CDC scheme to further reduce the communication
load for linear Reduce functions. Along this direction, it is of great inter-
est to study the optimal task assignment and communication schemes
64 Coding for Bandwidth Reduction

Figure 2.12: (a) An overview of “think like a vertex” approach taken in common
parallel graph computing frameworks, in which the intermediate computations only
depend on the neighbors at each node [45]; (b) Illustration of the fundamental
trade-off curve between communication load L and storage size at each server m in
parallel graph processing.

for general non-linear Reduce functions, such as thresholds (max, min,


etc.) and polynomials that are for example used in distributed training.
Large-scale graph analytics. There is an increasing interest in ex-
ecuting complex analyses over very large graphs. Examples include
graph-theoretic problems in social networks (e.g., targeted advertising
and studying the spread of information), bioinformatics (e.g., study of
the interactions between various components in a biological system),
and networks (e.g., network analysis for intelligence and surveillance.
Popular distributed graph computing frameworks, such as Pregel [108]
and GraphLab [103], can be viewed as decomposing the intermediate
computations in such a way that they only depend on the neighbors at
each node (see Figure 2.12). This is commonly referred to as “think like
a vertex” computation model, which makes the parallel computations
very efficient by leveraging graph topology to reduce the communication
load at each iteration of the algorithm.
More formally, we can consider an undirected graph G = (V, E),
where for each graph node v ∈ {1, 2, . . . , |V|}, a file wv is associated with
the node. Let N ∗ (v) denote the neighbourhood of node v (including v).
We can then model the computation at each vertex v as

φv (WN ∗ (v) ) = hv ({gv,j (wj ): wj ∈ WN ∗ (v) }), (2.95)


2.4. Related Works and Open Problems 65

where the Map function gv,j (wj ) maps the input file wj into an inter-
mediate value, and the Reduce function hv (·) maps the intermediate
values of the neighboring nodes of v into the final output value φv . We
note that a key difference between Equation (2.95) and the general
MapReduce computation in Equation (2.1) is that the computations at
each node now only depend on the neighboring nodes according to the
graph topology.
Based on the above abstraction of the computation model, an inter-
esting problem is to design the optimal allocation of the subset of nodes
(or data) to each available server and the coding for data shuffling, such
that the amount of communication between servers is minimized. More
specifically, let us denote the number of available servers by K and
assume that the maximum number of nodes that can be assigned to one
server or the storage size is denoted by m. Our goal is to characterize the
fundamental trade-off curve between communication load and storage
size (L, m) for an arbitrary graph, and how coding can help in achieving
this fundamental limit (see Figure 2.12(b)). A preliminary exploration
for random graphs was recently presented in [124].
3
Coding for Straggler Mitigation

Straggling machines cause a major performance bottleneck as distributed


computing applications continue to scale out (see, e.g., [8, 41, 172]). It
was recently proposed to use techniques from coding theory to alleviate
the effect of stragglers in large-scale data analytics, especially performed
over low-end machines on shared platforms like Amazon EC2. The key
idea is to inject and leverage redundant computation tasks into the
cluster, such that the overall computation task can be accomplished
without waiting for the results from the unknown stragglers, hence
significantly reducing the overall computation latency.
As a motivating example of this concept, we consider a distributed
matrix-vector multiplication problem, which underlies many distributed
machine learning algorithms. Given a data matrix A and a target vector
x, the goal is to compute the product Ax distributedly over 3 workers.
The conventional way of doing this is to first partition the matrix A into
3 sub-matrices A1 , A2 , and A3 , such that A = [A1 ; A2 ; A3 ], and then
compute Ai x at worker i. Using this approach, the overall computation
time is limited by the slowest worker, and can be indefinitely long if one
worker becomes irresponsive. Utilizing the idea of erasure coding, as

66
67

A1 A1 x
A1 A2 (A1 +A2 )x Ax
A=
A2 A1 +A2

A1 x (A1 +A2 )x A1 x (A1 +A2 )x

A2 x A2 x

Figure 3.1: Coded matrix-vector multiplication. Each worker stores a coded sub-
matrix of the data matrix A. During computation, the master can recover the final
result using the results of any 2 out of the 3 workers.

shown in Figure 3.1, a master node partitions the matrix into two sub-
matrices A1 and A2 , and creates a coded sub-matrix A1 +A2 , and gives
each of these three sub-matrices to one of the workers for computation.
Now, the master can recover the desired computation from the results
of any 2 out of the 3 workers. For example as shown in Figure 3.1, the
missing result A2 x can be recovered by subtracting the result of worker 1
from that of worker 3. This example illustrates that by introducing 50%
redundant computations, we can now tolerate a single straggler. For
a general matrix-vector multiplication problem distributedly executed
over n workers, it was proposed in [89] to first partition the matrix into
k sub-matrices, for some k < n, and then use an (n, k) MDS code (e.g.,
Reed–Solomon code) to generate n coded sub-matrices, each of which
is stored on a worker. During the computation process, each worker
multiplies its local sub-matrix with the target vector and returns the
result to the master. Due to the “k out of n” property of the (n, k)
MDS code, the master can recover the overall computation result using
the results from the fastest k worker, protecting the system from as
many as n − k stragglers.
While repeating computation tasks has been demonstrated to be
an effective approach in straggler mitigation (see, e.g., [8, 55, 73, 153]),
many recent works have been focusing on characterizing the optimal
codes to combat the straggler’s effect of distributed linear algebraic
computations like matrix-vector and matrix–matrix multiplication. For a
problem of multiplying a matrix with a long vector, a sparse code was
designed in [47] such that only a subset of the entries of the vector are
68 Coding for Straggler Mitigation

needed for local computations, while still maintaining the robustness to


a certain number of stragglers. In the problem of distributed matrix–
matrix multiplication where we want to compute the multiplication of
two large matrices A and B over distributed workers, each of whom
can only store a part of A and B, product code was proposed in [90] to
separately encode A and B using MDS codes, and then each worker
is assigned to compute the product of a coded sub-matrix of A and a
coded sub-matrix of B. Using the product code, the master needs to
wait for a much less number of workers before it can recover the final
multiplication results, compared with schemes that only code one of the
two matrices. Finally, in [4, 131], the optimal task/resource allocation
(i.e., where and when to launch the redundant coded tasks) was studied
to minimize the overall computation latency.
In this section, we first consider a distributed matrix–matrix multi-
plication problem, and propose an optimal coded computation scheme,
named “polynomial code” to achieve the minimum possible recovery
threshold, which is defined as the number of workers the master needs
to wait for before recovering the overall computation result. Next, we
go beyond matrix algebra, and consider distributed computing of an
arbitrary multivariate polynomial over a dataset. For this problem,
we propose “Lagrange Coded Computing” (LCC), which leverages the
well-known Lagrange interpolation polynomial to create computation
redundancy in a novel coded form across the workers, and achieves the
minimum recovery threshold during job execution. We apply LCC to a
fundamental machine learning task – least-squares regression, where the
gradient computed in each iteration of the gradient descent algorithm is
a quadratic function of the training data. For the task of training a re-
gression model on big data, we empirically demonstrate a 2.36×∼12.65×
latency reduction over state-of-the-art straggler mitigation techniques.
Finally, we end this section by discussing coded computation schemes
for more general computation tasks (e.g., training neural networks),
and some open problems along this research direction.
3.1. Optimal Coding for Matrix Multiplications 69

3.1 Optimal Coding for Matrix Multiplications

In this subsection, we develop “polynomial code” for distributedly


computing large-scale matrix–matrix multiplication. We show that the
proposed polynomial code requires the minimum number of workers
returning their computation results, and this number does not scale
with the network size.

3.1.1 System Model, Problem Formulation, and Main Result


We consider a problem of matrix multiplication with two input matrices
A ∈ Fs×r
q and B ∈ Fs×t
q , for some integers r, s, t and a sufficiently large
finite field Fq . We are interested in computing the product C , A| B in
a distributed computing environment with a master node and N worker
nodes, where each worker can store m 1
fraction of A and n1 fraction of
B, for some parameters m, n ∈ N (see Figure 3.2). We assume at least
+

one of the two input matrices A and B is tall (i.e., s ≥ r or s ≥ t),


because otherwise the output matrix C would be rank inefficient and
the problem is degenerated.

Figure 3.2: Overview of the distributed matrix multiplication framework. Coded


data are initially stored distributedly at N workers according to data assignment.
Each worker computes the product of the two stored matrices and returns it to the
master. By carefully designing the computation strategy, the master can decode
given the computing results from a subset of workers, without having to wait for the
stragglers (worker 1 in this example).
70 Coding for Straggler Mitigation

r
s× m
Specifically, each worker i can store two matrices Ãi ∈ Fq and
s× t
B̃i ∈ Fq n ,
computed based on arbitrary functions of A and B respec-
tively. Each worker can compute the product C̃i , Ã|i B̃i , and return it
to the master. The master waits only for the results from a subset of
workers, before proceeding to recover (or compute) the final output C
given these products using certain decoding functions.1

Problem Formulation
Given the above system model, we formulate the distributed matrix
multiplication problem based on the following terminology: We define
the computation strategy as the 2N functions, denoted by

f = (f0 , f1 , . . . , fN −1 ), g = (g0 , g1 , . . . , gN −1 ), (3.1)

that are used to compute each Ãi and B̃i . Specifically,

Ãi = fi (A), B̃i = gi (B), ∀ i ∈ {0, 1, . . . , N − 1}. (3.2)

For any integer k, we say a computation strategy is k-recoverable if the


master can recover C given the computing results from any k workers.
We define the recovery threshold of a computation strategy, denoted
by k(f , g), as the minimum integer k such that computation strategy
(f , g) is k-recoverable.
Using the above terminology, we define the following concept:

Definition 3.1. For a distributed matrix multiplication problem of


computing A| B using N workers that can each store m 1
fraction of
A and n fraction of B, we define the optimum recovery threshold,
1

denoted by K ∗ , as the minimum achievable recovery threshold among


all computation strategies, i.e.,

K ∗ , min k(f , g). (3.3)


f ,g

State-of-the-art schemes. There have been two computing schemes


proposed earlier for this problem that leverage ideas from coding theory.
1
Note that we consider the most general model and do not impose any constraints
on the decoding functions. However, any good decoding function should have relatively
low computation complexity.
3.1. Optimal Coding for Matrix Multiplications 71

Figure 3.3: Product code [90] in an example with N = 9 workers that can each
store half of A and half of B.

The first one, referred to as one dimensional MDS code (1D MDS code),
was introduced in [89] and extended in [90]. The 1D MDS code, as
illustrated before in Figure 3.1, injects redundancy in only one of the
input matrices using maximum distance separable (MDS) codes [143].
In general, one can show that the 1D MDS code achieves a recovery
threshold of
N
K1D-MDS , N − + m = Θ(N ). (3.4)
n
An alternative computing scheme was recently proposed in [90]
for the case of m = n, referred to as the product code, which instead
injects redundancy in both input matrices. This coding technique has
also been proposed earlier in the context of Fault Tolerant Computing
in [67,
√ 74]. As √ shown in Figure 3.3, product code aligns workers in
an N −by− N layout. The matrix A is √ divided along the columns √
into m submatrices, encoded using an ( N √ , m) MDS code into N
coded matrices,
√ and then assigned to the N columns of workers. √
Similarly N coded matrices of B are created and assigned to the N
rows. Given the property of MDS codes, the master can decode an
entire row after obtaining any m results in that row; likewise for the
columns. Consequently, the master can recover the final output using
a peeling algorithm, iteratively decoding the MDS codes on rows and
columns until the output C is completely available. For example, if
the 5 computing results A|1 B0 , A|1 B1 , (A0 + A1 )| B1 , A|0 (B0 + B1 ), and
A|1 (B0 + B1 ) are received as demonstrated in Figure 3.3, the master can
72 Coding for Straggler Mitigation

recover the needed results by computing A|0 B1 = (A0 + A1 )| B1 − A|1 B1


then A|0 B0 = A|0 (B0 + B1 ) − A|0 B1 . In general, one can show that the
product code achieves a recovery threshold of
√ √
Kproduct , 2(m − 1) N − (m − 1)2 + 1 = Θ( N ), (3.5)

which significantly improves over K1D-MDS .

Main Result
Our main result, which demonstrates that the optimum recovery thresh-
old can be far less than what the above two schemes achieve, is stated
in the following theorem:

Theorem 3.1. For a distributed matrix multiplication problem of com-


puting A| B using N workers that can each store m1
fraction of A and
1
n fraction of B, the minimum recovery threshold K ∗ is

K ∗ = mn. (3.6)

Furthermore, there is a computation strategy, referred to as the polyno-


mial code, that achieves the above K ∗ while allowing efficient decoding
at the master node, i.e., with complexity equal to that of polynomial
interpolation given mn points.

We prove Theorem 3.1 in the next subsection, where we first describe


the proposed polynomial code that achieves a recovery threshold Kpoly =
mn, and then develop an information theoretic converse demonstrating
that K ∗ is lower bounded by mn.

Remark 3.1. Compared to the state of the art [89, 90], the polynomial
code provides order-wise improvement in terms of the recovery threshold.
Specifically, the recovery thresholds achieved by √
1D MDS code [89, 90]
and product code [90] scale linearly with N and N respectively, while
the proposed polynomial code actually achieves a recovery threshold
that does not scale with N . Furthermore, polynomial code achieves the
optimal recovery threshold.

Remark 3.2. The polynomial code not only improves the state of the
art asymptotically, but also gives strict and significant improvement
3.1. Optimal Coding for Matrix Multiplications 73

Figure 3.4: Comparison of the recovery thresholds achieved by the proposed poly-
nomial code and the state of the arts (1D MDS code [89] and product code [90]),
1
where each worker can store 10 fraction of each input matrix. The polynomial code
attains the optimum recovery threshold K ∗ , and significantly improves the state of
the art.

for any parameter values of N , m, and n (see Figure 3.4 for


example).

Remark 3.3. As we will discuss in Subsection 3.1.2, decoding polynomial


code can be mapped to a polynomial interpolation problem, which can
be solved in time almost linear to the input size [80]. This is enabled by
carefully designing the computing strategies at the workers, such that
the computed products form a Reed–Solomon code [133], which can
be decoded efficiently using any polynomial interpolation algorithm or
Reed–Solomon decoding algorithm that provides the best performance
depending on the problem scenario (e.g., [13]).

3.1.2 Polynomial Code and Its Optimality


In this subsection, we formally describe the polynomial code and its
decoding process. We then prove its optimality with an information
theoretic converse, which completes the proof of Theorem 3.1. Finally, we
demonstrate the optimality of polynomial code under other performance
metrics.
74 Coding for Straggler Mitigation

Figure 3.5: Example using polynomial code, with N = 5 workers that can each
store half of each input matrix. Computation strategy: each worker i stores A0 + iA1
and B0 + i2 B1 , and computes their product. Decoding: master waits for results from
any 4 workers, and decodes the output using fast polynomial interpolation algorithm.

Motivating Example
We start by demonstrating the key ideas of polynomial code through a
motivating example. Consider a distributed matrix multiplication task
of computing C = A| B using N = 5 workers that can each store half
of the matrices (see Figure 3.5). We evenly divide each input matrix
along the column side into 2 submatrices:
A = [A0 A1 ], B = [B0 B1 ]. (3.7)
Given this notation, we essentially want to compute the following 4
uncoded components:
 
A|0 B0 A|0 B1
C = A| B =  | . (3.8)
A1 B0 A|1 B1
Now we design a computation strategy to achieve the optimum recovery
threshold of 4. Suppose elements of A, B are in F7 , let each worker
i ∈ {0, 1, . . . , 4} store the following two coded submatrices:
Ãi = A0 + iA1 , B̃i = B0 + i2 B1 . (3.9)
To prove that this design gives a recovery threshold of 4, we need to
design a valid decoding function for any subset of 4 workers. Without
3.1. Optimal Coding for Matrix Multiplications 75

loss of generality, we assume that the master receives the computation


results from workers 1, 2, 3, and 4, as shown in Figure 3.5.
According to the designed computation strategy, we have

1 11 12 13
 0  | 

C̃1
 A0 B 0
2 21 22 23   | 
 0  
C̃2   A1 B0 
 
 = . (3.10)
3 31 32 33 
   0  | 

C̃3    A0 B1 
C̃4 40 41 42 43 A|1 B1
The coefficient matrix in the above equation is Vandermonde, and hence
invertible since its parameters 1, 2, 3, 4 are distinct in F7 . So one way
to recover C is to directly invert Equation (3.10). However, directly
computing this inverse using the classical inversion algorithm might
be expensive in more general cases. Quite interestingly, the decoding
process can also be viewed as a polynomial interpolation problem (or
equivalently, decoding a Reed–Solomon code subject to erasures).
Specifically, in this example each worker i returns

C̃i = Ã|i B̃i = A|0 B0 + iA|1 B0 + i2 A|0 B1 + i3 A|1 B1 , (3.11)

which is essentially the value of the following polynomial at point x = i:

h(x) , A|0 B0 + xA|1 B0 + x2 A|0 B1 + x3 A|1 B1 . (3.12)

Hence, recovering C using computation results from 4 workers is equiv-


alent to interpolating a degree-3 polynomial given its values at 4 points,
and we will later show that this can be performed with almost-linear
complexity.

General Polynomial Code


Now we proceed to present the polynomial code in a general setting
that achieves the optimum recovery threshold stated in Theorem 3.1
for any parameter values of N , m, and n. First of all, we evenly divide
each input matrix along the column side into m and n submatrices
respectively, i.e.,

A = [A0 A1 . . . Am−1 ], B = [B0 B1 . . . Bn−1 ], (3.13)


76 Coding for Straggler Mitigation

We then assign each worker i ∈ {0, 1, . . . , N − 1} a distinct number in


Fq , denoted by xi . Under this setting, we define the following class of
computation strategies.

Definition 3.2. Given parameters α, β ∈ N, we define the (α, β)-


polynomial code as
m−1 n−1
Ãi = Aj xjα B̃i = Bj xjβ ∀ i ∈ {0, 1, . . . , N − 1}. (3.14)
X X
i , i ,
j=0 j=0

In an (α, β)-polynomial code, each worker i essentially computes


m−1
X n−1
C̃i = Ã|i B̃i = A|j Bk xjα+kβ (3.15)
X
i .
j=0 k=0

In order for the master to recover the output given any mn results (i.e.,
achieve the optimum recovery threshold), we carefully select the design
parameters α and β, while making sure that no two terms in the above
formula has the same exponent of x. One such choice is (α, β) = (1, m),
i.e.,
m−1 n−1
Ãi = Aj xji , B̃i = Bj xjm (3.16)
X X
i .
j=0 j=0

Hence, each worker essentially computes the value of the following


degree mn − 1 polynomial at point x = xi :
m−1
X n−1
A|j Bk xj+km , (3.17)
X
h(x) ,
j=0 k=0

where the coefficients are exactly the mn uncoded components of C.


Since all xi ’s are selected to be distinct, recovering C given results
from any mn workers is essentially interpolating h(x) using mn distinct
points. Since h(x) has degree mn − 1, the output C can always be
uniquely decoded.
In terms of complexity, this decoding process can be viewed as
interpolating degree mn − 1 polynomials of Fq for mn rt
times. It is well
known that polynomial interpolation of degree k has a complexity of
O(k log2 k log log k) [80]. Therefore, decoding polynomial code also only
3.1. Optimal Coding for Matrix Multiplications 77

requires a complexity of O(rt log2 (mn) log log(mn)). Furthermore, this


complexity can be reduced by simply swapping in any faster polynomial
interpolation algorithm or Reed–Solomon decoding algorithm.

Remark 3.4. We can naturally extend polynomial code to the scenario


where input matrix elements are real or complex numbers. In practical
implementation, to avoid handling large elements in the coefficient
matrix, we can first quantize input values into numbers of finite digits,
embed them into a finite field that covers the range of possible values of
the output matrix elements, and then directly apply polynomial code.
By embedding into finite fields, we avoid large intermediate computing
results, which effectively saves storage and computation time, and
reduces numerical errors.

Optimality of Polynomial Code for Recovery Threshold


So far we have constructed a computing scheme that achieves a recovery
threshold of mn, which upper bounds K ∗ . To complete the proof of
Theorem 3.1, here we establish a matching lower bound through an
information theoretic converse.
We need to prove that for any computation strategy, the master
needs to wait for at least mn workers in order to recover the output.
Recall that at least one of A and B is a tall matrix. Without loss of
generality, assume A is tall (i.e., s ≥ r). Let A be an arbitrary fixed
full-rank matrix and B be sampled from Fs×t q uniformly at random. It
is easy to show that C = A| B is uniformly distributed on Fr×t q . This
means that the master essentially needs to recover a random variable
with entropy of H(C) = rt log2 q bits. Note that each worker returns
mn elements of Fq , providing at most mn log2 q bits of information.
rt rt

Consequently, using a cut-set bound around the master, we can show


that at least mn results from the workers need to be collected, and thus
we have K ∗ ≥ mn.

Remark 3.5 (Random linear code). We conclude this subsection by


noting that, another computation design is to let each worker store two
random linear combinations of the input submatrices. Although this
design can achieve the optimal recovery threshold with high probability,
78 Coding for Straggler Mitigation

it creates a large coding overhead and requires high decoding complexity


(e.g., O(m3 n3 + mnrt) using the classical inversion decoding algorithm).
Compared to random linear code, the proposed polynomial code achieves
the optimum recovery threshold deterministically, with a significantly
lower decoding complexity.

Optimality of Polynomial Code for Other Performance Metrics


In the previous subsection, we proved that polynomial code is optimal in
terms of the recovery threshold. As a by-product, we can prove that it is
also optimal in terms of some other performance metrics. In particular,
we consider the following three metrics considered in prior works, and
establish the optimality of polynomial code for each of them.
Computation latency is considered in models where the computation
time Ti of each worker i is a random variable with a certain probability
distribution (e.g., [89, 90]). The computation latency is defined as the
amount of time required for the master to collect enough information
to decode C.

Theorem 3.2. For any computation strategy, the computation latency


T is always no less than the latency achieved by polynomial code,
denoted by Tpoly . Namely,

T ≥ Tpoly . (3.18)

Proof sketch. We know that by the converse proof of Theorem 3.1 that
using an arbitrary computation strategy, in order for the master to
recover the output matrix C at time T , it has to receive the computation
results from at least mn worker. However, using the polynomial code,
the matrix C can be recovered as soon as mn workers return their
results. Therefore, we have T ≥ Tpoly .

Probability of failure given a deadline is defined as the probability


that the master does not receive enough information to decode C at
any time t [48].

Corollary 3.3. For any computation strategy, let T denote its computa-
tion latency, and let Tpoly denote the computation latency of polynomial
3.2. Optimal Coding for Polynomial Evaluations 79

code. We have

P(T > t) ≥ P(Tpoly > t) ∀ t ≥ 0. (3.19)

Corollary 3.3 directly follows from Theorem 3.2 since (3.18) implies
(3.19).
Communication load is another important metric in distributed
computing (e.g., [93, 97, 166]), defined as the minimum number of bits
needed to be communicated in order to complete the computation.

Theorem 3.4. Polynomial code achieves the minimum communication


load for distributed matrix multiplication, which is given by

L∗ = rt log2 q. (3.20)

Proof. Recall that in the converse proof of Theorem 3.1, we have shown
that if the input matrices are sampled based on a certain distribution,
then decoding the output C requires that the entropy of the entire
message received by the server is at least rt log2 q. Consequently, it
takes at least rt log2 q bits deliver such messages, which lower bounds
the minimum communication load.
On the other hand, the polynomial code requires delivering rt ele-
ments in Fq in total, which achieves this minimum communication load.
Hence, the minimum communication load L∗ equals rt log2 q.

Remark 3.6. While polynomial codes provide the optimal design, with
respect to the above metrics, for straggler mitigation in distributed
matrix multiplication, one can also consider other metrics and variations
of the problem setting for which the problem is still not completely
solved. One variation is “approximate distributed matrix multiplication”,
which has been studied in [59, 70]. Another variation is coded computing
in heterogeneous and dynamic network settings, which has been studied
in [54, 109, 115, 130, 131, 161].

3.2 Optimal Coding for Polynomial Evaluations

In this subsection, we go beyond matrix algebra to study the impact of


coding on minimizing the recovery threshold in distributed computation
of arbitrary multivariate polynomials.
80 Coding for Straggler Mitigation

3.2.1 Problem Formulation and Motivating Examples


We consider a problem of evaluating a function f over a dataset X =
(X1 , . . . , XK ). In particular, each Xi is an element in a vector space V
over a field F, and the goal is to compute Y1 , f (X1 ), . . . , YK , f (XK )
given the function f : V → U, where U is a vector space over the same
field F. The function f can be any multivariate polynomial with vector
coefficients, and we define the degree of the chosen f , denoted by deg f ,
as the total degree of the polynomial.2
The computation is carried out in a distributed system consist-
ing of a master and N workers. Each worker has already stored a
fraction of the dataset prior to the computation, in a possibly coded
manner. Specifically, for each i ∈ [N ] , {1, . . . , N }, worker i stores
X̃i , gi (X1 , . . . , XK ), where gi : VK → V is the encoding function of
worker i. We focus on the class of linear encoding strategies, mean-
ing that each X̃i is a linear combination of X1 , . . . , XK . This class
of encoding designs guarantees low encoding complexity and simple
implementation.
During the computation, each worker i computes Ỹi , f (X̃i ), and
returns the result back to the master upon its completion. The master
only waits for a fastest subset of workers, until all the final outputs
Y1 , . . . , YK can be decoded from the available results by computing
their linear combinations.3 Similarly as before, we define the recovery
threshold as the minimum number of responses that guarantees the
completion of the computation task.
This computation model suites the common scenario where a func-
tion of interest is of the form F (X1 , . . . , Xk ) = g(f (X1 ), . . . , f (Xk )),
where f is a “hard to compute” function and g is an “easy to compute”
one. This is in accordance with common distributed computing tasks

2
The total degree of a polynomial f is the maximum among all the total degrees of
its monomials. In the case where F is finite, we resort to the canonical representation
of polynomials, in which the individual degrees within each term is no more than
(|F| − 1).
3
Note that if the number of workers is too small, obviously no valid computation
design exists unless f is a constant. Hence, in the rest of this subsection we focus on
meaningful cases where N is large enough such that there is a valid computation
design for at least one non-trivial function f (i.e., N ≥ K).
3.2. Optimal Coding for Polynomial Evaluations 81

like matrix multiplication, the MapReduce algorithm, and gradient


computation.
Based on this setting, the coded computing problem is then formu-
lated as designing the optimal encoding of the dataset (i.e., designing
gi ’s) over which workers carry out their computations, in order to achieve
the minimum recovery threshold, denoted by K ∗ .
The above computation framework encapsulates many computation
tasks of interest, which we highlight in the following examples.
Linear computation. Consider the computation scenario of matrix-
vector multiplication A~b, for some dataset A = {Ai }K i=1 and some
vector b. This scenario naturally arises in many machine learning al-
~
gorithms, such as each iteration of linear regression. Our formulation
covers this setting by letting V be the space of matrices of certain
dimensions over F, U be the space of vectors of a certain length over F,
Xi be Ai , and f (Xi ) = Xi · ~b for all i ∈ [K]. Coded computing for such
linear computations has also been studied in [47, 78, 89, 157].
Bilinear computation. Another computation task of interest, is to
evaluate the element-wise products {Ai ·Bi }K i=1 given two lists of matrices
{Ai }i=1 and {Bi }i=1 . This computation is the key building block for
K K

various algorithms, such as fast matrix multiplication in distributed


systems [52, 167, 169]. Our formulation covers this setting by letting V
be the space of pairs of two matrices of certain dimensions, U be the
space of matrices of dimension which equals that of the product of the
pairs of matrices, Xi = (Ai , Bi ), and f (Xi ) = Ai · Bi for all i ∈ [K].
General Tensor algebra. Beyond bilinear operations, distributed
computations of multivariate polynomials of larger degree, such as
general tensor algebraic functions (i.e., functions composed of inner
products, outer products, and tensor contractions) [132], also arise in
practice. A specific example is to compute the coordinate transforma-
tion of a third-order tensor field at K locations, where given a list
of matrices {Q(i) }Ki=1 and a list of third order tensors {T }i=1 with
(i) K

matching dimension on each index, the goal is to compute another


list of tensors, denoted by {T 0(i) }K i=1 , of which each entry is defined
as Tj 0 k0 `0 , Tjk` Qjj 0 Qkk0 Q``0 . Our formulation covers all functions
0(i) P 0(i) (i) (i) (i)
j,k,`
82 Coding for Straggler Mitigation

within this class by letting V the space of input tensors, U the space of
output tensors, Xi be the inputs, and f be the tensor function.
Gradient computation. Another general class of functions arises from
gradient decent algorithms and their variants, which are the workhorse
of today’s learning tasks. The computation task for this class of functions
is to consider one iteration of the gradient decent algorithm, and to eval-
uate the gradient of the empirical risk ∇LS (h) , avgz∈S ∇`h (z), given
a hypothesis h: Rd → R, a respective loss function `h : Rd+1 → R, and
a training set S ⊆ Rd+1 , where d is the number of features. In practice,
this computation is carried out by partitioning S into K subsets {Si }K i=1
of equal sizes, evaluating the partial gradients {∇LSi (h)}K i=1 distribut-
edly, and computing the final result using ∇LS (h) = avgi∈[K] ∇LSi (h).
We present a specific example of applying this computing model to
least-squares regression problems in Subsection 3.2.5.

3.2.2 Main Results and Comparison with Prior Works


We characterize the minimum possible recovery threshold for the above
distributed computing problem in the following theorem.

Theorem 3.5. For the above described problem of distributedly evalu-


ating a multivariate polynomial f : V → U on a dataset of K inputs by
using N workers, the minimum recovery threshold is given by

K ∗ = (K − 1) deg f + 1

when N ≥ K deg f − 1, and K ∗ = N − bN/Kc + 1 otherwise.


We propose a coded computing scheme, named “Lagrange Coded
Computing” to achieve this minimum value.

To prove Theorem 3.5, we present the proposed Lagrange Coded


Computing (LCC) scheme and characterize its recovery threshold in the
next subsection. Moreover, we complete the proof by demonstrating the
optimality of Lagrange Coded Computing through a matching converse
in Subsection 3.2.4.

Remark 3.7. LCC generalizes several previously studied scenarios. For


example, having V = U and f the identity function reduces to the
3.2. Optimal Coding for Polynomial Evaluations 83

well-studied case of distributed storage, in which Theorem 3.5 is well-


known (e.g., the Singleton bound [133, Theorem 4.1]). Further, as
previously mentioned, f can correspond to matrix-vector and matrix–
matrix multiplication, in which the special cases of Theorem 3.5 are
known as well [89, 169]. However, LCC substantially generalizes the
state of the arts to any computation that can be represented as an
arbitrary multivariate polynomial of the input dataset, including many
computation scenarios of interest in machine learning.

Remark 3.8. The key idea of LCC is to encode the input dataset
using the well-known Lagrange polynomial. In particular, the encoding
functions (i.e., gi ’s) amount to evaluations of a Lagrange polynomial
of degree K − 1 at N distinct points. Hence, the computations at the
workers amount to evaluations of a composition of that polynomial with
the desired function f . Therefore, K ∗ may simply be seen as the number
of evaluations that are necessary and sufficient in order to interpolate
the composed polynomial, that is later evaluated at certain point to
finalize the computation.

Remark 3.9. LCC has a number of additional properties of interest.


First, the proposed encoding is identical to all multivariate polynomials,
which allows pre-encoding of the data without knowing the identity
of the computing task. In other words, data encoding of LCC can
be universally used for any polynomial computation. This is in stark
contrast to previous task specific coding techniques in the literature.
Furthermore, workers apply the same computation as if no coding
took place; a feature that reduces computational costs, and prevents
ordinary servers from carrying the burden of outliers. Second, decoding
and encoding build upon polynomial interpolation and evaluation, and
hence efficient off-the-shelf subroutines can be used.

3.2.3 Lagrange Coded Computing


Illustrative Example
Consider a problem of evaluating the quadratic function f (Xi ) =
Xi> (Xi w − y), where the input Xi ’s are real matrices with certain
dimensions, and w, y are constant vectors with matching lengths. This
84 Coding for Straggler Mitigation

function naturally appears in gradient computing problems, given that


each f (Xi ) is the gradient of a commonly used quadratic loss function
(Xi w − y)2 (with respect to w).
We demonstrate Lagrange Coded Computing (LCC) in the scenario
where the input data X is partitioned into K = 2 batches, and the
computing system has N = 5 workers. Note that the conventional
uncoded repetition design only achieves a recovery threshold of 4. This
is since in uncoded repetition one essentially must have X̃1 = X̃2 = X1
and X̃3 = X̃4 = X̃5 = X2 , and clearly, the computation cannot be
completed from the results of workers 3, 4, and 5. However, optimal
recovery threshold of 3 is attainable by LCC.
As mentioned earlier, the main idea of LCC is to encode data using
a Lagrange polynomial u. To this end, let u(z) , X1 · z−2
1−2 + X2 · 2−1 =
z−1

z(X2 − X1 ) + 2X1 − X2 , and observe that u(1) = X1 and u(2) = X2 .


Then, node i stores u(i), i.e.,

1 0 −1 −2 −3
!
= (X1 , X2 ) ·
 
X̃1 , . . . , X̃5 .
0 1 2 3 4

Note that when applying f over its stored data, each worker essen-
tially evaluates a linear combination of 6 possible terms: four quadratic
Xi> Xj w and two linear Xi> y. However, the master only wants two
specific linear combinations of them: X1> (X1 w − y) and X2> (X2 w − y).
Interestingly, LCC optimally aligns the computation of the workers in
a sense that the linear combinations returned by the workers belong
to a subspace of only 3 dimensions, which can be recovered from the
computing results of any 3 workers, while containing the two needed
linear combinations.
More specifically, each worker i evaluates the polynomial

f (u(z))=(z(X2 − X1 )+2X1 − X2 )>((z(X2 − X1 )+2X1 −X2 )w−y)

at z = i. Since f (u(z)) is a quadratic polynomial, it can be determined


given the computation results from any three nodes. Furthermore,
after decoding the polynomial f (u(z)), the master can obtain f (X1 )
and f (X2 ) by evaluating it at z = 1 and z = 2.
3.2. Optimal Coding for Polynomial Evaluations 85

General Description
When the number of workers is small (i.e., N < K deg f − 1), the opti-
mum recovery threshold K ∗ = N − bN/Kc + 1 can be easily achieved
by uncoded repetition design – that is, by replicating every Xi be-
tween bN/Kc and dN/Ke times, it is readily verified that every set
of N −bN/Kc+1 computation results contains at least one copy of f (Xi )
for every i. Hence, we focus on the case where N ≥ K deg f − 1.
First, we select any K distinct elements β1 , . . . , βK from F, and
find a polynomial u: F → V of degree K − 1 such that u(βi ) = Xi for
any i ∈ [K] = {1, . . . , K}. This is simply accomplished by letting u be
the respective Lagrange interpolation polynomial u(z) , j∈[K] Xj ·
P

k∈[K]\{j} βj −βk . We then select N distinct elements α1 , . . . , αN from


Q z−βk

F, and encode the input variables by letting X̃i = u(αi ) for any i ∈ [N ].
That is,
K
αi − β k
X̃i = gi (X) = u(αi ) , (3.21)
X Y
Xj · .
j=1
β − βk
k∈[K]\{j} j

When each worker i computes Ỹi = f (X̃i ), it is essentially evaluating


the composition of the two polynomials f and u at point αi (i.e.,
f (u(αi ))). Note that the composition f (u(z)) is also a polynomial,
whose degree is (K − 1) deg f . Hence, any (K − 1) deg f + 1 workers
return the evaluations of this polynomial at (K − 1) deg f + 1 points,
and thus it is recoverable.
Finally, the master aims to recover f (u(βi )) = f (Xi ) for all i ∈ [K],
which is possible given that f (u(z)) is determined.
Remark 3.10. Note that by choosing {βi }K i=1 = {αi }i=1 , the first K
K

workers exactly compute the K required results respectively. This pro-


vides a systematic coding design in the sense that the first K workers
are the systematic nodes and the rest of the N − K workers are parity
nodes.
Remark 3.11. In our construction, the only restriction imposed on the
underlying field is that we need to be able to select N distinct elements
{αi }i∈[N ] . Hence, LCC can be applied over any infinite field or any finite
field with at least N elements.
86 Coding for Straggler Mitigation

Remark 3.12. In terms of encoding and decoding complexities, the


decoding process of LCC is essentially computing the Lagrange in-
terpolation for the polynomial f ◦ u at K points. This interpolation
can be efficiently computed with an almost linear complexity (i.e.,
O(K ∗ log2 K ∗ log log K ∗ ) linear operations in U), using fast polynomial
arithmetic algorithms [80]. Similar to the polynomial code, this de-
coding complexity can be reduced by simply swapping in any faster
interpolation algorithm or Reed–Solomon decoding algorithm.

3.2.4 Optimality of Lagrange Coded Computing


While the analysis of the proposed Lagrange Coded Computing scheme
provides an upper bound on the minimum recovery threshold K ∗ , we
now complete the proof of Theorem 3.5 by establishing a matching
lower bound of K ∗ for any polynomial function f : V → U.
The proof consists of two steps. In Step 1, we prove the converse
for the special case where f is a multilinear function (i.e., f is linear in
each variable, where the rest of the variables are fixed). Then in Step 2,
we generalize this result to arbitrary polynomial functions, by proving
that for any function f , there exists a multilinear function with the
same degree and a recovery threshold no greater than K ∗ .
For Step 1, we consider the scenario where (1) the domain V
of the function f is in the form of V = Wd for some vector space
W and some d ∈ N+ , and (2) f is a non-zero function of input
Xi = (Xi,1 , Xi,2 , . . . , Xi,d ) ∈ Wd , and is multilinear with respect to
the elements Xi,1 , Xi,2 , . . . , Xi,d . In this scenario, we develop a lower
bound on the minimum recovery threshold as stated in the following
lemma.

Lemma 3.6. For any multilinear function f of degree deg f ∈ N+ , its


minimum recovery threshold is lower bounded by (K − 1) deg f + 1
when N ≥ K deg f − 1, and lower bounded by N − bN/Kc + 1 when
N < K deg f − 1. Moreover, this recovery threshold cannot be further
reduced even if we allow arbitrary decoding functions.

We present the proof of Lemma 3.6 in Appendix A. The main idea


is to show that for any computing strategy that tries to operate at a
3.2. Optimal Coding for Polynomial Evaluations 87

recovery threshold smaller than the lower bound stated in Lemma 3.6,
there would be scenarios where all available computing results are
degenerated (i.e., constants), while the computing results needed by the
master are variable, thus violating the decodability requirement.
Next in Step 2, we prove the matching converse for any polynomial
function. Given any function f with degree d, we first construct a
non-zero, multilinear function f 0 with the same degree. Then we let
Kf∗ (K, N ) denote the minimum recovery threshold for function f , and
prove Kf∗ (K, N ) ≥ Kf∗0 (K, N ), by constructing a computation design
of f 0 that is based on a computation design of f and achieves the same
recovery threshold. The construction and its properties are stated in
the following lemma, whose proof can be found in [170, Appendix E].

Lemma 3.7. Given any function f of degree d, let f 0 be a map from


Vd → U such that f 0 (Z1 , . . . , Zd ) = S⊆[d] (−1)|S| f ( j∈S Zj ) for any
P P

{Zj }j∈[d] ∈ Vd . Then f 0 is multilinear with respect to the d inputs.


Moreover, if the characteristic of the base field F is 0 or greater than d,
then f 0 is non-zero.

Given Lemma 3.7, it suffices to prove that f 0 cannot have a greater


recovery threshold than f , i.e., Kf∗ (K, N ) ≥ Kf∗0 (K, N ) for any K
and N . We prove this fact by constructing computing schemes for f 0
given any design for f , which achieve the same recovery threshold.
Note that f 0 is defined as a linear combination of functions
f ( j∈S Zj ), each of which is a composition of a linear map and f .
P

Given the linearity assumption of the encoding design, any computation


scheme of f can be directly applied to any of these functions, achieving
the same recovery threshold. Since the decoding functions are linear, the
same scheme also applies to linear combinations of them, which includes
f 0 . Hence, the minimum recovery threshold of f 0 is upper bounded by
the recovery threshold of any computing design of f , which indicates
Kf∗ (K, N ) ≥ Kf∗0 (K, N ).
To conclude, using the matching converse we proved in Lemma 3.6
for multilinear functions, we showed that the same converse holds in
general. This completes the proof of Theorem 3.5.
88 Coding for Straggler Mitigation

3.2.5 Application of LCC to Accelerate Least-Squares Regression


We demonstrate a practical application of LCC in accelerating dis-
tributed least-squares linear regression, whose gradient computation is
a quadratic function of the input dataset, hence matching well the LCC
framework. We also experimentally demonstrate its performance gain
over state-of-the-art straggler mitigation schemes via experiments on
AWS EC2 clusters.

Distributed Gradient Descent for Regression Problems


We focus on linear regression problems with a least-squares objective.
Given a training dataset consisting of m feature inputs xi ∈ Rd and
labels yi ∈ R we wish to find the coefficients w ∈ Rd of a linear function
x 7→ hx, wi that best fits this training data. Minimizing the empirical
risk leads to the following optimization problem

1 Xm
1
min L(w) = (x>
i w − yi ) =
2
||Xw − y||2 . (3.22)
w∈Rd m i=1 m

Here, X = [x1 x2 . . . xm ]> ∈ Rm×d is the feature matrix and y =


[y1 y2 . . . ym ]> ∈ Rm is the output vector obtained by concatenating the
input features and output labels, respectively.
Many nonlinear regression problems can also be written in the form
above. In particular, consider the problem of finding the best function
h belonging to a hypothesis class H that fits the training data

1 Xm
min L(h) = (h(xi ) − yi )2 . (3.23)
h∈H m i=1

Such nonlinear regression problems can often be cast in the form (3.22),
and be solved efficiently using the so called kernalization trick [137].
However, for simplicity of exposition we focus on the simpler instance
(3.22).
A popular approach to solve the above problem is via gradient
descent (GD). In particular, GD iteratively refines the weight vector
w by moving along the negative gradient direction via the following
3.2. Optimal Coding for Polynomial Evaluations 89

Figure 3.6: An illustration of a master/worker architecture for data-parallel dis-


tributed linear regression.

updates
2 >
w(t+1) = w(t) − η (t) ∇L(w(t) ) = w(t) − η (t) X (Xw(t) − y). (3.24)
m
Here, η (t) is the learning rate in the tth iteration.
When the size of the training data is too large to store/process on
a single machine, the GD updates can be calculated in a distributed
fashion over many computing nodes. As illustrated in Figure 3.6, we
consider a computing architecture that consists of a master node and n
worker nodes. Using a naive data-parallel distributed regression scheme,
we first partition the input data matrix X into n equal-sized sub-
matrices such that X = [X1 . . . Xn−1 ]> , where each sub-matrix Xj ∈
Rd× n contains m n input data points, and is stored on worker j. Within
m

each iteration of the GD procedure, the master broadcasts the current


weight vector w to all the workers. Upon receiving w, each worker j
computes Xj Xj> w, and returns it to the master. The master waits
for the results from all workers and sums them up to obtain the full
gradient
n−1
X > Xw = Xj Xj> w. (3.25)
X

j=0

Then, the master uses this gradient to update the weight vector
via (3.24).4
4
Since the value of X > y does not vary across iterations, it only needs to be
computed once. We assume that it is available at the master for weight updates.
90 Coding for Straggler Mitigation

Coded Computation Schemes and Their Recovery Thresholds


The above naive uncoded scheme requires the master to wait for results
from all the workers. Therefore, even a single straggler can significantly
delay the iteration. One way to combat stragglers is through redundant
data storage/processing. For example, each worker, instead of 1, stores
and processes 1 < r ≤ n sub-matrices. Then, we can partition the
n sub-matrices into nr batches of size r, and repeatedly store each
batch on r workers. Utilizing this storage/computation redundancy,
in the worst case, the master needs the results returned from the
fastest n − r + 1 workers to compute the final gradient. In general,
for a given storage/computation load, we can design optimal coding
techniques to minimize the number of workers the master needs to wait
for before recovering the gradient. Motivated by this idea, we consider a
general distributed regression framework with an input feature matrix
X = [X1 . . . Xn ]> and n workers. Each worker j stores r (potentially
coded) sub-matrices locally. In each iteration, each worker performs local
computation utilizing the received weight vector w and the locally stored
data. The master waits for the results from a subset N ⊆ [n] of workers,
and uses them to compute the gradient in (3.25). For this framework, a
coded computation scheme consists of the following elements.
• Computation/storage parameter. We characterize the com-
putation/storage load at each worker via a parameter r ∈ [n].
Specifically, each worker stores some data generated from the
feature matrix X whose size is nr -fraction of the size of X.
• Encoding functions. We encode the data stored at the workers
via a set of n encoding functions ρ = (ρ1 , . . . , ρn ) where ρj is
the encoding function of worker j. Each ρj maps the input data
X into r coded sub-matrices X̃j,1 , . . . , X̃j,r ∈ Rd× n which are
m

locally stored at worker j. In particular, each X̃j,k , is a linear


combination of the sub-matrices X1 , . . . , Xn , i.e.,
n
X̃j,k = (3.26)
X
aj,k,i Xi .
i=1
Here, the coefficients aj,k,i are specified by the encoding function
ρj of worker j.
3.2. Optimal Coding for Polynomial Evaluations 91

• Computation functions. Each worker uses the r encoded sub-


matrices along with the weight vector w received from the master
to perform its computation. We use φj : Rd× n ×r × Rd → R`j to
m

denote this mapping whose output is an arbitrary length-`j vector


that is computed locally at worker j using X̃j,1 , . . . , X̃j,r and w.

• Decoding function. The master uses a decoding function ψ :


× R`j → Rd to map the computation results of the available
j∈N
workers in N to the desired computation X > Xw.

Definition 3.3. We define the recovery threshold of a computation


scheme S with a computation/storage load r at each worker, denoted
by KS (r), as the minimum number of workers the master needs to wait
to accomplish the gradient computation.

Consider a distributed linear regression task executed on n workers


with a local computation/storage load r each. We are interested in
finding the minimum recovery threshold achieved among all computation
schemes along with the corresponding scheme. This optimal recovery
threshold can be formally defined as

K ∗ (r) := minKS (r). (3.27)


S

State-of-the-art schemes. Proposed in [147], and extended in [63,


94, 127, 164], the gradient coding (GC) schemes code across partial
gradients computed from uncoded data batches to mitigate stragglers
for general distributed machine learning problems. In this case, GC
schemes achieve a recovery threshold of KGC (r) = n − r + 1. To see
this first note that each worker stores r uncoded sub-matrices. For
example, using the cyclic repetition scheme in [147], worker j stores
Xj , . . . , Xj+r−1 locally, and sends a liner combination of the compu-
tation results Xj Xj> w, . . . , Xj+r−1 Xj+r−1
> w to the master, who can
recover the final result X1 X1 w + · · · + Xn Xn> w by linearly combining
>

the messages received from any subsets of n − r + 1 workers.


On the other hand, the matrix-vector multiplication based (MVM)
scheme proposed in [89] takes a different decomposition of the computa-
tion X > Xw from (3.25). Specifically, the overall computation consists
92 Coding for Straggler Mitigation

of two rounds. In the first round, an intermediate vector z = Xw is


computed distributedly and decoded at the master. In the second round,
the master re-distributes z to the workers and has them collaboratively
compute the final result X > z. Each worker stores coded data generated
using MDS codes from X and X > respectively. MVM achieves a recov-
ery threshold of KMVM (r) = d 2n r e in each round, when the storage is
evenly split between rounds. It was recently proposed in [106] to use one
round of matrix-vector multiplication to compute the gradient, given
that the second moment of the feature matrix X > X is known in prior.
However, since we focus on the cases where the input X is very large,
storing X and computing X > X on a single machine is prohibitive.

Applying LCC to Minimize Recovery Threshold


We note that above gradient computation framework can be cast to
the computation model in Subsection 3.2.1. To do that, we group the
sub-matrices into K =d nr e data blocks such that X = [X̄1 . . . X̄K ]> .
Then the gradient computation (3.25) reduces to computing the sum
of a degree-2 polynomial f (X̄k ) = X̄k X̄k> w, evaluated over K data
blocks X̄1 , . . . , X̄K .
Now, we can directly apply LCC to minimize the recovery threshold
in each iteration. We first generate the coded matrix X̃i stored at
worker i as a linear combination of X̄1 , . . . , X̄K as in (3.21). Each
worker i computes f (X̃i ) = X̃i X̃i> w, and sends it to the master.
Based on Theorem 3.5 the master can recover f (X̄1 ), . . . , f (X̄K ) and
the gradient by summing them up after receiving the results from
2(K − 1) + 1 = 2d nr e − 1 workers. We state this result in the following
corollary.
Corollary 3.8. Consider the above distributed linear regression problem
and assume it is executed over n workers, each storing 2 ≤ r ≤ n coded
sub-matrices. In this setting, LCC achieves a recovery threshold of
KLCC (r) = 2d nr e−1. Furthermore, the recovery threshold achieved by
LCC is within a factor two of the minimum possible recovery threshold
K ∗ (r) achievable by any algorithm. That is
1 n
KLCC (r) < K ∗ (r) ≤ KLCC (r) = 2d e − 1, (3.28)
2 r
3.2. Optimal Coding for Polynomial Evaluations 93

When r = 1, LCC reduces to the uncoded scheme where each


worker j computes Xj Xj> w. The achievability part directly comes from
the recovery threshold of LCC. As for the converse part, Since here
we consider a more general scenario where workers can execute any
computation on the data (not necessarily matrix–matrix multiplication),
the lower bound in Theorem 3.5 no longer holds. we refer the interested
readers to [95] for the proof of a new lower bound on K ∗ (r) that is no
less than half of KLCC (r).

Remark 3.13. We note that LCC is also directly applicable for non-
linear regression problems using kernel methods. To do that, we simply
replace the data matrix X with the kernel matrix K, whose entry
Kij = k(xi , xj ) is some kernel function of the data points xi and xj .

Comparison with state of the arts. Compared with the gradient


coding (GC) schemes (see, e.g., [63, 127, 147]), LCC directly codes
across the raw data before computation, further reducing the recovery
threshold by about r/2 times. While the amount of computation and
communication at each worker is the same for GC and LCC, LCC is
expected to finish much faster due to its much smaller recovery threshold.
However, GC schemes are applicable for more general learning problems
where the gradient can be arbitrary functions of the data.
Compared with the matrix-vector multiplication based (MVM)
scheme in [89], LCC completes each iteration in only one round of
computation and communication, with a smaller recovery threshold
than that of MVM in each round (assuming even storage split between
two rounds). However, MVM requires less amount of computation at
each worker than LCC. While LCC has each worker send a dimension-d
vector in each iteration, each MVM worker sends two vectors whose
sizes are respectively proportional to m and d.

Experiments on AWS EC2


We run distributed linear regression on Amazon EC2 clusters, and
empirically compare the performance of the proposed LCC scheme with
the conventional uncoded scheme for which there is no data redundancy
94 Coding for Straggler Mitigation

among the workers, the GC scheme (specifically, the cyclic repetition


scheme in [147]), and the MVM scheme in [89].
Setup. We train a linear regression model using Nesterov’s accelerated
gradient descent over a distributed computing system, where the master
and worker nodes are implemented on t2.micro instances using Python.
Message passing between instances are implemented using MPI4py [39].
In each iteration, each worker sends its computation result back to the
master asynchronously using Isend().
Data. We create synthetic datasets of m training samples by (1) sam-
pling a true weight vector w∗ whose components are i.i.d. and uniformly
distributed on [0, 1], and (2) sampling each input point xi of d fea-
tures from a normal mixture distribution 12 × N (µ1 , I) + 12 × N (µ2 , I),
where µ1 = 1.5d w and µ2 = d w , and computing its output label
∗ −1.5 ∗

yi = x>i w . For each dataset, we run GD for 100 iterations over n = 40


workers. We consider different dimensions of input matrix X as listed


in the following scenarios.

• Scenario 1 & 2: (m, d) = (8000, 7000).

• Scenario 3: (m, d) = (160000, 500).

We let the system run with naturally occurring stragglers in sce-


nario 1. To mimic the effect of slow/failed workers, we artificially intro-
duce stragglers in scenarios 2 and 3, by imposing a 0.5 seconds delay
on each worker with probability 5% in each iteration.
To implement LCC, we set the βi parameters to 1, . . . , nr , and the αi
parameters to 0, . . . , n − 1. To avoid numerical instability due to large
entries of the decoding matrix, we can embed input data into a large
finite field, and apply LCC in it with exact computations. However in
all of our experiments the gradients are calculated correctly without
carrying out this step.
Results. For the uncoded scheme, each worker stores and processes
r = 1 data batch. For the GC and LCC schemes, we select the optimal
r subject to the memory size of the t2.micro instance to minimize
the total run-time. For MVM, we further optimized the run-time over
the computation/storage assigned between two rounds of matrix-vector
3.2. Optimal Coding for Polynomial Evaluations 95
total run-time , sec
60

40

20

scenario 1 scenario 2 scenario 3


uncoded GC MVM LCC

Figure 3.7: Run-time comparison of LCC with other three schemes: Conventional
uncoded, GC, and MVM.

multiplications. We plot the run-time performance in all three scenarios


in Figure 3.7, and also list the breakdowns of their run-times in Tables 3.1
to 3.3. The computation time was measured as the summation of the
maximum local processing time among all non-straggling workers, over
100 iterations. The communication time is computed as the difference
between the total run-time and the computation time.
Based on the experimental results, we draw the following conclusions.

• LCC achieves the least run-time in all scenarios. In particular,


LCC speeds up the uncoded scheme by 6.79×–13.43×, the GC
scheme by 2.36–4.29×, and the MVM scheme by 1.01–12.65×.

• In scenarios 1 & 2 where the number of inputs m is close to


the number of features d, LCC achieves a similar performance
as MVM. However, when we have much more data points in

Table 3.1: Breakdowns of the run-times in scenario one

# Batches/ Recovery Communication Computation Total


Schemes
Worker (r) Threshold Time Time Run-time
Uncoded 1 40 24.125 s 0.237 s 24.362 s
GC 10 31 6.033 s 2.431s 8.464 s
MVM Rd. 1 5 8 1.245 s 0.561 s 1.806 s
MVM Rd. 2 5 8 1.340 s 0.480 s 1.820 s
MVM total 10 − 2.585 s 1.041 s 3.626 s
LCC 10 7 1.719 s 1.868 s 3.587 s
96 Coding for Straggler Mitigation

Table 3.2: Breakdowns of the run-times in scenario two

# Batches/ Recovery Communication Computation Total


Schemes
Worker (r) Threshold Time Time Run-time
Uncoded 1 40 7.928 s 44.772 s 52.700 s
GC 10 31 14.42 s 2.401 s 16.821 s
MVM Rd. 1 5 8 2.254 s 0.475 s 2.729 s
MVM Rd. 2 5 8 2.292 s 0.586 s 2.878 s
MVM total 10 – 4.546 s 1.061 s 5.607 s
LCC 10 7 2.019 s 1.906 s 3.925 s

Table 3.3: Breakdowns of the run-times in scenario three

# Batches/ Recovery Communication Computation Total


Schemes
Worker (r) Threshold Time Time Run-time
Uncoded 1 40 0.229 s 41.765 s 41.994 s
GC 10 31 8.627 s 2.962 s 11.589 s
MVM Rd. 1 5 8 3.807 s 0.664 s 4.471 s
MVM Rd. 2 5 8 52.232 s 0.754 s 52.986 s
MVM total 10 – 56.039 s 1.418 s 57.457 s
LCC 10 7 1.962 s 2.597 s 4.541 s

scenario 3, LCC finishes substantially faster than MVM by as


much as 12.65×. The main reason for this subpar performance is
that MVM requires large amounts of data transfer from workers
to the master in the first round and from master to workers in the
second round (both are proportional to m). However, the amount
of communication from each worker or master is proportional to d
for all other schemes, which is much smaller than m in scenario 3.

3.3 Related Works and Open Problems

Unified coding. So far, we have demonstrated how coded computing


techniques can inject and leverage redundant computations to minimize
the load of communication, and the effect of stragglers, respectively.
Moving beyond these individual improvements, we have recently pro-
posed in [96] a unified coded framework for distributed computing
with straggling servers, by introducing a tradeoff between “latency of
computation” and “load of communication” for some linear computa-
tion tasks. We showed that the Coded Distributed Computing (CDC)
3.3. Related Works and Open Problems 97

scheme in Section 2 that repeats the intermediate computations to


create coded multicasting opportunities to reduce communication load,
and the coded scheme of [89] that generates redundant intermediate
computations to combat straggling servers can be viewed as special
instances of the proposed framework, by considering two extremes of
this tradeoff: minimizing either the load of communication or the la-
tency of computation individually. The key idea of this unified coding
scheme is to apply redundant CDC data placement on MDS-coded
data blocks. Then, by tuning the coding rate of the MDS code and
the number of redundant computations for each intermediate task, we
can systematically operate at any point on the latency-load tradeoff to
optimize the run-time performance of distributed computing tasks. We
also proved an information-theoretic lower bound on the latency-load
tradeoff, which was shown to be within a constant multiplicative gap
from the achieved tradeoff at the two end points.
Gradient coding. While coded computing schemes have been designed
to speed up fundamental algebraic computations like matrix multipli-
cation and polynomial evaluation, directly adopting these schemes in
general machine learning algorithms is often not applicable since the
gradient computation may not have any algebraic structure, or can
only be evaluated numerically. One of the most important learning
algorithms is the stochastic gradient descent (SGD), which is currently
the most widely used training method in supervised learning. Recently,
a coding method named “gradient coding” (GC) was proposed in [147],
and extended in [63, 127, 164], to mitigate stragglers in running dis-
tributed SGD. We use the following simple example to illustrate the
idea of GC.
Figure 3.8(a) illustrates a naive way of distributing the computa-
tion of the gradient on three workers. The three workers have disjoint
partitions of the data stored locally (D1 , D2 , D3 ) and all share the
current model. For i = 1, 2, 3, Worker i computes the gradient of the
model on examples in partition Di , denoted by gi . The three gradient
vectors are then communicated to a master node which computes the
full gradient by summing these vectors g1 + g2 + g3 and updates the
model with a gradient step. The new model is then sent to the workers
98 Coding for Straggler Mitigation

g1 + g2 + g3 g1 + g2 + g3 (from any 2)

g1 g1 /2 + g2
g3 g1 /2 + g3
D1 g2 D1 ,D 2 g2 − g 3 D3 ,D 1
D3
D2 D2 , D 3

(a) Naive synchronous gradient descent. (b) Gradient coding: The vector g1 + g2 +
g3

Figure 3.8: Illustration of gradient coding.

and the system moves to the next iteration. This setup is, however,
subject to delays introduced by stragglers because the master has to
wait for outputs of all three workers before computing g1 + g2 + g3 .
Figure 3.8(b) illustrates one way to resolve this problem by replicat-
ing data across machines as shown, and sending linear combinations of
the associated gradients. As shown in Figure 3.8(b), each data partition
is replicated twice using a specific placement policy. Each worker is
assigned to compute two gradients on their assigned two data partitions.
For instance, Worker 1 computes vectors g1 and g2 , and then sends
2 g1 +g2 . Interestingly, g1 +g2 +g3 can be constructed from any
1
two out of
these three vectors. For instance, g1 + g2 + g3 = 2 21 g1 + g2 − (g2 − g3 ).
 

Therefore, such a scheme is robust to one straggler. This gradient cod-


ing technique makes the computation robust to stragglers albeit at a
computational overhead compared to the naive scheme, while keeping
the communication load the same.
In general GC schemes require processing s + 1 data batches at each
worker in order for the system to tolerate s stragglers. We also note
that in contrast to the previously proposed Lagrange Coded Comput-
ing scheme that codes over data batches, the GC schemes code over
partial gradients computed from uncoded data, hence it is applicable
to arbitrary loss functions whose gradients may not have any alge-
braic structure or can only be computed numerically (e.g., deep neural
networks).
3.3. Related Works and Open Problems 99

Finally, we end this section with some of the open problems and fu-
ture directions for designing straggler-resilient coded computing systems.
Low-complexity algorithms for coded matrix multiplication.
While the naive multiplication of an M × N matrix A by an N × L
matrix B has complexity O(M N L), there is a rich literature that has
discovered low complexity implementations, especially if the matrices are
restricted to a certain class. When the entries of matrix A come from a
bounded alphabet A (e.g., A is the adjacency matrix of a degree-bounded
graph in common graph algorithms like pagerank, or Laplacian matrix
calculation), the product AB can be computed via the four Russians
algorithm [100, 150] using O(M N L log2 |A|/ log2 N ) operations - an
improvement of a factor of log2 N as compared to the naive approach
for small alphabet. There are some unique challenges for the use of the
four Russians method in coded distributed matrix multiplication due to
the fact that the alphabet size of good codes tends to be large. Consider a
concrete example where A = {0, 1}, and B is a N ×1 vector. Surprisingly,
a back-of-the envelop calculation reveals that natural application of
MDS codes to the case of binary multiplication has the same per-node
computational complexity as replication O( MnN log2 N ), for a fixed straggler
(s+1)

tolerance s. This is because the alphabet size of a parity matrix, say


i=1 gi Ai , can be as large as 2 , so the computational complexity of
Pm m

multiplying with B is O( log N ), whereas an uncoded computation has


MN
2
complexity O( m Mlog2 N ). Motivated by this observation, we propose to
N

answer the following question: For a matrix multiplication where the


matrix entries come from a bounded alphabet A, what is the the optimal
trade-off between straggler tolerance and the per-node computational
complexity. There have been several recent works along this direction
in [61, 148].
Developing “master-less” systems for efficient and straggler-
resilient matrix multiplication. Current state of the art in coded
computing largely assumes availability of master/fusion nodes that dis-
tribute and collect data, and perform encoding/decoding computations.
In practice, however, often all nodes are identical, and no single node
may be able to store all the data or perform all the encoding/decoding
operations (see e.g., [71]). Distributed and parallel computing literature
100 Coding for Straggler Mitigation

has developed efficient matrix multiplication algorithms for decentralized


architectures, for example, the Scalable Universal Matrix Multiplication
Algorithm (SUMMA) [151], which is implemented in linear algebra
libraries such as ScaLAPACK [22], PLAPACK [6], PB-BLAS [35], and
Elemental [123]. A significant challenge that SUMMA overcomes is
to limit the communication cost in decentralized architectures. The
theoretical basis for practical algorithms like SUMMA comes from the
study of their completion time over a distributed computing model
[14, 43] of n fully connected nodes, where message transmission time
involves a fixed start up time plus a time that is proportional to the
length of the message. The completion time for matrix multiplication
in SUMMA is approximately optimal in this model [14]. The main
idea of SUMMA is that it cleverly schedules the operations performed
by the nodes to minimize the amount of time spent waiting for data
and message communication and startup costs, minimizing the over-
all completion time. SUMMA, however, is not robust to stragglers or
failures. As such motivated, we propose to solve the following problem:
Develop a matrix multiplication algorithm over n nodes that is robust
to s stragglers, and minimizes the completion time in the model of [14].
In order to emulate the fusion node’s decode/repair functionality over
the master-less decentralized systems, one may refer to related designs
for locally recoverable and regenerating codes in the distributed storage
literature. Also, it would be interesting to try to develop lower bounds
on the completion time using the techniques for proving straggler-aware
bounds in [47, 167].
Gradient coding for partial stragglers. Previous works of gradi-
ent coding for distributed learning relied on a simplified assumption:
straggling machines perform no work i.e., fail catastrophically or simply
do not respond to requests. In reality, this rarely happens: machines
are simply slower because of an OS update, moving of virtualization
resources across servers or other issues relating to pooled computing
resources. Furthermore a machine may be a straggler for some iterations
of the learning process but not for others. This allows us to design
methods for iterative learning not one round at a time, but considering
the iterative nature of the whole training process jointly.
3.3. Related Works and Open Problems 101

One way to tackle this problem is to use a layering of different


gradient codes designed for different numbers of stragglers. Each data
batch of gradient descent can be partitioned into smaller partitions
and combinations of gradient codes can provide good intermediate
performance. A good way to explain our future vision for this problem
is through erasure codes: In classic MDS erasure coding k data blocks
are encoded into n blocks with the guarantee that if someone collects
any k from the encoded blocks they can reconstruct all the original data.
However, even if k − 1 blocks are recovered, there is no guarantee of
recovery. Of course, one could make a systematic MDS code and have
some intermediate performance from the systematic blocks, but it is
highly nontrivial to improve on that. For example, one could ask that
any k/2 blocks suffice to recover a good fraction of the original data
and also recover everything from any k blocks.
This problem is sometimes called intermediate performance for era-
sure codes, see e.g., [44, 82, 136] and the related growth codes [77].
The problem we are proposing here is intermediate performance for
gradient codes: For example, create a code to ensure that a full gradient
is recovered if any n − s − `1 machines each process 2k1 blocks, and `1
partial stragglers process some k2 blocks. In this example we assumed a
slowdown factor of 2. Finding the fundamental limits and designing op-
timal gradient codes for such systems are interesting research problems.
Some progress has been made on this direction in [54, 115].
Gradient coding that produce approximate gradients. Another
way to alleviate the computational overhead of gradient coding can be
by relaxing the requirement of exactly recovering the full gradient (or of
any batch involved in a particular iteration). In other words, one could
try to design n sparse vectors {b1 , b2 , . . . , bn } such that the span of any
(n−s) contains a vector close to the all 1s vector 1. This gives rise to the
following open problem. Thinking of the n vectors as rows of a matrix
B, and given some constant  > 0, one form of stating this problem
requires constructing the matrix B such that any submatrix of (n − s)
rows, say B 0 , satisfies kB 0 x − 1k2 ≤  for some vector x. This question
has been recently studied in [29, 127, 154, 156]. However, the question
of whether these approaches are optimal is still open. Furthermore,
102 Coding for Straggler Mitigation

the problem stated as such is agnostic to the actual gradient being


approximated. It is conceivable that building data dependent encoders
that exploit gradients from the previous iteration could lead to better
approximations of the current gradient for the same computational cost.
Beyond polynomial computations. Polynomial computation is the
most general class of computations for which we know the optimal
design for coded computing via Lagrange coding. Extending the state-
of-the-art in coded computing to go beyond polynomial computations is
a very important and challenging research direction, which is expected
to impact various application domains (in particular, machine learning,
in which non-linear threshold functions are common). There have been
some work on this direction in [46, 86, 145, 162], however the problem
still remains largely unsolved.
4
Coding for Security and Privacy

In the previous sections, we have demonstrated the role of coding


in reducing the bandwidth requirement, and alleviating the stragglers’
delay, for distributed computing applications. In this section, we focus on
addressing another two major concerns of the information age – security
and privacy. The security concern in distributed computation is having
Byzantine (or malicious) workers with no computational restriction,
who can deliberately send erroneous data to affect the computation for
their benefit. Examples for such scenarios include the one described
in [24], where it is shown that a single malicious server in a distributed
execution of gradient descent can cause arbitrary bias in the resulting
hypothesis. In addition to security challenges, distributed computation
and learning schemes are susceptible to privacy infringement. Since
such computations are commonly performed by using third party cloud
services, the concern for personal data leakage is growing. Therefore, in
some cases it is crucial to keep the workers oblivious to the actual data
they are processing.
Security and privacy have been the main research focus in the lit-
erature of multiparty computing (MPC) and secure/private machine

103
104 Coding for Security and Privacy

learning (see, e.g., [18, 37, 38, 64, 114]). In this section, we demon-
strate how coding theory can help to maintain security and privacy in
multiparty computing and distributed learning. Specifically, we first
demonstrate that how we can extend the Lagrange Coded Computing
framework proposed in the previous section to provide MPC systems
with security and privacy guarantees. We also compare LCC with
state-of-the-art MPC schemes (e.g., the celebrated BGW scheme for
secure/private MPC [18]), and illustrate the substantial reduction in the
amount of randomness, storage overhead, and computational complexity
achieved by LCC.
Second, we demonstrate the application of coded computing for
privacy-preserving machine learning. In particular, we consider an appli-
cation scenario in which a data-owner (e.g., a hospital) wishes to train
a logistic regression model by offloading the large volume of data (e.g.,
healthcare records) and computationally-intensive training tasks (e.g.,
gradient computations) to N machines over a cloud platform, while
ensuring that any collusions between T out of N workers do not leak
information about the dataset. We then discus a recently proposed
scheme [145], named CodedPrivateML, that leverages coded computing
for this problem. We finally end this section with a discussion on some
related works and open problems.

4.1 Secure and Private Multiparty Computing

We consider the problem of evaluating a multivariate polynomial f : V →


U over a dataset X = (X1 , . . . , XK ), where V and U are vector spaces of
dimensions M and L, respectively, over the finite field1 Fq . We assume
a distributed computing environment with a master and N workers (see
Figure 4.1), and the goal is to compute Y1 , f (X1 ), . . . , YK , f (XK )
given function f . We define the degree of the chosen f , denoted by
deg f , as the total degree of the polynomial.
In this setting each worker has already stored a fraction of the dataset
prior to computation, in a possibly coded manner. Specifically, for i ∈
[N ] (where [N ] , {1, . . . , N }), worker i stores X̃i , gi (X1 , . . . , XK , Z),
1
While the results about security hold for any large enough field, privacy is well
defined only over finite ones.
4.1. Secure and Private Multiparty Computing 105

Figure 4.1: An illustration of coded computing in the presence of malicious workers


(m1 , . . . , mA ) who wish to affect the computation for their own benefit, and sets
of colluding workers (c1 , . . . , cT ) who wish to know the dataset X. The master
encodes the dataset {Xj }K j=1 to {X̃i }i=1 , and sends X̃i to worker i. In turn, the
N

workers compute f (X̃i ) and send the result back to the master. The master needs to
retrieve {f (Xi )}K i=1 in the presence of at most A malicious workers, and maintain
the perfect privacy of the dataset in the face of up to T colluding workers.

where gi is the encoding function of that worker, and Z is a random


variable. We restrict our attention to linear encoding functions, which
guarantee low encoding complexity and simple implementation. Specifi-
cally, each X̃i is a linear combination of X1 , . . . , XK , Z.
Upon starting computation, each worker i ∈ [N ] computes Ỹi ,
f (X̃i ) and returns the result to the master. The master waits for all
workers and then decodes outputs Y1 , . . . , YK using a decoding function
given these results.
The procedure described above must satisfy two additional require-
ments. First, the workers must remain oblivious to the content of the
dataset, even if up to T of them collude, where T is the privacy param-
eter of the system. Formally, for every T ⊆ [N ] of size at most T , we
106 Coding for Security and Privacy

must have that2


I(X; X̃T ) = 0 (4.1)
where I is mutual information, X̃T represents the encoded dataset that
is stored at the workers in T , and X is seen as chosen uniformly at
random. A scheme which guarantees privacy against T colluding workers
is called T -private.
In addition to privacy, the computing scheme should provide security,
i.e., robustness against malicious workers. Formally, the master must be
able to obtain true values of Y1 , . . . , YK even if up to A workers return
arbitrarily erroneous results, where A is the security parameter of the
system. A scheme that guarantees security against A malicious workers
is called A-secure.
Uncoded repetition scheme. For this setting, a naive uncoded
scheme simply replicates each uncoded data block Xi onto multiple
workers. By replicating each Xi between bN/Kc and dN/Ke times, it
can tolerate at most A adversaries when 2A ≤ bN/Kc − 1. However,
uncoded repetition does not support the privacy requirement.

4.1.1 LCC for Secure and Private Multiparty Computing


We star with an illustrative example of how LCC scheme that we
described in the previous section can be utilized for secure and private
MPC.

Illustrative Example
√ √
Consider the function f (Xi ) = Xi2 , where input Xi ’s are M × M
square matrices for some square integer M . We demonstrate LCC
in the scenario where the input data X is partitioned into K = 2
batches X1 and X2 , and the computing system has N = 7 workers. In
addition, the scheme guarantees perfect privacy against any individual
worker (i.e., T = 1), and is robust against any single malicious worker
(i.e., A = 1).
2
Equivalently, Equation (4.1) requires that X̃T and X are independent. Under
this condition, the input data X still appears uniformly random after the colluding
workers learn X̃T , which guarantees the privacy.
4.1. Secure and Private Multiparty Computing 107

The gist of LCC is picking a uniformly random matrix Z of the same


dimensions as the Xi ’s, and to encode (X1 , X2 , Z) using a Lagrange
interpolation polynomial u. To this end, assume that the underlying
field Fq = F11 , let
(z − 2)(z − 3) (z − 1)(z − 3)
u(z) ,X1 · + X2 ·
(1 − 2)(1 − 3) (2 − 1)(2 − 3)
(z − 1)(z − 2)
+Z · ,
(3 − 1)(3 − 2)
and observe that u(1) = X1 and u(2) = X2 . Then, fix distinct {αi }7i=1
in F11 such that {αi }7i=1 ∩ [2] = ∅, and have workers 1, . . . , 7
store u(α1 ), . . . , u(α7 ), i.e.,

(X̃1 , . . . , X̃7 ) = (X1 , X2 , Z) · U

where U ∈ F3×7
11 satisfies Ui,j = `∈[3]\{i} i−` for (i, j) ∈ [3] × [7].
Q j α −`

First, notice that for every j ∈ [7], worker j sees X̃j , which is a
linear combination of X1 and X2 masked by addition of λ · Z for some
nonzero λ ∈ F11 ; since Z is uniformly random, this guarantees perfect
privacy for T = 1. Next, worker j computes f (X̃j ) = f (u(αj )), which
is an evaluation of the composition polynomial f (u(z)), with degree at
most 4, at αj .
Normally, a polynomial of degree 4 can be interpolated from 5 eval-
uations at distinct points. However, the presence of A = 1 malicious
worker requires the master to employ a Reed–Solomon decoder, and have
two additional evaluations at distinct points (in general, two additional
evaluations for every malicious worker). Finally, after decoding polyno-
mial f (u(z)), the master can obtain f (X1 ) and f (X2 ) by evaluating it
at z = 1 and z = 2.

General Description
To start, we first select any K +T distinct elements β1 , . . . , βK+T from F,
and find a polynomial u: F → V of degree K +T −1 such that u(βi ) = Xi
for any i ∈ [K], and u(βi ) = Zi for i ∈ {K + 1, . . . , K + T }, where
all Zi ’s are chosen uniformly at random from V. This is accomplished
108 Coding for Security and Privacy

by letting u be the respective Lagrange interpolation polynomial


K+T
z − βk z − βk
+
X Y X Y
u(z) , Xj · Zj · .
j∈[K]
β − βk j=K+1
k∈[K+T ]\{j} j
β − βk
k∈[K+T ]\{j} j
(4.2)
We then select N distinct elements α1 , . . . , αN from F such
that {αi }N
i=1 ∩ {βj }j=1 = ∅, and encode the input variables by let-
K

ting X̃i = u(αi ) for any i ∈ [N ]. That is, the input variables are
encoded as

X̃i = u(αi ) = (X1 , . . . , XK , ZK+1 , . . . , ZK+T ) · Ui , (4.3)

where U ∈ Fq is the encoding matrix Ui,j , `∈[K+T ]\{i} βji −β`` ,


(K+T )×N Q α −β

and Ui is its i’th column.


Next we briefly sketch the proof of T -privacy, which relies on the fact
that the bottom T × N submatrix U bottom of U is an MDS matrix (i.e.,
every T × T submatrix of U bottom is invertible). Hence, for a colluding
set of workers T ⊆ [N ] of size T , their encoded data X̃T satisfies X̃T =
XUTtop + ZUTbottom , where Z , (ZK+1 , . . . , ZK+T ), and UTtop ∈ FK×Tq ,
UTbottom T
∈ Fq ×T are the top and bottom sub-matrices which correspond
to the columns in U that are indexed by T .
Now, the fact that U bottom is MDS implies that UTbottom is invertible,
and hence

Z = (X̃T − XUTtop ) · (UTbottom )−1 .

Therefore, for every dataset X and every observed encoding X̃T , there
exists a unique value for the randomness Z by which the encoding of X
equals X̃T ; a statement equivalent to the definition of T -privacy.
Following the encoding of (4.3), each worker i applies f on X̃i
and sends the result back to the master. Hence, the master obtains N
evaluations, at most A of which are incorrect, of the polynomial f (u(z)).
Since deg f (u(z)) ≤ deg f · (K + T − 1), and N ≥ (K + T − 1) deg(f ) +
2A + 1, the master can obtain all coefficients of f (u(z)) by applying
Reed–Solomon decoding. Having this polynomial, the master evaluates
it at βi for every i ∈ [K] to obtain f (u(βi )) = f (Xi ). This results in
the following theorem for LCC.
4.1. Secure and Private Multiparty Computing 109

Theorem 4.1. Given a number of workers N and a dataset X =


(X1 , . . . , XK ), LCC scheme provides an A-secure, and T -private com-
putation of {f (Xi )}K i=1 for any polynomial f , as long as

(K + T − 1) deg f + 2A + 1 ≤ N, (4.4)

for any field Fq that is sufficiently large (i.e., q ≥ N + K).

Remark 4.1. This construction is applicable over every finite field


with q ≥ K + N . Moreover, disregarding the privacy constraint (i.e.,
setting T = 0) provides an A-secure scheme over infinite fields as well.

Remark 4.2. Note that LHS of inequality (4.4) is independent of the


number of workers N , hence the key property of LCC is that adding
1 worker can increase its security to malicious workers by 1/2, while
keeping the privacy constraint T the same. This result essentially ex-
tends the well-known optimal scaling of error-correcting codes (i.e.,
adding one parity can provide robustness against one erasure or 1/2
error in optimal maximum distance separable codes) to the distributed
computing paradigm.

4.1.2 Optimality of LCC for Secure and Private MPC


We now discuss the optimality of LCC by proving Theorem 4.2 (optimal
security) and Theorem 4.3 (optimal randomness), stated below.

Theorem 4.2 (Optimal security). For any multilinear function f , security


can be provided against at most A = b(N − (K − 1)deg f − 1)/2c
adversaries when N ≥ Kdeg f − 1, and A = bN/2K − 1/2c adversaries
when N < Kdeg f − 1.

Compared with the result in Theorem 4.1 (for the case of T = 0),
Theorem 4.2 demonstrates that the LCC scheme provides the optimal
security, by protecting against maximum possible number of adversaries.
Proof of Theorem 4.2. We prove Theorem 4.2 by connecting the ad-
versary tolerance problem to the straggler mitigation problem described
in Subsection 3.2.1, using the extended concept of Hamming distance
for coded computing.
110 Coding for Security and Privacy

As the first step, we define the Hamming distance of a (possibly


random) linear encoding scheme, denoted by d, as the maximum integer,
such that for any two distinct instances of input X that also generate
distinct outputs, and for any two possible realizations of the N encoding
functions, the encoded data differ for at least d workers.
It was shown in [168] that this Hamming distance behaves similar
to its counterpart in classical coding theory: an encoding scheme can
tolerate S stragglers and A adversarial workers (erroneous results)
whenever S + 2A ≤ d − 1. Therefore, for any encoding scheme that is
A secure, it has a Hamming distance of at least 2A + 1. Consequently,
it can tolerate up 2A stragglers. Now recall from Lemma 3.6 that to
recover a multilinear function f of degree deg f , the maximum number
of stragglers any linear encoding scheme can tolerate is upper bounded
by N − (K − 1) deg f − 1 when N ≥ K deg f − 1, and upper bounded
by bN/Kc − 1 when N < K deg f − 1. Hence, a computation scheme
exists only if A ≤ (N − (K − 1) deg f − 1)/2 when N ≥ K deg f − 1,
and A ≤ N/2K − 1/2 when N < K deg f − 1.
To guarantee T -privacy, the LCC scheme pads the dataset X with
additional T random entries before coding; and this amount of random-
ness is shown to be minimal.

Theorem 4.3 (Optimal randomness). Any computing scheme that uni-


versally achieves the (T, A) tradeoff in (4.4)3 for all linear functions f
must use an amount of randomness no less than that of LCC.

Proof of Theorem 4.3. To prove Theorem 4.3, we demonstrate that


LCC uses the minimum possible randomness among all linear encoding
schemes that achieve the security-privacy tradeoff stated in Theorem
4.1 for linear f . Since the identity map is included in the class of linear
functions, one can employ previous results regarding private storage to
establish a lower bound on the required amount of randomness.
The proof is based on the result in [69 Chapter 3]. In what fol-
lows, an (n, k, r, z)Ftq secure RAID scheme is a storage scheme over Ftq
(where Fq is a field with q elements) in which k message symbols are

3
That is, when two sides of (4.4) are equal.
4.1. Secure and Private Multiparty Computing 111

coded into n storage servers, such that the k message symbols are re-
constructible from any n − r servers, and any z servers are information
theoretically oblivious to the message symbols. Further, such a scheme is
assumed to use v random entries as keys, and by [69, Proposition 3.1.1],
must satisfy n − r ≥ k + z.

Theorem 4.4. [69, Theorem 3.2.1] A linear rate-optimal (n, k, r, z)Ftq


secure RAID scheme uses at least zt keys over Fq (i.e., v ≥ z).

Clearly, in our scenario V can be seen as Fdim


q
V for some q. Further,

by setting N = n, T = z, and t = dim V, it follows from Theorem 4.4


that any encoding scheme which guarantees information theoretic pri-
vacy against sets of T colluding workers must use at least T random
entries {Zi }i∈[T ] .

4.1.3 Comparison with Prior Works on Multiparty Computing


Providing security and privacy for multiparty computing (MPC) and
machine learning systems is an extensively studied topic. To illustrate
the significant role of LCC in secure and private computing, let us
consider the celebrated BGW MPC scheme [18].4
Given inputs {Xi }K i=1 , BGW first uses Shamir’s scheme [141]
to encode the dataset in a privacy-preserving manner as Pi (z) =
Xi + Zi,1 z + · · · + Zi,T z T for every i ∈ [K], where Zi,j ’s are i.i.d. uni-
formly random variables and T is the number of colluding workers that
should be tolerated. The key distinction between the data encoding of
BGW scheme and LCC is that we instead use Lagrange polynomials
to encode the data. This results in significant reduction in the amount
of randomness needed in data encoding (BGW needs KT zi,j ’s while
as we describe in the next subsection, LCC only needs T amount of
randomness).
The BGW scheme will then store {Pi (α` )}K i=1 to worker ` for ev-
ery ` ∈ [N ], given some distinct values α1 , . . . , αN . The computation
is then carried out by evaluating f over all stored coded data at the
4
Conventionally, the BGW scheme operates in a multi-round fashion, requiring
significantly more communication overhead than one-shot approaches. For simplicity
of comparison, we present a modified one-shot version of BGW.
112 Coding for Security and Privacy

Table 4.1: Comparison between BGW based designs and LCC. The computational
complexity is normalized by that of evaluating f ; randomness, which refers to
the number of random entries used in encoding functions, is normalized by the
length of Xi

BGW LCC
Complexity/worker K 1
Frac. data/worker 1 1/K
Randomness KT T
Min. num. of workers 2T + 1 deg f · (K + T − 1) + 1

nodes. In the LCC scheme, on the other hand, each worker ` only needs
to store one encoded data X̃` and compute f (X̃` ). This gives rise to
the second key advantage of LCC, which is a factor of K in storage
overhead and computation complexity at each worker.
After computation, each worker ` in the BGW scheme has essentially
evaluated the polynomials {f (Pi (z))}K i=1 at z = α` , whose degree is at
most deg f · T . Hence, if no adversary appears (i.e., A = 0), the master
can recover all required results f (Pi (0))’s, through polynomial interpola-
tion, as long as N ≥ deg f ·T +1 workers participated in the computation.
It is also possible to use the conventional multi-round BGW, which only
requires N ≥ 2T + 1 workers to ensure T -privacy. However, multiple
rounds of computation and communication (Ω(log(deg f )) rounds) are
needed, which further increases its communication overhead. Note that
under the same condition, LCC scheme requires N ≥ deg f ·(K+T −1)+1
number of workers, which is larger than that of the BGW scheme.
Hence, in overall comparison with the BGW scheme, LCC results in
a factor of K reduction in the amount of randomness, storage overhead,
and computation complexity, while requiring more workers to guarantee
the same level of privacy. This is summarized in Table 4.1.5

5
A BGW scheme was also proposed in [18] for secure MPC, however for a
substantially different setting. Similarly, a comparison can be made by adapting it
to our setting, leading to similar results, which we omit for brevity.
4.2. Privacy Preserving Machine Learning 113

4.2 Privacy Preserving Machine Learning

We now illustrate an application of coded computing, in particular LCC,


for privacy-preserving machine learning. We consider a scenario in which
a data-owner (e.g., a hospital) wishes to train a logistic regression model
by offloading the large volume of data (e.g., healthcare records) and
computationally-intensive training tasks (e.g., gradient computations)
to N machines over a cloud platform, while ensuring that any collusions
between T out of N workers do not leak information about the dataset.
We illustrate a recently proposed scheme, named CodedPri-
vateML [145], that leverages coded computing for this problem. Coded-
PrivateML has three salient features:

1. it provides strong information-theoretic privacy guarantees for


both the training dataset and model parameters;

2. it enables fast training by distributing the training computation


load effectively across several workers;

3. it secret shares the dataset and model parameters using coding


and information theory principles, which significantly reduces the
training time.

4.2.1 The CodedPrivateML Framework


We consider the training of a logistic regression model.6 The dataset
is given by a matrix X = [x1T . . . xm
T ]T ∈ Rm×d of m data points with

d features and a label vector y ∈ {0, 1}m . Model parameters (weights)


w ∈ Rd are obtained by minimizing the cross entropy function,

1 X
m
C(w) = (−yi log ŷi − (1 − yi ) log(1 − ŷi )) (4.5)
m i=1

where ŷi = g(xi · w) ∈ (0, 1) is the estimated probability of label i being


equal to 1 and g(z) = 1/(1 + e−z ) is the sigmoid function. Problem
(4.5) can be solved via gradient descent, through an iterative process
that updates the weights in the opposite direction of the gradient
6
Analysis applies to linear regression with minor modifications.
114 Coding for Security and Privacy

master

Dataset : X =( X1 , . . ., XK )

(t) (t)
W1 WN

.. .
worker 1 worker N
X1 XN
T colluding workers

Figure 4.2: The distributed coded training setup.

∇C(w) = m X (g(X
1 > × w) − y). The update function is given by,
η >
w(t+1) = w(t) − X (g(X × w(t) ) − y) (4.6)
m
where w(t) holds the estimated parameters from iteration t, η is the
learning rate, and g(·) operates element-wise.
We consider a master-worker distributed compute architecture shown
in Figure 4.2, where the master offloads the gradient computations in
(4.6) to N workers. In doing so, the master also wants to protect the
privacy of the dataset against any potential collusions between up to T
workers, where T is the privacy parameter of the system. Initially, the
dataset is partitioned into K submatrices X = [X> 1 . . . XK ] . Parameter
> >

K ∈ N reflects the amount of parallelization (computation load at each


worker is proportional to 1/K-th of the dataset). The master then
creates N encoded matrices, {X e i}
i∈[N ] , by combining the K parts of the
dataset with some random matrices to preserve privacy, and sends X e i to
worker i. At iteration t, master also creates an encoded matrix W f (t) to
i
secret share the current estimate of the weights w(t) with worker i ∈ [N ],
as they can also leak substantial information about the dataset [112].
The coding strategy should ensure that any subset of T workers can
not learn any information, in the information-theoretic sense, about the
4.2. Privacy Preserving Machine Learning 115

dataset. Formally, for every subset of workers T ⊆ [N ] with |T | ≤ T ,


we need I X; X T t∈[J] = 0 where I is the mutual information,
 (t) 
eT , W f

J is the number of iterations, and X T t∈[J] is the collection of


eT, W

f (t)

coded matrices stored at workers in T .


At each iteration, worker i ∈ [N ] performs its computation locally
using Xe i and Wf (t) and sends the result back to the master. After
i
receiving the results from a sufficient number of workers, the master
recovers X> g(X × w(t) ) = K k=1 Xk g(Xk × w ) and updates the
> (t)
P

weights using (4.6).


CodedPrivateML consists of the following four main phases.
Phase 1: Quantization. In order to guarantee information-theoretic
privacy, one has to mask the dataset and weights in a finite field F
using uniformly random matrices, so that the added randomness can
make each data point appear equally likely. In contrast, the dataset
and weights for the training task are defined in the domain of real
numbers. Our solution to handle the conversion between the real and
finite domains is through the use of stochastic quantization. Accordingly,
in the first phase of our system, master quantizes the dataset and weights
from the real domain to the domain of integers, and then embeds them
in a field Fp of integers modulo a prime p. The quantized version of the
dataset X is given by X. The quantization of the weight vector w(t) ,
on the other hand, is represented by a matrix W , where each column
(t)

holds an independent stochastic quantization of w(t) . This structure will


be important for the convergence of the model. Parameter p is selected
to be sufficiently large to avoid wrap-around in computations. Its value
depends on the bitwidth of the machine as well as the number of additive
and multiplicative operations. For example, in a 64-bit implementation,
we select p = 33554393 (the largest prime with 25 bits) as explained in
our experiments.
Phase 2: Encoding and Secret Sharing. In the second phase, the
master partitions the quantized dataset X into K submatrices and
encodes them using the LCC approach that we discussed in the previous
subsection. It then sends to worker i ∈ [N ] a coded submatrix Xei ∈
m
×d
FpK . As we discussed before, this encoding ensures that the coded
116 Coding for Security and Privacy

matrices do not leak any information about the true dataset, even if
T workers collude. In addition, the master has to ensure the weight
estimations sent to the workers at each iteration do not leak information
about the dataset. This is because the weights updated via (4.6) carry
information about the whole training set, and sending them directly to
the workers may breach privacy. In order to prevent this, at iteration t,
master also quantizes the current weight vector w(t) to the finite field
and encodes it again using Lagrange coding.
Phase 3: Polynomial Approximation and Local Computation.
In the third phase, each worker performs the computations using its
local storage and sends the result back to the master. We note that
the workers perform the computations over the encoded data as if
they were computing over the true dataset. That is, the structure of
the computations are the same for computing over the true dataset
versus computing over the encoded dataset. A major challenge is that
LCC is designed for distributed polynomial computations. However,
the computations in the training phase are not polynomials due to the
sigmoid function. We overcome this by approximating the sigmoid with
a polynomial of a selected degree r. This allows us to represent the
gradient computations in terms of polynomials that can be computed
locally by each worker.
Phase 4: Decoding and Model Update. The master collects the
results from a subset of fastest workers and decodes the gradient. Then,
the master converts the gradient from finite to real domain, updates the
weight vector, and secret shares it with the workers for the next round.
Based on this design, we can obtain the following theoretical guar-
antees for the convergence and privacy of CodedPrivateML. We refer
to [145] for the details.

Lemma 4.5. Let p(t) , m X (ḡ(X, W )


>
− y) be the gradient compu-
1 (t)

tation using quantized weights W and degree-r polynomial approxi-


(t)

mation in CodedPrivateML. Then,


• (Unbiasedness) Vector p(t) is an asymptotically unbiased estimator
of the true gradient. E[p(t) ] = ∇C(w(t) ) + (r), and (r) → 0 as
r → ∞ where expectation is taken over the quantization errors,
4.2. Privacy Preserving Machine Learning 117

• (Variance bound) E kp(t) − E[p(t) ]k22 ≤ 2−2l1w m2 k X k2F , σ 2


 

where k · k2 and k · kF denote the l2 -norm and Frobenius norm,


respectively.

Theorem 4.6. Consider the training of a logistic regression model in


a distributed system with N workers with dataset X = (X1 , . . . , XK ),
initial weights w(0) , and constant step size η = 1/L where L , 14 kXk22 .
For any N ≥ (2r + 1)(K + T − 1) + 1, CodedPrivateML guarantees,
2
k w(0)−w∗ k
• (Convergence) E[C( J1 Jt=0 w(t) )] − C(w∗ ) ≤ +ησ 2 in
P
2ηJ
J iterations, with σ 2 from Lemma 4.5,

• (Privacy) X remains information-theoretically private against any


T colluding workers, i.e., I X; X T t∈[J] = 0, ∀T ⊂ [N ],
 
e T , {W (t)
f }
|T | ≤ T ,

Theorem 4.6 reveals an important trade-off between privacy (T )


and parallelization (K), that is, each additional worker can be utilized
either for more privacy or a faster training.

4.2.2 Experimental Evaluation of CodedPrivateML


The performance of CodedPrivateML had been experimentally demon-
strated in [145] over Amazon EC2 Cloud Platform for training a logistic
regression model for image classification. In particular, CodedPrivateML
has been used for training the logistic regression model from (4.5) for
binary image classification on the CIFAR-10 [87] and GISETTE [60]
datasets to experimentally examine two things: the accuracy of Coded-
PrivateML and the performance gain in terms of training time over two
MPC-based benchmarks. The first one is based on the well-known BGW
protocol [18], whereas the second one is a more recent protocol from
[17, 40] that trade-offs offline calculations for a more efficient imple-
mentation. Both baselines utilize Shamir’s secret sharing scheme [141]
where the dataset is secret shared among the N workers.
CodedPrivateML parameters. There are several system param-
eters in CodedPrivateML that should be set. Given that a 64-bit
implementation was used in [145], the field size was selected to be
118 Coding for Security and Privacy

p = 33554393, which is the largest prime with 25 bits to avoid an


overflow on intermediate multiplications.
One needs to also set the parameter r, the degree of the polynomial
for approximating the sigmoid function. Both r = 1 and r = 2 were
considered in [145], and it was empirically observed that the degree
one approximation achieves good accuracy. Finally, one needs to select
T (privacy threshold) and K (amount of parallelization) in CodedPri-
vateML. As stated in Theorem 4.6, these parameters should satisfy
N ≥ (2r + 1)(K + T − 1) + 1. Given the choice of r = 1, two cases can
be considered:

• Case 1 (maximum parallelization). All resources allocated


for parallelization (faster training) by setting K = b N−1
3 c, T = 1.

• Case 2 (equal parallelization & privacy). Resources split


almost equally between parallelization & privacy, i.e., T =
b N 6−3 c, K = b N 3+2 c − T .

With these parameters, the training time of CodedPrivateML has


been measured while increasing the number of workers N gradually.
The results are demonstrated in Figure 4.3, which shows the comparison
of CodedPrivateML with the [BH08] protocol from [17], which was the
faster of the two benchmarks. In particular, we make the following
observations.7

• CodedPrivateML provides substantial speedup over the MPC


baselines, in particular, up to 4.4× and 5.2× with the CIFAR-10
and GISETTE datasets, respectively, while providing the same
privacy threshold as the benchmarks (T = b N 6−3 c for Case 2).
Table 4.2 demonstrates the breakdown of the total runtime with
the CIFAR-10 dataset for N = 50 workers. In this scenario,
CodedPrivateML provides significant improvement in all three
categories of dataset encoding and secret sharing; communication
time between the workers and the master; and the computation
7
For N = 10, all schemes have similar performance because the total amount
of data stored at each worker is one third of the size of whole dataset (K = 3 for
CodedPrivateML and G = 3 for the benchmark).
4.2. Privacy Preserving Machine Learning 119

(a) CIFAR-10 (for accuracy 81.35% with 50 iterations)

(b) GISETTE (for accuracy 97.50% with 50 iterations)

Figure 4.3: Performance gain of CodedPrivateML over the MPC baseline ([BH08]
from [17]). The plot shows the total training time for different number of workers N .

time. Main reason for this is that, in the MPC baselines, the size
of the data processed at each worker is one third of the original
dataset, while in CodedPrivateML it is 1/K-th of the dataset.
This reduces the computational overhead of each worker while
computing matrix multiplications as well as the communication
overhead between the master and workers. We also observe that
a higher amount of speedup is achieved as the dimension of
the dataset becomes larger (CIFAR-10 vs. GISETTE datasets),
120 Coding for Security and Privacy

Table 4.2: (CIFAR-10) Breakdown of total runtime for N = 50

Enc. Comm. Comp.


Protocol Time Time Time Total
MPC using [BGW88] 202.78 s 31.02 s 7892.42 s 8127.07 s
MPC using [BH08] 201.08 s 30.25 s 1326.03 s 1572.34 s
CodedPrivateML (Case 1) 59.93 s 4.76 s 141.72 s 229.07 s
CodedPrivateML (Case 2) 91.53 s 8.30 s 235.18 s 361.08 s

suggesting CodedPrivateML to be well-suited for data-intensive


training tasks where parallelization is essential.

• The total runtime of CodedPrivateML decreases as the number


of workers increases. This is again due to the parallelization gain
of CodedPrivateML (i.e., increasing K while N increases). This
is not achievable in conventional MPC baselines, since the size of
data processed at each worker is constant for all N .

• CodedPrivateML provides up to 22.5× speedup over the BGW


protocol [18], as shown in Table 4.2 for the CIFAR-10 dataset
with N = 50 workers. This is due to the fact that BGW requires
additional communication between the workers to execute a degree
reduction phase for every multiplication operation.

The accuracy and convergence of CodedPrivateML was also experi-


mentally analyzed in [145]. Figure 4.4(a) illustrates the test accuracy of
the binary classification problem between plane and car images for the
CIFAR-10 dataset. With 50 iterations, the accuracy of CodedPrivateML
with degree one polynomial approximation and conventional logistic
regression are 81.35% and 81.75%, respectively. Figure 4.4(b) shows the
test accuracy for binary classification between digits 4 and 9 for the
GISETTE dataset. With 50 iterations, the accuracy of CodedPrivateML
with degree one polynomial approximation and conventional logistic
regression has the same value of 97.5%. Hence, CodedPrivateML has
comparable accuracy to conventional logistic regression while being
privacy preserving.
4.2. Privacy Preserving Machine Learning 121

between car and plain images


(using 9019 samples for training and 2000 samples for testing).

between digits 4 and 9


images (using 6000 samples for training and 1000 samples for testing).

Figure 4.4: Comparison of the accuracy of CodedPrivateML (demonstrated for


Case 2 and N = 50 workers) vs. conventional logistic regression that uses the sigmoid
function without quantization.

Figure 4.5 presents the cross entropy loss for CodedPrivateML versus
the conventional logistic regression model for the GISETTE dataset. The
latter setup uses the sigmoid function and no polynomial approximation,
in addition, no quantization is applied to the dataset or the weight
vectors. We observe that CodedPrivateML achieves convergence with
122 Coding for Security and Privacy

Figure 4.5: Convergence of CodedPrivateML (demonstrated for Case 2 and N = 50


workers) vs. conventional logistic regression (using the sigmoid function without
polynomial approximation or quantization).

comparable rate to conventional logistic regression, while being privacy


preserving.

4.3 Related Works and Open Problems

The security and privacy issue has been extensively studied in the
literature of secure multiparty computing and distributed machine
learning/data mining [18, 37, 38, 102, 114]. For instance, the cele-
brated BGW scheme [18] employs Shamir’s scheme [141] to privately
share intermediate results between parties. As we have elaborated in
Subsection 4.1, the proposed LCC scheme significantly improves the
BGW in the required storage overhead, computational complexity, and
the amount of injected randomness (Table 4.1).
There have also been several other recent works have on coded com-
puting under privacy and security constraints. Extending the research
works on secure storage (see, e.g., [122, 140, 128]), staircase codes [21]
have been proposed to combat stragglers in linear computations (e.g.,
matrix-vector multiplications) while preserving data privacy, which was
shown to reduce the computation latency of the schemes based on clas-
sical secret sharing strategies [110, 141]. The proposed Lagrange Coded
Computing scheme in this section generalizes the staircase codes beyond
linear computations. Even for the linear case, LCC guarantees data
4.3. Related Works and Open Problems 123

privacy against T colluding workers by introducing less randomness


than [21] (T rather than T K/(K − T )).
Beyond linear computations, a secure coded computing scheme was
proposed in [160] to achieve data security for distributed matrix–matrix
multiplication. Leveraging the polynomial code proposed in [167], the
secure computing scheme in [160] achieves the order-optimal recovery
threshold, while preserving data privacy at each worker (i.e., T = 1).
For computing more general class of matrix polynomials, [117] has
combined ideas from the BGW scheme and [167] to form the so-called
polynomial sharing, a private coded computation scheme for arbitrary
matrix polynomials. However, polynomial sharing inherits the undesired
BGW property of performing a communication round for every linear
and every bilinear operation in the polynomial; a feature that drastically
increases the communication overhead, and is circumvented by the one-
shot approach of LCC.
DRACO [31] was proposed as a secure distributed training algorithm
that is robust to Byzantine faults. Since DRACO is designed for general
gradient computations, it employs a blackbox approach, i.e., the coding
is applied on the gradients computed from uncoded data, but not on
the data itself, which is similar to the gradient coding techniques [63,
94, 127, 147, 164] designed primarily for stragglers. Hence, the inherent
algebraic structure of the gradients is ignored. For this approach, [31]
show that a 2A + 1 multiplicative factor of redundant computations is
necessary to be robust to A Byzantine workers. For the proposed LCC
however, the blackbox approach is disregarded in favor of an algebraic
one, and consequently, a 2A additive factor suffices.
We end this subsection by highlighting a few other interesting
directions for future research.
Extension to real-field computations. Most of the works that we
have described so far rely on quantizing the data into a finite field, so that
the coded computing approaches for secure and private computing can
then be employed. These approaches, however, can result in substantial
accuracy losses due to quantization, fixed-point representation of the
data, and computation overflows (see e.g., [51]). An important research
124 Coding for Security and Privacy

direction would be to develop coded computing techniques for secure


and private computing in the real-field domain.
Application to deep neural network training. Another barrier in
the theory of secure and private computing and machine learning is
their efficient generalization to non-polynomial computations. Many non-
linear threshold functions that arise in machine learning, in particular
rectified linear unit (ReLU) functions in deep neural networks, cannot be
approximated well with low degree polynomials. Therefore, finding new
coded computing approaches that enable secure and private computing
for such classes of computations would be of great interest.
Application to large-scale federated learning. Another interesting
application of coding for secure and private computing would be the
federated learning problem that has attracted a lot of attention recently.
Federated learning is an emerging approach that enables model training
over a large volume of decentralized data residing in mobile devices,
while protecting the privacy of the individual users [26, 27, 76, 111].
A major bottleneck in scaling secure federated learning to a large number
of users is the overhead of secure model aggregation across many users.
An interesting direction is understand the role of coded computing
for efficient, scalable, and secure model aggregation in large federated
systems. Some promising work in this direction is initiated in [144].
Application to blockchain systems. Secure, private, and verifiable
computing are also of great importance in decentralized blockchain
systems. Today’s blockchain designs suffer from a trilemma claiming
that no blockchain system can simultaneously achieve decentralization,
security, and performance scalability. For current blockchain systems, as
more nodes join the network, the efficiency of the system (computation,
communication, and storage) stays constant at best. A leading idea
for enabling blockchains to scale efficiency is the notion of sharding:
different subsets of nodes handle different portions of the blockchain,
thereby reducing the load for each individual node. However, existing
sharding proposals achieve efficiency scaling by compromising on trust –
corrupting the nodes in a given shard will lead to the permanent loss of
the corresponding portion of data. Coded computing can provide an
4.3. Related Works and Open Problems 125

effective approach for overcoming such barriers in distributed systems.


Coded computing can also provide new approaches for dealing with the
issues of computation verification and data availability in decentralized
blockchain systems. Several recent works in these directions have been
initiated in [34, 75, 99, 113, 134, 135, 165].
Acknowledgements

We would like to thank several colleagues and students whose comments


and collaborations greatly helped the monograph. In particular we
would like to acknowledge the collaborations with B. Guler, M. Maddah
Ali, P. Mohassel, R. Pedarsani, N. Raviv, J. So, and Q. Yu. The authors
would also like to thank the anonymous reviewer for her/his valuable
comments.
We would like to also thank several funding agencies for enabling this
research. In particular, the content of this monograph is based upon work
supported by National Science Foundation (grants CCF-1703575 and
CCF-1763673), Defense Advanced Research Projects Agency (DARPA)
(award HR001117C0053), Office of Naval Research (award N00014-16-
1-2189), Army Research Office (award W911NF1810400), and Intel
Labs.

126
Appendices
A
Proof of Lemma 3.6
Lower Bound on the Recovery Threshold of
Computing Multilinear Functions

Before we start the proof, we let Kf∗ (K, N ) denote the minimum recovery
threshold given the function f , the number of computations K, and the
number of workers N .
We now proceed to prove Lemma 3.6 by induction.
(a) When d = 1, then f is a linear function, and we aim to prove
Kf (K, N ) ≥ K. Assuming the opposite, we can find a computation

design such that for a subset N of at most K − 1 workers, there is


a decoding function that computes all f (Xi )’s given the results from
workers in N .
Because the encoding functions are linear, we can thus find a non-
zero vector (a1 , . . . , aK ) ∈ FK such that when Xi = ai V for any V ∈ V,
the coded variable X̃i stored by any worker in N equals 0. This leads
to a fixed output from the decoder. On the other hand, because f is
assumed to be non-zero, the computing results {f (Xi )}i∈[K] is variable
for different values of V , which leads to a contradiction. Hence, we have
Kf∗ (K, N ) ≥ K.
(b) Suppose we have a matching converse for any multilinear function
with d = d0 . We now prove the lower bound for any non-zero multilinear
function f of degree d0 + 1. The proof idea is to construct a multilinear

128
129

function f 0 with degree d0 based on function f , and to lower bound the


minimum recovery threshold of f using that of f 0 . More specifically,
this is done by showing that given any computation design for function
f , a computation design can also be developed for the corresponding
f 0 , which achieves a recovery threshold that is related to that of the
scheme for f .
In particular, for any non-zero function f (Xi,1 , Xi,2 , . . . , Xi,d0 +1 ),
we can find V ∈ V, such that f (Xi,1 , Xi,2 , . . . , Xi,d0 , V ) as a function of
(Xi,1 , Xi,2 , . . . , Xi,d0 ) is non-zero. We define f 0 (Xi,1 , Xi,2 , . . . , Xi,d0 ) =
f (Xi,1 , Xi,2 , . . . , Xi,d0 , V ), which is a multilinear function with degree d0 .
Given parameters K and N , we now develop a computation strategy for
f 0 for a dataset of K inputs and a cluster of N 0 , N − K workers, which
achieves a recovery threshold of Kf∗ (K, N ) − (K − 1). We construct this
computation strategy based on an encoding strategy of f that achieves
the recovery threshold Kf∗ (K, N ). Because the encoding functions are
linear, we consider the encoding matrix, denoted by G ∈ FK×N , and
defined as the coefficients of the encoding functions X̃i = K j=1 Xj Gji .
P

Following the same arguments we used in the d = 1 case, the left null
space of G must be {0}. Consequently, the rank of G equals K, and we
can find a subset K of K workers such that the corresponding columns
of G form a basis of FK . We construct a computation scheme for f 0
with N 0 , N − K workers, each of whom stores the coded version
of (Xi,1 , Xi,2 , . . . , Xi,d0 ) that is stored by a unique respective worker in
[N ] \ K in the computation scheme of f .
Now it suffices to prove that the above construction achieves a
recovery threshold of Kf∗ (K, N ) − (K − 1). Equivalently, we need to
prove that given any subset S of [N ]\K of size Kf∗ (K, N ) − (K − 1),
the values of f (Xi,1 , Xi,2 , . . . , Xi,d0 , V ) for i ∈ [K] are decodable from
the computing results of workers in S.
We now exploit the decodability of the computation design for
function f . For any j ∈ K, the set S ∪ K\{j} has size Kf∗ (K, N ). Conse-
quently, for any vector a = (a1 , . . . , aK ) ∈ FK , by letting Xi,d0 +1 = ai V ,
we have that {ai f (Xi,1 , Xi,2 , . . . , Xi,d0 , V )}i∈[K] is decodable given the
computing results from workers in S ∪ K\{j}. Moreover, for any j ∈ [K],
let a(j) ∈ FK be a non-zero vector that is orthogonal to all columns
of G with indices in K\{j}, workers in K\{j} would store 0 for the
130 Proof of Lemma 3.6

Xi,d0 +1 entry, and return constant 0 due to the multilinearity of f . Con-


sequently, any {ai f (Xi,1 , Xi,2 , . . . , Xi,d0 , V )}i∈[K] is decodable from
(j)

the computing results from workers in S.


Because columns of G with indices in K form a basis of
FK , the vectors a(j) for j ∈ K also from a basis. Consequently,
f 0 (Xi,1 , Xi,2 , . . . , Xi,d0 ), which equals f (Xi,1 , Xi,2 , . . . , Xi,d0 , V ), is also
decodable given results from workers in S for any i ∈ [K]. On the other
hand, note that the computing results for each worker in S given each
a(j) can also be computed using the results from the same workers
when computing f 0 . Hence, the decoder for function f 0 can first recover
the computing results for workers in S for function f , and then pro-
ceed to decoding the final result. Thus we have completed the proof of
decodability.
To summarize, we have essentially proved that Kf∗ (K, N )−(K −1) ≥
Kf∗0 (K, N − K) when N ≥ 2K, and Kf∗ (K, N ) − (K − 1) > N − K
otherwise. Hence, we verify that the converse bound for Kf∗ (K, N ) holds
for any function f with degree d0 + 1 given the above result and the
induction assumption.
(c) Thus, the converse bound in Lemma 3.6 holds for any d ∈ N+ .
References

[1] Agarwal, A. and A. Mazumdar (2016). “Local partial clique


and cycle covers for index coding”. In: 2016 IEEE Globecom
Workshops (GC Wkshps). 1–6.
[2] Ahlswede, R., N. Cai, S.-Y. R. Li, and R. W. Yeung (2000).
“Network information flow”. IEEE Transactions on Information
Theory. 46(4): 1204–1216.
[3] Ahmad, F., S. T. Chakradhar, A. Raghunathan, and T. Vijayku-
mar (2012). “Tarazu: Optimizing MapReduce on heterogeneous
clusters”. ACM SIGARCH Computer Architecture News. 40(1):
61–74.
[4] Aktas, M. F., P. Peng, and E. Soljanin (2018). “Straggler miti-
gation by delayed relaunch of tasks”. ACM SIGMETRICS Per-
formance Evaluation Review. 45(2): 224–231.
[5] Alistarh, D., D. Grubic, J. Li, R. Tomioka, and M. Vojnovic
(2017). “QSGD: Communication-efficient SGD via gradient quan-
tization and encoding”. Advances in Neural Information Pro-
cessing Systems (NIPS): 1707–1718.
[6] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, G. Morrow,
J. Overfelt, R. van de Geijn, and Y.-J. J. Wu (1997). “PLAPACK:
Parallel linear algebra package design overview”. In: Proceedings
of the 1997 ACM/IEEE Conference on Supercomputing. ACM.
1–16.

131
132 References

[7] “Amazon Elastic Compute Cloud (EC2)” (n.d.). https://aws.a


mazon.com/ec2/. Accessed on Jan. 30, 2018.
[8] Ananthanarayanan, G., A. Ghodsi, S. Shenker, and I. Stoica
(2013). “Effective straggler mitigation: Attack of the clones”. In:
10th USENIX Symposium on Networked Systems Design and
Implementation. 185–198.
[9] Ananthanarayanan, G., S. Kandula, A. G. Greenberg, I. Stoica,
Y. Lu, B. Saha, and E. Harris (2010). “Reining in the outliers in
map-reduce clusters using Mantri”. In: OSDI. Vol. 10. No. 1. 24.
[10] “Apache Hadoop” (n.d.). http://hadoop.apache.org. Accessed
on Jan. 30, 2018.
[11] Arbabjolfaei, F. and Y. Kim (2018). “Fundamentals of index
coding”. Foundations and Trends R in Communications and
Information Theory. 14(3–4): 163–346.
[12] Attia, M. A. and R. Tandon (2016). “Information theoretic
limits of data shuffling for distributed learning”. In: IEEE Global
Communications Conference (GLOBECOM). 1–6.
[13] Baktir, S. and B. Sunar (2006). “Achieving efficient polynomial
multiplication in fermat fields using the fast fourier transform”.
In: Proceedings of the 44th Annual Southeast Regional Conference.
ACM. 549–554.
[14] Ballard, G., E. Carson, J. Demmel, M. Hoemmen, N. Knight,
and O. Schwartz (2014). “Communication lower bounds and
optimal algorithms for numerical linear algebra”. Acta Numerica.
23: 1–155.
[15] Bar-Yossef, Z., Y. Birk, T. Jayram, and T. Kol (2011). “Index
coding with side information”. IEEE Transactions on Informa-
tion Theory. 57(3): 1479–1494.
[16] Becker, K. and U. Wille (1998). “Communication complexity
of group key distribution”. In: Proceedings of the 5th ACM
Conference on Computer and Communications Security. 1–6.
[17] Beerliova-Trubiniova, Z. and M. Hirt (2008). “Perfectly-secure
MPC with linear communication complexity”. In: Theory of
Cryptography Conference. Springer. 213–230.
References 133

[18] Ben-Or, M., S. Goldwasser, and A. Wigderson (1988). “Complete-


ness theorems for non-cryptographic fault-tolerant distributed
computation”. In: Proceedings of the Twentieth Annual ACM
Symposium on Theory of Computing. ACM. 1–10.
[19] Bernstein, J., Y.-X. Wang, K. Azizzadenesheli, and A. Anandku-
mar (2018). “signSGD: Compressed optimisation for non-convex
problems”. In: Proceedings of the 35th International Conference
on Machine Learning. Ed. by J. Dy and A. Krause. Vol. 80.
Proceedings of Machine Learning Research. Stockholmsmässan,
Stockholm Sweden: PMLR. 560–569. url: http://proceedings.m
lr.press/v80/bernstein18a.html.
[20] Birk, Y. and T. Kol (2006). “Coding on demand by an informed
source (ISCOD) for efficient broadcast of different supplemental
data to caching clients”. IEEE Transactions on Information
Theory. 52(6): 2825–2830.
[21] Bitar, R., P. Parag, and S. E. Rouayheb (2020). “Minimizing
latency for secure coded computing using secret sharing via
staircase codes”. IEEE Transactions on Communications.
[22] Blackford, L. S., J. Choi, A. Cleary, E. D’Azevedo, J. Demmel,
I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet,
K. Stanley, D. Walker, and R. C. Whaley (1997). ScaLAPACK
Users’ Guide. SIAM.
[23] Blanchard, P., E.-M. El Mhamdi, R. Guerraoui, and
J. Stainer (2017a). “Byzantine-tolerant machine learning”.
preprint arXiv:1703.02757.
[24] Blanchard, P., E.-M. El Mhamdi, R. Guerraoui, and J. Stainer
(2017b). “Machine learning with adversaries: Byzantine tolerant
gradient descent”. In: Advances in Neural Information Processing
Systems. 118–128.
[25] Bogdanov, D., S. Laur, and J. Willemson (2008). “Sharemind:
A framework for fast privacy-preserving computations”. In:
Proceedings of the 13th European Symposium on Research in
Computer Security: Computer Security. ESORICS ’08. Spain:
Springer-Verlag. 192–206.
134 References

[26] Bonawitz, K., V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMa-


han, S. Patel, D. Ramage, A. Segal, and K. Seth (2016). “Practi-
cal secure aggregation for federated learning on user-held data”.
In: Conference on Neural Information Processing Systems.
[27] Bonawitz, K., V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMa-
han, S. Patel, D. Ramage, A. Segal, and K. Seth (2017). “Practi-
cal secure aggregation for privacy-preserving machine learning”.
In: ACM SIGSAC Conference on Computer and Communications
Security. ACM. 1175–1191.
[28] Bonomi, F., R. Milito, J. Zhu, and S. Addepalli (2012). “Fog
computing and its role in the internet of things”. In: Proceed-
ings of the 1st Edition of the MCC Workshop on Mobile Cloud
Computing. ACM. 13–16.
[29] Charles, Z., D. Papailiopoulos, and J. Ellenberg (2017). “Ap-
proximate gradient coding via sparse random graphs”. preprint
arXiv:1711.06771.
[30] Chaubey, M. and E. Saule (2015). “Replicated data placement
for uncertain scheduling”. In: IEEE International Parallel and
Distributed Processing Symposium Workshop. 464–472.
[31] Chen, L., H. Wang, Z. Charles, and D. Papailiopoulos (2018).
“DRACO: Robust distributed training via redundant gradients”.
e-print arXiv:1803.09877.
[32] Chiang, M. and T. Zhang (2016). “Fog and IoT: An overview of
research opportunities”. IEEE Internet of Things Journal. 3(6):
854–864.
[33] Chilimbi, T. M., Y. Suzue, J. Apacible, and K. Kalyanaraman
(2014). “Project Adam: Building an efficient and scalable deep
learning training system”. In: 11th USENIX Symposium on Op-
erating Systems Design and Implementation. Vol. 14. 571–582.
[34] Choi, B., J.-Y. Sohn, D.-J. Han, and J. Moon (2019). “Scalable
network-coded PBFT consensus algorithm”. In: 2019 IEEE In-
ternational Symposium on Information Theory (ISIT). IEEE.
857–861.
[35] Choi, J., J. Dongarra, and D. Walker (1996). “PB-BLAS: A set
of parallel block basic linear algebra subprograms”. Concurrency:
Practice and Experience. 8(7): 517–535.
References 135

[36] Chowdhury, M., M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica


(2011). “Managing data transfers in computer clusters with or-
chestra”. ACM SIGCOMM Computer Communication Review.
41(4): 98–109.
[37] Cramer, R., I. Damgård, and J. B. Nielsen (2001). “Multiparty
computation from threshold homomorphic encryption”. In: In-
ternational Conference on the Theory and Applications of Cryp-
tographic Techniques. Springer. 280–300.
[38] Cramer, R., I. B. Damgrd, and J. B. Nielsen (2015). Secure Mul-
tiparty Computation and Secret Sharing. Cambridge University
Press.
[39] Dalcin, L. D., R. R. Paz, P. A. Kler, and A. Cosimo (2011). “Par-
allel distributed computing using python”. Advances in Water
Resources. 34(9): 1124–1139.
[40] Damgård, I. and J. B. Nielsen (2007). “Scalable and uncondition-
ally secure multiparty computation”. In: International Cryptology
Conference. Springer. 572–590.
[41] Dean, J. and L. A. Barroso (2013). “The tail at scale”. Commu-
nications of the ACM. 56(2): 74–80.
[42] Dean, J. and S. Ghemawat (2004). “MapReduce: Simplified
data processing on large clusters”. Sixth USENIX Symposium
on Operating System Design and Implementation.
[43] Demmel, J., L. Grigori, M. Hoemmen, and J. Langou (2012).
“Communication-optimal parallel and sequential QR and LU
factorizations”. SIAM Journal on Scientific Computing. 34(1):
A206–A239.
[44] Dimakis, A. G., J. Wang, and K. Ramchandran (2007). “Unequal
growth codes: Intermediate performance and unequal error pro-
tection for video streaming”. In: Multimedia Signal Processing,
2007. MMSP 2007. IEEE 9th Workshop on. IEEE. 107–110.
[45] “Distributed Algorithms and Optimization Lecture Notes” (n.d.).
https :// stanford . edu / ~rezab / classes / cme323 / S16 / notes /
Lecture16/Pregel_GraphX.pdf. Accessed on July 11, 2018.
[46] Dutta, S., Z. Bai, T. M. Low, and P. Grover (2019). “CodeNet:
Training large scale neural networks in presence of soft-errors”.
preprint arXiv:1903.01042.
136 References

[47] Dutta, S., V. Cadambe, and P. Grover (2016). “Short-dot: Com-


puting large linear transforms distributedly using coded short dot
products”. Advances in Neural Information Processing Systems
(NIPS): 2100–2108.
[48] Dutta, S., V. Cadambe, and P. Grover (2017). “Coded convolu-
tion for parallel and distributed computing within a deadline”. In:
IEEE International Symposium on Information Theory (ISIT).
IEEE. 2403–2407.
[49] Ekanayake, J., H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,
and G. Fox (2010). “Twister: A runtime for iterative MapReduce”.
Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing. June: 810–818.
[50] Ezzeldin, Y. H., M. Karmoose, and C. Fragouli (2017). “Com-
munication vs. distributed computation: An alternative trade-off
curve”. In: IEEE Information Theory Workshop (ITW). IEEE.
279–283.
[51] Fahim, M. and V. R. Cadambe (2019). “Numerically stable
polynomially coded computing”. In: 2019 IEEE International
Symposium on Information Theory (ISIT). 3017–3021.
[52] Fahim, M., H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, and
P. Grover (2017). “On the optimal recovery threshold of coded
matrix multiplication”. In: 55th Annual Allerton Conference.
IEEE. 1264–1270.
[53] Al-Fares, M., S. Radhakrishnan, B. Raghavan, N. Huang, and
A. Vahdat (2010). “Hedera: Dynamic flow scheduling for data
center networks”. 7th USENIX Symposium on Networked Sys-
tems Design and Implementation. Apr.
[54] Ferdinand, N. and S. C. Draper (2018). “Hierarchical coded
computation”. In: 2018 IEEE International Symposium on In-
formation Theory (ISIT). 1620–1624.
[55] Gardner, K., S. Zbarsky, S. Doroudi, M. Harchol-Balter, and
E. Hyytia (2015). “Reducing latency via redundant requests:
Exact analysis”. ACM SIGMETRICS Performance Evaluation
Review. 43(1): 347–360.
References 137

[56] Gemulla, R., E. Nijkamp, P. J. Haas, and Y. Sismanis (2011).


“Large-scale matrix factorization with distributed stochastic gra-
dient descent”. In: Proceedings of the 17th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining.
ACM. 69–77.
[57] Greenberg, A., J. R. Hamilton, N. Jain, S. Kandula, C. Kim,
P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta (2009). “VL2:
A scalable and flexible data center network”. ACM SIGCOMM
Computer Communication Review. 39(4): 51–62.
[58] Guo, Y., J. Rao, and X. Zhou (2013). “iShuffle: Improving
Hadoop performance with shuffle-on-write”. In: Proceedings of
the 10th International Conference on Autonomic Computing.
107–117.
[59] Gupta, V., S. Wang, T. Courtade, and K. Ramchandran (2018).
“OverSketch: Approximate matrix multiplication for the cloud”.
In: 2018 IEEE International Conference on Big Data (Big Data).
298–304.
[60] Guyon, I., S. Gunn, A. Ben-Hur, and G. Dror (2005). “Result
analysis of the NIPS 2003 feature selection challenge”. Advances
in Neural Information Processing Systems (NIPS): 545–552.
[61] Haddadpour, F. and V. R. Cadambe (2018). “Codes for dis-
tributed finite alphabet matrix-vector multiplication”. In: 2018
IEEE International Symposium on Information Theory (ISIT).
1625–1629.
[62] “Hadoop TeraSort” (n.d.). https://hadoop.apache.org/docs/
r2.7.1/api/org/apache/hadoop/examples/terasort/package-su
mmary.html. Accessed on Jan. 30, 2018.
[63] Halbawi, W., N. Azizan-Ruhi, F. Salehi, and B. Hassibi (2017).
“Improving distributed gradient descent using Reed–Solomon
codes”. e-print arXiv:1706.05436.
[64] Halpern, J. and V. Teague (2004). “Rational secret sharing and
multiparty computation”. In: Proceedings of the Thirty-Sixth
Annual ACM Symposium on Theory of Computing. ACM. 623–
632.
138 References

[65] He, K., X. Zhang, S. Ren, and J. Sun (2016). “Deep residual
learning for image recognition”. IEEE Conference on Computer
Vision and Pattern Recognition: 770–778.
[66] Ho, T., R. Koetter, M. Medard, D. R. Karger, and M. Effros
(2003). “The benefits of coding over routing in a randomized
setting”. IEEE International Symposium on Information Theory.
June: 442.
[67] Huang, K.-H. and J. A. Abraham (1984). “Algorithm-based
fault tolerance for matrix operations”. IEEE Transactions on
Computers. C-33(6): 518–528.
[68] Huang, L., A. D. Joseph, B. Nelson, B. I. Rubinstein, and
J. Tygar (2011). “Adversarial machine learning”. In: Proceedings
of the 4th ACM Workshop on Security and Artificial Intelligence.
ACM. 43–58.
[69] Huang, W. (2017). “Coding for security and reliability in dis-
tributed systems”. PhD thesis. California Institute of Technology.
[70] Jahani-Nezhad, T. and M. A. Maddah-Ali (2019). “CodedSketch:
Coded distributed computation of approximated matrix multipli-
cation”. In: 2019 IEEE International Symposium on Information
Theory (ISIT). 2489–2493.
[71] Jeong, H., T. M. Low, and P. Grover (2018). “Masterless coded
computing: A fully-distributed coded FFT algorithm”. In: 2018
56th Annual Allerton Conference on Communication, Control,
and Computing (Allerton). 887–894.
[72] Ji, M., G. Caire, and A. F. Molisch (2016). “Fundamental limits
of caching in wireless D2D networks”. IEEE Transactions on
Information Theory. 62(2): 849–869.
[73] Joshi, G., E. Soljanin, and G. Wornell (2017). “Efficient re-
dundancy techniques for latency reduction in cloud systems”.
ACM Transactions on Modeling and Performance Evaluation of
Computing Systems (TOMPECS). 2(2): 12.
[74] Jou, J.-Y. and J. A. Abraham (1986). “Fault-tolerant matrix
arithmetic and signal processing on highly concurrent computing
structures”. Proceedings of the IEEE. 74(5): 732–741.
References 139

[75] Kadhe, S., J. Chung, and K. Ramchandran (2019). “SeF: A secure


fountain architecture for slashing storage costs in blockchains”.
preprint arXiv:1906.12140.
[76] Kairouz, P., H. B. McMahan, B. Avent, A. Bellet, M. Bennis,
A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cum-
mings, R. G. L. D’Oliveira, S. El Rouayheb, D. Evans, J. Gardner,
Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser,
Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu,
M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konečný, A. Ko-
rolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal,
M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi,
D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun,
A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong,
Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao (2019). “Ad-
vances and open problems in federated learning”. preprint
arXiv:1912.04977.
[77] Kamra, A., V. Misra, J. Feldman, and D. Rubenstein (2006).
“Growth codes: Maximizing sensor network data persistence”.
ACM SIGCOMM Computer Communication Review. 36(4): 255–
266.
[78] Karakus, C., Y. Sun, S. Diggavi, and W. Yin (2017). “Straggler
mitigation in distributed optimization through data encoding”.
Advances in Neural Information Processing Systems (NIPS):
5440–5448.
[79] Karamchandani, N., U. Niesen, M. A. Maddah-Ali, and S. Dig-
gavi (2014). “Hierarchical coded caching”. IEEE International
Symposium on Information Theory. June: 2142–2146.
[80] Kedlaya, K. S. and C. Umans (2011). “Fast polynomial factoriza-
tion and modular composition”. SIAM Journal on Computing.
40(6): 1767–1802.
[81] Kiamari, M., C. Wang, and A. S. Avestimehr (2017). “On het-
erogeneous coded distributed computing”. IEEE GLOBECOM.
Dec.
140 References

[82] Kim, S. and S. Lee (2009). “Improved intermediate performance


of rateless codes”. In: Advanced Communication Technology,
2009. ICACT 2009. 11th International Conference on. Vol. 3.
IEEE. 1682–1686.
[83] Koetter, R. and M. Medard (2003). “An algebraic approach
to network coding”. IEEE/ACM Transactions on Networking.
11(5): 782–795.
[84] Konstantinidis, K. and A. Ramamoorthy (2018). “Leveraging
coding techniques for speeding up distributed computing”. e-
print arXiv:1802.03049.
[85] Korner, J. and K. Marton (1979). “How to encode the modulo-
two sum of binary sources”. IEEE Transactions on Information
Theory. 25(2): 219–221.
[86] Kosaian, J., K. Rashmi, and S. Venkataraman (2018). “Learn-
ing a code: Machine learning for approximate non-linear coded
computation”. preprint arXiv:1806.01259.
[87] Krizhevsky, A. and G. Hinton (2009). “Learning multiple layers
of features from tiny images”. Tech. rep. Citeseer.
[88] Kushilevitz, E. and N. Nisan (2006). Communication Complexity.
Cambridge University Press.
[89] Lee, K., M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ram-
chandran (2018). “Speeding up distributed machine learning
using codes”. IEEE Transactions on Information Theory. 64(3):
1514–1529.
[90] Lee, K., C. Suh, and K. Ramchandran (2017). “High-dimensional
coded matrix multiplication”. In: 2017 IEEE International Sym-
posium on Information Theory (ISIT). 2418–2422.
[91] Lee, K., R. Pedarsani, and K. Ramchandran (2015). “On schedul-
ing redundant requests with cancellation overheads”. In: 53rd
Annual Allerton Conference on Communication, Control, and
Computing. IEEE. 99–106.
[92] Li, S., M. A. Maddah-Ali, and A. S. Avestimehr (2018a). “Com-
pressed coded distributed computing”. In: 2018 IEEE Interna-
tional Symposium on Information Theory (ISIT). IEEE. 2032–
2036.
References 141

[93] Li, S., M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr (2018b).


“A fundamental tradeoff between computation and communica-
tion in distributed computing”. IEEE Transactions on Informa-
tion Theory. 64(1): 109–128.
[94] Li, S., S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi
(2018c). “Near-optimal straggler mitigation for distributed gra-
dient methods”. IPDPSW. May.
[95] Li, S., S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Aves-
timehr (2018d). “Polynomially coded regression: Optimal strag-
gler mitigation via data encoding”. e-print arXiv:1805.09934.
[96] Li, S., M. A. Maddah-Ali, and A. S. Avestimehr (2016a). “A uni-
fied coding framework for distributed computing with straggling
servers”. IEEE NetCod. Dec.
[97] Li, S., M. A. Maddah-Ali, and A. S. Avestimehr (2015). “Coded
MapReduce”. 53rd Annual Allerton Conference on Communica-
tion, Control, and Computing. Sept.
[98] Li, S., M. A. Maddah-Ali, and A. S. Avestimehr (2016b).
“Coded distributed computing: Straggling servers and multistage
dataflows”. 54th Allerton Conference. Sept.
[99] Li, S., M. Yu, C.-S. Yang, A. S. Avestimehr, S. Kannan, and
P. Viswanath (2018e). “PolyShard: Coded sharding achieves
linearly scaling efficiency and security simultaneously”. arXiv:
1809.10361 [cs.CR].
[100] Liberty, E. and S. W. Zucker (2009). “The mailman algorithm:
A note on matrix–vector multiplication”. Information Processing
Letters. 109(3): 179–182.
[101] Lin, S. and D. J. Costello (2004). Error Control Coding. Pearson.
[102] Lindell, Y. (2005). “Secure multiparty computation for privacy
preserving data mining”. In: Encyclopedia of Data Warehousing
and Mining. IGI Global. 1005–1009.
[103] Low, Y., D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and
J. M. Hellerstein (2012). “Distributed GraphLab: A framework
for machine learning and data mining in the cloud”. Proceedings
of the VLDB Endowment. 5(8): 716–727.
142 References

[104] Maddah-Ali, M. A. and U. Niesen (2014a). “Decentralized


coded caching attains order-optimal memory-rate tradeoff”.
IEEE/ACM Transactions on Networking. Apr.
[105] Maddah-Ali, M. A. and U. Niesen (2014b). “Fundamental limits
of caching”. IEEE Transactions on Information Theory. 60(5):
2856–2867.
[106] Maity, R. K., A. S. Rawat, and A. Mazumdar (2018). “Robust
gradient descent via moment encoding with LDPC codes”. SysML
Conference.
[107] Maleki, H., V. R. Cadambe, and S. A. Jafar (2014). “Index
coding—An interference alignment perspective”. IEEE Transac-
tions on Information Theory. 60(9): 5402–5432.
[108] Malewicz, G., M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski (2010). “Pregel: A system for large-
scale graph processing”. In: Proceedings of the ACM SIGMOD
International Conference on Management of Data. ACM. 135–
146.
[109] Mallick, A., M. Chaudhari, U. Sheth, G. Palanikumar, and
G. Joshi (2019). “Rateless codes for near-perfect load balancing
in distributed matrix-vector multiplication”. Proceedings of the
ACM on Measurement and Analysis of Computing Systems. 3(3):
1–40.
[110] McEliece, R. J. and D. V. Sarwate (1981). “On sharing secrets
and Reed–Solomon codes”. Communications of the ACM. 24(9):
583–584.
[111] McMahan, H. B., E. Moore, D. Ramage, S. Hampson, and
B. A. Y. Arcas (2017). “Communication-efficient learning of
deep networks from decentralized data”. In: International Con-
ference on Artificial Intelligence and Statistics. 1273–1282.
[112] Melis, L., C. Song, E. D. Cristofaro, and V. Shmatikov (2019).
“Exploiting unintended feature leakage in collaborative learning”.
arXiv:1805.04049.
[113] Mitra, D. and L. Dolecek (2019). “Patterned erasure correcting
codes for low storage-overhead blockchain systems”. In: 2019
53rd Asilomar Conference on Signals, Systems, and Computers.
IEEE. 1734–1738.
References 143

[114] Mohassel, P. and Y. Zhang (2017). “SecureML: A system for


scalable privacy-preserving machine learning”. In: 2017 IEEE
Symposium on Security and Privacy (SP). 19–38.
[115] Narra, K. G., Z. Lin, M. Kiamari, S. Avestimehr, and M. An-
navaram (2019). “Slack squeeze coded computing for adaptive
straggler mitigation”. In: Proceedings of the International Confer-
ence for High Performance Computing, Networking, Storage and
Analysis. SC ’19. Denver, Colorado: Association for Computing
Machinery.
[116] Nazer, B. and M. Gastpar (2007). “Computation over multiple-
access channels”. IEEE Transactions on Information Theory.
53(10): 3498–3516.
[117] Nodehi, H. A. and M. A. Maddah-Ali (2018). “Limited-sharing
multi-party computation for massive matrix operations”. In:
IEEE International Symposium on Information Theory (ISIT).
1231–1235.
[118] OMalley, O. (2008). “TeraByte sort on apache hadoop”. Tech.
rep. Yahoo.
[119] “Open MPI: Open source high performance computing” (n.d.).
https://www.open-mpi.org/.
[120] Orlitsky, A. and A. El Gamal (1990). “Average and randomized
communication complexity”. IEEE Transactions on Information
Theory. 36(1): 3–16.
[121] Orlitsky, A. and J. Roche (2001). “Coding for computing”. IEEE
Transactions on Information Theory. 47(3): 903–917.
[122] Pawar, S., S. El Rouayheb, and K. Ramchandran (2011). “Secur-
ing dynamic distributed storage systems against eavesdropping
and adversarial attacks”. IEEE Transactions on Information
Theory. 57(10): 6734–6753.
[123] Poulson, J., B. Marker, R. A. van de Geijn, J. R. Hammond,
and N. A. Romero (2013). “Elemental: A new framework for
distributed memory dense matrix computations”. ACM Trans-
actions on Mathematical Software. 39(2): 13:1–13:24.
[124] Prakash, S., A. Reisizadeh, R. Pedarsani, and S. Avestimehr
(2018). “Coded computing for distributed graph analytics”. IEEE
ISIT.
144 References

[125] Rajaraman, A. and J. D. Ullman (2011). Mining of Massive


Datasets. Cambridge University Press.
[126] Ramamoorthy, A. and M. Langberg (2013). “Communicating
the sum of sources over a network”. IEEE Journal on Selected
Areas in Communications. 31(4): 655–665.
[127] Raviv, N., I. Tamo, R. Tandon, and A. G. Dimakis (2017).
“Gradient coding from cyclic MDS codes and expander graphs”.
e-print arXiv:1707.03858.
[128] Rawat, A. S., O. O. Koyluoglu, N. Silberstein, and S. Vish-
wanath (2014). “Optimal locally repairable and secure codes for
distributed storage systems”. IEEE Transactions on Information
Theory. 60(1): 212–236.
[129] Recht, B., C. Re, S. Wright, and F. Niu (2011). “Hogwild: A lock-
free approach to parallelizing stochastic gradient descent”. Ad-
vances in Neural Information Processing Systems (NIPS): 693–
701.
[130] Reisizadeh, A., S. Prakash, R. Pedarsani, and A. S. Avestimehr
(2019). “Coded computation over heterogeneous clusters”. IEEE
Transactions on Information Theory. 65(7): 4227–4242.
[131] Reisizadeh, A., S. Prakash, R. Pedarsani, and S. Avestimehr
(2017). “Coded computation over heterogeneous clusters”. IEEE
ISIT : 2408–2412.
[132] Renteln, P. (2013). Manifolds, Tensors, and Forms: An Introduc-
tion for Mathematicians and Physicists. Cambridge University
Press.
[133] Roth, R. (2006). Introduction to Coding Theory. Cambridge
University Press.
[134] Sahraei, S. and A. S. Avestimehr (2019). “INTERPOL: Infor-
mation theoretically verifiable polynomial evaluation”. In: 2019
IEEE International Symposium on Information Theory (ISIT).
IEEE. 1112–1116.
[135] Sahraei, S., M. A. Maddah-Ali, and S. Avestimehr (2019).
“Interactive verifiable polynomial evaluation”. preprint
arXiv:1907.04302.
References 145

[136] Sanghavi, S. (2007). “Intermediate performance of rateless codes”.


In: Information Theory Workshop, 2007. ITW’07. IEEE. 478–
482.
[137] Schölkopf, B., R. Herbrich, and A. J. Smola (2001). “A gen-
eralized representer theorem”. In: International Conference on
Computational Learning Theory. Springer. 416–426.
[138] Seide, F., H. Fu, J. Droppo, G. Li, and D. Yu (2014). “1-bit
stochastic gradient descent and its application to data-parallel
distributed training of speech dnns”. In: Fifteenth Annual Con-
ference of the International Speech Communication Association.
[139] Shah, N. B., K. Lee, and K. Ramchandran (2016). “When do
redundant requests reduce latency?” IEEE Transactions on Com-
munications. 64(2): 715–722.
[140] Shah, N. B., K. Rashmi, and P. V. Kumar (2011). “Information-
theoretically secure regenerating codes for distributed storage”.
In: Global Telecommunications Conference (GLOBECOM 2011),
2011. IEEE. 1–5.
[141] Shamir, A. (1979). “How to share a secret”. Communications of
the ACM. 22(11): 612–613.
[142] Shanmugam, K., A. G. Dimakis, and M. Langberg (2013). “Local
graph coloring and index coding”. In: 2013 IEEE International
Symposium on Information Theory. 1152–1156.
[143] Singleton, R. (1964). “Maximum distance q-nary codes”. IEEE
Transactions on Information Theory. 10(2): 116–118.
[144] So, J., B. Guler, and A. S. Avestimehr (2020). “Turbo-aggregate:
Breaking the quadratic aggregation barrier in secure federated
learning”. arXiv: 2002.04156 [cs.LG].
[145] So, J., B. Guler, A. S. Avestimehr, and P. Mohassel (2019).
“CodedPrivateML: A fast and privacy-preserving framework for
distributed machine learning”. CoRR. abs/1902.00641. arXiv:
1902.00641.
[146] Song, L., C. Fragouli, and T. Zhao (2017). “A pliable index
coding approach to data shuffling”. e-print arXiv:1701.05540.
146 References

[147] Tandon, R., Q. Lei, A. G. Dimakis, and N. Karampatziakis (2017).


“Gradient coding: Avoiding stragglers in distributed learning”.
In: Proceedings of the 34th International Conference on Machine
Learning. Vol. 70. Proceedings of Machine Learning Research. In-
ternational Convention Centre, Sydney, Australia: PMLR. 3368–
3376.
[148] Tang, L., K. Konstantinidis, and A. Ramamoorthy (2019). “Era-
sure Coding for distributed matrix multiplication for matrices
with bounded entries”. IEEE Communications Letters. 23(1):
8–11.
[149] “tc – show/manipulate traffic control settings” (n.d.). http://
lartc.org/manpages/tc.txt.
[150] Ullman, J. D., A. V. Aho, and J. E. Hopcroft (1974). The Design
and Analysis of Computer Algorithms. Vol. 4. Addison-Wesley,
Reading. 1–2.
[151] Van De Geijn, R. A. and J. Watts (1997). “SUMMA: Scalable uni-
versal matrix multiplication algorithm”. Concurrency-Practice
and Experience. 9(4): 255–274.
[152] Wan, K., D. Tuninetti, M. Ji, and P. Piantanida (2018).
“Fundamental limits of distributed data shuffling”. e-print
arXiv:1807.00056.
[153] Wang, D., G. Joshi, and G. Wornell (2014). “Efficient task
replication for fast response times in parallel computation”. ACM
SIGMETRICS Performance Evaluation Review. 42(1): 599–600.
[154] Wang, H., Z. Charles, and D. Papailiopoulos (2019a). “Era-
sureHead: Distributed gradient descent without delays using
approximate gradient coding”. preprint arXiv:1901.09671.
[155] Wang, S., J. Liu, and N. Shroff (2018a). “Coded sparse matrix
multiplication”. e-print arXiv:1802.03430.
[156] Wang, S., J. Liu, and N. Shroff (2019b). “Fundamental limits
of approximate gradient coding”. Proceedings of the ACM on
Measurement and Analysis of Computing Systems. 3(3): 52.
[157] Wang, S., J. Liu, N. Shroff, and P. Yang (2018b). “Fundamental
limits of coded linear transform”. e-print arXiv:1804.09791.
References 147

[158] Wen, W., C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li


(2017). “TernGrad: Ternary gradients to reduce communication
in distributed deep learning”. Advances in Neural Information
Processing Systems (NIPS): 1508–1518.
[159] Woolsey, N., R.-R. Chen, and M. Ji (2018). “A new com-
binatorial design of coded distributed computing”. e-print
arXiv:1802.03870.
[160] Yang, H. and J. Lee (2019). “Secure distributed computing with
straggling servers using polynomial codes”. IEEE Transactions
on Information Forensics and Security. 14(1): 141–150.
[161] Yang, Y., M. Interlandi, P. Grover, S. Kar, S. Amizadeh, and
M. Weimer (2019). “Coded elastic computing”. In: 2019 IEEE
International Symposium on Information Theory (ISIT). 2654–
2658.
[162] Yang, Y., P. Grover, and S. Kar (2017). “Coded distributed com-
puting for inverse problems”. In: Advances in Neural Information
Processing Systems 30. Ed. by I. Guyon, U. V. Luxburg, S. Ben-
gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett.
Curran Associates, Inc. 709–719. url: http://papers.nips.cc/
paper/6673-coded-distributed-computing-for-inverse-problems.
pdf.
[163] Yao, A. C.-C. (1979). “Some complexity questions related to
distributive computing (preliminary report)”. In: Proceedings of
the Eleventh Annual ACM Symposium on Theory of Computing.
209–213.
[164] Ye, M. and E. Abbe (2018). “Communication-computation effi-
cient gradient coding”. e-print arXiv:1802.03475.
[165] Yu, M., S. Sahraei, S. Li, S. Avestimehr, S. Kannan, and P.
Viswanath (2020). “Coded Merkle tree: Solving data availability
attacks in blockchains”. In: Financial Cryptography and Data
Security (FC).
[166] Yu, Q., S. Li, M. A. Maddah-Ali, and A. S. Avestimehr (2017a).
“How to optimally allocate resources for coded distributed com-
puting?” IEEE International Conference on Communications
(ICC). May: 1–7.
148 References

[167] Yu, Q., M. A. Maddah-Ali, and A. S. Avestimehr (2017b). “Poly-


nomial codes: An optimal design for high-dimensional coded
matrix multiplication”. Advances in Neural Information Process-
ing Systems (NIPS): 4406–4416.
[168] Yu, Q., M. A. Maddah-Ali, and A. S. Avestimehr (2018a). “Strag-
gler mitigation in distributed matrix multiplication: Fundamental
limits and optimal coding”. e-print arXiv:1801.07487.
[169] Yu, Q., M. A. Maddah-Ali, and A. S. Avestimehr (2018b). “Strag-
gler mitigation in distributed matrix multiplication: Fundamental
limits and optimal coding”. In: IEEE International Symposium
on Information Theory (ISIT). 2022–2026.
[170] Yu, Q., N. Raviv, J. So, and A. S. Avestimehr (2018c). “Lagrange
coded computing: Optimal design for resiliency, security and
privacy”. e-print arXiv:1806.00939.
[171] Zaharia, M., M. Chowdhury, M. J. Franklin, S. Shenker, and
I. Stoica (2010). “Spark: Cluster computing with working sets”.
In: Proceedings of the 2nd USENIX HotCloud. Vol. 10. 10.
[172] Zaharia, M., A. Konwinski, A. D. Joseph, R. H. Katz, and I. Sto-
ica (2008). “Improving MapReduce performance in heterogeneous
environments”. Operating Systems Design and Implementation.
8(4): 7.
[173] Zhang, S., J. Han, Z. Liu, K. Wang, and S. Feng (2009). “Accel-
erating MapReduce with distributed memory cache”. 15th IEEE
International Conference on Parallel and Distributed Systems
(ICPADS). Dec.: 472–478.
[174] Zhang, Z., L. Cherkasova, and B. T. Loo (2013). “Performance
modeling of MapReduce jobs in heterogeneous cloud environ-
ments”. In: IEEE Sixth International Conference on Cloud Com-
puting. 839–846.
[175] Zhuang, Y., W.-S. Chin, Y.-C. Juan, and C.-J. Lin (2013). “A fast
parallel SGD for matrix factorization in shared memory systems”.
In: Proceedings of the 7th ACM Conference on Recommender
Systems. ACM. 249–256.

You might also like