Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning
Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning
Information Theory
Coded Computing: Mitigating
Fundamental Bottlenecks in
Large-Scale Distributed Computing
and Machine Learning
Suggested Citation: Songze Li and Salman Avestimehr (2020), “Coded Computing:
Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine
Learning”, Foundations and Trends R in Communications and Information Theory: Vol.
17, No. 1, pp 1–148. DOI: 10.1561/0100000103.
Songze Li
University of Southern California
USA
[email protected]
Salman Avestimehr
University of Southern California
USA
[email protected]
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents
1 Introduction 3
1.1 Coding for Bandwidth Reduction . . . . . . . . . . . . . . 5
1.2 Coding for Straggler Mitigation . . . . . . . . . . . . . . . 7
1.3 Coding for Security and Privacy . . . . . . . . . . . . . . . 11
1.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . 14
Acknowledgements 126
Appendices 127
References 131
Coded Computing: Mitigating
Fundamental Bottlenecks in
Large-Scale Distributed Computing
and Machine Learning
Songze Li1 and Salman Avestimehr2
1 University of Southern California, USA; [email protected]
2 University of Southern California, USA; [email protected]
ABSTRACT
We introduce the concept of “coded computing”, a novel
computing paradigm that utilizes coding theory to effec-
tively inject and leverage data/computation redundancy
to mitigate several fundamental bottlenecks in large-scale
distributed computing, namely communication bandwidth,
straggler’s (i.e., slow or failing nodes) delay, privacy and
security bottlenecks. More specifically, for MapReduce based
distributed computing structures, we propose the “Coded
Distributed Computing” (CDC) scheme, which injects re-
dundant computations across the network in a structured
manner, such that in-network coding opportunities are en-
abled to substantially slash the communication load to shuf-
fle the intermediate computation results. We prove that
CDC achieves the optimal tradeoff between computation
and communication, and demonstrate its impact on a wide
range of distributed computing systems from cloud-based
datacenters to mobile edge/fog computing platforms.
3
4 Introduction
this straggler effect can prolong the job execution time by as much as
five times.
Conventionally, in the original open-source implementation of
Hadoop MapReduce [10], the stragglers are constantly detected and
the slow tasks are speculatively restarted on other available nodes. Fol-
lowing this idea of straggler detection, more timely straggler detection
algorithms and better scheduling algorithms have been developed to
further alleviate the straggler effect (see, e.g., [9, 172]). Apart from
straggler detection and speculative restart, another straggler mitigation
technique is to schedule the clones of the same task (see, e.g., [8, 30,
55, 91, 139]). The underlying idea of cloning is to execute redundant
tasks such that the computation can proceed when the results of the
fast-responding clones have returned. Recently, it has been proposed
to utilize error correcting codes for straggler mitigation in distributed
matrix-vector multiplication [47, 89, 96, 106]. The main idea is to par-
tition the data matrix into K batches, and then generate N coded
batches using the maximum-distance-separable (MDS) code [101], and
assign multiplication with each of the coded batches to a worker node.
Benefiting from the “any K of N ” property of the MDS code, the
computation can be accomplished as long as any K fastest nodes have
finished their computations, providing the system the robustness to
up to N − K arbitrary stragglers. This coded approach was shown to
significantly outperform the state-of-the-art cloning approaches in strag-
gler mitigation capability, and minimize the the overall computation
latency.
Our first contribution on this topic is the development of optimal
codes, named polynomial codes, to deal with stragglers in distributed
high-dimensional matrix–matrix multiplication. More specifically, we
consider a distributed matrix multiplication problem where we aim to
compute C = A> B from input matrices A and B. The computation is
carried out using a distributed system with a master node and N worker
nodes that can each stores a fixed fraction of A and B respectively
(possibly in a coded manner). For this problem, we aim to design
computation strategies that achieve the minimum possible recovery
threshold, which is defined as the minimum number of workers that the
master needs to wait for in order to compute C. While the prior works,
1.2. Coding for Straggler Mitigation 9
i.e., the one dimensional MDS code (1D MDS code) in [89], and the
product code in [90] apply MDS codes on the data matrices, they are
sub-optimal in minimizing the recovery threshold. The main novelty
and advantage of the proposed polynomial code is that, by carefully
designing the algebraic structure of the coded storage at each worker,
we create an MDS structure on the intermediate computations, instead
of only the coded data matrices. This allows polynomial code to achieve
order-wise improvement over state of the arts (see Table 1.1). We also
prove the optimality of polynomial code by showing that it achieves
the information-theoretic lower bound on the recovery threshold. As
a by-product, we also prove the optimality of polynomial code under
several other performance metrics considered in previous literature.
Going beyond matrix algebra, we also study the straggler mitigation
strategies for scenarios where the function of interest is an arbitrary
multivariate polynomial of the input dataset. This significantly broadens
the scope of the problem to cover many computations of interest in ma-
chine learning, such as various gradient and loss-function computations
in learning algorithms and tensor algebraic operations (e.g., low-rank
tensor approximation). In particular, we consider a computation task
for which the goal is to compute a function f over a large dataset
X = (X1 , . . . , XK ) to obtain K outputs Y1 = f (X1 ), . . . , YK = f (XK ).
The computation is carried over a system consisting of a master node
and N worker nodes. Each worker i stores a coded dataset X̃i generated
from X, computes f (X̃i ), and sends the obtained result to the master.
The master decodes the output Y1 , . . . , YK from the computation results
of the group of the fastest workers.
For this setting, a naive repetition scheme would repeat the compu-
tation for each data block Xk onto N/K workers, yielding a recovery
threshold of N − N/K + 1 = Θ(N ). We propose the “Lagrange Coded
10 Introduction
Data privacy has become a major concern in the information age. The
immensity of modern datasets has popularized the use of third-party
cloud services, and as a result, the threat of privacy infringement has
increased dramatically. In order to alleviate this concern, techniques
for private computation are essential [25, 38, 102, 114]. Additionally,
third-party service providers often have an interest in the result of the
computation, and might attempt to alter it for their benefit [23, 24].
In particular, we consider a common and important scenario where a
user wishes to disperse computations over a large network of workers,
subject to the following privacy and security constraints.
We also note that the number of workers the master needs to wait
for does not scale with the total number of workers N , hence the key
property of LCC is that adding one additional worker can increase its
resiliency to stragglers by 1, or increase its robustness to malicious worker
by 1/2, while maintaining the privacy constraint. Hence, this result
essentially extends the well-known optimal scaling of error-correcting
codes (i.e., adding one parity can provide robustness against 1 erasure
or 1/2 error in optimal maximum distance separable codes) to the
distributed computing paradigm. Compared with the state-of-the-art
BGW-based designs, we also show that LCC significantly improves the
storage, communication, and secret-sharing overhead needed for secure
and private multiparty computing (see Table 1.2).
Finally, we will also discuss the problem of privacy-preserving ma-
chine learning. In particular, we consider an application scenario in
which a data-owner (e.g., a hospital) wishes to train a logistic regression
model by offloading the large volume of data (e.g., healthcare records)
and computationally-intensive training tasks (e.g., gradient computa-
tions) to N machines over a cloud platform, while ensuring that any
collusions between T out of N workers do not leak information about
Table 1.2: Comparison between BGW based designs and LCC. The computational
complexity is normalized by that of evaluating f ; randomness, which refers to the
number of random entries used in encoding functions, is normalized by the length
of Xi
BGW LCC
Complexity/worker K 1
Frac. data/worker 1 1/K
Randomness KT T
Min. num. of workers 2T + 1 deg f · (K + T − 1) + 1
14 Introduction
18
19
1
The motivation for considering simultaneous computation of Q functions is
that we consider a common scenario in which many computation requests (over the
same dataset) are continuously submitted (e.g., database queries, web search, loss
computation in machine learning, etc).
2.1. A Fundamental Tradeoff 21
2
When mapping a file, we compute Q intermediate values in parallel, one for each
of the Q output functions. The main reason to do this is that parallel processing can be
efficiently performed for applications that fit into the MapReduce framework. In other
words, mapping a file according to one function is only marginally more expensive
than mapping according to all functions. For example, for the canonical Word Count
task, while we are scanning a document to count the number of appearances of one
word, we can simultaneously count the numbers of appearances of other words with
marginally increased computation cost.
22 Coding for Bandwidth Reduction
Figure 2.2: Comparison of the communication load achieved by the proposed coded
scheme in Theorem 2.1 with that of the uncoded scheme in (2.5), for Q = 10 output
functions, N = 2520 input files and K = 10 computing nodes.
smaller than the lower convex envelop of the points {(r, 1r · (1 − Kr )): r ∈
{1, . . . , K}} by proving the converse in Subsection 2.1.4.
in the Shuffle phase, which helps to minimize the overall execution time
of applications whose performances are limited by data shuffling. In
the next subsection, we will empirically demonstrate this idea through
experiments on a widely-used practical workload.
Illustrative Example
We consider a MapReduce-type problem in Figure 2.3 for distributed
computing of Q = 3 output functions, represented by red/circle,
green/square, and blue/triangle respectively, from N = 6 input files,
using K = 3 computing nodes. Nodes 1, 2, and 3 are respectively respon-
sible for final reduction of red/circle, green/square, and blue/triangle
output functions. We first consider the case where no redundancy is
imposed on the computations, i.e., each file is mapped once and compu-
tation load r = 1. As shown in Figure 2.3(a), Node k maps File 2k − 1
and File 2k for k = 1, 2, 3. In this case, each node maps 2 input files
locally. In Figure 2.3, we represent, for example, the intermediate value
of the red/circle function in File n using a red circle labelled by n, for
all n = 1, . . . , 6. Similar representations follow for the green/square and
2.1. A Fundamental Tradeoff 27
(a) Uncoded Distributed Computing Scheme. (b) Coded Distributed Computing Scheme.
the blue/triangle functions. After the Map phase, each node obtains
2 out of 6 required intermediate values to reduce the output function
it is responsible for (e.g., Node 1 knows the red circles in File 1 and
File 2). Hence, each node needs 4 intermediate values from the other
nodes, yielding a communication load of 4×3
3×6 = 3 .
2
K−1
r−1 η = rKN̄ Map functions, i.e., |Mk | = rKN̄ for all k ∈ {1, . . . , K}.
After the Map phase, Node k, k ∈ {1, . . . , K}, knows the intermediate
values of all Q output functions in the files in Mk , i.e., {vq,n : q ∈
{1, . . . , Q}, wn ∈ Mk }.
Coded data shuffling. We focus on the case where the number of the
output functions Q satisfies K Q
∈ N, and enforce a symmetric assignment
of the Reduce functions such that every node reduces K Q
functions. That
is, |W1 | = · · · = |WK | = K
Q
, and Wj ∩ Wk = ∅ for all j 6= k.
For any subset P ⊂ {1, . . . , K}, and k ∈ / P, we denote the set of
intermediate values needed by Node k and known exclusively by nodes
whose indices are in P as VPk . More formally:
VPk , vq,n : q ∈ Wk , wn ∈ ∩ Mi , wn ∈
/ ∪ Mi . (2.8)
i∈P i∈P
/
2.1. A Fundamental Tradeoff 29
After we iterate the above data shuffling process over all subsets
of r + 1 nodes, it is easy to see that for each node k, other than its
locally computed intermediate values, it has recovered all the required
intermediate values, i.e., {VS\{k}
k : S ⊆ {1, . . . , K}, |S| = r + 1, k ∈ S},
to compute the Reduce functions locally.
Communication load. Since the coded segment XiS has a size of
K · r bits for each i ∈ S, there are a total of K · r (r + 1) bits
Q ηT Q ηT
Remark 2.7. The ideas of efficiently creating and exploiting coded multi-
casting opportunities have been introduced in caching problems [72, 104,
105]. Through the above description of the CDC scheme, we illustrated
how coding opportunities can be utilized in distributed computing to
slash the load of communicating intermediate values, by designing a par-
ticular assignment of extra computations across distributed computing
nodes. We note that the calculated intermediate values in the Map phase
mimics the locally stored cache contents in caching problems, providing
the “side information” to enable coding in the following Shuffle phase
(or content delivery).
Remark 2.8. Generally speaking, we can view the Shuffle phase of the
considered distributed computing framework as an instance of the index
2.1. A Fundamental Tradeoff 31
coding problem [15, 20], in which a central server aims to design a broad-
cast message (code) with minimum length to simultaneously satisfy the
requests of all the clients, given the clients’ side information stored in
their local caches. Note that while a randomized linear network coding
approach (see, e.g., [2, 66, 83]) is sufficient to implement any multicast
communication where messages are intended by all receivers, it is gener-
ally sub-optimal for index coding problems where every client requests
different messages. Although the index coding problem is still open in
general, for the considered distributed computing scenario where we are
given the flexibility of designing Map computation (thus the flexibility
of designing side information), we next prove tight lower bounds on
the minimum communication load, demonstrating the optimality of the
proposed CDC scheme.
For example, for the particular file assignment in Figure 2.4, i.e.,
M = ({1, 3, 5, 6}, {4, 5, 6}, {2, 3, 4, 6}), a1M = 2 since File 1 and File 2
are mapped on a single node (i.e., Node 1 and Node 3 respectively).
Similarly, we have a2M = 3 (Files 3, 4, and 5), and a3M = 1 (File 6).
32 Coding for Bandwidth Reduction
j=1
K
jajM = rN. (2.16)
X
j=1
2.1. A Fundamental Tradeoff 33
K ajM
K−
P
j N
(a) K −r
L∗ (r) ≥ inf = (2.17)
j=1
K
,
M: |M1 |+···+|MK |=rN P ajM Kr
K j N
j=1
-
aj,S (2.28)
X
M , ∩ Mk ∪ Mi ,
k∈J i∈J
/
J ⊆S: |J |=j
2.1. A Fundamental Tradeoff 35
and the message symbols communicated by the nodes whose indices are
in S as
XS , {Xk : k ∈ S}. (2.29)
Then we prove the following claim.
Claim 2.2.1. For any subset S ⊆ {1, . . . , K}, we have
|S|
j,S Q |S| − j
H(XS | YS c ) ≥ T (2.30)
X
aM · ,
j=1
K j
The first term on the RHS of (2.38) can be lower bounded as follows.
(a)
= H(VWk ,: | V:,Mk , V:,MS c ) (2.40)
(b)
= H(V{q},: | V{q},Mk ∪MS c ) (2.41)
X
q∈Wk
S0 S0
(c) Q X Q X
= (2.42)
j,S\{k} j,S\{k}
T aM ≥ T a ,
K j=0 K j=1 M
S0
j,S\{k} Q S0 − j
(2.44)
X
≥T aM · .
j=1
K j
Q 1 X j,S\{k}
S0 S0
T XX j,S\{k} Q S0
= =T
X
aM · · aM .
S0 k∈S j=1 K j j=1
K j k∈S
(2.46)
2.2. Empirical Evaluations of Coded Distributed Computing 37
N
= 1(file n is only mapped by some nodes
X j,S\{k} XX
aM
k∈S k∈S n=1
in S\{k})·1(file n is mapped by j nodes) (2.47)
N
= 1(file n is only mapped by j nodes in S)
X
n=1
1(file n is not mapped by Node k) (2.48)
X
·
k∈S
N
= 1(file n is only mapped by j nodes in S)(|S| − j)
X
n=1
(2.49)
M (S0
=aj,S + 1 − j). (2.50)
c. Thus for all subsets S ⊆ {1, . . . , K}, the following equation holds:
|S|
j,S Q |S| − j
H(XS | YS c ) ≥ T (2.52)
X
aM · ,
j=1
K j
H(XS | YS c ) X K
ajM K − j
L∗M ≥ ≥ · . (2.53)
QN T j=1
N Kj
2.2.2 TeraSort
TeraSort [118] is a conventional algorithm for distributed sorting of a
large amount of data. The input data to be sorted is in the format of
2.2. Empirical Evaluations of Coded Distributed Computing 39
key-value (KV) pairs, meaning each input KV pair consists of a key and
a value. For example, the domain of the keys can be 10-byte integers,
and the domain of the values can be arbitrary strings. TeraSort aims
to sort the input data according to their keys, e.g., sorting integers.
A TeraSort algorithm run over K nodes, whose indices are denoted by
a set K = {1, . . . , K}, is comprised of the following five components.
File placement. Let F denote the entire KV pairs to be sorted. They
are split into K disjoint input files, denoted by F{1} , . . . , F{K} . File F{k}
is assigned to and locally stored at Node k.
Key domain partitioning. The key domain of the KV pair, denoted
by P , is split into K ordered partitions, denoted by P1 , . . . , PK . Specif-
ically, for any p ∈ Pi and any p0 ∈ Pi+1 , it holds that p < p0 for all
i ∈ {1, . . . , K − 1}. For example, when P = [0, 100] and K = 4, the
partitions can be P1 = [0, 25), P2 = [25, 50), P3 = [50, 75), P4 = [75, 100].
Node k is responsible for sorting all KV pairs in the partition Pk , for
all k ∈ K.
Map stage. Each node hashes each KV pair in the locally stored file
F{k} to the partition its key falls into. For each of the K key partitions,
the hashing procedure on the file F{k} generates an intermediate value
that contains the KV pairs in F{k} whose keys belong to that partition.
More specifically, we denote the intermediate value of the partition
Pj from the file F{k} as I{k}
j
, and the hashing procedure on the file F{k}
is defined as n o
1 K
I{k} , . . . , I{k} ← Hash F{k} .
Performance Evaluation
We performed an experiment on Amazon EC2 to sort 12 GB of data by
running TeraSort on 16 nodes. The breakdown of the total execution
time is shown in Table 2.1.
We observe from Table 2.1 that for a conventional TeraSort exe-
cution, 98.4% of the total execution time was spent in data shuffling,
which is 508.5× of the time spent in the Map stage. This motivates us to
develop a coded distributed sorting algorithm, named CodedTeraSort,
which integrates the coding technique of CDC into TeraSort to trade
extra computation time to significantly reduce the communication time,
as shown in (2.55).
are kept at Node k. This is because that the intermediate value ISi ,
required by Node i ∈ S\{k} in the Reduce stage, is already available at
Node i after the Map stage, so Node k does not need to keep them and
send them to the nodes in S\{k}. For example, as shown in Figure 2.6,
Node 1 does not keep the intermediate value I{1,2} 2 for Node 2. However,
Node 1 keeps I{1,2} , I{1,2} , I{1,2} , which are required by Nodes 1, 3, and 4
1 3 4
k
{IM\{k},u : u ∈ M\{k}} from the received coded packets in M, and
merge them back to obtain a required intermediate value IM\{k}
k .
Reduce. After the Decoding stage, Node k has obtained all KV pairs in
the partition Pk , for all k ∈ K. In this final stage, Node k, k = 1, . . . , K,
performs the Reduce process as in the TeraSort algorithm, sorting the
KV pairs in partition Pk locally.
2.2.4 Experiments
We imperially demonstrate the performance gain of CodedTeraSort
through experiments on Amazon EC2 clusters. In this subsection, we
first present the choices we have made for the implementation. Then, we
describe experiment setup. Finally, we discuss the experiment results.
44 Coding for Bandwidth Reduction
Implementation Choices
Data format. All input KV pairs are generated from TeraGen [62] in
the standard Hadoop package. Each input KV pair consists of a 10-byte
key and a 90-byte value. A key is a 10-byte unsigned integer, and the
value is an arbitrary string of 90 bytes. The KV pairs are sorted based
on their keys, using the standard integer ordering.
Platform and library. We choose Amazon EC2 as the evaluation
platform. We implement both TeraSort and CodedTeraSort algorithms
in C++, and use Open MPI library [119] for communications among EC2
instances.
System architecture. As shown in Figure 2.8, we employ a system
architecture that consists of a coordinator node and K worker nodes,
for some K ∈ N. Each node is run as an EC2 instance. The coordinator
node is responsible for creating the key partitions and placing the
input files on the local disks of the worker nodes. The worker nodes
are responsible for distributedly executing the stages of the sorting
algorithms.
In-memory processing. After the KV pairs are loaded from the local
files into the workers’ memories, all intermediate data that are used
for encoding, decoding and local sorting are persisted in the memories,
and hence there is no disk I/O involved during the executions of the
algorithms.
In the TeraSort implementation, each node sequentially steps
through Map, Pack, Shuffle, Unpack, and Reduce stages. In the Reduce
2.2. Empirical Evaluations of Coded Distributed Computing 45
Figure 2.9: (a) Serial unicast in the Shuffle stage of TeraSort; a solid arrow repre-
sents a unicast. (b) Serial multicast in the Multicast Shuffle stage of CodedTeraSort;
a group of solid arrows starting at the same node represents a multicast.
stage, the standard sort std::sort is used to sort each partition locally.
To better interpret the experiment results, we add the Pack and the Un-
pack stages to separate the time of serialization and deserialization from
the other stages. The Pack stage serializes each intermediate value to a
continuous memory array to ensure that a single TCP flow is created for
each intermediate value (which may contain multiple KV pairs) when
MPI_Send is called.5 The Unpack stage deserializes the received data to
a list of KV pairs. In the Shuffle stage, intermediate values are unicast
serially, meaning that there is only one sender node and one receiver
node at any time instance. Specifically, as illustrated in Figure 2.9(a),
Node 1 starts to unicast to Nodes 2, 3, and 4 back-to-back. After Node
1 finishes, Node 2 unicasts back-to-back to Nodes 1, 3, and 4. This
continues until Node 4 finishes.
In the CodedTeraSort implementation, each node sequentially steps
through CodeGen, Map, Encode, Multicast Shuffling, Decode, and Re-
duce stages. In the CodeGen (or code generation) stage, firstly, each
node generates all file indices, as subsets of r nodes. Then each node
uses MPI_Comm_split to initialize r+1 K
multicast groups each contain-
ing r + 1 nodes on Open MPI, such that multicast communications
will be performed within each of these groups. The serialization and
deserialization are implemented respectively in the Encode and the
5
Creating a TCP flow per KV pair leads to inefficiency from overhead and
convergence issue.
46 Coding for Bandwidth Reduction
Experiment Setup
We conduct experiments using the following configurations to evaluate
the performance of CodedTeraSort and TeraSort on Amazon EC2:
2.2.5 Results
The breakdowns of the execution times with K = 16 workers and
K = 20 workers are shown in Tables 2.2 and 2.3 respectively. We
observe an overall 1.97×–3.39× speedup of CodedTeraSort as compared
with TeraSort. From the experiment results we make the following
observations:
6
This is to alleviate the effects of the bursty behaviors of the transmission rates
in the beginning of some TCP sessions. The rates are limited by traffic control
command tc [149].
2.2. Empirical Evaluations of Coded Distributed Computing 47
Table 2.2: Sorting 12 GB data with K = 16 nodes and 100 Mbps network speed
Table 2.3: Sorting 12 GB data with K = 20 nodes and 100 Mbps network speed
Remark 2.10. The above assumption holds for various wireless dis-
tributed computing applications. For example, in a mobile navigation
application, an input is simply the address of the intended destination.
The computed intermediate results contain all possible routes between
the two end locations, from which the fastest one is computed for the
user. Similarly, for a set of “filtering” applications like image recognition
(or similarly augmented reality) and recommendation systems, the in-
puts are light-weight queries (e.g., the feature vector of an image) that
are much smaller than the filtered intermediate results containing all
attributes of related information. For example, an input can be multiple
words describing the type of restaurant a user is interested in, and the
intermediate results returned by a recommendation system application
can be a list of relevant information that include customers’ comments,
pictures, and videos of the recommended restaurants.
We assume that the access point does not have access to the dataset.
Upon decoding all the uplink messages W1 , . . . , WK , the access point
generates a message X from the decoded uplink messages, i.e.,
X = ρ(W1 , . . . , WK ), (2.63)
8
For small number of files N < µK , we can apply the coded wireless distributed
K
computing scheme to a smaller subset of users, achieving a part of the gain in
reducing the communication load.
54 Coding for Bandwidth Reduction
µK+1 (µK + 1) · η · T
K
1 1 2
Lcoded
u (µ) = = − 1, µ∈ , ,...,1 .
µK · N T µ K K
(2.70)
Downlink communication. For all subsets S ⊆ {1, . . . , K} of size
µK + 1, the access point computes µK random linear combinations of
the uplink messages generated based on the subset S:
For general 1
K ≤ µ ≤ 1, the achieved loads are as stated in (2.75) and
(2.76).
Remark 2.11. Theorem 2.3 implies that, for large K, Lcoded u (µ) ≈
Lcoded
d (µ) ≈ 1
µ − 1, which is independent of the number of users. Hence,
we can accommodate any number of users without incurring extra
communication load, and the proposed scheme is scalable. The reason
for this phenomenon is that, as more users joint the network, with
an appropriate dataset placement, we can create coded multicasting
opportunities to reduce the communication loads by a factor that scales
linearly with K. Such phenomenon was also observed in the context of
cache networks (see e.g., [105]).
20 20
Uncoded Uncoded
18 CWDC (Optimal) 18 CWDC (Optimal)
16 16
Downlink Communication Load
Uplink Communication Load
14 14
12 12
10 10
8 8
6 10 × 6 11 ×
4 4
2 2
0 0
0 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Storage Size (µ) Storage Size (µ)
Remark 2.14. Using Theorems 2.3 and 2.4, we have completely charac-
terized the minimum achievable uplink and downlink communication
loads, using any dataset placement, uplink and downlink communication
schemes. This implies that the proposed CWDC scheme simultaneously
minimizes both uplink and downlink communication loads required to
accomplish distributed computing, and no other scheme can improve
upon it. This also demonstrates that there is no fundamental tension
between optimizing uplink and downlink communication in wireless
distributed computing.
K
j
aU = N, (2.80)
X
j=1
K
jajU = µN K. (2.81)
X
j=1
K aj
Lemma 2.5. L∗u (U) ≥ · j .
K−j
P U
N
j=1
Lemma 2.5 can be proved following the similar steps in the proof of
Lemma 2.2 in Subsection 2.1, after replacing the downlink broadcast
message X with the uplink unicast messages W1 , . . . , WK in conditional
entropy terms (since X is a function of W1 , . . . , WK ).
Next, since the function K−j
j in Lemma 2.5 is convex in j, and by
K aj
(2.80) that = 1 and (2.81), we have
P U
N
j=1
K aj
K−
P
j NU
K − µK 1
L∗u (U) ≥ = = − 1. (2.82)
j=1
K
P ajU µK µ
jN
j=1
atK
U
· (p + qt) (2.88)
X
≥
1
N
t= K ,...,1
= p + qµ. (2.89)
• We consider the access point as the (K + 1)th user who has stored
all N files and has a virtual input to process. Thus the enhanced
60 Coding for Bandwidth Reduction
Then following the same arguments as in the proof for the minimum
uplink communication load, we have
K − µK µK 1
L∗d (U) ≥ = · −1 . (2.94)
µK + 1 µK + 1 µ
2.4. Related Works and Open Problems 61
Figure 2.12: (a) An overview of “think like a vertex” approach taken in common
parallel graph computing frameworks, in which the intermediate computations only
depend on the neighbors at each node [45]; (b) Illustration of the fundamental
trade-off curve between communication load L and storage size at each server m in
parallel graph processing.
where the Map function gv,j (wj ) maps the input file wj into an inter-
mediate value, and the Reduce function hv (·) maps the intermediate
values of the neighboring nodes of v into the final output value φv . We
note that a key difference between Equation (2.95) and the general
MapReduce computation in Equation (2.1) is that the computations at
each node now only depend on the neighboring nodes according to the
graph topology.
Based on the above abstraction of the computation model, an inter-
esting problem is to design the optimal allocation of the subset of nodes
(or data) to each available server and the coding for data shuffling, such
that the amount of communication between servers is minimized. More
specifically, let us denote the number of available servers by K and
assume that the maximum number of nodes that can be assigned to one
server or the storage size is denoted by m. Our goal is to characterize the
fundamental trade-off curve between communication load and storage
size (L, m) for an arbitrary graph, and how coding can help in achieving
this fundamental limit (see Figure 2.12(b)). A preliminary exploration
for random graphs was recently presented in [124].
3
Coding for Straggler Mitigation
66
67
A1 A1 x
A1 A2 (A1 +A2 )x Ax
A=
A2 A1 +A2
A2 x A2 x
Figure 3.1: Coded matrix-vector multiplication. Each worker stores a coded sub-
matrix of the data matrix A. During computation, the master can recover the final
result using the results of any 2 out of the 3 workers.
shown in Figure 3.1, a master node partitions the matrix into two sub-
matrices A1 and A2 , and creates a coded sub-matrix A1 +A2 , and gives
each of these three sub-matrices to one of the workers for computation.
Now, the master can recover the desired computation from the results
of any 2 out of the 3 workers. For example as shown in Figure 3.1, the
missing result A2 x can be recovered by subtracting the result of worker 1
from that of worker 3. This example illustrates that by introducing 50%
redundant computations, we can now tolerate a single straggler. For
a general matrix-vector multiplication problem distributedly executed
over n workers, it was proposed in [89] to first partition the matrix into
k sub-matrices, for some k < n, and then use an (n, k) MDS code (e.g.,
Reed–Solomon code) to generate n coded sub-matrices, each of which
is stored on a worker. During the computation process, each worker
multiplies its local sub-matrix with the target vector and returns the
result to the master. Due to the “k out of n” property of the (n, k)
MDS code, the master can recover the overall computation result using
the results from the fastest k worker, protecting the system from as
many as n − k stragglers.
While repeating computation tasks has been demonstrated to be
an effective approach in straggler mitigation (see, e.g., [8, 55, 73, 153]),
many recent works have been focusing on characterizing the optimal
codes to combat the straggler’s effect of distributed linear algebraic
computations like matrix-vector and matrix–matrix multiplication. For a
problem of multiplying a matrix with a long vector, a sparse code was
designed in [47] such that only a subset of the entries of the vector are
68 Coding for Straggler Mitigation
r
s× m
Specifically, each worker i can store two matrices Ãi ∈ Fq and
s× t
B̃i ∈ Fq n ,
computed based on arbitrary functions of A and B respec-
tively. Each worker can compute the product C̃i , Ã|i B̃i , and return it
to the master. The master waits only for the results from a subset of
workers, before proceeding to recover (or compute) the final output C
given these products using certain decoding functions.1
Problem Formulation
Given the above system model, we formulate the distributed matrix
multiplication problem based on the following terminology: We define
the computation strategy as the 2N functions, denoted by
Figure 3.3: Product code [90] in an example with N = 9 workers that can each
store half of A and half of B.
The first one, referred to as one dimensional MDS code (1D MDS code),
was introduced in [89] and extended in [90]. The 1D MDS code, as
illustrated before in Figure 3.1, injects redundancy in only one of the
input matrices using maximum distance separable (MDS) codes [143].
In general, one can show that the 1D MDS code achieves a recovery
threshold of
N
K1D-MDS , N − + m = Θ(N ). (3.4)
n
An alternative computing scheme was recently proposed in [90]
for the case of m = n, referred to as the product code, which instead
injects redundancy in both input matrices. This coding technique has
also been proposed earlier in the context of Fault Tolerant Computing
in [67,
√ 74]. As √ shown in Figure 3.3, product code aligns workers in
an N −by− N layout. The matrix A is √ divided along the columns √
into m submatrices, encoded using an ( N √ , m) MDS code into N
coded matrices,
√ and then assigned to the N columns of workers. √
Similarly N coded matrices of B are created and assigned to the N
rows. Given the property of MDS codes, the master can decode an
entire row after obtaining any m results in that row; likewise for the
columns. Consequently, the master can recover the final output using
a peeling algorithm, iteratively decoding the MDS codes on rows and
columns until the output C is completely available. For example, if
the 5 computing results A|1 B0 , A|1 B1 , (A0 + A1 )| B1 , A|0 (B0 + B1 ), and
A|1 (B0 + B1 ) are received as demonstrated in Figure 3.3, the master can
72 Coding for Straggler Mitigation
Main Result
Our main result, which demonstrates that the optimum recovery thresh-
old can be far less than what the above two schemes achieve, is stated
in the following theorem:
K ∗ = mn. (3.6)
Remark 3.1. Compared to the state of the art [89, 90], the polynomial
code provides order-wise improvement in terms of the recovery threshold.
Specifically, the recovery thresholds achieved by √
1D MDS code [89, 90]
and product code [90] scale linearly with N and N respectively, while
the proposed polynomial code actually achieves a recovery threshold
that does not scale with N . Furthermore, polynomial code achieves the
optimal recovery threshold.
Remark 3.2. The polynomial code not only improves the state of the
art asymptotically, but also gives strict and significant improvement
3.1. Optimal Coding for Matrix Multiplications 73
Figure 3.4: Comparison of the recovery thresholds achieved by the proposed poly-
nomial code and the state of the arts (1D MDS code [89] and product code [90]),
1
where each worker can store 10 fraction of each input matrix. The polynomial code
attains the optimum recovery threshold K ∗ , and significantly improves the state of
the art.
Figure 3.5: Example using polynomial code, with N = 5 workers that can each
store half of each input matrix. Computation strategy: each worker i stores A0 + iA1
and B0 + i2 B1 , and computes their product. Decoding: master waits for results from
any 4 workers, and decodes the output using fast polynomial interpolation algorithm.
Motivating Example
We start by demonstrating the key ideas of polynomial code through a
motivating example. Consider a distributed matrix multiplication task
of computing C = A| B using N = 5 workers that can each store half
of the matrices (see Figure 3.5). We evenly divide each input matrix
along the column side into 2 submatrices:
A = [A0 A1 ], B = [B0 B1 ]. (3.7)
Given this notation, we essentially want to compute the following 4
uncoded components:
A|0 B0 A|0 B1
C = A| B = | . (3.8)
A1 B0 A|1 B1
Now we design a computation strategy to achieve the optimum recovery
threshold of 4. Suppose elements of A, B are in F7 , let each worker
i ∈ {0, 1, . . . , 4} store the following two coded submatrices:
Ãi = A0 + iA1 , B̃i = B0 + i2 B1 . (3.9)
To prove that this design gives a recovery threshold of 4, we need to
design a valid decoding function for any subset of 4 workers. Without
3.1. Optimal Coding for Matrix Multiplications 75
1 11 12 13
0 |
C̃1
A0 B 0
2 21 22 23 |
0
C̃2 A1 B0
= . (3.10)
3 31 32 33
0 |
C̃3 A0 B1
C̃4 40 41 42 43 A|1 B1
The coefficient matrix in the above equation is Vandermonde, and hence
invertible since its parameters 1, 2, 3, 4 are distinct in F7 . So one way
to recover C is to directly invert Equation (3.10). However, directly
computing this inverse using the classical inversion algorithm might
be expensive in more general cases. Quite interestingly, the decoding
process can also be viewed as a polynomial interpolation problem (or
equivalently, decoding a Reed–Solomon code subject to erasures).
Specifically, in this example each worker i returns
In order for the master to recover the output given any mn results (i.e.,
achieve the optimum recovery threshold), we carefully select the design
parameters α and β, while making sure that no two terms in the above
formula has the same exponent of x. One such choice is (α, β) = (1, m),
i.e.,
m−1 n−1
Ãi = Aj xji , B̃i = Bj xjm (3.16)
X X
i .
j=0 j=0
T ≥ Tpoly . (3.18)
Proof sketch. We know that by the converse proof of Theorem 3.1 that
using an arbitrary computation strategy, in order for the master to
recover the output matrix C at time T , it has to receive the computation
results from at least mn worker. However, using the polynomial code,
the matrix C can be recovered as soon as mn workers return their
results. Therefore, we have T ≥ Tpoly .
Corollary 3.3. For any computation strategy, let T denote its computa-
tion latency, and let Tpoly denote the computation latency of polynomial
3.2. Optimal Coding for Polynomial Evaluations 79
code. We have
Corollary 3.3 directly follows from Theorem 3.2 since (3.18) implies
(3.19).
Communication load is another important metric in distributed
computing (e.g., [93, 97, 166]), defined as the minimum number of bits
needed to be communicated in order to complete the computation.
L∗ = rt log2 q. (3.20)
Proof. Recall that in the converse proof of Theorem 3.1, we have shown
that if the input matrices are sampled based on a certain distribution,
then decoding the output C requires that the entropy of the entire
message received by the server is at least rt log2 q. Consequently, it
takes at least rt log2 q bits deliver such messages, which lower bounds
the minimum communication load.
On the other hand, the polynomial code requires delivering rt ele-
ments in Fq in total, which achieves this minimum communication load.
Hence, the minimum communication load L∗ equals rt log2 q.
Remark 3.6. While polynomial codes provide the optimal design, with
respect to the above metrics, for straggler mitigation in distributed
matrix multiplication, one can also consider other metrics and variations
of the problem setting for which the problem is still not completely
solved. One variation is “approximate distributed matrix multiplication”,
which has been studied in [59, 70]. Another variation is coded computing
in heterogeneous and dynamic network settings, which has been studied
in [54, 109, 115, 130, 131, 161].
2
The total degree of a polynomial f is the maximum among all the total degrees of
its monomials. In the case where F is finite, we resort to the canonical representation
of polynomials, in which the individual degrees within each term is no more than
(|F| − 1).
3
Note that if the number of workers is too small, obviously no valid computation
design exists unless f is a constant. Hence, in the rest of this subsection we focus on
meaningful cases where N is large enough such that there is a valid computation
design for at least one non-trivial function f (i.e., N ≥ K).
3.2. Optimal Coding for Polynomial Evaluations 81
within this class by letting V the space of input tensors, U the space of
output tensors, Xi be the inputs, and f be the tensor function.
Gradient computation. Another general class of functions arises from
gradient decent algorithms and their variants, which are the workhorse
of today’s learning tasks. The computation task for this class of functions
is to consider one iteration of the gradient decent algorithm, and to eval-
uate the gradient of the empirical risk ∇LS (h) , avgz∈S ∇`h (z), given
a hypothesis h: Rd → R, a respective loss function `h : Rd+1 → R, and
a training set S ⊆ Rd+1 , where d is the number of features. In practice,
this computation is carried out by partitioning S into K subsets {Si }K i=1
of equal sizes, evaluating the partial gradients {∇LSi (h)}K i=1 distribut-
edly, and computing the final result using ∇LS (h) = avgi∈[K] ∇LSi (h).
We present a specific example of applying this computing model to
least-squares regression problems in Subsection 3.2.5.
K ∗ = (K − 1) deg f + 1
Remark 3.8. The key idea of LCC is to encode the input dataset
using the well-known Lagrange polynomial. In particular, the encoding
functions (i.e., gi ’s) amount to evaluations of a Lagrange polynomial
of degree K − 1 at N distinct points. Hence, the computations at the
workers amount to evaluations of a composition of that polynomial with
the desired function f . Therefore, K ∗ may simply be seen as the number
of evaluations that are necessary and sufficient in order to interpolate
the composed polynomial, that is later evaluated at certain point to
finalize the computation.
1 0 −1 −2 −3
!
= (X1 , X2 ) ·
X̃1 , . . . , X̃5 .
0 1 2 3 4
Note that when applying f over its stored data, each worker essen-
tially evaluates a linear combination of 6 possible terms: four quadratic
Xi> Xj w and two linear Xi> y. However, the master only wants two
specific linear combinations of them: X1> (X1 w − y) and X2> (X2 w − y).
Interestingly, LCC optimally aligns the computation of the workers in
a sense that the linear combinations returned by the workers belong
to a subspace of only 3 dimensions, which can be recovered from the
computing results of any 3 workers, while containing the two needed
linear combinations.
More specifically, each worker i evaluates the polynomial
General Description
When the number of workers is small (i.e., N < K deg f − 1), the opti-
mum recovery threshold K ∗ = N − bN/Kc + 1 can be easily achieved
by uncoded repetition design – that is, by replicating every Xi be-
tween bN/Kc and dN/Ke times, it is readily verified that every set
of N −bN/Kc+1 computation results contains at least one copy of f (Xi )
for every i. Hence, we focus on the case where N ≥ K deg f − 1.
First, we select any K distinct elements β1 , . . . , βK from F, and
find a polynomial u: F → V of degree K − 1 such that u(βi ) = Xi for
any i ∈ [K] = {1, . . . , K}. This is simply accomplished by letting u be
the respective Lagrange interpolation polynomial u(z) , j∈[K] Xj ·
P
F, and encode the input variables by letting X̃i = u(αi ) for any i ∈ [N ].
That is,
K
αi − β k
X̃i = gi (X) = u(αi ) , (3.21)
X Y
Xj · .
j=1
β − βk
k∈[K]\{j} j
recovery threshold smaller than the lower bound stated in Lemma 3.6,
there would be scenarios where all available computing results are
degenerated (i.e., constants), while the computing results needed by the
master are variable, thus violating the decodability requirement.
Next in Step 2, we prove the matching converse for any polynomial
function. Given any function f with degree d, we first construct a
non-zero, multilinear function f 0 with the same degree. Then we let
Kf∗ (K, N ) denote the minimum recovery threshold for function f , and
prove Kf∗ (K, N ) ≥ Kf∗0 (K, N ), by constructing a computation design
of f 0 that is based on a computation design of f and achieves the same
recovery threshold. The construction and its properties are stated in
the following lemma, whose proof can be found in [170, Appendix E].
1 Xm
1
min L(w) = (x>
i w − yi ) =
2
||Xw − y||2 . (3.22)
w∈Rd m i=1 m
1 Xm
min L(h) = (h(xi ) − yi )2 . (3.23)
h∈H m i=1
Such nonlinear regression problems can often be cast in the form (3.22),
and be solved efficiently using the so called kernalization trick [137].
However, for simplicity of exposition we focus on the simpler instance
(3.22).
A popular approach to solve the above problem is via gradient
descent (GD). In particular, GD iteratively refines the weight vector
w by moving along the negative gradient direction via the following
3.2. Optimal Coding for Polynomial Evaluations 89
updates
2 >
w(t+1) = w(t) − η (t) ∇L(w(t) ) = w(t) − η (t) X (Xw(t) − y). (3.24)
m
Here, η (t) is the learning rate in the tth iteration.
When the size of the training data is too large to store/process on
a single machine, the GD updates can be calculated in a distributed
fashion over many computing nodes. As illustrated in Figure 3.6, we
consider a computing architecture that consists of a master node and n
worker nodes. Using a naive data-parallel distributed regression scheme,
we first partition the input data matrix X into n equal-sized sub-
matrices such that X = [X1 . . . Xn−1 ]> , where each sub-matrix Xj ∈
Rd× n contains m n input data points, and is stored on worker j. Within
m
j=0
Then, the master uses this gradient to update the weight vector
via (3.24).4
4
Since the value of X > y does not vary across iterations, it only needs to be
computed once. We assume that it is available at the master for weight updates.
90 Coding for Straggler Mitigation
Remark 3.13. We note that LCC is also directly applicable for non-
linear regression problems using kernel methods. To do that, we simply
replace the data matrix X with the kernel matrix K, whose entry
Kij = k(xi , xj ) is some kernel function of the data points xi and xj .
40
20
Figure 3.7: Run-time comparison of LCC with other three schemes: Conventional
uncoded, GC, and MVM.
g1 + g2 + g3 g1 + g2 + g3 (from any 2)
g1 g1 /2 + g2
g3 g1 /2 + g3
D1 g2 D1 ,D 2 g2 − g 3 D3 ,D 1
D3
D2 D2 , D 3
(a) Naive synchronous gradient descent. (b) Gradient coding: The vector g1 + g2 +
g3
and the system moves to the next iteration. This setup is, however,
subject to delays introduced by stragglers because the master has to
wait for outputs of all three workers before computing g1 + g2 + g3 .
Figure 3.8(b) illustrates one way to resolve this problem by replicat-
ing data across machines as shown, and sending linear combinations of
the associated gradients. As shown in Figure 3.8(b), each data partition
is replicated twice using a specific placement policy. Each worker is
assigned to compute two gradients on their assigned two data partitions.
For instance, Worker 1 computes vectors g1 and g2 , and then sends
2 g1 +g2 . Interestingly, g1 +g2 +g3 can be constructed from any
1
two out of
these three vectors. For instance, g1 + g2 + g3 = 2 21 g1 + g2 − (g2 − g3 ).
Finally, we end this section with some of the open problems and fu-
ture directions for designing straggler-resilient coded computing systems.
Low-complexity algorithms for coded matrix multiplication.
While the naive multiplication of an M × N matrix A by an N × L
matrix B has complexity O(M N L), there is a rich literature that has
discovered low complexity implementations, especially if the matrices are
restricted to a certain class. When the entries of matrix A come from a
bounded alphabet A (e.g., A is the adjacency matrix of a degree-bounded
graph in common graph algorithms like pagerank, or Laplacian matrix
calculation), the product AB can be computed via the four Russians
algorithm [100, 150] using O(M N L log2 |A|/ log2 N ) operations - an
improvement of a factor of log2 N as compared to the naive approach
for small alphabet. There are some unique challenges for the use of the
four Russians method in coded distributed matrix multiplication due to
the fact that the alphabet size of good codes tends to be large. Consider a
concrete example where A = {0, 1}, and B is a N ×1 vector. Surprisingly,
a back-of-the envelop calculation reveals that natural application of
MDS codes to the case of binary multiplication has the same per-node
computational complexity as replication O( MnN log2 N ), for a fixed straggler
(s+1)
103
104 Coding for Security and Privacy
learning (see, e.g., [18, 37, 38, 64, 114]). In this section, we demon-
strate how coding theory can help to maintain security and privacy in
multiparty computing and distributed learning. Specifically, we first
demonstrate that how we can extend the Lagrange Coded Computing
framework proposed in the previous section to provide MPC systems
with security and privacy guarantees. We also compare LCC with
state-of-the-art MPC schemes (e.g., the celebrated BGW scheme for
secure/private MPC [18]), and illustrate the substantial reduction in the
amount of randomness, storage overhead, and computational complexity
achieved by LCC.
Second, we demonstrate the application of coded computing for
privacy-preserving machine learning. In particular, we consider an appli-
cation scenario in which a data-owner (e.g., a hospital) wishes to train
a logistic regression model by offloading the large volume of data (e.g.,
healthcare records) and computationally-intensive training tasks (e.g.,
gradient computations) to N machines over a cloud platform, while
ensuring that any collusions between T out of N workers do not leak
information about the dataset. We then discus a recently proposed
scheme [145], named CodedPrivateML, that leverages coded computing
for this problem. We finally end this section with a discussion on some
related works and open problems.
workers compute f (X̃i ) and send the result back to the master. The master needs to
retrieve {f (Xi )}K i=1 in the presence of at most A malicious workers, and maintain
the perfect privacy of the dataset in the face of up to T colluding workers.
Illustrative Example
√ √
Consider the function f (Xi ) = Xi2 , where input Xi ’s are M × M
square matrices for some square integer M . We demonstrate LCC
in the scenario where the input data X is partitioned into K = 2
batches X1 and X2 , and the computing system has N = 7 workers. In
addition, the scheme guarantees perfect privacy against any individual
worker (i.e., T = 1), and is robust against any single malicious worker
(i.e., A = 1).
2
Equivalently, Equation (4.1) requires that X̃T and X are independent. Under
this condition, the input data X still appears uniformly random after the colluding
workers learn X̃T , which guarantees the privacy.
4.1. Secure and Private Multiparty Computing 107
where U ∈ F3×7
11 satisfies Ui,j = `∈[3]\{i} i−` for (i, j) ∈ [3] × [7].
Q j α −`
First, notice that for every j ∈ [7], worker j sees X̃j , which is a
linear combination of X1 and X2 masked by addition of λ · Z for some
nonzero λ ∈ F11 ; since Z is uniformly random, this guarantees perfect
privacy for T = 1. Next, worker j computes f (X̃j ) = f (u(αj )), which
is an evaluation of the composition polynomial f (u(z)), with degree at
most 4, at αj .
Normally, a polynomial of degree 4 can be interpolated from 5 eval-
uations at distinct points. However, the presence of A = 1 malicious
worker requires the master to employ a Reed–Solomon decoder, and have
two additional evaluations at distinct points (in general, two additional
evaluations for every malicious worker). Finally, after decoding polyno-
mial f (u(z)), the master can obtain f (X1 ) and f (X2 ) by evaluating it
at z = 1 and z = 2.
General Description
To start, we first select any K +T distinct elements β1 , . . . , βK+T from F,
and find a polynomial u: F → V of degree K +T −1 such that u(βi ) = Xi
for any i ∈ [K], and u(βi ) = Zi for i ∈ {K + 1, . . . , K + T }, where
all Zi ’s are chosen uniformly at random from V. This is accomplished
108 Coding for Security and Privacy
ting X̃i = u(αi ) for any i ∈ [N ]. That is, the input variables are
encoded as
Therefore, for every dataset X and every observed encoding X̃T , there
exists a unique value for the randomness Z by which the encoding of X
equals X̃T ; a statement equivalent to the definition of T -privacy.
Following the encoding of (4.3), each worker i applies f on X̃i
and sends the result back to the master. Hence, the master obtains N
evaluations, at most A of which are incorrect, of the polynomial f (u(z)).
Since deg f (u(z)) ≤ deg f · (K + T − 1), and N ≥ (K + T − 1) deg(f ) +
2A + 1, the master can obtain all coefficients of f (u(z)) by applying
Reed–Solomon decoding. Having this polynomial, the master evaluates
it at βi for every i ∈ [K] to obtain f (u(βi )) = f (Xi ). This results in
the following theorem for LCC.
4.1. Secure and Private Multiparty Computing 109
(K + T − 1) deg f + 2A + 1 ≤ N, (4.4)
Compared with the result in Theorem 4.1 (for the case of T = 0),
Theorem 4.2 demonstrates that the LCC scheme provides the optimal
security, by protecting against maximum possible number of adversaries.
Proof of Theorem 4.2. We prove Theorem 4.2 by connecting the ad-
versary tolerance problem to the straggler mitigation problem described
in Subsection 3.2.1, using the extended concept of Hamming distance
for coded computing.
110 Coding for Security and Privacy
3
That is, when two sides of (4.4) are equal.
4.1. Secure and Private Multiparty Computing 111
coded into n storage servers, such that the k message symbols are re-
constructible from any n − r servers, and any z servers are information
theoretically oblivious to the message symbols. Further, such a scheme is
assumed to use v random entries as keys, and by [69, Proposition 3.1.1],
must satisfy n − r ≥ k + z.
Table 4.1: Comparison between BGW based designs and LCC. The computational
complexity is normalized by that of evaluating f ; randomness, which refers to
the number of random entries used in encoding functions, is normalized by the
length of Xi
BGW LCC
Complexity/worker K 1
Frac. data/worker 1 1/K
Randomness KT T
Min. num. of workers 2T + 1 deg f · (K + T − 1) + 1
nodes. In the LCC scheme, on the other hand, each worker ` only needs
to store one encoded data X̃` and compute f (X̃` ). This gives rise to
the second key advantage of LCC, which is a factor of K in storage
overhead and computation complexity at each worker.
After computation, each worker ` in the BGW scheme has essentially
evaluated the polynomials {f (Pi (z))}K i=1 at z = α` , whose degree is at
most deg f · T . Hence, if no adversary appears (i.e., A = 0), the master
can recover all required results f (Pi (0))’s, through polynomial interpola-
tion, as long as N ≥ deg f ·T +1 workers participated in the computation.
It is also possible to use the conventional multi-round BGW, which only
requires N ≥ 2T + 1 workers to ensure T -privacy. However, multiple
rounds of computation and communication (Ω(log(deg f )) rounds) are
needed, which further increases its communication overhead. Note that
under the same condition, LCC scheme requires N ≥ deg f ·(K+T −1)+1
number of workers, which is larger than that of the BGW scheme.
Hence, in overall comparison with the BGW scheme, LCC results in
a factor of K reduction in the amount of randomness, storage overhead,
and computation complexity, while requiring more workers to guarantee
the same level of privacy. This is summarized in Table 4.1.5
5
A BGW scheme was also proposed in [18] for secure MPC, however for a
substantially different setting. Similarly, a comparison can be made by adapting it
to our setting, leading to similar results, which we omit for brevity.
4.2. Privacy Preserving Machine Learning 113
1 X
m
C(w) = (−yi log ŷi − (1 − yi ) log(1 − ŷi )) (4.5)
m i=1
master
Dataset : X =( X1 , . . ., XK )
(t) (t)
W1 WN
.. .
worker 1 worker N
X1 XN
T colluding workers
∇C(w) = m X (g(X
1 > × w) − y). The update function is given by,
η >
w(t+1) = w(t) − X (g(X × w(t) ) − y) (4.6)
m
where w(t) holds the estimated parameters from iteration t, η is the
learning rate, and g(·) operates element-wise.
We consider a master-worker distributed compute architecture shown
in Figure 4.2, where the master offloads the gradient computations in
(4.6) to N workers. In doing so, the master also wants to protect the
privacy of the dataset against any potential collusions between up to T
workers, where T is the privacy parameter of the system. Initially, the
dataset is partitioned into K submatrices X = [X> 1 . . . XK ] . Parameter
> >
matrices do not leak any information about the true dataset, even if
T workers collude. In addition, the master has to ensure the weight
estimations sent to the workers at each iteration do not leak information
about the dataset. This is because the weights updated via (4.6) carry
information about the whole training set, and sending them directly to
the workers may breach privacy. In order to prevent this, at iteration t,
master also quantizes the current weight vector w(t) to the finite field
and encodes it again using Lagrange coding.
Phase 3: Polynomial Approximation and Local Computation.
In the third phase, each worker performs the computations using its
local storage and sends the result back to the master. We note that
the workers perform the computations over the encoded data as if
they were computing over the true dataset. That is, the structure of
the computations are the same for computing over the true dataset
versus computing over the encoded dataset. A major challenge is that
LCC is designed for distributed polynomial computations. However,
the computations in the training phase are not polynomials due to the
sigmoid function. We overcome this by approximating the sigmoid with
a polynomial of a selected degree r. This allows us to represent the
gradient computations in terms of polynomials that can be computed
locally by each worker.
Phase 4: Decoding and Model Update. The master collects the
results from a subset of fastest workers and decodes the gradient. Then,
the master converts the gradient from finite to real domain, updates the
weight vector, and secret shares it with the workers for the next round.
Based on this design, we can obtain the following theoretical guar-
antees for the convergence and privacy of CodedPrivateML. We refer
to [145] for the details.
Figure 4.3: Performance gain of CodedPrivateML over the MPC baseline ([BH08]
from [17]). The plot shows the total training time for different number of workers N .
time. Main reason for this is that, in the MPC baselines, the size
of the data processed at each worker is one third of the original
dataset, while in CodedPrivateML it is 1/K-th of the dataset.
This reduces the computational overhead of each worker while
computing matrix multiplications as well as the communication
overhead between the master and workers. We also observe that
a higher amount of speedup is achieved as the dimension of
the dataset becomes larger (CIFAR-10 vs. GISETTE datasets),
120 Coding for Security and Privacy
Figure 4.5 presents the cross entropy loss for CodedPrivateML versus
the conventional logistic regression model for the GISETTE dataset. The
latter setup uses the sigmoid function and no polynomial approximation,
in addition, no quantization is applied to the dataset or the weight
vectors. We observe that CodedPrivateML achieves convergence with
122 Coding for Security and Privacy
The security and privacy issue has been extensively studied in the
literature of secure multiparty computing and distributed machine
learning/data mining [18, 37, 38, 102, 114]. For instance, the cele-
brated BGW scheme [18] employs Shamir’s scheme [141] to privately
share intermediate results between parties. As we have elaborated in
Subsection 4.1, the proposed LCC scheme significantly improves the
BGW in the required storage overhead, computational complexity, and
the amount of injected randomness (Table 4.1).
There have also been several other recent works have on coded com-
puting under privacy and security constraints. Extending the research
works on secure storage (see, e.g., [122, 140, 128]), staircase codes [21]
have been proposed to combat stragglers in linear computations (e.g.,
matrix-vector multiplications) while preserving data privacy, which was
shown to reduce the computation latency of the schemes based on clas-
sical secret sharing strategies [110, 141]. The proposed Lagrange Coded
Computing scheme in this section generalizes the staircase codes beyond
linear computations. Even for the linear case, LCC guarantees data
4.3. Related Works and Open Problems 123
126
Appendices
A
Proof of Lemma 3.6
Lower Bound on the Recovery Threshold of
Computing Multilinear Functions
Before we start the proof, we let Kf∗ (K, N ) denote the minimum recovery
threshold given the function f , the number of computations K, and the
number of workers N .
We now proceed to prove Lemma 3.6 by induction.
(a) When d = 1, then f is a linear function, and we aim to prove
Kf (K, N ) ≥ K. Assuming the opposite, we can find a computation
∗
128
129
Following the same arguments we used in the d = 1 case, the left null
space of G must be {0}. Consequently, the rank of G equals K, and we
can find a subset K of K workers such that the corresponding columns
of G form a basis of FK . We construct a computation scheme for f 0
with N 0 , N − K workers, each of whom stores the coded version
of (Xi,1 , Xi,2 , . . . , Xi,d0 ) that is stored by a unique respective worker in
[N ] \ K in the computation scheme of f .
Now it suffices to prove that the above construction achieves a
recovery threshold of Kf∗ (K, N ) − (K − 1). Equivalently, we need to
prove that given any subset S of [N ]\K of size Kf∗ (K, N ) − (K − 1),
the values of f (Xi,1 , Xi,2 , . . . , Xi,d0 , V ) for i ∈ [K] are decodable from
the computing results of workers in S.
We now exploit the decodability of the computation design for
function f . For any j ∈ K, the set S ∪ K\{j} has size Kf∗ (K, N ). Conse-
quently, for any vector a = (a1 , . . . , aK ) ∈ FK , by letting Xi,d0 +1 = ai V ,
we have that {ai f (Xi,1 , Xi,2 , . . . , Xi,d0 , V )}i∈[K] is decodable given the
computing results from workers in S ∪ K\{j}. Moreover, for any j ∈ [K],
let a(j) ∈ FK be a non-zero vector that is orthogonal to all columns
of G with indices in K\{j}, workers in K\{j} would store 0 for the
130 Proof of Lemma 3.6
131
132 References
[65] He, K., X. Zhang, S. Ren, and J. Sun (2016). “Deep residual
learning for image recognition”. IEEE Conference on Computer
Vision and Pattern Recognition: 770–778.
[66] Ho, T., R. Koetter, M. Medard, D. R. Karger, and M. Effros
(2003). “The benefits of coding over routing in a randomized
setting”. IEEE International Symposium on Information Theory.
June: 442.
[67] Huang, K.-H. and J. A. Abraham (1984). “Algorithm-based
fault tolerance for matrix operations”. IEEE Transactions on
Computers. C-33(6): 518–528.
[68] Huang, L., A. D. Joseph, B. Nelson, B. I. Rubinstein, and
J. Tygar (2011). “Adversarial machine learning”. In: Proceedings
of the 4th ACM Workshop on Security and Artificial Intelligence.
ACM. 43–58.
[69] Huang, W. (2017). “Coding for security and reliability in dis-
tributed systems”. PhD thesis. California Institute of Technology.
[70] Jahani-Nezhad, T. and M. A. Maddah-Ali (2019). “CodedSketch:
Coded distributed computation of approximated matrix multipli-
cation”. In: 2019 IEEE International Symposium on Information
Theory (ISIT). 2489–2493.
[71] Jeong, H., T. M. Low, and P. Grover (2018). “Masterless coded
computing: A fully-distributed coded FFT algorithm”. In: 2018
56th Annual Allerton Conference on Communication, Control,
and Computing (Allerton). 887–894.
[72] Ji, M., G. Caire, and A. F. Molisch (2016). “Fundamental limits
of caching in wireless D2D networks”. IEEE Transactions on
Information Theory. 62(2): 849–869.
[73] Joshi, G., E. Soljanin, and G. Wornell (2017). “Efficient re-
dundancy techniques for latency reduction in cloud systems”.
ACM Transactions on Modeling and Performance Evaluation of
Computing Systems (TOMPECS). 2(2): 12.
[74] Jou, J.-Y. and J. A. Abraham (1986). “Fault-tolerant matrix
arithmetic and signal processing on highly concurrent computing
structures”. Proceedings of the IEEE. 74(5): 732–741.
References 139