Demystifying Parallel and Distributed Deep Learning
Demystifying Parallel and Distributed Deep Learning
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Acceler-
ating their training is a major challenge and techniques range from distributed algorithms to low-level circuit
design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for
its parallelization. We present trends in DNN architectures and the resulting implications on parallelization
strategies. We then review and model the different types of concurrency in DNNs: from the single operator,
through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous
stochastic optimization, distributed system architectures, communication schemes, and neural architecture
search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.
CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Neu-
ral networks; Parallel computing methodologies; Distributed computing methodologies;
Additional Key Words and Phrases: Deep learning, distributed computing, parallel algorithms
ACM Reference format:
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth
Concurrency Analysis. ACM Comput. Surv. 52, 4, Article 65 (August 2019), 43 pages.
https://doi.org/10.1145/3320060
1 INTRODUCTION
Machine Learning, and in particular Deep Learning [149], is rapidly taking over a variety of aspects
in our daily lives. At the core of deep learning lies the Deep Neural Network (DNN), a construct
inspired by the interconnected nature of the human brain. Trained properly, the expressiveness
of DNNs provides accurate solutions for problems previously thought to be unsolvable, merely by
observing large amounts of data. Deep learning has been successfully implemented for a plethora
of fields, ranging from image classification [110], through speech recognition [6] and medical di-
agnosis [44], to autonomous driving [22] and defeating human players in complex games [223].
Since the 1980s, neural networks have attracted the attention of the machine-learning
community [150]. However, DNNs’ rise into prominence was tightly coupled to the available
computational power, which allowed to exploit their inherent parallelism. Consequently, deep
learning managed to outperform all existing approaches in speech recognition [152] and image
T.B.N. is supported by the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020
programme (Grant Agreement DAPP No. 678880).
Authors’ address: T. Ben-Nun and T. Hoefler, ETH Zurich, Department of Computer Science, ETH Zürich Universitätstrasse
6, 8092 Zürich, Switzerland; emails: {talbn, htor}@inf.ethz.ch.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
0360-0300/2019/08-ART65 $15.00
https://doi.org/10.1145/3320060
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:2 T. Ben-Nun and T. Hoefler
classification [142], where the latter increased the accuracy by a factor of two, sparking academic
and industrial interest.
As datasets increase in size and DNNs in complexity, the computational intensity and memory
demands of deep learning increase proportionally. Training a DNN to competitive accuracy to-
day essentially requires a high-performance computing cluster. To harness such systems, different
aspects of training and inference (evaluation) of DNNs are modified to increase concurrency.
In this survey, we discuss the variety of topics in the context of parallelism and distribution
in deep learning, spanning from vectorization to efficient use of supercomputers. In particular,
we present parallelism strategies for DNN evaluation and implementations thereof, as well as
extensions to training algorithms and systems targeted at supporting distributed environments.
To provide comparative measures on the approaches, we analyze their concurrency and average
parallelism using the Work-Depth model [21].
This article surveys 252 other works, obtained by recursively tracking relevant bibliography
from seminal papers in the field, dating back to the year 1984. We include additional papers result-
ing from keyword searches on Google Scholar1 and arXiv.2 Due to the quadratic increase in deep
learning papers on the latter source (Table 1), some works may not have been included. The full
list of categorized papers in this survey can be found online.3 Figure 1 shows an overview of this
survey.
1 https://scholar.google.com/.
2 https://www.arxiv.org/.
3 https://spcl.inf.ethz.ch/Research/Parallel_Programming/DistDL/.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:3
discuss accelerators for traditional neural networks [117] and the use of FPGAs in deep learning
[144].
Name Definition
D Data probability distribution
S Training dataset
(t )
w ∈H Model parameters. w i denotes parameter
i at SGD iteration t
fw (z ) Model function (learned predictor)
h (z ) Ground-truth label (in Supervised Learning)
(w, z ) Per-sample loss function
∇(w, z ) Gradient of Fig. 2. Multi-class classification loss.
u (д, w, t ) Parameter update rule. Function of loss
gradient д, parameters w , and iteration t
in machine learning is prominently performed via Gradient Descent. Since the full D is, how-
ever, never observed, it is necessary to obtain an unbiased estimator of the gradient. Observe that
∇L D (w ) = Ez∼D [∇ (w, z)] (Equation (1), linearity of the derivative). Thus, in expectation, we
can descend using randomly sampled data in each iteration, applying Stochastic Gradient Descent
(SGD) [215].
SGD (Algorithm 1) iteratively optimizes parameters defined by the sequence {w (t ) }Tt=0 , using
samples√from a dataset S sampled from D with replacement. SGD is proven to converge at a rate
of O(1/ T ) for convex functions with Lipschitz-continuous and bounded gradient [181].
Prior to running SGD, one must choose an initial estimate for the weights w (0) . Due to the ill-
posed nature of some problems, the selection of w (0) is important and may reflect on the final
quality of the result. The choice of initial weights can originate from random values, informed
decisions (e.g., Xavier initialization [81]), or from pre-trained weights in a methodology called
Transfer Learning [197]. In deep learning, recent works state that the optimization space is riddled
with saddle points [149], and assume that the value of w (0) does not affect the final loss. In practice,
improper initialization may have an adverse effect on generalization as networks become deeper
[93].
In line 1, T denotes the number of steps to run SGD for (known as the stopping condition or
computational budget). Typically, real-world instances of SGD run for a constant number of steps,
for a fixed period of time, or until a desired accuracy is achieved. Line 2 then samples random
elements from the dataset, to provide the unbiased loss estimator. The gradient of the loss function
with respect to the weights w (t ) is subsequently computed (line 3). In deep neural networks, the
gradient is obtained with respect to each layer (wl(t ) ) using backpropagation (Section 4.2). This
gradient is then used for updating the weights, using a weight update rule (line 4).
2.1.1 Weight Update Rules. The weight update rule, denoted as u in Algorithm 1, can be defined
as a function of the gradient д, the previous weight values w (0) , . . . , w (t ) , and the current iteration
t. Table 3 summarizes the popular u functions used in training. In the table, the basic SGD update
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:5
rule is usдd (д) = −η · д, where η represents the learning rate. η controls how much the gradient
values will overall affect the next estimate w (t +1) , and in iterative nonlinear optimization methods
finding the correct η is a considerable part of the computation [185]. In machine-learning problems,
it is customary to fix η, or set an iteration-based weight update rule ual r (д, t ) = −ηt · д, where ηt
decreases (decays) over time to bound the modification size and avoid local divergence.
Other popular weight update rules include Momentum, which uses the difference between cur-
rent and past weights w (t ) − w (t −1) to avoid local minima and redundant steps with natural motion
[182, 205]. More recent update rules, such as RMSProp [96] and Adam [136], use the first and sec-
ond moments of the gradient to adapt the learning rate per-weight, enhancing sparser updates
over others.
Factors such as the learning rate and other symbols found in Table 3 are called hyper-parameters,
and are set before the optimization process begins. In the table, μ, β, β 1 , and β 2 represent the mo-
mentum, RMS decay rate, and first and second moment decay rate hyper-parameters, respectively.
To obtain the best results, hyper-parameters must be tuned, which can be performed by value
sweeps or by meta-optimization (Section 7.5.2). The multitude of hyper-parameters and the re-
liance upon them is considered problematic by a part of the community [207].
2.1.2 Minibatch SGD. When performing SGD, it is common to decrease the number of weight
updates by computing the sample loss in minibatches (Algorithm 2), averaging the gradient with
respect to subsets of the data [147]. Minibatches represent a tradeoff between traditional SGD,
which is proven to converge when drawing one sample at a time, and batch methods [185], which
make use of the entire dataset at each iteration.
In practice, minibatch sampling is implemented by shuffling the dataset S, and processing that
permutation by obtaining contiguous segments of size B from it. An entire pass over the dataset is
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:6 T. Ben-Nun and T. Hoefler
called an epoch, and a full training procedure usually consists of tens to hundreds of such epochs
[84, 260]. As opposed to the original SGD, shuffle-based processing entails without-replacement
sampling. Nevertheless, minibatch SGD was proven [221] to provide similar convergence
guarantees.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:7
Figure 4(a) shows a breakdown of the number of nodes used in deep learning research over the
years. It started very high with the large-scale DistBelief run, reduced slightly with the introduc-
tion of powerful accelerators and is on a quick rise again since 2015 with the advent of large-scale
deep learning. Out of the 252 reviewed papers, 80 make use of distributed-memory systems and
provide details about their hardware setup. We observe that large-scale setups, similar to HPC
machines, are commonplace and essential in today’s training.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:8 T. Ben-Nun and T. Hoefler
number of processes to execute the graph with. Furthermore, we can show that the execution time
of such a DAG on p processors is bounded by: min{W/p, D} ≤ Tp ≤ O(W/p + D) [8, 26].
Most of the operations in learning can be modeled as operations on tensors (typically tensors as a
parallel programming model [228]). Such operations are highly data-parallel and only summations
introduce dependencies. Thus, we will focus on parallel reduction operations in the following.
In a reduction, we apply a series of binary operators ⊕ to combine n values into a single value,
e.g., y = x 1 ⊕ x 2 ⊕ x 3 · · · ⊕ x n−1 ⊕ x n . If the operation ⊕ is associative, then we can change its ap-
plication, which changes the DAG from a linear-depth line-like graph as shown in Figure 5(a) to
a logarithmic-depth tree graph as shown in Figure 5(b). It is simple to show that the work and
depth for reducing n numbers is W = n − 1 and D = log2 n, respectively. In deep learning, one
often needs to reduce (sum) large tables of m independent parameters and return the result to all
processes. This is called allreduce in the MPI specification [86, 173].
In multi-machine environments, these tables are distributed across the machines that partici-
pate in the overall reduction operation. Due to the relatively low bandwidth between the machines
(compared to local memory bandwidths), this operation is often most critical for distributed learn-
ing. Even if we fully overlap communication and computation [102], (nonblocking) allreductions
often turn into a bottleneck. We analyze the algorithms in a simplified LogP model [54], where
we ignore injection rate limitations (o = д = 0), which makes it similar to the simple α-β model:
L = α models the point-to-point latency in the network, G = β models the cost per byte, and P ≤ p
is the number of networked machines. Based on the DAG model from above, it is simple to show
a lower bound for the reduction time Tr ≥ L log2 (P ) in this simplified model. Furthermore, be-
cause each element of the table has to be sent at least once, the second lower bound is Tr ≥ γmG,
where γ represents the size of a single data value and m is the number of values sent. This bound
can be strengthened to Tr ≥ L log2 (P ) + 2γmG (P − 1)/P if we disallow redundant computations
[29].
Several practical algorithms exist for the parallel allreduce operation in different environments
and the best algorithm depends on the system, the number of processes, and the message size. We
refer to Chan et al. [29] and Hoefler and Moor [103] for surveys of collective algorithms. Here,
we summarize key algorithms that have been rediscovered in the context of parallel learning, and
illustrate them in Figure 5(c). The simplest algorithm is to combine two trees, one for summing
the values to one process, similar to Figure 5(b), and one for broadcasting the values back to all
processes; its complexity is Ttree = 2 log2 (P )(L + γmG). Yet, this algorithm is inefficient and can
be optimized with a simple butterfly pattern, reducing the time to Tbfly = log2 (P )(L + γmG). The
butterfly algorithm is efficient (near-optimal) for small γm. For large γm and small P, a simple
linear pipeline that splits the message into P segments is bandwidth-optimal and performs well in
practice, even though it has a linear component in P: Tpipe = 2(P − 1)(L + γ m P G). For most ranges
of γm and P, one could use Rabenseifner’s algorithm [206], which combines reduce-scatter with
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:9
allgather, running in time Trabe = 2L log2 (P ) + 2γmG (P − 1)/P. This algorithm achieves the lower
bound [198] but may be harder to implement and tune.
Other communication problems needed for convolutions and pooling, illustrated in Figure 5(d),
exhibit high spatial locality due to strict neighbor interactions. They can be optimized using well-
known HPC techniques for stencil computations such as MPI Neighborhood Collectives [104] (for-
merly known as sparse collectives [106]) or optimized Remote Memory Access programming [15].
In general, exploring different low-level communication, message scheduling, and topology map-
ping [105] strategies that are well-known in the HPC field could significantly speed up the com-
munication in distributed deep learning.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:10 T. Ben-Nun and T. Hoefler
operator types and their properties, followed by the computational description of deep networks
and the backpropagation algorithm. Then, we study several examples of popular neural networks,
highlighting the computational trends driven by their definition.
4.1 Neurons
The basic building block of a deep neural network is the neuron. Modeled after the brain, an ar-
tificial neuron (Figure 7(a)) accumulates signals from other neurons connected by synapses. An
activation function (or axon) is applied on the accumulated value, which adds nonlinearity to the
network and determines the signal this neuron “fires” to its neighbors. In feed-forward neural net-
works, the neurons are grouped to layers strictly connected to neurons in subsequent layers. In
contrast, recurrent neural networks allow back-connections within the same layer.
4.1.1 Feed-forward Operators. Neural network operators are implemented as weighted sums,
using the synapses as weights. Activations (denoted σ ) can be implemented as different functions,
such as Sigmoid, Softmax, hyperbolic tangents, Rectified Linear Units (ReLU), or variants thereof
[93]. When color images are used as input (as is commonly the case in computer vision), they are
usually represented as a four-dimensional tensor-sized N ×C×H ×W . As shown in Figure 8, N is
number of images in the minibatch, where each H ×W image contains C channels (e.g., image RGB
components). If an operator disregards spatial locality in the image tensor (e.g., a fully connected
layer), then the dimensions are flattened to N × (C · H · W ). In typical DNN and CNN construc-
tions, the number of features (channels in subsequent layers), as well as the width and height of
an image, change from layer to layer using the operators defined below. We denote the input and
output features of a layer by Cin and Cout , respectively.
A fully connected layer (Figure 7(a)) is defined on a group of neurons x (sized N × Cin , disregard-
ing spatial properties) by yi,∗ = σ (wx i,∗ + b), where w is the weight matrix (sized Cin × Cout ) and
b is a per-layer trainable bias vector (sized Cout ). While this inner product is usually implemented
with multiplication and addition, some works use other operators, such as similarity [46].
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:11
Not all operators in a neural network are fully connected. Sparsely connecting neurons and
sharing weights is beneficial for reducing the number of parameters; as is the case in the popular
convolutional operator. In a convolutional operator, every 3D tensor x (i.e., a slice of the 4D mini-
batch tensor representing one image) is convolved with Cout kernels of size Cin ×Ky ×K x , where
the base formula for a minibatch is given by
C
in −1 K y −1 K
x −1
The goal of this operator is to reduce the size of a tensor by sub-sampling it while emphasizing
important features. Applying subsequent convolutions of the same kernel size on a sub-sampled
tensor enables learning high-level features that correspond to larger regions in the original data.
Batch Normalization (BN) [121] is an example of an operator that creates inter-dependencies
between samples in the same minibatch. Its role is to center the samples around a zero mean and a
variance of one, which, according to the authors, reduces the internal covariate shift. BN is given
by the following transformation:
x i, j,k,l − E x ∗, j,k,l
yi, j,k,l = · γ + β,
Var x ∗, j,k,l + ϵ
where γ , β are scaling factors, and ϵ is added to the denominator for numerical stability.
4.1.2 Recurrent Operators. Recurrent Neural Networks (RNNs) [72] enable connections from a
layer’s output to its own inputs. These connections create “state” in the neurons, retaining per-
sistent information in the network and allowing it to process data sequences instead of a single
tensor. We denote the input tensor at time point t as x (t ) .
The standard Elman RNN layer is defined as y (t ) = wy · (w h · ht −1 + w x · x (t ) ) (omitting bias,
illustrated in Figure 9(a)), where ht represents the “hidden” data at time-point t and is carried
over to the next time-point. Despite the initial success of these operators, it was found that they
tend to “forget” information quickly (as a function of sequence length) [19]. To address this issue,
Long-Short Term Memory (LSTM) [100] (Figure 9(b)) units redesign the structure of the recurrent
connection to resemble memory cells. Several variants of LSTM exist, such as the Gated Recurrent
Unit (GRU) [40] (Figure 9(c)), which simplifies the LSTM gates to reduce the number of parameters.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:12 T. Ben-Nun and T. Hoefler
Fig. 9. Recurrent Neural Network (RNN) Layers. Sub-figures (b) and (c) adapted from Reference [190].
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:13
Property LeNet [151] AlexNet [142] GoogLeNet [234] ResNet [94] DenseNet [110]
|w | 60K 61M 6.8M 1.7M–60.2M ∼15.3M–30M
Layers (∝ D) 7 13 27 50–152 40–250
Operations (∝ W, ImageNet-1k) N/A 725M 1,566M ∼1,000M–2,300M ∼600M–1,130M
Top-5 Error (ImageNet-1k) N/A 15.3% 9.15% 5.71% 5.29%
Top-1 Error (CIFAR-10) N/A N/A N/A 6.41% 3.62%
of sequence length, using the same weights for each time-point. This creates a larger, feed-forward
network that can be trained with the usual means.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:14 T. Ben-Nun and T. Hoefler
Fig. 11. Performance of cublasSgemm on a tesla K80 GPU for various matrix sizes (adapted from Refer-
ence [193]).
other large computational resources (e.g., the Google Brain cluster), increasing the available pro-
cessing elements toward the average parallelism (W/D).
However, as over-parameterization leads to overfitting, and since the resulting networks were
too large to fit into consumer devices, efforts to decrease resource usage started around 2015, and
so did the average parallelism (see table). Research has since focused on increasing expressiveness,
mostly by producing deeper networks, while also reducing the number of parameters and oper-
ations required to forward-evaluate the given network. Parallelization efforts have thus shifted
toward concurrency within minibatches (data parallelism, see Section 6). By reducing memory
and increasing energy efficiency, the resource conservation trend aims to move neural processing
to the end user, i.e., to embedded and mobile devices. At the same time, smaller networks are faster
to prototype and require less information to communicate when training on distributed platforms.
5 CONCURRENCY IN OPERATORS
Given that neural network layers operate on four-dimensional tensors (Figure 8(a)) and the high
locality of the operations, there are several opportunities for parallelizing layer execution. In most
cases, computations (e.g., in the case of pooling operators) can be directly parallelized. However,
to expose parallelism in other operator types, computations have to be reshaped. Below, we list
efforts to model DNN performance, followed by a concurrency analysis of three popular operators.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:15
5.3 Convolution
Convolutions constitute the majority of computations involved in training and inference of DNNs.
As such, the research community and the industry have invested considerable efforts into optimiz-
ing their computation on all platforms. Figure 12 depicts the convolution methods detailed below,
and Table 6 summarizes their work and depth characteristics.
While a convolution operator (Equation (2)) can be computed directly, it will not fully utilize
the resources of vector processors (e.g., Intel’s AVX registers) and many-core architectures (e.g.,
GPUs), which are geared toward many parallel multiplication-accumulation operations. It is pos-
sible, however, to increase the utilization by ordering operations to maximize data reuse [61],
introducing data redundancy, or via basis transformation.
The first algorithmic change proposed for convolutional operators was the use of the well-
known technique to transform a discrete convolution into matrix multiplication, using Toeplitz
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:16 T. Ben-Nun and T. Hoefler
matrices (colloquially known as im2col). The first occurrence of unrolling convolutions in CNNs
[31] used both CPUs and GPUs for training (since the work precedes CUDA, it uses Pixel Shaders
for GPU computations). The method was subsequently popularized by Coates et al. [45], and it
consists of reshaping the images in the minibatch from 3D tensors to 2D matrices. Each 1D row in
the matrix contains an unrolled 2D patch that would usually be convolved (possibly with overlap),
generating redundant information (see Figure 12(a)). The convolution kernels are then stored as a
2D matrix, where each column represents an unrolled kernel (one convolution filter). Multiplying
those two matrices results in a matrix that contains the convolved tensor in 2D format, which can
be reshaped to 3D for subsequent operations. Note that this operation can be generalized to 4D
tensors (an entire minibatch), converting it into a single matrix multiplication. Alternatively, the
kernels can be unrolled to rows (kn2row) for the matrix multiplication [242].
While processor-friendly, the GEMM method (as described above) consumes a considerable
amount of memory, and thus was not scalable. Practical implementations of the GEMM method,
such as in cuDNN [38], implement “implicit GEMM,” in which the Toeplitz matrix is never mate-
rialized. It was also reported [49] that the Strassen matrix multiplication [231] can be used for the
underlying computation, reducing the number of operations by up to 47%.
A second method to compute convolutions is to make use of the Fourier domain, in which
convolution is defined as an element-wise multiplication [172, 241]. In this method, both the data
and the kernels are transformed using FFT, multiplied, and the inverse FFT is applied on the result:
C in
yi, j,∗,∗ = F −1 F x i,m,∗,∗ ◦ F (w j,m,∗,∗ ) ,
m=0
where F denotes the Fourier Transform and ◦ is element-wise multiplication. Note that for a single
minibatch, it is enough to transform w once and reuse the results.
Experimental results [241] have shown that the larger the convolution kernels are, the more
beneficial FFT becomes, yielding up to 16× performance over the GEMM method, which has to
process patches of proportional size to the kernels. Additional optimizations were made to the
FFT and IFFT operations [241], using DNN-specific knowledge: (a) The process uses decimation-
in-frequency for FFT and decimation-in-time for IFFT to mitigate bit-reversal instructions; (b) mul-
tiple FFTs with sizes ≤32 are batched together and performed at the warp-level on the GPU; and
(c) pre-computation of twiddle factors.
Working with DNNs, FFT-based convolution can be optimized further. In ZNNi [278], the au-
thors observed that due to zero-padding, the convolutional kernels, which are considerably smaller
than the images, mostly consist of zeros. Thus, pruned FFT [230] can be executed for transform-
ing the kernels, reducing the number of operations by 3×. In turn, the paper reports 5× and 10×
speedups for CPUs and GPUs, respectively.
The prevalent method used today to perform convolutions is Winograd’s algorithm for minimal
filtering [249]. First proposed by Lavin and Gray [146], the method modifies the original algorithm
for multiple filters (as is the case in convolutions), performing the following computation for one
tile:
C in
yi, j,∗,∗ = AT Gw j,m,∗,∗GT ◦ BT x i,m,∗,∗ B A,
m=0
with the matrices A, G, B constructed as in Winograd’s algorithm.
Since the number of operations in Winograd convolutions grows quadratically with filter
size, the convolution is decomposed into a sum of tiled, small convolutions, and the method is
strictly used for small kernels (e.g., 3 × 3). Additionally, because the magnitude of elements in the
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:17
expression increases with filter size, the numerical accuracy of Winograd convolution is generally
lower than the other methods, and decreases as larger filters are used.
Table 6 lists the concurrency characteristics of the aforementioned convolution implementa-
tions, using the Work-Depth model. From the table, we can see that each method exhibits different
behavior, where the average parallelism (W/D) can be determined by the kernel size or by image
size (e.g., FFT). This coincides with experimental results [38, 146, 241], which show that there is no
“one-size-fits-all” convolution method. We can also see that the Work and Depth metrics are not
always sufficient to reason about absolute performance, as the Direct and im2col methods exhibit
the same concurrency characteristics, even though im2col is faster in many cases, due to high
processor utilization and memory reuse (e.g., caching) opportunities.
Data layout also plays a role in convolution performance. Li et al. [154] assert that convolution
and pooling operators can be computed faster by transposing the data from N ×C×H ×W tensors
to C×H ×W ×N . The paper reports up to 27.9× performance increase over the state-of-the-art for
a single operator, and 5.6× for a full DNN (AlexNet). The paper reports speedup even in the case
of transposing the data during the computation of the DNN, upon inputting the tensor to the
operator.
DNN primitive libraries, such as cuDNN [38] and MKL-DNN [120], provide a variety of convo-
lution methods and data layouts. To assist users in choosing an algorithm, such libraries provide
functions that select the best-performing method, given tensor sizes and memory constraints. In-
ternally, the libraries may run all methods and pick the fastest one.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:18 T. Ben-Nun and T. Hoefler
execute the RNN layers to be “persistent,” performing global synchronization on their own and
circumventing the normal GPU programming model. The approach attains up to ∼30× speedup
over previous state-of-the-art for low minibatch sizes, performing on the order of multiple TFLOP/s
per-GPU, even though it does not execute GEMM operations and loads more memory for each
multi-processor. Additionally, the approach reduces the total memory footprint of RNNs, allowing
users to stack more layers using the same resources.
6 CONCURRENCY IN NETWORKS
The high average parallelism (W/D) in neural networks may not only be harnessed to compute
individual operators efficiently but also to evaluate the whole network concurrently with respect
to different dimensions. Owing to the use of minibatches, the breadth (∝ W) of the layers, and
the depth of the DNN (∝ D), it is possible to partition both the forward evaluation and the back-
propagation phases (lines 4–5 in Algorithm 2) among parallel processors. Below, we discuss three
prominent partitioning strategies, illustrated in Figure 14: partitioning by input samples (data
parallelism), partitioning by network structure (model parallelism), and partitioning by layer
(pipelining).
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:19
gradient is updated several times prior to updating the weights, essentially equivalent to minibatch
SGD.
One of the earliest occurrences of mapping DNN computations to data parallel architectures
(e.g., GPUs) were performed by Raina et al. [208]. The paper focuses on the problem of train-
ing Deep Belief Networks [98], mapping the unsupervised training procedure to GPUs by run-
ning minibatch SGD. The paper shows speedup of up to 72.6× over CPU when training Restricted
Boltzmann Machines. Today, data parallelism is supported by the vast majority of deep learning
frameworks, using a single GPU, multiple GPUs, or a cluster of multi-GPU nodes.
The scaling of data parallelism is naturally defined by the minibatch size (Table 4). Apart from
Batch Normalization (BN) [121], all operators mentioned in Section 4 operate on a single sample
at a time, so forward evaluation and backpropagation are almost completely independent. In the
weight update phase, however, the results of the partitions have to be averaged to obtain the gra-
dient w.r.t. the whole minibatch, which potentially induces an allreduce operation. Furthermore,
in this partitioning method, all DNN parameters have to be accessible for all participating devices,
which means that they should be replicated.
6.1.1 Neural Architecture Support for Large Minibatches. By applying various modifications to
the training process, recent works have successfully managed to increase minibatch size to 8k
samples [84], 32k samples [259], and even 64k [226] without losing considerable accuracy. While
the generalization issue still exists (Section 3), it is not as severe as claimed in prior works [218].
One bottleneck that hinders scaling of data parallelism, however, is the BN operator, which re-
quires a full synchronization point upon invocation. Since BN recurs multiple times in some DNN
architectures [94], this is too costly. Thus, popular implementations of BN follow the approach
driven by large-batch papers [84, 107, 259], in which small subsets (e.g., 32 samples) of the mini-
batch are normalized independently. If at least 32 samples are scheduled to each processor, then
this synchronization point is local, which in turn increases scaling.
Another approach to the BN problem is to define a different operator altogether. Weight Nor-
malization (WN) [216] proposes to separate the parameter (w) norm from its directionality by
д
way of re-parameterization. In WN, the weights are defined as w = ( v ) · v, where д represents
weight magnitude and v a normalized direction (as changing the magnitude of v will not intro-
duce changes in ∇). WN decreases the depth (D) of the operator from O(log N ) to O(1), removing
inter-dependencies within the minibatch. According to the authors, WN reduces the need for BN,
achieving comparable accuracy using a simplified version of BN (without variance correction).
6.1.2 Coarse- and Fine-grained Data Parallelism. Additional approaches for data parallelism
were proposed in literature. In ParallelSGD [277], SGD is run (possibly with minibatches) k times in
parallel, dividing the dataset among the processors. After the convergence of all SGD instances, the
resulting weights are aggregated and averaged to obtain w, exhibiting coarse-grained parallelism.
ParallelSGD, as well as other deep learning implementations [125, 147, 267], were designed with
the MapReduce [58] programming paradigm. Using MapReduce, it is easy to schedule parallel tasks
onto multiple processors, as well as distributed environments. Prior to these works, the potential
scaling of MapReduce was studied [42] on a variety of machine-learning problems, including NNs,
promoting the need to shift from single-processor learning to distributed memory systems.
While the MapReduce model was successful for deep learning at first, its generality hindered
DNN-specific optimizations. Therefore, current implementations make use of high-performance
communication interfaces (e.g., MPI) to implement fine-grained parallelism features, such as
reducing latencies via asynchronous execution and pipelining [79] (Figure 15(a)), sparse com-
munication (see Section 7.3), and exploiting parallelism within a given computational resource
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:20 T. Ben-Nun and T. Hoefler
[194, 278]. In the last category, minibatches are fragmented into micro-batches (Figure 15(b))
that are decomposed [278] or computed sequentially [194]. This reduces the required memory
footprint, thus making it possible to choose faster methods that require more memory, as well as
enabling hybrid CPU-GPU inference.
6.3 Pipelining
In deep learning, pipelining can either refer to overlapping computations, i.e., between one layer
and the next (as data becomes ready); or to partitioning the DNN according to depth, assigning
layers to specific processors (Figure 14(c)). Pipelining can be viewed as a form of data parallelism,
since elements (samples) are processed through the network in parallel but also as model paral-
lelism, since the length of the pipeline is determined by the DNN structure.
The first form of pipelining can be used to overlap forward evaluation, backpropagation, and
weight updates. This scheme is widely used in practice [1, 48, 124, 238] and increases utilization
by mitigating processor idle time. In a finer granularity, neural network architectures can be de-
signed around the principle of overlapping layer computations, as is the case with Deep Stacking
Networks (DSN) [63]. In DSNs, each step computes a different fully connected layer of the data.
However, the results of all previous steps are concatenated to the layer inputs (see Figure 17(a)).
This enables each layer to be partially computed in parallel, due to the relaxed data dependencies.
GPipe [111] uses layer partitioning to train many-parameter DNNs, achieving state-of-the-art
accuracy on ImageNet and CIFAR-10. There are several advantages for a multi-processor pipeline
over both data and model parallelism: (a) there is no need to store all parameters on all processors
during forward evaluation and backpropagation (as with model parallelism); (b) there is a fixed
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:22 T. Ben-Nun and T. Hoefler
number of communication points between processors (at layer boundaries), and the source and
destination processors are always known. Moreover, since the processors always compute the
same layers, the weights can remain cached to decrease memory round-trips. Two disadvantages
of pipelining, however, are that data (samples) have to arrive at a specific rate to fully utilize the
system, and that latency proportional to the number of processors is incurred.
In the following section, we discuss two implementations of layer partitioning—DistBelief
[57] and Project Adam [39]—which combine the advantages of pipelining with data and model
parallelism.
7 CONCURRENCY IN TRAINING
So far, we have discussed training algorithms where there is only one copy of w, and its up-to-
date value is directly visible to all processors. In distributed environments, there may be multiple
instances of SGD (training agents) running independently, and thus the overall algorithm has to
be adapted. Distribution schemes for deep learning can be categorized along three axes: model
consistency, parameter distribution, and training distribution; where Figures 18 and 19 sum-
marize the applied techniques and optimizations.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:23
Fig. 18. Overview of distributed deep learning methods. Fig. 19. Section over-
view.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:24 T. Ben-Nun and T. Hoefler
creates a distributed form of data parallelism (Section 6), where all nodes have to communicate
their updates to the others before fetching a new minibatch. To support distributed, data parallel
SGD, we can modify Algorithm 2 by changing lines 3 and 7 to read (write) weights from (to)
a parameter store, which may be centralized or decentralized (see Section 7.2). This incurs a
substantial overhead on the overall system, which hinders training scaling.
Recent works relax the synchronization restriction, creating an inconsistent model (Figure 20(c)).
As a result, a training agent i at time t contains a copy of the weights, denoted as w (τ ,i ) for τ ≤ t,
where t − τ is called the staleness (or lag). A well-known instance of inconsistent SGD is the HOG-
WILD shared-memory algorithm [213], which allows training agents to read parameters and up-
date gradients at will, overwriting existing progress. HOGWILD has been proven to converge
for sparse learning problems [213], where updates only modify small subsets of w, for general
convex optimization [56], and nonconvex optimization [160]. Based on foundations of distributed
asynchronous SGD [237], the proofs impose that (a) write-accesses (adding gradients) are always
atomic; (b) Lipschitz continuous differentiability and strong convexity on fw ; and (c) that the stal-
eness, i.e., the maximal number of iterations between reading w and writing ∇w, is bounded.
The HOGWILD algorithm was originally designed for shared-memory architectures but has
since been extended [57, 186] to distributed-memory systems, in which it still attains convergence
for deep learning problems. To mitigate the interference effect of overwriting w at each step, the
implementation transfers the gradient ∇w instead of w from the training agents. Asymptotically,
the lack of synchronization in HOGWILD √ and its gradient-communicating variants admits an op-
timal SGD convergence rate of O(1/ mT ) for m participating nodes [2, 59, 160], as well as linear
scaling, as every agent can train almost independently.
To provide correctness guarantees in spite of asynchrony, Stale-Synchronous Parallelism (SSP)
[99] proposes a compromise between consistent and inconsistent models. In SSP (Figure 20(d)),
the gradient staleness is enforced to be bounded by performing a global synchronization step after
a maximal staleness may have been reached by one of the nodes. This approach works especially
well in heterogeneous environments, where lagging agents (stragglers) are kept in check. To that
end, distributed asynchronous processing has the additional advantage of adding and removing
nodes on-the-fly, allowing users to add more resources, introduce node redundancy, and remove
straggling nodes [57, 195].
7.2 Centralization
The choice between designing a centralized and a decentralized network architecture for DNN
training depends on multiple factors [159], including the network topology, bandwidth, commu-
nication latency, parameter update frequency, and desired fault tolerance. A centralized network
architecture would typically include a parameter server (PS) infrastructure (e.g., Figures 20(a), 20(c),
and 21), which may consist of one or more specialized nodes; whereas a decentralized architecture
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:25
(Figures 20(b) and 20(d)) would rely on allreduce to communicate parameter updates among the
nodes. Following communication, centralized parameter update is performed by the PS, whereas
the decentralized update is computed by each node separately. In the latter case, every node creates
its own optimizer.
The tradeoff of the distribution schemes can be modeled by the communication cost per global
update. While the allreduce operation can be implemented efficiently for different message sizes
and nodes (Section 2.4), the PS scheme requires each training agent to send and receive informa-
tion to/from the PS nodes. Thus, not all network routes are used, and in terms of communication
the operation is equivalent to a reduce-then-broadcast implementation of allreduce, taking Ttree
time. However, the PS can keep track of a “global view” of training, averaging the gradients at
one location and enabling asynchronous operation of the agents. This, in turn, allows nodes to
communicate less information by performing some of the computations on the PS [39], as well as
increases fault tolerance by dynamic spin-up and removal of nodes during training.
The PS infrastructure is an abstract concept, and is not necessarily represented by one physical
server. Sharded parameter servers [39, 57] divide the ownership of w over multiple nodes, each
containing a segment of its elements. In conjunction with model parallelism and layer pipelining
(Sections 6.2 and 6.3), this alleviates some of the congestion at the PS, as shown in Figure 21(a),
in which each portion of a “model replica” (training agent) transmits its gradients and receives
its weights from a different shard. Hierarchical parameter servers [89, 263] (Figure 21(b)) further
alleviate resource contention by assigning training agents with PS “leaves,” propagating weights
and gradients from specific agent groups up to the global parameter store. Rudra [89] also studies
the tradeoff in allowed staleness, number of agents, and minibatch size, showing that SSP performs
better, but requires adapting the learning rate accordingly.
A PS infrastructure is not only beneficial for performance but also for fault tolerance. The sim-
plest form of fault tolerance in machine learning is checkpoint/restart, in which w (t ) is periodically
synchronized and persisted to a non-volatile data store (e.g., a hard drive). This is performed lo-
cally in popular deep learning frameworks, and globally in frameworks such as Poseidon [265].
Besides checkpoints, fault tolerance in distributed deep learning has first been tackled by Dist-
Belief [57, 148]. In the system, training resilience is increased by both introducing computational
redundancy in the training agents (using different nodes that handle the same data), as well as
replicating parameter server shards. In the former, an agent, which is constructed from multi-
ple physical nodes in DistBelief via hybrid parallelism (Section 6.4), is assigned multiple times to
separate groups of nodes. Allocating redundant agents enables handling slow and faulty replicas
(“stragglers”) by cancelling their work upon completion of the faster counterpart. As for the latter
resilience technique, in DistBelief and Project Adam [39], the parameters on the PS are replicated
and persisted on non-volatile memory using a dedicated manager. Project Adam further increases
the resilience of distributed training by using separate communication endpoints for replication
and using Paxos consensus between PS nodes.
Applying weight updates in a distributed environment is another issue to be addressed. In Sec-
tion 2.1, we establish that all popular weight rules are first-order with respect to the required gra-
dients (Table 3). As such, both centralized and decentralized schemes can perform weight updates
by storing the last gradient and parameter values. Since GPUs are commonly used when training
DNNs (Figure 3(a)), frameworks such as GeePS [52] implement a specialized PS for accelerator-
based training agents. In particular, GeePS incorporates additional components over a general CPU
PS, including CPU-GPU memory management components for weight updates.
In addition to reducing local (e.g., CPU-GPU) memory copies, PS infrastructures enable reducing
the amount of information communicated over the network. Project Adam utilizes the fact that the
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:26 T. Ben-Nun and T. Hoefler
7.3.1 Quantization. A prominent data representation for gradient (or parameter) compression
is quantization, i.e., mapping continuous information into buckets that represent sets of values
(usually ranges). It has been shown [138] that the distributions of parameter and gradient values
are narrowly dispersed (Figure 22(a)), thus these methods are effective in representing the working
range to reduce the number of bits per parameter. This method has been successfully utilized in
deep learning, both during training [64, 88, 112] and for inference, where values are quantized
post-training [210, 276]. Some papers go so far as to quantize gradients to binary [51, 217] or
ternary [156] values, while still attaining convergence with marginally reduced accuracy.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:27
7.3.2 Sparsification. DNNs (and CNNs in particular) exhibit sparse gradients during parameter
updates. This is primarily due to the very large number of parameters that do not necessarily
change all at once; and operators such as convolutions, in which the optimization process may
improve the accuracy of certain convolution kernels. Therefore, the full gradient is not necessary
to retain convergence, and various methods that leverage this feature have been proposed.
The first application of gradient sparsification [232] prunes gradients using a static threshold,
below which an element should not be sent. Results show up to 54× speedup for 80 nodes and up
to 1.8% reduction in error. The authors achieved a compression ratio (which also includes 32-bit
fixed point quantization) of 846–2,871× for a non-convolutional DNN. Subsequent works propose
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:28 T. Ben-Nun and T. Hoefler
relative (e.g., top 1%) [3, 32, 222] and adaptive thresholds [69] to transmit only the “important” gra-
dients, based on their absolute value. To counter the accuracy loss as a result of sparsification, some
works suggest to condition gradient values by changing the DNN architecture, adding normaliza-
tion operators [3]; whereas others [163] propose local gradient clipping and warm-up training.
In a centralized setting (Section 7.2), distributing sparse gradients is straightforward—sparse
messages are sent between the training agents and the PS. However, implementing the necessary
allreduce in a decentralized setting is not as simple, because each agent may contribute different
non-zero indices (dimensions) in its gradient. Kylix [274] implements sparse allreduce in two steps,
first exchanging the indices and then the data. While this is desirable for systems where the sparsity
pattern per node does not change, in deep learning the gradient indices differ with each iteration.
SparCML [214] targets the specifics of deep learning explicitly by supporting arbitrarily changing
indices in a framework for sparse allreduce operations, as illustrated in Figure 22(b). SparCML
combines sending only the top-k most significant indices with quantization and supports sparse
vectors of heterogeneous sizes. The system switches between a sparse and dense representation
automatically, informed by a simple performance model. SparCML achieves a speedup of more
than 20× over a well-tuned CNTK implementation on Ethernet.
7.3.3 Other Techniques. In Section 7.2, we discuss Project Adam sending activations and errors
instead of parameters, decreasing the overall footprint for fully connected layers in favor of redun-
dant computation on the PS. The Poseidon (formerly Petuum) framework [252, 264, 265] extends
the idea of transmitting decomposed outer products u · v T of w, generalizing the concept to other
fields in machine learning as Sufficient Factor Broadcasting (SFB) [251]. With SFB, the activations
are not sent to the PS, but rather disseminated among the training agents for local recomposition.
SFB works best in centralized topologies, as recomposing the gradients in a decentralized environ-
ment causes each agent to process m − 1 additional outer products with each step, where m is the
number of agents.
A different approach to reduce DNN memory footprint is to design them specifically for that
purpose [41, 114, 135, 155]. Such works make use of memory-efficient operators and techniques,
mostly applied to convolutions, to train networks that fit on devices such as FPGAs and mobile
phones. Applied techniques include layers constructed from a multitude of 1 × 1 convolutions
[114], reshaping [155] or applying Tucker Decomposition [135] on convolution tensors, and sepa-
rable convolutions (sequential application of reduced-dimension convolutions) [41, 108]. The pa-
pers show that DNNs can decrease in size (up to 50×) and evaluation time (6.13×), exhibiting minor
reduction in accuracy.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:29
an inconsistent view of w, the former may entirely change the training and inference processes,
depending on the method.
7.4.1 Ensemble Learning and Knowledge Distillation. A widely used technique for post-training
consolidation is ensemble learning [118, 153, 225]. With ensembles, multiple instances of w are
trained separately on the same dataset, and the overall prediction is the average of the predictions
of the ensemble members, i.e., f (x ) = m1 m i=0 fw (T , i ) (x ). Ensemble learning has been used exten-
sively in machine learning before the deep learning era [66] as a form of boosting, and typically
increases the overall accuracy over a single model. Thus, it is routinely applied in machine-learning
competitions such as ILSVRC [62] and in industrial applications. Distributed training of ensembles
is a completely parallel process, requiring no communication between the agents. However, works
such as TreeNets [153] (Section 6.2) combine ensemble learning with custom (ensemble-aware) loss
functions to promote diversity between ensemble members.
Given that ensembles consume a factor of m more memory and compute power, another post-
training model consolidation technique is to reduce the size of a DNN using knowledge distillation
[10, 97]. In this scheme, training is performed in two steps: in the first step, a large network or
an ensemble is trained normally; and the second step trains a single neural network to mimic the
output of the large ensemble. Results [97] show that the second network is easier to train on the
ensemble than on a labeled dataset, attaining the same word error rate as an ensemble of 10 DNNs.
7.4.2 Model Averaging. Another technique for consolidating models is model averaging [201].
Such methods may separately run m SGD instances on different machines, aggregating the pa-
rameters only once (post-training) [277] or every few iterations [33, 174]. While these methods are
proven to converge, applying stale-synchronous SGD (Section 7.1) leads to higher overall accuracy.
To overcome accuracy degradation as a result of infrequent averaging, more sophisticated con-
solidation methods include Elastic Averaging SGD (EASGD) [268] and Natural Gradient Descent
[202]. EASGD is based on a centralized environment (i.e., including a PS), extending direct aver-
aging by using elastic forces between the training agents’ view of w (w (t,i ) ) and the PS’s view (w̄).
This allows the agents to “explore” further by increasing the possible distance of each agent from
the average, and also allows to communicate sparsely with respect to time (iterations). EASGD
was reported [268] to outperform the DistBelief [57] SGD method in terms of accuracy, shown
to be tolerant in terms of update delay, and was used successfully in practice for communication
reduction by other works [159, 258].
Natural Gradient Descent (NG-SGD) can also be used to deal with diverging parameters in dif-
ferent agents [202]. NG-SGD modifies SGD to define learning-rate matrices, approximating the
inverse Fisher information matrix and thus natural gradients. By averaging agent parameters only
every k samples (typically in the order of hundreds of thousands), the algorithm allows agents to
gradually diverge and synchronize less than traditional SGD. Natural Gradients were also approx-
imated for distributed deep learning using Kronecker Factorization (K-FAC) [11], where the work
is divided between gradient- and statistics-computing agents (for Fisher matrix blocks).
In distributed settings, algorithms are also inspected w.r.t. fault tolerance. Krum [20] is a Byzan-
tine fault-tolerant [145] SGD algorithm, allowing up to f Byzantine training agents. In particular,
the paper shows that any gradient aggregation rule based on linear combination cannot sustain
a single Byzantine agent. By combining specific m − f gradients (that are closest to each other),
Krum is able to overcome adversarial gradient inputs from the network.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:30 T. Ben-Nun and T. Hoefler
7.5.1 Parameter Search. Supervised learning can either be viewed as a stochastic optimization
process that uses one or a minibatch of samples at a time, or it can be expressed as a batch op-
timization problem, where the entire dataset is necessary to obtain gradients for descent. Batch
optimization has been used for deep learning since the inception of DNNs [151], using first- and
second-order methods [185] such as Levenberg-Marquardt, Conjugate Gradient (CG), and L-BFGS.
Although considerably more computationally expensive than SGD, there are several advantages to
such approaches, including increased concurrency (as data-parallelism increases) and better the-
oretical convergence guarantees (e.g., second-order methods converge locally at a quadratic rate).
As mentioned in Sections 3 and 6.1, large-minibatch training represents a middle ground between
SGD and batch methods. Such methods combine the “best of both worlds”—on one hand, they ex-
hibit increased inherent concurrency (as higher-order methods); on the other hand, they employ
stochasticity, which, despite the sublinear rate of convergence, works well in practice.
For distributed deep learning, batch methods [147] (specifically CG and L-BFGS) and Hessian-
free second-order optimization [43, 95, 171] have initially been favored due to their apparent scal-
ability compared to traditional SGD (Algorithm 1). However, due to the superior generalization
properties of first-order stochastic optimization, and the successful DistBelief [57] implementa-
tion of inconsistent SGD (called Downpour SGD, based on HOGWILD [213]); it was found that
the quadratic increase of batch methods in memory, communication, and computations due to
high dimensionality is not desirable. To overcome these issues, stochastic variants of L-BFGS have
been proposed [28, 177] that estimate the inverse Hessian matrix and proven to converge at a
linear rate in strongly-convex, Lipschitz-continuous settings [177].
Other optimization algorithms applied to deep learning attempt to: (a) reduce the variance of
SGD incurred by random sampling [128], (b) use the Alternating Direction Method of Multipliers
(ADMM) [25] to skip the backpropagation altogether [235], (c) use K-FAC to approximate second
order information [191], or (d) use the Neumann series expansion to approximate the Hessian
matrix [139], scaling to large minibatch sizes (32k with no accuracy loss, 131k with minimal loss)
without substantial computational overhead.
Gradient-free evolutionary algorithms have also been employed for deep learning, where exam-
ples include Genetic Algorithms [199, 250], Neuro-Evolution [175, 243], and Particle-Swarm Opti-
mization [168]. Apart from recombination/evolution steps, training behavior is similar to ensemble
learning, and thus these algorithms are more amenable to parallelism than traditional gradient de-
scent. As we show in the rest of this section, the gradient-independent nature of such algorithms
enable their use for meta-optimization of both hyper-parameters and DNN architectures.
7.5.2 Hyper-parameter Search. The multitude of hyper-parameters in SGD (e.g., learning rate,
momentum, maximal staleness) and their adverse effect on the resulting accuracy hinders research
efforts into new techniques in machine learning. Until recently, the prominent method for hyper-
parameter search was to perform parameter sweeps (i.e., grid search over feasible ranges). Since
this method increases the overall time exponentially with the number of hyper-parameters, its
effectiveness is limited by the availability of computing power.
Several methods try to expand beyond simple parameter sweeps by making educated guesses
and tuning hyper-parameters during training. In the former class, methods include Bayesian op-
timization [227], predictive analysis of the learning curves (e.g., training error, validation error)
[13, 137] for dropping undesirable configurations, and sampling the hyper-parameter space effi-
ciently using spectral methods such as Compressed Sensing [92].
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:31
As for tuning hyper-parameters during training, Omnivore [90] employs predictive analysis
and grid searches every predetermined number of minutes to modify the momentum and a hyper-
parameter controlling local gradient staleness. The paper shows that in distributed environments,
controlling the synchronous SGD node-group size during training can increase both accuracy and
performance. YellowFin [266] uses the local gradient curvature and variance to tune momentum,
working especially well on LSTM-based models and asynchronous environments, performing up
to 3.28× faster than the Adam optimizer (Table 3).
Metaheuristic optimization algorithms can inherently integrate hyper-parameter tuning with
training and are thus used for DNNs. Such methods include Particle Swarm Optimization (PSO)-
based deep learning [168]; and CoDeepNEAT [175], a modification of the NEAT algorithm that
simultaneously searches for hyper-parameter and architecture configurations (see below). Such
methods scale almost linearly, due to abundance of independent computations.
Last, Population-based Training [122] (PBT) uses a reinforcement learning approach to “ex-
plore” and “exploit” the hyper-parameter space. As illustrated in Figure 24(a), each training agent
independently samples (exploits) information from other agents every few SGD iterations. The
information is then used to select the best configuration (e.g., using a t-test), and hyper-parameters
are in turn perturbed (explored) to continue learning. This creates a decentralized topology where
communication is nondeterministic (i.e., exploitation is performed with a randomly sampled
agent), which may scale better as the number of training agents increases.
7.5.3 Architecture Search. Like feature engineering before the era of deep learning, manually
crafting DNN architectures is naturally limited by human resourcefulness and creativity. This lim-
itation promoted a recent rise of research into automated neural architecture search. Architecture
search can be categorized into three approaches: Sequential Model-based Optimization (SMBO),
Reinforcement Learning (RL), and Evolutionary Algorithms (EA).
SMBO-based search methods rely on optimizing an architecture candidate, defining a finite set
of states to explore (e.g., search tree children), and traversing those sets. As a result, concurrency
depends on the number of points in the search space at a given time. Examples of SMBO in-
clude DeepArchitect [180], which proposes a DNN definition language that allows programmers
to explicitly define the space; PNASNet [164], which searches for networks ordered by increasing
complexity using a search algorithm based on A*, conserving half the evaluated models compared
to an equivalent RL approach [280]; SMASH [27], which assesses optimality (fitness) of candi-
date networks using another CNN that maps the given DNN architecture to weights for testing;
and DARTS [166], which formulates architecture search as a bi-level, differentiable optimization
problem.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:32 T. Ben-Nun and T. Hoefler
Many recent DNN architectures (Section 4.3) exhibit self-similarity and repeating sub-units
(modules). This observation can be leveraged to dramatically reduce the number of explored archi-
tectures, composing networks hierarchically out of modules and basic blocks (e.g., convolution).
This approach has been used successfully in the community, creating new candidates for both
CNN modules [164, 165, 175, 211, 275, 280] and RNN units [200, 279].
RL-based architecture search uses the accuracy of the resulting network as a reward function,
whereas modifications to the DNN or its hyper-parameters are actions. In Neural Architecture
Search (NAS) [279], the parameters of each layer can be modified, but the number of layers is
fixed. A sharded PS-based distributed system, in conjunction with policy gradient optimization
[248], is used for training. Other examples include MetaQNN [12] and BlockQNN [275], which
operate similarly to NAS, but use Q-learning for optimization; and ENAS [200], which significantly
reduces computational time over NAS (by three orders of magnitude) by sharing parameters across
children DNNs (i.e., networks in the immediate search space).
Evolutionary Algorithms (EA) are advantageous for architecture search, as any function (not
necessarily differentiable) can be optimized using these methods. HyperNEAT was the first EA
successfully applied [243] to deep learning, used for training weights and DNN architecture at the
same time; and CoDeepNEAT [175] defines a variant of the NEAT algorithm to optimize hyper-
parameters and architecture, using the self-similarity feature of DNNs by optimizing “blueprints”
that are composed of modules. Genetic CNNs [250] uses Genetic Algorithms (GAs) by encoding
the DNN connections as binary genes (as required in GAs, shown in Figure 24(b)), and training
the population of DNNs with every time-step, using the final accuracy as the fitness function. GAs
are highly amenable to parallelism, and have been successfully used for very large-scale training
[261], where 18,000 nodes were used on the Titan supercomputer for 24h to obtain state-of-the-art
accuracy for segmentation and reconstruction problems.
Large-Scale Evolution [212] also uses GAs, but defines a set of specific mutations (e.g., insert
convolution, alter stride) that can be applied. Large-Scale Evolution outperforms some existing
RL-based methods in terms of accuracy, as well as in terms of scalability, as GAs can run the entire
population in parallel (where accuracy increases with population size in expectation). However,
in the general case GA requires synchronous reductive communication between time-steps for
selection of the fittest candidates. To overcome this issue, the paper employs tournament selection
[82], which only performs pairwise comparisons between population members.
Additional GA architecture search methods include the use of multi-level hierarchical represen-
tations of DNNs [165], which implement an asynchronous distributed tournament selection (cen-
tralized, queue-based implementation) with specialized mutation. Regularized Evolution (Amoe-
baNets) [211] further extends GA with tournament selection by removing the oldest sample from
the population each iteration (akin to death in nature), thus regularizing the optimization process.
AmoebaNets outperform all existing methods [111], including manually engineered DNNs and
RL-based searches, with 3% top-5 error for ImageNet and 1.5% top-1 error for CIFAR-10 (compared
to 5.29% and 3.62% on the best instances of DenseNet, see Table 5).
8 CONCLUDING REMARKS
The world of deep learning is brimming with concurrency. Nearly every aspect of training, from
the computation of a convolution to the meta-optimization of DNN architectures, is inherently
parallel. Even if an aspect is sequential, its consistency requirements can be reduced, due to the
robustness of nonlinear optimization, to increase concurrency while still attaining reasonable ac-
curacy, if not better. In this article, we give an overview of many of these aspects, the respective
approaches documented in literature, and we provide concurrency analysis using the W-D model
when possible.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:33
It is hard to predict what the future holds for this highly active field of research (many have
tried over the years). Below, we highlight potential directions for future research in parallel and
distributed deep learning.
As research progresses, DNN architectures are becoming deeper and more interconnected, be-
tween consecutive and non-consecutive layers (“skip connections”). Apart from accuracy, consid-
erable effort is devoted to reducing the memory footprint and number of operations [108, 211], to
successfully run inference on mobile devices. This also means that post-training DNN compres-
sion [91] will likely be researched further, and training compressible networks will be desirable.
Since mobile hardware is limited in memory capacity and has to be energy efficient, specialized
DNN computational hardware is frequently proposed [233]. We see this trend with the NVIDIA
Tensor Cores [188], the Tensor Processing Unit [129], other ASICs and FPGAs [35, 187], and even
neuromorphic computing [4]. Handling DNN sparsity (e.g., after compression) is a focus for some
ASICs [269], and advances in recurrent networks and attention learning [30, 253] indicate that
training and inference hardware would also need to work efficiently with variable-length inputs.
Computing individual operators is highly optimized today (Section 5), and thus current re-
search is oriented toward inter-layer and whole-DNN optimization. TensorFlow XLA [83], Tensor
Comprehensions [240], Latte [236], and TVM [34] compile entire neural network graphs at once,
performing a variety of transformations (e.g., fusion) to optimize execution time, achieving 4×
speedup over manually tuned individual operators. We expect research to continue in this direction
to the point where DNN evaluation is close to optimal in terms of operations and shared-memory
optimizations.
Techniques applied in distributed deep learning are converging to the point where a standard
programming interface (or framework) can be designed. In the future, ecosystems such as Ease.ml
[158] may make the definition of a training scheme (e.g., with respect to centralization and gradient
consistency) easier, hiding most of the low-level infrastructure setup. Combining the increasing
support for cloud systems [271] and elastic training [195] (where nodes can be spun up and re-
moved at will) with the latest developments in evolutionary algorithms (see Section 7.5), we may
see adaptive and financially viable optimization methods rising to prominence.
Finally, deep learning is being used to solve increasingly complex problems such as routing algo-
rithms [85] and hierarchical task combination [77]. Research toward Artificial General Intelligence
is now focusing on multi-purpose networks [127, 130], which creates new, unexplored opportuni-
ties for model parallelism and different training algorithms. Searching for adequate multi-purpose
networks may be beyond the ingenuity of a human team, and as meta-optimization (specifically,
architecture search) and progressive training [132] increase in usability and quality; parameter
sweeps and manual DNN architecture engineering will become obsolete. Supporting this claim is
the fact that the current state-of-the-art CNN in computer vision [211] (CIFAR-10 and ImageNet
datasets) is the result of an automated architecture search. Exploiting parallelism is necessary for
such breakthroughs and others, going hand in hand with the advancement of deep learning as a
field.
REFERENCES
[1] M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from
http://www.tensorflow.org.
[2] A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information
Processing Systems 24. MIT Press, 873–881.
[3] A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021
[4] F. Akopyan et al. 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic
chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537–1557.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:34 T. Ben-Nun and T. Hoefler
[5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient
quantization and encoding. In Advances in Neural Information Processing Systems 30. MIT Press, 1709–1720.
[6] D. Amodei et al. 2016. Deep speech 2 : End-to-end speech recognition in English and mandarin. In Proceedings of
the 33rd International Conference on Machine Learning, vol. 48. 173–182.
[7] J. Appleyard, T. Kociský, and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs.
arxiv:1604.01946
[8] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In
Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119–129.
[9] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for
scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (PPoPP’17). 193–205.
[10] J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems
27. MIT Press, 2654–2662.
[11] J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approxima-
tions. In Proceedings of the International Conference on Learning Representations (ICLR’17).
[12] B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learn-
ing. In Proceedings of the International Conference on Learning Representations (ICLR’17).
[13] B. Baker, O. Gupta, R. Raskar, and N. Naik. 2017. Practical neural network performance prediction for early stopping.
arxiv:1705.10823.
[14] B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical
Report.
[15] R. Belli and T. Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-
consumer synchronization. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium
(IPDPS’15).
[16] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puz-
zle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
(SC’15). 19:1–19:12.
[17] Y. Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of the Statistical Language and
Speech Processing (SLSP’13).
[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances
in Neural Information Processing Systems 19. MIT Press, 153–160.
[19] Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE
Trans. Neural Netw. 5, 2 (1994), 157–166.
[20] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. 2017. Machine learning with adversaries: Byzantine
tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 119–129.
[21] R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5
(1999), 720–748.
[22] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,
X. Zhang, J. Zhao, and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316
[23] L. Bottou, F. E. Curtis, and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838
[24] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceed-
ings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653–1664.
[25] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the
alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1–122.
[26] R. P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201–206.
[27] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through Hyper-
Networks. arxiv:1708.05344.
[28] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. 2016. A stochastic quasi-newton method for large-scale optimiza-
tion. SIAM J. Optim. 26, 2 (2016), 1008–1031.
[29] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: Theory, practice, and
experience: Research articles. Concurr. Comput.: Pract. Exper. 19, 13 (2007), 1749–1783.
[30] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP’16). 4960–4964.
[31] K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document process-
ing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:35
[32] C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual
gradient compression for data-parallel distributed training. arxiv:1712.02679.
[33] K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-
block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP’16). 5880–5884.
[34] T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.
[35] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput
accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS’14). 269–284.
[36] T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174
[37] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual path networks. In Advances in Neural Information
Processing Systems 30. MIT Press, 4470–4478.
[38] S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759.
[39] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep
learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implemen-
tation. 571–582.
[40] K. Cho et al. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724–1734.
[41] F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357
[42] C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. 2007. Map-reduce for machine learning on
multicore. In Advances in Neural Information Processing Systems 19. MIT Press, 281–288.
[43] I. H. Chung et al. 2017. Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib.
Syst. 28, 6 (2017), 1703–1714.
[44] D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology
images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and
Computer-Assisted Intervention. 411–418.
[45] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems.
In Proceedings of the 30th International Conference on Machine Learning—Volume 28 (ICML’13). III–1337–III–1345.
[46] N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR’16). 4782–4791.
[47] N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceed-
ings of the 29th Annual Conference on Learning Theory, vol. 49. 698–728.
[48] R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In
Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn’11).
[49] J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the Interna-
tional Conference on Artificial Neural Networks (ICANN’14). 281–290.
[50] M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations con-
strained to +1 or −1. arxiv:1602.02830
[51] M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights
during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems
(NIPS’15), vol. 2. 3123–3131.
[52] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed
GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems
(EuroSys’16). 4:1–4:16.
[53] X. Cui, W. Zhang, Z. Tüske, and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep
neural networks. In Advances in Neural Information Processing Systems 31. MIT Press, 6048–6058.
[54] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP:
Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (PPoPP’93). 1–12.
[55] J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient
descent. arxiv:1803.05880.
[56] C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the wild: A unified analysis of HOGWILD!-style algorithms.
In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 2674–
2682.
[57] J. Dean et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on
Neural Information Processing Systems (NIPS’12), vol. 1. 1223–1231.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:36 T. Ben-Nun and T. Hoefler
[58] J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008),
107–113.
[59] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches.
J. Mach. Learn. Res. 13, 1 (2012), 165–202.
[60] O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Pro-
cessing Systems 24. MIT Press, 666–674.
[61] J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905.
[62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).
[63] L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of
the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2133–2136.
[64] T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561.
[65] G. Diamos et al. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International
Conference on Machine Learning, vol. 48. 2024–2033.
[66] T. G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on
Multiple Classifier Systems (MCS’00). 1–15.
[67] Z. Drezner and A. Barak. 1986. An asynchronous algorithm for scattering information between the active nodes of
a multicomputer system. J. Parallel Distrib. Comput. 3, 3 (1986), 344–351.
[68] N. Dryden et al. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceed-
ings of the 33rd IEEE Int:l Parallel & Distributed Processing Symposium (IPDPS’19).
[69] N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. 2016. Communication quantization for data-parallel training of
deep neural networks. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’16). 1–8.
[70] J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimiza-
tion. J. Mach. Learn. Res. 12 (2011), 2121–2159.
[71] V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285.
[72] J. L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179–211.
[73] T. Elsken, J.-H. Metzen, and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural net-
works. arxiv:1711.04528.
[74] L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks.
arxiv:1701.05130.
[75] P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd Interna-
tional Conference on Algorithms and Architectures for Parallel Processing. 659–666.
[76] B. M. Forrest, D. Roweth, N. Stroud, D. J. Wallace, and G. V. Wilson. 1987. Implementing neural network models on
parallel computers. Comput. J. 30, 5 (1987), 413–419.
[77] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767.
[78] M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373.
[79] A. Gaunt, M. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. 2017. AMPNet: Asynchro-
nous model-parallel training for dynamic neural networks. arxiv:1705.09786.
[80] A. Gholami, A. Azad, P. H. Jin, K. Keutzer, and A. Buluç. 2018. Integrated model, batch, and domain parallelism
in training neural networks. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures
(SPAA’18). 77–86.
[81] X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Pro-
ceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249–256.
[82] D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations
of Genetic Algorithms, vol. 1. Elsevier, 69–93.
[83] Google. 2017. TensorFlow XLA Overview. Retrieved from https://www.tensorflow.org/performance/xla.
[84] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate,
large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.
[85] A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626
(2016), 471–476.
[86] W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing
Interface. MIT Press.
[87] A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. 2016. Memory-efficient backpropagation through
time. In Advances in Neural Information Processing Systems 29. MIT Press, 4125–4133.
[88] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision.
In Proceedings of the 32nd International Conference on Machine Learning, vol. 37. 1737–1746.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:37
[89] S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A sys-
tematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 171–180.
[90] S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs
and GPUs. arxiv:1606.04487.
[91] S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained
quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16)
(2016).
[92] E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the
International Conference on Learning Representations (ICLR’18).
[93] K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on
ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026–
1034.
[94] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.
[95] X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural
network. In Proceedings of the AAAI Workshops.
[96] G. Hinton. 2012. Neural Networks for Machine Learning, Lecture 6a: Overview of Mini-batch Gradient Descent.
[97] G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS
Deep Learning and Representation Learning Workshop.
[98] G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7
(2006), 1527–1554.
[99] Q. Ho et al. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of
the 26th International Conference on Neural Information Processing Systems, vol. 1 (NIPS’13). 1223–1231.
[100] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[101] T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unre-
liable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17).
[102] T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collec-
tive operations for MPI. In Proceedings of the International Conference on High Performance Computing, Networking,
Storage and Analysis (SC’07). IEEE Computer Society/ACM.
[103] T. Hoefler and D. Moor. 2014. Energy, memory, and runtime tradeoffs for implementing collective communication
operations. J. Supercomput. Front. Innovat. 1, 2 (2014), 58–75.
[104] T. Hoefler and T. Schneider. 2012. Optimization principles for collective neighborhood communications. In Proceed-
ings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 98:1–98:10.
[105] T. Hoefler and M. Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceed-
ings of the ACM International Conference on Supercomputing (ICS’11). 75–85.
[106] T. Hoefler and J. L. Traeff. 2009. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International
Parallel and Distributed Processing Symposium (HIPS’09).
[107] E. Hoffer, I. Hubara, and D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large
batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1729–1739.
[108] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. Mo-
bileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.
[109] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-
distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Conference on Networked
Systems Design and Implementation (NSDI’17). 629–647.
[110] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[111] Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965.
[112] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Quantized neural networks: Training neural
networks with low precision weights and activations. arxiv:1609.07061.
[113] D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (1952), 1098–1101.
[114] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level
accuracy with 50x fewer parameters and <1MB model size. arxiv:1602.07360.
[115] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neu-
ral network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’16).
[116] IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from
https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:38 T. Ben-Nun and T. Hoefler
[117] P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL, Lau-
sanne, Switzerland.
[118] D. J. Im, H. Ma, C. D. Kim, and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021.
[119] Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.
[120] Intel. 2017. MKL-DNN. Retrieved from https://01.org/mkl-dnn.
[121] S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate
shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448–456.
[122] M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846.
[123] X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four
minutes. arxiv:1807.11205.
[124] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convo-
lutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia.
675–678.
[125] J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the
ACM International Conference on Management of Data (SIGMOD’17). 463–478.
[126] P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the
ML Systems Workshop at NIPS.
[127] M. Johnson et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation.
arxiv:1611.04558.
[128] R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In
Advances in Neural Information Processing Systems 26. MIT Press, 315–323.
[129] N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th
Annual International Symposium on Computer Architecture (ISCA’17). 1–12.
[130] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them
all. arxiv:1706.05137.
[131] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian
optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press, 2016–2025.
[132] T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and
variation. arxiv:1710.10196.
[133] J. Keuper and F. Pfreundt. 2015. Asynchronous parallel stochastic gradient descent: A numeric core for scalable
distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in HPC Environments
(MLHPC’15). 1:1–1:11.
[134] H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility.
arxiv:1602.08191.
[135] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks
for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations
(ICLR’16).
[136] D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Con-
ference on Learning Representations (ICLR’15).
[137] A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks.
In Proceedings of the International Conference on Learning Representations (ICLR).
[138] U. Köster et al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In
Advances in Neural Information Processing Systems 30. MIT Press, 1740–1750.
[139] S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural
networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).
[140] A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto,
Canada.
[141] A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997.
[142] A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems 25. MIT Press, 1097–1105.
[143] T. Kurth et al. 2017. Deep learning at 15PF: Supervised and semi-supervised classification for scientific data. In
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17).
7:1–7:11.
[144] G. Lacey, G. W. Taylor, and S. Areibi. 2016. Deep learning on FPGAs: Past, present, and future. arxiv:1602.04283.
[145] L. Lamport, R. Shostak, and M. Pease. 1982. The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 3
(1982), 382–401.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:39
[146] A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR’16).
[147] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. 2011. On optimization methods for deep learning.
In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 265–272.
[148] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building high-level
features using large scale unsupervised learning. In Proceedings of the 29th International Conference on Machine
Learning (ICML’12). 507–514.
[149] Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[150] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation
applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541–551.
[151] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc.
IEEE 86, 11 (1998), 2278–2324.
[152] H. Lee, P. Pham, Y. Largman, and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using con-
volutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press, 1096–1104.
[153] S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. 2015. Why M heads are better than one: Training
a diverse ensemble of deep networks. arxiv:1511.06314.
[154] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou. 2016. Optimizing memory efficiency for deep convolutional
neural networks on GPUs. In Proceedings of the International Conference for Supercomputing (SC’16). 54:1–54:12.
[155] D. Li, X. Wang, and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices.
arxiv:1708.04728.
[156] F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711.
[157] M. Li et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX
Conference on Operating Systems Design and Implementation (OSDI’14). 583–598.
[158] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine
learning workloads. arxiv:1708.07308.
[159] X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized
parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 5336–5346.
[160] X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In
Proceedings of the 28th International Conference on NIPS, vol. 2. 2737–2745.
[161] X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In
Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3043–3052.
[162] M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning
Representations (ICLR’14).
[163] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communica-
tion bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations
(ICLR’18).
[164] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. 2017. Progressive neural
architecture search. arxiv:1712.00559.
[165] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient
architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’18).
[166] H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055.
[167] X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings
of the International Conference on Learning Representations (ICLR’18).
[168] P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyper-parameter selection in deep neural networks
using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO’17). 1864–1871.
[169] I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the Inter-
national Conference on Learning Representations (ICLR’17).
[170] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. 2018. Neural architecture optimization. In Advances in Neural Infor-
mation Processing Systems 31. MIT Press, 7816–7827.
[171] J. Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on
Machine Learning (ICML’10). 735–742.
[172] M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of
the International Conference on Learning Representations (ICLR’14).
[173] Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from
https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:40 T. Ben-Nun and T. Hoefler
[174] Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In
Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14).
830–834.
[175] R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.
[176] H. Mikami et al. 2018. ImageNet/ResNet-50 training in 224 seconds. arxiv:1811.05233.
[177] P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of
the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249–258.
[178] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings
of the International Conference on Learning Representations (ICLR’16).
[179] U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE Inter-
national Conference on Neural Networks, vol. 6. 3961–3966.
[180] R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures.
arxiv:1704.08792.
[181] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic
programming. SIAM J. Optim. 19, 4 (2009), 1574–1609.
[182] Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O (1/k 2 ). Soviet Math.
Doklady 269 (1983), 543–547.
[183] Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http://www.netlib.org/blas.
[184] J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances
in Neural Information Processing Systems 23. MIT Press, 1279–1287.
[185] J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer.
[186] C. Noel and S. Osindero. 2014. Dogwild!—Distributed hogwild for CPU & GPU. In Proceedings of the NIPS Workshop
on Distributed Machine Learning and Matrix Computations.
[187] E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceed-
ings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 5–14.
[188] NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https://devblogs.nvidia.com/
programming-tensor-cores-cuda-9.
[189] NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas.
[190] C. Olah. 2015. Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/
2015-08-Understanding-LSTMs.
[191] K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35
Epochs. arxiv:1811.12019.
[192] M. Ott et al. 2018. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation:
Research Papers. 1–9.
[193] Y. Oyama et al. 2016. Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning
system on GPU supercomputers. In Proceedings of the IEEE International Conference on Big Data (BigData’16). 66–75.
[194] Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-
batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’18).
[195] PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https://github.com/PaddlePaddle/cloud/tree/develop/
doc/edl.
[196] T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training.
arxiv:1312.6186.
[197] S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359.
[198] P. Patarasuk and X. Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel
Distrib. Comput. 69, 2 (2009), 117–124.
[199] F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep
neural networks for reinforcement learning. arxiv:1712.06567.
[200] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. 2018. Efficient neural architecture search via parameter sharing.
arxiv:1802.03268.
[201] B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim.
30, 4 (1992), 838–855.
[202] D. Povey, X. Zhang, and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and
parameter averaging. arxiv:1410.7455.
[203] R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.
[204] H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of
the International Conference on Learning Representations (ICLR’17).
[205] N. Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Netw. 12, 1 (1999).
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:41
[206] R. Rabenseifner. 2004. Optimization of collective reduction operations. In Proceedings of the International Conference
on Computational Science. 1–9.
[207] A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http://www.argmin.net/2017/12/
05/kitchen-sinks NIPS Test of Time Award Talk.
[208] R. Raina, A. Madhavan, and A. Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In
Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). 873–880.
[209] S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization.
In Proceedings of the International Conference on Game Theory for Networks. 80–81.
[210] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convo-
lutional neural networks. arxiv:1603.05279.
[211] E. Real, A. Aggarwal, Y. Huang, and Q. V Le. 2018. Regularized evolution for image classifier architecture search.
arxiv:1802.01548.
[212] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of
image classifiers. In Proceedings of the 34th International Conference on Machine Learning. 2902–2911.
[213] B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient
descent. In Advances in Neural Information Processing Systems 24. MIT Press, 693–701.
[214] C. Renggli, D. Alistarh, and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learn-
ing. arxiv:1802.08021.
[215] H. Robbins and S. Monro. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 3 (1951), 400–407.
[216] T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of
deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press, 901–909.
[217] F. Seide et al. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech
DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTER-
SPEECH’14).
[218] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 235–239.
[219] S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge
University Press.
[220] C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600.
[221] O. Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information
Processing Systems 29. MIT Press, 46–54.
[222] R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Con-
ference on Computer and Communications Security (CCS’15). 1310–1321.
[223] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,
et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
[224] K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceed-
ings of the International Conference on Learning Representations (ICLR’15).
[225] A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping.
arxiv:1505.05972.
[226] S. L. Smith, P. Kindermans, and Q. V. Le. 2017. Don’t decay the learning rate, increase the batch size. arxiv:1711.00489.
[227] J. Snoek, H. Larochelle, and R. P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In
Advances in Neural Information Processing Systems 25. MIT Press, 2951–2959.
[228] E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066.
[229] M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchi-
tectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17).
1–12.
[230] H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points.
IEEE Trans. Signal Process. 41, 3 (1993), 1184–1200.
[231] V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354–356.
[232] N. Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the
16th Annual Conference of the International Speech Communication Association.
[233] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and
survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[234] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going
deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15).
[235] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. 2016. Training neural networks without gradi-
ents: A scalable ADMM approach. (2016). arxiv:1605.02026
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
65:42 T. Ben-Nun and T. Hoefler
[236] L. Truong et al. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16).
209–223.
[237] J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient
optimization algorithms. IEEE Trans. Automat. Control 31, 9 (1986), 803–812.
[238] B. Van Essen et al. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop
on Machine Learning in HPC Environments.
[239] V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of
the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).
[240] N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstrac-
tions. arxiv:1802.04730.
[241] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft:
A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[242] A. Vasudevan, A. Anderson, and D. Gregg. 2017. Parallel multi channel convolution using general matrix multipli-
cation. arxiv:1704.04428.
[243] P. Verbancsics and J. Harguess. 2015. Image classification using generative neuro evolution for deep learning. In
Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 488–493.
[244] A. Viebke, S. Memeti, S. Pllana, and A. Abraham. 2019. CHAOS: A parallelization scheme for training convolutional
neural networks on Intel Xeon Phi. The Journal of Supercomputing 75, 1 (Jan. 2019), 197–227.
[245] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. TernGrad: Ternary gradients to reduce communi-
cation in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press, 1509–1519.
[246] P. J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (1990), 1550–1560.
[247] J. H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications.
[248] R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach.
Learn. 8, 3 (1992), 229–256.
[249] S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics.
[250] L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision
(ICCV’17). 1388–1397.
[251] P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2016. Lighter-communication distributed machine
learning via sufficient factor broadcasting. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelli-
gence (UAI’16). 795–804.
[252] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new
platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49–67.
[253] K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044.
[254] O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853.
[255] F. Yan, O. Ruwase, Y. He, and T. Chilimbi. 2015. Performance modeling and scalability optimization of distributed
deep learning systems. In Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data
Mining (KDD’15). 1355–1364.
[256] C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992.
[257] Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.
[258] Y. You, A. Buluç, and J. Demmel. 2017. Scaling deep learning on GPU and knights landing clusters. In Proceedings of
the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 9:1–9:12.
[259] Y. You, I. Gitman, and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888.
[260] Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes.
arxiv:1709.05011
[261] S. R. Young et al. 2017. Evolving deep networks using HPC. In Proceedings of the Workshop on Machine Learning in
HPC Environments (MLHPC’17). 7:1–7:7.
[262] F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International
Conference on Learning Representations (ICLR’16).
[263] Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the
IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS’16). 942–947.
[264] H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines.
arxiv:1512.06216
[265] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. 2017. Poseidon: An efficient
communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Annual
Technical Conference (ATC’17). 181–193.
[266] J. Zhang, I. Mitliagkas, and C. Ré. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis 65:43
[267] K. Zhang and X. W. Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395–403.
[268] S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems
28. MIT Press, 685–693.
[269] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for
sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO’16). 1–12.
[270] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660–6663.
[271] W. Zhang et al. 2017. GaDei: On scale-up training as a service for deep learning. In Proceedings of the IEEE Interna-
tional Conference on Data Mining (ICDM’17). 1195–1200.
[272] W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for distributed deep learning. In Proceedings
of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2350–2356.
[273] X. Zhang, M. McKenna, J. P. Mesirov, and D. L. Waltz. 1990. An efficient implementation of the back-propagation
algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press,
801–809.
[274] H. Zhao and J. Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Proceedings of the 43rd International
Conference on Parallel Processing. 273–282.
[275] Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552
[276] S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients.
arxiv:1606.06160
[277] M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. 2010. Parallelized stochastic gradient descent. In Proceedings of the
23rd International Conference on Neural Information Processing Systems, vol. 2. 2595–2603.
[278] A. Zlateski, K. Lee, and H. S. Seung. 2016. ZNNi: Maximizing the inference throughput of 3D convolutional networks
on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking,
Storage and Analysis. 854–865.
[279] B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International
Conference on Learning Representations (ICLR’17).
[280] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2017. Learning transferable architectures for scalable image recogni-
tion. arxiv:1707.07012
ACM Computing Surveys, Vol. 52, No. 4, Article 65. Publication date: August 2019.