0% found this document useful (0 votes)
3 views

An Efficient Distributed Stochastic Gradient Descent Algorithm for Deep-Learning Applications

This document presents a distributed, bulk-synchronous stochastic gradient descent (SASGD) algorithm designed to improve the efficiency of deep-learning applications by minimizing communication overhead. The authors analyze the performance of SASGD compared to popular asynchronous stochastic gradient descent (ASGD) implementations like Downpour and EAMSGD, demonstrating superior convergence behavior and reduced training time. Experimental results indicate that SASGD can achieve significant accuracy improvements and training time reductions, particularly in large-scale machine learning scenarios.

Uploaded by

kircalinecla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

An Efficient Distributed Stochastic Gradient Descent Algorithm for Deep-Learning Applications

This document presents a distributed, bulk-synchronous stochastic gradient descent (SASGD) algorithm designed to improve the efficiency of deep-learning applications by minimizing communication overhead. The authors analyze the performance of SASGD compared to popular asynchronous stochastic gradient descent (ASGD) implementations like Downpour and EAMSGD, demonstrating superior convergence behavior and reduced training time. Experimental results indicate that SASGD can achieve significant accuracy improvements and training time reductions, particularly in large-scale machine learning scenarios.

Uploaded by

kircalinecla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2017 46th International Conference on Parallel Processing

An efficient, distributed stochastic gradient descent


algorithm for deep-learning applications
Guojing Cong, Onkar Bhardwaj, Minwei Feng
IBM TJ Watson Research Center
1101 Kitchawan Road, Yorktown Heights, NY, 10598
{gcong,obhardw,mfeng}@us.ibm.com

Abstract—Parallel and distributed processing is employed to being the number of samples processed. Regarding parallel
accelerate training for many deep-learning applications with variants of SGD, Dekel et al. [6] extends these results to
large models and inputs. As it reduces synchronization and
communication overhead by tolerating stale gradient updates,
the setting of synchronous SGD with  √p learners
 and show
asynchronous stochastic gradient descent (ASGD), derived from that it has convergence rate of O 1/ pK for non-convex
stochastic gradient descent (SGD), is widely used. Recent the- objectives, with K being the number of samples processed
oretical analyses show ASGD converges with linear asymptotic by each learner. Note that this agrees with the prior analyses
speedup over SGD. for SGD since in this case S = pK is the number of
Oftentimes glossed over in theoretical analysis are communi- samples processed. Hogwild! is a lockfree implementation of
cation overhead and practical learning rates that are critical
to the performance of ASGD. After analyzing the communi-
ASGD, and Niu et al. [20] proves its convergence for strongly
cation performance and convergence behavior of ASGD using convex problems with theoretical linear speedup over SGD.
the Downpour algorithm as an example, we demonstrate the Downpour is another ASGD implementation with resilience
challenges for ASGD to achieve good practical speedup over against machine failures [5]. Lian et al. [13] show that as
SGD. We propose a distributed, bulk-synchronous stochastic long as the gradient staleness is bounded by the number of
gradient descent algorithm that allows for sparse gradient
aggregation from individual learners. The communication cost
learners, ASGD converges for non-convex problems (with
is amortized explicitly by a gradient aggregation interval, and certain assumptions) with asymptotic linear speedup over
global reductions are used instead of a parameter server for SGD.
gradient aggregation. We prove its convergence and show that Unfortunately, theoretical convergence does not guarantee
it has superior communication performance and convergence practical efficiency for ASGD. When deployed on a cluster,
behavior over popular ASGD implementations such as Downpour
and EAMSGD for deep-learning applications.
communication cost can dominate the execution time when the
model is large and/or gradient update is frequent. Although
I. I NTRODUCTION ASGD has the same asymptotic convergence rate as SGD
when the staleness of gradient update is bounded, the learning
To solve large-scale machine learning problems on mod- rate assumed for proving ASGD convergence can be too small
ern computer systems, parallel and distributed processing is for practical purposes. It is also difficult for an ASGD imple-
adopted for stochastic optimization methods (e.g., see [20], mentation to control the staleness in gradient updates as it is in-
[5], [19], [28], [4], [29]). Efficient parallelization becomes fluenced by the relative processing speeds of learners and their
critical to accelerating long-running machine learning appli- positions in the communication network. Furthermore, the
cations. Asynchronous stochastic gradient descent (ASGD), parameter server presents performance challenges on platforms
derived from stochastic gradient descent (SGD), is popular with many GPUs. On such platforms, a single parameter server
in current deep-learning applications and studies (e.g., see oftentimes does not serve the aggregation requests fast enough.
[5], [20]). ASGD exploits data parallelism by employing A sharded server alleviates the aggregation speed problem
multiple learners each computing gradient updates on their but introduces inconsistencies for parameters distributed on
inputs. The gradients learned by each learner are aggregated multiple shards. Communication between the parameter server
typically through a central parameter server (e.g., see [11]). (typically on CPUs) and the learners (on GPUs) is likely to
The parameters maintained at the central parameter server remain a bottleneck in future systems.
are updated asynchronously. Asynchronous gradient updates We propose a distributed, bulk-synchronous SGD algorithm
reduce synchronization and communication overhead that can that allows for sparse gradient aggregation to effectively
otherwise become prohibitive on a cluster even with a modest minimize the communication overhead. We call this algorithm
number of learners. sparse aggregation SGD (SASGD). Instead of a parameter
The mathematical foundations of parallel methods including server, the learners in SASGD communicate the gradients
ASGD are established in recent studies. The convergence learned with each other at regular intervals through global
for SGD has been extensively studied (e.g.,  √ see[21], [3], reductions. Rather than relying on asynchrony that reduces
[17], [22], [7]). The convergence rate is O 1/ S for non- communication overhead but has adverse impact on practical
convex problems and O(1/S) for convex problems, with S convergence, we make the communication interval T a param-

2332-5690/17 $31.00 © 2017 IEEE 11


DOI 10.1109/ICPP.2017.10
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
eter in SASGD. The communication time is amortized among
the data samples processed within each interval and becomes Input: minibatch of M RGB images
negligible if T is large enough. Compared to asynchronous ↓
Convolution: (nfeat, nkern, height, width) = (3, 64, 5, 5)
updates to a parameter server, global reduction minimizes the Rectified Linear Unit (ReLU)
amount of data transported in the system. Also on current Max-Pooling: (height, width) = (2, 2)
and emerging computer platforms that support high bandwidth Dropout: prob. = 0.5

direct communication among GPUs (e.g., GPU-direct [8]), Convolution: (nfeat, nkern, height, width) = (64, 128, 3, 3)
global reduction does not involve CPUs and avoids multiple Rectified Linear Unit (ReLU)
costly copies through the software layers. We demonstrate the Max-Pooling: (height, width) = (2, 2)
Dropout: prob. = 0.5
convergence behavior of SASGD relative to T , and analyze its ↓
sample complexity (measured as the number of data samples Convolution: (nfeat, nkern, height, width) = (128, 256, 3, 3)
required to reach certain training quality). We show that Rectified Linear Unit (ReLU)
Max-Pooling: (height, width) = (2, 2)
sample complexity of SASGD increases with T , and thus Dropout: prob. = 0.5
the practitioners need to explicitly balance the decrease of ↓
communication time and the increase of iterations through an Convolution: (nfeat, nkern, height, width) = (256, 128, 2, 2)
Rectified Linear Unit (ReLU)
appropriately chosen T . In contrast, neither communication Max-Pooling: (height, width) = (2, 2)
time nor sample complexity can be effectively controlled in Dropout: prob. = 0.5
most ASGD implementations, even for some recent ASGD ↓
Fully connected layer: 128 × 10
variants that also employ a gradient update interval to tolerate ↓
more staleness in gradient updates. Cross-entropy error
Our experiments with two deep-learning applications
demonstrate the superior performance of SASGD over
two popular ASGD implementations: Downpour [5] and TABLE I: Convolutional Neural Network for CIFAR10. For
EAMSGD [29]. In EAMSGD, global gradient aggregation convolutional layers nfeat denotes the number of input feature
among learners simulates an elastic force that links the param- maps and nkern denotes the number of kernels.
eters they compute with a center variable stored by the param-
eter server. On our target platform, when T is small, SASGD
significantly reduces the communication time in comparison Input: minibatch of M sentences translated to
their precomputed word2vec representation
to Downpour and EAMSGD while achieving similar training ↓
and test accuracies. The training time reduction is up to 50%. Fully connected layer: 100 × 200
When T is large, SASGD achieves much better training and Tanh layer

test accuracies than Downpour and EAMSGD after the same Temporal Convolution: (nkern, window size) = (1000, 2)
amount of data samples are processed. For example, with 16 Max-Pooling: (height, width) = (2, 1)
GPUs and T = 50, SASGD achieves up to 3% more accuracy Tanh layer

for the first application and 50% more accuracy for the second Fully connected layer: 1000 × 1000
application than Downpour and EAMSGD. Tanh layer
The rest of the paper is organized as follows. Section II Fully connected layer: 1000 × 311

investigates the practical efficiency of ASGD for two deep- Cross-entropy error
learning applications. It shows the communication overhead is
significant and argues that in practice only sublinear speedups
may be observed. Section III introduces SASGD, and analyzes
TABLE II: Neural Network for NLC-F. For temporal convo-
its convergence behavior. Section IV shows the performance of
lutional layer nkern denotes the number of kernels.
SASGD in comparison to Downpour and EAMSGD. Section V
presents our conclusion and future work.
II. P RACTICAL EFFICIENCY OF ASGD learning network models, but the training algorithm is the
same. So we consider them as ASGD with two different data
We evaluate the practical efficiency of ASGD using Down- sets. One is the CIFAR data set (CIFAR-10) [9]. The other is an
pour as an example for two deep-learning applications. Train-
in-house natural language processing data set from the finance
ing in machine learning applications typically takes many industry, and we call it NLC-F. For the CIFAR-10 data set, the
passes of the input data before convergence. One pass of the application trains a model to recognize the input images, and
input is called an epoch. To take advantage of the parallelism
for the NLC-F data set, the application detects the sentiments
in the architecture (e.g., SIMD units or threads in GPUs), expressed in the input sentences. CIFAR-10 contains 50, 000
the training samples are typically processed in groups called
training images and 10, 000 test images, each associated with
minibatches. 1 out of 10 possible labels, whereas NLC-F consists of 2500
Experiment setup: The two applications in our experiments input sentences and 311 output labels.
both employ ASGD for training. They use different deep-
The multi-layer convolutional neural networks used for

12

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
CIFAR-10 and NLC-F are shown in Tab. I and Tab. II, respec- computation
tively. The network used for CIFAR-10 is a fairly standard communication
100
convolutional network design [1] consisting of a series of
convolutional layers interspersed by max-pooling layers. We 80
choose this network instead of other networks with deeper

percentage
structures such as AlexNet [10] or GoogLeNet [24] to limit 60
the amount of training time. The approaches discussed in this
paper work for these networks also. The outputs of convolu- 40
tional layers in this network are filtered with rectified linear
20
units before max-pooling is applied. Additionally, Dropout
is applied as regularization [23]. The last layer is a fully 0
connected layer. The network for NLC-F contains a temporal N
1
C N
2
C N
4
C N
8
C
convolution layer [2] and a few fully connected layers. Tanh
units are used for non-linearities instead of rectified linear Fig. 1: Breakdown of epoch time
units. The number of parameters is about 0.5 million in the
CIFAR-10 network and about 2 million in the NLC-F network.
Both networks use cross-entropy error between the input • Unbiased gradient: We assume that the partial gradient
labels and the predicted labels as the error measure. In both G(x, z) of f (·) is an unbiased estimator of true gradient,
applications, gradient descent is implemented with Torch [25] where x is any parameter vector and z is a mini-batch
and the communication is implemented using CUDA-aware of randomly selected M samples. In other words, we
openMPI 2.0 through the mpiT library [15]. assume E(G(x, z)) = ∇f (x) where the expectation is
We run our experiments on an IBM Power8 host with with respect to randomly selected mini-batches.
an OSS high-density compute accelerator [18]. The OSS • Bounded variance: We assume that the variance of partial
accelerator contains 8 NVIDIA Tesla K80 GPUs connected gradient with respect to randomly selected mini-batches
2
by PCIe switches forming a binary tree. The host contains is bounded, i.e., E(G(x, z) − ∇f (x) ) ≤ σ 2 .
two Power8 chips, each with 12 cores running at 3.1 GHz. In • Lipschitzian gradient: We assume that there exists a
the ASGD implementations, to fully utilize both the host and constant L such that ∇f (x) − ∇f (y) ≤ Lx − y
the accelerators, the learners are run on the GPUs, while the for any two parameter vectors x, y.
(sharded) parameter server is run on the host Power8 CPUs. The notations used in the convergence behavior analysis and
analyses in later sections are introduced in Tab. III. Lian et
A. ASGD communication overhead
al. [13] show that with a small enough constant learning rate
We experiment with 1, 2, 4, and 8 learners (each learner γ and ignoring communication overhead, linear asymptotic
on one GPU) for CIFAR-10 and NLC-F. All learners in speedup may be achieved for ASGD over SGD after sufficient
Downpour have similar behavior. The amount of computation number of iterations if the staleness of the gradient is bounded
and communication in each learner remains constant between by the number of learners. As it is not measured by comparing
epochs. Fig. 1 shows the breakdown of epoch time into wall-clock times, we refer to the speedup as convergence
computation and communication for one learner. Minibatch speedup. We analyze the convergence speedup of ASGD and
sizes 64 and 11 are used for training on CIFAR-10 and NLC- SGD for a finite number of iterations, and show that it can be
F, respectively. Of the two bars in each group, the one on the sublinear for practical purposes.
left is for NLC-F and the one on the right is for CIFAR-10. Let R̄K denote the average expected gradient norm after the
In Fig. 1, communication dominates for NLC-F, accounting first K updates of ASGD. Then from Theorem 1 in [13] the
for more than 60% of the epoch time. For CIFAR-10, with 1 convergence rate guarantee for ASGD expressed as average
learner the communication time is around 20%, and increases gradient norm is
to about 30% with 8 learners. In Downpour, from the perspec-
tive of a learner, communication includes sending its computed 2Df
R̄K ≤ + σ 2 Lγ + 2σ 2 L2 M pγ 2 (1)
gradients to the parameter server, waiting for the server to M Kγ
aggregate the gradients, and receiving parameters from the s.t. LM γ + 2L2 M 2 p2 γ 2 ≤ 1 (2)
parameter server. Communication time becomes significant
as the number of learners increases and/or the network size The terms independent of the number of updates K in Equa-
increases. tion 1 indicate that with a constant learning rate, there is a
limit on how close the algorithm can reach to the optimum
B. ASGD convergence relative to SGD
without lowering the learning rate.
We analyze the convergence behavior of ASGD in compari- Analyzing average gradient norm, we get the following
son to SGD. We make standard assumptions about the surface theorem about the gap between the guarantee for one learner
properties of the objective function. and multiple learners
1 Minibatch size 1 is used because it gives the best test accuracy for NLC-F

13

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
80

f (·) := A non-convex objective function 70


x := Parameter vector
60
x1 := Initial parameter vector
x∗ :=

accuracy (%)
A local optima towards which the algorithm converges 50
Df := f (x1 ) − f (x∗ )
40
L := The constant corresponding to Lipschitzian gradient
M := Minibatch size 30

T := No. of local updates after which a global aggregation is done 20


p=1
p=2
p := Number of learners p=8
p=16
10
γ := Learning rate 10 20 30 40 50 60 70 80 90 100
2 2
σ := Upper bound on E(G(x, zi ) − ∇f (x)) where G(x, zi ) is #epochs

the stochastic gradient of f (·) with respect to ith sample zi Fig. 2: ASGD convergence for CIFAR10 with γ = 0.1
S := The number of total samples processed

TABLE III: Notation Let c∗1 and c∗pmax denote the solutions to the above equation
for 1 and pmax learners respectively.
√ σ2
 1.2 Let p > 1 be the number of learners and let
Theorem For p = 1: R̄K  2 2 · (8)
α = MLD Kσ
≤ p, then the optimal ASGD convergence rate αM
f
guarantee for 1 learner and p learners can differ by a factor For p = pmax and 16 ≤ α ≤ √Tmax , the cubic term dominates
of approximately αp . in Equation 7. c∗pmax = α/( 2pmax ). Thus for 16 ≤ α ≤
 c
pmax , Equation 4 becomes
Proof. We have γ = c · Df /(M KLσ 2 ) = αML from the √
definition of α. Substituting this in Equation 1, we get 2 2 pmax σ 2
 For p = pmax : R̄K  · (9)
2 2pc2 Df Lσ 2 α αM
R̄K ≤ +c+ · (3) Thus comparing Equation 8 and 9, we see that the ASGD
c α MK
convergence guarantee for p = 1 and p = pmax learners can
From the definition of α, we have K = α2 M LDf /σ 2 . Using differ by a factor of pmax for 16 ≤ α ≤ pmax .
α
it in the above equation, we get

2 2pc2 1 σ2 Theorem 1 predicts that the convergence guarantees after
R̄K ≤ +c+ · · (4) processing K minibatches can differ significantly between 1
c α α M
 learner and p > 1 learners. For example, when p = 32, α
Df c
Similarly, given γ = c · MKLσ 2 = αML , the condition in is roughly 16 for 50 epochs of updates with CIFAR-10. The
Equation 2 can also be expressed as: convergence guarantee between SGD and ASGD with p = 32
c 2p2 c2 can differ by 2.
+ ≤ 1 ⇒ 2p2 c2 + αc − α2 ≤ 0 We run Downpour with p = 1, 2, 8, and 16 learners to
α α2
evaluate its convergence speedup over SGD. For p = 16 we
Since learning rate (and hence c) is always positive, the above
run 2 learners per GPU using CUDA multi-process service
equation gives us
 [16]. We use the learning rate of γ = 0.1, and compare how
α
0 ≤ c ≤ 2 · (−1 + 1 + 8p2 ) test accuracy increases with respect to the number of epochs.
4p The results are shown in Fig. 2.
Thus finding the optimal learning rate (within the regime of In Fig. 2, as p increases, with the same number of epochs,
Equation 1 and 2) is equivalent to solving the following the accuracy gap between Downpour and SGD (Downpour

2 2pc2 1 σ2 with p=1) increases. Since the accuracy converges slower
minimize +c+ · · (5) and slower as p increases, linear convergence speedup is not
c α α M
α  observed. That is, to reach the same accuracy achieved by
s.t. 0 ≤ c ≤ · (−1 + 1 + 8p2 ) (6) SGD, Downpour with p > 1 learners needs to process more
4p2
data samples.
Now, by means of solving the above optimization, we will
Linear convergence speedup
 is observed, however, if we
investigate how much the convergence guarantee can differ as D
f
the number of learners increase. In particular, we will look at use the learning rate, MKLσ2 , derived from the conver-
the difference in the guarantee for 1 learner and pmax learners gence analysis of ASGD by Lian et al. [13]. We estimate
where pmax ≥ 16. Taking the derivative of Equation 5 with the Lipschitz constant L and an upper bound on gradient
respect to c and setting it to 0, we get the following variance σ 2 for CIFAR-10. We bound Df as f (x1 ) and use
M K = 500, 000.
4pc3 + αc2 − 2α = 0 (7)

14

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
60
id is the learner ID (0 ≤ id < p), and K is the total
55 number of global gradient aggregations. Note there are two
50 different learning rates, γ and γp . Learning rate γ determines
45 the step size for local updates within an aggregation interval,
accuracy (%)

40 while γp is the step size for global aggregation. Each learner


35
accumulates gradients learned within an interval into gs .
30
Global aggregation aggregates gs from all learners through
allreduce. The parameter x is initialized by learner 0, and
25
p=1 then broadcast to all learners.
p=2
20 p=8
p=16
15
10 20 30 40 50 60 70 80 90 100 Algorithm 1 SASGD (T , p, id, γ, γp , K)
#epochs
gs ← 0, i ← 0
Fig. 3: ASGD convergence for CIFAR10 with γ = 0.005 if id = 0 then
initialize parameters x
 end if
Df
Using our estimated quantities, MKLσ 2 is approximately x ← broadcast(x, p, id)
0.005, much smaller than 0.1. With γ = 0.005, we experiment x ← x
with p = 1, 2, 8, and 16 learners. The results are shown while i < K do
in Fig. 3. In Fig. 3, indeed linear convergence speedup is j←0
observed for p > 1 (implied by the curves for different p while j < T do
overlapping almost perfectly). However, γ = 0.005 is clearly compute gradient g from a random minibatch
sub-optimal for CIFAR-10 as it achieves only about 57% x ← x − γ ∗ g, gs ← gs + g
accuracy compared to 80% accuracy achieved with γ = 0.1. j ←j+1
III. S PARSE - AGGREGATION SGD end while
gs ← allreduce(gs , p, id)
To reduce the communication overhead of distributed train-
x ← x − γp gs
ing, some recent studies propose using large minibatch sizes
(e.g., see [12]). As large batch size has been found to be in- x ← x, gs ← 0
i←i+1
efficient for gradient descent with respect to training accuracy
end while
[26], additional measures such as reducing the variance among
the gradients are needed [14], [27]. Some implementations
tolerate even more stale gradient updates. In fact, Downpour Alg. 1 is quite simple, and bears resemblance to existing
itself has a version that processes multiple minibatches before ASGD and synchronous SGD implementations. One major dif-
sending gradients asynchronously to the parameter server. ference between SASGD and ASGD is the explicit constraint
However, its practical behavior turns out to be erratic. Zhang on the staleness of the gradients. In SASGD, the staleness
et al. [29] proposes an ASGD variant called EAMSGD that of the gradients is bound explicitly by T , while in most
enforces some constraints on the parameters when the gradient ASGD implementations, in addition to T , the staleness is also
update interval increases. EAMSGD is shown to have superior impacted by the relative processing speed of the learners and
performance over Downpour. the position of the learners in the communication network.
While providing some degree of fault tolerance, the parame- The amount of data transported per gradient aggregation is
ter server in Downpour and EAMSGD also poses performance O(m log p) in SASGD (with tree reduction allreduce), where
challenges. A sharded server is used in these implementations m is the model size. In comparison, the amount of data
for fast gradient aggregation. Although more scalable, sharded transported in ASGD is O(mp). Moreover, on current and
server suffers increased stochasticity and inconsistency in emerging systems with many GPUs, the communication in
gradient updates. Additionally, the amount of data transported SASGD can benefit from the large bandwidth among the GPUs,
with a parameter server increases linearly with the number while the communication in ASGD with parameter servers
of learners. To balance communication with computation for needs to cross a narrower channel to the host.
high performance, tremendous bandwidth is needed between Alg. 1 simulates model averaging with γp = 1/p. Model
the learners and the parameter server. averaging is a heuristic used in some synchronous SGD im-
We propose a bulk-synchronous, distributed SGD algorithm, plementations that computes an average of the parameters from
SASGD, with explicit gradient aggregation intervals that al- all learners. Some implementations average the parameters
low for sparse gradient updates to amortize communication at the end of learning once (e.g., [30]), and others average
cost. The learners themselves aggregate the learned gradients the parameters after each minibatch is processed (e.g., [12]).
through collective operations without a parameter server. Neither approaches work in our study. The former results in
Alg. 1 gives a formal description of SASGD. In Alg. 1, very poor training and test accuracies, and the latter incurs
T is the aggregation interval, p is the number of learners, high communication overhead. Moreover, convergence and

15

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
convergence rate of model averaging relative to T have not The following corollary establishes the asymptotic conver-
yet been shown. gence guarantee of SASGD

A. Convergence 2Df 4MLDf
Corollary 3. If γ = γp = Sσ2 and K ≥ σ2 ·
Convergence guarantees can be proven for SASGD by (max{p,T }+1)2
pT , then after K global updates, we have
bounding the gradient norm average after S = M T Kp
samples are processed (recall that M is the minibatch size). K 2
The following theorem gives the convergence guarantee for k=1 E(∇f (xk ) ) Df Lσ 2
≤ 4·
SASGD K S

Theorem 2. After K global allreduce updates the average In other


√ 
words, the asymptotic convergence guarantee of
gradient norm satisfies the following upper bound O 1/ S can also be extended to SASGD. Although asymp-
K 2 totic convergence does not change regardless of the global
k=1 E(∇f (xk ) ) 2Df
≤ + 2L2 σ 2 γp γM T + Lσ2 γp update frequency determined by T , we must note that the
K Sγp
number of global updates K needed in order to achieve the
when γ and γp satisfies γp LM T p + 2L2 M 2 T 2 γp γ ≤ 1 asymptotic convergence rate can substantially increase with
the increase in T (see the bound on K in Corollary 3).
Due to limited space, we give a brief outline for the proof
of Theorem 2. The key to the proof is to bound the difference
in the objective function values between the global updates in B. Sample complexity relative to T
terms of accumulated gradients. This bound is used in several In practice, the number of samples processed in an appli-
prior ASGD convergence proofs (e.g., proof of Theorem 1 cation oftentimes does not reach the asymptotic convergence
in [13]). Let xk+1 denote the parameter vector after k th global 4MLDf }+1)2
regime, that is, K < σ2 · (max{p,T
pT . In this case,
update. With the Lipschitzian gradient property, we have (see
the following theorem says that increasing T always leads
Table III for notations)
to slower convergence in terms of epochs (or the number of
f (xk+1 ) − f (xk ) ≤ ∇f (xk ), xk+1 − xk processed samples).
L 2
+ · xk+1 − xk  Theorem 4. For SASGD with a constant number of learners
2 p and constant minibatch size M , given γp = γ, the number
Note that xk+1 − xk is same as −γp gs where we use gs that of samples needs to be processed in order to achieve the same
is used to compute k th allreduce from Algorithm 1. Thus we convergence guarantee increases as the interval T between
have global updates increases.
Lγp2 2
f (xk+1 ) − f (xk ) ≤ −γp · ∇f (xk ), gs + · gs  (10) Proof outline. We show that keeping the same number of
2 samples processed (i.e., the same S) the upper bound on
Since gs used for k th allreduce in Algorithm 1 is the sum of convergence guarantee from Theorem 2 becomes worse as T
of all the gradients computed by individual learners between increases. Thus, it implies that in order to reach the same
(k − 1)th and k th allreduce, the above equation expresses the convergence guarantee, higher number of samples need to be
difference in the objective function values between the global processed with larger T . S is kept constant by adjusting K.
allreduce updates in terms of accumulated gradients. Let γp = γ. The range of γ permissible by the constraint
When we choose gs to be comparable to M T p∇f (xk ), in Theorem 2 becomes smaller as T increases. The rest of the
2 2
for finite σ 2 , gs  cannot be much larger than ∇f (xk ) . proof combines the observations that the minimum attained by
In expectation ∇f (xk ), gs can be expressed in terms of the convergence guarantee in Theorem 2 must become worse
2
∇f (xk ) and an additive bound in terms of σ 2 . Summing if the range of γ decreases and T increases.
E(f (xk+1 ) − f (xk )) after K allreduce updates, we have
For a given p, increasing T reduces the epoch time but
∗ increases the number of epochs needed to reach the target
−Df ≤ E(f (x ) − f (x1 ))
 convergence guarantee. Thus there is an optimal T for a
≤ E(f (xk+1 ) − f (xk )) specific application in terms of the wall-clock time needed
k∈[K]
to reach convergence. For a given T , increasing the number
K
γp M T p  2 of learners will also increase the sample complexity to reach
≤ − ∇f (xk ) convergence. The analysis for SASGD convergence relative
2
k=1
to SGD is similar to the analysis for ASGD in section II-B.
KLM T pγp2σ 2 The advantage of SASGD over ASGD implementations from
+KL2 M 2 T 2 pγp γ 2 σ 2 +
2 the convergence perspective is that the staleness in gradient
Rearranging the terms in the above inequality, we get the updates is explicitly bounded. The impact on performance is
desired bound in Theorem 2. discussed in section IV-C.

16

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
IV. P ERFORMANCE Both Downpour and EAMSGD also have a gradient update
interval T . As the amount of data transported grows linearly
We evaluate the performance of SASGD with the NLC-F with the number of learners, performance can take a hit when
and CIFAR-10 data sets. We study the impact of T on epoch the number of learners increases. Fig. 6 shows the epoch
time and convergence. We also present a detailed comparison time for Downpour, EAMSGD, and SASGD with CIFAR-
of the convergence behavior between SASGD and Downpour 10 and NLC-F. With T = 1, gradient update is frequent,
and EAMSGD. and communication takes a significant portion of the epoch
time. SASGD is much faster than Downpour and EAMSGD
A. Impact of T on epoch time due to its lower communication complexity. With T = 50,
After every T minibatches, a global reduction of the gra- communication time in all three approaches is amortized for
dients occurs in SASGD among the learners. Communication multiple minibatches, and computation time dominates. All
time is amortized among T minibatches. When T = 1, SASGD three approaches have similar epoch times.
becomes the traditional synchronous SGD.
We experiment with T = 1 and T = 50 for NLC-F and 40
Downpour
CIFAR-10. The results with 1, 2, 4, and 8 learners for CIFAR- EAMSGD
35
10 and NLC-F are shown in Fig. 4 and Fig. 5, respectively. SASGD
30
Increasing T from 1 to 50 reduces the epoch time for both
data sets. The reduction is more significant for NLC-F than 25

time (s)
CIFAR-10. With 8 learners, SASGD with T = 50 is 1.3 times 20
faster than with T = 1 for CIFAR-10, and is 9.7 times faster
15
for NLC-F. In both figures, the horizontal line shows the
sequential time. The speedups with 8 learners are 4.45 and 10

5.35 for CIFAR-10 and NLC-F, respectively. 5

0
C(T=1) C(T=50) N(T=1) N(T=50)
T=1
60 T=50 Fig. 6: Epoch time with T = 50 for Downpour, EAMSGD,
and SASGD with 8 learners. “C” is for CIFAR-10, and “N” is
for NLC-F
time (s)

30

B. Impact of T on convergence
As analyzed in Section III-B, increasing T is likely to
15
increase the number of samples needed to reach convergence.
For the same amount of samples processed, larger T will likely
1 2 4 8 result in lower accuracy. We experiment with T = 1, 5, 25, and
#learners
50 for p = 2, 4, 8, 16 learners. For p = 16, we run two learners
Fig. 4: Impact of T on epoch time for CIFAR-10. In log-log per GPU. Since we are concerned with sample complexity and
plot. The horizontal line shows the sequential time not wall-clock time, we can run multiple learners per GPU. In
the rest of the paper, whenever 16 learners are used, we run
two of them per GPU. The test accuracies with CIFAR-10 are
shown in Fig. 7(a), 7(b), 7(c), and 7(d).
T=1 In each figure, as T increases, the test accuracy achieved
T=50
at the end of 100 epochs degrades slightly. Thus for a given
number of learners, more epochs are needed to reach a target
accuracy with larger T . This observation agrees with our
time (s)

5 analysis in Section III-B.


The degradation in accuracy is negligible when p is small.
For example, with p = 2, after 100 epochs, the gap in test
accuracy for T = 50 versus T = 1 is 1.32%. As p increases,
the gap becomes larger. With p = 16, after 100 epochs, the
gap is 3.21%.
1 2 4 8
#learners The test accuracies with NLC-F are shown in
Fig. 8(a), 8(b), 8(c), and 8(d). In comparison to CIFAR-10,
Fig. 5: Impact of T on epoch time for NLC-F. In log-log plot. for a given p, the degradation in accuracy after 200 epochs
The horizontal line shows the sequential time when T increases is not as pronounced. The degradation is

17

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
85 90 90 80
80
75 80 80 70
70 70 70 60
accuracy (%)

65
60 60 60 50
55 50 50 40
50 T=1
45 T=5 40 40 30
40 T=25 30 30 20
35 T=50
30 20 20 10
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
#epochs #epochs #epochs #epochs

(a) p = 2 (b) p = 4 (c) p = 8 (d) p = 16

Fig. 7: Test accuracy with various T values for CIFAR-10. For readability, the accuracies for every 10 epochs are shown

60 60 60 60
55
50 50 50 50
45
accuracy (%)

40 40 40 40
30 35 30 30
30
20 T=1 25 20 20
T=5 20
10 T=25 15 10 10
T=50 10
0 5 0 0
0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0
20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20
#epochs #epochs #epochs #epochs

(a) p = 2 (b) p = 4 (c) p = 8 (d) p = 16

Fig. 8: Test accuracy with various T values for NLC-F. For readability, the accuracies for every 20 epochs are shown

the most obvious for p = 8 in Fig. 8(c). For p = 16, the best p = 8, 16. This behavior is also reported in the EAMSGD study
accuracy is actually achieved with T = 50. [29]. EAMSGD performs much better than Downpour, and
Similar impact of T is observed for training accuracies. SASGD in turn performs consistently better than EAMSGD.
C. Comparison with ASGD As p increases, the gap in accuracy between SASGD and
EAMSGD increases, suggesting that SASGD tolerates stale
SASGD is much faster than Downpour and EAMSGD in updates better than EAMSGD. With p = 16, the accuracy gap
terms of epoch time when T is small. T needs to be large for between SASGD and EAMSGD after 100 epochs is 2.95%.
all three approaches to amortize the communication overhead This gap significant for CIFAR-10.
for a large number of learners. We show SASGD has better
In Fig. 9(a), all three approaches converge after 100 epochs
convergence behavior for large T . That is, for the same amount
with 2 learners. Before convergence (i.e., before 70 epochs),
of data samples processed, SASGD achieves better accuracy
both SASGD and EAMSGD improve training accuracy much
than Downpour and EAMSGD.
faster than Downpour, and SASGD performs slightly better
In our experiment with CIFAR-10, we run each training
than EAMSGD. In Fig. 9(b), 9(c), and 9(d), none of the three
algorithm for 100 epochs. That is, all learners collectively
approaches fully converge after 100 epochs. This demonstrates
make 100 passes of all input data. We use T = 50. Recall
the parallelization overhead for convergence. SASGD is still
in section IV-A, we have shown that with T = 50, all three
the best performing algorithm, while Downpour starts to show
approaches have similar epoch time.
some erratic behavior with p = 4 and degenerates to almost
The training accuracies for Downpour, EAMSGD, and
random guess for p = 8, 16.
SASGD are shown in Fig. 9(a), 9(b), 9(c), and 9(d) for p = 2,
4, 8, and 16, respectively. Due to the synchronous nature of Fig. 9(e), 9(f), 9(g), and 9(h) show test accuracies for
SASGD, the accuracy numbers for all learners after each epoch Downpour, EAMSGD, and SASGD for p = 2, 4, 8, and 16,
are similar. Both EAMSGD and Downpour are asynchronous, respectively. The test accuracy curves track the training accu-
and before they terminate, the accuracy numbers for different racy curves in Fig. 9(a), 9(b), 9(c), and 9(d) for p = 2, 4, 8,
learners see a wider range of fluctuation. For consistency, we and 16, respectively. Downpour shows the worst performance.
collect accuracy numbers from one learner after it has made a With 4 learners Downpour starts to behave erratically, and
complete pass of the input data. Thus from the perspective of with 16 learners, the test accuracy degrades to random guess.
total number of samples processed by all learners, Downpour SASGD consistently ranks as the top performer among the
and EAMSGD report accuracy numbers after every p epochs, three approaches. The gap between SASGD and EAMSGD
and SASGD report accuracy numbers after each epoch. In the increases as the number of learners increases, again suggesting
plots Downpour and EAMSGD have 1/p as many data points SASGD a better choice for more learners.
as SASGD. We also compare the performance of the three approaches
Comparing the plots in these figures, it is clear that Down- for NLC-F with T = 50. With each approach the learn-
pour performs poorly in terms of achieved accuracy with ers collectively make 200 passes of the input data. Fig-

18

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
80 80 80 70
70 70 70 60
60 60 60 50
accuracy (%)

50 50 50
40
40 40 40
30
30 30 30
20 Downpour 20 20 20
10 EAMSGD 10 10 10
SASGD
0 0 0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
#epochs #epochs #epochs #epochs

(a) p = 2 (b) p = 4 (c) p = 8 (d) p = 16


90 90 80 80
80 80 70 70
70 70 60 60
accuracy (%)

60 60 50 50
50 50
40 40
40 40
30 30 30 30
20 Downpour 20 20 20
EAMSGD 10 10
10 SASGD 10
0 0 0 0
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
#epochs #epochs #epochs #epochs

(e) p = 2 (f) p = 4 (g) p = 8 (h) p = 16

Fig. 9: Training and test accuracies for CIFAR-10. The top row is for training, and the bottom row is for test

ures 10(a), 10(b),10(c), and 10(d) show training accuracies substantially degrade training and test accuracies with even
for Downpour, EAMSGD, and SASGD for p = 2, 4, 8, and a moderate number of learners. On current and emerging
16, respectively. computer platforms, communication bandwidth between the
SASGD consistently reaches to close to 100% training (sharded) parameter server on CPUs and learners on GPUs is
accuracy at the end of 200 epochs. In comparison, Downpour a limiting factor for scaling when the neural network models
and EAMSGD also reach high training accuracies with 2 are large and/or gradient aggregation is frequent.
and 4 learners, but with 8 learners, the training accuracy for We propose a bulk-synchronous, distributed SGD algorithm,
Downpour and EAMSGD starts to degrade. With 16 learners, SASGD, that allows sparse gradient aggregation. Instead of
neither Downpour nor EAMSGD achieves accuracy better than reducing communication overhead through asynchronous up-
the random guess. dates, we adopt an explicit gradient aggregation interval T that
Figures 10(e),10(f),10(g), and 10(h) show test accuracies for amortizes the communication cost. The aggregation is done
Downpour, EAMSGD, and SASGD. The test accuracy curves through collective allreduce that does not rely on a parameter
largely track the training accuracy curves except for p = 8. The server. In our experiments, from the perspective of epoch time,
maximum test accuracies achieved by the three approaches SASGD achieves significant speedups over SGD with T = 50
are around 60%. This is also the accuracy achieved by the while maintaining good convergence; SASGD is much faster
sequential implementation after 200 epochs. For p = 2, and 4, than the ASGD implementations when T is small due to its
all three approaches have similar accuracy curves. For p = 8, low communication complexity.
and 16, SASGD consistently achieves much better accuracies We also compare the convergence performance of SASGD
than EAMSGD and Downpour. With 8 learners, the accuracy with Downpour and EAMSGD. With a small number of learn-
drops to between 30% and 40% for Downpour and EAMSGD, ers (e.g., 1 or 2 learners), all three approaches have similar
while the accuracy for SASGD remains close to 60%. With convergence behavior for different T . When the number of
16 learners, both Downpour and EAMSGD exhibit erratic learners reach 8, 16, and beyond, the stochasticity from asyn-
behavior and achieve accuracies not much better than random chronous updates in Downpour and EAMSGD substantially
guess, while SASGD still achieves close to 60% accuracy. degrades accuracy, while SASGD still shows stable conver-
gence behavior. As the number of GPUs in future systems
V. C ONCLUSION
is likely to increase, we expect SASGD perform better than
Parallelization impacts not only the epoch time but also ASGD implementations for machine learning applications.
the convergence rate of stochastic gradient descent. Efficient
parallelization is critical in reducing the execution time while
R EFERENCES
maintaining the accuracy for the trained model. Although in
theory ASGD converges with asymptotic linear speedup over [1] CIFAR10 model. https://github.com/eladhoffer/ConvNet-torch/blob/
SGD, we show in practice ASGD faces challenges of signifi- master/Models/Model.lua, accessed January 11, 2016.
cant communication overhead and high sample complexity to [2] O. Abdel-Hamid, A-R Mohamed, H. Jiang, and et al. Convolutional
neural networks for speech recognition. IEEE/ACM Transactions on
reach convergence. Stale gradients resulted from asynchronous audio, speech, and language processing, 22(10):1533–1545, 2014.
updates increase the stochasticity in optimization, and can [3] L. Bottou. Online learning and stochastic approximations, 1998.

19

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.
100 100 100 100
80 80 80 80
accuracy (%)

60 60 60 60
40 40 40 40
Downpour
20 EAMSGD 20 20 20
SASGD
0 0 0 0
0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0
20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20
#epochs #epochs #epochs #epochs

(a) p = 2 (b) p = 4 (c) p = 8 (d) p = 16

60 60 60 60
50 50 50 50
accuracy (%)

40 40 40 40
30 30 30 30
20 20 20 20
Downpour
10 EAMSGD 10 10 10
SASGD
0 0 0 0
0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0

0
0
0
0
0
0
20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20

20
40
60
80
10
12
14
16
18
20
#epochs #epochs #epochs #epochs

(e) p = 2 (f) p = 4 (g) p = 8 (h) p = 16

Fig. 10: Training and test accuracies for NLC-F. The top row is for training, and the bottom row is for test

[4] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: [20] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach
Building an efficient and scalable deep learning training system. In 11th to parallelizing stochastic gradient descent. In Advances in Neural
USENIX Symposium on Operating Systems Design and Implementation Information Processing Systems, pages 693–701, 2011.
(OSDI 14), pages 571–582, 2014. [21] H. Robbins and D. Siegmund. A convergence theorem for non negative
[5] J. Dean, G. Corrado, R. Monga, and et al. Large scale distributed deep almost supermartingales and some applications. In Herbert Robbins
networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Selected Papers, pages 111–135. Springer, 1985.
editors, Advances in Neural Information Processing Systems 25, pages [22] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth
1223–1231. Curran Associates, Inc., 2012. optimization: Convergence results and optimal averaging schemes. In
[6] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal ICML (1), pages 71–79, 2013.
distributed online prediction using mini-batches. Journal of Machine [23] N. Srivastava, G.E. Hinton E, A. Krizhevsky, and et al. Dropout: a
Learning Research, 13(Jan):165–202, 2012. simple way to prevent neural networks from overfitting. Journal of
[7] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for Machine Learning Research, 15(1):1929–1958, 2014.
nonconvex stochastic programming. SIAM Journal on Optimization, [24] C. Szegedy, W. Liu, Y. Jia, and et al. Going deeper with convolutions.
23(4):2341–2368, 2013. In Proceedings of the IEEE Conference on Computer Vision and Pattern
[8] NVIDIA GPUDirect, https://developer.nvidia.com/gpudirect. Recognition, pages 1–9, 2015.
[9] A. Krizhevsky. Learning multiple layers of features from tiny images. [25] Torch – A scientific computing framework for Luajit, http://torch.ch.
Master’s thesis, 2009. [26] D.R. Wilson and T.R. Martinez. The general inefficiency of batch
[10] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification training for gradient descent learning. Neural Network, 16(10):1429–
with deep convolutional neural networks. In Advances in neural 1451, December 2003.
information processing systems, pages 1097–1105, 2012. [27] L. Xiao and T. Zhang. A proximal stochastic gradient method with pro-
[11] M. Li, D. G. Andersen, J. W. Park, and et al. Scaling distributed machine gressive variance reduction. SIAM Journal on Optimization, 24(4):2057–
learning with the parameter server. In 11th USENIX Symposium on 2075, 2014.
Operating Systems Design and Implementation (OSDI 14), pages 583– [28] R. Zhang and J.T. Kwok. Asynchronous distributed admm for consensus
598, Broomfield, CO, October 2014. USENIX Association. optimization. In Proceedings of the 31st International Conference on
[12] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training Machine Learning (ICML 2014), pages 1701–1709, 2014.
for stochastic optimization. In Proceedings of the 20th ACM SIGKDD [29] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic
International Conference on Knowledge Discovery and Data Mining, averaging SGD. In Advances in Neural Information Processing Systems
KDD ’14, pages 661–670, New York, NY, USA, 2014. ACM. 28: Annual Conference on Neural Information Processing Systems 2015,
[13] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic December 7-12, 2015, Montreal, Quebec, Canada, pages 685–693,
gradient for nonconvex optimization. In Advances in Neural Information 2015.
Processing Systems, pages 2737–2745, 2015. [30] M. Zinkevich, M. Weimer, L. Li, and A.J. Smola. Parallelized stochastic
[14] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient gradient descent. In Advances in neural information processing systems,
method. In Advances in Neural Information Processing Systems, pages pages 2595–2603, 2010.
3059–3067, 2014.
[15] mpiT – MPI for Torch, https://github.com/sixin-zh/mpiT.
[16] Multi-process service, https://docs.nvidia.com/deploy/pdf/CUDA
Multi Process Service Overview.pdf.
[17] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic
approximation approach to stochastic programming. SIAM Journal on
optimization, 19(4):1574–1609, 2009.
[18] One Stop System High Density Compute Ac-
celerator, http://www.onestopsystems.com/blog-post/
one-stop-systems-shows-its-16-gpu-monster-machine-gtc-2015.
[19] T. Paine, H. Jin, J. Yang, and et al. Gpu asynchronous stochastic
gradient descent to speed up neural network training. arXiv preprint
arXiv:1312.6186, 2013.

20

Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 19:22:17 UTC from IEEE Xplore. Restrictions apply.

You might also like