0% found this document useful (0 votes)
97 views27 pages

Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent

This document discusses a study comparing decentralized parallel stochastic gradient descent (D-PSGD) algorithms to centralized parallel stochastic gradient descent (C-PSGD) algorithms for machine learning problems. The study finds that while D-PSGD and C-PSGD have similar overall computational complexity, D-PSGD requires much less communication on the busiest node. Through theoretical analysis and empirical tests on over 100 GPUs, the study shows that for networks with low bandwidth or high latency, D-PSGD can be up to 10 times faster than C-PSGD. This suggests decentralized algorithms may outperform centralized algorithms in some scenarios.

Uploaded by

Sr L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views27 pages

Can Decentralized Algorithms Outperform Centralized Algorithms A Case Study For Decentralized Parallel Stochastic Gradient Descent

This document discusses a study comparing decentralized parallel stochastic gradient descent (D-PSGD) algorithms to centralized parallel stochastic gradient descent (C-PSGD) algorithms for machine learning problems. The study finds that while D-PSGD and C-PSGD have similar overall computational complexity, D-PSGD requires much less communication on the busiest node. Through theoretical analysis and empirical tests on over 100 GPUs, the study shows that for networks with low bandwidth or high latency, D-PSGD can be up to 10 times faster than C-PSGD. This suggests decentralized algorithms may outperform centralized algorithms in some scenarios.

Uploaded by

Sr L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Can Decentralized Algorithms Outperform

Centralized Algorithms? A Case Study for


Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian† , Ce Zhang∗ , Huan Zhang+ , Cho-Jui Hsieh+ , Wei Zhang# , and Ji Liu†\

University of Rochester, ∗ ETH Zurich
+
University of California, Davis, # IBM T. J. Watson Research Center, \ Tencent AI lab
[email protected], [email protected], [email protected],
[email protected], [email protected], [email protected]

Abstract
Most distributed machine learning systems nowadays, including TensorFlow and
CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms
lies on high communication cost on the central node. Motivated by this, we ask,
can decentralized algorithms be faster than its centralized counterpart?
Although decentralized PSGD (D-PSGD) algorithms have been studied by the
control community, existing analysis and theory do not show any advantage over
centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario
where only the decentralized network is available. In this paper, we study a D-
PSGD algorithm and provide the first theoretical analysis that indicates a regime
in which decentralized algorithms might outperform centralized algorithms for
distributed stochastic gradient descent. This is because D-PSGD has comparable
total computational complexities to C-PSGD but requires much less communication
cost on the busiest node. We further conduct an empirical study to validate our
theoretical analysis across multiple frameworks (CNTK and Torch), different
network configurations, and computation platforms up to 112 GPUs. On network
configurations with low bandwidth or high latency, D-PSGD can be up to one order
of magnitude faster than its well-optimized centralized counterparts.

1 Introduction
In the context of distributed machine learning, decentralized algorithms have long been treated as a
compromise — when the underlying network topology does not allow centralized communication,
one has to resort to decentralized communication, while, understandably, paying for the “cost of being
decentralized”. In fact, most distributed machine learning systems nowadays, including TensorFlow
and CNTK, are built in a centralized fashion. But can decentralized algorithms be faster than their
centralized counterparts? In this paper, we provide the first theoretical analysis, verified by empirical
experiments, for a positive answer to this question.
We consider solving the following stochastic optimization problem
min f (x) := Eξ∼D F (x; ξ), (1)
x∈RN

where D is a predefined distribution and ξ is a random variable usually referring to a data sample in
machine learning. This formulation summarizes many popular machine learning models including
deep learning [LeCun et al., 2015], linear regression, and logistic regression.
Parallel stochastic gradient descent (PSGD) methods are leading algorithms in solving large-scale
machine learning problems such as deep learning [Dean et al., 2012, Li et al., 2014], matrix completion

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Parameter
Server

(a) Centralized Topology (b) Decentralized Topology


Figure 1: An illustration of different network topologies.

Algorithm communication complexity computational complexity


on the busiest node
O n + 12 

C-PSGD (mini-batch SGD) O(n)
D-PSGD O (Deg(network)) O n + 12
Table 1: Comparison of C-PSGD and D-PSGD. The unit of the communication cost is the number
of stochastic gradients or optimization variables. n is the number of nodes. The computational
complexity is the number of stochastic gradient evaluations we need to get a -approximation solution,
which is defined in (3).

[Recht et al., 2011, Zhuang et al., 2013] and SVM. Existing PSGD algorithms are mostly designed for
centralized network topology, for example, the parameter server topology [Li et al., 2014], where there
is a central node connected with multiple nodes as shown in Figure 1(a). The central node aggregates
the stochastic gradients computed from all other nodes and updates the model parameter, for example,
the weights of a neural network. The potential bottleneck of the centralized network topology lies on
the communication traffic jam on the central node, because all nodes need to communicate with it
concurrently iteratively. The performance will be significantly degraded when the network bandwidth
is low.1 These motivate us to study algorithms for decentralized topologies, where all nodes can only
communicate with its neighbors and there is no such a central node, shown in Figure 1(b).
Although decentralized algorithms have been studied as consensus optimization in the control commu-
nity and used for preserving data privacy [Ram et al., 2009a, Yan et al., 2013, Yuan et al., 2016], for
the application scenario where only the decentralized network is available, it is still an open question
if decentralized methods could have advantages over centralized algorithms in some scenarios
in case both types of communication patterns are feasible — for example, on a supercomputer with
thousands of nodes, should we use decentralized or centralized communication? Existing theory and
analysis either do not make such comparison [Bianchi et al., 2013, Ram et al., 2009a, Srivastava and
Nedic, 2011, Sundhar Ram et al., 2010] or implicitly indicate that decentralized algorithms were much
worse than centralized algorithms in terms of computational complexity and total communication
complexity [Aybat et al., 2015, Lan et al., 2017, Ram et al., 2010, Zhang and Kwok, 2014]. This paper
gives a positive result for decentralized algorithms by studying a decentralized PSGD (D-PSGD)
algorithm on the connected decentralized network. Our theory indicates that D-PSGD admits similar
total computational complexity but requires much less communication for the busiest node. Table 1
shows a quick comparison between C-PSGD and D-PSGD with respect to the computation and
communication complexity. Our contributions are:
• We theoretically justify the potential advantage of decentralizedalgorithms over centralized
algorithms. Instead of treating decentralized algorithms as a compromise one has to make, we are the
first to conduct a theoretical analysis that identifies cases in which decentralized algorithms can be
faster than its centralized counterpart.
• We theoretically analyze the scalability behavior of decentralized SGD when more nodes are
used. Surprisingly, we show that, when more nodes are available, decentralized algorithms can bring
speedup, asymptotically linearly, with respect to computational complexity. To our best knowledge,
this is the first speedup result related to decentralized algorithms.
• We conduct extensive empirical study to validate our theoretical analysis of D-PSGD and different
C-PSGD variants (e.g., plain SGD, EASGD [Zhang et al., 2015]). We observe similar computational
1
There has been research in how to accommodate this problem by having multiple parameter servers
communicating with efficient MPI A LL R EDUCE primitives. As we will see in the experiments, these methods,
on the other hand, might suffer when the network latency is high.

2
complexity as our theory indicates; on networks with low bandwidth or high latency, D-PSGD
can be up to 10× faster than C-PSGD. Our result holds across multiple frameworks (CNTK and
Torch), different network configurations, and computation platforms up to 112 GPUs. This indicates
promising future direction in pushing the research horizon of machine learning systems from pure
centralized topology to a more decentralized fashion.
Definitions and notations Throughout this paper, we use following notation and definitions:
• k · k denotes the vector `2 norm or the matrix spectral norm depending on the argument.
• k · kF denotes the matrix Frobenius norm.
• ∇f (·) denotes the gradient of a function f .
• 1n denotes the column vector in Rn with 1 for all elements.
• f ∗ denotes the optimal solution of (1).
• λi (·) denotes the i-th largest eigenvalue of a matrix.

2 Related work
In the following, we use K and n to refer to the number of iterations and the number of nodes.
Stochastic Gradient Descent (SGD) SGD is a powerful approach for solving √ large scale machine
learning. The well known convergence rate of stochastic gradient is O(1/ K) for convex problems
and O(1/K) for strongly convex problems [Moulines and Bach, 2011, Nemirovski et al., 2009]. SGD
is closely related to online learning algorithms, for example, Crammer et al. [2006], Shalev-Shwartz
[2011],
√ Yang et al. [2014]. For SGD on nonconvex optimization, an ergodic convergence rate of
O(1/ K) is proved in Ghadimi and Lan [2013].
Centralized parallel SGD For C ENTRALIZED PARALLEL SGD (C-PSGD) algorithms, the most
popular implementation is based on √ the parameter server, which is essentially the mini-batch SGD
admitting a convergence rate of O(1/ Kn) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al.,
2015], where in each iteration n stochastic gradients are evaluated. In this implementation there is a
parameter server communicating with all nodes. The linear speedup is implied by the convergence
rate automatically. More implementation details for C-PSGD can be found in Chen et al. [2016],
Dean et al. [2012], Li et al. [2014], Zinkevich et al. [2010]. The asynchronous version of centralized
parallel SGD is proved to guarantee the linear speedup on all kinds of objectives (including convex,
strongly convex, and nonconvex objectives) if the staleness of the stochastic gradient is bounded
[Agarwal and Duchi, 2011, Feyzmahdavian et al., 2015, Lian et al., 2015, 2016, Recht et al., 2011,
Zhang et al., 2016b,c].
Decentralized parallel stochastic algorithms Decentralized algorithms do not specify any central
node unlike centralized algorithms, and each node maintains its own local model but can only
communicate with with its neighbors. Decentralized algorithms can usually be applied to any
connected computational network. Lan et al. [2017] proposed a decentralized stochastic algorithm
with computational complexities O(n/2 ) for general convex objectives and O(n/) for strongly
convex objectives. Sirb and Ye [2016] proposed an asynchronous decentralized stochastic algorithm
ensuring complexity O(n/2 ) for convex objectives. A similar algorithm to our D-PSGD in both
synchronous and asynchronous fashion was studied in Ram et al. [2009a, 2010], Srivastava and
Nedic [2011], Sundhar Ram et al. [2010]. The difference is that in their algorithm all node can only
perform either communication or computation but not simultaneously. Sundhar Ram et al. [2010]
proposed a stochastic decentralized optimization algorithm for constrained convex optimization
and the algorithm can be used for non-differentiable objectives by using subgradients. Please also
refer to Srivastava and Nedic [2011] for the subgradient variant. The analysis in Ram et al. [2009a,
2010], Srivastava and Nedic [2011], Sundhar Ram et al. [2010] requires the gradients of each term
of the objective to be bounded by a constant. Bianchi et al. [2013] proposed a similar decentralized
stochastic algorithm and provided a convergence rate for the consensus of the local models when
the local models are bounded. The convergence to a solution was also provided by using central
limit theorem, but the rate is unclear. HogWild++ [Zhang et al., 2016a] uses decentralized model
parameters for parallel asynchronous SGD on multi-socket systems and shows that this algorithm
empirically outperforms some centralized algorithms. Yet the convergence or the convergence rate is
unclear. The common issue for these work above lies on that the speedup is unclear, that is, we do
not know if decentralized algorithms (involving multiple nodes) can improve the efficiency of only
using a single node.

3
Other decentralized algorithms In other areas including control, privacy and wireless sensing
network, decentralized algorithms are usually studied for solving the consensus problem [Aysal et al.,
2009, Boyd et al., 2005, Carli et al., 2010, Fagnani and Zampieri, 2008, Olfati-Saber et al., 2007,
Schenato and Gamba, 2007]. Lu et al. [2010] proves a gossip algorithm to converge to the optimal
solution for convex optimization. Mokhtari and Ribeiro [2016] analyzed decentralized SAG and
SAGA algorithms for minimizing finite sum strongly convex objectives, but they are not shown
to admit any speedup. The decentralized gradient descent method for convex and strongly convex
problems was analyzed in Yuan et al. [2016]. Nedic and Ozdaglar [2009], Ram et al. [2009b] studied
its subgradient variants. However, this type of algorithms can only converge to a ball of the optimal
solution, whose diameter depends on the steplength. This issue was fixed by Shi et al. [2015] using a
modified algorithm, namely EXTRA, that can guarantee to converge to the optimal solution. Wu et al.
[2016] analyzed an asynchronous version of decentralized gradient descent with some modification
like in Shi et al. [2015] and showed that the algorithm converges to a solution when K → ∞. Aybat
et al. [2015], Shi et al., Zhang and Kwok [2014] analyzed decentralized ADMM algorithms and they
are not shown to have speedup. From all of these reviewed papers, it is still unclear if decentralized
algorithms can have any advantage over their centralized counterparts.
3 Decentralized parallel stochastic gradient descent (D-PSGD)
Algorithm 1 Decentralized Parallel Stochastic Gradient Descent (D-PSGD) on the ith node
Require: initial point x0,i = x0 , step length γ, weight matrix W , and number of iterations K
1: for k = 0, 1, 2, . . . , K − 1 do
2: Randomly sample ξk,i from local data of the i-th node
3: Compute the local stochastic gradient ∇Fi (xk,i ; ξk,i ) ∀i on all nodes a
4: ComputeP the neighborhood weighted average by fetching optimization variables from neighbors:
xk+ 1 ,i = n j=1 Wij xk,j
b
2
5: Update the local optimization variable xk+1,i ← xk+ 1 ,i − γ∇Fi (xk,i ; ξk,i )c
2
6: end for Pn
1
7: Output: n i=1 xK,i
a
Note that the stochastic gradient computed in can be replaced with a mini-batch of stochastic gradients,
which will not hurt our theoretical results.
b
Note that the Line 3 and Line 4 can be run in parallel.
c
Note that the Line 4 and step Line 5 can be exchanged. That is, we first update the local stochastic gradient
into the local optimization variable, and then average the local optimization variable with neighbors. This does
not hurt our theoretical analysis. When Line 4 is logically before Line 5, then Line 3 and Line 4 can be run in
parallel. That is to say, if the communication time used by Line 4 is smaller than the computation time used by
Line 3, the communication time can be completely hidden (it is overlapped by the computation time).

This section introduces the D-PSGD algorithm. We represent the decentralized communication
topology with an undirected graph with weights: (V, W ). V denotes the set of n computational
nodes: V := {1, 2, · · · , n}. W ∈ Rn×n is a symmetric Pdoubly stochastic matrix, which means (i)
Wij ∈ [0, 1], ∀i, j, (ii) Wij = Wji for all i, j, and (ii) j Wij = 1 for all i. We use Wij to encode
how much node j can affect node i, while Wij = 0 means node i and j are disconnected.
To design distributed algorithms on a decentralized network, we first distribute the data onto all nodes
such that the original objective defined in (1) can be rewritten into
n
1X
min f (x) = Eξ∼Di Fi (x; ξ) . (2)
x∈RN n i=1 | {z }
=:fi (x)
There are two simple ways to achieve (2), both of which can be captured by our theoretical analysis
and they both imply Fi (·; ·) = F (·; ·), ∀i.
Strategy-1 All distributions Di ’s are the same as D, that is, all nodes can access a shared database;
Strategy-2 n nodes partition all data in the database and appropriately define a distribution for
sampling local data, for example, if D is the uniform distribution over all data, Di can be defined
to be the uniform distribution over local data.
The D-PSGD algorithm is a synchronous parallel algorithm. All nodes are usually synchronized by a
clock. Each node maintains its own local variable and runs the protocol in Algorithm 1 concurrently,
which includes three key steps at iterate k:

4
• Each node computes the stochastic gradient ∇Fi (xk,i ; ξk,i )2 using the current local variable xk,i ,
where k is the iterate number and i is the node index;
• When the synchronization barrier is met, each node exchanges local variables with its neighbors
and average the local variables it receives with its own local variable;
• Each node update its local variable using the average and the local stochastic gradient.
To view the D-PSGD algorithm from a global view, at iterate k, we define the concatenation of all
local variables, random samples, stochastic gradients by matrix Xk ∈ RN ×n , vector ξk ∈ Rn , and
∂F (Xk , ξk ), respectively:
>
Xk := [ xk,1 ··· xk,n ] ∈ RN ×n , ξk := [ ξk,1 ··· ξk,n ] ∈ Rn ,
∂F (Xk , ξk ) := [ ∇F1 (xk,1 ; ξk,1 ) ∇F2 (xk,2 ; ξk,2 ) · · · ∇Fn (xk,n ; ξk,n ) ] ∈ RN ×n .
Then the k-th iterate of Algorithm 1 can be viewed as the following update
Xk+1 ← Xk W − γ∂F (Xk ; ξk ).

We say the algorithm gives an -approximation solution if


P  
K −1
K−1
E ∇f Xk 1n 2 6 . (3)
k=0 n

4 Convergence rate analysis


This section provides the analysis for the convergence rate of the D-PSGD algorithm. Our analysis
will show that the convergence rate of D-PSGD w.r.t. iterations is similar to the C-PSGD (or mini-
batch SGD) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al., 2015], but D-PSGD avoids the
communication traffic jam on the parameter server.
To show the convergence results, we first define
∂f (Xk ) := [ ∇f1 (xk,1 ) ∇f2 (xk,2 ) · · · ∇fn (xk,n ) ] ∈ RN ×n ,
where functions fi (·)’s are defined in (2).
Assumption 1. Throughout this paper, we make the following commonly used assumptions:
1. Lipschitzian gradient: All function fi (·)’s are with L-Lipschitzian gradients.
2. Spectral gap: Given the symmetric doubly stochastic matrix W , we define ρ :=
(max{|λ2 (W )|, |λn (W )|})2 . We assume ρ < 1.
3. Bounded variance: Assume the variance of stochastic gradient Ei∼U ([n]) Eξ∼Di k∇Fi (x; ξ) −
∇f (x)k2 is bounded for any x with i uniformly sampled from {1, . . . , n} and ξ from the distribution
Di . This implies there exist constants σ, ς such that
Eξ∼Di k∇Fi (x; ξ) − ∇fi (x)k2 6σ 2 , ∀i, ∀x, Ei∼U ([n]) k∇fi (x) − ∇f (x)k2 6 ς 2 , ∀x.
Note that if all nodes can access the shared database, then ς = 0.
4. Start from 0: We assume X0 = 0. This assumption simplifies the proof w.l.o.g.

Let
9γ 2 L2 n 18γ 2
   
1 2
D1 := − √ , D2 := 1 − √ nL .
2 (1 − ρ)2 D2 (1 − ρ)2
Under Assumption 1, we have the following convergence result for Algorithm 1.
Theorem 1 (Convergence of Algorithm 1). Under Assumption 1, we have the following convergence
rate for Algorithm 1:
K−1 2 K−1
X   2 !
1 1 − γL X ∂f (Xk )1n Xk 1n
E
+ D1
E ∇f

K 2 n n
k=0 k=0
f (0) − f ∗ γL 2 γ 2 L2 nσ 2 9γ 2 L2 nς 2
6 + σ + + √ .
γK 2n (1 − ρ)D2 (1 − ρ)2 D2
2
It can be easily extended to mini-batch stochastic gradient descent.

5
Pn
Noting that Xkn1n = n1 i=1 xk,i , this theorem characterizes the convergence of the average of all
local optimization variables xk,i . To take a closer look at this result, we appropriately choose the step
length in Theorem 1 to obtain the following result:
1

Corollary 2. Under the same assumptions as in Theorem 1, if we set γ = , for Algorithm
2L+σ K/n
1 we have the following convergence rate:
PK−1 Xk 1n 2

k=0 E ∇f

n 8(f (0) − f ∗ )L (8f (0) − 8f ∗ + 4L)σ
6 + √ . (4)
K K Kn
if the total number of iterate K is sufficiently large, in particular,
2
4L4 n5
 2
σ 9ς 2
K> 6 + √ , and (5)
σ (f (0) − f ∗ + L)2 1 − ρ (1 − ρ)2
72L2 n2
K> √ 2 . (6)
σ2 1 − ρ
 
1 √1
This result basically suggests that the convergence rate for D-PSGD is O K + nK
, if K is large
enough. We highlight two key observations from this result:
1 1
Linear speedup When K is large enough, the K term will be dominated by the √Kn term which
1 3
leads to a √nK convergence rate. It indicates that the total computational complexity to achieve
an -approximation solution (3) is bounded by O 12 . Since the total number of nodes does

 not
affect the total complexity, a single node only shares a computational complexity of O n12 . Thus
linear speedup can be achieved by D-PSGD asymptotically w.r.t. computational complexity.
D-PSGD can be better than C-PSGD Note that this rate is the same as C-PSGD (or mini-batch
SGD with mini-batch size n) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al., 2015].
The advantage of D-PSGD over C-PSGD is to avoid the communication traffic jam. At each
iteration, the maximal communication cost for every single node is O(the degree of the network)
for D-PSGD, in contrast with O(n) for C-PSGD. The degree of the network could be much smaller
than O(n), e.g., it could be O(1) in the special case of a ring.
The key difference from most existing analysis for decentralized algorithms lies on that we do not
use the boundedness assumption for domain or gradient or stochastic gradient. Those boundedness
assumptions can significantly simplify the proof but lose some subtle structures in the problem.
In Theorem 1 p
and Corollary 2, we choose the constant steplength for simplicity. Using the diminishing
steplength O( n/k) can achieve a similar convergence rate by following the proof procedure √ in this
paper. For convex objectives, D-PSGD could be proven to admit the convergence rate O(1/ nK)
which is consistent with the non-convex case. For strongly convex objectives, the convergence rate
for D-PSGD could be improved to O(1/nK) which is consistent with the rate for C-PSGD.
The linear speedup indicated by Corollary 2 requires the total number of iteration K is sufficiently
large. The following special example gives a concrete bound of K for the ring network topology.
Theorem 3. (Ring network) Choose the steplength γ in the same as Corollary 2 and consider the
ring network topology with corresponding W in the form of
1/3 1/3 1/3
 
 1/3 1/3 1/3 
 
 .. 

1/3 1/3 . 
n×n
W = ∈R .
 
 .. .. 

 . . 1/3


 1/3 1/3 1/3 
1/3 1/3 1/3

Under Assumption 1, Algorithm 1 achieves the same convergence rate in (4), which indicates a linear
speedup can be achieved, if the number of involved nodes is bounded by
• n = O(K 1/9 ), if apply strategy-1 distributing data (ς = 0);
• n = O(K 1/13 ), if apply strategy-2 distributing data (ς > 0),
3
The complexity to compute a single stochastic gradient counts 1.

6
2.5 2.5 300 140
Slower Network Slower Network

Seconds/Epoch
Seconds/Epoch
250 120

Training Loss

Training Loss
2 2
100 Centralized
1.5 Centralized 200
Centralized
1.5
Centralized 150
80 CNTK
1 1 60

Decentralized
100 Decentralized 40
0.5 0.5 CNTK
CNTK Decentralized
50
CNTK
20 Decentralized
0 0 0 0
0 500 1000 0 500 1000 0 0.5 1 0 5 10
Time (Seconds) Time (Seconds) 1/Bandwidth (1 / 1Mbps) Network Latency (ms)
(a) ResNet-20, 7GPU, 10Mbps (b) ResNet-20, 7GPU, 5ms (c) Impact of Network Bandwidth (d) Impact of Network Latency

Figure 2: Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs).
3 3 8
Training Loss

Training Loss
2.5 2.5
6

Speedup
2 2
1.5 1.5 4
1 Decentralized 1
2
0.5 0.5 Centralized
0
Centralized Decentralized 0
0
0 200 400 600 0 50 100 150 0 2 4 6 8
Epochs Epochs # Workers
(a) ResNet20, 112GPUs (b) ResNet-56, 7GPU (c) ResNet20, 7GPUs (d) DPSGD Comm. Pattern

Figure 3: (a) Convergence Rate; (b) D-PSGD Speedup; (c) D-PSGD Communication Patterns.
where the capital “O” swallows σ, ς, L, and f (0) − f ∗ .
This result considers a special decentralized network topology: ring network, where each node can
only exchange information with its two neighbors. The linear speedup can be achieved up to K 1/9
and K 1/13 for different scenarios. These two upper bound can be improved potentially. This is the
first work to show the speedup for decentralized algorithms, to the best of our knowledge.
In this section, we mainly investigate the convergence rate for the average of all local variables
{xk,i }ni=1 . Actually one can also obtain a similar rate for each individual xk,i , since all nodes achieve
Pn 2
the consensus quickly, in particular, the running average of E n1 i0 =1 xk,i0 − xk,i converges to
0 with a O(1/K) rate, where the “O” swallows n, ρ, σ, ς, L and f (0) − f ∗ . See Theorem 6 for more
details in Supplemental Material.

5 Experiments
We validate our theory with experiments that compare D-PSGD with other centralized implementa-
tions. We run experiments on clusters up to 112 GPUs and show that, on some network configurations,
D-PSGD can outperform well-optimized centralized implementations by an order of magnitude.
5.1 Experiment setting
Datasets and models We evaluate D-PSGD on two machine learning tasks, namely (1) image clas-
sification, and (2) Natural Language Processing (NLP). For image classification we train ResNet [He
et al., 2015] with different number of layers on CIFAR-10 [Krizhevsky, 2009]; for natural language
processing, we train both proprietary and public dataset on a proprietary CNN model that we get
from our industry partner [Feng et al., 2016, Lin et al., 2017, Zhang et al., 2017].
Implementations and setups We implement D-PSGD on two different frameworks, namely Mi-
crosoft CNTK and Torch. We evaluate four SGD implementations:
1. CNTK. We compare with the standard CNTK implementation of synchronous SGD. The imple-
mentation is based on MPI’s AllReduce primitive.
2. Centralized. We implemented the standard parameter server-based synchronous SGD using MPI.
One node will serve as the parameter server in our implementation.
3. Decentralized. We implemented our D-PSGD algorithm using MPI within CNTK.
4. EASGD. We compare with the standard EASGD implementation of Torch.
All three implementations are compiled with gcc 7.1, cuDNN 5.0, OpenMPI 2.1.1. We fork from
CNTK after commit 57d7b9d and enable distributed minibatch reading for all of our experiments.
During training, we keep the local batch size of each node the same as the reference configurations
provided by CNTK. We tune learning rate for each SGD variant and report the best configuration.
Machines/Clusters We conduct experiments on three different machines/clusters:
1. 7GPUs. A single local machine with 8 GPUs, each of which is a Nvidia TITAN Xp.

7
2. 10GPUs. 10 p2.xlarge EC2 instances, each of which has one Nvidia K80 GPU.
3. 16GPUs. 16 local machines, each of which has two Xeon E5-2680 8-core processors and a
NVIDIA K20 GPU. Machines are connected by Gigabit Ethernet in this case.
4. 112GPUs. 4 p2.16xlarge and 6 p2.8xlarge EC2 instances. Each p2.16xlarge (resp.
p2.8xlarge) instance has 16 (resp. 8) Nvidia K80 GPUs.
In all of our experiments, we use each GPU as a node.
5.2 Results on CNTK
End-to-end performance We first validate that, under certain network configurations, D-PSGD
converges faster, in wall-clock time, to a solution that has the same quality of centralized SGD.
Figure 2(a, b) and Figure 3(a) shows the result of training ResNet20 on 7GPUs. We see that D-
PSGD converges faster than both centralized SGD competitors. This is because when the network is
slow, both centralized SGD competitors take more time per epoch due to communication overheads.
Figure 3(a, b) illustrates the convergence with respect to the number of epochs, and D-PSGD shows
similar convergence rate as centralized SGD even with 112 nodes.
Speedup The end-to-end speedup of D-PSGD over centralized SGD highly depends on the under-
lying network. We use the tc command to manually vary the network bandwidth and latency and
compare the wall-clock time that all three SGD implementations need to finish one epoch.
Figure 2(c, d) shows the result. We see that, when the network has high bandwidth and low latency,
not surprisingly, all three SGD implementations have similar speed. This is because in this case,
the communication is never the system bottleneck. However, when the bandwidth becomes smaller
(Figure 2(c)) or the latency becomes higher (Figure 2(d)), both centralized SGD implementations
slow down significantly. In some cases, D-PSGD can be even one order of magnitude faster than
its centralized competitors. Compared with Centralized (implemented with a parameter server), D-
PSGD has more balanced communication patterns between nodes and thus outperforms Centralized
in low-bandwidth networks; compared with CNTK (implemented with AllReduce), D-PSGD needs
fewer number of communications between nodes and thus outperforms CNTK in high-latency
networks. Figure 3(c) illustrates the communication between nodes for one run of D-PSGD.
We also vary the number of GPUs that D-PSGD uses and report the speed up over a single GPU to
reach the same loss. Figure 3(b) shows the result on a machine with 7GPUs. We see that, up to 4
GPUs, D-PSGD shows near linear speed up. When all seven GPUs are used, D-PSGD achieves up
to 5× speed up. This subliner speed up for 7 GPUs is due to the synchronization cost but also that
our machine only has 4 PCIe channels and thus more than two GPUs will share PCIe bandwidths.
5.3 Results on Torch
Due to the space limitation, the results on Torch can be found in Supplement Material.

6 Conclusion
This paper studies the D-PSGD algorithm on the decentralized computational network. We prove
that D-PSGD achieves the same convergence rate (or equivalently computational complexity) as the
C-PSGD algorithm, but outperforms C-PSGD by avoiding the communication traffic jam. To the
best of our knowledge, this is the first work to show that decentralized algorithms admit the linear
speedup and can outperform centralized algorithms.
Limitation and Future Work The potential limitation of D-PSGD lies on the cost of synchronization.
Breaking the synchronization barrier could make the decentralize algorithms even more efficient, but
requires more complicated analysis. We will leave this direction for the future work.
On the system side, one future direction is to deploy D-PSGD to larger clusters beyond 112 GPUs
and one such environment is state-of-the-art supercomputers. In such environment, we envision
D-PSGD to be one necessary building blocks for multiple “centralized groups” to communicate. It is
also interesting to deploy D-PSGD to mobile environments.
Acknowledgements Xiangru Lian and Ji Liu are supported in part by NSF CCF1718513. Ce Zhang gratefully
acknowledge the support from the Swiss National Science Foundation NRP 75 407540_167266, IBM Zurich,
Mercedes-Benz Research & Development North America, Oracle Labs, Swisscom, Chinese Scholarship Council,
the Department of Computer Science at ETH Zurich, the GPU donation from NVIDIA Corporation, and the
cloud computation resources from Microsoft Azure for Research award program. Huan Zhang and Cho-Jui
Hsieh acknowledge the support of NSF IIS-1719097 and the TACC computation resources.

8
References
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
N. S. Aybat, Z. Wang, T. Lin, and S. Ma. Distributed linearized alternating direction method of
multipliers for composite convex consensus optimization. arXiv preprint arXiv:1512.08122, 2015.
T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus.
IEEE Transactions on Signal processing, 57(7):2748–2761, 2009.
P. Bianchi, G. Fort, and W. Hachem. Performance of a distributed stochastic approximation algorithm.
IEEE Transactions on Information Theory, 59(11):7405–7418, 2013.
S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis and applications.
In INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications
Societies. Proceedings IEEE, volume 3, pages 1653–1664. IEEE, 2005.
R. Carli, F. Fagnani, P. Frasca, and S. Zampieri. Gossip consensus algorithms via quantized commu-
nication. Automatica, 46(1):70–80, 2010.
J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv
preprint arXiv:1604.00981, 2016.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive
algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V.
Le, et al. Large scale distributed deep networks. In Advances in neural information processing
systems, pages 1223–1231, 2012.
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using
mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
F. Fagnani and S. Zampieri. Randomized consensus algorithms over large scale networks. IEEE
Journal on Selected Areas in Communications, 26(4), 2008.
M. Feng, B. Xiang, and B. Zhou. Distributed deep learning for question answering. In Proceedings
of the 25th ACM International on Conference on Information and Knowledge Management, pages
2413–2416. ACM, 2016.
H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for
regularized stochastic optimization. arXiv, 2015.
S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic program-
ming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. ArXiv e-prints,
Dec. 2015.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
A. Krizhevsky. Learning multiple layers of features from tiny images. In Technical Report, 2009.
G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic
optimization. arXiv preprint arXiv:1701.03961, 2017.
Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and
B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14,
pages 583–598, 2014.
X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex
optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.

9
X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu. A comprehensive linear speedup analysis for
asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in
Neural Information Processing Systems, pages 3054–3062, 2016.
Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive
sentence embedding. 5th International Conference on Learning Representations, 2017.
J. Lu, C. Y. Tang, P. R. Regier, and T. D. Bow. A gossip algorithm for convex consensus optimization
over networks. In American Control Conference (ACC), 2010, pages 301–308. IEEE, 2010.
A. Mokhtari and A. Ribeiro. Dsa: decentralized double stochastic averaging gradient algorithm.
Journal of Machine Learning Research, 17(61):1–35, 2016.
E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. NIPS, 2011.
A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE
Transactions on Automatic Control, 54(1):48–61, 2009.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to
stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
Nvidia. Nccl: Optimized primitives for collective multi-gpu communication.
https://github.com/NVIDIA/nccl.
R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation in networked multi-agent
systems. Proceedings of the IEEE, 95(1):215–233, 2007.
S. S. Ram, A. Nedić, and V. V. Veeravalli. Asynchronous gossip algorithms for stochastic optimization.
In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference.
CDC/CCC 2009. Proceedings of the 48th IEEE Conference on, pages 3581–3586. IEEE, 2009a.
S. S. Ram, A. Nedic, and V. V. Veeravalli. Distributed subgradient projection algorithm for convex
optimization. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International
Conference on, pages 3653–3656. IEEE, 2009b.
S. S. Ram, A. Nedić, and V. V. Veeravalli. Asynchronous gossip algorithm for stochastic optimization:
Constant stepsize analysis. In Recent Advances in Optimization and its Applications in Engineering,
pages 51–60. Springer, 2010.
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic
gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
L. Schenato and G. Gamba. A distributed consensus protocol for clock synchronization in wireless
sensor network. In Decision and Control, 2007 46th IEEE Conference on, pages 2289–2294. IEEE,
2007.
S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in
Machine Learning, 4(2):107–194, 2011.
W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized
consensus optimization.
W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact first-order algorithm for decentralized consensus
optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
B. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized
networks. In Big Data (Big Data), 2016 IEEE International Conference on, pages 76–85. IEEE,
2016.
K. Srivastava and A. Nedic. Distributed asynchronous constrained stochastic optimization. IEEE
Journal of Selected Topics in Signal Processing, 5(4):772–790, 2011.
S. Sundhar Ram, A. Nedić, and V. Veeravalli. Distributed stochastic subgradient projection algorithms
for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010.

10
T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with
asynchrony and delays. arXiv preprint arXiv:1612.00150, 2016.
F. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi. Distributed autonomous online learning: Re-
grets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data
Engineering, 25(11):2483–2493, 2013.
T. Yang, M. Mahdavi, R. Jin, and S. Zhu. Regret bounded by gradual variation for online convex
optimization. Machine learning, 95(2):183–223, 2014.
K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal
on Optimization, 26(3):1835–1854, 2016.
H. Zhang, C.-J. Hsieh, and V. Akella. Hogwild++: A new mechanism for decentralized asynchronous
stochastic gradient descent. ICDM, 2016a.
R. Zhang and J. Kwok. Asynchronous distributed admm for consensus optimization. In International
Conference on Machine Learning, pages 1701–1709, 2014.
S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances
in Neural Information Processing Systems, pages 685–693, 2015.
W. Zhang, S. Gupta, X. Lian, and J. Liu. Staleness-aware async-sgd for distributed deep learning. In
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI
2016, New York, NY, USA, 9-15 July 2016, 2016b.
W. Zhang, S. Gupta, and F. Wang. Model accuracy and runtime tradeoff in distributed deep learning:
A systematic study. In IEEE International Conference on Data Mining, 2016c.
W. Zhang, M. Feng, Y. Zheng, Y. Ren, Y. Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and
F. Wang. Gadei: On scale-up training as a service for deep learning. In Proceedings of the
25th ACM International on Conference on Information and Knowledge Management. The IEEE
International Conference on Data Mining series(ICDM’2017), 2017.
Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel sgd for matrix factorization in shared
memory systems. In Proceedings of the 7th ACM conference on Recommender systems, pages
249–256. ACM, 2013.
M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In
Advances in neural information processing systems, pages 2595–2603, 2010.

11
Supplemental Materials: Results on Torch
The following comparison is based on the implementation using Torch.
We provide results for the experiment of D-PSGD and EASGD. For this set of experiments we use
a 32-layer residual network and CIFAR-10 dataset. We use up to 16 machines, and each machine
includes two Xeon E5-2680 8-core processors and a NVIDIA K20 GPU. Worker machines are
connected in a logical ring as described in Theorem 3. Connections between D-PSGD nodes are
made via TCP socks, and EASGD uses MPI for communication. Because D-PSGD do not have a
centralized model, we average all models from different machines as our final model to evaluate. In
practical training, this only needs to be done after the last epoch with an all-reduce operation. For
EASGD, we evaluate the central model on the parameter server.
One remarkable feature of this experiment is that we use inexpensive Gigabit Ethernet to connect
all machines, and we are able to practically observe network congestion with centralized parameter
server approach, even with a relatively small (ResNet-32) model. Although in practice, network
with much higher bandwidth are available (e.g., InfiniBand), we also want to use larger model or
more machines, so that network bandwidth can always become a bottleneck. We practically show
that D-PSGD has better scalability than centralized approaches when network bandwidth becomes a
constraint.
Comparison to EASGD Elastic Averaging SGD (EASGD) [Zhang et al., 2015] is an improved
parameter server approach that outperforms traditional parameter server [Dean et al., 2012]. It makes
each node perform more exploration by allowing local parameters to fluctuate around the central
variable. We add ResNet-32 [He et al., 2016] with CIFAR-10 into the EASGD’s Torch experiment
code4 and also implement our algorithm in Torch. Both algorithms run at the same speed on a single
GPU so there is no implementation bias. Unlike the previous experiment which uses high bandwidth
PCI-e or 10Gbits network for inter-GPU communication, we use 9 physical machines (1 as parameter
server) with a single K20 GPU each, connected by inexpensive Gigabit Ethernet. For D-PSGD
we use a logical ring connection between nodes as in Theorem 3. For EASGD we set moving rate
β = 0.9 and use its momentum variant (EAMSGD). For both algorithms we set learning rate to
0.1, momentum to 0.9. τ = {1, 4, 16} is a hyper-parameter in EASGD controlling the number of
mini-batches before communicating with the server.

1.0 1.0
D-PSGD, 8 machines D-PSGD, 8 machines
0.8 EAMSGD, = 1, 8 machines 0.8 EAMSGD, = 1, 8 machines
EAMSGD, = 4, 8 machines EAMSGD, = 4, 8 machines
Training Loss

Training Loss

EAMSGD, = 16, 8 machines EAMSGD, = 16, 8 machines


0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Loss (b) Time vs Training Loss
Figure 4: Convergence comparison between D-PSGD and EAMSGD (EASGD’s momentum variant).

Figure 4 shows that D-PSGD outperforms EASGD with a large margin in this setting. EASGD with
τ = 1 has good convergence, but its large bandwidth requirement saturates the network and slows
down nodes. When τ = 4, 16 EASGD converges slower than D-PSGD as there is less communication.
D-PSGD allows more communication in an efficient way without reaching the network bottleneck.
Moreover, D-PSGD is synchronous and shows less convergence fluctuation comparing with EASGD.
Accuracy comparison with EASGD We have shown the training loss comparison between D-PSGD
and EASGD, and we now show additional figures comparing training error and test error in our
experiment, as in Figure 5 and 6. We observe similar results as we have seen in section 6; D-PSGD
can achieve good accuracy noticeably faster than EASGD.
4
https://github.com/sixin-zh/mpiT.git

12
0.40 0.40
0.35
D-PSGD, 8 machines 0.35
D-PSGD, 8 machines
EAMSGD, = 1, 8 machines EAMSGD, = 1, 8 machines
0.30 EAMSGD, = 4, 8 machines 0.30 EAMSGD, = 4, 8 machines

Training Error

Training Error
0.25 EAMSGD, = 16, 8 machines 0.25 EAMSGD, = 16, 8 machines
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Error (b) Time vs Training Error
Figure 5: Training Error comparison between D-PSGD and EAMSGD (EASGD’s momentum variant)

0.40 0.40
0.35
D-PSGD, 8 machines 0.35
D-PSGD, 8 machines
EAMSGD, = 1, 8 machines EAMSGD, = 1, 8 machines
0.30 EAMSGD, = 4, 8 machines 0.30 EAMSGD, = 4, 8 machines
0.25 EAMSGD, = 16, 8 machines 0.25 EAMSGD, = 16, 8 machines
Test Error

Test Error
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Test Error (b) Time vs Test Error
Figure 6: Test Error comparison between D-PSGD and EAMSGD (EASGD’s momentum variant)

Scalability of D-PSGD In this experiment, we run D-PSGD on 1, 4, 8, 16 machines and compare


convergence speed and error. For experiments involving 16 machines, each machine also connects to
one additional machine which has the largest topological distance on the ring besides its two logical
neighbours. We found that this can help information flow and get better convergence.
In Figure 10, 11 and 12 we can observe that D-PSGD scales very well when the number of machines
is growing. Also, comparing with the single machine SGD, D-PSGD has minimum overhead; we
measure the per-epoch training time only increases by 3% comparing to single machine SGD, but
D-PSGD’s convergence speed is much faster. To reach a training loss of 0.2, we need about 80 epochs
with 1 machine, 20 epochs with 4 machines, 10 epochs with 8 machines and only 5 epochs with 16
machines. The observed linear speedup justifies the correctness of our theory.

1.0 1.0
Momentum SGD, 1 machine Momentum SGD, 1 machine
0.8 D-PSGD, 4 machines 0.8 D-PSGD, 4 machines
D-PSGD, 8 machines D-PSGD, 8 machines
Training Loss

Training Loss

D-PSGD, 16 machines D-PSGD, 16 machines


0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Loss (b) Time vs Training Loss
Figure 7: Training Loss comparison between D-PSGD on 1, 4, 8 and 16 machines

Generalization ability of D-PSGD In our previous experiments we set the learning rate to fixed 0.1.
To complete Residual network training, we need to decrease the learning rate after some epochs. We
follow the learning rate schedule in ResNet paper [He et al., 2016], and decrease the learning rate

13
0.40 0.40
0.35
Momentum SGD, 1 machine 0.35
Momentum SGD, 1 machine
D-PSGD, 4 machines D-PSGD, 4 machines
0.30 D-PSGD, 8 machines 0.30 D-PSGD, 8 machines

Training Error

Training Error
0.25 D-PSGD, 16 machines 0.25 D-PSGD, 16 machines
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Error (b) Time vs Training Error
Figure 8: Training Error comparison between D-PSGD on 1, 4, 8 and 16 machines

0.40 0.40
0.35
Momentum SGD, 1 machine 0.35
Momentum SGD, 1 machine
D-PSGD, 4 machines D-PSGD, 4 machines
0.30 D-PSGD, 8 machines 0.30 D-PSGD, 8 machines
0.25 D-PSGD, 16 machines 0.25 D-PSGD, 16 machines
Test Error

Test Error
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Test Error (b) Time vs Test Error
Figure 9: Test Error comparison between D-PSGD on 1, 4, 8 and 16 machines

to 0.01 at epoch 80. We observe training/test loss and error, as shown in figure 10, 11 and 12. For
D-PSGD, we can tune a better learning rate schedule, but parameter tuning is not the focus of our
experiments; rather, we would like to see if D-PSGD can achieve the same best ResNet accuracy as
reported by the literature.

1.0 1.0
Momentum SGD, 1 machine Momentum SGD, 1 machine
0.8 D-PSGD, 4 machines 0.8 D-PSGD, 4 machines
D-PSGD, 8 machines D-PSGD, 8 machines
Training Loss

Training Loss

D-PSGD, 16 machines D-PSGD, 16 machines


0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Loss (b) Time vs Training Loss
Figure 10: Training Loss comparison between D-PSGD on 1, 4, 8 and 16 machines

The test error of D-PSGD after 160 epoch is 0.0715, 0.0746 and 0.0735, for 4, 8 and 16 machines,
respectively. He et al. [2016] reports 0.0751 error for the same 32-layer residual network, and we can
reliably outperform the reported error level regardless of different numbers of machines used. Thus,
D-PSGD does not negatively affect (or perhaps helps) generalization.
Network utilization During the experiment, we measure the network bandwidth on each machine.
Because every machine is identical on the network, the measured bandwidth are the same on each
machines. For experiment with 4 and 8 machines, the required bandwidth is about 22 MB/s. With 16
machines the required bandwidth is about 33 MB/s because we have an additional link. The required
bandwidth is related to GPU performance; if GPU can compute each minibatch faster, the required

14
0.40 0.40
0.35
Momentum SGD, 1 machine 0.35
Momentum SGD, 1 machine
D-PSGD, 4 machines D-PSGD, 4 machines
0.30 D-PSGD, 8 machines 0.30 D-PSGD, 8 machines

Training Error

Training Error
0.25 D-PSGD, 16 machines 0.25 D-PSGD, 16 machines
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Training Error (b) Time vs Training Error
Figure 11: Training Error comparison between D-PSGD on 1, 4, 8 and 16 machines

0.40 0.40
0.35
Momentum SGD, 1 machine 0.35
Momentum SGD, 1 machine
D-PSGD, 4 machines D-PSGD, 4 machines
0.30 D-PSGD, 8 machines 0.30 D-PSGD, 8 machines
0.25 D-PSGD, 16 machines 0.25 D-PSGD, 16 machines
Test Error

Test Error
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Epoch Time (s) 1e4
(a) Iteration vs Test Error (b) Time vs Test Error
Figure 12: Test Error comparison between D-PSGD on 1, 4, 8 and 16 machines

bandwidth also increases proportionally. Considering the practical bandwidth of Gigabit Ethernet is
about 100 ~120 MB/s, Our algorithm can handle a 4 ~5 times faster GPU (or GPUs) easily, even with
an inexpensive gigabit connection.
Because our algorithm is synchronous, we desire each node to compute each minibatch roughly within
the same time. If each machine has different computation power, we can use different minibatch sizes
to compensate the speed difference, or allow faster machines to make more than 1 minibatch before
synchronization.

Industrial benchmark
In this section, we evaluate the effectiveness of our algorithm on IBM Watson Natural Language
Classifier (NLC) workload. IBM Watson Natural Language Classifier (NLC) service, IBM’s most
popular cognitive service offering, is used by thousands of enterprise-level clients around the globe.
The NLC task is to classify input sentences into a target category in a predefined label set. NLC has
been extensively used in many practical applications, including sentiment analysis, topic classification,
and question classification. At the core of NLC training is a CNN model that has a word-embedding
lookup table layer, a convolutional layer and a fully connected layer with a softmax output layer.
NLC is implemented using the Torch open-source deep learning framework.
Methodology We use two datasets in our evaluation. The first dataset Joule is an in-house customer
dataset that has 2.5K training samples, 1K test samples, and 311 different classes. The second dataset
Yelp, which is a public dataset, has 500K training samples, 2K test samples and 5 different classes.
The experiments are conducted on an IBM Power server, which has 40 IBM P8 cores, each core
is 4-way SMP with clock frequence of 2GHz. The server has 128GB memory and is equipped
with 8 K80 GPUs. DataParallelTable (DPT) is a NCCL-basedNvidia module in Torch that can
leverage multiple GPUs to carry out centralized parallel SGD algorithm. NCCL is an all-reduce
based implementation. We implemented the decentralized SGD algorithm in the NLC product.
We now compare the convergence rate of centralized SGD (i.e. DPT) and our decentralized SGD
implementation.

15
Convergence results and test accuracy First, we examine the Joule dataset. We use 8 nodes and
each node calculates with a mini-batch size of 2 and the entire run passes through 200 epochs.
Figure 13 shows that centralized SGD algorithm and decentralized SGD algorithm achieve similar
training loss (0.96) at roughly same convergence rate. Figure 14 shows that centralized SGD
algorithm and decentralized SGD algorithm achieve similar testing error (43%). In the meantime, the
communication cost is reduced by 3X in decentralized SGD case compared to the centralized SGD
algorithm. Second, we examine the Yelp dataset. We use 8 nodes and each node calculates with a
mini-batch size of 32 and the entire run passes through 20 epochs. Figure 13 shows that centralized
SGD algorithm and decentralized SGD algorithm achieve similar training loss (0.86). Figure 14
shows that centralized SGD algorithm and decentralized SGD algorithm achieve similar testing error
(39%).

Joule Training Loss Comparison Joule Test Error Comparison


6 1.2

5 1

4 0.8
Training loss

Test Error
3 0.6

2 0.4

1 0.2

0 0

11
21
31
41
51
61
71
81
91
1

101
111
121
131
141
151
161
171
181
191
1 21 41 61 81 101 121 141 161 181
Epoch Epoch

Centralized Decentralized Centralized Decentralized

Figure 13: Training loss on Joule dataset Figure 14: Test error on Joule dataset

Yelp Training Loss Comparison Yelp Test Error Comparison


1.8 0.8
1.6 0.7
1.4 0.6
1.2
Training Loss

0.5
Test Error

1
0.4
0.8
0.3
0.6
0.4 0.2

0.2 0.1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Epoch Epoch

Centralized Decentralized Centralized Decentralized

Figure 15: Training loss on Yelp dataset Figure 16: Test error on Yelp dataset

16
Supplemental Materials: Proofs
We provide the proof to all theoretical results in this paper in this section.
Lemma 4. Under Assumption 1 we have
2
1n k
k
n − W ei ≤ ρ , ∀i ∈ {1, 2, . . . , n}, k ∈ N.

1n
Proof. Let W ∞ := limk→∞ W k . Note that from Assumption 1-2 we have n = W ∞ ei , ∀i since
W is doubly stochastic and ρ < 1. Thus
2
1n k
∞ k 2
n − W ei =k(W − W )ei k

6kW ∞ − W k k2 kei k2
=kW ∞ − W k k2
6ρk ,
where the last step comes from the diagonalizability of W , completing the proof.
Lemma 5. We have the following inequality under Assumption 1:
n Pn 2   2
X i0 =1 xj,i0 Xj 1n
Ek∂f (Xj )k2 6 3EL2 2
1>

− x j,h
+ 3nς + 3E ∇f n , ∀j.

n n
h=1

Proof. We consider the upper bound of Ek∂f (Xj )k2 in the following:
Ek∂f (Xj )k2
  2
Xj 1n >
63E ∂f (Xj ) − ∂f
1n
n
    2
Xj 1n > Xj 1n
1>

+ 3E ∂f 1n − ∇f n

n n
  2
Xj 1n
1>

+ 3E ∇f
n

n
  2
(Assumption 1-3) Xj 1n >
6 3E ∂f (X j ) − ∂f 1n

n F
+ 3nς 2
  2
Xj 1n
1>

+ 3E ∇f

n
n
n Pn 2   2
(Assumption 1-1) X i0 =1 xj,i0 Xj 1n
3EL2 2
1>

6 − x j,h
+ 3nς + 3E ∇f n .

n n
h=1

This completes the proof.


 
Proof to Theorem 1. We start form f Xk+1 n
1n
:
 
Xk+1 1n
Ef
n
 
Xk W 1n ∂F (Xk ; ξk )1n
=Ef −γ
n n
 
(Assumption 1-2) Xk 1n ∂F (Xk ; ξk )1n
= Ef −γ
n n

17
     
Xk 1n Xk 1n ∂f (Xk )1n
6Ef − γE ∇f ,
n n n
2
Pn 2
γ L i=1 ∇Fi (xk,i ; ξk,i )

+ E . (7)
2 n
Note that for the last term we can split it into two terms:
i=1 ∇Fi (xk,i ; ξk,i ) 2
Pn Pn 2
i=1 ∇Fi (xk,i ; ξk,i ) − ni=1 ∇fi (xk,i )
P Pn
i=1 ∇fi (xk,i )

E
=E +
n n n
i=1 ∇Fi (xk,i ; ξk,i ) − ni=1 ∇fi (xk,i ) 2
Pn P
=E
n
Pn 2
i=1 ∇fi (xk,i )
+ E
n
*P 2
+
n Pn Pn
i=1 ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i ) i=1 ∇fi (xk,i )
+E ,
n n
Pn Pn 2
i=1 ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i )
=E
n
Pn 2
i=1 ∇fi (xk,i )
+ E
n
* Pn Pn Pn 2
+
i=1 Eξk,i ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i ) i=1 ∇fi (xk,i )
+E ,
n n
Pn Pn 2
i=1 ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i )
=E
n
Pn 2
i=1 ∇fi (xk,i )
+ E .
n

Then it follows from (7) that

 
Xk+1 1n
Ef
n
     
Xk 1n Xk 1n ∂f (Xk )1n
6Ef − γE ∇f ,
n n n
2
Pn Pn 2
γ L i=1 ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i )

+ E
2 n
n 2
γ2L
P
i=1 ∇fi (xk,i )

+ E . (8)
2 n
For the second last term we can bound it using σ:
Pn Pn 2
γ2L i=1 ∇Fi (xk,i ; ξk,i ) − i=1 ∇fi (xk,i )

E
2 n
n
γ2L X
= 2 Ek∇Fi (xk,i ; ξk,i ) − ∇fi (xk,i )k2
2n i=1
n n
γ2L X X
+ Eh∇Fi (xk,i ; ξk,i ) − ∇fi (xk,i ), ∇Fi0 (xk,i0 ; ξk,i0 ) − ∇fi0 (xk,i0 )i
n2 i=1 0
i =i+1
2n
γ LX
= 2 Ek∇Fi (xk,i ; ξk,i ) − ∇fi (xk,i )k2
2n i=1

18
n n
γ2L X X
+ 2 Eh∇Fi (xk,i ; ξk,i ) − ∇fi (xk,i ), Eξk,i0 ∇Fi0 (xk,i0 ; ξk,i0 ) − ∇fi0 (xk,i0 )i
n i=1 0
i =i+1
n
γ2L X
= 2 Ek∇Fi (xk,i ; ξk,i ) − ∇fi (xk,i )k2
2n i=1
γ2L 2
6 σ ,
2n
where the last step comes from Assumption 1-3.
Thus it follows from (8):
 
Xk+1 1n
Ef
n
γ 2 L σ2
     
Xk 1n Xk 1n ∂f (Xk )1n
6Ef − γE ∇f , +
n n n 2 n
2
Pn 2
γ L i=1 ∇fi (xk,i )

+ E
2 n
2  2
γ − γ2L 2 2
  
Xk 1n ∂f (Xk )1n γ Xk 1n
+ γ Lσ
=Ef − E − E ∇f
n 2 n 2 n 2 n
  2
γ Xk 1n ∂f (Xk )1n
+ E ∇f − , (9)
2 n n
| {z }
=:T1

where the last step comes from 2ha, bi = kak2 + kbk2 − ka − bk2 .
We then bound T1 :

  2
Xk 1n ∂f (Xk )1n
T1 =E ∇f

n n
n  P n  2
1X i0 =1 xk,i
0
6 E ∇fi
− ∇fi (xk,i )
n i=1
n
n Pn 2
(Assumption 1-1) L2 X i0 =1 xk,i0
6 E − xk,i
, (10)
n i=1 n
| {z }
=:Qk,i

where we define Qk,i as the squared distance of the local optimization variable on the i-th node from
the averaged local optimization variables on all nodes.
In order to bound T1 we bound Qk,i ’s as the following:

Pn 2
i0 =1 xk,i0
Qk,i =E − x k,i

n
2
Xk 1n
=E
− Xk ei
n
2
Xk−1 W 1n − γ∂F (Xk−1 ; ξk−1 )1n
=E − (Xk−1 W ei − γ∂F (X ; ξ
k−1 k−1 i )e )
n
2
Xk−1 1n − γ∂F (Xk−1 ; ξk−1 )1n
=E
− (Xk−1 W ei − γ∂F (Xk−1 ; ξk−1 )ei )
n
  2
k−1
X0 1n − k−1
P
i=0 γ∂F (X ; ξ
i i n)1 k
X
k−j−1 

=E
− X0 W ei −
 γ∂F (Xj ; ξj )W ei
n j=0

19
 2

  k−1 
1n k
X 1 n
− W k−j−1 ei

=E X0 n − W ei − γ∂F (Xj ; ξj )
n
j=0
2
k−1  
(Assumption 1-4) X 1n k−j−1

= E
γ∂F (Xj ; ξj ) −W ei
j=0 n

 2

k−1 
X 1 n
=γ 2 E − W k−j−1 ei

∂F (Xj ; ξj )

j=0 n

 2

k−1 
X 1n
62γ 2 E k−j−1

(∂F (Xj ; ξj ) − ∂f (Xj )) n − W ei

j=0
| {z }
=:T2

 2

k−1 
X 1n
+ 2γ 2 E − W k−j−1 ei

∂f (Xj ) . (11)

j=0 n
| {z }
=:T3

For T2 , we provide the following upper bounds:

 2

k−1 
X 1n k−j−1

(∂F (Xj ; ξj ) − ∂f (Xj )) n − W
T2 =E ei
j=0
k−1
X   2
1n k−j−1

= E (∂F (Xj ; ξj ) − ∂f (Xj ))
−W ei
j=0
n

k−1 2
X 1n k−j−1
2

6 Ek∂F (Xj ; ξj ) − ∂f (Xj )k
−W ei
j=0
n
k−1 2
X 1n
∂f (Xj )k2F k−j−1

6 Ek∂F (Xj ; ξj ) − n −W
ei
j=0
k−1
(Lemma 4,Assumption 1-3) X
6 nσ 2 ρk−j−1
j=0

nσ 2
6 .
1−ρ
For T3 , we provide the following upper bounds:
 2

k−1 
X 1n
− W k−j−1 ei

T3 =E ∂f (Xj )

j=0 n

k−1   2
X 1n k−j−1

= E ∂f (Xj )
−W ei
j=0
n
| {z }
=:T4
    
X 1n 1n 0
+ E ∂f (Xj ) − W k−j−1 ei , ∂f (Xj 0 ) − W k−j −1 ei
n n
j6=j 0
| {z }
=:T5

20
To bound T3 we bound T4 and T5 in the following: for T4 ,
k−1   2
X 1n k−j−1

T4 = E ∂f (X j ) − W ei

j=0
n
k−1 2
X 1n
Ek∂f (Xj )k2 k−j

6 n − W e i
j=0
k−1 n 2
(Lemmas 4 and 5) XX 1n 2 1
EL2 Qj,h k−j−1

6 3 − W ei + 3nς
j=0 h=1
n 1−ρ
k−1   2 2
X Xj 1n > 1n
k−j−1

+3 E ∇f
1n −W ei .
j=0
n n
We bound T5 using two new terms T6 and T7 :
k−1
X     
1n 1n 0
T5 = E ∂f (Xj ) − W k−j−1 ei , ∂f (Xj 0 ) − W k−j −1 ei
0
n n
j6=j
k−1    
X 1n k−j−1
1n k−j 0 −1

6 E ∂f (Xj ) − W e i
∂f (Xj 0 ) − W e i

0
n n
j6=j
k−1
X 1n k−j−1
1n k−j 0 −1

6 Ek∂f (Xj )k
−W ei k∂f (Xj 0 )k
−W ei
0
n n
j6=j
k−1
X 1n k−j−1
1n k−j 0 −1

6 Ek∂f (Xj )k
−W ei k∂f (Xj 0 )k
−W ei
0
n n
j6=j
k−1
k∂f (Xj )k2

X 1n k−j−1 1n
k−j 0 −1

6 E −W ei −W ei
0
2 n n
j6=j
k−1
k∂f (Xj 0 )k2

X 1n k−j−1 1n
k−j 0 −1

+ E −W ei −W ei
0
2 n n
j6=j
k−1
k∂f (Xj )k2 k∂f (Xj 0 )k2
 
Lemma 4 X j+j 0
−1
6 E + ρk− 2
2 2
j6=j 0
k−1
X j+j 0
−1
= E(k∂f (Xj )k2 )ρk− 2

j6=j 0
k−1 n   2 !
Lemma 5 X X Xj 1 n ρk− j+j
0
EL2 Qj,h + E 1> 2 −1

6 3 ∇f n n
j6=j 0 h=1
| {z }
=:T6
k−1
X j+j 0
+ 3nς 2 ρk−1− 2 ,
j6=j 0
| {z }
=:T7
where T7 can be bounded using ς and ρ:
k−1
X j+j 0
T7 =6nς 2 ρk−1− 2

j>j 0
 √ 
2 ρk/2 − 1 ρk/2 − ρ
=6nς √ 2 √ 
ρ−1 ρ+1

21
1
≤6nς 2 √ 2 ,
1− ρ
and we bound T6 :
k−1 n   2 !
X X X j 1n ρk− j+j
0
2
1> 2 −1

T6 =3 ∇f
EL Qj,h + E
n n
j6=j 0 h=1
k−1 n   2 ! k−1X √ 2k−j−j 0 −2
X X Xj 1 n
2
1>

=6 ∇f
EL Qj,h + E ρ
n

n
j=0 h=1 j 0 =j+1
k−1 n   2 ! √ k−j−1
X X Xj 1 n ρ
2
1>

66 ∇f
EL Qj,h + E √ .
n

j=0
n 1 − ρ
h=1

Plugging T6 and T7 into T5 and then plugging T5 and T4 into T3 yield the upper bound for T3 :

k−1 n 2
XX 1n
EL2 Qj,h k−j−1

T3 63 n − W ei
j=0 h=1
k−1   2 2
X Xj 1n > 1n
k−j−1

+3 E ∇f 1n − W e i
j=0
n n
k−1 n   2 ! √ k−j−1
X X Xj 1n ρ
2
1>

+6 ∇f
EL Qj,h + E √

n
j=0
n 1 − ρ
h=1
2
3nς 6nς 2
+ + √ 2
1−ρ 1− ρ
k−1 n 2
XX 1n
EL2 Qj,h k−j−1

63 n − W ei
j=0 h=1
k−1   2 2
X Xj 1n > 1n
k−j−1

+3 E ∇f
1n −W ei
j=0
n n
k−1 n   2 ! √ k−j−1
X X Xj 1n ρ
EL2 Qj,h + E 1>

+6 ∇f √

n
j=0
n 1 − ρ
h=1

9nς 2
+ √ 2 ,
1− ρ
1 1
where the last step we use the fact that 1−ρ ≤ (1− ρ)2 .

Putting the bound for T2 and T3 back to (11) we get the bound for Qk,i :
k−1 n Pn 2 2
2γ 2 nσ 2 2
XX
2
i0 =1 xj,i0 1n k−j−1

Qk,i 6 + 6γ EL − xj,h
−W ei
1−ρ j=0
n n
h=1
k−1   2 2
2
X Xj 1n > 1n
k−j−1

+ 6γ E ∇f
1n −W ei
j=0
n n
k−1 n Pn 2   2 ! √ k−j−1
X X i0 =1 xj,i0 Xj 1n ρ
2 2
1>

+ 12γ EL
− xj,h + E ∇f
n

j=0
n n 1− ρ
h=1
2 2
18γ nς
+ √
(1 − ρ)2

22
Lemma 4 2γ 2 nσ 2 18γ 2 nς 2
6 + √
1−ρ (1 − ρ)2
k−1
XX n
+ 6γ 2 EL2 Qj,h ρk−j−1
j=0 h=1
k−1   2
X Xj 1n
+ 6γ 2 1>
k−j−1
E ∇f n ρ

j=0
n
k−1 n   2 ! √ k−j−1
X X Xj 1 n ρ
2 2
1>

+ 12γ ∇f
EL Qj,h + E √
n

j=0
n 1− ρ
h=1
2 2
2γ nσ 18γ 2 nς 2
= + √
1−ρ (1 − ρ)2
k−1  2 √ !
2 ρk−j−1

X Xj 1n
2
1> k−j−1

+ 6γ E ∇f
n
ρ + √
j=0
n 1− ρ
k−1 n √ !
XX 2 ρk−j−1
+ 6γ 2 EL2 Qj,h √ + ρk−j−1 . (12)
j=0 h=1
1− ρ

Till now, we have the bound for Qk,i . We continue by bounding its average Mk on all nodes, which
is defined by:
Pn
E i=1 Qk,i
EMk := (13)
n
(12) 2γ 2 nσ 2 18γ 2 nς 2
6 + √
1−ρ (1 − ρ)2
k−1
X   2 √ k−j−1 !
X j 1n 2 ρ
+ 6γ 2 1> ρk−j−1 +

E ∇f n n
1

− ρ
j=0
k−1 √ k−j−1 !
X 2 ρ
+ 6γ 2 nL2 EMj √ + ρk−j−1 .
j=0
1 − ρ

Summing from k = 0 to K − 1 we get:

K−1
X 2γ 2 nσ 2 18γ 2 nς 2
EMk 6 K+ √ K
1−ρ (1 − ρ)2
k=0
K−1
X k−1  2 √ !
2 ρk−j−1

X Xj 1n
+ 6γ 2 1> k−j−1

E ∇f n
ρ + √
j=0
n 1− ρ
k=0
K−1
X k−1 √ !
X 2 ρk−j−1
+ 6γ 2 nL2 EMj √ + ρk−j−1
1− ρ
k=0 j=0
2 2
2γ nσ 18γ 2 nς 2
6 K+ √ K
1−ρ (1 − ρ)2
K−1  2 X ∞ P∞ √ i !
X 
2
X k 1 n i=0 ρ
+ 6γ 2 1> ρi +

∇f
E √
n

n i=0
1 − ρ
k=0
K−1 P∞ √ ∞
!
2 2
X 2 i=0 ρi X i
+ 6γ nL EMk √ + ρ
1− ρ i=0
k=0

23
2γ 2 nσ 2 18γ 2 nς 2
6 K+ √ K
1−ρ (1 − ρ)2
K−1   2
18 X Xk 1n
√ 2 γ2 1>

+ E ∇f n

(1 − ρ) n
k=0
K−1
18 X
+ √ 2 γ 2 nL2 EMk ,
(1 − ρ)
k=0

where the second step comes from rearranging the summations and the last step comes from the
summation of geometric sequences.
Simply by rearranging the terms we get the bound for the summation of EMk ’s from k = 0 to K − 1:

  K−1
18 X
1− √ γ 2 nL2 EMk
(1 − ρ)2
k=0
K−1  2
2γ 2 nσ 2 18γ 2 nς 2 X 
18 Xk 1n
√ γ2 1>

6 K+ √ 2K + E ∇f n

1−ρ (1 − ρ) (1 − ρ) n
k=0
K−1 2 2
X 2γ nσ 18γ 2 nς 2
=⇒ EMk 6  K +
√  K
k=0 (1 − ρ) 1 − (1−18
√ 2 γ 2 nL2
ρ) (1 − ρ)2 1 − (1−18 √ 2 γ 2 nL2
ρ)
K−1  2
18γ 2 X 
Xk 1n
1>

+ √ 2  E ∇f
n .
(14)
18
(1 − ρ) 1 − (1−√ρ)2 γ 2 nL2 k=0 n

Recall (10) that T1 can be bounded using Mk :

n
L2 X
ET1 6 EQk,i = L2 EMk . (15)
n i=1

We are finally able to bound the error by combining all above. Starting from (9):

2  2
γ − γ2L
    
Xk+1 1n Xk 1n ∂f (Xk )1n γ Xk 1n
Ef 6Ef − E
− E ∇f

n n 2 n 2 n
γ2L 2 γ
+ σ + ET1
2n 2
2  2
γ − γ2L
  
(15) Xk 1n ∂f (Xk )1n γ Xk 1n
6 Ef − E − E ∇f
n 2 n 2 n
γ2L 2 γ 2
+ σ + L EMk .
2n 2

Summing from k = 0 to k = K − 1 we get:


K−1 2 K−1  2
γ − γ2L X X 
∂f (Xk )1n
+γ ∇f Xk 1n

E E
2 n 2 n
k=0 k=0
2 K−1
γ KL 2 γ 2 X
6f (0) − f ∗ + σ + L EMk
2n 2
k=0
(14) γ 2 KL 2
6 f (0) − f ∗ + σ
2n

24
γ 2 2γ 2 nσ 2 γ 2 18γ 2 nς 2
+ L  K + L √  K
2 (1 − ρ) 1 − 18 √ γ 2 nL2 2 (1 − ρ)2 1 − 18 √ γ 2 nL2
(1− ρ)2 (1− ρ)2

2 K−1   2
γ 2 18γ X Xk 1n
1>

+ L √   E ∇f
n

2 (1 − ρ)2 1 − 18 2 2 n
(1− ρ)2 γ nL
√ k=0
2
γ KL 2
=f (0) − f ∗ + σ
2n
γ 3 L2 nσ 2 9γ 3 L2 nς 2
+  K + √ K
(1 − ρ) 1 − (1−18
√ 2 γ 2 nL2
ρ) (1 − ρ)2 1− 18
√ 2 γ 2 nL2
(1− ρ)

3 2 K−1   2
9nγ L X
∇f Xk 1n

+ √ 2   E
(1 − ρ) 1 − 18 √ 2 γ 2 nL2
n
(1− ρ) k=0

By rearranging the inequality above, we obtain:


!
2  2
γ−γ 2 L PK−1 9nγ 3 L2 PK−1
E ∂f (Xnk )1n + γ Xk 1 n

− E ∇f

√ 2 
2 k=0 2 (1− ρ) 1− (1−18 k=0 n
√ 2 γ 2 nL2
ρ)
=⇒
γK
f (0) − f ∗ γL 2 γ 2 L2 nσ 2 9γ 2 L2 nς 2
6 + σ +  + √  .
γK 2n (1 − ρ) 1 − (1−18√ 2 γ 2 nL2 (1 − ρ)2 1 − (1−18 √ 2 γ 2 nL2
ρ) ρ)

which completes the proof.


2
Proof to Corollary 2. Substitute γ = 1
√ into Theorem 1 and remove the ∂f (Xnk )1n

2L+σ K/n
terms on the LHS. We get
PK−1  2
E ∇f Xkn1n

D1 k=0
K
2(f (0) − f ∗ )L (f (0) − f ∗ )σ Lσ 2
6 + √ + √
K Kn 4nL + 2σ Kn
L2 n
 2
9ς 2

σ
+ + √
(2L + σ K/n)2 D2 1 − ρ (1 − ρ)2
p

2(f (0) − f ∗ )L (f (0) − f ∗ + L/2)σ


6 + √
K Kn
L2 n σ2 9ς 2
 
+ p + √ . (16)
(σ K/n)2 D2 1 − ρ (1 − ρ)2

We first show D1 and D2 are approximately constants when (6) is satisfied.


9γ 2 L2 n 18γ 2
   
1 2
D1 := − √ , D2 := 1 − √ nL .
2 (1 − ρ)2 D2 (1 − ρ)2
Note that

2 (1 − ρ)2
γ 6 =⇒ D2 > 1/2,
36nL2
√ 2
(1 − ρ)
γ2 6 =⇒ D1 > 1/4.
72L2 n
Since
n
γ2 6 ,
σ2 K

25
as long as we have

n (1 − ρ)2
6
σ2 K 36nL2

n (1 − ρ)2
6 ,
σ2 K 72L2 n
D2 > 1/2 and D1 > 1/4 will be satisfied. Solving above inequalities we get (6).
Now with (6) we can safely replace D1 and D2 in (17) with 1/4 and 1/2 respectively. Thus

PK−1  2
E ∇f Xkn1n

k=0
4K
2(f (0) − f ∗ )L (f (0) − f ∗ + L/2)σ
6 + √
K Kn
2L2 n
 2
9ς 2

σ
+ + √ . (17)
(σ K/n)2 1 − ρ (1 − ρ)2
p

Given (5), the last term is bounded by the second term, completing the proof.

Proof to Theorem 3. This can be seen from a simple analysis that the ρ, ρ for this W are asymptot-
2
8π 2
ically 1 − 16π
3n2 , 1 − 3n2 respectivelywhen n is large. Then by requiring (6) we need n ≤ O(K
1/6
).
1/9 1/13
To satisfy (5) we need n ≤ O K when ς = 0 and n ≤ O(K ) when ς > 0. This completes
the proof.

We have the following theorem showing the distance of the local optimization variables converges
with a O(1/K) rate, where the “O” swallows n, ρ, σ, ς, L and f (0) − f ∗ , which means it is safe to
use any worker’s local result to get a good estimate to the solution:
1

Theorem 6. With γ = under the same assumptions as in Corollary 2 we have
2L+σ K/n

K−1 n Pn 2
−1
XX i0 =1 xk,i0 2 A
(Kn) E − xk,i
6nγ D2 ,
n
k=0 i=1

where
2σ 2 18ς 2 L2
 2
9ς 2

σ
A := + √ + + √
1 − ρ (1 − ρ)2 D1 1 − ρ (1 − ρ)2
f (0) − f ∗ γLσ 2
 
18
+ √ 2 + .
(1 − ρ) γK 2nD1

Choosing γ in the way in Corollary 2, we can see that the consensus will be achieved in the rate
O(1/K).
1

Proof to Theorem 6. From (14) with γ = we have
2L+σ K/n
PK−1
k=0 EMk 2γ 2 nσ 2 18γ 2 nς 2
6 + √
K (1 − ρ)D2 (1 − ρ)2 D2
PK−1 2
Xk 1n
1>

18γ 2 k=0 E ∇f

n n

+ √ 2
(1 − ρ) D2 K
2γ 2 nσ 2 18γ 2 nς 2
= + √
(1 − ρ)D2 (1 − ρ)2 D2
PK−1 Xk 1n
 2
18γ 2 n k=0 E ∇f

n

+ √ 2
(1 − ρ) D2 K

26
2γ 2 nσ 2 18γ 2 nς 2 γ 2 L2 n
 2
9ς 2

Corollary 2 σ
6 + √ 2 + + √ 2
(1 − ρ)D2 (1 − ρ) D2 D1 D2 1 − ρ (1 − ρ)
2 ∗
γLσ 2
 
18γ n f (0) − f
+ √ +
(1 − ρ)2 D2 γK 2nD1
nγ 2
= A.
D2
This completes the proof.

27

You might also like