0% found this document useful (0 votes)
12 views

Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions

Uploaded by

vicentfan0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions

Uploaded by

vicentfan0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE 56th Annual Conference on Decision and Control (CDC)

December 12-15, 2017, Melbourne, Australia

Accelerated Distributed Nesterov Gradient Descent for Convex and Smooth


Functions
Guannan Qu, Na Li

Abstract— This paper considers the distributed optimization a O(1/t2 ) convergence rate. The nice convergence rates
problem over a network, where the objective is to optimize lead to the question of this paper: how to decentralize the
a global function formed by an average of local functions, Nesterov Gradient method to achieve similar convergence
using only local computation and communication. We develop
an Accelerated Distributed Nesterov Gradient Descent (Acc- rates? Our recent work [27] has studied the μ-strongly
DNGD) method for convex and smooth objective functions. We convex and L-smooth case. This paper will focus on the
show that it achieves a O(1/t1.4− ) (∀ ∈ (0, 1.4)) convergence convex and L-smooth case (without the strongly convex
rate when a vanishing step size is used. The convergence rate assumption). Previous work in this line includes [28] that
can be improved to O(1/t2 ) when we use a fixed step size and develops Distributed Nesterov Gradient (D-NG) method and
the objective functions satisfy a special property. To the best of
our knowledge, Acc-DNGD is the fastest among all distributed shows that it has a convergence rate of O( logt t ),1 which is
gradient-based algorithms that have been proposed so far. not faster than the rate of CGD (O( 1t )).
In this paper, we propose an Accelerated Distributed Nes-
I. I NTRODUCTION terov Gradient Descent (Acc-DNGD) method. We show that
Given a set of agents N = {1, 2, . . . , n}, each of which it achieves a O(1/t1.4− ) (for any  ∈ (0, 1.4)) convergence
has a local convex cost function fi (x) : RN → R, the objec- rate when a vanishing step size is used. We further show
tive of distributed optimization is to find x that minimizes that the convergence rate can be improved to O(1/t2 ) when
the average of all the functions, we use a fixed step size and the objective function is a
n composition of a linear map and a strongly-convex and
1 smooth function. Both rates are faster than what CGD and
min f (x)  fi (x)
x∈RN n i=1 CGD-based distributed methods can achieve (O(1/t)). To
the best of the authors’ knowledge, the O(1/t1.4− ) rate is
using local communication and local computation. The local the fastest among all distributed gradient-based algorithms
communication is defined through a connected and undi- being proposed so far.2
rected communication graph G = (V, E), where the nodes Our algorithm is a combination of CNGD and a gradient
V = N and edges E ⊂ V × V . This problem has found estimation scheme. The gradient estimation scheme has
various applications in multi-agent control, distributed state been studied under various contexts in [19]–[25]. As [22]
estimation over sensor networks, large scale computation in has pointed out, when combining the gradient estimation
machine learning, etc [1]–[3]. scheme with a centralized algorithm, the resulting distributed
There exist many studies on developing distributed al- algorithm could potentially match the convergence rate of the
gorithms for this problem, e.g., [4]–[15], most of which centralized algorithm. The results in this paper show that,
are distributed gradient descent algorithms. Each iteration is although combining the scheme with CNGD will not give
composed of a consensus step and a gradient descent step. a convergence rate (O(1/t1.4− )) matching that of CNGD
These methods have achieved sublinear convergence rates (O(1/t2 )), it does improve over previously known CGD-
(usually O( √1t ) for convex functions. When the functions based distributed algorithms (O(1/t)).
are nonsmooth, the sublinear convergence rates match Cen- In the rest of the paper, Section II formally defines the
tralized Gradient Descent (CGD). More recent work have problem and presents our algorithm and results. Section III
improved these results for smooth functions, by adding a proves the convergence rates. Section IV provides numerical
correction term [16]–[18], or using a gradient estimation simulations and Section V concludes the paper.
sequence [19]–[25]. With these techniques, paper [16], [22] Notations. In this paper, n is the number of agents, and
can achieve a O( 1t ) convergence rate for smooth functions, N is the dimension of the domain of the fi ’s. Notation  · 
matching the rate of CGD. Additionally, if strong convexity denotes 2-norm for vectors, and Frobenius norm for matrices.
is further assumed, paper [16]–[18], [22]–[25] can achieve a Notation  · ∗ denotes spectral norm for matrices. Notation
linear convergence rate, matching the rate of CGD as well. ·, · denotes inner product for vectors. Notation ρ(·) denotes
It is known that among all centralized gradient based spectral radius for square matrices, and 1 denotes a n-
algorithms, Centralized Nesterov Gradient Descent (CNGD) dimensional all one column vector. All vectors, when having
[26] achieves the optimal convergence rate in terms of dimension N (the dimension of the domain of the fi ’s), will
first-order oracle complexity. For μ-strongly
 convex and L-
smooth functions, it achieves a O((1 − μ/L)t ) conver- 1 Reference [28] also studies an algorithm that uses multiple consensus
gence rate; for convex and L-smooth functions, it achieves steps per iteration, and achieves a O(1/t2 ) convergence rate. In this paper,
we focus on algorithms that only use one or a constant number of consensus
Guannan Qu and Na Li are affiliated with John A. Paulson School steps per iteration.
of Engineering and Applied Sciences at Harvard University. Email: 2 We only include algorithms that are gradient based (without extra
[email protected], [email protected]. This work is supported under information like Hessian), and use one (or a constant number of) step(s) of
NSF ECCS 1608509 and NSF CAREER 1553407. consensus after each gradient evaluation.

978-1-5090-2873-3/17/$31.00 ©2017 IEEE 2260


Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.

be regarded as row vectors. As a special case, all gradients, (b) Matrix
 W is doubly stochastic, i.e. i w i j =


∇fi (x) and ∇f (x) are interpreted as N -dimensional row j  w ij  = 1 for all i, j ∈ N .

vectors. Notation “≤”, when applied to vectors of the same As a result, ∃σ ∈ (0, 1) which depends on the spectrum
dimension, denotes element wise “less than or equal to”. of W , such that for any ω ∈ Rn×1 , we have the “averaging
II. P ROBLEM AND A LGORITHM property”, W ω −1ω̄ ≤ σω −1ω̄ where ω̄ = n1 1T ω (the
average of the entries in ω) [29]. How to select a consensus
A. Problem Formulation matrix to satisfy these properties has been intensely studied,
Consider n agents, N = {1, 2, . . . , n}, each of which e.g. [29], [30].
has a function fi : RN → R. The objective of distributed In our algorithm Acc-DNGD, each agent keeps a copy
optimization is to find x to minimize the average of all the of the three variables in CNGD, xi (t), vi (t), yi (t) and in
functions, i.e. n addition si (t) which serves as a gradient estimator. The
1
min f (x)  fi (x) (1) initial condition is xi (0) = vi (0) = yi (0) = 0 and si (0) =
x∈RN n i=1 ∇f (0),3 and the algorithm updates as follows:
using local communication and local computation. The local 
communication is defined through a connected undirected xi (t + 1) = wij yj (t) − ηt si (t) (3a)
communication graph G = (V, E), where the nodes V = j
N and the edges E ⊂ V × V . Agent i and j can send  ηt
information to each other if and only if (i, j) ∈ E. The vi (t + 1) = wij vj (t) − si (t) (3b)
j
αt
local computation means that each agent can only make its
decision based on the local function fi and the information yi (t + 1) = (1 − αt+1 )xi (t + 1) + αt+1 vi (t + 1) (3c)
obtained from its neighbors. 
si (t + 1) = wij sj (t) + ∇fi (yi (t + 1)) − ∇fi (yi (t))
Throughout the paper, we assume that f has a minimizer
j
x∗ with optimal value f ∗ . We will use the following assump-
tions in the rest of the paper. (3d)

Assumption 1. ∀i ∈ N , fi is convex. As a result, f is also where [wij ]n×n are the consensus weights and ηt ∈ (0, L1 )
convex. √ Sequence (αt )t≥0 is generated by, first
are the step sizes.
selecting α0 = η0 L ∈ (0, 1), then given αt ∈ (0, 1),
Assumption 2. ∀i ∈ N , fi is L-smooth, that is, fi is selecting αt+1 to be the unique solution in (0, 1) of the
differentiable and the gradient is L-Lipschitz continuous, i.e., following equation,4
∀x, y ∈ RN , ∇fi (x) − ∇fi (y) ≤ Lx − y. As a result, ηt+1
2
f is L-smooth. αt+1 = (1 − αt+1 )αt2 .
ηt
Assumption 3. The set of minimizers of f is compact.
We will consider two variants of the algorithm with the
B. Centralized Nesterov Gradient Descent (CNGD) following two step size rules.
1
We briefly introduce a version of centralized Nesterov • Vanishing step size rule: ηt = η for some η ∈
(t+t0 )β
1
Gradient Descent (CNGD) that is derived from Section 2.2 of (0, L ), β ∈ (0, 2) and t0 ≥ 1.
[26]. CNGD keeps updating three variables x(t), v(t), y(t) ∈ • Fixed step size rule: ηt = η > 0.
RN , starting from an initial point x(0) = v(0) = y(0) ∈ RN , Because wij = 0 when (i, j) ∈ / E, each node i only
and the update equation is given by needs to send xi (t), vi (t), yi (t) and si (t) to its neighbors.
x(t + 1) = y(t) − η∇f (y(t)) (2a) Therefore, the algorithm can be operated in a fully distributed
η fashion with only local communication. The additional term
v(t + 1) = v(t) − ∇f (y(t)) (2b) si (t) allows each agent to obtain an estimate on the global
αt
gradient n1 j ∇fj (yj (t)) (for more details, see Section
y(t + 1) = (1 − αt+1 )x(t + 1) + αt+1 v(t + 1), (2c)
II-D). Compared with distributed algorithms without this
where (αt )∞
t=0 is defined by an arbitrarily chosen α0 ∈ (0, 1) estimation term, it helps improve the convergence speed.
2
and the update equation αt+1 = (1 − αt+1 )αt2 , where αt+1 As a result, we call this method as Accelerated Distributed
always takes the unique solution in (0, 1). The following Nesterov Gradient Descent (Acc-DNGD) method.
theorem (adapted from [26, Thm 2.2.1, Lem. 2.2.4]) gives
the convergence rate of CNGD. D. Intuition Behind our Algorithm
Here we briefly explain how the algorithm works. First
Theorem 1. In CNGD (2), under Assumption 1 and 2, when
we note that Eq. (3a)-(3c) is  similar to Eq.(2), except
0 < η ≤ L1 , we have f (x(t)) − f ∗ = O( t12 ).
the “weighted average terms” ( j wij yj (t), j wij vj (t))
C. Our Algorithm: Accelerated Distributed Nesterov Gradi- and the new term si (t) that replaces the gradient terms.
ent Descent (Acc-DNGD)
3 We note that the initial condition s (0) = ∇f (0) requires the agents
i
We design our algorithm based on a consensus matrix to conduct an initial run of consensus. We impose this initial condition
W = [wij ] ∈ Rn×n . Here wij stands for how much for technical reasons, while we expect the results of this paper to hold for
agent i weighs its neighbor j’s information. W satisfies the a relaxed initial condition, si (0) = ∇fi (0) which does not need initial
coordination. We use the relaxed initial condition in numerical simulations.
following properties: 4 Without causing any confusion with the α in (2), in the rest of the
t
(a) ∀(i, j) ∈ E, wij > 0. ∀i, wii > 0. wij = 0 elsewhere. paper we abuse the notation of αt .

2261
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
1
We have the following circular arguments that explain why where D(β, t0 ) = 16+ 6 and R is the diameter
(t0 +3)2 e 2−β
algorithm (3) should work. n
Argument 1: Assuming si (t) ≈ n1 j=1 ∇fj (yj (t)), then of the (2f (x̄(0)) − f + 2Lv̄(0) − x∗ 2 )-level set of

the algorithm converges. f .5


1
Then, f (x̄(t)) − f ∗ = O( t1.4− ).
To see this, notice the “weighted average terms”
( j wij yj (t), j wij vj (t)) ensure that different agents In Theorem 2, condition (i) intends to make ηt /ηt+1 close
reach “consensus”, i.e. ∀j, xi (t) ≈xj (t), yi (t) ≈ yj (t) to 1 which is required in Lemma 6 (iii), and conditions
and vi (t) ≈ vj (t) (as a result,
 j wij yj (t) ≈ yi (t), (ii)(iii) intend to make ηt close to 0 which is required in
wij vj (t) ≈ vi (t)). If we further assume that si (t) ≈ Lemma 6 (ii). While the conditions are needed for the proof,
1

j
n j ∇fj (yj (t)), then since yj (t) ≈ yi (t), we have si (t) ≈
we expect the same result will hold if we simply let t0 = 1
1
1 and η = 2L , which is what we choose in the simulations in
n j ∇fj (yi (t)) = ∇f (yi (t)). Hence (3a)-(3c) can be
rewritten as Section IV. The reason is that, regardless of the value of η
and t0 , we have ηt → 0 and ηt /ηt+1 → 1, and hence for
xi (t + 1) ≈ yi (t) − ηt ∇f (yi (t)) (4a) large enough t, ηt and ηt /ηt+1 will automatically be close
ηt to 0 and 1 respectively.
vi (t + 1) ≈ vi (t) − ∇f (yi (t)) (4b)
αt While in Theorem 2 we require β > 0.6, we conjecture
yi (t + 1) ≈ (1 − αt+1 )xi (t + 1) + αt+1 vi (t + 1),(4c) that the algorithm will still converge even if β ∈ [0, 0.6] and
1
which is exactly (2) (except the step size rule), and hence the convergence rate will be O( t2−β ). We note that β = 0
we expect convergence. corresponds to the case of a fixed step size. In Section IV
Argument we will use numerical methods to test this conjecture.
n2: Assuming the algorithm converges, then In the next theorem, we provide a O( t12 ) convergence
si (t) ≈ n1 j=1 ∇fj (yj (t)).  result when a fixed step size is used and the objective
To see this, from (3d) and the fact that  wi j = 1,
 n  n i functions belong to a special class.
we have s̄(t) := n1 j=1 sj (t) = n1 j=1 ∇fj (yj (t)).
Assuming the convergence of the algorithm, we will have Theorem 3. Assume each fi (x) can be written as fi (x) =
the input to (3d), ∇fi (yi (t + 1)) − ∇fi (yi (t)) ≤ Lyi (t + hi (xAi ), where Ai is a non-zero N × Mi matrix, and
1) − yi (t) → 0. Because of the vanishing input,  and the hi (x) : R1×Mi → R is a μ0 -strongly convex and L0 -smooth
“taking weighted average of neighbor” step ( j wij sj (t)) function. Suppose we use the fixed step size rule ηt = η, with
in  (3d), we can expact that eventually si (t) ≈ s̄(t) = σ 2 μ1.5 (1 − σ)3
n
1
j=1 ∇fj (yj (t)).
0 < η < min( , )
n 93 L L2.5 34561.5
Though Argument 1 and 2 only form a circular argument,
they provide a high-level guideline for the rigorous proof in where L = L0 ν with ν = maxi Ai 2∗ (where Ai ∗ means
Section III. To give the rigorous proof, it turns out that we the spectral norm of Ai ); and μ = μ0 γ with  γ being the
n
need to use a vanishing step size in (3) instead of a fixed smallest non-zero eigenvalue of matrix A = n1 i=1 Ai ATi .
∗ 1
step size as in (2) (we can still use a fixed step size if fi Then, we have f (x̄(t)) − f = O( t2 ).
has special structures, cf. Theorem 3). This slows down the An important example of the type of function fi (x) in
convergence rate of our algorithm (1/t1.4− ) compared to Theorem 3 is the square error for linear regression when the
CNGD (1/t2 ) (cf. Theorem 2). sample size is less than the parameter dimension.
An observation from the above circular argument is that,
si (t) acts asa “gradient estimator” that estimates the average Remark 1. All the step size conditions used in this section
gradient n1 j ∇fj (yj (t)). This observation can be used to are conservative. This is because we have used coarse
devise stopping criterion of the algorithm (e.g. the algorithm spectral bounds in the proofs (see Lemma 10, 11, 12), in
stops when si (t) is sufficiently close to 0). order to simplify mathematical calculations. In numerical
simulations, we show that large step sizes can be used. When
E. Convergence of the Algorithm applying the algorithm in practice, this may require trial and
To state the convergence results,
n we need to define the error to pre-tune the step size.
average sequence, x̄(t) = n1 i=1 xi (t) ∈ R1×N . We
III. C ONVERGENCE A NALYSIS
summarize our convergence results below.
In this section, we will provide the proof of the con-
Theorem 2. Suppose Assumption 1, 2 and 3 are true and vergence results. We will first provide a proof overview in
without loss of generality we assume v̄(0) = x∗ . Let the step Section III-A and then defer the detailed proof to the rest of
size be ηt = (t+tη0 )β with β = 0.6 +  where  ∈ (0, 1.4). the section. Due to space limit, we omit some proofs, which
Suppose the following conditions are met. can be found in the full version of this paper [32].
(i)
1 A. Proof Overview
t0 > .
min(( σ+3 3 σ/(28β)
) 16
, ( 15+σ )β
1
)−1 We introduce matrix notations x(t), v(t), y(t), s(t),
σ+2 4
(ii)
2 4
∇(t) ∈ Rn×N to simplify the mathematical expressions,6
σ (1 − σ)
η < min( , ). x(t) = [x1 (t)T , x2 (t)T , . . . , xn (t)T ]T
93 L 36866L
(iii) 5 Here we have used the fact that by Assumption 1 and 3, all level sets
 3/2
2 of f are bounded. See Proposition B.9 of [31].
D(β, t0 )(β − 0.6)(1 − σ)
η< 6 Without causing any confusion with notations in (2), in this section we
9216(t0 + 1)2−β L2/3 [4 + R2 /v̄(0) − x∗ 2 ] abuse the use of notation x(t), v(t), y(t).

2262
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
v(t) = [v1 (t)T , v2 (t)T , . . . , vn (t)T ]T Proof: We omit the proof and refer the readers to [27,
Lem. 4]. 2
y(t) = [y1 (t)T , y2 (t)T , . . . , yn (t)T ]T
s(t) = [s1 (t)T , s2 (t)T , . . . , sn (t)T ]T The consensus error y(t)−1ȳ(t) in the previous lemma
is bounded by the following lemma whose proof is given in
∇(t) = [∇f1 (y1 (t))T , ∇f2 (y2 (t))T , . . . , ∇fn (yn (t))T ]T . Section III-B.
Now our algorithm in (3) can be written as Lemma 6. Suppose the step sizes satisfy
(i) ηt ≥ ηt+1 > 0,
x(t + 1) = W y(t) − ηt s(t) (5a) 2 3
ηt (ii) η0 < min( 9σ3 L , (1−σ)
6144L ),
v(t + 1) = W v(t) − s(t) (5b) ηt
(iii) supt≥0 ηt+1 ≤ min(( σ+3 3 σ/28 16
αt σ+2 4 ) , 15+σ ).
y(t + 1) = (1 − αt+1 )x(t + 1) + αt+1 v(t + 1) (5c) Then, under Assumption 2, we have,
s(t + 1) = W s(t) + ∇(t + 1) − ∇(t). (5d) y(t) − 1ȳ(t)
  
n
Apart from the average sequence x̄(t) = n1 i=1 xi (t) ∈ √ 8
≤ κ nχ2 (ηt ) Lȳ(t) − x̄(t) + Lηt g(t)
R1×N that we have defined, we also define several
nother av- 1−σ
n
erage sequences,
 v̄(t) = n1 i=1 vi (t),
 ȳ(t) = n1 i=1 yi (t),
n n where χ2 : R → R is a function satisfying 0 < χ2 (ηt ) ≤
s̄(t) = n1 i=1 si (t), and g(t) = n1 i=1 ∇fi (yi (t)). 2 1/3 6
η , and κ = (1−σ)
L2/3 t
.
Overview of the Proof. We derive a series of lemmas
(Lemma 4, 5, 6 and 7) that will work for both the vanishing We next provide the following intermediate result. The
and the fixed step size case. We firstly derive the update proof roughly follows the same steps of [26, Lemma 2.2.3],
formula for the average sequences (Lemma 4). Then, we and can be found in Appendix-D of [32].
show that the update rule for the average sequences is in α2 L
Lemma 7. Define γ0 = η0 (1−α 0
0)
= 1−α 0
. We define a series
fact centralized Nesterov Gradient Descent (CNGD) with in- N
exact gradients [33], and the inexactness is characterized by of functions (Φt : R → R)t≥0 , with
γ0
“consensus error” y(t)−1ȳ(t) (Lemma 5). The consensus Φ0 (ω) = f (x̄(0)) + ω − v̄(0)2
2
error is bounded in Lemma 6. Then, we apply the proof of and
CNGD (see e.g. [26]) to the average sequences in spite of the
consensus error, and derive an intermediate result in Lemma Φt+1 (ω) = (1 − αt )Φt (ω) + αt [fˆ(t) + g(t), ω − ȳ(t)]. (9)
7. Lastly, we finish the proof of Theorem 2 in Section III-C. Then, under Assumption 1 and 2, the following holds.
The proof of Theorem 3 can be found in Appendix-F in [32]. (i) We have,
Lemma 4. The following equalities hold.
Φt (ω) ≤ f (ω) + λt (Φ0 (ω) − f (ω)) (10)
x̄(t + 1) = ȳ(t) − ηt g(t) (6a) where λt is defined through λ0 = 1, and λt+1 = (1 −
ηt αt )λt .
v̄(t + 1) = v̄(t) − g(t) (6b)
αt (ii) Function Φt (ω) can be written as
γt
ȳ(t + 1) = (1 − αt+1 )x̄(t + 1) + αt+1 v̄(t + 1) (6c) Φt (ω) = φ∗t + ω − v̄(t)2 (11)
2
s̄(t + 1) = s̄(t) + g(t + 1) − g(t) = g(t + 1) (6d) where γt is defined through γt+1 = γt (1 − αt ), and φ∗t is
Proof: We omit the proof since these equalities can be easily some real number that satisfies φ∗0 = f (x̄(0)), and
1
derived using the fact that W is doubly stochastic. For (6d) φ∗t+1 = (1 − αt )φ∗t + αt fˆ(t) − ηt g(t)2
we also need to use the fact that s̄(0) = g(0). 2 2
From (6a)-(6c) we see that the sequences x̄(t), v̄(t) and + αt g(t), v̄(t) − ȳ(t). (12)
ȳ(t) follow a update rule similar to the CNGD in (2). The B. Proof of the Bounded Consensus Error (Lemma 6)
only difference is that the g(t) in (6a)-(6c) is not the exact
We will frequently use the following lemmas, whose
gradient ∇f (ȳ(t)) in CNGD. In the following Lemma, we
proofs can be found in Appendix-A of [32].
show that g(t) is an inexact gradient.7
Lemma 8. The following equalities are true.
Lemma 5. Under Assumption 1, 2, ∀t, g(t) is an inexact
 αt+1 
gradient of f at ȳ(t) with error O(y(t) − 1ȳ(t)2 ) in the ȳ(t+1)− ȳ(t) = αt+1 (v̄(t)− ȳ(t))−ηt +1−αt+1 g(t)
sense that, ∀ω ∈ RN , αt
(13)
f (ω) ≥ fˆ(t) + g(t), ω − ȳ(t) (7)
v̄(t + 1) − ȳ(t + 1) = (1 − αt+1 )(v̄(t) − ȳ(t))
f (ω) ≤ fˆ(t) + g(t), ω − ȳ(t) + Lω − ȳ(t)2 1
1 + ηt (1 − αt+1 )(1 − )g(t) (14)
+ L y(t) − 1ȳ(t)2 , (8) αt
n Lemma 9. Under Assumption 2, the following are true.
n
where fˆ(t) = n1 i=1 [fi (yi (t))+∇fi (yi (t)), ȳ(t)−yi (t)].
∇(t + 1) − ∇(t) ≤ Ly(t + 1) − y(t) (15)
7 For more information regarding why (7) (8) define an “inexact gradient”, L
g(t) − ∇f (ȳ(t)) ≤ √ y(t) − 1ȳ(t) (16)
we refer the readers to [33]. n
2263
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
σ+3
Proof of Lemma 6: Step 3.1: We prove that a(t + 1) ≥ 4 a(t). By (14),
Overview of the proof. The proof is divided into three a(t + 1)
steps. In step 1, we treat the algorithm (5) as a linear system
= αt+1 L(1 − αt+1 )(v̄(t) − ȳ(t))
and derive a linear system inequality (17). In step 2, we
1
analyze the state transition matrix in (17) and prove a few + (1 − αt+1 )(1 − )ηt g(t) + 2ληt+1 Lg(t + 1)
spectral properties. In step 3, we further analyze the linear αt
system (17) and bound the state by the input, from which the ≥ αt+1 (1 − αt+1 )Lv̄(t) − ȳ(t)
αt+1
conclusion of the lemma follows. Throughout the proof, we − (1 − αt+1 )(1 − αt )ηt Lg(t)
αt
will frequently use an easy-to-check fact: αt is a decreasing
+ 2ληt+1 Lg(t) − 2ληt+1 Lg(t + 1) − g(t).
sequence.
Step 1: A Linear System Inequality. Define z(t) = Therefore, we have
[αt v(t) − 1v̄(t),
√ y(t) − 1ȳ(t), s(t) − 1g(t)]T ∈ R3 , a(t) − a(t + 1)
b(t) = [0, 0, na(t)]T ∈ R3 where
≤ αt − αt+1 (1 − αt+1 ) Lv̄(t) − ȳ(t)
a(t)  αt Lv̄(t) − ȳ(t) + 2λLηt g(t) αt+1
+ (1 − αt+1 )(1 − αt )ηt L
4
αt
in which λ  1−σ > 1. The desired inequality is (17). The
proof of (17) is similar to that of [27, Eq. (8)]. Due to space + 2ληt L − 2ληt+1 L g(t) + 2ληt+1 Lg(t + 1) − g(t)
limit, it is omitted and can be found in Appendix-B of [32].
≤ αt − αt+1 (1 − αt+1 ) Lv̄(t) − ȳ(t)
G(ηt )
   + (ηt + 2λ(ηt − ηt+1 ))Lg(t)

σ 0 ηt + 2ληt+1 Lg(t + 1) − g(t)
z(t + 1) ≤ σ σ 2ηt z(t) + b(t) (17) αt+1 α2 1 ηt − ηt+1
L 2L σ + 2ηt L ≤ max(1 − + t+1 , + )a(t)
αt αt 2λ ηt
+ 2ληt+1 Lg(t + 1) − g(t) (19)
Step 2: Spectral Properties of G(·). When η is positive,
G(η) is a nonnegative matrix and G(η)2 is a positive matrix. where in the last inequality, we have used the elementary fact
By Perron-Frobenius Theorem [34, Thm. 8.5.1] G(η) has a that for four positive numbers a1 , a2 , a3 , a4 and x, y ≥ 0, we
unique largest (in magnitude) eigenvalue that is a positive have a1 x+a2 y = aa13 a3 x+ aa24 a4 y ≤ max( aa13 , aa24 )(a3 x+a4 y)
real with multiplicity 1, and the eigenvalue is associated with Next, we expand g(t + 1) − g(t),
an eigenvector with positive entries. We let the unique largest
g(t + 1) − g(t)
eigenvalue be θ(η) = ρ(G(η)) and let its eigenvector be
χ(η) = [χ1 (η), χ2 (η), χ3 (η)]T , normalized by χ3 (η) = 1. ≤ g(t + 1) − ∇f (ȳ(t + 1)) + g(t) − ∇f (ȳ(t))
We give bounds on the eigenvalue and the eigenvector in the + ∇f (ȳ(t + 1)) − ∇f (ȳ(t))
following lemmas, whose proofs can be found in Appendix- L
(a) L
≤ √ y(t + 1) − 1ȳ(t + 1) + √ y(t) − 1ȳ(t)
C of [32]. n n
Lemma 10. When 0 < ηL < 1, we have σ < θ(η) < + Lȳ(t + 1) − ȳ(t)
2 L L
σ + 4(ηL)1/3 , and χ2 (η) ≤ L2/3 η 1/3 . (b)
≤ √ σαt v(t) − 1v̄(t) + √ 2y(t) − 1ȳ(t)
√ n n
σ
Lemma 11. When η ∈ (0, L2√ 2
), θ(η) ≥ σ + (σηL)1/3 and L
η + √ 2ηt s(t) − 1g(t) + a(t)
χ1 (η) < (σηL)1/3 . n
(c)
2
Lemma 12. When ζ1 , ζ2 ∈ (0, 9σ3 L ), then ≤ Lσκχ1 (ηt )a(t) + 2Lκχ2 (ηt )a(t)
χ1 (ζ1 )
≤ max(( ζ2 6/σ ζ1 6/σ
) , ( ) ) and χχ22 (ζ 1)
≤ + 2Lηt κχ3 (ηt )a(t) + a(t)
χ1 (ζ2 ) ζ1 ζ2 (ζ2 ) 1/3
ζ2 28/σ ζ1 28/σ
max(( ζ1 ) , ( ζ2 ) ).
(d) ηt 2ηt
≤ a(t) Lσκ + 2Lκ 2/3 + 2Lηt κ + 1
(σηt L)1/3 L
It is easy to check that, under our step size condition (ii), (e)
all the conditions of Lemma 10, 11, 12 are satisfied. ≤ 8κa(t). (20)
Step 3: Bound the state by the input. With the above Here (a) is due to (16); (b) is due to the second row of (17)
preparations, now we prove, by induction, the following and the fact that a(t) ≥ Lȳ(t + 1) − ȳ(t); (c) is due to the
statement,
√ induction assumption (18). In (d), we have used the bound
z(t) ≤ na(t)κχ(ηt ) (18) on χ1 (·) (Lemma 11), χ2 (·) (Lemma 10), and χ3 (ηt ) = 1.
6 In (e), we have used ηt L < 1, σ < 1 and κ > 1.
where κ = 1−σ . Equation (18) is true for t = 0, since the Combining (20) with (19), we have
left hand side is zero when t = 0. Assume (18) holds for
a(t) − a(t + 1)
t. We now show (18) is true for t + 1. We divide the rest
of the proof into two sub-steps. Briefly speaking, step 3.1 αt+1 α2 1 ηt − ηt+1
≤ max(1 − + t+1 , + )a(t)
proves that the input to the system (17), a(t + 1) does not αt αt 2λ ηt
decrease too much compared to a(t) (a(t + 1) ≥ σ+3 4 a(t));
+ 16κληt+1 La(t)

while step 3.2 shows that the state z(t + 1), compared to ηt+1 1−σ ηt − ηt+1
≤ max(1 − + 2αt+1 , + )
z(t), decreases enough for (18) to hold for t + 1. ηt 8 ηt

2264
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.

384 D(β,t0 )
+ η 0 L a(t) (iii) λt ≥ (t+t 0)
2−β where D(β, t0 ) is some constant that
(1 − σ)2
only depends on β and t0 , given by D(β, t0 ) =
1
where in the last inequality, we have used the fact that 16+ 6
.
(t0 +3)2 e 2−β

αt+1 α2 α2
1− + t+1 < 1 − t+1 + αt+1 Now we proceed to prove Theorem 2.
αt αt αt2 Proof of Theorem 2: It is easy to check that all the conditions
ηt+1 ηt+1
=1− (1 − αt+1 ) + αt+1 < 1 − + 2αt+1 . of Lemma 6 and 13 are satisfied, hence the conclusions of
ηt ηt
Lemma 6 and 13 hold. The major step of proving the theorem
ηt 16
By the step size condition (iii), ηt+1 ≤ 15+σ , and hence is to show the following inequality,
ηt+1 1−σ
1 − ηt ≤ 16 . By the step size condition (ii), 2αt+1 ≤ λt (Φ0 (x∗ ) − f ∗ ) + φ∗t ≥ f (x̄(t)).
√ (22)
2α0 = 2 η0 L ≤ 1−σ 384 1−σ
16 , and η0 L (1−σ)2 < 16 . Combining
1−σ If (22) is true, by (22) and (10), we have
the above, we have a(t) − a(t + 1) ≤ 4 a(t). Hence a(t +
1) ≥ 3+σ f (x̄(t)) ≤ φ∗t + λt (Φ0 (x∗ ) − f ∗ )
4 a(t).
Step 3.2: Finishing the induction. We have, ≤ Φt (x∗ ) + λt (Φ0 (x∗ ) − f ∗ )
(a) ≤ f ∗ + 2λt (Φ0 (x∗ ) − f ∗ ).
z(t + 1) ≤ G(ηt )z(t) + b(t)
1
(b) √ √ Hence f (x̄(t)) − f ∗ = O(λt ) = O( t2−β ), i.e. the desired
≤ G(ηt ) na(t)κχ(ηt ) + na(t)χ(ηt ) result of the theorem follows.
(c) √ √ Now we use induction to prove (22). Firstly, (22) is true
= θ(ηt ) na(t)κχ(ηt ) + na(t)χ(ηt )
√ for t = 0, since φ∗0 = f (x̄(0)) and Φ0 (x∗ ) > f ∗ . Suppose
= na(t)χ(ηt )(κθ(ηt ) + 1) it’s true for 0, 1, 2, . . . , t. For 0 ≤ k ≤ t, by (10), Φk (x∗ ) ≤
(d) √ σ+1 4 f ∗ + λk (Φ0 (x∗ ) − f ∗ ). Hence
≤ na(t + 1)χ(ηt+1 )(κ + 1)
2 3+σ γk
χ1 (ηt ) χ2 (ηt ) φ∗k + x∗ − v̄(k)2 ≤ f ∗ + λk (Φ0 (x∗ ) − f ∗ ).
× max( , , 1) 2
χ1 (ηt+1 ) χ2 (ηt+1 )
(e) √ σ+2 4 Using the induction assumption, we get
= na(t + 1)χ(ηt+1 ) κ
3 σ+3 γk
χ1 (ηt ) χ2 (ηt ) f (x̄(k)) + x∗ − v̄(k)2 ≤ f ∗ + 2λk (Φ0 (x∗ ) − f ∗ ). (23)
× max( , , 1) 2
χ1 (ηt+1 ) χ2 (ηt+1 )
Since f (x̄(k)) ≥ f ∗ and γk = λk γ0 , we have x∗ −
(f ) √
≤ na(t + 1)κχ(ηt+1 ) (21) v̄(k)2 ≤ γ40 (Φ0 (x∗ ) − f ∗ ). Since v̄(k) = α1k (ȳ(k) −
x̄(k)) + x̄(k), we have v̄(k) − x∗ 2 =  α1k (ȳ(k) − x̄(k)) +
where (a) is due to (17), and (b) is due to induction
assumption (18), and (c) is because θ(ηt ) is an eigenvalue of x̄(k) − x∗ 2 ≥ 2α1 2 ȳ(k) − x̄(k)2 − x̄(k) − x∗ 2 . By (23),
k
G(ηt ) with eigenvector χ(ηt ), and (d) is due to step 3.1, and f (x̄(k)) ≤ 2Φ0 (x∗ )−f ∗ = 2f (x̄(0))−f ∗ +γ0 v̄(0)−x∗ 2 .
θ(ηt ) < σ + 4(η0 L)1/3 < 1+σ 2 (by step size condition (ii) Also since γ0 = 1−α L
< 2L, we have x̄(k) lies within
0
and Lemma 10), and in (e), we have used by the definition the (2f (x̄(0)) − f + 2Lv̄(0) − x∗ 2 )-level set of f . By

of κ, κ σ+1
2 + 1 =
σ+2
3 κ. For (f), we have used that by Assumption 3 and Proposition B.9 of [31], we have the level
Lemma 12 and step size condition (iii), set is compact. Hence we have x̄(k) − x∗  ≤ R where
R is the diameter of that level set. Combining the above
χ1 (ηt ) χ2 (ηt ) ηt 28/σ σ+33 arguments, we get
max( , , 1) ≤ ( ) ≤ .
χ1 (ηt+1 ) χ2 (ηt+1 ) ηt+1 σ+24
ȳ(k) − x̄(k)2
Now, (18) is proven for t + 1, and hence is true for all t. ≤ 2αk2 (v̄(k) − x∗ 2 + x̄(k) − x∗ 2 )
Therefore, we have 4
≤ 2αk2 [R2 + (f (x̄(0)) − f ∗ ) + 2v̄(0) − x∗ 2 ]
√ γ0
y(t) − 1ȳ(t) ≤ κ na(t)χ2 (ηt ).
≤ 2αk2 [R2 + 4v̄(0) − x∗ 2 ] (24)
  
Notice that a(t) = αt Lv̄(t) − ȳ(t) + 2λLηt g(t) ≤ C1
8
Lx̄(t)− ȳ(t)+ 1−σ Lηt g(t). The statement of the lemma
follows. 2 where C1 is a constant that does not depend on η.
Next, we consider (12),
C. Proof of Theorem 2 φ∗t+1 − f (x̄(t + 1))
We first introduce Lemma 13 regarding the asymptotic = (1 − αt )(φ∗t − f (x̄(t))) + (1 − αt )f (x̄(t)) + αt fˆ(t)
behavior of αt and λt . The proof can be found in Appendix- 1
E of [32]. − ηt g(t)2 + αt g(t), v̄(t) − ȳ(t) − f (x̄(t + 1))
2
Lemma 13. When the vanishing step size is used (ηt = (a) 1
η 1 ≥ (1 − αt )(φ∗t − f (x̄(t))) + αt fˆ(t) − ηt g(t)2
(t+t0 )β
, t0 ≥ 1, β ∈ (0, 2)), and η0 < 4L (equivalently 2
1
α0 < 2 ), we have + (1 − αt ){fˆ(t) + g(t), x̄(t) − ȳ(t) }
(i) αt ≤ t+1 2
. + αt g(t), v̄(t) − ȳ(t) − f (x̄(t + 1))
(b)
(ii) λt = O( t2−β 1
). = (1 − αt )(φ∗t − f (x̄(t))) + fˆ(t)

2265
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
1 (a) 4(t0 + 1)2−β 

1
− ηt g(t)2 − f (x̄(t + 1)) (25) ≤ η 2/3
2 D(β, t0 ) 5β
k=0 (k + 1)
3

where (a) is due to Lemma 5 and (b) is due to αt (v̄(t) − (b) 4(t0 + 1)2−β 2
ȳ(t)) + (1 − αt )(x̄(t) − ȳ(t)) = 0. By Lemma 5 and Lemma ≤ η 2/3 ×
6, D(β, t0 ) β − 0.6

f (x̄(t + 1)) where in (a) we have used, k + t0 ≥ k + 1, k + 1 + t0 ≤


≤ fˆ(t) − (ηt − Lηt2 )g(t)2 + 2Lκ2 χ2 (ηt )2 (t0 + 1)(k + 1); in (b) we have used 53 β > 1. So, we have
64 
× (L2 x̄(t) − ȳ(t)2 + L2 ηt2 g(t)2 ). (26) 
t
2κ2 χ2 (ηk )2 L3 x̄(k) − ȳ(k)2 t=k+1 (1 − α )
(1 − σ)2
(Φ0 (x∗ ) − f ∗ )λt+1
Combining the above with (25), we get, k=0
8(t0 + 1)2−β
φ∗t+1− f (x̄(t + 1)) ≤ η 2/3 C2 <1
D(β, t0 )(β − 0.6)
≥ (1 − αt )(φ∗t − f (x̄(t)))
1 4608L3 χ2 (ηt )2 ηt2 where in the last inequality, we have simply required η 2/3 <
+ ( ηt − Lηt2 − )g(t)2
2 (1 − σ)4 D(β,t0 )(β−0.6)
(i.e. step size condition (iii)), which is possi-
8(t0 +1)2−β C2
− 2κ2 χ2 (ηt )2 L3 x̄(t) − ȳ(t)2 ble since the constants C2 and D(β, t0 ) do not depend on
≥ (1 − αt )(φ∗t − f (x̄(t))) − 2κ2 χ2 (ηt )2 L3 x̄(t) − ȳ(t)2 η. So the induction is complete and we have (22) is true. 2
(27)
where we have used the fact that by step size condition (ii), IV. N UMERICAL E XPERIMENTS
3
1 4608L χ2 (ηt )2 ηt2
ηt − Lηt2 − We simulate our algorithm and compare it with other
2 (1 − σ)4
2/3
algorithms. We choose n = 100 agents and the graph is
1 4608L3 ηt2 4ηt generated using the Erdos-Renyi model [35] with connec-
≥ ηt − Lηt2 −
2 (1 − σ)4 L4/3 tivity probability 0.3. The weight matrix W is chosen using
18433Lη the Laplacian method [16, Sec. 2.4]. We will compare our
≥ ηt (1/2 − ) > 0.
(1 − σ)4 algorithm Acc-DNGD with Distributed Gradient Descent
Hence, expanding (27) recursively, we get (DGD) in [6] with a vanishing step size, the “EXTRA”
algorithm in [16] (with W̃ = W2+I ), the algorithm studied in
φ∗t+1 − f (x̄(t + 1)) [19]–[25] (we name it “Acc-DGD”), the “D-NG” method in
t 
t [28]. We will also compare with two centralized methods that
≥− 2κ2 χ2 (ηk )2 L3 x̄(k) − ȳ(k)2 (1 − α ). directly optimize f : Centralized Gradient Descent (CGD)
k=0 =k+1 and Centralized Nesterov Gradient Descent (CNGD (2)).
Therefore to finish the induction, we need to show Each element of the initial point xi (0) is drawn from
i.i.d. Gaussian with mean 0 and variance 25. The objective

t 
t
functions are given by,
2κ2 χ2 (ηk )2 L3 x̄(k) − ȳ(k)2 (1 − α )
k=0 =k+1
1 m
≤ (Φ0 (x∗ ) − f ∗ )λt+1 . fi (x) = m ai , x + bi , x if |ai , x| ≤ 1,
|ai , x| − m−1
m + bi , x if |ai , x| > 1,
Notice that


t
2κ2 χ2 (ηk )2 L3 x̄(k) − ȳ(k)2 t=k+1 (1 − α ) where m = 12, ai , bi ∈ RN (N = 4) are vectors whose
(Φ0 (x∗ ) − f ∗ )λt+1 entries are i.i.d. Gaussian with mean 0 and variance
n−11, with
 exception that bn is set to be bn = − i=1 bi s.t.
the
k=0

i bi = 0. It is easy to check that fi is convex and smooth,
t
(a) 4κ2 L3 1
≤ χ2 (ηk )2 x̄(k) − ȳ(k)2
Lv̄(0) − x∗ 2 λk+1 but not strongly convex (around the minimizer).
k=0

t The selection of the objective functions is intended to
(b) 4(6/(1 − σ))2 L2 2 1/3 2 1 1
(β > 0.6) of our
≤ ( η ) 2C1 αk2 test the sublinear convergence rate t2−β
v̄(0) − x∗ 2 L2/3 k λk+1 algorithm Acc-DNGD (3) and the conjecture that the t2−β 1
k=0

t
1152L2/3 C1 2/3 1 rate still holds even if β ∈ [0, 0.6] (cf. Theorem 2 and the
= η αk2 comments following it). Therefore, we do two runs of our
(1 − σ)2 v̄(0) − x∗ 2 k λk+1
k=0    algorithm Acc-DNGD, one with β = 0.61 and the other with
C2 β = 0. The results are shown in Figure 1, where the x-axis
where C2 is a costant that does not depend on η, and in (a) the iteration t, and the y-axis is the average objective error
is 
1
we have used Φ0 (x∗ ) − f ∗ ≥ L2 v̄(0) − x∗ 2 > 0, and in n f (xi (t))−f ∗ for distributed methods, or objective error
(b), we have used the bound on χ2 (ηk ) (Lemma 6) and the f (x(t)) − f ∗ for centralized methods. Notice that Figure 1 is
bound on x̄(k) − ȳ(k) (equation (24)). Now by Lemma a double log plot. It shows that Acc-DNGD with β = 0.61
13, we get, performs faster than 1/t1.39 , while D-NG, CGD and CGD-
based distributed methods (DGD, Acc-DGD, EXTRA) are

t
1 
t
η 2/3 4 (k + 1 + t0 )2−β
2/3
ηk αk2 ≤ slower than 1/t1.39 . Further, both Acc-DNGD with β = 0
λk+1 and CNGD are faster than t12 .

(k + t0 ) 3 (k + 1) 2 D(β, t0 )
k=0 k=0

2266
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
[11] ——, “Distributed optimization over time-varying directed graphs,”
Automatic Control, IEEE Transactions on, vol. 60, no. 3, pp. 601–
615, 2015.
[12] I. Matei and J. S. Baras, “Performance evaluation of the consensus-
based distributed subgradient method under random communication
topologies,” Selected Topics in Signal Processing, IEEE Journal of,
vol. 5, no. 4, pp. 754–771, 2011.
[13] A. Olshevsky, “Linear time average consensus on fixed graphs and
implications for decentralized optimization and multi-agent control,”
arXiv preprint arXiv:1411.4186, 2014.
[14] M. Zhu and S. Martı́nez, “On distributed convex optimization under
inequality and equality constraints,” Automatic Control, IEEE Trans-
actions on, vol. 57, no. 1, pp. 151–164, 2012.
[15] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent op-
timization with state-dependent communication,” Mathematical Pro-
gramming, vol. 129, no. 2, pp. 255–284, 2011.
[16] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order
algorithm for decentralized consensus optimization,” SIAM Journal
on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[17] C. Xi and U. A. Khan, “On the linear convergence of distributed
optimization over directed graphs,” arXiv preprint arXiv:1510.02149,
2015.
[18] J. Zeng and W. Yin, “Extrapush for convex smooth decentralized op-
timization over directed networks,” arXiv preprint arXiv:1511.02942,
Fig. 1: Simulation results. Steps sizes: Acc-DNGD with β = 2015.
0.0045
0.61: ηt = (t+1) 0.61 , α0 = 0.7071; Acc-DNGD with β = 0:
[19] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient
methods for multi-agent optimization under uncoordinated constant
ηt = 0.0045, α0 = 0.7071; D-NG: ηt = 0.0091 t+1 ; DGD: ηt = stepsizes,” in 2015 54th IEEE Conference on Decision and Control
0.0091

t
; EXTRA: η = 0.0091; Acc-DGD: η = 0.0045; CGD: (CDC). IEEE, 2015, pp. 2055–2060.
[20] P. Di Lorenzo and G. Scutari, “Distributed nonconvex optimization
η = 0.0091; CNGD: η = 0.0091, α0 = 0.5. over networks,” in Computational Advances in Multi-Sensor Adaptive
Processing (CAMSAP), 2015 IEEE 6th International Workshop on.
IEEE, 2015, pp. 229–232.
[21] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimiza-
V. C ONCLUSION tion,” IEEE Transactions on Signal and Information Processing over
Networks, vol. 2, no. 2, pp. 120–136, 2016.
In this paper we propose an Accelerated Distributed Nes- [22] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed
optimization,” arXiv preprint arXiv:1605.07112, 2016.
terov Gradient Descent algorithm for distributed optimiza- [23] A. Nedich, A. Olshevsky, and W. Shi, “Achieving geometric conver-
tion of convex and smooth functions. We show a general gence for distributed optimization over time-varying graphs,” arXiv
1
O( t1.4− ) (∀ ∈ (0, 1.4)) convergence rate, and an improved preprint arXiv:1607.03218, 2016.
1 [24] A. Nedic, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically
O( t2 ) convergence rate when the objective functions satisfy convergent distributed optimization with uncoordinated step-sizes,”
an additional property. Future work includes giving tighter arXiv preprint arXiv:1609.05877, 2016.
analysis of the convergence rates. [25] C. Xi and U. A. Khan, “Add-opt: Accelerated distributed directed
optimization,” arXiv preprint arXiv:1607.04757, 2016.
[26] Y. Nesterov, Introductory lectures on convex optimization: A basic
R EFERENCES course. Springer Science & Business Media, 2013, vol. 87.
[27] G. Qu and N. Li, “Accelerated distributed nesterov gradient descent for
[1] B. Johansson, “On distributed optimization in networked systems,” smooth and strongly convex functions,” in Communication, Control,
2008. and Computing (Allerton), 2016 54th Annual Allerton Conference on.
[2] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for IEEE, 2016, pp. 209–216.
cognitive radio networks by exploiting sparsity,” IEEE Transactions [28] D. Jakovetic, J. Xavier, and J. M. Moura, “Fast distributed gradient
on Signal Processing, vol. 58, no. 3, pp. 1847–1862, 2010. methods,” Automatic Control, IEEE Transactions on, vol. 59, no. 5,
[3] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based pp. 1131–1146, 2014.
distributed support vector machines,” Journal of Machine Learning [29] A. Olshevsky and J. N. Tsitsiklis, “Convergence speed in distributed
Research, vol. 11, no. May, pp. 1663–1707, 2010. consensus and averaging,” SIAM Journal on Control and Optimization,
[4] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- vol. 48, no. 1, pp. 33–55, 2009.
chronous deterministic and stochastic gradient optimization algo- [30] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and
rithms,” in 1984 American Control Conference, 1984, pp. 484–489. cooperation in networked multi-agent systems,” Proceedings of the
[5] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computa- IEEE, vol. 95, no. 1, pp. 215–233, Jan 2007.
tion: numerical methods. Prentice hall Englewood Cliffs, NJ, 1989, [31] D. P. Bertsekas, “Nonlinear programming,” 1999.
vol. 23. [32] N. L. Guannan Qu. (2017) Accelerated distributed nesterov gradient
[6] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for descent for convex and smooth functions. [Online]. Available:
multi-agent optimization,” Automatic Control, IEEE Transactions on, http://scholar.harvard.edu/files/gqu/files/cdc2017fullversion.pdf
vol. 54, no. 1, pp. 48–61, 2009. [33] O. Devolder, F. Glineur, and Y. Nesterov, “First-order methods of
[7] I. Lobel and A. Ozdaglar, “Convergence analysis of distributed sub- smooth convex optimization with inexact oracle,” Mathematical Pro-
gradient methods over random networks,” in Communication, Control, gramming, vol. 146, no. 1-2, pp. 37–75, 2014.
and Computing, 2008 46th Annual Allerton Conference on. IEEE, [34] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university
2008, pp. 353–360. press, 2012.
[8] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for [35] P. Erdos and A. Renyi, “On random graphs i,” Publ. Math. Debrecen,
distributed optimization: convergence analysis and network scaling,” vol. 6, pp. 290–297, 1959.
Automatic control, IEEE Transactions on, vol. 57, no. 3, pp. 592–606,
2012.
[9] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic
subgradient projection algorithms for convex optimization,” Journal
of optimization theory and applications, vol. 147, no. 3, pp. 516–545,
2010.
[10] A. Nedic and A. Olshevsky, “Stochastic gradient-push for strongly
convex functions on time-varying directed graphs,” arXiv preprint
arXiv:1406.2075, 2014.

2267
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.

You might also like