Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions
Accelerated Distributed Nesterov Gradient Descent For Convex and Smooth Functions
Abstract— This paper considers the distributed optimization a O(1/t2 ) convergence rate. The nice convergence rates
problem over a network, where the objective is to optimize lead to the question of this paper: how to decentralize the
a global function formed by an average of local functions, Nesterov Gradient method to achieve similar convergence
using only local computation and communication. We develop
an Accelerated Distributed Nesterov Gradient Descent (Acc- rates? Our recent work [27] has studied the μ-strongly
DNGD) method for convex and smooth objective functions. We convex and L-smooth case. This paper will focus on the
show that it achieves a O(1/t1.4− ) (∀ ∈ (0, 1.4)) convergence convex and L-smooth case (without the strongly convex
rate when a vanishing step size is used. The convergence rate assumption). Previous work in this line includes [28] that
can be improved to O(1/t2 ) when we use a fixed step size and develops Distributed Nesterov Gradient (D-NG) method and
the objective functions satisfy a special property. To the best of
our knowledge, Acc-DNGD is the fastest among all distributed shows that it has a convergence rate of O( logt t ),1 which is
gradient-based algorithms that have been proposed so far. not faster than the rate of CGD (O( 1t )).
In this paper, we propose an Accelerated Distributed Nes-
I. I NTRODUCTION terov Gradient Descent (Acc-DNGD) method. We show that
Given a set of agents N = {1, 2, . . . , n}, each of which it achieves a O(1/t1.4− ) (for any ∈ (0, 1.4)) convergence
has a local convex cost function fi (x) : RN → R, the objec- rate when a vanishing step size is used. We further show
tive of distributed optimization is to find x that minimizes that the convergence rate can be improved to O(1/t2 ) when
the average of all the functions, we use a fixed step size and the objective function is a
n composition of a linear map and a strongly-convex and
1 smooth function. Both rates are faster than what CGD and
min f (x) fi (x)
x∈RN n i=1 CGD-based distributed methods can achieve (O(1/t)). To
the best of the authors’ knowledge, the O(1/t1.4− ) rate is
using local communication and local computation. The local the fastest among all distributed gradient-based algorithms
communication is defined through a connected and undi- being proposed so far.2
rected communication graph G = (V, E), where the nodes Our algorithm is a combination of CNGD and a gradient
V = N and edges E ⊂ V × V . This problem has found estimation scheme. The gradient estimation scheme has
various applications in multi-agent control, distributed state been studied under various contexts in [19]–[25]. As [22]
estimation over sensor networks, large scale computation in has pointed out, when combining the gradient estimation
machine learning, etc [1]–[3]. scheme with a centralized algorithm, the resulting distributed
There exist many studies on developing distributed al- algorithm could potentially match the convergence rate of the
gorithms for this problem, e.g., [4]–[15], most of which centralized algorithm. The results in this paper show that,
are distributed gradient descent algorithms. Each iteration is although combining the scheme with CNGD will not give
composed of a consensus step and a gradient descent step. a convergence rate (O(1/t1.4− )) matching that of CNGD
These methods have achieved sublinear convergence rates (O(1/t2 )), it does improve over previously known CGD-
(usually O( √1t ) for convex functions. When the functions based distributed algorithms (O(1/t)).
are nonsmooth, the sublinear convergence rates match Cen- In the rest of the paper, Section II formally defines the
tralized Gradient Descent (CGD). More recent work have problem and presents our algorithm and results. Section III
improved these results for smooth functions, by adding a proves the convergence rates. Section IV provides numerical
correction term [16]–[18], or using a gradient estimation simulations and Section V concludes the paper.
sequence [19]–[25]. With these techniques, paper [16], [22] Notations. In this paper, n is the number of agents, and
can achieve a O( 1t ) convergence rate for smooth functions, N is the dimension of the domain of the fi ’s. Notation ·
matching the rate of CGD. Additionally, if strong convexity denotes 2-norm for vectors, and Frobenius norm for matrices.
is further assumed, paper [16]–[18], [22]–[25] can achieve a Notation · ∗ denotes spectral norm for matrices. Notation
linear convergence rate, matching the rate of CGD as well. ·, · denotes inner product for vectors. Notation ρ(·) denotes
It is known that among all centralized gradient based spectral radius for square matrices, and 1 denotes a n-
algorithms, Centralized Nesterov Gradient Descent (CNGD) dimensional all one column vector. All vectors, when having
[26] achieves the optimal convergence rate in terms of dimension N (the dimension of the domain of the fi ’s), will
first-order oracle complexity. For μ-strongly
convex and L-
smooth functions, it achieves a O((1 − μ/L)t ) conver- 1 Reference [28] also studies an algorithm that uses multiple consensus
gence rate; for convex and L-smooth functions, it achieves steps per iteration, and achieves a O(1/t2 ) convergence rate. In this paper,
we focus on algorithms that only use one or a constant number of consensus
Guannan Qu and Na Li are affiliated with John A. Paulson School steps per iteration.
of Engineering and Applied Sciences at Harvard University. Email: 2 We only include algorithms that are gradient based (without extra
[email protected], [email protected]. This work is supported under information like Hessian), and use one (or a constant number of) step(s) of
NSF ECCS 1608509 and NSF CAREER 1553407. consensus after each gradient evaluation.
∇fi (x) and ∇f (x) are interpreted as N -dimensional row j w ij = 1 for all i, j ∈ N .
vectors. Notation “≤”, when applied to vectors of the same As a result, ∃σ ∈ (0, 1) which depends on the spectrum
dimension, denotes element wise “less than or equal to”. of W , such that for any ω ∈ Rn×1 , we have the “averaging
II. P ROBLEM AND A LGORITHM property”, W ω −1ω̄ ≤ σω −1ω̄ where ω̄ = n1 1T ω (the
average of the entries in ω) [29]. How to select a consensus
A. Problem Formulation matrix to satisfy these properties has been intensely studied,
Consider n agents, N = {1, 2, . . . , n}, each of which e.g. [29], [30].
has a function fi : RN → R. The objective of distributed In our algorithm Acc-DNGD, each agent keeps a copy
optimization is to find x to minimize the average of all the of the three variables in CNGD, xi (t), vi (t), yi (t) and in
functions, i.e. n addition si (t) which serves as a gradient estimator. The
1
min f (x) fi (x) (1) initial condition is xi (0) = vi (0) = yi (0) = 0 and si (0) =
x∈RN n i=1 ∇f (0),3 and the algorithm updates as follows:
using local communication and local computation. The local
communication is defined through a connected undirected xi (t + 1) = wij yj (t) − ηt si (t) (3a)
communication graph G = (V, E), where the nodes V = j
N and the edges E ⊂ V × V . Agent i and j can send ηt
information to each other if and only if (i, j) ∈ E. The vi (t + 1) = wij vj (t) − si (t) (3b)
j
αt
local computation means that each agent can only make its
decision based on the local function fi and the information yi (t + 1) = (1 − αt+1 )xi (t + 1) + αt+1 vi (t + 1) (3c)
obtained from its neighbors.
si (t + 1) = wij sj (t) + ∇fi (yi (t + 1)) − ∇fi (yi (t))
Throughout the paper, we assume that f has a minimizer
j
x∗ with optimal value f ∗ . We will use the following assump-
tions in the rest of the paper. (3d)
Assumption 1. ∀i ∈ N , fi is convex. As a result, f is also where [wij ]n×n are the consensus weights and ηt ∈ (0, L1 )
convex. √ Sequence (αt )t≥0 is generated by, first
are the step sizes.
selecting α0 = η0 L ∈ (0, 1), then given αt ∈ (0, 1),
Assumption 2. ∀i ∈ N , fi is L-smooth, that is, fi is selecting αt+1 to be the unique solution in (0, 1) of the
differentiable and the gradient is L-Lipschitz continuous, i.e., following equation,4
∀x, y ∈ RN , ∇fi (x) − ∇fi (y) ≤ Lx − y. As a result, ηt+1
2
f is L-smooth. αt+1 = (1 − αt+1 )αt2 .
ηt
Assumption 3. The set of minimizers of f is compact.
We will consider two variants of the algorithm with the
B. Centralized Nesterov Gradient Descent (CNGD) following two step size rules.
1
We briefly introduce a version of centralized Nesterov • Vanishing step size rule: ηt = η for some η ∈
(t+t0 )β
1
Gradient Descent (CNGD) that is derived from Section 2.2 of (0, L ), β ∈ (0, 2) and t0 ≥ 1.
[26]. CNGD keeps updating three variables x(t), v(t), y(t) ∈ • Fixed step size rule: ηt = η > 0.
RN , starting from an initial point x(0) = v(0) = y(0) ∈ RN , Because wij = 0 when (i, j) ∈ / E, each node i only
and the update equation is given by needs to send xi (t), vi (t), yi (t) and si (t) to its neighbors.
x(t + 1) = y(t) − η∇f (y(t)) (2a) Therefore, the algorithm can be operated in a fully distributed
η fashion with only local communication. The additional term
v(t + 1) = v(t) − ∇f (y(t)) (2b) si (t) allows each agent to obtain an estimate on the global
αt
gradient n1 j ∇fj (yj (t)) (for more details, see Section
y(t + 1) = (1 − αt+1 )x(t + 1) + αt+1 v(t + 1), (2c)
II-D). Compared with distributed algorithms without this
where (αt )∞
t=0 is defined by an arbitrarily chosen α0 ∈ (0, 1) estimation term, it helps improve the convergence speed.
2
and the update equation αt+1 = (1 − αt+1 )αt2 , where αt+1 As a result, we call this method as Accelerated Distributed
always takes the unique solution in (0, 1). The following Nesterov Gradient Descent (Acc-DNGD) method.
theorem (adapted from [26, Thm 2.2.1, Lem. 2.2.4]) gives
the convergence rate of CNGD. D. Intuition Behind our Algorithm
Here we briefly explain how the algorithm works. First
Theorem 1. In CNGD (2), under Assumption 1 and 2, when
we note that Eq. (3a)-(3c) is similar to Eq.(2), except
0 < η ≤ L1 , we have f (x(t)) − f ∗ = O( t12 ).
the “weighted average terms” ( j wij yj (t), j wij vj (t))
C. Our Algorithm: Accelerated Distributed Nesterov Gradi- and the new term si (t) that replaces the gradient terms.
ent Descent (Acc-DNGD)
3 We note that the initial condition s (0) = ∇f (0) requires the agents
i
We design our algorithm based on a consensus matrix to conduct an initial run of consensus. We impose this initial condition
W = [wij ] ∈ Rn×n . Here wij stands for how much for technical reasons, while we expect the results of this paper to hold for
agent i weighs its neighbor j’s information. W satisfies the a relaxed initial condition, si (0) = ∇fi (0) which does not need initial
coordination. We use the relaxed initial condition in numerical simulations.
following properties: 4 Without causing any confusion with the α in (2), in the rest of the
t
(a) ∀(i, j) ∈ E, wij > 0. ∀i, wii > 0. wij = 0 elsewhere. paper we abuse the notation of αt .
2261
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
1
We have the following circular arguments that explain why where D(β, t0 ) = 16+ 6 and R is the diameter
(t0 +3)2 e 2−β
algorithm (3) should work. n
Argument 1: Assuming si (t) ≈ n1 j=1 ∇fj (yj (t)), then of the (2f (x̄(0)) − f + 2Lv̄(0) − x∗ 2 )-level set of
∗
2262
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
v(t) = [v1 (t)T , v2 (t)T , . . . , vn (t)T ]T Proof: We omit the proof and refer the readers to [27,
Lem. 4]. 2
y(t) = [y1 (t)T , y2 (t)T , . . . , yn (t)T ]T
s(t) = [s1 (t)T , s2 (t)T , . . . , sn (t)T ]T The consensus error y(t)−1ȳ(t) in the previous lemma
is bounded by the following lemma whose proof is given in
∇(t) = [∇f1 (y1 (t))T , ∇f2 (y2 (t))T , . . . , ∇fn (yn (t))T ]T . Section III-B.
Now our algorithm in (3) can be written as Lemma 6. Suppose the step sizes satisfy
(i) ηt ≥ ηt+1 > 0,
x(t + 1) = W y(t) − ηt s(t) (5a) 2 3
ηt (ii) η0 < min( 9σ3 L , (1−σ)
6144L ),
v(t + 1) = W v(t) − s(t) (5b) ηt
(iii) supt≥0 ηt+1 ≤ min(( σ+3 3 σ/28 16
αt σ+2 4 ) , 15+σ ).
y(t + 1) = (1 − αt+1 )x(t + 1) + αt+1 v(t + 1) (5c) Then, under Assumption 2, we have,
s(t + 1) = W s(t) + ∇(t + 1) − ∇(t). (5d) y(t) − 1ȳ(t)
n
Apart from the average sequence x̄(t) = n1 i=1 xi (t) ∈ √ 8
≤ κ nχ2 (ηt ) Lȳ(t) − x̄(t) + Lηt g(t)
R1×N that we have defined, we also define several
nother av- 1−σ
n
erage sequences,
v̄(t) = n1 i=1 vi (t),
ȳ(t) = n1 i=1 yi (t),
n n where χ2 : R → R is a function satisfying 0 < χ2 (ηt ) ≤
s̄(t) = n1 i=1 si (t), and g(t) = n1 i=1 ∇fi (yi (t)). 2 1/3 6
η , and κ = (1−σ)
L2/3 t
.
Overview of the Proof. We derive a series of lemmas
(Lemma 4, 5, 6 and 7) that will work for both the vanishing We next provide the following intermediate result. The
and the fixed step size case. We firstly derive the update proof roughly follows the same steps of [26, Lemma 2.2.3],
formula for the average sequences (Lemma 4). Then, we and can be found in Appendix-D of [32].
show that the update rule for the average sequences is in α2 L
Lemma 7. Define γ0 = η0 (1−α 0
0)
= 1−α 0
. We define a series
fact centralized Nesterov Gradient Descent (CNGD) with in- N
exact gradients [33], and the inexactness is characterized by of functions (Φt : R → R)t≥0 , with
γ0
“consensus error” y(t)−1ȳ(t) (Lemma 5). The consensus Φ0 (ω) = f (x̄(0)) + ω − v̄(0)2
2
error is bounded in Lemma 6. Then, we apply the proof of and
CNGD (see e.g. [26]) to the average sequences in spite of the
consensus error, and derive an intermediate result in Lemma Φt+1 (ω) = (1 − αt )Φt (ω) + αt [fˆ(t) + g(t), ω − ȳ(t)]. (9)
7. Lastly, we finish the proof of Theorem 2 in Section III-C. Then, under Assumption 1 and 2, the following holds.
The proof of Theorem 3 can be found in Appendix-F in [32]. (i) We have,
Lemma 4. The following equalities hold.
Φt (ω) ≤ f (ω) + λt (Φ0 (ω) − f (ω)) (10)
x̄(t + 1) = ȳ(t) − ηt g(t) (6a) where λt is defined through λ0 = 1, and λt+1 = (1 −
ηt αt )λt .
v̄(t + 1) = v̄(t) − g(t) (6b)
αt (ii) Function Φt (ω) can be written as
γt
ȳ(t + 1) = (1 − αt+1 )x̄(t + 1) + αt+1 v̄(t + 1) (6c) Φt (ω) = φ∗t + ω − v̄(t)2 (11)
2
s̄(t + 1) = s̄(t) + g(t + 1) − g(t) = g(t + 1) (6d) where γt is defined through γt+1 = γt (1 − αt ), and φ∗t is
Proof: We omit the proof since these equalities can be easily some real number that satisfies φ∗0 = f (x̄(0)), and
1
derived using the fact that W is doubly stochastic. For (6d) φ∗t+1 = (1 − αt )φ∗t + αt fˆ(t) − ηt g(t)2
we also need to use the fact that s̄(0) = g(0). 2 2
From (6a)-(6c) we see that the sequences x̄(t), v̄(t) and + αt g(t), v̄(t) − ȳ(t). (12)
ȳ(t) follow a update rule similar to the CNGD in (2). The B. Proof of the Bounded Consensus Error (Lemma 6)
only difference is that the g(t) in (6a)-(6c) is not the exact
We will frequently use the following lemmas, whose
gradient ∇f (ȳ(t)) in CNGD. In the following Lemma, we
proofs can be found in Appendix-A of [32].
show that g(t) is an inexact gradient.7
Lemma 8. The following equalities are true.
Lemma 5. Under Assumption 1, 2, ∀t, g(t) is an inexact
αt+1
gradient of f at ȳ(t) with error O(y(t) − 1ȳ(t)2 ) in the ȳ(t+1)− ȳ(t) = αt+1 (v̄(t)− ȳ(t))−ηt +1−αt+1 g(t)
sense that, ∀ω ∈ RN , αt
(13)
f (ω) ≥ fˆ(t) + g(t), ω − ȳ(t) (7)
v̄(t + 1) − ȳ(t + 1) = (1 − αt+1 )(v̄(t) − ȳ(t))
f (ω) ≤ fˆ(t) + g(t), ω − ȳ(t) + Lω − ȳ(t)2 1
1 + ηt (1 − αt+1 )(1 − )g(t) (14)
+ L y(t) − 1ȳ(t)2 , (8) αt
n Lemma 9. Under Assumption 2, the following are true.
n
where fˆ(t) = n1 i=1 [fi (yi (t))+∇fi (yi (t)), ȳ(t)−yi (t)].
∇(t + 1) − ∇(t) ≤ Ly(t + 1) − y(t) (15)
7 For more information regarding why (7) (8) define an “inexact gradient”, L
g(t) − ∇f (ȳ(t)) ≤ √ y(t) − 1ȳ(t) (16)
we refer the readers to [33]. n
2263
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
σ+3
Proof of Lemma 6: Step 3.1: We prove that a(t + 1) ≥ 4 a(t). By (14),
Overview of the proof. The proof is divided into three a(t + 1)
steps. In step 1, we treat the algorithm (5) as a linear system
= αt+1 L(1 − αt+1 )(v̄(t) − ȳ(t))
and derive a linear system inequality (17). In step 2, we
1
analyze the state transition matrix in (17) and prove a few + (1 − αt+1 )(1 − )ηt g(t) + 2ληt+1 Lg(t + 1)
spectral properties. In step 3, we further analyze the linear αt
system (17) and bound the state by the input, from which the ≥ αt+1 (1 − αt+1 )Lv̄(t) − ȳ(t)
αt+1
conclusion of the lemma follows. Throughout the proof, we − (1 − αt+1 )(1 − αt )ηt Lg(t)
αt
will frequently use an easy-to-check fact: αt is a decreasing
+ 2ληt+1 Lg(t) − 2ληt+1 Lg(t + 1) − g(t).
sequence.
Step 1: A Linear System Inequality. Define z(t) = Therefore, we have
[αt v(t) − 1v̄(t),
√ y(t) − 1ȳ(t), s(t) − 1g(t)]T ∈ R3 , a(t) − a(t + 1)
b(t) = [0, 0, na(t)]T ∈ R3 where
≤ αt − αt+1 (1 − αt+1 ) Lv̄(t) − ȳ(t)
a(t) αt Lv̄(t) − ȳ(t) + 2λLηt g(t) αt+1
+ (1 − αt+1 )(1 − αt )ηt L
4
αt
in which λ 1−σ > 1. The desired inequality is (17). The
proof of (17) is similar to that of [27, Eq. (8)]. Due to space + 2ληt L − 2ληt+1 L g(t) + 2ληt+1 Lg(t + 1) − g(t)
limit, it is omitted and can be found in Appendix-B of [32].
≤ αt − αt+1 (1 − αt+1 ) Lv̄(t) − ȳ(t)
G(ηt )
+ (ηt + 2λ(ηt − ηt+1 ))Lg(t)
σ 0 ηt + 2ληt+1 Lg(t + 1) − g(t)
z(t + 1) ≤ σ σ 2ηt z(t) + b(t) (17) αt+1 α2 1 ηt − ηt+1
L 2L σ + 2ηt L ≤ max(1 − + t+1 , + )a(t)
αt αt 2λ ηt
+ 2ληt+1 Lg(t + 1) − g(t) (19)
Step 2: Spectral Properties of G(·). When η is positive,
G(η) is a nonnegative matrix and G(η)2 is a positive matrix. where in the last inequality, we have used the elementary fact
By Perron-Frobenius Theorem [34, Thm. 8.5.1] G(η) has a that for four positive numbers a1 , a2 , a3 , a4 and x, y ≥ 0, we
unique largest (in magnitude) eigenvalue that is a positive have a1 x+a2 y = aa13 a3 x+ aa24 a4 y ≤ max( aa13 , aa24 )(a3 x+a4 y)
real with multiplicity 1, and the eigenvalue is associated with Next, we expand g(t + 1) − g(t),
an eigenvector with positive entries. We let the unique largest
g(t + 1) − g(t)
eigenvalue be θ(η) = ρ(G(η)) and let its eigenvector be
χ(η) = [χ1 (η), χ2 (η), χ3 (η)]T , normalized by χ3 (η) = 1. ≤ g(t + 1) − ∇f (ȳ(t + 1)) + g(t) − ∇f (ȳ(t))
We give bounds on the eigenvalue and the eigenvector in the + ∇f (ȳ(t + 1)) − ∇f (ȳ(t))
following lemmas, whose proofs can be found in Appendix- L
(a) L
≤ √ y(t + 1) − 1ȳ(t + 1) + √ y(t) − 1ȳ(t)
C of [32]. n n
Lemma 10. When 0 < ηL < 1, we have σ < θ(η) < + Lȳ(t + 1) − ȳ(t)
2 L L
σ + 4(ηL)1/3 , and χ2 (η) ≤ L2/3 η 1/3 . (b)
≤ √ σαt v(t) − 1v̄(t) + √ 2y(t) − 1ȳ(t)
√ n n
σ
Lemma 11. When η ∈ (0, L2√ 2
), θ(η) ≥ σ + (σηL)1/3 and L
η + √ 2ηt s(t) − 1g(t) + a(t)
χ1 (η) < (σηL)1/3 . n
(c)
2
Lemma 12. When ζ1 , ζ2 ∈ (0, 9σ3 L ), then ≤ Lσκχ1 (ηt )a(t) + 2Lκχ2 (ηt )a(t)
χ1 (ζ1 )
≤ max(( ζ2 6/σ ζ1 6/σ
) , ( ) ) and χχ22 (ζ 1)
≤ + 2Lηt κχ3 (ηt )a(t) + a(t)
χ1 (ζ2 ) ζ1 ζ2 (ζ2 ) 1/3
ζ2 28/σ ζ1 28/σ
max(( ζ1 ) , ( ζ2 ) ).
(d) ηt 2ηt
≤ a(t) Lσκ + 2Lκ 2/3 + 2Lηt κ + 1
(σηt L)1/3 L
It is easy to check that, under our step size condition (ii), (e)
all the conditions of Lemma 10, 11, 12 are satisfied. ≤ 8κa(t). (20)
Step 3: Bound the state by the input. With the above Here (a) is due to (16); (b) is due to the second row of (17)
preparations, now we prove, by induction, the following and the fact that a(t) ≥ Lȳ(t + 1) − ȳ(t); (c) is due to the
statement,
√ induction assumption (18). In (d), we have used the bound
z(t) ≤ na(t)κχ(ηt ) (18) on χ1 (·) (Lemma 11), χ2 (·) (Lemma 10), and χ3 (ηt ) = 1.
6 In (e), we have used ηt L < 1, σ < 1 and κ > 1.
where κ = 1−σ . Equation (18) is true for t = 0, since the Combining (20) with (19), we have
left hand side is zero when t = 0. Assume (18) holds for
a(t) − a(t + 1)
t. We now show (18) is true for t + 1. We divide the rest
of the proof into two sub-steps. Briefly speaking, step 3.1 αt+1 α2 1 ηt − ηt+1
≤ max(1 − + t+1 , + )a(t)
proves that the input to the system (17), a(t + 1) does not αt αt 2λ ηt
decrease too much compared to a(t) (a(t + 1) ≥ σ+3 4 a(t));
+ 16κληt+1 La(t)
while step 3.2 shows that the state z(t + 1), compared to ηt+1 1−σ ηt − ηt+1
≤ max(1 − + 2αt+1 , + )
z(t), decreases enough for (18) to hold for t + 1. ηt 8 ηt
2264
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
384 D(β,t0 )
+ η 0 L a(t) (iii) λt ≥ (t+t 0)
2−β where D(β, t0 ) is some constant that
(1 − σ)2
only depends on β and t0 , given by D(β, t0 ) =
1
where in the last inequality, we have used the fact that 16+ 6
.
(t0 +3)2 e 2−β
αt+1 α2 α2
1− + t+1 < 1 − t+1 + αt+1 Now we proceed to prove Theorem 2.
αt αt αt2 Proof of Theorem 2: It is easy to check that all the conditions
ηt+1 ηt+1
=1− (1 − αt+1 ) + αt+1 < 1 − + 2αt+1 . of Lemma 6 and 13 are satisfied, hence the conclusions of
ηt ηt
Lemma 6 and 13 hold. The major step of proving the theorem
ηt 16
By the step size condition (iii), ηt+1 ≤ 15+σ , and hence is to show the following inequality,
ηt+1 1−σ
1 − ηt ≤ 16 . By the step size condition (ii), 2αt+1 ≤ λt (Φ0 (x∗ ) − f ∗ ) + φ∗t ≥ f (x̄(t)).
√ (22)
2α0 = 2 η0 L ≤ 1−σ 384 1−σ
16 , and η0 L (1−σ)2 < 16 . Combining
1−σ If (22) is true, by (22) and (10), we have
the above, we have a(t) − a(t + 1) ≤ 4 a(t). Hence a(t +
1) ≥ 3+σ f (x̄(t)) ≤ φ∗t + λt (Φ0 (x∗ ) − f ∗ )
4 a(t).
Step 3.2: Finishing the induction. We have, ≤ Φt (x∗ ) + λt (Φ0 (x∗ ) − f ∗ )
(a) ≤ f ∗ + 2λt (Φ0 (x∗ ) − f ∗ ).
z(t + 1) ≤ G(ηt )z(t) + b(t)
1
(b) √ √ Hence f (x̄(t)) − f ∗ = O(λt ) = O( t2−β ), i.e. the desired
≤ G(ηt ) na(t)κχ(ηt ) + na(t)χ(ηt ) result of the theorem follows.
(c) √ √ Now we use induction to prove (22). Firstly, (22) is true
= θ(ηt ) na(t)κχ(ηt ) + na(t)χ(ηt )
√ for t = 0, since φ∗0 = f (x̄(0)) and Φ0 (x∗ ) > f ∗ . Suppose
= na(t)χ(ηt )(κθ(ηt ) + 1) it’s true for 0, 1, 2, . . . , t. For 0 ≤ k ≤ t, by (10), Φk (x∗ ) ≤
(d) √ σ+1 4 f ∗ + λk (Φ0 (x∗ ) − f ∗ ). Hence
≤ na(t + 1)χ(ηt+1 )(κ + 1)
2 3+σ γk
χ1 (ηt ) χ2 (ηt ) φ∗k + x∗ − v̄(k)2 ≤ f ∗ + λk (Φ0 (x∗ ) − f ∗ ).
× max( , , 1) 2
χ1 (ηt+1 ) χ2 (ηt+1 )
(e) √ σ+2 4 Using the induction assumption, we get
= na(t + 1)χ(ηt+1 ) κ
3 σ+3 γk
χ1 (ηt ) χ2 (ηt ) f (x̄(k)) + x∗ − v̄(k)2 ≤ f ∗ + 2λk (Φ0 (x∗ ) − f ∗ ). (23)
× max( , , 1) 2
χ1 (ηt+1 ) χ2 (ηt+1 )
Since f (x̄(k)) ≥ f ∗ and γk = λk γ0 , we have x∗ −
(f ) √
≤ na(t + 1)κχ(ηt+1 ) (21) v̄(k)2 ≤ γ40 (Φ0 (x∗ ) − f ∗ ). Since v̄(k) = α1k (ȳ(k) −
x̄(k)) + x̄(k), we have v̄(k) − x∗ 2 = α1k (ȳ(k) − x̄(k)) +
where (a) is due to (17), and (b) is due to induction
assumption (18), and (c) is because θ(ηt ) is an eigenvalue of x̄(k) − x∗ 2 ≥ 2α1 2 ȳ(k) − x̄(k)2 − x̄(k) − x∗ 2 . By (23),
k
G(ηt ) with eigenvector χ(ηt ), and (d) is due to step 3.1, and f (x̄(k)) ≤ 2Φ0 (x∗ )−f ∗ = 2f (x̄(0))−f ∗ +γ0 v̄(0)−x∗ 2 .
θ(ηt ) < σ + 4(η0 L)1/3 < 1+σ 2 (by step size condition (ii) Also since γ0 = 1−α L
< 2L, we have x̄(k) lies within
0
and Lemma 10), and in (e), we have used by the definition the (2f (x̄(0)) − f + 2Lv̄(0) − x∗ 2 )-level set of f . By
∗
of κ, κ σ+1
2 + 1 =
σ+2
3 κ. For (f), we have used that by Assumption 3 and Proposition B.9 of [31], we have the level
Lemma 12 and step size condition (iii), set is compact. Hence we have x̄(k) − x∗ ≤ R where
R is the diameter of that level set. Combining the above
χ1 (ηt ) χ2 (ηt ) ηt 28/σ σ+33 arguments, we get
max( , , 1) ≤ ( ) ≤ .
χ1 (ηt+1 ) χ2 (ηt+1 ) ηt+1 σ+24
ȳ(k) − x̄(k)2
Now, (18) is proven for t + 1, and hence is true for all t. ≤ 2αk2 (v̄(k) − x∗ 2 + x̄(k) − x∗ 2 )
Therefore, we have 4
≤ 2αk2 [R2 + (f (x̄(0)) − f ∗ ) + 2v̄(0) − x∗ 2 ]
√ γ0
y(t) − 1ȳ(t) ≤ κ na(t)χ2 (ηt ).
≤ 2αk2 [R2 + 4v̄(0) − x∗ 2 ] (24)
Notice that a(t) = αt Lv̄(t) − ȳ(t) + 2λLηt g(t) ≤ C1
8
Lx̄(t)− ȳ(t)+ 1−σ Lηt g(t). The statement of the lemma
follows. 2 where C1 is a constant that does not depend on η.
Next, we consider (12),
C. Proof of Theorem 2 φ∗t+1 − f (x̄(t + 1))
We first introduce Lemma 13 regarding the asymptotic = (1 − αt )(φ∗t − f (x̄(t))) + (1 − αt )f (x̄(t)) + αt fˆ(t)
behavior of αt and λt . The proof can be found in Appendix- 1
E of [32]. − ηt g(t)2 + αt g(t), v̄(t) − ȳ(t) − f (x̄(t + 1))
2
Lemma 13. When the vanishing step size is used (ηt = (a) 1
η 1 ≥ (1 − αt )(φ∗t − f (x̄(t))) + αt fˆ(t) − ηt g(t)2
(t+t0 )β
, t0 ≥ 1, β ∈ (0, 2)), and η0 < 4L (equivalently 2
1
α0 < 2 ), we have + (1 − αt ){fˆ(t) + g(t), x̄(t) − ȳ(t) }
(i) αt ≤ t+1 2
. + αt g(t), v̄(t) − ȳ(t) − f (x̄(t + 1))
(b)
(ii) λt = O( t2−β 1
). = (1 − αt )(φ∗t − f (x̄(t))) + fˆ(t)
2265
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
1 (a) 4(t0 + 1)2−β
∞
1
− ηt g(t)2 − f (x̄(t + 1)) (25) ≤ η 2/3
2 D(β, t0 ) 5β
k=0 (k + 1)
3
where (a) is due to Lemma 5 and (b) is due to αt (v̄(t) − (b) 4(t0 + 1)2−β 2
ȳ(t)) + (1 − αt )(x̄(t) − ȳ(t)) = 0. By Lemma 5 and Lemma ≤ η 2/3 ×
6, D(β, t0 ) β − 0.6
2266
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.
[11] ——, “Distributed optimization over time-varying directed graphs,”
Automatic Control, IEEE Transactions on, vol. 60, no. 3, pp. 601–
615, 2015.
[12] I. Matei and J. S. Baras, “Performance evaluation of the consensus-
based distributed subgradient method under random communication
topologies,” Selected Topics in Signal Processing, IEEE Journal of,
vol. 5, no. 4, pp. 754–771, 2011.
[13] A. Olshevsky, “Linear time average consensus on fixed graphs and
implications for decentralized optimization and multi-agent control,”
arXiv preprint arXiv:1411.4186, 2014.
[14] M. Zhu and S. Martı́nez, “On distributed convex optimization under
inequality and equality constraints,” Automatic Control, IEEE Trans-
actions on, vol. 57, no. 1, pp. 151–164, 2012.
[15] I. Lobel, A. Ozdaglar, and D. Feijer, “Distributed multi-agent op-
timization with state-dependent communication,” Mathematical Pro-
gramming, vol. 129, no. 2, pp. 255–284, 2011.
[16] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order
algorithm for decentralized consensus optimization,” SIAM Journal
on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[17] C. Xi and U. A. Khan, “On the linear convergence of distributed
optimization over directed graphs,” arXiv preprint arXiv:1510.02149,
2015.
[18] J. Zeng and W. Yin, “Extrapush for convex smooth decentralized op-
timization over directed networks,” arXiv preprint arXiv:1511.02942,
Fig. 1: Simulation results. Steps sizes: Acc-DNGD with β = 2015.
0.0045
0.61: ηt = (t+1) 0.61 , α0 = 0.7071; Acc-DNGD with β = 0:
[19] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradient
methods for multi-agent optimization under uncoordinated constant
ηt = 0.0045, α0 = 0.7071; D-NG: ηt = 0.0091 t+1 ; DGD: ηt = stepsizes,” in 2015 54th IEEE Conference on Decision and Control
0.0091
√
t
; EXTRA: η = 0.0091; Acc-DGD: η = 0.0045; CGD: (CDC). IEEE, 2015, pp. 2055–2060.
[20] P. Di Lorenzo and G. Scutari, “Distributed nonconvex optimization
η = 0.0091; CNGD: η = 0.0091, α0 = 0.5. over networks,” in Computational Advances in Multi-Sensor Adaptive
Processing (CAMSAP), 2015 IEEE 6th International Workshop on.
IEEE, 2015, pp. 229–232.
[21] P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimiza-
V. C ONCLUSION tion,” IEEE Transactions on Signal and Information Processing over
Networks, vol. 2, no. 2, pp. 120–136, 2016.
In this paper we propose an Accelerated Distributed Nes- [22] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed
optimization,” arXiv preprint arXiv:1605.07112, 2016.
terov Gradient Descent algorithm for distributed optimiza- [23] A. Nedich, A. Olshevsky, and W. Shi, “Achieving geometric conver-
tion of convex and smooth functions. We show a general gence for distributed optimization over time-varying graphs,” arXiv
1
O( t1.4− ) (∀ ∈ (0, 1.4)) convergence rate, and an improved preprint arXiv:1607.03218, 2016.
1 [24] A. Nedic, A. Olshevsky, W. Shi, and C. A. Uribe, “Geometrically
O( t2 ) convergence rate when the objective functions satisfy convergent distributed optimization with uncoordinated step-sizes,”
an additional property. Future work includes giving tighter arXiv preprint arXiv:1609.05877, 2016.
analysis of the convergence rates. [25] C. Xi and U. A. Khan, “Add-opt: Accelerated distributed directed
optimization,” arXiv preprint arXiv:1607.04757, 2016.
[26] Y. Nesterov, Introductory lectures on convex optimization: A basic
R EFERENCES course. Springer Science & Business Media, 2013, vol. 87.
[27] G. Qu and N. Li, “Accelerated distributed nesterov gradient descent for
[1] B. Johansson, “On distributed optimization in networked systems,” smooth and strongly convex functions,” in Communication, Control,
2008. and Computing (Allerton), 2016 54th Annual Allerton Conference on.
[2] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for IEEE, 2016, pp. 209–216.
cognitive radio networks by exploiting sparsity,” IEEE Transactions [28] D. Jakovetic, J. Xavier, and J. M. Moura, “Fast distributed gradient
on Signal Processing, vol. 58, no. 3, pp. 1847–1862, 2010. methods,” Automatic Control, IEEE Transactions on, vol. 59, no. 5,
[3] P. A. Forero, A. Cano, and G. B. Giannakis, “Consensus-based pp. 1131–1146, 2014.
distributed support vector machines,” Journal of Machine Learning [29] A. Olshevsky and J. N. Tsitsiklis, “Convergence speed in distributed
Research, vol. 11, no. May, pp. 1663–1707, 2010. consensus and averaging,” SIAM Journal on Control and Optimization,
[4] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asyn- vol. 48, no. 1, pp. 33–55, 2009.
chronous deterministic and stochastic gradient optimization algo- [30] R. Olfati-Saber, J. A. Fax, and R. M. Murray, “Consensus and
rithms,” in 1984 American Control Conference, 1984, pp. 484–489. cooperation in networked multi-agent systems,” Proceedings of the
[5] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computa- IEEE, vol. 95, no. 1, pp. 215–233, Jan 2007.
tion: numerical methods. Prentice hall Englewood Cliffs, NJ, 1989, [31] D. P. Bertsekas, “Nonlinear programming,” 1999.
vol. 23. [32] N. L. Guannan Qu. (2017) Accelerated distributed nesterov gradient
[6] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for descent for convex and smooth functions. [Online]. Available:
multi-agent optimization,” Automatic Control, IEEE Transactions on, http://scholar.harvard.edu/files/gqu/files/cdc2017fullversion.pdf
vol. 54, no. 1, pp. 48–61, 2009. [33] O. Devolder, F. Glineur, and Y. Nesterov, “First-order methods of
[7] I. Lobel and A. Ozdaglar, “Convergence analysis of distributed sub- smooth convex optimization with inexact oracle,” Mathematical Pro-
gradient methods over random networks,” in Communication, Control, gramming, vol. 146, no. 1-2, pp. 37–75, 2014.
and Computing, 2008 46th Annual Allerton Conference on. IEEE, [34] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university
2008, pp. 353–360. press, 2012.
[8] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for [35] P. Erdos and A. Renyi, “On random graphs i,” Publ. Math. Debrecen,
distributed optimization: convergence analysis and network scaling,” vol. 6, pp. 290–297, 1959.
Automatic control, IEEE Transactions on, vol. 57, no. 3, pp. 592–606,
2012.
[9] S. S. Ram, A. Nedić, and V. V. Veeravalli, “Distributed stochastic
subgradient projection algorithms for convex optimization,” Journal
of optimization theory and applications, vol. 147, no. 3, pp. 516–545,
2010.
[10] A. Nedic and A. Olshevsky, “Stochastic gradient-push for strongly
convex functions on time-varying directed graphs,” arXiv preprint
arXiv:1406.2075, 2014.
2267
Authorized licensed use limited to: Xi'an University of Technology. Downloaded on September 29,2024 at 14:27:20 UTC from IEEE Xplore. Restrictions apply.