0% found this document useful (0 votes)
15 views12 pages

3 Agd

Uploaded by

miaomiao24122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

3 Agd

Uploaded by

miaomiao24122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

3.

THE ACCELERATED GRADIENT DESCENT (AGD) METHOD

ZHE (JIMMY) ZHANG, PURDUE IE

Nesterov acceleration is an algebraic miracle. As a miracle, we should not seek to understand


it. —Akardy Nemirovsky
In this lecture, we introduce the breakthorugh accelerated gradient algorithm√ [4] introduced by Nesterov in
1983. The AD method improves the O(1/ϵ) GD oracle complexity to O(1/ ϵ) by evaluating the gradient
at some convex combination of previous points. As illustrated by the above quote from his then mentor,
Akardy, the rational behind the design of the algorithm is myterious. Aside from the algebra, there lacks
a transparent understanding why it should work, and how to apply it to new settings. Even today, 40
years after its publication, accelerated algorithms for new problem settings are still nontrivial and deemed
to be highly valuable by the research community. Our goal we seek to illustrate part of the magic from the
perspective of the Fenchel conjugates [2].

1. Defs: Smoothness and Strong Convexity


Definition 1. For a convex C 1 function f : X → R, the following definitions of L smoothness are equivalent
• ∥∇f (x) − ∇f (y)∥∗ ≤ L∥x − y∥ ∀x, y ∈ X.
• f (x) − f (y) − ⟨∇f (y), x − y⟩ ≤ L2 ∥x − y∥2 ,∀x, y ∈ X.
Definition 2. For a convex function f : X → R, the following definitions of strongly convexity with modulus
µ are equivalent
• ∥f ′ (x) − f ′ (y)∥∗ ≥ µ∥x − y∥∀x, y ∈ X.
• f (x) − f (y) − ⟨f ′ (y), x − y⟩ ≥ µ2 ∥x − y∥2 ∀x, y ∈ X

2. Motivation for the Dual Perspective


It is useful to recall how we handled the error terms to arrive at the recursive inequality for the simpler
subGD and the GD methods:
2
• SubGD: we used the following inequality to arrive at an error term ( M2ηt ) for each iteration. These

error term would accumulate, so we have to choose an diminishing stepsize of 1/ηt = O(1)1/ t,
leading to the O(1/ϵ2 ) complexity.
ηt t ηt M2
⟨xt − xt−1 , f ′ (xt )⟩ −∥x − xt−1 ∥2 ≤ ∥xt − xt−1 ∥M − ∥xt − xt−1 ∥2 ≤ .
2 2 2ηt
• Gradient Descent Method: with a constant stepsize 1/ηt = O(1)1/L,the error term in each iteration
could be canceled out, leading to a clean recursive formula which can be telescoped effectively
ηt
⟨xt − xt−1 , ∇f (xt ) − ∇f (xt−1 )⟩ − ∥xt − xt−1 ∥2 ≤ L∥xt − xt−1 ∥2 − L∥xt − xt−1 ∥2 ≤ 0.
2
• The AGD method, by viewing the gradient as some dual variable that could be controlled in a
more refined manner, we could take even larger stepsize 1/ηt = O(1)L/t,leading to the accelerated
convergence rate of O(1/t2 ):
ηt
⟨xt − xt−1 , π t−1 − π t−2 ⟩ − ∥xt − xt−1 ∥2 − τt−1 Df ∗ (π t−1 ; π t−2 ) ≤ 0.
2
1
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 2

3. The Fenchel Conjugate Function


Proposed by Fenchel in 1945, the Fenchel conjugate function f ∗ : Rd → R̄ associated with function
f : Rd → R̄ is defined as

f ∗ (π) := min ⟨π, x⟩ − f (x).


x∈Rd

Some interpretations:
(1) A closed convex set admits both inner and outer description. Inner: it contains every line segment.
Outer: it is the intersection of halve spaces. If we consider the graph associated with G(f ) := {(x, t) :
t ≥ f (x)}, f ∗ specifies the half spaces for G(f ). (Draw some picture )
(2) Specifically, it captures the intercept of the linear support functional because
min f (x) − ⟨x, π⟩ = − max⟨π, x⟩ − f (x) = −f ∗ (π).
x x

Thus ⟨π, x⟩ − f (x) is a linear support function for f associated with the dual variable π, and vice
versa. (This is made precise in Lemma 3).
(3) It is regarded as the Laplace transform for convex function analysis, but the integration is replaced
by maximization. (See Chapter 4 of [1] for Calculations for commonly used functions).
(4) Example, for an indicator function δC (x), its conjugate function is the support funciton σC (π) =
arg maxx∈C ⟨π, x⟩(why?). For instance, σ∆+ (y) =max yi = (δ∆+ )∗ .
(5) And the exponent conjugate (hence the name conjugate, proved using the Holder’s inequality)
1 1
∥y∥pp = max⟨x, y⟩ − ∥x∥qq .
p x q
(6) In micro-economics, given the cost function c(Q), c∗ (P ) = maxp P Q − c(Q), the conjugate function
expresses maximal profit as a function of price. Also used as the part of the moment generation
function in probability, as a part of the Chernoff bound.
Focused on CCP functions, the fenchel conjugate admits the following desirable properties.
Lemma 3. Given a convex, closed and proper function f : Rd → R, the Fenchel transfomation satisfies the
following properties.
a) Involution, f ∗∗ ≡ f. (Theorem 11.1 of [6]).
b) f ∗ is a convex, closed and proper function.
c) linear support functional, f (x) + f ∗ (π) ≥ ⟨π, x⟩ ∀x, π ∈ Rd .
Since the Fenchel conjugate characterizes the linear support functionals, we could also use it provides
alternative characterization of the sub-differential.
Theorem 4. Given a CCP function f and its Fenchel conjugate f ∗ , the following are equivalent (See Section
4.6 of [1]).
a) ⟨π, x⟩ = f (x) + f ∗ (π).
b.i) π ∈ ∂f (x).
b.ii) x ∈ ∂f ∗ (π).
c.i) π ∈ arg maxπ̄ ⟨x, π̄⟩ − f ∗ (π̄).
c.ii) x ∈ arg maxx̄ ⟨π, x̄⟩ − f (x̄).
Proof. The equivalence between a) and c) follows from the definition of Fenchel conjugate and Lemma 3.
c.ii) ⇒ a) def of Fenchel conjugate a) ⇒ c.ii) Lemma 3.b). Moreover, the equivalence for c.i) follows from
involution.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 3

a) ⇒ b.i) follows from the def of sub-gradient, for all x ∈ Rd we have


f (x̄) ≥ ⟨π, x̄⟩ − f ∗ (π)
= ⟨π, x̄⟩ − (⟨π, x⟩ − f (x))
= f (x) + ⟨π, x̄ − x⟩.
b.i) ⇒ a) f (x̄) ≥ ⟨x̄ − x, π⟩ + f (x)∀x implies that ⟨π, x⟩ − f (x) ≥ ⟨x̄, π⟩ − f (x̄), thus f ∗ (π) = ⟨π, x⟩ − f (x)
by definition of Fenchel conjugates. □
Some remarks are in order:
• The above result shows the equivalence between maximizers and sub-differential.
– Thus ∂σ∆+ (y) = arg maxπ ⟨y, π⟩ − δ∆+ (y) = arg maxπ∈∆ ⟨y, π⟩
– Given f (x) = ∥x∥p = σBq (0,1) (x), thus f ∗ (y) = δBq (0,1) ,and ∂f (x) = arg maxy∈Bq (0,1) ⟨x, y⟩ (this
shows that ∥y∥p to be non-smooth for all p.) In general, every norm could be represented as
a support function associated with some convex set which is symmetric about zero. (Draw a
picture)
– NX (x) = ∂δX (x) can be calculated by
{π : ⟨π, x⟩ = δX (x) + σX (π)} = {π : ⟨π, x⟩ = max⟨y, π⟩}
y∈X
= {π : ⟨π, x⟩ ≥ ⟨y, π⟩∀y ∈ X} = {π : ⟨π, y − x⟩ ≤ 0∀y ∈ X}.
• Legendre: the sub-differential of the fenchel conjugate is the inverse operation π ∈ ∂f (x) ⇔ x ∈
∂f ∗ (π).
• If we know f ∗ , since 0 ∈ ∂f (x∗ ), we could recover the optimal solution by evaluating x∗ ∈ ∂f ∗ (0).
The knowledge of f ∗ often implies f is almost known to us as a white-box, which could be unrealistic
in many situations. In this sense, the Fenchel conjugate is often an abstract tool to obtain a better
understanding. In this lecture, we will use the Fenchel conjugate function implicitly only in the
convergence analysis, and the algorithm could be run without any information about it. However,
in some special settings like distributed optimization, the explicit knowledge on f ∗ could also be
reasonable.

4. Some Intuitioin: Fenchel Conjugate and The Linear Support Function


In this section, we seek to illustrate the somewhat abstract Fenchel conjugate through some picture.
Recall that

−f ∗ (π̄) = min f (x) − ⟨x, π̄⟩


x
Fixing π̄, the above definition of the Fenchel conjugate could be interpreted as moving the linear approx-
imation funciton downward until it only touches f . Specifically, the intercept −f ∗ (π̄) captures the linear
functional lf (x; π̄) := ⟨x, π̄⟩ − f ∗ (π̄) which intercepts x only “once” at the minimizer x̄, i.e.,

lf (x̄; π̄) = ⟨x̄, π̄⟩ − f ∗ (π̄) = ⟨x̄, π̄⟩ + f (x̄) − ⟨x̄, π̄⟩ = f (x̄)
lf (x; π̄) = ⟨x, π̄⟩ − f ∗ (π̄) ≤ ⟨x, π̄⟩ + f (x) − ⟨x, π̄⟩ ≤ f (x).
Next, fixing x̄, the maximization in Lemma 3 could be interpreted as going upward; finding the lower
linear functional lf (·; π) that “touches” x̄

π̄ ∈ arg max lf (x̄; π) = arg max⟨π, x⟩ − f ∗ (π).


π x
Clearly, the gradient of the linear support functional π̄ corresponds to the one subgradient at x̄, π̄ ∈ ∂f (x̄).
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 4

Figure 4.1. The Fenchel Conjugate Function Illustrated.

5. The Duality Between Strong Convexity and Smoothness


Lemma 5. Given some Bregman distance function Φ with domain Ω such that X ⊂ Ω, consider the iterate
generated by the following update with η > 0 and x̄ ∈ rint(Ω)

x+ ← arg min⟨g, x⟩ + r(x) + ηDΦ (x; x̄).


x∈X
Then x+ ∈ rint(Ω) and

⟨g, x+ − x⟩ + ηDΦ (x; x+ ) + Dr (x; x+ ) + r(x+ ) + ηDΦ (x+ ; x̄) ≤ ηDΦ (x; x̄) + r(x) ∀x ∈ X.
Proof. The result follows from the first-order optimality condition that

⟨g + η(∇Φ(x+ ) − Φ(x̄)) + r′ (x+ ), x − x+ ⟩ ≥ 0.


The only different term is ⟨r′ (x+ ), x − x+ ⟩ = r(x) − r(x+ ) − (r(x) − r(x+ ) − ⟨r′ (x+ ), x − x+ ⟩) = r(x) −
r(x+ ) − Dr (x; x+ ). □
Theorem 6. A convex function f : Rd → R̄ is µ strongly convex if and only if f ∗ is 1/µ smooth.
Proof. ⇒ Let π̂, π̄ ∈ dom(∂f ∗ ) be given. Consider any x̄ ∈ ∂f ∗ (π̄) and x̂ ∈ ∂f ∗ (π̂), i.e., π̄ ∈ ∂f (x̄) and
π̂ ∈ ∂f (x̂). By Theorem 4, we get
x̂ ← arg min⟨−π̂, x⟩ + f (x).
x
Thus it follows from 5 that (why?)

⟨−π̂, x̂ − x⟩ + Df (x; x̂) + f (x̂) ≤ f (x)∀x ∈ X


µ
⇒ ⟨−π̂, x̂ − x̄⟩ + ∥x̄ − x̂∥2 + f (x̂) ≤ f (x̄).
2
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 5

Similarly, we get

µ
(5.1) ⟨−π̄, x̄ − x̂⟩ + ∥x̂ − x̄∥2 + f (x̄) ≤ f (x̂).
2
Adding up the preceding formulae then leads to

⟨π̂ − π̄, x̂ − x̄⟩ ≥ µ∥x̂ − x̄∥2


⇒ µ∥x̂ − x̄∥2 ≤ ⟨π̂ − π̄, x̂ − x̄⟩ ≤ ∥π̂ − π̄∥∗ ∥x̂ − x̄∥
1
⇒∥π̂ − π̄∥∗ ≤ ∥x̂ − x̄∥.
µ
⇐It is perhaps more illustrate to assume f to be L-smooth and show f ∗ to be µ-strongly convex. The
following claim, the dual to (5.1), is useful for us.
Claim: for a smooth convex f, we have

1
(5.2) f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩ ≥ ∥∇f (y) − ∇f (x̄)∥2∗ ∀y, x̄ ∈ dom(∂f ).
2L
We will prove the claim later. Let us assume it being true for now. Again, let π̂, π̄ ∈ dom(∂f ∗ ) be given.
Consider any x̄ ∈ ∂f ∗ (π̄) and x̂ ∈ ∂f ∗ (π̂), i.e., π̄ = ∇f (x̄) and π̂ = ∇f (x̂) (why?). We have from (5.2) that

1
f (x̂) − f (x̄) − ⟨∇f (x̄), x̂ − x̄⟩ ≥ ∥π̂ − π̄∥2∗
2L
1
f (x̄) − f (x̂) − ⟨∇f (x̂), x̄ − x̂⟩ ≥ ∥π̂ − π̄∥2∗ .
2L
Adding them up leads to
1
∥π̂ − π̄∥2∗ ≤ ⟨π̂ − π̄, x̂ − x̄⟩ ≤ ∥x̂ − x̄∥∥π̂ − π̄∥∗ .
L
Thus, we get the desired strong convexity relation ∥(f ∗ )′ (π̄) − (f ∗ )′ (π̄)∥ ≥ L1 ∥π̂ − π̄∥.
Now we show the claim in (5.2). Consider ϕ(x) = f (x) − ⟨∇f (x̄), x⟩.Clearly, x̄ ∈ arg minx∈Rd ϕ(x) and we
have

L
∥y − x∥2 ∀x ∈ Rd .
ϕ(x̄) ≤ ϕ(x) ≤ ϕ(y) + ⟨∇ϕ(y), x − y⟩ +
2
∥∇ϕ(y)∥∗
Let v = arg maxx∈B(0,1) ⟨∇ϕ(y), x⟩ such that ⟨v, ∇ϕ(y)⟩ = ∥∇ϕ(y)∥∗ . Taking x = y − L v leads to

1 1
ϕ(x̄) ≤ ϕ(y) − ∥∇ϕ(y)∥2∗ + ∥∇ϕ(y)∥2∗
L 2L
1
⇒ ∥∇f (y) − ∇f (x̄)∥2∗ ≤ f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩.
2L

6. The Primal And Dual Version of the AGD Method


Nesterov original version is similar to Algorithm (1). Other than that the algebra just works out, the tech-
nique of evaluating gradient at some averaged point makes little sense. A more modern re-interpretation of
the method from the primal and dual perspective is shown in (2). The minimization problem is reformulated
instead as a saddle point game.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 6

Algorithm 1 The Primal Version of AGD [2]


• With x0 ∈ X. For t = 1, 2, 3, ..., do the following
(1) Set xt := αt xt−1 + (1 − αt )x̄t−1 .
(2) Compute xt ← arg minx∈X ⟨∇f (xt ), x⟩ + η2t ∥x − xt−1 ∥2 .
(3) Set x̄t = αt xt + (1 − αt )x̄t−1 .

Algorithm 2 The Primal Dual Version of AGD [2]


• With x−1 = x0 ∈ X and π 0 ∈ ∂f (x0 ). For t = 1, 2, 3, ..., do the following
(1) x̃t = xt−1 + θt (xt−1 − xt−2 ).
(2) Compute π t ← arg maxπ ⟨π, x̃t ⟩ − f ∗ (π) − τt Df ∗ (π; π t−1 ).
(3) Compute xt ← arg minx∈X ⟨x, π t ⟩ + η2t ∥x − xt−1 ∥2 .

(6.1) min f ∗∗ (x) = min max{L(x; π) := ⟨π, x⟩ − f ∗ (π)}.


x∈X x∈X π

In this zero-sum game, the max-player π seeks to maximize his pay-off L(π; x) which is dependent on the
action of the primal player x, while the primal player seeks to minimize his lose. In the tth iteration of the
AGD Algorithm 2, since his reward depends on the yet-known action of x, xt , the dual player constructs
an estimate x̃t of it by extrapolating from the history, and then maximizes the his reward L(x̃t ; π) subject
to some prox-penalty Line 2. After that, seeing the action π t , the x player responds by minimizing L(x; π t )
subject to some prox-penalty term.
The surprising gradient evaluation associated with Algorithm 1 can be interpreted as the a dual proximal
step. Before going into the details, let us see that the π t evaluation could indeed be reinterpreted as some
gradient evaluation step.

π t ← arg max⟨π, x̃t ⟩ − f ∗ (π) − τt Df ∗ (π; π t−1 )


π
⇒π ← arg max⟨π, x̃t ⟩ − f ∗ (π) − τt (f ∗ (π) − ⟨(f ∗ )′ (π t−1 ), π⟩)
t
π
⇒π ← arg max⟨π, x̃t + τt (f ∗ )′ (π t−1 )⟩ − (1 + τt )f ∗ (π)
t
π
x̃t + τt (f ∗ )′ (π t−1 )
⇒π t = ∇f (xt ) with xt := (why? Lemma (4))
1 + τt
Moreover, this general framework allows us to recover some classical algorithms as special cases.
• The gradient descent method corresponds to choosing θt = 0 (no extrapolation) and τt = 0 (gradient
evaluated at ∇f (xt−1 )).
• The Frank Wolfe method corresponds to choosing θt = 0 (no extrapolation) and ηt = 0 (no primal
stability term).

7. The Convergence Analysis


7.1. The Q-Gap Function. We introduce a so-called Q-gap function to characterize the convergence of
the generated iterates to the Nash equilibrium point.
Definition 7. Given a convex-concave saddle point problem L(x; π) and some point z t := (xt ; π t ), the
Q-gap function associated a reference point z ∈ (x, π) is defined as
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 7

Q(z t ; z) := L(xt ; π) − L(x; π t ).


Some properties associated with the gap function is in order:
• The gap associated with z t is defined as G(z t ) := maxz∈X×Π Q(z t ; z). G(z t ) is always non-negative
and G(z t ) ≤ 0 if and only if z t is a Nash equilibrium point satisfying (why?)
L(xt ; π) ≤ L(xt ; π t ) ∀x ∈ X
L(x; π t ) ≥ L(xt ; π t ) ∀π ∈ Π.
• Fixing the reference point z, the Q-gap function is convex with respect to xt and concave with respect
to π t . In particular, the Jensen’s inequality could be applied to obtain
N
X N
X
Q(z̄ N ; z) ≤ pt Q(z t ; z), with p ∈ ∆N
+ and z̄
N
= pi z i .
t=1 t=1
• It provides an upper bound to the function value gap.
Lemma 8. Consider the saddle point reformulation L(x; π) in (6.1) associated with some function f . For
some feasible solution z t := (xt ; π t ) ∈ X × dom(f ∗ ), we have

f (xt ) − f∗ ≤ Q(z t ; (x∗ ; π̂)) with π̂ ∈ ∂f (xt ).


Proof. Use the Legendre property of the Fenchel conjugate in Theorem 4. □
The preceding result tells us that we could minimize the function value gap by analyzing the convergence
of the Q-gap function.
7.2. The Detailed Convergence Bound. The next Proposition specifies some stepsize requirements to
ensure the convergence of the Q-gap function.
Proposition 9. Consider the iterates generated by Algorithm 2. Suppose the stepsize satisfies the following
requirements for some weights ωt for all t ≥ 2

ωt−1 = ωt θt
ωt−1 (1 + τt−1 ) ≥ ωt τt
ωt−1 ηt−1 ≥ ωt ηt
τt ηt−1 ≥ Lθt , (τN + 1)ηN ≥ L.
Then the following relation is valid for any z := (x, π) ∈ X × Rd .
N
X ωN η N N ω1 η 1 0
ωt Q(z t ; z) + ∥x − x∥2 ≤ ∥x − x∥2 + ω1 τ1 Df ∗ (π; π 0 ).
t=1
2 2
Proof. The following algebraic identity decomposition is crucial for our analysis. (Extrapolation prediction
is crucial for the next decomposition to work. )

At := ⟨π t − π, xt − x̃t ⟩ =⟨π t − π, xt − [xt−1 + θt (xt−1 − xt−2 )]⟩


=⟨π t − π, (xt − xt−1 )⟩ − ⟨π t − π, θt (xt−1 − xt−2 )⟩
=⟨π t − π, (xt − xt−1 )⟩ − θt ⟨π t−1 − π, (xt−1 − xt−2 )⟩
− θt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 8

Using the three point inequality associated with the primal update leads us to

ηt t ηt ηt
L(xt ; π t ) − L(x; π t ) + ∥x − x∥2 + ∥xt−1 − xt ∥2 ≤ ∥xt−1 − x∥2 .
2 2 2
So the ωt weighted sum satisfies

N N
X ωN ηN N X ωt ηt t−1 η1 ω1 t−1
ωt [L(xt ; π t ) − L(x; π t )] + ∥x − x∥2 + ∥x − xt ∥2 ≤ ∥x − x∥2 .
t=1
2 t=1
2 2

Similarly, the three point inequality for the dual update implies that (why?)

L(xt ; π) − L(xt ; π t ) + At + (τt + 1)Df ∗ (π; π t ) + τt Df ∗ (π t ; π t−1 ) ≤ τt Df ∗ (π; π t−1 ).


So the ωt weighted sum satisfies

N
X N
X
ωt [L(xt ; π) − L(x; π t )] + ĀN + ωN (τN + 1)Df ∗ (π; π N ) + ωt τt Df ∗ (π t ; π t−1 ) ≤ τ1 ω1 Df ∗ (π; π 0 ).
t=1 t=1
N
X
ĀN := ωN ⟨π N − π, (xN − xN −1 )⟩ − θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩.
t=1

In particular, we have from the 1/L strong convexity of f ∗ (see Theorem 6) that

ηt−1 t−1
θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩ ≤ ωt τt Df ∗ (π t ; π t−1 ) + ωt−1∥x − xt−2 ∥2
2
ηN N
−ωN ⟨π N − π, (xN − xN −1 )⟩ ≤ ωN (τN + 1)Df ∗ (π t ; π t−1 ) + ωN ∥x − xN −1 ∥2 .
2
Adding up all the relations leads to the desired convergence guarantee. □

Now we specify some concrete stepsize choice to obtain the desired function value convergence bound.

Theorem 10. With


PN
t−1 2L ωt z t
ωt = t, τt = , ηt = , and z̄ N = Pt=1
N
2 t t=1 ωt
0 ∗ 2
we get f (x̄N ) − f ∗ ≤ Q(z̄ N ; (x∗ ; ∇f (x̄N ))) ≤ O(1) L∥x N−x
2

.
Moreover, the distance ∥xt − x∗ ∥2 is monotonically non-increasing.

Remarks:
• The above convergence bound could also be obtained from a purely primal analysis, see [2]. We
√ 0 ∗ 2
could also obtain the desired linear convergence rate O( κ log( L∥x −x
ϵ

) if f is a strongly convex
function. Because of the heavy use of three point inequality, the extension to the non-Euclidean
geometry is really straightforward.
• Nesterov retired this year. However, the exact reason why this scheme work is still not well-
understood.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 9

Optimality Criterion M -Lipschitz Non–smooth L-Smooth


t M ∥x0 −x∗ ∥ L∥x0 −x∗ ∥2
Non-strongly Convex f (x ) − f∗ √
t t2
4M 2
p −t 0
µ-Strongly Convex (1 + µ/L) (f (x ) − f∗ )
µt
Table 1. The O(1) Optimal Convergence Rates for Convex Optimization

8. Restarting to Take Advantage of Strong Convexity


The (mini-max) optimal convergence rates for the typical convex optimization is listed in Table 1. A
problem being strongly convex could improve the convergence rate by at least an order of magnitude. Here
we illustrate how to use restart to bootstrap the non-strongly convex rate achieve the strongly convex rate.
• For the non-smooth problem, we have
p
t M ∥x0 − x∗ ∥ M 2f (x0 ) − f∗
f (x ) − f∗ ≤ √ ≤√ √ .
t t µ
2
Thus restarting the appropriate sub-gradient descent algorithm after T = (f (x8M
0 )−f )µ iterations would

lead to the halving of the function value gap. We could calculate the total number of iterations to
achieve ϵ-optimality by

X 8M 2 M2
Nϵ ≤ = O(1) .
i=1
ϵ2i−1 µ ϵµ
• For the smooth problem, we have
L∥x0 − x∗ ∥2 L 2(f (x0 ) − f∗ )
f (xt ) − f∗ ≤ 2
≤ .
t µ t2
q
Thus restarting the appropriate sub-gradient descent algorithm after T = 2 L µ iterations would
lead to the halving of the function value gap. We could calculate the total number of iterations to
achieve ϵ-optimality by
f (x0 )−f∗
log( )
s s
X ϵ
L L f (x0 ) − f∗
Nϵ ≤ 2 = O(1) log( ).
i=1
µ µ ϵ

9. The Lower Complexity Bound


0 ∗ 2
In this section, we will construct a quadratic problem to show the above O(1) L∥x N−x
2

oracle complexity
0 ∗ 2
bound to be un-improvable. Specifically, we are going to show that PΣ∗ (M, F) ≥ O(1) L∥x N−x
2

where
• The termination criterion Tϵ is ϵ-optimality gap, that is, finding an x̄ s.t. f (x̄) − f∗ ≤ ϵ, and there
is access is the first order oracle O.
• F: the class of objective functions under consideration are all L-smooth, and satisfies ∥x0 −x∗ ∥2 ≤ D02
and the feasible region is X = Rd .
• M: we will focus on deterministic iterative methods satisfying some linear-span assumption [5].
Extension to a more general class of functions is also possible but quite technical (see [3]).

9.1. The Linear Span Assumption. Without the loss of generality, we can assume the initial point x0
to be 0 (why? hint: apply some shifting to the worst case function).
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 10

Definition 11. A first order method is said to satisfy the linear span assumption if the iterates generated
satisfy the following requirement

(9.1) xt ∈ span{x0 , ∇f (x0 ), ..., ∇f (xt−1 )}.


This definition highlights again that we are doing information theoretical lower complexity bound. The
possible points that can be returned by any method is limited to the gradient information it has gathered.
A few remarks are helpful for clarifying this definition.
• The linear span operation allows us to take any linear combination of the generated gradients. So it
clearly covers the gradient descent (GD) method xt ← xt−1 − η1t ∇f (xt−1 ) with any step-size choice.
• If we take xt to be xt in Algorithm 1, the AGD method also satisfies the assumption.
• This assumption does not cover the non-Euclidean methods, e.g., the mirror descent method or the
accelerated version. For instance, the pre-conditioned gradient method xt ← xt−1 − η1t Σ−1 ∇f (xt−1 )
might lead us outside the linear space in (9.1). However, the lower bound can actually be strengthened
by some rotational invariance argument to cover this case. (see [3])

9.2. The Lower Bound Gadget. We present a parameterized class of hard functions which could to
tuned to provide the desired lower complexity bound. Particularly, the hard instance, parameterized by
l > 0, r > 0, n ∈ N+ , is

n−1
l 1X 1
(9.2) h(x; l, r, n) := [ (xi − xi+1 )2 + (xn )2 − rx1 ].
4 2 i=1 2

The next lemma characterizes the properties of h.


Lemma 12. With an even n ≥ 2, the function h in (9.2) has the following properties.
a) It is l-Lipschitz smooth.
b) ∥x0 − x∗ ∥2 ≤ r2 n3 .
c) For any first-order method satisfying the linear span assumption (Def (11)), the iterates generated in
k iterations have non-zero coordinates only in the first k coordinates.
d) For any x̄ that satisfies x̄n/2+1 = x̄n/2+2 = ... = x̄n = 0, we have
lr2 n
h(x̄) − h(x∗ ) ≥ .
16
Proof. The gradient of h is given by

 
1

−1
 r
−1 0
2 −1   

l 
 0
−1 2 −1 
(9.3) ∇h(x̄) = {  x̄ − 
 .. }.

4  ¨
.
  
 
 −1 2 −1 0
−1 2 0
Clearly, h is an l-Lipschitz smooth function (why?). Now we can solve for the first order optimality condition
∇h(x∗ ) = 0 to obtain x∗ = r[n, n − 1, n − 2, ..., 1] to get ∥x∗ − 0∥2 ≤ r2 n3 (it is a rough upper bound for
n ≥ 2). Moreover, constrained to having non-zeros only in the first n/2 coordinates, we can calculate the
optimal solution to be x̄∗ = r[n/2, n/2 − 1, ..., 1, 0, ..., 0]. Thus the function value gap is lower bounded by
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 11

Figure 9.1. Eigenvalues associated with the quadratic hard instance.

r2 l n n r2 ln
h(x̄∗ ) − h(x∗ ) = [ − ]= .
4 2 4 16
Property c) follows from the structure of the hard instance. □
9.3. The Lower Bound Result. Now we ready to utilize the hard-instance gadget to provide the lower
complexity bound.
Theorem 13. Consider the task of finding ϵ-optimal solution (in function value gap) for the class of L-
smooth convex functions with ∥x0 − x∗ ∥2 ≤ D02 and that ϵ ≤ LD02 /256. The number of iterations required by
qassumption to find an ϵ-optimal q
any first order method satisfying the linear span solution for any problem
LD02 LD02
inside the problem class is lower bounded by ⌊ 18 ϵ ⌋, that is, PΣ∗ (M; F) ≥ Ω(1) ϵ .

Proof. Setting the parameters in the hard instance gadget to be


r
1 LD02
n = 2⌊ ⌋, l = L, r2 = D02 /n3 .
8 ϵ
Clearly, the function h is inside our problem class. (Use part a), b) of Lemma 12).
q
LD02
From part d) of Lemma 12, we get that the xk solution generated in k ≤ ⌊ 81 ϵ ⌋ by any method
satisfying the linear span assumption has non-zeros in only the first k coordinates. Then part c) of Lemma
shows the function value gap is lower bounded by

lr2 n LD02 n LD02


h(x̄) − h(x∗ ) ≥ = 3
≥ LD 2
= 4ϵ.
16 16n 16 64ϵ0

A few remarks are in order regarding this hard instance.
• The key to this hard instance is to “resist” the progress of any first order method by restricting the
iterates to a certain linear subspace, which is clear from part c) from the preceding lemma.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 12

• This is the state of art for optimal methods for smooth optimization. However, this is not the last
story on the search for optimal methods.
• The eigenvalues associated with the matrix in (9.3) is plotted in Figure 9.1. The hard instance is
particularly bad because the eigenvalues spread almost uniformly over the range [0, 4].
• If these eigenvalues are clustered in some way, say, most of them are small except for a few, then the
conjugate gradient (CG) method could lead to a much faster convergence rate.
• Moreover, if the dimension is small, then the ellipsoid method could also significantly improve the
complexity result.
• To go into future requires us to define more refined problem classes and design more adaptive
algorithms.

References
[1] Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.
[2] Guanghui Lan. First-order and stochastic Optimization Methods for Machine Learning. Springer-Nature, 2020.
[3] Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization.
John Wiley UK/USA, 1983.
[4] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In
Doklady AN USSR, volume 269, pages 543–547, 1983.
[5] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media,
2003.
[6] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.

You might also like