0% found this document useful (0 votes)

15 views12 pages

3 Agd

Uploaded by

miaomiao24122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views12 pages

3 Agd

Uploaded by

miaomiao24122

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

3.

THE ACCELERATED GRADIENT DESCENT (AGD) METHOD

ZHE (JIMMY) ZHANG, PURDUE IE

Nesterov acceleration is an algebraic miracle. As a miracle, we should not seek to understand

it. —Akardy Nemirovsky
In this lecture, we introduce the breakthorugh accelerated gradient algorithm√ [4] introduced by Nesterov in
1983. The AD method improves the O(1/ϵ) GD oracle complexity to O(1/ ϵ) by evaluating the gradient
at some convex combination of previous points. As illustrated by the above quote from his then mentor,
Akardy, the rational behind the design of the algorithm is myterious. Aside from the algebra, there lacks
a transparent understanding why it should work, and how to apply it to new settings. Even today, 40
years after its publication, accelerated algorithms for new problem settings are still nontrivial and deemed
to be highly valuable by the research community. Our goal we seek to illustrate part of the magic from the
perspective of the Fenchel conjugates [2].

1. Defs: Smoothness and Strong Convexity

Definition 1. For a convex C 1 function f : X → R, the following definitions of L smoothness are equivalent
• ∥∇f (x) − ∇f (y)∥∗ ≤ L∥x − y∥ ∀x, y ∈ X.
• f (x) − f (y) − ⟨∇f (y), x − y⟩ ≤ L2 ∥x − y∥2 ,∀x, y ∈ X.
Definition 2. For a convex function f : X → R, the following definitions of strongly convexity with modulus
µ are equivalent
• ∥f ′ (x) − f ′ (y)∥∗ ≥ µ∥x − y∥∀x, y ∈ X.
• f (x) − f (y) − ⟨f ′ (y), x − y⟩ ≥ µ2 ∥x − y∥2 ∀x, y ∈ X

2. Motivation for the Dual Perspective

It is useful to recall how we handled the error terms to arrive at the recursive inequality for the simpler
subGD and the GD methods:
2
• SubGD: we used the following inequality to arrive at an error term ( M2ηt ) for each iteration. These
√
error term would accumulate, so we have to choose an diminishing stepsize of 1/ηt = O(1)1/ t,
leading to the O(1/ϵ2 ) complexity.
ηt t ηt M2
⟨xt − xt−1 , f ′ (xt )⟩ −∥x − xt−1 ∥2 ≤ ∥xt − xt−1 ∥M − ∥xt − xt−1 ∥2 ≤ .
2 2 2ηt
• Gradient Descent Method: with a constant stepsize 1/ηt = O(1)1/L,the error term in each iteration
could be canceled out, leading to a clean recursive formula which can be telescoped effectively
ηt
⟨xt − xt−1 , ∇f (xt ) − ∇f (xt−1 )⟩ − ∥xt − xt−1 ∥2 ≤ L∥xt − xt−1 ∥2 − L∥xt − xt−1 ∥2 ≤ 0.
2
• The AGD method, by viewing the gradient as some dual variable that could be controlled in a
more refined manner, we could take even larger stepsize 1/ηt = O(1)L/t,leading to the accelerated
convergence rate of O(1/t2 ):
ηt
⟨xt − xt−1 , π t−1 − π t−2 ⟩ − ∥xt − xt−1 ∥2 − τt−1 Df ∗ (π t−1 ; π t−2 ) ≤ 0.
2
1
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 2

3. The Fenchel Conjugate Function

Proposed by Fenchel in 1945, the Fenchel conjugate function f ∗ : Rd → R̄ associated with function
f : Rd → R̄ is defined as

f ∗ (π) := min ⟨π, x⟩ − f (x).

x∈Rd

Some interpretations:
(1) A closed convex set admits both inner and outer description. Inner: it contains every line segment.
Outer: it is the intersection of halve spaces. If we consider the graph associated with G(f ) := {(x, t) :
t ≥ f (x)}, f ∗ specifies the half spaces for G(f ). (Draw some picture )
(2) Specifically, it captures the intercept of the linear support functional because
min f (x) − ⟨x, π⟩ = − max⟨π, x⟩ − f (x) = −f ∗ (π).
x x
∗
Thus ⟨π, x⟩ − f (x) is a linear support function for f associated with the dual variable π, and vice
versa. (This is made precise in Lemma 3).
(3) It is regarded as the Laplace transform for convex function analysis, but the integration is replaced
by maximization. (See Chapter 4 of [1] for Calculations for commonly used functions).
(4) Example, for an indicator function δC (x), its conjugate function is the support funciton σC (π) =
arg maxx∈C ⟨π, x⟩(why?). For instance, σ∆+ (y) =max yi = (δ∆+ )∗ .
(5) And the exponent conjugate (hence the name conjugate, proved using the Holder’s inequality)
1 1
∥y∥pp = max⟨x, y⟩ − ∥x∥qq .
p x q
(6) In micro-economics, given the cost function c(Q), c∗ (P ) = maxp P Q − c(Q), the conjugate function
expresses maximal profit as a function of price. Also used as the part of the moment generation
function in probability, as a part of the Chernoff bound.
Focused on CCP functions, the fenchel conjugate admits the following desirable properties.
Lemma 3. Given a convex, closed and proper function f : Rd → R, the Fenchel transfomation satisfies the
following properties.
a) Involution, f ∗∗ ≡ f. (Theorem 11.1 of [6]).
b) f ∗ is a convex, closed and proper function.
c) linear support functional, f (x) + f ∗ (π) ≥ ⟨π, x⟩ ∀x, π ∈ Rd .
Since the Fenchel conjugate characterizes the linear support functionals, we could also use it provides
alternative characterization of the sub-differential.
Theorem 4. Given a CCP function f and its Fenchel conjugate f ∗ , the following are equivalent (See Section
4.6 of [1]).
a) ⟨π, x⟩ = f (x) + f ∗ (π).
b.i) π ∈ ∂f (x).
b.ii) x ∈ ∂f ∗ (π).
c.i) π ∈ arg maxπ̄ ⟨x, π̄⟩ − f ∗ (π̄).
c.ii) x ∈ arg maxx̄ ⟨π, x̄⟩ − f (x̄).
Proof. The equivalence between a) and c) follows from the definition of Fenchel conjugate and Lemma 3.
c.ii) ⇒ a) def of Fenchel conjugate a) ⇒ c.ii) Lemma 3.b). Moreover, the equivalence for c.i) follows from
involution.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 3

a) ⇒ b.i) follows from the def of sub-gradient, for all x ∈ Rd we have

f (x̄) ≥ ⟨π, x̄⟩ − f ∗ (π)
= ⟨π, x̄⟩ − (⟨π, x⟩ − f (x))
= f (x) + ⟨π, x̄ − x⟩.
b.i) ⇒ a) f (x̄) ≥ ⟨x̄ − x, π⟩ + f (x)∀x implies that ⟨π, x⟩ − f (x) ≥ ⟨x̄, π⟩ − f (x̄), thus f ∗ (π) = ⟨π, x⟩ − f (x)
by definition of Fenchel conjugates. □
Some remarks are in order:
• The above result shows the equivalence between maximizers and sub-differential.
– Thus ∂σ∆+ (y) = arg maxπ ⟨y, π⟩ − δ∆+ (y) = arg maxπ∈∆ ⟨y, π⟩
– Given f (x) = ∥x∥p = σBq (0,1) (x), thus f ∗ (y) = δBq (0,1) ,and ∂f (x) = arg maxy∈Bq (0,1) ⟨x, y⟩ (this
shows that ∥y∥p to be non-smooth for all p.) In general, every norm could be represented as
a support function associated with some convex set which is symmetric about zero. (Draw a
picture)
– NX (x) = ∂δX (x) can be calculated by
{π : ⟨π, x⟩ = δX (x) + σX (π)} = {π : ⟨π, x⟩ = max⟨y, π⟩}
y∈X
= {π : ⟨π, x⟩ ≥ ⟨y, π⟩∀y ∈ X} = {π : ⟨π, y − x⟩ ≤ 0∀y ∈ X}.
• Legendre: the sub-differential of the fenchel conjugate is the inverse operation π ∈ ∂f (x) ⇔ x ∈
∂f ∗ (π).
• If we know f ∗ , since 0 ∈ ∂f (x∗ ), we could recover the optimal solution by evaluating x∗ ∈ ∂f ∗ (0).
The knowledge of f ∗ often implies f is almost known to us as a white-box, which could be unrealistic
in many situations. In this sense, the Fenchel conjugate is often an abstract tool to obtain a better
understanding. In this lecture, we will use the Fenchel conjugate function implicitly only in the
convergence analysis, and the algorithm could be run without any information about it. However,
in some special settings like distributed optimization, the explicit knowledge on f ∗ could also be
reasonable.

4. Some Intuitioin: Fenchel Conjugate and The Linear Support Function

In this section, we seek to illustrate the somewhat abstract Fenchel conjugate through some picture.
Recall that

−f ∗ (π̄) = min f (x) − ⟨x, π̄⟩

x
Fixing π̄, the above definition of the Fenchel conjugate could be interpreted as moving the linear approx-
imation funciton downward until it only touches f . Specifically, the intercept −f ∗ (π̄) captures the linear
functional lf (x; π̄) := ⟨x, π̄⟩ − f ∗ (π̄) which intercepts x only “once” at the minimizer x̄, i.e.,

lf (x̄; π̄) = ⟨x̄, π̄⟩ − f ∗ (π̄) = ⟨x̄, π̄⟩ + f (x̄) − ⟨x̄, π̄⟩ = f (x̄)
lf (x; π̄) = ⟨x, π̄⟩ − f ∗ (π̄) ≤ ⟨x, π̄⟩ + f (x) − ⟨x, π̄⟩ ≤ f (x).
Next, fixing x̄, the maximization in Lemma 3 could be interpreted as going upward; finding the lower
linear functional lf (·; π) that “touches” x̄

π̄ ∈ arg max lf (x̄; π) = arg max⟨π, x⟩ − f ∗ (π).

π x
Clearly, the gradient of the linear support functional π̄ corresponds to the one subgradient at x̄, π̄ ∈ ∂f (x̄).
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 4

Figure 4.1. The Fenchel Conjugate Function Illustrated.

5. The Duality Between Strong Convexity and Smoothness

Lemma 5. Given some Bregman distance function Φ with domain Ω such that X ⊂ Ω, consider the iterate
generated by the following update with η > 0 and x̄ ∈ rint(Ω)

x+ ← arg min⟨g, x⟩ + r(x) + ηDΦ (x; x̄).

x∈X
Then x+ ∈ rint(Ω) and

⟨g, x+ − x⟩ + ηDΦ (x; x+ ) + Dr (x; x+ ) + r(x+ ) + ηDΦ (x+ ; x̄) ≤ ηDΦ (x; x̄) + r(x) ∀x ∈ X.
Proof. The result follows from the first-order optimality condition that

⟨g + η(∇Φ(x+ ) − Φ(x̄)) + r′ (x+ ), x − x+ ⟩ ≥ 0.

The only different term is ⟨r′ (x+ ), x − x+ ⟩ = r(x) − r(x+ ) − (r(x) − r(x+ ) − ⟨r′ (x+ ), x − x+ ⟩) = r(x) −
r(x+ ) − Dr (x; x+ ). □
Theorem 6. A convex function f : Rd → R̄ is µ strongly convex if and only if f ∗ is 1/µ smooth.
Proof. ⇒ Let π̂, π̄ ∈ dom(∂f ∗ ) be given. Consider any x̄ ∈ ∂f ∗ (π̄) and x̂ ∈ ∂f ∗ (π̂), i.e., π̄ ∈ ∂f (x̄) and
π̂ ∈ ∂f (x̂). By Theorem 4, we get
x̂ ← arg min⟨−π̂, x⟩ + f (x).
x
Thus it follows from 5 that (why?)

⟨−π̂, x̂ − x⟩ + Df (x; x̂) + f (x̂) ≤ f (x)∀x ∈ X

µ
⇒ ⟨−π̂, x̂ − x̄⟩ + ∥x̄ − x̂∥2 + f (x̂) ≤ f (x̄).
2
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 5

Similarly, we get

µ
(5.1) ⟨−π̄, x̄ − x̂⟩ + ∥x̂ − x̄∥2 + f (x̄) ≤ f (x̂).
2
Adding up the preceding formulae then leads to

⟨π̂ − π̄, x̂ − x̄⟩ ≥ µ∥x̂ − x̄∥2

⇒ µ∥x̂ − x̄∥2 ≤ ⟨π̂ − π̄, x̂ − x̄⟩ ≤ ∥π̂ − π̄∥∗ ∥x̂ − x̄∥
1
⇒∥π̂ − π̄∥∗ ≤ ∥x̂ − x̄∥.
µ
⇐It is perhaps more illustrate to assume f to be L-smooth and show f ∗ to be µ-strongly convex. The
following claim, the dual to (5.1), is useful for us.
Claim: for a smooth convex f, we have

1
(5.2) f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩ ≥ ∥∇f (y) − ∇f (x̄)∥2∗ ∀y, x̄ ∈ dom(∂f ).
2L
We will prove the claim later. Let us assume it being true for now. Again, let π̂, π̄ ∈ dom(∂f ∗ ) be given.
Consider any x̄ ∈ ∂f ∗ (π̄) and x̂ ∈ ∂f ∗ (π̂), i.e., π̄ = ∇f (x̄) and π̂ = ∇f (x̂) (why?). We have from (5.2) that

1
f (x̂) − f (x̄) − ⟨∇f (x̄), x̂ − x̄⟩ ≥ ∥π̂ − π̄∥2∗
2L
1
f (x̄) − f (x̂) − ⟨∇f (x̂), x̄ − x̂⟩ ≥ ∥π̂ − π̄∥2∗ .
2L
Adding them up leads to
1
∥π̂ − π̄∥2∗ ≤ ⟨π̂ − π̄, x̂ − x̄⟩ ≤ ∥x̂ − x̄∥∥π̂ − π̄∥∗ .
L
Thus, we get the desired strong convexity relation ∥(f ∗ )′ (π̄) − (f ∗ )′ (π̄)∥ ≥ L1 ∥π̂ − π̄∥.
Now we show the claim in (5.2). Consider ϕ(x) = f (x) − ⟨∇f (x̄), x⟩.Clearly, x̄ ∈ arg minx∈Rd ϕ(x) and we
have

L
∥y − x∥2 ∀x ∈ Rd .
ϕ(x̄) ≤ ϕ(x) ≤ ϕ(y) + ⟨∇ϕ(y), x − y⟩ +
2
∥∇ϕ(y)∥∗
Let v = arg maxx∈B(0,1) ⟨∇ϕ(y), x⟩ such that ⟨v, ∇ϕ(y)⟩ = ∥∇ϕ(y)∥∗ . Taking x = y − L v leads to

1 1
ϕ(x̄) ≤ ϕ(y) − ∥∇ϕ(y)∥2∗ + ∥∇ϕ(y)∥2∗
L 2L
1
⇒ ∥∇f (y) − ∇f (x̄)∥2∗ ≤ f (y) − f (x̄) − ⟨∇f (x̄), y − x̄⟩.
2L
□

6. The Primal And Dual Version of the AGD Method

Nesterov original version is similar to Algorithm (1). Other than that the algebra just works out, the tech-
nique of evaluating gradient at some averaged point makes little sense. A more modern re-interpretation of
the method from the primal and dual perspective is shown in (2). The minimization problem is reformulated
instead as a saddle point game.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 6

Algorithm 1 The Primal Version of AGD [2]

• With x0 ∈ X. For t = 1, 2, 3, ..., do the following
(1) Set xt := αt xt−1 + (1 − αt )x̄t−1 .
(2) Compute xt ← arg minx∈X ⟨∇f (xt ), x⟩ + η2t ∥x − xt−1 ∥2 .
(3) Set x̄t = αt xt + (1 − αt )x̄t−1 .

Algorithm 2 The Primal Dual Version of AGD [2]

• With x−1 = x0 ∈ X and π 0 ∈ ∂f (x0 ). For t = 1, 2, 3, ..., do the following
(1) x̃t = xt−1 + θt (xt−1 − xt−2 ).
(2) Compute π t ← arg maxπ ⟨π, x̃t ⟩ − f ∗ (π) − τt Df ∗ (π; π t−1 ).
(3) Compute xt ← arg minx∈X ⟨x, π t ⟩ + η2t ∥x − xt−1 ∥2 .

(6.1) min f ∗∗ (x) = min max{L(x; π) := ⟨π, x⟩ − f ∗ (π)}.

x∈X x∈X π

In this zero-sum game, the max-player π seeks to maximize his pay-off L(π; x) which is dependent on the
action of the primal player x, while the primal player seeks to minimize his lose. In the tth iteration of the
AGD Algorithm 2, since his reward depends on the yet-known action of x, xt , the dual player constructs
an estimate x̃t of it by extrapolating from the history, and then maximizes the his reward L(x̃t ; π) subject
to some prox-penalty Line 2. After that, seeing the action π t , the x player responds by minimizing L(x; π t )
subject to some prox-penalty term.
The surprising gradient evaluation associated with Algorithm 1 can be interpreted as the a dual proximal
step. Before going into the details, let us see that the π t evaluation could indeed be reinterpreted as some
gradient evaluation step.

π t ← arg max⟨π, x̃t ⟩ − f ∗ (π) − τt Df ∗ (π; π t−1 )

π
⇒π ← arg max⟨π, x̃t ⟩ − f ∗ (π) − τt (f ∗ (π) − ⟨(f ∗ )′ (π t−1 ), π⟩)
t
π
⇒π ← arg max⟨π, x̃t + τt (f ∗ )′ (π t−1 )⟩ − (1 + τt )f ∗ (π)
t
π
x̃t + τt (f ∗ )′ (π t−1 )
⇒π t = ∇f (xt ) with xt := (why? Lemma (4))
1 + τt
Moreover, this general framework allows us to recover some classical algorithms as special cases.
• The gradient descent method corresponds to choosing θt = 0 (no extrapolation) and τt = 0 (gradient
evaluated at ∇f (xt−1 )).
• The Frank Wolfe method corresponds to choosing θt = 0 (no extrapolation) and ηt = 0 (no primal
stability term).

7. The Convergence Analysis

7.1. The Q-Gap Function. We introduce a so-called Q-gap function to characterize the convergence of
the generated iterates to the Nash equilibrium point.
Definition 7. Given a convex-concave saddle point problem L(x; π) and some point z t := (xt ; π t ), the
Q-gap function associated a reference point z ∈ (x, π) is defined as
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 7

Q(z t ; z) := L(xt ; π) − L(x; π t ).

Some properties associated with the gap function is in order:
• The gap associated with z t is defined as G(z t ) := maxz∈X×Π Q(z t ; z). G(z t ) is always non-negative
and G(z t ) ≤ 0 if and only if z t is a Nash equilibrium point satisfying (why?)
L(xt ; π) ≤ L(xt ; π t ) ∀x ∈ X
L(x; π t ) ≥ L(xt ; π t ) ∀π ∈ Π.
• Fixing the reference point z, the Q-gap function is convex with respect to xt and concave with respect
to π t . In particular, the Jensen’s inequality could be applied to obtain
N
X N
X
Q(z̄ N ; z) ≤ pt Q(z t ; z), with p ∈ ∆N
+ and z̄
N
= pi z i .
t=1 t=1
• It provides an upper bound to the function value gap.
Lemma 8. Consider the saddle point reformulation L(x; π) in (6.1) associated with some function f . For
some feasible solution z t := (xt ; π t ) ∈ X × dom(f ∗ ), we have

f (xt ) − f∗ ≤ Q(z t ; (x∗ ; π̂)) with π̂ ∈ ∂f (xt ).

Proof. Use the Legendre property of the Fenchel conjugate in Theorem 4. □
The preceding result tells us that we could minimize the function value gap by analyzing the convergence
of the Q-gap function.
7.2. The Detailed Convergence Bound. The next Proposition specifies some stepsize requirements to
ensure the convergence of the Q-gap function.
Proposition 9. Consider the iterates generated by Algorithm 2. Suppose the stepsize satisfies the following
requirements for some weights ωt for all t ≥ 2

ωt−1 = ωt θt
ωt−1 (1 + τt−1 ) ≥ ωt τt
ωt−1 ηt−1 ≥ ωt ηt
τt ηt−1 ≥ Lθt , (τN + 1)ηN ≥ L.
Then the following relation is valid for any z := (x, π) ∈ X × Rd .
N
X ωN η N N ω1 η 1 0
ωt Q(z t ; z) + ∥x − x∥2 ≤ ∥x − x∥2 + ω1 τ1 Df ∗ (π; π 0 ).
t=1
2 2
Proof. The following algebraic identity decomposition is crucial for our analysis. (Extrapolation prediction
is crucial for the next decomposition to work. )

At := ⟨π t − π, xt − x̃t ⟩ =⟨π t − π, xt − [xt−1 + θt (xt−1 − xt−2 )]⟩

=⟨π t − π, (xt − xt−1 )⟩ − ⟨π t − π, θt (xt−1 − xt−2 )⟩
=⟨π t − π, (xt − xt−1 )⟩ − θt ⟨π t−1 − π, (xt−1 − xt−2 )⟩
− θt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 8

Using the three point inequality associated with the primal update leads us to

ηt t ηt ηt
L(xt ; π t ) − L(x; π t ) + ∥x − x∥2 + ∥xt−1 − xt ∥2 ≤ ∥xt−1 − x∥2 .
2 2 2
So the ωt weighted sum satisfies

N N
X ωN ηN N X ωt ηt t−1 η1 ω1 t−1
ωt [L(xt ; π t ) − L(x; π t )] + ∥x − x∥2 + ∥x − xt ∥2 ≤ ∥x − x∥2 .
t=1
2 t=1
2 2

Similarly, the three point inequality for the dual update implies that (why?)

L(xt ; π) − L(xt ; π t ) + At + (τt + 1)Df ∗ (π; π t ) + τt Df ∗ (π t ; π t−1 ) ≤ τt Df ∗ (π; π t−1 ).

So the ωt weighted sum satisfies

N
X N
X
ωt [L(xt ; π) − L(x; π t )] + ĀN + ωN (τN + 1)Df ∗ (π; π N ) + ωt τt Df ∗ (π t ; π t−1 ) ≤ τ1 ω1 Df ∗ (π; π 0 ).
t=1 t=1
N
X
ĀN := ωN ⟨π N − π, (xN − xN −1 )⟩ − θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩.
t=1

In particular, we have from the 1/L strong convexity of f ∗ (see Theorem 6) that

ηt−1 t−1
θt ωt ⟨π t − π t−1 , (xt−1 − xt−2 )⟩ ≤ ωt τt Df ∗ (π t ; π t−1 ) + ωt−1∥x − xt−2 ∥2
2
ηN N
−ωN ⟨π N − π, (xN − xN −1 )⟩ ≤ ωN (τN + 1)Df ∗ (π t ; π t−1 ) + ωN ∥x − xN −1 ∥2 .
2
Adding up all the relations leads to the desired convergence guarantee. □

Now we specify some concrete stepsize choice to obtain the desired function value convergence bound.

Theorem 10. With

PN
t−1 2L ωt z t
ωt = t, τt = , ηt = , and z̄ N = Pt=1
N
2 t t=1 ωt
0 ∗ 2
we get f (x̄N ) − f ∗ ≤ Q(z̄ N ; (x∗ ; ∇f (x̄N ))) ≤ O(1) L∥x N−x
2
∥
.
Moreover, the distance ∥xt − x∗ ∥2 is monotonically non-increasing.

Remarks:
• The above convergence bound could also be obtained from a purely primal analysis, see [2]. We
√ 0 ∗ 2
could also obtain the desired linear convergence rate O( κ log( L∥x −x
ϵ
∥
) if f is a strongly convex
function. Because of the heavy use of three point inequality, the extension to the non-Euclidean
geometry is really straightforward.
• Nesterov retired this year. However, the exact reason why this scheme work is still not well-
understood.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 9

Optimality Criterion M -Lipschitz Non–smooth L-Smooth

t M ∥x0 −x∗ ∥ L∥x0 −x∗ ∥2
Non-strongly Convex f (x ) − f∗ √
t t2
4M 2
p −t 0
µ-Strongly Convex (1 + µ/L) (f (x ) − f∗ )
µt
Table 1. The O(1) Optimal Convergence Rates for Convex Optimization

8. Restarting to Take Advantage of Strong Convexity

The (mini-max) optimal convergence rates for the typical convex optimization is listed in Table 1. A
problem being strongly convex could improve the convergence rate by at least an order of magnitude. Here
we illustrate how to use restart to bootstrap the non-strongly convex rate achieve the strongly convex rate.
• For the non-smooth problem, we have
p
t M ∥x0 − x∗ ∥ M 2f (x0 ) − f∗
f (x ) − f∗ ≤ √ ≤√ √ .
t t µ
2
Thus restarting the appropriate sub-gradient descent algorithm after T = (f (x8M
0 )−f )µ iterations would
∗
lead to the halving of the function value gap. We could calculate the total number of iterations to
achieve ϵ-optimality by
∞
X 8M 2 M2
Nϵ ≤ = O(1) .
i=1
ϵ2i−1 µ ϵµ
• For the smooth problem, we have
L∥x0 − x∗ ∥2 L 2(f (x0 ) − f∗ )
f (xt ) − f∗ ≤ 2
≤ .
t µ t2
q
Thus restarting the appropriate sub-gradient descent algorithm after T = 2 L µ iterations would
lead to the halving of the function value gap. We could calculate the total number of iterations to
achieve ϵ-optimality by
f (x0 )−f∗
log( )
s s
X ϵ
L L f (x0 ) − f∗
Nϵ ≤ 2 = O(1) log( ).
i=1
µ µ ϵ

9. The Lower Complexity Bound

0 ∗ 2
In this section, we will construct a quadratic problem to show the above O(1) L∥x N−x
2
∥
oracle complexity
0 ∗ 2
bound to be un-improvable. Specifically, we are going to show that PΣ∗ (M, F) ≥ O(1) L∥x N−x
2
∥
where
• The termination criterion Tϵ is ϵ-optimality gap, that is, finding an x̄ s.t. f (x̄) − f∗ ≤ ϵ, and there
is access is the first order oracle O.
• F: the class of objective functions under consideration are all L-smooth, and satisfies ∥x0 −x∗ ∥2 ≤ D02
and the feasible region is X = Rd .
• M: we will focus on deterministic iterative methods satisfying some linear-span assumption [5].
Extension to a more general class of functions is also possible but quite technical (see [3]).

9.1. The Linear Span Assumption. Without the loss of generality, we can assume the initial point x0
to be 0 (why? hint: apply some shifting to the worst case function).
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 10

Definition 11. A first order method is said to satisfy the linear span assumption if the iterates generated
satisfy the following requirement

(9.1) xt ∈ span{x0 , ∇f (x0 ), ..., ∇f (xt−1 )}.

This definition highlights again that we are doing information theoretical lower complexity bound. The
possible points that can be returned by any method is limited to the gradient information it has gathered.
A few remarks are helpful for clarifying this definition.
• The linear span operation allows us to take any linear combination of the generated gradients. So it
clearly covers the gradient descent (GD) method xt ← xt−1 − η1t ∇f (xt−1 ) with any step-size choice.
• If we take xt to be xt in Algorithm 1, the AGD method also satisfies the assumption.
• This assumption does not cover the non-Euclidean methods, e.g., the mirror descent method or the
accelerated version. For instance, the pre-conditioned gradient method xt ← xt−1 − η1t Σ−1 ∇f (xt−1 )
might lead us outside the linear space in (9.1). However, the lower bound can actually be strengthened
by some rotational invariance argument to cover this case. (see [3])

9.2. The Lower Bound Gadget. We present a parameterized class of hard functions which could to
tuned to provide the desired lower complexity bound. Particularly, the hard instance, parameterized by
l > 0, r > 0, n ∈ N+ , is

n−1
l 1X 1
(9.2) h(x; l, r, n) := [ (xi − xi+1 )2 + (xn )2 − rx1 ].
4 2 i=1 2

The next lemma characterizes the properties of h.

Lemma 12. With an even n ≥ 2, the function h in (9.2) has the following properties.
a) It is l-Lipschitz smooth.
b) ∥x0 − x∗ ∥2 ≤ r2 n3 .
c) For any first-order method satisfying the linear span assumption (Def (11)), the iterates generated in
k iterations have non-zero coordinates only in the first k coordinates.
d) For any x̄ that satisfies x̄n/2+1 = x̄n/2+2 = ... = x̄n = 0, we have
lr2 n
h(x̄) − h(x∗ ) ≥ .
16
Proof. The gradient of h is given by

 
1

−1
 r
−1 0
2 −1   

l 
 0
−1 2 −1 
(9.3) ∇h(x̄) = {  x̄ − 
 .. }.

4  ¨
.
  
 
 −1 2 −1 0
−1 2 0
Clearly, h is an l-Lipschitz smooth function (why?). Now we can solve for the first order optimality condition
∇h(x∗ ) = 0 to obtain x∗ = r[n, n − 1, n − 2, ..., 1] to get ∥x∗ − 0∥2 ≤ r2 n3 (it is a rough upper bound for
n ≥ 2). Moreover, constrained to having non-zeros only in the first n/2 coordinates, we can calculate the
optimal solution to be x̄∗ = r[n/2, n/2 − 1, ..., 1, 0, ..., 0]. Thus the function value gap is lower bounded by
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 11

Figure 9.1. Eigenvalues associated with the quadratic hard instance.

r2 l n n r2 ln
h(x̄∗ ) − h(x∗ ) = [ − ]= .
4 2 4 16
Property c) follows from the structure of the hard instance. □
9.3. The Lower Bound Result. Now we ready to utilize the hard-instance gadget to provide the lower
complexity bound.
Theorem 13. Consider the task of finding ϵ-optimal solution (in function value gap) for the class of L-
smooth convex functions with ∥x0 − x∗ ∥2 ≤ D02 and that ϵ ≤ LD02 /256. The number of iterations required by
qassumption to find an ϵ-optimal q
any first order method satisfying the linear span solution for any problem
LD02 LD02
inside the problem class is lower bounded by ⌊ 18 ϵ ⌋, that is, PΣ∗ (M; F) ≥ Ω(1) ϵ .

Proof. Setting the parameters in the hard instance gadget to be

r
1 LD02
n = 2⌊ ⌋, l = L, r2 = D02 /n3 .
8 ϵ
Clearly, the function h is inside our problem class. (Use part a), b) of Lemma 12).
q
LD02
From part d) of Lemma 12, we get that the xk solution generated in k ≤ ⌊ 81 ϵ ⌋ by any method
satisfying the linear span assumption has non-zeros in only the first k coordinates. Then part c) of Lemma
shows the function value gap is lower bounded by

lr2 n LD02 n LD02

h(x̄) − h(x∗ ) ≥ = 3
≥ LD 2
= 4ϵ.
16 16n 16 64ϵ0
□
A few remarks are in order regarding this hard instance.
• The key to this hard instance is to “resist” the progress of any first order method by restricting the
iterates to a certain linear subspace, which is clear from part c) from the preceding lemma.
3. THE ACCELERATED GRADIENT DESCENT (AGD) METHOD 12

• This is the state of art for optimal methods for smooth optimization. However, this is not the last
story on the search for optimal methods.
• The eigenvalues associated with the matrix in (9.3) is plotted in Figure 9.1. The hard instance is
particularly bad because the eigenvalues spread almost uniformly over the range [0, 4].
• If these eigenvalues are clustered in some way, say, most of them are small except for a few, then the
conjugate gradient (CG) method could lead to a much faster convergence rate.
• Moreover, if the dimension is small, then the ellipsoid method could also significantly improve the
complexity result.
• To go into future requires us to define more refined problem classes and design more adaptive
algorithms.

References
[1] Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.
[2] Guanghui Lan. First-order and stochastic Optimization Methods for Machine Learning. Springer-Nature, 2020.
[3] Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization.
John Wiley UK/USA, 1983.
[4] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In
Doklady AN USSR, volume 269, pages 543–547, 1983.
[5] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media,
2003.
[6] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.

Convex Analysis and Optimization PDF
No ratings yet
Convex Analysis and Optimization PDF
340 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
1.basic Mathematics - Theory & Exercise Part
100% (1)
1.basic Mathematics - Theory & Exercise Part
32 pages
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 305: A Series of Comprehensive Studies in Mathematics
431 pages
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
361 pages
0606-Indices and Surds PDF
100% (2)
0606-Indices and Surds PDF
52 pages
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
No ratings yet
CntrlEngg (Optimization) ConvexAnalysisAndOptimization Solutions DimitriBertsekas
191 pages
Main
No ratings yet
Main
166 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
Steepest Decent and CG
No ratings yet
Steepest Decent and CG
68 pages
Gradinet
No ratings yet
Gradinet
51 pages
Murota1998 Article DiscreteConvexAnalysis
No ratings yet
Murota1998 Article DiscreteConvexAnalysis
59 pages
Doan BFGS
No ratings yet
Doan BFGS
72 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
Chapter 1
No ratings yet
Chapter 1
147 pages
Co 463
No ratings yet
Co 463
116 pages
Proximal Gradient Methods For Machine
No ratings yet
Proximal Gradient Methods For Machine
96 pages
Jiyue Zeng Honors Thesis
No ratings yet
Jiyue Zeng Honors Thesis
59 pages
Lect4 Removed
No ratings yet
Lect4 Removed
32 pages
Solutions: Stein and Shakarchi, Complex Analysis
30% (20)
Solutions: Stein and Shakarchi, Complex Analysis
26 pages
Gradient
No ratings yet
Gradient
37 pages
Lecture Notes On Iterative Optimization Algorithms
No ratings yet
Lecture Notes On Iterative Optimization Algorithms
102 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
L10 - Subgrad - PGD (Partially Annotated)
No ratings yet
L10 - Subgrad - PGD (Partially Annotated)
39 pages
Introduction To Optimization - Jean-François Aujol
No ratings yet
Introduction To Optimization - Jean-François Aujol
51 pages
Gradient
No ratings yet
Gradient
31 pages
Closed Functions - Conjugate Function
No ratings yet
Closed Functions - Conjugate Function
16 pages
M Tech-Asp3
No ratings yet
M Tech-Asp3
15 pages
Bedd
No ratings yet
Bedd
13 pages
The Conjugate Gradient Method: Tom Lyche
No ratings yet
The Conjugate Gradient Method: Tom Lyche
23 pages
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
No ratings yet
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
12 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Vector Calculus Wikibook
No ratings yet
Vector Calculus Wikibook
52 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 4: Convex and Quadra2c Func2ons
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 4: Convex and Quadra2c Func2ons
35 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Spectral Projected Gradient Methods: E. G. Birgin J. M. Mart Inez M. Raydan January 17, 2007
No ratings yet
Spectral Projected Gradient Methods: E. G. Birgin J. M. Mart Inez M. Raydan January 17, 2007
14 pages
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
No ratings yet
Unconstrained Optimization: Prof. S.S. Jang Department of Chemical Engineering National Tsing-Hua Univeristy
46 pages
Characterization of Lipschitz
No ratings yet
Characterization of Lipschitz
8 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Conjugate Gradient
No ratings yet
Conjugate Gradient
15 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Legendre Transformation and Information Geometry
No ratings yet
Legendre Transformation and Information Geometry
6 pages
Cgnotes PDF
No ratings yet
Cgnotes PDF
11 pages
Scribe FenchelsDualityTheorem Optimization
No ratings yet
Scribe FenchelsDualityTheorem Optimization
8 pages
Mirror 2
No ratings yet
Mirror 2
8 pages
Chương 9
No ratings yet
Chương 9
12 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
No ratings yet
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
11 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
DC As I Am Convex Property
No ratings yet
DC As I Am Convex Property
7 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Linear Equations
No ratings yet
Linear Equations
12 pages
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
No ratings yet
Conjugate Gradient Method: Com S 477/577 Nov 6, 2007
8 pages
Conjugate Gradient Method Report
No ratings yet
Conjugate Gradient Method Report
17 pages
Three Strategies To Derive A Dual Problem
No ratings yet
Three Strategies To Derive A Dual Problem
8 pages
UGRD-ECE6205-Signals-Spectra-and-Signal-Processing-legit-not-quizess MidALL
100% (3)
UGRD-ECE6205-Signals-Spectra-and-Signal-Processing-legit-not-quizess MidALL
19 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Conjugate Gradient - 1 1 Co-Ordinate Descent Method
No ratings yet
Conjugate Gradient - 1 1 Co-Ordinate Descent Method
2 pages
Exam Progression-Jawapan PDF
No ratings yet
Exam Progression-Jawapan PDF
3 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Differential Eqns
No ratings yet
Differential Eqns
94 pages
Kvs PGT Mathematics Syllabus
No ratings yet
Kvs PGT Mathematics Syllabus
4 pages
A Study of Systems of Neutrosophic Linear Equations
No ratings yet
A Study of Systems of Neutrosophic Linear Equations
10 pages
CO2 Intersecting Chords Tangent Segments and Tangent Segments
100% (1)
CO2 Intersecting Chords Tangent Segments and Tangent Segments
19 pages
Lecture 21: Greens Theorem: Edwin Abbot's Flatland Is A 1884 Romance Taking Place in The Plane
No ratings yet
Lecture 21: Greens Theorem: Edwin Abbot's Flatland Is A 1884 Romance Taking Place in The Plane
18 pages
Session 2 Notes
No ratings yet
Session 2 Notes
11 pages
Graphing Quadratic Functions in Standard Form Lesson Plan
100% (1)
Graphing Quadratic Functions in Standard Form Lesson Plan
2 pages
Chapter 10 Polar Coordinates
No ratings yet
Chapter 10 Polar Coordinates
4 pages
C1.Vector Analysis
No ratings yet
C1.Vector Analysis
27 pages
MATHEMATICAL METHODS (MD Nasim)
No ratings yet
MATHEMATICAL METHODS (MD Nasim)
12 pages
Maths Practice Paper 4
No ratings yet
Maths Practice Paper 4
7 pages
Discrete Mathematics: Advanced Counting Techniques
No ratings yet
Discrete Mathematics: Advanced Counting Techniques
36 pages
Problem 5.18: X-Direction Lies in The X-Y Plane and Carries A Current I in The
No ratings yet
Problem 5.18: X-Direction Lies in The X-Y Plane and Carries A Current I in The
2 pages
StaticsC02 - Force Vectors
No ratings yet
StaticsC02 - Force Vectors
37 pages
Time-Domain Solution of LTI State Equations 1 Introduction 2 ... - MIT
No ratings yet
Time-Domain Solution of LTI State Equations 1 Introduction 2 ... - MIT
32 pages
Laplace Transforms Tutorial Sheet
0% (1)
Laplace Transforms Tutorial Sheet
2 pages
University of Waterloo Math 147 Lectures
No ratings yet
University of Waterloo Math 147 Lectures
12 pages
MTH107 SU3 (Post-Lecture)
No ratings yet
MTH107 SU3 (Post-Lecture)
18 pages
From The Consumer's Perspective: 5.1 Economics Application: Consumer Surplus and Producer Surplus
No ratings yet
From The Consumer's Perspective: 5.1 Economics Application: Consumer Surplus and Producer Surplus
21 pages
2023 2024 3rd Term Notes Differentiation
No ratings yet
2023 2024 3rd Term Notes Differentiation
11 pages
Numerics Made Easy
No ratings yet
Numerics Made Easy
8 pages
GCE Core Mathematics C1 MAY 2006
No ratings yet
GCE Core Mathematics C1 MAY 2006
12 pages
18.112: Functions of A Complex Variable: Lecturer: Professor Wei Zhang Notes By: Andrew Lin Fall 2019
No ratings yet
18.112: Functions of A Complex Variable: Lecturer: Professor Wei Zhang Notes By: Andrew Lin Fall 2019
5 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

3 Agd

Uploaded by

3 Agd

Uploaded by

3.

THE ACCELERATED GRADIENT DESCENT (AGD) METHOD

ZHE (JIMMY) ZHANG, PURDUE IE

Nesterov acceleration is an algebraic miracle. As a miracle, we should not seek to understand

1. Defs: Smoothness and Strong Convexity

2. Motivation for the Dual Perspective

3. The Fenchel Conjugate Function

f ∗ (π) := min ⟨π, x⟩ − f (x).

a) ⇒ b.i) follows from the def of sub-gradient, for all x ∈ Rd we have

4. Some Intuitioin: Fenchel Conjugate and The Linear Support Function

−f ∗ (π̄) = min f (x) − ⟨x, π̄⟩

π̄ ∈ arg max lf (x̄; π) = arg max⟨π, x⟩ − f ∗ (π).

Figure 4.1. The Fenchel Conjugate Function Illustrated.

5. The Duality Between Strong Convexity and Smoothness

x+ ← arg min⟨g, x⟩ + r(x) + ηDΦ (x; x̄).

⟨g + η(∇Φ(x+ ) − Φ(x̄)) + r′ (x+ ), x − x+ ⟩ ≥ 0.

⟨−π̂, x̂ − x⟩ + Df (x; x̂) + f (x̂) ≤ f (x)∀x ∈ X

⟨π̂ − π̄, x̂ − x̄⟩ ≥ µ∥x̂ − x̄∥2

6. The Primal And Dual Version of the AGD Method

Algorithm 1 The Primal Version of AGD [2]

Algorithm 2 The Primal Dual Version of AGD [2]

(6.1) min f ∗∗ (x) = min max{L(x; π) := ⟨π, x⟩ − f ∗ (π)}.

π t ← arg max⟨π, x̃t ⟩ − f ∗ (π) − τt Df ∗ (π; π t−1 )

7. The Convergence Analysis

Q(z t ; z) := L(xt ; π) − L(x; π t ).

f (xt ) − f∗ ≤ Q(z t ; (x∗ ; π̂)) with π̂ ∈ ∂f (xt ).

At := ⟨π t − π, xt − x̃t ⟩ =⟨π t − π, xt − [xt−1 + θt (xt−1 − xt−2 )]⟩

L(xt ; π) − L(xt ; π t ) + At + (τt + 1)Df ∗ (π; π t ) + τt Df ∗ (π t ; π t−1 ) ≤ τt Df ∗ (π; π t−1 ).

Theorem 10. With

Optimality Criterion M -Lipschitz Non–smooth L-Smooth

8. Restarting to Take Advantage of Strong Convexity

9. The Lower Complexity Bound

(9.1) xt ∈ span{x0 , ∇f (x0 ), ..., ∇f (xt−1 )}.

The next lemma characterizes the properties of h.

Figure 9.1. Eigenvalues associated with the quadratic hard instance.

Proof. Setting the parameters in the hard instance gadget to be

lr2 n LD02 n LD02

You might also like