0% found this document useful (0 votes)
19 views

Fast Gradient Method

Uploaded by

smashouff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Fast Gradient Method

Uploaded by

smashouff
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

L.

Vandenberghe EE236C (Spring 2016)

9. Accelerated proximal gradient methods

• Nesterov’s method

• analysis with fixed step size

• line search

9-1
Proximal gradient method

Results from lecture 6

• each proximal gradient iteration is a descent step:

f (x(k)) < f (x(k−1)), kx(k) − x?k22 ≤ c kx(k−1) − x?k22

with c = 1 − m/L

• suboptimality after k iterations is O(1/k):

(k) L (0)
f (x ) − f ≤ kx − x?k22
?
2k

Accelerated proximal gradient methods

• to improve convergence, we add a momentum term


• we relax the descent property
• originated in work by Nesterov in the 1980s

Accelerated proximal gradient methods 9-2


Assumptions

we consider the same problem and make the same assumptions as in lecture 6:

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn

• there exist constants m ≥ 0 and L > 0 such that the functions

m T L T
g(x) − x x, x x − g(x)
2 2
are convex

• the optimal value f ? is finite and attained at x? (not necessarily unique)

Accelerated proximal gradient methods 9-3


Nesterov’s method

Algorithm: choose x(0) = v (0) and γ0 > 0; for k ≥ 1, repeat the steps

• define γk = θk2 /tk where θk is the positive root of the quadratic equation

θk2 /tk = (1 − θk )γk−1 + mθk

• update x(k) and v (k) as follows:

(k−1) θk γk−1
y = x + (v (k−1) − x(k−1))
γk−1 + mθk
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

stepsize tk is fixed (tk = 1/L) or obtained from line search

Accelerated proximal gradient methods 9-4


Momentum interpretation

• first iteration (k = 1) is a proximal gradient step at y = x(0)


• next iterations are proximal gradient steps at extrapolated points y :

θk γk−1
y = x(k−1) + (v (k−1) − x(k−1))
γk−1 + mθk
 
θk γk−1 1
= x(k−1) + − 1 (x(k−1) − x(k−2))
γk−1 + mθk θk−1

x(k) = proxtk h(y − tk ∇g(y))

x(k−2) x(k−1) y

Accelerated proximal gradient methods 9-5


Algorithm parameters

θk2 θk2
= (1 − θk )γk−1 + mθk , γk =
tk tk

• θk is positive root of the quadratic equation


• θk < 1 if mtk < 1
• if tk is constant, sequence θk is completely determined by starting value γ0

Example: L = 1, m = 0.1, tk = 1/L

0.4 γ0 = m 0.02 γ0 = m
γ0 = 0.5m γ0 = 0.5m
γ0 = 2m γ0 = 2m
0.35 0.15
q γk
θk

m
L
0.3 0.1 m

0.25 0.05
0 5 10 15 20 25 0 5 10 15 20 25
k k

Accelerated proximal gradient methods 9-6


FISTA

if we take m = 0 on page 9-4, the expression for y simplifies:

y = x(k−1) + θk (v (k−1) − x(k−1))


x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

eliminating the variables v (k) gives the equivalent iteration (for k ≥ 2)

1
y = x(k−1) + θk ( − 1)(x(k−1) − x(k−2))
θk−1
x(k) = proxtk h(y − tk ∇g(y))

this is known as FISTA (‘Fast Iterative Shrinkage-Thresholding Algorithm’)

Accelerated proximal gradient methods 9-7


Example
m
minimize exp(aTi x + bi)
P
log
i=1

• two randomly generated problems with m = 2000, n = 1000


• same fixed step size used for gradient method and FISTA
• figures show (f (x(k)) − f ?)/f ?

100 100
gradient gradient
FISTA FISTA
10−1 10−1

10−2 10−2

10−3 10−3

10−4 10−4

10−5 10−5

10−6 10−6
0 50 100 150 200 0 50 100 150 200
k k

Accelerated proximal gradient methods 9-8


Nesterov’s simplest method

• if m > 0 and we choose γ0 = m, then



γk = m, θk = mtk for all k ≥ 1

• the algorithm on p. 9-4 and p. 9-5 simplifies:


√ √
tk 1 − mtk−1 (k−1)
y = x(k−1) + √ √ (x − x(k−2))
tk−1 1 + mtk

x(k) = proxtk h(y − tk ∇g(y))

• with constant stepsize tk = 1/L, the expression for y reduces to


p
(k−1) 1− m/L
y=x + p (x(k−1) − x(k−2))
1+ m/L

Accelerated proximal gradient methods 9-9


Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Overview

• we show that if tk = 1/L, the following inequality holds at each iteration:

γk (k)
f (x(k)) − f ? + kv − x?k22
2
 γ k−1

≤ (1 − θk ) f (x(k−1)) − f ? + kv (k−1) − x?k22
2

• therefore the rate of convergence is determined by λk = − θi):


Qk
i=1 (1

(k) ? γk (k)
f (x )−f ≤ f (x ) − f + kv − x?k22
(k) ?
2
 γ 0

≤ λk f (x(0)) − f ? + kx(0) − x?k22
2

(here we assume that x(0) ∈ dom h = dom f )

Accelerated proximal gradient methods 9-10


Notation for one iteration

quantities in iteration i of the algorithm on page 9-4

• define t = ti, θ = θi, γ = γi, and γ + = γi:

γ + = (1 − θ)γ + mθ, γ + = θ2/t

• define x = x(i−1), x+ = x(i), v = v (i−1), and v + = v (i):


1 +

y = γ x + θγv
γ + mθ
x+ = y − tGt(y)
1
v + = x + (x+ − x)
θ

• v +, v , and y are related as

γ +v + = (1 − θ)γv + mθy − θGt(y) (1)

Accelerated proximal gradient methods 9-11


Proof (last identity):

• combine v and x updates and use γ + = θ2/t:

+ 1
v = x + (y − tGt(y) − x)
θ
1 θ
= (y − (1 − θ)x) − + Gt(y)
θ γ

• multiply with γ + = γ + mθ − θγ :

+ + γ+
γ v = (y − (1 − θ)x) − θGt(y)
θ
(1 − θ)
= ((γ + mθ)y − γ +x) + θmy − θGt(y)
θ
= (1 − θ)γv + θmy − θGt(y)

Accelerated proximal gradient methods 9-12


Bounds on objective function

recall the results on the proximal gradient update (page 6-13):

• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by

t
g(x ) ≤ g(y) − t∇g(y) Gt(y) + kGt(y)k22
+ T
(2)
2

• if the inequality (2) holds, then mt ≤ 1 and, for all z ,

t m
f (z) ≥ f (x+) + kGt(y)k22 + Gt(y)T (z − y) + kz − yk22
2 2

• combine the inequalities for z = x and z = x?:

f (x+) − f ? ≤ (1 − θ)(f (x) − f ?) − Gt(y)T ((1 − θ)x + θx? − y)


t mθ ?
− kGt(y)k22 − kx − yk22
2 2

Accelerated proximal gradient methods 9-13


Progress in one iteration

• the definition of γ + and (1) imply that

γ+
(kx? − v +k22 − ky − v +k22)
2
(1 − θ)γ ? 2 2 mθ ?
= (kx − vk2 − ky − vk2) + kx − yk22 + θGt(y)T (x? − y)
2 2

• combining this with the last inequality on page 9-13 gives

+ γ+ ? ?
f (x ) − f + kx − v +k22
2

? γ ? 2 T γ 2

≤ (1 − θ) f (x) − f + kx − vk2 − Gt(y) (x − y) − ky − vk2
2 2
t 2 γ+
− kGt(y)k2 + ky − v +k22
2 2

Accelerated proximal gradient methods 9-14


• the last term on the right-hand side is

γ+ 1
ky − v k2 = + k(1 − θ)γ(y − v) + θGt(y)k22
+ 2
2 2γ
(1 − θ)2γ 2 2 θ(1 − θ)γ T t 2
= ky − vk2 + G t (y) (y − v) + kG t (y)k 2
2γ + γ+ 2
+
 
γ(γ − mθ) 2 T t 2
= (1 − θ) ky − vk 2 + G t (y) (x − y) + kG t (y)k2
2γ + 2

last step uses definitions of γ + and y (chosen so that θγ(y − v) = γ +(x − y))

• substituting this in the last inequality on page 9-14 gives the result on page 9-10

+ γ+ ? ?
f (x ) − f + kx − v +k22
2
 γ  (1 − θ)γ mθ
≤ (1 − θ) f (x) − f ? + kx? − vk2 − ky − vk 2
2
2 2 γ+

? γ ? 2

≤ (1 − θ) f (x) − f + kx − vk
2

Accelerated proximal gradient methods 9-15


Analysis for fixed step size

the product λk = − θi) determines the rate of convergence (page 9-10)


Qk
i=1 (1

• the sequence λk satisfies the following bound (proof on next page)

4
λk ≤ k √

ti)2
P
(2 + γ0
i=1

• for constant step size tk = 1/L, we obtain

4
λk ≤ p
(2 + k γ0/L)2

• combined with the inequality on p. 9-10, this shows the 1/k 2 convergence rate:

4  γ 0

f (x(k)) − f ? ≤ p f (x(0)) − f ? + kx(0) − x?k22
(2 + k γ0/L)2 2

Accelerated proximal gradient methods 9-16


Proof.

• recall that γk and θk are defined by γk = (1 − θk )γk−1 + θk m and γk = θk2 /tk


• we first note that λk ≤ γk /γ0; this follows from

γk − θk m γk
λk = (1 − θk )λk−1 = λk−1 ≤ λk−1
γk−1 γk−1

• the inequality follows by combining from i = 1 to i = k the inequalities

1 1 λi−1 − λi
√ −p ≥ √ (because λi ≤ λi−1)
λi λi−1 2λi−1 λi
θi
= √
2 λi
θ
≥ p i
2 γi/γ0
1√
= γ0ti
2

Accelerated proximal gradient methods 9-17


Strongly convex functions

the following bound on λk is useful for strongly convex functions (m > 0)

• if γ0 ≥ m then γk ≥ m for all k and


k
Y √
λk ≤ (1 − mti)
i=1

(proof on next page)

• for constant step size tk = 1/L, we obtain


p k
λk ≤ 1 − m/L

• combined with the inequality on p. 9-10, this shows


 r k 
m γ 0

f (x(k)) − f ? ≤ 1− f (x(0)) − f ?) + kx(0) − x?k22
L 2

Accelerated proximal gradient methods 9-18


Proof.

• if γk−1 ≥ m, then

γk = (1 − θk )γk−1 + θk m
≥ m

• since γ0 ≥ m, we have γk ≥ m for all k


√ √
• it follows that θi = γiti ≥ mti and

k k
Y Y √
λk = (1 − θi) ≤ (1 − mti)
i=1 i=1

Accelerated proximal gradient methods 9-19


Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Line search

• the analysis for fixed step size starts with the inequality (2):

t
g(x − tGt(y)) ≤ g(y) − t∇g(y)T Gt(y) + kGt(y)k22
2

this inequality is known to hold for 0 ≤ t ≤ 1/L

• if L is not known, we can satisfy (2) by a backtracking line search:


start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds

• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}

• for each tentative tk we need to recompute θk , y , x(k) in the algorithm on p. 9-4

• requires evaluations of ∇g , proxth, and g (twice) per line search iteration

Accelerated proximal gradient methods 9-20


Analysis with line search

• from page 9-16:

4 4
λk ≤ k √
≤ √ 2
√ (2 + k γ0tmin)
ti)2
P
(2 + γ0
i=1

• from page 9-18, if γ0 ≥ m:

k
Y √ √ k
λk ≤ (1 − mti) ≤ 1 − mtmin
i=1

• therefore the results for fixed step size hold with 1/tmin substituted for L

Accelerated proximal gradient methods 9-21


References

Accelerated gradient methods

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004).


The material in the lecture is from §2.2 of this book.
• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).
• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends in
Machine Learning (2015), §3.7.

FISTA

• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. on Imaging Sciences (2009).
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.
Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications
(2009).

Line search strategies

• FISTA papers by Beck and Teboulle.


• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with
line search (2011).
• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).
• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).

Accelerated proximal gradient methods 9-22


Interpretation and insight

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), §2.2.
• W. Su, S. Boyd, E. Candès, A differentiable equation for modeling Nesterov’s accelerated
gradient method: theory and insight, NIPS (2014).
• H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization,
arXiv:1506.02186 (2015).

Implementation

• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications to
sparse signal recovery, Mathematical Programming Computation (2011).
• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics (2015).
• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTA
implementation, arXiv:1411.3406 (2016).

Accelerated proximal gradient methods 9-23

You might also like