0% found this document useful (0 votes)
18 views

Fast Gradient Method

Uploaded by

smashouff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Fast Gradient Method

Uploaded by

smashouff
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

L.

Vandenberghe EE236C (Spring 2016)

9. Accelerated proximal gradient methods

• Nesterov’s method

• analysis with fixed step size

• line search

9-1
Proximal gradient method

Results from lecture 6

• each proximal gradient iteration is a descent step:

f (x(k)) < f (x(k−1)), kx(k) − x?k22 ≤ c kx(k−1) − x?k22

with c = 1 − m/L

• suboptimality after k iterations is O(1/k):

(k) L (0)
f (x ) − f ≤ kx − x?k22
?
2k

Accelerated proximal gradient methods

• to improve convergence, we add a momentum term


• we relax the descent property
• originated in work by Nesterov in the 1980s

Accelerated proximal gradient methods 9-2


Assumptions

we consider the same problem and make the same assumptions as in lecture 6:

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn

• there exist constants m ≥ 0 and L > 0 such that the functions

m T L T
g(x) − x x, x x − g(x)
2 2
are convex

• the optimal value f ? is finite and attained at x? (not necessarily unique)

Accelerated proximal gradient methods 9-3


Nesterov’s method

Algorithm: choose x(0) = v (0) and γ0 > 0; for k ≥ 1, repeat the steps

• define γk = θk2 /tk where θk is the positive root of the quadratic equation

θk2 /tk = (1 − θk )γk−1 + mθk

• update x(k) and v (k) as follows:

(k−1) θk γk−1
y = x + (v (k−1) − x(k−1))
γk−1 + mθk
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

stepsize tk is fixed (tk = 1/L) or obtained from line search

Accelerated proximal gradient methods 9-4


Momentum interpretation

• first iteration (k = 1) is a proximal gradient step at y = x(0)


• next iterations are proximal gradient steps at extrapolated points y :

θk γk−1
y = x(k−1) + (v (k−1) − x(k−1))
γk−1 + mθk
 
θk γk−1 1
= x(k−1) + − 1 (x(k−1) − x(k−2))
γk−1 + mθk θk−1

x(k) = proxtk h(y − tk ∇g(y))

x(k−2) x(k−1) y

Accelerated proximal gradient methods 9-5


Algorithm parameters

θk2 θk2
= (1 − θk )γk−1 + mθk , γk =
tk tk

• θk is positive root of the quadratic equation


• θk < 1 if mtk < 1
• if tk is constant, sequence θk is completely determined by starting value γ0

Example: L = 1, m = 0.1, tk = 1/L

0.4 γ0 = m 0.02 γ0 = m
γ0 = 0.5m γ0 = 0.5m
γ0 = 2m γ0 = 2m
0.35 0.15
q γk
θk

m
L
0.3 0.1 m

0.25 0.05
0 5 10 15 20 25 0 5 10 15 20 25
k k

Accelerated proximal gradient methods 9-6


FISTA

if we take m = 0 on page 9-4, the expression for y simplifies:

y = x(k−1) + θk (v (k−1) − x(k−1))


x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk

eliminating the variables v (k) gives the equivalent iteration (for k ≥ 2)

1
y = x(k−1) + θk ( − 1)(x(k−1) − x(k−2))
θk−1
x(k) = proxtk h(y − tk ∇g(y))

this is known as FISTA (‘Fast Iterative Shrinkage-Thresholding Algorithm’)

Accelerated proximal gradient methods 9-7


Example
m
minimize exp(aTi x + bi)
P
log
i=1

• two randomly generated problems with m = 2000, n = 1000


• same fixed step size used for gradient method and FISTA
• figures show (f (x(k)) − f ?)/f ?

100 100
gradient gradient
FISTA FISTA
10−1 10−1

10−2 10−2

10−3 10−3

10−4 10−4

10−5 10−5

10−6 10−6
0 50 100 150 200 0 50 100 150 200
k k

Accelerated proximal gradient methods 9-8


Nesterov’s simplest method

• if m > 0 and we choose γ0 = m, then



γk = m, θk = mtk for all k ≥ 1

• the algorithm on p. 9-4 and p. 9-5 simplifies:


√ √
tk 1 − mtk−1 (k−1)
y = x(k−1) + √ √ (x − x(k−2))
tk−1 1 + mtk

x(k) = proxtk h(y − tk ∇g(y))

• with constant stepsize tk = 1/L, the expression for y reduces to


p
(k−1) 1− m/L
y=x + p (x(k−1) − x(k−2))
1+ m/L

Accelerated proximal gradient methods 9-9


Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Overview

• we show that if tk = 1/L, the following inequality holds at each iteration:

γk (k)
f (x(k)) − f ? + kv − x?k22
2
 γ k−1

≤ (1 − θk ) f (x(k−1)) − f ? + kv (k−1) − x?k22
2

• therefore the rate of convergence is determined by λk = − θi):


Qk
i=1 (1

(k) ? γk (k)
f (x )−f ≤ f (x ) − f + kv − x?k22
(k) ?
2
 γ 0

≤ λk f (x(0)) − f ? + kx(0) − x?k22
2

(here we assume that x(0) ∈ dom h = dom f )

Accelerated proximal gradient methods 9-10


Notation for one iteration

quantities in iteration i of the algorithm on page 9-4

• define t = ti, θ = θi, γ = γi, and γ + = γi:

γ + = (1 − θ)γ + mθ, γ + = θ2/t

• define x = x(i−1), x+ = x(i), v = v (i−1), and v + = v (i):


1 +

y = γ x + θγv
γ + mθ
x+ = y − tGt(y)
1
v + = x + (x+ − x)
θ

• v +, v , and y are related as

γ +v + = (1 − θ)γv + mθy − θGt(y) (1)

Accelerated proximal gradient methods 9-11


Proof (last identity):

• combine v and x updates and use γ + = θ2/t:

+ 1
v = x + (y − tGt(y) − x)
θ
1 θ
= (y − (1 − θ)x) − + Gt(y)
θ γ

• multiply with γ + = γ + mθ − θγ :

+ + γ+
γ v = (y − (1 − θ)x) − θGt(y)
θ
(1 − θ)
= ((γ + mθ)y − γ +x) + θmy − θGt(y)
θ
= (1 − θ)γv + θmy − θGt(y)

Accelerated proximal gradient methods 9-12


Bounds on objective function

recall the results on the proximal gradient update (page 6-13):

• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by

t
g(x ) ≤ g(y) − t∇g(y) Gt(y) + kGt(y)k22
+ T
(2)
2

• if the inequality (2) holds, then mt ≤ 1 and, for all z ,

t m
f (z) ≥ f (x+) + kGt(y)k22 + Gt(y)T (z − y) + kz − yk22
2 2

• combine the inequalities for z = x and z = x?:

f (x+) − f ? ≤ (1 − θ)(f (x) − f ?) − Gt(y)T ((1 − θ)x + θx? − y)


t mθ ?
− kGt(y)k22 − kx − yk22
2 2

Accelerated proximal gradient methods 9-13


Progress in one iteration

• the definition of γ + and (1) imply that

γ+
(kx? − v +k22 − ky − v +k22)
2
(1 − θ)γ ? 2 2 mθ ?
= (kx − vk2 − ky − vk2) + kx − yk22 + θGt(y)T (x? − y)
2 2

• combining this with the last inequality on page 9-13 gives

+ γ+ ? ?
f (x ) − f + kx − v +k22
2

? γ ? 2 T γ 2

≤ (1 − θ) f (x) − f + kx − vk2 − Gt(y) (x − y) − ky − vk2
2 2
t 2 γ+
− kGt(y)k2 + ky − v +k22
2 2

Accelerated proximal gradient methods 9-14


• the last term on the right-hand side is

γ+ 1
ky − v k2 = + k(1 − θ)γ(y − v) + θGt(y)k22
+ 2
2 2γ
(1 − θ)2γ 2 2 θ(1 − θ)γ T t 2
= ky − vk2 + G t (y) (y − v) + kG t (y)k 2
2γ + γ+ 2
+
 
γ(γ − mθ) 2 T t 2
= (1 − θ) ky − vk 2 + G t (y) (x − y) + kG t (y)k2
2γ + 2

last step uses definitions of γ + and y (chosen so that θγ(y − v) = γ +(x − y))

• substituting this in the last inequality on page 9-14 gives the result on page 9-10

+ γ+ ? ?
f (x ) − f + kx − v +k22
2
 γ  (1 − θ)γ mθ
≤ (1 − θ) f (x) − f ? + kx? − vk2 − ky − vk 2
2
2 2 γ+

? γ ? 2

≤ (1 − θ) f (x) − f + kx − vk
2

Accelerated proximal gradient methods 9-15


Analysis for fixed step size

the product λk = − θi) determines the rate of convergence (page 9-10)


Qk
i=1 (1

• the sequence λk satisfies the following bound (proof on next page)

4
λk ≤ k √

ti)2
P
(2 + γ0
i=1

• for constant step size tk = 1/L, we obtain

4
λk ≤ p
(2 + k γ0/L)2

• combined with the inequality on p. 9-10, this shows the 1/k 2 convergence rate:

4  γ 0

f (x(k)) − f ? ≤ p f (x(0)) − f ? + kx(0) − x?k22
(2 + k γ0/L)2 2

Accelerated proximal gradient methods 9-16


Proof.

• recall that γk and θk are defined by γk = (1 − θk )γk−1 + θk m and γk = θk2 /tk


• we first note that λk ≤ γk /γ0; this follows from

γk − θk m γk
λk = (1 − θk )λk−1 = λk−1 ≤ λk−1
γk−1 γk−1

• the inequality follows by combining from i = 1 to i = k the inequalities

1 1 λi−1 − λi
√ −p ≥ √ (because λi ≤ λi−1)
λi λi−1 2λi−1 λi
θi
= √
2 λi
θ
≥ p i
2 γi/γ0
1√
= γ0ti
2

Accelerated proximal gradient methods 9-17


Strongly convex functions

the following bound on λk is useful for strongly convex functions (m > 0)

• if γ0 ≥ m then γk ≥ m for all k and


k
Y √
λk ≤ (1 − mti)
i=1

(proof on next page)

• for constant step size tk = 1/L, we obtain


p k
λk ≤ 1 − m/L

• combined with the inequality on p. 9-10, this shows


 r k 
m γ 0

f (x(k)) − f ? ≤ 1− f (x(0)) − f ?) + kx(0) − x?k22
L 2

Accelerated proximal gradient methods 9-18


Proof.

• if γk−1 ≥ m, then

γk = (1 − θk )γk−1 + θk m
≥ m

• since γ0 ≥ m, we have γk ≥ m for all k


√ √
• it follows that θi = γiti ≥ mti and

k k
Y Y √
λk = (1 − θi) ≤ (1 − mti)
i=1 i=1

Accelerated proximal gradient methods 9-19


Outline

• Nesterov’s method

• analysis with fixed step size

• line search
Line search

• the analysis for fixed step size starts with the inequality (2):

t
g(x − tGt(y)) ≤ g(y) − t∇g(y)T Gt(y) + kGt(y)k22
2

this inequality is known to hold for 0 ≤ t ≤ 1/L

• if L is not known, we can satisfy (2) by a backtracking line search:


start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds

• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}

• for each tentative tk we need to recompute θk , y , x(k) in the algorithm on p. 9-4

• requires evaluations of ∇g , proxth, and g (twice) per line search iteration

Accelerated proximal gradient methods 9-20


Analysis with line search

• from page 9-16:

4 4
λk ≤ k √
≤ √ 2
√ (2 + k γ0tmin)
ti)2
P
(2 + γ0
i=1

• from page 9-18, if γ0 ≥ m:

k
Y √ √ k
λk ≤ (1 − mti) ≤ 1 − mtmin
i=1

• therefore the results for fixed step size hold with 1/tmin substituted for L

Accelerated proximal gradient methods 9-21


References

Accelerated gradient methods

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004).


The material in the lecture is from §2.2 of this book.
• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).
• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends in
Machine Learning (2015), §3.7.

FISTA

• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. on Imaging Sciences (2009).
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.
Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications
(2009).

Line search strategies

• FISTA papers by Beck and Teboulle.


• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with
line search (2011).
• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).
• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).

Accelerated proximal gradient methods 9-22


Interpretation and insight

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), §2.2.
• W. Su, S. Boyd, E. Candès, A differentiable equation for modeling Nesterov’s accelerated
gradient method: theory and insight, NIPS (2014).
• H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization,
arXiv:1506.02186 (2015).

Implementation

• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications to
sparse signal recovery, Mathematical Programming Computation (2011).
• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics (2015).
• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTA
implementation, arXiv:1411.3406 (2016).

Accelerated proximal gradient methods 9-23

You might also like