Fast Gradient Method
Fast Gradient Method
• Nesterov’s method
• line search
9-1
Proximal gradient method
with c = 1 − m/L
(k) L (0)
f (x ) − f ≤ kx − x?k22
?
2k
we consider the same problem and make the same assumptions as in lecture 6:
m T L T
g(x) − x x, x x − g(x)
2 2
are convex
Algorithm: choose x(0) = v (0) and γ0 > 0; for k ≥ 1, repeat the steps
• define γk = θk2 /tk where θk is the positive root of the quadratic equation
(k−1) θk γk−1
y = x + (v (k−1) − x(k−1))
γk−1 + mθk
x(k) = proxtk h(y − tk ∇g(y))
(k) (k−1) 1 (k)
v = x + (x − x(k−1))
θk
θk γk−1
y = x(k−1) + (v (k−1) − x(k−1))
γk−1 + mθk
θk γk−1 1
= x(k−1) + − 1 (x(k−1) − x(k−2))
γk−1 + mθk θk−1
x(k−2) x(k−1) y
θk2 θk2
= (1 − θk )γk−1 + mθk , γk =
tk tk
0.4 γ0 = m 0.02 γ0 = m
γ0 = 0.5m γ0 = 0.5m
γ0 = 2m γ0 = 2m
0.35 0.15
q γk
θk
m
L
0.3 0.1 m
0.25 0.05
0 5 10 15 20 25 0 5 10 15 20 25
k k
1
y = x(k−1) + θk ( − 1)(x(k−1) − x(k−2))
θk−1
x(k) = proxtk h(y − tk ∇g(y))
100 100
gradient gradient
FISTA FISTA
10−1 10−1
10−2 10−2
10−3 10−3
10−4 10−4
10−5 10−5
10−6 10−6
0 50 100 150 200 0 50 100 150 200
k k
• Nesterov’s method
• line search
Overview
γk (k)
f (x(k)) − f ? + kv − x?k22
2
γ k−1
≤ (1 − θk ) f (x(k−1)) − f ? + kv (k−1) − x?k22
2
(k) ? γk (k)
f (x )−f ≤ f (x ) − f + kv − x?k22
(k) ?
2
γ 0
≤ λk f (x(0)) − f ? + kx(0) − x?k22
2
+ 1
v = x + (y − tGt(y) − x)
θ
1 θ
= (y − (1 − θ)x) − + Gt(y)
θ γ
• multiply with γ + = γ + mθ − θγ :
+ + γ+
γ v = (y − (1 − θ)x) − θGt(y)
θ
(1 − θ)
= ((γ + mθ)y − γ +x) + θmy − θGt(y)
θ
= (1 − θ)γv + θmy − θGt(y)
t
g(x ) ≤ g(y) − t∇g(y) Gt(y) + kGt(y)k22
+ T
(2)
2
t m
f (z) ≥ f (x+) + kGt(y)k22 + Gt(y)T (z − y) + kz − yk22
2 2
γ+
(kx? − v +k22 − ky − v +k22)
2
(1 − θ)γ ? 2 2 mθ ?
= (kx − vk2 − ky − vk2) + kx − yk22 + θGt(y)T (x? − y)
2 2
+ γ+ ? ?
f (x ) − f + kx − v +k22
2
? γ ? 2 T γ 2
≤ (1 − θ) f (x) − f + kx − vk2 − Gt(y) (x − y) − ky − vk2
2 2
t 2 γ+
− kGt(y)k2 + ky − v +k22
2 2
γ+ 1
ky − v k2 = + k(1 − θ)γ(y − v) + θGt(y)k22
+ 2
2 2γ
(1 − θ)2γ 2 2 θ(1 − θ)γ T t 2
= ky − vk2 + G t (y) (y − v) + kG t (y)k 2
2γ + γ+ 2
+
γ(γ − mθ) 2 T t 2
= (1 − θ) ky − vk 2 + G t (y) (x − y) + kG t (y)k2
2γ + 2
last step uses definitions of γ + and y (chosen so that θγ(y − v) = γ +(x − y))
• substituting this in the last inequality on page 9-14 gives the result on page 9-10
+ γ+ ? ?
f (x ) − f + kx − v +k22
2
γ (1 − θ)γ mθ
≤ (1 − θ) f (x) − f ? + kx? − vk2 − ky − vk 2
2
2 2 γ+
? γ ? 2
≤ (1 − θ) f (x) − f + kx − vk
2
4
λk ≤ k √
√
ti)2
P
(2 + γ0
i=1
4
λk ≤ p
(2 + k γ0/L)2
• combined with the inequality on p. 9-10, this shows the 1/k 2 convergence rate:
4 γ 0
f (x(k)) − f ? ≤ p f (x(0)) − f ? + kx(0) − x?k22
(2 + k γ0/L)2 2
γk − θk m γk
λk = (1 − θk )λk−1 = λk−1 ≤ λk−1
γk−1 γk−1
1 1 λi−1 − λi
√ −p ≥ √ (because λi ≤ λi−1)
λi λi−1 2λi−1 λi
θi
= √
2 λi
θ
≥ p i
2 γi/γ0
1√
= γ0ti
2
• if γk−1 ≥ m, then
γk = (1 − θk )γk−1 + θk m
≥ m
k k
Y Y √
λk = (1 − θi) ≤ (1 − mti)
i=1 i=1
• Nesterov’s method
• line search
Line search
• the analysis for fixed step size starts with the inequality (2):
t
g(x − tGt(y)) ≤ g(y) − t∇g(y)T Gt(y) + kGt(y)k22
2
• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}
4 4
λk ≤ k √
≤ √ 2
√ (2 + k γ0tmin)
ti)2
P
(2 + γ0
i=1
k
Y √ √ k
λk ≤ (1 − mti) ≤ 1 − mtmin
i=1
• therefore the results for fixed step size hold with 1/tmin substituted for L
FISTA
• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse
problems, SIAM J. on Imaging Sciences (2009).
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.
Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications
(2009).
• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004), §2.2.
• W. Su, S. Boyd, E. Candès, A differentiable equation for modeling Nesterov’s accelerated
gradient method: theory and insight, NIPS (2014).
• H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization,
arXiv:1506.02186 (2015).
Implementation
• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications to
sparse signal recovery, Mathematical Programming Computation (2011).
• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations of
Computational Mathematics (2015).
• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTA
implementation, arXiv:1411.3406 (2016).