Fista
Fista
1
Fast (proximal) gradient methods
this lecture:
k − 2 (k−1)
y = x(k−1) + (x − x(k−2))
k+1
x(k) = proxtk h (y − tk ∇g(y))
x(k−2) x(k−1) y
randomly generated data with m = 2000, n = 1000, same fixed step size
100
gradient
FISTA
10-1
10-2
f (x(k)) − f ⋆
|f ⋆|
10-3
10-4
10-5
100
gradient
FISTA
10-1
10-2
f (x(k)) − f ⋆
|f ⋆|
10-3
10-4
10-5
assumptions
y = (1 − θk )x(k−1) + θk v (k−1)
1 − θk 1
≤ 2 , k≥2
θk2 θk−1
L
g(u) ≤ g(z) + ∇g(z)T (u − z) + ku − zk22 ∀u, z
2
1
h(u) ≤ h(z) + (w − u)T (u − z) ∀w, u = proxth(w), z
t
1 +
g(x ) ≤ g(y) + ∇g(y) (x − y) + kx − yk22
+ T +
(1)
2t
1
h(x+) ≤ h(z) + ∇g(y)T (z − x+) + (x+ − y)T (z − x+) ∀z
t
1 + 1 +
+
f (x ) ≤ f (z) + (x − y) (z − x ) + kx − yk22
T +
∀z
t 2t
Fast proximal gradient methods 10
• make convex combination of upper bounds for z = x and z = x⋆
ti (i) ⋆
1
(i) ⋆ 2
2 f (x ) − f + kv − x k2
θi 2
(1 − θi)ti (i−1 ⋆
1
(i−1) ⋆ 2
≤ f (x ) − f + kv − x k2 (2)
θi2 2
2
take ti = t = 1/L and apply (2) recursively, using (1 − θi)/θi2 ≤ 1/θi−1 :
t 1
2 f (x(k)) − f ⋆ + kv (k) − x⋆k22
θk 2
(1 − θ1)t (0) ⋆
1
(0) ⋆ 2
≤ f (x ) − f + kv − x k2
θ12 2
1 (0)
= kx − x⋆k22
2
therefore,
(k) ⋆
√
conclusion: reaches f (x ) − f ≤ ǫ after O(1/ ǫ) iterations
101
gradient
100 FISTA
10-1
(f (x(k)) − f ⋆)/|f ⋆|
10-2
10-3
10-4
10-5
10-6
10-7 0 10 20 30 40 50
k
100
gradient
FISTA
10-1
(f (x(k)) − f ⋆)/f ⋆
10-2
10-3
10-4
10-5
10-6
10-7 0 20 40 60 80 100
k
1 +
g(x+) ≤ g(y) + ∇g(y)T (x+ − y) + kx − yk22 (1)
2t
this inequality is known to hold for 0 < t ≤ 1/L
ti (i) ⋆
1
(i) ⋆ 2
f (x ) − f + kv − x k2
θi2 2
(1 − θi)ti 1
≤ 2 f (x(i−1) − f ⋆ + kv (i−1) − x⋆k22 (2)
θi 2
(1 − θi)ti ti−1
≤ 2 (i ≥ 2) (3)
θi2 θi−1
(k) ⋆ θk2
f (x )−f ≤ kx(0) − x⋆k22
2tk
θk2 1
= O( 2 )
tk k
1 2
tk = , θk =
L k+1
θk2 4L
=
tk (k + 1)2
θk2 4
≤
tk (k + 1)2tmin
Fast proximal gradient methods 17
FISTA with line search (method 2)
t := t̂ > 0
θ := positive root of tk−1θ2 = tθk−1
2
(1 − θ)
y := (1 − θ)x(k−1) + θv (k−1)
x := proxth(y − t∇g(y))
1
while g(x) > g(y) + ∇g(y)T (x − y) + 2t kx − yk22
t := βt
θ := positive root of tk−1θ2 = tθk−1
2
(1 − θ)
y := (1 − θ)x(k−1) + θv (k−1)
x := proxth(y − t∇g(y))
end
θk2 1 4
≤ 2 ≤ (k + 1)2t
tk √ P√
k min
t1 + 12 ti
i=2
method 1
• one evaluation of g(x), one proxth evaluation per line search iteration
method 2
y = (1 − θk )x(k−1) + θk v (k−1)
u = proxtk h (y − tk ∇g(y))
f (u) ≤ f (x(k−1))
(k) u
x =
x(k−1) otherwise
(k) (k−1) 1
v = x + (u − x(k−1))
θk
(from page 6)
100
gradient
FISTA
10-1 FISTA-d
(f (x(k)) − f ⋆)/|f ⋆|
10-2
10-3
10-4
10-5
• enforcing descent
y = (1 − θk )x(k−1) + θk v (k−1)
tk
v (k) = prox(tk /θk )h v (k−1)
− ∇g(y)
θk
x(k) = (1 − θk )x(k−1) + θk v (k)
assumptions
1 +
g(x+) ≤ g(y) + ∇g(y)T (x+ − y) + kx − yk22
2t
• from convexity of g, h
θ2 +
g(x ) ≤ (1 − θ)g(x) + θ g(y) + ∇g(y) (v − y) + kv − vk22
+ T +
2t
h(x+) ≤ (1 − θ)h(x) + θh(v +)
θ
h(v +) ≤ h(z) + ∇g(y)T (z − v +) − (v + − v)T (v + − z) ∀z
t
θ2 + θ2 +
+ ⋆
f (x ) ≤ (1 − θ)f (x) + θf − (v − v) (v − x ) + kv − vk22
T + ⋆
t 2t
⋆ θ2 ⋆ 2 + ⋆ 2
= (1 − θ)f (x) + θf + kv − x k2 − kv − x k2
2t
this is identical to the final inequality (2) in the analysis of FISTA on p.11
ti (i) ⋆
1
(i) ⋆ 2
2 f (x ) − f + kv − x k2
θi 2
(1 − θi)ti (i−1 ⋆
1
(i−1) ⋆ 2
≤ 2 f (x )) − f + kv − x k2
θi 2
FISTA
• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear
inverse problems, SIAM J. on Imaging Sciences (2009)
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal
Processing and Communications (2009)