0% found this document useful (0 votes)
12 views32 pages

Fista

Uploaded by

satitz chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Fista

Uploaded by

satitz chong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

L.

Vandenberghe EE236C (Spring 2013-14)

Fast proximal gradient methods

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method

1
Fast (proximal) gradient methods

• Nesterov (1983, 1988, 2005): three gradient projection methods with


1/k 2 convergence rate

• Beck & Teboulle (2008): FISTA, a proximal gradient version of


Nesterov’s 1983 method

• Nesterov (2004 book), Tseng (2008): overview and unified analysis of


fast gradient methods

• several recent variations and extensions

this lecture:

FISTA and Nesterov’s 2nd method (1988) as presented by Tseng

Fast proximal gradient methods 2


Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method


FISTA (basic version)

minimize f (x) = g(x) + h(x)

• g convex, differentiable, with dom g = Rn


• h closed, convex, with inexpensive proxth operator

algorithm: choose any x(0) = x(−1); for k ≥ 1, repeat the steps

k − 2 (k−1)
y = x(k−1) + (x − x(k−2))
k+1
x(k) = proxtk h (y − tk ∇g(y))

• step size tk fixed or determined by line search


• acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’

Fast proximal gradient methods 3


Interpretation

• first iteration (k = 1) is a proximal gradient step at y = x(0)

• next iterations are proximal gradient steps at extrapolated points y

x(k) = proxtk h (y − tk ∇g(y))

x(k−2) x(k−1) y

note: x(k) is feasible (in dom h); y may be outside dom h

Fast proximal gradient methods 4


Example
m
exp(aTi x + bi)
P
minimize log
i=1

randomly generated data with m = 2000, n = 1000, same fixed step size

100
gradient
FISTA
10-1

10-2
f (x(k)) − f ⋆
|f ⋆|

10-3

10-4

10-5

10-6 0 50 100 150 200


k

Fast proximal gradient methods 5


another instance

100
gradient
FISTA
10-1

10-2
f (x(k)) − f ⋆
|f ⋆|

10-3

10-4

10-5

10-6 0 50 100 150 200


k

FISTA is not a descent method

Fast proximal gradient methods 6


Convergence of FISTA

assumptions

• g convex with dom g = Rn; ∇g Lipschitz continuous with constant L:

k∇g(x) − ∇g(y)k2 ≤ Lkx − yk2 ∀x, y

• h is closed and convex (so that proxth(u) is well defined)


• optimal value f ⋆ is finite and attained at x⋆ (not necessarily unique)

convergence result: f (x(k)) − f ⋆ decreases at least as fast as 1/k 2

• with fixed step size tk = 1/L


• with suitable line search

Fast proximal gradient methods 7


Reformulation of FISTA

define θk = 2/(k + 1) and introduce an intermediate variable v (k)

algorithm: choose x(0) = v (0); for k ≥ 1, repeat the steps

y = (1 − θk )x(k−1) + θk v (k−1)

x(k) = proxtk h (y − tk ∇g(y))

(k) (k−1) 1 (k)


v = x + (x − x(k−1))
θk

substituting expression for v (k) in formula for y gives FISTA of page 3

Fast proximal gradient methods 8


Important inequalities

choice of θk : the sequence θk = 2/(k + 1) satisfies θ1 = 1 and

1 − θk 1
≤ 2 , k≥2
θk2 θk−1

upper bound on g from Lipschitz property (page 1-12)

L
g(u) ≤ g(z) + ∇g(z)T (u − z) + ku − zk22 ∀u, z
2

upper bound on h from definition of prox-operator (page 6-7)

1
h(u) ≤ h(z) + (w − u)T (u − z) ∀w, u = proxth(w), z
t

Fast proximal gradient methods 9


Progress in one iteration

define x = x(i−1), x+ = x(i), v = v (i−1), v + = v (i), t = ti, θ = θi

• upper bound from Lipschitz property: if 0 < t ≤ 1/L,

1 +
g(x ) ≤ g(y) + ∇g(y) (x − y) + kx − yk22
+ T +
(1)
2t

• upper bound from definition of prox-operator:

1
h(x+) ≤ h(z) + ∇g(y)T (z − x+) + (x+ − y)T (z − x+) ∀z
t

• add the upper bounds and use convexity of g

1 + 1 +
+
f (x ) ≤ f (z) + (x − y) (z − x ) + kx − yk22
T +
∀z
t 2t
Fast proximal gradient methods 10
• make convex combination of upper bounds for z = x and z = x⋆

f (x+) − f ⋆ − (1 − θ)(f (x) − f ⋆)


= f (x+) − θf ⋆ − (1 − θ)f (x)
1 + 1 +
≤ (x − y) (θx + (1 − θ)x − x ) + kx − yk22
T ⋆ +
t 2t
1  2 2

= ky − (1 − θ)x − θx⋆k2 − x+ − (1 − θ)x − θx⋆ 2
2t
θ2  2 2

= kv − x⋆k2 − v + − x⋆ 2
2t

conclusion: if the inequality (1) holds at iteration i, then

ti  (i) ⋆
 1
(i) ⋆ 2
2 f (x ) − f + kv − x k2
θi 2
(1 − θi)ti  (i−1 ⋆
 1
(i−1) ⋆ 2
≤ f (x ) − f + kv − x k2 (2)
θi2 2

Fast proximal gradient methods 11


Analysis for fixed step size

2
take ti = t = 1/L and apply (2) recursively, using (1 − θi)/θi2 ≤ 1/θi−1 :

t   1
2 f (x(k)) − f ⋆ + kv (k) − x⋆k22
θk 2
(1 − θ1)t  (0) ⋆
 1
(0) ⋆ 2
≤ f (x ) − f + kv − x k2
θ12 2
1 (0)
= kx − x⋆k22
2

therefore,

(k) θk2 (0) 2L


f (x ⋆
) − f ≤ kx − x⋆k22 = 2
kx (0)
− x ⋆ 2
k2
2t (k + 1)

(k) ⋆

conclusion: reaches f (x ) − f ≤ ǫ after O(1/ ǫ) iterations

Fast proximal gradient methods 12


Example: quadratic program with box constraints
minimize (1/2)xT Ax + bT x
subject to 0  x  1

101
gradient
100 FISTA

10-1
(f (x(k)) − f ⋆)/|f ⋆|

10-2
10-3
10-4
10-5
10-6
10-7 0 10 20 30 40 50
k

n = 3000; fixed step size t = 1/λmax(A)

Fast proximal gradient methods 13


1-norm regularized least-squares
1
minimize kAx − bk22 + kxk1
2

100
gradient
FISTA
10-1
(f (x(k)) − f ⋆)/f ⋆

10-2

10-3

10-4

10-5

10-6

10-7 0 20 40 60 80 100
k

randomly generated A ∈ R2000×1000; step tk = 1/L with L = λmax(AT A)

Fast proximal gradient methods 14


Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method


Key steps in the analysis of FISTA
• the starting point (page 10) is the inequality

1 +
g(x+) ≤ g(y) + ∇g(y)T (x+ − y) + kx − yk22 (1)
2t
this inequality is known to hold for 0 < t ≤ 1/L

• if (1) holds, then the progress made in iteration i is bounded by

ti  (i) ⋆
 1
(i) ⋆ 2
f (x ) − f + kv − x k2
θi2 2
(1 − θi)ti   1
≤ 2 f (x(i−1) − f ⋆ + kv (i−1) − x⋆k22 (2)
θi 2

• to combine these inequalities recursively, we need

(1 − θi)ti ti−1
≤ 2 (i ≥ 2) (3)
θi2 θi−1

Fast proximal gradient methods 15


• if θ1 = 1, combining the inequalities (2) from i = 1 to k gives the bound

(k) ⋆ θk2
f (x )−f ≤ kx(0) − x⋆k22
2tk

conclusion: rate 1/k 2 convergence if (1) and (3) hold with

θk2 1
= O( 2 )
tk k

FISTA with fixed step size

1 2
tk = , θk =
L k+1

these values satisfy (1) and (3) with

θk2 4L
=
tk (k + 1)2

Fast proximal gradient methods 16


FISTA with line search (method 1)
replace update of x in iteration k (page 8) with

t := tk−1 (define t0 = t̂ > 0)


x := proxth(y − t∇g(y))
1
while g(x) > g(y) + ∇g(y)T (x − y) + 2t kx − yk22
t := βt
x := proxth(y − t∇g(y))
end

• inequality (1) holds trivially, by the backtracking exit condition


• inequality (3) holds with θk = 2/(k + 1) because tk ≤ tk−1
• Lipschitz continuity of ∇g guarantees tk ≥ tmin = min{t̂, β/L}
• preserves 1/k 2 convergence rate because θk2 /tk = O(1/k 2):

θk2 4

tk (k + 1)2tmin
Fast proximal gradient methods 17
FISTA with line search (method 2)

replace update of y and x in iteration k (page 8) with

t := t̂ > 0
θ := positive root of tk−1θ2 = tθk−1
2
(1 − θ)
y := (1 − θ)x(k−1) + θv (k−1)
x := proxth(y − t∇g(y))
1
while g(x) > g(y) + ∇g(y)T (x − y) + 2t kx − yk22
t := βt
θ := positive root of tk−1θ2 = tθk−1
2
(1 − θ)
y := (1 − θ)x(k−1) + θv (k−1)
x := proxth(y − t∇g(y))
end

assume t0 = 0 in the first iteration (k = 1), i.e., take θ1 = 1, y = x(0)

Fast proximal gradient methods 18


discussion

• inequality (1) holds trivially, by the backtracking exit condition


• inequality (3) holds trivially, by construction of θk
• Lipschitz continuity of ∇g guarantees tk ≥ tmin = min{t̂, β/L}
2
• θi is defined as the positive root of θi2/ti = (1 − θi)θi−1 /ti−1; hence
√ p √ √
ti−1 (1 − θi)ti ti ti
= ≤ −
θi−1 θi θi 2
√ k
√ tk 1 X√
combine inequalities from i = 2 to k to get t1 ≤ − ti
θk 2 i=2

• rearranging shows that θk2 /tk = O(1/k 2):

θk2 1 4
≤ 2 ≤ (k + 1)2t
tk √ P√
k min
t1 + 12 ti
i=2

Fast proximal gradient methods 19


Comparison of line search methods

method 1

• uses nonincreasing step sizes (enforces tk ≤ tk−1)

• one evaluation of g(x), one proxth evaluation per line search iteration

method 2

• allows non-monotonic step sizes

• one evaluation of g(x), one evaluation of g(y), ∇g(y), one evaluation of


proxth per line search iteration

the two strategies can be combined and extended in various ways

Fast proximal gradient methods 20


Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method


Descent version of FISTA

choose x(0) = v (0); for k ≥ 1, repeat the steps

y = (1 − θk )x(k−1) + θk v (k−1)

u = proxtk h (y − tk ∇g(y))

f (u) ≤ f (x(k−1))

(k) u
x =
x(k−1) otherwise
(k) (k−1) 1
v = x + (u − x(k−1))
θk

• step 3 implies f (x(k)) ≤ f (x(k−1))


• use θk = 2/(k + 1) and tk = 1/L, or one of the line search methods
• same iteration complexity as original FISTA
• changes on page 10: replace x+ with u and use f (x+) ≤ f (u)

Fast proximal gradient methods 21


Example

(from page 6)

100
gradient
FISTA
10-1 FISTA-d
(f (x(k)) − f ⋆)/|f ⋆|

10-2

10-3

10-4

10-5

10-6 0 50 100 150 200


k

Fast proximal gradient methods 22


Outline

• fast proximal gradient method (FISTA)

• line search strategies

• enforcing descent

• Nesterov’s second method


Nesterov’s second method

algorithm: choose x(0) = v (0); for k ≥ 1, repeat the steps

y = (1 − θk )x(k−1) + θk v (k−1)
 
tk
v (k) = prox(tk /θk )h v (k−1)
− ∇g(y)
θk
x(k) = (1 − θk )x(k−1) + θk v (k)

• use θk = 2/(k + 1) and tk = 1/L, or one of the line search methods

• identical to FISTA if h(x) = 0

• unlike in FISTA, y is feasible (in dom h) if we take x(0) ∈ dom h

Fast proximal gradient methods 23


Convergence of Nesterov’s second method

assumptions

• g convex; ∇g is Lipschitz continuous on dom h ⊆ dom g

k∇g(x) − ∇g(y)k2 ≤ Lkx − yk2 ∀x, y ∈ dom h

• h is closed and convex (so that proxth(u) is well defined)


• optimal value f ⋆ is finite and attained at x⋆ (not necessarily unique)

convergence result: f (x(k)) − f ⋆ decreases at least as fast as 1/k 2

• with fixed step size tk = 1/L


• with suitable line search

Fast proximal gradient methods 24


Analysis of one iteration
define x = x(i−1), x+ = x(i), v = v (i−1), v + = v (i), t = ti, θ = θi

• from Lipschitz property if 0 < t ≤ 1/L

1 +
g(x+) ≤ g(y) + ∇g(y)T (x+ − y) + kx − yk22
2t

• plug in x+ = (1 − θ)x + θv + and x+ − y = θ(v + − v)


2
θ
g(x+) ≤ g(y) + ∇g(y)T (1 − θ)x + θv + − y + kv + − vk22

2t

• from convexity of g, h

θ2 +
g(x ) ≤ (1 − θ)g(x) + θ g(y) + ∇g(y) (v − y) + kv − vk22
+ T +

2t
h(x+) ≤ (1 − θ)h(x) + θh(v +)

Fast proximal gradient methods 25


• upper bound on h from p. 9 (with u = v +, w = v − (t/θ)∇g(y))

θ
h(v +) ≤ h(z) + ∇g(y)T (z − v +) − (v + − v)T (v + − z) ∀z
t

• combine the upper bounds on g(x+), h(x+), h(v +) with z = x⋆

θ2 + θ2 +
+ ⋆
f (x ) ≤ (1 − θ)f (x) + θf − (v − v) (v − x ) + kv − vk22
T + ⋆
t 2t
⋆ θ2 ⋆ 2 + ⋆ 2

= (1 − θ)f (x) + θf + kv − x k2 − kv − x k2
2t

this is identical to the final inequality (2) in the analysis of FISTA on p.11

ti  (i) ⋆
 1
(i) ⋆ 2
2 f (x ) − f + kv − x k2
θi 2
(1 − θi)ti  (i−1 ⋆
 1
(i−1) ⋆ 2
≤ 2 f (x )) − f + kv − x k2
θi 2

Fast proximal gradient methods 26


References

surveys of fast gradient methods


• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)
• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization
(2008)

FISTA
• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear
inverse problems, SIAM J. on Imaging Sciences (2009)
• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal
recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal
Processing and Communications (2009)

line search strategies


• FISTA papers by Beck and Teboulle
• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex
optimization with line search (2011)
• Yu. Nesterov, Gradient methods for minimizing composite objective function (2007)
• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992)

Fast proximal gradient methods 27


Nesterov’s third method (not covered in this lecture)

• Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical


Programming (2005)
• S. Becker, J. Bobin, E.J. Candès, NESTA: a fast and accurate first-order method for
sparse recovery, SIAM J. Imaging Sciences (2011)

Fast proximal gradient methods 28

You might also like