Frank Wolfe
Frank Wolfe
Ryan Tibshirani
Convex Optimization 10-725
Last time: ADMM
For the problem
3
Frank-Wolfe method
The Frank-Wolfe method, also called conditional gradient method,
uses a local linear expansion of f :
i.e., we are moving less and less in the direction of the linearization
minimizer as the algorithm proceeds
4
- the linearization of the objective fu
. toward
o this lin
f
y over th
r f (x)
e In te
g g(x)
gence,
- that
- Algorit
x f (x(k) )
s D
- for x⇤
(From Jaggi 2011)
, solution to (1) (Frank & Wolfe, 195
5
Norm constraints
s ∈ argmin ∇f (x(k−1) )T s
ksk≤t
(k−1) T
= −t · argmax ∇f (x ) s
ksk≤1
6
Outline
Today:
• Examples
• Convergence analysis
• Properties and variants
• Path following
7
Example: `1 regularization
Note: this is a lot simpler than projection onto the `1 ball, though
both require O(n) operations
8
Example: `p regularization
For the `p -regularized problem
Note: this is a lot simpler projection onto the `p ball, for general p!
Aside from special cases (p = 1, 2, ∞), these projections cannot be
directly computed (must be treated as an optimization)
9
Example: trace norm regularization
S (k−1) = −t · uv T
10
Constrained and Lagrange forms
as we let the tuning parameters t and λ vary over [0, ∞]. Typically
in statistics and ML problems, we would just solve whichever form
is easiest, over wide range of parameter values, then use CV
11
• `1 norm: Frank-Wolfe update scans for maximum of gradient;
proximal operator soft-thresholds the gradient step; both use
O(n) flops
12
Example: lasso comparison
Comparing projected and conditional gradient for constrained lasso
problem, with n = 100, p = 500:
1e+03
Projected gradient
Conditional gradient
1e+02
1e+01
f−fstar
1e+00
1e−01
Note: FW uses standard step sizes, line search would probably help
13
Duality gap
Frank-Wolfe iterations admit a very natural duality gap:
14
Why do we call it“duality gap”? Rewrite original problem as
15
Convergence analysis
Following Jaggi (2011), define the curvature constant of f over C:
2 T
M= max f (y) − f (x) − ∇f (x) (y − x)
γ∈[0,1] γ2
x,s,y∈C
y=(1−γ)x+γs
2M
f (x(k) ) − f ? ≤
k+2
16
This matches the sublinear rate for projected gradient descent for
Lipschitz ∇f , but how do the assumptions compare?
2 L
M≤ max · ky − xk22
γ∈[0,1] γ2 2
x,s,y∈C
y=(1−γ)x+γs
17
Basic inequality
The key inequality used to prove the Frank-Wolfe convergence rate:
γk2
f (x(k) ) ≤ f (x(k−1) ) − γk g(x(k−1) ) + M
2
Here g(x) = maxs∈C ∇f (x)T (x − s) is duality gap defined earlier
18
The proof of the convergence result is now straightforward. Denote
by h(x) = f (x) − f ? the suboptimality gap at x. Basic inequality:
γk2
h(x(k) ) ≤ h(x(k−1) ) − γk g(x(k−1) ) + M
2
γ2
≤ h(x(k−1) ) − γk h(x(k−1) ) + k M
2
γ 2
= (1 − γk )h(x(k−1) ) + k M
2
19
Affine invariance
s0 = argmin ∇F (x0 )T z
z∈A−1 C
(x ) = (1 − γ)x0 + γs0
0 +
20
Inexact updates
Jaggi (2011) also analyzes inexact Frank-Wolfe updates: suppose
we choose s(k−1) so that
M γk
∇f (x(k−1) )T s(k−1) ≤ min ∇f (x(k−1) )T s + ·δ
s∈C 2
where δ ≥ 0 is our inaccuracy parameter. Then we basically attain
the same rate
21
Two variants
Two important variants of Frank-Wolfe:
• Line search: instead of using standard step sizes, use
γk = argmin f x(k−1) + γ(s(k−1) − x(k−1) )
γ∈[0,1]
22
Path following
Given the norm constrained problem
23
Claim: this produces (piecewise-constant) path with
i.e., the duality gap remains ≤ for the same x, between t and t+
24
References
25