0% found this document useful (0 votes)
43 views25 pages

Frank Wolfe

The Frank-Wolfe method uses a linear approximation of the objective function to iteratively find updates that remain within the constraint set. At each iteration, it finds the point that minimizes the linear approximation over the constraint set, rather than using projections. This allows for simpler updates than projection-based methods when the constraint set admits efficient optimization. The Frank-Wolfe method converges linearly to the optimal solution based on the curvature of the objective function over the constraint set. It also provides a natural duality gap bound that can be used to assess convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views25 pages

Frank Wolfe

The Frank-Wolfe method uses a linear approximation of the objective function to iteratively find updates that remain within the constraint set. At each iteration, it finds the point that minimizes the linear approximation over the constraint set, rather than using projections. This allows for simpler updates than projection-based methods when the constraint set admits efficient optimization. The Frank-Wolfe method converges linearly to the optimal solution based on the curvature of the objective function over the constraint set. It also provides a natural duality gap bound that can be used to assess convergence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Frank-Wolfe Method

Ryan Tibshirani
Convex Optimization 10-725
Last time: ADMM
For the problem

min f (x) + g(z) subject to Ax + Bz = c


x,z

we form augmented Lagrangian (scaled form):


ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx − Bx + c + wk22 − kwk22
2 2
Alternating direction method of multipliers or ADMM:

x(k) = argmin Lρ (x, z (k−1) , w(k−1) )


x
(k)
z = argmin Lρ (x(k) , z, w(k−1) )
z
(k) (k−1)
w =w + Ax(k) + Bz (k) − c

Converges like a first-order method. Very flexible framework


2
Projected gradient descent
Consider constrained problem

min f (x) subject to x ∈ C


x

where f is convex and smooth, and C is convex. Recall projected


gradient descent chooses an initial x(0) , repeats for k = 1, 2, 3, . . .

x(k) = PC x(k−1) − tk ∇f (x(k−1)

where PC is the projection operator onto the set C. Special case


of proximal gradient, motivated by local quadratic expansion of f :
!
(k) (k−1) T (k−1) 1 (k−1) 2
x = PC argmin ∇f (x ) (y − x ) + ky − x k2
y 2t

Motivation for today: projections are not always easy!

3
Frank-Wolfe method
The Frank-Wolfe method, also called conditional gradient method,
uses a local linear expansion of f :

s(k−1) ∈ argmin ∇f (x(k−1) )T s


s∈C
(k)
x = (1 − γk )x(k−1) + γk s(k−1)

Note that there is no projection; update is solved directly over C

Default step sizes: γk = 2/(k + 1), k = 1, 2, 3, . . .. Note for any


0 ≤ γk ≤ 1, we have x(k) ∈ C by convexity. Can rewrite update as

x(k) = x(k−1) + γk (s(k−1) − x(k−1) )

i.e., we are moving less and less in the direction of the linearization
minimizer as the algorithm proceeds

4
- the linearization of the objective fu
. toward
o this lin
f
y over th
r f (x)
e In te
g g(x)
gence,
- that
- Algorit
x f (x(k) )
s D
- for x⇤
(From Jaggi 2011)
, solution to (1) (Frank & Wolfe, 195
5
Norm constraints

What happens when C = {x : kxk ≤ t} for a norm k · k? Then

s ∈ argmin ∇f (x(k−1) )T s
ksk≤t
 
(k−1) T
= −t · argmax ∇f (x ) s
ksk≤1

= −t · ∂k∇f (x(k−1) )k∗

where k · k∗ denotes the corresponding dual norm. That is, if we


know how to compute subgradients of the dual norm, then we can
easily perform Frank-Wolfe steps

A key to Frank-Wolfe: this can often be simpler or cheaper than


projection onto C = {x : kxk ≤ t}

6
Outline

Today:
• Examples
• Convergence analysis
• Properties and variants
• Path following

7
Example: `1 regularization

For the `1 -regularized problem

min f (x) subject to kxk1 ≤ t


x

we have s(k−1) ∈ −t∂k∇f (x(k−1) )k∞ . Frank-Wolfe update is thus



ik−1 ∈ argmax ∇i f (x(k−1) )
i=1,...p
(k)

x = (1 − γk )x(k−1) − γk t · sign ∇ik−1 f (x(k−1) ) · eik−1

Like greedy coordinate descent! (But with diminshing steps)

Note: this is a lot simpler than projection onto the `1 ball, though
both require O(n) operations

8
Example: `p regularization
For the `p -regularized problem

min f (x) subject to kxkp ≤ t


x

for 1 ≤ p ≤ ∞, we have s(k−1) ∈ −t∂k∇f (x(k−1) )kq , where p, q


are dual, i.e., 1/p + 1/q = 1. Claim: can choose
(k−1)  p/q
si = −α · sign ∇fi (x(k−1) ) · ∇fi (x(k−1) ) , i = 1, . . . n

where α is a constant such that ks(k−1) kq = t (check this!), and


then Frank-Wolfe updates are as usual

Note: this is a lot simpler projection onto the `p ball, for general p!
Aside from special cases (p = 1, 2, ∞), these projections cannot be
directly computed (must be treated as an optimization)

9
Example: trace norm regularization

For the trace-regularized problem

min f (X) subject to kXktr ≤ t


X

we have S (k−1) ∈ −t∂k∇f (X (k−1) )kop . Claim: can choose

S (k−1) = −t · uv T

where u, v are leading left and right singular vectors of ∇f (X (k−1) )


(check this!), and then Frank-Wolfe updates are as usual

Note: this substantially simpler and cheaper than projection onto


the trace norm ball, which requires a singular value decomposition!

10
Constrained and Lagrange forms

Recall that solution of the constrained problem

min f (x) subject to kxk ≤ t


x

are equivalent to those of the Lagrange problem

min f (x) + λkxk


x

as we let the tuning parameters t and λ vary over [0, ∞]. Typically
in statistics and ML problems, we would just solve whichever form
is easiest, over wide range of parameter values, then use CV

So we should also compare the Frank-Wolfe updates under k · k to


the proximal operator of k · k

11
• `1 norm: Frank-Wolfe update scans for maximum of gradient;
proximal operator soft-thresholds the gradient step; both use
O(n) flops

• `p norm: Frank-Wolfe update computes raises each entry of


gradient to power and sums, in O(n) flops; proximal operator
not generally directly computable

• Trace norm: Frank-Wolfe update computes top left and right


singular vectors of gradient; proximal operator soft-thresholds
the gradient step, requiring a singular value decomposition

Various other constraints yield efficient Frank-Wolfe updates, e.g.,


special polyhedra or cone constraints, sum-of-norms (group-based)
regularization, atomic norms. See Jaggi (2011)

12
Example: lasso comparison
Comparing projected and conditional gradient for constrained lasso
problem, with n = 100, p = 500:

1e+03
Projected gradient
Conditional gradient

1e+02
1e+01
f−fstar

1e+00
1e−01

0 200 400 600 800 1000

Note: FW uses standard step sizes, line search would probably help
13
Duality gap
Frank-Wolfe iterations admit a very natural duality gap:

∇f (x(k) )T (x(k) − s(k) )

Claim: this upper bounds on f (x(k) ) − f ?

Proof: by the first-order condition for convexity

f (s) ≥ f (x(k) ) + ∇f (x(k) )T (s − x(k) )

Minimizing both sides over all s ∈ C yields

f ? ≥ f (x(k) ) + min ∇f (x(k) )T (s − x(k) )


s∈C
(k)
= f (x ) + ∇f (x(k) )T (s(k) − x(k) )

Rearranged, this gives the duality gap above

14
Why do we call it“duality gap”? Rewrite original problem as

min f (x) + IC (x)


x

where IC is the indicator function of C. The dual problem is

max −f ∗ (u) − IC∗ (−u)


u

where IC∗ is the support function of C. Duality gap at x, u is

f (x) + f ∗ (u) + IC∗ (−u) ≥ xT u + IC∗ (−u)

Evaluated at x = x(k) , u = ∇f (x(k) ), this gives

∇f (x(k) )T x(k) + max −∇f (x(k) )T s = ∇f (x(k) )T (x(k) − s(k) )


s∈C

which is our gap

15
Convergence analysis
Following Jaggi (2011), define the curvature constant of f over C:
2 T

M= max f (y) − f (x) − ∇f (x) (y − x)
γ∈[0,1] γ2
x,s,y∈C
y=(1−γ)x+γs

Note that M = 0 for linear f , and f (y) − f (x) − ∇f (x)T (y − x)


is called the Bregman divergence, defined by f

Theorem: The Frank-Wolfe method using standard step sizes


γk = 2/(k + 1), k = 1, 2, 3, . . . satisfies

2M
f (x(k) ) − f ? ≤
k+2

Thus number of iterations needed for f (x(k) ) − f ? ≤  is O(1/)

16
This matches the sublinear rate for projected gradient descent for
Lipschitz ∇f , but how do the assumptions compare?

For Lipschitz ∇f with constant L, recall


L
f (y) − f (x) − ∇f (x)T (y − x) ≤ ky − xk22
2
Maximizing over all y = (1 − γ)x + γs, and multiplying by 2/γ 2 ,

2 L
M≤ max · ky − xk22
γ∈[0,1] γ2 2
x,s,y∈C
y=(1−γ)x+γs

= max Lkx − sk22 = L · diam2 (C)


x,s∈C

Hence assuming a bounded curvature is basically no stronger than


what we assumed for projected gradient

17
Basic inequality
The key inequality used to prove the Frank-Wolfe convergence rate:

γk2
f (x(k) ) ≤ f (x(k−1) ) − γk g(x(k−1) ) + M
2
Here g(x) = maxs∈C ∇f (x)T (x − s) is duality gap defined earlier

Proof: write x+ = x(k) , x = x(k−1) , s = s(k−1) , γ = γk . Then



f (x+ ) = f x + γ(s − x)
γ2
≤ f (x) + γ∇f (x)T (s − x) + M
2
γ2
= f (x) − γg(x) + M
2
Second line used definition of M , and third line the definition of g

18
The proof of the convergence result is now straightforward. Denote
by h(x) = f (x) − f ? the suboptimality gap at x. Basic inequality:

γk2
h(x(k) ) ≤ h(x(k−1) ) − γk g(x(k−1) ) + M
2
γ2
≤ h(x(k−1) ) − γk h(x(k−1) ) + k M
2
γ 2
= (1 − γk )h(x(k−1) ) + k M
2

where in the second line we used g(x(k−1) ) ≥ h(x(k−1) )

To get the desired result we use induction:


   2
(k) 2 2M 2 M 2M
h(x )≤ 1− + ≤
k+1 k+1 k+1 2 k+2

19
Affine invariance

Frank-Wolfe updates are affine invariant: for nonsingular matrix A,


define x = Ax0 , F (x0 ) = f (Ax0 ), consider Frank-Wolfe on F :

s0 = argmin ∇F (x0 )T z
z∈A−1 C
(x ) = (1 − γ)x0 + γs0
0 +

Multiplying by A produces same Frank-Wolfe update as that from


f . Convergence analysis is also affine invariant: curvature constant
2 0 0 0 T 0 0

M= max F (y ) − F (x ) − ∇F (x ) (y − x )
γ∈[0,1] γ2
x0 ,s0 ,y 0 ∈A−1 C
y =(1−γ)x0 +γs0
0

matches that of f , because ∇F (x0 )T (y 0 − x0 ) = ∇f (x)T (y − x)

20
Inexact updates
Jaggi (2011) also analyzes inexact Frank-Wolfe updates: suppose
we choose s(k−1) so that
M γk
∇f (x(k−1) )T s(k−1) ≤ min ∇f (x(k−1) )T s + ·δ
s∈C 2
where δ ≥ 0 is our inaccuracy parameter. Then we basically attain
the same rate

Theorem: Frank-Wolfe using step sizes γk = 2/(k + 1), k =


1, 2, 3, . . ., and inaccuracy parameter δ ≥ 0, satisfies
2M
f (x(k) ) − f ? ≤ (1 + δ)
k+1

Note: the optimization error at step k is M γk /2 · δ. Since γk → 0,


we require the errors to vanish

21
Two variants
Two important variants of Frank-Wolfe:
• Line search: instead of using standard step sizes, use

γk = argmin f x(k−1) + γ(s(k−1) − x(k−1) )
γ∈[0,1]

at each k = 1, 2, 3, . . .. Or, we could use backtracking


• Fully corrective: directly update according to

x(k) = argmin f (y) subject to y ∈ conv{x(0) , s(0) , . . . s(k−1) }


y

Both variants lead to the same O(1/) iteration complexity

Another popular variant: away steps, which get linear convergence


under strong convexity

22
Path following
Given the norm constrained problem

min f (x) subject to kxk ≤ t


x

Frank-Wolfe can be used for path following, i.e., we can produce an


approximate solution path x̂(t) that is -suboptimal for every t ≥ 0

Let t0 = 0 and x? (0) = 0, fix m > 0, repeat for k = 1, 2, 3, . . .:


• Calculate
(1 − 1/m)
tk = tk−1 +
k∇f (x̂(tk−1 ))k∗
and set x̂(t) = x̂(tk−1 ) for all t ∈ (tk−1 , tk )
• Compute x̂(tk ) by running Frank-Wolfe at t = tk , terminating
when the duality gap is ≤ /m
(This is a simplification of the strategy from Giesen et al., 2012)

23
Claim: this produces (piecewise-constant) path with

f (x̂(t)) − f (x? (t)) ≤  for all t ≥ 0

Proof: rewrite the Frank-Wolfe duality gap as

gt (x) = max ∇f (x)T (x − s) = ∇f (x)T x + tk∇f (x)k∗


ksk≤1

This is a linear function of t. Hence if gt (x) ≤ /m, then we can


increase t until t+ = t + (1 − 1/m)/k∇f (x)k∗ , because

gt+ (x) = ∇f (x)T x + tk∇f (x)k∗ +  − /m ≤ 

i.e., the duality gap remains ≤  for the same x, between t and t+

24
References

• K. Clarkson (2010), “Coresets, sparse greedy approximation,


and the Frank-Wolfe algorithm”
• J. Giesen and M. Jaggi and S. Laue, S. (2012),
“Approximating parametrized convex optimization problems”
• M. Jaggi (2011), “Sparse convex optimization methods for
machine learning”
• M. Jaggi (2011), “Revisiting Frank-Wolfe: projection-free
sparse convex optimization”
• M. Frank and P. Wolfe (2011), “An algorithm for quadratic
programming”
• R. J. Tibshirani (2015), “A general framework for fast
stagewise algorithms”

25

You might also like