0% found this document useful (0 votes)

43 views25 pages

Frank Wolfe

The Frank-Wolfe method uses a linear approximation of the objective function to iteratively find updates that remain within the constraint set. At each iteration, it finds the point that minimizes the linear approximation over the constraint set, rather than using projections. This allows for simpler updates than projection-based methods when the constraint set admits efficient optimization. The Frank-Wolfe method converges linearly to the optimal solution based on the curvature of the objective function over the constraint set. It also provides a natural duality gap bound that can be used to assess convergence.

Uploaded by

sherlockholmes108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views25 pages

Frank Wolfe

Uploaded by

sherlockholmes108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Frank-Wolfe Method

Ryan Tibshirani
Convex Optimization 10-725
Last time: ADMM
For the problem

min f (x) + g(z) subject to Ax + Bz = c

x,z

we form augmented Lagrangian (scaled form):

ρ ρ
Lρ (x, z, w) = f (x) + g(z) + kAx − Bx + c + wk22 − kwk22
2 2
Alternating direction method of multipliers or ADMM:

x(k) = argmin Lρ (x, z (k−1) , w(k−1) )

x
(k)
z = argmin Lρ (x(k) , z, w(k−1) )
z
(k) (k−1)
w =w + Ax(k) + Bz (k) − c

Converges like a first-order method. Very flexible framework

2
Projected gradient descent
Consider constrained problem

min f (x) subject to x ∈ C

where f is convex and smooth, and C is convex. Recall projected

gradient descent chooses an initial x(0) , repeats for k = 1, 2, 3, . . .

x(k) = PC x(k−1) − tk ∇f (x(k−1)

where PC is the projection operator onto the set C. Special case

of proximal gradient, motivated by local quadratic expansion of f :
!
(k) (k−1) T (k−1) 1 (k−1) 2
x = PC argmin ∇f (x ) (y − x ) + ky − x k2
y 2t

Motivation for today: projections are not always easy!

3
Frank-Wolfe method
The Frank-Wolfe method, also called conditional gradient method,
uses a local linear expansion of f :

s(k−1) ∈ argmin ∇f (x(k−1) )T s

s∈C
(k)
x = (1 − γk )x(k−1) + γk s(k−1)

Note that there is no projection; update is solved directly over C

Default step sizes: γk = 2/(k + 1), k = 1, 2, 3, . . .. Note for any

0 ≤ γk ≤ 1, we have x(k) ∈ C by convexity. Can rewrite update as

x(k) = x(k−1) + γk (s(k−1) − x(k−1) )

i.e., we are moving less and less in the direction of the linearization
minimizer as the algorithm proceeds

4
- the linearization of the objective fu
. toward
o this lin
f
y over th
r f (x)
e In te
g g(x)
gence,
- that
- Algorit
x f (x(k) )
s D
- for x⇤
(From Jaggi 2011)
, solution to (1) (Frank & Wolfe, 195
5
Norm constraints

What happens when C = {x : kxk ≤ t} for a norm k · k? Then

s ∈ argmin ∇f (x(k−1) )T s
ksk≤t

(k−1) T
= −t · argmax ∇f (x ) s
ksk≤1

= −t · ∂k∇f (x(k−1) )k∗

where k · k∗ denotes the corresponding dual norm. That is, if we

know how to compute subgradients of the dual norm, then we can
easily perform Frank-Wolfe steps

A key to Frank-Wolfe: this can often be simpler or cheaper than

projection onto C = {x : kxk ≤ t}

6
Outline

Today:
• Examples
• Convergence analysis
• Properties and variants
• Path following

7
Example: `1 regularization

For the `1 -regularized problem

min f (x) subject to kxk1 ≤ t

we have s(k−1) ∈ −t∂k∇f (x(k−1) )k∞ . Frank-Wolfe update is thus

ik−1 ∈ argmax ∇i f (x(k−1) )
i=1,...p
(k)

x = (1 − γk )x(k−1) − γk t · sign ∇ik−1 f (x(k−1) ) · eik−1

Like greedy coordinate descent! (But with diminshing steps)

Note: this is a lot simpler than projection onto the `1 ball, though
both require O(n) operations

8
Example: `p regularization
For the `p -regularized problem

min f (x) subject to kxkp ≤ t

for 1 ≤ p ≤ ∞, we have s(k−1) ∈ −t∂k∇f (x(k−1) )kq , where p, q

are dual, i.e., 1/p + 1/q = 1. Claim: can choose
(k−1) p/q
si = −α · sign ∇fi (x(k−1) ) · ∇fi (x(k−1) ) , i = 1, . . . n

where α is a constant such that ks(k−1) kq = t (check this!), and

then Frank-Wolfe updates are as usual

Note: this is a lot simpler projection onto the `p ball, for general p!
Aside from special cases (p = 1, 2, ∞), these projections cannot be
directly computed (must be treated as an optimization)

9
Example: trace norm regularization

For the trace-regularized problem

min f (X) subject to kXktr ≤ t

we have S (k−1) ∈ −t∂k∇f (X (k−1) )kop . Claim: can choose

S (k−1) = −t · uv T

where u, v are leading left and right singular vectors of ∇f (X (k−1) )

(check this!), and then Frank-Wolfe updates are as usual

Note: this substantially simpler and cheaper than projection onto

the trace norm ball, which requires a singular value decomposition!

10
Constrained and Lagrange forms

Recall that solution of the constrained problem

min f (x) subject to kxk ≤ t

are equivalent to those of the Lagrange problem

min f (x) + λkxk

as we let the tuning parameters t and λ vary over [0, ∞]. Typically
in statistics and ML problems, we would just solve whichever form
is easiest, over wide range of parameter values, then use CV

So we should also compare the Frank-Wolfe updates under k · k to

the proximal operator of k · k

11
• `1 norm: Frank-Wolfe update scans for maximum of gradient;
proximal operator soft-thresholds the gradient step; both use
O(n) flops

• `p norm: Frank-Wolfe update computes raises each entry of

gradient to power and sums, in O(n) flops; proximal operator
not generally directly computable

• Trace norm: Frank-Wolfe update computes top left and right

singular vectors of gradient; proximal operator soft-thresholds
the gradient step, requiring a singular value decomposition

Various other constraints yield efficient Frank-Wolfe updates, e.g.,

special polyhedra or cone constraints, sum-of-norms (group-based)
regularization, atomic norms. See Jaggi (2011)

12
Example: lasso comparison
Comparing projected and conditional gradient for constrained lasso
problem, with n = 100, p = 500:

1e+03
Projected gradient
Conditional gradient

1e+02
1e+01
f−fstar

1e+00
1e−01

0 200 400 600 800 1000

Note: FW uses standard step sizes, line search would probably help
13
Duality gap
Frank-Wolfe iterations admit a very natural duality gap:

∇f (x(k) )T (x(k) − s(k) )

Claim: this upper bounds on f (x(k) ) − f ?

Proof: by the first-order condition for convexity

f (s) ≥ f (x(k) ) + ∇f (x(k) )T (s − x(k) )

Minimizing both sides over all s ∈ C yields

f ? ≥ f (x(k) ) + min ∇f (x(k) )T (s − x(k) )

s∈C
(k)
= f (x ) + ∇f (x(k) )T (s(k) − x(k) )

Rearranged, this gives the duality gap above

14
Why do we call it“duality gap”? Rewrite original problem as

min f (x) + IC (x)

where IC is the indicator function of C. The dual problem is

max −f ∗ (u) − IC∗ (−u)

where IC∗ is the support function of C. Duality gap at x, u is

f (x) + f ∗ (u) + IC∗ (−u) ≥ xT u + IC∗ (−u)

Evaluated at x = x(k) , u = ∇f (x(k) ), this gives

∇f (x(k) )T x(k) + max −∇f (x(k) )T s = ∇f (x(k) )T (x(k) − s(k) )

s∈C

which is our gap

15
Convergence analysis
Following Jaggi (2011), define the curvature constant of f over C:
2 T

M= max f (y) − f (x) − ∇f (x) (y − x)
γ∈[0,1] γ2
x,s,y∈C
y=(1−γ)x+γs

Note that M = 0 for linear f , and f (y) − f (x) − ∇f (x)T (y − x)

is called the Bregman divergence, defined by f

Theorem: The Frank-Wolfe method using standard step sizes

γk = 2/(k + 1), k = 1, 2, 3, . . . satisfies

2M
f (x(k) ) − f ? ≤
k+2

Thus number of iterations needed for f (x(k) ) − f ? ≤ is O(1/)

16
This matches the sublinear rate for projected gradient descent for
Lipschitz ∇f , but how do the assumptions compare?

For Lipschitz ∇f with constant L, recall

L
f (y) − f (x) − ∇f (x)T (y − x) ≤ ky − xk22
2
Maximizing over all y = (1 − γ)x + γs, and multiplying by 2/γ 2 ,

2 L
M≤ max · ky − xk22
γ∈[0,1] γ2 2
x,s,y∈C
y=(1−γ)x+γs

= max Lkx − sk22 = L · diam2 (C)

x,s∈C

Hence assuming a bounded curvature is basically no stronger than

what we assumed for projected gradient

17
Basic inequality
The key inequality used to prove the Frank-Wolfe convergence rate:

γk2
f (x(k) ) ≤ f (x(k−1) ) − γk g(x(k−1) ) + M
2
Here g(x) = maxs∈C ∇f (x)T (x − s) is duality gap defined earlier

Proof: write x+ = x(k) , x = x(k−1) , s = s(k−1) , γ = γk . Then

f (x+ ) = f x + γ(s − x)
γ2
≤ f (x) + γ∇f (x)T (s − x) + M
2
γ2
= f (x) − γg(x) + M
2
Second line used definition of M , and third line the definition of g

18
The proof of the convergence result is now straightforward. Denote
by h(x) = f (x) − f ? the suboptimality gap at x. Basic inequality:

γk2
h(x(k) ) ≤ h(x(k−1) ) − γk g(x(k−1) ) + M
2
γ2
≤ h(x(k−1) ) − γk h(x(k−1) ) + k M
2
γ 2
= (1 − γk )h(x(k−1) ) + k M
2

where in the second line we used g(x(k−1) ) ≥ h(x(k−1) )

To get the desired result we use induction:

2
(k) 2 2M 2 M 2M
h(x )≤ 1− + ≤
k+1 k+1 k+1 2 k+2

19
Affine invariance

Frank-Wolfe updates are affine invariant: for nonsingular matrix A,

define x = Ax0 , F (x0 ) = f (Ax0 ), consider Frank-Wolfe on F :

s0 = argmin ∇F (x0 )T z
z∈A−1 C
(x ) = (1 − γ)x0 + γs0
0 +

Multiplying by A produces same Frank-Wolfe update as that from

f . Convergence analysis is also affine invariant: curvature constant
2 0 0 0 T 0 0

M= max F (y ) − F (x ) − ∇F (x ) (y − x )
γ∈[0,1] γ2
x0 ,s0 ,y 0 ∈A−1 C
y =(1−γ)x0 +γs0
0

matches that of f , because ∇F (x0 )T (y 0 − x0 ) = ∇f (x)T (y − x)

20
Inexact updates
Jaggi (2011) also analyzes inexact Frank-Wolfe updates: suppose
we choose s(k−1) so that
M γk
∇f (x(k−1) )T s(k−1) ≤ min ∇f (x(k−1) )T s + ·δ
s∈C 2
where δ ≥ 0 is our inaccuracy parameter. Then we basically attain
the same rate

Theorem: Frank-Wolfe using step sizes γk = 2/(k + 1), k =

1, 2, 3, . . ., and inaccuracy parameter δ ≥ 0, satisfies
2M
f (x(k) ) − f ? ≤ (1 + δ)
k+1

Note: the optimization error at step k is M γk /2 · δ. Since γk → 0,

we require the errors to vanish

21
Two variants
Two important variants of Frank-Wolfe:
• Line search: instead of using standard step sizes, use

γk = argmin f x(k−1) + γ(s(k−1) − x(k−1) )
γ∈[0,1]

at each k = 1, 2, 3, . . .. Or, we could use backtracking

• Fully corrective: directly update according to

x(k) = argmin f (y) subject to y ∈ conv{x(0) , s(0) , . . . s(k−1) }

Both variants lead to the same O(1/) iteration complexity

Another popular variant: away steps, which get linear convergence

under strong convexity

22
Path following
Given the norm constrained problem

min f (x) subject to kxk ≤ t

Frank-Wolfe can be used for path following, i.e., we can produce an

approximate solution path x̂(t) that is -suboptimal for every t ≥ 0

Let t0 = 0 and x? (0) = 0, fix m > 0, repeat for k = 1, 2, 3, . . .:

• Calculate
(1 − 1/m)
tk = tk−1 +
k∇f (x̂(tk−1 ))k∗
and set x̂(t) = x̂(tk−1 ) for all t ∈ (tk−1 , tk )
• Compute x̂(tk ) by running Frank-Wolfe at t = tk , terminating
when the duality gap is ≤ /m
(This is a simplification of the strategy from Giesen et al., 2012)

23
Claim: this produces (piecewise-constant) path with

f (x̂(t)) − f (x? (t)) ≤ for all t ≥ 0

Proof: rewrite the Frank-Wolfe duality gap as

gt (x) = max ∇f (x)T (x − s) = ∇f (x)T x + tk∇f (x)k∗

ksk≤1

This is a linear function of t. Hence if gt (x) ≤ /m, then we can

increase t until t+ = t + (1 − 1/m)/k∇f (x)k∗ , because

gt+ (x) = ∇f (x)T x + tk∇f (x)k∗ + − /m ≤

i.e., the duality gap remains ≤ for the same x, between t and t+

24
References

• K. Clarkson (2010), “Coresets, sparse greedy approximation,

and the Frank-Wolfe algorithm”
• J. Giesen and M. Jaggi and S. Laue, S. (2012),
“Approximating parametrized convex optimization problems”
• M. Jaggi (2011), “Sparse convex optimization methods for
machine learning”
• M. Jaggi (2011), “Revisiting Frank-Wolfe: projection-free
sparse convex optimization”
• M. Frank and P. Wolfe (2011), “An algorithm for quadratic
programming”
• R. J. Tibshirani (2015), “A general framework for fast
stagewise algorithms”

Stock Card Drug Management
75% (4)
Stock Card Drug Management
4 pages
Primal - Dual Decomposition Methods
No ratings yet
Primal - Dual Decomposition Methods
40 pages
A Collection of Algebraic Identities
75% (4)
A Collection of Algebraic Identities
658 pages
ISI Placement Brochure 2022-23
No ratings yet
ISI Placement Brochure 2022-23
16 pages
SESO2018 Wednesday Sagastizabal
No ratings yet
SESO2018 Wednesday Sagastizabal
181 pages
(Springer Series in Statistics) Jun Shao, Dongsheng Tu (Auth.) - The Jackknife and Bootstrap-Springer-Verlag New York (1995)
100% (1)
(Springer Series in Statistics) Jun Shao, Dongsheng Tu (Auth.) - The Jackknife and Bootstrap-Springer-Verlag New York (1995)
532 pages
Subgradient Methods
No ratings yet
Subgradient Methods
56 pages
24 Cond Grad
No ratings yet
24 Cond Grad
25 pages
Wolfe Conditions
No ratings yet
Wolfe Conditions
5 pages
KeralaPentecostHistory (SajuMathew) PDF
100% (1)
KeralaPentecostHistory (SajuMathew) PDF
440 pages
Clnote Sept24
No ratings yet
Clnote Sept24
24 pages
Notes Lecture14
No ratings yet
Notes Lecture14
8 pages
Inequality 20161031
No ratings yet
Inequality 20161031
31 pages
2024 MS Powerpoint Test
No ratings yet
2024 MS Powerpoint Test
3 pages
Part3 1
No ratings yet
Part3 1
15 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Optimumengineeringdesign Day5
No ratings yet
Optimumengineeringdesign Day5
84 pages
Quality Agreement Template 4.28.10
No ratings yet
Quality Agreement Template 4.28.10
19 pages
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
15 pages
Jaggi 13
No ratings yet
Jaggi 13
9 pages
Optimisation in MAchine Learning
No ratings yet
Optimisation in MAchine Learning
114 pages
Projected Gradient
No ratings yet
Projected Gradient
21 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Probability Theory III (B.Stat. 2017-2020)
No ratings yet
Probability Theory III (B.Stat. 2017-2020)
173 pages
A Seminar Report ON Direct-To-Home Television (DTH)
100% (1)
A Seminar Report ON Direct-To-Home Television (DTH)
32 pages
Asset Management PAS 55 ISO 55000
100% (2)
Asset Management PAS 55 ISO 55000
15 pages
Lecture8 UnconstrainedII 2023
No ratings yet
Lecture8 UnconstrainedII 2023
57 pages
ECL300 - RS 232 Protocol PDF
No ratings yet
ECL300 - RS 232 Protocol PDF
20 pages
Subgradient Method
No ratings yet
Subgradient Method
22 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Computer Profile Summary: Plan For Your Next Computer Refresh... Click For Belarc's System Management Products
0% (1)
Computer Profile Summary: Plan For Your Next Computer Refresh... Click For Belarc's System Management Products
6 pages
GOAT Cheat Sheet
No ratings yet
GOAT Cheat Sheet
1 page
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
16 Duality
No ratings yet
16 Duality
4 pages
Multivariate Analysis Notes
No ratings yet
Multivariate Analysis Notes
6 pages
5165 Test 2 Cheating
No ratings yet
5165 Test 2 Cheating
7 pages
TV LCD Samsung Pl42p5hdx
No ratings yet
TV LCD Samsung Pl42p5hdx
13 pages
PC Maintenance Lab Report
No ratings yet
PC Maintenance Lab Report
21 pages
ASCII Characters Set
No ratings yet
ASCII Characters Set
8 pages
Latex For Mu
No ratings yet
Latex For Mu
3 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Gateforum Ece Question Paper
No ratings yet
Gateforum Ece Question Paper
17 pages
Self Balancing Scooter Ver 20 PDF
No ratings yet
Self Balancing Scooter Ver 20 PDF
9 pages
Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
No ratings yet
Conditional Gradient (Frank-Wolfe) Method: Lecturer: Javier Pe Na Convex Optimization 10-725/36-725
28 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
SimCube SC 5 User Manual PDF
No ratings yet
SimCube SC 5 User Manual PDF
24 pages
QR Patrol 2 Page
No ratings yet
QR Patrol 2 Page
2 pages
Final Teaching Allocation First Semester 2022-2023
No ratings yet
Final Teaching Allocation First Semester 2022-2023
5 pages
Primal-Dual Subgradient Method: - Equality Constrained Problems - Inequality Constrained Problems
No ratings yet
Primal-Dual Subgradient Method: - Equality Constrained Problems - Inequality Constrained Problems
13 pages
C2 M2 Exam Withsol
No ratings yet
C2 M2 Exam Withsol
12 pages
Model Sadpmini: Hand Held Dewpoint Meter Ranges Available Between - 110°C To +20°C (-166°F To +68°F) Dewpoint
No ratings yet
Model Sadpmini: Hand Held Dewpoint Meter Ranges Available Between - 110°C To +20°C (-166°F To +68°F) Dewpoint
4 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
An Overview of The LTE Physical Layer Part I
No ratings yet
An Overview of The LTE Physical Layer Part I
6 pages
From Prototypical To Prototyping: Mass - Customization Versus 20TH Century Utopias in Architecture and Urban Design
No ratings yet
From Prototypical To Prototyping: Mass - Customization Versus 20TH Century Utopias in Architecture and Urban Design
10 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
YouTube Gains by DarkFerret
No ratings yet
YouTube Gains by DarkFerret
11 pages
Chương 9
No ratings yet
Chương 9
12 pages
MXC-6400 Series Datasheet-En 20180706
No ratings yet
MXC-6400 Series Datasheet-En 20180706
2 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Gradient
No ratings yet
Gradient
31 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
Math Geek
No ratings yet
Math Geek
3 pages
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
No ratings yet
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
24 pages
1909 10140
No ratings yet
1909 10140
39 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Categorical Data Analysis Assignment: Due DT.: 10/12/2022 Name: Soham Mallick Roll No.: MB-2202
No ratings yet
Categorical Data Analysis Assignment: Due DT.: 10/12/2022 Name: Soham Mallick Roll No.: MB-2202
6 pages
Apply and Innovate 2018 Honda Kawabe
No ratings yet
Apply and Innovate 2018 Honda Kawabe
41 pages
CC5 2020
No ratings yet
CC5 2020
3 pages
CBCS - BSC - HONS - Sem-3 - STATISTICS - CC-5 - LINEAR ALGEBRA-10096
No ratings yet
CBCS - BSC - HONS - Sem-3 - STATISTICS - CC-5 - LINEAR ALGEBRA-10096
2 pages
Howrah [email protected]
No ratings yet
Howrah [email protected]
2 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Apostila Otimização MIT
No ratings yet
Apostila Otimização MIT
101 pages
The Visakhapatnam Co-Operative Bank LTD: Vacancies
No ratings yet
The Visakhapatnam Co-Operative Bank LTD: Vacancies
13 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Datasheet Wdeh220 20120702-14235212729
No ratings yet
Datasheet Wdeh220 20120702-14235212729
4 pages
Exam1Review Annotated
No ratings yet
Exam1Review Annotated
13 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
Digital Learning Resources and Support Features Matrix
No ratings yet
Digital Learning Resources and Support Features Matrix
9 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Booklet Primer Grado Insps 2024
No ratings yet
Booklet Primer Grado Insps 2024
42 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Optim
No ratings yet
Optim
70 pages
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
No ratings yet
Choose An OTA For The Apple Watch Series 3 (42mm) IPSW Downloads
1 page
MV3000 Software Manual PDF
0% (1)
MV3000 Software Manual PDF
464 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
DB - Gat Ecolock 7xxxfiso en - 22
No ratings yet
DB - Gat Ecolock 7xxxfiso en - 22
3 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Devkit Modules-Datasheet JADAK-1
No ratings yet
Devkit Modules-Datasheet JADAK-1
2 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Frank Wolfe

Uploaded by

Frank Wolfe

Uploaded by

Frank-Wolfe Method

min f (x) + g(z) subject to Ax + Bz = c

we form augmented Lagrangian (scaled form):

x(k) = argmin Lρ (x, z (k−1) , w(k−1) )

Converges like a first-order method. Very flexible framework

min f (x) subject to x ∈ C

where f is convex and smooth, and C is convex. Recall projected

where PC is the projection operator onto the set C. Special case

Motivation for today: projections are not always easy!

s(k−1) ∈ argmin ∇f (x(k−1) )T s

Note that there is no projection; update is solved directly over C

Default step sizes: γk = 2/(k + 1), k = 1, 2, 3, . . .. Note for any

x(k) = x(k−1) + γk (s(k−1) − x(k−1) )

What happens when C = {x : kxk ≤ t} for a norm k · k? Then

= −t · ∂k∇f (x(k−1) )k∗

where k · k∗ denotes the corresponding dual norm. That is, if we

A key to Frank-Wolfe: this can often be simpler or cheaper than

For the `1 -regularized problem

min f (x) subject to kxk1 ≤ t

we have s(k−1) ∈ −t∂k∇f (x(k−1) )k∞ . Frank-Wolfe update is thus

Like greedy coordinate descent! (But with diminshing steps)

min f (x) subject to kxkp ≤ t

for 1 ≤ p ≤ ∞, we have s(k−1) ∈ −t∂k∇f (x(k−1) )kq , where p, q

where α is a constant such that ks(k−1) kq = t (check this!), and

For the trace-regularized problem

min f (X) subject to kXktr ≤ t

we have S (k−1) ∈ −t∂k∇f (X (k−1) )kop . Claim: can choose

where u, v are leading left and right singular vectors of ∇f (X (k−1) )

Note: this substantially simpler and cheaper than projection onto

Recall that solution of the constrained problem

min f (x) subject to kxk ≤ t

are equivalent to those of the Lagrange problem

min f (x) + λkxk

So we should also compare the Frank-Wolfe updates under k · k to

• `p norm: Frank-Wolfe update computes raises each entry of

• Trace norm: Frank-Wolfe update computes top left and right

Various other constraints yield efficient Frank-Wolfe updates, e.g.,

0 200 400 600 800 1000

∇f (x(k) )T (x(k) − s(k) )

Claim: this upper bounds on f (x(k) ) − f ?

Proof: by the first-order condition for convexity

f (s) ≥ f (x(k) ) + ∇f (x(k) )T (s − x(k) )

Minimizing both sides over all s ∈ C yields

f ? ≥ f (x(k) ) + min ∇f (x(k) )T (s − x(k) )

Rearranged, this gives the duality gap above

min f (x) + IC (x)

where IC is the indicator function of C. The dual problem is

max −f ∗ (u) − IC∗ (−u)

where IC∗ is the support function of C. Duality gap at x, u is

f (x) + f ∗ (u) + IC∗ (−u) ≥ xT u + IC∗ (−u)

Evaluated at x = x(k) , u = ∇f (x(k) ), this gives

∇f (x(k) )T x(k) + max −∇f (x(k) )T s = ∇f (x(k) )T (x(k) − s(k) )

which is our gap

Note that M = 0 for linear f , and f (y) − f (x) − ∇f (x)T (y − x)

Theorem: The Frank-Wolfe method using standard step sizes

Thus number of iterations needed for f (x(k) ) − f ? ≤  is O(1/)

For Lipschitz ∇f with constant L, recall

= max Lkx − sk22 = L · diam2 (C)

Hence assuming a bounded curvature is basically no stronger than

Proof: write x+ = x(k) , x = x(k−1) , s = s(k−1) , γ = γk . Then

where in the second line we used g(x(k−1) ) ≥ h(x(k−1) )

To get the desired result we use induction:

Frank-Wolfe updates are affine invariant: for nonsingular matrix A,

Multiplying by A produces same Frank-Wolfe update as that from

matches that of f , because ∇F (x0 )T (y 0 − x0 ) = ∇f (x)T (y − x)

Theorem: Frank-Wolfe using step sizes γk = 2/(k + 1), k =

Note: the optimization error at step k is M γk /2 · δ. Since γk → 0,

at each k = 1, 2, 3, . . .. Or, we could use backtracking

x(k) = argmin f (y) subject to y ∈ conv{x(0) , s(0) , . . . s(k−1) }

Both variants lead to the same O(1/) iteration complexity

Another popular variant: away steps, which get linear convergence

min f (x) subject to kxk ≤ t

Frank-Wolfe can be used for path following, i.e., we can produce an

Let t0 = 0 and x? (0) = 0, fix m > 0, repeat for k = 1, 2, 3, . . .:

f (x̂(t)) − f (x? (t)) ≤  for all t ≥ 0

Proof: rewrite the Frank-Wolfe duality gap as

Thus number of iterations needed for f (x(k) ) − f ? ≤ is O(1/)

Both variants lead to the same O(1/) iteration complexity

f (x̂(t)) − f (x? (t)) ≤ for all t ≥ 0

This is a linear function of t. Hence if gt (x) ≤ /m, then we can

gt+ (x) = ∇f (x)T x + tk∇f (x)k∗ + − /m ≤