Berkeley-tutorial Optimization for Machine Learningpart2
Berkeley-tutorial Optimization for Machine Learningpart2
Elad Hazan
Princeton University
0 ≺ 𝛼𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼
−𝛽𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼
Why do we care?
1
=− 𝜂+ 𝛽𝜂+ 𝛻0 + =− 𝛻 +
4𝛽 0
1 +
−2𝑀 ≤ 𝑓 𝑥 ; − 𝑓(𝑥2 ) = > 𝑓(𝑥012 ) − 𝑓 𝑥0 ≤− > 𝛻0
4𝛽
0 0
Thus, exists a t for which,
+
8𝑀𝛽
𝛻0 ≤
𝑇
Smooth gradient descent
2
Conclusions: for 𝑥012 = 𝑥0 − 𝜂𝛻4 and T = Ω , finds
C
𝛻0 + ≤𝜖
+ +
= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻N4 = −𝜂𝛻4+ + 𝜂+ 𝛽𝐸 𝛻N4
M𝛽 +
T=𝑂 + ⇒ ∃0Y; . 𝛻0 ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:
𝑥012 = 𝑥0 − 𝜂 𝛻P 𝑓 𝑥0 − 𝛻P 𝑓 𝑥[ + 𝛻𝑓(𝑥[ )
𝑥2
𝑥+
𝑥2
𝑥+
d3 time per iteration!
Infeasible for ML!!
𝑥m
Till recently…
Speed up the Newton direction computation??
p2 p2
y+ + y
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 + ≠ 𝛻+𝑓
Circumvent Hessian creation and inversion!
• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 p+ = > 𝐼 − 𝛻+ s
s†[ 0‡ ˆ
1
= Es∼‰,‹∼[s] Œ 𝐼 − 𝛻 + 𝑓‹ 𝛻f ⋅
Pr[𝑖] Single example
‹†2 0‡ s
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)
1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖
users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2
• f is smooth, convex
• linear opt over K is easy
• Set 𝑥2 ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥0, obtain ft
2. Compute 𝑥012 as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵ xt
xt+1 (1 t )xt + t vt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt
Machine
Distribution
over label
{a} ∈ 𝑅›
1
arg min• > ℓs 𝑥, 𝑎s , 𝑏s + 𝑅 𝑥
•∈œ 𝑚
s†2 0‡ u
What is Optimization
250
200
100
50
−50
3
• Optimization approaches:
• Finding vanishing gradients / local minima efficiently
• Graduated optimization / homotopy method
• Quasi-convexity
• Structure of local optima (probabilistic assumptions that
allow alternating minimization,…)
Gradient/Hessian based methods
Goal: find point x such that
1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximate first order optimality)
2. 𝛻 + 𝑓 𝑥 ≽ −𝜀𝐼 (approximate second order optimality)
2
1. (we’ve proved) GD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 finds in 𝑂 C
(expensive) iterations point (1)
2
2. (we’ve proved) SGD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻P4 finds in 𝑂 (cheap)
CŸ
iterations point (1)
2
3. SGD algorithm with noise finds in 𝑂 (cheap) iterations (1&2)
C
[Ge, Huang, Jin, Yuan ‘15]
2
4. Recent second order methods: find in 𝑂 C¡/¢ (expensive) iterations
point (1&2)
[Carmon,Duchi, Hinder, Sidford ‘16]
[Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16]
Recap
Chair/car
What is Optimization
200 2. Regularization
• AdaGrad and optimal
150
100
50 regularization
0
−50
3
3. Advanced optimization
• Frank-Wolfe, acceleration,
2
3
1 2
0 1
−1 0
methods, non-convex
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization
Bibliography & more information, see:
http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm
Thank you!