0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learningpart2

This document discusses optimization techniques for machine learning, focusing on methods such as stochastic optimization, adaptive regularization, and gradient descent acceleration. It covers both first-order and second-order optimization methods, including Newton's method and the Conditional Gradient algorithm, highlighting their applications in recommendation systems and matrix completion. The document emphasizes the importance of well-conditioned functions for faster optimization and presents theoretical results on the efficiency of these algorithms.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learningpart2

This document discusses optimization techniques for machine learning, focusing on methods such as stochastic optimization, adaptive regularization, and gradient descent acceleration. It covers both first-order and second-order optimization methods, including Newton's method and the Conditional Gradient algorithm, highlighting their applications in recommendation systems and matrix completion. The document emphasizes the importance of well-conditioned functions for faster optimization and presents theoretical results on the efficiency of these algorithms.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Tutorial: PART 2

Optimization for Machine Learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora & Yoram Singer


Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization
Accelerating gradient descent?
1. Adaptive regularization (AdaGrad)
works for non-smooth&non-convex
2. Variance reduction
uses special ERM structure
very effective for smooth&convex
3. Acceleration/momentum
smooth convex only, general purpose optimization
since 80’s
Condition number of convex functions
#
defined as 𝛾 = $ , where (simplified)

0 ≺ 𝛼𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

𝛼 = strong convexity, 𝛽 = smoothness


Non-convex smooth functions: (simplified)

−𝛽𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

Why do we care?

well-conditioned functions exhibit much faster optimization!


(but equivalent via reductions)
Examples
Smooth gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 )

𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

1
=− 𝜂+ 𝛽𝜂+ 𝛻0 + =− 𝛻 +
4𝛽 0

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

1 +
−2𝑀 ≤ 𝑓 𝑥 ; − 𝑓(𝑥2 ) = > 𝑓(𝑥012 ) − 𝑓 𝑥0 ≤− > 𝛻0
4𝛽
0 0
Thus, exists a t for which,
+
8𝑀𝛽
𝛻0 ≤
𝑇
Smooth gradient descent
2
Conclusions: for 𝑥012 = 𝑥0 − 𝜂𝛻4 and T = Ω , finds
C

𝛻0 + ≤𝜖

1. Holds even for non-convex functions


2. For convex functions implies 𝑓 𝑥0 − 𝑓 𝑥 ∗ ≤ 𝑂(𝜖) (faster for
smooth!)
Non-convex stochastic gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻N0 )

𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

+ +
= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻N4 = −𝜂𝛻4+ + 𝜂+ 𝛽𝐸 𝛻N4

= −𝜂𝛻4+ + 𝜂+ 𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

M𝛽 +
T=𝑂 + ⇒ ∃0Y; . 𝛻0 ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:

𝑥012 = 𝑥0 − 𝜂 𝛻P 𝑓 𝑥0 − 𝛻P 𝑓 𝑥[ + 𝛻𝑓(𝑥[ )

Every so often, compute full gradient and restart at new 𝑥[.

Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13;


Mahdavi, Zhang, Jin ’13]
Variance reduction for well-conditioned functions
+
𝛽
0 ≺ 𝛼𝐼 ≼ 𝛻 𝑓 𝑥 ≼ 𝛽𝐼 , 𝛾= 𝛾 should be
𝛼
Produces an 𝜖 approximate solution in time interpreted as
1 1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
𝜖
Acceleration/momentum [Nesterov ‘83]
• Optimal gradient complexity (smooth, convex)

• modest practical improvements, non-convex “momentum”


methods.

• With variance reduction, fastest possible running time of first-


order methods:
1
𝑂 (𝑚 + 𝛾𝑚) 𝑑 log
𝜖

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle


Experiments w. convex losses

Improve upon GradientDescent++ ?


Next few slides:

Move from first order (GD++) to second order


Higher Order Optimization

• Gradient Descent – Direction of Steepest Descent


• Second Order Methods – Use Local Curvature
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+

For non-convex function: can move to ∞


Solution: solve a quadratic approximation in a
𝑥m
local area (trust region)
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+
d3 time per iteration!
Infeasible for ML!!
𝑥m

Till recently…
Speed up the Newton direction computation??

• Spielman-Teng ‘04: diagonally dominant systems of equations in


linear time!
• 2015 Godel prize
• Used by Daitch-Speilman for faster flow algorithms

• Erdogu-Montanari ‘15: low rank approximation & inversion by


Sherman-Morisson
• Allow stochastic information
• Still prohibitive: rank * d2
Stochastic Newton? 𝑛„
𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)
2
• ERM, rank-1 loss: arg min 𝐸 s∼ u [ℓ 𝑥 ; 𝑎s , 𝑏s + |𝑥|+ ]
r +

• unbiased estimator of the Hessian:


y+ = az a{ ⋅ ℓ′ 𝑥 ; 𝑎s , bz + 𝐼
𝛻 𝑖 ~ 𝑈[1, … , 𝑚]
z

p2 p2
y+ + y
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 + ≠ 𝛻+𝑓
Circumvent Hessian creation and inversion!

• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 p+ = > 𝐼 − 𝛻+ s

s†[ 0‡ ˆ

• (2) sample from the infinite series (Hessian-gradient product) , ONCE


1
𝛻 + 𝑓 p2 𝛻𝑓 = > 𝐼 − 𝛻 + 𝑓 s
𝛻f = 𝐸s∼‰ 𝐼 − 𝛻 + 𝑓 s
𝛻f ⋅
Pr[𝑖]
s

• (3) estimate Hessian-power by sampling i.i.d. data examples

1
= Es∼‰,‹∼[s] Œ 𝐼 − 𝛻 + 𝑓‹ 𝛻f ⋅
Pr[𝑖] Single example
‹†2 0‡ s
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)

• Use the estimator 𝛻•p+ 𝑓 defined previously

• Compute a full (large batch) gradient 𝛻f


• Move in the direction 𝛻•p+ 𝑓 𝛻𝑓

• Theoretical running time to produce an 𝜖 approximate solution for 𝛾


well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]

1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖

1. Faster than first-order methods!


2. Indications this is tight [Arjevani,Shamir ‘16]
What about constraints??

Next few slides – projection free


(Frank-Wolfe) methods
Recommendation systems
movies
rating of user i
1 5 for movie j
2 3
2 4 1

users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2

Assume low rank of “true matrix”, convex relaxation: bounded trace


norm
Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values


• K = { X | X is a matrix with trace norm at most D }

• Computational bottleneck: projections on K


require eigendecomposition: O(n3) operations

• But: linear optimization over K is easier


computing top eigenvector; O(sparsity) time
Projections à linear optimization

1. Matrix completion (K = bounded trace norm matrices)


eigen decomposition
largest singular vector computation

2. Online routing (K = flow polytope)


conic optimization over flow polytope
shortest path computation

3. Rotations (K = rotation matrices)


conic optimization over rotations set
Wahba’s algorithm – linear time

4. Matroids (K = matroid polytope)


convex opt. via ellipsoid method
greedy algorithm – nearly linear time!
Conditional Gradient algorithm [Frank, Wolfe ’56]
Convex opt problem:
min 𝑓(𝑥)
•∈‘

• f is smooth, convex
• linear opt over K is easy

vt = arg min rf (xt )> x


x2K
xt+1 = xt + ⌘t (vt xt )

1. At iteration t: convex comb. of at most t vertices (sparsity)


2
2. No learning rate. 𝜂0 ≈ (independent of diameter, gradients etc.)
0
FW theorem rf (xt )

xt ) , vt = arg min r> vt+1


xt+1 = xt + ⌘t (vt
x2K
t x xt+1
⇤ 1 xt
Theorem: f (xt ) f (x ) = O( )
t

Proof, main observation:


f (xt+1 ) f (x⇤ ) = f (xt + ⌘t (vt xt )) f (x⇤ )

 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f


2
 f (xt ) f (x⇤ ) + ⌘t (x⇤ xt )> rt + ⌘t2 kvt xt k2 optimality of vt
2
 f (xt ) f (x⇤ ) + ⌘t (f (x⇤ ) f (xt )) + ⌘t2 kvt xt k2 convexity of f
2
⇤ ⌘t2
 (1 ⌘t )(f (xt ) f (x )) + D2 .
2
Thus: ht = f(xt) – f(x*)
1
ht+1  (1 ⌘t )ht + O(⌘t2 ) ⌘t , ht = O( )
t
Online Conditional Gradient

• Set 𝑥2 ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥0, obtain ft
2. Compute 𝑥012 as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵ xt
xt+1 (1 t )xt + t vt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt

Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 m/— )


Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and
smooth losses,
1. Offline: convergence after t steps: 𝑒 p˜(0)
2. Online: 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
Next few slides:
survey state-of-the-art
Non-convex optimization in ML

Machine

Distribution
over label
{a} ∈ 𝑅›

1
arg min• > ℓs 𝑥, 𝑎s , 𝑏s + 𝑅 𝑥
•∈œ 𝑚
s†2 0‡ u
What is Optimization

But generally speaking...


We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous func

Solution concepts for non-convex optimization? h(x) = sin(2πx) = 0

250

200

• Global minimization is NP hard, even for degree 4


150

100

50

polynomials. Even local minimization up to 4th order 0

−50
3

optimality conditions is NP hard.


2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]


Duchi (UC Berkeley) Convex Optimization for Machine Learning F

• Optimization approaches:
• Finding vanishing gradients / local minima efficiently
• Graduated optimization / homotopy method
• Quasi-convexity
• Structure of local optima (probabilistic assumptions that
allow alternating minimization,…)
Gradient/Hessian based methods
Goal: find point x such that
1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximate first order optimality)
2. 𝛻 + 𝑓 𝑥 ≽ −𝜀𝐼 (approximate second order optimality)

2
1. (we’ve proved) GD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 finds in 𝑂 C
(expensive) iterations point (1)
2
2. (we’ve proved) SGD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻P4 finds in 𝑂 (cheap)

iterations point (1)
2
3. SGD algorithm with noise finds in 𝑂 (cheap) iterations (1&2)
C
[Ge, Huang, Jin, Yuan ‘15]
2
4. Recent second order methods: find in 𝑂 C¡/¢ (expensive) iterations
point (1&2)
[Carmon,Duchi, Hinder, Sidford ‘16]
[Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16]
Recap

Chair/car

What is Optimization

But generally speaking...


We’re screwed.
! Local (non global) minima of f0
1. Online learning and stochastic
! All kinds of constraints (even restricting to continuous functions):
optimization
h(x) = sin(2πx) = 0
• Regret minimization
• Online gradient descent
250

200 2. Regularization
• AdaGrad and optimal
150

100

50 regularization
0

−50
3
3. Advanced optimization
• Frank-Wolfe, acceleration,
2
3
1 2
0 1
−1 0

variance reduction, second order


−1
−2 −2
−3 −3

methods, non-convex
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization
Bibliography & more information, see:
http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm

Thank you!

You might also like