0% found this document useful (0 votes)

2 views

Berkeley-tutorial Optimization for Machine Learningpart2

This document discusses optimization techniques for machine learning, focusing on methods such as stochastic optimization, adaptive regularization, and gradient descent acceleration. It covers both first-order and second-order optimization methods, including Newton's method and the Conditional Gradient algorithm, highlighting their applications in recommendation systems and matrix completion. The document emphasizes the importance of well-conditioned functions for faster optimization and presents theoretical results on the efficiency of these algorithms.

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Berkeley-tutorial Optimization for Machine Learningpart2

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Tutorial: PART 2

Optimization for Machine Learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora & Yoram Singer

Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization
Accelerating gradient descent?
1. Adaptive regularization (AdaGrad)
works for non-smooth&non-convex
2. Variance reduction
uses special ERM structure
very effective for smooth&convex
3. Acceleration/momentum
smooth convex only, general purpose optimization
since 80’s
Condition number of convex functions
#
defined as 𝛾 = $ , where (simplified)

0 ≺ 𝛼𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

𝛼 = strong convexity, 𝛽 = smoothness

Non-convex smooth functions: (simplified)

−𝛽𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

Why do we care?

well-conditioned functions exhibit much faster optimization!

(but equivalent via reductions)
Examples
Smooth gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 )

𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

1
=− 𝜂+ 𝛽𝜂+ 𝛻0 + =− 𝛻 +
4𝛽 0

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

1 +
−2𝑀 ≤ 𝑓 𝑥 ; − 𝑓(𝑥2 ) = > 𝑓(𝑥012 ) − 𝑓 𝑥0 ≤− > 𝛻0
4𝛽
0 0
Thus, exists a t for which,
+
8𝑀𝛽
𝛻0 ≤
𝑇
Smooth gradient descent
2
Conclusions: for 𝑥012 = 𝑥0 − 𝜂𝛻4 and T = Ω , finds
C

𝛻0 + ≤𝜖

1. Holds even for non-convex functions

2. For convex functions implies 𝑓 𝑥0 − 𝑓 𝑥 ∗ ≤ 𝑂(𝜖) (faster for
smooth!)
Non-convex stochastic gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻N0 )

𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

+ +
= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻N4 = −𝜂𝛻4+ + 𝜂+ 𝛽𝐸 𝛻N4

= −𝜂𝛻4+ + 𝜂+ 𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

M𝛽 +
T=𝑂 + ⇒ ∃0Y; . 𝛻0 ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:

𝑥012 = 𝑥0 − 𝜂 𝛻P 𝑓 𝑥0 − 𝛻P 𝑓 𝑥[ + 𝛻𝑓(𝑥[ )

Every so often, compute full gradient and restart at new 𝑥[.

Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13;

Mahdavi, Zhang, Jin ’13]
Variance reduction for well-conditioned functions
+
𝛽
0 ≺ 𝛼𝐼 ≼ 𝛻 𝑓 𝑥 ≼ 𝛽𝐼 , 𝛾= 𝛾 should be
𝛼
Produces an 𝜖 approximate solution in time interpreted as
1 1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
𝜖
Acceleration/momentum [Nesterov ‘83]
• Optimal gradient complexity (smooth, convex)

• modest practical improvements, non-convex “momentum”

methods.

• With variance reduction, fastest possible running time of first-

order methods:
1
𝑂 (𝑚 + 𝛾𝑚) 𝑑 log
𝜖

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle

Experiments w. convex losses

Improve upon GradientDescent++ ?

Next few slides:

Move from first order (GD++) to second order

Higher Order Optimization

• Gradient Descent – Direction of Steepest Descent

• Second Order Methods – Use Local Curvature
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+

For non-convex function: can move to ∞

Solution: solve a quadratic approximation in a
𝑥m
local area (trust region)
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+
d3 time per iteration!
Infeasible for ML!!
𝑥m

Till recently…
Speed up the Newton direction computation??

• Spielman-Teng ‘04: diagonally dominant systems of equations in

linear time!
• 2015 Godel prize
• Used by Daitch-Speilman for faster flow algorithms

• Erdogu-Montanari ‘15: low rank approximation & inversion by

Sherman-Morisson
• Allow stochastic information
• Still prohibitive: rank * d2
Stochastic Newton? 𝑛„
𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)
2
• ERM, rank-1 loss: arg min 𝐸 s∼ u [ℓ 𝑥 ; 𝑎s , 𝑏s + |𝑥|+ ]
r +

• unbiased estimator of the Hessian:

y+ = az a{ ⋅ ℓ′ 𝑥 ; 𝑎s , bz + 𝐼
𝛻 𝑖 ~ 𝑈[1, … , 𝑚]
z

p2 p2
y+ + y
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 + ≠ 𝛻+𝑓
Circumvent Hessian creation and inversion!

• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 p+ = > 𝐼 − 𝛻+ s

s†[ 0‡ ˆ

• (2) sample from the infinite series (Hessian-gradient product) , ONCE

1
𝛻 + 𝑓 p2 𝛻𝑓 = > 𝐼 − 𝛻 + 𝑓 s
𝛻f = 𝐸s∼‰ 𝐼 − 𝛻 + 𝑓 s
𝛻f ⋅
Pr[𝑖]
s

• (3) estimate Hessian-power by sampling i.i.d. data examples

1
= Es∼‰,‹∼[s] Œ 𝐼 − 𝛻 + 𝑓‹ 𝛻f ⋅
Pr[𝑖] Single example
‹†2 0‡ s
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)

• Use the estimator 𝛻•p+ 𝑓 defined previously

• Compute a full (large batch) gradient 𝛻f

• Move in the direction 𝛻•p+ 𝑓 𝛻𝑓

• Theoretical running time to produce an 𝜖 approximate solution for 𝛾

well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]

1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖

1. Faster than first-order methods!

2. Indications this is tight [Arjevani,Shamir ‘16]
What about constraints??

Next few slides – projection free

(Frank-Wolfe) methods
Recommendation systems
movies
rating of user i
1 5 for movie j
2 3
2 4 1

users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2

Assume low rank of “true matrix”, convex relaxation: bounded trace

norm
Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values

• K = { X | X is a matrix with trace norm at most D }

• Computational bottleneck: projections on K

require eigendecomposition: O(n3) operations

• But: linear optimization over K is easier

computing top eigenvector; O(sparsity) time
Projections à linear optimization

1. Matrix completion (K = bounded trace norm matrices)

eigen decomposition
largest singular vector computation

2. Online routing (K = flow polytope)

conic optimization over flow polytope
shortest path computation

3. Rotations (K = rotation matrices)

conic optimization over rotations set
Wahba’s algorithm – linear time

4. Matroids (K = matroid polytope)

convex opt. via ellipsoid method
greedy algorithm – nearly linear time!
Conditional Gradient algorithm [Frank, Wolfe ’56]
Convex opt problem:
min 𝑓(𝑥)
•∈‘

• f is smooth, convex
• linear opt over K is easy

vt = arg min rf (xt )> x

x2K
xt+1 = xt + ⌘t (vt xt )

1. At iteration t: convex comb. of at most t vertices (sparsity)

2
2. No learning rate. 𝜂0 ≈ (independent of diameter, gradients etc.)
0
FW theorem rf (xt )

xt ) , vt = arg min r> vt+1

xt+1 = xt + ⌘t (vt
x2K
t x xt+1
⇤ 1 xt
Theorem: f (xt ) f (x ) = O( )
t

Proof, main observation:

f (xt+1 ) f (x⇤ ) = f (xt + ⌘t (vt xt )) f (x⇤ )

 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f

2
 f (xt ) f (x⇤ ) + ⌘t (x⇤ xt )> rt + ⌘t2 kvt xt k2 optimality of vt
2
 f (xt ) f (x⇤ ) + ⌘t (f (x⇤ ) f (xt )) + ⌘t2 kvt xt k2 convexity of f
2
⇤ ⌘t2
 (1 ⌘t )(f (xt ) f (x )) + D2 .
2
Thus: ht = f(xt) – f(x*)
1
ht+1  (1 ⌘t )ht + O(⌘t2 ) ⌘t , ht = O( )
t
Online Conditional Gradient

• Set 𝑥2 ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥0, obtain ft
2. Compute 𝑥012 as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵ xt
xt+1 (1 t )xt + t vt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt

Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 m/— )

Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and
smooth losses,
1. Offline: convergence after t steps: 𝑒 p˜(0)
2. Online: 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
Next few slides:
survey state-of-the-art
Non-convex optimization in ML

Machine

Distribution
over label
{a} ∈ 𝑅›

1
arg min• > ℓs 𝑥, 𝑎s , 𝑏s + 𝑅 𝑥
•∈œ 𝑚
s†2 0‡ u
What is Optimization

But generally speaking...

We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous func

Solution concepts for non-convex optimization? h(x) = sin(2πx) = 0

250

200

• Global minimization is NP hard, even for degree 4

150

100

polynomials. Even local minimization up to 4th order 0

−50
3

optimality conditions is NP hard.

2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]

Duchi (UC Berkeley) Convex Optimization for Machine Learning F

• Optimization approaches:
• Finding vanishing gradients / local minima efficiently
• Graduated optimization / homotopy method
• Quasi-convexity
• Structure of local optima (probabilistic assumptions that
allow alternating minimization,…)
Gradient/Hessian based methods
Goal: find point x such that
1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximate first order optimality)
2. 𝛻 + 𝑓 𝑥 ≽ −𝜀𝐼 (approximate second order optimality)

2
1. (we’ve proved) GD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 finds in 𝑂 C
(expensive) iterations point (1)
2
2. (we’ve proved) SGD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻P4 finds in 𝑂 (cheap)
CŸ
iterations point (1)
2
3. SGD algorithm with noise finds in 𝑂 (cheap) iterations (1&2)
C
[Ge, Huang, Jin, Yuan ‘15]
2
4. Recent second order methods: find in 𝑂 C¡/¢ (expensive) iterations
point (1&2)
[Carmon,Duchi, Hinder, Sidford ‘16]
[Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16]
Recap

Chair/car

What is Optimization

But generally speaking...

We’re screwed.
! Local (non global) minima of f0
1. Online learning and stochastic
! All kinds of constraints (even restricting to continuous functions):
optimization
h(x) = sin(2πx) = 0
• Regret minimization
• Online gradient descent
250

200 2. Regularization
• AdaGrad and optimal
150

100

50 regularization
0

−50
3
3. Advanced optimization
• Frank-Wolfe, acceleration,
2
3
1 2
0 1
−1 0

variance reduction, second order

−1
−2 −2
−3 −3

methods, non-convex
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization
Bibliography & more information, see:
http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm

Thank you!

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
cours5
No ratings yet
cours5
23 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
02_grad_desc
No ratings yet
02_grad_desc
54 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Berkeley-tutorial Optimization for Machine Learning-part1
No ratings yet
Berkeley-tutorial Optimization for Machine Learning-part1
37 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
Optimization
No ratings yet
Optimization
21 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
ch4
No ratings yet
ch4
28 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
DL 3
No ratings yet
DL 3
72 pages
Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
No ratings yet
Convex Optimization in Classification Problems: MIT/ORC Spring Seminar
39 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
optimization
No ratings yet
optimization
6 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
Chapter
No ratings yet
Chapter
46 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
25-Optimization
No ratings yet
25-Optimization
28 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Les Hoches 2022 convex optimization
No ratings yet
Les Hoches 2022 convex optimization
34 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
AI Bullshit Regulatory Gaps
No ratings yet
AI Bullshit Regulatory Gaps
1 page
AI Bullshit Healthcare Hype
No ratings yet
AI Bullshit Healthcare Hype
1 page
AI Bullshit Economic Disparities
No ratings yet
AI Bullshit Economic Disparities
1 page
AI Bullshit Consumer Misleading
No ratings yet
AI Bullshit Consumer Misleading
1 page
AI Bullshit Misinformation
No ratings yet
AI Bullshit Misinformation
1 page
AI Bullshit Overhyped Promises
No ratings yet
AI Bullshit Overhyped Promises
1 page
AI Bullshit Ethical Concerns
No ratings yet
AI Bullshit Ethical Concerns
1 page
LTL Artale Slides
No ratings yet
LTL Artale Slides
44 pages
1995 CommuOfACM ComputingInVN 1995
No ratings yet
1995 CommuOfACM ComputingInVN 1995
6 pages
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
No ratings yet
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
5 pages
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
No ratings yet
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
11 pages
A Case Study in Analog Co-Processing For Solving Stochastic Differential Equations
No ratings yet
A Case Study in Analog Co-Processing For Solving Stochastic Differential Equations
5 pages
2015056pap Identification of First-Price Auctions With Biased Beliefs
No ratings yet
2015056pap Identification of First-Price Auctions With Biased Beliefs
45 pages
Characterization and Enumeration of Certain Classes of Tenable P Olya Urns Grown by Drawing Multisets of Balls
No ratings yet
Characterization and Enumeration of Certain Classes of Tenable P Olya Urns Grown by Drawing Multisets of Balls
17 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Luhn Algorithm
No ratings yet
Luhn Algorithm
3 pages
Rsa Algorithm
No ratings yet
Rsa Algorithm
26 pages
Summary of Student Comments For Spring 2014 Courses PDF
No ratings yet
Summary of Student Comments For Spring 2014 Courses PDF
300 pages
Chapter 9
No ratings yet
Chapter 9
9 pages
Ball Control
No ratings yet
Ball Control
67 pages
Design and Analysis of Algorithms: Vignan University:: Vadlamudi - 522 213
No ratings yet
Design and Analysis of Algorithms: Vignan University:: Vadlamudi - 522 213
2 pages
Wa0037.
No ratings yet
Wa0037.
52 pages
Ai - Chapter 3
No ratings yet
Ai - Chapter 3
10 pages
ADA Unit II GCR
No ratings yet
ADA Unit II GCR
58 pages
Schuberth2023b
No ratings yet
Schuberth2023b
12 pages
The Time Value of Money: Quantitative Methods
No ratings yet
The Time Value of Money: Quantitative Methods
43 pages
Ridvan Ozkerim Master Thesis v1.0
No ratings yet
Ridvan Ozkerim Master Thesis v1.0
76 pages
Decision Science MCQ TEST 10 With Answer
No ratings yet
Decision Science MCQ TEST 10 With Answer
5 pages
A Novel Neural Network Model Based On Cerebral Hemispheric Asymmetry For EEG Emotion Recognition
No ratings yet
A Novel Neural Network Model Based On Cerebral Hemispheric Asymmetry For EEG Emotion Recognition
7 pages
Linear Systems
No ratings yet
Linear Systems
2 pages
Artificial Intelligence and Knowledge Representation
No ratings yet
Artificial Intelligence and Knowledge Representation
20 pages
Week5 Tut
No ratings yet
Week5 Tut
5 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Experiment 2 (Signals) (Part C)
No ratings yet
Experiment 2 (Signals) (Part C)
5 pages
Entanglement-Enabled Advantage For Learning A Bosonic Random Displacement Channel
No ratings yet
Entanglement-Enabled Advantage For Learning A Bosonic Random Displacement Channel
33 pages
Revision: Template
No ratings yet
Revision: Template
631 pages
Monoalphateic Cipher
100% (1)
Monoalphateic Cipher
66 pages
CNN Examples
No ratings yet
CNN Examples
2 pages
Podar Interantional School, Satara (Cbse) : Polynomials Worksheet #2
100% (1)
Podar Interantional School, Satara (Cbse) : Polynomials Worksheet #2
4 pages
Step and Impulse Responce
No ratings yet
Step and Impulse Responce
5 pages
Data Structure and C - Lab
No ratings yet
Data Structure and C - Lab
3 pages
H - M 2ND Year Math
No ratings yet
H - M 2ND Year Math
1 page
Container Shipping
No ratings yet
Container Shipping
23 pages
20EC6453_Signal processing
No ratings yet
20EC6453_Signal processing
2 pages

Berkeley-tutorial Optimization for Machine Learningpart2

Uploaded by

Berkeley-tutorial Optimization for Machine Learningpart2

Uploaded by

Tutorial: PART 2

Optimization for Machine Learning

+ help from Sanjeev Arora & Yoram Singer

𝛼 = strong convexity, 𝛽 = smoothness

well-conditioned functions exhibit much faster optimization!

𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

1. Holds even for non-convex functions

𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

= −𝜂𝛻4+ + 𝜂+ 𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

Every so often, compute full gradient and restart at new 𝑥[.

Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13;

• modest practical improvements, non-convex “momentum”

• With variance reduction, fastest possible running time of first-

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle

Improve upon GradientDescent++ ?

Move from first order (GD++) to second order

• Gradient Descent – Direction of Steepest Descent

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

For non-convex function: can move to ∞

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

• Spielman-Teng ‘04: diagonally dominant systems of equations in

• Erdogu-Montanari ‘15: low rank approximation & inversion by

• unbiased estimator of the Hessian:

• (2) sample from the infinite series (Hessian-gradient product) , ONCE

• (3) estimate Hessian-power by sampling i.i.d. data examples

• Use the estimator 𝛻•p+ 𝑓 defined previously

• Compute a full (large batch) gradient 𝛻f

• Theoretical running time to produce an 𝜖 approximate solution for 𝛾

1. Faster than first-order methods!

Next few slides – projection free

Assume low rank of “true matrix”, convex relaxation: bounded trace

• Trace norm of a matrix = sum of singular values

• Computational bottleneck: projections on K

• But: linear optimization over K is easier

1. Matrix completion (K = bounded trace norm matrices)

2. Online routing (K = flow polytope)

3. Rotations (K = rotation matrices)

4. Matroids (K = matroid polytope)

vt = arg min rf (xt )> x

1. At iteration t: convex comb. of at most t vertices (sparsity)

xt ) , vt = arg min r> vt+1

Proof, main observation:

 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f

Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 m/— )

But generally speaking...

Solution concepts for non-convex optimization? h(x) = sin(2πx) = 0

• Global minimization is NP hard, even for degree 4

polynomials. Even local minimization up to 4th order 0

optimality conditions is NP hard.

• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]

But generally speaking...

variance reduction, second order

You might also like