0% found this document useful (0 votes)

8 views37 pages

Berkeley-tutorial Optimization for Machine Learning-part1

This tutorial by Elad Hazan focuses on optimization techniques in machine learning, covering topics such as stochastic optimization, empirical risk minimization, and gradient descent methods. It discusses the importance of convexity in optimization problems and introduces various algorithms for efficient learning. The tutorial also touches on online learning and regret minimization but does not cover parallelism or Bayesian inference.

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views37 pages

Berkeley-tutorial Optimization for Machine Learning-part1

Uploaded by

Van Tien Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Tutorial: PART 1

Optimization for machine learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora, Yoram Singer

ML paradigm

Machine

Chair/car

Distribution
over label
{a} ∈ 𝑅% 𝑏 = 𝑓)*+*,-.-+/ (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/online/stochastic gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

NOT touch upon:

• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov-chain-
monte-carlo, Partial information and bandit algorithms
Mathematical optimization

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8

Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾

Accessing f? (values, differentials, …)

Generally NP-hard, given full access to function.

What is Optimization

Learning = optimizationBut
over dataspeaking...
generally
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f

Fitting the parameters of the model (“training”) = optimization

h(x) = sin(2πx) = 0
problem:
250

200

150

100

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

m = # of examples (a,b) = (features, labels)

d = dimension
Example: linear classification

Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,

find hyperplane (through the origin w.l.o.g)
such that:

𝑥 = arg min # of mistakes =

B QL

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

B QL

`
L
arg min ∑I ℓ(𝑥, 𝑎I , 𝑏I ) for ℓ 𝑥, 𝑎I , 𝑏I = _1 𝑥 𝑎≠𝑏
] QL , 0 𝑥 `𝑎 = 𝑏

NP hard!
Sum of signs à global optimization NP-hard!
but locally verifiable…

Local property that ensures global optimality?

Convexity

A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:

1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2

• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)` (𝑦 − 𝑥)

𝑥 𝑦
Convex sets

Set K is convex if and only if:

𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
Z
Loss functions ℓ 𝑥, 𝑎I , 𝑏I = ℓ(𝑥 𝑎I ⋅ 𝑏I )
Convex relaxations for linear (&kernel)
classification

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

B QL

1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l

2. SVM ℓ 𝑥 `𝑎I , 𝑦I = max{0,1 − 𝑏I 𝑥 ` 𝑎I }
t*
3. Logistic regression ℓ 𝑥 `𝑎I , 𝑦I = log(1 + 𝑒 qrs ⋅B s )
We have: cast learning as mathematical optimization,
argued convexity is algorithmically important

Next è algorithms!
Gradient descent, constrained set

𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥. @
[rf (x)]i = f (x)
𝑥.uL = arg min |𝑦.uL − 𝑥| @xi
]∈x

p* p3 p2 p1
Convergence of gradient descent
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
y
𝑥.uL = arg min |𝑦.uL − 𝑥|
Theorem: for step size 𝜂 = ]∈x
z Z

1 ∗ + 𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥
𝑇 B ∗ ∈x 𝑇
.

Where:
• G = upper bound on norm of gradients
|𝛻𝑓 𝑥. | ≤ 𝐺

• D = diameter of constraint set

∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l
= x ∗ − x• l
− 2𝜂𝛻𝑓(𝑥. )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥. ) l

2. Observation 2:
x ∗ − 𝑥.uL l
≤ x ∗ − y.uL l

This is the Pythagorean theorem:

Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l = x ∗ − x• l − 2𝜂𝛻𝑓(𝑥 . )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥 . ) l

2. Observation 2:
x ∗ − 𝑥 .uL l ≤ x ∗ − y.uL l

Thus:
x ∗ − x•uL l ≤ x ∗ − x• l − 2𝜂𝛻𝑓 (𝑥 .)(𝑥 . − 𝑥 ∗ ) + 𝜂 l 𝐺 l
And hence:

1 1 1
𝑓( G 𝑥 .) − 𝑓 𝑥 ∗ ≤ G 𝑓 𝑥. − 𝑓 𝑥 ∗ ≤ G 𝛻𝑓 𝑥 . 𝑥. − 𝑥 ∗
𝑇 𝑇 𝑇
. . .

1 1 𝜂
≤ G x ∗ − x•uL l − x ∗ − x• l + 𝐺l
𝑇 2𝜂 2
.

1 𝜂 𝐷𝐺
≤ 𝐷l + 𝐺l ≤
𝑇 ⋅ 2𝜂 2 𝑇
Recap

y
Theorem: for step size 𝜂 =
z Z

1 ∗
𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥 +
𝑇 ∗
B ∈x 𝑇
.

L
Thus, to get 𝜖-approximate solution, apply O
‚ƒ
gradient iterations.
Gradient Descent - caveat

For ERM problems

1
arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

1. Gradient depends on all data

2. What about generalization?

p* p3 p2 p1
Next few slides:

Simultaneous optimization and generalization

è Faster optimization! (single example per iteration)
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
h1
learner:
h2
Hypothesis h
l
Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 −𝑏

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN

Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:
⇤
err(h)  min
⇤
err(h )+✏
h 2H
More powerful setting:
Online Learning in Games
AxB
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ. ∈ 𝐻
Adversary: (𝑎. , 𝑏. ) ∈ 𝐴 H
Loss ℓ(ℎ. , (𝑎. , 𝑏. ))

Goal: minimize (average, expected) regret:

" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1

Vanishing regret à generalization in PAC setting! (online2batch)

From this point onwards: 𝑓. 𝑥 = ℓ(𝑥, 𝑎. , 𝑏. ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent [Zinkevich ‘05]
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K

Theorem: Regret = ∑. 𝑓. 𝑥. − ∑. 𝑓. 𝑥 ∗ = 𝑂 𝑇
Analysis
𝛻. ≔ 𝛻𝑓. (𝑥. )
Observation 1:
kyt+1 x⇤ k2 = kxt x⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
Observation 2: (Pythagoras)
⇤ ⇤
kxt+1 x k  kyt+1 x k
Thus: kxt+1 x⇤ k2  kxt x ⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
X X
Convexity: [ft (xt ) ⇤
ft (x )]  rt (xt x⇤ )
t t
1 X
⇤ 2 ⇤ 2
 (kxt x k kxt+1 x k )+⌘ krt k2
⌘ t
1 ⇤ 2
p
 kx1 x k + ⌘T G = O( T )
⌘
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥 , 𝑓l 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)

p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0

Stochastic gradient descent

250

200

150

100

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2

Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I

−3 −3

B∈C
random example: 𝑓. 𝑥 = ℓI 𝑥, 𝑎I , 𝑏I Duchi (UC Berkeley) Convex Optimization for Machine Learning

1. We have proved: (for any sequence of 𝛻. )

1 1 ` 𝑥 ∗ + 𝐷𝐺
G 𝛻.` 𝑥 . ≤ min G 𝛻.
𝑇 B∗∈x 𝑇 𝑇
. .
2. Taking (conditional) expectation:

1 1 𝐷𝐺
𝐸 𝐹 G 𝑥 . − min
∗ 𝐹 𝑥∗ ≤𝐸 G 𝛻.` (𝑥. − 𝑥 ∗)] ≤
𝑇 B ∈x 𝑇 𝑇
. .

One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
8 ,8
O vs. O total running time for 𝜖 generalization error.
‚ƒ ‚ƒ
Stochastic vs. full gradient descent
Regularization &
Gradient Descent++
Why “regularize”?

• Statistical learning theory /

Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier

• PAC theory: Regularization <-> reduce complexity

• Regret minimization: Regularization <-> stability
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min
⇤
ft (x⇤ )
x 2K
t t
• Most natural:
.qL

𝑥. = arg min G 𝑓I 𝑥
B∈x
IKL

• Provably works [Kalai-Vempala’05]:

𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL

B∈x
IKL

• So if 𝑥. ≈ 𝑥.uL, we get a regret bound

• But instability 𝑥. − 𝑥.uL can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓. 𝑥. Z 𝑥
• Add regularization:

1
𝑥. = arg min G 𝛻.` 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• R(x) is a strongly convex function, ensures stability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)

FTRL vs. gradient descent

L
• 𝑅 𝑥 = ∥ 𝑥 ∥l
l
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ% distributions over experts
• 𝑓. 𝑥 = 𝑐.Z 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑I 𝑥I log 𝑥I : negative entropy

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential

• Gives the Multiplicative Weights method!

FTRL ⇔ Online Mirror Descent

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K

Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K

BR (xky) := R(x) R(y) rR(y)> (x y)

QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
Adaptive Regularization: AdaGrad

• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥

𝛻𝑓. 𝑥 = ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• features treated equally in updating parameter vector

• In typical text classification tasks, feature vectors at are very sparse,

Slow learning!

• Adaptive regularization: per-feature learning rates

Optimal regularization

• The general RFTL form

1
𝑥. = arg min G 𝑓I 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• Which regularizer to pick?

• AdaGrad: treat this as a learning problem!
Family of regularizations:

𝑅 𝑥 = 𝑥 l 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
¤

• Objective in matrix world: best regret in hindsight!

AdaGrad (diagonal form)

• Set 𝑥L ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥. obtain ft
2. compute 𝑥.uL as follows:
Pt >
Gt = diag( i=1 rf i (x i )rf i (x i ) )
1/2
yt+1 = xt ⌘Gt rft (xt )
xt+1 = arg min(yt+1 x)> Gt (yt+1 x)
x2K

• Regret bound: [Duchi, Hazan, Singer ‘10]

𝑂 ∑I ∑. 𝛻.l,I , can be 𝑑 better than SGD
• Infrequently occurring, or small-scale, features have small influence
on regret (and therefore, convergence to optimal parameter)
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/stochastic/online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
The Dance of Connection in Human Life
No ratings yet
The Dance of Connection in Human Life
1 page
25-Optimization
No ratings yet
25-Optimization
28 pages
Calculus Early Transcendental Functions 6th Edition Larson Solutions Manual download
100% (3)
Calculus Early Transcendental Functions 6th Edition Larson Solutions Manual download
44 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
Complexity Theory and the Philosophy of Education Educational Philosophy and Theory Special Issues 1st Edition Mark Mason download
No ratings yet
Complexity Theory and the Philosophy of Education Educational Philosophy and Theory Special Issues 1st Edition Mark Mason download
47 pages
The Search for Meaning in Human Existence
No ratings yet
The Search for Meaning in Human Existence
1 page
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
AI Bullshit Economic Disparities
No ratings yet
AI Bullshit Economic Disparities
1 page
AI Bullshit Regulatory Gaps
No ratings yet
AI Bullshit Regulatory Gaps
1 page
AI Bullshit Consumer Misleading
No ratings yet
AI Bullshit Consumer Misleading
1 page
AI Bullshit Misinformation
No ratings yet
AI Bullshit Misinformation
1 page
AI Bullshit Healthcare Hype
No ratings yet
AI Bullshit Healthcare Hype
1 page
AI Bullshit Overhyped Promises
No ratings yet
AI Bullshit Overhyped Promises
1 page
AI Bullshit Ethical Concerns
No ratings yet
AI Bullshit Ethical Concerns
1 page
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
Les Hoches 2022 convex optimization
No ratings yet
Les Hoches 2022 convex optimization
34 pages
essay_9
No ratings yet
essay_9
1 page
essay_4
No ratings yet
essay_4
1 page
essay_8
No ratings yet
essay_8
1 page
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Navigating Brand Evolution: The Strategic Impact of Logo Redesign On Consumer Perceptions and Behaviours
No ratings yet
Navigating Brand Evolution: The Strategic Impact of Logo Redesign On Consumer Perceptions and Behaviours
63 pages
LTL Artale Slides
No ratings yet
LTL Artale Slides
44 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
BDA-24_Lect (5-6)-(Chapter 3 Data Analysis)
No ratings yet
BDA-24_Lect (5-6)-(Chapter 3 Data Analysis)
17 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Lecture 7 Test
No ratings yet
Lecture 7 Test
35 pages
Practice Quiz M2 (Ungraded) 1
No ratings yet
Practice Quiz M2 (Ungraded) 1
4 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Madhumita Das
No ratings yet
Madhumita Das
2 pages
Lec 16
No ratings yet
Lec 16
10 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
2015056pap Identification of First-Price Auctions With Biased Beliefs
No ratings yet
2015056pap Identification of First-Price Auctions With Biased Beliefs
45 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
Thesis Examples Poems
100% (2)
Thesis Examples Poems
5 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
M.Tech CS 2014 PDF
No ratings yet
M.Tech CS 2014 PDF
17 pages
RiliBAEK English 2015 Version Labmed-2014-0046 (2) - 1 PDF
No ratings yet
RiliBAEK English 2015 Version Labmed-2014-0046 (2) - 1 PDF
44 pages
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
No ratings yet
ISSRE23 Phương pháp gỡ lỗi Delta xác suất cho cây cú pháp trừu tượng .
11 pages
SPM Form 4 Add Math Revision Note - Differentiation 3
50% (4)
SPM Form 4 Add Math Revision Note - Differentiation 3
6 pages
Nil Raised To Nil Is Not Trivial
No ratings yet
Nil Raised To Nil Is Not Trivial
2 pages
Characterization and Enumeration of Certain Classes of Tenable P Olya Urns Grown by Drawing Multisets of Balls
No ratings yet
Characterization and Enumeration of Certain Classes of Tenable P Olya Urns Grown by Drawing Multisets of Balls
17 pages
1995 CommuOfACM ComputingInVN 1995
No ratings yet
1995 CommuOfACM ComputingInVN 1995
6 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
44 pages
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
No ratings yet
Kalman Decomposition of Linear Fractional Transformat Ion Represent at Ions and Minimality
5 pages
Scilab Text Books PDF
No ratings yet
Scilab Text Books PDF
286 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Delta Functions Generated by Complex Expo-Nentials
No ratings yet
Delta Functions Generated by Complex Expo-Nentials
4 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Electric Circuit Theory - Lesson Plan - 2012
No ratings yet
Electric Circuit Theory - Lesson Plan - 2012
3 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Statistics and Probability Week 6. LAS 1
No ratings yet
Statistics and Probability Week 6. LAS 1
1 page
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture4 Compressed
No ratings yet
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture4 Compressed
22 pages
LP Model in Equation Form: Module - 1 Lecture Notes - 4 Linear Programming Problems-II
No ratings yet
LP Model in Equation Form: Module - 1 Lecture Notes - 4 Linear Programming Problems-II
3 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
A Case Study in Analog Co-Processing For Solving Stochastic Differential Equations
No ratings yet
A Case Study in Analog Co-Processing For Solving Stochastic Differential Equations
5 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
NM
No ratings yet
NM
18 pages
Influence of Social Media As A Tool of Political Marketing in General Elections
No ratings yet
Influence of Social Media As A Tool of Political Marketing in General Elections
6 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Code of Practice For Life Cycle Costing NATO RTO 2009
No ratings yet
Code of Practice For Life Cycle Costing NATO RTO 2009
64 pages
A Detailed Lesson Plan in Mathematics in The Modern World
No ratings yet
A Detailed Lesson Plan in Mathematics in The Modern World
11 pages
Ms&e 213 / CS 269o
No ratings yet
Ms&e 213 / CS 269o
3 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Kaka de 09 Generalization
No ratings yet
Kaka de 09 Generalization
8 pages
Techniques of Data Collection
No ratings yet
Techniques of Data Collection
12 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
MATLAB Codes For Finite Element Analysis
No ratings yet
MATLAB Codes For Finite Element Analysis
8 pages
Arfken Solutions Manual 7th Ed PDF
84% (77)
Arfken Solutions Manual 7th Ed PDF
524 pages
Uncertainty Format
100% (1)
Uncertainty Format
6 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Operators On Hilbert Space
No ratings yet
Operators On Hilbert Space
52 pages
2016 2 KL SMK Methodist - Maths QA
No ratings yet
2016 2 KL SMK Methodist - Maths QA
12 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Berkeley-tutorial Optimization for Machine Learning-part1

Uploaded by

Berkeley-tutorial Optimization for Machine Learning-part1

Uploaded by

Tutorial: PART 1

Optimization for machine learning

+ help from Sanjeev Arora, Yoram Singer

NOT touch upon:

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8

Accessing f? (values, differentials, …)

Generally NP-hard, given full access to function.

Fitting the parameters of the model (“training”) = optimization

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning

m = # of examples (a,b) = (features, labels)

Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,

𝑥 = arg min # of mistakes =

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

Local property that ensures global optimality?

A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:

Set K is convex if and only if:

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |

1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l

• D = diameter of constraint set

This is the Pythagorean theorem:

For ERM problems

1. Gradient depends on all data

Simultaneous optimization and generalization

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN

Goal: minimize (average, expected) regret:

Vanishing regret à generalization in PAC setting! (online2batch)

Stochastic gradient descent

Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I

1. We have proved: (for any sequence of 𝛻. )

• Statistical learning theory /

• PAC theory: Regularization <-> reduce complexity

• Provably works [Kalai-Vempala’05]:

𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL

• So if 𝑥. ≈ 𝑥.uL, we get a regret bound

• R(x) is a strongly convex function, ensures stability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

• Gives the Multiplicative Weights method!

BR (xky) := R(x) R(y) rR(y)> (x y)

• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥

• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• In typical text classification tasks, feature vectors at are very sparse,

• Adaptive regularization: per-feature learning rates

• The general RFTL form

• Which regularizer to pick?

• Objective in matrix world: best regret in hindsight!

• Regret bound: [Duchi, Hazan, Singer ‘10]

You might also like