0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learning-part1

This tutorial by Elad Hazan focuses on optimization techniques in machine learning, covering topics such as stochastic optimization, empirical risk minimization, and gradient descent methods. It discusses the importance of convexity in optimization problems and introduces various algorithms for efficient learning. The tutorial also touches on online learning and regret minimization but does not cover parallelism or Bayesian inference.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learning-part1

This tutorial by Elad Hazan focuses on optimization techniques in machine learning, covering topics such as stochastic optimization, empirical risk minimization, and gradient descent methods. It discusses the importance of convexity in optimization problems and introduces various algorithms for efficient learning. The tutorial also touches on online learning and regret minimization but does not cover parallelism or Bayesian inference.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Tutorial: PART 1

Optimization for machine learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora, Yoram Singer


ML paradigm

Machine

Chair/car

Distribution
over label
{a} ∈ 𝑅% 𝑏 = 𝑓)*+*,-.-+/ (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/online/stochastic gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

NOT touch upon:


• Parallelism/distributed computation (asynchronous optimization,
HOGWILD etc.), Bayesian inference in graphical models, Markov-chain-
monte-carlo, Partial information and bandit algorithms
Mathematical optimization

Input: function 𝑓: 𝐾 ↦ 𝑅, for 𝐾 ⊆ 𝑅8


Output: minimizer 𝑥 ∈ 𝐾, such that 𝑓 𝑥 ≤ 𝑓 𝑦 ∀ 𝑦 ∈ 𝐾

Accessing f? (values, differentials, …)

Generally NP-hard, given full access to function.


What is Optimization

Learning = optimizationBut
over dataspeaking...
generally
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f

Fitting the parameters of the model (“training”) = optimization


h(x) = sin(2πx) = 0
problem:
250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

1 Duchi (UC Berkeley) Convex Optimization for Machine Learning


arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

m = # of examples (a,b) = (features, labels)


d = dimension
Example: linear classification

Given a sample 𝑆 = 𝑎L, 𝑏L , … , 𝑎, , 𝑏, ,


find hyperplane (through the origin w.l.o.g)
such that:

𝑥 = arg min # of mistakes =


B QL

arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |


B QL

`
L
arg min ∑I ℓ(𝑥, 𝑎I , 𝑏I ) for ℓ 𝑥, 𝑎I , 𝑏I = _1 𝑥 𝑎≠𝑏
] QL , 0 𝑥 `𝑎 = 𝑏

NP hard!
Sum of signs à global optimization NP-hard!
but locally verifiable…

Local property that ensures global optimality?


Convexity

A function 𝑓: 𝑅8 ↦ 𝑅 is convex if and only if:

1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2

• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)` (𝑦 − 𝑥)

𝑥 𝑦
Convex sets

Set K is convex if and only if:


𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
Z
Loss functions ℓ 𝑥, 𝑎I , 𝑏I = ℓ(𝑥 𝑎I ⋅ 𝑏I )
Convex relaxations for linear (&kernel)
classification

𝑥 = arg min 𝑖 s. 𝑡. 𝑠𝑖𝑔𝑛 (𝑥 Z 𝑎I ≠ 𝑏I |


B QL

1. Ridge / linear regression ℓ 𝑥 `𝑎I , 𝑦I = 𝑥 `𝑎I − 𝑏I l


2. SVM ℓ 𝑥 `𝑎I , 𝑦I = max{0,1 − 𝑏I 𝑥 ` 𝑎I }
t*
3. Logistic regression ℓ 𝑥 `𝑎I , 𝑦I = log(1 + 𝑒 qrs ⋅B s )
We have: cast learning as mathematical optimization,
argued convexity is algorithmically important

Next è algorithms!
Gradient descent, constrained set

𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥. @
[rf (x)]i = f (x)
𝑥.uL = arg min |𝑦.uL − 𝑥| @xi
]∈x

p* p3 p2 p1
Convergence of gradient descent
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
y
𝑥.uL = arg min |𝑦.uL − 𝑥|
Theorem: for step size 𝜂 = ]∈x
z Z

1 ∗ + 𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥
𝑇 B ∗ ∈x 𝑇
.

Where:
• G = upper bound on norm of gradients
|𝛻𝑓 𝑥. | ≤ 𝐺

• D = diameter of constraint set


∀𝑥, 𝑦 ∈ 𝐾 . |𝑥 − 𝑦| ≤ 𝐷
Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l
= x ∗ − x• l
− 2𝜂𝛻𝑓(𝑥. )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥. ) l

2. Observation 2:
x ∗ − 𝑥.uL l
≤ x ∗ − y.uL l

This is the Pythagorean theorem:


Proof: 𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
1. Observation 1: 𝑥.uL = arg min |𝑦.uL − 𝑥|
]∈x
x ∗ − y•uL l = x ∗ − x• l − 2𝜂𝛻𝑓(𝑥 . )(𝑥. − 𝑥 ∗ ) + 𝜂 l 𝛻𝑓(𝑥 . ) l

2. Observation 2:
x ∗ − 𝑥 .uL l ≤ x ∗ − y.uL l

Thus:
x ∗ − x•uL l ≤ x ∗ − x• l − 2𝜂𝛻𝑓 (𝑥 .)(𝑥 . − 𝑥 ∗ ) + 𝜂 l 𝐺 l
And hence:

1 1 1
𝑓( G 𝑥 .) − 𝑓 𝑥 ∗ ≤ G 𝑓 𝑥. − 𝑓 𝑥 ∗ ≤ G 𝛻𝑓 𝑥 . 𝑥. − 𝑥 ∗
𝑇 𝑇 𝑇
. . .

1 1 𝜂
≤ G x ∗ − x•uL l − x ∗ − x• l + 𝐺l
𝑇 2𝜂 2
.

1 𝜂 𝐷𝐺
≤ 𝐷l + 𝐺l ≤
𝑇 ⋅ 2𝜂 2 𝑇
Recap

y
Theorem: for step size 𝜂 =
z Z

1 ∗
𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥 +
𝑇 ∗
B ∈x 𝑇
.

L
Thus, to get 𝜖-approximate solution, apply O
‚ƒ
gradient iterations.
Gradient Descent - caveat

For ERM problems


1
arg minD G ℓI 𝑥, 𝑎I , 𝑏I + 𝑅 𝑥
B∈C 𝑚
IKL .M ,

1. Gradient depends on all data


2. What about generalization?

p* p3 p2 p1
Next few slides:

Simultaneous optimization and generalization


è Faster optimization! (single example per iteration)
Statistical (PAC) learning
Nature: i.i.d from distribution D over
A ×𝐵 = {(𝑎, 𝑏)}
(a1,b1) (aM,bM)
h1
learner:
h2
Hypothesis h
l
Loss, e.g. ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 −𝑏

𝑒𝑟𝑟 ℎ = 𝔼*,r∼y [ℓ(ℎ, 𝑎, 𝑏 ] hN


Hypothesis class H: X -> Y is learnable if ∀𝜖, 𝛿 > 0 exists algorithm s.t. after seeing m
examples, for 𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))
finds h s.t. w.p. 1- δ:

err(h)  min

err(h )+✏
h 2H
More powerful setting:
Online Learning in Games
AxB
Iteratively, for t = 1,2, … , 𝑇
Player: ℎ. ∈ 𝐻
Adversary: (𝑎. , 𝑏. ) ∈ 𝐴 H
Loss ℓ(ℎ. , (𝑎. , 𝑏. ))

Goal: minimize (average, expected) regret:

" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t

h 2H
t
T !1

Vanishing regret à generalization in PAC setting! (online2batch)


From this point onwards: 𝑓. 𝑥 = ℓ(𝑥, 𝑎. , 𝑏. ) = loss for one example
Can we minimize regret efficiently?
Online gradient descent [Zinkevich ‘05]
yt+1 = xt ⌘rft (xt )
xt+1 = arg min kyt+1 xt k
x2K

Theorem: Regret = ∑. 𝑓. 𝑥. − ∑. 𝑓. 𝑥 ∗ = 𝑂 𝑇
Analysis
𝛻. ≔ 𝛻𝑓. (𝑥. )
Observation 1:
kyt+1 x⇤ k2 = kxt x⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
Observation 2: (Pythagoras)
⇤ ⇤
kxt+1 x k  kyt+1 x k
Thus: kxt+1 x⇤ k2  kxt x ⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
X X
Convexity: [ft (xt ) ⇤
ft (x )]  rt (xt x⇤ )
t t
1 X
⇤ 2 ⇤ 2
 (kxt x k kxt+1 x k )+⌘ krt k2
⌘ t
1 ⇤ 2
p
 kx1 x k + ⌘T G = O( T )

Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥 , 𝑓l 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)

p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0

Stochastic gradient descent


250

200

150

100

50

−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2

Learning problem arg minD 𝐹 𝑥 = 𝐸(*s ,rs) ℓI 𝑥, 𝑎I , 𝑏I


−3 −3

B∈C
random example: 𝑓. 𝑥 = ℓI 𝑥, 𝑎I , 𝑏I Duchi (UC Berkeley) Convex Optimization for Machine Learning

1. We have proved: (for any sequence of 𝛻. )

1 1 ` 𝑥 ∗ + 𝐷𝐺
G 𝛻.` 𝑥 . ≤ min G 𝛻.
𝑇 B∗∈x 𝑇 𝑇
. .
2. Taking (conditional) expectation:

1 1 𝐷𝐺
𝐸 𝐹 G 𝑥 . − min
∗ 𝐹 𝑥∗ ≤𝐸 G 𝛻.` (𝑥. − 𝑥 ∗)] ≤
𝑇 B ∈x 𝑇 𝑇
. .

One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
8 ,8
O vs. O total running time for 𝜖 generalization error.
‚ƒ ‚ƒ
Stochastic vs. full gradient descent
Regularization &
Gradient Descent++
Why “regularize”?

• Statistical learning theory /


Occam’s razor:
# of examples needed to learn
hypothesis class ~ it’s “dimension”
• VC dimension
• Fat-shattering dimension
• Rademacher width
• Margin/norm of linear/kernel classifier

• PAC theory: Regularization <-> reduce complexity


• Regret minimization: Regularization <-> stability
Minimize regret: best-in-hindsight
X X
Regret = ft (xt ) min

ft (x⇤ )
x 2K
t t
• Most natural:
.qL

𝑥. = arg min G 𝑓I 𝑥
B∈x
IKL

• Provably works [Kalai-Vempala’05]:


.

𝑥.ž = arg min G 𝑓I 𝑥 = 𝑥.uL


B∈x
IKL

• So if 𝑥. ≈ 𝑥.uL, we get a regret bound


• But instability 𝑥. − 𝑥.uL can be large!
Fixing FTL: Follow-The-Regularized-Leader
(FTRL)
• Linearize: replace ft by a linear function, 𝛻𝑓. 𝑥. Z 𝑥
• Add regularization:

1
𝑥. = arg min G 𝛻.` 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• R(x) is a strongly convex function, ensures stability:

𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)


FTRL vs. gradient descent

L
• 𝑅 𝑥 = ∥ 𝑥 ∥l
l
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )

• Essentially OGD: starting with y1 = 0, for t = 1, 2, …

Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ% distributions over experts
• 𝑓. 𝑥 = 𝑐.Z 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑I 𝑥I log 𝑥I : negative entropy

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential

• Gives the Multiplicative Weights method!


FTRL ⇔ Online Mirror Descent

Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K

Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K

BR (xky) := R(x) R(y) rR(y)> (x y)

QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
Adaptive Regularization: AdaGrad

• Consider generalized linear model, prediction is function of 𝑎 Z 𝑥


𝛻𝑓. 𝑥 = ℓ 𝑎. , 𝑏. , 𝑥 𝑎.

• OGD update: 𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎. , 𝑏. , 𝑥 𝑎.


• features treated equally in updating parameter vector

• In typical text classification tasks, feature vectors at are very sparse,


Slow learning!

• Adaptive regularization: per-feature learning rates


Optimal regularization

• The general RFTL form

1
𝑥. = arg min G 𝑓I 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL

• Which regularizer to pick?


• AdaGrad: treat this as a learning problem!
Family of regularizations:

𝑅 𝑥 = 𝑥 l 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
¤

• Objective in matrix world: best regret in hindsight!


AdaGrad (diagonal form)

• Set 𝑥L ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥. obtain ft
2. compute 𝑥.uL as follows:
Pt >
Gt = diag( i=1 rf i (x i )rf i (x i ) )
1/2
yt+1 = xt ⌘Gt rft (xt )
xt+1 = arg min(yt+1 x)> Gt (yt+1 x)
x2K

• Regret bound: [Duchi, Hazan, Singer ‘10]


𝑂 ∑I ∑. 𝛻.l,I , can be 𝑑 better than SGD
• Infrequently occurring, or small-scale, features have small influence
on regret (and therefore, convergence to optimal parameter)
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/stochastic/online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization

You might also like