Berkeley-tutorial Optimization for Machine Learning-part1
Berkeley-tutorial Optimization for Machine Learning-part1
Elad Hazan
Princeton University
Machine
Chair/car
Distribution
over label
{a} ∈ 𝑅% 𝑏 = 𝑓)*+*,-.-+/ (𝑎)
This tutorial - training the machine
• Efficiency
• generalization
Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Offline/online/stochastic gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization
Learning = optimizationBut
over dataspeaking...
generally
(a.k.a. Empirical Risk Minimization)
We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous f
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3
`
L
arg min ∑I ℓ(𝑥, 𝑎I , 𝑏I ) for ℓ 𝑥, 𝑎I , 𝑏I = _1 𝑥 𝑎≠𝑏
] QL , 0 𝑥 `𝑎 = 𝑏
NP hard!
Sum of signs à global optimization NP-hard!
but locally verifiable…
1 1 1 1
𝑓 𝑥+ 𝑦 ≤ 𝑓 𝑥 + 𝑓 𝑦
2 2 2 2
• Informally: smiley J
• Alternative definition:
f y ≥ f x + 𝛻𝑓(𝑥)` (𝑦 − 𝑥)
𝑥 𝑦
Convex sets
Next è algorithms!
Gradient descent, constrained set
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥. @
[rf (x)]i = f (x)
𝑥.uL = arg min |𝑦.uL − 𝑥| @xi
]∈x
p* p3 p2 p1
Convergence of gradient descent
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.
y
𝑥.uL = arg min |𝑦.uL − 𝑥|
Theorem: for step size 𝜂 = ]∈x
z Z
1 ∗ + 𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥
𝑇 B ∗ ∈x 𝑇
.
Where:
• G = upper bound on norm of gradients
|𝛻𝑓 𝑥. | ≤ 𝐺
2. Observation 2:
x ∗ − 𝑥.uL l
≤ x ∗ − y.uL l
2. Observation 2:
x ∗ − 𝑥 .uL l ≤ x ∗ − y.uL l
Thus:
x ∗ − x•uL l ≤ x ∗ − x• l − 2𝜂𝛻𝑓 (𝑥 .)(𝑥 . − 𝑥 ∗ ) + 𝜂 l 𝐺 l
And hence:
1 1 1
𝑓( G 𝑥 .) − 𝑓 𝑥 ∗ ≤ G 𝑓 𝑥. − 𝑓 𝑥 ∗ ≤ G 𝛻𝑓 𝑥 . 𝑥. − 𝑥 ∗
𝑇 𝑇 𝑇
. . .
1 1 𝜂
≤ G x ∗ − x•uL l − x ∗ − x• l + 𝐺l
𝑇 2𝜂 2
.
1 𝜂 𝐷𝐺
≤ 𝐷l + 𝐺l ≤
𝑇 ⋅ 2𝜂 2 𝑇
Recap
y
Theorem: for step size 𝜂 =
z Z
1 ∗
𝐷𝐺
𝑓 G 𝑥. ≤ min 𝑓 𝑥 +
𝑇 ∗
B ∈x 𝑇
.
L
Thus, to get 𝜖-approximate solution, apply O
‚ƒ
gradient iterations.
Gradient Descent - caveat
p* p3 p2 p1
Next few slides:
" #
1 X X
`(ht , (at , bt ) min `(h⇤ , (at , bt )) ! 0
T t
⇤
h 2H
t
T !1
Theorem: Regret = ∑. 𝑓. 𝑥. − ∑. 𝑓. 𝑥 ∗ = 𝑂 𝑇
Analysis
𝛻. ≔ 𝛻𝑓. (𝑥. )
Observation 1:
kyt+1 x⇤ k2 = kxt x⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
Observation 2: (Pythagoras)
⇤ ⇤
kxt+1 x k kyt+1 x k
Thus: kxt+1 x⇤ k2 kxt x ⇤ k2 2⌘rt (x⇤ xt ) + ⌘ 2 krt k2
X X
Convexity: [ft (xt ) ⇤
ft (x )] rt (xt x⇤ )
t t
1 X
⇤ 2 ⇤ 2
(kxt x k kxt+1 x k )+⌘ krt k2
⌘ t
1 ⇤ 2
p
kx1 x k + ⌘T G = O( T )
⌘
Lower bound
p
Regret = ⌦( T )
• 2 loss functions, T iterations:
• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥 , 𝑓l 𝑥 = −𝑥
• Second expert loss = first * -1
• Expected loss = 0 (any algorithm)
• Regret = (compared to either -1 or 1)
p
E[|#10 s #( 1)0 s|] = ⌦( T )
! All kinds of constraints (even restricting to continuous
h(x) = sin(2πx) = 0
200
150
100
50
−50
3
2
3
1 2
0 1
−1 0
−1
−2 −2
B∈C
random example: 𝑓. 𝑥 = ℓI 𝑥, 𝑎I , 𝑏I Duchi (UC Berkeley) Convex Optimization for Machine Learning
1 1 ` 𝑥 ∗ + 𝐷𝐺
G 𝛻.` 𝑥 . ≤ min G 𝛻.
𝑇 B∗∈x 𝑇 𝑇
. .
2. Taking (conditional) expectation:
1 1 𝐷𝐺
𝐸 𝐹 G 𝑥 . − min
∗ 𝐹 𝑥∗ ≤𝐸 G 𝛻.` (𝑥. − 𝑥 ∗)] ≤
𝑇 B ∈x 𝑇 𝑇
. .
One example per step, same convergence as GD, & gives direct generalization!
(formally needs martingales)
8 ,8
O vs. O total running time for 𝜖 generalization error.
‚ƒ ‚ƒ
Stochastic vs. full gradient descent
Regularization &
Gradient Descent++
Why “regularize”?
𝑥. = arg min G 𝑓I 𝑥
B∈x
IKL
1
𝑥. = arg min G 𝛻.` 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL
L
• 𝑅 𝑥 = ∥ 𝑥 ∥l
l
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Q ⇣ Pt 1 ⌘
= K ⌘ i=1 rfi (xi )
Q
xt = K (yt )
yt+1 = yt ⌘rft (xt )
FTRL vs. Multiplicative Weights
• Experts setting: 𝐾 = Δ% distributions over experts
• 𝑓. 𝑥 = 𝑐.Z 𝑥, where ct is the vector of losses
• 𝑅 𝑥 = ∑I 𝑥I log 𝑥I : negative entropy
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
⇣ P ⌘
t 1
Entrywise
= exp ⌘ i=1 ci /Zt Normalization
constant
exponential
Pt 1
xt = arg min i=1 rfi (xi )> x + ⌘1 R(x)
x2K
Bregman Projection:
QR
K (y) = arg min BR (xky)
x2K
QR
xt = K (yt )
1
yt+1 = (rR) (rR(yt ) ⌘rft (xt ))
Adaptive Regularization: AdaGrad
1
𝑥. = arg min G 𝑓I 𝑥 + 𝑅 𝑥
B∈x 𝜂
IKL….qL
𝑅 𝑥 = 𝑥 l 𝑠. 𝑡. 𝐴 ≽ 0 , 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
¤
• Set 𝑥L ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. use 𝑥. obtain ft
2. compute 𝑥.uL as follows:
Pt >
Gt = diag( i=1 rf i (x i )rf i (x i ) )
1/2
yt+1 = xt ⌘Gt rft (xt )
xt+1 = arg min(yt+1 x)> Gt (yt+1 x)
x2K