5.scaling Optimization
5.scaling Optimization
Source: http://cs231n.github.io/neural-networks-1/
Output Layer
Width
Depth
I2DL: Prof. Dai 4
Compute Graphs → Neural Networks
Input layer Output layer
𝑥0 ∗ 𝑤0
−𝑦0 𝑥 ∗𝑥 Loss/
𝑥0 + max(0, 𝑥)
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 Weights ReLU Activation L2 Loss
Input (not arguing this is the
(unknowns!) right choice here)
∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1
𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
I2DL: Prof. Dai 6
Compute Graphs → Neural Networks
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑤
𝐿 = 𝐿𝑖
𝑥0
𝑖
𝑦ො0 𝑦0 𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:
…
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑦ො1 𝑦1
𝑥𝑘 ⟶ use chain rule to compute partials
𝑦ො𝑖 = 𝐴(𝑏𝑖 + 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝑘 𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝑏
I2DL: Prof. Dai 7
Summary
𝜕𝑓
• We have 𝜕𝑤0,0,0
…
– (Directional) compute graph …
– Structure graph into layers 𝜕𝑓
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
– Compute partial derivatives w.r.t. …
weights (unknowns) …
𝜕𝑓
𝜕𝑏𝑙,𝑚
• Next
Gradient step:
– Find weights based on gradients 𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)
Initialization
Optimum
Initialization
Optimum
𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥
Learning rate
𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥
𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥
Initialization
What is the
gradient when Not guaranteed
we reach this to reach the
point? Optimum
global optimum
I2DL: Prof. Dai 15
Convergence of Gradient Descent
• Convex function: all local minima are global minima
Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg
If line/plane segment between any two points lies above or on the graph
Source: https://builtin.com/data-science/gradient-
descent
Source: A. Geron
I2DL: Prof. Dai 19
Gradient Descent: Multiple Dimensions
Source: builtin.com/data-science/gradient-descent
Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-
descent.png
I2DL: Prof. Dai 21
Gradient Descent for Neural Networks
Loss function 𝜕𝑓
2
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖
𝜕𝑤0,0,0
ℎ0 …
𝑥0 …
ℎ1 𝑦ො0 𝑦0 𝜕𝑓
𝛻𝑾,𝒃 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
𝑥1 …
ℎ2 𝑦ො1 𝑦1 …
𝑥2 𝜕𝑓
ℎ3 𝜕𝑏𝑙,𝑚
1
• Cost 𝐿 = σ𝑛𝑖=1 𝐿𝑖 (𝜽, 𝒙𝑖 , 𝒚𝑖 )
𝑛
– 𝜽 = arg min 𝐿
1 𝑛
1. Compute gradient: 𝛻𝜽 𝐿 = σ 𝛻 𝐿
𝑛 𝑖=1 𝜽 𝑖
2. Optimize for optimal step 𝛼:
arg min 𝐿(𝜽𝑘 − 𝛼 𝛻𝜽 𝐿)
𝛼
Minibatch
choose subset of trainset 𝑚 ≪ 𝑛
𝐵𝑖 = { 𝒙𝟏 , 𝒚𝟏 , 𝒙𝟐 , 𝒚𝟐 , … , 𝒙𝒎 , 𝒚𝒎 }
{𝐵1 , 𝐵2 , … , 𝐵𝑛/𝑚 }
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
where 𝛼1 , 𝛼2 … 𝛼𝑛 is a sequence of positive step-sizes
and 𝐻 𝜃 𝑘 , 𝑋 is the unbiased estimate of 𝛻F 𝜃 𝑘 , i.e.
𝔼 𝐻 𝜃𝑘, 𝑋 = 𝛻F 𝜃 𝑘
Robbins, H. and Monro, S. “A Stochastic Approximation Method" 1951.
I2DL: Prof. Dai 33
Convergence of SGD
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
converges to a local (global) minimum if the following
conditions are met:
1) 𝛼𝑛 ≥ 0, ∀ 𝑛 ≥ 0
2) σ∞
𝑛=1 𝛼𝑛 = ∞
3) σ∞
𝑛=1 𝛼𝑛
2 <∞
4) 𝐹 𝜃 is strictly convex
𝛼
The proposed sequence by Robbins and Monro is 𝛼𝑛 ∝ , 𝑓𝑜𝑟 𝑛 > 0
𝑛
Source: A. Ng
We’re making many steps
back and forth along this Would love to go faster here…
dimension. Would love to I.e., accumulated gradients over
track that this is averaging time
out over time.
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
velocity
weights of model
Hyperparameters are 𝛼, 𝛽
𝛽 is often set to 0.9
Source: I. Goodfellow
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 39
Nesterov Momentum
• Look-ahead momentum
෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽
෩𝑘+1 )
𝒗𝑘+1 = 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
Nesterov, Yurii E. "A method for solving the convex programming problem with convergence rate O (1/k^ 2)." Dokl. akad. nauk Sssr. Vol. 269.
1983.
෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽
Source: G. Hinton
𝒗𝑘+1 ෩ 𝑘+1 )
= 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 41
Root Mean Squared Prop (RMSProp)
Large gradients
Small gradients
Source: Andrew. Ng
Hinton et al. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural
networks for machine learning 4.2 (2012): 26-31.
𝒔𝑘+1 = 𝛽 ⋅ 𝒔𝑘 + (1 − 𝛽)[𝛻𝜽 𝐿 ∘ 𝛻𝜽 𝐿]
Element-wise multiplication
𝛻𝜽 𝐿
𝜽𝑘+1 = 𝜽𝑘 − 𝛼 ⋅
𝒔𝑘+1 + 𝜖
Hyperparameters: 𝛼, 𝛽, 𝜖
Typically 10−8
Needs tuning! Often 0.9
Source: A. Ng
• Hyperparameters: 𝛼, 𝛽1 , 𝛽2 , 𝜖
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘
Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 51
Convergence
Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 52
Convergence
Source: https://github.com/Jaewan-Yun/optimizer-visualization
I2DL: Prof. Dai 53
Jacobian and Hessian
ⅆ𝑓 𝑥
• Derivative 𝒇: ℝ → ℝ
ⅆ𝑥
ⅆ𝑓 𝒙 ⅆ𝑓 𝒙
• Gradient 𝒇: ℝ𝑚 → ℝ 𝛻𝒙 𝑓 𝒙
ⅆ𝑥1
,
ⅆ𝑥2
• Jacobian 𝒇: ℝ𝑚 → ℝ𝑛 𝐉 ∈ ℝ𝑛 × 𝑚
• Hessian SECOND
𝒇: ℝ𝑚 → ℝ 𝐇 ∈ ℝ𝑚 × 𝑚
DERIVATIVE
More info:
https://en.wikipedia.org/wiki/Taylor_series
𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step
𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step
Source: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization
𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽
• BFGS
• Limited memory: L-BFGS
𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ ⅆ𝑖𝑎𝑔(𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 ) ⋅ 𝑥𝑘 − 𝑥𝑘+1
= 𝛻𝑓(𝑥𝑘 )
• When you can, use 2nd order method → it’s just faster
• Next lecture
– Training Neural networks