0% found this document useful (0 votes)
11 views68 pages

5.scaling Optimization

Uploaded by

8varlock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views68 pages

5.scaling Optimization

Uploaded by

8varlock
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Scaling Optimization

I2DL: Prof. Dai 1


Lecture 4 Recap

I2DL: Prof. Dai 2


Neural Network

Source: http://cs231n.github.io/neural-networks-1/

I2DL: Prof. Dai 3


Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer

Output Layer
Width

Depth
I2DL: Prof. Dai 4
Compute Graphs → Neural Networks
Input layer Output layer

𝑥0 ∗ 𝑤0
−𝑦0 𝑥 ∗𝑥 Loss/
𝑥0 + max(0, 𝑥)
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 Weights ReLU Activation L2 Loss
Input (not arguing this is the
(unknowns!) right choice here)

e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾


regression target
I2DL: Prof. Dai 5
Compute Graphs → Neural Networks
∗ 𝑤0,0
Input layer Output layer Loss/
+ −𝑦0 𝑥∗𝑥
cost
∗ 𝑤0,1

∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1

𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
I2DL: Prof. Dai 6
Compute Graphs → Neural Networks
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑤

𝐿 = ෍ 𝐿𝑖
𝑥0
𝑖
𝑦ො0 𝑦0 𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:

𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑦ො1 𝑦1
𝑥𝑘 ⟶ use chain rule to compute partials
𝑦ො𝑖 = 𝐴(𝑏𝑖 + ෍ 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝑘 𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝑏
I2DL: Prof. Dai 7
Summary
𝜕𝑓
• We have 𝜕𝑤0,0,0

– (Directional) compute graph …
– Structure graph into layers 𝜕𝑓
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
– Compute partial derivatives w.r.t. …
weights (unknowns) …
𝜕𝑓
𝜕𝑏𝑙,𝑚
• Next
Gradient step:
– Find weights based on gradients 𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)

I2DL: Prof. Dai 8


Optimization

I2DL: Prof. Dai 9


Gradient Descent
𝑥 ∗ = arg min 𝑓(𝑥)

Initialization

Optimum

I2DL: Prof. Dai 10


Gradient Descent
𝑥 ∗ = arg min 𝑓(𝑥)

Initialization

Follow the slope


of the
DERIVATIVE

Optimum

I2DL: Prof. Dai 11


Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

Learning rate

I2DL: Prof. Dai 12


Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

SMALL Learning rate

I2DL: Prof. Dai 13


Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

LARGE Learning rate

I2DL: Prof. Dai 14


Gradient Descent
𝒙∗ = arg min 𝑓(𝒙)

Initialization

What is the
gradient when Not guaranteed
we reach this to reach the
point? Optimum
global optimum
I2DL: Prof. Dai 15
Convergence of Gradient Descent
• Convex function: all local minima are global minima

Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg

If line/plane segment between any two points lies above or on the graph

I2DL: Prof. Dai 16


Convergence of Gradient Descent
• Neural networks are non-convex
– many (different) local minima
– no (practical) way to say which is globally optimal

Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data

I2DL: Prof. Dai 17


Convergence of Gradient Descent

Source: https://builtin.com/data-science/gradient-
descent

I2DL: Prof. Dai 18


Convergence of Gradient Descent

Source: A. Geron
I2DL: Prof. Dai 19
Gradient Descent: Multiple Dimensions

Source: builtin.com/data-science/gradient-descent

Various ways to visualize…


I2DL: Prof. Dai 20
Gradient Descent: Multiple Dimensions

Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-
descent.png
I2DL: Prof. Dai 21
Gradient Descent for Neural Networks
Loss function 𝜕𝑓
2
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖
𝜕𝑤0,0,0
ℎ0 …
𝑥0 …
ℎ1 𝑦ො0 𝑦0 𝜕𝑓
𝛻𝑾,𝒃 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
𝑥1 …
ℎ2 𝑦ො1 𝑦1 …
𝑥2 𝜕𝑓
ℎ3 𝜕𝑏𝑙,𝑚

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )


𝑗
Just simple:
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝐴 𝑥 = max(0, 𝑥)
𝑘
I2DL: Prof. Dai 22
Gradient Descent: Single Training Sample
• Given a loss function 𝐿 and a single training sample
{𝒙𝑖 , 𝒚𝑖 }
• Find best model parameters 𝜽 = 𝑾, 𝒃
• Cost 𝐿𝑖 𝜽, 𝒙𝑖 , 𝒚𝑖
– 𝜽 = arg min 𝐿𝑖 (𝒙𝑖 , 𝒚𝑖 )
• Gradient Descent:
– Initialize 𝜽1 with ‘random’ values (more on that later)
– 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿𝑖 (𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 )
– Iterate until convergence: 𝜽𝑘+1 − 𝜽𝑘 < 𝜖

I2DL: Prof. Dai 23


Gradient Descent: Single Training Sample

– 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿𝑖 (𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 )


Training sample
Weights, biases after Loss Function
update step Gradient w.r.t. 𝜽
Learning rate
Weights, biases at step k
(current model)
– 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝒊 computed via backpropagation
– Typically: ⅆim 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 = ⅆim 𝜽 ≫ 1 𝑚𝑖𝑙𝑙𝑖𝑜𝑛

I2DL: Prof. Dai 24


Gradient Descent: Multiple Training Samples

• Given a loss function 𝐿 and multiple (𝑛) training


samples {𝒙𝑖 , 𝒚𝑖 }
• Find best model parameters 𝜽 = 𝑾, 𝒃

1
• Cost 𝐿 = σ𝑛𝑖=1 𝐿𝑖 (𝜽, 𝒙𝑖 , 𝒚𝑖 )
𝑛
– 𝜽 = arg min 𝐿

I2DL: Prof. Dai 25


Gradient Descent: Multiple Training Samples
• Update step for multiple samples
𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 𝜽𝑘 , 𝒙 1..𝑛 , 𝒚 1..𝑛
• Gradient is average / sum over residuals
1
𝛻𝜽 𝐿 𝜽𝑘 , 𝒙 1..𝑛 , 𝒚 1..𝑛 = σ𝑛𝑖=1 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝒊
𝑛
Reminder: this comes from backprop.

• Often people are lazy and just write: 𝛻𝐿 = σ𝑛𝑖=1 𝛻𝜽 𝐿𝑖


1
− omitting is not ‘wrong’, it just means rescaling the
𝑛
learning rate
I2DL: Prof. Dai 26
Side Note: Optimal Learning Rate
Can compute optimal learning rate 𝛼 using Line Search
(optimal for a given set)

1 𝑛
1. Compute gradient: 𝛻𝜽 𝐿 = σ 𝛻 𝐿
𝑛 𝑖=1 𝜽 𝑖
2. Optimize for optimal step 𝛼:
arg min 𝐿(𝜽𝑘 − 𝛼 𝛻𝜽 𝐿)
𝛼

3. 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 Not that practical for DL since we


need to solve huge system every step…

I2DL: Prof. Dai 27


Gradient Descent on Train Set
• Given large train set with 𝑛 training samples {𝒙𝑖 , 𝒚𝑖 }
– Let’s say 1 million labeled images
– Let’s say our network has 500k parameters

• Gradient has 500k dimensions


• 𝑛 = 1 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
→ Extremely expensive to compute

I2DL: Prof. Dai 28


Stochastic Gradient Descent (SGD)
• If we have 𝑛 training samples we need to compute
the gradient for all of them which is 𝑂(𝑛)

• If we consider the problem as empirical risk


minimization, we can express the total loss over the
training data as the expectation of all the samples
1 𝑛
෍ 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊 = 𝔼𝑖~ 1,…,𝑛 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊
𝑛 𝑖=1

I2DL: Prof. Dai 29


Stochastic Gradient Descent (SGD)
• The expectation can be approximated with a small
subset of the data
1
𝔼𝑖~ 1,…,𝑛 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊 ≈ ෍ 𝐿𝑗 𝜽, 𝒙𝒋 , 𝒚𝒋 with S ⊆ 1, … , 𝑛
𝑆 𝑗∈𝑆

Minibatch
choose subset of trainset 𝑚 ≪ 𝑛
𝐵𝑖 = { 𝒙𝟏 , 𝒚𝟏 , 𝒙𝟐 , 𝒚𝟐 , … , 𝒙𝒎 , 𝒚𝒎 }
{𝐵1 , 𝐵2 , … , 𝐵𝑛/𝑚 }

I2DL: Prof. Dai 30


Stochastic Gradient Descent (SGD)
• Minibatch size is hyperparameter
– Typically power of 2 → 8, 16, 32, 64, 128…
– Smaller batch size means greater variance in the
gradients
→ noisy updates
– Mostly limited by GPU memory (in backward pass)
– E.g.,
• Train set has n = 220 (about 1 million) images
• With batch size m = 64: 𝐵1 … 𝑛/𝑚 = 𝐵1 … 16,384 minibatches
(Epoch = complete pass through training set)
I2DL: Prof. Dai 31
Stochastic Gradient Descent (SGD)
𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿(𝜽𝑘 , 𝒙{1..𝑚} , 𝒚{1..𝑚} )

𝑘 now refers to 𝑘-th iteration


1 𝑚
𝛻𝜽 𝐿 = σ 𝛻 𝐿
𝑚 𝑖=1 𝜽 𝑖
𝑚 training samples in the current minibatch
Gradient for the 𝑘-th minibatch

Note the terminology: iteration vs epoch


I2DL: Prof. Dai 32
Convergence of SGD
Suppose we want to minimize the function 𝐹 𝜃 with
the stochastic approximation

𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
where 𝛼1 , 𝛼2 … 𝛼𝑛 is a sequence of positive step-sizes
and 𝐻 𝜃 𝑘 , 𝑋 is the unbiased estimate of 𝛻F 𝜃 𝑘 , i.e.

𝔼 𝐻 𝜃𝑘, 𝑋 = 𝛻F 𝜃 𝑘
Robbins, H. and Monro, S. “A Stochastic Approximation Method" 1951.
I2DL: Prof. Dai 33
Convergence of SGD
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
converges to a local (global) minimum if the following
conditions are met:
1) 𝛼𝑛 ≥ 0, ∀ 𝑛 ≥ 0
2) σ∞
𝑛=1 𝛼𝑛 = ∞
3) σ∞
𝑛=1 𝛼𝑛
2 <∞

4) 𝐹 𝜃 is strictly convex

𝛼
The proposed sequence by Robbins and Monro is 𝛼𝑛 ∝ , 𝑓𝑜𝑟 𝑛 > 0
𝑛

I2DL: Prof. Dai 34


Problems of SGD
• Gradient is scaled equally across all dimensions
→ i.e., cannot independently scale directions
→ need to have conservative min learning rate to avoid
divergence
→ Slower than ‘necessary’

• Finding good learning rate is an art by itself


→ More next lecture

I2DL: Prof. Dai 35


Gradient Descent with Momentum

Source: A. Ng
We’re making many steps
back and forth along this Would love to go faster here…
dimension. Would love to I.e., accumulated gradients over
track that this is averaging time
out over time.

I2DL: Prof. Dai 36


Gradient Descent with Momentum
𝒗𝑘+1 = 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽𝑘 )
accumulation rate Gradient of current minibatch
velocity learning rate
(‘friction’, momentum)

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
velocity
weights of model

Exponentially-weighted average of gradient


Important: velocity 𝒗𝑘 is vector-valued!
[Sutskever et al., ICML’13] On the importance of initialization and momentum in deep learning
I2DL: Prof. Dai 37
Gradient Descent with Momentum

Step will be largest when a sequence of


gradients all point to the same direction

Hyperparameters are 𝛼, 𝛽
𝛽 is often set to 0.9

Source: I. Goodfellow
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1

I2DL: Prof. Dai 38


Gradient Descent with Momentum
• Can it overcome local minima?

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 39
Nesterov Momentum
• Look-ahead momentum

෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽

෩𝑘+1 )
𝒗𝑘+1 = 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
Nesterov, Yurii E. "A method for solving the convex programming problem with convergence rate O (1/k^ 2)." Dokl. akad. nauk Sssr. Vol. 269.
1983.

I2DL: Prof. Dai 40


Nesterov Momentum

෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽
Source: G. Hinton
𝒗𝑘+1 ෩ 𝑘+1 )
= 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 41
Root Mean Squared Prop (RMSProp)
Large gradients

Small gradients
Source: Andrew. Ng

• RMSProp divides the learning rate by an


exponentially-decaying average of squared gradients.

Hinton et al. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural
networks for machine learning 4.2 (2012): 26-31.

I2DL: Prof. Dai 42


RMSProp

𝒔𝑘+1 = 𝛽 ⋅ 𝒔𝑘 + (1 − 𝛽)[𝛻𝜽 𝐿 ∘ 𝛻𝜽 𝐿]
Element-wise multiplication
𝛻𝜽 𝐿
𝜽𝑘+1 = 𝜽𝑘 − 𝛼 ⋅
𝒔𝑘+1 + 𝜖

Hyperparameters: 𝛼, 𝛽, 𝜖
Typically 10−8
Needs tuning! Often 0.9

I2DL: Prof. Dai 43


Large gradients
Y-Direction
RMSProp

Source: A. Ng

X-direction Small gradients


(Uncentered) variance of gradients
→ second momentum 𝒔𝑘+1 = 𝛽 ⋅ 𝒔𝑘 + (1 − 𝛽)[𝛻𝜽 𝐿 ∘ 𝛻𝜽 𝐿]

We’re dividing by square gradients: 𝛻𝜽 𝐿


- Division in Y-Direction will be 𝜽𝑘+1 = 𝜽𝑘 − 𝛼 ⋅
𝒔𝑘+1 + 𝜖
large
- Division in X-Direction will be
small Can increase learning rate!

I2DL: Prof. Dai 44


RMSProp
• Dampening the oscillations for high-variance
directions

• Can use faster learning rate because it is less likely to


diverge
→ Speed up learning speed
→ Second moment

I2DL: Prof. Dai 45


Adaptive Moment Estimation (Adam)
Idea : Combine Momentum and RMSProp
First momentum:
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘 mean of gradients
𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + (1 − 𝛽2 )[𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘 ]
𝑘+1 𝑘 𝒎𝑘+1
𝜽 =𝜽 −𝛼⋅ Note : This is not the
𝒗𝑘+1 +𝜖 update rule of Second momentum:
Adam variance of gradients
Q. What happens at 𝑘 = 0?
A. We need bias correction as 𝒎0 = 0 and 𝒗0 = 0

[Kingma et al., ICLR’15] Adam: A method for stochastic optimization


I2DL: Prof. Dai 46
Adam : Bias Corrected
• Combines Momentum and RMSProp
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘 𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + (1 − 𝛽2 )[𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘

• 𝒎𝑘 and 𝒗𝑘 are initialized with zero


→ bias towards zero
→ Need bias-corrected moment updates
Update rule of Adam
𝒎𝑘+1 𝒗𝑘+1 ෝ 𝑘+1
𝒎
ෝ 𝑘+1 =
𝒎 ෝ𝑘+1 =
𝒗 𝜽 𝑘+1 𝑘
=𝜽 −𝛼⋅
1 − 𝛽1 𝑘+1 1 − 𝛽2 𝑘+1 ෝ𝑘+1 +𝜖
𝒗

I2DL: Prof. Dai 47


Adam
• Exponentially-decaying mean and variance of
gradients (combines first and second order
momentum)

• Hyperparameters: 𝛼, 𝛽1 , 𝛽2 , 𝜖
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘

Needs tuning! Often 0.9 𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + 1 − 𝛽2 𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘


Typically 10−8
𝒎𝑘+1 𝒗𝑘+1
Often 0.999 ෝ 𝑘+1 =
𝒎 ෝ𝑘+1 =
𝒗
1−𝛽1 𝑘+1 1−𝛽2 𝑘+1
ෝ 𝑘+1
𝒎
𝜽𝑘+1 = 𝜽𝑘 −𝛼⋅
Defaults in PyTorch ෝ𝑘+1 + 𝜖
𝒗

I2DL: Prof. Dai 48


There are a few others…
• ‘Vanilla’ SGD
• Momentum
• RMSProp Adam is mostly method
• Adagrad of choice for neural networks!
• Adadelta
• AdaMax
It’s actually fun to play around with SGD
• Nada updates.
• AMSGrad It’s easy and you get pretty immediate
feedback ☺
I2DL: Prof. Dai 50
Convergence

Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 51
Convergence

Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 52
Convergence

Source: https://github.com/Jaewan-Yun/optimizer-visualization
I2DL: Prof. Dai 53
Jacobian and Hessian
ⅆ𝑓 𝑥
• Derivative 𝒇: ℝ → ℝ
ⅆ𝑥

ⅆ𝑓 𝒙 ⅆ𝑓 𝒙
• Gradient 𝒇: ℝ𝑚 → ℝ 𝛻𝒙 𝑓 𝒙
ⅆ𝑥1
,
ⅆ𝑥2

• Jacobian 𝒇: ℝ𝑚 → ℝ𝑛 𝐉 ∈ ℝ𝑛 × 𝑚

• Hessian SECOND
𝒇: ℝ𝑚 → ℝ 𝐇 ∈ ℝ𝑚 × 𝑚
DERIVATIVE

I2DL: Prof. Dai 54


Newton’s Method
• Approximate our function by a second-order Taylor
series expansion
1
𝐿 𝜽 ≈ 𝐿 𝜽0 + 𝜽 − 𝜽0 𝑇 𝜵𝜽 𝐿 𝜽0 + 𝜽 − 𝜽0 𝑇 𝐇 𝜽 − 𝜽0
2

First derivative Second derivative (curvature)

More info:
https://en.wikipedia.org/wiki/Taylor_series

I2DL: Prof. Dai 55


Newton’s Method
• Differentiate and equate to zero

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step

We got rid of the learning rate!

SGD 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 𝜽𝑘 , 𝐱 𝒊 , 𝐲𝒊

I2DL: Prof. Dai 56


Newton’s Method
• Differentiate and equate to zero

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step

Parameters of a Number of Computational


network (millions) elements in the complexity of ‘inversion’
Hessian per iteration

I2DL: Prof. Dai 57


Newton’s Method
• Gradient Descent (green)

• Newton’s method exploits


the curvature to take a
more direct route

Source: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

I2DL: Prof. Dai 58


Newton’s Method
𝑇
𝐽 𝜽 = 𝐲 − 𝐗𝛉 𝐲 − 𝐗𝛉

Can you apply Newton’s


method for linear regression?
What do you get as a result?

I2DL: Prof. Dai 59


BFGS and L-BFGS
• Broyden-Fletcher-Goldfarb-Shanno algorithm
• Belongs to the family of quasi-Newton methods
• Have an approximation of the inverse of the Hessian

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽

• BFGS
• Limited memory: L-BFGS

I2DL: Prof. Dai 60


Gauss-Newton
−1
• 𝑥𝑘+1 = 𝑥𝑘 − 𝐻𝑓 𝑥𝑘 𝛻𝑓(𝑥𝑘 )
– ‘true’ 2nd derivatives are often hard to obtain (e.g.,
numerics)
– 𝐻𝑓 ≈ 2𝐽𝐹𝑇 𝐽𝐹
• Gauss-Newton (GN):
𝑥𝑘+1 = 𝑥𝑘 − [2𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 ]−1 𝛻𝑓(𝑥𝑘 )

• Solve linear system (again, inverting a matrix is


unstable):
2 𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘 )
Solve for delta vector
I2DL: Prof. Dai 61
Levenberg
• Levenberg
– “damped” version of Gauss-Newton:
Tikhonov
– 𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝐼 ⋅ 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘 ) regularization

– The damping factor 𝜆 is adjusted in each iteration ensuring:


𝑓 𝑥𝑘 > 𝑓(𝑥𝑘+1 )
• if the equation is not fulfilled increase 𝜆
• →Trust region

• → “Interpolation” between Gauss-Newton (small 𝜆) and


Gradient Descent (large 𝜆)

I2DL: Prof. Dai 62


Levenberg-Marquardt
• Levenberg-Marquardt (LM)

𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ ⅆ𝑖𝑎𝑔(𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 ) ⋅ 𝑥𝑘 − 𝑥𝑘+1
= 𝛻𝑓(𝑥𝑘 )

– Instead of a plain Gradient Descent for large 𝜆, scale


each component of the gradient according to the
curvature.
• Avoids slow convergence in components with a small
gradient

I2DL: Prof. Dai 63


Which, What, and When?
• Standard: Adam

• Fallback option: SGD with momentum

• Newton, L-BFGS, GN, LM only if you can do full


batch updates (doesn’t work well for minibatches!!)
This practically never happens for DL
Theoretically, it would be nice though due to fast
convergence
I2DL: Prof. Dai 64
General Optimization
• Linear Systems (Ax = b)
– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.
• Non-linear (gradient-based)
– Newton, Gauss-Newton, LM, (L)BFGS ← second
order
– Gradient Descent, SGD ← first order
• Others
– Genetic algorithms, MCMC, Metropolis-Hastings, etc.
– Constrained and convex solvers (Langrage, ADMM, Primal-
Dual, etc.)
I2DL: Prof. Dai 65
Please Remember!
• Think about your problem and optimization at hand

• SGD is specifically designed for minibatch

• When you can, use 2nd order method → it’s just faster

• GD or SGD is not a way to solve a linear system!

I2DL: Prof. Dai 66


Next Lecture
• This week:
– Check exercises
– Check office hours ☺

• Next lecture
– Training Neural networks

I2DL: Prof. Dai 72


See you next week ☺

I2DL: Prof. Dai 73


Some References to SGD Updates
• Goodfellow et al. “Deep Learning” (2016),
– Chapter 8: Optimization
• Bishop “Pattern Recognition and Machine Learning”
(2006),
– Chapter 5.2: Network training (gradient descent)
– Chapter 5.4: The Hessian Matrix (second order methods)
• https://ruder.io/optimizing-gradient-descent/index.html
• PyTorch Documetation (with further readings)
– https://pytorch.org/docs/stable/optim.html

I2DL: Prof. Dai 74

You might also like