0% found this document useful (0 votes)

12 views68 pages

5.scaling Optimization

Uploaded by

8varlock

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views68 pages

5.scaling Optimization

Uploaded by

8varlock

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Scaling Optimization

I2DL: Prof. Dai 1

Lecture 4 Recap

I2DL: Prof. Dai 2

Neural Network

Source: http://cs231n.github.io/neural-networks-1/

I2DL: Prof. Dai 3

Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer

Output Layer
Width

Depth
I2DL: Prof. Dai 4
Compute Graphs → Neural Networks
Input layer Output layer

𝑥0 ∗ 𝑤0
−𝑦0 𝑥 ∗𝑥 Loss/
𝑥0 + max(0, 𝑥)
cost
𝑦ො0 𝑦0 𝑥1 ∗ 𝑤1
𝑥1 Weights ReLU Activation L2 Loss
Input (not arguing this is the
(unknowns!) right choice here)

e.g., class label/ We want to compute gradients w.r.t. all weights 𝑾

regression target
I2DL: Prof. Dai 5
Compute Graphs → Neural Networks
∗ 𝑤0,0
Input layer Output layer Loss/
+ −𝑦0 𝑥∗𝑥
cost
∗ 𝑤0,1

∗ 𝑤1,0
𝑦ො0 𝑦0 𝑥0 Loss/
+ −𝑦0 𝑥 ∗𝑥
𝑥0 cost
∗ 𝑤1,1
𝑦ො1 𝑦1 𝑥1

𝑥1 ∗ 𝑤2,0
𝑦ො2 𝑦2 Loss/
+ −𝑦0 𝑥∗𝑥
cost
∗ 𝑤2,1
We want to compute gradients w.r.t. all weights 𝑾
I2DL: Prof. Dai 6
Compute Graphs → Neural Networks
Input layer Output layer Goal: We want to compute gradients of
the loss function 𝐿 w.r.t. all weights 𝑤

𝐿 = ෍ 𝐿𝑖
𝑥0
𝑖
𝑦ො0 𝑦0 𝐿: sum over loss per sample, e.g.
L2 loss ⟶ simply sum up squares:
…

𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖 2
𝑦ො1 𝑦1
𝑥𝑘 ⟶ use chain rule to compute partials
𝑦ො𝑖 = 𝐴(𝑏𝑖 + ෍ 𝑥𝑘 𝑤𝑖,𝑘 ) 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑖
= ⋅
𝑘 𝜕𝑤𝑖,𝑘 𝜕𝑦ො𝑖 𝜕𝑤𝑖,𝑘
Activation bias
function We want to compute gradients w.r.t.
all weights 𝑾 AND all biases 𝑏
I2DL: Prof. Dai 7
Summary
𝜕𝑓
• We have 𝜕𝑤0,0,0
…
– (Directional) compute graph …
– Structure graph into layers 𝜕𝑓
𝛻𝑾 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
– Compute partial derivatives w.r.t. …
weights (unknowns) …
𝜕𝑓
𝜕𝑏𝑙,𝑚
• Next
Gradient step:
– Find weights based on gradients 𝑾′ = 𝑾 − 𝛼𝛻𝑾 𝑓 𝒙,𝒚 (𝑾)

I2DL: Prof. Dai 8

Optimization

I2DL: Prof. Dai 9

Gradient Descent
𝑥 ∗ = arg min 𝑓(𝑥)

Initialization

Optimum

I2DL: Prof. Dai 10

Gradient Descent
𝑥 ∗ = arg min 𝑓(𝑥)

Initialization

Follow the slope

of the
DERIVATIVE

Optimum

I2DL: Prof. Dai 11

Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

Learning rate

I2DL: Prof. Dai 12

Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

SMALL Learning rate

I2DL: Prof. Dai 13

Gradient Descent
• From derivative to gradient Direction of
greatest increase
ⅆ𝑓 𝑥 of the function
𝛻𝑥 𝑓 𝑥
ⅆ𝑥
• Gradient steps in direction of negative gradient

𝛻𝑥 𝑓(𝑥) 𝑥
𝑥 ′ = 𝑥 − 𝛼𝛻𝑥 𝑓 𝑥

LARGE Learning rate

I2DL: Prof. Dai 14

Gradient Descent
𝒙∗ = arg min 𝑓(𝒙)

Initialization

What is the
gradient when Not guaranteed
we reach this to reach the
point? Optimum
global optimum
I2DL: Prof. Dai 15
Convergence of Gradient Descent
• Convex function: all local minima are global minima

Source: https://en.wikipedia.org/wiki/Convex_function#/media/File:ConvexFunction.svg

If line/plane segment between any two points lies above or on the graph

I2DL: Prof. Dai 16

Convergence of Gradient Descent
• Neural networks are non-convex
– many (different) local minima
– no (practical) way to say which is globally optimal

Source: Li, Qi. (2006). Challenging Registration of Geologic Image Data

I2DL: Prof. Dai 17

Convergence of Gradient Descent

Source: https://builtin.com/data-science/gradient-
descent

I2DL: Prof. Dai 18

Convergence of Gradient Descent

Source: A. Geron
I2DL: Prof. Dai 19
Gradient Descent: Multiple Dimensions

Source: builtin.com/data-science/gradient-descent

Various ways to visualize…

I2DL: Prof. Dai 20
Gradient Descent: Multiple Dimensions

Source: http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-
descent.png
I2DL: Prof. Dai 21
Gradient Descent for Neural Networks
Loss function 𝜕𝑓
2
𝐿𝑖 = 𝑦ො𝑖 − 𝑦𝑖
𝜕𝑤0,0,0
ℎ0 …
𝑥0 …
ℎ1 𝑦ො0 𝑦0 𝜕𝑓
𝛻𝑾,𝒃 𝑓 𝒙,𝒚 (𝑾) =
𝜕𝑤𝑙,𝑚,𝑛
𝑥1 …
ℎ2 𝑦ො1 𝑦1 …
𝑥2 𝜕𝑓
ℎ3 𝜕𝑏𝑙,𝑚

𝑦ො𝑖 = 𝐴(𝑏1,𝑖 + ෍ ℎ𝑗 𝑤1,𝑖,𝑗 )

𝑗
Just simple:
ℎ𝑗 = 𝐴(𝑏0,𝑗 + ෍ 𝑥𝑘 𝑤0,𝑗,𝑘 )
𝐴 𝑥 = max(0, 𝑥)
𝑘
I2DL: Prof. Dai 22
Gradient Descent: Single Training Sample
• Given a loss function 𝐿 and a single training sample
{𝒙𝑖 , 𝒚𝑖 }
• Find best model parameters 𝜽 = 𝑾, 𝒃
• Cost 𝐿𝑖 𝜽, 𝒙𝑖 , 𝒚𝑖
– 𝜽 = arg min 𝐿𝑖 (𝒙𝑖 , 𝒚𝑖 )
• Gradient Descent:
– Initialize 𝜽1 with ‘random’ values (more on that later)
– 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿𝑖 (𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 )
– Iterate until convergence: 𝜽𝑘+1 − 𝜽𝑘 < 𝜖

I2DL: Prof. Dai 23

Gradient Descent: Single Training Sample

– 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿𝑖 (𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 )

Training sample
Weights, biases after Loss Function
update step Gradient w.r.t. 𝜽
Learning rate
Weights, biases at step k
(current model)
– 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝒊 computed via backpropagation
– Typically: ⅆim 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝑖 = ⅆim 𝜽 ≫ 1 𝑚𝑖𝑙𝑙𝑖𝑜𝑛

I2DL: Prof. Dai 24

Gradient Descent: Multiple Training Samples

• Given a loss function 𝐿 and multiple (𝑛) training

samples {𝒙𝑖 , 𝒚𝑖 }
• Find best model parameters 𝜽 = 𝑾, 𝒃

1
• Cost 𝐿 = σ𝑛𝑖=1 𝐿𝑖 (𝜽, 𝒙𝑖 , 𝒚𝑖 )
𝑛
– 𝜽 = arg min 𝐿

I2DL: Prof. Dai 25

Gradient Descent: Multiple Training Samples
• Update step for multiple samples
𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 𝜽𝑘 , 𝒙 1..𝑛 , 𝒚 1..𝑛
• Gradient is average / sum over residuals
1
𝛻𝜽 𝐿 𝜽𝑘 , 𝒙 1..𝑛 , 𝒚 1..𝑛 = σ𝑛𝑖=1 𝛻𝜽 𝐿𝑖 𝜽𝑘 , 𝒙𝑖 , 𝒚𝒊
𝑛
Reminder: this comes from backprop.

• Often people are lazy and just write: 𝛻𝐿 = σ𝑛𝑖=1 𝛻𝜽 𝐿𝑖

1
− omitting is not ‘wrong’, it just means rescaling the
𝑛
learning rate
I2DL: Prof. Dai 26
Side Note: Optimal Learning Rate
Can compute optimal learning rate 𝛼 using Line Search
(optimal for a given set)

1 𝑛
1. Compute gradient: 𝛻𝜽 𝐿 = σ 𝛻 𝐿
𝑛 𝑖=1 𝜽 𝑖
2. Optimize for optimal step 𝛼:
arg min 𝐿(𝜽𝑘 − 𝛼 𝛻𝜽 𝐿)
𝛼

3. 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 Not that practical for DL since we

need to solve huge system every step…

I2DL: Prof. Dai 27

Gradient Descent on Train Set
• Given large train set with 𝑛 training samples {𝒙𝑖 , 𝒚𝑖 }
– Let’s say 1 million labeled images
– Let’s say our network has 500k parameters

• Gradient has 500k dimensions

• 𝑛 = 1 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
→ Extremely expensive to compute

I2DL: Prof. Dai 28

Stochastic Gradient Descent (SGD)
• If we have 𝑛 training samples we need to compute
the gradient for all of them which is 𝑂(𝑛)

• If we consider the problem as empirical risk

minimization, we can express the total loss over the
training data as the expectation of all the samples
1 𝑛
෍ 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊 = 𝔼𝑖~ 1,…,𝑛 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊
𝑛 𝑖=1

I2DL: Prof. Dai 29

Stochastic Gradient Descent (SGD)
• The expectation can be approximated with a small
subset of the data
1
𝔼𝑖~ 1,…,𝑛 𝐿𝑖 𝜽, 𝒙𝒊 , 𝒚𝒊 ≈ ෍ 𝐿𝑗 𝜽, 𝒙𝒋 , 𝒚𝒋 with S ⊆ 1, … , 𝑛
𝑆 𝑗∈𝑆

Minibatch
choose subset of trainset 𝑚 ≪ 𝑛
𝐵𝑖 = { 𝒙𝟏 , 𝒚𝟏 , 𝒙𝟐 , 𝒚𝟐 , … , 𝒙𝒎 , 𝒚𝒎 }
{𝐵1 , 𝐵2 , … , 𝐵𝑛/𝑚 }

I2DL: Prof. Dai 30

Stochastic Gradient Descent (SGD)
• Minibatch size is hyperparameter
– Typically power of 2 → 8, 16, 32, 64, 128…
– Smaller batch size means greater variance in the
gradients
→ noisy updates
– Mostly limited by GPU memory (in backward pass)
– E.g.,
• Train set has n = 220 (about 1 million) images
• With batch size m = 64: 𝐵1 … 𝑛/𝑚 = 𝐵1 … 16,384 minibatches
(Epoch = complete pass through training set)
I2DL: Prof. Dai 31
Stochastic Gradient Descent (SGD)
𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿(𝜽𝑘 , 𝒙{1..𝑚} , 𝒚{1..𝑚} )

𝑘 now refers to 𝑘-th iteration

1 𝑚
𝛻𝜽 𝐿 = σ 𝛻 𝐿
𝑚 𝑖=1 𝜽 𝑖
𝑚 training samples in the current minibatch
Gradient for the 𝑘-th minibatch

Note the terminology: iteration vs epoch

I2DL: Prof. Dai 32
Convergence of SGD
Suppose we want to minimize the function 𝐹 𝜃 with
the stochastic approximation

𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
where 𝛼1 , 𝛼2 … 𝛼𝑛 is a sequence of positive step-sizes
and 𝐻 𝜃 𝑘 , 𝑋 is the unbiased estimate of 𝛻F 𝜃 𝑘 , i.e.

𝔼 𝐻 𝜃𝑘, 𝑋 = 𝛻F 𝜃 𝑘
Robbins, H. and Monro, S. “A Stochastic Approximation Method" 1951.
I2DL: Prof. Dai 33
Convergence of SGD
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼𝑘 𝐻 𝜃 𝑘 , 𝑋
converges to a local (global) minimum if the following
conditions are met:
1) 𝛼𝑛 ≥ 0, ∀ 𝑛 ≥ 0
2) σ∞
𝑛=1 𝛼𝑛 = ∞
3) σ∞
𝑛=1 𝛼𝑛
2 <∞

4) 𝐹 𝜃 is strictly convex

𝛼
The proposed sequence by Robbins and Monro is 𝛼𝑛 ∝ , 𝑓𝑜𝑟 𝑛 > 0
𝑛

I2DL: Prof. Dai 34

Problems of SGD
• Gradient is scaled equally across all dimensions
→ i.e., cannot independently scale directions
→ need to have conservative min learning rate to avoid
divergence
→ Slower than ‘necessary’

• Finding good learning rate is an art by itself

→ More next lecture

I2DL: Prof. Dai 35

Gradient Descent with Momentum

Source: A. Ng
We’re making many steps
back and forth along this Would love to go faster here…
dimension. Would love to I.e., accumulated gradients over
track that this is averaging time
out over time.

I2DL: Prof. Dai 36

Gradient Descent with Momentum
𝒗𝑘+1 = 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽𝑘 )
accumulation rate Gradient of current minibatch
velocity learning rate
(‘friction’, momentum)

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
velocity
weights of model

Exponentially-weighted average of gradient

Important: velocity 𝒗𝑘 is vector-valued!
[Sutskever et al., ICML’13] On the importance of initialization and momentum in deep learning
I2DL: Prof. Dai 37
Gradient Descent with Momentum

Step will be largest when a sequence of

gradients all point to the same direction

Hyperparameters are 𝛼, 𝛽
𝛽 is often set to 0.9

Source: I. Goodfellow
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1

I2DL: Prof. Dai 38

Gradient Descent with Momentum
• Can it overcome local minima?

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 39
Nesterov Momentum
• Look-ahead momentum

෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽

෩𝑘+1 )
𝒗𝑘+1 = 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽

𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
Nesterov, Yurii E. "A method for solving the convex programming problem with convergence rate O (1/k^ 2)." Dokl. akad. nauk Sssr. Vol. 269.
1983.

I2DL: Prof. Dai 40

Nesterov Momentum

෩ 𝑘+1 = 𝜽𝑘 + 𝛽 ⋅ 𝒗𝑘
𝜽
Source: G. Hinton
𝒗𝑘+1 ෩ 𝑘+1 )
= 𝛽 ⋅ 𝒗𝑘 − 𝛼 ⋅ 𝛻𝜽 𝐿(𝜽
𝜽𝑘+1 = 𝜽𝑘 + 𝒗𝑘+1
I2DL: Prof. Dai 41
Root Mean Squared Prop (RMSProp)
Large gradients

Small gradients
Source: Andrew. Ng

• RMSProp divides the learning rate by an

exponentially-decaying average of squared gradients.

Hinton et al. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural
networks for machine learning 4.2 (2012): 26-31.

I2DL: Prof. Dai 42

RMSProp

𝒔𝑘+1 = 𝛽 ⋅ 𝒔𝑘 + (1 − 𝛽)[𝛻𝜽 𝐿 ∘ 𝛻𝜽 𝐿]
Element-wise multiplication
𝛻𝜽 𝐿
𝜽𝑘+1 = 𝜽𝑘 − 𝛼 ⋅
𝒔𝑘+1 + 𝜖

Hyperparameters: 𝛼, 𝛽, 𝜖
Typically 10−8
Needs tuning! Often 0.9

I2DL: Prof. Dai 43

Large gradients
Y-Direction
RMSProp

Source: A. Ng

X-direction Small gradients

(Uncentered) variance of gradients
→ second momentum 𝒔𝑘+1 = 𝛽 ⋅ 𝒔𝑘 + (1 − 𝛽)[𝛻𝜽 𝐿 ∘ 𝛻𝜽 𝐿]

We’re dividing by square gradients: 𝛻𝜽 𝐿

- Division in Y-Direction will be 𝜽𝑘+1 = 𝜽𝑘 − 𝛼 ⋅
𝒔𝑘+1 + 𝜖
large
- Division in X-Direction will be
small Can increase learning rate!

I2DL: Prof. Dai 44

RMSProp
• Dampening the oscillations for high-variance
directions

• Can use faster learning rate because it is less likely to

diverge
→ Speed up learning speed
→ Second moment

I2DL: Prof. Dai 45

Adaptive Moment Estimation (Adam)
Idea : Combine Momentum and RMSProp
First momentum:
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘 mean of gradients
𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + (1 − 𝛽2 )[𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘 ]
𝑘+1 𝑘 𝒎𝑘+1
𝜽 =𝜽 −𝛼⋅ Note : This is not the
𝒗𝑘+1 +𝜖 update rule of Second momentum:
Adam variance of gradients
Q. What happens at 𝑘 = 0?
A. We need bias correction as 𝒎0 = 0 and 𝒗0 = 0

[Kingma et al., ICLR’15] Adam: A method for stochastic optimization

I2DL: Prof. Dai 46
Adam : Bias Corrected
• Combines Momentum and RMSProp
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘 𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + (1 − 𝛽2 )[𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘

• 𝒎𝑘 and 𝒗𝑘 are initialized with zero

→ bias towards zero
→ Need bias-corrected moment updates
Update rule of Adam
𝒎𝑘+1 𝒗𝑘+1 ෝ 𝑘+1
𝒎
ෝ 𝑘+1 =
𝒎 ෝ𝑘+1 =
𝒗 𝜽 𝑘+1 𝑘
=𝜽 −𝛼⋅
1 − 𝛽1 𝑘+1 1 − 𝛽2 𝑘+1 ෝ𝑘+1 +𝜖
𝒗

I2DL: Prof. Dai 47

Adam
• Exponentially-decaying mean and variance of
gradients (combines first and second order
momentum)

• Hyperparameters: 𝛼, 𝛽1 , 𝛽2 , 𝜖
𝒎𝑘+1 = 𝛽1 ⋅ 𝒎𝑘 + 1 − 𝛽1 𝛻𝜽 𝐿 𝜽𝑘

Needs tuning! Often 0.9 𝒗𝑘+1 = 𝛽2 ⋅ 𝒗𝑘 + 1 − 𝛽2 𝛻𝜽 𝐿 𝜽𝑘 ∘ 𝛻𝜽 𝐿 𝜽𝑘

Typically 10−8
𝒎𝑘+1 𝒗𝑘+1
Often 0.999 ෝ 𝑘+1 =
𝒎 ෝ𝑘+1 =
𝒗
1−𝛽1 𝑘+1 1−𝛽2 𝑘+1
ෝ 𝑘+1
𝒎
𝜽𝑘+1 = 𝜽𝑘 −𝛼⋅
Defaults in PyTorch ෝ𝑘+1 + 𝜖
𝒗

I2DL: Prof. Dai 48

There are a few others…
• ‘Vanilla’ SGD
• Momentum
• RMSProp Adam is mostly method
• Adagrad of choice for neural networks!
• Adadelta
• AdaMax
It’s actually fun to play around with SGD
• Nada updates.
• AMSGrad It’s easy and you get pretty immediate
feedback ☺
I2DL: Prof. Dai 50
Convergence

Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 51
Convergence

Source: http://ruder.io/optimizing-gradient-descent/
I2DL: Prof. Dai 52
Convergence

Source: https://github.com/Jaewan-Yun/optimizer-visualization
I2DL: Prof. Dai 53
Jacobian and Hessian
ⅆ𝑓 𝑥
• Derivative 𝒇: ℝ → ℝ
ⅆ𝑥

ⅆ𝑓 𝒙 ⅆ𝑓 𝒙
• Gradient 𝒇: ℝ𝑚 → ℝ 𝛻𝒙 𝑓 𝒙
ⅆ𝑥1
,
ⅆ𝑥2

• Jacobian 𝒇: ℝ𝑚 → ℝ𝑛 𝐉 ∈ ℝ𝑛 × 𝑚

• Hessian SECOND
𝒇: ℝ𝑚 → ℝ 𝐇 ∈ ℝ𝑚 × 𝑚
DERIVATIVE

I2DL: Prof. Dai 54

Newton’s Method
• Approximate our function by a second-order Taylor
series expansion
1
𝐿 𝜽 ≈ 𝐿 𝜽0 + 𝜽 − 𝜽0 𝑇 𝜵𝜽 𝐿 𝜽0 + 𝜽 − 𝜽0 𝑇 𝐇 𝜽 − 𝜽0
2

First derivative Second derivative (curvature)

More info:
https://en.wikipedia.org/wiki/Taylor_series

I2DL: Prof. Dai 55

Newton’s Method
• Differentiate and equate to zero

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step

We got rid of the learning rate!

SGD 𝜽𝑘+1 = 𝜽𝑘 − 𝛼𝛻𝜽 𝐿 𝜽𝑘 , 𝐱 𝒊 , 𝐲𝒊

I2DL: Prof. Dai 56

Newton’s Method
• Differentiate and equate to zero

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽 Update step

Parameters of a Number of Computational

network (millions) elements in the complexity of ‘inversion’
Hessian per iteration

I2DL: Prof. Dai 57

Newton’s Method
• Gradient Descent (green)

• Newton’s method exploits

the curvature to take a
more direct route

Source: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

I2DL: Prof. Dai 58

Newton’s Method
𝑇
𝐽 𝜽 = 𝐲 − 𝐗𝛉 𝐲 − 𝐗𝛉

Can you apply Newton’s

method for linear regression?
What do you get as a result?

I2DL: Prof. Dai 59

BFGS and L-BFGS
• Broyden-Fletcher-Goldfarb-Shanno algorithm
• Belongs to the family of quasi-Newton methods
• Have an approximation of the inverse of the Hessian

𝜽∗ = 𝜽0 − 𝐇 −1 𝛻𝜽 𝐿 𝜽

• BFGS
• Limited memory: L-BFGS

I2DL: Prof. Dai 60

Gauss-Newton
−1
• 𝑥𝑘+1 = 𝑥𝑘 − 𝐻𝑓 𝑥𝑘 𝛻𝑓(𝑥𝑘 )
– ‘true’ 2nd derivatives are often hard to obtain (e.g.,
numerics)
– 𝐻𝑓 ≈ 2𝐽𝐹𝑇 𝐽𝐹
• Gauss-Newton (GN):
𝑥𝑘+1 = 𝑥𝑘 − [2𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 ]−1 𝛻𝑓(𝑥𝑘 )

• Solve linear system (again, inverting a matrix is

unstable):
2 𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘 )
Solve for delta vector
I2DL: Prof. Dai 61
Levenberg
• Levenberg
– “damped” version of Gauss-Newton:
Tikhonov
– 𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝐼 ⋅ 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘 ) regularization

– The damping factor 𝜆 is adjusted in each iteration ensuring:

𝑓 𝑥𝑘 > 𝑓(𝑥𝑘+1 )
• if the equation is not fulfilled increase 𝜆
• →Trust region

• → “Interpolation” between Gauss-Newton (small 𝜆) and

Gradient Descent (large 𝜆)

I2DL: Prof. Dai 62

Levenberg-Marquardt
• Levenberg-Marquardt (LM)

𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ ⅆ𝑖𝑎𝑔(𝐽𝐹 𝑥𝑘 𝑇 𝐽𝐹 𝑥𝑘 ) ⋅ 𝑥𝑘 − 𝑥𝑘+1
= 𝛻𝑓(𝑥𝑘 )

– Instead of a plain Gradient Descent for large 𝜆, scale

each component of the gradient according to the
curvature.
• Avoids slow convergence in components with a small
gradient

I2DL: Prof. Dai 63

Which, What, and When?
• Standard: Adam

• Fallback option: SGD with momentum

• Newton, L-BFGS, GN, LM only if you can do full

batch updates (doesn’t work well for minibatches!!)
This practically never happens for DL
Theoretically, it would be nice though due to fast
convergence
I2DL: Prof. Dai 64
General Optimization
• Linear Systems (Ax = b)
– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.
• Non-linear (gradient-based)
– Newton, Gauss-Newton, LM, (L)BFGS ← second
order
– Gradient Descent, SGD ← first order
• Others
– Genetic algorithms, MCMC, Metropolis-Hastings, etc.
– Constrained and convex solvers (Langrage, ADMM, Primal-
Dual, etc.)
I2DL: Prof. Dai 65
Please Remember!
• Think about your problem and optimization at hand

• SGD is specifically designed for minibatch

• When you can, use 2nd order method → it’s just faster

• GD or SGD is not a way to solve a linear system!

I2DL: Prof. Dai 66

Next Lecture
• This week:
– Check exercises
– Check office hours ☺

• Next lecture
– Training Neural networks

I2DL: Prof. Dai 72

See you next week ☺

I2DL: Prof. Dai 73

Some References to SGD Updates
• Goodfellow et al. “Deep Learning” (2016),
– Chapter 8: Optimization
• Bishop “Pattern Recognition and Machine Learning”
(2006),
– Chapter 5.2: Network training (gradient descent)
– Chapter 5.4: The Hessian Matrix (second order methods)
• https://ruder.io/optimizing-gradient-descent/index.html
• PyTorch Documetation (with further readings)
– https://pytorch.org/docs/stable/optim.html

I2DL: Prof. Dai 74

Topic 4 (Part 2) - NN Learning
No ratings yet
Topic 4 (Part 2) - NN Learning
92 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
5.scaling Optimization
No ratings yet
5.scaling Optimization
67 pages
Topic 5 - Part2 NN Learning
No ratings yet
Topic 5 - Part2 NN Learning
90 pages
Unit-Ii (Ml-I)
No ratings yet
Unit-Ii (Ml-I)
81 pages
8 TrainingNN-3
No ratings yet
8 TrainingNN-3
67 pages
6 TrainingNN
No ratings yet
6 TrainingNN
51 pages
Lec-4-Opt and BP
No ratings yet
Lec-4-Opt and BP
75 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Optimization in Deep Learning
No ratings yet
Optimization in Deep Learning
15 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lect 6
No ratings yet
Lect 6
60 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Lec 8
No ratings yet
Lec 8
43 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Jacobi Method PDF
100% (1)
Jacobi Method PDF
57 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
DLbook
No ratings yet
DLbook
165 pages
ADA Complete Notes
33% (3)
ADA Complete Notes
151 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Non Linear Data Structures
No ratings yet
Non Linear Data Structures
50 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Cours 5
No ratings yet
Cours 5
23 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
28 Consistent Hashing
No ratings yet
28 Consistent Hashing
6 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
12.advanced DL Topics
No ratings yet
12.advanced DL Topics
104 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
10 CNN-2
No ratings yet
10 CNN-2
97 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
9 CNN-1
No ratings yet
9 CNN-1
89 pages
Stack and Queue
No ratings yet
Stack and Queue
24 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
Characteristics of Data Structures
No ratings yet
Characteristics of Data Structures
2 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Unit - 1 Daa
No ratings yet
Unit - 1 Daa
38 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
13 (A) Explain The Banker's Algorithm For Deadlock Avoidance With An Illustration. - Bituh
100% (1)
13 (A) Explain The Banker's Algorithm For Deadlock Avoidance With An Illustration. - Bituh
6 pages
Neural-Networks Back Propagation
No ratings yet
Neural-Networks Back Propagation
70 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
7 Numerical Methods 3 Newton Raphson and Second Order
No ratings yet
7 Numerical Methods 3 Newton Raphson and Second Order
19 pages
2425 CSC14003 23CLC1 Quiz01
No ratings yet
2425 CSC14003 23CLC1 Quiz01
8 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Python Recursion
No ratings yet
Python Recursion
11 pages
Indexing
No ratings yet
Indexing
10 pages
Rauf Khann C M M File
No ratings yet
Rauf Khann C M M File
12 pages
DSA Oral Questions
No ratings yet
DSA Oral Questions
2 pages
Salesforce - LeetCode
No ratings yet
Salesforce - LeetCode
3 pages
DAA Final Report
No ratings yet
DAA Final Report
18 pages
Advanced Machine Learning Challenge5
No ratings yet
Advanced Machine Learning Challenge5
22 pages
Introduction To Numerical Analysis For Engineers: - Systems of Linear Equations Mathews
No ratings yet
Introduction To Numerical Analysis For Engineers: - Systems of Linear Equations Mathews
10 pages
Laporan Praktikum Queue
No ratings yet
Laporan Praktikum Queue
6 pages
Improved Selection Sort Algorithm
No ratings yet
Improved Selection Sort Algorithm
9 pages
Data Structures & Algorithms
No ratings yet
Data Structures & Algorithms
9 pages
BruteForceStringMatch
No ratings yet
BruteForceStringMatch
2 pages
Unit - 1: Analysis of Algorithm
No ratings yet
Unit - 1: Analysis of Algorithm
16 pages
Experiment 13: Data Structure & Algorithm Lab
No ratings yet
Experiment 13: Data Structure & Algorithm Lab
7 pages
Optimization Techniques
No ratings yet
Optimization Techniques
3 pages
Question: Exercise 3. Draw The 11-Item Hash Table That Results From Using
No ratings yet
Question: Exercise 3. Draw The 11-Item Hash Table That Results From Using
4 pages
A Circular Array Is A Data Structure That Uses A Single
No ratings yet
A Circular Array Is A Data Structure That Uses A Single
2 pages
Adsa Int2 QP
No ratings yet
Adsa Int2 QP
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)