0% found this document useful (0 votes)

3 views44 pages

Optimization

The document discusses optimization algorithms in neural networks, focusing on gradient descent methods including batch, stochastic, and mini-batch gradient descent. It explains the weight update rule, the importance of the learning rate, and the challenges associated with vanilla gradient descent. Additionally, it highlights the use of exponentially weighted averages to improve weight updates during training.

Uploaded by

Cát Lăng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views44 pages

Optimization

Uploaded by

Cát Lăng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Optimization Algorithms in

Neural Networks

Complete Guide to Neural Networks with Python: Theory and Applications

The Backpropagation
Remember our objective is to:
Minimize the error By Changing the Weight

Negative Slope:
Gradient Descent
We move in the
When we Increase w,
direction opposite to
the loss is decreasing 
the derivative
-(-) = +  Weight
(opposite to the slope)
Increases (Moving Right)

Positive Slope:
When we increase w,
the loss is increasing 
-(+) = -  Weight
Decreases (Moving Left)
Weight Update Rule:
ɳ = Learning Rate – How fast we update the weights.
In other words, the step size of the update

𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
Old Weight Negative Learnin Gradient
Slop g Rate

High Learning rate leads to Small and optimal learning rate leads
overshooting during gradient descent to gradual descent towards the
and might never reach the minima minima
Why isn’t the vanilla gradient descent good enough?

𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
• Generally, vanilla backpropagation is slow and doesn't perform well on real-life
datasets.
• There is a lot to improve in the “step” that the algorithm takes (example: if it’s going in
the right way, we want it to take larger steps since it’s more confident).
• We need to deal with the overshooting of the local minima
• The learning rate can be adaptive depending on the path status of gradient descent
(example: small gradients can have a large learning rate since their weight is optimized
while large gradients can have a small learning rate since they still need an update.
Gradient Descent
• The base algorithm that is used to minimize the error with respect to the weights
of the neural network. The learning rate determines the step size of the update
used to reach the minimum.
• An Epoch is one complete pass through all the samples.

https://www.learnopencv.com/understanding-activation-functions-
in-deep-learning/ https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
Batch Gradient Descent
• Take all the samples in one iteration. In this case, iteration = epoch
• All the samples are passed to the neural network at one time.
• Computes the gradient of the loss function with respect to the network’s weights for
the entire training dataset, and then performs one update.
• n = all samples per iteration/epoch
• In literature, weights 𝑤 are sometimes denoted as 𝜃 ɳ = Learning Rate – How
fast we update the
weights. In other words,
the step size of the update
• Weight Update Rule:

Old Weight Loss Function of all samples

https://towardsdatascience.com/gradient-
descent-in-a-nutshell-eaf8c18212f0
Example of Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54 0.32 0.28 0.93
Sample 4 0.92 0.67 -1.43 -0.01 0.03 0.72
1.8 -0.46 0.54
Sample 5 -0.71 1.21 0.37
0.92 0.67 -1.43 Error
Sample 6 0.01 0.78 0.04 -0.71 1.21 0.37

Sample 7 -0.42 -0.33 1.37 0.01 0.78 0.04

-0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
1.12 -0.06 -0.17

Update Weights

Take ALL Samples and feed them to the network. Calculate error
based on all samples and update weights
Stochastic gradient descent
• Takes in one sample at each iteration.
• Performs the weights update for each sample at a time.
• Number of iteration per epoch = number of samples.
• It has high variance of updates and causes the function to fluctuate.

for each sample 𝑥 (𝑖) and 𝑦 (𝑖) {

}
• n = 1 per iteration

Andrew Ng Machine Learning Tutorial

Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.32 0.28 0.93 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.82 -0.46 0.54 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.92 0.67 -1.43 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.71 1.21 0.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.01 0.78 0.04 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.42 -0.33 1.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.12 -0.06 -0.17 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Mini-Batch Gradient Descent
• Takes the best of both the Batch and Stochastic Gradient Descent.
• Perform weight update for a batch of samples.
• Take n training samples (batch size)  Feed to the network
• Number of iterations per epoch = samples / batch size

for each batch 𝑥 (𝑖:𝑖+𝑛) and 𝑦 (𝑖:𝑖+𝑛) {

n = batch size

}
Example of Mini-Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
0.32 0.28 0.93
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72
Error
Sample 6 0.01 0.78 0.04 1.8 -0.46 0.54
0.92 0.67 -1.43
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take a BATCH of Samples and feed them to the network. Calculate

error based on the BATCH of samples and update the weights
Example of Mini-Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
-0.71 1.21 0.37
Sample 5 -0.71 1.21 0.37 0.01 0.78 0.04
Error
Sample 6 0.01 0.78 0.04 -0.42 -0.33 1.37
1.12 -0.06 -0.17
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take a BATCH of Samples and feed them to the network. Calculate

error based on the BATCH of samples and update the weights
Exponentially
Weighted
Averages 0.1
𝛽 = 0.9 (𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 10 𝑝𝑜𝑖𝑛𝑡𝑠) 𝛽 = 0.98 (𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 50 𝑝𝑜𝑖𝑛𝑡𝑠)

1
We are averaging over 1−𝛽 points

If current data points

rapidly change  it
won’t get affected

𝛽 = 0.5 𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 2 𝑝𝑜𝑖𝑛𝑡𝑠 → 𝑀𝑜𝑟𝑒 𝑛𝑜𝑖𝑠𝑦

Source: Andrew Ng Deep Learning Course

Exponentially Weighted Average
𝑣50 = 0.9𝑣49 + 0.1 𝜃50
𝑣𝑡 = 𝛽𝑣𝑡−1 + (1 − 𝛽) 𝜃𝑡 𝑣49 = 0.9𝑣48 + 0.1 𝜃49
𝑣48 = 0.9𝑣47 + 0.1 𝜃48
𝑣47 = 0.9𝑣46 + 0.1 𝜃47
𝑣0 = 0 𝑣46 = 0.9𝑣45 + 0.1 𝜃46
𝑣1 = 0.9𝑣0 + 0.1(𝜃1 )
𝑣2 = 0.9𝑣1 + 0.1(𝜃2 ) 𝑣50 = 0.1𝜃50 + 0.9 0.9𝑣48 + 0.1 𝜃49
𝑣3 = 0.9𝑣2 + 0.1 𝜃3 𝑣50 = 0.1𝜃50 + 0.9 0.9 0.9𝑣47 + 0.1 𝜃48 + 0.1 𝜃49
𝑖𝑓 𝛽 = 0.9 𝑣50 = 0.1𝜃50 + (0.9)3 𝑣47 + 0.9 2 0.1 𝜃48 + 0.9(0.1) 𝜃49
𝑣50 = 0.1𝜃50 + 0.9 0.1 𝜃49 + 0.9 2 0.1 𝜃48 + (0.9)3 𝑣47

𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡 𝑣50 = 0.1𝜃50 + 0.9 0.9 0.9(0.9𝑣46 + 0.1 𝜃47 ) + 0.1 𝜃48 + 0.1 𝜃49
More Less 𝑣50 = 0.1𝜃50 + (0.9)4 𝑣46 + 0.9 3 0.1 𝜃47 + 0.9 2 0.1 𝜃48 + 0.9 0.1 𝜃49
Weights to Weights to 𝑣50 = 0.1𝜃50 + 0.9 0.1 𝜃49 + 0.9 2 0.1 𝜃48 + 0.9 3 0.1 𝜃47 + (0.9)4 𝑣46
previous current
values values Sum up to 1. (0.1 only for current, and all the rest for previous)

𝑛=𝑡,𝑡=1

The General Idea 𝑣𝑡 = (1 − 𝛽)𝜃𝑡 + 0.1 ෍ (𝛽)𝑛 𝜃𝑡−1

𝑛=1,𝑡=𝑡
Bias Correction of Exponentially Weighted Averages
𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡
If we start with 𝑣0 = 0 (assume 𝛽 = 0.98)
𝑣1 = 𝛽𝑣0 + 1 − 𝛽 𝜃1
𝑣1 = 0 + 0.02𝜃1
𝑣2 = 0.98(0.02𝜃1 ) + 0.02𝜃2
= 0.0196𝜃1 + 0.02𝜃2

Initial Values Get smaller

To mitigate this problem: Source: Andrew Ng Deep Learning Course
Divide 𝑣𝑡 by (1 − 𝛽 𝑡 ) Purple Curve: Without Bias Correction
Green Curve: With Bias Correction
Example: At t=2:
1 − 𝛽 𝑡 = 1 − 0.982 = 0.0396
𝑣2 0.0196𝜃1 + 0.02𝜃2 Note that at large values of t 
→ = 0.494𝜃1 + 0.505𝜃2
0.0369 0.0396 1 − 𝛽 𝑡 = 1 − 0.98150 ≈ 1
Values are Corrected!
That’s why the green and purple lines are
overlapping at large values of t
Momentum

𝛽 𝑖𝑠 𝑠𝑒𝑡 𝑡𝑜 0.9 (𝑟𝑜𝑏𝑢𝑠𝑡)

We can use exponentially moving average to

reduce the oscillations in the vertical direction
and speed it up in the horizontal direction!

Compute dw and db on your mini-batch, and then:

𝑣𝑑𝑤 = 𝛽𝑣𝑑𝑤 + 1 − 𝛽 𝑑𝑤
𝑣𝑑𝑏 = 𝛽𝑣𝑑𝑏 + 1 − 𝛽 𝑑𝑏

Then perform weight update;

𝑤 = 𝑤 − 𝛼𝑣𝑑𝑤
𝑏 = 𝑏 − 𝛼𝑣𝑑𝑏
Adding Momentum
• Accelerates the SGD in the correct direction and weakens the oscillations, by adding a fraction 𝛾 of the past
update vector, to the new update vector. The momentum term increases for dimensions whose gradients
point in the same directions and reduces updates for dimensions whose gradients change directions. As a
result, we gain faster convergence and reduced oscillation. So the history is getting accumulated, and the
more we accumulate in one direction, the faster the update is in the same direction.

However, Momentum still

suffers from oscillations
New Update Vector
Old Update Vector

Without Momentum With Momentum

http://ruder.io/optimizing-gradient-descent/index.html#momentum
RMSProp
Set of Parameters 2

We want to minimize the oscillation in the vertical direction

We want to increase the speed in the horizontal direction

Denote Set of Parameters 1 as dw

Set of Parameters 1
Denote Set of Parameters 2 as db

dw is small (has a small variation) 𝑑𝑤 2 and 𝑑𝑏2 are element-wise squared

db is large (has a large variation)

𝑠𝑑𝑤 = 𝛽𝑠𝑑𝑤 + (1 − 𝛽)𝑑𝑤 2 small (𝑠𝑚𝑎𝑙𝑙 𝑛𝑢𝑚𝑏𝑒𝑟)2  Becomes Smaller ∈ is a very small number to
𝑠𝑑𝑏 = 𝛽𝑠𝑑𝑏 + (1 − 𝛽)𝑑𝑏2 large (𝑙𝑎𝑟𝑔𝑒 𝑛𝑢𝑚𝑏𝑒𝑟)2  Becomes Larger prevent dividing by zero

𝛼 𝛼
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝑑𝑤 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝑑𝑏
𝑠𝑑𝑤 +∈ 𝑠𝑑𝑏 +∈

Dividing by a small number  gets larger Dividing by a large number  gets smaller
RMSProp
• An Adaptive Learning Rate Method
• Proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
• RMSProp divides the learning rate by an exponentially decaying average of
squared gradients
• 𝛾 is suggested to be 0.9 and ɳ = 0.001
• Each time, the learning rate is being changed based on E[𝑔2 ]𝑡

Old Weight
Learning Rate
Changing!
Adam Optimization (Adaptive Moment Estimation)
• Learning Rate is adaptive
• Stores an exponentially decaying average of past square gradients (vt)
• Stores an exponentially decaying average of past gradients (mt)

Momentum!

RMSProp!

• To prevent mt and vt from becoming zero during the initial steps, an adjustment is made:

Bias Correction
Proposed Values:
• And Finally the update! 𝛽1 = 0.9
We’re taking the 𝛽2 = 0.9999
Exponentially ∈ = 10−8
Weighted Average
Old Weight
Learning Rate Changing!
Adam Optimization
• Combines Momentum and RMSProp:
Perform weight update with 𝑣𝑑𝑤 and 𝑣𝑑𝑏 (as in Momentum), and divide it by 𝑠𝑑𝑤 and 𝑠𝑑𝑏 (as in RMSProp).
It uses Bias Correction: Divide by (1 − 𝛽1 𝑡 ) for 𝑣𝑑𝑤 and 𝑣𝑑𝑏
Divide by (1 − 𝛽2 𝑡 ) for 𝑠𝑑𝑤 and 𝑠𝑑𝑏
Initialize 𝑣𝑑𝑤 = 0, 𝑠𝑑𝑤 = 0 and 𝑣𝑑𝑏 = 0, 𝑠𝑑𝑏 = 0
On iteration t:
Compute dw and db using the current mini-batch:
𝑣𝑑𝑤 = 𝛽1 𝑣𝑑𝑤 + 1 − 𝛽1 𝑑𝑤 and 𝑣𝑑𝑏 = 𝛽1 𝑣𝑑𝑏 + 1 − 𝛽1 𝑑𝑏 (Moment with 𝛽1 )
𝑠𝑑𝑤 = 𝛽2 𝑠𝑑𝑤 + 1 − 𝛽2 𝑑𝑤 2 and 𝑠𝑑𝑏 = 𝛽2 𝑠𝑑𝑏 + 1 − 𝛽2 𝑑𝑏2 (RMSProp with 𝛽2 )

Compute Bias Correction:

𝒗𝒅𝒘 𝒗𝒅𝒃 Default Values:
𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 = 𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒕 and 𝒗𝒅𝒃 =
𝟏−𝛽1 𝟏−𝛽1 𝒕
𝒔𝒅𝒘 𝒔𝒅𝒃 𝛽1 = 0.9
𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 = 𝒕 and 𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃 = 𝛽2 = 0.999
𝟏−𝛽2 𝟏−𝛽2 𝒕
𝜖 = 10−8
Perform Weight Update:
𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼 and 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 +𝜖 𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃 +𝜖
Adam was applied to a Multilayer Perceptron
algorithm on the MNIST dataset and
Convolutional Neural Networks on the CIFAR-
10 image recognition dataset. They conclude:
Using large models and datasets, we
demonstrate Adam can efficiently solve
practical deep learning problems.

https://arxiv.org/abs/1412.6980

In practice Adam is currently recommended as the default

algorithm to use, and often works slightly better than RMSProp.
Stop and think about it….
Do we need to decay the learning rate if Adam already adapts it for each
parameter independently?

• The single learning rate for each parameter is computed using the initial learning rate as an
upper limit. This means that every single learning rate can vary from 0 (no update) to the
initial learning rate (maximum update).
• If we want to be sure that every update step doesn't exceed the initial learning rate we can
than lower this initial learning rate (learning rate scheduling). It is therefore recommended
to decay the learning rate because Adam might still exhibit a high learning rate depending on
the gradient of a parameter
𝛼

𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘
So what happened then?
People started noticing that despite superior training time, Adam in some areas does not
converge to an optimal solution, so for some tasks (such as image classification on popular
CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with
momentum.
The paper: ‘Improving Generalization Performance by Switching from Adam to SGD also
showed that by switching to SGD during training they’ve been able to obtain better
generalization power than when using Adam alone. They proposed a simple fix which uses a
very simple idea. They’ve noticed that in earlier stages of training Adam still outperforms SGD
but later the learning saturates. They proposed simple strategy which they called SWATS in
which they start training deep neural network with Adam but then switch to SGD when certain
criteria hits. They managed to achieve results comparable to SGD with momentum.

SWATS: Adam If Criteria Hits

SGD
Weight Decay and Regularization 𝑤 =𝑤−𝑛
𝑑 𝐸෨
𝑑𝑤
• The weight update rule is given as: 𝑑𝐸෨ 𝑑𝐸 𝑑𝐿2
= +
𝑑𝑤 𝑑𝑤 𝑑𝑤

• And by adding the term to the cost function: λ 𝒅(𝒘𝟐 ) λ

= (𝟐𝒘) = λ𝒘
𝟐 𝒘 𝟐

• Taking the differentiation of the above equation and substituting it in the

weight update rule , we end up with the update rule: 𝑑 𝐸෨ 𝑑𝐸
= + λ𝑤
𝑑𝑤 𝑑𝑤
𝑑𝐸
𝑤 = 𝑤 − 𝑛( + λ𝑤)
𝑑𝑤
This new term coming from the regularization causes the weight to
decay in proportion to its size.
So we subtract a little portion of the weight at each step, hence the name decay. We directly modify the
weight update rule rather than modifying the loss function. we don’t want to add more computations by
modifying the loss when there is an easier way
• So when adding the L2 regularization term to the loss function, it is denoted
as L2 regularization. And when we modify the weight update rule directly, it is
denoted as Weight decay. For the case of vanilla SGD, they are the same.
• But they aren’t the same when adding Momentum or using Adam. The L2
regularization and weight decay become different. Therefore, weight decay no
longer equals L2 regularization, as in the case for vanilla SGD.
• One paper thus propose to decouple weight decay from the gradient update
by adding it after the parameter update as in the original definition.
Why aren’t they the same?
• When using L2 regularization the penalty we use for large weights gets
scaled by moving average of the past and current squared gradients
and therefore weights with large typical gradient magnitude are
regularized by a smaller relative amount than other weights. In
contrast, weight decay regularizes all weights by the same factor. To
use weight decay with Adam we need to modify the update rule as
follows:
Decoupling weight decay: Adding Weight Decay to Momentum and Adam
The SGD with momentum and weight decay (SGDW) update:

−𝑛λ𝜃𝑡 decoupled weight decay

Similarly, for Adam with weight decay (AdamW) we obtain:

−𝑛λ𝜃𝑡 decoupled weight decay

So did it yield good results?
• The authors show that this substantially improves Adam’s generalization
performance and allows it to compete with SGD with momentum on image
classification datasets. The authors show that this substantially improves Adam’s
generalization performance and allows it to compete with SGD with momentum
on image classification datasets.
• Several Papers empirically find that a lower 𝛽2 value, which
controls the contribution of the exponential moving average
of past squared gradients in Adam, e.g. 0.99 or 0.9 vs. the
default 0.999 worked better in their respective applications,
indicating that there might be an issue with the exponential
moving average.
AMSGrad
Let’s Start with the problem statement:

• Practitioners noticed that in some cases, e.g. for object recognition or machine
translation they fail to converge to an optimal solution and are outperformed
by SGD with momentum.
• They point out that in Adam, the short-term memory of the gradients becomes
an obstacle in other scenarios. In settings where Adam converges to a
suboptimal solution, it has been observed that some minibatches provide large
and informative gradients, but as these minibatches only occur rarely,
exponential averaging diminishes their influence, which leads to poor
convergence.
So what did they do?
Since values of step size are often decreasing over time, they proposed a fix of keeping the
maximum of values V and use it instead of the moving average to update parameters.

AMSGrad uses the maximum of past squared gradients 𝑣𝑡 rather than the exponential average
to update the parameters. 𝑣𝑡 is defined the same as in Adam:

Instead of using 𝑣𝑡 (or its bias-corrected version 𝑣ො𝑡 ) directly, we now employ the
previous 𝑣𝑡−1 if it is larger than the current one:
The full AMSGrad update without bias-corrected estimates can be seen below:
The authors observe improved performance compared to Adam on small datasets and on CIFAR-10. Other experiments, however, show
similar or worse performance than Adam. It remains to be seen whether AMSGrad is able to consistently outperform Adam in practice.
Authors also claimed convergence problems are actually just signs of poorly chosen hyper-parameters. Amsgrad turns out to be very
disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by
amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics ( like accuracy) always end up worse
Another paper…..
They demonstrate an interesting connection between the learning rate and the batch size,
two hyper parameters that are typically thought to be independent of each other: They show
that decaying the learning rate is equivalent to increasing the batch size, while the latter
allows for increased parallelism and thus shorter training time.

DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE

Wide ResNet on CIFAR10. Training set

cross-entropy, evaluated as a function
of the number of training epochs (a),
or the number of parameter updates
(b). The three learning curves are
identical, but increasing the batch size
reduces the number of parameter
updates required.

https://arxiv.org/abs/1711.00489
A lot of thigs discussed….what’s the best choice?
• If we tune Adam well, it works perfectly! Just like the state-of-art! When you hear people
saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find
that they’re choosing poor hyper-parameters for their model. Adam generally requires
more regularization than SGD, so be sure to adjust your regularization hyper-parameters
when switching from SGD to Adam.

So , it’s all  AdamW Adam with Weight Decay

A good reference….
https://ruder.io/optimizing-gradient-descent/index.html#adamax
https://arxiv.org/pdf/1609.04747.pdf

ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
My Solar Edge 3
No ratings yet
My Solar Edge 3
925 pages
SIP5 7SA-SD-SL-VK-87 V08.60 Manual C011-F en
No ratings yet
SIP5 7SA-SD-SL-VK-87 V08.60 Manual C011-F en
2,254 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Evm A R Tool v1 1 2
No ratings yet
Evm A R Tool v1 1 2
313 pages
Get Test Bank For Ebook PDF Microelectronic Circuits 8th Edition HQ File PDF Download
No ratings yet
Get Test Bank For Ebook PDF Microelectronic Circuits 8th Edition HQ File PDF Download
405 pages
Onkyo tx-nr737 PDF
No ratings yet
Onkyo tx-nr737 PDF
113 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Multimedia Technology and Enhanced Learning Second EAI International Conference ICMTEL 2020 Leicester UK April 10 11 2020 Proceedings Part I Yu-Dong Zhang - Quickly access the ebook and start reading today
No ratings yet
Multimedia Technology and Enhanced Learning Second EAI International Conference ICMTEL 2020 Leicester UK April 10 11 2020 Proceedings Part I Yu-Dong Zhang - Quickly access the ebook and start reading today
65 pages
Training NNs
No ratings yet
Training NNs
34 pages
Backpropogation Algorithm
No ratings yet
Backpropogation Algorithm
48 pages
Quantum Computing Research Paper
No ratings yet
Quantum Computing Research Paper
4 pages
Ship Particular Onasis 10 PLB 150 Pax
No ratings yet
Ship Particular Onasis 10 PLB 150 Pax
3 pages
Cementing Program For 9 58 Inch Casing
100% (1)
Cementing Program For 9 58 Inch Casing
42 pages
Chapter II Build A Neural Network Step by Step
No ratings yet
Chapter II Build A Neural Network Step by Step
31 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
Case1-Workshop Manual
No ratings yet
Case1-Workshop Manual
20 pages
Experiment 1
No ratings yet
Experiment 1
15 pages
VLANs and Trunking Assignment #1
No ratings yet
VLANs and Trunking Assignment #1
9 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
RRB AlpTech CBT - 2 Paper With Official Answer Key Trade Electronics Mechanic Date 21-1-2019 Shift 2 English
No ratings yet
RRB AlpTech CBT - 2 Paper With Official Answer Key Trade Electronics Mechanic Date 21-1-2019 Shift 2 English
55 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
19 pages
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
Aims Members Directory-2014-15
No ratings yet
Aims Members Directory-2014-15
61 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
KL Divergence
No ratings yet
KL Divergence
8 pages
Chapter 1 Annexe
No ratings yet
Chapter 1 Annexe
17 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
Composable and Packaged CDP
No ratings yet
Composable and Packaged CDP
8 pages
EPS-DL-Handout3-Build ANN From Scratch Basics
No ratings yet
EPS-DL-Handout3-Build ANN From Scratch Basics
25 pages
Batch MiniBath Stochastic
No ratings yet
Batch MiniBath Stochastic
4 pages
Very Highspeed BJT Buffer For Trackandhold Amplifiers With Enhan
No ratings yet
Very Highspeed BJT Buffer For Trackandhold Amplifiers With Enhan
4 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Ch2 ANN BB
No ratings yet
Ch2 ANN BB
16 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
2025-06-06
No ratings yet
2025-06-06
11 pages
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
16 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
A Brief Review On Failure of Turbine Blades: October 2013
No ratings yet
A Brief Review On Failure of Turbine Blades: October 2013
9 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
43 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
SJNanda - Neural Network
No ratings yet
SJNanda - Neural Network
43 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
Typical Controls Report For FS5002 MKII
No ratings yet
Typical Controls Report For FS5002 MKII
13 pages
L4deep Learning
No ratings yet
L4deep Learning
14 pages
Softmax
No ratings yet
Softmax
5 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Graphrepresentation
No ratings yet
Graphrepresentation
59 pages
What Is DTV Technology
No ratings yet
What Is DTV Technology
9 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
6208734-I-DSH-005 Level Transmitter
100% (1)
6208734-I-DSH-005 Level Transmitter
3 pages
A B C D: Top View SCALE 1: 25
No ratings yet
A B C D: Top View SCALE 1: 25
1 page
Cours 5
No ratings yet
Cours 5
23 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
98 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
GE OHV AC Control System Events For 17KG537 and Counters - Prolfiles
100% (2)
GE OHV AC Control System Events For 17KG537 and Counters - Prolfiles
74 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Back Propagation
No ratings yet
Back Propagation
8 pages
100 Tambola Tickets Printable Free
No ratings yet
100 Tambola Tickets Printable Free
35 pages
CMOS Circuit Speed and Buffer Optimization
No ratings yet
CMOS Circuit Speed and Buffer Optimization
12 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
INSITE Transfer Process For Cummins Employees
No ratings yet
INSITE Transfer Process For Cummins Employees
11 pages
Business Statistics
No ratings yet
Business Statistics
4 pages
Advanced Marketing Management
No ratings yet
Advanced Marketing Management
22 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
5 C 03 C 2158 A 5 FF
No ratings yet
5 C 03 C 2158 A 5 FF
4 pages
Power Electronics - Chapter 4 - Rectifiers Analysis
No ratings yet
Power Electronics - Chapter 4 - Rectifiers Analysis
40 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Spectral method for fatigue damage estimation with non-zero mean stress
From Everand
Spectral method for fatigue damage estimation with non-zero mean stress
Pedro H. Alves Corrêa
No ratings yet
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Dacty Scan 84 C
No ratings yet
Dacty Scan 84 C
2 pages
Engler Viscometer: Manual and Semi-Automatic Analysers: Viscosimetry
No ratings yet
Engler Viscometer: Manual and Semi-Automatic Analysers: Viscosimetry
1 page
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet