0% found this document useful (0 votes)
2 views

Optimization

The document discusses optimization algorithms in neural networks, focusing on gradient descent methods including batch, stochastic, and mini-batch gradient descent. It explains the weight update rule, the importance of the learning rate, and the challenges associated with vanilla gradient descent. Additionally, it highlights the use of exponentially weighted averages to improve weight updates during training.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Optimization

The document discusses optimization algorithms in neural networks, focusing on gradient descent methods including batch, stochastic, and mini-batch gradient descent. It explains the weight update rule, the importance of the learning rate, and the challenges associated with vanilla gradient descent. Additionally, it highlights the use of exponentially weighted averages to improve weight updates during training.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Optimization Algorithms in

Neural Networks

Complete Guide to Neural Networks with Python: Theory and Applications


The Backpropagation
Remember our objective is to:
Minimize the error By Changing the Weight

Negative Slope:
Gradient Descent
We move in the
When we Increase w,
direction opposite to
the loss is decreasing 
the derivative
-(-) = +  Weight
(opposite to the slope)
Increases (Moving Right)

Positive Slope:
When we increase w,
the loss is increasing 
-(+) = -  Weight
Decreases (Moving Left)
Weight Update Rule:
ɳ = Learning Rate – How fast we update the weights.
In other words, the step size of the update

𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
Old Weight Negative Learnin Gradient
Slop g Rate

High Learning rate leads to Small and optimal learning rate leads
overshooting during gradient descent to gradual descent towards the
and might never reach the minima minima
Why isn’t the vanilla gradient descent good enough?

𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
• Generally, vanilla backpropagation is slow and doesn't perform well on real-life
datasets.
• There is a lot to improve in the “step” that the algorithm takes (example: if it’s going in
the right way, we want it to take larger steps since it’s more confident).
• We need to deal with the overshooting of the local minima
• The learning rate can be adaptive depending on the path status of gradient descent
(example: small gradients can have a large learning rate since their weight is optimized
while large gradients can have a small learning rate since they still need an update.
Gradient Descent
• The base algorithm that is used to minimize the error with respect to the weights
of the neural network. The learning rate determines the step size of the update
used to reach the minimum.
• An Epoch is one complete pass through all the samples.

https://www.learnopencv.com/understanding-activation-functions-
in-deep-learning/ https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
Batch Gradient Descent
• Take all the samples in one iteration. In this case, iteration = epoch
• All the samples are passed to the neural network at one time.
• Computes the gradient of the loss function with respect to the network’s weights for
the entire training dataset, and then performs one update.
• n = all samples per iteration/epoch
• In literature, weights 𝑤 are sometimes denoted as 𝜃 ɳ = Learning Rate – How
fast we update the
weights. In other words,
the step size of the update
• Weight Update Rule:

Old Weight Loss Function of all samples


https://towardsdatascience.com/gradient-
descent-in-a-nutshell-eaf8c18212f0
Example of Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54 0.32 0.28 0.93
Sample 4 0.92 0.67 -1.43 -0.01 0.03 0.72
1.8 -0.46 0.54
Sample 5 -0.71 1.21 0.37
0.92 0.67 -1.43 Error
Sample 6 0.01 0.78 0.04 -0.71 1.21 0.37

Sample 7 -0.42 -0.33 1.37 0.01 0.78 0.04


-0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
1.12 -0.06 -0.17

Update Weights

Take ALL Samples and feed them to the network. Calculate error
based on all samples and update weights
Stochastic gradient descent
• Takes in one sample at each iteration.
• Performs the weights update for each sample at a time.
• Number of iteration per epoch = number of samples.
• It has high variance of updates and causes the function to fluctuate.

for each sample 𝑥 (𝑖) and 𝑦 (𝑖) {

}
• n = 1 per iteration

Andrew Ng Machine Learning Tutorial


Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.32 0.28 0.93 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.82 -0.46 0.54 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.92 0.67 -1.43 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.71 1.21 0.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.01 0.78 0.04 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.42 -0.33 1.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.12 -0.06 -0.17 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Mini-Batch Gradient Descent
• Takes the best of both the Batch and Stochastic Gradient Descent.
• Perform weight update for a batch of samples.
• Take n training samples (batch size)  Feed to the network
• Number of iterations per epoch = samples / batch size

for each batch 𝑥 (𝑖:𝑖+𝑛) and 𝑦 (𝑖:𝑖+𝑛) {

n = batch size

}
Example of Mini-Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
0.32 0.28 0.93
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72
Error
Sample 6 0.01 0.78 0.04 1.8 -0.46 0.54
0.92 0.67 -1.43
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take a BATCH of Samples and feed them to the network. Calculate


error based on the BATCH of samples and update the weights
Example of Mini-Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
-0.71 1.21 0.37
Sample 5 -0.71 1.21 0.37 0.01 0.78 0.04
Error
Sample 6 0.01 0.78 0.04 -0.42 -0.33 1.37
1.12 -0.06 -0.17
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17

Update Weights

Take a BATCH of Samples and feed them to the network. Calculate


error based on the BATCH of samples and update the weights
Exponentially
Weighted
Averages 0.1
𝛽 = 0.9 (𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 10 𝑝𝑜𝑖𝑛𝑡𝑠) 𝛽 = 0.98 (𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 50 𝑝𝑜𝑖𝑛𝑡𝑠)

1
We are averaging over 1−𝛽 points

If current data points


rapidly change  it
won’t get affected

𝛽 = 0.5 𝐴𝑣𝑒𝑟𝑎𝑔𝑖𝑛𝑔 𝑂𝑣𝑒𝑟 2 𝑝𝑜𝑖𝑛𝑡𝑠 → 𝑀𝑜𝑟𝑒 𝑛𝑜𝑖𝑠𝑦

Source: Andrew Ng Deep Learning Course


Exponentially Weighted Average
𝑣50 = 0.9𝑣49 + 0.1 𝜃50
𝑣𝑡 = 𝛽𝑣𝑡−1 + (1 − 𝛽) 𝜃𝑡 𝑣49 = 0.9𝑣48 + 0.1 𝜃49
𝑣48 = 0.9𝑣47 + 0.1 𝜃48
𝑣47 = 0.9𝑣46 + 0.1 𝜃47
𝑣0 = 0 𝑣46 = 0.9𝑣45 + 0.1 𝜃46
𝑣1 = 0.9𝑣0 + 0.1(𝜃1 )
𝑣2 = 0.9𝑣1 + 0.1(𝜃2 ) 𝑣50 = 0.1𝜃50 + 0.9 0.9𝑣48 + 0.1 𝜃49
𝑣3 = 0.9𝑣2 + 0.1 𝜃3 𝑣50 = 0.1𝜃50 + 0.9 0.9 0.9𝑣47 + 0.1 𝜃48 + 0.1 𝜃49
𝑖𝑓 𝛽 = 0.9 𝑣50 = 0.1𝜃50 + (0.9)3 𝑣47 + 0.9 2 0.1 𝜃48 + 0.9(0.1) 𝜃49
𝑣50 = 0.1𝜃50 + 0.9 0.1 𝜃49 + 0.9 2 0.1 𝜃48 + (0.9)3 𝑣47

𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡 𝑣50 = 0.1𝜃50 + 0.9 0.9 0.9(0.9𝑣46 + 0.1 𝜃47 ) + 0.1 𝜃48 + 0.1 𝜃49
More Less 𝑣50 = 0.1𝜃50 + (0.9)4 𝑣46 + 0.9 3 0.1 𝜃47 + 0.9 2 0.1 𝜃48 + 0.9 0.1 𝜃49
Weights to Weights to 𝑣50 = 0.1𝜃50 + 0.9 0.1 𝜃49 + 0.9 2 0.1 𝜃48 + 0.9 3 0.1 𝜃47 + (0.9)4 𝑣46
previous current
values values Sum up to 1. (0.1 only for current, and all the rest for previous)

𝑛=𝑡,𝑡=1

The General Idea 𝑣𝑡 = (1 − 𝛽)𝜃𝑡 + 0.1 ෍ (𝛽)𝑛 𝜃𝑡−1


𝑛=1,𝑡=𝑡
Bias Correction of Exponentially Weighted Averages
𝑣𝑡 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝜃𝑡
If we start with 𝑣0 = 0 (assume 𝛽 = 0.98)
𝑣1 = 𝛽𝑣0 + 1 − 𝛽 𝜃1
𝑣1 = 0 + 0.02𝜃1
𝑣2 = 0.98(0.02𝜃1 ) + 0.02𝜃2
= 0.0196𝜃1 + 0.02𝜃2

Initial Values Get smaller


To mitigate this problem: Source: Andrew Ng Deep Learning Course
Divide 𝑣𝑡 by (1 − 𝛽 𝑡 ) Purple Curve: Without Bias Correction
Green Curve: With Bias Correction
Example: At t=2:
1 − 𝛽 𝑡 = 1 − 0.982 = 0.0396
𝑣2 0.0196𝜃1 + 0.02𝜃2 Note that at large values of t 
→ = 0.494𝜃1 + 0.505𝜃2
0.0369 0.0396 1 − 𝛽 𝑡 = 1 − 0.98150 ≈ 1
Values are Corrected!
That’s why the green and purple lines are
overlapping at large values of t
Momentum

𝛽 𝑖𝑠 𝑠𝑒𝑡 𝑡𝑜 0.9 (𝑟𝑜𝑏𝑢𝑠𝑡)

We can use exponentially moving average to


reduce the oscillations in the vertical direction
and speed it up in the horizontal direction!

Compute dw and db on your mini-batch, and then:


𝑣𝑑𝑤 = 𝛽𝑣𝑑𝑤 + 1 − 𝛽 𝑑𝑤
𝑣𝑑𝑏 = 𝛽𝑣𝑑𝑏 + 1 − 𝛽 𝑑𝑏

Then perform weight update;


𝑤 = 𝑤 − 𝛼𝑣𝑑𝑤
𝑏 = 𝑏 − 𝛼𝑣𝑑𝑏
Adding Momentum
• Accelerates the SGD in the correct direction and weakens the oscillations, by adding a fraction 𝛾 of the past
update vector, to the new update vector. The momentum term increases for dimensions whose gradients
point in the same directions and reduces updates for dimensions whose gradients change directions. As a
result, we gain faster convergence and reduced oscillation. So the history is getting accumulated, and the
more we accumulate in one direction, the faster the update is in the same direction.

However, Momentum still


suffers from oscillations
New Update Vector
Old Update Vector

Without Momentum With Momentum


http://ruder.io/optimizing-gradient-descent/index.html#momentum
RMSProp
Set of Parameters 2

We want to minimize the oscillation in the vertical direction


We want to increase the speed in the horizontal direction

Denote Set of Parameters 1 as dw


Set of Parameters 1
Denote Set of Parameters 2 as db

dw is small (has a small variation) 𝑑𝑤 2 and 𝑑𝑏2 are element-wise squared


db is large (has a large variation)

𝑠𝑑𝑤 = 𝛽𝑠𝑑𝑤 + (1 − 𝛽)𝑑𝑤 2 small (𝑠𝑚𝑎𝑙𝑙 𝑛𝑢𝑚𝑏𝑒𝑟)2  Becomes Smaller ∈ is a very small number to
𝑠𝑑𝑏 = 𝛽𝑠𝑑𝑏 + (1 − 𝛽)𝑑𝑏2 large (𝑙𝑎𝑟𝑔𝑒 𝑛𝑢𝑚𝑏𝑒𝑟)2  Becomes Larger prevent dividing by zero

𝛼 𝛼
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝑑𝑤 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝑑𝑏
𝑠𝑑𝑤 +∈ 𝑠𝑑𝑏 +∈

Dividing by a small number  gets larger Dividing by a large number  gets smaller
RMSProp
• An Adaptive Learning Rate Method
• Proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
• RMSProp divides the learning rate by an exponentially decaying average of
squared gradients
• 𝛾 is suggested to be 0.9 and ɳ = 0.001
• Each time, the learning rate is being changed based on E[𝑔2 ]𝑡

Old Weight
Learning Rate
Changing!
Adam Optimization (Adaptive Moment Estimation)
• Learning Rate is adaptive
• Stores an exponentially decaying average of past square gradients (vt)
• Stores an exponentially decaying average of past gradients (mt)

Momentum!

RMSProp!

• To prevent mt and vt from becoming zero during the initial steps, an adjustment is made:

Bias Correction
Proposed Values:
• And Finally the update! 𝛽1 = 0.9
We’re taking the 𝛽2 = 0.9999
Exponentially ∈ = 10−8
Weighted Average
Old Weight
Learning Rate Changing!
Adam Optimization
• Combines Momentum and RMSProp:
Perform weight update with 𝑣𝑑𝑤 and 𝑣𝑑𝑏 (as in Momentum), and divide it by 𝑠𝑑𝑤 and 𝑠𝑑𝑏 (as in RMSProp).
It uses Bias Correction: Divide by (1 − 𝛽1 𝑡 ) for 𝑣𝑑𝑤 and 𝑣𝑑𝑏
Divide by (1 − 𝛽2 𝑡 ) for 𝑠𝑑𝑤 and 𝑠𝑑𝑏
Initialize 𝑣𝑑𝑤 = 0, 𝑠𝑑𝑤 = 0 and 𝑣𝑑𝑏 = 0, 𝑠𝑑𝑏 = 0
On iteration t:
Compute dw and db using the current mini-batch:
𝑣𝑑𝑤 = 𝛽1 𝑣𝑑𝑤 + 1 − 𝛽1 𝑑𝑤 and 𝑣𝑑𝑏 = 𝛽1 𝑣𝑑𝑏 + 1 − 𝛽1 𝑑𝑏 (Moment with 𝛽1 )
𝑠𝑑𝑤 = 𝛽2 𝑠𝑑𝑤 + 1 − 𝛽2 𝑑𝑤 2 and 𝑠𝑑𝑏 = 𝛽2 𝑠𝑑𝑏 + 1 − 𝛽2 𝑑𝑏2 (RMSProp with 𝛽2 )

Compute Bias Correction:


𝒗𝒅𝒘 𝒗𝒅𝒃 Default Values:
𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 = 𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒕 and 𝒗𝒅𝒃 =
𝟏−𝛽1 𝟏−𝛽1 𝒕
𝒔𝒅𝒘 𝒔𝒅𝒃 𝛽1 = 0.9
𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 = 𝒕 and 𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃 = 𝛽2 = 0.999
𝟏−𝛽2 𝟏−𝛽2 𝒕
𝜖 = 10−8
Perform Weight Update:
𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 𝒗𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼 and 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝛼
𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘 +𝜖 𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒃 +𝜖
Adam was applied to a Multilayer Perceptron
algorithm on the MNIST dataset and
Convolutional Neural Networks on the CIFAR-
10 image recognition dataset. They conclude:
Using large models and datasets, we
demonstrate Adam can efficiently solve
practical deep learning problems.

https://arxiv.org/abs/1412.6980

In practice Adam is currently recommended as the default


algorithm to use, and often works slightly better than RMSProp.
Stop and think about it….
Do we need to decay the learning rate if Adam already adapts it for each
parameter independently?

• The single learning rate for each parameter is computed using the initial learning rate as an
upper limit. This means that every single learning rate can vary from 0 (no update) to the
initial learning rate (maximum update).
• If we want to be sure that every update step doesn't exceed the initial learning rate we can
than lower this initial learning rate (learning rate scheduling). It is therefore recommended
to decay the learning rate because Adam might still exhibit a high learning rate depending on
the gradient of a parameter
𝛼

𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘
So what happened then?
People started noticing that despite superior training time, Adam in some areas does not
converge to an optimal solution, so for some tasks (such as image classification on popular
CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with
momentum.
The paper: ‘Improving Generalization Performance by Switching from Adam to SGD also
showed that by switching to SGD during training they’ve been able to obtain better
generalization power than when using Adam alone. They proposed a simple fix which uses a
very simple idea. They’ve noticed that in earlier stages of training Adam still outperforms SGD
but later the learning saturates. They proposed simple strategy which they called SWATS in
which they start training deep neural network with Adam but then switch to SGD when certain
criteria hits. They managed to achieve results comparable to SGD with momentum.

SWATS: Adam If Criteria Hits


SGD
Weight Decay and Regularization 𝑤 =𝑤−𝑛
𝑑 𝐸෨
𝑑𝑤
• The weight update rule is given as: 𝑑𝐸෨ 𝑑𝐸 𝑑𝐿2
= +
𝑑𝑤 𝑑𝑤 𝑑𝑤

• And by adding the term to the cost function: λ 𝒅(𝒘𝟐 ) λ


= (𝟐𝒘) = λ𝒘
𝟐 𝒘 𝟐

• Taking the differentiation of the above equation and substituting it in the


weight update rule , we end up with the update rule: 𝑑 𝐸෨ 𝑑𝐸
= + λ𝑤
𝑑𝑤 𝑑𝑤
𝑑𝐸
𝑤 = 𝑤 − 𝑛( + λ𝑤)
𝑑𝑤
This new term coming from the regularization causes the weight to
decay in proportion to its size.
So we subtract a little portion of the weight at each step, hence the name decay. We directly modify the
weight update rule rather than modifying the loss function. we don’t want to add more computations by
modifying the loss when there is an easier way
• So when adding the L2 regularization term to the loss function, it is denoted
as L2 regularization. And when we modify the weight update rule directly, it is
denoted as Weight decay. For the case of vanilla SGD, they are the same.
• But they aren’t the same when adding Momentum or using Adam. The L2
regularization and weight decay become different. Therefore, weight decay no
longer equals L2 regularization, as in the case for vanilla SGD.
• One paper thus propose to decouple weight decay from the gradient update
by adding it after the parameter update as in the original definition.
Why aren’t they the same?
• When using L2 regularization the penalty we use for large weights gets
scaled by moving average of the past and current squared gradients
and therefore weights with large typical gradient magnitude are
regularized by a smaller relative amount than other weights. In
contrast, weight decay regularizes all weights by the same factor. To
use weight decay with Adam we need to modify the update rule as
follows:
Decoupling weight decay: Adding Weight Decay to Momentum and Adam
The SGD with momentum and weight decay (SGDW) update:

−𝑛λ𝜃𝑡 decoupled weight decay

Similarly, for Adam with weight decay (AdamW) we obtain:

−𝑛λ𝜃𝑡 decoupled weight decay


So did it yield good results?
• The authors show that this substantially improves Adam’s generalization
performance and allows it to compete with SGD with momentum on image
classification datasets. The authors show that this substantially improves Adam’s
generalization performance and allows it to compete with SGD with momentum
on image classification datasets.
• Several Papers empirically find that a lower 𝛽2 value, which
controls the contribution of the exponential moving average
of past squared gradients in Adam, e.g. 0.99 or 0.9 vs. the
default 0.999 worked better in their respective applications,
indicating that there might be an issue with the exponential
moving average.
AMSGrad
Let’s Start with the problem statement:

• Practitioners noticed that in some cases, e.g. for object recognition or machine
translation they fail to converge to an optimal solution and are outperformed
by SGD with momentum.
• They point out that in Adam, the short-term memory of the gradients becomes
an obstacle in other scenarios. In settings where Adam converges to a
suboptimal solution, it has been observed that some minibatches provide large
and informative gradients, but as these minibatches only occur rarely,
exponential averaging diminishes their influence, which leads to poor
convergence.
So what did they do?
Since values of step size are often decreasing over time, they proposed a fix of keeping the
maximum of values V and use it instead of the moving average to update parameters.

AMSGrad uses the maximum of past squared gradients 𝑣𝑡 rather than the exponential average
to update the parameters. 𝑣𝑡 is defined the same as in Adam:

Instead of using 𝑣𝑡 (or its bias-corrected version 𝑣ො𝑡 ) directly, we now employ the
previous 𝑣𝑡−1 if it is larger than the current one:
The full AMSGrad update without bias-corrected estimates can be seen below:
The authors observe improved performance compared to Adam on small datasets and on CIFAR-10. Other experiments, however, show
similar or worse performance than Adam. It remains to be seen whether AMSGrad is able to consistently outperform Adam in practice.
Authors also claimed convergence problems are actually just signs of poorly chosen hyper-parameters. Amsgrad turns out to be very
disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by
amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics ( like accuracy) always end up worse
Another paper…..
They demonstrate an interesting connection between the learning rate and the batch size,
two hyper parameters that are typically thought to be independent of each other: They show
that decaying the learning rate is equivalent to increasing the batch size, while the latter
allows for increased parallelism and thus shorter training time.

DON’T DECAY THE LEARNING RATE, INCREASE THE BATCH SIZE

Wide ResNet on CIFAR10. Training set


cross-entropy, evaluated as a function
of the number of training epochs (a),
or the number of parameter updates
(b). The three learning curves are
identical, but increasing the batch size
reduces the number of parameter
updates required.

https://arxiv.org/abs/1711.00489
A lot of thigs discussed….what’s the best choice?
• If we tune Adam well, it works perfectly! Just like the state-of-art! When you hear people
saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find
that they’re choosing poor hyper-parameters for their model. Adam generally requires
more regularization than SGD, so be sure to adjust your regularization hyper-parameters
when switching from SGD to Adam.

So , it’s all  AdamW Adam with Weight Decay


A good reference….
https://ruder.io/optimizing-gradient-descent/index.html#adamax
https://arxiv.org/pdf/1609.04747.pdf

You might also like