Optimization
Optimization
Neural Networks
Negative Slope:
Gradient Descent
We move in the
When we Increase w,
direction opposite to
the loss is decreasing
the derivative
-(-) = + Weight
(opposite to the slope)
Increases (Moving Right)
Positive Slope:
When we increase w,
the loss is increasing
-(+) = - Weight
Decreases (Moving Left)
Weight Update Rule:
ɳ = Learning Rate – How fast we update the weights.
In other words, the step size of the update
𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
Old Weight Negative Learnin Gradient
Slop g Rate
High Learning rate leads to Small and optimal learning rate leads
overshooting during gradient descent to gradual descent towards the
and might never reach the minima minima
Why isn’t the vanilla gradient descent good enough?
𝑑𝐸
𝑤 𝑤 − ղ
𝑑𝑤
• Generally, vanilla backpropagation is slow and doesn't perform well on real-life
datasets.
• There is a lot to improve in the “step” that the algorithm takes (example: if it’s going in
the right way, we want it to take larger steps since it’s more confident).
• We need to deal with the overshooting of the local minima
• The learning rate can be adaptive depending on the path status of gradient descent
(example: small gradients can have a large learning rate since their weight is optimized
while large gradients can have a small learning rate since they still need an update.
Gradient Descent
• The base algorithm that is used to minimize the error with respect to the weights
of the neural network. The learning rate determines the step size of the update
used to reach the minimum.
• An Epoch is one complete pass through all the samples.
https://www.learnopencv.com/understanding-activation-functions-
in-deep-learning/ https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
Batch Gradient Descent
• Take all the samples in one iteration. In this case, iteration = epoch
• All the samples are passed to the neural network at one time.
• Computes the gradient of the loss function with respect to the network’s weights for
the entire training dataset, and then performs one update.
• n = all samples per iteration/epoch
• In literature, weights 𝑤 are sometimes denoted as 𝜃 ɳ = Learning Rate – How
fast we update the
weights. In other words,
the step size of the update
• Weight Update Rule:
Update Weights
Take ALL Samples and feed them to the network. Calculate error
based on all samples and update weights
Stochastic gradient descent
• Takes in one sample at each iteration.
• Performs the weights update for each sample at a time.
• Number of iteration per epoch = number of samples.
• It has high variance of updates and causes the function to fluctuate.
}
• n = 1 per iteration
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.82 -0.46 0.54 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.92 0.67 -1.43 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.71 1.21 0.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
0.01 0.78 0.04 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
-0.42 -0.33 1.37 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Example of Stochastic Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
Sample 5 -0.71 1.21 0.37
1.12 -0.06 -0.17 Error
Sample 6 0.01 0.78 0.04
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Take ONE Sample and feed it to the network. Calculate error based on
ONE sample and update weights. Repeat this step for all samples
Mini-Batch Gradient Descent
• Takes the best of both the Batch and Stochastic Gradient Descent.
• Perform weight update for a batch of samples.
• Take n training samples (batch size) Feed to the network
• Number of iterations per epoch = samples / batch size
n = batch size
}
Example of Mini-Batch Gradient Descent
Sample 1 0.32 0.28 0.93
Sample 2 -0.01 0.03 0.72
Sample 3 1.82 -0.46 0.54
Sample 4 0.92 0.67 -1.43
0.32 0.28 0.93
Sample 5 -0.71 1.21 0.37
-0.01 0.03 0.72
Error
Sample 6 0.01 0.78 0.04 1.8 -0.46 0.54
0.92 0.67 -1.43
Sample 7 -0.42 -0.33 1.37
Sample 8 1.12 -0.06 -0.17
Update Weights
Update Weights
1
We are averaging over 1−𝛽 points
𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡 𝑣50 = 0.1𝜃50 + 0.9 0.9 0.9(0.9𝑣46 + 0.1 𝜃47 ) + 0.1 𝜃48 + 0.1 𝜃49
More Less 𝑣50 = 0.1𝜃50 + (0.9)4 𝑣46 + 0.9 3 0.1 𝜃47 + 0.9 2 0.1 𝜃48 + 0.9 0.1 𝜃49
Weights to Weights to 𝑣50 = 0.1𝜃50 + 0.9 0.1 𝜃49 + 0.9 2 0.1 𝜃48 + 0.9 3 0.1 𝜃47 + (0.9)4 𝑣46
previous current
values values Sum up to 1. (0.1 only for current, and all the rest for previous)
𝑛=𝑡,𝑡=1
𝑠𝑑𝑤 = 𝛽𝑠𝑑𝑤 + (1 − 𝛽)𝑑𝑤 2 small (𝑠𝑚𝑎𝑙𝑙 𝑛𝑢𝑚𝑏𝑒𝑟)2 Becomes Smaller ∈ is a very small number to
𝑠𝑑𝑏 = 𝛽𝑠𝑑𝑏 + (1 − 𝛽)𝑑𝑏2 large (𝑙𝑎𝑟𝑔𝑒 𝑛𝑢𝑚𝑏𝑒𝑟)2 Becomes Larger prevent dividing by zero
𝛼 𝛼
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝑑𝑤 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − 𝑑𝑏
𝑠𝑑𝑤 +∈ 𝑠𝑑𝑏 +∈
Dividing by a small number gets larger Dividing by a large number gets smaller
RMSProp
• An Adaptive Learning Rate Method
• Proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
• RMSProp divides the learning rate by an exponentially decaying average of
squared gradients
• 𝛾 is suggested to be 0.9 and ɳ = 0.001
• Each time, the learning rate is being changed based on E[𝑔2 ]𝑡
Old Weight
Learning Rate
Changing!
Adam Optimization (Adaptive Moment Estimation)
• Learning Rate is adaptive
• Stores an exponentially decaying average of past square gradients (vt)
• Stores an exponentially decaying average of past gradients (mt)
Momentum!
RMSProp!
• To prevent mt and vt from becoming zero during the initial steps, an adjustment is made:
Bias Correction
Proposed Values:
• And Finally the update! 𝛽1 = 0.9
We’re taking the 𝛽2 = 0.9999
Exponentially ∈ = 10−8
Weighted Average
Old Weight
Learning Rate Changing!
Adam Optimization
• Combines Momentum and RMSProp:
Perform weight update with 𝑣𝑑𝑤 and 𝑣𝑑𝑏 (as in Momentum), and divide it by 𝑠𝑑𝑤 and 𝑠𝑑𝑏 (as in RMSProp).
It uses Bias Correction: Divide by (1 − 𝛽1 𝑡 ) for 𝑣𝑑𝑤 and 𝑣𝑑𝑏
Divide by (1 − 𝛽2 𝑡 ) for 𝑠𝑑𝑤 and 𝑠𝑑𝑏
Initialize 𝑣𝑑𝑤 = 0, 𝑠𝑑𝑤 = 0 and 𝑣𝑑𝑏 = 0, 𝑠𝑑𝑏 = 0
On iteration t:
Compute dw and db using the current mini-batch:
𝑣𝑑𝑤 = 𝛽1 𝑣𝑑𝑤 + 1 − 𝛽1 𝑑𝑤 and 𝑣𝑑𝑏 = 𝛽1 𝑣𝑑𝑏 + 1 − 𝛽1 𝑑𝑏 (Moment with 𝛽1 )
𝑠𝑑𝑤 = 𝛽2 𝑠𝑑𝑤 + 1 − 𝛽2 𝑑𝑤 2 and 𝑠𝑑𝑏 = 𝛽2 𝑠𝑑𝑏 + 1 − 𝛽2 𝑑𝑏2 (RMSProp with 𝛽2 )
https://arxiv.org/abs/1412.6980
• The single learning rate for each parameter is computed using the initial learning rate as an
upper limit. This means that every single learning rate can vary from 0 (no update) to the
initial learning rate (maximum update).
• If we want to be sure that every update step doesn't exceed the initial learning rate we can
than lower this initial learning rate (learning rate scheduling). It is therefore recommended
to decay the learning rate because Adam might still exhibit a high learning rate depending on
the gradient of a parameter
𝛼
𝒔𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅
𝒅𝒘
So what happened then?
People started noticing that despite superior training time, Adam in some areas does not
converge to an optimal solution, so for some tasks (such as image classification on popular
CIFAR datasets) state-of-the-art results are still only achieved by applying SGD with
momentum.
The paper: ‘Improving Generalization Performance by Switching from Adam to SGD also
showed that by switching to SGD during training they’ve been able to obtain better
generalization power than when using Adam alone. They proposed a simple fix which uses a
very simple idea. They’ve noticed that in earlier stages of training Adam still outperforms SGD
but later the learning saturates. They proposed simple strategy which they called SWATS in
which they start training deep neural network with Adam but then switch to SGD when certain
criteria hits. They managed to achieve results comparable to SGD with momentum.
• Practitioners noticed that in some cases, e.g. for object recognition or machine
translation they fail to converge to an optimal solution and are outperformed
by SGD with momentum.
• They point out that in Adam, the short-term memory of the gradients becomes
an obstacle in other scenarios. In settings where Adam converges to a
suboptimal solution, it has been observed that some minibatches provide large
and informative gradients, but as these minibatches only occur rarely,
exponential averaging diminishes their influence, which leads to poor
convergence.
So what did they do?
Since values of step size are often decreasing over time, they proposed a fix of keeping the
maximum of values V and use it instead of the moving average to update parameters.
AMSGrad uses the maximum of past squared gradients 𝑣𝑡 rather than the exponential average
to update the parameters. 𝑣𝑡 is defined the same as in Adam:
Instead of using 𝑣𝑡 (or its bias-corrected version 𝑣ො𝑡 ) directly, we now employ the
previous 𝑣𝑡−1 if it is larger than the current one:
The full AMSGrad update without bias-corrected estimates can be seen below:
The authors observe improved performance compared to Adam on small datasets and on CIFAR-10. Other experiments, however, show
similar or worse performance than Adam. It remains to be seen whether AMSGrad is able to consistently outperform Adam in practice.
Authors also claimed convergence problems are actually just signs of poorly chosen hyper-parameters. Amsgrad turns out to be very
disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by
amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics ( like accuracy) always end up worse
Another paper…..
They demonstrate an interesting connection between the learning rate and the batch size,
two hyper parameters that are typically thought to be independent of each other: They show
that decaying the learning rate is equivalent to increasing the batch size, while the latter
allows for increased parallelism and thus shorter training time.
https://arxiv.org/abs/1711.00489
A lot of thigs discussed….what’s the best choice?
• If we tune Adam well, it works perfectly! Just like the state-of-art! When you hear people
saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find
that they’re choosing poor hyper-parameters for their model. Adam generally requires
more regularization than SGD, so be sure to adjust your regularization hyper-parameters
when switching from SGD to Adam.