0% found this document useful (0 votes)

6 views

L12_optim__slides

Lecture 12 of STAT 453 focuses on improving gradient descent-based optimization techniques in deep learning, covering topics such as learning rate decay, momentum learning, and adaptive learning. It discusses practical tips for minibatch learning and the importance of setting appropriate batch sizes. Additionally, it highlights the relationship between learning rate and batch size, suggesting that increasing batch size can be an alternative to decaying the learning rate.

Uploaded by

sarv

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

L12_optim__slides

Uploaded by

sarv

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

STAT 453: Introduction to Deep Learning and Generative Models

Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching

Lecture 12
Improving Gradient Descent-based
with Applications in Python

Optimization
Sebastian Raschka STAT 453: Intro to Deep Learning 1
Overview: Additional Tricks for
Neural Network Training (Part 2/2)

Part 1 (Last Lecture, L10)

• Input Normalization & BatchNorm
• Weight Initialization (Xavier Glorot, Kaiming He)

Part 2 (this lecture)

• Learning Rate Decay
• Momentum Learning
• Adaptive Learning

Sebastian Raschka STAT 453: Intro to Deep Learning 2

Overview: Additional Tricks for
Neural Network Training (Part 2/2)

Part 1 (Last Lecture, L10)

• Input Normalization & BatchNorm
• Weight Initialization (Xavier Glorot, Kaiming He)

Part 2 (this lecture)

• Learning Rate Decay
(Modi cations of the 1st order SGD optimization
• Momentum Learning algorithm; 2nd order methods are rarely used in DL)
• Adaptive Learning

Sebastian Raschka STAT 453: Intro to Deep Learning 3

fi
Lecture Overview

1. Learning rate decay

2. Learning rate schedulers in PyTorch

3. Training with "momentum"

4. ADAM: Adaptive learning rates & momentum

5. Using optimization algorithms in PyTorch

6. Optimization in deep learning: Additional topics

Sebastian Raschka STAT 453: Intro to Deep Learning 4

Decreasing the learning rate
over the course of training

1. Learning rate decay

2. Learning rate schedulers in PyTorch
3. Training with "momentum"
4. ADAM: Adaptive learning rates & momentum
5. Using optimization algorithms in PyTorch
6. Optimization in deep learning: Additional topics

Sebastian Raschka STAT 453: Intro to Deep Learning 5

Minibatch Learning Recap
• Minibatch learning is a form of
stochastic gradient descent
• Each minibatch can be considered a
sample drawn from the training set
(where the training set is in turn a
sample drawn from the population)
• Hence, the gradient is noisier

• A noisy gradient can be

✦ good: chance to escape local
minima
✦ bad: can lead to extensive
oscillation
• Main advantage: Convergence speed,
because it o ers to opportunities for
parallelism (do you recall what these are?)

Sebastian Raschka STAT 453: Intro to Deep Learning 6

ff
Nice Library & Visualization Tool
https://vis.ensmallen.org

Large Learning Rate

Small Learning Rate

Sebastian Raschka STAT 453: Intro to Deep Learning 7

Practical Tip for Minibatch Use
• Reasonable minibatch sizes are usually: 32, 64, 128, 256, 512, 1024 (in the last
lecture, we discussed why powers of 2 are a common convention)
• Usually, you can choose a batch size that is as large as your GPU memory allows
(matrix-multiplication and the size of fully-connected layers are usually the
bottleneck)
• Practical tip: usually, it is a good idea to also make the batch size proportional to
the number of classes in the dataset
Dataset before splitting (n = 150)
This work by Sebastian Raschka is licensed under a
Creative Commons Attribution 4.0 International License.

Training dataset (n = 100) Test dataset (n = 50)

Raschka, S. (2018). Model evaluation, model selection, and

Sepal Length [cm] Sepal Length [cm] algorithm selection in machine learning.
https://arxiv.org/abs/1811.12808
Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.

In the worst-case scenario, the test set may not contain any instance of a minority class at all. Thus,
a recommended practice is to divide the dataset in a stratified fashion. Here, stratification simply
means that we randomly split a dataset such that each class
Sebastian is correctly represented
Raschka STAT in 453:
the resulting
Intro to Deep Learning
subsets (the training and the test set) – in other words, stratification is an approach to maintain the 8
batchsize-1024.ipynb batchsize-64.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 9

Learning Rate Decay
• Batch e ects -- minibatches are samples of the training set,
hence minibatch loss and gradients are approximations
• Hence, we usually get oscillations
• To dampen oscillations towards the end of the training, we can decay
the learning rate

Sebastian Raschka STAT 453: Intro to Deep Learning 10

ff
Learning Rate Decay
• Batch e ects -- minibatches are samples of the training set,
hence minibatch loss and gradients are approximations
• Hence, we usually get oscillations
• To dampen oscillations towards the end of the training, we can decay
the learning rate

Danger of learning rate is

to decrease the learning rate too early

Practical tip: try to train the model

without learning rate decay rst,
then add it later
You can also use the validation
performance (e.g., accuracy) to
judge whether lr decay is useful
(as opposed to using the training loss)

Sebastian Raschka STAT 453: Intro to Deep Learning 11

ff
fi
Learning Rate Decay

Most common variants for learning rate decay:

1) Exponential Decay:
k·t
⌘t := ⌘0 · e
<latexit sha1_base64="y1QN6zUteveq7skV4o5wfP7j9R0=">AAACDnicbZC7SgNBFIZn4y3G26qlzWAI2Bh2o6AIQtDGMoK5QBLD7OQkGTJ7YeasEJY8gY2vYmOhiK21nW/jJNlCE38Y+PjPOZw5vxdJodFxvq3M0vLK6lp2PbexubW9Y+/u1XQYKw5VHspQNTymQYoAqihQQiNSwHxPQt0bXk/q9QdQWoTBHY4iaPusH4ie4AyN1bELLUDWQXpxSafk0BbvhkjhPjkepozjjp13is5UdBHcFPIkVaVjf7W6IY99CJBLpnXTdSJsJ0yh4BLGuVasIWJ8yPrQNBgwH3Q7mZ4zpgXjdGkvVOYFSKfu74mE+VqPfM90+gwHer42Mf+rNWPsnbcTEUQxQsBni3qxpBjSSTa0KxRwlCMDjCth/kr5gCnG0SSYMyG48ycvQq1UdE+KpdvTfPkqjSNLDsghOSIuOSNlckMqpEo4eSTP5JW8WU/Wi/VufcxaM1Y6s0/+yPr8AdcImrk=</latexit>

where k is the decay rate

Sebastian Raschka STAT 453: Intro to Deep Learning 12

Learning Rate Decay
Most common variants for learning rate decay:

2) Halving the learning rate:

⌘t := ⌘t 1 /2

3) Inverse decay:
⌘0
⌘t :=
1+k·t
<latexit sha1_base64="CMVLJdUk7lm/xR6VZzLL2r/6azA=">AAACD3icbVDLSsNAFJ34rPUVdelmsCiCUJIqKIJQdOOygn1AU8JkMmmHTiZh5kYooX/gxl9x40IRt27d+TdOHwttPXDhzDn3MveeIBVcg+N8WwuLS8srq4W14vrG5ta2vbPb0EmmKKvTRCSqFRDNBJesDhwEa6WKkTgQrBn0b0Z+84EpzRN5D4OUdWLSlTzilICRfPvIY0B8wJdX2IsUofn47QxzF5/gPvZomACGoW+XnLIzBp4n7pSU0BQ13/7ywoRmMZNABdG67TopdHKigFPBhkUv0ywltE+6rG2oJDHTnXx8zxAfGiXEUaJMScBj9fdETmKtB3FgOmMCPT3rjcT/vHYG0UUn5zLNgEk6+SjKBIYEj8LBIVeMghgYQqjiZldMe8TEAibCognBnT15njQqZfe0XLk7K1Wvp3EU0D46QMfIReeoim5RDdURRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gCRB5sZ</latexit>

Sebastian Raschka STAT 453: Intro to Deep Learning 13

ractical suggestions for setting the learning
3.1. Cyclical Learning Rates
earning rates: Adaptive learning rates can be
Learning Rate Decay
The essence of this learning rate policy comes from the
ompetitor to cyclical learning rates because observation that increasing the learning rate might have a
on local adaptive learning rates in place of short term negative effect and yet achieve a longer term ben-
g rate experimentation but there is a signifi- eficial effect. This observation leads to the idea of letting the
onal cost in doing so. CLR does not possess learning rate vary within a range of values rather than adopt-
onal costs so it can be used freely. ing a stepwise fixed or exponentially decreasing value. That
the early work on adaptive learning rates can
eorge and Powell [6]. Duchi, et al. [5] pro-
There are many, many more
is, one sets minimum and maximum boundaries and the
learning rate cyclically varies between these bounds. Ex-
d, which is one of the early adaptive methods periments with numerous functional forms, such as a trian-
the learning rates from the gradients. gular window (linear), a Welch window (parabolic) and a
s discussed in the slides by Geoffrey Hinton2
op is described there as “Divide the learning E.g., Cyclical Learning Rate
Hann window (sinusoidal) all produced equivalent results
This led to adopting a triangular window (linearly increas-
ght by a running average of the magnitudes ing then linearly decreasing), which is illustrated in Figure
ients for that weight.” RMSProp is a funda-
Smith, Leslie N. “Cyclical learning rates
2, because it isfor
the training neuralthat
simplest function networks.”
incorporates Applications
this of Computer
e learning rate method that others have built idea. The rest of this paper refers to this as the triangular
Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017.
learning rate policy.
. [22] discuss an adaptive learning rate based
estimation of the Hessian of the gradients.
tures of their method is that they allow their
hod to decrease or increase the learning rate.
r paper seems to limit the idea of increasing
o non-stationary problems. On the other hand,
monstrates that a schedule of increasing the
s more universally valuable.
describes his AdaDelta method, which im-
aGrad based on two ideas: limiting the sum
dients over all time to a limited window, and
arameter update rule consistent with a units Figure 2. Triangular learning rate policy. The blue lines represent
the relationship between the update and the learning rate values changing between bounds. The input parame-
ter stepsize is the number of iterations in half a cycle.
tly, several papers have appeared on adaptive
An intuitive understanding of why CLR methods
Gulcehre and Bengio [9] propose an adaptive
work comes from considering the loss function topology.
lgorithm, called AdaSecant, that utilizes the
Dauphin et al. [4] argue that the difficulty in minimizing the
to.edu/ tijmen/csc321/slides/lecture slides lec6.pdf loss arises from saddle points rather than poor local minima.

Sebastian Raschka STAT 453: Intro to Deep Learning 14

Relationship between Learning Rate and Batch
Published as a conference paper at ICLR 2018

Size
D ON ’ T D ECAY THE L EARNING R ATE ,
I NCREASE THE BATCH S IZE
Samuel L. Smith⇤, Pieter-Jan Kindermans⇤, Chris Ying & Quoc V. Le
Google Brain
{slsmith, pikinder, chrisying, qvl}@google.com

A BSTRACT
It is common practice to decay the learning rate. Here we show one can usually
obtain the same learning curve on both training and test sets by instead increasing
the batch size during training. This procedure is successful for stochastic gradi-
ent descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It
reaches equivalent test accuracies after the same number of training epochs, but
with fewer parameter updates, leading to greater parallelism and shorter training
times. We can further reduce the number of parameter updates by increasing the
learning rate ✏ and scaling the batch size B / ✏. Finally, one can increase the mo-
mentum coefficient m and scale B / 1/(1 m), although this tends to slightly
reduce the test accuracy. Crucially, our techniques allow us to repurpose existing
training schedules for large batch training with no hyper-parameter tuning. We
train ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes.

1 I
Smith, S. L., Kindermans, P. J., Ying, C., & Le, Q. V. (2017). Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489.
NTRODUCTION
Sebastian Raschka STAT 453: Intro to Deep Learning 15
Relationship between Learning Rate and Batch
Published as a conference paper at ICLR 2018
Size

(a) (b)

Figure 6: Inception-ResNet-V2 on ImageNet. Increasing the batch size during training achieves
similar results to decaying the learning rate, but it reduces the number of parameter updates from
just over 14000 to below 6000. We run each experiment twice to illustrate the variance.

Smith, S. L., Kindermans, P. J., Ying, C., & Le, Q. V. (2017). Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489.

5.3 T RAINING I MAGE N ET IN 2500 PARAMETER UPDATES

We now apply our insightsSebastian

to reduceRaschka STAT 453: Intro to Deep Learning
the number of parameter updates required to train ImageNet. 16
Decreasing the learning rate
over the course of training

1. Learning rate decay

Sebastian Raschka STAT 453: Intro to Deep Learning 17

Learning Rate Decay in PyTorch

Option 1. Just call your own function at the end of each epoch:

def adjust_learning_rate(optimizer, epoch, initial_lr, decay_rate)

"""Exponential decay every 10 epochs"""
if not epoch % 10
lr = initial_lr * torch.exp(-decay_rate*epoch
for param_group in optimizer.param_groups
param_group['lr'] = lr

Sebastian Raschka STAT 453: Intro to Deep Learning 18

Learning Rate Decay in PyTorch

Option 2. Use one of the built-in tools in PyTorch:
(many more available)
(Here, the most generic version.)

Source: https://pytorch.org/docs/stable/optim.html

Sebastian Raschka STAT 453: Intro to Deep Learning 19

Learning Rate Decay in PyTorch
################################# Example, part 1/2
### Model Initialization
#################################

torch.manual_seed(RANDOM_SEED
model = MLP(num_features=28*28
num_hidden=100
num_classes=10

model = model.to(DEVICE

optimizer = torch.optim.SGD(model.parameters(), lr=0.1

#################################
### LEARNING RATE SCHEDULER
#################################

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer,
gamma=0.1

...

Sebastian Raschka STAT 453: Intro to Deep Learning 20

Learning Rate Decay in PyTorch

for epoch in range(5)
model.train( Example, part 2/2
for batch_idx, (features, targets) in enumerate(train_loader)

features = features.view(-1, 28*28).to(DEVICE

targets = targets.to(DEVICE

### FORWARD AND BACK PROP

logits, probas = model(features

#cost = F.nll_loss(torch.log(probas), targets)

cost = F.cross_entropy(logits, targets
optimizer.zero_grad(

cost.backward(
minibatch_cost.append(cost
### UPDATE MODEL PARAMETERS

optimizer.step(

### LOGGING
if not batch_idx % 50
print ('Epoch: %03d/%03d | Batch %03d/%03d | Cost: %.4f'
%(epoch+1, NUM_EPOCHS, batch_idx,
len(train_loader), cost)

##########################
### Update Learning Rate
scheduler.step() # don't have to do it every epoch!
##########################

model.eval()

Sebastian Raschka STAT 453: Intro to Deep Learning 21

3x3 conv, 64
sizes, they are performed with a stride of 2.
3x3 conv, 64

3x3 conv, 64
3.4. Implementation
3x3 conv, 128, /2 Our implementation for ImageNet follows the practice
3x3 conv, 128 in [21, 40]. The image is resized with its shorter side ran-
3x3 conv, 128 domly sampled in [256, 480] for scale augmentation [40].
3x3 conv, 128 A 224×224 crop is randomly sampled from an image or its
3x3 conv, 128 horizontal flip, with the per-pixel mean subtracted [21]. The
3x3 conv, 128 standard color augmentation in [21] is used. We adopt batch
3x3 conv, 128 normalization (BN) [16] right after each convolution and
3x3 conv, 128
before activation, following [16]. We initialize the weights
3x3 conv, 256, /2
as in [12] and train all plain/residual nets from scratch. We
3x3 conv, 256
use SGD with a mini-batch size of 256. The learning rate
3x3 conv, 256
starts from 0.1 and is divided by 10 when the error plateaus,
and the models are trained for up to 60 × 104 iterations. We
3x3 conv, 256
use a weight decay of 0.0001 and a momentum of 0.9. We
3x3 conv, 256
do not use dropout [13], following the practice in [16].
3x3 conv, 256
In testing, for comparison studies we adopt the standard
3x3 conv, 256
10-crop testing [21]. For best results, we adopt the fully-
3x3 conv, 256
convolutional form as in [40, 12], and average the scores
3x3 conv, 256
at multiple scales (images are resized such that the shorter
3x3 conv, 256 side is in {224, 256, 384, 480, 640}).
http://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
3x3 conv, 256

3x3 conv, 256 4. Experiments

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on
computer
3x3 conv, 512, /2 vision and pattern recognition 2016 (pp. 770-778).
4.1. ImageNet Classification
3x3 conv, 512

3x3 conv, 512 We evaluate

Sebastianour method STAT
Raschka on the ImageNet
453: Intro 2012 classifi-
to Deep Learning 22
scheduler.ipynb:

Sebastian Raschka STAT 453: Intro to Deep Learning 23

scheduler.ipynb:

...

Sebastian Raschka STAT 453: Intro to Deep Learning 24

Saving Models in PyTorch

Learning rate schedulers

have the advantage that
we can also simply save
their state for reuse
(e.g., saving and
continuing training later)

Sebastian Raschka STAT 453: Intro to Deep Learning 25

Nudging SGD into the right
direction

1. Learning rate decay

Sebastian Raschka STAT 453: Intro to Deep Learning 26

Training with "Momentum"

Source: https://en.wikipedia.org/wiki/Momentum

• Concept: In momentum learning, we try to accelerate convergence by

dampening oscillations using "velocity" (the speed of the "movement" from
previous updates)

Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The O cial Journal of
the International Neural Network Society, 12(1), 145–151. http://doi.org/10.1016/S0893-6080(98)00116-6

Sebastian Raschka STAT 453: Intro to Deep Learning 27

ffi
Training with "Momentum"

• Concept: In momentum learning, we try to accelerate convergence by

dampening oscillations using "velocity" (the speed of the "movement" from
previous updates)

Without momentum With momentum

Sebastian Raschka STAT 453: Intro to Deep Learning 28

Training with "Momentum"

Without momentum With momentum

Key take-away:
Not only move in the (opposite) direction of the gradient, but also
move in the "averaged" direction of the last few updates

Sebastian Raschka STAT 453: Intro to Deep Learning 29

Training with "Momentum"

Helps with dampening oscillations, but also helps with escaping

local minima traps

Sebastian Raschka STAT 453: Intro to Deep Learning 30

Training with "Momentum"
Often referred to as "velocity" v
"velocity" from the
previous iteration
@L
wi,j (t) := ↵ · wi,j (t 1) + ⌘ · (t)
<latexit sha1_base64="z07XiIX0nYQ7L9u4xfm0qJW6RzA=">AAACZXicbVFLaxRBEO4ZX+tqdBPFiwcLFyHBuMwkAUUQgnrw4CGCmwR2lqWmtybbpudBd41haedPevPqxb9hz2bwkaSg4eOrr15fp5VWlqPoRxBeu37j5q3e7f6du2v37g/WNw5tWRtJY1nq0hynaEmrgsasWNNxZQjzVNNRevquzR99JWNVWXzmZUXTHE8KlSmJ7KnZ4FvynjQjuLNm5tQ2fGk2eQtev4EEdbVASOS8ZLhC9CLegueQEP/RZAalSyo0rFBDkiMvJGr3sWn+su6sa9C0c2aDYTSKVgGXQdyBoejiYDb4nsxLWedUsNRo7SSOKp66trnU1PST2lKF8hRPaOJhgTnZqVu51MAzz8whK41/BcOK/bfCYW7tMk+9st3dXsy15FW5Sc3Zq6lTRVUzFfJ8UFZr4BJay2GuDEnWSw9QGuV3BblA7xb7j+l7E+KLJ18GhzujeHe082lvuP+2s6MnHounYlPE4qXYFx/EgRgLKX4GvWA92Ah+hWvhw/DRuTQMupoH4r8In/wGv6C2kA==</latexit>
@wi,j

Usually, we choose a
momentum rate between
0.9 and 0.999; you can Regular partial derivative/
think of it as a "friction" or gradient multiplied by
"dampening" parameter learning rate at current
time step t
Weight update using the velocity vector:
wi,j (t + 1) := wi,j (t)
<latexit sha1_base64="YXpNve4YJpwcqXgxHZOK0YTWgHI=">AAACIHicbZDJSgNBEIZ74hbjFvXopTEICWqYiUJEEIJ68BjBLJAMQ0+nY9r0LHTXKGHIo3jxVbx4UERv+jR2lkMWf2j4+aqK6vrdUHAFpvljJBYWl5ZXkquptfWNza309k5VBZGkrEIDEci6SxQT3GcV4CBYPZSMeK5gNbd7NajXHplUPPDvoBcy2yP3Pm9zSkAjJ12Mn/pOzI/wQz8Lh1YOn1/gSZTDx7h5zQSQaeykM2beHArPG2tsMmisspP+brYCGnnMByqIUg3LDMGOiQROBeunmpFiIaFdcs8a2vrEY8qOhwf28YEmLdwOpH4+4CGdnIiJp1TPc3WnR6CjZmsD+F+tEUH7zI65H0bAfDpa1I4EhgAP0sItLhkF0dOGUMn1XzHtEEko6ExTOgRr9uR5Uy3krZN84fY0U7ocx5FEe2gfZZGFiqiEblAZVRBFz+gVvaMP48V4Mz6Nr1FrwhjP7KIpGb9/h5GguQ==</latexit>
wi,j (t)

Sebastian Raschka STAT 453: Intro to Deep Learning 31

ffi
Source: https://distill.pub/2017/momentum/
Sebastian Raschka STAT 453: Intro to Deep Learning 32
Combining adaptive learning
rates with momentum

1. Learning rate decay

Sebastian Raschka STAT 453: Intro to Deep Learning 33

Adaptive Learning Rates

There are many di erent avors of adapting the learning rate

(bit out of scope for this course to review them all)

Key take-aways:

• decrease learning if the gradient changes its direction

• increase learning if the gradient stays consistent

Sebastian Raschka STAT 453: Intro to Deep Learning 34

ff
fl
Adaptive Learning Rates

Key take-aways:

• decrease learning if the gradient changes its direction

• increase learning if the gradient stays consistent

Step 1: De ne a local gain (g) for each weight (initialized with g=1)

@L
wi,j := ⌘ · gi,j ·
<latexit sha1_base64="NoawAABpSt+8PKqkCagS8PvKu/A=">AAACRHicbZDLahsxFIY1Sdq67s1JltmImkIXxcy4hZZCwKRddJGFA/UFPMackc/YijUXpDMNZpiH66YP0F2fIJsuEkq2IRpfSGv3gODXdy46+oNUSUOu+8vZ2d178PBR5XH1ydNnz1/U9g+6Jsm0wI5IVKL7ARhUMsYOSVLYTzVCFCjsBbNPZb73DbWRSfyV5ikOI5jEMpQCyKJRbeB/RkXAL0a5fMPPC/7xmPtogS/GCfHJGi+vfqhB5H4KmiQo7kdAUwEqPy2Ke7oeVYxqdbfhLoJvC28l6mwV7VHtpz9ORBZhTEKBMQPPTWmYl3OFwqLqZwZTEDOY4MDKGCI0w3xhQsFfWTLmYaLtiYkv6N8dOUTGzKPAVpZrm81cCf+XG2QUfhjmMk4zwlgsHwozxSnhpaN8LDUKUnMrQGhpd+ViCtYosr5XrQne5pe3RbfZ8N42mmfv6q2TlR0VdsRestfMY+9Zi31hbdZhgn1nl+yKXTs/nN/OH+dmWbrjrHoO2T/h3N4BrbGxZw==</latexit>
@wi,j

Sebastian Raschka STAT 453: Intro to Deep Learning 35

fi
Adaptive Learning Rates
Step 1: De ne a local gain (g) for each weight (initialized with g=1)
@L
wi,j := ⌘ · gi,j ·
<latexit sha1_base64="NoawAABpSt+8PKqkCagS8PvKu/A=">AAACRHicbZDLahsxFIY1Sdq67s1JltmImkIXxcy4hZZCwKRddJGFA/UFPMackc/YijUXpDMNZpiH66YP0F2fIJsuEkq2IRpfSGv3gODXdy46+oNUSUOu+8vZ2d178PBR5XH1ydNnz1/U9g+6Jsm0wI5IVKL7ARhUMsYOSVLYTzVCFCjsBbNPZb73DbWRSfyV5ikOI5jEMpQCyKJRbeB/RkXAL0a5fMPPC/7xmPtogS/GCfHJGi+vfqhB5H4KmiQo7kdAUwEqPy2Ke7oeVYxqdbfhLoJvC28l6mwV7VHtpz9ORBZhTEKBMQPPTWmYl3OFwqLqZwZTEDOY4MDKGCI0w3xhQsFfWTLmYaLtiYkv6N8dOUTGzKPAVpZrm81cCf+XG2QUfhjmMk4zwlgsHwozxSnhpaN8LDUKUnMrQGhpd+ViCtYosr5XrQne5pe3RbfZ8N42mmfv6q2TlR0VdsRestfMY+9Zi31hbdZhgn1nl+yKXTs/nN/OH+dmWbrjrHoO2T/h3N4BrbGxZw==</latexit>
@wi,j

Step 2:
Note that
If gradient is consistent multiplying by a factor has a larger
impact if gains are large, compared
<latexit sha1_base64="zSTJaRlBybcJChwhNIJbeK2n2UQ=">AAACDHicbVDLSgMxFM3UV62vqks3wSK0qGWmCoogFN24rGAf0A4lk6ZtbOZBckcoQz/Ajb/ixoUibv0Ad/6NmekstPVA4Nxz7uXmHicQXIFpfhuZhcWl5ZXsam5tfWNzK7+901B+KCmrU1/4suUQxQT3WB04CNYKJCOuI1jTGV3HfvOBScV97w7GAbNdMvB4n1MCWurmC4NuxI/w/aQIJXxxiZMyro6tEj7EHYcB0V1m2UyA54mVkgJKUevmvzo9n4Yu84AKolTbMgOwIyKBU8EmuU6oWEDoiAxYW1OPuEzZUXLMBB9opYf7vtTPA5yovyci4io1dh3d6RIYqlkvFv/z2iH0z+2Ie0EIzKPTRf1QYPBxnAzucckoiLEmhEqu/4rpkEhCQeeX0yFYsyfPk0albJ2UK7enhepVGkcW7aF9VEQWOkNVdINqqI4oekTP6BW9GU/Gi/FufExbM0Y6s4v+wPj8AWQUmKk=</latexit>
gi,j (t) := gi,j (t 1) + to adding a term
(dampening e ect if updates oscillate
else in the wrong direction)
<latexit sha1_base64="VgTJ5W8ysLtb2+R/2t3af7rlx90=">AAACFHicbVDLSgMxFM3UV62vUZdugkVo0ZaZKiiCUHTjsoJ9QDsMmTRtYzMPkjtCGfoRbvwVNy4UcevCnX9jOu1CqwcC555zLzf3eJHgCizry8gsLC4tr2RXc2vrG5tb5vZOQ4WxpKxOQxHKlkcUEzxgdeAgWCuSjPieYE1veDXxm/dMKh4GtzCKmOOTfsB7nBLQkmse9t2EH+G7cQGK+PwCp+WkKtlF3KHdEHDBLnU8BqTomnmrbKXAf4k9I3k0Q801PzvdkMY+C4AKolTbtiJwEiKBU8HGuU6sWETokPRZW9OA+Ew5SXrUGB9opYt7odQvAJyqPycS4is18j3d6RMYqHlvIv7ntWPonTkJD6IYWECni3qxwBDiSUK4yyWjIEaaECq5/iumAyIJBZ1jTodgz5/8lzQqZfu4XLk5yVcvZ3Fk0R7aRwVko1NURdeohuqIogf0hF7Qq/FoPBtvxvu0NWPMZnbRLxgf38fmm4M=</latexit>
gi,j (t) := gi,j (t 1) · (1 )

Sebastian Raschka STAT 453: Intro to Deep Learning 36

ff
fi
Adaptive Learning Rate via RMSProp

• Unpublished algorithm by Geo Hinton (but very popular) based on Rprop [1]
• Very similar to another concept called AdaDelta
• Concept: divide learning rate by an exponentially decreasing moving average of
the squared gradients
• This takes into account that gradients can vary widely in magnitude
• Here, RMS stands for "Root Mean Squared"
• Also, damps oscillations like momentum (but in practice, works a bit better)

[1] Igel, Christian, and Michael Hüsken. "Improving the Rprop learning algorithm." Proceedings of the Second
International ICSC Symposium on Neural Computation (NC 2000). Vol. 2000. ICSC Academic Press, 2000.

Sebastian Raschka STAT 453: Intro to Deep Learning 37

ff
Adaptive Learning Rate via RMSProp

✓ ◆2
@L
M eanSquare(wi,j , t) := · M eanSquare(wi,j , t 1) + (1 )
<latexit sha1_base64="Z+HVulkxbXVLEb22LThAtsPJsEs=">AAACbXicbVHbitRAEO3E2zreouKDF6RwEBPcHZJRUARh0RcfFFZ0dhcm41Dp6cy22+nE7ooyhLz5hb75C774C3ayI+juFjQcTp1DVZ3OKiUtxfFPzz9z9tz5CxsXB5cuX7l6Lbh+Y9eWteFiwktVmv0MrVBSiwlJUmK/MgKLTIm97PB119/7KoyVpf5Iq0rMClxqmUuO5Kh58P2dQP3hS41GhN/mjdyEz+0mUAQvXkKaCUJI+aIkOF22lUTwGMJkq5dGziGXyzDNDfImrdCQRAVpgXTAUTVv27b5a4aQoraXR5/G82AYj+K+4CRI1mDI1rUzD36ki5LXhdDEFVo7TeKKZk03kCvRDtLaigr5IS7F1EGNhbCzpk+rhYeOWUBeGvc0Qc/+62iwsHZVZE7ZbW6P9zrytN60pvz5rJG6qklofjQorxVQCV30sJBGcFIrB5Ab6XYFfoAuK3IfNHAhJMdPPgl2x6PkyWj8/ulw+9U6jg12lz1gIUvYM7bN3rAdNmGc/fIC77Z3x/vt3/Lv+fePpL639txk/5X/6A8nOrer</latexit>
wi,j (t)

moving average of the squared gradient for each weight

✓q ◆
@L
wi,j (t) := wi,j (t) ⌘· / M eanSquare (wi,j , t) + ✏
<latexit sha1_base64="A5NHqNA9il5pfKg4MG0lXiIwqog=">AAACiXicbVFdb9MwFHUCjNHxUeCRlysqpCJKl4yvaRLSxF54AGkIuk2qq8pxb1ozx8nsG1AV5b/wm3jj3+C0QZSNK1k6Puce+frcpNDKURT9CsJr129s3dy+1dm5fefuve79BycuL63Ekcx1bs8S4VArgyNSpPGssCiyRONpcn7U6Kff0DqVmy+0LHCSiblRqZKCPDXt/vg+rdQAvtZ9egoHbzduzzmSAC5nOQFPrZAVL4QlJTTwTNBCCl19qOu/7Ia3hl2uMaU+dxeWqo+AIMDAZ7iA0iMLuJb/WAZA3Kr5wjufcSyc0rlpiWm3Fw2jVcFVELegx9o6nnZ/8lkuywwNSS2cG8dRQZOqGVJqrDu8dFgIeS7mOPbQiAzdpFolWcMTz8wgza0/hmDFbjoqkTm3zBLf2WTgLmsN+T9tXFK6P6mUKUpCI9cPpaUGyqFZC8yURUl66YGQVvlZQS6ET5388jo+hPjyl6+Ck71h/GK49+ll7/BdG8c2e8Qesz6L2Rt2yN6zYzZiMtgKBsGr4HW4E8bhfniwbg2D1vOQ/VPh0W9mB8Iz</latexit>
@wi,j (t)

where beta is typically between 0.9 and 0.999 small epsilon term to
avoid division by zero

Sebastian Raschka STAT 453: Intro to Deep Learning 38

Adaptive Learning Rate via ADAM
• ADAM (Adaptive Moment Estimation) is probably the most widely used
optimization algorithm in DL as of today
• It is a combination of the momentum method and RMSProp

Momentum-like term:
mt <latexit sha1_base64="BaUV/ky/esFoJzWPohpB2BYsCMs=">AAAB7nicdVDLSsNAFJ3UV62vqks3g0VwY0jS0NZd0Y3LCvYBbSiT6aQdOjMJMxOhhH6EGxeKuPV73Pk3TtoKKnrgwuGce7n3njBhVGnH+bAKa+sbm1vF7dLO7t7+QfnwqKPiVGLSxjGLZS9EijAqSFtTzUgvkQTxkJFuOL3O/e49kYrG4k7PEhJwNBY0ohhpI3X5MNMX7nxYrjj2ZaPm+TXo2I5Tdz03J17dr/rQNUqOClihNSy/D0YxTjkRGjOkVN91Eh1kSGqKGZmXBqkiCcJTNCZ9QwXiRAXZ4tw5PDPKCEaxNCU0XKjfJzLElZrx0HRypCfqt5eLf3n9VEeNIKMiSTUReLkoShnUMcx/hyMqCdZsZgjCkppbIZ4gibA2CZVMCF+fwv9Jx7Pdqu3d+pXm1SqOIjgBp+AcuKAOmuAGtEAbYDAFD+AJPFuJ9Wi9WK/L1oK1mjkGP2C9fQJjC4+c</latexit>
1
@L original momentum term
wi,j (t) := ↵ · wi,j (t 1) + ⌘ · (t)
@wi,j
mt
<latexit sha1_base64="z07XiIX0nYQ7L9u4xfm0qJW6RzA=">AAACZXicbVFLaxRBEO4ZX+tqdBPFiwcLFyHBuMwkAUUQgnrw4CGCmwR2lqWmtybbpudBd41haedPevPqxb9hz2bwkaSg4eOrr15fp5VWlqPoRxBeu37j5q3e7f6du2v37g/WNw5tWRtJY1nq0hynaEmrgsasWNNxZQjzVNNRevquzR99JWNVWXzmZUXTHE8KlSmJ7KnZ4FvynjQjuLNm5tQ2fGk2eQtev4EEdbVASOS8ZLhC9CLegueQEP/RZAalSyo0rFBDkiMvJGr3sWn+su6sa9C0c2aDYTSKVgGXQdyBoejiYDb4nsxLWedUsNRo7SSOKp66trnU1PST2lKF8hRPaOJhgTnZqVu51MAzz8whK41/BcOK/bfCYW7tMk+9st3dXsy15FW5Sc3Zq6lTRVUzFfJ8UFZr4BJay2GuDEnWSw9QGuV3BblA7xb7j+l7E+KLJ18GhzujeHe082lvuP+2s6MnHounYlPE4qXYFx/EgRgLKX4GvWA92Ah+hWvhw/DRuTQMupoH4r8In/wGv6C2kA==</latexit>

@L
<latexit sha1_base64="vwwizZRkV2c/UM2DKTGBtXofbIw=">AAAB7HicdVBNS8NAEN34WetX1aOXxSJ4Ckka2norevFYwbSFNpTNdtMu3d2E3Y1QQn+DFw+KePUHefPfuGkrqOiDgcd7M8zMi1JGlXacD2ttfWNza7u0U97d2z84rBwdd1SSSUwCnLBE9iKkCKOCBJpqRnqpJIhHjHSj6XXhd++JVDQRd3qWkpCjsaAxxUgbKeDDXM+HlapjXzbrnl+Hju04DddzC+I1/JoPXaMUqIIV2sPK+2CU4IwToTFDSvVdJ9VhjqSmmJF5eZApkiI8RWPSN1QgTlSYL46dw3OjjGCcSFNCw4X6fSJHXKkZj0wnR3qifnuF+JfXz3TcDHMq0kwTgZeL4oxBncDicziikmDNZoYgLKm5FeIJkghrk0/ZhPD1KfyfdDzbrdnerV9tXa3iKIFTcAYugAsaoAVuQBsEAAMKHsATeLaE9Wi9WK/L1jVrNXMCfsB6+wSGqY8q</latexit>

mt := ↵ · mt 1 + (1 ↵) · (t)
@wi,j <latexit sha1_base64="Nqz8y0Sl/lUTBdPy6m+vgbbZJ80=">AAACTHicbZBPaxRBEMV7NhqT9d9qjl6KLMIGzTITBYMQCHrx4CEBNwnsLENNb0+2TffM0F1jWJr5gLnkkJufwouHiAj27A6oiQUNj/equqt/aamkpTD8GnRW7txdvbe23r3/4OGjx70nT49sURkuRrxQhTlJ0QolczEiSUqclEagTpU4Ts/eN/nxF2GsLPJPNC/FRONpLjPJkbyV9LhOCN7uQYyqnCHEfFoQ6MTRdlTDC4BBtL2MttoszgxyF5doSKKCWCPNOCr3sa7/uO48cfIlfK7rekBbSa8fDsNFwW0RtaLP2jpIelfxtOCVFjlxhdaOo7CkiWsu50rU3biyokR+hqdi7GWOWtiJW8Co4bl3ppAVxp+cYOH+PeFQWzvXqe9sdrc3s8b8XzauKNudOJmXFYmcLx/KKgVUQEMWptIITmruBXIj/a7AZ+hpkeff9RCim1++LY52htGr4c7h6/7+uxbHGnvGNtmARewN22cf2AEbMc4u2Dd2zX4El8H34Gfwa9naCdqZDfZPdVZ/A10DsoM=</latexit>

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Sebastian Raschka STAT 453: Intro to Deep Learning 39

Adaptive Learning Rate via ADAM
Momentum-like term:

@L
mt := ↵ · mt 1 + (1 ↵) · (t)
<latexit sha1_base64="Nqz8y0Sl/lUTBdPy6m+vgbbZJ80=">AAACTHicbZBPaxRBEMV7NhqT9d9qjl6KLMIGzTITBYMQCHrx4CEBNwnsLENNb0+2TffM0F1jWJr5gLnkkJufwouHiAj27A6oiQUNj/equqt/aamkpTD8GnRW7txdvbe23r3/4OGjx70nT49sURkuRrxQhTlJ0QolczEiSUqclEagTpU4Ts/eN/nxF2GsLPJPNC/FRONpLjPJkbyV9LhOCN7uQYyqnCHEfFoQ6MTRdlTDC4BBtL2MttoszgxyF5doSKKCWCPNOCr3sa7/uO48cfIlfK7rekBbSa8fDsNFwW0RtaLP2jpIelfxtOCVFjlxhdaOo7CkiWsu50rU3biyokR+hqdi7GWOWtiJW8Co4bl3ppAVxp+cYOH+PeFQWzvXqe9sdrc3s8b8XzauKNudOJmXFYmcLx/KKgVUQEMWptIITmruBXIj/a7AZ+hpkeff9RCim1++LY52htGr4c7h6/7+uxbHGnvGNtmARewN22cf2AEbMc4u2Dd2zX4El8H34Gfwa9naCdqZDfZPdVZ/A10DsoM=</latexit>
@wi,j

RMSProp term: ✓ ◆2
@L
r := · M eanSquare(wi,j , t 1) + (1 )
<latexit sha1_base64="35mB82DfqUC86OS0JSfG/tTONkc=">AAACYHicbVFNb9NAEF27fITQ0hRucBkRITmijeyAVFQJqSoXDiAVQdpKcYjGm3W6dL12d8etIst/sjcOXPglrN0goGWklZ7em7ez8zYplLQUht89f+3O3Xv3Ow+6D9c3Hm32th4f2bw0XIx5rnJzkqAVSmoxJklKnBRGYJYocZycvWv04wthrMz1F1oWYprhQstUciRHzXqXBvbeQpwIQoj5PCf4KFB/Pi/RiOByVslt+FZvA+1EA3gJQbTTtg6cQy4WQZwa5FVcoCGJCuIM6ZSjqj7U9R/29y0Q0KBufYOvo1mvHw7DtuA2iFagz1Z1OOtdxfOcl5nQxBVaO4nCgqZVM4MrUXfj0ooC+RkuxMRBjZmw06oNqIYXjplDmht3NEHL/u2oMLN2mSWus1nB3tQa8n/apKT0zbSSuihJaH49KC0VUA5N2jCXRnBSSweQG+neCvwUXWjk/qTrQohurnwbHI2G0avh6NPr/v7BKo4Oe8aes4BFbJfts/fskI0ZZz+8NW/d2/B++h1/09+6bvW9lecJ+6f8p78A4ayzFA==</latexit>
@wi,j (t)
ADAM update:
mt
wi,j := wi,j ⌘p
<latexit sha1_base64="ybbslpsrNYZDlLvaddWDralrgAc=">AAACH3icbZBNS8NAEIY3flu/qh69LBZBUEuioiIIohePFawKTQmb7URXN5u4O1FKyD/x4l/x4kER8dZ/47ZW8OuFhYd3ZpidN0ylMOi6HWdgcGh4ZHRsvDQxOTU9U56dOzVJpjnUeSITfR4yA1IoqKNACeepBhaHEs7C68Nu/ewWtBGJOsF2Cs2YXSgRCc7QWkF56y7IxSq9Kuju3heu+YCM+pFmPI8DLHLf3GjMdbHiQ2qETFQRlCtu1e2J/gWvDxXSVy0ov/uthGcxKOSSGdPw3BSbOdMouISi5GcGUsav2QU0LCoWg2nmvfsKumSdFo0SbZ9C2nO/T+QsNqYdh7YzZnhpfte65n+1RobRTjMXKs0QFP9cFGWSYkK7YdGW0MBRti0wroX9K+WXzOaCNtKSDcH7ffJfOF2vehvV9ePNyv5BP44xskAWyTLxyDbZJ0ekRuqEk3vySJ7Ji/PgPDmvzttn64DTn5knP+R0PgCGDqNY</latexit>
r+✏

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Sebastian Raschka STAT 453: Intro to Deep Learning 40

Published as a conference paper at ICLR 2015

Adaptive Learning Rate via ADAM

Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,
and for a slightly more efficient (but less clear) order of computation. gt2 indicates the elementwise
square gt gt . Good default settings for the tested machine learning problems are ↵ = 0.001,
1 = 0.9, 2 = 0.999 and ✏ = 10 . All operations on vectors are element-wise. With 1t and 2t
8

we denote 1 and 2 to the power t.

Require: ↵: Stepsize
Require: 1 , 2 2 [0, 1): Exponential decay rates for the moment estimates
Require: f (✓): Stochastic objective function with parameters ✓
Require: ✓0 : Initial parameter vector
m0 0 (Initialize 1st moment vector)
v0 0 (Initialize 2nd moment vector)
t 0 (Initialize timestep)
while ✓t not converged do
t t+1
gt r✓ ft (✓t 1 ) (Get gradients w.r.t. stochastic objective at timestep t)
mt 1 · mt 1 + (1 1 ) · gt (Update biased first moment estimate)
2 ) · gt (Update biased second raw moment estimate)
2
vt 2 · vt 1 + (1
1 ) (Compute bias-corrected first moment estimate)
t
mbt mt /(1
2 ) (Compute bias-corrected second raw moment estimate)
t
vbt vt /(1 p
✓t ✓t 1 ↵ · m b t /( vbt + ✏) (Update parameters)
end while Also add a bias correction term
return ✓t (Resulting parameters) for better conditioning in earlier iterations

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s
convergence in online convex programming. Empirically,
Sebastian Raschka
our method consistently outperforms other
STAT 453: Intro to Deep Learning
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is 41
Experimenting with di erent
optimization algorithms

1. Learning rate decay

Sebastian Raschka STAT 453: Intro to Deep Learning 42

ff
Using Di erent Optimizers in PyTorch

Usage is the as for vanilla SGD, which we used before,

you can nd an overview at: https://pytorch.org/docs/stable/optim.html

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

Sebastian Raschka STAT 453: Intro to Deep Learning 43

fi
ff
Using Di erent Optimizers in PyTorch

Usage is the as for vanilla SGD, which we used before,

you can nd an overview at: https://pytorch.org/docs/stable/optim.html

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

Remember to save the optimizer state if you are using, e.g., Momentum or
ADAM, and want to continue training later
(see earlier slides on saving states of the learning rate schedulers).

Sebastian Raschka STAT 453: Intro to Deep Learning 44

fi
ff
Adaptive Learning Rate via ADAM
@L
mt := ↵ · mt 1 + (1 ↵) · (t)
<latexit sha1_base64="Nqz8y0Sl/lUTBdPy6m+vgbbZJ80=">AAACTHicbZBPaxRBEMV7NhqT9d9qjl6KLMIGzTITBYMQCHrx4CEBNwnsLENNb0+2TffM0F1jWJr5gLnkkJufwouHiAj27A6oiQUNj/equqt/aamkpTD8GnRW7txdvbe23r3/4OGjx70nT49sURkuRrxQhTlJ0QolczEiSUqclEagTpU4Ts/eN/nxF2GsLPJPNC/FRONpLjPJkbyV9LhOCN7uQYyqnCHEfFoQ6MTRdlTDC4BBtL2MttoszgxyF5doSKKCWCPNOCr3sa7/uO48cfIlfK7rekBbSa8fDsNFwW0RtaLP2jpIelfxtOCVFjlxhdaOo7CkiWsu50rU3biyokR+hqdi7GWOWtiJW8Co4bl3ppAVxp+cYOH+PeFQWzvXqe9sdrc3s8b8XzauKNudOJmXFYmcLx/KKgVUQEMWptIITmruBXIj/a7AZ+hpkeff9RCim1++LY52htGr4c7h6/7+uxbHGnvGNtmARewN22cf2AEbMc4u2Dd2zX4El8H34Gfwa9naCdqZDfZPdVZ/A10DsoM=</latexit>
@wi,j
✓ ◆2
@L
r := · M eanSquare(wi,j , t 1) + (1 )
<latexit sha1_base64="35mB82DfqUC86OS0JSfG/tTONkc=">AAACYHicbVFNb9NAEF27fITQ0hRucBkRITmijeyAVFQJqSoXDiAVQdpKcYjGm3W6dL12d8etIst/sjcOXPglrN0goGWklZ7em7ez8zYplLQUht89f+3O3Xv3Ow+6D9c3Hm32th4f2bw0XIx5rnJzkqAVSmoxJklKnBRGYJYocZycvWv04wthrMz1F1oWYprhQstUciRHzXqXBvbeQpwIQoj5PCf4KFB/Pi/RiOByVslt+FZvA+1EA3gJQbTTtg6cQy4WQZwa5FVcoCGJCuIM6ZSjqj7U9R/29y0Q0KBufYOvo1mvHw7DtuA2iFagz1Z1OOtdxfOcl5nQxBVaO4nCgqZVM4MrUXfj0ooC+RkuxMRBjZmw06oNqIYXjplDmht3NEHL/u2oMLN2mSWus1nB3tQa8n/apKT0zbSSuihJaH49KC0VUA5N2jCXRnBSSweQG+neCvwUXWjk/qTrQohurnwbHI2G0avh6NPr/v7BKo4Oe8aes4BFbJfts/fskI0ZZz+8NW/d2/B++h1/09+6bvW9lecJ+6f8p78A4ayzFA==</latexit>
@wi,j (t)

The default settings for the

"betas" work usually just ne

Source: https://pytorch.org/docs/stable/optim.html
Sebastian Raschka STAT 453: Intro to Deep Learning 45
fi
sgd-scheduler-momentum.ipynb adam.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 46

Decreasing the learning rate
over the course of training

1. Learning rate decay

Sebastian Raschka STAT 453: Intro to Deep Learning 47

Published as a conference paper at ICLR 2015

Kingma, D. P., & Ba, J. (2014). Adam: A method

(a) https://arxiv.org/abs/
for stochastic optimization. (b)
1412.6980
Figure 2: Training of multilayer neural networks on MNIST images. (a) Ne
dropout stochastic regularization. (b) Neural networks with deterministic cost fu
with the sum-of-functions (SFO) optimizer (Sohl-Dickstein et al., 2014)
Sebastian Raschka STAT 453: Intro to Deep Learning 48
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B. The marginal value of adaptive gradient
methods in machine learning, https://arxiv.org/abs/1705.08292

Sebastian Raschka STAT 453: Intro to Deep Learning 49

Training Loss vs Generalization Error

Sebastian Raschka STAT 453: Intro to Deep Learning 50

Training Loss vs Generalization Error
proving Generalization Performance by Switching from Adam to SGD

ation gap between Adam and

aining of the CIFAR-10 data
2009) on the DenseNet archi-
4). This is an example of an
generalization gap exists be-
plot the performance of Adam
so consider a variant of Adam
p, q). Given (p, q) such that
ariant take on the form

!
, p · ↵sgd , q · ↵sgd mk 1.
+✏

e of the learning rate for SGD

mance for the same task. The
he vector x element-wise such
ed to be in [a, b]. Note that
respond to SGD. The network Keskar, N. S., & Socher, R. (2017). Improving
GD and two variants: Adam- Figure 1. Training the performance
generalization DenseNet architecture on the CIFAR-10
by switching from adam to
) with tuned learning rates for data set
sgd.with four optimizers:
arXiv SGD, Adam, Adam-Clip(1, 1) and
preprint arXiv:1712.07628.
earning rate by 10 after 150 Adam-Clip(0, 1). SGD achieves the best testing accuracy while
xperiment is to investigate the training with Adam leads to a generalization gap of roughly 2%.
arge Setting a minimum learning rate for each parameter of Adam par-
p and small step sizes that
k tially closes the generalization gap.
1 ↵k
1 k
2
p
vk
1
1 +✏
, on the gen-
1 Sebastian Raschka STAT 453: Intro to Deep Learning 51
https://www.lightly.ai/post/which-optimizer-should-i-use-for-my-machine-learning-project

Sebastian Raschka STAT 453: Intro to Deep Learning 52

https://parameterfree.com/2020/12/06/
neural-network-maybe-evolved-to-make-
adam-the-best-optimizer/

"it is known that Adam will not always give you the best performance, yet most of
the time people know that they can use it with its default parameters and get, if not
the best performance, at least the second best performance on their particular deep
learning problem. "

"Usually people try new architectures keeping the optimization algorithm xed, and most
of the time the algorithm of choice is Adam. This happens because, as explained above,
Adam is the default optimizer."

Sebastian Raschka STAT 453: Intro to Deep Learning 53

fi
meters ✓, and Dt is the training dataset of size |Dt |. This loss is near zero when a model with
meters ✓ accurately classifies the training data.
-parameterized neural networks (i.e., those with more parameters than training data) can represent
rary, even random, labeling functions on large datasets [Zhang et al., 2016]. As a result, an
mizer can reliably fit an over-parameterized network to training data and achieve near zero
[Laurent and Brecht, 2018, Kawaguchi, 2016]. However, this comes with no guarantee of
ralization to unseen test data.
llustrate the difference between model fit-
https://arxiv.org/abs/1906.03291
and generalization with an experiment.
CIFAR-10 training dataset contains 50,000
l images. We train two over-parameterized
els on this dataset. The first is a neural
ork (ResNet-18) with 269,722 parameters
ly 6⇥ the number of training images). The
nd is a linear model with a feature set that
des pixel intensities as well as pair-wise
ucts of pixels intensities.1 This linear model
298, 369 parameters, which is comparable
e neural network, and both are trained using
. On the left of Figure 2, we see that over-
meterization causes both models to achieve Figure 2: (left) CIFAR10 trained with ResNet-18 and a
ct accuracy on training data. But the linear
linear model having comparable number of parameters.
el achieves only 49% test accuracy, while Both can fit the training data well, but neural nets are
Net-18 achieves 92%. able to generalize to unseen data, while linear models
cannot. (right) CIFAR10 trained with various optimiz-
excellent performance of the neural network ers using VGG13, generalizing well irrespective of the
el raises the question of whether bad min- optimizer used.
xist at all. Maybe deep networks generalize
use bad minima are rare and lie far away from the region of parameter space where initialization
place? We can confirm the existence of Raschka
Sebastian bad minimaSTAT
by453:
“poisoning” the loss function with a
Intro to Deep Learning 54
2 Methods
2.1 Details of AdaBelief Optimizer

Notations By the convention in [8], we use the following notations:

• fQ(✓) 2 R, ✓ 2 Rd : f is the loss function to minimize, ✓ is the parameter in Rd
• F ,M (y) = argminx2F ||M
1/2
(x y)||: projection of y onto a convex feasible set F
• gt : the gradient and step t
• mt : exponential moving average (EMA) of gt
https://arxiv.org/abs/2010.07468 • vt , st : vt is the EMA of gt2 , st is the EMA of (gt mt )2
https://github.com/juntang-zhuang/Adabelief-Optimizer
• ↵, ✏: ↵ is the learning rate, default is 10 3 ; ✏ is a small number, typically set as 10 8
• 1 , 2 : smoothing parameters, typical values are 1 = 0.9, 2 = 0.999
• 1t , 2t are the momentum for mt and vt respectively at step t, and typically set as constant
(e.g. 1t = 1 , 2t = 2 , 8t 2 {1, 2, ...T }

Algorithm 1: Adam Optimizer Algorithm 2: AdaBelief Optimizer

Initialize ✓0 , m0 0 , v0 0, t 0 Initialize ✓0 , m0 0 , s0 0, t 0
"uses the exponential moving While ✓t not converged While ✓t not converged
t t+1 t t+1
average of variance of gradient gt r✓ ft (✓t 1 ) gt r✓ ft (✓t 1 )
mt 1 mt 1 + (1 1 )gt mt 1 mt 1 + (1 1 )gt
instead of the exponential moving vt v
2 t 1 + (1 )g
2 t
2
st s
2 t 1 +(1 2 )(g t mt ) +✏
2

Bias Correction Bias Correction

average of square of gradients to ct
m 1
mt
1
t, vbt 1
vt
t
2
ct
m 1
mt
t, s
1
bt 1
st
t
2
Update Update
calculate the adaptive learning rate" ✓t
Q p ⇣
✓ t 1 p ↵m ct
⌘
✓t
Q p ⇣
✓ t 1 p ↵m ct
⌘
F , vbt vb +✏
t F , sbt sb +✏
t

Comparison with Adam Adam and AdaBelief are summarized in Algo. 1 and Algo. 2, where
all operations are element-wise, with differences marked in blue. Note that no p extra parameters
"trains fast as Adam, are introduced in AdaBelief. Specifically, in Adam, the update direction is mt / vt , where vt is
p
generalizes well as SGD, and is the EMA of gt2 ; in AdaBelief, the update direction is mt / st , where st is the EMA of (gt mt )2 .
Intuitively, viewing mt as the prediction of gt , AdaBelief takes a large step when observation gt is
stable to train GANs" close to prediction mt , and a small step when the observation greatly deviates from the prediction. b.
represents bias-corrected value. Note that an extra ✏ is added to st during bias-correction, in order to

Sebastian Raschka STAT 453: Intro to Deep Learning 55

L10_regularization__slides(1)
No ratings yet
L10_regularization__slides(1)
45 pages
Optimization and Tips For Neural Network Training: Geena Kim
No ratings yet
Optimization and Tips For Neural Network Training: Geena Kim
24 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
Deep Learning_Part II-1
No ratings yet
Deep Learning_Part II-1
23 pages
Syllabus-2025-Final
No ratings yet
Syllabus-2025-Final
4 pages
Unit – IV
No ratings yet
Unit – IV
24 pages
Cory Rieth Lecture MVPA
No ratings yet
Cory Rieth Lecture MVPA
23 pages
AI ML Nov 15
No ratings yet
AI ML Nov 15
32 pages
DL mod 2
No ratings yet
DL mod 2
4 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
ML prep for samsung
No ratings yet
ML prep for samsung
73 pages
A Probabilistic Theory of Deep Learning: Unit 2
No ratings yet
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Deep Learning
100% (1)
Deep Learning
49 pages
4.optimization Techniques
No ratings yet
4.optimization Techniques
1 page
ppt2dl
No ratings yet
ppt2dl
69 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Gradient Problems (1)
No ratings yet
Gradient Problems (1)
8 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
Aidl Unit III
No ratings yet
Aidl Unit III
79 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
3 - DeepLearning - and - CNN v3
No ratings yet
3 - DeepLearning - and - CNN v3
50 pages
Learning to Reweight
No ratings yet
Learning to Reweight
13 pages
DM See M4
No ratings yet
DM See M4
8 pages
Learning Rate Decay and methods in Deep Learning _ by Vaibhav Haswani _ Analytics Vidhya _ Medium
No ratings yet
Learning Rate Decay and methods in Deep Learning _ by Vaibhav Haswani _ Analytics Vidhya _ Medium
6 pages
Stochastic Network Stability. Uncovering learning rate as a form of… _ by Shivam Sharma _ Towards Data Science
No ratings yet
Stochastic Network Stability. Uncovering learning rate as a form of… _ by Shivam Sharma _ Towards Data Science
17 pages
Not All Samples Are Created Equal
No ratings yet
Not All Samples Are Created Equal
13 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
8 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Intro to Neural network
No ratings yet
Intro to Neural network
25 pages
Data Imbalance Problem
No ratings yet
Data Imbalance Problem
56 pages
Lecture 16 Meta Learning
No ratings yet
Lecture 16 Meta Learning
39 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6 (1)
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6 (1)
39 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
NN 08
No ratings yet
NN 08
36 pages
DeepLearning L1 Intro
No ratings yet
DeepLearning L1 Intro
92 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
A Recipe For Training Neural Networks
No ratings yet
A Recipe For Training Neural Networks
15 pages
Chapter 7_Printed
No ratings yet
Chapter 7_Printed
14 pages
DL Class3
No ratings yet
DL Class3
28 pages
Daa Course Handout
No ratings yet
Daa Course Handout
5 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Deep Learning Training
No ratings yet
Deep Learning Training
9 pages
DeepLearning Book
No ratings yet
DeepLearning Book
108 pages
MECH4403 NN Week05
No ratings yet
MECH4403 NN Week05
22 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
The Primacy Bias in Deep Reinforcement Learning
No ratings yet
The Primacy Bias in Deep Reinforcement Learning
20 pages
Lecture 11
No ratings yet
Lecture 11
24 pages
IntroClassificationDA-2024
No ratings yet
IntroClassificationDA-2024
129 pages
In5490 Classification
No ratings yet
In5490 Classification
85 pages
Unit IV
No ratings yet
Unit IV
89 pages
Geeksforgeeks Org Array Subarray Subsequence and Subset Ref LBP
No ratings yet
Geeksforgeeks Org Array Subarray Subsequence and Subset Ref LBP
2 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
Chapter 5
No ratings yet
Chapter 5
25 pages
Is the Answer Reasonable?, Grade 8: The Test Connection
From Everand
Is the Answer Reasonable?, Grade 8: The Test Connection
Frank Schaffer Publications
No ratings yet
L15_intro-rnn__slides
No ratings yet
L15_intro-rnn__slides
50 pages
US10371163-Load Absorption Systems and Methods
No ratings yet
US10371163-Load Absorption Systems and Methods
14 pages
Aircraft Multidisciplinary Design and Op
No ratings yet
Aircraft Multidisciplinary Design and Op
182 pages
Follucular-Study-FEM-v2
No ratings yet
Follucular-Study-FEM-v2
45 pages
Reh Field 1990
No ratings yet
Reh Field 1990
9 pages
PEM Fuel Cell Systems For Commercial Airplane Systems Power
No ratings yet
PEM Fuel Cell Systems For Commercial Airplane Systems Power
33 pages
A Conceptual Analysis of An Aircraft With Rear-Mounted Open Rotor Engines
No ratings yet
A Conceptual Analysis of An Aircraft With Rear-Mounted Open Rotor Engines
6 pages
Comprehensive Investigation On Hydrogen and Fuel Cell Technology in The Aviation and Aerospace Sectors
No ratings yet
Comprehensive Investigation On Hydrogen and Fuel Cell Technology in The Aviation and Aerospace Sectors
44 pages
ML Coursera Python Assignments
No ratings yet
ML Coursera Python Assignments
20 pages
Regression and Classification
No ratings yet
Regression and Classification
26 pages
DL 4
No ratings yet
DL 4
15 pages
MiniCPM-2B: New Compact Multimodal LLM Outperforming The Giants
No ratings yet
MiniCPM-2B: New Compact Multimodal LLM Outperforming The Giants
9 pages
Lecture 7: Stochastic Gradient Descent
No ratings yet
Lecture 7: Stochastic Gradient Descent
4 pages
EE2211 Introduction To Machine Learning
No ratings yet
EE2211 Introduction To Machine Learning
99 pages
4.1 - EDA Lecture Module 4 Vetri Sir New
No ratings yet
4.1 - EDA Lecture Module 4 Vetri Sir New
19 pages
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
No ratings yet
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
21 pages
00 - Perceptron - Scientific Machine Learning (SciML)
No ratings yet
00 - Perceptron - Scientific Machine Learning (SciML)
42 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Experiment No
No ratings yet
Experiment No
29 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Stock Market Prediction Using Reinforcement Learning With Sentiment Analysis
No ratings yet
Stock Market Prediction Using Reinforcement Learning With Sentiment Analysis
20 pages
11 PDF
No ratings yet
11 PDF
13 pages
AD3501-DL-Unit 1 Notes
No ratings yet
AD3501-DL-Unit 1 Notes
43 pages
Machine Learning
No ratings yet
Machine Learning
122 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Module 3
No ratings yet
Module 3
35 pages
Linear Regression in python
No ratings yet
Linear Regression in python
9 pages
Infrared Image Pedestrian Target Detection Based On Yolov3 and Migration Learning
No ratings yet
Infrared Image Pedestrian Target Detection Based On Yolov3 and Migration Learning
5 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
No ratings yet
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
54 pages
Deep Learning Basics Lecture 11 Practical Methodology
No ratings yet
Deep Learning Basics Lecture 11 Practical Methodology
25 pages
12. Bone Fracture Classification using
No ratings yet
12. Bone Fracture Classification using
6 pages
Fine-Tuning Llama 2 Domain Adaptation of A Pre-Trained Model
No ratings yet
Fine-Tuning Llama 2 Domain Adaptation of A Pre-Trained Model
36 pages
Exercise 8 - Nikki
No ratings yet
Exercise 8 - Nikki
11 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
Multi Layer Perceptron Annotated
No ratings yet
Multi Layer Perceptron Annotated
53 pages