L5 Training Neural Networks Part 2 en v2
L5 Training Neural Networks Part 2 en v2
1
Outline
• Optimization algorighms for neural networks
• Learning rate schedules
• Anti-overfitting techniques
• Data enrichment (data augmentation)
• Choosing Hyperparameters
• Techniques for combining multiple models (ensemble
methods)
• Transfer learning
2
Optimization algorighms
for neural networks
3
Stochastic gradient descent (SGD)
4
Problem #1 with SGD
• What if loss changes quickly in one direction and
slowly in another? What does gradient descent do?
5
Problem #1 with SGD (2)
• What if loss changes quickly in one direction and slowly in
another? What does gradient descent do?
• Very slow progress along shallow dimension, jitter along
steep direction
6
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?
7
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?
• Zero gradient, gradient
descent gets stuck
• Saddle points often
appear with multivariable
objective functions
8
Problem #3 with SGD
• Our gradients come from
minibatches so they can
be noisy!
9
SGD + momentum
10
SGD + momentum (2)
11
SGD + momentum (3)
12
Nesterov Momentum
• Annoying, usually we
want update in terms of
• Let and
rearrange
14
Nesterov Momentum (3)
15
AdaGrad
16
AdaGrad
17
AdaGrad
19
AdaGrad
21
RSMProp
22
Adam (almost)
23
Adam (full form)
24
Visual examples of the learning process
25
First-order optimization
26
Second-order optimization
• Using the Hessian, which is a square matrix of second-
order partial derivatives of the function.
27
Second-order optimization
• Taylor expansion
28
L-BFGS (Limited memory BFGS)
• Does not form/store the full inverse Hessian
• Usually works very well in full batch, deterministic
mode i.e. if you have a single, deterministic f(x) then L-
BFGS will probably work very nicely
• Does not transfer very well to mini-batch setting.
Gives bad results. Adapting second-order methods to
large-scale, stochastic setting is an active area of
research.
SOTA optimizers
• NAdam = Adam + Nesterov's Accelerated
Gradient (NAG)
• RAdam (Rectified Adam)
• LookAhead
• Ranger = RAdam + LookAhead
30
In practice
• Adam is a good default choice in many cases;
it often works ok even with constant learning
rate
• SGD+Momentum can outperform Adam but
may require more tuning of LR and schedule
• Try cosine schedule, very few hyperparameters!
• If you can afford to do full batch updates, then
try out L-BFGS (and don’t forget to disable all
sources of noise)
31
Learning rate schedules
32
Learning rate
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all
have learning rate as a hyperparameter.
• Q: Which one of these learning rates is best to use?
• Usually starts with a large value and decreases over time
33
Learning rate decays over time
• Step: Reduce
learning rate at a
few fixed points.
• E.g., for ResNets,
multiply LR by 0.1
after epochs 30,
60, and 90.
34
Cosine rate decay
35
Linear rate decay
36
Inverse sqrt of total number of epochs
37
Linear Warmup
• High initial learning rates
can make loss explode;
linearly increasing
learning rate from 0 over
the first ~5000 iterations
can prevent this
• Empirical rule of thumb: If
you increase the batch
size by N, also scale the
initial learning rate by N
Anti-overfitting techniques
39
Beyond Training Error
Early Stopping: Always do this
• Stop training the model when accuracy on the
validation set decreases Or train for a long time, but
always keep track of the model snapshot that worked
best on val
41
Regularization: Add term to loss
Common regularization
42
Regularization: Dropout
• In each forward pass, randomly set some neurons to
zero
• Probability of dropping is a hyperparameter; 0.5 is
common
43
Regularization: Dropout
• Example forward pass with a 3-layer network using
dropout
44
Dropout effects
• Forces the network to have a redundant
representation; Prevents co-adaptation of features
45
Dropout effects (2)
• Dropout is training a large ensemble of models (that
share parameters).
• Each binary mask is one model
• An FC layer with 4096 units has 24096 ~ 101233 possible
masks!
• … only 1082 atoms in the universe!
46
Dropout: Test time
• Dropout makes our output random!
47
Dropout: Test time (2)
• Want to approximate the integral
48
Dropout: Test time (3)
• At test time all neurons are active always => We must
scale the activations so that for each neuron:
• Output at test time = expected output at training time
• At test time, multiply by dropout probability
49
More common: “Inverted dropout”
Data Augmentation
52
Horizontal Flip
53
Random crops and scales
• Training: sample
random crops / scales
ResNet:
• Pick random L in range
[256, 480]
• Resize training image,
short side = L
• Sample random 224 x
224 patch
• Testing: average a fixed
set of crops ResNet:
• Resize image at 5
scales: {224, 256, 384,
480, 640}
• For each size, use 10
224 x 224 crops: 4
corners + center, + flips
54
Color Jitter
Simple:
Randomize
contrast and
brightness
55
Other transformations
- Translation
- Rotation
- Stretching
- Shearing
- lens distortions
- … (go crazy)
56
Mixup
57
Some libraries
1. Albumentations
https://github.com/albumentations-team/albumentations
2. Imgaug
https://github.com/aleju/imgaug
3. Augmentor
https://github.com/mdbloice/Augmentor
58
Choosing hyperparameters
59
Hyperparameters
• Network Architecture
• Learning rate, parameters in learning rate change
strategy, optimization algorithm
• Control coefficients (L2 weight decay, drop rate)
60
Random Search vs Grid Search
61
Techniques for combining multiple
models (ensemble methods)
62
Model Ensembles
• Train multiple independent models
• At test time average their results
• Take average of predicted probability distributions, then
choose argmax
• Enjoy 2% extra performance
63
Model Ensembles
• Instead of training multiple models independently,
multiple snapshots of the same model can be used
during training
64
Transfer learning
65
Transfer learning
Train the network on a large available data set, then train
with your data set
66
Transfer learning
67
More tips and tricks
• Machine Learning Yearning by Andrew Ng
https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/20
18/09/Ng-MLY01-13.pdf
68
References
1. http://cs231n.stanford.edu
2. Adam:
https://towardsdatascience.com/adam-latest-trends-in-
deep-learning-optimization-6be9a291375c
3. Stanford lecture note:
http://cs231n.github.io/neural-networks-3/
69
Thank you
for your
attention!!!
70