0% found this document useful (0 votes)
24 views70 pages

L5 Training Neural Networks Part 2 en v2

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views70 pages

L5 Training Neural Networks Part 2 en v2

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Lesson 5:

Training neural networks


(Part 2)
Viet-Trung Tran

1
Outline
• Optimization algorighms for neural networks
• Learning rate schedules
• Anti-overfitting techniques
• Data enrichment (data augmentation)
• Choosing Hyperparameters
• Techniques for combining multiple models (ensemble
methods)
• Transfer learning

2
Optimization algorighms
for neural networks

3
Stochastic gradient descent (SGD)

4
Problem #1 with SGD
• What if loss changes quickly in one direction and
slowly in another? What does gradient descent do?

Loss function has high condition number: ratio of largest


to smallest singular value of the Hessian matrix is large
(Hessian is a square matrix of second-order partial derivatives of a scalar-
valued function, or scalar field. It describes the local curvature of a function
of many variables.)

5
Problem #1 with SGD (2)
• What if loss changes quickly in one direction and slowly in
another? What does gradient descent do?
• Very slow progress along shallow dimension, jitter along
steep direction

Loss function has high condition number: ratio of largest to


smallest singular value of the Hessian matrix is large (Hessian is
a square matrix of second-order partial derivatives of a scalar-valued function,
or scalar field. It describes the local curvature of a function of many variables.)

6
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?

7
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?
• Zero gradient, gradient
descent gets stuck
• Saddle points often
appear with multivariable
objective functions

8
Problem #3 with SGD
• Our gradients come from
minibatches so they can
be noisy!

9
SGD + momentum

• Continue moving in the general direction as the previous iterations


• Build up “velocity” as a running mean of gradients
• Rho gives “friction”; typically, rho=0.9 or 0.99
• At the beginning, rho the score may be lower due to unclear
redirection, e.g., rho = 0.5

10
SGD + momentum (2)

• Alternative equivalent formulation


• You may see SGD+Momentum formulated different
ways,
• But they are equivalent - give same sequence of x

11
SGD + momentum (3)

12
Nesterov Momentum

“Look ahead” to the point where


updating using velocity would take
Combine gradient at current point
us; compute gradient there and
with velocity to get step used to
mix it with velocity to get actual
update weights
update direction
13
Nesterov Momentum (2)

• Annoying, usually we
want update in terms of
• Let and
rearrange

14
Nesterov Momentum (3)

15
AdaGrad

• Added element-wise scaling of the gradient based on


the historical sum of squares in each dimension
• “Per-parameter learning rates” or “adaptive learning
rates”

16
AdaGrad

• Q1: What happens with AdaGrad?

17
AdaGrad

• Q1: What happens with AdaGrad?


• Progress along “steep” directions is damped
• Progress along “flat” directions is accelerated
18
AdaGrad

• Q2: What happens to the step size over long time?

19
AdaGrad

• Q2: What happens to the step size over long time?


• Decays to zero
• The learning rate is monotonically smaller
20
RMSProp: “Leaky AdaGrad”

RSMProp use a moving average of squared gradients


Recommendation: decay_rate = [0.9, 0.99, 0.999]
Unlike Adagrad the updates do not get monotonically smaller

21
RSMProp

22
Adam (almost)

Sort of like RMSProp with momentum

Q: What happens at first timestep?


beta1 = 0.9, beta2 = 0.999
First and second moment start at zero

23
Adam (full form)

• Bias correction for the fact that first and second


moment estimates start at zero. It makes the algorithm
more stable during warm up at the first few steps.
• Adam with beta1 = 0.9, beta2 = 0.999, and
learning_rate = 1e-3 or 5e-4 are good default
parameters for many models!

24
Visual examples of the learning process

(c) Alec Radford.

25
First-order optimization

26
Second-order optimization
• Using the Hessian, which is a square matrix of second-
order partial derivatives of the function.

computing (and inverting) the Hessian in its explicit form is a


very costly process in both space and time

27
Second-order optimization
• Taylor expansion

• Solving for the critical point we obtain the Newton parameter


update:

• Not good for DL (due to matrix inverse complexity of O(n^3))


• Hessian has O(N^2) elements
• Inverting takes O(N^3)
• N = (Tens or Hundreds of) Millions
• Quasi-Newton (BGFS)
• instead of inverting the Hessian (O(n^3)), approximate inverse
Hessian with rank 1 updates over time (O(n^2) each

28
L-BFGS (Limited memory BFGS)
• Does not form/store the full inverse Hessian
• Usually works very well in full batch, deterministic
mode i.e. if you have a single, deterministic f(x) then L-
BFGS will probably work very nicely
• Does not transfer very well to mini-batch setting.
Gives bad results. Adapting second-order methods to
large-scale, stochastic setting is an active area of
research.
SOTA optimizers
• NAdam = Adam + Nesterov's Accelerated
Gradient (NAG)
• RAdam (Rectified Adam)
• LookAhead
• Ranger = RAdam + LookAhead

30
In practice
• Adam is a good default choice in many cases;
it often works ok even with constant learning
rate
• SGD+Momentum can outperform Adam but
may require more tuning of LR and schedule
• Try cosine schedule, very few hyperparameters!
• If you can afford to do full batch updates, then
try out L-BFGS (and don’t forget to disable all
sources of noise)

31
Learning rate schedules

32
Learning rate
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all
have learning rate as a hyperparameter.
• Q: Which one of these learning rates is best to use?
• Usually starts with a large value and decreases over time

33
Learning rate decays over time
• Step: Reduce
learning rate at a
few fixed points.
• E.g., for ResNets,
multiply LR by 0.1
after epochs 30,
60, and 90.

34
Cosine rate decay

35
Linear rate decay

36
Inverse sqrt of total number of epochs

37
Linear Warmup
• High initial learning rates
can make loss explode;
linearly increasing
learning rate from 0 over
the first ~5000 iterations
can prevent this
• Empirical rule of thumb: If
you increase the batch
size by N, also scale the
initial learning rate by N
Anti-overfitting techniques

39
Beyond Training Error
Early Stopping: Always do this
• Stop training the model when accuracy on the
validation set decreases Or train for a long time, but
always keep track of the model snapshot that worked
best on val

41
Regularization: Add term to loss

Common regularization

42
Regularization: Dropout
• In each forward pass, randomly set some neurons to
zero
• Probability of dropping is a hyperparameter; 0.5 is
common

43
Regularization: Dropout
• Example forward pass with a 3-layer network using
dropout

44
Dropout effects
• Forces the network to have a redundant
representation; Prevents co-adaptation of features

45
Dropout effects (2)
• Dropout is training a large ensemble of models (that
share parameters).
• Each binary mask is one model
• An FC layer with 4096 units has 24096 ~ 101233 possible
masks!
• … only 1082 atoms in the universe!

46
Dropout: Test time
• Dropout makes our output random!

• Want to “average out” the randomness at test-time

• But this integral seems hard …

47
Dropout: Test time (2)
• Want to approximate the integral

• Consider a single neuron

• At test time we have:


• During training we have:

48
Dropout: Test time (3)
• At test time all neurons are active always => We must
scale the activations so that for each neuron:
• Output at test time = expected output at training time
• At test time, multiply by dropout probability

49
More common: “Inverted dropout”
Data Augmentation

52
Horizontal Flip

53
Random crops and scales
• Training: sample
random crops / scales
ResNet:
• Pick random L in range
[256, 480]
• Resize training image,
short side = L
• Sample random 224 x
224 patch
• Testing: average a fixed
set of crops ResNet:
• Resize image at 5
scales: {224, 256, 384,
480, 640}
• For each size, use 10
224 x 224 crops: 4
corners + center, + flips

54
Color Jitter

Simple:
Randomize
contrast and
brightness

55
Other transformations
- Translation
- Rotation
- Stretching
- Shearing
- lens distortions
- … (go crazy)

56
Mixup

57
Some libraries
1. Albumentations
https://github.com/albumentations-team/albumentations
2. Imgaug
https://github.com/aleju/imgaug
3. Augmentor
https://github.com/mdbloice/Augmentor

58
Choosing hyperparameters

59
Hyperparameters
• Network Architecture
• Learning rate, parameters in learning rate change
strategy, optimization algorithm
• Control coefficients (L2 weight decay, drop rate)

60
Random Search vs Grid Search

61
Techniques for combining multiple
models (ensemble methods)

62
Model Ensembles
• Train multiple independent models
• At test time average their results
• Take average of predicted probability distributions, then
choose argmax
• Enjoy 2% extra performance

63
Model Ensembles
• Instead of training multiple models independently,
multiple snapshots of the same model can be used
during training

64
Transfer learning

65
Transfer learning
Train the network on a large available data set, then train
with your data set

66
Transfer learning

67
More tips and tricks
• Machine Learning Yearning by Andrew Ng
https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/20
18/09/Ng-MLY01-13.pdf

68
References
1. http://cs231n.stanford.edu
2. Adam:
https://towardsdatascience.com/adam-latest-trends-in-
deep-learning-optimization-6be9a291375c
3. Stanford lecture note:
http://cs231n.github.io/neural-networks-3/

69
Thank you
for your
attention!!!

70

You might also like