0% found this document useful (0 votes)

24 views70 pages

L5 Training Neural Networks Part 2 en v2

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views70 pages

L5 Training Neural Networks Part 2 en v2

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Lesson 5:

Training neural networks

(Part 2)
Viet-Trung Tran

1
Outline
• Optimization algorighms for neural networks
• Learning rate schedules
• Anti-overfitting techniques
• Data enrichment (data augmentation)
• Choosing Hyperparameters
• Techniques for combining multiple models (ensemble
methods)
• Transfer learning

2
Optimization algorighms
for neural networks

3
Stochastic gradient descent (SGD)

4
Problem #1 with SGD
• What if loss changes quickly in one direction and
slowly in another? What does gradient descent do?

Loss function has high condition number: ratio of largest

to smallest singular value of the Hessian matrix is large
(Hessian is a square matrix of second-order partial derivatives of a scalar-
valued function, or scalar field. It describes the local curvature of a function
of many variables.)

5
Problem #1 with SGD (2)
• What if loss changes quickly in one direction and slowly in
another? What does gradient descent do?
• Very slow progress along shallow dimension, jitter along
steep direction

Loss function has high condition number: ratio of largest to

smallest singular value of the Hessian matrix is large (Hessian is
a square matrix of second-order partial derivatives of a scalar-valued function,
or scalar field. It describes the local curvature of a function of many variables.)

6
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?

7
Problem #2 with SGD
• What if the loss function
has a local minima or
saddle point?
• Zero gradient, gradient
descent gets stuck
• Saddle points often
appear with multivariable
objective functions

8
Problem #3 with SGD
• Our gradients come from
minibatches so they can
be noisy!

9
SGD + momentum

• Continue moving in the general direction as the previous iterations

• Build up “velocity” as a running mean of gradients
• Rho gives “friction”; typically, rho=0.9 or 0.99
• At the beginning, rho the score may be lower due to unclear
redirection, e.g., rho = 0.5

10
SGD + momentum (2)

• Alternative equivalent formulation

• You may see SGD+Momentum formulated different
ways,
• But they are equivalent - give same sequence of x

11
SGD + momentum (3)

12
Nesterov Momentum

“Look ahead” to the point where

updating using velocity would take
Combine gradient at current point
us; compute gradient there and
with velocity to get step used to
mix it with velocity to get actual
update weights
update direction
13
Nesterov Momentum (2)

• Annoying, usually we
want update in terms of
• Let and
rearrange

14
Nesterov Momentum (3)

15
AdaGrad

• Added element-wise scaling of the gradient based on

the historical sum of squares in each dimension
• “Per-parameter learning rates” or “adaptive learning
rates”

16
AdaGrad

• Q1: What happens with AdaGrad?

17
AdaGrad

• Q1: What happens with AdaGrad?

• Progress along “steep” directions is damped
• Progress along “flat” directions is accelerated
18
AdaGrad

• Q2: What happens to the step size over long time?

19
AdaGrad

• Q2: What happens to the step size over long time?

• Decays to zero
• The learning rate is monotonically smaller
20
RMSProp: “Leaky AdaGrad”

RSMProp use a moving average of squared gradients

Recommendation: decay_rate = [0.9, 0.99, 0.999]
Unlike Adagrad the updates do not get monotonically smaller

21
RSMProp

22
Adam (almost)

Sort of like RMSProp with momentum

Q: What happens at first timestep?

beta1 = 0.9, beta2 = 0.999
First and second moment start at zero

23
Adam (full form)

• Bias correction for the fact that first and second

moment estimates start at zero. It makes the algorithm
more stable during warm up at the first few steps.
• Adam with beta1 = 0.9, beta2 = 0.999, and
learning_rate = 1e-3 or 5e-4 are good default
parameters for many models!

24
Visual examples of the learning process

(c) Alec Radford.

25
First-order optimization

26
Second-order optimization
• Using the Hessian, which is a square matrix of second-
order partial derivatives of the function.

computing (and inverting) the Hessian in its explicit form is a

very costly process in both space and time

27
Second-order optimization
• Taylor expansion

• Solving for the critical point we obtain the Newton parameter

update:

• Not good for DL (due to matrix inverse complexity of O(n^3))

• Hessian has O(N^2) elements
• Inverting takes O(N^3)
• N = (Tens or Hundreds of) Millions
• Quasi-Newton (BGFS)
• instead of inverting the Hessian (O(n^3)), approximate inverse
Hessian with rank 1 updates over time (O(n^2) each

28
L-BFGS (Limited memory BFGS)
• Does not form/store the full inverse Hessian
• Usually works very well in full batch, deterministic
mode i.e. if you have a single, deterministic f(x) then L-
BFGS will probably work very nicely
• Does not transfer very well to mini-batch setting.
Gives bad results. Adapting second-order methods to
large-scale, stochastic setting is an active area of
research.
SOTA optimizers
• NAdam = Adam + Nesterov's Accelerated
Gradient (NAG)
• RAdam (Rectified Adam)
• LookAhead
• Ranger = RAdam + LookAhead

30
In practice
• Adam is a good default choice in many cases;
it often works ok even with constant learning
rate
• SGD+Momentum can outperform Adam but
may require more tuning of LR and schedule
• Try cosine schedule, very few hyperparameters!
• If you can afford to do full batch updates, then
try out L-BFGS (and don’t forget to disable all
sources of noise)

31
Learning rate schedules

32
Learning rate
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all
have learning rate as a hyperparameter.
• Q: Which one of these learning rates is best to use?
• Usually starts with a large value and decreases over time

33
Learning rate decays over time
• Step: Reduce
learning rate at a
few fixed points.
• E.g., for ResNets,
multiply LR by 0.1
after epochs 30,
60, and 90.

34
Cosine rate decay

35
Linear rate decay

36
Inverse sqrt of total number of epochs

37
Linear Warmup
• High initial learning rates
can make loss explode;
linearly increasing
learning rate from 0 over
the first ~5000 iterations
can prevent this
• Empirical rule of thumb: If
you increase the batch
size by N, also scale the
initial learning rate by N
Anti-overfitting techniques

39
Beyond Training Error
Early Stopping: Always do this
• Stop training the model when accuracy on the
validation set decreases Or train for a long time, but
always keep track of the model snapshot that worked
best on val

41
Regularization: Add term to loss

Common regularization

42
Regularization: Dropout
• In each forward pass, randomly set some neurons to
zero
• Probability of dropping is a hyperparameter; 0.5 is
common

43
Regularization: Dropout
• Example forward pass with a 3-layer network using
dropout

44
Dropout effects
• Forces the network to have a redundant
representation; Prevents co-adaptation of features

45
Dropout effects (2)
• Dropout is training a large ensemble of models (that
share parameters).
• Each binary mask is one model
• An FC layer with 4096 units has 24096 ~ 101233 possible
masks!
• … only 1082 atoms in the universe!

46
Dropout: Test time
• Dropout makes our output random!

• Want to “average out” the randomness at test-time

• But this integral seems hard …

47
Dropout: Test time (2)
• Want to approximate the integral

• Consider a single neuron

• At test time we have:

• During training we have:

48
Dropout: Test time (3)
• At test time all neurons are active always => We must
scale the activations so that for each neuron:
• Output at test time = expected output at training time
• At test time, multiply by dropout probability

49
More common: “Inverted dropout”
Data Augmentation

52
Horizontal Flip

53
Random crops and scales
• Training: sample
random crops / scales
ResNet:
• Pick random L in range
[256, 480]
• Resize training image,
short side = L
• Sample random 224 x
224 patch
• Testing: average a fixed
set of crops ResNet:
• Resize image at 5
scales: {224, 256, 384,
480, 640}
• For each size, use 10
224 x 224 crops: 4
corners + center, + flips

54
Color Jitter

Simple:
Randomize
contrast and
brightness

55
Other transformations
- Translation
- Rotation
- Stretching
- Shearing
- lens distortions
- … (go crazy)

56
Mixup

57
Some libraries
1. Albumentations
https://github.com/albumentations-team/albumentations
2. Imgaug
https://github.com/aleju/imgaug
3. Augmentor
https://github.com/mdbloice/Augmentor

58
Choosing hyperparameters

59
Hyperparameters
• Network Architecture
• Learning rate, parameters in learning rate change
strategy, optimization algorithm
• Control coefficients (L2 weight decay, drop rate)

60
Random Search vs Grid Search

61
Techniques for combining multiple
models (ensemble methods)

62
Model Ensembles
• Train multiple independent models
• At test time average their results
• Take average of predicted probability distributions, then
choose argmax
• Enjoy 2% extra performance

63
Model Ensembles
• Instead of training multiple models independently,
multiple snapshots of the same model can be used
during training

64
Transfer learning

65
Transfer learning
Train the network on a large available data set, then train
with your data set

66
Transfer learning

67
More tips and tricks
• Machine Learning Yearning by Andrew Ng
https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/20
18/09/Ng-MLY01-13.pdf

68
References
1. http://cs231n.stanford.edu
2. Adam:
https://towardsdatascience.com/adam-latest-trends-in-
deep-learning-optimization-6be9a291375c
3. Stanford lecture note:
http://cs231n.github.io/neural-networks-3/

69
Thank you
for your
attention!!!

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Theory DL
No ratings yet
Theory DL
227 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Rajesh (DL Unit3) 06dec2024
No ratings yet
Rajesh (DL Unit3) 06dec2024
67 pages
UNIT3
No ratings yet
UNIT3
17 pages
6 NN RNN
No ratings yet
6 NN RNN
55 pages
Module 2
No ratings yet
Module 2
67 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Megersa MBA Thesis For Defense (2024)
No ratings yet
Megersa MBA Thesis For Defense (2024)
74 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
14 pages
Training NNs
No ratings yet
Training NNs
34 pages
Basics of DL: Prof. Leal-Taixé and Prof. Niessner 1
No ratings yet
Basics of DL: Prof. Leal-Taixé and Prof. Niessner 1
76 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Unit 3
No ratings yet
Unit 3
110 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Lecture 8.5
No ratings yet
Lecture 8.5
9 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Water Well Drilling Machine and Tools Catalogue
No ratings yet
Water Well Drilling Machine and Tools Catalogue
49 pages
Lec 8
No ratings yet
Lec 8
43 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
08 Training
No ratings yet
08 Training
18 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Lect 7
No ratings yet
Lect 7
43 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Classification of Air Masses and Fronts - Geography Optional - UPSC - Digitally Learn
No ratings yet
Classification of Air Masses and Fronts - Geography Optional - UPSC - Digitally Learn
14 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Cours 5
No ratings yet
Cours 5
23 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
82bace127438068b8ebe
No ratings yet
82bace127438068b8ebe
73 pages
Protech Controller LF-313LD
100% (4)
Protech Controller LF-313LD
2 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Twentyone 20466 PDF
No ratings yet
Twentyone 20466 PDF
15 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Magnetic Properties
No ratings yet
Magnetic Properties
71 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
No ratings yet
Network Security 1.0 Modules 8 - 10 - ACLs and Firewalls Group Exam Answers
20 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Vendor P Q App
No ratings yet
Vendor P Q App
441 pages
Asce 7-22 CH 01 - For PC
100% (2)
Asce 7-22 CH 01 - For PC
17 pages
L12 Generative Models en
No ratings yet
L12 Generative Models en
65 pages
Part 1 Linux Chapter 4 Process Management
No ratings yet
Part 1 Linux Chapter 4 Process Management
10 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Exam - 1013S 2023 Final
No ratings yet
Exam - 1013S 2023 Final
20 pages
Optimizers
No ratings yet
Optimizers
4 pages
Rocker Gear and Valves
No ratings yet
Rocker Gear and Valves
10 pages
Windows OS Internal Training
No ratings yet
Windows OS Internal Training
66 pages
DX Diag
No ratings yet
DX Diag
27 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Definition of Loads: Type of Occupancy: 4-Storey Hotel Building Loading Used
No ratings yet
Definition of Loads: Type of Occupancy: 4-Storey Hotel Building Loading Used
37 pages
Practice Problems For Mid Term Test
No ratings yet
Practice Problems For Mid Term Test
11 pages
WA DOC 20230324 44dd412a
No ratings yet
WA DOC 20230324 44dd412a
8 pages
Final IEEEversion
No ratings yet
Final IEEEversion
7 pages
Ensayos de Permeabilidad
No ratings yet
Ensayos de Permeabilidad
27 pages
AIX Basics Student Guide-8
No ratings yet
AIX Basics Student Guide-8
4 pages
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
No ratings yet
Group Members: 1. Shucayb Mohamed Ismail 2. Abdihafid Ismail Salad 3. Nimo Ahmed Hassan 4. Nimo Khadar Ahmed
20 pages
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
No ratings yet
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
2 pages
Everything You Ever Wanted To Functional Global Variables
No ratings yet
Everything You Ever Wanted To Functional Global Variables
51 pages
ch05 과제
No ratings yet
ch05 과제
2 pages
23G-04 1 06
No ratings yet
23G-04 1 06
17 pages
s15 Pin Out
No ratings yet
s15 Pin Out
4 pages
"Node - CPP" : #Include #Include #Include Class Public New
No ratings yet
"Node - CPP" : #Include #Include #Include Class Public New
9 pages
Module 3 - Pneumatics Activity 1
No ratings yet
Module 3 - Pneumatics Activity 1
2 pages
Low Rolling Resistance For Conveyor Belts: Goodyear Conveyor Belt Products
No ratings yet
Low Rolling Resistance For Conveyor Belts: Goodyear Conveyor Belt Products
25 pages
Frafos ABC SBC Brochure
No ratings yet
Frafos ABC SBC Brochure
4 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

L5 Training Neural Networks Part 2 en v2

Uploaded by

L5 Training Neural Networks Part 2 en v2

Uploaded by

Lesson 5:

Training neural networks

Loss function has high condition number: ratio of largest

Loss function has high condition number: ratio of largest to

• Continue moving in the general direction as the previous iterations

• Alternative equivalent formulation

“Look ahead” to the point where

• Added element-wise scaling of the gradient based on

• Q1: What happens with AdaGrad?

• Q1: What happens with AdaGrad?

• Q2: What happens to the step size over long time?

• Q2: What happens to the step size over long time?

RSMProp use a moving average of squared gradients

Sort of like RMSProp with momentum

Q: What happens at first timestep?

• Bias correction for the fact that first and second

(c) Alec Radford.

computing (and inverting) the Hessian in its explicit form is a

• Solving for the critical point we obtain the Newton parameter

• Not good for DL (due to matrix inverse complexity of O(n^3))

• Want to “average out” the randomness at test-time

• But this integral seems hard …

• Consider a single neuron

• At test time we have:

You might also like