0% found this document useful (0 votes)

20 views34 pages

Training NNs

The document provides an overview of training neural networks, emphasizing the importance of hyperparameters such as network architecture, learning rate, and batch size. It discusses optimization techniques like Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (ADAM), as well as strategies for hyperparameter tuning and model evaluation. Additionally, it highlights the significance of learning curves and generalization in assessing model performance.

Uploaded by

jonasahmed7833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views34 pages

Training NNs

Uploaded by

jonasahmed7833

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Training a neural network

Hyperparameters helps to achieve optimised model

• Network architecture — width, depth

• Number of epochs
• Weight Initialisation
• Learning rate
• Regularization
• Batch size
• Momentum
Basic Recipe for training a NN

• You have dataset with labels

• Take a network f and learn its parameters w, b

What is meant by learning?

Basic Recipe for training a NN

• You have dataset with labels

• Take a network f and learn its parameters Ws, for generalisation
• Empirical Risk minimisation
• Probability that fθ(x) gets the approximation wrong
1
n∑
Empirical Risk = ℓ(x, y, θ), over data distribution D
•
n
Stochastic Gradient Descent (SGD) is a way to achieve optimization over large
dataset

• If we have n training samples, we need to compute the gradient for all of them
which is O(n)

• We can express the total loss over training data as the expectation of all the
samples — empirical risk minimisation

• Choose subset of trainset m < < n — mini batches

bi = {x1, y1}, {x2, y2}, . . . . , {xm, ym}
{b1, b2, . . . . . bn/m}
How to select optimal minibatch size

• Minibatch size is a hyper parameter

• Typically in powers of 2 — 4, 8,16,32, 64,128….
• Smaller batch size — frequent and noisy update
• Mini batch size of m — only one time back propagation
• Optimal batch size — GPU memory dependant (Back propagation)
• Any relation of batch size with output size of the network?

Iterations versus Epochs?

Idea: Using batch size proportional to number classes is useful

Batch Size: 64
Batch Size: 1024
Problems with SGD & mini batches

• Gradeints are scaled equally across all the dimensions — oscillations

• We want to scale over the dimensions — use momentum with gradients how?
Using Gradient Descent with Momentum

• Exponentially weighted average of gradients —β

• wi,j(t + 1) = wi,j(t) − δwi,j(t), where δwi,j(t)?
∂ℓ
δwi,j(t) = α . δwi,j(t − 1) + η . (t)
• ∂wi,j
Using Gradient Descent with Momentum

• Exponentially weighted average of gradients —β

• Momentum helps dampen the oscillations in the path of optimiser towards
minimum

• On each iteration t of GD
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients

• Momentum helps dampen the oscillations in
the path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD calculate dw
• Compute gradients as usual ≔ − ∗
≔ − ∗ w
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑑
𝑑
𝑤
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients

• Momentum helps dampen the oscillations in the
path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD
calculate dw
• Compute gradients as usual calculate
• Compute Compute exponentially weighted ≔ − ∗
moving average Vdw. ≔ − ∗ w

• Update w and b using Vdw instead of dw.

𝑑
𝑤
𝑉
𝑑
𝑑
𝑤
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑉
𝑉
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients

• Momentum helps dampen the oscillations in the
path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD
calculate dw
• Compute gradients as usual calculate
≔ − ∗
• Compute Compute exponentially weighted
≔ − ∗
moving average Vdw.

• Update w and b using Vdw instead of dw.

Do you know the commonly used value of Momentum?
𝑑
𝑤
𝑉
𝑑
𝑏
𝑑
𝑤
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑉
𝑉
Learning rate decay adjusts the learning rate over time

1
• α = . α0
1 + t * epoch
• t de nes the decay rate
• If α0 = 0.1, decay rate=1.0
• Calculate the learning rate for rst 3 epochs?
• Epoch 0=0.1, Epoch 1 = 0.05, Epoch 2= 0.03
• Step Decay α = α − t . α, common value of t = 0.5
epoch
• Exponential decay α = t . α0, t < 1.0
fi
fi
Relationship of Batch size and Learning Rate
Adaptive learning rate adjusts the learning rate as per requirement

• Decrease the learning rate if gradient changes its direction

• Increase the learning rate if gradient stays consistent
• De ne local gain for each weight initially g=1
∂ℓ
δwi,j = η . gi,j
• ∂wi,j

• If gradient is consistent gi,j(t) = gi,j(t − 1) + β, else

gi,j(t) = gi,j(t − 1)(1 − β)
fi
RMSProp using Adaptive learning rate

• Concept: Divide learning rate by exponentially decreasing moving average of

squared gradients — Ge Hinton

• Damps oscillations like moment ion but using Root mean squares
∂ℓ 2
MeanSquare(wi,j, t) = β . MeanSquare(wi,j, t − 1) + (1 − β)( )
• ∂wi,j(t)

Igel, Christian, and Michael Hüsken. "Improving the Rprop learning algorithm." Proceedings of the Second
International ICSC Symposium on Neural Computation (NC 2000). Vol. 2000. ICSC Academic Press, 2000.
ff
Adaptive Moment Estimation

• Perhaps most commonly used Optimisation algorithm

• Combination of momentum method and RMSProp
∂ℓ
mt = α . m(t−1) + η . gi,j (t) , momentum term
• ∂wi,j
∂ℓ 2
r = β . MeanSquare(wi,j, t − 1) + (1 − β)( ) , RMSProp term
• ∂wi,j(t)
mt
wi,j = wi,j − η , ADAM Optimiser
• r+ϵ
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
SGD versus ADAM schedular
Keskar, N. S., & Socher, R. (2017). Improving generalisation performance by switching from Adam to SGD
An overview of di erent optimisation algorithms

https://www.lightly.ai/post/which-optimizer-should-i-use-for-my-machine-learning-project
ff
Model learning can be xed by Hyperparameter tuning

• Manual search — most common

fi
Model learning can be xed by Hyperparameter tuning

• Manual search — most common

• Grid Search — structured
• De ne ranges of all parameters spaces and select points
• Usually pseudo-uniform distribution
• Iterate over all possible con guration
• Coarse search — similar to Grid search
• Points are picked at random
fi
fi
fi
Hyperparameter tuning should start from very small

• Start from single training sample

• Check if model is over tting
• Increase the number of samples gradually
• Use all training data with small weight decay
• Find a learning rate that makes the loss drop signi cantly with in 100 iterations
• Coarse search for optimal weight decay and learning rate combination
• Train model for few epochs
• Pick the best models
fi
fi
Hyperparameter tuning should start from very small

• Grid search around selected models

• Train them now — keep an eye on loss curve
Learning curves provide you insights about the learning of an NN

Accuracy Loss
What do you think about following training graphs?
What do you think about following training graphs?
How long does each iteration take?

• Monitor precise timing of each iteration —

only in start

• If its taking longer— need to troubleshoot

• more than half a second
• Data loading — resolution, compression,
train from SSD

• Backprop
How to choose the appropriate network architecture

• Start with simplest possible network —

How simple?

• Get Debug cycles down — few mins

• Use train, validation, test curves
• Numbers needs to be comparable
• Only one change at a time
• Visualise inputs, prediction, ground
truths
Training graphs provide useful information to x a NN

Yes Bigger Model

High training error Train Longer Bais
Update Model Arch
No
Yes More Data
High Train-Validation error Regularization Variance
Model Arch
No Make training data similar
Yes to test data Data
High Validation error Data synthesis mismatch
No Domain Adaptation
Yes Over t on
High Test error Adjust Validation set
Validation set
No How to address these problems?
Done
fi
fi
Learning is process of generalisation

• Hypothesis about learning

• Based on learnt parameters during training, model should give similar
results on di erent data (test data)

• Generalised behaviour of your model on an unknown dataset

• Train on known dataset — parameter optimization
• Test on unknown dataset
How to look for an optimised model?
ff
• Additional Resources: Playlists for Deep learning course
• https://www.youtube.com/watch?v=GxJe0DZvydM&list=PLTKMiZHVd_2KJtIXOW0zFhFfBaJJilH51&index=52
• https://www.youtube.com/watch?v=PBYWWM9We-0&list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A&index=9
• https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Some material (images, tables, text etc.) in this
presentation has been borrowed from di erent books,
lecture notes, and the web. The original contents solely
belong to their owners, and are used in this
presentation only for clarifying various educational
concepts. Any copyright infringement is not at all
intended.

Lecture 2
No ratings yet
Lecture 2
31 pages
Cours 5
No ratings yet
Cours 5
23 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
08 Training
No ratings yet
08 Training
18 pages
Lec 8
No ratings yet
Lec 8
43 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Lect 7
No ratings yet
Lect 7
43 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Lecture 5
No ratings yet
Lecture 5
34 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
AI ML Nov 15
No ratings yet
AI ML Nov 15
32 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Optimizer
No ratings yet
Optimizer
13 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Ann TP
No ratings yet
Ann TP
40 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optimizers
No ratings yet
Optimizers
4 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
Module 2
No ratings yet
Module 2
67 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
15 Deep
No ratings yet
15 Deep
39 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
MV cs4243 2024 Amir 6 p1
No ratings yet
MV cs4243 2024 Amir 6 p1
97 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
6 TrainingNN
No ratings yet
6 TrainingNN
51 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Canny Edge Detector: Unveiling the Art of Visual Perception
From Everand
Canny Edge Detector: Unveiling the Art of Visual Perception
Fouad Sabry
No ratings yet
Sampling Theorem
No ratings yet
Sampling Theorem
92 pages
K01148 - Seven Tenets BPM Updated PDF
100% (1)
K01148 - Seven Tenets BPM Updated PDF
5 pages
Hopfield Neural Network Optimized Fuzzy Logic Controller For Maximum Power Point Tracking in A Photovoltaic System
No ratings yet
Hopfield Neural Network Optimized Fuzzy Logic Controller For Maximum Power Point Tracking in A Photovoltaic System
14 pages
SQL Final Exam
No ratings yet
SQL Final Exam
2 pages
0580 Measurement and Bound
No ratings yet
0580 Measurement and Bound
4 pages
Status and Issues in SMEs Financing Through Digital Innovation in Myanmar
No ratings yet
Status and Issues in SMEs Financing Through Digital Innovation in Myanmar
25 pages
Spiceland IA 6e Excel Guide
No ratings yet
Spiceland IA 6e Excel Guide
31 pages
Car Resale Value Prediction
No ratings yet
Car Resale Value Prediction
23 pages
CRM Software, Leads, Marketing, Sales CRM Solution, Customer Relationship Management System in India
No ratings yet
CRM Software, Leads, Marketing, Sales CRM Solution, Customer Relationship Management System in India
14 pages
Match The Graphs To The Linear Equations
No ratings yet
Match The Graphs To The Linear Equations
1 page
Department of Education Emiliano Tria Tirona Memorial National High School
No ratings yet
Department of Education Emiliano Tria Tirona Memorial National High School
3 pages
Ma2265-Dm QB Unit I & II
No ratings yet
Ma2265-Dm QB Unit I & II
2 pages
Slides For Chapter 7: Operating System Support: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 7: Operating System Support: Distributed Systems: Concepts and Design
23 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Servo2006 07
100% (1)
Servo2006 07
84 pages
WordPress Mass Defacer - Joomla Mass Defacer - PHP Mass Hack Script
No ratings yet
WordPress Mass Defacer - Joomla Mass Defacer - PHP Mass Hack Script
14 pages
Flexi8.0 Help Manual
100% (1)
Flexi8.0 Help Manual
182 pages
Brand Study of Lenovo Electronics
No ratings yet
Brand Study of Lenovo Electronics
41 pages
IMD 301 Introduction To Cataloguing
No ratings yet
IMD 301 Introduction To Cataloguing
22 pages
Roland CutChoice
No ratings yet
Roland CutChoice
24 pages
1.what Is Multitasking? Multitasking Means That The
No ratings yet
1.what Is Multitasking? Multitasking Means That The
6 pages
QUICK START GUIDE For Eagle Point Softwa
No ratings yet
QUICK START GUIDE For Eagle Point Softwa
48 pages
Save Wizard Code Info
No ratings yet
Save Wizard Code Info
7 pages
Lab 6
No ratings yet
Lab 6
5 pages
Universal Calculation of Piece Values - Version 12-17-2016
No ratings yet
Universal Calculation of Piece Values - Version 12-17-2016
65 pages
REST Reference Manual
100% (1)
REST Reference Manual
559 pages
IEEE Membership
No ratings yet
IEEE Membership
2 pages
Pseudo Code For The Inventory Management System
100% (1)
Pseudo Code For The Inventory Management System
3 pages
MENOG BGP and Traffic Engineering With Akamai
No ratings yet
MENOG BGP and Traffic Engineering With Akamai
57 pages
Decode Python: With Dsa
No ratings yet
Decode Python: With Dsa
14 pages