0% found this document useful (0 votes)
20 views34 pages

Training NNs

The document provides an overview of training neural networks, emphasizing the importance of hyperparameters such as network architecture, learning rate, and batch size. It discusses optimization techniques like Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (ADAM), as well as strategies for hyperparameter tuning and model evaluation. Additionally, it highlights the significance of learning curves and generalization in assessing model performance.

Uploaded by

jonasahmed7833
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views34 pages

Training NNs

The document provides an overview of training neural networks, emphasizing the importance of hyperparameters such as network architecture, learning rate, and batch size. It discusses optimization techniques like Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (ADAM), as well as strategies for hyperparameter tuning and model evaluation. Additionally, it highlights the significance of learning curves and generalization in assessing model performance.

Uploaded by

jonasahmed7833
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Training a neural network

Hyperparameters helps to achieve optimised model

• Network architecture — width, depth


• Number of epochs
• Weight Initialisation
• Learning rate
• Regularization
• Batch size
• Momentum
Basic Recipe for training a NN

• You have dataset with labels


• Take a network f and learn its parameters w, b

What is meant by learning?


Basic Recipe for training a NN

• You have dataset with labels


• Take a network f and learn its parameters Ws, for generalisation
• Empirical Risk minimisation
• Probability that fθ(x) gets the approximation wrong
1
n∑
Empirical Risk = ℓ(x, y, θ), over data distribution D

n
Stochastic Gradient Descent (SGD) is a way to achieve optimization over large
dataset

• If we have n training samples, we need to compute the gradient for all of them
which is O(n)

• We can express the total loss over training data as the expectation of all the
samples — empirical risk minimisation

• Choose subset of trainset m < < n — mini batches


bi = {x1, y1}, {x2, y2}, . . . . , {xm, ym}
{b1, b2, . . . . . bn/m}
How to select optimal minibatch size

• Minibatch size is a hyper parameter


• Typically in powers of 2 — 4, 8,16,32, 64,128….
• Smaller batch size — frequent and noisy update
• Mini batch size of m — only one time back propagation
• Optimal batch size — GPU memory dependant (Back propagation)
• Any relation of batch size with output size of the network?

Iterations versus Epochs?


Idea: Using batch size proportional to number classes is useful

Batch Size: 64
Batch Size: 1024
Problems with SGD & mini batches

• Gradeints are scaled equally across all the dimensions — oscillations


• We want to scale over the dimensions — use momentum with gradients how?
Using Gradient Descent with Momentum

• Exponentially weighted average of gradients —β


• wi,j(t + 1) = wi,j(t) − δwi,j(t), where δwi,j(t)?
∂ℓ
δwi,j(t) = α . δwi,j(t − 1) + η . (t)
• ∂wi,j
Using Gradient Descent with Momentum

• Exponentially weighted average of gradients —β


• Momentum helps dampen the oscillations in the path of optimiser towards
minimum

• On each iteration t of GD
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients


• Momentum helps dampen the oscillations in
the path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD calculate dw
• Compute gradients as usual ≔ − ∗
≔ − ∗ w
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑑
𝑑
𝑤
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients


• Momentum helps dampen the oscillations in the
path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD
calculate dw
• Compute gradients as usual calculate
• Compute Compute exponentially weighted ≔ − ∗
moving average Vdw. ≔ − ∗ w

• Update w and b using Vdw instead of dw.


𝑑
𝑤
𝑉
𝑑
𝑑
𝑤
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑉
𝑉
Using Gradient Descent with Momentum

• Exponentially- weighted average of gradients


• Momentum helps dampen the oscillations in the
path of optimiser towards minimum Algorithm
For each mini-batch:
• On each iteration t of SGD
calculate dw
• Compute gradients as usual calculate
≔ − ∗
• Compute Compute exponentially weighted
≔ − ∗
moving average Vdw.

• Update w and b using Vdw instead of dw.


Do you know the commonly used value of Momentum?
𝑑
𝑤
𝑉
𝑑
𝑏
𝑑
𝑤
𝑤
𝑏
𝑏
𝑤
𝛼
𝛼
𝑉
𝑉
Learning rate decay adjusts the learning rate over time

1
• α = . α0
1 + t * epoch
• t de nes the decay rate
• If α0 = 0.1, decay rate=1.0
• Calculate the learning rate for rst 3 epochs?
• Epoch 0=0.1, Epoch 1 = 0.05, Epoch 2= 0.03
• Step Decay α = α − t . α, common value of t = 0.5
epoch
• Exponential decay α = t . α0, t < 1.0
fi
fi
Relationship of Batch size and Learning Rate
Adaptive learning rate adjusts the learning rate as per requirement

• Decrease the learning rate if gradient changes its direction


• Increase the learning rate if gradient stays consistent
• De ne local gain for each weight initially g=1
∂ℓ
δwi,j = η . gi,j
• ∂wi,j

• If gradient is consistent gi,j(t) = gi,j(t − 1) + β, else


gi,j(t) = gi,j(t − 1)(1 − β)
fi
RMSProp using Adaptive learning rate

• Concept: Divide learning rate by exponentially decreasing moving average of


squared gradients — Ge Hinton

• Damps oscillations like moment ion but using Root mean squares
∂ℓ 2
MeanSquare(wi,j, t) = β . MeanSquare(wi,j, t − 1) + (1 − β)( )
• ∂wi,j(t)

Igel, Christian, and Michael Hüsken. "Improving the Rprop learning algorithm." Proceedings of the Second
International ICSC Symposium on Neural Computation (NC 2000). Vol. 2000. ICSC Academic Press, 2000.
ff
Adaptive Moment Estimation

• Perhaps most commonly used Optimisation algorithm


• Combination of momentum method and RMSProp
∂ℓ
mt = α . m(t−1) + η . gi,j (t) , momentum term
• ∂wi,j
∂ℓ 2
r = β . MeanSquare(wi,j, t − 1) + (1 − β)( ) , RMSProp term
• ∂wi,j(t)
mt
wi,j = wi,j − η , ADAM Optimiser
• r+ϵ
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
SGD versus ADAM schedular
Keskar, N. S., & Socher, R. (2017). Improving generalisation performance by switching from Adam to SGD
An overview of di erent optimisation algorithms

https://www.lightly.ai/post/which-optimizer-should-i-use-for-my-machine-learning-project
ff
Model learning can be xed by Hyperparameter tuning

• Manual search — most common

fi
Model learning can be xed by Hyperparameter tuning

• Manual search — most common


• Grid Search — structured
• De ne ranges of all parameters spaces and select points
• Usually pseudo-uniform distribution
• Iterate over all possible con guration
• Coarse search — similar to Grid search
• Points are picked at random
fi
fi
fi
Hyperparameter tuning should start from very small

• Start from single training sample


• Check if model is over tting
• Increase the number of samples gradually
• Use all training data with small weight decay
• Find a learning rate that makes the loss drop signi cantly with in 100 iterations
• Coarse search for optimal weight decay and learning rate combination
• Train model for few epochs
• Pick the best models
fi
fi
Hyperparameter tuning should start from very small

• Grid search around selected models


• Train them now — keep an eye on loss curve
Learning curves provide you insights about the learning of an NN

Accuracy Loss
What do you think about following training graphs?
What do you think about following training graphs?
How long does each iteration take?

• Monitor precise timing of each iteration —


only in start

• If its taking longer— need to troubleshoot


• more than half a second
• Data loading — resolution, compression,
train from SSD

• Backprop
How to choose the appropriate network architecture

• Start with simplest possible network —


How simple?

• Get Debug cycles down — few mins


• Use train, validation, test curves
• Numbers needs to be comparable
• Only one change at a time
• Visualise inputs, prediction, ground
truths
Training graphs provide useful information to x a NN

Yes Bigger Model


High training error Train Longer Bais
Update Model Arch
No
Yes More Data
High Train-Validation error Regularization Variance
Model Arch
No Make training data similar
Yes to test data Data
High Validation error Data synthesis mismatch
No Domain Adaptation
Yes Over t on
High Test error Adjust Validation set
Validation set
No How to address these problems?
Done
fi
fi
Learning is process of generalisation

• Hypothesis about learning


• Based on learnt parameters during training, model should give similar
results on di erent data (test data)

• Generalised behaviour of your model on an unknown dataset


• Train on known dataset — parameter optimization
• Test on unknown dataset
How to look for an optimised model?
ff
• Additional Resources: Playlists for Deep learning course
• https://www.youtube.com/watch?v=GxJe0DZvydM&list=PLTKMiZHVd_2KJtIXOW0zFhFfBaJJilH51&index=52
• https://www.youtube.com/watch?v=PBYWWM9We-0&list=PL_iWQOsE6TfVmKkQHucjPAoRtIJYt8a5A&index=9
• https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
Some material (images, tables, text etc.) in this
presentation has been borrowed from di erent books,
lecture notes, and the web. The original contents solely
belong to their owners, and are used in this
presentation only for clarifying various educational
concepts. Any copyright infringement is not at all
intended.

ff

You might also like