Training NNs
Training NNs
• If we have n training samples, we need to compute the gradient for all of them
which is O(n)
• We can express the total loss over training data as the expectation of all the
samples — empirical risk minimisation
Batch Size: 64
Batch Size: 1024
Problems with SGD & mini batches
• On each iteration t of GD
Using Gradient Descent with Momentum
1
• α = . α0
1 + t * epoch
• t de nes the decay rate
• If α0 = 0.1, decay rate=1.0
• Calculate the learning rate for rst 3 epochs?
• Epoch 0=0.1, Epoch 1 = 0.05, Epoch 2= 0.03
• Step Decay α = α − t . α, common value of t = 0.5
epoch
• Exponential decay α = t . α0, t < 1.0
fi
fi
Relationship of Batch size and Learning Rate
Adaptive learning rate adjusts the learning rate as per requirement
• Damps oscillations like moment ion but using Root mean squares
∂ℓ 2
MeanSquare(wi,j, t) = β . MeanSquare(wi,j, t − 1) + (1 − β)( )
• ∂wi,j(t)
Igel, Christian, and Michael Hüsken. "Improving the Rprop learning algorithm." Proceedings of the Second
International ICSC Symposium on Neural Computation (NC 2000). Vol. 2000. ICSC Academic Press, 2000.
ff
Adaptive Moment Estimation
https://www.lightly.ai/post/which-optimizer-should-i-use-for-my-machine-learning-project
ff
Model learning can be xed by Hyperparameter tuning
fi
Model learning can be xed by Hyperparameter tuning
Accuracy Loss
What do you think about following training graphs?
What do you think about following training graphs?
How long does each iteration take?
• Backprop
How to choose the appropriate network architecture
ff