Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
Batch Size = 1
DL is a highly iterative process, you have a GD with Momentum larger the value, smoother the update
lot of things and hyperparameters to take care
of 0.8 - 0.999
Common values
Value of Beta
small dataset? these ratios will be okay Old: 70/30 or 60/20/20 Don't feel like tuning? 0.9 is mostly used
depends on the size of data Because dl has a lot of data, 95/5 will also work May need several attempts to find out the
Setting up data in train/dev/test sets right value of beta for your model.
Make sure dev and test set come from the
same distribution Tuning of alpha with beta is common
2D data: we can plot the line and see if bias or S{dW} = beta * S{dw} + (1-beta) * dW^2 (
variance is there element wise).
Training error - baye's error small, train - dev S{db} = beta * S{db} + (1-beta) * db^2 ( dW = dW / sqrt(S{dW})
High variance Overfitting error big? RMS Prop element wise). db = db / sqrt(S{db})
Cost function not clearly defined Dropout Regularization Use an appropriate scale to pick
hyperparameters Drastically different lower and upper limits? use Log scale than linear scale
Use only in the training time
Regularizing your neural network Exponentially weighted average Cant be randomly chosen because very
Getting more data is tough
hyperparameter sensitive
Flip
Data Augmentation Pandas approach Babysit one model
Rotate
Caviar approach Train multiple models simultaneously
Augment the data In practice:
Crop
Dealing with high variance problem Approach selection will depend on availability
of computation power
Skew, etc.
Makes hyperparameter search easier
Downside of early stopping Orthogonalization Early Stopping
Makes NN more robust Enables to train large NNs easily
lambda is very large, weights become smaller,
Z cones in smaller range of values
Normalize Z terms
and hence NN becomes much simpler Why does regularization help with overfitting?
Hyperparameter Tuning
gamma and beta are learnable parameters
1. Subtract the mean (1/m)*np.sum(x^(i))
Batch Normalization z(i) = gamma * z(i)norm + beta using GD/Adam/RMSProp/Momentum
2. Normalize the variance (x = x / sigma^2) (
Instead of using unnormalized values you use
Sigma = (1/m) * np.sum(x^i ** 2))
normalized values of Z
Use same mean and variance to normalize the
batch norm is applied using mini batch
test set
Normalizing Inputs gradient descent
The cost function looks elongated when not Implementation
It makes weights deeper in the NN more
normalized and hence gradient descent
Why BN works? robust to change
becomes tough to perform
BN has a slight regularization effect though it
Input features are on different scales,
should not be used as a regularization
anyway always normalize normalization becomes very important
technique
Random Initialization
mu and sigma^2 are calculates using
3. Hyperparameter tuning, Batch exponentially weighted averages across mini-
if tanh activation Xavier Initialization
Weight initialization to tackle this problem Vanishing and exploding gradients normalization and Programming BN at Test time batch Subtopic 1
if relu activation He initialization Setting up your optimization problem frameworks
Generalization of Logistic Regression
if relu activation np.sqrt(np.divide(2, n^(l-1) + n^(l)))
Recognize 1 out of c classes
Helps find bugs in Backprop
Softmax Regression Algorithm
Helps save time
In backprob we use this to check if it is Changes in the label (make it no.of final
working correctly or not classes x 1 vector)
Find limit ----> euclidean distance
Training a softmax classifier
Checks distance between gradient calculated Normally programming frameworks will take
manually and by the backprop. care of the complex operations
Paddle Paddle
Theano
Torch