0% found this document useful (0 votes)
279 views1 page

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Gradient descent is used to optimize neural networks by iteratively processing training examples to update network parameters downhill towards a minimum. Batch gradient descent processes all examples at once while stochastic gradient descent processes one example at a time. Mini-batch gradient descent processes examples in small batches of 1 to m examples to balance speed of batch processing with frequent parameter updates of stochastic gradient descent. Choosing an optimal mini-batch size can provide the fastest learning in practice.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views1 page

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization

Gradient descent is used to optimize neural networks by iteratively processing training examples to update network parameters downhill towards a minimum. Batch gradient descent processes all examples at once while stochastic gradient descent processes one example at a time. Mini-batch gradient descent processes examples in small batches of 1 to m examples to balance speed of batch processing with frequent parameter updates of stochastic gradient descent. Choosing an optimal mini-batch size can provide the fastest learning in practice.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Gradient Descent Analogy: "Goes Downhill"

Batch Size = 1

Stochastic Gradient Descent 1. Over the number of iterations

2. Over the m training examples


Required 3 for loops to implement
3. Over the number of layers in the neural
network

5 million data points? Divide by 1000, update


parameters in GD after every mini batch is
Divide huge data into batches for training processed.

Batch size = m (total training examples Batch Gradient Descent

Batch size = 1 Stochastic gradient descent 

mini-batch between 1-m

BGD takes too long

SGD: you lose the speed up from vectorization


Why Mini-Batch GD is good
In practice gives fastest learning

Choosing your mini batch size In practice MDGD Vectorized implementation


Mini-Batch Gradient Descent
Less waiting

make sure mini-batch size fits in memory of


GPU or CPU

Shuffling and partitioning are two steps


required to build mini-batches

<= 2000 Use batch gradient descent


Training Set Size
Mini-batch sizes of power 2^n (64, 128, 256,
Bigger? 512)

Trend should be going down


You may not converge everytime
Hyperparameter tuning Learning rate for example

Exponentially Weighted Averages Key equation 


2. Optimization Algorithms Almost always works faster than Batch GD

Use exponentially weighted average of


gradients and use those gradients to update
your weights V{t} / 1-B^t for removing initial bias in EWA

Smoothes out the steps of gradient descent

DL is a highly iterative process, you have a GD with Momentum  larger the value, smoother the update 
lot of things and hyperparameters to take care
of 0.8 - 0.999
Common values
Value of Beta
small dataset? these ratios will be okay Old: 70/30 or 60/20/20 Don't feel like tuning? 0.9 is mostly used

depends on the size of data Because dl has a lot of data, 95/5 will also work May need several attempts to find out the
Setting up data in train/dev/test sets right value of beta for your model.
Make sure dev and test set come from the
same distribution Tuning of alpha with beta is common

2D data: we can plot the line and see if bias or S{dW} = beta * S{dw} + (1-beta) * dW^2 (
variance is there element wise).

Training error - baye's error small, train - dev S{db} = beta * S{db} + (1-beta) * db^2 ( dW = dW / sqrt(S{dW})
High variance Overfitting error big? RMS Prop  element wise). db = db / sqrt(S{db})

Bias/Variance  Adam = Adaptive Moment Estimation


Training error - baye's error big, train-dev
High bias Underfitting error small?
Mix of momentum and RMS Prop
Training error - baye's error big, train-dev
High bias and variance Worst of both worlds error big? What is Optimal/Baye's error? High dims data? One of the most effective optimization
technique
Training error - baye's error small, train-dev
Low bias low variance Best of both worlds error small? alpha
Adam 
The optimal error will define the bias or beta1
variance or both problem. parameters
beta2
try bigger network
epsilon
almost always helps
Relatively low memory requirements 
train longer 1. Ask if the model has high bias?
Advantages
Usually works well even with little tuning of
try different NN architecture
hyperparameters (except learning_rate)
more training data will not help here Basic recipe for machine learning Learning Rate Decay By formula
get more data After training initial model
Improving Deep Neural can be taken care of by different optimization
The problem of local optima Problem of plateaus algorithms
regularization 2. Ask if the model has high variance?
Networks:
more appropriate NN architecture
Hyperparameter tuning, learning rate is the most important
W may be sparse, avoid L1 regularization Regularization and some hyperparameters are more important
than others next maybe the momentum term
in NN application it becomes Forbenius Norm 1. Practical Aspects of Deep Learning Optimization
L2 regularization 3rd mini batch size, etc.
in backprob, "weight decay" happens Solving problem of bias and variance when
not low to find the right parameters, you gotta find
Inverted Dropout (Most Common) them randomly coarse to fine scheme do random sampling

Cost function not clearly defined Dropout Regularization  Use an appropriate scale to pick
hyperparameters Drastically different lower and upper limits? use Log scale than linear scale
Use only in the training time
Regularizing your neural network Exponentially weighted average Cant be randomly chosen because very
Getting more data is tough
hyperparameter sensitive
Flip
Data Augmentation Pandas approach Babysit one model
Rotate
Caviar approach Train multiple models simultaneously
Augment the data In practice:
Crop
Dealing with high variance problem Approach selection will depend on availability
of computation power
Skew, etc.
Makes hyperparameter search easier
Downside of early stopping  Orthogonalization Early Stopping 
Makes NN more robust Enables to train large NNs easily
lambda is very large, weights become smaller,
Z cones in smaller range of values
Normalize Z terms
and hence NN becomes much simpler Why does regularization help with overfitting?
Hyperparameter Tuning
gamma and beta are learnable parameters
1. Subtract the mean (1/m)*np.sum(x^(i))
Batch Normalization z(i) = gamma * z(i)norm + beta using GD/Adam/RMSProp/Momentum
2. Normalize the variance (x = x / sigma^2) (
Instead of using unnormalized values you use
Sigma = (1/m) * np.sum(x^i ** 2))
normalized values of Z
Use same mean and variance to normalize the
batch norm is applied using mini batch
test set
Normalizing Inputs gradient descent
The cost function looks elongated when not Implementation
It makes weights deeper in the NN more
normalized and hence gradient descent
Why BN works? robust to change
becomes tough to perform
BN has a slight regularization effect though it
Input features are on different scales,
should not be used as a regularization
anyway always normalize normalization becomes very important
technique
Random Initialization
mu and sigma^2 are calculates using
3. Hyperparameter tuning, Batch exponentially weighted averages across mini-
if tanh activation Xavier Initialization
Weight initialization to tackle this problem Vanishing and exploding gradients normalization and Programming BN at Test time  batch Subtopic 1
if relu activation He initialization Setting up your optimization problem frameworks
Generalization of Logistic Regression
if relu activation np.sqrt(np.divide(2, n^(l-1) + n^(l)))
Recognize 1 out of c classes
Helps find bugs in Backprop
Softmax Regression Algorithm 
Helps save time
In backprob we use this to check if it is Changes in the label (make it no.of final
working correctly or not classes x 1 vector)
Find limit ----> euclidean distance
Training a softmax classifier
Checks distance between gradient calculated Normally programming frameworks will take
manually and by the backprop. care of the complex operations

Don't use in training, only to debug Gradient checking Caffe / Caffe2

If algo fails grad check, look at components to DL4J


try to identify bugs
CNTK
Remember regularization Practical Tips
Keras
keep_prop = 1.0 doesn't work with dropout
Lasagne
run at random initialization, perhaps again Deep learning Frameworks
after some training mxNet

Paddle Paddle

Tensorflow Did in the assignment

Theano

Torch

You might also like