Lecture 1 Part II
Lecture 1 Part II
Lecture 01:
Deep feedforward networks
Part II
y depends in 2 variables
y1,y2 depend on x
Backpropagation Algorithm Cont…
w3 w7
0.25 0.5
o2 0.99
X2 h2
0.3 w4 0.55 w8
0.1
1 0.35 1 0.6
b1 b2
Ø Backpropagation Example Cont…
Training dataset
High Training accuracy
High Testing accuracy
Optimal
Test dataset Test dataset
Low accuracy Low accuracy
Solutions for Overfitting
• Increase the size of the dataset – e.g Data augmentation
• Regularization
• L1
• L2
• Dropout
• Bagging/Ensemble models
• Early stopping
Ø Data Augmentation Data augmentation techniques in computer
vision
• Cropping.
• Flipping.
• Rotation.
• Translation.
• Brightness.
• Contrast.
• Color Augmentation.
• Saturation.
Ø Regularization for deep learning
• Regularization is any modification made to the learning algorithm with the intention of lowering
the generalization error but not the training error.
• In the context of deep learning, most regularization strategies involve regularizing estimators.
This is done by reducing variance at the expense of increasing the estimator's bias.
• An effective regularizer is one that decreases the variance significantly while not overly
increasing the bias.
• controlling the complexity of the model is not a simple matter of finding the right model size
and the right number of parameters.
• Instead, deep learning relies on finding the best-fitting model, a large model that has been
properly regularized.
Ø L1/L2 Regularization
Ø Dropout
• Dropout provides a computationally inexpensive but powerful method of
regularizing a broad family of models.
• Dropout provides an inexpensive approximation to training and evaluating a
bagged ensemble of exponentially many neural networks.
• Specifically, dropout trains the ensemble consisting of all sub-networks that can
be formed by removing non-output units from an underlying base network.
Ø Training with Dropout
• To train with dropout, we use a minibatch-based learning algorithm that makes
small steps, such as stochastic gradient descent.
• Each time we load an example into a minibatch, we randomly sample a different
binary mask to apply to all of the input and hidden units in the network.
• The mask for each unit is sampled independently from all of the others.
• Typically, the probability of including a hidden unit is 0.5, while the probability of
including an input unit is 0.8.
Ø Bagging/ Ensemble models
• Bagging (short for bootstrap aggregating) is a technique for reducing
generalization error by combining several models.
• Bagging is defined as follows:
• Train k different models on k different subsets of training data, constructed to
have the same number of examples as the original dataset through random
sampling from that dataset with replacement.
• Have all of the models vote on the output for test examples.
• Techniques employing bagging are called ensemble models.
Ø Bagging/ Ensemble models
• Bagging works because different models will usually not all make the same errors
on the test set.
• This is a direct result of training on k different subsets of the training data, each
subset missing some of the examples from the original dataset.
• Other factors, such as differences in random initialization, random selection of
mini-batches,differences in hyperparameters, or different outcomes of non-
deterministic neural network implementations, are often enough to cause
different members of the ensemble to make partially independent errors.
Ø Early stopping
• When training models with sufficient representational capacity to overfit the
task, we often observe that training error decreases steadily over time while the
error on the validation set begins to rise again.
• The occurrence of this behaviour in the scope of our applications is almost
certain.
• This means we can obtain a model with better validation set error (and thus,
hopefully, better test set error) by returning to the parameter setting at the
point in time with the lowest validation set error.
• This is termed Early Stopping.
Thank you!