0% found this document useful (0 votes)

47 views

Dropout in Deep Learning

The document discusses dropout regularization, which is a technique used to reduce overfitting in neural networks. Dropout works by randomly dropping out or ignoring nodes in the network during training. This prevents nodes from co-adapting too much and improves generalization. Dropout is implemented by applying a dropout layer after each hidden layer during training, where some outputs are randomly ignored.

Uploaded by

Oorja Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Dropout in Deep Learning

Uploaded by

Oorja Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

The deep neural networks have different architectures, sometimes shallow,

sometimes very deep trying to generalize on the given dataset. But, in this pursuit of
trying too hard to learn different features from the dataset, they sometimes learn
the statistical noise in the dataset. This definitely improves the model performance
on the training dataset but fails massively on new data points (test dataset). This is
the problem of overfitting. To tackle this problem we have various regularization
techniques that penalize the weights of the network but this wasn’t enough.

The best way to reduce overfitting or the best way to regularize a fixed-size model
is to get the average predictions from all possible settings of the parameters and
aggregate the final output. But, this becomes too computationally expensive and
isn’t feasible for a real-time inference/prediction.

The other way is inspired by the ensemble techniques (such as AdaBoost,

XGBoost, and Random Forest) where we use multiple neural networks of different
architectures. But this requires multiple models to be trained and stored, which over
time becomes a huge challenge as the networks grow deeper.

So, we have a great solution known as Dropout Layers.

Figure 1: Dropout applied to a Standard Neural Network

What is a Dropout?

The term “dropout” refers to dropping out the nodes (input and hidden layer) in a
neural network (as seen in Figure 1). All the forward and backwards connections
with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network. The nodes are dropped by a dropout
probability of p.

Let’s try to understand with a given input x: {0, 1, 2, 3, 4, 5} to the fully connected
layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8).
During the forward propagation (training) from the input x, 20% of the nodes would
be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on.
Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly dropped
in every iteration (batch).

Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the
greater the drop probability more sparse the model, where 0.5 is the most optimized
keep probability, that states dropping 50% of the nodes.

How does it solve the Overfitting problem?

In the overfitting problem, the model learns the statistical noise. To be precise, the
main motive of training is to decrease the loss function, given all the units
(neurons). So in overfitting, a unit may change in a way that fixes up the mistakes
of the other units. This leads to complex co-adaptations, which in turn leads to the
overfitting problem because this complex co-adaptation fails to generalize on the
unseen dataset.

Now, if we use dropout, it prevents these units to fix up the mistake of other units,
thus preventing co-adaptation, as in every iteration the presence of a unit is highly
unreliable. So by randomly dropping a few units (nodes), it forces the layers to take
more or less responsibility for the input by taking a probabilistic approach.

This ensures that the model is getting generalized and hence reducing the
overfitting problem.
Figure 2: (a) Hidden layer features without dropout; (b) Hidden layer features with
dropout

From figure 2, we can easily make out that the hidden layer with dropout is learning
more of the generalized features than the co-adaptations in the layer without
dropout. It is quite apparent, that dropout breaks such inter-unit relations and
focuses more on generalization.

Dropout Implementation

Figure 3: (a) A unit (neuron) during training is present with a probability p and is
connected to the next layer with weights ‘w’ ; (b) A unit during
inference/prediction is always present and is connected to the next layer with
weights, ‘pw’

In the original implementation of the dropout layer, during training, a unit

(node/neuron) in a layer is selected with a keep probability (1-drop probability).
This creates a thinner architecture in the given training batch, and every time this
architecture is different.

In the standard neural network, during the forward propagation we have the
following equations:

Figure 4: Forward propagation of a standard neural network

where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l

Further, with the activation function, z is transformed into the output for layer (l+1).

Now, if we have a dropout, the forward propagation equations change in the

following way:
Figure 5: Forward propagation of a layer with dropout

So before we calculate z, the input to the layer is sampled and multiplied element-

wise with the independent Bernoulli variables. r denotes the Bernoulli random
variables each of which has a probability p of being 1. Basically, r acts as a mask to
the input variable, which ensures only a few units are kept according to the keep
probability of a dropout. This ensures that we have thinned outputs “y(bar)”, which
is given as an input to the layer during feed-forward propagation.

Figure 6: Comparison of the dropout network with the standard network for a given
layer during forward propagation

Dropout during Inference

Now, we know the dropout works mathematically but what happens during
the inference/prediction? Do we use the network with dropout or do we
remove the dropout during inference?

According to the original implementation (Figure 3b)

during the inference, we do not use a dropout layer.
This means that all the units are considered during the
prediction step. But, because of taking all the
units/neurons from a layer, the final weights will be
larger than expected and to deal with this problem,
weights are first scaled by the chosen dropout rate. With
this, the network would be able to make accurate
predictions.

To be more precise, if a unit is retained with probability p during training, the

outgoing weights of that unit are multiplied by p during the prediction stage.

How Dropout was conceived

According to Geoffrey Hinton, one of the authors of “Dropout: A Simple Way
to Prevent Neural Networks from Overfitting” there were a set of events that
inspired the fundamental dropout.

1. The analogy with Google Brain is that it should be big because it

learns a large ensemble of models. In neural networks, it is not a
very efficient use of hardware since the same features would need to
be invented separately by different models. This is when the idea of
using the same subset of neurons was discovered.

2. Bank Teller: In those days, the tellers keep changing regularly and it
must be because it would require cooperation between the
employees to successfully defraud the bank. This implanted the idea
of randomly selecting different neurons such that with every
iteration there is a different set of neurons used. This would ensure
that neurons are unable to learn the co-adaptations and prevent
overfitting, similar to preventing the conspiracies in the bank.

3. Sexual Reproduction: It involves talking half of the genes of one

parent and half of the other, adding a very small amount of random
mutation, to produce an offspring. This creates a mixed ability of the
genes and makes them more robust. This can be linked to a dropout
which is used to break co-adaptations (adds randomness just like a
gene mutation).
Introduction
When you go out to buy a shirt for yourself, you will not buy something which
is very fit for your body because then if you eat pizza or biryani and if you
become fat it will not be convenient you will not buy something that is very
loose because then it looks like a cloth hanging on a skeleton, you will try to
buy a right fit for your body the problem of overfitting and underfitting
happening in the machine learning project as well, and there are techniques to
tackle this overfitting and under fitting issue and these techniques are
called regularization techniques.

In the below image, we are applying a dropout on the second hidden layer of a
neuron network.

So
urce: mohcinemadkour.github.io
What’s Dropout?
In machine learning, “dropout” refers to the practice of disregarding certain
nodes in a layer at random during training. A dropout is a regularization
approach that prevents overfitting by ensuring that no units are codependent
with one another.

Dropout Regularization
When you have training data, if you try to train your model too much, it might
overfit, and when you get the actual test data for making predictions, it will not
probably perform well. Dropout regularization is one technique used to tackle
overfitting problems in deep learning.

Training with Drop-Out Layers

Dropout is a regularization method approximating concurrent training of many
neural networks with various designs. During training, some layer outputs are
ignored or dropped at random. This makes the layer appear and is regarded as
having a different number of nodes and connectedness to the preceding layer.
In practice, each layer update during training is carried out with a different
perspective of the specified layer. Dropout makes the training process noisy,
requiring nodes within a layer to take on more or less responsible for the
inputs on a probabilistic basis.

According to this conception, dropout may break apart circumstances in which

network tiers co-adapt to fix mistakes committed by prior layers, making the
model more robust. Dropout is implemented per layer in a neural network. It
works with the vast majority of layers, including dense, fully connected,
convolutional, and recurrent layers such as the long short-term memory
network layer. Dropout can occur on any or all of the network’s hidden layers
as well as the visible or input layer. It is not used on the output layer.

Dropout Implementation
Using the torch. nn, you can easily add a dropout to your PyTorch models. The
dropout class accepts the dropout rate (the likelihood of a neuron being
deactivated) as a parameter.
self.dropout = nn.Dropout(0.25)

Dropout can be used after any non-output layer.

To investigate the impact of dropout, train an image classification model. I’ll

start with an unregularized network and then use Dropout to train a
regularised network. The Cifar-10 dataset is used to train the models over 15
epochs.

A complete example of introducing dropout to a PyTorch model is provided.

class Net(nn.Module):
def __init__(self, input_shape=(3,32,32)):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
n_size = self._get_conv_output(input_shape)
self.fc1 = nn.Linear(n_size, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self._forward_features(x)
x = x.view(x.size(0), -1)
x = self.dropout(x)
x = F.relu(self.fc1(x))
# Apply dropout
x = self.dropout(x)
x = self.fc2(x)
return x

An unregularized network overfits instantly on the training dataset. Take note

of how the validation loss for the no-dropout run diverges dramatically after
only a few epochs. This explains why the generalization error has grown.
Overfitting is avoided by training with two dropout layers and a dropout
probability of 25%. However, this affects training accuracy, necessitating the
training of a regularised network over a longer period.
Leaving improves model generalisation. Although the training accuracy is lower
than that of the unregularized network, the total validation accuracy has
improved. This explains why the generalization error has decreased.

Why will dropout help with overfitting?

 It can’t rely on one input as it might be randomly dropped out.

 Neurons will not learn redundant details of inputs
Other Popular Regularization Techniques
When combating overfitting, dropping out is far from the only choice.
Regularization techniques commonly used include:

Early stopping: automatically terminates training when a performance measure

(e.g., validation loss, accuracy) ceases to improve.

Weight decay: add a penalty to the loss function to motivate the network to
utilize lesser weights.

Noise: Allow some random variations in the data through augmentation to

create noise (which makes the network robust to a larger distribution of inputs
and hence improves generalization).

Model Combination: the outputs of separately trained neural networks are

averaged (which requires a lot of computational power, data, and time).

Dropout Regularization Hyperparameters

A big decaying learning rate and a high momentum are two hyperparameter
values that have been discovered to function well with dropout regularisation.
Limiting our weight vectors using dropout allows us to employ a high learning
rate without fear of the weights blowing up. Dropout noise, along with our big
decaying learning rate, allows us to explore alternative areas of our loss
function and, hopefully, reach a better minimum.

The Drawbacks of Dropout

Although dropout is a potent tool, it has certain downsides. A dropout network
may take 2-3 times longer to train than a normal network. Finding a regularizer
virtually comparable to a dropout layer is one method to reap the benefits of
dropout without slowing down training. This regularizer is a modified variant of
L2 regularisation for linear regression. An analogous regularizer for more
complex models has yet to be discovered until that time when doubt drops
out.

Weight Dropout for Preventing Neural Networks From Overfitting
No ratings yet
Weight Dropout for Preventing Neural Networks From Overfitting
4 pages
M2 Topic 7 - Dropout
No ratings yet
M2 Topic 7 - Dropout
20 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Rennie 2014
No ratings yet
Rennie 2014
6 pages
Dropout&Batch Normalization
No ratings yet
Dropout&Batch Normalization
18 pages
Improving Neural Networks by Preventing Co-Adaptat
No ratings yet
Improving Neural Networks by Preventing Co-Adaptat
19 pages
Dropout
No ratings yet
Dropout
4 pages
Validation and training
No ratings yet
Validation and training
3 pages
NIPS 2013 Adaptive Dropout For Training Deep Neural Networks Paper
No ratings yet
NIPS 2013 Adaptive Dropout For Training Deep Neural Networks Paper
9 pages
Unit II.
No ratings yet
Unit II.
14 pages
Rademacher Dropout: An Adaptive Dropout For Deep Neural Network Via Optimizing Generalization Gap
No ratings yet
Rademacher Dropout: An Adaptive Dropout For Deep Neural Network Via Optimizing Generalization Gap
11 pages
A Comparison of Dropout and Weight Decay For Regularizing Deep Ne
No ratings yet
A Comparison of Dropout and Weight Decay For Regularizing Deep Ne
15 pages
9.b Handout-2-Regularization
No ratings yet
9.b Handout-2-Regularization
5 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Face Recognition Based On Deep Autoencoder Networks With Dropout
No ratings yet
Face Recognition Based On Deep Autoencoder Networks With Dropout
4 pages
Bias-Variance Tradeoff Clustering: Clustering Phenomena in Dropout
No ratings yet
Bias-Variance Tradeoff Clustering: Clustering Phenomena in Dropout
1 page
Fast Dropout
No ratings yet
Fast Dropout
11 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
30 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Regularization
No ratings yet
Regularization
9 pages
tutorial 4
No ratings yet
tutorial 4
6 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
5 pages
DL Class3
No ratings yet
DL Class3
28 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Chapter 6 Deep Learning Knowledge
No ratings yet
Chapter 6 Deep Learning Knowledge
24 pages
Enhancing Transformer Training Efficiency with Dynamic Dropout
No ratings yet
Enhancing Transformer Training Efficiency with Dynamic Dropout
10 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
How To Reduce Overfitting With Dropout Regularization in Keras
No ratings yet
How To Reduce Overfitting With Dropout Regularization in Keras
12 pages
Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
No ratings yet
Lecture 5 - CS50's Introduction to Artificial Intelligence with Python
16 pages
1509.04612v2
No ratings yet
1509.04612v2
4 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Intro to Neural network
No ratings yet
Intro to Neural network
25 pages
cours4
No ratings yet
cours4
30 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
No ratings yet
Regularization of Neural Networks Using Dropconnect: Hinton Et Al. 2012 Hinton Et Al. 2012
9 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Machine Learning in Measurement Part 2 Uncertainty Quantification
No ratings yet
Machine Learning in Measurement Part 2 Uncertainty Quantification
5 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
Neural Networks Bias
No ratings yet
Neural Networks Bias
7 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Multi-Sample Dropout For Accelerated Training and Better Generalization 1905.09788
No ratings yet
Multi-Sample Dropout For Accelerated Training and Better Generalization 1905.09788
9 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Deep Learning_Lecture 3_Regularization in Neural Networks
No ratings yet
Deep Learning_Lecture 3_Regularization in Neural Networks
16 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Syntactic Structures and Dependency Parsing by The Great Dr. Imran Naseem
No ratings yet
Syntactic Structures and Dependency Parsing by The Great Dr. Imran Naseem
20 pages
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
[Dropblock] NeurIPS-2018-dropblock-a-regularization-method-for-convolutional-networks-Paper
No ratings yet
[Dropblock] NeurIPS-2018-dropblock-a-regularization-method-for-convolutional-networks-Paper
11 pages
PA 4 UNIT
No ratings yet
PA 4 UNIT
33 pages
DL Notes
No ratings yet
DL Notes
16 pages
Accelerated Bayesian Optimization For Deep Learning
No ratings yet
Accelerated Bayesian Optimization For Deep Learning
13 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages

Dropout in Deep Learning

Uploaded by

Dropout in Deep Learning

Uploaded by

The deep neural networks have different architectures, sometimes shallow,

The other way is inspired by the ensemble techniques (such as AdaBoost,

So, we have a great solution known as Dropout Layers.

How does it solve the Overfitting problem?

In the original implementation of the dropout layer, during training, a unit

Figure 4: Forward propagation of a standard neural network

Now, if we have a dropout, the forward propagation equations change in the

So before we calculate z, the input to the layer is sampled and multiplied element-

Dropout during Inference

According to the original implementation (Figure 3b)

To be more precise, if a unit is retained with probability p during training, the

How Dropout was conceived

1. The analogy with Google Brain is that it should be big because it

3. Sexual Reproduction: It involves talking half of the genes of one

Training with Drop-Out Layers

According to this conception, dropout may break apart circumstances in which

Dropout can be used after any non-output layer.

To investigate the impact of dropout, train an image classification model. I’ll

A complete example of introducing dropout to a PyTorch model is provided.

An unregularized network overfits instantly on the training dataset. Take note

Why will dropout help with overfitting?

 It can’t rely on one input as it might be randomly dropped out.

Early stopping: automatically terminates training when a performance measure

Noise: Allow some random variations in the data through augmentation to

Model Combination: the outputs of separately trained neural networks are

Dropout Regularization Hyperparameters

The Drawbacks of Dropout

You might also like