Dropout in Deep Learning
Dropout in Deep Learning
sometimes very deep trying to generalize on the given dataset. But, in this pursuit of
trying too hard to learn different features from the dataset, they sometimes learn
the statistical noise in the dataset. This definitely improves the model performance
on the training dataset but fails massively on new data points (test dataset). This is
the problem of overfitting. To tackle this problem we have various regularization
techniques that penalize the weights of the network but this wasn’t enough.
The best way to reduce overfitting or the best way to regularize a fixed-size model
is to get the average predictions from all possible settings of the parameters and
aggregate the final output. But, this becomes too computationally expensive and
isn’t feasible for a real-time inference/prediction.
What is a Dropout?
The term “dropout” refers to dropping out the nodes (input and hidden layer) in a
neural network (as seen in Figure 1). All the forward and backwards connections
with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network. The nodes are dropped by a dropout
probability of p.
Let’s try to understand with a given input x: {0, 1, 2, 3, 4, 5} to the fully connected
layer. We have a dropout layer with probability p = 0.2 (or keep probability = 0.8).
During the forward propagation (training) from the input x, 20% of the nodes would
be dropped, i.e. the x could become {1, 0, 3, 4, 5} or {1, 2, 0, 4, 5} and so on.
Similarly, it applied to the hidden layers.
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is
applied with drop probability = 0.5, then 500 neurons would be randomly dropped
in every iteration (batch).
Generally, for the input layers, the keep probability, i.e. 1- drop probability, is
closer to 1, 0.8 being the best as suggested by the authors. For the hidden layers, the
greater the drop probability more sparse the model, where 0.5 is the most optimized
keep probability, that states dropping 50% of the nodes.
In the overfitting problem, the model learns the statistical noise. To be precise, the
main motive of training is to decrease the loss function, given all the units
(neurons). So in overfitting, a unit may change in a way that fixes up the mistakes
of the other units. This leads to complex co-adaptations, which in turn leads to the
overfitting problem because this complex co-adaptation fails to generalize on the
unseen dataset.
Now, if we use dropout, it prevents these units to fix up the mistake of other units,
thus preventing co-adaptation, as in every iteration the presence of a unit is highly
unreliable. So by randomly dropping a few units (nodes), it forces the layers to take
more or less responsibility for the input by taking a probabilistic approach.
This ensures that the model is getting generalized and hence reducing the
overfitting problem.
Figure 2: (a) Hidden layer features without dropout; (b) Hidden layer features with
dropout
From figure 2, we can easily make out that the hidden layer with dropout is learning
more of the generalized features than the co-adaptations in the layer without
dropout. It is quite apparent, that dropout breaks such inter-unit relations and
focuses more on generalization.
Dropout Implementation
Figure 3: (a) A unit (neuron) during training is present with a probability p and is
connected to the next layer with weights ‘w’ ; (b) A unit during
inference/prediction is always present and is connected to the next layer with
weights, ‘pw’
In the standard neural network, during the forward propagation we have the
following equations:
where:
z: denote the vector of output from layer (l + 1) before activation
y: denote the vector of outputs from layer l
w: weight of the layer l
b: bias of the layer l
Further, with the activation function, z is transformed into the output for layer (l+1).
Figure 6: Comparison of the dropout network with the standard network for a given
layer during forward propagation
2. Bank Teller: In those days, the tellers keep changing regularly and it
must be because it would require cooperation between the
employees to successfully defraud the bank. This implanted the idea
of randomly selecting different neurons such that with every
iteration there is a different set of neurons used. This would ensure
that neurons are unable to learn the co-adaptations and prevent
overfitting, similar to preventing the conspiracies in the bank.
In the below image, we are applying a dropout on the second hidden layer of a
neuron network.
So
urce: mohcinemadkour.github.io
What’s Dropout?
In machine learning, “dropout” refers to the practice of disregarding certain
nodes in a layer at random during training. A dropout is a regularization
approach that prevents overfitting by ensuring that no units are codependent
with one another.
Dropout Regularization
When you have training data, if you try to train your model too much, it might
overfit, and when you get the actual test data for making predictions, it will not
probably perform well. Dropout regularization is one technique used to tackle
overfitting problems in deep learning.
Dropout Implementation
Using the torch. nn, you can easily add a dropout to your PyTorch models. The
dropout class accepts the dropout rate (the likelihood of a neuron being
deactivated) as a parameter.
self.dropout = nn.Dropout(0.25)
class Net(nn.Module):
def __init__(self, input_shape=(3,32,32)):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3)
self.conv2 = nn.Conv2d(32, 64, 3)
self.conv3 = nn.Conv2d(64, 128, 3)
self.pool = nn.MaxPool2d(2,2)
n_size = self._get_conv_output(input_shape)
self.fc1 = nn.Linear(n_size, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.25)
def forward(self, x):
x = self._forward_features(x)
x = x.view(x.size(0), -1)
x = self.dropout(x)
x = F.relu(self.fc1(x))
# Apply dropout
x = self.dropout(x)
x = self.fc2(x)
return x
Weight decay: add a penalty to the loss function to motivate the network to
utilize lesser weights.