Deep MLP's
Deep MLP's
1. The biggest problem is the vanishing gradient problem to train deep networks.
2. We have little data to train the networks and It is easily can over fit the network.
3. We have too little compute power. We take lot of time to work.
In 1980:
By the time we reach the 2010, lots of labelled data is generated by internet company.
In 2010:
Modern DL:
I also respect theory.
Dropout layers and regularization:
We have many weights so as overfitting, we use the dropout layers and regularization.
When we built random forest, we use the trees to look at the small parts of the data and fully
grown and also overfit the models.
By using the randomization as the regularization. Using the random forest. This reduces the
variance in the model.
The core idea is we are using the randomization of features to enable regularization.
Can we take the randomization for regularization for MLPs ?
Drop out layers:
Remove the neurons randomly between the Input and output layer. For only each iteration.
It is the probability rate lies between 0 and 1 these are called as the “p” value.
“P” value is the percentage the network will be removed. This is very similar to random subset of
features in random forest.
Drop out is very similar to random subset of features in random forest. As the inputs to the one
neuron is varied at every iteration.
Drop out as having the value that makes the network deactivate. To create regularization.
At test time,
At test time, the network remains the same and the each weight of the network is multiplied with
the “P” value.
When given a query point, if multiply with the weight value and “P” value.
If we have more weights than the number of data points then there are high chances for
overfitting the less is the value of P may be like 0.1 or 0.2.
This is a hyper parameters can be determined by grid search and all other networks.
If there are many weights more than number of data points then we keep “P” value is smaller.
Here “P” is the hyper parameter.
The dropout layer is applied after the layer in the neural network. The dropout layer randomly
sends the data from the layer to the next layer.
People call this as the dropout network.
They are the default activation functions that are implemented in many networks.
The slope of the activation function is always equal to 1 or 0. Just like the Hinge function in
SVM.
The derivative at ZERO is not defined.
The solid line represents the model trained using ReLU. and dotted is the TanH.
Because we does not vanishing gradient or exploding gradient problem. So, the network
converges faster than the TanH.
The smooth approximation to ReLU. Such a function is called the softplus function.
The derivative of the softplus function is the logistic function. It is not much widely used.
Computing the derivative is also much simpler.
If the Z is negative then the derivative also become zero, this makes the chain rule also zero.
The weights are not changing anymore, when the weights are negative. Which we don’t need.
This is called the dead activation state.
The input to NN is always normalized.
The fix for this problem is giving a small value to the negative input to the neurons in the NN.
Typically people use the ReLU, If we found the more dead neurons in the NN then we tend to
use the Leaky ReLU.
Advantages of ReLu:
Weight Initialization:
For logistic regression the weights in the SGD, we initialize the weights randomly. We will
initialize the weights from the gaussian normal distribution.
What happens ?
1.
We want asymmetry in the NN. That makes the each layer in the NN to learn different things.
In ensembles the more different the base models are, the better is the output of the model.
If we have the same weights we learn the same. This we don’t need.
2. If we have negative values then there is the problem of dead neurons in Case of ReLU and
other activation functions.
First technique:
Uniform initialization:
The weights from the uniform distribution with fanIN and fanOut as the function.
Idea - 2:
This also works fairly well for sigmoid.
Idea - 3:
Xavier/Glorot initialization:
There are two variations in Xavier initialization.
The weights are picked from the normal distributions and normal distribution.
Idea - 4:
He - initialization:
This also have the normal and uniform distribution. It works with ReLU and Leaky ReLU.
Batch Normalization:
The pre-processing steps in the NN is the data normalization.
If the input changes slightly a small change in the input can give the large change.
Example:
Between each batch there is less difference as the data is normalized the lower layers
does not get affected so much,
But the layers deeper in the network will be affected more.
Ideally we want the data to be normalized for every layer, for each layer there will be different
distribution.
The neurons at each layer can go Crazy. This problem is called the internal co-variance shift.
The solution for this is adding a new layer called Batch Normalization Layer, whenever we get a
batch of inputs. We normalize only that batch.
We are explicitly normalizing for each layer. It works deep in the layers.
The batch norm is in between the two layers. It has two hyper parameters <gamma> and
<delta>.
We will learn these parameters as the part of back propagation.
BN is also acts as the regularization. We can train deeper neural networks with the BN.
In the case of deep learning, the batch and mini batch SGD do not work very well.
These are primarily for Deep Learning.
We will use the other optimizers that get rid of the zero derivative.
Because it is a non convex function, based on the initial weight we can land up a different
minima.
Gradient descent always move towards minima using all the n points.
Using the SGD:
If we run the lots of iterations of SGD we can reach the MIN value as SGD.
The estimate of the gradient using SGD, each of the updates completely depend of the
derivative and more noisy in SGD.
Adagrad:
Learning rate is set to small, the learning rate is same each weight in SGD.
Each weight has a different learning rate in Adagrad.
In our original datasets, some features which are dense and sparse.
THe alpha values can become very small, then the weights cannot converge. This could take
slower time to converge.
This problem is fixed in next algorithms.
Adadelta and RMSProp:
AdaDelta:
Instead of taking all the weights in alpha, we will take the exponentially decaying average.
In a nutshell, take exponential weighted averages of gradient^2, rather than the sum of squares
of all the gradients.
Adadelta has the faster convergence.
Adam:
It is most Popular in algorithm.
What if i store the exponential weighted average of the gradients it self, than the squares of the
gradients.
Softmax Classifier:
Here yi belongs to “K” such classes. Here the summation of all the class belongs to 1.
In the case of softmax classifer, the input that I get at each of the neuron is calculated as
follows:
Formulation:
This satisfys our requirement.
Softmax is the generalization of Logistic regression to multi class setting.
The very important is always monitor gradients, and apply gradient clipping.
Auto Encoders: This performs the dimensionality reduction using the NN.
We need to get the three dimensional output for the six dimensional input.
The output that we will predict the X it self. We can conclude the middle layer preserves the
input data.
Auto Encoders reference:
https://en.wikipedia.org/wiki/Autoencoder
Ppl intentially add noise, this learns the data and leaves the noise and makes the robust noise
free encoder.
Sparse AE:
We will apply loss function with L1 regularization, If we add the L1 reg sparse autoencoder is
achienved.
For better and unsupervised feature representation and extracting the important features in the
data is done by Auto Encoders.