Deep Neural Network
Deep Neural Network
• If performance is not good you may drop or replace the upper layer
for god performance.
Normalization
• In some cases, feature will have varied values, which may lead to
inconsistent results.
• For example, observe the following table: where wine quality is being
verified.
• However the values of “Alcohol” and “Malic” are having huge
difference.
• In such cases, the system may fail to perform properly.
• Hence normalizing all the feature between 0 to 1 is required.
• Min-max normalization is popularly used, observe the values of alcohol and
malic are now normalized between 0 to 1.
• This will be done as soon as the input is fed into the system, and before
summation and activation function. It is also called pre-processing in neural
network.
• In the example given below: age and number of mile of driving are
two parameters.
• They are on different scaling.
• If they are used as it is without normalization, may lead to imbalance
of neural network.
• To handle this we need to normalize the data.
• Right hand side you have the same data which is now normalized.
Batch size and epoch
• The weights are update after one iteration of every batch of data.
• For example, if you have 1000 samples and you set a batch size of 200, then the neural
network’s weights gets updated after every 200 samples.
• An epoch completes after it has seen the full data set, so in the example above, in 1
epoch, the neural network gets updated 5 times.
• And gradient descent is, updating the weights with reference to loss.
Iteration and Epoch
• Forward propagate and later update the weights in the backward
propagation for one sample is called ITERATION.
• Say for example if we have 10,000 training samples are available, then
if we decide to update weights for every sample, we will be having
10,000 iterations and this is called one EPOCH.
General Gradient Descent
and
Stochastic Gradient Descent (SGD)
• In general Gradient Descent, loss will be collected for all the training samples
and weights will be updated by taking average of all the loss.
• Say if we have 10,000 samples to be trained, then we will not be having 10,000
iterations. Will be collecting the loss of all the 10,000 samples. This completes
one epoch and at the end of one epoch weights will be updated.
• The problem in the above method is, if the training data is too huge like 10 Lakhs
or 50 Lakhs.. Then we need have huge RAM space to load all the samples and
space is also required to hold the loss value for all the samples. As a solution
researchers found another method called stochastic Gradient Descent.
• In Stochastic Gradient Descent (SGD), weights will be updated in every iteration
and weights will be updated in every iteration. Though it requires less memory, it
is time consuming to reach the global minima of error.
Solution to SGD is the ‘Mini Batch’
• Researchers have introduced a technique called “Min Batch” or “Mini Batch
SGD”
• In mini batch, a batch of training data is considered for weight updates.
This is the iteration value.
• For example, if we have 10,000 training samples, and if batch size is 1000,
weights will be updated after every 1000 samples are trained.
• In this case to complete one epoch, we need to have 10 iterations.
• On the other hand in one epoch, there will be weight updation for 10
times, but this might have some noise as shown in the diagram.
Counter Plot
• If you draw a counter plot, which is the top view of the gradient
descent, it will be smoother for GD (Red color), and for Minibatch
SGD it will not be smooth (Black color) and blue colour is the SG..
Which will have more error. The left hand side picture is an illustration
of counter plot using python.
Illustration of noise, while climbing the hill, SGD or
Mini SGD will have noise, which can be smoothened
using some technique called Optimizers.
Fast Optimizer – Momentum Optimizer
• Consider the physics solution to smoothen the velocity:
• If beta is 0.95 and for the next step it will become 0.05, and hence the
velocity of moment will be smoothened.
• Similarly weights will be updated in the momentum optimization:
• The below computation shows the weight calculation for a single
weight. It can also be applied for bias. Wt-1 is the Wold.
• Vdw is the exponential weight change.
Final concept of GD with Momentum for fast
and smooth optimization
Summary of GD, SGD and Mini Batch SGD
• GD – Weight updation will be done after all samples are passed
through the model
• SGD – weight updation takes place for every sample
• Mini Batch – Weight updation takes place for every batch.
• Batch size should not be too small or too big. Depending on the
available sample size, program should decide about the batch size.
AdaGrad Faster Optimizer
• AdaGrad – Adaptive Gradient.
• In Adagrad Optimizer the core idea is that each weight has a different
learning rate (η).
• This modification has great importance.
• In the real-world dataset, some features are sparse (for example, in
Bag of Words most of the features are zero so it’s sparse) and some
are dense (most of the features will be noon-zero).
• So keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks
like:
• Weight updation in Ada Grad is given by: