0% found this document useful (0 votes)
23 views60 pages

Deep Neural Network

Unit 2 covers Deep Neural Networks (DNN), including their structure, training methods like backpropagation, and issues such as vanishing and exploding gradients. It discusses techniques for improving training efficiency, such as reusing pretrained layers and faster optimizers, as well as strategies for avoiding overfitting through regularization methods. Key concepts include gradient descent variations, normalization, and data augmentation to enhance model performance.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views60 pages

Deep Neural Network

Unit 2 covers Deep Neural Networks (DNN), including their structure, training methods like backpropagation, and issues such as vanishing and exploding gradients. It discusses techniques for improving training efficiency, such as reusing pretrained layers and faster optimizers, as well as strategies for avoiding overfitting through regularization methods. Key concepts include gradient descent variations, normalization, and data augmentation to enhance model performance.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

UNIT - 2

Deep Neural Network


Unit – 2 : Syllabus
• Deep Neural network:
• Introduction,
• Vanishing Gradient problems,
• Reusing Pretrained layers,
• Faster optimizers,
• avoiding over fitting through regularization
Deep Neural Networks
• Introduction:
• Neural network with 2 or more hidden layers, can be called Deep
Neural Network.
• While handling a complex problem such as detecting hundreds of
types of objects in high resolution images, you may need to train a
much deeper DNN, perhaps say 10 layers, each containing hundreds
of neurons, connected by hundreds of thousands of connections.
• This leads to a problem of vanishing gradients.
Training the Neural Network
Back propagation
• Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the
error rate obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and make
the model reliable by increasing its generalization.

• Backpropagation in neural network is a short form for “backward


propagation of errors.” It is a standard method of training artificial
neural networks. This method helps calculate the gradient of a loss
function with respect to all the weights in the network.
Back propagation is the method of adjusting
weights after computing the loss value.
Chain rule in backpropagation
Backpropagate to change the weight
Similarly :
Problem: (to compute Y out): apply activation
function
• Calculate the output ‘y’ of a 3 input neuron with bias, with data as
given in the diagram. Use sigmoidal activation function
Gradient Descent
Compute the weights
Vanishing Gradient
• Back propagation algorithm works by going from the output layer to
the input layer, propagating the error gradient on the way.
• Gradients often get smaller and smaller as the algorithm progresses
down to the lower layers.
• As a result the gradient descent updates in the lower layer weights
virtually unchanged.
• This makes training never converges to a good solution.
• This is called vanishing gradient.
Exploding Gradients
• Some times the gradients can grow bigger and bigger, so many layers
gets large weight updates and the algorithm diverges.
• This is called exploding Gradients, which is popularly seen in recurrent
neural networks.
Gradient clipping
• A popular technique to lessen the exploding gradients problem is to
simply clip the gradients during backpropagation so that they never
exceed some threshold (this is mostly useful for recurrent neural
networks)
Reusing Pretrained Layers
• Saved models with its weights after training
• They are usually a very deep Neural Network models
• Trained on very large and generalized dataset used on large
classification tasks.
Reusing Pretrained Layers
• It is generally not a good idea to train a very large DNN from scratch:.
• instead you should always try to find an existing neural network that
accomplishes a similar task to the one you are trying to tackle, .
• then just reuse the lower layers of this network this is called Transfer
Learning.
• It will not only speed up training considerably, but will also require
much less training data.
Reusing Pretrained Layers
• For example, suppose that you have access to a DNN that was trained
to classify pictures into 100 different categories, including animals,
plants, vehicles, and everyday objects.
• You now want to train a DNN to classify specific types of vehicles.
These tasks are very similar, so you should try to reuse parts of the
first network
Observations to be made While training the
lower layers.
• If the input pictures of your new task don’t have the same size as the
ones used in the original task, you will have to add a preprocessing
step to resize them to the size expected by the original model. More
generally, transfer learning will work only well if the inputs have
similar low-level features.
Pre-trained Models – How it is useful?
• Lower layers learn basic features from very large and generalized training images like color, lines
in various angles
• These lower layer features are almost same for our most of our task, changes will be made in the
upper layers, Hence training will be made for the upper layers.
• Examples of Pre-trained models:
• Xception
• VGG16
• VGG19
• ResNet50
• InceptionV3
• InceptionResNetV2
• MobileNet
• MobileNetV2
• DenseNet
• NASNet
VGG 16 Layers.

Freezing the lower layers
• It is likely that the lower layer of the first DNN have learned to detect
low –level features in pictures that will be useful across both image
classification tasks, so you can just reuse these layers as they are.

• It is generally good idea to “Freeze” their weights.


• If the lower layers weights are fixed, then the higher layers weights
will be easier to train.
Tweaking, Dropping or replacing the upper
layers
• The output layers of the original model should usually be replaced
since it is most likely not useful at all for the new task and it may not
even have the right number of outputs for the new task.

• If performance is not good you may drop or replace the upper layer
for god performance.
Normalization
• In some cases, feature will have varied values, which may lead to
inconsistent results.
• For example, observe the following table: where wine quality is being
verified.
• However the values of “Alcohol” and “Malic” are having huge
difference.
• In such cases, the system may fail to perform properly.
• Hence normalizing all the feature between 0 to 1 is required.
• Min-max normalization is popularly used, observe the values of alcohol and
malic are now normalized between 0 to 1.
• This will be done as soon as the input is fed into the system, and before
summation and activation function. It is also called pre-processing in neural
network.
• In the example given below: age and number of mile of driving are
two parameters.
• They are on different scaling.
• If they are used as it is without normalization, may lead to imbalance
of neural network.
• To handle this we need to normalize the data.
• Right hand side you have the same data which is now normalized.
Batch size and epoch
• The weights are update after one iteration of every batch of data.

• For example, if you have 1000 samples and you set a batch size of 200, then the neural
network’s weights gets updated after every 200 samples.

• Batch is also called as ‘Mini Batch’.

• An epoch completes after it has seen the full data set, so in the example above, in 1
epoch, the neural network gets updated 5 times.

• Batch size will be fixed based on the processing capacity.


• In the above example weights will not be updated till your training set receives 200
samples.
Batch size = 10
Faster Optimizers
• Training a very large deep neural network can be slow.
• Four ways to speed up training :
• Applying good initialization strategy for the connection weights.
• Using a good activation function
• Using Batch Normalization and
• Reusing parts of a pretrained network
• Another huge speed boost comes from using a faster optimizer than
the gradient descent optimizer.
• Some of the popular fast optimizers are: Momentum Optimization,
Nesterov Accelerated Gradient, AdaGrad, RMSProp and Adam
Optimization
Reason to use fast optimizer?
• Let us consider the following case:
Gradient Descent updating…
• When we pass one sample, loss is given by:

• And gradient descent is, updating the weights with reference to loss.
Iteration and Epoch
• Forward propagate and later update the weights in the backward
propagation for one sample is called ITERATION.

• If the iteration is completed for all the training samples available it is


called ONE EPOCH.

• Say for example if we have 10,000 training samples are available, then
if we decide to update weights for every sample, we will be having
10,000 iterations and this is called one EPOCH.
General Gradient Descent
and
Stochastic Gradient Descent (SGD)
• In general Gradient Descent, loss will be collected for all the training samples
and weights will be updated by taking average of all the loss.

• Say if we have 10,000 samples to be trained, then we will not be having 10,000
iterations. Will be collecting the loss of all the 10,000 samples. This completes
one epoch and at the end of one epoch weights will be updated.
• The problem in the above method is, if the training data is too huge like 10 Lakhs
or 50 Lakhs.. Then we need have huge RAM space to load all the samples and
space is also required to hold the loss value for all the samples. As a solution
researchers found another method called stochastic Gradient Descent.
• In Stochastic Gradient Descent (SGD), weights will be updated in every iteration
and weights will be updated in every iteration. Though it requires less memory, it
is time consuming to reach the global minima of error.
Solution to SGD is the ‘Mini Batch’
• Researchers have introduced a technique called “Min Batch” or “Mini Batch
SGD”
• In mini batch, a batch of training data is considered for weight updates.
This is the iteration value.

• For example, if we have 10,000 training samples, and if batch size is 1000,
weights will be updated after every 1000 samples are trained.
• In this case to complete one epoch, we need to have 10 iterations.

• On the other hand in one epoch, there will be weight updation for 10
times, but this might have some noise as shown in the diagram.
Counter Plot
• If you draw a counter plot, which is the top view of the gradient
descent, it will be smoother for GD (Red color), and for Minibatch
SGD it will not be smooth (Black color) and blue colour is the SG..
Which will have more error. The left hand side picture is an illustration
of counter plot using python.
Illustration of noise, while climbing the hill, SGD or
Mini SGD will have noise, which can be smoothened
using some technique called Optimizers.
Fast Optimizer – Momentum Optimizer
• Consider the physics solution to smoothen the velocity:
• If beta is 0.95 and for the next step it will become 0.05, and hence the
velocity of moment will be smoothened.
• Similarly weights will be updated in the momentum optimization:
• The below computation shows the weight calculation for a single
weight. It can also be applied for bias. Wt-1 is the Wold.
• Vdw is the exponential weight change.
Final concept of GD with Momentum for fast
and smooth optimization
Summary of GD, SGD and Mini Batch SGD
• GD – Weight updation will be done after all samples are passed
through the model
• SGD – weight updation takes place for every sample
• Mini Batch – Weight updation takes place for every batch.

• Batch size should not be too small or too big. Depending on the
available sample size, program should decide about the batch size.
AdaGrad Faster Optimizer
• AdaGrad – Adaptive Gradient.
• In Adagrad Optimizer the core idea is that each weight has a different
learning rate (η).
• This modification has great importance.
• In the real-world dataset, some features are sparse (for example, in
Bag of Words most of the features are zero so it’s sparse) and some
are dense (most of the features will be noon-zero).
• So keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks
like:
• Weight updation in Ada Grad is given by:

• Where alpha(t) denotes different learning rates for each weight at


each iteration.
• Here, η is a constant number, epsilon is a small positive value number
to avoid divide by zero error if in case alpha(t) becomes 0 because if
alpha(t) become zero then the learning rate will become zero which in
turn after multiplying by derivative will make w(old) = w(new), and
this will lead to small convergence.
• Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
Avoiding Overfitting Through Regularization
• Deep Neural Network typically have tens of thousands of parameters.
• With so many parameters, the network is prone to overfitting the
training set.

• This will be done using “Regularization” techniques.


• Some of the popular regularization techniques are:
• Early Stopping
• Dropout
• Max-Norm Regularization and
• Data Augmentation.
Early Stopping
• To avoid Overfitting the training set, good solution is early stopping.
• Interrupt training when its performance on the validation set starts dropping.

• Evaluate the model on a validation set at regular intervals.


• If the performance is improved compared to the previous interval, go back to
the pervious values and stop training.
Dropout
• Another popular regularization technique for deep neural network is
arguably dropout.
• At every training step, every neuron has a probability P of being
temporarily “Dropped Out”, meaning it will be entirely ignored during
this training step.
• But it may be active during the next step.
• The hyperparameter ‘P’ is called the dropout rate and it is typically
set to 50%.
• After training neurons don’t get dropped anymore.
• It is found that many a times this technique has worked well.
Max-Norm Regularization
• Another regularization technique that is quite popular for neural
networks s called max-norm regularization..
• It constrains the weights w of the incoming connections such that**
∥ w ∥2 ≤ r, where r is the max-norm hyperparameter and ∥ · ∥2 is the
ℓ2 norm**.
• It is typically implemented by computing ∥w∥2 after each training
step and clipping w if needed.
• Reducing r increases the amount of regularization and helps reduce
overfitting.
Data Augmentation

• One last regularization technique is data augmentation.


• It consists of generating new training instances from existing ones.
• Artificially boosting the size of the training set.
• This will reduce overfitting making this a regularization technique.
• The trick is to generate realistic training instances, ideally a human
should not be able to tell which instances were generated and which
ones were not.
End of Unit - 2

You might also like