0% found this document useful (0 votes)

25 views60 pages

Deep Neural Network

Unit 2 covers Deep Neural Networks (DNN), including their structure, training methods like backpropagation, and issues such as vanishing and exploding gradients. It discusses techniques for improving training efficiency, such as reusing pretrained layers and faster optimizers, as well as strategies for avoiding overfitting through regularization methods. Key concepts include gradient descent variations, normalization, and data augmentation to enhance model performance.

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views60 pages

Deep Neural Network

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

UNIT - 2

Deep Neural Network

Unit – 2 : Syllabus
• Deep Neural network:
• Introduction,
• Vanishing Gradient problems,
• Reusing Pretrained layers,
• Faster optimizers,
• avoiding over fitting through regularization
Deep Neural Networks
• Introduction:
• Neural network with 2 or more hidden layers, can be called Deep
Neural Network.
• While handling a complex problem such as detecting hundreds of
types of objects in high resolution images, you may need to train a
much deeper DNN, perhaps say 10 layers, each containing hundreds
of neurons, connected by hundreds of thousands of connections.
• This leads to a problem of vanishing gradients.
Training the Neural Network
Back propagation
• Backpropagation is the essence of neural network training. It is the
method of fine-tuning the weights of a neural network based on the
error rate obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and make
the model reliable by increasing its generalization.

• Backpropagation in neural network is a short form for “backward

propagation of errors.” It is a standard method of training artificial
neural networks. This method helps calculate the gradient of a loss
function with respect to all the weights in the network.
Back propagation is the method of adjusting
weights after computing the loss value.
Chain rule in backpropagation
Backpropagate to change the weight
Similarly :
Problem: (to compute Y out): apply activation
function
• Calculate the output ‘y’ of a 3 input neuron with bias, with data as
given in the diagram. Use sigmoidal activation function
Gradient Descent
Compute the weights
Vanishing Gradient
• Back propagation algorithm works by going from the output layer to
the input layer, propagating the error gradient on the way.
• Gradients often get smaller and smaller as the algorithm progresses
down to the lower layers.
• As a result the gradient descent updates in the lower layer weights
virtually unchanged.
• This makes training never converges to a good solution.
• This is called vanishing gradient.
Exploding Gradients
• Some times the gradients can grow bigger and bigger, so many layers
gets large weight updates and the algorithm diverges.
• This is called exploding Gradients, which is popularly seen in recurrent
neural networks.
Gradient clipping
• A popular technique to lessen the exploding gradients problem is to
simply clip the gradients during backpropagation so that they never
exceed some threshold (this is mostly useful for recurrent neural
networks)
Reusing Pretrained Layers
• Saved models with its weights after training
• They are usually a very deep Neural Network models
• Trained on very large and generalized dataset used on large
classification tasks.
Reusing Pretrained Layers
• It is generally not a good idea to train a very large DNN from scratch:.
• instead you should always try to find an existing neural network that
accomplishes a similar task to the one you are trying to tackle, .
• then just reuse the lower layers of this network this is called Transfer
Learning.
• It will not only speed up training considerably, but will also require
much less training data.
Reusing Pretrained Layers
• For example, suppose that you have access to a DNN that was trained
to classify pictures into 100 different categories, including animals,
plants, vehicles, and everyday objects.
• You now want to train a DNN to classify specific types of vehicles.
These tasks are very similar, so you should try to reuse parts of the
first network
Observations to be made While training the
lower layers.
• If the input pictures of your new task don’t have the same size as the
ones used in the original task, you will have to add a preprocessing
step to resize them to the size expected by the original model. More
generally, transfer learning will work only well if the inputs have
similar low-level features.
Pre-trained Models – How it is useful?
• Lower layers learn basic features from very large and generalized training images like color, lines
in various angles
• These lower layer features are almost same for our most of our task, changes will be made in the
upper layers, Hence training will be made for the upper layers.
• Examples of Pre-trained models:
• Xception
• VGG16
• VGG19
• ResNet50
• InceptionV3
• InceptionResNetV2
• MobileNet
• MobileNetV2
• DenseNet
• NASNet
VGG 16 Layers.
•
Freezing the lower layers
• It is likely that the lower layer of the first DNN have learned to detect
low –level features in pictures that will be useful across both image
classification tasks, so you can just reuse these layers as they are.

• It is generally good idea to “Freeze” their weights.

• If the lower layers weights are fixed, then the higher layers weights
will be easier to train.
Tweaking, Dropping or replacing the upper
layers
• The output layers of the original model should usually be replaced
since it is most likely not useful at all for the new task and it may not
even have the right number of outputs for the new task.

• If performance is not good you may drop or replace the upper layer
for god performance.
Normalization
• In some cases, feature will have varied values, which may lead to
inconsistent results.
• For example, observe the following table: where wine quality is being
verified.
• However the values of “Alcohol” and “Malic” are having huge
difference.
• In such cases, the system may fail to perform properly.
• Hence normalizing all the feature between 0 to 1 is required.
• Min-max normalization is popularly used, observe the values of alcohol and
malic are now normalized between 0 to 1.
• This will be done as soon as the input is fed into the system, and before
summation and activation function. It is also called pre-processing in neural
network.
• In the example given below: age and number of mile of driving are
two parameters.
• They are on different scaling.
• If they are used as it is without normalization, may lead to imbalance
of neural network.
• To handle this we need to normalize the data.
• Right hand side you have the same data which is now normalized.
Batch size and epoch
• The weights are update after one iteration of every batch of data.

• For example, if you have 1000 samples and you set a batch size of 200, then the neural
network’s weights gets updated after every 200 samples.

• Batch is also called as ‘Mini Batch’.

• An epoch completes after it has seen the full data set, so in the example above, in 1
epoch, the neural network gets updated 5 times.

• Batch size will be fixed based on the processing capacity.

• In the above example weights will not be updated till your training set receives 200
samples.
Batch size = 10
Faster Optimizers
• Training a very large deep neural network can be slow.
• Four ways to speed up training :
• Applying good initialization strategy for the connection weights.
• Using a good activation function
• Using Batch Normalization and
• Reusing parts of a pretrained network
• Another huge speed boost comes from using a faster optimizer than
the gradient descent optimizer.
• Some of the popular fast optimizers are: Momentum Optimization,
Nesterov Accelerated Gradient, AdaGrad, RMSProp and Adam
Optimization
Reason to use fast optimizer?
• Let us consider the following case:
Gradient Descent updating…
• When we pass one sample, loss is given by:

• And gradient descent is, updating the weights with reference to loss.
Iteration and Epoch
• Forward propagate and later update the weights in the backward
propagation for one sample is called ITERATION.

• If the iteration is completed for all the training samples available it is

called ONE EPOCH.

• Say for example if we have 10,000 training samples are available, then
if we decide to update weights for every sample, we will be having
10,000 iterations and this is called one EPOCH.
General Gradient Descent
and
Stochastic Gradient Descent (SGD)
• In general Gradient Descent, loss will be collected for all the training samples
and weights will be updated by taking average of all the loss.

• Say if we have 10,000 samples to be trained, then we will not be having 10,000
iterations. Will be collecting the loss of all the 10,000 samples. This completes
one epoch and at the end of one epoch weights will be updated.
• The problem in the above method is, if the training data is too huge like 10 Lakhs
or 50 Lakhs.. Then we need have huge RAM space to load all the samples and
space is also required to hold the loss value for all the samples. As a solution
researchers found another method called stochastic Gradient Descent.
• In Stochastic Gradient Descent (SGD), weights will be updated in every iteration
and weights will be updated in every iteration. Though it requires less memory, it
is time consuming to reach the global minima of error.
Solution to SGD is the ‘Mini Batch’
• Researchers have introduced a technique called “Min Batch” or “Mini Batch
SGD”
• In mini batch, a batch of training data is considered for weight updates.
This is the iteration value.

• For example, if we have 10,000 training samples, and if batch size is 1000,
weights will be updated after every 1000 samples are trained.
• In this case to complete one epoch, we need to have 10 iterations.

• On the other hand in one epoch, there will be weight updation for 10
times, but this might have some noise as shown in the diagram.
Counter Plot
• If you draw a counter plot, which is the top view of the gradient
descent, it will be smoother for GD (Red color), and for Minibatch
SGD it will not be smooth (Black color) and blue colour is the SG..
Which will have more error. The left hand side picture is an illustration
of counter plot using python.
Illustration of noise, while climbing the hill, SGD or
Mini SGD will have noise, which can be smoothened
using some technique called Optimizers.
Fast Optimizer – Momentum Optimizer
• Consider the physics solution to smoothen the velocity:
• If beta is 0.95 and for the next step it will become 0.05, and hence the
velocity of moment will be smoothened.
• Similarly weights will be updated in the momentum optimization:
• The below computation shows the weight calculation for a single
weight. It can also be applied for bias. Wt-1 is the Wold.
• Vdw is the exponential weight change.
Final concept of GD with Momentum for fast
and smooth optimization
Summary of GD, SGD and Mini Batch SGD
• GD – Weight updation will be done after all samples are passed
through the model
• SGD – weight updation takes place for every sample
• Mini Batch – Weight updation takes place for every batch.

• Batch size should not be too small or too big. Depending on the
available sample size, program should decide about the batch size.
AdaGrad Faster Optimizer
• AdaGrad – Adaptive Gradient.
• In Adagrad Optimizer the core idea is that each weight has a different
learning rate (η).
• This modification has great importance.
• In the real-world dataset, some features are sparse (for example, in
Bag of Words most of the features are zero so it’s sparse) and some
are dense (most of the features will be noon-zero).
• So keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks
like:
• Weight updation in Ada Grad is given by:

• Where alpha(t) denotes different learning rates for each weight at

each iteration.
• Here, η is a constant number, epsilon is a small positive value number
to avoid divide by zero error if in case alpha(t) becomes 0 because if
alpha(t) become zero then the learning rate will become zero which in
turn after multiplying by derivative will make w(old) = w(new), and
this will lead to small convergence.
• Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
Avoiding Overfitting Through Regularization
• Deep Neural Network typically have tens of thousands of parameters.
• With so many parameters, the network is prone to overfitting the
training set.

• This will be done using “Regularization” techniques.

• Some of the popular regularization techniques are:
• Early Stopping
• Dropout
• Max-Norm Regularization and
• Data Augmentation.
Early Stopping
• To avoid Overfitting the training set, good solution is early stopping.
• Interrupt training when its performance on the validation set starts dropping.

• Evaluate the model on a validation set at regular intervals.

• If the performance is improved compared to the previous interval, go back to
the pervious values and stop training.
Dropout
• Another popular regularization technique for deep neural network is
arguably dropout.
• At every training step, every neuron has a probability P of being
temporarily “Dropped Out”, meaning it will be entirely ignored during
this training step.
• But it may be active during the next step.
• The hyperparameter ‘P’ is called the dropout rate and it is typically
set to 50%.
• After training neurons don’t get dropped anymore.
• It is found that many a times this technique has worked well.
Max-Norm Regularization
• Another regularization technique that is quite popular for neural
networks s called max-norm regularization..
• It constrains the weights w of the incoming connections such that**
∥ w ∥2 ≤ r, where r is the max-norm hyperparameter and ∥ · ∥2 is the
ℓ2 norm**.
• It is typically implemented by computing ∥w∥2 after each training
step and clipping w if needed.
• Reducing r increases the amount of regularization and helps reduce
overfitting.
Data Augmentation

• One last regularization technique is data augmentation.

• It consists of generating new training instances from existing ones.
• Artificially boosting the size of the training set.
• This will reduce overfitting making this a regularization technique.
• The trick is to generate realistic training instances, ideally a human
should not be able to tell which instances were generated and which
ones were not.
End of Unit - 2

Thermo King Tool Catalog Part 2
100% (1)
Thermo King Tool Catalog Part 2
53 pages
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
System Design Resources
No ratings yet
System Design Resources
25 pages
GRAW Track Measurement Systems
No ratings yet
GRAW Track Measurement Systems
48 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Unit 4 Web Programming
No ratings yet
Unit 4 Web Programming
184 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Nokia Siemens Networks
100% (1)
Nokia Siemens Networks
33 pages
Enterprise Architecture Udemy Course Contents
No ratings yet
Enterprise Architecture Udemy Course Contents
17 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Web Design Proposal
100% (1)
Web Design Proposal
15 pages
FNIS FNIS Readme 7.4.5
No ratings yet
FNIS FNIS Readme 7.4.5
17 pages
Plant Disease Identification
No ratings yet
Plant Disease Identification
17 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Unit-1: 1. Introduction To Object Oriented Programming
No ratings yet
Unit-1: 1. Introduction To Object Oriented Programming
14 pages
2019 Aug TechTIPS-JUSTIFIED
No ratings yet
2019 Aug TechTIPS-JUSTIFIED
9 pages
Honeywell Ceiling Speaker 2015
No ratings yet
Honeywell Ceiling Speaker 2015
2 pages
ConfigGuide IMDS PDF
No ratings yet
ConfigGuide IMDS PDF
23 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Unit 3
No ratings yet
Unit 3
110 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
No ratings yet
Components-Algorithms/: The Basic Architecture of Neural Networks: Single Computational Layer
65 pages
Gis&Cad Lab Manual
No ratings yet
Gis&Cad Lab Manual
53 pages
General Observation
No ratings yet
General Observation
93 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
How To Manage BW Transports
No ratings yet
How To Manage BW Transports
14 pages
OTS Avaloq Parameterization Principles Agenda 3 1
No ratings yet
OTS Avaloq Parameterization Principles Agenda 3 1
7 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Our Strategic Searching Lesson Plan:, Grade Level: 6-9
No ratings yet
Our Strategic Searching Lesson Plan:, Grade Level: 6-9
4 pages
BHEL Unit Implements ERP Package
No ratings yet
BHEL Unit Implements ERP Package
9 pages
NN 08
No ratings yet
NN 08
36 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
6 CNN
No ratings yet
6 CNN
50 pages
Ai-Ds 2
No ratings yet
Ai-Ds 2
31 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
NI ELVIS II Prototyping Board Pinouts
No ratings yet
NI ELVIS II Prototyping Board Pinouts
5 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Lecture 3 - MLP and ANN
No ratings yet
Lecture 3 - MLP and ANN
31 pages
Dell Inspiron-17-N7010 Setup Guide En-Us
No ratings yet
Dell Inspiron-17-N7010 Setup Guide En-Us
94 pages
TP3 Mi204 Santos Scardellato
No ratings yet
TP3 Mi204 Santos Scardellato
20 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
17 MAY - Algebra 2 Spring 2024 Final REVIEW
No ratings yet
17 MAY - Algebra 2 Spring 2024 Final REVIEW
9 pages
Wheatstone Bridge's Sensitivity, Resistors' Values Effect PDF
No ratings yet
Wheatstone Bridge's Sensitivity, Resistors' Values Effect PDF
6 pages
3 - DeepLearning - and - CNN v3
No ratings yet
3 - DeepLearning - and - CNN v3
50 pages
Sanghamitra User ID Password (Class 5)
No ratings yet
Sanghamitra User ID Password (Class 5)
9 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
31 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
37 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
OD2e L2 Word List
No ratings yet
OD2e L2 Word List
5 pages
Cours 4
No ratings yet
Cours 4
30 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Easergy MiCOM P63x Protection Relays - P63283
No ratings yet
Easergy MiCOM P63x Protection Relays - P63283
2 pages
Deep Learning Training
No ratings yet
Deep Learning Training
9 pages
Unit 4
No ratings yet
Unit 4
13 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Computer Vision NN Architecture
No ratings yet
Computer Vision NN Architecture
19 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Iso 6489-3
No ratings yet
Iso 6489-3
12 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
R2 - Data Acquisition From Greenhouses by Using Autonomous Mobile Robot
No ratings yet
R2 - Data Acquisition From Greenhouses by Using Autonomous Mobile Robot
5 pages
PTV-Vision VISWALK
No ratings yet
PTV-Vision VISWALK
4 pages
Capstone Project
No ratings yet
Capstone Project
7 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
PHP Pdo
No ratings yet
PHP Pdo
39 pages
Aem LMCG WJ 1 Rs RDL RQOFH39 BZ K3 N 8 P 3 CL 8 D3 V Wit B
No ratings yet
Aem LMCG WJ 1 Rs RDL RQOFH39 BZ K3 N 8 P 3 CL 8 D3 V Wit B
4 pages
Chapter 5 Summary
No ratings yet
Chapter 5 Summary
5 pages
Govt - Polytechnic College Nedumangadu: Seminar Report ON
No ratings yet
Govt - Polytechnic College Nedumangadu: Seminar Report ON
29 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Assignment 13 Modern AI
No ratings yet
Assignment 13 Modern AI
3 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Xiq Whitepaper vr2
No ratings yet
Xiq Whitepaper vr2
9 pages

Deep Neural Network

Uploaded by

Deep Neural Network

Uploaded by

UNIT - 2

Deep Neural Network

• Backpropagation in neural network is a short form for “backward

• It is generally good idea to “Freeze” their weights.

• Batch is also called as ‘Mini Batch’.

• Batch size will be fixed based on the processing capacity.

• If the iteration is completed for all the training samples available it is

• Where alpha(t) denotes different learning rates for each weight at

• This will be done using “Regularization” techniques.

• Evaluate the model on a validation set at regular intervals.

• One last regularization technique is data augmentation.

You might also like