0% found this document useful (0 votes)
32 views402 pages

Unit1 TDL Compressed

The document provides an overview of deep learning, including its definition, applications, and comparison with traditional machine learning. It covers various architectures such as Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks, highlighting their specific uses in fields like computer vision and natural language processing. Additionally, it discusses the bias-variance tradeoff in model selection, emphasizing the importance of finding a balance between model complexity and generalization.

Uploaded by

nabirasula2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views402 pages

Unit1 TDL Compressed

The document provides an overview of deep learning, including its definition, applications, and comparison with traditional machine learning. It covers various architectures such as Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks, highlighting their specific uses in fields like computer vision and natural language processing. Additionally, it discusses the bias-variance tradeoff in model selection, emphasizing the importance of finding a balance between model complexity and generalization.

Uploaded by

nabirasula2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 402

UE21CS343BB2

Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD),
Centre for Data Sciences & Applied Machine
Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Topics in Deep Learning

Introduction to Deep Learning


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Human Intelligence
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Artificial Intelligence
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Machine Learning
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Types of Machine Learning
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Types of Machine Learning

ARTIFICIAL INTELLIGENCE
Broad area which enables computers to mimic human behavior

MACHINE LEARNING
Usage of statistical tools enables machines to learn from experience
(data) – need to be told

DEEP LEARNING
Learn from its own method of
computing - its own brain

Why is Deep
Learning useful?

Good at classification,
clustering and
predictive analysis
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Deep Learning

• Deep learning is a branch of machine learning which is based on


artificial neural networks.
• It is capable of learning complex patterns and relationships within
data and does not require us to explicitly program everything.
• It has become increasingly popular in recent years due to the
advances in processing power and the availability of large datasets.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Deep Learning

• An artificial neural network or ANN uses layers of interconnected


nodes called neurons that work together to process and learn from
the input data.
• In a fully connected Deep neural network, there is an input layer and
one or more hidden layers connected one after the other.
• Each neuron receives input from the previous layer neurons or the
input layer.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Deep Learning

• The output of one neuron becomes the input to other neurons in the
next layer of the network, and this process continues until the final
layer produces the output of the network.
• The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn
complex representations of the input data.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Machine Learning vs Deep Learning

• Automatic Feature Extraction: Deep learning models, especially


neural networks, automatically learn relevant features from raw
input data during the training process.
• Elimination of Manual Feature Engineering: Unlike traditional
machine learning, deep learning reduces the need for manual
feature extraction, saving time and leveraging the model's ability to
learn complex patterns.
• Hierarchical Representations: Deep neural networks consist of
multiple layers that transform input data into hierarchical and
abstract representations, capturing intricate features without explicit
human guidance.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Machine Learning vs Deep Learning

• Applicability to Complex Data: Deep learning


excels in tasks like image recognition, natural
language processing, and speech recognition,
where the data's complexity may be challenging
to capture through manual feature engineering.
While deep learning does automate feature
extraction, attention to proper data preprocessing
and hyperparameter tuning is crucial for achieving
optimal model performance.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Machine Learning vs Deep Learning

Machine Learning Deep Learning


Apply statistical algorithms to learn the hidden patterns Uses artificial neural network architecture to learn the
and relationships in the dataset. hidden patterns and relationships in the dataset.
Can work on the smaller amount of dataset. Requires the larger volume of dataset compared to
machine learning.
Better for the low-label task. Better for complex task like image processing, natural
language processing, etc.
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant features which are manually Relevant features are automatically extracted from images.
extracted from images to detect an object in the image. It is an end-to-end learning process.
Less complex and easy to interpret the result. More complex, it works like the black box interpretations of
the result are not easy.
It can work on the CPU or requires less computing power It requires a high-performance computer with GPU.
as compared to deep learning.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Computer Vision
Deep learning models can enable machines to identify and understand visual data. Some of
the main applications of deep learning in computer vision include:
• Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Computer Vision

Image Segmentation
Image Classification
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Natural Language Processing(NLP)


In NLP, the Deep learning model can enable machines to understand and generate human
language.
Some of the main applications of deep learning in NLP include:
• Automatic Text Generation: Deep learning model can learn the corpus of text and new text
like summaries, essays can be automatically generated using these trained models.
• Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative, or neutral. This is
used in applications such as customer service, social media monitoring, and political
analysis.
• Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search, and
voice-controlled devices.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Natural Language Processing(NLP)

Language Translation Speech Recognition


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Reinforcement Learning
In reinforcement learning, deep learning works as training agents to take
action in an environment to maximize a reward.
Some of the main applications of deep learning in reinforcement learning
include:
• Game playing: Deep reinforcement learning models have been able to
beat human experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots
to perform complex tasks such as grasping objects, navigation, and
manipulation.
• Control systems: Deep reinforcement learning models can be used to
control complex systems such as power grids, traffic management, and
supply chain optimization.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning

• Reinforcement Learning

Game Playing Robotics


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Convolutional Neural Networks


CNNs are a specialized type of neural network designed for processing
structured grid data, such as images.
A CNN captures the spatial features from an image. Spatial features
refer to the arrangement of pixels and the relationship between them in
an image. They help us in identifying the object accurately, the location
of an object, as well as its relation with other objects in an image.
For example, face and object recognition software.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Convolutional Neural Networks


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Recurrent Neural Networks


Recurrent Neural Networks (RNNs) are a
type of artificial neural network designed to
work with sequential data and handle tasks
where the order of information is crucial.
An RNN remembers every piece of
information throughout time.
Apple's Siri and Google's voice search
algorithm are exemplary applications of
RNNs in machine learning.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Recurrent Neural Networks


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Deep Belief Networks


DBNs consist of multiple layers of stochastic(random probability), latent variables (often
binary) and are structured as a stack of Restricted Boltzmann Machines (RBMs). Each
layer captures progressively complex features of the input data.

DBNs are trained layer by layer in an unsupervised manner. The first layer is trained to
capture basic patterns in the data, and subsequent layers are added and trained to
capture higher-level representations.

The architecture of DBNs also makes them good at unsupervised learning, where the
goal is to understand and label input data without explicit guidance. This characteristic
is particularly useful in scenarios where labelled data is scarce or when the goal is to
explore the structure of the data without any preconceived labels.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Deep Belief Networks Consider a DBN applied to image


data.

The first layer might learn simple


features like edges and corners,
while deeper layers learn more
complex features such as textures
or object parts.

The final layer could represent high-


level features like complete objects.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Graph Neural Networks


GNNs are a type of neural network architecture designed to process and
analyze graph-structured data.
Unlike traditional neural networks that operate on grid-structured data
like images or sequences, GNNs are tailored for data represented as
graphs, which consist of nodes (vertices) and edges connecting these
nodes.
Some examples of its usage include Recommender Systems, Social
Network Analysis and more.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Graph Neural Networks


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Reinforcement Learning
RL is a type of machine learning paradigm where an agent learns to
make decisions by interacting with an environment.
The goal of reinforcement learning is for the agent to learn a strategy
that maximizes a cumulative reward over time.
The agent receives observations, takes actions, and receives rewards
from the environment, aiming to learn a strategy that maximizes
cumulative rewards over time.
RL is employed in scenarios where optimal decision-making is learned
through trial and error in complex and dynamic environments.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Reinforcement Learning
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Generative Adversarial Networks


Generative Adversarial Networks (GANs) are a class of deep learning
models composed of a generator and a discriminator trained in
competition.
The generator creates synthetic data, while the discriminator evaluates
its authenticity, leading to an adversarial learning process.
GANs have achieved remarkable success in generating realistic images,
videos, and other data, with applications ranging from art creation to
data augmentation.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Generative Adversarial Networks


UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Transformers
Transformers are a type of deep learning model architecture renowned
for their self-attention mechanism, enabling the capture of intricate
relationships within input sequences.
Their parallelization capabilities and encoder-decoder architecture make
them efficient for various tasks, extending beyond natural language
processing to computer vision.
Notably, models like BERT (Bidirectional Encoder Representations from
Transformers), Google's neural network-based technique for natural
language processing (NLP) pre-training have demonstrated exceptional
performance.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• Transformers
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• GPT (Generative Pre-trained Transformer) Series:


Models like ChatGPT, part of the GPT series, are transformer-based
language models trained on diverse data, excelling in natural language
generation and understanding.

• Google Gemini
Google Gemini is built with a decoder architecture featuring a 32k
context length(words that the model can take into account when
generating responses or predictions) and Multi Query Attention (MQA),
Gemini is engineered for advanced contextual understanding, setting a
new standard in AI architecture.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures

• GPT4 vs Gemini:
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Acknowledgements & References

• https://www.geeksforgeeks.org/
• https://deeplearning.ai
• https://medium.com/@developer.yasir.pk/unveiling-googles-gemini-a-deep-dive-into-the-next-
frontier-of-ai-ee41ffe90a9c
Thank You

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Anashua Dattidar,


Teaching Assistant
A Recap of Bias Variance Tradeoff using an example
Say we have some data on heights and weights of rabbits and we wish to
predict the height of an rabbit given its weight . After plotting the data we
realise that the weight and height values have a linear relationship upto a
point after which the rabbits don't get any taller but only fatter .
. Thus we go through the steps of our usual machine learning pipeline :

1. Data Preprocessing

1. Split into training and test set


( Assume here the green points
are training data and the red
points are testing data)
A Recap of Bias Variance Tradeoff using an example
Now we would like to use some machine learning methods to approximate this
relationship. So let's use Linear Regression and fit a straight line to the data
A clear problem with this approach is that
the straight line will never be able to
capture the true relationship between
weight and height
Thus the issue here lies in the assumption
that the relationship can be modeled using
a straight line and we are highly “biased”
in our model selection and we have
chosen a model which is too simple

Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Model with high bias pays very little attention to the training
data and oversimplifies the model.
A Recap of Bias Variance Tradeoff using an example
Now in search of a more complex model for our data we have landed on the following function .
This line passes through each and every data point on the training set (low bias) and does a great
job of fitting the training set but it does an equally terrible job of fitting the testing set . In this case
we have landed on a model which is too complex for the given data and thus has high variance.

Such a model is also said to be overfit as it perfectly fits the training data but fails to
generalize.
Training set Test Set Variance is the
variability of
model prediction
for a given data
point or a value
which tells us
spread of our data.
A Recap of Bias Variance Tradeoff using an example
If our model is too simple and has very few parameters then it may
have high bias and low variance. On the other hand if our model
has large number of parameters then it’s going to have high
variance and low bias. So we need to find the right balance without
overfitting and underfitting the data.

This is known as the bias variance tradeoff in machine learning as


a model cannot be too complex or too simple but must be perfect
for each problem.

To build a good model, we need to find a good balance between bias


and variance such that it minimizes the total error.
Say , we have encountered an overfit model such as in the
previous example . How do we improve the model ?

One way to get a better model is to use some form of


regularization to get a more generalised model.

You could alternatively also use bagging or boosting.


What is Regularization?
Regularization is any modification we make to a learning algorithm that
is intended to reduce its generalization error but not its training error.
Machine learning faces a key challenge: ensuring algorithms perform well not just on
training data but on new inputs. To address this, various strategies, collectively known as
regularization, aim to reduce test error, sometimes leading to higher training error. This field
constantly explores newer, more effective regularization methods.
In this course, we'll delve into several such strategies which are:

1. L1 regularization
2. L2 regularization
3. Dropout
4. Early Stoppage
Parameter Norm Penalties
Regularization techniques have been much prevalent before neural networks came
into the picture . The L1 and L2 family of regularization techniques are based on
adding a parameter norm penalty (let’s call it ( ) ; where represents the
parameters) . To the objective function J (also known as cost function) .
Thus J’ the new cost function becomes:

J’( ,X,y) = J( ,X,y) + * ( )

[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term

When our training algorithm minimizes the regularized objective function J’ . It will decrease both the original
objective J on the training data and some measure of the size of the parameters θ(or some subset of the
parameters). Different choices for the parameter norm Ω can result in different solutions being preferred.
𝝺
𝛳
𝛳
𝝺
𝛀
𝛳
𝛀
𝛳
𝛳
L2 Parameter regularization
The L2 parameter norm penalty is commonly known as weight decay . This regularization strategy drives the
weights closer to the origin by adding a regularization term to the objective function such as :

L2 regularization is also known as ridge regression (when used in linear regression) or Tikhonov
regularization.
Let us understand L2 regularization using an linear
regression example:
Overfit line !!
Consider a scenario in which we have 2 points in
our training dataset and 2 points in our testing
dataset , thus using linear regression we get a the
best fit line which has a SSE training (Sum of square
error) =0 but SSE testing = large !!
Ridge regression example
The main idea behind techniques like ridge regression is to find a new line with a
little more bias such that there is a significant drop in variance. So with a slightly
worse fit we get a more generalized line.

So as we have a straight line our equation


here is :
y = m*x +c
were we estimate m and c by minimising
the Sum of Square error or (SSE).

If we use ridge regression instead we


shall be minimising
SSE + * slope2 Adds penalty to
the traditional least
squares line
𝝺
What is ?
[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term . So when =0
we get the same line as the least squares line.

We know that a line with higher slope (steeper) y is more


sensitive to small changes in x. While on a line with smaller
slope y value changes slowly compared to large changes in
x and is less sensitive to the value of x.

● As we increase the successive values of we get a


less steeper line thus the predicted variable gets
successively less sensitive to the parameter weights.
● Thus ridge regression can reduce the value of the slope
to be asymptotically close to zero.
● One can use techniques like cross validation to get the
best value for
𝝺
𝝺
𝝺
𝝺
𝝺
L1 parameter regularisation and lasso regression
Although , L2 form of weight decay is most commonly used another option which may
be used is L1 regularization . L1 regularization on the model parameter w is defined
as :

The L1 regularization model with respect to linear regression is also known as lasso
regression. As you may recall in ridge regression the cost function used was :

SSE + * slope2
Here we use ,
SSE + * | slope|
𝝺
𝝺
Difference between L1 and L2
● The main difference between ridge and lasso regression is while ridge
regression can shrink the slope asymptotically close to zero lasso regression
can actually reduce the slope to zero .
This is useful while eliminating parameters which have no effect on the prediction.
For example :
size= a* weight + b* height + c* no_of_friends +d* horoscope + so on…4

● In the above equation although L2 regularization can reduce the value of c and d
it will never be equal to zero.
● But using lasso regression c and d can be made zero so that the effects of such
variables is completely removed.
● Thus Lasso regression is useful in cases where there are useless variables while
ridge regression is a better option when most variables are useful.
L2 regularization for Neural networks
For a NN with L layers and n training examples the new cost function J would be
denoted as:

Sum of losses
over n training
examples
Intuition on how regularization reduce overfitting in NN
The Frobenius
norm penalizes
the weight
matrices for
being too large
● Here if the value is large then the W matrix values are
close to 0 . This reduces the impact of a lot of hidden
units.

● The the remaining NN is a much smaller


And sparse network greatly simplifying
the complexity of the current network
And thus reducing overfitting.
𝝺
Intuition on how regularization reduce overfitting in NN
Say, we have Tanh as our activation function
g(z) = Tanh(z)
For small values of z we almost have a linear curve,
thus for values in which λ is large the weights WL are
small . So as

Z[L] = W[L] * a[L-1] + b[L]


Thus is the weights are small then Z is also relatively small.

As g(z) ends up being roughly linear , every layer in the NN ends up being roughly
linear and so the model is made much simpler.
Why Dropout ?
Why Regularization using Dropout?
● One approach to reduce overfitting is to fit all possible different neural network architectures on the
same dataset and to average the predictions from each model.
This is not feasible in practice, and can be approximated using a small collection of different models,
called an ensemble. However even with the ensemble approximation, multiple models need to be fit
and stored, which can be a challenge if the models are large, requiring days or weeks to train and
tune.
● Another approach could be training multiple instances of the same neural network using different
subsets of the training data. Each instance learns from a different subset, which helps reduce
overfitting. However, this approach also requires significant computational resources.

To overcome these expensive options, dropout emerges as a solution.


Dropout

The term “dropout” refers to dropping out the nodes in a neural network. All the forward and
backwards connections with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network.
Dropout

For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).
A neural network with n nodes, the total number of thinned networks that can be formed are 2n
Of course, this is prohibitively large and we cannot possibly train 2n networks. So we
● share the weights across all the networks
● sample a different network for each training instance
More on Dropout

● While training, layers might co-adapt to correct mistakes from prior


layers. This results in complex co-adaptations which do not
generalize unseen data resulting in overfitting. Performing a dropout
prevents the co-adaptations from occurring.
● A side-effect of applying dropout is that the activations of the hidden
units become sparse, even when no sparsity inducing regularizers are
present.
● Because the outputs of a layer under dropout are randomly
subsampled, it has the effect of reducing the capacity or thinning the
network during training. As such, a wider network, e.g. more nodes,
may be required when using dropout.
Dropout Implementation
● Dropout is typically applied to hidden and input layers, and not on nodes of the output layer.
● A hyperparameter defines the probability at which outputs of the layer are dropped out, or
inversely, the probability at which outputs of the layer are retained.
● Dropout is not used after training when making predictions. The weights of the network will be
larger than normal because of dropout. Therefore, before finalizing the network, the weights are
first scaled by the chosen dropout rate.
● The rescaling of the weights can also be performed at training time instead, after each weight
update at the end of the mini-batch. This is called “inverse dropout” and does not require any
modification of weights during training. (Both Keras & PyTorch implement dropout in this way)
Dropout Training
Steps:
● for the first training instance (or mini-batch), the weights are initialized
and dropout is applied
● loss is computed and backpropagation occurs (only the active
parameters get updated)
● dropout is applied again for the second training instance, resulting in in
a different thinned network
● we again compute the loss and backpropagate to the active weights
● if the weight was active for both the training instances then it receives
a total of two updates and if the weight was active for only one of the
training instances then it receives only one update
● each thinned network gets trained rarely (or even never) but the
parameter sharing ensures that no model has untrained parameters
Tips for Using Dropout Regularization

● dropout is a generic approach and can be used with all network types (MLPs, CNNs,
LSTMs etc.)
● a good value for dropout in a hidden layer is between 0.5 and 0.8 and for units in the
input layers a larger dropout rate, such as of 0.8.
● usually applied in larger networks as they overfit easily
● testing of different rates is important as it results in the best dropout rate for the
problem; along with this it also indicates how sensitive the network is to dropout and an
unstable network can benefit from an increase in size
● dropout is more effective where there is a limited amount of training data and the
model is likely to overfit the training data
Summary

● in dropout, units in the neural network are dropped


● each iteration of the training process has a different set of nodes,
resulting in different outputs each time; resembles ensemble training
● the probability with which a unit is dropped is defined by the
hyperparameter of the dropout function
● a common value for probability is 0.5 for retaining the output of each
node in a hidden layer and a value close to 1.0, such as 0.8, for
retaining inputs from the visible layer.
● the weights of the network are rescaled when dropout is applied
● dropout is used to prevent overfitting and is a popular method of
regularization
Data Augmentation

The word “augmentation” literally means “the action or process of making or becoming greater in
size or amount”, summarizes the outcome of this technique.
But another important effect is that it increases or augments the diversity of the data. The increased
diversity means, at each training stage the model comes across a different version of the original
data.
For images, some common methods of data augmentation are taking cropped portions, zooming in/
out, rotating along the axis, vertical/horizontal flips, adjusting the brightness and sheer intensity. Data
augmentation for audio data involves adding noise, changing speed and pitch.
While data augmentation prevents the model from overfitting, some augmentation combinations
can actually lead to underfitting.
Data Augmentation

The simplest way to prevent overfitting is to increase the size of the training data but collecting more
labelled data is costly. In case of images, as mentioned earlier there are a few ways such as rotating the
image, flipping, scaling, shifting, etc.

In the above image, some transformations have been applied on the MNIST dataset. This provides a
big leap in improving the accuracy of the model. It can be considered as a mandatory trick in order to
improve our predictions.
In Keras, we can perform all of these transformations using ImageDataGenerator. It has a big list of
arguments which you you can use to pre-process your training data.
Early Stopping
Too little training results in an underfit model and too much training results in an overfit
model
If the performance of the model on the validation dataset starts to degrade (e.g. loss begins
to increase or accuracy begins to decrease), then the training process is stopped. The model
at the time that training is stopped is then used and is known to have good generalization
performance.
This simple, effective, and widely used approach to training neural networks is called
early stopping.
When training the network, a larger number of training epochs is used than may normally
be required, to give the network plenty of opportunity to fit, then begin to overfit the
training dataset.
Early Stopping
● To prevent overfitting and avoid driving the
validation error to a high value, it is important to
track the validation error during training.
● A patience parameter (p) is used to determine the
number of steps to wait for improvement in the
validation error.
● At each step (k), the validation error is checked. If
there is no improvement in the validation error for
the previous p steps, training is stopped, and the
model stored at step (k - p) is returned.
● By stopping the training early, we prevent the
model from overfitting and avoid the situation
where the training error approaches 0 while the
validation error increases.
References

● https://towardsdatascience.com/understanding-the-bias-variance-
tradeoff-165e6942b229
● https://www.youtube.com/watch?
v=SjQyLhQIXSM&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc
&index=2
● https://www.youtube.com/watch?v=EuBBz3bI-aA
● https://www.youtube.com/watch?v=Q81RR3yKn30&t=891s
● https://www.youtube.com/watch?v=D8PJAL-
MZv8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=6
● https://towardsdatascience.com/intuitions-on-l1-and-l2-
regularisation-235f2db4c261
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Anashua Dattidar,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Devan Saragoi,


Teaching Assistant
Unit 1: Introduction to Deep Learning

Activation Functions

Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Activation Functions

Activation Function

As we know, everything from the name to the structure of a Neural Network is inspired by the human brain;
mimicking the way biological neurons signal one another.

Diving into the structure, a node is a replica of a neuron that receives a set of input signals. Depending on the
nature and intensity of these input signals, the brain processes them and decides whether the neuron should be
activated (“fired”) or not.

Similarly, in deep learning, this is the role of the Activation Function—that’s why it’s often referred to as a
Transfer Function in Artificial Neural Network.
Unit 1: Introduction to Deep Learning - Activation Functions

Activation Function

The primary role of the Activation Function is to transform the summed weighted input from the node into an
output value to be fed to the next hidden layer or as output.

• The activation function introduces non-linearity into the output of a neuron.*

• Activation functions also help to normalize the output of each neuron to a range between 1 and 0 or
between -1 and 1.

* A neural network without an activation function is essentially just a linear regression model. The activation function does the
non-linear transformation to the input making it capable to learn and perform more complex tasks.
Unit 1: Introduction to Deep Learning - Activation Functions

Activation Function

Activation Function is a mathematical “gate” in between the input feeding the current neuron
and its output going to the next layer
Unit 1: Introduction to Deep Learning - Activation Functions

Types of Activation Function

Typically the type of problems we consider are either classification or regression


problems.

So we need to know

• What should be the activation function for a regression problem?

• What should be the activation function for a classification problem?


Unit 1: Introduction to Deep Learning - Activation Functions

Types of Activation Function

The most popular types of activation functions include…

Binary Step Function Linear Activation Non-Linear Activation


Unit 1: Introduction to Deep Learning - Activation Functions

Binary Step Function

Input: weighted sum of weights and biases of the neurons in a layer

Working: the input is compared to a certain threshold; if the input exceeds


the threshold than the neuron is activated, else it is deactivated

{1
0 <0
( )=
≥0

Output: a value of 0 or 1

Disadvantage: the step function does not allow multi-value outputs—for


example, it cannot be used for multi-class classification problems
𝑓
𝑜
𝑟
𝑥
𝑓
𝑥
𝑓
𝑜
𝑟
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Linear Activation Function

Input: weighted sum of weights and biases of the neurons in a layer

Working: the function doesn't do anything to the weighted sum of the


input, it simply spits out the value it was given

( )=

Output: outputs multiple values not just 1 or 0

Disadvantage: cannot map non linear data, essentially turning neural


networks into single layer of neurons
𝑓
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Types of Activation Function – Non-Linear Activation Function

Linear Activation Functions turn the neural network into a linear regression model. This does not allow the
model to create complex mappings between the inputs and outputs.

Non-linear Activation Functions solve the following limitations of Linear Activation Functions:
• they allow backpropagation; the derivative function is related to the input, and it’s possible to go back
and understand which weights in the input neurons can provide a better prediction

• they allow the stacking of multiple layers as the output would now be a non-linear combination of input
passed through multiple layers; any output can be represented as a functional computation in a neural
network.

Sigmoid Tanh ReLU Softmax


Swish
Function Function Function Function
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Sigmoid Activation Function (Logistic)


Input: weighted sum of weights and biases of the neurons in a layer
Working: the larger the input the closer the output is to 1 whereas smaller
the input closer the output is to 0

1
( )=
1 + ⅇ−

Output: values between 0 and 1


Advantages:
• suitable for problems involving probabilities
• function is differentiable; smooth gradient
𝑥
𝑓
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Sigmoid Activation Function


Disadvantages:
derivative of sigmoid function is

′( ) = ( ) = ( ) × (1 − ( ))
as we can see in the figure, the gradient values are only significant in the
central region, the curve gets flatter in the other regions, this implies that
the values will change very slowly in those regions
• as gradient values approach zero, the network stops learning and
suffers from the Vanishing Gradient problem
• the output of the function is not centered around zero i.e. the
outputs of all neurons will be of the same sign; this makes training
the neural network hard
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝑥
𝑓

𝑥
𝑔
𝑥
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Tanh Function (Hyperbolic Tangent)


Input: weighted sum of weights and biases of the neurons in a layer
Working: the larger the input the closer the output is to 1 whereas smaller
the input closer the output is to -1

(ⅇ − ⅇ )

( )=
(ⅇ + ⅇ − )
Output: values bounded between -1 and 1
Advantages:
• Zero Centered; easier to model inputs that have strongly negative,
neutral, and strongly positive values
𝑥
𝑥
𝑓
𝑥
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Tanh Function
Disadvantages:
derivative of tanh function is
′( ) = ( ) = 1 − tanh2( )

as we can see— it also faces the problem of vanishing gradients similar to


the sigmoid activation function
• the gradient of the tanh function is much steeper as compared to
the sigmoid function

* although both sigmoid and tanh face the vanishing gradient issue, tanh is zero centered, and the gradients are not restricted to
move in a certain direction, therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity
𝑓

𝑥
𝑔
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

ReLU Activation (Rectified Linear Unit)


Input: weighted sum of weights and biases of the neurons in a layer

Working: he function outputs a 0 if it receives a negative input, but for


positive values it returns the same values (like a Linear Activation Function)

( ) = max(0, )

Output: max of 0 and input

Advantages:
• solves the gradient descent problem
• computationally efficient because of sparse network i.e. only those
neurons are activated which have a non-zero value
𝑓
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

ReLU Activation
Disadvantages: Dying ReLU Problem

derivative of the function is

{1
0 <0
′( ) = ( ) =
≥0

the negative side of the graph makes the gradient value zero. Due to this
reason, during backpropagation, the weights and biases for some neurons
are not updated. This can create dead neurons which never get activated.
All the negative input values become zero immediately, which decreases
the model’s ability to fit or train from the data properly.
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Leaky ReLU Function


Leaky ReLU is an improved version of ReLU, allowing a fixed small positive slope for negative inputs.

Input: weighted sum of weights and biases of the neurons in a layer

Working: the function has a small positive slope in the negative region
producing non-zero values for negative inputs; the function performs
similarly in positive regions
( ) = max(0.1 , )
Output: max of 0.1*input and input
Advantages:
• same as ReLU
• does not suffer from Dying ReLU
𝑓
𝑥
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Leaky ReLU Function


How does Leaky ReLU solve the Dying ReLU problem?

it is an improved version of ReLU, a small positive slope is introduced in the


negative region enabling backpropagation

{1
0.1 <0
′( ) = ( ) =
≥0

making this minor alteration for negative inputs, the gradient of the
negative region results in non-zero value; hence reducing the occurrences
of dead neurons
Disadvantages: learning process in time consuming and predictions are not
consistent for negative values
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Parametric ReLU Function (PReLU)


Parametric ReLU is another variant of ReLU that solves the problem of gradient’s becoming zero for negative values

Input: weighted sum of weights and biases of the neurons; learnable


parameter

Working: a learnable parameter represents the slope in the negative


region; during training, this parameter is adjusted through backpropagation
to optimize the overall performance of the model

( ) = max( , )

Output: max of *input and input


𝑓
𝑥
𝑎
𝑥
𝑥
𝒂
𝒂
𝒂
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Parametric ReLU Function


Advantages:
• adaptive negative slope
• does not suffer from Dying ReLU

Disadvantages:
• limitation is that it may perform differently for different problems
depending upon the value of slope parameter
• there is a risk of overfitting
𝒂
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Exponential Linear Units (ELUs)


ELU is also a variant of ReLU that modifies the slope of the negative region of the function
Input: weighted sum of weights and biases of the neurons
Working: ELU uses a logarithmic curve to define the negative values unlike
the leaky ReLU and Parametric ReLU functions with a straight line.

≥0
{
( )=
(ⅇ − 1) <0

Advantages:
• ELU becomes smooth slowly until its output to -
• avoids dead ReLU problem by introducing log curve for negative values;
it helps the network nudge weights and biases in the right direction
𝛼
𝑓
𝑜
𝑟
𝑥
𝑥
𝑓
𝑥
𝑥
𝑓
𝑜
𝑟
𝑥
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Exponential Linear Units


derivative of the activation function

{
1 ≥0
′( ) = ( ) =
( )+( ) <0

Disadvantages:
• increases the computational time because of the exponential operation
• ( ) is not a learnable parameter
• exploding gradient problem
𝑓
𝑥
𝛼
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Gaussian Error Linear Unit (GELU)


The Gaussian Error Linear Unit (GELU) activation function is compatible with BERT, ROBERTa, ALBERT, and other top
NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs
Input: weighted sum of weights and biases of the neurons

( )= ( ≤ )= ( )

[ ]
2
( + 0.044715 )
3
= 0.5 1 + tanh

Working: activation functions (ReLU) activate a neuron by multiplying with 0’s or 1’s; dropout (a regularization
technique) randomly drops neurons by multiplying activations with 0; a new RNN regularizer called Zoneout
stochastically multiples the inputs by 1
𝜋
This together decides the neuron’s output yet they work independently; GELU aims to combine them
𝑥
𝑥
𝑥
𝑥
𝑃
𝑋
𝑥
𝑥
𝜙
𝑥
𝑓
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Gaussian Error Linear Unit (GELU)


GELU nonlinearity is better than ReLU and ELU activations and finds performance improvements across all tasks in
domains of computer vision, natural language processing, and speech recognition
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Softmax Activation
Input: weighted sum of weights and biases of the neurons in a layer

Working: softmax function is defined as the combination of multiple


sigmoid functions, it calculates the relative probabilities


( )=
=1

Output: maps the output to a [0,1] range and at the same time makes sure
the summation of the output is 1.

The output of Softmax is therefore a probability distribution.


𝑗
𝛴
𝑒
𝑗
𝑛
𝑥
𝑓
𝑥
𝑖
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Swish Function
The Swish activation performs well in various deep learning tasks. It introduces a non-linearity that allows the
model to capture complex patterns while maintaining some of the desirable properties of ReLU.
Input: weighted sum of weights and biases of the neurons in a layer,
hyperparameter to control the slope of the swish activation
Working: the function self gates its output i.e. the magnitude of the output
is influenced by the input; the sigmoid function acts as a gating mechanism

( )= × ( )=
1 + ⅇ−
Output: the function approaches positive infinity as the input approaches
positive infinity; as the input approaches negative infinity the function
settles down to a constant value (~0.5)
𝛽
𝑥
𝑓
𝑥
𝑥
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝛽
𝑥
𝜷
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Swish Function
Advantages:
• smooth function, does not change value direction abruptly like ReLU; bends smoothly from 0 towards
values < 0 and then upwards again
• lays emphasis on small negative values but zeroes out large negative values (win-win)
• non-monotonous nature enhances the expression of input data and weights to be learnt

Disadvantages:
• beneficial in some cases, its superiority over other activation function is not universally established
• finding optimal value for may require additional experimentation
𝛽
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Scaled Exponential Linear Unit (SELU)


SELU is a self-normalizing activation function. It is a variant of the ELU . SELU’s self-normalizing behavior makes
sure that the output is always in a standardized range.
Input: weighted sum of weights and biases of the neurons in a layer, scaling
factor and scaling factor *
Working: scaling factor is applied to the positive region of the function to
ensure the standard deviation of the output is close to 1 and scaling factor
is applied to the negative region contributing to the non-linearity of the
function.
≥0
{
( )=
(ⅇ − 1) <0

* (~1.505) and (~1.673) are predefined. These values are based on mathematical considerations to encourage the self
𝛼
𝑓
𝑜
𝑟
𝑥
𝑥
normalization. When set to these values, SELU helps the network to stabilize
𝑓
𝑥
𝜆
𝑥
𝑓
𝑜
𝑟
𝑥
𝝀
𝝀
𝜶
𝜶
𝜆
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions

Non-Linear Activation Function

Scaled Exponential Linear Unit


Advantages:
• like ReLU, SELU does not have vanishing gradient problem
• compared to ReLU, SELU cannot die
• SELUs learn faster and better than other activation functions without needing further procession; moreover,
other activation function combined with batch normalization cannot compete with SELUs

Disadvantages:
• SELU is relatively new so it is not yet used widely in practice; ReLU stays as the preferred option
• more research on architectures such as CNNs and RNNs using SELUs is needed for wide-spread industry use
Unit 1: Introduction to Deep Learning - Activation Functions

Choice of Activation Function

Depending upon the properties of the problem we might be able to make a better choice for easy and
quicker convergence of the network.
• if we encounter a case of dead neurons in our networks variants of ReLU activation functions seem to be
the best choice
• ReLU should only be used in hidden layers and sigmoid/tanh should not be used in hidden layers because
they make the model more susceptible to problems (vanishing gradiant)
• sigmoids/tanh functions are generally avoided due to the vanishing gradient problem
• swish function is used in neural networks having a depth greater than 40 layers

As a rule of thumb, you can begin with using ReLU function and then move over to other activation
functions in case ReLU doesn’t provide with optimum results
Unit 1: Introduction to Deep Learning - Activation Functions

Choice of Activation Function

Depending upon the type of the prediction problem it is advised to choose these activation functions for the
output layer
• Regression – Linear Activation Function
• Binary Classification – Sigmoid Activation Function
• Multiclass Classification – Softmax Activation
• Multilabel Classification – Sigmoid Activation Function
Unit 1: Introduction to Deep Learning - Activation Functions

Recap

• Activation Functions are used to introduce non-linearity in the network.


• A neural network will almost always have the same activation function in all hidden layers. This activation
function should be differentiable so that the parameters of the network are learned in backpropagation.
• While selecting an activation function, you must consider the problems it might face: vanishing and
exploding gradients.
• Use Softmax or Sigmoid function for the classification problems.
Unit 1: Introduction to Deep Learning - Activation Functions

Further Reading

In addition to the popular activation functions mentioned earlier, such as the


Sigmoid and ReLU functions, there are many other activation functions that can be
used in neural networks based on the specific problem statement and desired
optimization of the learning process.
To gain a deeper understanding and familiarize yourself with different activation
functions, you can refer to the following link:

https://www.v7labs.com/blog/neural-networks-activation-functions
Unit 1: Introduction to Deep Learning - Activation Functions

References

• https://www.v7labs.com/blog/neural-networks-activation-functions
• https://pub.towardsai.net/what-is-parametric-relu-2444a2a292de
• https://machinelearningmastery.com/activation-functions-in-pytorch/
• https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-
activation-functions-when-to-use-them/
• h tt p s : / / m e d i u m . c o m / @ v i n o d h b 9 5 /a c t i va t i o n - f u n c t i o n s - a n d - i t s -
types-8750f1287464
• https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-
relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e
• https://pytorch.org/docs/stable/nn.html#non-linear-activations-other
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Devan Saragoi,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Aryan Sharma,


Teaching Assistant
UE21CS343BB2: Topics in Deep Learning

Loss Functions
Topics in Deep Learning
What is a loss function?

● To measure how well the neural network is performing on a specific task, we


need loss functions.
● The goal of Loss Function is to measure the error that the model made.
● This function has to be minimized in order to get the most accurate outputs
from our model. Let’s look at how we will do this.
Topics in Deep Learning
What is a loss function?

Let J(w,b) be a loss function called


Mean Squared Error Loss Function.

Thus, our goal is to choose the right


weights/coefficients(w,b) to minimize
J(w,b).
Topics in Deep Learning
Types of Loss Functions

● The loss functions are selected based on


the type of problem.

● Typically, this involves the difference


between the actual value and
approximated(predicted) value.

● Cross-entropy loss is often simply referred


to as “cross-entropy,” “logarithmic loss,”
“logistic loss,” or “log loss” for short.
Topics in Deep Learning
Types of Problems

We will consider 4 main cases to pair up problems with activation and loss
functions.
1. Regression: Predicting a Numerical Value

1. Categorical: Predicting a Binary Outcome

1. Categorical: Predicting a Single Label from Multiple Classes

1. Categorical: Predicting Multiple Labels from Multiple Classes


Topics in Deep Learning
Case 1 : Regression - Predicting a Numerical Value

The final layer of the neural network will have one neuron and the value it returns is a
continuous numerical value.
To evaluate the accuracy of the prediction, it is compared with the true value which is
also a continuous number.
Topics in Deep Learning
Regression Problems

There are three metrics which are generally used for evaluation of Regression
problems (like Linear Regression, Decision Tree Regression, Random Forest
Regression etc.):
1. Mean Absolute Error (MAE):
This measures the absolute average distance between the real data and the
predicted data, but it fails to punish large errors in prediction.
Topics in Deep Learning
Regression Problems

2. Mean Square Error (MSE):


This measures the squared average distance between the real data and the
predicted data. Here, larger errors are well noted (better than MAE).
But the disadvantage is that it also squares up the units of data as well. So,
evaluation with different units is not at all justified.
Topics in Deep Learning
Regression Problems

3. Root Mean Squared Error (RMSE):


This is actually the square root of MSE. Also, this metrics solves the problem
of squaring the units.
Topics in Deep Learning
Case 2 : Predicting a Binary Outcome

The final layer of the neural network will have one neuron and will return a value
between 0 and 1, which can be inferred as a probability.
To evaluate the accuracy of the prediction, it is compared with the true value which is
also a continuous number.
Topics in Deep Learning
Case 2 : Predicting a Binary Outcome

Loss Function for Binary Classification Problems


● Binary Cross Entropy: Cross entropy quantifies the difference between two
probability distribution.
The model predicts a model distribution of {p, 1-p} as we have a binary distribution.
We use binary cross-entropy to compare this with the true distribution {y, 1-y}.
Topics in Deep Learning
Case 3 : Predicting a Single Label from Multiple Classes

The final layer of the neural network will have one neuron for each of the classes and
they will return a value between 0 and 1, which will be the probability of it being the
class.
The output is then a probability distribution and will sum to 1.
Topics in Deep Learning
Case 3 : Predicting a Single Label from Multiple Classes

Loss Function for Categorical Problems


● Cross Entropy: Cross entropy quantifies the difference between two probability
distribution. Our model predicts a model distribution of {p1, p2, p3} (where
p1+p2+p3 = 1).
We use cross-entropy to compare this with the true distribution {y1, y2, y3}.
Topics in Deep Learning
Case 4 : Predicting Multiple Labels from Multiple Classes

The final layer of the neural network will have one neuron for each of the classes and
they will return a value between 0 and 1, which will be the probability of it being the
class.
The output is then a probability distribution and will sum to 1.
Topics in Deep Learning
Case 4 : Predicting Multiple Labels from Multiple Classes

Loss Function for Categorical Problems


● Binary Cross Entropy: Cross entropy quantifies the difference between two
probability distribution. Our model predicts a model distribution of {p, 1-p} (binary
distribution) for each of the classes.
We use binary cross-entropy to compare these with the true distributions {y, 1-y} for
each class and sum up their results.
Topics in Deep Learning
What’s the right loss function?

Although picking a loss function is not given much importance and overlooked, one must
understand that there is no one-size-fits-all and choosing a loss function is as important
as choosing the right machine learning model for the problem in hand.

The choice of a loss function is tightly coupled with the choice of output unit in a model.
Topics in Deep Learning
Factors to consider when selecting a loss function
Topics in Deep Learning
Acknowledgements & References

● https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-
ac02f1c56aa8
● https://www.deeplearningbook.org/
● https://www.analyticsvidhya.com/blog/2022/06/understanding-loss-function-in-deep-learning/
● https://machinelearningknowledge.ai/cost-functions-in-machine-learning/
● https://deeplearning.ai
● https://www.datacamp.com/tutorial/loss-function-in-machine-learning
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Aryan Sharma,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD),
Centre for Data Sciences & Applied Machine
Learning (CSSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Anirudh Chandrasekar,
Teaching Assistant
Topics in Deep Learning

Batch Normalization
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Problems with NN Training process

• Internal Covariate Shift


• Unstable and Slow training
• Sensitivity to weight initialization
• Limited Generalization
• Increased dependency on Hyper-parameters
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Covariate Shift

• Variables that affect the response variable, but are not of interest in
the study. (according to statistics)
• In ML : They are not variables but essentially features.

Ex:
• Age: Study between physical activity and health.
• Gender: Study between job satisfaction and work-life balance.
• Neural Networks: Activations in Hidden layers and features.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Covariate Shift

• Shift in the distribution of features.


• Usually shift of data/feature distribution between training and
testing.
Ex:
• Image Classification: Day (Train) vs Night (Test)
• Speech Recognition: Trained on different accents but tested on
another.
• Medical Diagnosis: Trained on patients below 20 years and
testing on patients above 50 years.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Internal Covariate Shift

• Shift in the distribution of hidden layer features during training.

• We know that the Input to the


2nd hidden layer is the output of
the 1st hidden layer.
• Similarly all the inputs to the
intermediate layers are outputs
of the previous layer, which are
dependent on the Weights.
• These weights get updated
while training.
• After weight update the input
distributions could change.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Internal Covariate Shift

• The distribution of input has changed in the kth layer after weight
updation.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Unstable and Slow training

• Since the distribution of inputs is


changing continuously the model has to
re-learn the weights which results in
slow training.
• As the distribution changes the model is
prone to large errors, leading to larger
weight updates.
• Larger weight updates causes
fluctuation and makes the model
unstable as shown in the graph.
• Due to this reaching convergence takes
a lot of time.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Problems with NN Training process

• Weight initialization is done randomly in neural networks.


Initialization must be done with caution as poor initialization can lead to
never attaining convergence.
• Limited Generalization: Since the distribution keeps changing, we cannot
generalize the model as it fits to the data that is already seen.
• Increased dependency on Hyper-parameters: Learning rate is the hyper-
parameter which plays a crucial role in time taken in attaining convergence,
but without normalization there are chances that certain values(higher
values) of learning rates do not lead to convergence at all.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Input Standardization

Convert input data into


a distribution with 0
mean and unit variance.
This takes care of the
input layers.
What about hidden
layers?
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Normalizing Activations in a Network

Normalizing inputs is done to speed up learning

This is done by computing the mean and


variance of the input and then subtracting
the mean and normalizing the data by the
variance.
Considering a deep network, Can we
normalize a[2] (or any hidden layer) so as to
train w[3], b[3] faster?
• This is what batch normalization does.
• But we normalize z[2] instead of a[2] i.e.
before applying the activation function.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Implementing Batch Normalization
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch v/s Mini-Batch

Batch - all the data is passed at Mini-batch - Data is split into smaller sets
once. and each set is passed in a new iteration.
In this case we do not require batch In this case Batch Normalization is needed
normalization as weight updates since weights are updated before each
happen only once. iteration.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch Normalization
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch Normalization Zinorm =
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Z(i)norm =
Batch Normalization
(i) = Z(i)norm +

𝝲
𝝱
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Implementing Batch Normalization

Given some intermediate values of the NN z(1),z(2),z(3),....z(i) for some hidden


layer l.

where
• is the mean and 2 is the variance.
• is added to ensure mathematical stability. (incase variance turns out to
be 0).
𝝻
𝞮
𝞂
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Implementing Batch Normalization

Now we have normalized the values of z such that they have mean 0 and
standard unit variance.
But we do not want this to be the case always and might want them to have a
different distribution.
So we compute,
(i) = Z(i)norm +
where and are learnable parameters of the model.
Note: scales and shifts.
Question: For what values of and will (i) = Z(i)?
Ans: = √ 2 + and =
𝝲
𝝲
𝝲
𝝲
𝞂
𝝱
𝝱
𝞮
𝝱

𝝱
𝝻

𝝲
𝝱
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Adding Batch Normalization to a Network

Using Batch Norm in 3 hidden layers NN:

Our NN parameters will be:

If you are using a deep learning framework, you won't have to implement
batch norm yourself:
• Ex. in Tensorflow you can add this line: tf.nn.batch-normalization()
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Why are and needed?

As shown in the figure once we normalize the inputs, all of them them follow
a normal distribution.
This leads to the model learning the same distribution over and over again,
hindering the learning capability of the model.

Normalize Scale and Shift


Z(i)norm (i) = Z(i)norm +
=
Hence we mimic the features of the original distribution by using (Scaling)
and (Shifting).
Note: When = √ 2 + and = we get the above result.

𝝲
𝝱
𝝱
𝝱
𝝲
𝝲
𝞂
𝞮
𝝱
𝝻
𝝲
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Why are and learnable?

(i) = Z(i)norm
+

As shown in the figure, different values of and produces different


distributions and what is chosen depends on how the distributions are
varying across batches(Learning factor).
𝝱

𝝱
𝝲
𝝲
𝝲
𝝱
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Why are and learnable?

(i) = Z(i)norm
+

Using and , the network is trying to learn the overall distribution of


the training dataset even though we are passing the input as batches.
𝝱
𝝲

𝝱
𝝲
𝝱
𝝲
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch Normalization

Advantages of Batch Normalization

• Networks train faster.


• Allows for higher learning rates.
• Makes weights easier to initialise.
• Provides some regularization.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Acknowledgements & References

• https://deeplearning.ai
• https://medium.com/@rishavsapahia/5-min-recap-for-andrew-ng-deep-learning-specialization-
course-2-8a59fd58ca0d
• https://youtu.be/PRQPyQeq6C4?si=7j_lT2i03mS4O94N
• https://youtu.be/2xChdY2qkmc?si=ten5Owc--g8ZaKOk
Thank You

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CSSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Anirudh Chandrasekar,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Devang Saraogi


Teaching Assistant
Optimization
Optimization is a process of finding optimal parameters for the model, which
significantly reduces the error function.

Optimization algorithms used for training of deep models differ from traditional optimization algorithms in
several ways. In most machine learning scenarios we care about some performance measure say , P but we
optimise P only indirectly by reducing a different cost function J(θ) in the hope that doing so will improve P.

The following are the optimization algorithms discussed in


this course :

● Mini Batch gradient descent


● Stochastic gradient descent
● Gradient descent with momentum
● RMS Prop
● Adam
A Recap of Gradient Descent
Gradient Descent searched the hypothesis space of all possible weight vectors to find the best fit
for all training examples. Say we have a cosy function J(w) and we wish to minimise this cost
function:
1. We randomly initialize the weight vectors
2. We calculate the loss and update the weight vectors in the direction of steepest descent.
The equation to do so is :

1. We repeat till we converge at a minima


Batch Gradient Descent
Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each
example in the training dataset, but only updates the model after all training examples have been evaluated
Upsides

● Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic
gradient descent.
● The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on
some problems.
● The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing
based implementations.

Downsides

● The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.
● The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all
training examples.
● Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory
and available to the algorithm.
● Model updates, and in turn training speed, may become very slow for large datasets.
Stochastic Gradient Descent
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that
calculates the error and updates the model for each example in the training dataset.
Upsides

● The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
● This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
● The increased model update frequency can result in faster learning on some problems.
● The noisy update process can allow the model to avoid local minima (e.g. premature convergence).

Downsides

● Updating the model so frequently is more computationally expensive than other configurations of gradient descent,
taking significantly longer to train models on large datasets.
● The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the
model error to jump around (have a higher variance over training epochs).
● The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error
minimum for the model.
Mini Batch Gradient Descent
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into
small batches that are used to calculate model error and update model coefficients. Mini-batch gradient
descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of
batch gradient descent. It is the most common implementation of gradient descent used in the field of deep
learning.
Upsides

● The model update frequency is higher than batch gradient descent which allows for a more robust
convergence, avoiding local minima.
● The batched updates provide a computationally more efficient process than stochastic gradient descent.
● The batching allows both the efficiency of not having all training data in memory and algorithm
implementations.

Downsides

● Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning
algorithm.
● Error information must be accumulated across mini-batches of training examples like batch gradient
descent
How to choose the best value for Mini Batch size

● If there is a small training set batch gradient descent is preferred.

● Mini batch size is a hyperparameter and it is a good idea to experiment with a range of values
to get the best fit for your model.

● Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the
computational architecture on which the implementation is being executed. Such as a power of
two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and
so on.
Convergence in different versions of Gradient Descent
Batch gradient descent takes small steps
towards the minima and will converge to a
minima.
While Stochastic gradient descent is noisy and
oscillates near the minima . It never actually
converges to the minima.
Exponentially Weighted Averages
The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time
series. However, due to several computational advantages (fast, low-memory cost), the EWMA is behind the
scenes of many optimization algorithms in deep learning, including Gradient Descent with Momentum,
RMSprop, Adam, etc.

Let’s understand with an example , say we have the some data for temperatures across multiple days in a
city and we wish to approximate the next day’s temperature from this data. Thus let the estimated
temperature be Vt , let the previous estimate be Vt-1 , Ot be the temperature for day t and β be a
hyperparameter.

Here , β determines how important the previous


value is (the trend), and (1-β) how important the
current value is.
Exponentially Weighted Averages
Here , Vt is calculated by approximating over 1/ (1-β) days of temperature . Thus if β is
chosen to be 0.9 then we approximate over the last 10 days if 0.5 then the last 2 days
and so on.
Note : As the value of β
increases the curve
becomes smoother and
has less noise.

However with large values of


β as we are averaging over a
larger window the formula
adapts more slowly to
changes in data as high
weightage is given to older
values
Gradient Descent with Momentum
In this method we compute the exponentially weighted average of the gradients
and use this new value to calculate the weights instead
A problem with the gradient descent algorithm is that the progression of the search can bounce around the
search space based on the gradient. For example, the search may progress downhill towards the minima, but
during this progression, it may move in another direction, even uphill, depending on the gradient of specific
points (sets of parameters) encountered during the search.

This can slow down the progress of the search, especially for those optimization problems where the broader
trend or shape of the search space is more useful than specific gradients along the way.

One approach to this problem is to add history to the parameter update equation based on the gradient
encountered in the previous updates
Gradient Descent with Momentum
1. First we calculate dW (where w is the weight) for the current mini batch

1. Then we compute

1. Then we update our weights as

Using a very high value for we can dampen out the oscillations which arise in gradient
descent !
𝛽
Gradient Descent with Momentum
Upsides:
1. Momentum has the effect of damping down the change in the gradient and, in turn, the step size with each
new point in the search space.

1. Momentum is most useful in optimization problems where the objective function has a large amount of
curvature (e.g. changes a lot), meaning that the gradient may change a lot over relatively small regions of the
search space.

1. It is also helpful when the gradient is estimated, such as from a simulation, and may be noisy, e.g. when the
gradient has a high variance.

1. Finally, momentum is helpful when the search space is flat or nearly flat, e.g. zero gradient. The momentum
allows the search to progress in the same direction as before the flat spot and helpfully cross the flat region.

Downsides :

1. It can overshoot the global minimum and converge to a local minimum instead.
2. Another disadvantage is that the momentum term can cause the optimization process to oscillate around the
global minimum.
Root Mean Square or RMS prop
We know that while implementing gradient descent we end up with lots of oscillations
Let the horizontal direction be w and the vertical direction be b. In this case we wish to
speed up learning in the w direction and slow down learning in the b direction.
for iteration t:
1. Calculate the derivative dW for the current mini batch ‘
2. Calculate an exponentially weighted average of the squares of derivatives

NOTE : β2 and β are separate hyperparameters

1. Then we update the weights as:


RMS prop intuition
● In this example we wish to dampen out oscillations in the b direction and to learn
faster in the x direction.
● Thus as SdW is a small number while Sdb is a large number.

● Thus by dividing by a small SdW we can increase the magnitude of update in the
W direction and while dividing with a large Sdb we can dampen the oscillations in
the b direction.

Note : if SdW is very close to zero in order to avoid the W term becoming very large we add a small term to the denominator

𝜖
RMS Prop
Upsides:

1. Faster convergence: RMSprop can converge faster than SGD by adapting the learning rate based on the
magnitude of the gradients for each parameter.

2. Robustness to different learning rates: RMSprop is more robust to different learning rates for different
parameters, which can be helpful in deep learning models with many parameters.

3. Adaptive learning rate: RMSprop adaptively scales the learning rate based on the history of the squared
gradients, which can improve the convergence speed and stability of the optimization process

Downsides:

1. RMSProp can sometimes lead to slow convergence.


2. It is sensitive to the learning rate.
Adam optimizer
Adam or Adaptive moment estimation optimizer is a combination of gradient
descent with momentum and the RMSprop optimizer
1. First we initialize VdW = 0 ,Vdb = 0 ,SdW = 0 ,Sdb = 0 Note :Adam optimization applies
bias correction to the first and
2. for iteration t : second moment estimates to
a. Compute dW and db for the current mini batch ensure that they are unbiased
estimates of the true values.
b. Then compute :
i. A momentum like update
So, the bias corrected terms:

i. A RMSprop like update


Adam optimizer
Thus the weights will be updated as :

Adam Configuration Parameters

● alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g.
0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values
(e.g. 1.0E-5) slow learning right down during training
● beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
● beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be
set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
● epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

Further, learning rate decay can also be used with Adam.


Adam optimizer
Advantages :

1. Adaptive Learning Rates: Unlike fixed learning rate methods like SGD, Adam optimization
provides adaptive learning rates for each parameter based on the history of gradients. This allows
the optimizer to converge faster and more accurately, especially in high-dimensional parameter
spaces.

1. Momentum: Adam optimization uses momentum to smooth out fluctuations in the optimization
process, which can help the optimizer avoid local minima and saddle points.

1. Bias Correction: Adam optimization applies bias correction to the first and second moment
estimates to ensure that they are unbiased estimates of the true values.

1. Robustness: Adam optimization is relatively robust to hyperparameter choices and works well
across a wide range of deep learning architectures.
Adam optimizer
Disadvantages :

Memory intensive: Adam needs to store moving averages of past gradients for each parameter during training
and hence it requires more memory than some other optimization algorithms, particularly when dealing with very
large neural networks or extensive datasets.

Slower convergence in some cases: While Adam usually converges quickly, it might converge to flawed
solutions in some cases or tasks. In such scenarios, other optimization algorithms like SGD (stochastic gradient
descent) with momentum or Nesterov accelerated gradient (NAG) may perform better.

Hyperparameter sensitivity: Although Adam is less sensitive to hyperparameter choices than some other
algorithms, it still has hyperparameters like the learning rate, beta1, and beta2. Choosing inappropriate values
for these hyperparameters could impact the performance of the algorithm.
Learning Rate Decay
Learning rate decay is a technique used in machine learning models to train modern neural networks. It
involves starting with a large learning rate and then gradually reducing it until local minima is obtained.

Say, we are implementing mini batch gradient descent with a relatively small batch size
and the gradient takes noisy steps as shown below. It also does not converge at the
minima but oscillates around the minimum value.

In such a situation slowly reducing the learning rate (α) value as we approach the
minima could be advantageous as the final value would oscillate closer to the minima if
small steps are taken .
This could be done by setting :

Here decay rate is yet another hyperparameter


while is 0 the initial learning rate value.
𝞪
Other learning rate decay methods

1. Exponential decay:

2. Epoch number
Based decay:

3. Discrete Step decay : In this method learning rate


is decreased in some discrete steps after every
certain interval of time , for example you are
reducing learning rate to its half after every 10
secs..

4. Manual decay : In this method practitioners manually examine the performance


of algorithm and decrease the learning rate manually day by day or hour by hour
etc.
Resources
● https://www.scaler.com/topics/momentum-based-gradient-descent/
● https://medium.com/mlearning-ai/exponentially-weighted-
average-5eed00181a09
● https://medium.com/nerd-for-tech/optimizers-in-machine-learning-
f1a9c549f8b4
● https://www.analyticsvidhya.com/blog/2021/03/variants-of-gradient-
descent-algorithm/
● https://machinelearningmastery.com/gradient-descent-with-momentum-
from-scratch/
● https://machinelearningmastery.com/gentle-introduction-mini-batch-
gradient-descent-configure-batch-size/
● https://youtube.com/playlist?
list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&si=7Z_y1LQXzn7fZF1I
● https://arxiv.org/pdf/1412.6980.pdf
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack:Anashua Krittika Dastidar,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divija L,
Teaching Assistant
Convolutional Neural Network(CNN):
Introduction,Filters, Feature Maps
UE22CS645BC2-TDL-Lecture -CNN
Introduction To Computer Vision(CV)

● Computer vision is a field of artificial intelligence that enables computers and


systems to derive meaningful information from digital images, videos and other
visual inputs — and take actions or make recommendations based on that
information.
● CV is used in various tasks like: Object Identification,
Facial Recognition, Image Segmentation, Image Classification,
Neural Style Transfer etc.
● CV needs lots of data. It runs analyses of data
over and over until it discerns distinctions and
ultimately recognizes images.
UE22CS645BC2-TDL-Lecture -CNN
Introduction To Computer Vision(CV)

A digital image is a binary representation of visual data. It contains a series of pixels


arranged in a grid-like fashion that contains pixel values to denote how bright and what
color each pixel should be.
How does a computer see an image?
UE22CS645BC2-TDL-Lecture -CNN
Introduction To Computer Vision(CV)

But how are Image processed?

● Images are processed in the form of an array of pixels,


where each pixel has a set of values, representing the
presence and intensity of the three primary colors: RGB
red, green, and blue.
● All pixels come together to form a digital image.
UE22CS645BC2-TDL-Lecture -CNN
What’s the problem with Fully Connected NN’s ?
● Fully connected Neural Networks: All neurons in one layer are connected to all neurons in the next layer. Each neuron in a
layer contributes to the activation of every neuron in the subsequent layer, leading to a dense and interconnected
structure.

● Imagine in CIFAR Dataset the images are only of size 32 X 32 pixels and have 3 color channels.
In this case a Single fully connected neuron in a first hidden layer of this neural network would have
32 x 32 x 32 = 3072 weights (it is still manageable.)

But let's consider a bigger image,


300 x 300 x 3 = 27000 weights
(demands lot of computational power, doesn’t leverage the spatial structures and relations of
each image)

● A huge neural network like that demands a lot of resources but


even then remains prone to overfitting because the large number
of parameters enable it to just memorize the dataset.

● They tend to perform less and aren’t good for feature extraction.
UE22CS645BC2-TDL-Lecture -CNN
Why Convolution Neural Networks(ConvNet / CNN) ?
Convolution is a mathematical operation that allows the merging of two sets of information.
Convolution is applied to the input data to filter the information and produce a feature map.

● A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activations to another through a differentiable function. Unlike Fully connected NNs, a neuron
in a convolutional layer is not connected to the entire input but just some section of the input
data.

● The architecture performs a better fitting to the image dataset due to the reduction in the
number of parameters involved. ConvNet leverage spatial information and are therefore very
well suited for classifying images.
UE22CS645BC2-TDL-Lecture -CNN
Difference between Fully connected NN vs CNN

Feature Fully Connected NN (FCN) Convolutional NN (CNN)

Architecture Dense, fully interconnected layers Convolutional layers, pooling layers

Parameter No parameter sharing Shared weights in convolutional filters


Sharing

Spatial Do not explicitly capture spatial hierarchies Capture spatial hierarchies through convolutional
Hierarchies and pooling layers

Computationa Can be computationally expensive, especially More parameter-efficient, computationally effective,


l Efficiency with high-dimensional input especially for images

Feature May struggle with capturing local patterns and Specialized for feature extraction from local regions,
Extraction spatial relationships suitable for images
UE22CS645BC2-TDL-Lecture -CNN
Idea of how CNNs Function

● The CNN is composed of image processing layers that deduce and pass down information from one
layer to the next.
● At each layer, information of different abstractions is deduced.
● Generally, in the earlier layers, simpler and more basic ideas will be deduced while later layers will use the gathered
information from the previous layers to deduce more complex ideas.
● The following figure lays down an example CNN - in the figure, a boat image is passed from the left end layer to the next
until it reaches the right-most layer, where it classifies the class of the image.
● Due to the way the layers learn more complex features the deeper into the network, we can call this a hierarchical learning
structure. This is “Hierarchical Learning Structure”.

Classifies image as “Boat”


UE22CS645BC2-TDL-Lecture -CNN
Analogy to Visual Cortex
● Hierarchical layer structure is actually also utilized in the visual cortex ventral pathway.Our vision is
based on multiple cortex levels, each one recognizing more and more structured information.

● First, we see single pixels; then from them, we recognize simple geometric forms. And then more sophisticated elements
such as objects, faces, human bodies, animals, and so on.

● Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field.

● As we proceed through the visual pathway, the features learned become more complex, just as in the CNN. The receptive
visual field size increases as well, as larger receptive field suggests a more holistic and general feature in the image.
UE22CS645BC2-TDL-Lecture -CNN
Advantages of CNN:
1. Local Receptive Fields:
● In images, there are local patterns and structures that are important for recognizing objects. CNNs use local
receptive fields (small regions of the input data) to capture these patterns that focuses on specific features in the
visual data.

2. Hierarchical Feature Learning:


● Lower layers capture simple features like edges and textures, while higher layers learn more complex and
abstract features, leading to the recognition of objects. This learning makes it highly effective in image-related
tasks.

3. Parameter Sharing:

● CNNs use shared weights and biases across different parts of the input data that allows them to learn a set of
filters that can be applied across the entire input, making them particularly effective for detecting patterns and
features in various locations of an image. This parameter sharing significantly reduces the number of
parameters compared to fully connected networks, making CNN's computationally efficient.
UE22CS645BC2-TDL-Lecture -CNN
Advantages of CNN:

4. Parameter Efficiency:

● CNNs are more parameter-efficient than fully connected networks, which is crucial when dealing with large
images. The use of shared weights and local receptive fields allows CNNs to learn from fewer parameters while
retaining the ability to capture important features.

5. Translation Invariance:

● CNNs can recognize patterns regardless of their position in the input space. This is achieved through the use of
pooling layers that down-sample the spatial dimensions, making the network less sensitive to small changes in
the position of features.
UE22CS645BC2-TDL-Lecture -CNN
Input Image

• The RGB image in the figure has been separated by its three color planes — Red, Green, and Blue.

• There are a number of such color spaces in which images exist — Grayscale, RGB, HSV, CMYK, etc.

• The role of the ConvNet is to reduce the images into


a form which is easier to process, without losing
features which are critical for getting a good
Prediction.

• This is important when we are to design an


architecture which is not only good at learning
features but also is scalable to massive datasets.
UE22CS645BC2-TDL-Lecture -CNN
Local Receptive Fields

• To preserve spatial information, each image is represented as a matrix of pixels.

• To encode the local structure a submatrix of adjacent input neurons is connected into one
single hidden neuron belonging to the next layer. That single hidden neuron represents
one local receptive field. This operation is named convolution.

• Unlike fully-connected layers, where each output unit gathers information from all of the
inputs, the activation of a convolution output cell is determined by the inputs in its
receptive field.
UE22CS645BC2-TDL-Lecture -CNN
Local Receptive Fields

• This principle works best for hierarchically structured data such as images.
• For eg., suppose that the size of each single submatrix is 5 x 5 and that
those submatrices are used with MNIST images of 28 x 28 pixels.
• Then we will be able to generate 23 x 23 local receptive field
neurons(feature maps) in the next hidden layer.
• It is possible to slide the submatrices by only 23 positions before touching
the borders of the images.
• There can be multiple feature maps that learn independently from each
hidden layer.
UE22CS645BC2-TDL-Lecture -CNN
CNN in a nutshell

● The first layer is responsible for detecting lines, edges, changes in brightness, and other simple features.

● The information is then passed onto the next layer, which combines simple features to build detectors
that can identify simple shapes.

● The process continues in the next layer and the next, becoming more and more abstract with every layer.

● The deeper layer will be able to extract high-order features such as shapes or specific objects.

● The last layers of the network will integrate all of those complex features and produce classification
predictions.

● The predicted value will be compared to the correct output, where those that are wrongly classified will
cause a large error gap and will cause the learning process to backpropagate to make changes to the
parameter in order to give out a more accurate outcome.

● The network goes back and forth, correcting itself until the satisfying output is achieved (where the error
is minimised).
UE22CS645BC2-TDL-Lecture -CNN
CNN in a nutshell
UE22CS645BC2-TDL-Lecture -CNN
Filters / Kernels
● A kernel in a convolution is an n x n matrix of numbers.
● They are designed to detect and highlight specific patterns or features in the input data.
● The CNN convolution can have multiple filters, highlighting different features, which results in multiple output feature maps
(one for each filter).
● It can also gather input from multiple feature maps, for e.g,
the output of a previous convolution.
● The combination of feature maps (input or output) is called a
volume. In this context, we can also refer to the feature maps as
slices.
● For example, in image processing, filters might be designed to
detect edges, corners, or textures.
● Each number inside the kernel matrix is multiplied with the value
for each pixel it lines up with, and then all the elements of the
Kernel are summed to output one single value. Then, the kernel slides over
to a new position in the image and the process is repeated.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
1. DIMENSIONS OF A FILTER:

A filter of size F x F applied on C channels is a F x F x C volume that performs convolutions on a input


of size I x I x C and produces an output feature map/ activation map of size O x O x 1.

The application of K filters of size F x F results in an out feature map of size O x O x K.


UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
2. STRIDE:
● For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves
after each operation.
● This shift can be horizontal, vertical, or both, depending on the stride's configuration.

● Importance of strides - The choice of stride affects the model in several ways.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
2. STRIDE:
● Importance of strides -

~ Output Size: Larger stride results in a smaller output spatial dimension because filter covers large
area of the input image with each step, thus reducing the number of positions it can occupy.

~ Computational Efficiency: Increasing the stride can decrease the computational load.Since the filter
moves more pixels per step, it performs fewer operations, which can speed up the training and
inference processes.

~ Field of View: Higher stride in the convolutional operation considers a broader area of the input
image per step, aiding in capturing global features over finer details.

~ Downsampling: Increasing strides can be used as an alternative to max pooling layers for
downsampling the input.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
3. ZERO-PADDING:

● Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can
either be manually specified or automatically set.

● It has 3 modes:

(i) VALID MODE:


● Where
P=0

Purpose:
● No padding
● Drops last convolution if dimensions do not match
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
3. ZERO-PADDING:

(ii) SAME MODE: Where


● Where ->Filter size is FxF
->Input size is IxI
->Stride is S

Purpose:
● Padding such that feature map size has size

● Output size is mathematically convenient

● Also called 'half' padding


UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
3. ZERO-PADDING:

(iii) FULL MODE:


● Where

Purpose:
● Maximum padding such that end convolutions are applied on the
limits of the input.

● Filter 'sees' the input end-to-end


UE22CS645BC2-TDL-Lecture -CNN
Typical Layers in a ConvNet
● A CNN typically has three layers: a convolutional layer, a pooling layer, and
fully connected layer.
● Two different types of layers, convolutional and pooling, are typically alternated.
● The depth of each filter increases from left to right in the network.
● The last stage is typically made of one or more fully connected layers
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
• Consider an image of dimensions 5x5x1 [5 (Height) x 5 (Breadth) x 1 (Number of
channels, eg. RGB)]
• In the illustration, the green section is the 5x5x1 input image, I.
• The element involved in carrying out the convolution operation in the first part of
a Convolutional Layer is called the Kernel/Filter, K, represented in the color
yellow.
• The filter K is a 3x3x1 matrix.
• The Kernel shifts 9 times because of
Stride Length = 1 (Non-Strided),
every time performing a matrix multiplication
operation between K and the portion P of the
image over which the
kernel is hovering.
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
▪ Depending on the number of input dimensions, we have 1D, 2D, or 3D convolutions.
▪ А time series input is a 1D vector, an image input is a 2D matrix, and a 3D point cloud is a 3D
tensor. T
▪ The convolution works as follows:
1. We slide the filter along all of the dimensions of the input tensor.
2. At every input position, we multiply each filter weight by its corresponding input tensor cell
at the given location.
3. The input cells, which contribute to a single output cell, are called receptive fields.
4. We sum all of these values to produce the value of a single output cell.
▪ 2D convolution with a 2×2 filter applied over a single 3×3 slice
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation

● Let us apply this idea to a toy


example and see the results.
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation
UE22CS645BC2-TDL-Lecture -CNN
Convolution Operation

Finally !!
TOPICS IN DEEP LEARNING

Convolution in 1D, 2D, 3D


UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

Let us see some examples of 2D convolutions applied to


images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

INPUT KERNEL
OUTPUT
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

INPUT KERNEL
OUTPUT
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

INPUT KERNEL
OUTPUT

Note how all the edges separating different colours and brightness levels have been
identified
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images - Eg: 2
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

Let’s see the working example of 2D convoultion


UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images

We can use multifilters to get multiple feature maps.


UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 1D

In the 1D case, we slide a one dimensional filter over a one dimensional input
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 2D

In the 2D case, we slide a two dimensional filter over a two dimensional input
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

What would happen in case of 3D?


UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

Suppose we have an RGB image and we want to convolve it with the following 3D filter. As we can see, the depth of our filter consists of three 2D filters. Let's assume that our RGB image is 5 by 5 pixels.

Each color channel corresponds to a two-dimensional array of


pixel values
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

We will add zero padding to each of these arrays in order to avoid losing information when
performing the convolution.
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

The convolutions will be carried out in exactly the same way as for grayscale images.
The only difference is that now we have to perform three convolutions instead of 1
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

Take each
corresponding
pixel and filter
value, multiply
them together,
and sum the
whole thing
up.
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

Add a bias value, which usually has a value of 1


UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D

To get the full convoluted output, we just perform the same operations for all the other pixels in
each of our color channels
UE22CS645BC2-TDL-Lecture 4- Loss Function
Acknowledgements & References

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture5.pdf

https://towardsdatascience.com/cnn-part-i-9ec412a14cb1

https://msail.github.io/previous_material/cnn/
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divija L,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divya K,
Teaching Assistant
TOPICS IN DEEP LEARNING
Overview of lecture

• CNN Example (LeNet-5)


• Conclusion
TOPICS IN DEEP LEARNING
CNN Example

Let us take a look at an example:

We have an input as an image of 32×32×3, so it is an RGB image and we are trying


to do a handwritten digit recognition. We have the number “7” and we are trying to
recognize which one of the 10 digits from 0 to 9 is this number.
TOPICS IN DEEP LEARNING
CNN Example

Let us build a neural network for this task, inspired by a classic neural network called LeNet-5:

Here, nf = number of filters; f = filter size; s=stride; p=padding;


CONV and POOL represent the convolution and pooling outputs respectively.
TOPICS IN DEEP LEARNING
CNN Example

• We may notice that as we go deeper into the neural network, the height and width
decreases, while the number of channels increase.
• CNNs usually comprise a combination of one or more convolution layers followed by
pooling layers, and finally a few fully connected layers.
TOPICS IN DEEP LEARNING
CNN Example

Let us go through a few more details for this example:

We may notice:
• Pooling layers don’t have any parameters
• Convolution layers have relatively few parameters
• Fully connected layers have the most parameters
• Activation size decreases gradually as we go deeper in the neural network
TOPICS IN DEEP LEARNING
Conclusion

We have now seen all the building blocks of CNN:


convolution layer, pooling layer and fully connected layer

Let’s sum up why we need convolutions:

• Parameter sharing: A feature detector (such as a vertical edge detector) that’s


useful in one part of the image is probably useful in another part of the image.
• Sparsity of connections: In each layer, each output value depends only on a
small number of inputs.
TOPICS IN DEEP LEARNING
Acknowledgements

https://www.deeplearning.ai/courses/deep-learning-specialization/
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divya K,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divya K,
Teaching Assistant
TOPICS IN DEEP LEARNING
Overview of lecture

• Recap: Backpropagation
• Convolution Forward Pass
• Convolution Backward Pass
➢ Calculation of loss gradient w.r.t filter
➢ Calculation of loss gradient w.r.t bias
➢ Calculation of loss gradient w.r.t input

• Conclusion: Backpropagation in Convolution Layer


TOPICS IN DEEP LEARNING
Recap: Backpropagation

➢ Backpropagation is an algorithm used to train neural networks by adjusting the


weights of the network based on the error between the predicted output and the
actual output.

➢ Backpropagation calculates the gradient of the loss function with respect to each
parameter in the network and updates the network parameters in such a way that it
minimizes the loss function.

➢ In CNNs the loss gradient is computed w.r.t the input and also w.r.t the filter, w.r.t the
bias.
TOPICS IN DEEP LEARNING
Convolution Forward Pass

Convolution between Input X and Filter F, gives us an output O. This can be represented as:
TOPICS IN DEEP LEARNING
Convolution Forward Pass

Applying the convolution operation, we get:


TOPICS IN DEEP LEARNING
Convolution Backward Pass

In ANNs, we update the weights as


W = W - α*∂L/∂W
In CNNs, we update the network parameters as:
F = F - α*∂L/∂F
B = B - α*∂L/∂B
(where α is the learning parameter)
Since X is the output of the previous layer, ∂L/∂X
becomes the loss gradient for the previous layer.
So, we need to calculate ∂L/∂F, ∂L/∂B and ∂L/∂X.
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the filter, ∂L/∂F:

We can use the chain rule to obtain the gradient w.r.t the filter as shown in the equation.
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the filter, ∂L/∂F:

On expanding the chain rule summation, we get:


TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the filter, ∂L/∂F:

To calculate ∂O/∂F:

From these equations, We get:


11 11 11 11
= 11, = 12, = 21,
11 12 21 22

12 12 12 1
= 12, = 13, = 22,
11 12 21 2

21 21 21 2
= 21, = 22, = 31,
11 12 21 2
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝜕
𝐹
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
22 22 22 2
𝑋
= , = , = ,
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝜕
𝑂
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the filter, ∂L/∂F:


On substituting the values of the local gradient, we get:
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the filter, ∂L/∂F:


If we look closely, this can be represented as a convolution operation
between input X and loss gradient ∂L/∂O as shown below:

= conv(X,

)
𝜕
𝑂
𝜕
𝐹
𝜕
𝐿
𝜕
𝐿
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the bias, ∂L/∂B:


We can use the chain rule to obtain the gradient w.r.t the bias as shown in the equation.
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the bias, ∂L/∂B:


To calculate ∂O/∂B, we just partially derive O11, O12, O21, and O22 with respect to B.
Since there is only one B term in each O term (as shown), the partial differentiation
just returns 1.
11
=1

12
=1

21
=1

22
=1
𝜕
𝜕
𝜕
𝐵
𝐵
𝐵
𝜕
𝐵
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝜕
𝑂
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the bias, ∂L/∂B:

So ∂L/∂B is just equal to the summation of ∂L/∂O terms.

= + + +
11 12 21 22
=

sum( )
𝜕
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝑂
𝜕
𝜕
𝐵
𝐵
𝜕
𝑂
𝜕
𝜕
𝜕
𝜕
𝐿
𝐿
𝐿
𝐿
𝜕
𝜕
𝐿
𝐿
𝜕
𝐿
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the input, ∂L/∂X:

We can use the chain rule to obtain the gradient w.r.t the input as shown in the equation.
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the input, ∂L/∂X:

On expanding the chain rule summation and substituting the values of the local gradients,
we get:
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the input, ∂L/∂X:

If we look closely, this can be represented as a “full convolution” operation between


flipped / 180◦ rotated Filter F and loss gradient ∂L/∂O as shown below:

❑ the term "full convolution" is often used interchangeably with "convolution with zero-
padding." In the context of convolutional neural networks (CNNs), "full convolution"
typically means performing a convolution operation with zero-padding applied to the
input. Here, we mean padded loss gradient ∂L/∂O.
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the input, ∂L/∂X:


Now, let us do a ‘full’ convolution between this flipped Filter F and ∂L/∂O, as visualized below:
TOPICS IN DEEP LEARNING
Convolution Backward Pass

➢ Calculation of loss gradient w.r.t the input, ∂L/∂X:


The full convolution above generates the values of ∂L/∂X and hence we can represent ∂L/∂X as:

= full-conv(180◦ flipped F,

) OR

= conv(180◦ flipped F,

padded( ))
𝜕
𝜕
𝑂
𝑂
𝜕
𝜕
𝑋
𝑋
𝜕
𝜕
𝐿
𝐿
𝜕
𝜕
𝐿
𝐿
TOPICS IN DEEP LEARNING
Conclusion: Backpropagation in Convolution Layer

Note : This is done for a single filter F and stride = 1, CNNs have a lot more filters. The CNN Backpropagation operation with stride > 1 is identical to a stride = 1
TOPICS IN DEEP LEARNING
Acknowledgements

https://youtu.be/Pn7RK7tofPg?si=bcY3RxuXhOa_GZOg

https://deeplearning.cs.cmu.edu/F21/document/recitation/Recitation5/CNN_Backprop_Recitation_5_F21.pdf

https://pavisj.medium.com/convolutions-and-backpropagations-46026a8f5d2c
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Divya K,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Anashua Krittika Dastidar


Teaching Assistant
Introduction
A convolutional neural network (CNN) is one of the most significant networks in the
deep learning field. Since CNN made impressive achievements in many areas,
including but not limited to computer vision and natural language processing, it
attracted much attention from both industry and academia in the past few years. In this
lecture we shall start by discussing some historically famous CNN models and move up
to state of the art complex CNN architectures.

Looking to study CNN’s from a historical perspective :

● Waibel et al. [1] proposed a time-delay neural network for speech recognition,
which can be viewed as a 1-D CNN.
● Then, Zhang [2] proposed the first 2-D CNN—shift-invariant ANN.
● LeCun et al. [3] also constructed a CNN for handwritten zip code recognition
and first used the term “convolution,” which is the original version of LeNet.
A Timeline of CNN Architectures ….
LeNet
● LeNet5[3] architecture was introduced in 1998. Due to its historical importance, it is known as the first CNN
model.
● It has also made great function in MNIST handwritten digital ID patterns.
● LeNet model, which is usually composed of 5 layers, accepts grayscale images with 32 x 32 x 1 as input.
● The inputs are transferred to the Conv layer and then to sub-sampling.
● Afterward, there are other Conv layers, followed by a pooling layer, and at the end of the architecture, FC
layers including the output at the last layer are defined.

● This model was the first CNN architecture reducing the number of parameters and capable of
automatically learning features from raw pixels. This model was introduced to identify digital manuscript
patterns and postal codes in post offices
AlexNet
● AlexNet [4] was the pioneering deep architecture, with top-5 test
accuracy of 84.6% on ImageNet data( 15 million labeled high resolution
images in over 22,000 categories).
● It utilized those data augmentation methods that are comprised of image
translation, patch extractions(random cropping), colour jittering and
horizontal reflection.
● This CNN model implements dropout layers with a particular end goal to
battle the issue of overfitting to the training data.
● It is trained by batch stochastic gradient descent, along with particular
values for weight decay and momentum. The activation function used
was ReLU.
● They train on ImageNet 2011 dataset and finetune on ImageNet 2012 Please Note ! that
dataset.
● They introduced ideas like local response normalization which are quiet
however these things are
out of practice today as batch normalization is preferred. pretty common to do
● It was trained on two GTX 580 GPUs for 5–6 days and contains five today when AlexNet
convolutional layers, one max pooling, ReLU as non-linearities, three FC
layers and dropout. paper came out they
were not the norm.

The creators of AlexNet pioneered the idea of Group convolution to train on multiple GPUs, let’s see how ..
What is Group Convolution?
Usually convolution filters are applied on an image layer by layer as we have seen till now in LeNet . However to
learn more features and train deeper networks we could create two models that train and backpropagate in
parallel. This method of using different set of convolution filter groups on the same image is called “grouped
convolution”.
As we have learnt in previous courses parallelism in training is two ways:

1. Data parallelism: where we split the dataset into chunks and then we train on each chunk. Intuitively, each
chunk can be understood as a mini batch used in mini batch gradient descent. Smaller the chunks, more
data parallelism we can squeeze out of it. However, smaller chunks also mean that we are doing more like
stochastic than batch gradient descent for training. This would result in slower and sometimes poorer
convergence.
2. Model parallelism: here we try to parallelize the
model such that we can take in as much as data as
possible. Grouped convolutions enable efficient model
parallelism, so much so that Alexnet was trained on
GPUs with only 3GB RAM.

For more details read this article.


Group Convolution in AlexNet
● A single GTX 580 GPU had only 3GB of memory, which limits the maximum size of the networks
that can be trained on it. It turns out that 1.2 million training examples are enough to train networks
which are too big to fit on one GPU. Therefore they spread the net across two GPUs.

● This parallelization scheme puts half the neurons on each GPU However the GPUs communicate
only in certain layers , for example, the kernels of layer 3 take input from all kernel maps in layer 2.
However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the
same GPU.
● Choosing the pattern of
connectivity is a problem
for cross-validation, but
this allowed them to
precisely tune the
amount of
communication until it is
an acceptable fraction of
the amount of
computation.
Group Convolution in AlexNet
AlexNet Architecture
ZFNet
● The paper Visualizing and Understanding Convolutional Networks[5] introduces the notion of Deconvnet
which enables us to visualize each layer.
● By visualizing each layer, we can get more insight about what the model is learning and thus, make some
adjustments to make it more optimize
● That’s how ZFnet was created, an AlexNet fine-tuned version based on visualization results.

Differences with AlexNet:


AlexNet consists of eight layers, five convolutional layers
followed by three fully connected layers. ZFNet retained
basic architecture of AlexNet but made some architectural
adjustments,particularly in the first few layers.

● AlexNet used 11x11,5x5 and 3x3 filter sizes while


ZFNet used 7x7 filter size in the first layer only and
3x3 in the latter layers only.
● There is stride of 4 in the first layer of AlexNet while
in ZFNet there is stride of 2 used.
● AlexNet used Local Response Normalization while
ZFNet used Local Contrast Normalization
VGGNet
● The VGG model, or VGGNet, that supports 16 layers is also referred to as VGG16, which is a convolutional
neural network model proposed in the research paper titled, “Very Deep Convolutional Networks for Large-
Scale Image Recognition.”
● The VGG16 model achieves almost 92.7% top-5 test accuracy in ImageNet.
● The previous CNNs like AlexNet had a ton of hyperparameters which were filter sizes and sizes of the max
pool layers, moreover the stride value for each layer was also a hyperparameter.
● This network simplifies it by having only 3x3 kernels for the convolution layer with a stride of 1 and 2x2 for
the max pool layers with a stride of 2.
● The VGG16 model was trained using Nvidia Titan Black GPUs for multiple weeks.

VGG16 ● The concept of the VGG19 model (also


VGGNet-19) is the same as the VGG16
except that it supports 19 layers.
Inception Net

The origin of the name


‘Inception Network’ is very
interesting since it comes
from the famous movie
Inception, directed by
Christopher Nolan.

In the movie, we have


dreams within dreams. In an
inception network, we have
networks within networks.

Fun fact, this meme was referenced in the first inception net paper.
What is an 1x1 convolution?
● A 1x1 convolution is nothing but the element
wise product of the 32 numbers on the left and
the 32 numbers in each filter. With a ReLU
nonlinearity applied to the result.
● If we have F no of filters the resultant is of
dimensions 6 x 6 x F as indicated in the image

● Coming back to the 1 filter case it can be visualised as a single node taking in the weighted sum of 32
inputs and applying a ReLU nonlinearity to it.
● Thus using F filters is equivalent to a fully connected layer with 32 nodes connected to F nodes in the next
layer.
● This idea is also known as ‘Network in Network”which was introduced by researchers in first inception net
paper

Doing this adds more nonlinearity to


the network allowing the network to
learn a more complex function.
How is an 1x1 convolution useful?

Say , we have a 28 x 28 x 192 dimension volume and we wish to shrink the no of


channels to 32 . One way to do this is we could use 32 filters of dimension 1
x 1 x 192 this would output a 28 x 28 x 32 volume.

Thus 1 x 1 convolutions can be used to shrink the no of channels.


How is an 1x1 convolution useful?
A 1 X 1 convolution can also help us reduce the computational cost

Now if we use 1 x 1 convolutions to reduce the


volume to a 28 x 28 x 16 and then use the 5 x 5
convolution, the no of multiplications are :
Say we have a 28 x 28 x 192 volume which is passed
through a convolution layer of 5 x5 kernels with 32 Total computations =
filters thus the no of total multiplications for this one 28 x 28 x 192 x 16 + 28x28x16x32 == 2.8 Million
layer is : multiplications.
Total Computations = 28 x 28 x 192 x 32 == 120 Million
multiplications . Which is a highly expensive operation! Thus this highly reduces the computational cost
Inception Net v1 or Google LeNet
The researchers behind google’s Inception Net or Google LeNet came up with a module
which combined various filters at the same level making the network wider rather than
deeper. As they observed that objects in images can have very large variation.

Ex: The three images below


Because of this huge variation
in the location of the
information, choosing the right
kernel size for the convolution
operation becomes tough. A
larger kernel is preferred for
information that is distributed
more globally, and a smaller
kernel is preferred for
information that is distributed
more locally.
What is an Inception Module?
The image is the “naive” inception module. It performs
convolution on an input, with 3 different sizes of filters (1x1, 3x3,
5x5). Additionally, max pooling is also performed. The outputs
are concatenated and sent to the next inception module.

As stated before, deep neural networks


are computationally expensive. To make it
cheaper, the authors limit the number of
input channels by adding an extra 1x1
convolution before the 3x3 and 5x5
convolutions.Do note that however, the
1x1 convolution is introduced after the max
pooling layer, rather than before.
Using the dimension reduced inception module, a neural network architecture
was built. This was popularly known as GoogLeNet (Inception v1). The
architecture is shown below:
An Auxiliary classifier are fully connected layers followed by a softmax layer
which are used to make predictions using the outputs of middle layers in
the network .

This has a regularizing effect on the network is is useful in preventing


overfitting.
Inception v2
PREMISE:
● The intuition was that, neural networks perform better when convolutions didn’t alter the
dimensions of the input drastically. Reducing the dimensions too much may cause loss of
information, known as a “representational bottleneck”.
● Using smart factorization methods, convolutions can be made more efficient in terms of
computational complexity.

SOLUTION:
A. Factorize 5x5 convolution to two 3x3 convolution operations to improve computational speed.
A 5x5 convolution is 2.78 times more expensive than a 3x3 convolution. So stacking two 3x3
convolutions in fact leads to a boost in performance.
B. Moreover, they factorize convolutions of filter size nxn to a combination of 1xn and nx1
convolutions. For example, a 3x3 convolution is equivalent to first performing a 1x3
convolution, and then performing a 3x1 convolution on its output. They found this method to
be 33% more cheaper than the single 3x3 convolution.
C. The filter banks in the module were expanded (made wider instead of deeper) to remove the
representational bottleneck.
Inception v2

A B C
Inception v3
PREMISE:

● The authors noted that the auxiliary classifiers didn’t contribute much until near the end of the
training process, when accuracies were nearing saturation. They argued that they function as
regularizes, especially if they have BatchNorm or Dropout operations.
● Possibilities to improve on the Inception v2 without drastically changing the modules were to be
investigated.

SOLUTION: Inception Net v3 incorporated all of the above upgrades stated for Inception v2, and in
addition used the following:

1. RMSProp Optimizer.
2. Factorized 7x7 convolutions.
3. BatchNorm in the Auxiliary Classifiers.
4. Label Smoothing (A type of regularizing component added to the loss formula that prevents the
network from becoming too confident about a class. Prevents overfitting).
Inception v4
PREMISE: Make the modules more uniform. The authors also noticed that some of the modules were
more complicated than necessary. This can enable us to boost performance by adding more of these
uniform modules.

SOLUTION: The “stem” of Inception v4 was modified. The stem here, refers to the initial set of
operations performed before introducing the Inception blocks.

Stem in v1 Stem in v4
ResNet
After the AlexNet model was introduced future researchers started building bigger CNNs for increased
accuracy.Thus the question arises that ;

Is learning better networks is driven by depth in layers?

However , when deeper networks are able to start converging, a degradation problem was exposed:
with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then
degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers
to a suitably deep model leads to higher training error. Thus in the following example a 59 layer network
has much higher test error than the 20 layer network.

Training error (left) and test error (right)


on CIFAR-10 with 20-layer and 56-layer
“plain” networks. The deeper network
has higher training error, and thus test
error.
ResNets
● Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it.
There exists a solution by construction to the deeper model: the added layers are identity
mapping, and the other layers are copied from the learned shallower model.
● The existence of this constructed solution indicates that a deeper model should produce no higher
training error than its shallower counterpart.
● But experiments show that our current solvers on hand are unable to find solutions that are
comparably good or better than the constructed solution

Residual
connections Thus to have a NN by
Thus here x is the default learn the identity
required deviation
which the model must
function rather than pass
learn and if there is the data through more
nothing to be learnt weight layers , the
then F(x) +x is the
identity function.
authors proposed adding
skip connections in the
network.
Thus we compare the following network architectures
for ImageNet.

Left: the VGG-19 model [41] (19.6 billion FLOPs) as a


reference.

Middle: a plain network with 34 parameter layers (3.6


billion FLOPs).

Right: a residual network with 34 parameter layers (3.6


billion FLOPs). The dotted shortcuts increase
dimensions
And the results ….

Here we see that training and However we see that in a network with
validation error are higher in the larger residual connections we can train a 34 layer
network without residual connections. network with a much lower error.

Using residual connections helps deeper


networks as the error is much lesser for
the 34 layer network than the 18 layer
network.
ResNets
The authors thus built a family of models as the following :
● ResNet-50: With 50 layers
● ResNet-101 : With 101 layers
● ResNet-152 : With 152 layers
At the time the ResNet 152 model performed the best on the ImageNet dataset and they were declared the
winners of Image Net 2015. Proving that deeper is sometimes indeed better!
They also introduced the idea of bottleneck for deeper networks i.e for each residual function F, we use a stack
of 3 layers instead of 2 (Fig). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are
responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with
smaller input/output dimensions.
As deeper networks are
computationally expensive , so to save
computational power they project 256
down to 64 and then the complexity of
the second layer is similar to that of a
64 dimension and then project upwards
again.
Inception-ResNet
Inspired by the performance of the ResNet, a hybrid inception module was proposed. There are two sub-versions of
Inception ResNet, namely v1 and v2. Before we checkout the salient features, let us look at the minor differences
between these two sub-versions.

○ Inception-ResNet v1 has a computational cost that is similar to that of Inception v3.


○ Inception-ResNet v2 has a computational cost that is similar to that of Inception v4.
○ They have different stems, as illustrated in the Inception v4 section.

For residual addition to work, the input and output after convolution must have the same dimensions.
Hence, we use 1x1 convolutions after the original convolutions, to match the depth sizes
Inception-ResNet
The pooling operation inside the main inception modules were replaced in favor of the residual
connections. However, you can still find those operations in the reduction blocks. Reduction block A is
same as that of Inception v4.

Networks with residual units deeper in the architecture caused the network to “die” if the number of
filters exceeded 1000. Hence, to increase stability, the authors scaled the residual activations by a
value around 0.1 to 0.3.
Efficient Net(B0-B7)
The process of scaling up ConvNets has never been well understood , however there are many ways to
do it . The most common of the ways is to scale up one of the following parameters width , depth or
resolution. But scaling up arbitrarily requires tedious manual tuning and often leads to suboptimal
results.
The key contributions of the Efficient Net paper is to show that it is critical to balance width , depth and
image resolution . They prove that such balance can simply be achieved scaling each factor by a
common ratio.This method is called Compound scaling.

They demonstrate this


using the MobileNet and
ResNet architectures.
They also develop a
baseline network and scale
it up to obtain the family of
Efficient Net models (B0-
B7)
Compound Scaling
The Compound scaling method, which uses a compound coefficient φ to uniformly
scales network width, depth, and resolution in a principled way:

where α, β, γ are constants that can be determined by a small grid search. Intuitively,
φ is a user-specified coefficient that controls how many more resources are available
for model scaling, while α, β, γ specify how to assign these extra resources to
network width, depth, and resolution respectively.
Efficient Net B0 (The baseline network)
At the core of Efficient Net, the mobile inverted bottleneck convolution layer is used along with the
squeeze and excitation layers. The following table shows the architecture of Efficient B0:

The creators of Efficient Net started to scale Efficient Net B0 with the
help of their compound scaling method. They applied the grid search
technique to get :

= 1.2, β = 1.1, ୪ = 1.15, Ф = 1.

Afterward, they fixed the scaling coefficients and scaled Efficient Net
B0 to Efficient Net B7.
𝛂
Conclusion

This lecture aims to summarize the most popular CNN architectures , however there
are a plethora of other CNN architectures which the students may wish to explore . A
comprehensive list is provided in the research papers listed in reference [9] and [10].

Reading the research papers cited in the references section is also encouraged.
References
[1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”
IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328–339, Mar. 1989.
[2] W. Zhang, “Shift-invariant pattern recognition neural network and its optical architecture,” in Proc. Annu. Conf. Jpn. Soc. Appl.
Phys., 1988, p. 734, paper 6P-M-14.
[3] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551,
Dec. 1989
[4]Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in
Neural Information Processing Systems, pp. 1097–1105 (2012)
[5] M. D. Zeiler and R. Fergus, ‘Visualizing and Understanding Convolutional Networks’, arXiv [cs.CV]. 2013.
[6]Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with
convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
[7]He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 770–778 (2016)
[8]https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
[9]A. Dhillon and G. K. Verma, “Convolutional neural network: A review of models, methodologies and applications to object
detection,” Prog. Artif. Intell., vol. 9, no. 2, pp. 85–112, Jun. 2020.
[10]Z. Li, F. Liu, W. Yang, S. Peng and J. Zhou, "A Survey of Convolutional Neural Networks: Analysis, Applications, and
Prospects," in IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 6999-7019, Dec. 2022, doi:
10.1109/TNNLS.2021.3084827.
[11]https://youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&si=yKZ9uniVHpq2EzMq
[12]Tan, M., & Le, Q. v. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. 36th International
Conference on Machine Learning, ICML 2019, 2019-June.
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Anashua Krittika Dastidar


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack:Anashua Krittika Dastidar,


Teaching Assistant
Introduction
1. Object detection is a vital computer vision task focusing on identifying
instances like humans, animals, or cars in digital images.

1. Its goal is to provide essential knowledge for computer vision


applications by determining what objects are present and where they
are located.

1. Key metrics for object detection include accuracy, encompassing


both classification and localization accuracy, as well as speed.

1. Object detection serves as the foundation for other computer vision


tasks such as instance segmentation, image captioning, and object
tracking.

1. Recent years have seen significant progress in object detection,


driven by rapid developments in deep learning techniques.Fig. 1
illustrates the growing number of publications associated with "object Source [1]
detection" over the past two decades.
Source [1]

In the past two decades, it is widely accepted that the progress of object detection has
generally gone through two historical periods: “traditional object detection period
(before 2014)” and “deep learning based detection period (after 2014)”.
Traditional Detectors
Most of the early object detection algorithms were built based on handcrafted features. Due to
the lack of effective image representation at that time, people have to design sophisticated
feature representations and a variety of speed-up skills. Some of them are :

1. Viola Jones Detectors: In 2001, P. Viola and M. Jones achieved real-time detection of
human faces for the first time without any constraints (e.g., skin color segmentation) [4,5].

1. HOG Detector: In 2005, N. Dalal and B. Triggs proposed Histogram of Oriented


Gradients (HOG) feature descriptor [6].

1. Deformable Part-based Model (DPM): DPM, as the winners of VOC-07,-08, and-09


detection challenges, was the epitome of the traditional object detection methods. DPM
was originally proposed by P. Felzenszwalb [7] in 2008 as an extension of the HOG
detector.
Datasets for Object Detection
Datasets for Object Detection
Building larger datasets with less bias is essential for developing advanced
detection algorithms. A number of well-known detection datasets have been
released in the past 10 years:

1. Pascal VOC: The PASCAL Visual Object Classes (VOC) Challenges


(from 2005 to 2012) was one of the most important competitions in the
early computer vision community. Two versions of Pascal-VOC are
mostly used in object detection: VOC07 and VOC12, where the former
consists of 5k training images + 12k annotated objects, and the latter
consists of 11k training images + 27k annotated objects. 20 classes of
objects that are common in life are annotated in these two datasets. Example images and
annotations
a. Link to challenges
1. Pascal VOC
2. ILSVRC
1. ILSVRC: The ImageNet Large Scale Visual Recognition Challenge Source [1]
(ILSVRC) was organized each year from 2010 to 2017. It contains a
detection challenge using ImageNet images. The ILSVRC detection
dataset contains 200 classes of visual objects. The number of its images/
object instances is two orders of magnitude larger than VOC.
a. Link to challegege
b. More info
Datasets for Object Detection
MS-COCO: MS-COCO is one of the most challenging object detection dataset
available today. It has less number of object categories than ILSVRC, but more
object instances. For example, MS-COCO-17 contains 164k images and 897k
annotated objects from 80 categories.

→ Compared with VOC and ILSVRC, the biggest progress of MS-COCO is that
apart from the bounding box annotations, each object is further labeled using per-
instance segmentation to aid in precise localization.

→ In addition, MS-COCO contains more small objects (whose area is smaller


than 1% of the image) and more densely located objects. Just like ImageNet in its
time, MS-COCO has become the de facto standard for the object detection
community.

→ Link to challenge

→ More information

Example images and annotation


MS COCO source1 [1] , source2
Datasets for Object Detection
Open Images: The year of 2018 sees the introduction of the Open Images Detection (OID) challenge,
following MS-COCO but at an unprecedented scale.
There are two tasks in Open Images:
1. The standard object detection
2. Visual relationship detection which detects paired objects in particular relations.

For the standard detection task, the dataset consists of 1,910k images with 15,440k annotated bounding
boxes on 600 object categories.

→ Link to dataset

Source
Metrics for Object Detection
Intersection over Union (IoU)
Intersection over Union indicates the overlap of the predicted bounding box coordinates to the ground
truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground
truth box coordinates.

To measure the object localization accuracy, the IoU


between the predicted box and the ground truth is used to
verify whether it is greater than a predefined threshold, say,
0.5. If yes, the object will be identified as “detected”,
otherwise, “missed”.

Source
What is mAP?
Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN,
YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to
1.
mAP formula is based on the following sub metrics Confusion Matrix, Intersection over Union(IoU), Recall and
Precision

Source
How to calculate AP?
→ AP@α is Area Under the Precision-Recall Curve(AUC-PR) evaluated at α IoU threshold.

→ Notation: AP@α or APα means that AP precision is evaluated at α IoU threshold. If you see
metrics like AP50 and A75, they mean AP calculated at IoU=0.5 and IoU=0.75, respectively.

→ A high Area Under the PR Curve means high recall and high precision which is preferred.

→ AP is calculated individually for each class. This means that there are as many AP values as
the number of classes.

→ These AP values are averaged to obtain the mean Average Precision (mAP) metric.

Source
What is mAP?
The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of
classes.The formula for mAP essentially tells us that, for a given class, i we need to calculate its corresponding AP.
The mean of these collated AP scores will produce the mAP and inform us how well the model performs.

mAP50: Mean average precision calculated at an intersection over union (IoU) threshold of 0.50. It's a measure of the model's
accuracy considering only the "easy" detections.

mAP50-95: The average of the mean average precision calculated at varying IoU thresholds, ranging from 0.50 to 0.95. It gives a
comprehensive view of the model's performance across different levels of detection difficulty.
The 0.5-IoU mAP has then become the de facto metric for object detection. After 2014, due to the introduction of MS-COCO
datasets, researchers started to pay more attention to the accuracy of object localization. Instead of using a fixed IoU threshold,
MS-COCO AP is averaged over multiple IoU thresholds between 0.5 and 0.95, which encourages more accurate object localization
and may be of great importance for some real world applications

Source
Frames Per Second (fps)
● When it comes to object detection algorithms, processing speed is of paramount importance. The most
common metric that is used to measure the detection speed is the number of frames per second (FPS).

● A high FPS value indicates that the model can process frames quickly, making it suitable for time-sensitive
applications like autonomous vehicles, surveillance systems, robotics, and more.

● On the other hand, a low FPS value implies that the model is slower, which might limit its applicability in
certain real-time scenarios.

● For example, Faster R-CNN operates at only 7 frames per second (FPS), whereas SSD operates at 59
FPS.

● In benchmarking experiments, you will see the authors of a paper stating their network results as: “Network
X achieves mAP of Y% at Z FPS”. Where X is the network name, Y is the mAP percentage, and Z is the
FPS.
Two Stage Detectors vs One Stage Detectors
Two stage Object Detector Single stage Object Detector
The two-stage object detector divides the whole ● In a single-stage object detector, we go
process into 2 steps: directly from the image to classification and
bounding box coordinates.
● The images are fed into a feature extractor
1. It first extracts the features using a CNN using a CNN and then the extracted
2. It then extracts a series of regions of interest features are directly used for classification
called object proposals and then the and regression for bounding box
classification and localization happens only on coordinates.
the object proposals. ● Single-stage object detectors are very very
fast and can be used in real-time object
detection but sometimes their performance
Two-stage object detectors are very powerful and is poorer than two-stage object detectors.
extremely accurate having very high values of mAP. ● Examples are the YOLO family, SSD,
Hence, they are mostly used in the medical domain RetinaNet, etc
where classification accuracy is more important than
speed. Examples of two-stage object detectors are
the R-CNN family, SPP-Net, etc.
Models for Object Detection :
Two Stage Detectors
R-CNN
Region-based Convolutional Neural Network (R-CNN) is a type of deep learning architecture used for object
detection in computer vision tasks. R-CNN was one of the pioneering models that helped advance the object
detection field by combining the power of convolutional neural networks and region-based approaches.
Let's dive deeper into how R-CNN works, step by step.

Source[9]
R-CNN- Region Proposal
R-CNN starts by dividing the input image into multiple regions or subregions. These regions are referred to
as "region proposals" or "region candidates." The region proposal step is responsible for generating a set of
potential regions in the image that are likely to contain objects. R-CNN does not generate these proposals
itself; instead, it relies on external methods like Selective Search or EdgeBoxes to generate region
proposals.

Selective Search : The selective search algorithm


[9] works by generating sub-segmentations of the
image that could belong to one object — based on
color, texture, size and shape — and iteratively
combining similar regions to form objects. This
gives ‘object proposals’ of different scales.

The authors use the selective search algorithm


to generate 2000 category independent region
proposals (usually indicated by rectangular
regions or ‘bounding boxes’) for each individual
image.
source
R-CNN- Feature extraction from Region Proposals
● In this stage, each region proposal is warped or cropped into a fixed resolution, and the CNN
module is utilized to extract a 4096-dimensional feature as the final representation from each of the
2000 region proposals for every image.
● The CNN used is of the same architecture as AlexNet by Krizhevsky et al pre-trained on a ILSVRC
2012 classification using image-level annotations only (to achieve results comparable to AlexNet)
and the fine tuned on warped region proposals only.
● Due to large learning capacity, dominant expressive power, and hierarchical structure of CNNs, a
high-level, semantic, and robust feature representation for each region proposal can be obtained.

source
R-CNN- SVM for object classification
This stage consists of learning an individual linear SVM (Support Vector Machine) classifier for each class, that
detects the presence or absence of an object belonging to a particular class.

Inputs: The 4096-d feature vector for each region proposal.

Labels for training: The features of all region proposals that have an IoU overlap of less than 0.3* with the
ground truth bounding box are considered negatives for that class during training. The positives for that class
are simply the features from the ground truth bounding boxes itself. All other proposals (IoU overlap greater
than 0.3, but not a ground truth bounding box) are ignored for the purpose of training the SVM.

*This number 0.3 was found using grid search on a validation set

source
R-CNN- Bounding Box Regression and Non Maximum Suppression

Bounding Box Regression


In addition to classifying objects, R-CNN also performs bounding box regression. For each class, a separate
regression model is trained to refine the location and size of the bounding box around the detected object. The
bounding box regression helps improve the accuracy of object localization by adjusting the initially proposed
bounding box to better fit the object's actual boundaries.

Non-Maximum Suppression (NMS)


After classifying and regressing bounding boxes for each region proposal, R-CNN applies non-maximum
suppression to eliminate duplicate or highly overlapping bounding boxes. NMS ensures that only the most
confident and non-overlapping bounding boxes are retained as final object detections.

source
R-CNN - Disadvantages
Disadvantages of R-CNN.

● Multi-stage, expensive training: The separate training processes are required for all the stages of the
network i.e fine-tuning a CNN on object proposals, learning an SVM to classify the feature vector of each
proposal from the CNN and learning a bounding box regressor to fine-tune the object proposals proves to
be a burden in terms of time, computation and resources. This multi-stage process can be slow and
resource-demanding.
● Slow Inference: Due to its sequential processing of region proposals, R-CNN is relatively slow during
inference. Real-time applications may find this latency unacceptable. Detection using a simple VGG network
as the backbone CNN takes 47s/image.
● R-CNN is Not End-to-End: Unlike more modern object detection architectures like Faster R-CNN, R-CNN
is not an end-to-end model. It involves separate modules for region proposal and classification, which can
lead to suboptimal performance compared to models that optimize both tasks jointly.
Fast R-CNN
Published in 2015 one year after the SPPNet paper Fast RCNN is a popular successor to these models improving
both efficiency and performance of R-CNN and SPP Networks.

The Fast R-CNN consists of a CNN (usually pre-trained on the ImageNet classification task) with its final
pooling layer replaced by an “ROI pooling” layer and its final FC layer is replaced by two branches — a (K + 1)
category softmax layer branch and a category-specific bounding box regression branch.

Source
Fast R-CNN Architecture
1. The entire image is fed into the backbone CNN and the features from the last convolution layer are
obtained. Depending on the backbone CNN used, the output feature maps are much smaller than the
original image size. This depends on the stride of the backbone CNN, which is usually 16 in the case
of a VGG backbone.
2. Meanwhile, the object proposal windows are obtained from a region proposal algorithm like selective
search.
3. The portion of the backbone feature map that belongs to this window is then fed into the ROI Pooling
layer. The ROI pooling layer is a special case of the spatial pyramid pooling (SPP) layer with just one
pyramid level. The layer basically divides the features from the selected proposal windows (that come
from the region proposal algorithm) into sub-windows of size h/H by w/W and performs a pooling
operation in each of these sub-windows. This gives rise to fixed-size output features of size (H x W)
irrespective of the input size. H and W are chosen such that the output is compatible with the
network’s first fully-connected layer. The chosen values of H and W in the Fast R-CNN paper is 7. Like
regular pooling, ROI pooling is carried out in every channel individually.
4. The output features from the ROI Pooling layer (N x 7 x 7 x 512) where N is the number of proposals)
are then fed into the successive FC layers, and the softmax and BB-regression branches. The softmax
classification branch produces probability values of each ROI belonging to K categories and one
catch-all background category. The BB regression branch output is used to make the bounding boxes
from the region proposal algorithm more precise.
Faster RCNN
In the R-CNN family of papers, the evolution between versions was usually in terms of computational efficiency
(integrating the different training stages), reduction in test time, and improvement in performance (mAP). These
networks usually consist of:
1. A region proposal algorithm to generate “bounding boxes” or locations of possible objects in the image.
2. A feature generation stage to obtain features of these objects, usually using a CNN.
3. A classification layer to predict which class this object belongs to; and d) A regression layer to make the
coordinates of the object bounding box more precise.

Both R-CNN and Fast R-CNN use CPU based region proposal algorithms, Eg- the Selective search
algorithm which takes around 2 seconds per image and runs on CPU computation.
The Faster R-CNN [10] paper fixes this by using another convolutional network (the RPN) to generate
the region proposals. This not only brings down the region proposal time from 2s to 10 ms per image but
also allows the region proposal stage to share layers with the following detection stages, causing an
overall improvement in feature representation.

Source
Region Proposal Network (RPN)
1. The region proposal network (RPN)
starts with the input image being fed
into the backbone convolutional neural
network. The input image is first
resized such that it’s shortest side is
600px with the longer side not
exceeding 1000px.
2. The output features of the backbone
network (indicated by H x W) are
usually much smaller than the input
image depending on the stride of the
backbone network. For both the
possible backbone networks used in
the paper (VGG, ZF-Net) the network
stride is 16.

Source
Region Proposal Network (RPN)
3. For every point in the output feature map, the network has to learn whether an object is present in the
input image at its corresponding location and estimate its size. This is done by placing a set of “Anchors” on
the input image for each location on the output feature map from the backbone network. These anchors
indicate possible objects in various sizes and aspect ratios at this location. The figure below shows 9
possible anchors in 3 different aspect ratios and 3 different sizes placed on the input image for a point A on
the output feature map. For the PASCAL challenge, the anchors used have 3 scales of box area 128², 256²,
512² and 3 aspect ratios of 1:1, 1:2 and 2:1.

4. As the network moves through each


pixel in the output feature map, it has to
check whether these k corresponding
anchors spanning the input image
actually contain objects, and refine
these anchors’ coordinates to give
bounding boxes as “Object proposals”
or regions of interest.

Source
Region Proposal Network (RPN)
5. First, a 3 x 3 convolution with 512 units is
applied to the backbone feature map as
shown in Figure , to give a 512-d feature
map for every location. This is followed by
two sibling layers: a 1 x 1 convolution layer
with 18 units for object classification, and a
1 x 1 convolution with 36 units for bounding
box regression.

6. The 18 units in the classification branch


give an output of size (H, W, 18). This
output is used to give probabilities of
whether or not each point in the backbone
feature map (size: H x W) contains an object
within all 9 of the anchors at that point.

7. The 36 units in the regression branch give an output of size (H, W, 36). This output is used to give the 4
regression coefficients of each of the 9 anchors for every point in the backbone feature map (size: H x W).
These regression coefficients are used to improve the coordinates of the anchors that contain objects.
Source
Faster R-CNN

Source
Models for Object Detection :
Single shot Detectors
Single-shot object detection is a type of object detection algorithm that is able to detect objects within an
image or video in a single pass without the need for multiple stages or region proposals.

Single-shot object detectors, such as YOLO (You Only Look Once) and SSD (Single Shot MultiBox
Detector), use a single convolutional neural network (CNN) to directly predict the class labels and bounding
boxes of objects within an image or video. These models are trained end-to-end using a large dataset of
labeled images and their associated object-bounding boxes
YOLO v1
YOLO (You Only Look Once) is a real-time object detection algorithm developed by Joseph Redmon and Ali
Farhadi in 2015. It is a single-stage object detector that uses a convolutional neural network (CNN) to predict the
bounding boxes and class probabilities of objects in input images.

Source
YOLO Algorithm
The basic idea behind YOLO is to divide the input image into a grid of cells and, for each cell, predict the probability of the presence
of an object and the bounding box coordinates of the object. The process of YOLO can be broken down into several steps:

1. Input image is passed through a CNN to extract features from the image.

2. The features are then passed through a series of fully connected layers, which predict class probabilities and bounding box
coordinates.

3. The image is divided into a grid of cells, and each cell is responsible for predicting a set of bounding boxes and class
probabilities.

4. The output of the network is a set of bounding boxes and class probabilities for each cell.

5. The bounding boxes are then filtered using a post-processing algorithm called non-max suppression to remove overlapping
boxes and choose the box with the highest probability.

6. The final output is a set of predicted bounding boxes and class labels for each object in the image.

One of the key advantages of YOLO is that it processes the entire image in one pass, making it faster and more efficient than two-
stage object detectors such as R-CNN and its variants.
Source


YOLO v2
● One of the main differences between YOLO v2 and the original YOLO is the use of anchor boxes. In
YOLO v2, CNN predicts not only the bounding box coordinates but also the anchor boxes. Anchor boxes
are pre-defined boxes of different aspect ratios and scales, which are used to match the predicted
bounding boxes with the actual objects in the image. This allows YOLO v2 to handle objects of different
shapes and sizes better.
● Another key difference is the use of a multi-scale approach. In YOLO v2, the input image is fed through
CNN at multiple scales, which allows the model to detect objects at different sizes. This is achieved by
using a feature pyramid network (FPN), which allows the model to extract features at different scales
from the same image.
● Additionally, YOLO v2 uses a different loss function than the original YOLO, called the sum-squared error
(SSE) loss function. The SSE loss function is more robust and helps the model to converge faster.
● In terms of architecture, YOLO v2 uses a slightly deeper CNN than YOLO, which allows it to extract more
powerful features from the image. The CNN is followed by several fully connected layers, which predict
class probabilities and bounding box coordinates.

Source



YOLO v3
● YOLO v3 is the third version of the YOLO object detection algorithm. The first difference between YOLO
v3 and previous versions is the use of multiple scales in the input image. YOLO v3 uses a technique
called "feature pyramid network" (FPN) to extract features from the image at different scales. This allows
the model to detect objects of different sizes in the image.
● Another important difference is the use of anchor boxes. In YOLO v3, anchor boxes are used to match the
predicted bounding boxes with the actual objects in the image. Anchor boxes are pre-defined boxes of
different aspect ratios and scales, and the model predicts the offset of the anchor boxes relative to the
bounding boxes. This helps the model to handle objects of different shapes and sizes better.
● In terms of architecture, YOLO v3 is built on a deep convolutional neural network (CNN) that is composed
of many layers of filters. The CNN is followed by several fully connected layers, which predict class
probabilities and bounding box coordinates.
● YOLO v3 also uses a different loss function than previous versions. It uses a combination of classification
loss and localization loss, which allows the model to learn both the class probabilities and the bounding
box coordinates.

Source


YOLO v4
● A key distinction between YOLO v4 and previous versions is using a more advanced neural network
architecture. YOLO v4 uses a technique called "Spatial Pyramid Processing" (SPP) to extract
features from the image at different scales and resolutions. This allows the model to detect objects
of different sizes in the image.

● Additionally, YOLO v4 also uses a technique called "Cross-stage partial connection" (CSP) to
improve the model's accuracy. It uses a combination of multiple models with different architectures
and scales and combines their predictions to achieve better accuracy.

Source
YOLO v5
● YOLO v5 was introduced in 2020 with a key difference from the previous versions, which is the use of
a more efficient neural network architecture called EfficientDet, which is based on the EfficientNet
architecture. EfficientDet is a family of image classification models that have achieved state-of-the-art
performance on a number of benchmark datasets. The EfficientDet architecture is designed to be
efficient in terms of computation and memory usage while also achieving high accuracy.

● Another important difference is the use of anchor-free detection, which eliminates the need for anchor
boxes used in previous versions of YOLO. Instead of anchor boxes, YOLO v5 uses a single
convolutional layer to predict the bounding box coordinates directly, which allows the model to be
more flexible and adaptable to different object shapes and sizes.

● YOLO v5 also uses a technique called "Cross mini-batch normalization" (CmBN) to improve the
model's accuracy. CmBN is a variant of the standard batch normalization technique that is used to
normalize the activations of the neural network.

Source
YOLO v6
● A notable contrast between YOLO v6 and previous versions is the use of a more efficient and lightweight
neural network architecture; this allows YOLO v6 to run faster and with fewer computational resources. The
architecture of YOLO v6 is based on the "Efficient Net-Lite" family, which is a set of lightweight models that
can be run on various devices with limited computational resources.

● YOLO v6 also incorporates data augmentation techniques to improve the robustness and generalization of
the model. This is done by applying random transformations to the input images during training, such as
rotation, scaling, and flipping.

Source
YOLO v7
● YOLO v7 boasts a number of enhancements compared to previous versions. A key enhancement is the
implementation of anchor boxes. These anchor boxes, which come in various aspect ratios, are utilized
to identify objects of various shapes. The use of nine anchor boxes in YOLO v7 enables it to detect a
wider range of object shapes and sizes, leading to a decrease in false positives.

● In YOLO v7, a new loss function called "focal loss" is implemented to enhance performance. Unlike the
standard cross-entropy loss function used in previous versions of YOLO, focal loss addresses the
difficulty in detecting small objects by adjusting the weight of the loss on well-classified examples and
placing more emphasis on challenging examples to detect.

Source

Conclusion

This lecture aims to summarize the most popular CNN architectures for object
detection , however there are a plethora of other architectures which may or may not
be CNN based which the students may wish to explore . A comprehensive list is
provided in the research papers listed in reference [1],[2] and [3].

Reading the research papers cited in the references section is also encouraged.
References
[1]Z. Zou, K. Chen, Z. Shi, Y. Guo and J. Ye, "Object Detection in 20 Years: A Survey," in Proceedings of the IEEE, vol. 111, no. 3,
pp. 257-276, March 2023, doi: 10.1109/JPROC.2023.3238524.
keywords: {Object detection;Detectors;Computer vision;Feature extraction;Deep learning;Convolutional neural
networks;Computer vision;convolutional neural networks (CNNs);deep learning;object detection;technical evolution}
[2]Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘Object Detection with Deep Learning: A Review’, arXiv [cs.CV]. 2019.
[3]X. Wu, D. Sahoo, and S. C. H. Hoi, ‘Recent Advances in Deep Learning for Object Detection’, arXiv [cs.CV]. 2019.
[4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1. IEEE, 2001, pp.
I–I.
[5] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–
154, 2004.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1. IEEE, 2005, pp. 886–893.
[7]P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in CVPR.
IEEE, 2008, pp. 1–8.
[8]R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘Rich feature hierarchies for accurate object detection and semantic
segmentation’, arXiv [cs.CV]. 2014.
[9]Uijlings, J. R. R. et al. “Selective Search for Object Recognition.” International Journal of Computer Vision 104.2 (2013)
[10]Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks, NIPS’15 Proceedings
UE21CS343BB2
Topics in Deep Learning

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack:Anashua Krittika Dastidar,


Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Devang Saragoi,


Teaching Assistant
Unit 1: Introduction to Deep Learning

Transfer Learning

Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Transfer Learning

Introduction to Transfer Learning

Humans have an inherent ability to transfer knowledge across tasks. What we acquire as
knowledge while learning about one task, we utilize in the same way to solve related tasks. The
more related the tasks, the easier it is for us to transfer, or cross-utilize our knowledge.

Some simple examples would be,

Know how to ride a cycle → Learn how to ride a motorbike

Know how to play classic piano → Learn how to play jazz piano

Know math and statistics → Learn machine learning

In each of the above scenarios, we don’t learn everything from scratch when we attempt to
learn new aspects or topics. We transfer and leverage our knowledge from what we have learnt
in the past!
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning

Transfer Learning is a technique where a model developed for one task is reused or repurposed for a
different but related task.

• The knowledge of an already trained machine learning model is transferred to a different but closely
linked problem throughout transfer learning.

• It can be understood as an optimization strategy that enables accelerated progress and enhanced
performance while modelling the problem.

• Transfer learning is not exclusively an area of study for deep learning but is a popular tool, given that
deep learning demands substantial resources and data to train models.
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning

For example, if you trained a simple classifier to predict whether an image contains a cat, you could use the model’s
training knowledge to identify other animals such as dogs.
Unit 1: Introduction to Deep Learning - Transfer Learning

Source Labels Target Labels


(large amount) (small amount)

Transfer Learned
Source Model Target Model
Knowledge

Source Data Target Data


(large amount) (small amount)
Unit 1: Introduction to Deep Learning - Transfer Learning

Traditional Machine Learning vs. Transfer Learning

Traditional Machine Learning Transfer Learning

isolated training approach knowledge is transferred

computationally expensive computationally efficient

large amounts of data is required small dataset is efficient

takes time to achieve optimal performance achieves optimal performance faster


Unit 1: Introduction to Deep Learning - Transfer Learning

Traditional Machine Learning vs. Transfer Learning


Unit 1: Introduction to Deep Learning - Transfer Learning

Formal Definition of Transfer Learning

A Domain consists of two components


= { , ( )}
where,
• feature space:

• marginal distribution: ( ), = { , …, }, ∈

For a given domain ( ) , a Task is defined by two components

{ )} = { , }
= { , …, },
, (

=
where,
• label space:

• a predictive function ( ), learned from features vectors/ label pairs, ( , ), ∈ , ∈


• for each vector in the domain, ( ) predicts its corresponding label: ( ) =
𝑻
𝒀
𝑷
𝒀
𝑿
𝒀
𝜼
𝟏
𝒏
𝒊
𝒊
𝒊
𝟏
𝒏
𝒊
𝒊
𝒊
𝒊
𝒊
𝑷
𝑿
𝑿
𝒙
𝒙
𝒙
𝑿
𝒀
𝒚
𝒚
𝒚
𝒀
𝜼
𝒙
𝑿
𝒚
𝒀
𝒚
𝒚
𝑫
𝑿
𝑷
𝑿
𝜼
𝒙
𝜼
𝒙
𝑿
𝒀
𝐷
Unit 1: Introduction to Deep Learning - Transfer Learning

Domain and Task in Transfer Learning

If two domains are different, they may have If two tasks are different, they may have different
different feature spaces or different marginal label spaces or different conditional distributions
distributions
Unit 1: Introduction to Deep Learning - Transfer Learning

Objective of Transfer Learning

Given a source domain ( ), a corresponding source task ( ), as well as a target


domain ( ), and a target task ( ), the objective of transfer learning now is to

enable us to learn the target conditional probability distribution ( ) in

with the information gained from and where ≠ or ≠ .


In most cases, a limited number of labeled target examples, which is exponentially
smaller than the number of labeled source examples are assumed to be available.
𝑻
𝑻
𝑻
𝑻
𝒔
𝑻
𝒔
𝑻
𝑷
𝒀
𝑿
𝒔
𝒔
𝑻
𝒔
𝒔
𝑫
𝑻
𝑫
𝑻
𝑫
𝑻
𝑫
𝑫
𝑻
𝑫
𝑻
Unit 1: Introduction to Deep Learning - Transfer Learning

Objective of Transfer Learning

Given source and target domains ( ) and ( ) where = { , ( )} and source and target tasks ( ) and

{ )} source and target conditions can vary in four ways, which we will illustrate in
( ) where = , (

the following again using a document classification example

• ( ≠ ) - The feature spaces of the source and target domain are different, e.g. the documents are written
in two different languages. In the context of natural language processing, this is generally referred to as cross-
lingual adaptation.

• ( ( )≠ ( ) ) - The marginal probability distributions of source and target domain are different, e.g. the
documents discuss different topics. This scenario is generally known as domain adaptation.

• ( ≠ ) - The label spaces between the two tasks are different , e.g. documents need to be assigned
different labels in the target task. In practice, this scenario usually occurs with scenario 4, as it is extremely rare
for two different tasks to have different label spaces, but exactly the same conditional probability distributions.
𝒔
𝑻
( )≠ ( ) ) - The conditional probability distributions of the source and target tasks are
𝑻
𝒀
𝑷
𝒀
𝑿
• (
𝒔
𝑻
𝑷
𝒀
𝑿
𝑷
𝒀
𝑿
𝑻
𝒔
𝒔
𝑻
𝒔
𝑻
𝑻
𝒔
𝑻
𝒔
𝑷
𝑫
𝑷
𝑿
𝑿
𝑿
𝑷
𝑿
𝒀
𝑿
𝑿
𝒀
𝑫
𝑫
𝑻
𝑻
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Strategies

Different transfer learning strategies and techniques are applied based on the domain of
the application, the task at hand, and the availability of data. Before deciding on the
strategy of transfer learning, it is crucial to have an answer of the following questions:

• Which part of the knowledge can be transferred from the source to the target to
improve the performance of the target task?
• When to transfer and when not to, so that one improves the target task performance/
results and does not degrade them?
• How to transfer the knowledge gained from the source model based on our current
domain/task?
Traditionally, transfer learning strategies fall under three major categories depending
upon the task domain and the amount of labeled/unlabeled data present.
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Strategies

Inductive Transfer Learning


Inductive Transfer Learning requires the source and target domains to be the same, though the specific
tasks the model is working on are different.
The algorithms try to use the knowledge from the source model and apply it to improve the target task.
The pre-trained model already has expertise on the features of the domain and is at a better starting point
than if we were to train it from scratch.
Inductive transfer learning is further divided into two subcategories depending upon whether the source
domain contains labeled data or not. These include multi-task learning and self-taught learning,
respectively.
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Strategies

Transductive Transfer Learning


Scenarios where the domains of the source and target tasks are not exactly the same but interrelated uses
the Transductive Transfer Learning strategy. One can derive similarities between the source and target
tasks. These scenarios usually have a lot of labeled data in the source domain, while the target domain has
only unlabeled data.

Unsupervised Transfer Learning


Unsupervised Transfer Learning is similar to Inductive Transfer learning. The only difference is that the
algorithms focus on unsupervised tasks and involve unlabeled datasets both in the source and target tasks.
Unit 1: Introduction to Deep Learning - Transfer Learning
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Approaches

Transfer Learning approaches can be categorized into different approaches based on the similarity of domains,
independent of the type of data samples present during training.

Homogeneous Transfer Learning

• proposed to handle situations where the domains are of


the same feature space

• domains have only a slight difference in marginal


distributions and are adjusted to by correcting the sample
selection bias or covariate shift.
Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Approaches

Heterogenous Transfer Learning

• usual approaches don’t account the difference in the


feature spaces between source and target domains

• it is challenging to collect labeled source domain data with


the same feature space as the target domain

• solves the issue of source and target domains having


differing feature spaces and other concerns like differing
data distributions and label spaces

• applied in cross-domain tasks such as cross-language text


categorization, text-to-image classification etc.
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Instance Transfer
Instance transfer reuses knowledge acquired from the source domain to enhance the performance of the
model on a target task. Specific instances or data points from the source domain are deemed to be
relevant to the target domain.

• the central idea is to transfer information from source domain to target domain, but the direct reuse of
source domain data is not always practical or advantageous

• involves the selective identification and utilization of specific instances from the source domain that
align with the characteristics and requirements of the target task

• this approach acknowledges that certain instances from the source domain can be more relevant when
combined with the target data to achieve improved results
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Instance Transfer
For example, if a model is trained on urban scenes (source domain) and needs to adapt to a new city (target
domain), certain instances from the source domain might include images with architectural features, street layouts,
or objects commonly found in both urban environments.
Traffic light is a common element in urban settings worldwide and can be identified as a “certain instance”.
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Feature Representation Transfer


This transfer approach acts as a bridge between domains. Abstract features are extracted from the source
domain data. These features capture general patterns and relationships that are applicable to the target
domains as well.
The approach can be further divided into two sub categories…
Asymmetric Transfer: Source features are transformed to fit the target feature space. Loss of information
might happen due to domain differences.
Symmetric Transfer: A common latent feature space is identified, and both source and target features are
transformed into this shared representation.
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Feature Representation Transfer


• state of the art models trained on enormous amounts of data (eg. VGG16) becomes the feature
extraction engine. This model have mastered extracting generic, robust features relevant to
understanding the data.

• instead of using the entire pre-trained model, different layers of the models are utilized; earlier layers
capture low level features like edges and textures while later layers encode more complex patterns

• a custom model is shaped around the borrowed layers of the pre-trained model, typically smaller and
simpler with a focus learning the relationship between the extracted features and target outputs

• sometimes, fine tuning is required to align transferred functionality with the target domain; this could
include adjusting weights of later layers
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Feature Representation Transfer


For example, if a model is initially trained to recognize objects in natural images (source domain), the
knowledge gained in learning features like edges, textures, or shapes can be transferred to a new task of
recognizing medical images (target domain) where the underlying visual patterns may also share
similarities.
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Parameter Transfer
Parameter-based transfer learning is a technique where knowledge is transferred between source and
target domain models directly through their shared parameters (weights and biases). This leverages the
assumption that models for related tasks often share some underlying structure or patterns captured by
these parameters.

There are two main approaches to parameter sharing…


Hard Weight Sharing: The weights and biases of the pre-trained source model are directly copied for the
target model.
Soft Weight Sharing: The model is expected to be close to the already learned features and is usually
penalized if its weights deviate significantly from a given set of weights.
Unit 1: Introduction to Deep Learning - Transfer Learning

Types of Transferable Components

Relational Based Transfer


Relational-based transfer learning focuses on transferring knowledge between source and target domains
by explicitly learning and leveraging the relationships between data points.

• this contrasts with approaches that transfer individual features, instances, or parameters

• relational methods aim to capture logical rules and connections within the data, enabling knowledge
transfer even for non-IID* data with complex relationships

• for example, understanding the relationships between speech elements in a male voice can assist in
analyzing sentences from other voices

*IID - Independent and Identically Distributed


Unit 1: Introduction to Deep Learning - Transfer Learning

Transfer Learning Strategies and Types of Transferable Components

Inductive Transductive Unsupervised


Transfer Learning Transfer Learning Transfer Learning

Instance Transfer ✓ ✓

Feature Representation Transfer ✓ ✓ ✓

Parameter Transfer ✓

Relational Knowledge Transfer ✓


Unit 1: Introduction to Deep Learning

Transfer Learning for Deep Learning

Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Transfer Learning for Deep Learning

Myth
Deep Learning is not possible unless you have a million labelled examples for the task at hand.

Reality
• Useful representations can be learned from unlabelled data.
• You can train on a nearby surrogate objective for which it is easy to generate labels.
• You can transfer learned representations from a related task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Transfer Learning for Deep Learning

Deep learning models are representative of what is known as inductive learning.


• Inductive learning algorithms infer a mapping from a set of training examples
• For instance, classification models learn a mapping between input features and class labels and in order
for a model to generalize well on unseen data, the model functions based on a set of assumptions which
are related to the distribution of the training data
• These set of assumptions are known as inductive bias and this bias influences what is learned by the
model on the given task and domain.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Transfer Learning for Deep Learning

Inductive transfer techniques utilize the inductive biases of the source task to assist the target task.

This can be done in different ways, such as by adjusting the inductive bias of the target task by limiting the model
space, narrowing down the hypothesis space, or making adjustments to the search process itself with the help of
knowledge from the source task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Deep Transfer Learning

Deep learning has made considerable progress in recent years. Pre-trained models form the basis of transfer
learning in the context of deep learning (deep transfer learning). Let’s look at the two most popular strategies for
deep transfer learning.

•Off-the-shelf Pre-trained Models as Feature Extractors


•Fine Tuning Off-the-shelf Pre-trained Models
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Deep Transfer Learning

Off-the-shelf Pre-trained Models as Feature Extractors


Deep learning systems and models are layered architectures that learn different features at different layers
(hierarchical representations of layered features). These layers are then finally connected to a last layer (usually a
fully connected layer, in the case of supervised learning) to get the final output.

This layered architecture allows us to utilize a pre-trained network (such as Inception v3 or VGG) without its final
layer as a fixed feature extractor for other tasks.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

loss shallow model

softmax features

fc2
fc1 fc1
TRANSFER
conv3 conv3
conv2 conv2
conv1 conv1

data and labels target data and labels

The key idea here is to just leverage the pre-trained model’s weighted layers to extract features but not to update
the weights of the model’s layers during training with new data for the new task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Deep Transfer Learning

Fine Tuning Off-the-shelf Pre-trained Models


This technique requires more human intervention and it goes beyond replacing the final layer. In this technique,
some of the previous layers are selectively retrained.

• neural network are highly configurable architectures with a number of hyperparameters


• initial layers capture more generic features and as we progress towards the final layer, the focus shifts towards
the specific task at hand

• we can freeze (fix weights) certain layers while retraining or fine tune the rest of the layers based on the
requirement

Question: Is freezing layers and using the model as a feature extractor enough? Or finetuning is also required?
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Freeze or Fine Tuned ?


Bottom n layers can be either frozen or fine-tuned. loss
• Freeze: Not Updated during Backpropagation

fine tuned
• Fine-Tuned: Updated during Backpropagation fc2 + softmax

fc1
Based on Target Task
conv3
• Freeze : target task labels are scarce; want to avoid overfitting

frozen
• Fine-Tuned: Plenty of target task labels conv2

conv1

Tip: Learning rates can be set to different for each layer to data

understand the tradeoff between freezing and fine tuning labels


Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Pre-trained Models

The entire concept of transfer learning is dependent on the presence of pre-trained models, more importantly
pre-trained models that perform exceptionally well on source tasks.

Luckily, the deep learning world believes in sharing and advancing together. Today, many state-of-the-art
models have been openly shared.

• pre-trained models usually cater to computer vision or natural language processing related tasks

• they are usually shared in different variants (small, medium, big) but contain millions of parameters
achieved during training

• they can be either downloaded from the internet or called as an object from deep learning related Python
libraries
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Pre-trained Models

Pre-trained models for Computer Vision Pre-trained models for Natural Language Processing

• VGG16, VGG19 • word2Vec

• ResNet, ResNet50 • GLoVe (Global Vectors for Word Representation)

• MobileNet • BERT (Bidirectional Encoder Representations from

• SAM (Segment Anything Model ), FastSAM Transformers)

• YOLO suite of models • GPT (Generative Pretrained Transformer)

• OpenPose, Google MediaPipe • ELMo (Embeddings from Language Models)

• DETR (Detection Transformers) • RoBERTa (Robustly Optimized BERT)

• ViT (Vision Transformer) • T5 (Text-to-Text Transfer Transformer)

• ByteTrack • XLNet (eXtreme Language understanding Network)


Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Types of Deep Transfer Learning

Domain Adaptation
Domain adaption is usually referred to in scenarios where the marginal probabilities between the source
and target domains are different, such as ( )≠ ( ).

• there is a shift or drift in the data distribution of the source and target domains that requires tweaks to
transfer the learning

• for instance, a corpus of movie reviews labeled as positive or negative would be different from a corpus
of product-review sentiments. A classifier trained on movie-review sentiment would see a different
distribution if utilized to classify product reviews

• thus, domain adaptation techniques are utilized in transfer learning in these scenarios
𝑻
𝒔
𝑷
𝑿
𝑷
𝑿
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Types of Deep Transfer Learning

Domain Confusion
Different layers in a model capture different sets of features. This can be utilized to learn domain-invariant
features and improve their transferability across domains.

• instead of allowing the model to learn any representation, we nudge the representations of both
domains to be as similar as possible

• this can be achieved by applying certain pre-processing steps directly to the representations themselves
• the basic idea behind this technique is to add another objective to the source model to encourage
similarity by confusing the domain itself, hence domain confusion.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Types of Deep Transfer Learning

Multitask Learning
Multitask learning is a slightly different flavor of the
transfer learning world. In the case of multitask
learning, several tasks are learned simultaneously
without distinction between the source and targets.
In this case, the learner receives information about
multiple tasks at once, as compared to transfer
learning, where the learner initially has no idea
about the target task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Types of Deep Transfer Learning

One-shot Learning
Deep learning systems are data-hungry by nature, such that they need many training examples to learn the
weights. This is one of the limiting aspects of deep neural networks, though such is not the case with
human learning.
For instance, once a child is shown what an apple looks like, they can easily identify a different variety of
apple (with one or a few training examples); this is not the case with ML and deep learning algorithms.
One-shot learning is a variant of transfer learning, where the required output is tried and inferred based
on just one or a few training examples. This is essentially helpful in real-world scenarios where it is not
possible to have labeled data for every possible class (if it is a classification task), and in scenarios where
new classes can be added often.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning

Types of Deep Transfer Learning

Zero-shot Learning
Zero-shot learning is another extreme variant of transfer learning, which relies on no labeled examples to
learn a task. Zero-data learning or zero-short learning methods, make clever adjustments during the
training stage itself to exploit additional information to understand unseen data.
In a book on Deep Learning, present zero-shot learning as a scenario where three variables are learned,
such as the traditional input variable ( ), the traditional output variable ( ) and the additional random

variable that describes the task ( ). The model is thus trained to learn the conditional probability

distribution of ( , ).

Zero-shot learning comes in handy in scenarios such as machine translation, where we may not even have
labels in the target language.
𝑷
𝒚
𝒙
𝑻
𝒚
𝒙
𝑻
Unit 1: Introduction to Deep Learning - Activation Functions

Recap

• Transfer learning models focus on storing knowledge gained while solving one problem and applying it to a
different but related problem.
• Instead of training a neural network from scratch, many pre-trained models can serve as the starting point
for training. These pre-trained models give a more reliable architecture and save time and resources.
• Transfer learning is used in scenarios where there is not enough data for training or when we want better
results in a short amount of time.
• Transfer learning involves selecting a source model similar to the target domain, adapting the source model
to the target model before transferring the knowledge, and training the source model to achieve the target
model.
• It is common to fine-tune the higher-level layers of the model while freezing the lower levels as the basic
knowledge is the same that is transferred from the source task to the target task of the same domain.
• In tasks with a small amount of data, if the source model is too similar to the target model, there might be an
issue of overfitting. To prevent overfitting, it is essential to tune the learning rate, freeze some layers from
the source model, or add linear classifiers while training the target model can help avoid this issue.
Unit 1: Introduction to Deep Learning - Activation Functions

References

• https://machinelearningmastery.com/transfer-learning-for-deep-learning/
• https://www.v7labs.com/blog/transfer-learning-guide
• https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-
learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
• https://medium.com/georgian-impact-blog /transfer-learning-part-1-
ed0c174ad6e7
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]

Ack: Devang Saragoi,


Teaching Assistant

You might also like