Unit1 TDL Compressed
Unit1 TDL Compressed
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD),
Centre for Data Sciences & Applied Machine
Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Topics in Deep Learning
ARTIFICIAL INTELLIGENCE
Broad area which enables computers to mimic human behavior
MACHINE LEARNING
Usage of statistical tools enables machines to learn from experience
(data) – need to be told
DEEP LEARNING
Learn from its own method of
computing - its own brain
Why is Deep
Learning useful?
Good at classification,
clustering and
predictive analysis
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Deep Learning
• The output of one neuron becomes the input to other neurons in the
next layer of the network, and this process continues until the final
layer produces the output of the network.
• The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn
complex representations of the input data.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Machine Learning vs Deep Learning
• Computer Vision
Deep learning models can enable machines to identify and understand visual data. Some of
the main applications of deep learning in computer vision include:
• Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning
• Computer Vision
Image Segmentation
Image Classification
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning
• Reinforcement Learning
In reinforcement learning, deep learning works as training agents to take
action in an environment to maximize a reward.
Some of the main applications of deep learning in reinforcement learning
include:
• Game playing: Deep reinforcement learning models have been able to
beat human experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots
to perform complex tasks such as grasping objects, navigation, and
manipulation.
• Control systems: Deep reinforcement learning models can be used to
control complex systems such as power grids, traffic management, and
supply chain optimization.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Applications of Deep Learning
• Reinforcement Learning
DBNs are trained layer by layer in an unsupervised manner. The first layer is trained to
capture basic patterns in the data, and subsequent layers are added and trained to
capture higher-level representations.
The architecture of DBNs also makes them good at unsupervised learning, where the
goal is to understand and label input data without explicit guidance. This characteristic
is particularly useful in scenarios where labelled data is scarce or when the goal is to
explore the structure of the data without any preconceived labels.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• Reinforcement Learning
RL is a type of machine learning paradigm where an agent learns to
make decisions by interacting with an environment.
The goal of reinforcement learning is for the agent to learn a strategy
that maximizes a cumulative reward over time.
The agent receives observations, takes actions, and receives rewards
from the environment, aiming to learn a strategy that maximizes
cumulative rewards over time.
RL is employed in scenarios where optimal decision-making is learned
through trial and error in complex and dynamic environments.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• Reinforcement Learning
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• Transformers
Transformers are a type of deep learning model architecture renowned
for their self-attention mechanism, enabling the capture of intricate
relationships within input sequences.
Their parallelization capabilities and encoder-decoder architecture make
them efficient for various tasks, extending beyond natural language
processing to computer vision.
Notably, models like BERT (Bidirectional Encoder Representations from
Transformers), Google's neural network-based technique for natural
language processing (NLP) pre-training have demonstrated exceptional
performance.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• Transformers
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• Google Gemini
Google Gemini is built with a decoder architecture featuring a 32k
context length(words that the model can take into account when
generating responses or predictions) and Multi Query Attention (MQA),
Gemini is engineered for advanced contextual understanding, setting a
new standard in AI architecture.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Popular Architectures
• GPT4 vs Gemini:
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Acknowledgements & References
• https://www.geeksforgeeks.org/
• https://deeplearning.ai
• https://medium.com/@developer.yasir.pk/unveiling-googles-gemini-a-deep-dive-into-the-next-
frontier-of-ai-ee41ffe90a9c
Thank You
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
1. Data Preprocessing
Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Model with high bias pays very little attention to the training
data and oversimplifies the model.
A Recap of Bias Variance Tradeoff using an example
Now in search of a more complex model for our data we have landed on the following function .
This line passes through each and every data point on the training set (low bias) and does a great
job of fitting the training set but it does an equally terrible job of fitting the testing set . In this case
we have landed on a model which is too complex for the given data and thus has high variance.
Such a model is also said to be overfit as it perfectly fits the training data but fails to
generalize.
Training set Test Set Variance is the
variability of
model prediction
for a given data
point or a value
which tells us
spread of our data.
A Recap of Bias Variance Tradeoff using an example
If our model is too simple and has very few parameters then it may
have high bias and low variance. On the other hand if our model
has large number of parameters then it’s going to have high
variance and low bias. So we need to find the right balance without
overfitting and underfitting the data.
1. L1 regularization
2. L2 regularization
3. Dropout
4. Early Stoppage
Parameter Norm Penalties
Regularization techniques have been much prevalent before neural networks came
into the picture . The L1 and L2 family of regularization techniques are based on
adding a parameter norm penalty (let’s call it ( ) ; where represents the
parameters) . To the objective function J (also known as cost function) .
Thus J’ the new cost function becomes:
[0, ∞) is a hyperparameter that weights the relative contribution of the norm penalty term
When our training algorithm minimizes the regularized objective function J’ . It will decrease both the original
objective J on the training data and some measure of the size of the parameters θ(or some subset of the
parameters). Different choices for the parameter norm Ω can result in different solutions being preferred.
𝝺
𝛳
𝛳
𝝺
𝛀
𝛳
𝛀
𝛳
𝛳
L2 Parameter regularization
The L2 parameter norm penalty is commonly known as weight decay . This regularization strategy drives the
weights closer to the origin by adding a regularization term to the objective function such as :
L2 regularization is also known as ridge regression (when used in linear regression) or Tikhonov
regularization.
Let us understand L2 regularization using an linear
regression example:
Overfit line !!
Consider a scenario in which we have 2 points in
our training dataset and 2 points in our testing
dataset , thus using linear regression we get a the
best fit line which has a SSE training (Sum of square
error) =0 but SSE testing = large !!
Ridge regression example
The main idea behind techniques like ridge regression is to find a new line with a
little more bias such that there is a significant drop in variance. So with a slightly
worse fit we get a more generalized line.
The L1 regularization model with respect to linear regression is also known as lasso
regression. As you may recall in ridge regression the cost function used was :
SSE + * slope2
Here we use ,
SSE + * | slope|
𝝺
𝝺
Difference between L1 and L2
● The main difference between ridge and lasso regression is while ridge
regression can shrink the slope asymptotically close to zero lasso regression
can actually reduce the slope to zero .
This is useful while eliminating parameters which have no effect on the prediction.
For example :
size= a* weight + b* height + c* no_of_friends +d* horoscope + so on…4
● In the above equation although L2 regularization can reduce the value of c and d
it will never be equal to zero.
● But using lasso regression c and d can be made zero so that the effects of such
variables is completely removed.
● Thus Lasso regression is useful in cases where there are useless variables while
ridge regression is a better option when most variables are useful.
L2 regularization for Neural networks
For a NN with L layers and n training examples the new cost function J would be
denoted as:
Sum of losses
over n training
examples
Intuition on how regularization reduce overfitting in NN
The Frobenius
norm penalizes
the weight
matrices for
being too large
● Here if the value is large then the W matrix values are
close to 0 . This reduces the impact of a lot of hidden
units.
As g(z) ends up being roughly linear , every layer in the NN ends up being roughly
linear and so the model is made much simpler.
Why Dropout ?
Why Regularization using Dropout?
● One approach to reduce overfitting is to fit all possible different neural network architectures on the
same dataset and to average the predictions from each model.
This is not feasible in practice, and can be approximated using a small collection of different models,
called an ensemble. However even with the ensemble approximation, multiple models need to be fit
and stored, which can be a challenge if the models are large, requiring days or weeks to train and
tune.
● Another approach could be training multiple instances of the same neural network using different
subsets of the training data. Each instance learns from a different subset, which helps reduce
overfitting. However, this approach also requires significant computational resources.
The term “dropout” refers to dropping out the nodes in a neural network. All the forward and
backwards connections with a dropped node are temporarily removed, thus creating a new network
architecture out of the parent network.
Dropout
For instance, if the hidden layers have 1000 neurons (nodes) and a dropout is applied with drop
probability = 0.5, then 500 neurons would be randomly dropped in every iteration (batch).
A neural network with n nodes, the total number of thinned networks that can be formed are 2n
Of course, this is prohibitively large and we cannot possibly train 2n networks. So we
● share the weights across all the networks
● sample a different network for each training instance
More on Dropout
● dropout is a generic approach and can be used with all network types (MLPs, CNNs,
LSTMs etc.)
● a good value for dropout in a hidden layer is between 0.5 and 0.8 and for units in the
input layers a larger dropout rate, such as of 0.8.
● usually applied in larger networks as they overfit easily
● testing of different rates is important as it results in the best dropout rate for the
problem; along with this it also indicates how sensitive the network is to dropout and an
unstable network can benefit from an increase in size
● dropout is more effective where there is a limited amount of training data and the
model is likely to overfit the training data
Summary
The word “augmentation” literally means “the action or process of making or becoming greater in
size or amount”, summarizes the outcome of this technique.
But another important effect is that it increases or augments the diversity of the data. The increased
diversity means, at each training stage the model comes across a different version of the original
data.
For images, some common methods of data augmentation are taking cropped portions, zooming in/
out, rotating along the axis, vertical/horizontal flips, adjusting the brightness and sheer intensity. Data
augmentation for audio data involves adding noise, changing speed and pitch.
While data augmentation prevents the model from overfitting, some augmentation combinations
can actually lead to underfitting.
Data Augmentation
The simplest way to prevent overfitting is to increase the size of the training data but collecting more
labelled data is costly. In case of images, as mentioned earlier there are a few ways such as rotating the
image, flipping, scaling, shifting, etc.
In the above image, some transformations have been applied on the MNIST dataset. This provides a
big leap in improving the accuracy of the model. It can be considered as a mandatory trick in order to
improve our predictions.
In Keras, we can perform all of these transformations using ImageDataGenerator. It has a big list of
arguments which you you can use to pre-process your training data.
Early Stopping
Too little training results in an underfit model and too much training results in an overfit
model
If the performance of the model on the validation dataset starts to degrade (e.g. loss begins
to increase or accuracy begins to decrease), then the training process is stopped. The model
at the time that training is stopped is then used and is known to have good generalization
performance.
This simple, effective, and widely used approach to training neural networks is called
early stopping.
When training the network, a larger number of training epochs is used than may normally
be required, to give the network plenty of opportunity to fit, then begin to overfit the
training dataset.
Early Stopping
● To prevent overfitting and avoid driving the
validation error to a high value, it is important to
track the validation error during training.
● A patience parameter (p) is used to determine the
number of steps to wait for improvement in the
validation error.
● At each step (k), the validation error is checked. If
there is no improvement in the validation error for
the previous p steps, training is stopped, and the
model stored at step (k - p) is returned.
● By stopping the training early, we prevent the
model from overfitting and avoid the situation
where the training error approaches 0 while the
validation error increases.
References
● https://towardsdatascience.com/understanding-the-bias-variance-
tradeoff-165e6942b229
● https://www.youtube.com/watch?
v=SjQyLhQIXSM&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc
&index=2
● https://www.youtube.com/watch?v=EuBBz3bI-aA
● https://www.youtube.com/watch?v=Q81RR3yKn30&t=891s
● https://www.youtube.com/watch?v=D8PJAL-
MZv8&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc&index=6
● https://towardsdatascience.com/intuitions-on-l1-and-l2-
regularisation-235f2db4c261
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Activation Functions
Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Activation Functions
Activation Function
As we know, everything from the name to the structure of a Neural Network is inspired by the human brain;
mimicking the way biological neurons signal one another.
Diving into the structure, a node is a replica of a neuron that receives a set of input signals. Depending on the
nature and intensity of these input signals, the brain processes them and decides whether the neuron should be
activated (“fired”) or not.
Similarly, in deep learning, this is the role of the Activation Function—that’s why it’s often referred to as a
Transfer Function in Artificial Neural Network.
Unit 1: Introduction to Deep Learning - Activation Functions
Activation Function
The primary role of the Activation Function is to transform the summed weighted input from the node into an
output value to be fed to the next hidden layer or as output.
• Activation functions also help to normalize the output of each neuron to a range between 1 and 0 or
between -1 and 1.
* A neural network without an activation function is essentially just a linear regression model. The activation function does the
non-linear transformation to the input making it capable to learn and perform more complex tasks.
Unit 1: Introduction to Deep Learning - Activation Functions
Activation Function
Activation Function is a mathematical “gate” in between the input feeding the current neuron
and its output going to the next layer
Unit 1: Introduction to Deep Learning - Activation Functions
So we need to know
{1
0 <0
( )=
≥0
Output: a value of 0 or 1
( )=
Linear Activation Functions turn the neural network into a linear regression model. This does not allow the
model to create complex mappings between the inputs and outputs.
Non-linear Activation Functions solve the following limitations of Linear Activation Functions:
• they allow backpropagation; the derivative function is related to the input, and it’s possible to go back
and understand which weights in the input neurons can provide a better prediction
• they allow the stacking of multiple layers as the output would now be a non-linear combination of input
passed through multiple layers; any output can be represented as a functional computation in a neural
network.
1
( )=
1 + ⅇ−
′( ) = ( ) = ( ) × (1 − ( ))
as we can see in the figure, the gradient values are only significant in the
central region, the curve gets flatter in the other regions, this implies that
the values will change very slowly in those regions
• as gradient values approach zero, the network stops learning and
suffers from the Vanishing Gradient problem
• the output of the function is not centered around zero i.e. the
outputs of all neurons will be of the same sign; this makes training
the neural network hard
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝑥
𝑓

𝑥
𝑔
𝑥
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
(ⅇ − ⅇ )
−
( )=
(ⅇ + ⅇ − )
Output: values bounded between -1 and 1
Advantages:
• Zero Centered; easier to model inputs that have strongly negative,
neutral, and strongly positive values
𝑥
𝑥
𝑓
𝑥
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
Tanh Function
Disadvantages:
derivative of tanh function is
′( ) = ( ) = 1 − tanh2( )
* although both sigmoid and tanh face the vanishing gradient issue, tanh is zero centered, and the gradients are not restricted to
move in a certain direction, therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity
𝑓

𝑥
𝑔
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
( ) = max(0, )
Advantages:
• solves the gradient descent problem
• computationally efficient because of sparse network i.e. only those
neurons are activated which have a non-zero value
𝑓
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
ReLU Activation
Disadvantages: Dying ReLU Problem
{1
0 <0
′( ) = ( ) =
≥0
the negative side of the graph makes the gradient value zero. Due to this
reason, during backpropagation, the weights and biases for some neurons
are not updated. This can create dead neurons which never get activated.
All the negative input values become zero immediately, which decreases
the model’s ability to fit or train from the data properly.
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
Working: the function has a small positive slope in the negative region
producing non-zero values for negative inputs; the function performs
similarly in positive regions
( ) = max(0.1 , )
Output: max of 0.1*input and input
Advantages:
• same as ReLU
• does not suffer from Dying ReLU
𝑓
𝑥
𝑥
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
{1
0.1 <0
′( ) = ( ) =
≥0
making this minor alteration for negative inputs, the gradient of the
negative region results in non-zero value; hence reducing the occurrences
of dead neurons
Disadvantages: learning process in time consuming and predictions are not
consistent for negative values
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
( ) = max( , )
Disadvantages:
• limitation is that it may perform differently for different problems
depending upon the value of slope parameter
• there is a risk of overfitting
𝒂
Unit 1: Introduction to Deep Learning - Activation Functions
≥0
{
( )=
(ⅇ − 1) <0
Advantages:
• ELU becomes smooth slowly until its output to -
• avoids dead ReLU problem by introducing log curve for negative values;
it helps the network nudge weights and biases in the right direction
𝛼
𝑓
𝑜
𝑟
𝑥
𝑥
𝑓
𝑥
𝑥
𝑓
𝑜
𝑟
𝑥
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions
{
1 ≥0
′( ) = ( ) =
( )+( ) <0
Disadvantages:
• increases the computational time because of the exponential operation
• ( ) is not a learnable parameter
• exploding gradient problem
𝑓
𝑥
𝛼
𝑓
𝑜
𝑟
𝑥
𝑓

𝑥
𝑔
𝑥
𝑓
𝑜
𝑟
𝑥
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions
( )= ( ≤ )= ( )
[ ]
2
( + 0.044715 )
3
= 0.5 1 + tanh
Working: activation functions (ReLU) activate a neuron by multiplying with 0’s or 1’s; dropout (a regularization
technique) randomly drops neurons by multiplying activations with 0; a new RNN regularizer called Zoneout
stochastically multiples the inputs by 1
𝜋
This together decides the neuron’s output yet they work independently; GELU aims to combine them
𝑥
𝑥
𝑥
𝑥
𝑃
𝑋
𝑥
𝑥
𝜙
𝑥
𝑓
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
Softmax Activation
Input: weighted sum of weights and biases of the neurons in a layer
ⅇ
( )=
=1
Output: maps the output to a [0,1] range and at the same time makes sure
the summation of the output is 1.
Swish Function
The Swish activation performs well in various deep learning tasks. It introduces a non-linearity that allows the
model to capture complex patterns while maintaining some of the desirable properties of ReLU.
Input: weighted sum of weights and biases of the neurons in a layer,
hyperparameter to control the slope of the swish activation
Working: the function self gates its output i.e. the magnitude of the output
is influenced by the input; the sigmoid function acts as a gating mechanism
( )= × ( )=
1 + ⅇ−
Output: the function approaches positive infinity as the input approaches
positive infinity; as the input approaches negative infinity the function
settles down to a constant value (~0.5)
𝛽
𝑥
𝑓
𝑥
𝑥
𝑠
𝑖
𝑔
𝑚
𝑜
𝑖
𝑑
𝛽
𝑥
𝜷
𝑥
Unit 1: Introduction to Deep Learning - Activation Functions
Swish Function
Advantages:
• smooth function, does not change value direction abruptly like ReLU; bends smoothly from 0 towards
values < 0 and then upwards again
• lays emphasis on small negative values but zeroes out large negative values (win-win)
• non-monotonous nature enhances the expression of input data and weights to be learnt
Disadvantages:
• beneficial in some cases, its superiority over other activation function is not universally established
• finding optimal value for may require additional experimentation
𝛽
Unit 1: Introduction to Deep Learning - Activation Functions
* (~1.505) and (~1.673) are predefined. These values are based on mathematical considerations to encourage the self
𝛼
𝑓
𝑜
𝑟
𝑥
𝑥
normalization. When set to these values, SELU helps the network to stabilize
𝑓
𝑥
𝜆
𝑥
𝑓
𝑜
𝑟
𝑥
𝝀
𝝀
𝜶
𝜶
𝜆
𝛼
Unit 1: Introduction to Deep Learning - Activation Functions
Disadvantages:
• SELU is relatively new so it is not yet used widely in practice; ReLU stays as the preferred option
• more research on architectures such as CNNs and RNNs using SELUs is needed for wide-spread industry use
Unit 1: Introduction to Deep Learning - Activation Functions
Depending upon the properties of the problem we might be able to make a better choice for easy and
quicker convergence of the network.
• if we encounter a case of dead neurons in our networks variants of ReLU activation functions seem to be
the best choice
• ReLU should only be used in hidden layers and sigmoid/tanh should not be used in hidden layers because
they make the model more susceptible to problems (vanishing gradiant)
• sigmoids/tanh functions are generally avoided due to the vanishing gradient problem
• swish function is used in neural networks having a depth greater than 40 layers
As a rule of thumb, you can begin with using ReLU function and then move over to other activation
functions in case ReLU doesn’t provide with optimum results
Unit 1: Introduction to Deep Learning - Activation Functions
Depending upon the type of the prediction problem it is advised to choose these activation functions for the
output layer
• Regression – Linear Activation Function
• Binary Classification – Sigmoid Activation Function
• Multiclass Classification – Softmax Activation
• Multilabel Classification – Sigmoid Activation Function
Unit 1: Introduction to Deep Learning - Activation Functions
Recap
Further Reading
https://www.v7labs.com/blog/neural-networks-activation-functions
Unit 1: Introduction to Deep Learning - Activation Functions
References
• https://www.v7labs.com/blog/neural-networks-activation-functions
• https://pub.towardsai.net/what-is-parametric-relu-2444a2a292de
• https://machinelearningmastery.com/activation-functions-in-pytorch/
• https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-
activation-functions-when-to-use-them/
• h tt p s : / / m e d i u m . c o m / @ v i n o d h b 9 5 /a c t i va t i o n - f u n c t i o n s - a n d - i t s -
types-8750f1287464
• https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-
relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e
• https://pytorch.org/docs/stable/nn.html#non-linear-activations-other
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Loss Functions
Topics in Deep Learning
What is a loss function?
We will consider 4 main cases to pair up problems with activation and loss
functions.
1. Regression: Predicting a Numerical Value
The final layer of the neural network will have one neuron and the value it returns is a
continuous numerical value.
To evaluate the accuracy of the prediction, it is compared with the true value which is
also a continuous number.
Topics in Deep Learning
Regression Problems
There are three metrics which are generally used for evaluation of Regression
problems (like Linear Regression, Decision Tree Regression, Random Forest
Regression etc.):
1. Mean Absolute Error (MAE):
This measures the absolute average distance between the real data and the
predicted data, but it fails to punish large errors in prediction.
Topics in Deep Learning
Regression Problems
The final layer of the neural network will have one neuron and will return a value
between 0 and 1, which can be inferred as a probability.
To evaluate the accuracy of the prediction, it is compared with the true value which is
also a continuous number.
Topics in Deep Learning
Case 2 : Predicting a Binary Outcome
The final layer of the neural network will have one neuron for each of the classes and
they will return a value between 0 and 1, which will be the probability of it being the
class.
The output is then a probability distribution and will sum to 1.
Topics in Deep Learning
Case 3 : Predicting a Single Label from Multiple Classes
The final layer of the neural network will have one neuron for each of the classes and
they will return a value between 0 and 1, which will be the probability of it being the
class.
The output is then a probability distribution and will sum to 1.
Topics in Deep Learning
Case 4 : Predicting Multiple Labels from Multiple Classes
Although picking a loss function is not given much importance and overlooked, one must
understand that there is no one-size-fits-all and choosing a loss function is as important
as choosing the right machine learning model for the problem in hand.
The choice of a loss function is tightly coupled with the choice of output unit in a model.
Topics in Deep Learning
Factors to consider when selecting a loss function
Topics in Deep Learning
Acknowledgements & References
● https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-
ac02f1c56aa8
● https://www.deeplearningbook.org/
● https://www.analyticsvidhya.com/blog/2022/06/understanding-loss-function-in-deep-learning/
● https://machinelearningknowledge.ai/cost-functions-in-machine-learning/
● https://deeplearning.ai
● https://www.datacamp.com/tutorial/loss-function-in-machine-learning
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD),
Centre for Data Sciences & Applied Machine
Learning (CSSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Anirudh Chandrasekar,
Teaching Assistant
Topics in Deep Learning
Batch Normalization
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Problems with NN Training process
• Variables that affect the response variable, but are not of interest in
the study. (according to statistics)
• In ML : They are not variables but essentially features.
Ex:
• Age: Study between physical activity and health.
• Gender: Study between job satisfaction and work-life balance.
• Neural Networks: Activations in Hidden layers and features.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Covariate Shift
• The distribution of input has changed in the kth layer after weight
updation.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Unstable and Slow training
Batch - all the data is passed at Mini-batch - Data is split into smaller sets
once. and each set is passed in a new iteration.
In this case we do not require batch In this case Batch Normalization is needed
normalization as weight updates since weights are updated before each
happen only once. iteration.
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch Normalization
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Batch Normalization Zinorm =
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Z(i)norm =
Batch Normalization
(i) = Z(i)norm +
Z̃
𝝲
𝝱
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Implementing Batch Normalization
where
• is the mean and 2 is the variance.
• is added to ensure mathematical stability. (incase variance turns out to
be 0).
𝝻
𝞮
𝞂
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Implementing Batch Normalization
Now we have normalized the values of z such that they have mean 0 and
standard unit variance.
But we do not want this to be the case always and might want them to have a
different distribution.
So we compute,
(i) = Z(i)norm +
where and are learnable parameters of the model.
Note: scales and shifts.
Question: For what values of and will (i) = Z(i)?
Ans: = √ 2 + and =
𝝲
𝝲
𝝲
𝝲
𝞂
𝝱
𝝱
𝞮
𝝱
Z̃
𝝱
𝝻
Z̃
𝝲
𝝱
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Adding Batch Normalization to a Network
If you are using a deep learning framework, you won't have to implement
batch norm yourself:
• Ex. in Tensorflow you can add this line: tf.nn.batch-normalization()
UE21CS343BB2-TDL-Lecture 1- Introduction to Deep Learning
Why are and needed?
As shown in the figure once we normalize the inputs, all of them them follow
a normal distribution.
This leads to the model learning the same distribution over and over again,
hindering the learning capability of the model.
(i) = Z(i)norm
+
(i) = Z(i)norm
+
• https://deeplearning.ai
• https://medium.com/@rishavsapahia/5-min-recap-for-andrew-ng-deep-learning-specialization-
course-2-8a59fd58ca0d
• https://youtu.be/PRQPyQeq6C4?si=7j_lT2i03mS4O94N
• https://youtu.be/2xChdY2qkmc?si=ten5Owc--g8ZaKOk
Thank You
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CSSAML)
Department of Computer Science and Engineering
[email protected]
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Optimization algorithms used for training of deep models differ from traditional optimization algorithms in
several ways. In most machine learning scenarios we care about some performance measure say , P but we
optimise P only indirectly by reducing a different cost function J(θ) in the hope that doing so will improve P.
● Fewer updates to the model means this variant of gradient descent is more computationally efficient than stochastic
gradient descent.
● The decreased update frequency results in a more stable error gradient and may result in a more stable convergence on
some problems.
● The separation of the calculation of prediction errors and the model update lends the algorithm to parallel processing
based implementations.
Downsides
● The more stable error gradient may result in premature convergence of the model to a less optimal set of parameters.
● The updates at the end of the training epoch require the additional complexity of accumulating prediction errors across all
training examples.
● Commonly, batch gradient descent is implemented in such a way that it requires the entire training dataset in memory
and available to the algorithm.
● Model updates, and in turn training speed, may become very slow for large datasets.
Stochastic Gradient Descent
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that
calculates the error and updates the model for each example in the training dataset.
Upsides
● The frequent updates immediately give an insight into the performance of the model and the rate of improvement.
● This variant of gradient descent may be the simplest to understand and implement, especially for beginners.
● The increased model update frequency can result in faster learning on some problems.
● The noisy update process can allow the model to avoid local minima (e.g. premature convergence).
Downsides
● Updating the model so frequently is more computationally expensive than other configurations of gradient descent,
taking significantly longer to train models on large datasets.
● The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the
model error to jump around (have a higher variance over training epochs).
● The noisy learning process down the error gradient can also make it hard for the algorithm to settle on an error
minimum for the model.
Mini Batch Gradient Descent
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into
small batches that are used to calculate model error and update model coefficients. Mini-batch gradient
descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of
batch gradient descent. It is the most common implementation of gradient descent used in the field of deep
learning.
Upsides
● The model update frequency is higher than batch gradient descent which allows for a more robust
convergence, avoiding local minima.
● The batched updates provide a computationally more efficient process than stochastic gradient descent.
● The batching allows both the efficiency of not having all training data in memory and algorithm
implementations.
Downsides
● Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning
algorithm.
● Error information must be accumulated across mini-batches of training examples like batch gradient
descent
How to choose the best value for Mini Batch size
● Mini batch size is a hyperparameter and it is a good idea to experiment with a range of values
to get the best fit for your model.
● Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the
computational architecture on which the implementation is being executed. Such as a power of
two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and
so on.
Convergence in different versions of Gradient Descent
Batch gradient descent takes small steps
towards the minima and will converge to a
minima.
While Stochastic gradient descent is noisy and
oscillates near the minima . It never actually
converges to the minima.
Exponentially Weighted Averages
The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time
series. However, due to several computational advantages (fast, low-memory cost), the EWMA is behind the
scenes of many optimization algorithms in deep learning, including Gradient Descent with Momentum,
RMSprop, Adam, etc.
Let’s understand with an example , say we have the some data for temperatures across multiple days in a
city and we wish to approximate the next day’s temperature from this data. Thus let the estimated
temperature be Vt , let the previous estimate be Vt-1 , Ot be the temperature for day t and β be a
hyperparameter.
This can slow down the progress of the search, especially for those optimization problems where the broader
trend or shape of the search space is more useful than specific gradients along the way.
One approach to this problem is to add history to the parameter update equation based on the gradient
encountered in the previous updates
Gradient Descent with Momentum
1. First we calculate dW (where w is the weight) for the current mini batch
1. Then we compute
Using a very high value for we can dampen out the oscillations which arise in gradient
descent !
𝛽
Gradient Descent with Momentum
Upsides:
1. Momentum has the effect of damping down the change in the gradient and, in turn, the step size with each
new point in the search space.
1. Momentum is most useful in optimization problems where the objective function has a large amount of
curvature (e.g. changes a lot), meaning that the gradient may change a lot over relatively small regions of the
search space.
1. It is also helpful when the gradient is estimated, such as from a simulation, and may be noisy, e.g. when the
gradient has a high variance.
1. Finally, momentum is helpful when the search space is flat or nearly flat, e.g. zero gradient. The momentum
allows the search to progress in the same direction as before the flat spot and helpfully cross the flat region.
Downsides :
1. It can overshoot the global minimum and converge to a local minimum instead.
2. Another disadvantage is that the momentum term can cause the optimization process to oscillate around the
global minimum.
Root Mean Square or RMS prop
We know that while implementing gradient descent we end up with lots of oscillations
Let the horizontal direction be w and the vertical direction be b. In this case we wish to
speed up learning in the w direction and slow down learning in the b direction.
for iteration t:
1. Calculate the derivative dW for the current mini batch ‘
2. Calculate an exponentially weighted average of the squares of derivatives
● Thus by dividing by a small SdW we can increase the magnitude of update in the
W direction and while dividing with a large Sdb we can dampen the oscillations in
the b direction.
Note : if SdW is very close to zero in order to avoid the W term becoming very large we add a small term to the denominator
𝜖
RMS Prop
Upsides:
1. Faster convergence: RMSprop can converge faster than SGD by adapting the learning rate based on the
magnitude of the gradients for each parameter.
2. Robustness to different learning rates: RMSprop is more robust to different learning rates for different
parameters, which can be helpful in deep learning models with many parameters.
3. Adaptive learning rate: RMSprop adaptively scales the learning rate based on the history of the squared
gradients, which can improve the convergence speed and stability of the optimization process
Downsides:
● alpha. Also referred to as the learning rate or step size. The proportion that weights are updated (e.g.
0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values
(e.g. 1.0E-5) slow learning right down during training
● beta1. The exponential decay rate for the first moment estimates (e.g. 0.9).
● beta2. The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be
set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
● epsilon. Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).
1. Adaptive Learning Rates: Unlike fixed learning rate methods like SGD, Adam optimization
provides adaptive learning rates for each parameter based on the history of gradients. This allows
the optimizer to converge faster and more accurately, especially in high-dimensional parameter
spaces.
1. Momentum: Adam optimization uses momentum to smooth out fluctuations in the optimization
process, which can help the optimizer avoid local minima and saddle points.
1. Bias Correction: Adam optimization applies bias correction to the first and second moment
estimates to ensure that they are unbiased estimates of the true values.
1. Robustness: Adam optimization is relatively robust to hyperparameter choices and works well
across a wide range of deep learning architectures.
Adam optimizer
Disadvantages :
Memory intensive: Adam needs to store moving averages of past gradients for each parameter during training
and hence it requires more memory than some other optimization algorithms, particularly when dealing with very
large neural networks or extensive datasets.
Slower convergence in some cases: While Adam usually converges quickly, it might converge to flawed
solutions in some cases or tasks. In such scenarios, other optimization algorithms like SGD (stochastic gradient
descent) with momentum or Nesterov accelerated gradient (NAG) may perform better.
Hyperparameter sensitivity: Although Adam is less sensitive to hyperparameter choices than some other
algorithms, it still has hyperparameters like the learning rate, beta1, and beta2. Choosing inappropriate values
for these hyperparameters could impact the performance of the algorithm.
Learning Rate Decay
Learning rate decay is a technique used in machine learning models to train modern neural networks. It
involves starting with a large learning rate and then gradually reducing it until local minima is obtained.
Say, we are implementing mini batch gradient descent with a relatively small batch size
and the gradient takes noisy steps as shown below. It also does not converge at the
minima but oscillates around the minimum value.
In such a situation slowly reducing the learning rate (α) value as we approach the
minima could be advantageous as the final value would oscillate closer to the minima if
small steps are taken .
This could be done by setting :
1. Exponential decay:
2. Epoch number
Based decay:
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divija L,
Teaching Assistant
Convolutional Neural Network(CNN):
Introduction,Filters, Feature Maps
UE22CS645BC2-TDL-Lecture -CNN
Introduction To Computer Vision(CV)
● Imagine in CIFAR Dataset the images are only of size 32 X 32 pixels and have 3 color channels.
In this case a Single fully connected neuron in a first hidden layer of this neural network would have
32 x 32 x 32 = 3072 weights (it is still manageable.)
● They tend to perform less and aren’t good for feature extraction.
UE22CS645BC2-TDL-Lecture -CNN
Why Convolution Neural Networks(ConvNet / CNN) ?
Convolution is a mathematical operation that allows the merging of two sets of information.
Convolution is applied to the input data to filter the information and produce a feature map.
● A simple CNN is a sequence of layers, and every layer of a CNN transforms one volume of
activations to another through a differentiable function. Unlike Fully connected NNs, a neuron
in a convolutional layer is not connected to the entire input but just some section of the input
data.
● The architecture performs a better fitting to the image dataset due to the reduction in the
number of parameters involved. ConvNet leverage spatial information and are therefore very
well suited for classifying images.
UE22CS645BC2-TDL-Lecture -CNN
Difference between Fully connected NN vs CNN
Spatial Do not explicitly capture spatial hierarchies Capture spatial hierarchies through convolutional
Hierarchies and pooling layers
Feature May struggle with capturing local patterns and Specialized for feature extraction from local regions,
Extraction spatial relationships suitable for images
UE22CS645BC2-TDL-Lecture -CNN
Idea of how CNNs Function
● The CNN is composed of image processing layers that deduce and pass down information from one
layer to the next.
● At each layer, information of different abstractions is deduced.
● Generally, in the earlier layers, simpler and more basic ideas will be deduced while later layers will use the gathered
information from the previous layers to deduce more complex ideas.
● The following figure lays down an example CNN - in the figure, a boat image is passed from the left end layer to the next
until it reaches the right-most layer, where it classifies the class of the image.
● Due to the way the layers learn more complex features the deeper into the network, we can call this a hierarchical learning
structure. This is “Hierarchical Learning Structure”.
● First, we see single pixels; then from them, we recognize simple geometric forms. And then more sophisticated elements
such as objects, faces, human bodies, animals, and so on.
● Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field.
● As we proceed through the visual pathway, the features learned become more complex, just as in the CNN. The receptive
visual field size increases as well, as larger receptive field suggests a more holistic and general feature in the image.
UE22CS645BC2-TDL-Lecture -CNN
Advantages of CNN:
1. Local Receptive Fields:
● In images, there are local patterns and structures that are important for recognizing objects. CNNs use local
receptive fields (small regions of the input data) to capture these patterns that focuses on specific features in the
visual data.
3. Parameter Sharing:
● CNNs use shared weights and biases across different parts of the input data that allows them to learn a set of
filters that can be applied across the entire input, making them particularly effective for detecting patterns and
features in various locations of an image. This parameter sharing significantly reduces the number of
parameters compared to fully connected networks, making CNN's computationally efficient.
UE22CS645BC2-TDL-Lecture -CNN
Advantages of CNN:
4. Parameter Efficiency:
● CNNs are more parameter-efficient than fully connected networks, which is crucial when dealing with large
images. The use of shared weights and local receptive fields allows CNNs to learn from fewer parameters while
retaining the ability to capture important features.
5. Translation Invariance:
● CNNs can recognize patterns regardless of their position in the input space. This is achieved through the use of
pooling layers that down-sample the spatial dimensions, making the network less sensitive to small changes in
the position of features.
UE22CS645BC2-TDL-Lecture -CNN
Input Image
• The RGB image in the figure has been separated by its three color planes — Red, Green, and Blue.
• There are a number of such color spaces in which images exist — Grayscale, RGB, HSV, CMYK, etc.
• To encode the local structure a submatrix of adjacent input neurons is connected into one
single hidden neuron belonging to the next layer. That single hidden neuron represents
one local receptive field. This operation is named convolution.
• Unlike fully-connected layers, where each output unit gathers information from all of the
inputs, the activation of a convolution output cell is determined by the inputs in its
receptive field.
UE22CS645BC2-TDL-Lecture -CNN
Local Receptive Fields
• This principle works best for hierarchically structured data such as images.
• For eg., suppose that the size of each single submatrix is 5 x 5 and that
those submatrices are used with MNIST images of 28 x 28 pixels.
• Then we will be able to generate 23 x 23 local receptive field
neurons(feature maps) in the next hidden layer.
• It is possible to slide the submatrices by only 23 positions before touching
the borders of the images.
• There can be multiple feature maps that learn independently from each
hidden layer.
UE22CS645BC2-TDL-Lecture -CNN
CNN in a nutshell
● The first layer is responsible for detecting lines, edges, changes in brightness, and other simple features.
● The information is then passed onto the next layer, which combines simple features to build detectors
that can identify simple shapes.
● The process continues in the next layer and the next, becoming more and more abstract with every layer.
● The deeper layer will be able to extract high-order features such as shapes or specific objects.
● The last layers of the network will integrate all of those complex features and produce classification
predictions.
● The predicted value will be compared to the correct output, where those that are wrongly classified will
cause a large error gap and will cause the learning process to backpropagate to make changes to the
parameter in order to give out a more accurate outcome.
● The network goes back and forth, correcting itself until the satisfying output is achieved (where the error
is minimised).
UE22CS645BC2-TDL-Lecture -CNN
CNN in a nutshell
UE22CS645BC2-TDL-Lecture -CNN
Filters / Kernels
● A kernel in a convolution is an n x n matrix of numbers.
● They are designed to detect and highlight specific patterns or features in the input data.
● The CNN convolution can have multiple filters, highlighting different features, which results in multiple output feature maps
(one for each filter).
● It can also gather input from multiple feature maps, for e.g,
the output of a previous convolution.
● The combination of feature maps (input or output) is called a
volume. In this context, we can also refer to the feature maps as
slices.
● For example, in image processing, filters might be designed to
detect edges, corners, or textures.
● Each number inside the kernel matrix is multiplied with the value
for each pixel it lines up with, and then all the elements of the
Kernel are summed to output one single value. Then, the kernel slides over
to a new position in the image and the process is repeated.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
1. DIMENSIONS OF A FILTER:
● Importance of strides - The choice of stride affects the model in several ways.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
2. STRIDE:
● Importance of strides -
~ Output Size: Larger stride results in a smaller output spatial dimension because filter covers large
area of the input image with each step, thus reducing the number of positions it can occupy.
~ Computational Efficiency: Increasing the stride can decrease the computational load.Since the filter
moves more pixels per step, it performs fewer operations, which can speed up the training and
inference processes.
~ Field of View: Higher stride in the convolutional operation considers a broader area of the input
image per step, aiding in capturing global features over finer details.
~ Downsampling: Increasing strides can be used as an alternative to max pooling layers for
downsampling the input.
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
3. ZERO-PADDING:
● Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can
either be manually specified or automatically set.
● It has 3 modes:
Purpose:
● No padding
● Drops last convolution if dimensions do not match
UE22CS645BC2-TDL-Lecture -CNN
Filter Hyperparameters
3. ZERO-PADDING:
Purpose:
● Padding such that feature map size has size
Purpose:
● Maximum padding such that end convolutions are applied on the
limits of the input.
Finally !!
TOPICS IN DEEP LEARNING
INPUT KERNEL
OUTPUT
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
INPUT KERNEL
OUTPUT
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
INPUT KERNEL
OUTPUT
Note how all the edges separating different colours and brightness levels have been
identified
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images - Eg: 2
UE22CS645BC2-TDL-Lecture -CNN
Effects of Convolutions on Images
In the 1D case, we slide a one dimensional filter over a one dimensional input
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 2D
In the 2D case, we slide a two dimensional filter over a two dimensional input
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
Suppose we have an RGB image and we want to convolve it with the following 3D filter. As we can see, the depth of our filter consists of three 2D filters. Let's assume that our RGB image is 5 by 5 pixels.
We will add zero padding to each of these arrays in order to avoid losing information when
performing the convolution.
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
The convolutions will be carried out in exactly the same way as for grayscale images.
The only difference is that now we have to perform three convolutions instead of 1
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
Take each
corresponding
pixel and filter
value, multiply
them together,
and sum the
whole thing
up.
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
UE22CS645BC2-TDL-Lecture -CNN
Convolutions in 3D
To get the full convoluted output, we just perform the same operations for all the other pixels in
each of our color channels
UE22CS645BC2-TDL-Lecture 4- Loss Function
Acknowledgements & References
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture5.pdf
https://towardsdatascience.com/cnn-part-i-9ec412a14cb1
https://msail.github.io/previous_material/cnn/
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divija L,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divya K,
Teaching Assistant
TOPICS IN DEEP LEARNING
Overview of lecture
Let us build a neural network for this task, inspired by a classic neural network called LeNet-5:
• We may notice that as we go deeper into the neural network, the height and width
decreases, while the number of channels increase.
• CNNs usually comprise a combination of one or more convolution layers followed by
pooling layers, and finally a few fully connected layers.
TOPICS IN DEEP LEARNING
CNN Example
We may notice:
• Pooling layers don’t have any parameters
• Convolution layers have relatively few parameters
• Fully connected layers have the most parameters
• Activation size decreases gradually as we go deeper in the neural network
TOPICS IN DEEP LEARNING
Conclusion
https://www.deeplearning.ai/courses/deep-learning-specialization/
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divya K,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divya K,
Teaching Assistant
TOPICS IN DEEP LEARNING
Overview of lecture
• Recap: Backpropagation
• Convolution Forward Pass
• Convolution Backward Pass
➢ Calculation of loss gradient w.r.t filter
➢ Calculation of loss gradient w.r.t bias
➢ Calculation of loss gradient w.r.t input
➢ Backpropagation calculates the gradient of the loss function with respect to each
parameter in the network and updates the network parameters in such a way that it
minimizes the loss function.
➢ In CNNs the loss gradient is computed w.r.t the input and also w.r.t the filter, w.r.t the
bias.
TOPICS IN DEEP LEARNING
Convolution Forward Pass
Convolution between Input X and Filter F, gives us an output O. This can be represented as:
TOPICS IN DEEP LEARNING
Convolution Forward Pass
We can use the chain rule to obtain the gradient w.r.t the filter as shown in the equation.
TOPICS IN DEEP LEARNING
Convolution Backward Pass
To calculate ∂O/∂F:
12 12 12 1
= 12, = 13, = 22,
11 12 21 2
21 21 21 2
= 21, = 22, = 31,
11 12 21 2
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝐹
𝜕
𝐹
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
𝑋
22 22 22 2
𝑋
= , = , = ,
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝑂
𝜕
𝑂
TOPICS IN DEEP LEARNING
Convolution Backward Pass
= conv(X,
)
𝜕
𝑂
𝜕
𝐹
𝜕
𝐿
𝜕
𝐿
TOPICS IN DEEP LEARNING
Convolution Backward Pass
12
=1
21
=1
22
=1
𝜕
𝜕
𝜕
𝐵
𝐵
𝐵
𝜕
𝐵
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝜕
𝑂
TOPICS IN DEEP LEARNING
Convolution Backward Pass
= + + +
11 12 21 22
=
sum( )
𝜕
𝜕
𝜕
𝜕
𝑂
𝑂
𝑂
𝑂
𝜕
𝜕
𝐵
𝐵
𝜕
𝑂
𝜕
𝜕
𝜕
𝜕
𝐿
𝐿
𝐿
𝐿
𝜕
𝜕
𝐿
𝐿
𝜕
𝐿
TOPICS IN DEEP LEARNING
Convolution Backward Pass
We can use the chain rule to obtain the gradient w.r.t the input as shown in the equation.
TOPICS IN DEEP LEARNING
Convolution Backward Pass
On expanding the chain rule summation and substituting the values of the local gradients,
we get:
TOPICS IN DEEP LEARNING
Convolution Backward Pass
❑ the term "full convolution" is often used interchangeably with "convolution with zero-
padding." In the context of convolutional neural networks (CNNs), "full convolution"
typically means performing a convolution operation with zero-padding applied to the
input. Here, we mean padded loss gradient ∂L/∂O.
TOPICS IN DEEP LEARNING
Convolution Backward Pass
= full-conv(180◦ flipped F,
) OR
= conv(180◦ flipped F,
padded( ))
𝜕
𝜕
𝑂
𝑂
𝜕
𝜕
𝑋
𝑋
𝜕
𝜕
𝐿
𝐿
𝜕
𝜕
𝐿
𝐿
TOPICS IN DEEP LEARNING
Conclusion: Backpropagation in Convolution Layer
Note : This is done for a single filter F and stride = 1, CNNs have a lot more filters. The CNN Backpropagation operation with stride > 1 is identical to a stride = 1
TOPICS IN DEEP LEARNING
Acknowledgements
https://youtu.be/Pn7RK7tofPg?si=bcY3RxuXhOa_GZOg
https://deeplearning.cs.cmu.edu/F21/document/recitation/Recitation5/CNN_Backprop_Recitation_5_F21.pdf
https://pavisj.medium.com/convolutions-and-backpropagations-46026a8f5d2c
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Ack: Divya K,
Teaching Assistant
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
● Waibel et al. [1] proposed a time-delay neural network for speech recognition,
which can be viewed as a 1-D CNN.
● Then, Zhang [2] proposed the first 2-D CNN—shift-invariant ANN.
● LeCun et al. [3] also constructed a CNN for handwritten zip code recognition
and first used the term “convolution,” which is the original version of LeNet.
A Timeline of CNN Architectures ….
LeNet
● LeNet5[3] architecture was introduced in 1998. Due to its historical importance, it is known as the first CNN
model.
● It has also made great function in MNIST handwritten digital ID patterns.
● LeNet model, which is usually composed of 5 layers, accepts grayscale images with 32 x 32 x 1 as input.
● The inputs are transferred to the Conv layer and then to sub-sampling.
● Afterward, there are other Conv layers, followed by a pooling layer, and at the end of the architecture, FC
layers including the output at the last layer are defined.
● This model was the first CNN architecture reducing the number of parameters and capable of
automatically learning features from raw pixels. This model was introduced to identify digital manuscript
patterns and postal codes in post offices
AlexNet
● AlexNet [4] was the pioneering deep architecture, with top-5 test
accuracy of 84.6% on ImageNet data( 15 million labeled high resolution
images in over 22,000 categories).
● It utilized those data augmentation methods that are comprised of image
translation, patch extractions(random cropping), colour jittering and
horizontal reflection.
● This CNN model implements dropout layers with a particular end goal to
battle the issue of overfitting to the training data.
● It is trained by batch stochastic gradient descent, along with particular
values for weight decay and momentum. The activation function used
was ReLU.
● They train on ImageNet 2011 dataset and finetune on ImageNet 2012 Please Note ! that
dataset.
● They introduced ideas like local response normalization which are quiet
however these things are
out of practice today as batch normalization is preferred. pretty common to do
● It was trained on two GTX 580 GPUs for 5–6 days and contains five today when AlexNet
convolutional layers, one max pooling, ReLU as non-linearities, three FC
layers and dropout. paper came out they
were not the norm.
The creators of AlexNet pioneered the idea of Group convolution to train on multiple GPUs, let’s see how ..
What is Group Convolution?
Usually convolution filters are applied on an image layer by layer as we have seen till now in LeNet . However to
learn more features and train deeper networks we could create two models that train and backpropagate in
parallel. This method of using different set of convolution filter groups on the same image is called “grouped
convolution”.
As we have learnt in previous courses parallelism in training is two ways:
1. Data parallelism: where we split the dataset into chunks and then we train on each chunk. Intuitively, each
chunk can be understood as a mini batch used in mini batch gradient descent. Smaller the chunks, more
data parallelism we can squeeze out of it. However, smaller chunks also mean that we are doing more like
stochastic than batch gradient descent for training. This would result in slower and sometimes poorer
convergence.
2. Model parallelism: here we try to parallelize the
model such that we can take in as much as data as
possible. Grouped convolutions enable efficient model
parallelism, so much so that Alexnet was trained on
GPUs with only 3GB RAM.
● This parallelization scheme puts half the neurons on each GPU However the GPUs communicate
only in certain layers , for example, the kernels of layer 3 take input from all kernel maps in layer 2.
However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the
same GPU.
● Choosing the pattern of
connectivity is a problem
for cross-validation, but
this allowed them to
precisely tune the
amount of
communication until it is
an acceptable fraction of
the amount of
computation.
Group Convolution in AlexNet
AlexNet Architecture
ZFNet
● The paper Visualizing and Understanding Convolutional Networks[5] introduces the notion of Deconvnet
which enables us to visualize each layer.
● By visualizing each layer, we can get more insight about what the model is learning and thus, make some
adjustments to make it more optimize
● That’s how ZFnet was created, an AlexNet fine-tuned version based on visualization results.
Fun fact, this meme was referenced in the first inception net paper.
What is an 1x1 convolution?
● A 1x1 convolution is nothing but the element
wise product of the 32 numbers on the left and
the 32 numbers in each filter. With a ReLU
nonlinearity applied to the result.
● If we have F no of filters the resultant is of
dimensions 6 x 6 x F as indicated in the image
● Coming back to the 1 filter case it can be visualised as a single node taking in the weighted sum of 32
inputs and applying a ReLU nonlinearity to it.
● Thus using F filters is equivalent to a fully connected layer with 32 nodes connected to F nodes in the next
layer.
● This idea is also known as ‘Network in Network”which was introduced by researchers in first inception net
paper
SOLUTION:
A. Factorize 5x5 convolution to two 3x3 convolution operations to improve computational speed.
A 5x5 convolution is 2.78 times more expensive than a 3x3 convolution. So stacking two 3x3
convolutions in fact leads to a boost in performance.
B. Moreover, they factorize convolutions of filter size nxn to a combination of 1xn and nx1
convolutions. For example, a 3x3 convolution is equivalent to first performing a 1x3
convolution, and then performing a 3x1 convolution on its output. They found this method to
be 33% more cheaper than the single 3x3 convolution.
C. The filter banks in the module were expanded (made wider instead of deeper) to remove the
representational bottleneck.
Inception v2
A B C
Inception v3
PREMISE:
● The authors noted that the auxiliary classifiers didn’t contribute much until near the end of the
training process, when accuracies were nearing saturation. They argued that they function as
regularizes, especially if they have BatchNorm or Dropout operations.
● Possibilities to improve on the Inception v2 without drastically changing the modules were to be
investigated.
SOLUTION: Inception Net v3 incorporated all of the above upgrades stated for Inception v2, and in
addition used the following:
1. RMSProp Optimizer.
2. Factorized 7x7 convolutions.
3. BatchNorm in the Auxiliary Classifiers.
4. Label Smoothing (A type of regularizing component added to the loss formula that prevents the
network from becoming too confident about a class. Prevents overfitting).
Inception v4
PREMISE: Make the modules more uniform. The authors also noticed that some of the modules were
more complicated than necessary. This can enable us to boost performance by adding more of these
uniform modules.
SOLUTION: The “stem” of Inception v4 was modified. The stem here, refers to the initial set of
operations performed before introducing the Inception blocks.
Stem in v1 Stem in v4
ResNet
After the AlexNet model was introduced future researchers started building bigger CNNs for increased
accuracy.Thus the question arises that ;
However , when deeper networks are able to start converging, a degradation problem was exposed:
with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then
degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers
to a suitably deep model leads to higher training error. Thus in the following example a 59 layer network
has much higher test error than the 20 layer network.
Residual
connections Thus to have a NN by
Thus here x is the default learn the identity
required deviation
which the model must
function rather than pass
learn and if there is the data through more
nothing to be learnt weight layers , the
then F(x) +x is the
identity function.
authors proposed adding
skip connections in the
network.
Thus we compare the following network architectures
for ImageNet.
Here we see that training and However we see that in a network with
validation error are higher in the larger residual connections we can train a 34 layer
network without residual connections. network with a much lower error.
For residual addition to work, the input and output after convolution must have the same dimensions.
Hence, we use 1x1 convolutions after the original convolutions, to match the depth sizes
Inception-ResNet
The pooling operation inside the main inception modules were replaced in favor of the residual
connections. However, you can still find those operations in the reduction blocks. Reduction block A is
same as that of Inception v4.
Networks with residual units deeper in the architecture caused the network to “die” if the number of
filters exceeded 1000. Hence, to increase stability, the authors scaled the residual activations by a
value around 0.1 to 0.3.
Efficient Net(B0-B7)
The process of scaling up ConvNets has never been well understood , however there are many ways to
do it . The most common of the ways is to scale up one of the following parameters width , depth or
resolution. But scaling up arbitrarily requires tedious manual tuning and often leads to suboptimal
results.
The key contributions of the Efficient Net paper is to show that it is critical to balance width , depth and
image resolution . They prove that such balance can simply be achieved scaling each factor by a
common ratio.This method is called Compound scaling.
where α, β, γ are constants that can be determined by a small grid search. Intuitively,
φ is a user-specified coefficient that controls how many more resources are available
for model scaling, while α, β, γ specify how to assign these extra resources to
network width, depth, and resolution respectively.
Efficient Net B0 (The baseline network)
At the core of Efficient Net, the mobile inverted bottleneck convolution layer is used along with the
squeeze and excitation layers. The following table shows the architecture of Efficient B0:
The creators of Efficient Net started to scale Efficient Net B0 with the
help of their compound scaling method. They applied the grid search
technique to get :
Afterward, they fixed the scaling coefficients and scaled Efficient Net
B0 to Efficient Net B7.
𝛂
Conclusion
This lecture aims to summarize the most popular CNN architectures , however there
are a plethora of other CNN architectures which the students may wish to explore . A
comprehensive list is provided in the research papers listed in reference [9] and [10].
Reading the research papers cited in the references section is also encouraged.
References
[1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,”
IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328–339, Mar. 1989.
[2] W. Zhang, “Shift-invariant pattern recognition neural network and its optical architecture,” in Proc. Annu. Conf. Jpn. Soc. Appl.
Phys., 1988, p. 734, paper 6P-M-14.
[3] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551,
Dec. 1989
[4]Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in
Neural Information Processing Systems, pp. 1097–1105 (2012)
[5] M. D. Zeiler and R. Fergus, ‘Visualizing and Understanding Convolutional Networks’, arXiv [cs.CV]. 2013.
[6]Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with
convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
[7]He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 770–778 (2016)
[8]https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202
[9]A. Dhillon and G. K. Verma, “Convolutional neural network: A review of models, methodologies and applications to object
detection,” Prog. Artif. Intell., vol. 9, no. 2, pp. 85–112, Jun. 2020.
[10]Z. Li, F. Liu, W. Yang, S. Peng and J. Zhou, "A Survey of Convolutional Neural Networks: Analysis, Applications, and
Prospects," in IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 6999-7019, Dec. 2022, doi:
10.1109/TNNLS.2021.3084827.
[11]https://youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF&si=yKZ9uniVHpq2EzMq
[12]Tan, M., & Le, Q. v. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. 36th International
Conference on Machine Learning, ICML 2019, 2019-June.
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
In the past two decades, it is widely accepted that the progress of object detection has
generally gone through two historical periods: “traditional object detection period
(before 2014)” and “deep learning based detection period (after 2014)”.
Traditional Detectors
Most of the early object detection algorithms were built based on handcrafted features. Due to
the lack of effective image representation at that time, people have to design sophisticated
feature representations and a variety of speed-up skills. Some of them are :
1. Viola Jones Detectors: In 2001, P. Viola and M. Jones achieved real-time detection of
human faces for the first time without any constraints (e.g., skin color segmentation) [4,5].
→ Compared with VOC and ILSVRC, the biggest progress of MS-COCO is that
apart from the bounding box annotations, each object is further labeled using per-
instance segmentation to aid in precise localization.
→ Link to challenge
→ More information
For the standard detection task, the dataset consists of 1,910k images with 15,440k annotated bounding
boxes on 600 object categories.
→ Link to dataset
Source
Metrics for Object Detection
Intersection over Union (IoU)
Intersection over Union indicates the overlap of the predicted bounding box coordinates to the ground
truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground
truth box coordinates.
Source
What is mAP?
Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN,
YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to
1.
mAP formula is based on the following sub metrics Confusion Matrix, Intersection over Union(IoU), Recall and
Precision
Source
How to calculate AP?
→ AP@α is Area Under the Precision-Recall Curve(AUC-PR) evaluated at α IoU threshold.
→ Notation: AP@α or APα means that AP precision is evaluated at α IoU threshold. If you see
metrics like AP50 and A75, they mean AP calculated at IoU=0.5 and IoU=0.75, respectively.
→ A high Area Under the PR Curve means high recall and high precision which is preferred.
→ AP is calculated individually for each class. This means that there are as many AP values as
the number of classes.
→ These AP values are averaged to obtain the mean Average Precision (mAP) metric.
Source
What is mAP?
The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of
classes.The formula for mAP essentially tells us that, for a given class, i we need to calculate its corresponding AP.
The mean of these collated AP scores will produce the mAP and inform us how well the model performs.
mAP50: Mean average precision calculated at an intersection over union (IoU) threshold of 0.50. It's a measure of the model's
accuracy considering only the "easy" detections.
mAP50-95: The average of the mean average precision calculated at varying IoU thresholds, ranging from 0.50 to 0.95. It gives a
comprehensive view of the model's performance across different levels of detection difficulty.
The 0.5-IoU mAP has then become the de facto metric for object detection. After 2014, due to the introduction of MS-COCO
datasets, researchers started to pay more attention to the accuracy of object localization. Instead of using a fixed IoU threshold,
MS-COCO AP is averaged over multiple IoU thresholds between 0.5 and 0.95, which encourages more accurate object localization
and may be of great importance for some real world applications
Source
Frames Per Second (fps)
● When it comes to object detection algorithms, processing speed is of paramount importance. The most
common metric that is used to measure the detection speed is the number of frames per second (FPS).
● A high FPS value indicates that the model can process frames quickly, making it suitable for time-sensitive
applications like autonomous vehicles, surveillance systems, robotics, and more.
● On the other hand, a low FPS value implies that the model is slower, which might limit its applicability in
certain real-time scenarios.
● For example, Faster R-CNN operates at only 7 frames per second (FPS), whereas SSD operates at 59
FPS.
● In benchmarking experiments, you will see the authors of a paper stating their network results as: “Network
X achieves mAP of Y% at Z FPS”. Where X is the network name, Y is the mAP percentage, and Z is the
FPS.
Two Stage Detectors vs One Stage Detectors
Two stage Object Detector Single stage Object Detector
The two-stage object detector divides the whole ● In a single-stage object detector, we go
process into 2 steps: directly from the image to classification and
bounding box coordinates.
● The images are fed into a feature extractor
1. It first extracts the features using a CNN using a CNN and then the extracted
2. It then extracts a series of regions of interest features are directly used for classification
called object proposals and then the and regression for bounding box
classification and localization happens only on coordinates.
the object proposals. ● Single-stage object detectors are very very
fast and can be used in real-time object
detection but sometimes their performance
Two-stage object detectors are very powerful and is poorer than two-stage object detectors.
extremely accurate having very high values of mAP. ● Examples are the YOLO family, SSD,
Hence, they are mostly used in the medical domain RetinaNet, etc
where classification accuracy is more important than
speed. Examples of two-stage object detectors are
the R-CNN family, SPP-Net, etc.
Models for Object Detection :
Two Stage Detectors
R-CNN
Region-based Convolutional Neural Network (R-CNN) is a type of deep learning architecture used for object
detection in computer vision tasks. R-CNN was one of the pioneering models that helped advance the object
detection field by combining the power of convolutional neural networks and region-based approaches.
Let's dive deeper into how R-CNN works, step by step.
Source[9]
R-CNN- Region Proposal
R-CNN starts by dividing the input image into multiple regions or subregions. These regions are referred to
as "region proposals" or "region candidates." The region proposal step is responsible for generating a set of
potential regions in the image that are likely to contain objects. R-CNN does not generate these proposals
itself; instead, it relies on external methods like Selective Search or EdgeBoxes to generate region
proposals.
source
R-CNN- SVM for object classification
This stage consists of learning an individual linear SVM (Support Vector Machine) classifier for each class, that
detects the presence or absence of an object belonging to a particular class.
Labels for training: The features of all region proposals that have an IoU overlap of less than 0.3* with the
ground truth bounding box are considered negatives for that class during training. The positives for that class
are simply the features from the ground truth bounding boxes itself. All other proposals (IoU overlap greater
than 0.3, but not a ground truth bounding box) are ignored for the purpose of training the SVM.
*This number 0.3 was found using grid search on a validation set
source
R-CNN- Bounding Box Regression and Non Maximum Suppression
source
R-CNN - Disadvantages
Disadvantages of R-CNN.
● Multi-stage, expensive training: The separate training processes are required for all the stages of the
network i.e fine-tuning a CNN on object proposals, learning an SVM to classify the feature vector of each
proposal from the CNN and learning a bounding box regressor to fine-tune the object proposals proves to
be a burden in terms of time, computation and resources. This multi-stage process can be slow and
resource-demanding.
● Slow Inference: Due to its sequential processing of region proposals, R-CNN is relatively slow during
inference. Real-time applications may find this latency unacceptable. Detection using a simple VGG network
as the backbone CNN takes 47s/image.
● R-CNN is Not End-to-End: Unlike more modern object detection architectures like Faster R-CNN, R-CNN
is not an end-to-end model. It involves separate modules for region proposal and classification, which can
lead to suboptimal performance compared to models that optimize both tasks jointly.
Fast R-CNN
Published in 2015 one year after the SPPNet paper Fast RCNN is a popular successor to these models improving
both efficiency and performance of R-CNN and SPP Networks.
The Fast R-CNN consists of a CNN (usually pre-trained on the ImageNet classification task) with its final
pooling layer replaced by an “ROI pooling” layer and its final FC layer is replaced by two branches — a (K + 1)
category softmax layer branch and a category-specific bounding box regression branch.
Source
Fast R-CNN Architecture
1. The entire image is fed into the backbone CNN and the features from the last convolution layer are
obtained. Depending on the backbone CNN used, the output feature maps are much smaller than the
original image size. This depends on the stride of the backbone CNN, which is usually 16 in the case
of a VGG backbone.
2. Meanwhile, the object proposal windows are obtained from a region proposal algorithm like selective
search.
3. The portion of the backbone feature map that belongs to this window is then fed into the ROI Pooling
layer. The ROI pooling layer is a special case of the spatial pyramid pooling (SPP) layer with just one
pyramid level. The layer basically divides the features from the selected proposal windows (that come
from the region proposal algorithm) into sub-windows of size h/H by w/W and performs a pooling
operation in each of these sub-windows. This gives rise to fixed-size output features of size (H x W)
irrespective of the input size. H and W are chosen such that the output is compatible with the
network’s first fully-connected layer. The chosen values of H and W in the Fast R-CNN paper is 7. Like
regular pooling, ROI pooling is carried out in every channel individually.
4. The output features from the ROI Pooling layer (N x 7 x 7 x 512) where N is the number of proposals)
are then fed into the successive FC layers, and the softmax and BB-regression branches. The softmax
classification branch produces probability values of each ROI belonging to K categories and one
catch-all background category. The BB regression branch output is used to make the bounding boxes
from the region proposal algorithm more precise.
Faster RCNN
In the R-CNN family of papers, the evolution between versions was usually in terms of computational efficiency
(integrating the different training stages), reduction in test time, and improvement in performance (mAP). These
networks usually consist of:
1. A region proposal algorithm to generate “bounding boxes” or locations of possible objects in the image.
2. A feature generation stage to obtain features of these objects, usually using a CNN.
3. A classification layer to predict which class this object belongs to; and d) A regression layer to make the
coordinates of the object bounding box more precise.
Both R-CNN and Fast R-CNN use CPU based region proposal algorithms, Eg- the Selective search
algorithm which takes around 2 seconds per image and runs on CPU computation.
The Faster R-CNN [10] paper fixes this by using another convolutional network (the RPN) to generate
the region proposals. This not only brings down the region proposal time from 2s to 10 ms per image but
also allows the region proposal stage to share layers with the following detection stages, causing an
overall improvement in feature representation.
Source
Region Proposal Network (RPN)
1. The region proposal network (RPN)
starts with the input image being fed
into the backbone convolutional neural
network. The input image is first
resized such that it’s shortest side is
600px with the longer side not
exceeding 1000px.
2. The output features of the backbone
network (indicated by H x W) are
usually much smaller than the input
image depending on the stride of the
backbone network. For both the
possible backbone networks used in
the paper (VGG, ZF-Net) the network
stride is 16.
Source
Region Proposal Network (RPN)
3. For every point in the output feature map, the network has to learn whether an object is present in the
input image at its corresponding location and estimate its size. This is done by placing a set of “Anchors” on
the input image for each location on the output feature map from the backbone network. These anchors
indicate possible objects in various sizes and aspect ratios at this location. The figure below shows 9
possible anchors in 3 different aspect ratios and 3 different sizes placed on the input image for a point A on
the output feature map. For the PASCAL challenge, the anchors used have 3 scales of box area 128², 256²,
512² and 3 aspect ratios of 1:1, 1:2 and 2:1.
Source
Region Proposal Network (RPN)
5. First, a 3 x 3 convolution with 512 units is
applied to the backbone feature map as
shown in Figure , to give a 512-d feature
map for every location. This is followed by
two sibling layers: a 1 x 1 convolution layer
with 18 units for object classification, and a
1 x 1 convolution with 36 units for bounding
box regression.
7. The 36 units in the regression branch give an output of size (H, W, 36). This output is used to give the 4
regression coefficients of each of the 9 anchors for every point in the backbone feature map (size: H x W).
These regression coefficients are used to improve the coordinates of the anchors that contain objects.
Source
Faster R-CNN
Source
Models for Object Detection :
Single shot Detectors
Single-shot object detection is a type of object detection algorithm that is able to detect objects within an
image or video in a single pass without the need for multiple stages or region proposals.
Single-shot object detectors, such as YOLO (You Only Look Once) and SSD (Single Shot MultiBox
Detector), use a single convolutional neural network (CNN) to directly predict the class labels and bounding
boxes of objects within an image or video. These models are trained end-to-end using a large dataset of
labeled images and their associated object-bounding boxes
YOLO v1
YOLO (You Only Look Once) is a real-time object detection algorithm developed by Joseph Redmon and Ali
Farhadi in 2015. It is a single-stage object detector that uses a convolutional neural network (CNN) to predict the
bounding boxes and class probabilities of objects in input images.
Source
YOLO Algorithm
The basic idea behind YOLO is to divide the input image into a grid of cells and, for each cell, predict the probability of the presence
of an object and the bounding box coordinates of the object. The process of YOLO can be broken down into several steps:
1. Input image is passed through a CNN to extract features from the image.
2. The features are then passed through a series of fully connected layers, which predict class probabilities and bounding box
coordinates.
3. The image is divided into a grid of cells, and each cell is responsible for predicting a set of bounding boxes and class
probabilities.
4. The output of the network is a set of bounding boxes and class probabilities for each cell.
5. The bounding boxes are then filtered using a post-processing algorithm called non-max suppression to remove overlapping
boxes and choose the box with the highest probability.
6. The final output is a set of predicted bounding boxes and class labels for each object in the image.
One of the key advantages of YOLO is that it processes the entire image in one pass, making it faster and more efficient than two-
stage object detectors such as R-CNN and its variants.
Source
YOLO v2
● One of the main differences between YOLO v2 and the original YOLO is the use of anchor boxes. In
YOLO v2, CNN predicts not only the bounding box coordinates but also the anchor boxes. Anchor boxes
are pre-defined boxes of different aspect ratios and scales, which are used to match the predicted
bounding boxes with the actual objects in the image. This allows YOLO v2 to handle objects of different
shapes and sizes better.
● Another key difference is the use of a multi-scale approach. In YOLO v2, the input image is fed through
CNN at multiple scales, which allows the model to detect objects at different sizes. This is achieved by
using a feature pyramid network (FPN), which allows the model to extract features at different scales
from the same image.
● Additionally, YOLO v2 uses a different loss function than the original YOLO, called the sum-squared error
(SSE) loss function. The SSE loss function is more robust and helps the model to converge faster.
● In terms of architecture, YOLO v2 uses a slightly deeper CNN than YOLO, which allows it to extract more
powerful features from the image. The CNN is followed by several fully connected layers, which predict
class probabilities and bounding box coordinates.
Source
YOLO v3
● YOLO v3 is the third version of the YOLO object detection algorithm. The first difference between YOLO
v3 and previous versions is the use of multiple scales in the input image. YOLO v3 uses a technique
called "feature pyramid network" (FPN) to extract features from the image at different scales. This allows
the model to detect objects of different sizes in the image.
● Another important difference is the use of anchor boxes. In YOLO v3, anchor boxes are used to match the
predicted bounding boxes with the actual objects in the image. Anchor boxes are pre-defined boxes of
different aspect ratios and scales, and the model predicts the offset of the anchor boxes relative to the
bounding boxes. This helps the model to handle objects of different shapes and sizes better.
● In terms of architecture, YOLO v3 is built on a deep convolutional neural network (CNN) that is composed
of many layers of filters. The CNN is followed by several fully connected layers, which predict class
probabilities and bounding box coordinates.
● YOLO v3 also uses a different loss function than previous versions. It uses a combination of classification
loss and localization loss, which allows the model to learn both the class probabilities and the bounding
box coordinates.
Source
YOLO v4
● A key distinction between YOLO v4 and previous versions is using a more advanced neural network
architecture. YOLO v4 uses a technique called "Spatial Pyramid Processing" (SPP) to extract
features from the image at different scales and resolutions. This allows the model to detect objects
of different sizes in the image.
● Additionally, YOLO v4 also uses a technique called "Cross-stage partial connection" (CSP) to
improve the model's accuracy. It uses a combination of multiple models with different architectures
and scales and combines their predictions to achieve better accuracy.
Source
YOLO v5
● YOLO v5 was introduced in 2020 with a key difference from the previous versions, which is the use of
a more efficient neural network architecture called EfficientDet, which is based on the EfficientNet
architecture. EfficientDet is a family of image classification models that have achieved state-of-the-art
performance on a number of benchmark datasets. The EfficientDet architecture is designed to be
efficient in terms of computation and memory usage while also achieving high accuracy.
● Another important difference is the use of anchor-free detection, which eliminates the need for anchor
boxes used in previous versions of YOLO. Instead of anchor boxes, YOLO v5 uses a single
convolutional layer to predict the bounding box coordinates directly, which allows the model to be
more flexible and adaptable to different object shapes and sizes.
● YOLO v5 also uses a technique called "Cross mini-batch normalization" (CmBN) to improve the
model's accuracy. CmBN is a variant of the standard batch normalization technique that is used to
normalize the activations of the neural network.
Source
YOLO v6
● A notable contrast between YOLO v6 and previous versions is the use of a more efficient and lightweight
neural network architecture; this allows YOLO v6 to run faster and with fewer computational resources. The
architecture of YOLO v6 is based on the "Efficient Net-Lite" family, which is a set of lightweight models that
can be run on various devices with limited computational resources.
● YOLO v6 also incorporates data augmentation techniques to improve the robustness and generalization of
the model. This is done by applying random transformations to the input images during training, such as
rotation, scaling, and flipping.
Source
YOLO v7
● YOLO v7 boasts a number of enhancements compared to previous versions. A key enhancement is the
implementation of anchor boxes. These anchor boxes, which come in various aspect ratios, are utilized
to identify objects of various shapes. The use of nine anchor boxes in YOLO v7 enables it to detect a
wider range of object shapes and sizes, leading to a decrease in false positives.
● In YOLO v7, a new loss function called "focal loss" is implemented to enhance performance. Unlike the
standard cross-entropy loss function used in previous versions of YOLO, focal loss addresses the
difficulty in detecting small objects by adjusting the weight of the loss on well-classified examples and
placing more emphasis on challenging examples to detect.
Source
Conclusion
This lecture aims to summarize the most popular CNN architectures for object
detection , however there are a plethora of other architectures which may or may not
be CNN based which the students may wish to explore . A comprehensive list is
provided in the research papers listed in reference [1],[2] and [3].
Reading the research papers cited in the references section is also encouraged.
References
[1]Z. Zou, K. Chen, Z. Shi, Y. Guo and J. Ye, "Object Detection in 20 Years: A Survey," in Proceedings of the IEEE, vol. 111, no. 3,
pp. 257-276, March 2023, doi: 10.1109/JPROC.2023.3238524.
keywords: {Object detection;Detectors;Computer vision;Feature extraction;Deep learning;Convolutional neural
networks;Computer vision;convolutional neural networks (CNNs);deep learning;object detection;technical evolution}
[2]Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘Object Detection with Deep Learning: A Review’, arXiv [cs.CV]. 2019.
[3]X. Wu, D. Sahoo, and S. C. H. Hoi, ‘Recent Advances in Deep Learning for Object Detection’, arXiv [cs.CV]. 2019.
[4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1. IEEE, 2001, pp.
I–I.
[5] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–
154, 2004.
[6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1. IEEE, 2005, pp. 886–893.
[7]P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in CVPR.
IEEE, 2008, pp. 1–8.
[8]R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘Rich feature hierarchies for accurate object detection and semantic
segmentation’, arXiv [cs.CV]. 2014.
[9]Uijlings, J. R. R. et al. “Selective Search for Object Recognition.” International Journal of Computer Vision 104.2 (2013)
[10]Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks, NIPS’15 Proceedings
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]
Transfer Learning
Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Transfer Learning
Humans have an inherent ability to transfer knowledge across tasks. What we acquire as
knowledge while learning about one task, we utilize in the same way to solve related tasks. The
more related the tasks, the easier it is for us to transfer, or cross-utilize our knowledge.
Know how to play classic piano → Learn how to play jazz piano
In each of the above scenarios, we don’t learn everything from scratch when we attempt to
learn new aspects or topics. We transfer and leverage our knowledge from what we have learnt
in the past!
Unit 1: Introduction to Deep Learning - Transfer Learning
Transfer Learning
Transfer Learning is a technique where a model developed for one task is reused or repurposed for a
different but related task.
• The knowledge of an already trained machine learning model is transferred to a different but closely
linked problem throughout transfer learning.
• It can be understood as an optimization strategy that enables accelerated progress and enhanced
performance while modelling the problem.
• Transfer learning is not exclusively an area of study for deep learning but is a popular tool, given that
deep learning demands substantial resources and data to train models.
Unit 1: Introduction to Deep Learning - Transfer Learning
Transfer Learning
For example, if you trained a simple classifier to predict whether an image contains a cat, you could use the model’s
training knowledge to identify other animals such as dogs.
Unit 1: Introduction to Deep Learning - Transfer Learning
Transfer Learned
Source Model Target Model
Knowledge
• marginal distribution: ( ), = { , …, }, ∈
{ )} = { , }
= { , …, },
, (
∈
=
where,
• label space:
If two domains are different, they may have If two tasks are different, they may have different
different feature spaces or different marginal label spaces or different conditional distributions
distributions
Unit 1: Introduction to Deep Learning - Transfer Learning
Given source and target domains ( ) and ( ) where = { , ( )} and source and target tasks ( ) and
{ )} source and target conditions can vary in four ways, which we will illustrate in
( ) where = , (
• ( ≠ ) - The feature spaces of the source and target domain are different, e.g. the documents are written
in two different languages. In the context of natural language processing, this is generally referred to as cross-
lingual adaptation.
• ( ( )≠ ( ) ) - The marginal probability distributions of source and target domain are different, e.g. the
documents discuss different topics. This scenario is generally known as domain adaptation.
• ( ≠ ) - The label spaces between the two tasks are different , e.g. documents need to be assigned
different labels in the target task. In practice, this scenario usually occurs with scenario 4, as it is extremely rare
for two different tasks to have different label spaces, but exactly the same conditional probability distributions.
𝒔
𝑻
( )≠ ( ) ) - The conditional probability distributions of the source and target tasks are
𝑻
𝒀
𝑷
𝒀
𝑿
• (
𝒔
𝑻
𝑷
𝒀
𝑿
𝑷
𝒀
𝑿
𝑻
𝒔
𝒔
𝑻
𝒔
𝑻
𝑻
𝒔
𝑻
𝒔
𝑷
𝑫
𝑷
𝑿
𝑿
𝑿
𝑷
𝑿
𝒀
𝑿
𝑿
𝒀
𝑫
𝑫
𝑻
𝑻
Unit 1: Introduction to Deep Learning - Transfer Learning
Different transfer learning strategies and techniques are applied based on the domain of
the application, the task at hand, and the availability of data. Before deciding on the
strategy of transfer learning, it is crucial to have an answer of the following questions:
• Which part of the knowledge can be transferred from the source to the target to
improve the performance of the target task?
• When to transfer and when not to, so that one improves the target task performance/
results and does not degrade them?
• How to transfer the knowledge gained from the source model based on our current
domain/task?
Traditionally, transfer learning strategies fall under three major categories depending
upon the task domain and the amount of labeled/unlabeled data present.
Unit 1: Introduction to Deep Learning - Transfer Learning
Transfer Learning approaches can be categorized into different approaches based on the similarity of domains,
independent of the type of data samples present during training.
Instance Transfer
Instance transfer reuses knowledge acquired from the source domain to enhance the performance of the
model on a target task. Specific instances or data points from the source domain are deemed to be
relevant to the target domain.
• the central idea is to transfer information from source domain to target domain, but the direct reuse of
source domain data is not always practical or advantageous
• involves the selective identification and utilization of specific instances from the source domain that
align with the characteristics and requirements of the target task
• this approach acknowledges that certain instances from the source domain can be more relevant when
combined with the target data to achieve improved results
Unit 1: Introduction to Deep Learning - Transfer Learning
Instance Transfer
For example, if a model is trained on urban scenes (source domain) and needs to adapt to a new city (target
domain), certain instances from the source domain might include images with architectural features, street layouts,
or objects commonly found in both urban environments.
Traffic light is a common element in urban settings worldwide and can be identified as a “certain instance”.
Unit 1: Introduction to Deep Learning - Transfer Learning
• instead of using the entire pre-trained model, different layers of the models are utilized; earlier layers
capture low level features like edges and textures while later layers encode more complex patterns
• a custom model is shaped around the borrowed layers of the pre-trained model, typically smaller and
simpler with a focus learning the relationship between the extracted features and target outputs
• sometimes, fine tuning is required to align transferred functionality with the target domain; this could
include adjusting weights of later layers
Unit 1: Introduction to Deep Learning - Transfer Learning
Parameter Transfer
Parameter-based transfer learning is a technique where knowledge is transferred between source and
target domain models directly through their shared parameters (weights and biases). This leverages the
assumption that models for related tasks often share some underlying structure or patterns captured by
these parameters.
• this contrasts with approaches that transfer individual features, instances, or parameters
• relational methods aim to capture logical rules and connections within the data, enabling knowledge
transfer even for non-IID* data with complex relationships
• for example, understanding the relationships between speech elements in a male voice can assist in
analyzing sentences from other voices
Instance Transfer ✓ ✓
Parameter Transfer ✓
Devang Saraogi
Teaching Assistant
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Myth
Deep Learning is not possible unless you have a million labelled examples for the task at hand.
Reality
• Useful representations can be learned from unlabelled data.
• You can train on a nearby surrogate objective for which it is easy to generate labels.
• You can transfer learned representations from a related task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Inductive transfer techniques utilize the inductive biases of the source task to assist the target task.
This can be done in different ways, such as by adjusting the inductive bias of the target task by limiting the model
space, narrowing down the hypothesis space, or making adjustments to the search process itself with the help of
knowledge from the source task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Deep learning has made considerable progress in recent years. Pre-trained models form the basis of transfer
learning in the context of deep learning (deep transfer learning). Let’s look at the two most popular strategies for
deep transfer learning.
This layered architecture allows us to utilize a pre-trained network (such as Inception v3 or VGG) without its final
layer as a fixed feature extractor for other tasks.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
softmax features
fc2
fc1 fc1
TRANSFER
conv3 conv3
conv2 conv2
conv1 conv1
The key idea here is to just leverage the pre-trained model’s weighted layers to extract features but not to update
the weights of the model’s layers during training with new data for the new task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
• we can freeze (fix weights) certain layers while retraining or fine tune the rest of the layers based on the
requirement
Question: Is freezing layers and using the model as a feature extractor enough? Or finetuning is also required?
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
fine tuned
• Fine-Tuned: Updated during Backpropagation fc2 + softmax
fc1
Based on Target Task
conv3
• Freeze : target task labels are scarce; want to avoid overfitting
frozen
• Fine-Tuned: Plenty of target task labels conv2
conv1
Tip: Learning rates can be set to different for each layer to data
Pre-trained Models
The entire concept of transfer learning is dependent on the presence of pre-trained models, more importantly
pre-trained models that perform exceptionally well on source tasks.
Luckily, the deep learning world believes in sharing and advancing together. Today, many state-of-the-art
models have been openly shared.
• pre-trained models usually cater to computer vision or natural language processing related tasks
• they are usually shared in different variants (small, medium, big) but contain millions of parameters
achieved during training
• they can be either downloaded from the internet or called as an object from deep learning related Python
libraries
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Pre-trained Models
Pre-trained models for Computer Vision Pre-trained models for Natural Language Processing
Domain Adaptation
Domain adaption is usually referred to in scenarios where the marginal probabilities between the source
and target domains are different, such as ( )≠ ( ).
• there is a shift or drift in the data distribution of the source and target domains that requires tweaks to
transfer the learning
• for instance, a corpus of movie reviews labeled as positive or negative would be different from a corpus
of product-review sentiments. A classifier trained on movie-review sentiment would see a different
distribution if utilized to classify product reviews
• thus, domain adaptation techniques are utilized in transfer learning in these scenarios
𝑻
𝒔
𝑷
𝑿
𝑷
𝑿
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Domain Confusion
Different layers in a model capture different sets of features. This can be utilized to learn domain-invariant
features and improve their transferability across domains.
• instead of allowing the model to learn any representation, we nudge the representations of both
domains to be as similar as possible
• this can be achieved by applying certain pre-processing steps directly to the representations themselves
• the basic idea behind this technique is to add another objective to the source model to encourage
similarity by confusing the domain itself, hence domain confusion.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Multitask Learning
Multitask learning is a slightly different flavor of the
transfer learning world. In the case of multitask
learning, several tasks are learned simultaneously
without distinction between the source and targets.
In this case, the learner receives information about
multiple tasks at once, as compared to transfer
learning, where the learner initially has no idea
about the target task.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
One-shot Learning
Deep learning systems are data-hungry by nature, such that they need many training examples to learn the
weights. This is one of the limiting aspects of deep neural networks, though such is not the case with
human learning.
For instance, once a child is shown what an apple looks like, they can easily identify a different variety of
apple (with one or a few training examples); this is not the case with ML and deep learning algorithms.
One-shot learning is a variant of transfer learning, where the required output is tried and inferred based
on just one or a few training examples. This is essentially helpful in real-world scenarios where it is not
possible to have labeled data for every possible class (if it is a classification task), and in scenarios where
new classes can be added often.
Unit 1: Introduction to Deep Learning - Transfer Learning for Deep Learning
Zero-shot Learning
Zero-shot learning is another extreme variant of transfer learning, which relies on no labeled examples to
learn a task. Zero-data learning or zero-short learning methods, make clever adjustments during the
training stage itself to exploit additional information to understand unseen data.
In a book on Deep Learning, present zero-shot learning as a scenario where three variables are learned,
such as the traditional input variable ( ), the traditional output variable ( ) and the additional random
variable that describes the task ( ). The model is thus trained to learn the conditional probability
distribution of ( , ).
Zero-shot learning comes in handy in scenarios such as machine translation, where we may not even have
labels in the target language.
𝑷
𝒚
𝒙
𝑻
𝒚
𝒙
𝑻
Unit 1: Introduction to Deep Learning - Activation Functions
Recap
• Transfer learning models focus on storing knowledge gained while solving one problem and applying it to a
different but related problem.
• Instead of training a neural network from scratch, many pre-trained models can serve as the starting point
for training. These pre-trained models give a more reliable architecture and save time and resources.
• Transfer learning is used in scenarios where there is not enough data for training or when we want better
results in a short amount of time.
• Transfer learning involves selecting a source model similar to the target domain, adapting the source model
to the target model before transferring the knowledge, and training the source model to achieve the target
model.
• It is common to fine-tune the higher-level layers of the model while freezing the lower levels as the basic
knowledge is the same that is transferred from the source task to the target task of the same domain.
• In tasks with a small amount of data, if the source model is too similar to the target model, there might be an
issue of overfitting. To prevent overfitting, it is essential to tune the learning rate, freeze some layers from
the source model, or add linear classifiers while training the target model can help avoid this issue.
Unit 1: Introduction to Deep Learning - Activation Functions
References
• https://machinelearningmastery.com/transfer-learning-for-deep-learning/
• https://www.v7labs.com/blog/transfer-learning-guide
• https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-
learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
• https://medium.com/georgian-impact-blog /transfer-learning-part-1-
ed0c174ad6e7
UE21CS343BB2
Topics in Deep Learning
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre for
Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
[email protected]