DL Unit 1
DL Unit 1
Learning
Dr Rajesh Thumma
Assoc Professor
Dept of ECE
Contents
• Course Objectives
• Course Outcomes
• Syllabus
• Fundamentals of deep learning
• Building Block of Neural Networks
• Layers- Single Layer perceptron and MLPs
• Forward pass & Backward pass
• Class, trainer and optimizer
• The Vanishing and Exploding Gradient Problems
• Difficulties in Convergence
• Local and Spurious Optima
• Momentum, learning rate Decay, Dropout
• Cross Entropy loss function.
Course Objectives
• To understand the concept of Deep Learning
• To understand various CNN Architectures
• To learn various RNN model
• To familiarize the concept of Autoencoder
• To apply Transfer Learning to solve problems
Course Outcomes
At the end of this course, students will be able to:
• Understand the fundamental issues and basics of deep learning
• Understand the concept of CNN to apply it in the Image classification
problems
• Analyze the various RNN methods for sequence of input and
Generative model for image generation
• Analyze the working of various the Autoencoders methods
• Use Transfer Learning to solve problems with high dimensional data
including image and speech
Syllabus
• UNIT-I Deep Learning: Fundamentals, Building Block of Neural Networks, Layers, MLPs,
Forward pass, backward pass, class, trainer and optimizer, The Vanishing and Exploding
Gradient Problems, Difficulties in Convergence, Local and Spurious Optima, Momentum,
learning rate Decay, Dropout, Cross Entropy loss function.
• UNIT-II Deep Learning: Activation functions, initialization, regularization, batch
normalization, model selection, ensembles. Convolutional neural networks: Fundamentals,
architectures, striding and padding, pooling layers, CNN -Case study with MNIST, CNN vs
Fully Connected.
• UNIT-III RNN: Handling Branches, Layers, Nodes, Essential Elements-Vanilla RNNs, GRUs,
LSTM, video to text with LSTM models.
• UNIT-IV Autoencoders and GAN: Basics of auto encoder, comparison between auto encoder
and PCA, variational auto encoders, denoising auto encoder, sparse auto encoder, vanilla auto
encoder, Multilayer autoencoder. Convolutional autoencoder, regularized auto encoder. GAN,
Image generation with GAN.
• UNIT-V Transfer Learning- Types, Methodologies, Diving into Transfer Learning, Challenges
What is Deep Learning?
Why do we need Deep Learning?
When to use Deep Learning or not over
others?
• Deep Learning out perform other techniques if the data size is large. But with
small data size, traditional Machine Learning algorithms are preferable.
• Deep Learning techniques need to have high end infrastructure to train in
reasonable time.
• When there is lack of domain understanding for feature introspection, Deep
Learning techniques outshines others as you have to worry less about feature
engineering.
• Deep Learning really shines when it comes to complex problems such as
image classification, natural language processing, and speech recognition.
Factors Machine Learning Deep Learning
Machine Learning is a subfield of AI
Deep Learning is a subfield of ML that focuses
that focuses on machines being able to
Definition on machines being able to mimic the human
learn without being explicitly
brain to perform highly complex AI problems.
programmed.
We give structured data to the machine We give unstructured data or you can say the raw
Data Feeding
that builds the ML model. input to the neural network.
ML models deal with datasets having Deep learning models mostly deal with datasets
Volume of Data
thousands of data rows. having millions of data rows.
ML models take less time to train It takes a huge amount of time because of
Training Time
because of the small data size. massive data points.
• Forward Propagation: Forward propagation is the process in which the neural network
makes its predictions. Starting from the input layer, it propagates the input through the
network, layer by layer, until it reaches the output layer. At each neuron, it multiplies the
input by the weights, adds the bias, and applies the activation function to generate the
output.
• Hidden Layer– The second type of layer is called the hidden layer. Hidden layers are
either one or more in number for a neural network. In the above case, the number is 1.
Hidden layers are the ones that are actually responsible for the excellent performance
and complexity of neural networks. They perform multiple functions at the same time
such as data transformation, automatic feature creation, etc.
• Output layer– The last type of layer is the output layer. The output layer holds the result
or the output of the problem.
Building Blocks of a Neural Network
• A layer consists of small individual units called neurons.
• An artificial neuron is similar to a biological neuron. It receives input from
the other neurons, performs some processing, and produces an output.
• Most simple neural network is the “perceptron”, which consists of a single
neuron.
• In biological neurons, the neuron receives electrical signals from its
Dendrites, modulate the electrical signals in various amounts, then fires an
output signal through its Synapses only when the total strength of the input
signals exceed a certain threshold. The output is then fed to another neuron
and so forth.
Perceptron
Perceptron
Perceptron
Perceptron
Perceptron
Perceptron
Biological Neuron
Artificial Neuron
Biological Neuron vs Artificial Neuron
Biological Neuron vs Artificial Neuron
• To model the biological neuron phenomenon, the artificial neuron
performs two consecutive functions:
ŷ = activation (∑xi . wi + b)
2. It then compares the prediction with the actual to calculate the error
error =y -ŷ
3. Update the weight: if the prediction is too high, it will adjust the weights to
make a lower prediction next time and vice versa.
4. Repeat!
Is one neuron enough to solve complex problems?
• No. The perceptron is a linear function. It works great with simple datasets that
can be separated by a linear line.
Multi-Layer Perceptron Architecture
• A common neural network architecture is to stack the neurons in layers
on top of each other called hidden layers. Each layer has n number of
neurons. Layers are connected to each other by weights connections.
This leads to the Multi-Layer Perceptron (MLP) architecture
Multi-Layer Perceptron
The learning process is a repetition of three main steps:
1) Feedforward calculations to produce a prediction (weighted sum
and activation),
2) Calculate the error, and
3) Backpropagate the error and update the weights to minimize the
error
Feedforward/Forward Pass
• The process of computing the linear combination and applying activation
function is called Feedforward.
In short, the forward pass is the calculations through the layers to make a
prediction
• Let’s take a look at this simple three-layer neural network and
explore each of its components:
Feedforward calculations
• Calculations at layer1 ,
• Calculations at layer2 ,
• Usually used in hidden layers of a neural network as its values lie between -1 to
+1; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It
helps in centering the data and makes learning for the next layer much easier.
ReLU Activation Function
• ReLU stands for Rectified Linear Unit.
computationally efficient.
f(x)=max(0.01*x , x)
Leaky ReLU Function
• Leaky ReLU is defined to address problem of dying neuron/dead neuron
• Problem of dying neuron/dead neuron is addressed by introducing a small
slope having the negative values scaled by a enables their corresponding
neurons to “stay alive”
• The function and its derivative both are monotonic
• It allows negative value during back propagation
• It is efficient and easy for computation
• Derivative of Leaky is 1 when f(x) > 0 and ranges between 0 and 1 when
f(x) < 0
Leaky ReLU Function
Parameterised ReLU
• This is another variant of ReLU that aims to solve the problem of gradient’s
becoming zero for the left half of the axis. The parameterised ReLU, as the
name suggests, introduces a new parameter as a slope of the negative part of
the function. Here’s how the ReLU function is modified to incorporate the
slope parameter-
• f(x) = x, x>=0
• = ax, x<0
Parameterised ReLU
•
Parameterised ReLU
• When the value of a is fixed to 0.01, the function acts as a Leaky
ReLU function. However, in case of a parameterised ReLU
function, ‘a‘ is also a trainable parameter. The network also learns
the value of ‘a‘ for faster and more optimum convergence.
• The derivative of the function would be same as the Leaky ReLu
function, except the value 0.01 will be replcaed with the value of a.
• f'(x) = 1, x>=0
• = a, x<0
• The parameterized ReLU function is used when the leaky ReLU
function still fails to solve the problem of dead neurons and the
relevant information is not successfully passed to the next layer
Softmax Function
• Used when trying to handle multiple classes.
The softmax function would squeeze the
outputs for each class between 0 and 1 and
would also divide by the sum of the outputs.
Swish
• It is a self-gated activation function
developed by researchers at Google.
• There are perhaps three activation functions you may want to consider for
use in hidden layers; they are:
2. Logistic (Sigmoid)
• Multiclass Classification—Softmax
What is a Good Activation Function?
• A proper choice has to be made in choosing the activation function to
improve the results in neural network computing. All activation
functions must be monotonic, differentiable, and quickly converging
with respect to the weights for optimization purposes.
Vanishing gradient &Exploding gradient
• In a network of n hidden layers, n derivatives will be multiplied together.
If the derivatives are large then the gradient will increase exponentially as
we propagate down the model until they eventually explode, and this is
what we call the problem of exploding gradient.
• Alternatively, if the derivatives are small then the gradient will decrease
exponentially as we propagate through the model until it eventually
vanishes, and this is the vanishing gradient problem.
Exploding gradient
Understanding Exploding Gradient in Deep
Learning
• Gradient Descent:
• In deep learning, we use a method called gradient descent to train neural networks.
• Gradient descent adjusts the weights (like knobs) in the network to minimize errors
and improve predictions.
• Gradient Explained:
• The "gradient" measures how much we need to change each weight to make the
network better at its job.
• It’s like figuring out which knobs to turn to make a machine work perfectly.
• What is Exploding Gradient?:
• Exploding gradient happens when these gradients become very large during training.
• Instead of small, manageable changes to the weights, the gradients grow so big that
they make the weights change wildly.
Exploding Gradient
• Why It’s a Problem:
• When gradients explode, the network’s weights can change so much that the
model becomes unstable.
• It can lead to the network making unpredictable predictions or even crashing
during training.
• Causes:
• Exploding gradients often happen in deep networks with many layers (like
many floors in a skyscraper).
• If the gradients get bigger and bigger as they pass through each layer, they
can explode at the end.
• Effects:
• Training becomes very slow and unstable because we have to use very small
steps (learning rates) to avoid the gradients from exploding.
• It can also affect how well the network learns and how accurate its
predictions are.
Dealing with Exploding Gradients
• Gradient Clipping:
• One technique to handle exploding gradients is called gradient clipping.
• It limits the size of the gradients during training so they don’t get too big.
• Choosing Learning Rates:
• We also carefully choose learning rates (step sizes) that are small enough to
avoid gradients from exploding.
• It’s like deciding how big each step should be to climb a mountain safely.
• Normalization Techniques:
• Using normalization methods like batch normalization helps keep the
gradients stable as they pass through each layer of the network.
• It’s like making sure everything is balanced and not too extreme.
Exploding gradient
• Summary
• Exploding gradient is when the changes in a neural network
become too big and unstable during training. It happens in
deep learning because of large gradients passing through
many layers. To deal with it, we use techniques like gradient
clipping and careful adjustment of learning rates to keep the
training stable and make sure our networks learn effectively.
Exploding gradient
Exploding Gradient Vanishing Gradient
The parameters of the higher
There is an exponential layers change significantly
growth in the model whereas the parameters of lower
parameters. layers would not change much
(or not at all).
OR
• The algorithm works by calculating the error rate of the model and
then adjusting the weights and biases accordingly.
• This process is repeated until the error rate is minimized and the
model is optimized.
Why does a neural net fail to converge?
• Implementation of not enough nodes may be a reason behind this issue
because models with fewer nodes need to change their architecture
drastically to model the data better and fail to converge.
• The amount of the training data is low or the data we are pushing on the
model is corrupted or not collected with the data integrity.
• The activation function we are using with the network often leads to good
results from the model but if complexity is higher then the model can fail to
converge.
Why does a neural net fail to converge?
• Inappropriate weight application in the network can also cause a
failure in convergence. The weights we are applying to the network
should be well calculated according to the activation function.
• Reinitialization of the weights of the network can help in avoiding the failure
of convergence.
• If the training is stuck in the local minima and subsequent sessions have
exceeded max iteration, this means the session has failed and we will get a
higher error. In such a situation starting another session can be helpful.
Remedies for convergence failure
• Change in the activation function can be helpful. For example, we are using a ReLU
activation and the neurons of the nodes become biased and this can cause the neuron to
never be activated. In such a situation changing the activation function to another
activation can be helpful.
• While performing classification using neural networks, then we can use the shuffling of
the training data to avoid the failure in convergence.
• The learning rate and the number of epochs should be proportional while modelling
a network. Applying a lower number of epochs causes the convergence to happen in
smaller steps and a bigger number of epochs there will mean a long wait in the
appearance of the convergence. A higher learning rate or the number of epochs should be
avoided to make the neural network converge faster.
COST FUNCTION VS LOSS FUNCTION
COST FUNCTION VS LOSS FUNCTION
Loss Functions
• Loss functions are one of the most important aspects of neural networks, as they are
directly responsible for fitting the model to the given training data.
• A neural network processes the input data at each layer and eventually produces a
• Each training input is loaded into the neural network in a process called forward
propagation. Once the model has produced an output, this predicted output is
compared against the given target output in a process called backpropagation — the
weights and biases of the model are then adjusted so that it now outputs a result
• The hyperparameters are adjusted to minimize the average loss — we find the
weights, wT, and biases, b, that minimize the value of J (average loss).
• which measure the distance of the actual y values from the regression line
(predicted values) — the goal being to minimize the net distance.
Types of Loss Functions
There are two main types of loss functions
Mean Squared Error (MSE): One of the most popular loss functions,
MSE finds the average of the squared differences between the target and the
predicted outputs
• The weights are updated when the whole dataset gradient is calculated, which
slows down the process. It also requires a large amount of memory to store this
temporary data, making it a resource-hungry process.
Gradient Descent
Advantages
• Simple to implement.
• Can work well with a well-tuned
learning rate.
Disadvantages
• As the method calculates the gradient
for the entire data set in one update, the
calculation is very slow.
• It requires large memory and it is
computationally expensive.
Stochastic Gradient Decent
• This is a variation of the GD, where the model parameters are
updated on every iteration. It means that after every training
sample, the loss function is tested and the model is updated.
• If the model has 10K dataset SGD will update the model
parameters 10k times.
Stochastic Gradient Decent
• These frequent updates result in converging to the minima in less
time, but it comes at the cost of increased variance that can make the
model overshoot the required position.
Advantages:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
Disadvantages:
• It can be noisy, leading to less stability.
• It may require more hyperparameter tuning to get good
performance.
Mini-Batch Gradient Descent
• Mini-batch gradient descent is similar to SGD, but instead of using a
single sample to compute the gradient, it uses a small, fixed-size
"mini-batch" of samples. The update rule is the same as for SGD,
except that the gradient is averaged over the mini-batch. This can
reduce noise in the updates and improve convergence.
Mini-Batch Gradient Descent
Advantages:
• It can be faster than standard gradient descent, especially for large
datasets.
• Can escape local minima more easily.
• Can reduce noise in updates, leading to more stable convergence.
Disadvantages:
• Can be sensitive to the choice of mini-batch size.
Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-batch
Gradient Descent images
Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Mini-batch
Gradient Descent images
Summary GD vs SGD vs Mini-batch GD
• Gradient Descent (GD): Best for small datasets where computational
cost per update is manageable. Provides stable convergence.
• Stochastic Gradient Descent (SGD): Best for very large datasets or
online learning. Provides faster updates but with higher variance.
• Mini-batch Gradient Descent: Combines benefits of both GD and
SGD, suitable for large datasets, balances speed and stability, and
makes efficient use of computational resources.
Drawbacks of base optimizers: (GD, SGD, mini-
batch GD)
• Gradient Descent uses the whole training data to update weight and bias.
Suppose if we have millions of records then training becomes slow and
computationally very expensive.
• SGD solved the Gradient Descent problem by using only single records to
updates parameters. But, still, SGD is slow to converge because it needs
forward and backward propagation for every record. And the path to reach global
minima becomes very noisy.
• During each epoch, the model parameters are updated in small steps based on the
loss calculated from the batches, leading to gradual learning and improvement of the
model.
• In summary, an epoch is a critical concept in deep learning that signifies a complete
pass through the training dataset. Training for multiple epochs allows the model to
iteratively learn and refine its parameters, ultimately leading to better performance
and generalization
SGD with momentum
• SGD with momentum is a variant of SGD that adds a "momentum" term to
the update rule, which helps the optimizer to continue moving in the same
direction even if the local gradient is small. The momentum term is typically
set to a value between 0 and 1.
• The formula for the new weight remains the same as in Adagrad
Adadelta over Adagrad
• The first modification involves using an exponentially
decaying average of squared gradients instead of their
cumulative sum. This allows the optimizer to adapt to
recent gradients while forgetting the older ones, enabling
more flexibility during training.
• The second modification introduces an additional
parameter, ρ (rho), which controls the ratio between the
update step size and the exponentially decaying average of
squared gradients. By adjusting ρ, Adadelta further
improves its adaptability.
Adadelta
• The Learning rate is calculated using the formula,
Where β and γ are the initial restricting parameters for SGD with
Momentum and Adadelta respectively.
Adam
• Using the previous equation, now the weight and bias updation
formula looks like:
Advantages:
• But in the second image where learning rate is reducing over time
(represented with green line), since the learning rate is large initially we
still have relatively fast learning but as tending towards minima learning
rate gets smaller and smaller, end up oscillating in a tighter region around
minima rather than wandering far away from it.
Learning Rate Decay
Learning rate decay (common method):
α=(1/(1+decayRate×epochNumber))*α0
1 epoch : 1 pass through data
α or η : learning rate (current iteration)
α0 or η : Initial learning rate
decayRate : hyper-parameter for the method
Example: Suppose we have α0 = 0.2 and decay rate=1 , then for the each
epoch we can examine the fall in learning rate α as:
Epoch 1: alpha 0.1
Epoch 2: alpha 0.067
Epoch 3: alpha 0.05
Epoch 4: alpha 0.04
Learning Rate Decay
Exponential Decay : The decayRate of this method is always less
then 1 , 0.95 is most commonly used among practitioners.
• Training the network without a useful weight initialization can lead to a very
slow convergence or an inability to converge
2. Random initialization
weight initialization
Zero Initialization (Initialized all weights to 0)
just 0 assignment. But what happens if weights are initialized high values or
function like sigmoid() is applied, the function maps its value near to 1
where the slope of gradient changes slowly and learning takes a lot of
time.
weight initialization
Random initialization :
• https://www.slideshare.net/slideshow/deep-learning-tutorial-deep-
learning-tensor-flow-deep-learning-with-neural-networks-
simplilearn/95199538
Thank You!
[email protected]