R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi
By
Dr. Bhawani Sankar Panigrahi
Assistant Professor
Department of Information Technology
Vardhaman College of Engineering
Course Description Course Overview
This course builds the knowledge on deep neural learning in the aspect of artificial
intelligence that depends on data representations rather than task- specific algorithms.
It helps the students to demonstrate supervised, semi-supervised, and unsupervised
learning. A convolution deep learning neural network is built using Keras to show how
deep learning is used in specialized neural networks. Applications of deep learning will
help to recognize and process text, images and speech applications. Introduction of
various deep learning models such as RNNs, Encoders and Generative models will help
to relate to real time projects.
Course Pre/co-requisites
A7512 - Machine Learning
A7704 - Foundations of Machine Learning
Course Outcomes (COs)
• After the completion of the course, the student will be able to:
• A7709.1 Identify the need of neural networks and deep learning for a given
problem.
• A7709.2 Build a CNN model on the real time data.
• A7709.3 Model sequence classification applications using RNN.
• A7709.4 Build a deep learning model using encoders.
• A7709.5 Make use of generative models in various creative tasks.
SYLLABUS
Reference Books:
1. Bishop, C., M., Pattern Recognition and Machine Learning, Springer, 2006.
2. Yegnanarayana, B., Artificial Neural Networks PHI Learning Pvt. Ltd, 2009.
3. Golub, G., H., and Van Loan,C.,F., Matrix Computations, JHU Press,2013.
4. Satish Kumar, Neural Networks: A Classroom Approach, Tata McGraw Hill
Education,2004
6
Historical Trends in Deep Learning
It is easiest to understand deep learning with some historical context. The
trends are:
• Deep learning has had a long and rich history, but has gone by many names
reflecting different philosophical viewpoints, and has waxed and waned in
popularity.
• Deep learning has become more useful as the amount of available training
data has increased.
• Deep learning models have grown in size over time as computer
infrastructure (both hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with
increasing accuracy over time.
8
9
AI, ML, DL, NN
10
AI Vs ML Vs DL Vs DS
Artificial Intelligence
Data Science
1. Supervised Learning
2. Unsupervised
3. Semi supervised
4. Reinforcement
Machine Learning
1. ANN
2. CNN
3. RNN
Deep Learning
What is artificial intelligence? Definition
15
What is Deep Learning .
Deep learning Deep learning is Subset of machine learning
methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or
unsupervised. Wikipedia
Deep learning is essentially a neural network with three or more
layers. These neural networks attempt to simulate the behaviour
of the human brain
Deep Learning models are able to automatically learn features from
the data, which makes them well-suited for tasks such as image
recognition, speech recognition, and natural language processing.
16
What is Deep Learning?
These type of functions are attached to each neuron in the network, and
Activation function also helps to normalize the output of each neuron to a range
space)
Anatomy of a deep neural network
Difference between Machine Learning and Deep Learning :
Machine Learning Deep Learning
Uses artificial neural network
Apply statistical algorithms to learn the
architecture to learn the hidden
hidden patterns and relationships in the
patterns and relationships in the
dataset.
dataset.
Can work on the smaller amount of Requires the larger volume of dataset
dataset compared to machine learning
Takes less time to train the model. Takes more time to train the model.
Machine Learning Deep Learning
A model is created by relevant Relevant features are
features which are manually automatically extracted from
extracted from images to detect images. It is an end-to-end
an object in the image. learning process.
More complex, it works like the
Less complex and easy to interpret
black box interpretations of the
the result.
result are not easy.
It can work on the CPU or requires
It requires a high-performance
less computing power as
computer with GPU.
compared to deep learning.
Deep Learning Architectures
Deep Learning Architectures
10 most popular deep learning aarchitectures:
1. Convolutional Neural Networks (CNNs)
2. Long Short Term Memory Networks (LSTMs)
3. Recurrent Neural Networks (RNNs)
4. Generative Adversarial Networks (GANs)
5. Radial Basis Function Networks (RBFNs)
6. Multilayer Perceptrons (MLPs)
7. Self Organizing Maps (SOMs)
8. Deep Belief Networks (DBNs)
9. Restricted Boltzmann Machines( RBMs)
10.Autoencoders
30
When it comes to deep learning, There are various types of neural
networks. And deep learning architectures are based on these
networks. Importantly there are six of the most common deep
learning architectures:
1. RNN
2. LSTM
3. GRU
4. CNN
5. DBN
6. DSN
RNN: Recurrent Neural Networks
RNN is one of the fundamental network architectures from which other
deep learning architectures are built. RNNs consist of a rich set of deep
learning architectures. They can use their internal state (memory) to
process variable-length sequences of inputs.
Bidirectional RNN: They work two ways; the output layer can get
information from past and future states simultaneously[2].
Deep RNN: Multiple layers are present. As a result, the DL model can extract
more hierarchical information.
LSTM: Long Short-Term Memory
It’s also a type of RNN. However, LSTM has feedback connections. This means
that it can process not only single data points such as images but also entire
sequences of data such as audio or video files.
LSTM derives from neural network architectures and is based on the concept
of a memory cell. The memory cell can retain its value for a short or long time
as a function of its inputs, which allows the cell to remember what’s essential
and not just its last computed value.
This abbreviation stands for. It’s a type of LSTM. The major difference is that
GRU has fewer parameters than LSTM, as it lacks an output gate[5]. GRUs are
used for smaller and less frequent datasets, where they show better
performance.
The basic idea behind GRU is to use gating mechanisms to selectively update
the hidden state of the network at each time step. The gating mechanisms are
used to control the flow of information in and out of the network. The GRU
has two gating mechanisms, called the reset gate and the update gate.
CNN: Convolutional Neural Networks
Each network within DSN has its own hidden layers that process data. This
architecture has been designed in order to improve the training issue, which is
quite complicated when it comes to traditional deep learning models.
40
Feed-Forward Neural Networks vs Recurrent Neural Networks
41
Feed-Forward Neural Networks vs Recurrent Neural Networks
45
Key characteristics and components of deep feed-forward networks
Layers: A deep feed-forward network typically consists of multiple layers,
including an input layer, one or more hidden layers, and an output layer.
Each layer is composed of one or more neurons (also called nodes or
units).
Neurons: Neurons are the basic computational units within each layer. Each neuron
receives inputs, applies a transformation to these inputs (usually a
weighted sum), and passes the result through an activation function. The
output of a neuron serves as input to the neurons in the subsequent layer.
Weights and Biases: The connections between neurons are represented by weights, which
determine the strength of the connection. Additionally, each neuron
typically has an associated bias term that helps shift the activation
function. These weights and biases are learned during the training process.
47
Activation Functions: Activation functions introduce non-linearity to the network. Common
activation functions include the sigmoid, hyperbolic tangent (tanh),
and rectified linear unit (ReLU) functions. These functions allow the
network to approximate complex functions.
Forward Propagation: The process of computing the network's output from input data is
known as forward propagation. It involves passing the input through
the layers of neurons, applying weights, biases, and activation
functions to compute the final output.
48
Challenges in Deep Learning
Deep learning has made significant advancements in various fields, but there
are still some challenges that need to be addressed. Here are some of the main
challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using
deep learning it’s a big concern to gather as much data for training.
5. Interpretability: Deep learning models are complex, it works like a black box.
it is very difficult to interpret the result.
6. Overfitting: when the model is trained again and again, it becomes too
specialized for the training data, leading to overfitting and poor performance
on new data.
Advantages of Deep Learning:
53
Gradient-Descent Learning
1. Gradient descent is an iterative optimization algorithm used to find the minimum
(or maximum) of a function.
2. It's commonly used in machine learning and optimization tasks to update the parameters
of a model in order to minimize a loss function.
3. There are several variations and strategies based on the basic gradient descent
algorithm
4. A gradient measures how much the output of a function changes if you change the
inputs a little bit.
5. In machine learning, a gradient is a derivative of a function that has more than one
input variable. Known as the slope of a function in mathematical terms, the
gradient simply measures the change in all weights about the change in error.
54
Gradient-Descent strategies
There are three popular types that mainly differ in the amount of data they use.*
• Batch Gradient In this basic form of gradient descent, the entire training dataset is used in each
Descent:
iteration to calculate the gradient and update the model parameters. It can be
55
Gradient-Descent strategies
There are three popular types
that mainly differ in the amount of data they use.
56
Stochastic Gradient-Descent strategies
57
Initialization: Initialize the model parameters (weights and biases) randomly
or using some predefined values.
Random In each iteration (also called an epoch), randomly select a single training
Sample example from the dataset. This randomness introduces noise into the optimization
Selection: process.
Calculate Compute the gradient of the loss function with respect to the selected training
Gradient: example. This involves calculating the derivatives of the loss function with respect
to each model parameter.
Update Update the model parameters using the computed gradient. The parameters
Parameters: are updated in the opposite direction of the gradient to minimize the loss. The
update rule is : parameter = parameter - learning_rate * gradient
Advantages of SGD:
60
3. Online Learning: SGD's ability to adapt quickly to new data points makes
it suitable
for online learning scenarios where data arrives in a streaming fashion.
62
2. Hyperparameter Sensitivity: The choice of learning rate is crucial in SGD.
If the learning rate
is too high, the algorithm might diverge; if it's too low, convergence can be
slow.
The way hidden units are differentiated from each other is based on their activation
function, g(z):
Like mentioned, this max activation function is on top of the affine transformation, z.
When mapped out it has these properties:
What’s Maxout?
Maxout is a flavour of a ReLU, which itself is a subset of activation functions,
which is a component of a hidden unit. As such we know that a hidden unit will
apply an affine transformation to a vector and then apply a nonlinear element-
wise activation function. Since Maxout is a flavour of ReLU, you are right to
assume it uses a max(0, z).
The Maxout unit is then the maximum element of one of these groups:
The difference between them is that sigmoid is 1/2 at 0, whereas tanh is 0 at 0. In that
sense, the tanh is more like the identity function, at least around 0.
What’s RBF?
This function, Radial Basis Function, becomes more active as x approaches a certain value
vector, it saturates to 0 everywhere else, so can be annoying for gradient descent:
What’s Softplus?
This one is discouraged from use based on empirical evidence. Which is counter-intuitive.
Since its meant to be an improvement on ReLU, making it differentiable everywhere. But in
practice, it does worse.
What’s the hard hyperbolic tangent, or hard tanh?
It looks like the tanh or the rectifier. But unlike the rectifier, it is bounded. It’s
computationally cheaper than many of the alternatives. It’s basically either -1 or the
line a or 1.
What’s Identity?
The first layer is matrix U and the second weight matrix is V. If the first
layer, U produces q parameters, together these layers
produce (n+p)q parameters. Whereas just W, would produce np parameters.
Linear hidden units, then offer an effective way to reduce the number of
parameters in a network.
What’s Softmax?
These hidden units are often used in architectures where your goal is to learn to
manipulate memory. When there is a classification problem and you need to
pick one of the multiple categories, this is the one to use. As it always boosts
the max category and drags the other categories down.
What’s GELU?
GELU stands for Gaussian Error Linear Unit, and it is a proposed activation
function, meant to be an improvement on ReLU and its cousins.
Where ReLU gates the inputs by their sign, the GELU gates inputs by their
magnitude. The paper does an empirical evaluation of GELU against ReLU and
ELU activation functions in MNIST, Tweet processing etc. And these guys found
it performed better.
Back Propagation Learning
DL by Dr. B.S.Panigrahi 80
Training a Neural Network with Backpropagation,
Back Propagation Learning
4. One way to train our model is called as Backpropagation. Consider the diagram
below:
DL by Dr. B.S.Panigrahi 81
Backpropagation Process & Steps
Steps:
1. Calculate the error – How far is your model output from the actual output.
2. Minimum Error – Check whether the error is minimized or not.
3. Update the parameters – If the error is huge then, update the parameters (weights and biases).
After that again check the error. Repeat the process until the error becomes minimum.
4. Model is ready to make a prediction – Once the error becomes minimum, you can
feed some inputs to your model and it will produce the output
DL by Dr. B.S.Panigrahi 82
Errors,Training and Test Loss
• Training and Test Loss is a number indicating how bad(un-accurate) the model's prediction
was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is
greater.
• Means the model is not learning; probably there's something wrong with either the model or the
optimization process
• The goal of training a model is to find a set of weights and biases that have low loss, on
average,
• Computationally, the training loss is calculated by taking the sum of errors for each
example in the training set.
DL by Dr. B.S.Panigrahi
83
Errors, Training and Test Loss
For example,
lets check a high loss model (A)on the left and a low loss model(B) on the right. Where
• The arrows represent loss.
• The blue lines represent predictions.
• Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly,
the line in the right plot is a much better predictive model than the line in the left plot.
(A) (B)
DL by Dr. B.S.Panigrahi
84
Training a Neural Network with Backpropagation,
Example:
Training a neural network using the backpropagation algorithm.
• In this example, we'll build and train a feedforward neural network to solve a basic
binary classification problem. The network will have one hidden layer and will use the sigmoid
activation function. We'll use the mean squared error (MSE) loss function and gradient descent as
the optimization algorithm.
• Input layer: The input layer consists of two neurons, one for each input feature.
• Hidden layer: The hidden layer consists of two neurons. Each neuron will receive inputs from
the input layer, and we'll apply the sigmoid activation function to the weighted sum of inputs.
• Output layer: The output layer consists of one neuron. It will receive inputs from the hidden layer
and also apply the sigmoid activation function.
DL by Dr. B.S.Panigrahi 86
Training a Neural Network with Backpropagation,
The formulas for forward pass are as follows:
DL by Dr. B.S.Panigrahi 87
Training a Neural Network with Backpropagation,
Step 3: Compute Loss
Now, we'll compute the mean squared error loss using the
predicted outputs and the target values from the dataset.
DL by Dr. B.S.Panigrahi 88
Step 4: Backpropagation - Compute Gradients
• In the backpropagation step, we compute the gradients of the loss with
respect to the network parameters. These gradients will tell us the
direction in which to adjust the weights and biases to minimize the
loss.
• We use the chain rule to calculate the gradients at each layer.
Training a Neural Network with Backpropagation,
Step 5: Update Weights and Biases
• With the gradients computed, we can now update the weights and biases using gradient
descent.
• Learning rate (α) is a hyperparameter that controls the step size in the gradient descent
update.
DL by Dr. B.S.Panigrahi 90
Training a Neural Network with Backpropagation,
Step 6: Repeat Training
Now, you can repeat steps 2 to 5 (forward pass, compute loss,
backpropagation, update weights and biases) for multiple
iterations (epochs) until the loss converges to a minimum
or reaches a satisfactory value.
DL by Dr. B.S.Panigrahi 91
Training a Neural Network with Backpropagation,
Question: 1
Optimize the weights so that the neural network can learn how to correctly map arbitrary inputs
to outputs
Reference solution
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
DL by Dr. B.S.Panigrahi 92