0% found this document useful (0 votes)
43 views92 pages

R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views92 pages

R21 - A7709 - Deep Learning: Dr. Bhawani Sankar Panigrahi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

R21 - A7709 - Deep Learning

By
Dr. Bhawani Sankar Panigrahi
Assistant Professor
Department of Information Technology
Vardhaman College of Engineering
Course Description Course Overview

This course builds the knowledge on deep neural learning in the aspect of artificial
intelligence that depends on data representations rather than task- specific algorithms.
It helps the students to demonstrate supervised, semi-supervised, and unsupervised
learning. A convolution deep learning neural network is built using Keras to show how
deep learning is used in specialized neural networks. Applications of deep learning will
help to recognize and process text, images and speech applications. Introduction of
various deep learning models such as RNNs, Encoders and Generative models will help
to relate to real time projects.

Course Pre/co-requisites
A7512 - Machine Learning
A7704 - Foundations of Machine Learning
Course Outcomes (COs)

• After the completion of the course, the student will be able to:

• A7709.1 Identify the need of neural networks and deep learning for a given
problem.
• A7709.2 Build a CNN model on the real time data.
• A7709.3 Model sequence classification applications using RNN.
• A7709.4 Build a deep learning model using encoders.
• A7709.5 Make use of generative models in various creative tasks.
SYLLABUS

Module I: Introduction Number of hours(12 )

• Introduction to Deep Learning architectures,


• Historical Trends in Deep learning,
• Challenges motivating Deep Learning,
• Gradient-Based learning,
• Hidden Units, Architecture Design,
• Back-Propagation.
4
Historical Trends in Deep Learning
Books and Materials
Text Books(s)
1. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016
2. Jeff Heaton., Deep Learning and Neural Networks, Heaton Research Inc, 2015

Reference Books:
1. Bishop, C., M., Pattern Recognition and Machine Learning, Springer, 2006.
2. Yegnanarayana, B., Artificial Neural Networks PHI Learning Pvt. Ltd, 2009.
3. Golub, G., H., and Van Loan,C.,F., Matrix Computations, JHU Press,2013.
4. Satish Kumar, Neural Networks: A Classroom Approach, Tata McGraw Hill
Education,2004
6
Historical Trends in Deep Learning
It is easiest to understand deep learning with some historical context. The
trends are:

• Deep learning has had a long and rich history, but has gone by many names
reflecting different philosophical viewpoints, and has waxed and waned in
popularity.
• Deep learning has become more useful as the amount of available training
data has increased.
• Deep learning models have grown in size over time as computer
infrastructure (both hardware and software) for deep learning has improved.
• Deep learning has solved increasingly complicated applications with
increasing accuracy over time.
8
9
AI, ML, DL, NN

10
AI Vs ML Vs DL Vs DS

Artificial Intelligence
Data Science
1. Supervised Learning
2. Unsupervised
3. Semi supervised
4. Reinforcement

Machine Learning

1. ANN
2. CNN
3. RNN
Deep Learning
What is artificial intelligence? Definition

Artificial intelligence is the ability of a computer to perform


tasks commonly associated with intelligent beings.
What is machine learning? Definition

“Machine learning is the study of algorithms that learn from


examples and experience instead of relying on hard-coded
rules and make predictions on new data.”
What is deep learning? Definition

Deep learning is a subfield of machine learning focusing on


learning data representations as successive layers of
increasingly meaningful representations.
Introduction to Deep Learning

15
What is Deep Learning .
 Deep learning Deep learning is Subset of machine learning
methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or
unsupervised. Wikipedia
 Deep learning is essentially a neural network with three or more
layers. These neural networks attempt to simulate the behaviour
of the human brain
 Deep Learning models are able to automatically learn features from
the data, which makes them well-suited for tasks such as image
recognition, speech recognition, and natural language processing.

16
What is Deep Learning?

Deep learning creates many layers of neurons, attempting to learn


structured representation of big data, layer by layer.
Why Deep Learning?

Note: A key advantage of deep


learning networks is that they often
continue to improve as the size of
your data increases.
Why Deep Learning?
1. Exponential growth of data from social media, you tube, smart phones etc.
with this complex use cases like recommendation system, face detection
and so on can be implemented using deep learning models.
The accuracy is more with deep learning model.
2. Technology upgradation in terms of Software and Hardware.
The data centers like Cloud and NVIDIA provides huge GPUs for processing.
3. Feature Extraction (Feature Engineering) is part of deep learning model
unlike machine learning models.
4. Solve complex problem statements like image classification, object detection,
NLP tasks, Chatbots etc.
Activation Functions (Transfer Functions):

 Activation Functions: It helps to determine the output of a neural network.

 These type of functions are attached to each neuron in the network, and

determines whether it should be activated or not, based on whether each

neuron’s input is relevant for the model’s prediction.

 Activation function also helps to normalize the output of each neuron to a range

between 1 and 0 or between -1 and 1 and etc.


Activation Functions (Transfer Functions):
Activation Functions (Transfer Functions):

 The activation function is a mathematical “gate” in between the input feeding


the current neuron and its output going to the next layer.
 It can be as simple as a step function that turns the neuron output on and off,
depending on a rule or threshold.
 Neural networks use non-linear activation functions, which can help the
network learn complex data, compute and learn almost any function
representing a question, and provide accurate predictions.
 Activation function is also called “Transfer Function”.
Note: The value of net input can be any anything from -inf to +inf.
 Why we need Activation Functions ? (Transfer Functions):

 It helps us to reach non-linearity.

 It helps us to converge the solution.(i.e. helps us to reduce the solution

space)
Anatomy of a deep neural network
Difference between Machine Learning and Deep Learning :
Machine Learning Deep Learning
Uses artificial neural network
Apply statistical algorithms to learn the
architecture to learn the hidden
hidden patterns and relationships in the
patterns and relationships in the
dataset.
dataset.

Can work on the smaller amount of Requires the larger volume of dataset
dataset compared to machine learning

Better for complex task like image


Better for the low-label task. processing, natural language processing,
etc.

Takes less time to train the model. Takes more time to train the model.
Machine Learning Deep Learning
A model is created by relevant Relevant features are
features which are manually automatically extracted from
extracted from images to detect images. It is an end-to-end
an object in the image. learning process.
More complex, it works like the
Less complex and easy to interpret
black box interpretations of the
the result.
result are not easy.
It can work on the CPU or requires
It requires a high-performance
less computing power as
computer with GPU.
compared to deep learning.
Deep Learning Architectures
Deep Learning Architectures
10 most popular deep learning aarchitectures:
1. Convolutional Neural Networks (CNNs)
2. Long Short Term Memory Networks (LSTMs)
3. Recurrent Neural Networks (RNNs)
4. Generative Adversarial Networks (GANs)
5. Radial Basis Function Networks (RBFNs)
6. Multilayer Perceptrons (MLPs)
7. Self Organizing Maps (SOMs)
8. Deep Belief Networks (DBNs)
9. Restricted Boltzmann Machines( RBMs)
10.Autoencoders
30
When it comes to deep learning, There are various types of neural
networks. And deep learning architectures are based on these
networks. Importantly there are six of the most common deep
learning architectures:
1. RNN
2. LSTM
3. GRU
4. CNN
5. DBN
6. DSN
RNN: Recurrent Neural Networks
RNN is one of the fundamental network architectures from which other
deep learning architectures are built. RNNs consist of a rich set of deep
learning architectures. They can use their internal state (memory) to
process variable-length sequences of inputs.

There are two types of RNN:

Bidirectional RNN: They work two ways; the output layer can get
information from past and future states simultaneously[2].

Deep RNN: Multiple layers are present. As a result, the DL model can extract
more hierarchical information.
LSTM: Long Short-Term Memory

It’s also a type of RNN. However, LSTM has feedback connections. This means
that it can process not only single data points such as images but also entire
sequences of data such as audio or video files.

LSTM derives from neural network architectures and is based on the concept
of a memory cell. The memory cell can retain its value for a short or long time
as a function of its inputs, which allows the cell to remember what’s essential
and not just its last computed value.

LSTMs are commonly used in such fields as text compression, handwriting


recognition, speech recognition, gesture recognition, and image captioning
GRU : Gated Recurrent Unit

This abbreviation stands for. It’s a type of LSTM. The major difference is that
GRU has fewer parameters than LSTM, as it lacks an output gate[5]. GRUs are
used for smaller and less frequent datasets, where they show better
performance.

The basic idea behind GRU is to use gating mechanisms to selectively update
the hidden state of the network at each time step. The gating mechanisms are
used to control the flow of information in and out of the network. The GRU
has two gating mechanisms, called the reset gate and the update gate.
CNN: Convolutional Neural Networks

This architecture is commonly used for image processing, image recognition,


video analysis, and NLP.

CNN can take in an input image, assign importance to various aspects/objects


in the image, and be able to differentiate one from the others[6]. The name
‘convolutional’ derives from a mathematical operation involving the
convolution of different functions. CNNs consist of an input and an output
layer, as well as multiple hidden layers. The CNN’s hidden layers typically
consist of a series of convolutional layers.
DBN: Deep Belief Network

DBN is a multilayer network (typically deep, including many hidden layers) in


which each pair of connected layers is a Restricted Boltzmann Machine (RBM).

DBNs use probabilities and unsupervised learning to produce outputs. Unlike


other models, each layer in DBN learns the entire input. In CNNs, the first
layers only filter inputs for basic features, and the latter layers recombine all
the simple patterns found by the previous layers. DBNs work holistically and
regulate each layer in order.

DBNs can be used in image recognition and NLP.


DSN: Deep Stacking Network

DSNs are also frequently called DCN–Deep Convex Network. DSN/DCN


comprises a deep network, but it’s actually a set of individual deep networks.

Each network within DSN has its own hidden layers that process data. This
architecture has been designed in order to improve the training issue, which is
quite complicated when it comes to traditional deep learning models.

Typically, DSNs consist of three or more modules. Each module consists of an


input layer, a hidden layer, and an output layer. These modules are stacked one
on top of another, which means that the input of a given module is based on
the output of prior modules/layers.
Understanding a Common
Deep Feed - Forward Networks

40
Feed-Forward Neural Networks vs Recurrent Neural Networks

41
Feed-Forward Neural Networks vs Recurrent Neural Networks

Comparison Attribute Feed-forward Neural Recurrent Neural


Networks Networks
Signal flow direction Forward only Bidirectional

Delay introduced No Yes

Complexity Low High

Neuron independence in Yes No


the same layer

Speed High slow

Pattern recognition, speech Language translation, speech- to-


Commonly used for recognition, and character text conversion, and robotic
recognition control
42
Neural Architectures for Feedforward Neural Network
• Neural architectures for multiclass models are designed to handle tasks where the input
data needs to be classified into one of several possible classes. These architectures are
commonly used in various machine learning tasks such as image classification, natural
language processing, and speech recognition
• Feedforward Neural Networks (FNN): Also known as Multi-Layer Perceptrons
(MLPs), FNNs are the simplest form of neural networks. They consist of an input layer,
one or more hidden layers, and an output layer. Each layer contains multiple neurons, and
the information flows only in one direction, from the input layer to the output layer. FNNs
can be used for multiclass classification by using softmax activation in the output layer to
convert raw scores into class probabilities.
Dr. B.S. Panigrahi 43
Neural Architectures for Feedforward and Deep Feedforward Neural
Network,

Dr. B.S. Panigrahi 44


Deep Feed - Forward Networks
1. Deep feed-forward networks, also known as feedforward neural
networks or multilayer perceptrons (MLPs), are a fundamental type of
artificial neural network architecture in the field of deep learning.

2. These networks are characterized by their layered structure, where


information flows in one direction, from input to output, without any cycles
or feedback loops.
3. The term "feedforward" indicates that the network's connections do not
form any loops.

45
Key characteristics and components of deep feed-forward networks
Layers: A deep feed-forward network typically consists of multiple layers,
including an input layer, one or more hidden layers, and an output layer.
Each layer is composed of one or more neurons (also called nodes or
units).
Neurons: Neurons are the basic computational units within each layer. Each neuron
receives inputs, applies a transformation to these inputs (usually a
weighted sum), and passes the result through an activation function. The
output of a neuron serves as input to the neurons in the subsequent layer.

Weights and Biases: The connections between neurons are represented by weights, which
determine the strength of the connection. Additionally, each neuron
typically has an associated bias term that helps shift the activation
function. These weights and biases are learned during the training process.
47
Activation Functions: Activation functions introduce non-linearity to the network. Common
activation functions include the sigmoid, hyperbolic tangent (tanh),
and rectified linear unit (ReLU) functions. These functions allow the
network to approximate complex functions.

Forward Propagation: The process of computing the network's output from input data is
known as forward propagation. It involves passing the input through
the layers of neurons, applying weights, biases, and activation
functions to compute the final output.

Training: Deep feed-forward networks are trained using supervised learning


techniques. During training, the network is presented with a dataset of
input-output pairs, and it adjusts its weights and biases to minimize the
difference between its predictions and the true targets. This is typically
done using gradient descent optimization algorithms and a loss
function that quantifies the prediction error.

48
Challenges in Deep Learning

Deep learning has made significant advancements in various fields, but there
are still some challenges that need to be addressed. Here are some of the main
challenges in deep learning:

1. Data availability: It requires large amounts of data to learn from. For using
deep learning it’s a big concern to gather as much data for training.

2. Computational Resources: For training the deep learning model, it is


computationally expensive because it requires specialized hardware like GPUs
and TPUs.
4. Time-consuming: While working on sequential data depending on the
computational resource it can take very large even in days or months.

5. Interpretability: Deep learning models are complex, it works like a black box.
it is very difficult to interpret the result.

6. Overfitting: when the model is trained again and again, it becomes too
specialized for the training data, leading to overfitting and poor performance
on new data.
Advantages of Deep Learning:

 High accuracy: Deep Learning algorithms can achieve state-of-the-art


performance in various tasks, such as image recognition and natural language
processing.
Automated feature engineering: Deep Learning algorithms can
automatically discover and learn relevant features from data without the need
for manual feature engineering.
Scalability: Deep Learning models can scale to handle large and complex
datasets, and can learn from massive amounts of data.
Flexibility: Deep Learning models can be applied to a wide range of tasks
and can handle various types of data, such as images, text, and speech.
Continual
improvement: Deep Learning models can continually improve their
performance as more data becomes available.
Disadvantages of Deep Learning:
High computational requirements: Deep Learning models require large amounts of data
and computational resources to train and optimize.
Requires large amounts of labeled data: Deep Learning models often require a large
amount of labeled data for training, which can be expensive and time- consuming to acquire.
Interpretability: Deep Learning models can be challenging to
interpret , making it difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data, resulting in
poor performance on new and unseen data.
Black-box nature: Deep Learning models are often treated as black boxes, making it difficult
to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high accuracy and
scalability, it also has some disadvantages, such as high computational requirements, the
need for large amounts of labeled data, and interpretability challenges. These limitations
need to be carefully considered when deciding whether to use Deep Learning for a specific
task.
Gradient-Descent Learning

53
Gradient-Descent Learning
1. Gradient descent is an iterative optimization algorithm used to find the minimum
(or maximum) of a function.
2. It's commonly used in machine learning and optimization tasks to update the parameters
of a model in order to minimize a loss function.

3. There are several variations and strategies based on the basic gradient descent
algorithm
4. A gradient measures how much the output of a function changes if you change the
inputs a little bit.
5. In machine learning, a gradient is a derivative of a function that has more than one
input variable. Known as the slope of a function in mathematical terms, the
gradient simply measures the change in all weights about the change in error.
54
Gradient-Descent strategies
There are three popular types that mainly differ in the amount of data they use.*

• Batch Gradient In this basic form of gradient descent, the entire training dataset is used in each
Descent:
iteration to calculate the gradient and update the model parameters. It can be

computationally expensive for large datasets.


• Stochastic Gradient In SGD, a single randomly chosen training example is used to
Descent compute the gradient and update the parameters in each iteration. It is faster and can
(SGD):
handle large datasets, but the updates can be noisy and may lead to oscillations.

• Mini-batch Mini-batch gradient descent combines the ideas of batch gradient


gradient descent with SGD, it is the preferred technique. It divides the
training dataset into manageable groups and updates each separately

55
Gradient-Descent strategies
There are three popular types
that mainly differ in the amount of data they use.

56
Stochastic Gradient-Descent strategies

• Stochastic Gradient Descent (SGD) is a widely used optimization


algorithm in machine learning and deep learning.

• It's a variant of gradient descent that updates the model parameters


using a single randomly
selected training example in each iteration. Here's how SGD works:

57
Initialization: Initialize the model parameters (weights and biases) randomly
or using some predefined values.

Random In each iteration (also called an epoch), randomly select a single training
Sample example from the dataset. This randomness introduces noise into the optimization
Selection: process.

Calculate Compute the gradient of the loss function with respect to the selected training
Gradient: example. This involves calculating the derivatives of the loss function with respect
to each model parameter.

Update Update the model parameters using the computed gradient. The parameters
Parameters: are updated in the opposite direction of the gradient to minimize the loss. The
update rule is : parameter = parameter - learning_rate * gradient

Repeat: Repeat steps 2-4 for a predefined number of iterations or until


convergence. Convergence is typically determined by observing if the loss
function has stopped decreasing or has reached a small enough value.
Stochastic Gradient-Descent strategies : Parameter Update
• SGD updates the model parameters using the gradient of the loss
function with respect to a single randomly chosen training
example.

Parameter Update: θ(t+1) = θ(t) - η * ∇L(θ(t), x_i, y_i)

• Where: θ(t) is the model parameter at iteration t.


• η (eta) is the learning rate.
• ∇L(θ(t), x_i, y_i) is the gradient of the loss function with respect
to the
parameters θ(t) computed using the training example (x_i, y_i).
59
Advantages of Stochastic Gradient-Descent strategies :

Advantages of SGD:

1. Faster Convergence: The noisy updates introduced by the random


selection of training examples can help the algorithm escape local
minima and converge faster, especially in ill-conditioned or complex
optimization landscapes.

2. Memory Efficiency: Since only one training example is used in each


iteration, SGD is memory-efficient and can handle large
datasets that might not fit entirely in memory.

60
3. Online Learning: SGD's ability to adapt quickly to new data points makes
it suitable
for online learning scenarios where data arrives in a streaming fashion.

4. Regularization: The noise introduced by the random selection of


examples can act
as a form of regularization, helping to prevent overfitting.
Disadvantages of Stochastic Gradient-Descent strategies :
1. Noisy Updates: The randomness in SGD can lead to oscillations in
the optimization process,
causing the loss to fluctuate instead of steadily decreasing.

2. Slower Convergence in Certain Cases: Due to the noisy updates,


SGD might take longer to converge to the optimal solution
compared to other methods like batch gradient descent.
However, it can escape shallow local minima more easily.

62
2. Hyperparameter Sensitivity: The choice of learning rate is crucial in SGD.
If the learning rate
is too high, the algorithm might diverge; if it's too low, convergence can be
slow.

4. Bias in Parameter Updates: Since each parameter update is based on a


single training example,
the updates can exhibit high variance and might not accurately represent the
true gradient.
The mathematical expressions for the basic Stochastic Gradient
Descent (SGD) - Algorithm and some of its common variations

• Stochastic Gradient Descent • θ(t+1) = θ(t) - η * ∇L(θ(t), x_i, y_i)


Where:
(SGD) SGD updates the model
• θ(t) is the model parameter at iteration t.
parameters using the gradient of the
• η (eta) is the learning rate.
loss function with respect to a
• ∇L(θ(t), x_i, y_i) is the gradient of the loss
single randomly chosen training
function with respect to the parameters θ(t)
example.
computed using the training example (x_i, y_i).

• Stochastic Gradient Descent with • v(t+1) = β * v(t) + (1 - β) * ∇L(θ(t), x_i, y_i)


• θ(t+1) = θ(t) - η * v(t+1)
Momentum:
Where:
SGD with Momentum adds a
• v(t) is the velocity term at iteration t.
momentum term to the parameter
• β (beta) is the momentum coefficient (typically
updates to improve convergence by
between 0 and 1).
smoothing out noisy updates.
64
• Nesterov Accelerated • v(t+1) = β * v(t) + (1 - β) * ∇L(θ(t) - β * v(t),
x_i, y_i)
Gradient (NAG): • θ(t+1) = θ(t) - η * v(t+1)
NAG improves upon the basic Where:
momentum method by considering • v(t) is the velocity term at iteration t.
the gradient at a slightly ahead • β (beta) is the momentum coefficient.
position in the direction of the
momentum.
The mathematical expressions for the basic Stochastic Gradient Descent
(SGD) - Algorithm and some of its common variations

• Adagrad(Adaptive • G(t+1) = G(t) + ∇L(θ(t), x_i, y_i)^2


Gradient Algorithm): • θ(t+1) = θ(t) - (η / √(G(t+1) + ε)) * ∇L(θ(t), x_i,
y_i)
Adagrad adapts the learning rate for Where:
each parameter based on the • G(t) is a diagonal matrix of accumulated squared
historical gradient information gradients up to iteration t.
• ε (epsilon) is a small constant to prevent division
by zero.
• RMSProp (Root Mean Square • G(t+1) = β * G(t) + (1 - β) * ∇L(θ(t), x_i, y_i)^2
Propagation): • θ(t+1) = θ(t) - (η / √(G(t+1) + ε)) * ∇L(θ(t), x_i,
y_i)
RMSProp modifies Adagrad by Where:
using a moving average of squared • G(t) is the moving average of squared gradients up
gradients to adaptively update the to iteration t.
learning rate • β (beta) is a decay factor for the moving average.
66
Hidden Units
The Hidden layers are concerned with extracting progressively higher-order
features from the raw data. Depending on the architecture we’re working
with, we tend to use certain subsets of layer activation functions. A hidden
unit takes in a vector/ tensor, compute an affine transformation z and then
applies an element-wise non-linear function g( z). Where z:

The way hidden units are differentiated from each other is based on their activation
function, g(z):

1.ReLU 2. ELU 3.GELU 4.Maxout 5.PReLU 6.Absolute value rectification


7.LeakyReLU 8.Logistic Sigmoid 9.Hyperbolic Tangent 10.Hard Hyperbolic Tangent
11.Identity 12.Softplus 13.Softmax 14.RBF etc
the different types of hidden units
What’s ReLU?
ReLU stands for Rectified Linear Unit. Rectified Linear Units are pretty much the standard that
everyone defaults to, but it’s only one out of the many options. And this activation function
looks like:

Like mentioned, this max activation function is on top of the affine transformation, z.
When mapped out it has these properties:
What’s Maxout?
Maxout is a flavour of a ReLU, which itself is a subset of activation functions,
which is a component of a hidden unit. As such we know that a hidden unit will
apply an affine transformation to a vector and then apply a nonlinear element-
wise activation function. Since Maxout is a flavour of ReLU, you are right to
assume it uses a max(0, z).
The Maxout unit is then the maximum element of one of these groups:

where: is the indices of the inputs of group i


What’s Logistic Sigmoid?
If the ReLU is the reigning queen of activation functions, then logistic sigmoid is the former,
denoted:
What’s Hyperbolic Tangent?
A close relative to the logistic sigmoid is the hyperbolic tangent, related to logistic sigmoid by:

The difference between them is that sigmoid is 1/2 at 0, whereas tanh is 0 at 0. In that
sense, the tanh is more like the identity function, at least around 0.
What’s RBF?
This function, Radial Basis Function, becomes more active as x approaches a certain value
vector, it saturates to 0 everywhere else, so can be annoying for gradient descent:
What’s Softplus?
This one is discouraged from use based on empirical evidence. Which is counter-intuitive.
Since its meant to be an improvement on ReLU, making it differentiable everywhere. But in
practice, it does worse.
What’s the hard hyperbolic tangent, or hard tanh?
It looks like the tanh or the rectifier. But unlike the rectifier, it is bounded. It’s
computationally cheaper than many of the alternatives. It’s basically either -1 or the
line a or 1.
What’s Identity?

Having an identity function as the activation function is exactly like having no


activation function. A linear unit can be a useful output unit, but it can also be a
decent hidden unit.

If every layer of the network is a linear transformation, the whole network is


also a linear transformation, by transitivity?

Generally multiplying and adding vectors and matrices acts as a linear


transformation that stretches, combines, rotates, compresses the input vector
or matrix.
A neural networks consist entirely of tensor operations, and all of these tensor
operations are just geometric transformations of the input data.
It follows that then neural networks are just geometric transformations of the
input data.
A hidden unit is:
Our network has n inputs and p outputs. With this approach we replace that with:

The first layer is matrix U and the second weight matrix is V. If the first
layer, U produces q parameters, together these layers
produce (n+p)q parameters. Whereas just W, would produce np parameters.
Linear hidden units, then offer an effective way to reduce the number of
parameters in a network.
What’s Softmax?
These hidden units are often used in architectures where your goal is to learn to
manipulate memory. When there is a classification problem and you need to
pick one of the multiple categories, this is the one to use. As it always boosts
the max category and drags the other categories down.

What’s GELU?
GELU stands for Gaussian Error Linear Unit, and it is a proposed activation
function, meant to be an improvement on ReLU and its cousins.
Where ReLU gates the inputs by their sign, the GELU gates inputs by their
magnitude. The paper does an empirical evaluation of GELU against ReLU and
ELU activation functions in MNIST, Tweet processing etc. And these guys found
it performed better.
Back Propagation Learning

DL by Dr. B.S.Panigrahi 80
Training a Neural Network with Backpropagation,
Back Propagation Learning

1. Backpropagation is a supervised learning algorithm, for training Multi-layer


Perceptrons (Artificial Neural Networks).
2. Backpropagation (backward propagation) is an important mathematical tool for
improving the accuracy of predictions in data mining and machine learning.
3. The Backpropagation algorithm looks for the minimum value of the error
function in weight space using a technique called the delta rule or gradient
descent. The weights that minimize the error function is then considered to be a
solution to the learning problem.

4. One way to train our model is called as Backpropagation. Consider the diagram
below:
DL by Dr. B.S.Panigrahi 81
Backpropagation Process & Steps

Steps:

1. Calculate the error – How far is your model output from the actual output.
2. Minimum Error – Check whether the error is minimized or not.
3. Update the parameters – If the error is huge then, update the parameters (weights and biases).
After that again check the error. Repeat the process until the error becomes minimum.
4. Model is ready to make a prediction – Once the error becomes minimum, you can
feed some inputs to your model and it will produce the output
DL by Dr. B.S.Panigrahi 82
Errors,Training and Test Loss
• Training and Test Loss is a number indicating how bad(un-accurate) the model's prediction
was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is
greater.
• Means the model is not learning; probably there's something wrong with either the model or the
optimization process
• The goal of training a model is to find a set of weights and biases that have low loss, on
average,
• Computationally, the training loss is calculated by taking the sum of errors for each
example in the training set.

DL by Dr. B.S.Panigrahi
83
Errors, Training and Test Loss
For example,
lets check a high loss model (A)on the left and a low loss model(B) on the right. Where
• The arrows represent loss.
• The blue lines represent predictions.
• Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly,
the line in the right plot is a much better predictive model than the line in the left plot.

(A) (B)
DL by Dr. B.S.Panigrahi
84
Training a Neural Network with Backpropagation,
Example:
Training a neural network using the backpropagation algorithm.
• In this example, we'll build and train a feedforward neural network to solve a basic
binary classification problem. The network will have one hidden layer and will use the sigmoid
activation function. We'll use the mean squared error (MSE) loss function and gradient descent as
the optimization algorithm.

• Lets take the following training dataset:

Input 1 Input 2 Target


0 0 0
0 1 1
1 0 1
1 1 0
DL by Dr. B.S.Panigrahi 85
Training a Neural Network with Backpropagation,
Step 1: Initialize the neural network parameters
• Let's start by initializing the neural network parameters: weights and biases for the
input-to-hidden and hidden-to-output layers
• .
Let's assume:
• The hidden layer has 2 neurons.
• Weights and biases are initialized randomly.

Step 2: Forward Pass


Next, perform a forward pass through the network to compute the predicted outputs.

• Input layer: The input layer consists of two neurons, one for each input feature.
• Hidden layer: The hidden layer consists of two neurons. Each neuron will receive inputs from
the input layer, and we'll apply the sigmoid activation function to the weighted sum of inputs.
• Output layer: The output layer consists of one neuron. It will receive inputs from the hidden layer
and also apply the sigmoid activation function.

DL by Dr. B.S.Panigrahi 86
Training a Neural Network with Backpropagation,
The formulas for forward pass are as follows:

DL by Dr. B.S.Panigrahi 87
Training a Neural Network with Backpropagation,
Step 3: Compute Loss
Now, we'll compute the mean squared error loss using the
predicted outputs and the target values from the dataset.

DL by Dr. B.S.Panigrahi 88
Step 4: Backpropagation - Compute Gradients
• In the backpropagation step, we compute the gradients of the loss with
respect to the network parameters. These gradients will tell us the
direction in which to adjust the weights and biases to minimize the
loss.
• We use the chain rule to calculate the gradients at each layer.
Training a Neural Network with Backpropagation,
Step 5: Update Weights and Biases

• With the gradients computed, we can now update the weights and biases using gradient
descent.
• Learning rate (α) is a hyperparameter that controls the step size in the gradient descent
update.

DL by Dr. B.S.Panigrahi 90
Training a Neural Network with Backpropagation,
Step 6: Repeat Training
Now, you can repeat steps 2 to 5 (forward pass, compute loss,
backpropagation, update weights and biases) for multiple
iterations (epochs) until the loss converges to a minimum
or reaches a satisfactory value.

DL by Dr. B.S.Panigrahi 91
Training a Neural Network with Backpropagation,
Question: 1
Optimize the weights so that the neural network can learn how to correctly map arbitrary inputs
to outputs

Reference solution
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
DL by Dr. B.S.Panigrahi 92

You might also like