0% found this document useful (0 votes)

14 views117 pages

Deep Leaning

The document outlines a syllabus for a Deep Learning course covering various topics such as neural networks, training feedforward deep neural networks, autoencoders, convolutional neural networks, recurrent neural networks, and recent trends like GANs. It includes detailed explanations of fundamental concepts, architectures, activation functions, and learning techniques. The course aims to provide a comprehensive understanding of deep learning methodologies and their applications.

Uploaded by

MUHAMMADISMAIL SHAIKH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views117 pages

Deep Leaning

Uploaded by

MUHAMMADISMAIL SHAIKH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 117

Deep Learning

B.E. Computer Sem-VIII

Dr. Shivaji Pawar

Syllabus
Unit One: Fundamentals of Neural Network
1.1 Biological neuron, Mc-Culloch Pitts Neuron, Perceptron, Perceptron learning
1.2 Delta learning, Multilayer Perceptron: Linearly separable, linearly non-
separable classes
1.3 Brief History, Three Classes of Deep Learning Basic Terminologies of Deep
Learning
Unit:2 Training Feedforward DNN
2.1 Multi Layered Feed Forward Neural Network, Learning Factors,
2.2 Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
2.3 Loss functions: Squared Error loss, Cross Entropy, Choosing output function
and loss function
2.4 Optimization, Learning with backpropagation, Learning Parameters: Gradient
Descent (GD), Stochastic and Mini Batch GD, Nesterov Accelerated GD,
2.5 AdaGrad, Adam, RMSProp Parameter sharing,
2.6 Dropout, Weight Decay, Batch normalization,
2.7 Early stopping, Data Augmentation, Adding noise to input and output
Unit 3: Autoencoders: Unsupervised Learning

3.1 Introduction, Linear Autoencoder,

3.2 Undercomplete Autoencoder,
3.3 Overcomplete Autoencoders, Regularization in Autoencoders
3.4 Denoising Autoencoders, Sparse Autoencoders,
3.5 Contractive Autoencoders
3.6 Application of Autoencoders: Image Compression
Unit:4 Convolutional Neural Networks (CNN): Supervised
Learning
4.1 Convolution operation, Padding, Stride, Relation between input, output
and filter size, CNN architecture:
4.2 Convolution layer, Pooling Layer, Weight Sharing in CNN,
4.3 Fully Connected NN vs CNN,
4.4 Variants of basic Convolution function
4.5 Modern Deep Learning Architectures:
4.6 LeNET: Architecture,
4.7 AlexNET: Architecture
Unit:5 Recurrent Neural Networks (RNN)
5.1 Sequence Learning Problem, Unfolding Computational graphs,
5.2 Recurrent Neural Network, Bidirectional RNN,
5.3 Long Short Term Memory:
5.4 Selective Read, Selective write, Selective Forget, Gated Recurrent Unit

Unit:6 Recent Trends and Applications

It may be divided into 2 parts. The first part, g

takes an input (ahem dendrite ahem), performs
an aggregation and based on the aggregated
value the second part, f makes a decision
AND Function OR Function

An AND function neuron would I believe this is self explanatory as we know

only fire when ALL the inputs are that an OR function neuron would fire if ANY
ON i.e., g(x) ≥ 3 here. of the inputs is ON i.e., g(x) ≥ 1 here.
NOR Function NOT Function

For a NOR neuron to fire, we want

For a NOT neuron, 1 outputs 0 and 0 outputs 1.
ALL the inputs to be 0 so the
So we take the input as an inhibitory input and
thresholding parameter should also be 0
set the thresholding parameter to 0. It works!
and we take them all as inhibitory
input.
Analysis of OR Function as a decision boundary
The inputs are obviously Boolean, so only 4 combinations are possible — (0,0), (0,1),
(1,0) and (1,1). Now plotting them on a 2D graph and making use of the OR function’s
aggregation equation i.e., x_1 + x_2 ≥ 1 using which we can draw the decision
boundary as shown in the graph below.
AND Function

In this case, the decision boundary equation is x_1 + x_2 =2. Here, all
the input points that lie ON or ABOVE, just (1,1), output 1 when
passed through the AND function M-P neuron.
OR Function With 3 Inputs

Lets just generalize this by looking at a 3 input OR function M-P unit. In this
case, the possible inputs are 8 points — (0,0,0), (0,0,1), (0,1,0), (1,0,0),
(1,0,1),… you got the point(s). We can map these on a 3D graph and this time
we draw a decision boundary in 3 dimensions.
The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1
is shown below:

Take your time and convince yourself by looking at the above plot that all the points that
lie ON or ABOVE that plane (positive half space) will result in output 1 when passed
through the OR function M-P unit and all the points that lie BELOW that plane
(negative half space) will result in output 0.
Limitations Of M-P Neuron
1. What about non-boolean (say, real) inputs?
2. Do we always need to hand code the threshold?
3. Are all inputs equal? What if we want to assign more importance to some inputs?
4. What about functions which are not linearly separable? Say XOR function.

Conclusion
1. In this Topic, we briefly looked at biological neurons.
2. We then established the concept of MuCulloch-Pitts neuron, the first ever mathematical
model of a biological neuron.
3. We represented a bunch of Boolean functions using the M-P neuron.
4. We also tried to get a geometric intuition of what is going on with the model, using 3D plots.
5. In the end, we also established a motivation for a more generalized model, the one and only
artificial neuron/perceptron model.
Perceptron model

Perceptron is Machine Learning algorithm for supervised learning of various

binary classification tasks

.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit
that helps to detect certain input data computations in business intelligence.
Basic Components of Perceptron
Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight and Bias:
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias, net
sum, and an activation function.
The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the desired
output.
This activation function is also known as the step function and is represented by 'f'.
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as
follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
1. Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes. In a single layer perceptron
model, its algorithms do not contain recorded data, so it begins with inconstantly allocated
input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs,
if the total sum of all inputs is more than a pre-determined value, the model gets activated
and shows the output value as +1.
Multi-Layered Perceptron Model

Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,

which executes in two stages as follows:

1.Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
2.Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Advantages of Multi-Layer Perceptron

1. A multi-layered perceptron model can be used to solve complex non-linear problems.

2. It works well with both small and large input data.
3. It helps us to obtain quick predictions after the training.
4. It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron

1. In Multi-layer perceptron, computations are difficult and time-consuming.

2. In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
3. The model functioning depends on the quality of the training.
Linearly separable, linearly non-separable classes

Linear and non-linear separability. (A) shows two classes of data that can be separated by
a single straight line (i.e. they are 'linearly separable'). (B) shows two classes of data that
require a more complex classification topology (i.e. they are not linearly separable).
Neural networks are capable of solving the complex topography in (B), whereas more
traditional statistical methods (such as binary logistic regression, the obvious comparator in
this case) are not.
What is Deep Leaning?
It is technique in which machines are trained to the
mimic the behavior of the human brain i.e. act or
behave like human being.
History of deep leaning

1943- Perceptron
(Threshold logic) (Warren
McCulloch and Walter
Pitts)

Threshold Logic
1985- Hinton and
Rumelhart, Williams

2000- The Vanishing

Gradient Problem
From 2012
Onwards….

Deep learning architecture is becoming so popular due

to…
1. Exponential growth in Data
2. Technology up Gradations
3. Deep learning Architectures are suitable for complex
applications like object detection, Face recognition,
recommended system, NLP and Chat bots
Types of Deep Learning architecture
Unit:2 Training Feedforward DNN

Multi Layered Feed Forward Neural Network, Learning Factors,

Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
Loss functions: Squared Error loss, Cross Entropy, Choosing output function and
loss function
Optimization, Learning with backpropagation, Learning Parameters: Gradient
Descent (GD), Stochastic and Mini Batch GD, Nesterov Accelerated GD,
AdaGrad, Adam, RMSProp Parameter sharing,
Dropout, Weight Decay, Batch normalization,
Early stopping, Data Augmentation, Adding noise to input and output
Elements of a Neural Networks Architecture
Input Layer
The input layer takes raw input from the domain. No computation is performed at this
layer. Nodes here just pass on the information (features) to the hidden layer.

Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction
to the neural network. The hidden layer performs all kinds of computation on the features
entered through the input layer and transfers the result to the output layer.

Output Layer
It’s the final layer of the network that brings the information learned through the hidden
layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the output layer
will typically use a different activation function from the hidden layers. The choice
depends on the goal or type of prediction made by the model.
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms describing
the movement of information—feedforward and backpropagation.
Feedforward Propagation –
The flow of information occurs in the forward direction. The input is used to calculate
some intermediate function in the hidden layer, which is then used to calculate an output.
In the feedforward propagation, the Activation Function is a mathematical “gate” in
between the input feeding the current neuron and its output going to the next layer.
Backpropagation -
The weights of the network connections are repeatedly adjusted to minimize the difference
between the actual output vector of the net and the desired output vector.
To put it simple word
Backpropagation aims to minimize the cost function by adjusting the network’s weights
and biases. The cost function gradients determine the level of adjustment with respect to
parameters like activation function, weights, bias, etc.
Why do Neural Networks Need an Activation Function?
So we know what Activation Function is and what it does, but—

Why do Neural Networks need it?

Well, the purpose of an activation function is to add non-linearity to the neural network.
Activation functions
To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs
and adds a bias, as shown in the figure below by the net input.

Mathematically,
net input =sum{(weight * input)+bias}
Types of Neural Networks Activation Functions
1. Binary Step Function
Binary step function depends on a threshold value that decides whether a neuron should be
activated or not.
The input fed to the activation function is compared to a certain threshold; if the input is
greater than it, then the neuron is activated, else it is deactivated, meaning that its output is
not passed on to the next hidden layer.
Mathematically it can be represented as
The limitations of binary step function:

1. It cannot provide multi-value outputs—for example, it cannot be used for multi-

class classification problems.

2. The gradient of the step function is zero, which causes a problems in the
backpropagation process.
Non-Linear Neural Networks Activation Functions

1. Sigmoid / Logistic Activation Function

This function takes any real value as input and outputs values in the range of 0 to 1.

The larger the input (more positive), the closer the output value will be to 1.0, whereas
the smaller the input (more negative), the closer the output will be to 0.0, as shown
below.
Mathematically it can be represented as: Why sigmoid/logistic activation function is one
of the most widely used functions? These are
the advantages
I. It is commonly used for models where we
have to predict the probability as an output.
Since probability of anything exists only
between the range of 0 and 1, sigmoid is the
right choice because of its range.
II. The function is differentiable and provides
a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-
shape of the sigmoid activation function.
The limitations of sigmoid function are discussed below:

1. The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

As we can see from the above Figure, the gradient

values are only significant for range -3 to 3, and the
graph gets much flatter in other regions.

It implies that for values greater than 3 or less than -3,

the function will have very small gradients. As the
gradient value approaches zero, the network ceases to
learn and suffers from the Vanishing gradient problem.

The output of the logistic function is not symmetric

around zero. So the output of all the neurons will be of
the same sign. This makes the training of the neural
network more difficult and unstable.
2. Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation function, and even has the
same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input
(more positive), the closer the output value will be to 1.0, whereas the smaller the input
(more negative), the closer the output will be to -1.0.

Mathematically it can be represented as

Advantages of using this activation function are:

The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.
Have a look at the gradient of the tanh activation function to understand its limitations.

As you can see— it also faces the

problem of vanishing gradients similar to
the sigmoid activation function. Plus the
gradient of the tanh function is much
steeper as compared to the sigmoid
function.
Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero
centered, and the gradients are not restricted to move in a certain direction. Therefore, in
practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.

3. ReLU stands for Rectified Linear Unit.

Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time. The neurons will only be deactivated if the output of the linear transformation is less
than 0.
Mathematically it can be represented as

The advantages of using ReLU as an activation

function are as follows:

1. Since only a certain number of neurons are

activated, the ReLU function is far more
computationally efficient when compared to the
sigmoid and tanh functions.
2. ReLU accelerates the convergence of gradient
descent towards the global minimum of the loss
function due to its linear, non-saturating property.
The limitations faced by this function are:

The Dying ReLU problem, which is explained below.

I. The negative side of the graph makes the

gradient value zero. Due to this reason,
during the backpropagation process, the
weights and biases for some neurons are
not updated. This can create dead neurons
which never get activated.

II. All the negative input values become zero

immediately, which decreases the model’s
ability to fit or train from the data
properly.
4. Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as
it has a small positive slope in the negative area.

Mathematically it can be represented as

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it
does enable backpropagation, even for negative input values.

By making this minor modification for negative input values, the gradient of the left
side of the graph comes out to be a non-zero value. Therefore, we would no longer
encounter dead neurons in that region

Here is the derivative of the Leaky ReLU function.

The limitations that this function
faces include:

The predictions may not be

consistent for negative input values.
The gradient for negative values is a
small value that makes the learning
of model parameters time-
consuming.
Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should focus on
its building block—the sigmoid/logistic activation function that works on calculating
probability values.
The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability.

But—

This function faces certain problems.

Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How
can we move forward with it?

The answer is: We can’t.

The above values don’t make sense as the sum of all the classes/output probabilities should
be equal to 1.
You see, the Softmax function is described as a combination of multiple sigmoid.

It calculates the relative probabilities. Similar to the sigmoid/logistic activation

function, the SoftMax function returns the probability of each class.

It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification.

Mathematically it can be represented as:

Let’s go over a simple example together.

Assume that you have three classes, meaning that there would be three neurons in the
output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the Softmax function over these values to give a probabilistic view will result in
the following outcome: [0.58, 0.23, 0.19].

The function returns 1 for the largest probability index while it returns 0 for the other
two array indexes.
Why are deep neural networks hard to train?
There are two challenges you might encounter when training your deep neural networks.
1.Vanishing Gradients
Like the sigmoid function, certain activation functions squish an ample input space into a
small output space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.
For shallow networks with only a few layers that use these activations, this isn’t a big
problem. However, when more layers are used, it can cause the gradient to be too small
for training to work effectively.
2. Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and
result in very large updates to neural network model weights during training. An
unstable network can result when there are exploding gradients, and the learning
cannot be completed. The values of the weights can also become so large as to
overflow and result in something called NaN values.
How to choose the right Activation Function?
We need to match your activation function for your output layer based on the type of
prediction problem that you are solving—specifically, the type of predicted variable.
Here's what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then
move over to other activation functions if ReLU doesn’t provide optimum results.

And here are a few other guidelines to help you out.

1.ReLU activation function should only be used in the hidden layers.

2.Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make
the model more susceptible to problems during training (due to vanishing gradients).
3. Swish function is used in neural networks having a depth greater than 40 layers.
Finally, a few rules for choosing the activation function for your output layer based on the
type of prediction problem that you are solving:
1.Regression - Linear Activation Function
2.Binary Classification—Sigmoid/Logistic Activation Function
3.Multiclass Classification—Softmax
4.Multilabel Classification—Sigmoid

The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.

5.Convolutional Neural Network (CNN): ReLU activation function.

6.Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Loss functions
In deep learning, “loss function” and “cost function” are often used interchangeably.
They both refer to the same concept of a function that calculates the error or
discrepancy between predicted and actual values. The cost or loss function is
minimized during the model's training process to improve accuracy.

Loss function get classified into

1. Regression loss
2. Classification loss
A. Regression Loss
1. Mean Squared Error/Squared loss/ L2 loss
The Mean Squared Error (MSE) is the simplest and most common loss function.
To calculate the MSE, you take the difference between the actual value and model
prediction, square it, and average it across the whole dataset.

Advantage
I. Easy to interpret. Disadvantage
II. Always differential because of ▪ Error unit in the square. because the unit in
the square. the square is not understood properly.
III. Only one local minima. ▪ Not robust to outlier

Note – In regression at the last neuron use linear activation function.

2. Mean Absolute Error/ L1 loss
The Mean Absolute Error (MAE) is also the simplest loss function. To calculate the
MAE, you take the difference between the actual value and model prediction and
average it across the whole dataset.

Advantage Disadvantage
1. Intuitive and easy 1. Graph, not differential. we can not use
2. Error Unit Same as the output column. gradient descent directly, then we can sub
3. Robust to outlier gradient calculation.

Note – In regression at the last neuron use linear activation function.

Should I use L1 loss function?

L1 loss is not sensitive to outliers as it is simply the absolute difference, so if you

want to penalize large errors and outliers then L1 is not a great choice and you
should probably use L2 loss instead.
However, if you don't want to punish infrequent large errors, then L1 is most likely
a good choice.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less
sensitive to outliers in data than the squared error loss.

n – the number of data points.

y – the actual value of the data point. Also known as true value.
ŷ – the predicted value of the data point. This value is returned by
the model.
δ – defines the point where the Huber loss function transitions from
a quadratic to linear.
Advantage
Robust to outlier
It lies between MAE and MSE.

Disadvantage
Its main disadvantage is the associated complexity. In order to maximize model accuracy,
the hyperparameter δ will also need to be optimized which increases the training
requirements.
B. Classification Loss

1. Binary Cross Entropy/log loss

It is used in binary classification problems like two classes. example a person has covid or not
or my article gets popular or not. Binary cross entropy compares each of the predicted
probabilities to the actual class output which can be either 0 or 1. It then calculates the score
that penalizes the probabilities based on the distance from the expected value. That means how
close or far from the actual value.

yi – actual values
yihat – Neural Network prediction
Advantage –
A cost function is a differential.
Disadvantage –
Multiple local minima
Not intuitive
Note – In classification at last neuron use sigmoid activation function.
2. Categorical Cross Entropy
Categorical Cross entropy is used for Multiclass classification with support of probability

CE= Distance of actual probability from the predicated probability (actual output –
predicated output

P(D)=P1,P2,P3 and Y(D)= Y1,Y2,Y3

C.E=- (Y1log p1+ Y2 log p2+ y3 log p3+………+ynlogpn)

Optimization, Learning with backpropagation, Learning Parameters: Gradient Descent (GD), Stochastic and
Mini Batch GD, Nesterov Accelerated GD,
Backpropagation

Backpropagation is a training algorithm

used for training feedforward neural
networks. It plays an important part in
improving the predictions made by neural
networks. This is because backpropagation
is able to improve the output of the neural
network iteratively.

In a feedforward neural network, the

input moves forward from the input layer to
the output layer. Backpropagation helps
improve the neural network’s output. It
does this by propagating the error backward
from the output layer to the input layer.
How Does Backpropagation Work?

To understand how backpropagation works, let’s first understand how a feedforward

network works.
1. When a neural network is first trained, it is first fed with input. Since the neural network
isn’t trained yet, we don’t know which weights to use for each input.
2. And so, each input is randomly assigned a weight. Since the weights are randomly
assigned, the neural network will likely make the wrong predictions.
3. It will give out the incorrect output. When the neural network gives out the incorrect
output, this leads to an output error.
4. This error is the difference between the actual and predicted outputs. A cost function
measures this error.

Role of Cost function

1. The cost function (J) indicates how accurately the model performs.
2. It tells us how far-off our predicted output values are from our actual values.
3. It is also known as the error. Because the cost function quantifies the error, we aim to
minimize the cost function.
What we want is to reduce the output error.
Since the weights affect the error, we will need
to readjust the weights. We have to adjust the
weights such that we have a combination of
weights that minimizes the cost function.

This is where Backpropagation comes in…

• Backpropagation allows us to readjust our weights to reduce output error.
• The error is propagated backward during backpropagation from the output to the input
layer.
• This error is then used to calculate the gradient of the cost function with respect to each
weight.
Essentially, backpropagation aims to calculate the negative gradient of the cost
function. This negative gradient is what helps in adjusting of the weights. It gives us an
idea of how we need to change the weights so that we can reduce the cost function.

Backpropagation uses the chain rule to calculate the gradient of the cost function. The
chain rule involves taking the derivative. This involves calculating the partial derivative
of each parameter. These derivatives are calculated by differentiating one weight and
treating the other(s) as a constant. As a result of doing this, we will have a gradient.

Since we have calculated the gradients, we will be able to adjust the weights.
Significance of Gradient Descent

The weights are adjusted using a process called gradient descent. Gradient descent is
an optimization algorithm that is used to find the weights that minimize the cost
function. Minimizing the cost function means getting to the minimum point of the cost
function. So, gradient descent aims to find a weight corresponding to the cost function’s
minimum point. To find this weight, we must navigate down the cost function until we
find its minimum point.
But first, to navigate the cost function, we need two things:
1. The direction in which to navigate and
2. The size of the steps for navigating.

The Direction
The direction for navigating the cost function is found using the gradient.

The Gradient
To know in which direction to navigate, gradient descent uses backpropagation. More
specifically, it uses the gradients calculated through backpropagation. These gradients are used
for determining the direction to navigate to find the minimum point. Specifically, we aim to
find the negative gradient. This is because a negative gradient indicates a decreasing slope. A
decreasing slope means that moving downward will lead us to the minimum point. For
example:
The Step Size
The step size for navigating the cost function is
determined using the learning rate.
Learning Rate
The learning rate is a tuning parameter that
determines the step size at each iteration of
gradient descent. It determines the speed at
which we move down the slope.
The step size plays an important part in ensuring a balance between optimization time and
accuracy. The step size is measured by a parameter alpha (α). A small α means a small
step size, and a large α means a large step size.
1. If the step sizes are too large, we could miss the minimum point completely. This can
yield inaccurate results.
2. If the step size is too small, the optimization process could take too much time. This
will lead to a waste of computational power.
The step size is evaluated and updated according to the
behavior of the cost function. The higher the gradient of the
cost function, the steeper the slope and the faster a model can
learn (high learning rate). A high learning rate results in a
higher step value, and a lower learning rate results in a lower
step value. If the gradient of the cost function is zero, the
model stops learning.
Descending the Cost Function
Navigating the cost function consists of adjusting the weights.
The weights are adjusted using the following formula:

This is the formula for gradient descent. As we can see, to

obtain the new weight, we use the gradient, the learning rate,
and an initial weight.

Adjusting the weights consists of multiple iterations. We take a

new step down for each iteration and calculate a new weight.
Using the initial weight and the gradient and learning rate, we
can determine the subsequent weights.

Let’s consider a graphical example of this:

From the graph of the cost function, we can see that:

To start descending the cost function, we first initialize a

random weight.
Then, we take a step down and obtain a new weight using the
gradient and learning rate. With the gradient, we can know
which direction to navigate. We can know the step size for
navigating the cost function using the learning rate.
We are then able to obtain a new weight using the gradient
descent formula.
We repeat this process until we reach the minimum point of
the cost function.
Once we’ve reached the minimum point, we find the weights
that correspond to the minimum of the cost function.
Backpropagation vs. Gradient Descent
Summarizing Gradient Descent
Gradient descent is an optimization algorithm used to find the
weights corresponding to the cost function. It needs to descend
the cost function until its minimum point to find these weights.
It needs the gradient and the learning rate to descend the cost
function. The gradient helps find the direction for reaching the
minimum point of the cost function. The learning rate helps
determine the speed at which to reach the minimum point.
Upon reaching the minimum point, gradient descent finds
weights corresponding to the minimum point.
Summarizing Backpropagation
Backpropagation is the algorithm of calculating the gradients of the cost function with
respect to the weights. Backpropagation is used to improve the output of neural networks. It
does this by propagating the error in a backward direction and calculating the gradient of the
cost function for each weight. These gradients are used in the process of gradient descent.
Conclusion
To put it plainly, gradient descent is the process of using gradients to find the minimum value
of the cost function, while backpropagation is calculating those gradients by moving in a
backward direction in the neural network. Judging from this, it would be safe to say that
gradient descent relies on backpropagation.
It would also be plausible to say that the neural network is trained using gradient descent
and that backpropagation is only used to assist in the process of calculating the gradients.
Although gradient descent is often paired with backpropagation to reduce the error in neural
networks, they each perform different functions.
Key Takeaways:
•Gradient descent relies on backpropagation. Gradient descent
uses gradients to help it find the minimum value of the cost
function. Backpropagation calculates these gradients using the
chain rule.
•Gradient descent is used to find a weight combination that
minimizes the cost function. Backpropagation propagates the
error backward and calculates the gradient for each error.
•Gradient descent requires the learning rate and the
gradient. The gradient helps find the direction to the minimum
point of the cost function. The learning rate helps find the
speed at which to navigate the cost function.
•Together, backpropagation and gradient descent improve the
prediction accuracy of neural networks. Backpropagation
propagates the error backward and calculates the gradient for
each weight. This gradient is used in the process of gradient
descent. Gradient descent involves adjusting the weights of the
neural network. Adjusting the weights helps minimize the
output error of the neural network.
Important Deep Learning Terms for gradient decent
•Epoch – The number of times the algorithm runs on the whole training dataset.

•Sample – A single row of a dataset.

•Batch – It denotes the number of samples to be taken to for updating the model
parameters.

•Learning rate – It is a parameter that provides the model a scale of how much
model weights should be updated.

•Cost Function/Loss Function – A cost function is used to calculate the cost, which
is the difference between the predicted value and the actual value.

•Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
Gradient decent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms.
The goal of Gradient Descent is to minimize the objective convex function f(x) using
iteration. In simple word how to reach on global minima with iteration.
Gradient Descent on Cost function.
Steps to implement Gradient Descent
1. Randomly initialize values.
2. Update values.

3. Repeat until slope =0

Learning rate must be chosen wisely as:

1. if it is too small, then the model will take
some time to learn.
2. if it is too large, model will converge as
our pointer will shoot and we’ll not be able
to get to minima.
Stochastic Gradient Descent

➢ There is need to have a better options than using gradient descent on massive data.
➢ To tackle the challenges large datasets , we have stochastic gradient descent, a popular
approach among optimizers in deep learning.
➢ The term stochastic denotes the element of randomness upon which the algorithm
relies.
➢ In stochastic gradient descent, instead of processing the entire dataset during each
iteration, we randomly select batches of data.
➢ This implies that only a few samples from the dataset are considered at a time,
allowing for more efficient and computationally feasible optimization in deep learning
models.
➢ The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum.
➢ Since we are not using the whole dataset but the batches of it for each iteration, the path
taken by the algorithm is full of noise as compared to the gradient descent algorithm.
Thus, SGD uses a higher number of iterations to reach the local minima.
➢ Due to an increase in the number of iterations, the overall computation time increases.
But even after increasing the number of iterations, the computation cost is still less than
that of the gradient descent optimizer.
➢ So the conclusion is if the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over batch gradient descent algorithm.
Mini Batch Gradient Descent

➢ In this variant of gradient descent, instead of taking all the training data, only a subset of
the dataset is used for calculating the loss function.
➢ Since we are using a batch of data instead of taking the whole dataset, fewer iterations
are needed. That is why the mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and batch gradient descent algorithms.
➢ This algorithm is more efficient and robust than the earlier variants of gradient descent.
As the algorithm uses batching, all the training data need not be loaded in the memory,
thus making the process more efficient to implement.
➢ Moreover, the cost function in mini-batch gradient descent is noisier than the batch
gradient descent algorithm but smoother than that of the stochastic gradient descent
algorithm. Because of this, mini-batch gradient descent is ideal and provides a good
balance between speed and accuracy.
➢ Despite all that, the mini-batch gradient descent algorithm has some downsides too.
It needs a hyperparameter that is “mini-batch-size”, which needs to be tuned to
achieve the required accuracy.
➢ Although, the batch size of 32 is considered to be appropriate for almost every case.
Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise
to look for other alternatives too.

HALOT Box User Manual
100% (1)
HALOT Box User Manual
24 pages
Tamil Periodic Table
No ratings yet
Tamil Periodic Table
31 pages
ACT IV Shakuntala
No ratings yet
ACT IV Shakuntala
6 pages
Anticancer Drugs Classification
100% (1)
Anticancer Drugs Classification
19 pages
Ryobi-825r Parts List
No ratings yet
Ryobi-825r Parts List
3 pages
Unit 5
No ratings yet
Unit 5
46 pages
DL QB Answers
No ratings yet
DL QB Answers
121 pages
DL Unit I & Unit II
No ratings yet
DL Unit I & Unit II
156 pages
Unit-7 ANN
No ratings yet
Unit-7 ANN
211 pages
Module - 2
No ratings yet
Module - 2
33 pages
Unit 7 Neural Networks
No ratings yet
Unit 7 Neural Networks
92 pages
Sudmo Components en
No ratings yet
Sudmo Components en
492 pages
Unit 1 Fundamentals of Deep Learning
No ratings yet
Unit 1 Fundamentals of Deep Learning
20 pages
ML Lec-21
No ratings yet
ML Lec-21
18 pages
Speed Test - 11 Simple Interest
No ratings yet
Speed Test - 11 Simple Interest
4 pages
Unit 1 Until MLP
No ratings yet
Unit 1 Until MLP
56 pages
It ML Unit 2 Notes Final
No ratings yet
It ML Unit 2 Notes Final
23 pages
Biological Neuron Artificial Neuron
No ratings yet
Biological Neuron Artificial Neuron
18 pages
Network Worksheet
No ratings yet
Network Worksheet
9 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
03 NeuralNetworksI PDF
100% (1)
03 NeuralNetworksI PDF
78 pages
DL Unit 2
No ratings yet
DL Unit 2
107 pages
Unit-Ii MLT1
No ratings yet
Unit-Ii MLT1
45 pages
Unit 4 Neural Networks
No ratings yet
Unit 4 Neural Networks
76 pages
ML Unit 3-2-18
No ratings yet
ML Unit 3-2-18
17 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
66 pages
Percept Ron
No ratings yet
Percept Ron
49 pages
Unit 1 Notes Final
No ratings yet
Unit 1 Notes Final
36 pages
Vestibular Rehab by Susan B OSullivan
No ratings yet
Vestibular Rehab by Susan B OSullivan
34 pages
Freud's Psychoanalytic Theory
No ratings yet
Freud's Psychoanalytic Theory
9 pages
Neural Networks
No ratings yet
Neural Networks
19 pages
CO2 - ANN Structure and Funadamentals - P1
No ratings yet
CO2 - ANN Structure and Funadamentals - P1
65 pages
Full Thesis
No ratings yet
Full Thesis
27 pages
UNIT1
No ratings yet
UNIT1
72 pages
L13 Artificial Neural Network
No ratings yet
L13 Artificial Neural Network
45 pages
Introduction DL
No ratings yet
Introduction DL
53 pages
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
No ratings yet
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
65 pages
CV 2025 Spring 14
No ratings yet
CV 2025 Spring 14
33 pages
Neural Network
No ratings yet
Neural Network
82 pages
This Document Is About Artificial Inteligence.
No ratings yet
This Document Is About Artificial Inteligence.
81 pages
Solid State Voltage Regulator
No ratings yet
Solid State Voltage Regulator
9 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
25 pages
Wk9-Neural Networks
No ratings yet
Wk9-Neural Networks
46 pages
Unit 2
No ratings yet
Unit 2
15 pages
Patliputra University, Patna, Bihar: Bachelor of Science-Year-1-Sem-I, Session-2023-2027
No ratings yet
Patliputra University, Patna, Bihar: Bachelor of Science-Year-1-Sem-I, Session-2023-2027
1 page
DP Learn
No ratings yet
DP Learn
72 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
81 pages
LittleShrub Report - Thermal Bridging and Thermal Break
No ratings yet
LittleShrub Report - Thermal Bridging and Thermal Break
49 pages
Introduction To Artificial Neural Networks and Perceptron
No ratings yet
Introduction To Artificial Neural Networks and Perceptron
59 pages
Ece Result 6TH Sem Ipu
No ratings yet
Ece Result 6TH Sem Ipu
325 pages
MSC Physics Syllabus 2022 2023
No ratings yet
MSC Physics Syllabus 2022 2023
50 pages
The Perceptrons
No ratings yet
The Perceptrons
41 pages
chp1 NN, MLFFN, Weight, Bias, Threshold, Activation FN, Loss FN
No ratings yet
chp1 NN, MLFFN, Weight, Bias, Threshold, Activation FN, Loss FN
19 pages
Unit 1.1
No ratings yet
Unit 1.1
44 pages
UNIT1 Perceptron MLP
No ratings yet
UNIT1 Perceptron MLP
26 pages
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
35 pages
ML Module 5
No ratings yet
ML Module 5
14 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
CS 191x Courseware4
No ratings yet
CS 191x Courseware4
3 pages
Advanced Supervised Learning
No ratings yet
Advanced Supervised Learning
17 pages
Neural Networks
No ratings yet
Neural Networks
28 pages
What Is Perceptron - Simplilearn
No ratings yet
What Is Perceptron - Simplilearn
46 pages
El Assignment
No ratings yet
El Assignment
10 pages
3 - Perceptron in Machine Learning
No ratings yet
3 - Perceptron in Machine Learning
7 pages
Tutankhamuns Missing Ribs KMT 18.1 PDF
100% (3)
Tutankhamuns Missing Ribs KMT 18.1 PDF
7 pages
CFBC 718 e 2 C
No ratings yet
CFBC 718 e 2 C
30 pages
DL CHPT 1
No ratings yet
DL CHPT 1
59 pages
Unit II - Perceptron
No ratings yet
Unit II - Perceptron
20 pages
Unit 3
No ratings yet
Unit 3
29 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
51 pages
WTG Nordex NXX 1 Micrositing en
No ratings yet
WTG Nordex NXX 1 Micrositing en
1 page
Unit 1
No ratings yet
Unit 1
25 pages
Book Nocse
No ratings yet
Book Nocse
340 pages
Eng SS 114-92009 A
No ratings yet
Eng SS 114-92009 A
3 pages
Neural Networks
No ratings yet
Neural Networks
42 pages
NNDL
No ratings yet
NNDL
96 pages
Unit 4
No ratings yet
Unit 4
9 pages
NN Unit 2
No ratings yet
NN Unit 2
20 pages
1 - Perceptron in Machine Learning
No ratings yet
1 - Perceptron in Machine Learning
6 pages
Load Shedding Proposal
No ratings yet
Load Shedding Proposal
8 pages
Class 12th Commerce Roll No 18. Name Khushi Purohit
No ratings yet
Class 12th Commerce Roll No 18. Name Khushi Purohit
21 pages
Found Sounds Scavenger Hunt
No ratings yet
Found Sounds Scavenger Hunt
1 page
BS EN 10277 5 2008 Bright Steel Products Steel For Quenching and Tempering Part 5 General
No ratings yet
BS EN 10277 5 2008 Bright Steel Products Steel For Quenching and Tempering Part 5 General
11 pages
UG Piping - Mechanical Handbook
100% (1)
UG Piping - Mechanical Handbook
7 pages
151 A.Data Devops Engineer
No ratings yet
151 A.Data Devops Engineer
3 pages
CSEC-Chemistry-p2 May-June 2012 PDF
50% (4)
CSEC-Chemistry-p2 May-June 2012 PDF
20 pages
Dance 101
No ratings yet
Dance 101
17 pages
Articulation Styles:: The Tongue Moves in An Up, Then
No ratings yet
Articulation Styles:: The Tongue Moves in An Up, Then
3 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet

Deep Leaning

Uploaded by

Deep Leaning

Uploaded by

Deep Learning

B.E. Computer Sem-VIII

Dr. Shivaji Pawar

3.1 Introduction, Linear Autoencoder,

Unit:6 Recent Trends and Applications

6.1 Generative Adversarial Network (GAN):Architecture

It may be divided into 2 parts. The first part, g

An AND function neuron would I believe this is self explanatory as we know

For a NOR neuron to fire, we want

Perceptron is Machine Learning algorithm for supervised learning of various

The multi-layer perceptron model is also known as the Backpropagation algorithm,

1. A multi-layered perceptron model can be used to solve complex non-linear problems.

Disadvantages of Multi-Layer Perceptron

1. In Multi-layer perceptron, computations are difficult and time-consuming.

2000- The Vanishing

Deep learning architecture is becoming so popular due

Multi Layered Feed Forward Neural Network, Learning Factors,

Why do Neural Networks need it?

1. It cannot provide multi-value outputs—for example, it cannot be used for multi-

1. Sigmoid / Logistic Activation Function

1. The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

As we can see from the above Figure, the gradient

It implies that for values greater than 3 or less than -3,

The output of the logistic function is not symmetric

Mathematically it can be represented as

As you can see— it also faces the

3. ReLU stands for Rectified Linear Unit.

The advantages of using ReLU as an activation

1. Since only a certain number of neurons are

The Dying ReLU problem, which is explained below.

I. The negative side of the graph makes the

II. All the negative input values become zero

Mathematically it can be represented as

Here is the derivative of the Leaky ReLU function.

The predictions may not be

This function faces certain problems.

The answer is: We can’t.

It calculates the relative probabilities. Similar to the sigmoid/logistic activation

Mathematically it can be represented as:

And here are a few other guidelines to help you out.

1.ReLU activation function should only be used in the hidden layers.

5.Convolutional Neural Network (CNN): ReLU activation function.

Loss function get classified into

Note – In regression at the last neuron use linear activation function.

Note – In regression at the last neuron use linear activation function.

L1 loss is not sensitive to outliers as it is simply the absolute difference, so if you

n – the number of data points.

1. Binary Cross Entropy/log loss

P(D)=P1,P2,P3 and Y(D)= Y1,Y2,Y3

C.E=- (Y1log p1+ Y2 log p2+ y3 log p3+………+ynlogpn)

Backpropagation is a training algorithm

In a feedforward neural network, the

To understand how backpropagation works, let’s first understand how a feedforward

Role of Cost function

This is where Backpropagation comes in…

This is the formula for gradient descent. As we can see, to

Adjusting the weights consists of multiple iterations. We take a

Let’s consider a graphical example of this:

To start descending the cost function, we first initialize a

•Sample – A single row of a dataset.

3. Repeat until slope =0

Learning rate must be chosen wisely as:

You might also like