0% found this document useful (0 votes)
14 views117 pages

Deep Leaning

The document outlines a syllabus for a Deep Learning course covering various topics such as neural networks, training feedforward deep neural networks, autoencoders, convolutional neural networks, recurrent neural networks, and recent trends like GANs. It includes detailed explanations of fundamental concepts, architectures, activation functions, and learning techniques. The course aims to provide a comprehensive understanding of deep learning methodologies and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views117 pages

Deep Leaning

The document outlines a syllabus for a Deep Learning course covering various topics such as neural networks, training feedforward deep neural networks, autoencoders, convolutional neural networks, recurrent neural networks, and recent trends like GANs. It includes detailed explanations of fundamental concepts, architectures, activation functions, and learning techniques. The course aims to provide a comprehensive understanding of deep learning methodologies and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

Deep Learning

B.E. Computer Sem-VIII

Dr. Shivaji Pawar


Syllabus
Unit One: Fundamentals of Neural Network
1.1 Biological neuron, Mc-Culloch Pitts Neuron, Perceptron, Perceptron learning
1.2 Delta learning, Multilayer Perceptron: Linearly separable, linearly non-
separable classes
1.3 Brief History, Three Classes of Deep Learning Basic Terminologies of Deep
Learning
Unit:2 Training Feedforward DNN
2.1 Multi Layered Feed Forward Neural Network, Learning Factors,
2.2 Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
2.3 Loss functions: Squared Error loss, Cross Entropy, Choosing output function
and loss function
2.4 Optimization, Learning with backpropagation, Learning Parameters: Gradient
Descent (GD), Stochastic and Mini Batch GD, Nesterov Accelerated GD,
2.5 AdaGrad, Adam, RMSProp Parameter sharing,
2.6 Dropout, Weight Decay, Batch normalization,
2.7 Early stopping, Data Augmentation, Adding noise to input and output
Unit 3: Autoencoders: Unsupervised Learning

3.1 Introduction, Linear Autoencoder,


3.2 Undercomplete Autoencoder,
3.3 Overcomplete Autoencoders, Regularization in Autoencoders
3.4 Denoising Autoencoders, Sparse Autoencoders,
3.5 Contractive Autoencoders
3.6 Application of Autoencoders: Image Compression
Unit:4 Convolutional Neural Networks (CNN): Supervised
Learning
4.1 Convolution operation, Padding, Stride, Relation between input, output
and filter size, CNN architecture:
4.2 Convolution layer, Pooling Layer, Weight Sharing in CNN,
4.3 Fully Connected NN vs CNN,
4.4 Variants of basic Convolution function
4.5 Modern Deep Learning Architectures:
4.6 LeNET: Architecture,
4.7 AlexNET: Architecture
Unit:5 Recurrent Neural Networks (RNN)
5.1 Sequence Learning Problem, Unfolding Computational graphs,
5.2 Recurrent Neural Network, Bidirectional RNN,
5.3 Long Short Term Memory:
5.4 Selective Read, Selective write, Selective Forget, Gated Recurrent Unit

Unit:6 Recent Trends and Applications

6.1 Generative Adversarial Network (GAN):Architecture


6.2 Applications: Image Generation,
6.3 DeepFake
Unit One: Fundamentals of Neural Network
What is Neuron ? A neuron has three main parts:
Dendrites, an axon, and a cell body
or soma (see image, which can be
represented as the branches, roots
and trunk of a tree, respectively.
A dendrite (tree branch) is where a
neuron receives input from other
cells.
Axon terminal act as a output
terminal
And cell body is processing unit of
all the input signal
Mc-Culloch Pitts Neuron
The first computational model of a neuron was proposed by Warren MuCulloch
(neuroscientist) and Walter Pitts (logician) in 1943.

It may be divided into 2 parts. The first part, g


takes an input (ahem dendrite ahem), performs
an aggregation and based on the aggregated
value the second part, f makes a decision
AND Function OR Function

An AND function neuron would I believe this is self explanatory as we know


only fire when ALL the inputs are that an OR function neuron would fire if ANY
ON i.e., g(x) ≥ 3 here. of the inputs is ON i.e., g(x) ≥ 1 here.
NOR Function NOT Function

For a NOR neuron to fire, we want


For a NOT neuron, 1 outputs 0 and 0 outputs 1.
ALL the inputs to be 0 so the
So we take the input as an inhibitory input and
thresholding parameter should also be 0
set the thresholding parameter to 0. It works!
and we take them all as inhibitory
input.
Analysis of OR Function as a decision boundary
The inputs are obviously Boolean, so only 4 combinations are possible — (0,0), (0,1),
(1,0) and (1,1). Now plotting them on a 2D graph and making use of the OR function’s
aggregation equation i.e., x_1 + x_2 ≥ 1 using which we can draw the decision
boundary as shown in the graph below.
AND Function

In this case, the decision boundary equation is x_1 + x_2 =2. Here, all
the input points that lie ON or ABOVE, just (1,1), output 1 when
passed through the AND function M-P neuron.
OR Function With 3 Inputs

Lets just generalize this by looking at a 3 input OR function M-P unit. In this
case, the possible inputs are 8 points — (0,0,0), (0,0,1), (0,1,0), (1,0,0),
(1,0,1),… you got the point(s). We can map these on a 3D graph and this time
we draw a decision boundary in 3 dimensions.
The plane that satisfies the decision boundary equation x_1 + x_2 + x_3 = 1
is shown below:

Take your time and convince yourself by looking at the above plot that all the points that
lie ON or ABOVE that plane (positive half space) will result in output 1 when passed
through the OR function M-P unit and all the points that lie BELOW that plane
(negative half space) will result in output 0.
Limitations Of M-P Neuron
1. What about non-boolean (say, real) inputs?
2. Do we always need to hand code the threshold?
3. Are all inputs equal? What if we want to assign more importance to some inputs?
4. What about functions which are not linearly separable? Say XOR function.

Conclusion
1. In this Topic, we briefly looked at biological neurons.
2. We then established the concept of MuCulloch-Pitts neuron, the first ever mathematical
model of a biological neuron.
3. We represented a bunch of Boolean functions using the M-P neuron.
4. We also tried to get a geometric intuition of what is going on with the model, using 3D plots.
5. In the end, we also established a motivation for a more generalized model, the one and only
artificial neuron/perceptron model.
Perceptron model

Perceptron is Machine Learning algorithm for supervised learning of various


binary classification tasks

.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit
that helps to detect certain input data computations in business intelligence.
Basic Components of Perceptron
Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight and Bias:
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias, net
sum, and an activation function.
The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the desired
output.
This activation function is also known as the step function and is represented by 'f'.
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as
follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
1. Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes. In a single layer perceptron
model, its algorithms do not contain recorded data, so it begins with inconstantly allocated
input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs,
if the total sum of all inputs is more than a pre-determined value, the model gets activated
and shows the output value as +1.
Multi-Layered Perceptron Model

Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

1.Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
2.Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Advantages of Multi-Layer Perceptron

1. A multi-layered perceptron model can be used to solve complex non-linear problems.


2. It works well with both small and large input data.
3. It helps us to obtain quick predictions after the training.
4. It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron

1. In Multi-layer perceptron, computations are difficult and time-consuming.


2. In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
3. The model functioning depends on the quality of the training.
Linearly separable, linearly non-separable classes

Linear and non-linear separability. (A) shows two classes of data that can be separated by
a single straight line (i.e. they are 'linearly separable'). (B) shows two classes of data that
require a more complex classification topology (i.e. they are not linearly separable).
Neural networks are capable of solving the complex topography in (B), whereas more
traditional statistical methods (such as binary logistic regression, the obvious comparator in
this case) are not.
What is Deep Leaning?
It is technique in which machines are trained to the
mimic the behavior of the human brain i.e. act or
behave like human being.
History of deep leaning

1943- Perceptron
(Threshold logic) (Warren
McCulloch and Walter
Pitts)

Threshold Logic
1985- Hinton and
Rumelhart, Williams

2000- The Vanishing


Gradient Problem
From 2012
Onwards….

Deep learning architecture is becoming so popular due


to…
1. Exponential growth in Data
2. Technology up Gradations
3. Deep learning Architectures are suitable for complex
applications like object detection, Face recognition,
recommended system, NLP and Chat bots
Types of Deep Learning architecture
Unit:2 Training Feedforward DNN

Multi Layered Feed Forward Neural Network, Learning Factors,


Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
Loss functions: Squared Error loss, Cross Entropy, Choosing output function and
loss function
Optimization, Learning with backpropagation, Learning Parameters: Gradient
Descent (GD), Stochastic and Mini Batch GD, Nesterov Accelerated GD,
AdaGrad, Adam, RMSProp Parameter sharing,
Dropout, Weight Decay, Batch normalization,
Early stopping, Data Augmentation, Adding noise to input and output
Elements of a Neural Networks Architecture
Input Layer
The input layer takes raw input from the domain. No computation is performed at this
layer. Nodes here just pass on the information (features) to the hidden layer.

Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction
to the neural network. The hidden layer performs all kinds of computation on the features
entered through the input layer and transfers the result to the output layer.

Output Layer
It’s the final layer of the network that brings the information learned through the hidden
layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the output layer
will typically use a different activation function from the hidden layers. The choice
depends on the goal or type of prediction made by the model.
Feedforward vs. Backpropagation
When learning about neural networks, you will come across two essential terms describing
the movement of information—feedforward and backpropagation.
Feedforward Propagation –
The flow of information occurs in the forward direction. The input is used to calculate
some intermediate function in the hidden layer, which is then used to calculate an output.
In the feedforward propagation, the Activation Function is a mathematical “gate” in
between the input feeding the current neuron and its output going to the next layer.
Backpropagation -
The weights of the network connections are repeatedly adjusted to minimize the difference
between the actual output vector of the net and the desired output vector.
To put it simple word
Backpropagation aims to minimize the cost function by adjusting the network’s weights
and biases. The cost function gradients determine the level of adjustment with respect to
parameters like activation function, weights, bias, etc.
Why do Neural Networks Need an Activation Function?
So we know what Activation Function is and what it does, but—

Why do Neural Networks need it?


Well, the purpose of an activation function is to add non-linearity to the neural network.
Activation functions
To put in simple terms, an artificial neuron calculates the ‘weighted sum’ of its inputs
and adds a bias, as shown in the figure below by the net input.

Mathematically,
net input =sum{(weight * input)+bias}
Types of Neural Networks Activation Functions
1. Binary Step Function
Binary step function depends on a threshold value that decides whether a neuron should be
activated or not.
The input fed to the activation function is compared to a certain threshold; if the input is
greater than it, then the neuron is activated, else it is deactivated, meaning that its output is
not passed on to the next hidden layer.
Mathematically it can be represented as
The limitations of binary step function:

1. It cannot provide multi-value outputs—for example, it cannot be used for multi-


class classification problems.

2. The gradient of the step function is zero, which causes a problems in the
backpropagation process.
Non-Linear Neural Networks Activation Functions

1. Sigmoid / Logistic Activation Function


This function takes any real value as input and outputs values in the range of 0 to 1.

The larger the input (more positive), the closer the output value will be to 1.0, whereas
the smaller the input (more negative), the closer the output will be to 0.0, as shown
below.
Mathematically it can be represented as: Why sigmoid/logistic activation function is one
of the most widely used functions? These are
the advantages
I. It is commonly used for models where we
have to predict the probability as an output.
Since probability of anything exists only
between the range of 0 and 1, sigmoid is the
right choice because of its range.
II. The function is differentiable and provides
a smooth gradient, i.e., preventing jumps in
output values. This is represented by an S-
shape of the sigmoid activation function.
The limitations of sigmoid function are discussed below:

1. The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

As we can see from the above Figure, the gradient


values are only significant for range -3 to 3, and the
graph gets much flatter in other regions.

It implies that for values greater than 3 or less than -3,


the function will have very small gradients. As the
gradient value approaches zero, the network ceases to
learn and suffers from the Vanishing gradient problem.

The output of the logistic function is not symmetric


around zero. So the output of all the neurons will be of
the same sign. This makes the training of the neural
network more difficult and unstable.
2. Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation function, and even has the
same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input
(more positive), the closer the output value will be to 1.0, whereas the smaller the input
(more negative), the closer the output will be to -1.0.

Mathematically it can be represented as


Advantages of using this activation function are:

The output of the tanh activation function is Zero centered; hence we can easily map the
output values as strongly negative, neutral, or strongly positive.
Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in
centering the data and makes learning for the next layer much easier.
Have a look at the gradient of the tanh activation function to understand its limitations.

As you can see— it also faces the


problem of vanishing gradients similar to
the sigmoid activation function. Plus the
gradient of the tanh function is much
steeper as compared to the sigmoid
function.
Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero
centered, and the gradients are not restricted to move in a certain direction. Therefore, in
practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.

3. ReLU stands for Rectified Linear Unit.

Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time. The neurons will only be deactivated if the output of the linear transformation is less
than 0.
Mathematically it can be represented as

The advantages of using ReLU as an activation


function are as follows:

1. Since only a certain number of neurons are


activated, the ReLU function is far more
computationally efficient when compared to the
sigmoid and tanh functions.
2. ReLU accelerates the convergence of gradient
descent towards the global minimum of the loss
function due to its linear, non-saturating property.
The limitations faced by this function are:

The Dying ReLU problem, which is explained below.

I. The negative side of the graph makes the


gradient value zero. Due to this reason,
during the backpropagation process, the
weights and biases for some neurons are
not updated. This can create dead neurons
which never get activated.

II. All the negative input values become zero


immediately, which decreases the model’s
ability to fit or train from the data
properly.
4. Leaky ReLU Function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as
it has a small positive slope in the negative area.

Mathematically it can be represented as


The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it
does enable backpropagation, even for negative input values.

By making this minor modification for negative input values, the gradient of the left
side of the graph comes out to be a non-zero value. Therefore, we would no longer
encounter dead neurons in that region

Here is the derivative of the Leaky ReLU function.


The limitations that this function
faces include:

The predictions may not be


consistent for negative input values.
The gradient for negative values is a
small value that makes the learning
of model parameters time-
consuming.
Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should focus on
its building block—the sigmoid/logistic activation function that works on calculating
probability values.
The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability.

But—

This function faces certain problems.

Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How
can we move forward with it?

The answer is: We can’t.

The above values don’t make sense as the sum of all the classes/output probabilities should
be equal to 1.
You see, the Softmax function is described as a combination of multiple sigmoid.

It calculates the relative probabilities. Similar to the sigmoid/logistic activation


function, the SoftMax function returns the probability of each class.

It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification.

Mathematically it can be represented as:


Let’s go over a simple example together.

Assume that you have three classes, meaning that there would be three neurons in the
output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the Softmax function over these values to give a probabilistic view will result in
the following outcome: [0.58, 0.23, 0.19].

The function returns 1 for the largest probability index while it returns 0 for the other
two array indexes.
Why are deep neural networks hard to train?
There are two challenges you might encounter when training your deep neural networks.
1.Vanishing Gradients
Like the sigmoid function, certain activation functions squish an ample input space into a
small output space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.
For shallow networks with only a few layers that use these activations, this isn’t a big
problem. However, when more layers are used, it can cause the gradient to be too small
for training to work effectively.
2. Exploding Gradients
Exploding gradients are problems where significant error gradients accumulate and
result in very large updates to neural network model weights during training. An
unstable network can result when there are exploding gradients, and the learning
cannot be completed. The values of the weights can also become so large as to
overflow and result in something called NaN values.
How to choose the right Activation Function?
We need to match your activation function for your output layer based on the type of
prediction problem that you are solving—specifically, the type of predicted variable.
Here's what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then
move over to other activation functions if ReLU doesn’t provide optimum results.

And here are a few other guidelines to help you out.

1.ReLU activation function should only be used in the hidden layers.


2.Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make
the model more susceptible to problems during training (due to vanishing gradients).
3. Swish function is used in neural networks having a depth greater than 40 layers.
Finally, a few rules for choosing the activation function for your output layer based on the
type of prediction problem that you are solving:
1.Regression - Linear Activation Function
2.Binary Classification—Sigmoid/Logistic Activation Function
3.Multiclass Classification—Softmax
4.Multilabel Classification—Sigmoid

The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.

5.Convolutional Neural Network (CNN): ReLU activation function.


6.Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Loss functions
In deep learning, “loss function” and “cost function” are often used interchangeably.
They both refer to the same concept of a function that calculates the error or
discrepancy between predicted and actual values. The cost or loss function is
minimized during the model's training process to improve accuracy.

Loss function get classified into


1. Regression loss
2. Classification loss
A. Regression Loss
1. Mean Squared Error/Squared loss/ L2 loss
The Mean Squared Error (MSE) is the simplest and most common loss function.
To calculate the MSE, you take the difference between the actual value and model
prediction, square it, and average it across the whole dataset.

Advantage
I. Easy to interpret. Disadvantage
II. Always differential because of ▪ Error unit in the square. because the unit in
the square. the square is not understood properly.
III. Only one local minima. ▪ Not robust to outlier

Note – In regression at the last neuron use linear activation function.


2. Mean Absolute Error/ L1 loss
The Mean Absolute Error (MAE) is also the simplest loss function. To calculate the
MAE, you take the difference between the actual value and model prediction and
average it across the whole dataset.

Advantage Disadvantage
1. Intuitive and easy 1. Graph, not differential. we can not use
2. Error Unit Same as the output column. gradient descent directly, then we can sub
3. Robust to outlier gradient calculation.

Note – In regression at the last neuron use linear activation function.


Should I use L1 loss function?

L1 loss is not sensitive to outliers as it is simply the absolute difference, so if you


want to penalize large errors and outliers then L1 is not a great choice and you
should probably use L2 loss instead.
However, if you don't want to punish infrequent large errors, then L1 is most likely
a good choice.
3. Huber Loss
In statistics, the Huber loss is a loss function used in robust regression, that is less
sensitive to outliers in data than the squared error loss.

n – the number of data points.


y – the actual value of the data point. Also known as true value.
ŷ – the predicted value of the data point. This value is returned by
the model.
δ – defines the point where the Huber loss function transitions from
a quadratic to linear.
Advantage
Robust to outlier
It lies between MAE and MSE.

Disadvantage
Its main disadvantage is the associated complexity. In order to maximize model accuracy,
the hyperparameter δ will also need to be optimized which increases the training
requirements.
B. Classification Loss

1. Binary Cross Entropy/log loss

It is used in binary classification problems like two classes. example a person has covid or not
or my article gets popular or not. Binary cross entropy compares each of the predicted
probabilities to the actual class output which can be either 0 or 1. It then calculates the score
that penalizes the probabilities based on the distance from the expected value. That means how
close or far from the actual value.

yi – actual values
yihat – Neural Network prediction
Advantage –
A cost function is a differential.
Disadvantage –
Multiple local minima
Not intuitive
Note – In classification at last neuron use sigmoid activation function.
2. Categorical Cross Entropy
Categorical Cross entropy is used for Multiclass classification with support of probability

CE= Distance of actual probability from the predicated probability (actual output –
predicated output

P(D)=P1,P2,P3 and Y(D)= Y1,Y2,Y3

C.E=- (Y1log p1+ Y2 log p2+ y3 log p3+………+ynlogpn)


Optimization, Learning with backpropagation, Learning Parameters: Gradient Descent (GD), Stochastic and
Mini Batch GD, Nesterov Accelerated GD,
Backpropagation

Backpropagation is a training algorithm


used for training feedforward neural
networks. It plays an important part in
improving the predictions made by neural
networks. This is because backpropagation
is able to improve the output of the neural
network iteratively.

In a feedforward neural network, the


input moves forward from the input layer to
the output layer. Backpropagation helps
improve the neural network’s output. It
does this by propagating the error backward
from the output layer to the input layer.
How Does Backpropagation Work?

To understand how backpropagation works, let’s first understand how a feedforward


network works.
1. When a neural network is first trained, it is first fed with input. Since the neural network
isn’t trained yet, we don’t know which weights to use for each input.
2. And so, each input is randomly assigned a weight. Since the weights are randomly
assigned, the neural network will likely make the wrong predictions.
3. It will give out the incorrect output. When the neural network gives out the incorrect
output, this leads to an output error.
4. This error is the difference between the actual and predicted outputs. A cost function
measures this error.

Role of Cost function

1. The cost function (J) indicates how accurately the model performs.
2. It tells us how far-off our predicted output values are from our actual values.
3. It is also known as the error. Because the cost function quantifies the error, we aim to
minimize the cost function.
What we want is to reduce the output error.
Since the weights affect the error, we will need
to readjust the weights. We have to adjust the
weights such that we have a combination of
weights that minimizes the cost function.

This is where Backpropagation comes in…


• Backpropagation allows us to readjust our weights to reduce output error.
• The error is propagated backward during backpropagation from the output to the input
layer.
• This error is then used to calculate the gradient of the cost function with respect to each
weight.
Essentially, backpropagation aims to calculate the negative gradient of the cost
function. This negative gradient is what helps in adjusting of the weights. It gives us an
idea of how we need to change the weights so that we can reduce the cost function.

Backpropagation uses the chain rule to calculate the gradient of the cost function. The
chain rule involves taking the derivative. This involves calculating the partial derivative
of each parameter. These derivatives are calculated by differentiating one weight and
treating the other(s) as a constant. As a result of doing this, we will have a gradient.

Since we have calculated the gradients, we will be able to adjust the weights.
Significance of Gradient Descent

The weights are adjusted using a process called gradient descent. Gradient descent is
an optimization algorithm that is used to find the weights that minimize the cost
function. Minimizing the cost function means getting to the minimum point of the cost
function. So, gradient descent aims to find a weight corresponding to the cost function’s
minimum point. To find this weight, we must navigate down the cost function until we
find its minimum point.
But first, to navigate the cost function, we need two things:
1. The direction in which to navigate and
2. The size of the steps for navigating.

The Direction
The direction for navigating the cost function is found using the gradient.

The Gradient
To know in which direction to navigate, gradient descent uses backpropagation. More
specifically, it uses the gradients calculated through backpropagation. These gradients are used
for determining the direction to navigate to find the minimum point. Specifically, we aim to
find the negative gradient. This is because a negative gradient indicates a decreasing slope. A
decreasing slope means that moving downward will lead us to the minimum point. For
example:
The Step Size
The step size for navigating the cost function is
determined using the learning rate.
Learning Rate
The learning rate is a tuning parameter that
determines the step size at each iteration of
gradient descent. It determines the speed at
which we move down the slope.
The step size plays an important part in ensuring a balance between optimization time and
accuracy. The step size is measured by a parameter alpha (α). A small α means a small
step size, and a large α means a large step size.
1. If the step sizes are too large, we could miss the minimum point completely. This can
yield inaccurate results.
2. If the step size is too small, the optimization process could take too much time. This
will lead to a waste of computational power.
The step size is evaluated and updated according to the
behavior of the cost function. The higher the gradient of the
cost function, the steeper the slope and the faster a model can
learn (high learning rate). A high learning rate results in a
higher step value, and a lower learning rate results in a lower
step value. If the gradient of the cost function is zero, the
model stops learning.
Descending the Cost Function
Navigating the cost function consists of adjusting the weights.
The weights are adjusted using the following formula:

This is the formula for gradient descent. As we can see, to


obtain the new weight, we use the gradient, the learning rate,
and an initial weight.

Adjusting the weights consists of multiple iterations. We take a


new step down for each iteration and calculate a new weight.
Using the initial weight and the gradient and learning rate, we
can determine the subsequent weights.

Let’s consider a graphical example of this:


From the graph of the cost function, we can see that:

To start descending the cost function, we first initialize a


random weight.
Then, we take a step down and obtain a new weight using the
gradient and learning rate. With the gradient, we can know
which direction to navigate. We can know the step size for
navigating the cost function using the learning rate.
We are then able to obtain a new weight using the gradient
descent formula.
We repeat this process until we reach the minimum point of
the cost function.
Once we’ve reached the minimum point, we find the weights
that correspond to the minimum of the cost function.
Backpropagation vs. Gradient Descent
Summarizing Gradient Descent
Gradient descent is an optimization algorithm used to find the
weights corresponding to the cost function. It needs to descend
the cost function until its minimum point to find these weights.
It needs the gradient and the learning rate to descend the cost
function. The gradient helps find the direction for reaching the
minimum point of the cost function. The learning rate helps
determine the speed at which to reach the minimum point.
Upon reaching the minimum point, gradient descent finds
weights corresponding to the minimum point.
Summarizing Backpropagation
Backpropagation is the algorithm of calculating the gradients of the cost function with
respect to the weights. Backpropagation is used to improve the output of neural networks. It
does this by propagating the error in a backward direction and calculating the gradient of the
cost function for each weight. These gradients are used in the process of gradient descent.
Conclusion
To put it plainly, gradient descent is the process of using gradients to find the minimum value
of the cost function, while backpropagation is calculating those gradients by moving in a
backward direction in the neural network. Judging from this, it would be safe to say that
gradient descent relies on backpropagation.
It would also be plausible to say that the neural network is trained using gradient descent
and that backpropagation is only used to assist in the process of calculating the gradients.
Although gradient descent is often paired with backpropagation to reduce the error in neural
networks, they each perform different functions.
Key Takeaways:
•Gradient descent relies on backpropagation. Gradient descent
uses gradients to help it find the minimum value of the cost
function. Backpropagation calculates these gradients using the
chain rule.
•Gradient descent is used to find a weight combination that
minimizes the cost function. Backpropagation propagates the
error backward and calculates the gradient for each error.
•Gradient descent requires the learning rate and the
gradient. The gradient helps find the direction to the minimum
point of the cost function. The learning rate helps find the
speed at which to navigate the cost function.
•Together, backpropagation and gradient descent improve the
prediction accuracy of neural networks. Backpropagation
propagates the error backward and calculates the gradient for
each weight. This gradient is used in the process of gradient
descent. Gradient descent involves adjusting the weights of the
neural network. Adjusting the weights helps minimize the
output error of the neural network.
Important Deep Learning Terms for gradient decent
•Epoch – The number of times the algorithm runs on the whole training dataset.

•Sample – A single row of a dataset.

•Batch – It denotes the number of samples to be taken to for updating the model
parameters.

•Learning rate – It is a parameter that provides the model a scale of how much
model weights should be updated.

•Cost Function/Loss Function – A cost function is used to calculate the cost, which
is the difference between the predicted value and the actual value.

•Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
Gradient decent
Gradient Descent is an optimizing algorithm used in Machine/ Deep Learning algorithms.
The goal of Gradient Descent is to minimize the objective convex function f(x) using
iteration. In simple word how to reach on global minima with iteration.
Gradient Descent on Cost function.
Steps to implement Gradient Descent
1. Randomly initialize values.
2. Update values.

3. Repeat until slope =0

Learning rate must be chosen wisely as:


1. if it is too small, then the model will take
some time to learn.
2. if it is too large, model will converge as
our pointer will shoot and we’ll not be able
to get to minima.
Stochastic Gradient Descent

➢ There is need to have a better options than using gradient descent on massive data.
➢ To tackle the challenges large datasets , we have stochastic gradient descent, a popular
approach among optimizers in deep learning.
➢ The term stochastic denotes the element of randomness upon which the algorithm
relies.
➢ In stochastic gradient descent, instead of processing the entire dataset during each
iteration, we randomly select batches of data.
➢ This implies that only a few samples from the dataset are considered at a time,
allowing for more efficient and computationally feasible optimization in deep learning
models.
➢ The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum.
➢ Since we are not using the whole dataset but the batches of it for each iteration, the path
taken by the algorithm is full of noise as compared to the gradient descent algorithm.
Thus, SGD uses a higher number of iterations to reach the local minima.
➢ Due to an increase in the number of iterations, the overall computation time increases.
But even after increasing the number of iterations, the computation cost is still less than
that of the gradient descent optimizer.
➢ So the conclusion is if the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over batch gradient descent algorithm.
Mini Batch Gradient Descent

➢ In this variant of gradient descent, instead of taking all the training data, only a subset of
the dataset is used for calculating the loss function.
➢ Since we are using a batch of data instead of taking the whole dataset, fewer iterations
are needed. That is why the mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and batch gradient descent algorithms.
➢ This algorithm is more efficient and robust than the earlier variants of gradient descent.
As the algorithm uses batching, all the training data need not be loaded in the memory,
thus making the process more efficient to implement.
➢ Moreover, the cost function in mini-batch gradient descent is noisier than the batch
gradient descent algorithm but smoother than that of the stochastic gradient descent
algorithm. Because of this, mini-batch gradient descent is ideal and provides a good
balance between speed and accuracy.
➢ Despite all that, the mini-batch gradient descent algorithm has some downsides too.
It needs a hyperparameter that is “mini-batch-size”, which needs to be tuned to
achieve the required accuracy.
➢ Although, the batch size of 32 is considered to be appropriate for almost every case.
Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise
to look for other alternatives too.

You might also like