0% found this document useful (0 votes)
21 views12 pages

More On Gradient Descent

The document discusses gradient descent as an optimization algorithm that requires a differentiable loss function to minimize loss through weight and bias updates. It highlights the vanishing gradient problem associated with certain activation functions like sigmoid and suggests solutions such as using ReLU and techniques like batch normalization. Additionally, it describes different types of gradient descent: Batch, Stochastic, and Mini-Batch, each with its own advantages and disadvantages for training neural networks.

Uploaded by

rabby01601565625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

More On Gradient Descent

The document discusses gradient descent as an optimization algorithm that requires a differentiable loss function to minimize loss through weight and bias updates. It highlights the vanishing gradient problem associated with certain activation functions like sigmoid and suggests solutions such as using ReLU and techniques like batch normalization. Additionally, it describes different types of gradient descent: Batch, Stochastic, and Mini-Batch, each with its own advantages and disadvantages for training neural networks.

Uploaded by

rabby01601565625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

MORE ON GRADIENT SPS, Summer 2022

DESCENT
GRADIENT DESCENT
An Optimization algorithm
Needs a differentiable loss
function
Finds the values for minimum
loss
First take random values of
weights and bias. Update rule,
wnew = wold – lr* dL/dw
bnew = bold – lr* dL/db
ACTIVATION FUNCTIONS

Sigmoid squishes values between 0 to 1

ReLU takes 0 for –ve values and positive


Values as it is.
ACTIVATION FUNCTIONS

We are interested in
activation functions
cause they are used in
calculating the output
Of each neuron and
those outputs are also
used when we calculate
derivatives to update
the weights and bias.
VANISHING GRADIENT
DESCENT PROBLEM
In Backpropagation, we can represent gradient of the loss
function as a product of gradients respect to their weights.
The updated weights of nodes in the network depend on the
gradients of the activation functions of each node.
For sigmoid the partial derivative of the sigmoid function
reaches a maximum value of 0.25. When there are more layers
in the network, the value of the product of derivative
decreases until at some point the partial derivative of the
loss function approaches a value close to zero, and the
partial derivative vanishes. We call this the vanishing
gradient problem.
VANISHING GRADIENT
DESCENT PROBLEM
wnew = wold – lr* dL/dw
bnew = bold – lr* dL/db
SOLUTION FOR VANISHING
GD
In a network with vanishing The problem with the use of
gradient, the weights cannot be ReLU is when the gradient has
updated, so the network cannot a value of 0. (Dying ReLU)
learn. The performance of the
network will decrease as a Other technique to avoid the
result. vanishing gradient problem is
The simplest solution to the proper weight
problem is to replace the initialization,
activation function as Relu.
reduce model complexity,
The derivative of a ReLU
function is defined as 1 for Leaky ReLU,
inputs that are greater than
zero and 0 for inputs that are
Batch normalization,
negative. residual Network(ResNET)
TYPES OF GRADIENT
DESCENT
In Batch Gradient Descent,
all the training data is
taken into consideration to
take a single step.
In Stochastic Gradient
Descent (SGD), we consider
just one example at a time to
take a single step.
a mixture of Batch Gradient
Descent and SGD. We use a
batch of a fixed number of
training examples which is
less than the actual dataset
and call it a mini-batch.
BATCH GD
Take the whole dataset
Feed it to Neural Network
Calculate it’s gradient
Use the gradient to update the weights
Repeat for number of epochs

great for relatively smooth error manifolds.


move directly towards an optimum solution.
It stuck when dataset is huge.
May not reach the global Minima.
STOCHASTIC GD
Take an example
Feed it to Neural Network
Calculate it’s gradient
Use the gradient we calculated
in step 3 to update the weights
Repeat steps 1–4 for all the
examples in training dataset
SGD can be used for larger
datasets to converge faster.
Good for fining global minima.
it will never reach the
minima but it will keep
dancing around it.
MINI BATCH GD
Pick a mini-batch Total data = 500
Feed it to Neural Network Batch_size = 50
Calculate the mean gradient Total batch = 500/50 = 10
of the mini-batch
For every 50 data points
Use the mean gradient we weight will be updated one
calculated in step 3 to time. This process will
update the weights repeat 10 times.
Repeat steps 1–4 for the  It takes both the good
mini-batches we created sides from SGD and BGD. It is
faster and also computable at
one time.
RESOURCES
https://
towardsdatascience.com/batch-mini-batch-stochastic-gradient-desc
ent-7a62ecba642a
Image sources : Google

You might also like