More On Gradient Descent
More On Gradient Descent
DESCENT
GRADIENT DESCENT
An Optimization algorithm
Needs a differentiable loss
function
Finds the values for minimum
loss
First take random values of
weights and bias. Update rule,
wnew = wold – lr* dL/dw
bnew = bold – lr* dL/db
ACTIVATION FUNCTIONS
We are interested in
activation functions
cause they are used in
calculating the output
Of each neuron and
those outputs are also
used when we calculate
derivatives to update
the weights and bias.
VANISHING GRADIENT
DESCENT PROBLEM
In Backpropagation, we can represent gradient of the loss
function as a product of gradients respect to their weights.
The updated weights of nodes in the network depend on the
gradients of the activation functions of each node.
For sigmoid the partial derivative of the sigmoid function
reaches a maximum value of 0.25. When there are more layers
in the network, the value of the product of derivative
decreases until at some point the partial derivative of the
loss function approaches a value close to zero, and the
partial derivative vanishes. We call this the vanishing
gradient problem.
VANISHING GRADIENT
DESCENT PROBLEM
wnew = wold – lr* dL/dw
bnew = bold – lr* dL/db
SOLUTION FOR VANISHING
GD
In a network with vanishing The problem with the use of
gradient, the weights cannot be ReLU is when the gradient has
updated, so the network cannot a value of 0. (Dying ReLU)
learn. The performance of the
network will decrease as a Other technique to avoid the
result. vanishing gradient problem is
The simplest solution to the proper weight
problem is to replace the initialization,
activation function as Relu.
reduce model complexity,
The derivative of a ReLU
function is defined as 1 for Leaky ReLU,
inputs that are greater than
zero and 0 for inputs that are
Batch normalization,
negative. residual Network(ResNET)
TYPES OF GRADIENT
DESCENT
In Batch Gradient Descent,
all the training data is
taken into consideration to
take a single step.
In Stochastic Gradient
Descent (SGD), we consider
just one example at a time to
take a single step.
a mixture of Batch Gradient
Descent and SGD. We use a
batch of a fixed number of
training examples which is
less than the actual dataset
and call it a mini-batch.
BATCH GD
Take the whole dataset
Feed it to Neural Network
Calculate it’s gradient
Use the gradient to update the weights
Repeat for number of epochs