0% found this document useful (0 votes)
10 views

2.vanishing Gradient and Exploding Gradient Simple Notes

Uploaded by

Jeevabarathi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

2.vanishing Gradient and Exploding Gradient Simple Notes

Uploaded by

Jeevabarathi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Vanishing Gradient and Exploding Gradient:

● The Neural Networks are trained using back propagation and gradient based learning methods.
● During training, we want to reach the most optimum value of weights resulting in minimum loss.
● Each weight is constantly gets updated during the training of the algorithm.
● The update is proportional to the partial derivative of the error function with respect to the current
weight in each training iteration.
● However, sometimes this update becomes too small, and hence the weight does not get updated. It
results in very less or practically no training of the network. This is referred to as the vanishing
gradient problem.

● In Figure, we Shown that in the sigmoid function, we can face the problem of vanishing gradient,
while in the case of a ReLU or Leaky ReLU, we will not have vanishing gradient as an issue.

Back Propagation
● Each neuron in the network has an activation function and a bias term.
● It accepts a finite number of input weight products, adds a bias term to it, and then
applies activation function. The output is then passed to the next neuron.
● The difference between the expected output and the predicted value is called error term.
● The error will be minimized when we have achieved the best combination of weights and
biases across the layers and neurons.
Gradient Descent

● When the error is calculated, a gradient descent is applied to the error function.
● This gradient descent is the differentiation of the error function with respect to weights
and biases.
● The back propagation algorithm adjusts these weights and biases using a learning rate.
● This is done from the last layer to the first layer in the backward direction or from the
right to the left.
● In each iteration, gradient descent determines the direction of change, updating weights and biases
until the error is minimized, or till the error reaches a global minima as shown in Figure.

Detection:
1. Kernel weight distribution shows weights approaching zero.
2. Weights in final layers change more than initial layers.
3. Slow or no improvement in model during training.
4. Early training stopping without further improvement.
Solutions:
1. Reduce Network Depth: Simplifies the network but might reduce performance.
2. Use ReLU Activation: ReLU (Rectified Linear Unit) helps maintain gradients better than
sigmoid or tanh functions.
3. Residual Networks (ResNets): Use skip connections to maintain gradient flow, making deep
networks train more effectively.

Exploding Gradient Problem:


In deep networks, error gradients sometimes become very large as they get accumulated. Hence, the
updates in the networks will be very large which make the network unstable. There are a few signs of
exploding gradients which can help us in detecting exploding gradient:
1. The model is suffering from poor loss during the training phase.
2. During the training of the algorithm, we might encounter NaN for the loss or for the weights.
3. The model is generally unstable, or in other words the updates to loss in subsequent iterations are
huge indicating an unstable state.
4. The error gradients are constantly above 1 for each of the layers and neurons in the network.

Exploding gradient can be resolved using


1. We can reduce the number of layers in the network or can try reducing the batch size during training.
2. L1 and L2 weight regularization can be added which will act as a penalty to the network loss functions.
3. Gradient clipping is one of the methods which can be used. We can limit the size of the gradients
during the process of training. We set a threshold for the error gradients, and the error gradients are set
to that limit or clipped if the error gradient exceeds the threshold.

You might also like