0% found this document useful (0 votes)
5 views41 pages

Lect 7 - Vanishing Gradient Problem

The document discusses the vanishing and exploding gradient problems encountered while training deep neural networks. It explains how these issues arise during backpropagation, particularly with activation functions like sigmoid and tanh, and suggests solutions such as using ReLU, leaky ReLU, residual networks, and batch normalization. Additionally, it addresses the exploding gradient problem caused by large weight values and recommends gradient clipping as a solution.

Uploaded by

cs22b2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views41 pages

Lect 7 - Vanishing Gradient Problem

The document discusses the vanishing and exploding gradient problems encountered while training deep neural networks. It explains how these issues arise during backpropagation, particularly with activation functions like sigmoid and tanh, and suggests solutions such as using ReLU, leaky ReLU, residual networks, and batch normalization. Additionally, it addresses the exploding gradient problem caused by large weight values and recommends gradient clipping as a solution.

Uploaded by

cs22b2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

VANISHING AND

EXPLODING GRADIENT
PROBLEM

Dr. Umarani Jayaraman


Vanishing and Exploding Gradient
Problem
Vanishing and Exploding Gradient
Problem

 The problems with training very deep


neural network are vanishing and
exploding gradients.
 When training a very deep neural
network, sometimes derivatives
becomes very small (vanishing gradient)
or very big (exploding gradient) and this
makes training difficult.
Vanishing and Exploding Gradient
Problem

 This problem occurs during training on


NN- Back Propagation Learning
Vanishing Gradient Problem
 In earlier days, most commonly used
activation function is sigmoid activation
function.
Vanishing Gradient Problem
Vanishing Gradient Problem
 In sigmoid the values are converted
between 0 and 1 and their derivatives
lies between 0 and 0.25
 The formula for weight updation is,
Vanishing Gradient Problem
Vanishing Gradient Problem

= 0.2 * 0.15 *
0.05

= 0.0015

= 2.5 – (1)
0.0015
Vanishing Gradient Problem
 As the number of layers in a neural network
increases, the derivative keeps on
decreasing.
 Hence, addition of more layers would lead to
almost 0 derivative
 at that time we could see

 If new weight is approximately equal to old


weight, it actually stops learning. This is
called vanishing gradient problem
Vanishing Gradient Problem
 This is the reason why sigmoid is no
longer used as activation function in
hidden layers.
 The activation function tanh is also not
used because it’s derivatives lies
between 0 and 1. Therefore, it also leads
to vanishing gradient problem.
Vanishing Gradient Problem
 Solutions for Vanishing
gradient problem:
 The simplest solution is to use other
activation functions, such as ReLU, Leaky
ReLU, Pameteric ReLU
 This activation only saturates on one
direction and thus are more resilient to
the vanishing of gradients.
ReLU Activation function
ReLU Activation function

=1*1*1

=1

= 2.5 – (1) 1

= 1.5
ReLU Activation function

=1*1*0

= 0 (dead
neuron)

= 2.5 – (1) 0

= 2.5
leaky ReLU
 To fix the problem of dead neuron leaky
ReLU is introduced.
leaky ReLU- derivatives
 The derivatives is no more zero and
hence no dead neurons
Leaky ReLU Activation
function

= 0.01 * 1 * 1

= 0.01 (No dead


neuron)

= 2.5 – (1) 0.01

= 2.49
Vanishing Gradient Problem
 The others solutions are,
 Use Residual networks (ResNets)
 Use Batch Normalization
 Use Multi-level hierarchy
 Use Long short term memory(LSTM)
network
 Use Faster hardware
 Genetic algorithms for weight search
Residual neural networks (ResNets)

 One of the newest and most


effective ways to resolve the vanishing
gradient problem is with residual neural
networks, or ResNets (not to be confused
with recurrent neural networks).
 It is noted before ResNets that a deeper
network would have higher training error
than the shallow network.
Residual neural networks (ResNets)

 As this gradient keeps flowing backwards


to the initial layers, this value keeps
getting multiplied by each local gradient.
 Hence, the gradient becomes smaller
and smaller, making the updates to the
initial layers very small, or no updation
in weights (it stops learning)
 We can solve this problem if the local
gradient somehow become 1.
Linear or Identity Activation Function

 Equation: f(x) =
x
 Derivative: f’(x)
=1
Residual neural networks (ResNets)

 How can the local gradient be 1, i.e, the derivative


of which function would always be 1?
 The Identity function!

 As this gradient is back propagated, it does not


decrease in value because the local gradient is 1.
Residual neural networks (ResNets)

 The residual  This residual


connection connection doesn’t
directly adds the go through
value at the activation functions
that “squashes” the
beginning of the
derivatives,
block, x, to the
resulting in a higher
end of the block overall derivative of
(F(x)+x). the block.
Residual neural networks (ResNets)

 These skip
connections act as
gradient superhigh
ways, allowing the
gradient to flow
unhindered
(without
restriction).
Batch Normalization
 batch normalization layers can also
resolve the vanishing gradient problem
 As stated before, the problem arises
when a large input space is mapped to a
small one, causing the derivatives to
disappear.
Batch Normalization
 In Image below, this is most clearly seen
at when |x| is big.
Batch Normalization
 Batch normalization reduces this
problem by simply normalizing the input
so |x| doesn’t reach the outer edges of
the sigmoid function.
 It normalizes the input so that most of it
falls in the green region, where the
derivative isn’t too small.
Exploding Gradient Problem
Exploding Gradient Problem

ŷ= ʄ (O1)
O1 O1=z1.W2+b

∂ ʄ (O1) ∂ O1
--- = --------- * -----
∂ŷ

∂z1 ∂O1 ∂z1

∂ (z1.W2+b)
--- = 0.25 * ----------------
∂ŷ

∂z1 ∂z1

--- = 0.25 * W2 = 0.25 * 500 =125 (when weights


∂ŷ
Exploding Gradient Problem
 Exploding problem is not because of
sigmoid function
 This problem occurs due to larger weight
value
 During initialization of weights if the
weights are initialized with larger value,
instead of converging, it keep oscillating.
 Hence, we should properly select initial
weight vectors
Exploding Gradient Problem
 When gradients explode, the gradients
could become NaN (Not a
Number) because of the numerical
overflow
 We might see irregular oscillations in
training cost when we plot the learning
curve.
Dealing with Exploding Gradients

 A solution to fix this is to apply


gradient clipping; which places a
predefined threshold on the gradients to
prevent it from getting too large, and by
doing this it doesn’t change the direction
of the gradients it only change its length.
Note: Step function
Note: Step function
Note: RelU function
Linear function definition
Non-linear function
definition
Source: Activation function
 https://
inblog.in/ACTIVATION-FUNCTION-BREAKT
HROUGH-VOyvxhTELU
Source: Vanishing gradient
problem
 https://www.mygreatlearning.com/blog/t
he-vanishing-gradient-problem/
 https://towardsdatascience.com/the-vani
shing-gradient-problem-69bf08b15484
 https://medium.com/analytics-vidhya/va
nishing-and-exploding-gradient-problems
-c94087c2e911
THANK YOU

You might also like