Lect 7 - Vanishing Gradient Problem
Lect 7 - Vanishing Gradient Problem
EXPLODING GRADIENT
PROBLEM
= 0.2 * 0.15 *
0.05
= 0.0015
= 2.5 – (1)
0.0015
Vanishing Gradient Problem
As the number of layers in a neural network
increases, the derivative keeps on
decreasing.
Hence, addition of more layers would lead to
almost 0 derivative
at that time we could see
=1*1*1
=1
= 2.5 – (1) 1
= 1.5
ReLU Activation function
=1*1*0
= 0 (dead
neuron)
= 2.5 – (1) 0
= 2.5
leaky ReLU
To fix the problem of dead neuron leaky
ReLU is introduced.
leaky ReLU- derivatives
The derivatives is no more zero and
hence no dead neurons
Leaky ReLU Activation
function
= 0.01 * 1 * 1
= 2.49
Vanishing Gradient Problem
The others solutions are,
Use Residual networks (ResNets)
Use Batch Normalization
Use Multi-level hierarchy
Use Long short term memory(LSTM)
network
Use Faster hardware
Genetic algorithms for weight search
Residual neural networks (ResNets)
Equation: f(x) =
x
Derivative: f’(x)
=1
Residual neural networks (ResNets)
These skip
connections act as
gradient superhigh
ways, allowing the
gradient to flow
unhindered
(without
restriction).
Batch Normalization
batch normalization layers can also
resolve the vanishing gradient problem
As stated before, the problem arises
when a large input space is mapped to a
small one, causing the derivatives to
disappear.
Batch Normalization
In Image below, this is most clearly seen
at when |x| is big.
Batch Normalization
Batch normalization reduces this
problem by simply normalizing the input
so |x| doesn’t reach the outer edges of
the sigmoid function.
It normalizes the input so that most of it
falls in the green region, where the
derivative isn’t too small.
Exploding Gradient Problem
Exploding Gradient Problem
ŷ= ʄ (O1)
O1 O1=z1.W2+b
∂ ʄ (O1) ∂ O1
--- = --------- * -----
∂ŷ
∂ (z1.W2+b)
--- = 0.25 * ----------------
∂ŷ
∂z1 ∂z1