Module 2
Module 2
Module 2
1. Explain batch normalization with relevant example.
Module 2 1
Detailed Answer
When to Use:
Formula:
f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0
Use Case:
Module 2 2
Works well with batch normalization.
Variants:
When to Use:
f(x) = {
x if x > 0
αx if x ≤ 0
When to Use:
Formula:
f(x) = max(0, x)
Limitations:
Module 2 3
Prone to the dying ReLU problem (neurons stop learning when
stuck in the negative region).
Use Case:
When to Use:
Formula:
ex − e−x
f(x) = x
e + e−x
Use Case:
When to Use:
Formula:
Module 2 4
1
f(x) =
1 + e−x
Limitations:
Use Case:
6. Softmax Function:
When to Use:
Formula:
ez i
σ(zi ) =
K
∑j=1 ez j
where zi is the output for a specific class, and K is the total number of
classes.
Use Case:
4. Discuss how to Train the DNN on this training set. For each
image pair, you can simultaneously feed the first image to DNN
A and the second image to DNN B. The whole network will
Module 2 5
gradually learn to tell whether two images belong to the same
class or not.
As a result:
Exploding Gradients:
These problems are more pronounced in very deep networks and recurrent
neural networks (RNNs).
Causes:
The vanishing/exploding gradients problem can be attributed to factors such
as:
Module 2 6
Use of saturating activation functions like sigmoid and tanh.
Solutions:
Exploding gradients: When gradients grow too large and make the
network unstable.
Module 2 7
The root cause is how weights are initialized. If weights are too small or too
large, the signal passing through layers is either diminished or amplified
exponentially, causing instability.
Xavier Initialization:
Balances the variance of the activations and gradients to keep them within
a reasonable range across layers.
He Initialization:
Proposed by Kaiming He et al. for ReLU and variants of ReLU (like Leaky
ReLU).
ReLU does not saturate like sigmoid or tanh, but it can suffer from dying
neurons if weights are not initialized properly.
The scaling factor n2in accounts for the fact that ReLU only activates half of
2
W ∼ N (0, )
nin
Where:
Module 2 8
8. Explain the variants of ReLU activation function.
1. ReLU (Rectified Linear Unit):
The original ReLU is the most widely used activation function in deep
learning. It outputs 0 for negative inputs and x for positive inputs.
Formula:
f(x) = max(0, x)
Pros:
Computationally inexpensive.
Cons:
Dying ReLU problem: Neurons can stop learning if they get stuck in the
negative region and always output zero.
2. Leaky ReLU:
Module 2 9
Leaky ReLU fixes the dying ReLU problem by allowing a small, non-zero
slope for negative inputs.
Formula:
f(x) = {
x if x > 0
αx if x ≤ 0
Pros:
Cons:
Use Case:
Parametric ReLU is a variant of Leaky ReLU where the slope for negative
inputs is learnable during training.
Formula:
f(x) = {
x if x > 0
ax if x ≤ 0
Pros:
The network can adapt the slope for negative values based on the data.
Cons:
Use Case:
Module 2 10
Effective in large-scale image classification tasks and deep neural
networks.
Randomized Leaky ReLU randomly chooses the slope αfor negative inputs
from a given range during training.
Formula:
f(x) = {
x if x > 0
αx if x ≤ 0
Where αis randomly sampled from a range [l, u]during training and fixed
during testing.
Pros:
Cons:
Use Case:
ELU allows small negative outputs, which helps push the mean activation
closer to zero, improving learning.
Formula:
f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0
Pros:
Module 2 11
Cons:
Use Case:
He Initialization: Designed for the ReLU activation function and its variants,
it accounts for the fact that ReLU only activates for positive values.
Module 2 12
The choice of activation function plays a crucial role in mitigating vanishing
gradients.
The sigmoid activation function, once popular, suffers from saturation for
large input values, leading to gradients close to zero.
ReLU and its variants (Leaky ReLU, ELU, RReLU, PReLU) address this issue
by not saturating for positive values.
3. Batch Normalization:
This technique tackles the issue of internal covariate shift, where the
distribution of each layer's inputs changes during training.
It normalizes the inputs of each layer, stabilizing the gradients and allowing
the use of higher learning rates.
4. Gradient Clipping:
Module 2 13