0% found this document useful (0 votes)
17 views13 pages

Module 2

Uploaded by

volleysilicon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Module 2

Uploaded by

volleysilicon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

⛱️

Module 2
1. Explain batch normalization with relevant example.

2. Write snippet code for transfer learning with Keras.

3. Explain with relevant illustration, in which cases would you


want to use each of the following activation functions: ELU,
leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

Module 2 1
Detailed Answer

1. ELU (Exponential Linear Unit):

The Exponential Linear Unit (ELU) addresses the issue of dying


neurons in ReLU by allowing small negative outputs for negative
inputs. It helps improve learning speed and leads to smoother
convergence.

When to Use:

When training deep networks to achieve faster convergence.

In cases where small negative outputs are needed to push the


mean activation closer to zero.

Formula:

f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0

where αis a positive constant.

Use Case:

Image classification and convolutional neural networks (CNNs)


where fast convergence is crucial.

Module 2 2
Works well with batch normalization.

2. Leaky ReLU (and Variants: Parametric ReLU, Randomized Leaky


ReLU):

Leaky ReLU addresses the dying ReLU problem by allowing a small,


non-zero slope for negative values of x. This keeps neurons active and
learning even when the input is negative.

Variants:

Parametric ReLU (PReLU): Allows the slope of negative values to be


learned.

Randomized Leaky ReLU (RReLU): Randomizes the slope during


training for regularization.

When to Use:

When dealing with deep networks prone to the dying ReLU


problem.

For time-series data or speech recognition tasks where some


negative values can hold important information.

f(x) = {
x if x > 0
αx if x ≤ 0
​ ​

Where αis a small positive constant (e.g., α = 0.01).


3. ReLU (Rectified Linear Unit):

ReLU is the most commonly used activation function due to its


simplicity and effectiveness. It outputs 0 for negative inputs and the
input itself for positive inputs.

When to Use:

In hidden layers of deep neural networks.

Works well for image-related tasks and object detection.

Formula:

f(x) = max(0, x)

Limitations:

Module 2 3
Prone to the dying ReLU problem (neurons stop learning when
stuck in the negative region).

Use Case:

Convolutional Neural Networks (CNNs) for image recognition.

Feedforward networks where computational efficiency is crucial.

4. tanh (Hyperbolic Tangent):

The tanh activation function outputs values between -1 and 1, making it


zero-centered. This helps to avoid shifting gradients in one direction
during backpropagation.

When to Use:

When dealing with classification tasks that require negative values


as well as positive values.

In hidden layers of networks where zero-centered outputs are


beneficial.

Formula:

ex − e−x
f(x) = x
e + e−x

Use Case:

Recurrent Neural Networks (RNNs) for time-series predictions.

Binary classification tasks where outputs need to be zero-


centered.

5. Logistic (Sigmoid) Function:

The sigmoid function outputs values between 0 and 1, making it


suitable for binary classification problems.

When to Use:

For binary classification tasks where the output is a probability


(e.g., predicting whether an email is spam or not).

In output layers when a probability score is required.

Formula:

Module 2 4
1
f(x) =
1 + e−x

Limitations:

Causes vanishing gradients for very large or very small inputs.

The output is not zero-centered, which can slow down learning.

Use Case:

Logistic regression and binary classification tasks.

Probability-based models where outputs need to be interpreted as


probabilities.

6. Softmax Function:

The softmax function is used to convert the outputs of a neural network


into a probability distribution over multiple classes.

When to Use:

In the output layer of a multiclass classification model.

When the task requires predicting one class out of multiple


possible classes.

Formula:

ez i ​

σ(zi ) =

K

∑j=1 ez j ​

where zi is the output for a specific class, and K is the total number of

classes.

Use Case:

Multiclass classification problems, such as identifying handwritten


digits (0-9) in MNIST dataset.

Natural language processing (NLP) tasks like named entity


recognition.

4. Discuss how to Train the DNN on this training set. For each
image pair, you can simultaneously feed the first image to DNN
A and the second image to DNN B. The whole network will

Module 2 5
gradually learn to tell whether two images belong to the same
class or not.

5. Explain the vanishing gradients problem in neural network.


OR
Explain the vanishing and exploding gradients problem in
neural network.
Vanishing Gradients:

When training a neural network using backpropagation, the error gradients


computed during backpropagation tend to decrease as they are propagated
backward through the network's layers.

In deep networks, especially when using activation functions like the


sigmoid or hyperbolic tangent, this can cause the gradients to shrink
exponentially, making them too small to cause meaningful updates to the
weights in the earlier layers.

As a result:

The earlier layers learn very slowly, or not at all.

The model fails to converge to a good solution.

Exploding Gradients:

In contrast, the exploding gradients problem occurs when the gradients


become excessively large during backpropagation, resulting in large updates to
the network weights. This can cause:

The model parameters to become unstable.

Divergence in the training process, where the loss function increases


instead of decreasing.

These problems are more pronounced in very deep networks and recurrent
neural networks (RNNs).

Causes:
The vanishing/exploding gradients problem can be attributed to factors such
as:

Poor initialization of weights.

Module 2 6
Use of saturating activation functions like sigmoid and tanh.

The accumulation of small gradients through many layers.

Solutions:

Several techniques can mitigate these issues:

1. Use of Non-saturating Activation Functions: Functions like ReLU (Rectified


Linear Unit) do not saturate for positive inputs, reducing the risk of
vanishing gradients.

2. Proper Weight Initialization: Xavier and He initialization methods help


maintain a balance in gradient propagation.

3. Batch Normalization: Normalizing inputs within the network can reduce


internal covariate shift, stabilizing gradients during training.

4. Gradient Clipping: This technique caps the gradients during


backpropagation to avoid them from becoming too large, addressing
exploding gradients.

These methods have significantly improved the training of deep neural


networks, enabling them to handle more complex tasks effectively.

6. Discuss the problem that Glorot initialization and He


initialization aim to fix.
Both Glorot initialization and He initialization were proposed to address
vanishing and exploding gradients, especially in deep neural networks.

These problems occur because weights are poorly initialized, causing


gradients to shrink (vanish) or grow (explode) as they propagate backward
through the layers.

The vanishing and exploding gradients problem is particularly severe


when:

The network is deep (many layers).

Activations are not properly scaled, leading to either:

Vanishing gradients: When gradients become very small and fail to


update the weights of earlier layers.

Exploding gradients: When gradients grow too large and make the
network unstable.

Module 2 7
The root cause is how weights are initialized. If weights are too small or too
large, the signal passing through layers is either diminished or amplified
exponentially, causing instability.

Xavier Initialization:

Proposed by Xavier Glorot and Yoshua Bengio, Xavier initialization works


well for sigmoid and tanh activation functions, which are prone to
saturation (leading to vanishing gradients).

Balances the variance of the activations and gradients to keep them within
a reasonable range across layers.

Formula for weight initialization:

He Initialization:

Proposed by Kaiming He et al. for ReLU and variants of ReLU (like Leaky
ReLU).

ReLU does not saturate like sigmoid or tanh, but it can suffer from dying
neurons if weights are not initialized properly.

The scaling factor n2in accounts for the fact that ReLU only activates half of

the neurons on average, preventing the gradients from vanishing.

Formula for weight initialization:

2
W ∼ N (0, ) ​

nin ​

Where:

nin = number of input neurons.


7. Differentiate Non-saturating and Saturating activation


functions with example.

Module 2 8
8. Explain the variants of ReLU activation function.
1. ReLU (Rectified Linear Unit):

The original ReLU is the most widely used activation function in deep
learning. It outputs 0 for negative inputs and x for positive inputs.

Formula:

f(x) = max(0, x)

Pros:

Simple and efficient.

Helps reduce vanishing gradient problems.

Computationally inexpensive.

Cons:

Dying ReLU problem: Neurons can stop learning if they get stuck in the
negative region and always output zero.

2. Leaky ReLU:

Module 2 9
Leaky ReLU fixes the dying ReLU problem by allowing a small, non-zero
slope for negative inputs.

Formula:

f(x) = {
x if x > 0
αx if x ≤ 0
​ ​

Where αis a the hyperparameter (slope).

Pros:

Prevents neurons from dying.

Allows negative values to propagate through the network.

Cons:

Choosing the right value of αcan be tricky.

Use Case:

Used in GANs (Generative Adversarial Networks), RNNs, and


networks prone to the dying ReLU problem.

3. Parametric ReLU (PReLU):

Parametric ReLU is a variant of Leaky ReLU where the slope for negative
inputs is learnable during training.

Formula:

f(x) = {
x if x > 0
ax if x ≤ 0
​ ​

Where ais a learnable parameter.

Pros:

The network can adapt the slope for negative values based on the data.

Reduces the risk of dying neurons.

Cons:

Can lead to overfitting if not regularized.

Use Case:

Module 2 10
Effective in large-scale image classification tasks and deep neural
networks.

4. Randomized Leaky ReLU (RReLU):

Randomized Leaky ReLU randomly chooses the slope αfor negative inputs
from a given range during training.

Formula:

f(x) = {
x if x > 0
αx if x ≤ 0
​ ​

Where αis randomly sampled from a range [l, u]during training and fixed
during testing.

Pros:

Helps with regularization.

Reduces the risk of overfitting.

Cons:

The choice of the range [l, u]can impact performance.

Use Case:

Useful in low-resource environments or for noise-tolerant networks.

5. Exponential Linear Unit (ELU):

ELU allows small negative outputs, which helps push the mean activation
closer to zero, improving learning.

Formula:

f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0
​ ​

Where αis a constant.

Pros:

Zero-centered output, which improves learning speed.

Reduces vanishing gradients more effectively than ReLU.

Module 2 11
Cons:

Computationally expensive compared to ReLU.

Use Case:

Used in convolutional neural networks (CNNs) and deep residual


networks.

9. Discuss the different strategies to fix the vanishing gradient


issue.
The vanishing gradients problem poses a significant challenge in training deep
neural networks, hindering the effective learning of lower layers as gradients
diminish during backpropagation.
Here are several strategies to mitigate this issue:

1. Xavier and He Initialization:

These weight initialization techniques aim to maintain consistent variance


of both outputs and gradients throughout the network.

Xavier Initialization: Suitable for the logistic activation function, it initializes


weights randomly, ensuring the variance of outputs matches the variance
of inputs.

Formula for weight initialization:

He Initialization: Designed for the ReLU activation function and its variants,
it accounts for the fact that ReLU only activates for positive values.

It typically uses a normal distribution with a mean of 0 and standard


2
deviation of σ = ninputs
.

​ ​

By employing these initialization methods, training can be accelerated


significantly, and deeper networks can be trained effectively.

2. Non-saturating Activation Functions:

Module 2 12
The choice of activation function plays a crucial role in mitigating vanishing
gradients.

The sigmoid activation function, once popular, suffers from saturation for
large input values, leading to gradients close to zero.

ReLU and its variants (Leaky ReLU, ELU, RReLU, PReLU) address this issue
by not saturating for positive values.

The ELU activation function, in particular, has shown promising results in


speeding up training and improving model performance, although it may be
computationally slower than ReLU at test time.

3. Batch Normalization:

This technique tackles the issue of internal covariate shift, where the
distribution of each layer's inputs changes during training.

It normalizes the inputs of each layer, stabilizing the gradients and allowing
the use of higher learning rates.

Batch Normalization has been shown to significantly reduce vanishing


gradients, speed up training, and improve the overall performance of deep
neural networks.

4. Gradient Clipping:

Primarily used for recurrent neural networks, gradient clipping involves


capping gradients during backpropagation to prevent them from exceeding
a certain threshold.

This helps prevent exploding gradients, a problem where gradients become


excessively large.

10. Discuss the various ways of how we can reuse a pre-


trained model.

11. Describe pretraining on an auxiliary task.

Module 2 13

You might also like