0% found this document useful (0 votes)

17 views13 pages

Module 2

Uploaded by

volleysilicon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Module 2

Uploaded by

volleysilicon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

⛱️

Module 2
1. Explain batch normalization with relevant example.

2. Write snippet code for transfer learning with Keras.

3. Explain with relevant illustration, in which cases would you

want to use each of the following activation functions: ELU,
leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

Module 2 1
Detailed Answer

1. ELU (Exponential Linear Unit):

The Exponential Linear Unit (ELU) addresses the issue of dying

neurons in ReLU by allowing small negative outputs for negative
inputs. It helps improve learning speed and leads to smoother
convergence.

When to Use:

When training deep networks to achieve faster convergence.

In cases where small negative outputs are needed to push the

mean activation closer to zero.

Formula:

f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0

where αis a positive constant.

Use Case:

Image classification and convolutional neural networks (CNNs)

where fast convergence is crucial.

Module 2 2
Works well with batch normalization.

2. Leaky ReLU (and Variants: Parametric ReLU, Randomized Leaky

ReLU):

Leaky ReLU addresses the dying ReLU problem by allowing a small,

non-zero slope for negative values of x. This keeps neurons active and
learning even when the input is negative.

Variants:

Parametric ReLU (PReLU): Allows the slope of negative values to be

learned.

Randomized Leaky ReLU (RReLU): Randomizes the slope during

training for regularization.

When to Use:

When dealing with deep networks prone to the dying ReLU

problem.

For time-series data or speech recognition tasks where some

negative values can hold important information.

f(x) = {
x if x > 0
αx if x ≤ 0

Where αis a small positive constant (e.g., α = 0.01).

3. ReLU (Rectified Linear Unit):

ReLU is the most commonly used activation function due to its

simplicity and effectiveness. It outputs 0 for negative inputs and the
input itself for positive inputs.

When to Use:

In hidden layers of deep neural networks.

Works well for image-related tasks and object detection.

Formula:

f(x) = max(0, x)

Limitations:

Module 2 3
Prone to the dying ReLU problem (neurons stop learning when
stuck in the negative region).

Use Case:

Convolutional Neural Networks (CNNs) for image recognition.

Feedforward networks where computational efficiency is crucial.

4. tanh (Hyperbolic Tangent):

The tanh activation function outputs values between -1 and 1, making it

zero-centered. This helps to avoid shifting gradients in one direction
during backpropagation.

When to Use:

When dealing with classification tasks that require negative values

as well as positive values.

In hidden layers of networks where zero-centered outputs are

beneficial.

Formula:

ex − e−x
f(x) = x
e + e−x

Use Case:

Recurrent Neural Networks (RNNs) for time-series predictions.

Binary classification tasks where outputs need to be zero-

centered.

5. Logistic (Sigmoid) Function:

The sigmoid function outputs values between 0 and 1, making it

suitable for binary classification problems.

When to Use:

For binary classification tasks where the output is a probability

(e.g., predicting whether an email is spam or not).

In output layers when a probability score is required.

Formula:

Module 2 4
1
f(x) =
1 + e−x

Limitations:

Causes vanishing gradients for very large or very small inputs.

The output is not zero-centered, which can slow down learning.

Use Case:

Logistic regression and binary classification tasks.

Probability-based models where outputs need to be interpreted as

probabilities.

6. Softmax Function:

The softmax function is used to convert the outputs of a neural network

into a probability distribution over multiple classes.

When to Use:

In the output layer of a multiclass classification model.

When the task requires predicting one class out of multiple

possible classes.

Formula:

ez i

σ(zi ) =

∑j=1 ez j

where zi is the output for a specific class, and K is the total number of

classes.

Use Case:

Multiclass classification problems, such as identifying handwritten

digits (0-9) in MNIST dataset.

Natural language processing (NLP) tasks like named entity

recognition.

4. Discuss how to Train the DNN on this training set. For each
image pair, you can simultaneously feed the first image to DNN
A and the second image to DNN B. The whole network will

Module 2 5
gradually learn to tell whether two images belong to the same
class or not.

5. Explain the vanishing gradients problem in neural network.

OR
Explain the vanishing and exploding gradients problem in
neural network.
Vanishing Gradients:

When training a neural network using backpropagation, the error gradients

computed during backpropagation tend to decrease as they are propagated
backward through the network's layers.

In deep networks, especially when using activation functions like the

sigmoid or hyperbolic tangent, this can cause the gradients to shrink
exponentially, making them too small to cause meaningful updates to the
weights in the earlier layers.

As a result:

The earlier layers learn very slowly, or not at all.

The model fails to converge to a good solution.

Exploding Gradients:

In contrast, the exploding gradients problem occurs when the gradients

become excessively large during backpropagation, resulting in large updates to
the network weights. This can cause:

The model parameters to become unstable.

Divergence in the training process, where the loss function increases

instead of decreasing.

These problems are more pronounced in very deep networks and recurrent
neural networks (RNNs).

Causes:
The vanishing/exploding gradients problem can be attributed to factors such
as:

Poor initialization of weights.

Module 2 6
Use of saturating activation functions like sigmoid and tanh.

The accumulation of small gradients through many layers.

Solutions:

Several techniques can mitigate these issues:

1. Use of Non-saturating Activation Functions: Functions like ReLU (Rectified

Linear Unit) do not saturate for positive inputs, reducing the risk of
vanishing gradients.

2. Proper Weight Initialization: Xavier and He initialization methods help

maintain a balance in gradient propagation.

3. Batch Normalization: Normalizing inputs within the network can reduce

internal covariate shift, stabilizing gradients during training.

4. Gradient Clipping: This technique caps the gradients during

backpropagation to avoid them from becoming too large, addressing
exploding gradients.

These methods have significantly improved the training of deep neural

networks, enabling them to handle more complex tasks effectively.

6. Discuss the problem that Glorot initialization and He

initialization aim to fix.
Both Glorot initialization and He initialization were proposed to address
vanishing and exploding gradients, especially in deep neural networks.

These problems occur because weights are poorly initialized, causing

gradients to shrink (vanish) or grow (explode) as they propagate backward
through the layers.

The vanishing and exploding gradients problem is particularly severe

when:

The network is deep (many layers).

Activations are not properly scaled, leading to either:

Vanishing gradients: When gradients become very small and fail to

update the weights of earlier layers.

Exploding gradients: When gradients grow too large and make the
network unstable.

Module 2 7
The root cause is how weights are initialized. If weights are too small or too
large, the signal passing through layers is either diminished or amplified
exponentially, causing instability.

Xavier Initialization:

Proposed by Xavier Glorot and Yoshua Bengio, Xavier initialization works

well for sigmoid and tanh activation functions, which are prone to
saturation (leading to vanishing gradients).

Balances the variance of the activations and gradients to keep them within
a reasonable range across layers.

Formula for weight initialization:

He Initialization:

Proposed by Kaiming He et al. for ReLU and variants of ReLU (like Leaky
ReLU).

ReLU does not saturate like sigmoid or tanh, but it can suffer from dying
neurons if weights are not initialized properly.

The scaling factor n2in accounts for the fact that ReLU only activates half of

the neurons on average, preventing the gradients from vanishing.

Formula for weight initialization:

2
W ∼ N (0, )

nin

Where:

nin = number of input neurons.

7. Differentiate Non-saturating and Saturating activation

functions with example.

Module 2 8
8. Explain the variants of ReLU activation function.
1. ReLU (Rectified Linear Unit):

The original ReLU is the most widely used activation function in deep
learning. It outputs 0 for negative inputs and x for positive inputs.

Formula:

f(x) = max(0, x)

Pros:

Simple and efficient.

Helps reduce vanishing gradient problems.

Computationally inexpensive.

Cons:

Dying ReLU problem: Neurons can stop learning if they get stuck in the
negative region and always output zero.

2. Leaky ReLU:

Module 2 9
Leaky ReLU fixes the dying ReLU problem by allowing a small, non-zero
slope for negative inputs.

Formula:

f(x) = {
x if x > 0
αx if x ≤ 0

Where αis a the hyperparameter (slope).

Pros:

Prevents neurons from dying.

Allows negative values to propagate through the network.

Cons:

Choosing the right value of αcan be tricky.

Use Case:

Used in GANs (Generative Adversarial Networks), RNNs, and

networks prone to the dying ReLU problem.

3. Parametric ReLU (PReLU):

Parametric ReLU is a variant of Leaky ReLU where the slope for negative
inputs is learnable during training.

Formula:

f(x) = {
x if x > 0
ax if x ≤ 0

Where ais a learnable parameter.

Pros:

The network can adapt the slope for negative values based on the data.

Reduces the risk of dying neurons.

Cons:

Can lead to overfitting if not regularized.

Use Case:

Module 2 10
Effective in large-scale image classification tasks and deep neural
networks.

4. Randomized Leaky ReLU (RReLU):

Randomized Leaky ReLU randomly chooses the slope αfor negative inputs
from a given range during training.

Formula:

f(x) = {
x if x > 0
αx if x ≤ 0

Where αis randomly sampled from a range [l, u]during training and fixed
during testing.

Pros:

Helps with regularization.

Reduces the risk of overfitting.

Cons:

The choice of the range [l, u]can impact performance.

Use Case:

Useful in low-resource environments or for noise-tolerant networks.

5. Exponential Linear Unit (ELU):

ELU allows small negative outputs, which helps push the mean activation
closer to zero, improving learning.

Formula:

f(x) = { x
x if x > 0
α(e − 1) if x ≤ 0

Where αis a constant.

Pros:

Zero-centered output, which improves learning speed.

Reduces vanishing gradients more effectively than ReLU.

Module 2 11
Cons:

Computationally expensive compared to ReLU.

Use Case:

Used in convolutional neural networks (CNNs) and deep residual

networks.

9. Discuss the different strategies to fix the vanishing gradient

issue.
The vanishing gradients problem poses a significant challenge in training deep
neural networks, hindering the effective learning of lower layers as gradients
diminish during backpropagation.
Here are several strategies to mitigate this issue:

1. Xavier and He Initialization:

These weight initialization techniques aim to maintain consistent variance

of both outputs and gradients throughout the network.

Xavier Initialization: Suitable for the logistic activation function, it initializes

weights randomly, ensuring the variance of outputs matches the variance
of inputs.

Formula for weight initialization:

He Initialization: Designed for the ReLU activation function and its variants,
it accounts for the fact that ReLU only activates for positive values.

It typically uses a normal distribution with a mean of 0 and standard

2
deviation of σ = ninputs
.

By employing these initialization methods, training can be accelerated

significantly, and deeper networks can be trained effectively.

2. Non-saturating Activation Functions:

Module 2 12
The choice of activation function plays a crucial role in mitigating vanishing
gradients.

The sigmoid activation function, once popular, suffers from saturation for
large input values, leading to gradients close to zero.

ReLU and its variants (Leaky ReLU, ELU, RReLU, PReLU) address this issue
by not saturating for positive values.

The ELU activation function, in particular, has shown promising results in

speeding up training and improving model performance, although it may be
computationally slower than ReLU at test time.

3. Batch Normalization:

This technique tackles the issue of internal covariate shift, where the
distribution of each layer's inputs changes during training.

It normalizes the inputs of each layer, stabilizing the gradients and allowing
the use of higher learning rates.

Batch Normalization has been shown to significantly reduce vanishing

gradients, speed up training, and improve the overall performance of deep
neural networks.

4. Gradient Clipping:

Primarily used for recurrent neural networks, gradient clipping involves

capping gradients during backpropagation to prevent them from exceeding
a certain threshold.

This helps prevent exploding gradients, a problem where gradients become

excessively large.

10. Discuss the various ways of how we can reuse a pre-

trained model.

11. Describe pretraining on an auxiliary task.

Module 2 13

(eBook PDF) Numerical Mathematics and Computing 7th Edition by E. Ward Cheney instant download
100% (7)
(eBook PDF) Numerical Mathematics and Computing 7th Edition by E. Ward Cheney instant download
49 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
Need and Use of Activation Functions in Anndeep Learning
No ratings yet
Need and Use of Activation Functions in Anndeep Learning
7 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
what are the activation functions, how do i deter...
No ratings yet
what are the activation functions, how do i deter...
3 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
Weight Initialization Techniques Assignment Questions
No ratings yet
Weight Initialization Techniques Assignment Questions
8 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
ANN notes
No ratings yet
ANN notes
7 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
Unit-2
No ratings yet
Unit-2
35 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Week 8
No ratings yet
Week 8
4 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Activation
No ratings yet
Activation
7 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Neural_Networks_Activation_Functions__1694135997
No ratings yet
Neural_Networks_Activation_Functions__1694135997
7 pages
DL
No ratings yet
DL
12 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Activation Function
No ratings yet
Activation Function
43 pages
Activation Function (1)
No ratings yet
Activation Function (1)
34 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
GENAI-SEE
No ratings yet
GENAI-SEE
51 pages
ANN Unit IV Notes
No ratings yet
ANN Unit IV Notes
4 pages
ISE-1 Imp DLpdf
No ratings yet
ISE-1 Imp DLpdf
28 pages
Deep Learning 15
No ratings yet
Deep Learning 15
13 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
No ratings yet
19EEE362:Deep Learning For Visual Computing: Dr.T.Ananthan
23 pages
Module 2
No ratings yet
Module 2
12 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
ReLu Heuristics For Avoiding Local Bad Minima
100% (2)
ReLu Heuristics For Avoiding Local Bad Minima
10 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
ANN_Viva_Prep
No ratings yet
ANN_Viva_Prep
66 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Neural Network example and Activation Functions Summary
No ratings yet
Neural Network example and Activation Functions Summary
2 pages
Activation Function
No ratings yet
Activation Function
36 pages
Getting Into Neural Networks
No ratings yet
Getting Into Neural Networks
15 pages
02_NEURAL_NETWORKS
No ratings yet
02_NEURAL_NETWORKS
32 pages
AyushChokhani AI Asiignment 2
No ratings yet
AyushChokhani AI Asiignment 2
12 pages
IVA UNIT-5 EDITED
No ratings yet
IVA UNIT-5 EDITED
42 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
CS445 - Neural Networks and Deep Learning - Lecture Notes
No ratings yet
CS445 - Neural Networks and Deep Learning - Lecture Notes
5 pages
Functii de Activare1
No ratings yet
Functii de Activare1
89 pages
14_Học sâu (3)_Improve DNN_v3
No ratings yet
14_Học sâu (3)_Improve DNN_v3
129 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
DL_Assi02
No ratings yet
DL_Assi02
9 pages
DL (2)
No ratings yet
DL (2)
18 pages
Ml Ppt Activation Functions
No ratings yet
Ml Ppt Activation Functions
12 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Course2
No ratings yet
Course2
1 page
Front Page 1
No ratings yet
Front Page 1
2 pages
Individual Activity Report Format
No ratings yet
Individual Activity Report Format
6 pages
Project Dairy 7th Sem (1)
No ratings yet
Project Dairy 7th Sem (1)
16 pages
Project Dairy 7th Sem (1)
No ratings yet
Project Dairy 7th Sem (1)
25 pages
Class-9-Maths-Chapter-2-Polynomials-MCQs
No ratings yet
Class-9-Maths-Chapter-2-Polynomials-MCQs
5 pages
1ST TERM YR 11 ADD-MATHS NOTE.
No ratings yet
1ST TERM YR 11 ADD-MATHS NOTE.
37 pages
10.2. Deep Learning (CNN)
No ratings yet
10.2. Deep Learning (CNN)
50 pages
Dodo
No ratings yet
Dodo
3 pages
Chapter 19
No ratings yet
Chapter 19
61 pages
Abbas FastDOG Fast Discrete Optimization On GPU CVPR 2022 Paper
No ratings yet
Abbas FastDOG Fast Discrete Optimization On GPU CVPR 2022 Paper
11 pages
Paper Linear-Least-Squares Initialization - J. Principe
No ratings yet
Paper Linear-Least-Squares Initialization - J. Principe
14 pages
AD3351- DESIGN AND ANALYSIS OF ALGORITHM
No ratings yet
AD3351- DESIGN AND ANALYSIS OF ALGORITHM
41 pages
Computational Mathematics for Information Systems-1
No ratings yet
Computational Mathematics for Information Systems-1
5 pages
Bahagi
No ratings yet
Bahagi
20 pages
Day 18 Dial Algorithm
No ratings yet
Day 18 Dial Algorithm
9 pages
Sectiond Group14 JCG Global Airservice
0% (1)
Sectiond Group14 JCG Global Airservice
20 pages
Homework #5 - Dividing Monomials
No ratings yet
Homework #5 - Dividing Monomials
2 pages
Lec 21 Trapizoidal Rule
No ratings yet
Lec 21 Trapizoidal Rule
28 pages
Numerical Analysis 3rd Edition Timothy Sauer download
100% (2)
Numerical Analysis 3rd Edition Timothy Sauer download
47 pages
OPR Cheat Sheet: Graphical Method
No ratings yet
OPR Cheat Sheet: Graphical Method
3 pages
Numerical Optimization in Matlab
No ratings yet
Numerical Optimization in Matlab
25 pages
SMA 3261 - Lecture 5 - Newton-Raphson - Method
No ratings yet
SMA 3261 - Lecture 5 - Newton-Raphson - Method
4 pages
Polynomials CHAPTER 2 CLASS 10 Content 1
No ratings yet
Polynomials CHAPTER 2 CLASS 10 Content 1
14 pages
Polynomials Name: Ravishu Nagarwal Class: X Section: C Roll No.: 28 ADMISSION NO.: 14864
No ratings yet
Polynomials Name: Ravishu Nagarwal Class: X Section: C Roll No.: 28 ADMISSION NO.: 14864
13 pages
12.2-Newton Raphson Cubic Equations
No ratings yet
12.2-Newton Raphson Cubic Equations
2 pages
Numerical Methods: Session 1: Principles of Numerical Mathematics
No ratings yet
Numerical Methods: Session 1: Principles of Numerical Mathematics
24 pages
Algorithms: Selected Lecture Notes
No ratings yet
Algorithms: Selected Lecture Notes
53 pages
Graph Theory
100% (1)
Graph Theory
81 pages
Sarthak Tomar53 Unit-4 DAA
No ratings yet
Sarthak Tomar53 Unit-4 DAA
9 pages
Unit 6.3 - Linear Program Simplex Method
No ratings yet
Unit 6.3 - Linear Program Simplex Method
6 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Quadratic Equations - Practice Sheet - Varun JEE Advanced 2024
No ratings yet
Quadratic Equations - Practice Sheet - Varun JEE Advanced 2024
3 pages
Stiffness Method
No ratings yet
Stiffness Method
31 pages

Module 2

Uploaded by

Module 2

Uploaded by

⛱️

2. Write snippet code for transfer learning with Keras.

3. Explain with relevant illustration, in which cases would you

1. ELU (Exponential Linear Unit):

The Exponential Linear Unit (ELU) addresses the issue of dying

When training deep networks to achieve faster convergence.

In cases where small negative outputs are needed to push the

where α﻿is a positive constant.

Image classification and convolutional neural networks (CNNs)

2. Leaky ReLU (and Variants: Parametric ReLU, Randomized Leaky

Leaky ReLU addresses the dying ReLU problem by allowing a small,

Parametric ReLU (PReLU): Allows the slope of negative values to be

Randomized Leaky ReLU (RReLU): Randomizes the slope during

When dealing with deep networks prone to the dying ReLU

For time-series data or speech recognition tasks where some

Where α﻿is a small positive constant (e.g., α = 0.01﻿).

ReLU is the most commonly used activation function due to its

In hidden layers of deep neural networks.

Works well for image-related tasks and object detection.

Convolutional Neural Networks (CNNs) for image recognition.

Feedforward networks where computational efficiency is crucial.

4. tanh (Hyperbolic Tangent):

The tanh activation function outputs values between -1 and 1, making it

When dealing with classification tasks that require negative values

In hidden layers of networks where zero-centered outputs are

Recurrent Neural Networks (RNNs) for time-series predictions.

Binary classification tasks where outputs need to be zero-

5. Logistic (Sigmoid) Function:

The sigmoid function outputs values between 0 and 1, making it

For binary classification tasks where the output is a probability

In output layers when a probability score is required.

Causes vanishing gradients for very large or very small inputs.

The output is not zero-centered, which can slow down learning.

Logistic regression and binary classification tasks.

Probability-based models where outputs need to be interpreted as

The softmax function is used to convert the outputs of a neural network

In the output layer of a multiclass classification model.

When the task requires predicting one class out of multiple

Multiclass classification problems, such as identifying handwritten

Natural language processing (NLP) tasks like named entity

5. Explain the vanishing gradients problem in neural network.

When training a neural network using backpropagation, the error gradients

In deep networks, especially when using activation functions like the

The earlier layers learn very slowly, or not at all.

The model fails to converge to a good solution.

In contrast, the exploding gradients problem occurs when the gradients

The model parameters to become unstable.

Divergence in the training process, where the loss function increases

Poor initialization of weights.

The accumulation of small gradients through many layers.

Several techniques can mitigate these issues:

1. Use of Non-saturating Activation Functions: Functions like ReLU (Rectified

2. Proper Weight Initialization: Xavier and He initialization methods help

3. Batch Normalization: Normalizing inputs within the network can reduce

4. Gradient Clipping: This technique caps the gradients during

These methods have significantly improved the training of deep neural

6. Discuss the problem that Glorot initialization and He

These problems occur because weights are poorly initialized, causing

The vanishing and exploding gradients problem is particularly severe

The network is deep (many layers).

Activations are not properly scaled, leading to either:

Vanishing gradients: When gradients become very small and fail to

Proposed by Xavier Glorot and Yoshua Bengio, Xavier initialization works

Formula for weight initialization:

the neurons on average, preventing the gradients from vanishing.

Formula for weight initialization:

nin ﻿= number of input neurons.

7. Differentiate Non-saturating and Saturating activation

Simple and efficient.

Helps reduce vanishing gradient problems.

Where α﻿is a the hyperparameter (slope).

Prevents neurons from dying.

Allows negative values to propagate through the network.

Choosing the right value of α﻿can be tricky.

Used in GANs (Generative Adversarial Networks), RNNs, and

3. Parametric ReLU (PReLU):

where αis a positive constant.

Where αis a small positive constant (e.g., α = 0.01).

nin = number of input neurons.

Where αis a the hyperparameter (slope).

Choosing the right value of αcan be tricky.

Where ais a learnable parameter.

The choice of the range [l, u]can impact performance.

Where αis a constant.