0% found this document useful (0 votes)

21 views

Initializing Neural Networks - Deeplearning - Ai

The document discusses how initializing neural network parameters effectively is important for training deep neural networks. It covers the problems of exploding or vanishing gradients that can occur from improper initialization, and explains that initialization should be large enough to avoid slow learning but small enough to avoid divergence. Proper initialization methods like Xavier initialization address these issues.

Uploaded by

Aruzhan Amanova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Initializing Neural Networks - Deeplearning - Ai

Uploaded by

Aruzhan Amanova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

04.05.2024, 10:59 Initializing neural networks - deeplearning.

AI Notes

Initializing neural networks

Initialization can have a significant impact on convergence in training deep neural networks.
Simple initialization schemes have been found to accelerate training, but they require some care
to avoid common pitfalls. In this post, we'll explain how to initialize neural network parameters
effectively.

↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 1/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

TABLE OF CONTENTS

I The importance of effective initialization

II The problem of exploding or vanishing gradients
III What is proper initialization?
IV Mathematical justification for Xavier initialization

↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 2/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

I The importance of effective initialization

To build a machine learning algorithm, usually you’d define an architecture
(e.g. Logistic regression, Support Vector Machine, Neural Network) and train it
to learn parameters. Here is a common training process for neural networks:

1. Initialize the parameters

2. Choose an optimization algorithm
3. Repeat these steps:

1. Forward propagate an input

2. Compute the cost function
3. Compute the gradients of the cost with respect to parameters using
backpropagation
4. Update each parameter using the gradients, according to the optimization
algorithm

Then, given a new data point, you can use the model to predict its class.
The initialization step can be critical to the model’s ultimate performance, and
it requires the right method. To illustrate this, consider the three-layer neural
network below. You can try initializing this network with different methods
and observe the impact on the learning.

1. Choose input dataset

Select a training dataset.

↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 3/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

This legend details the color scheme for labels, and the values of the weights/gradients.

Label/Prediction:
0 0.5 1

Weight/Gradient:
neg zero pos

Node Type: Input Relu Sigmoid

2. Choose initialization method

Select an initialization method for the values of your neural network parameters1.
Zero Too small Appropriate Too large

↑ Back to top

X1
https://www.deeplearning.ai/ai-notes/initialization/index.html 4/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

Select whether to visualize the weights or gradients of the network above.

Weight Gradient

3. Train the network.

Observe the cost function and the decision boundary.
tsoC

0
Epoch
4

0
2X

-2

What do you notice about the gradients and weights when the initialization
-4

method is zero?
-4 -2 0 2 4
X1
↑ Back to top

Initializing all the weights with zeros leads the neurons to learn the same
https://www.deeplearning.ai/ai-notes/initialization/index.html 5/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

features during training.

In fact, any constant initialization scheme will perform very poorly. Consider a
neural network with two hidden units, and assume we initialize all the biases
to 0 and the weights with some constant α. If we forward propagate an input
(x1 , x2 ) in this network, the output of both hidden units will be relu(αx1 + αx2 ).

Thus, both hidden units will have identical influence on the cost, which will
lead to identical gradients. Thus, both neurons will evolve symmetrically
throughout training, effectively preventing different neurons from learning
different things.
What do you notice about the cost plot when you initialize weights with values
too small or too large?

Despite breaking the symmetry, initializing the weights with values (i) too
small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.

Choosing proper values for initialization is necessary for efficient training. We

will investigate this further in the next section.

II The problem of exploding or vanishing gradients

Consider this 9-layer neural network.

At every iteration of the optimization loop (forward, cost, backward, update),

we observe that backpropagated gradients are either amplified or minimized
as you move from the output layer towards the input layer. This result makes
sense if you consider the following example. ↑ Back to top

Assume all the activation functions are linear (identity function). Then the
https://www.deeplearning.ai/ai-notes/initialization/index.html 6/15
Assume all the activation functionsInitializing
04.05.2024, 10:59
are linear (identity function). Then the
neural networks - deeplearning.ai

output activation is:

y^ = a[L] = W [L] W [L−1] W [L−2] … W [3] W [2] W [1] x

where L = 10 and W [1] , W [2] , … , W [L−1] are all matrices of size (2, 2) because
layers [1] to [L − 1] have 2 neurons and receive 2 inputs. With this in mind, and
for illustrative purposes, if we assume W [1] = W [2] = ⋯ = W [L−1] = W the
output prediction is y^ = W [L] W L−1 x (where W L−1 takes the matrix W to the

power of L − 1, while W [L] denotes the Lth matrix).

What would be the outcome of initialization values that were too small, too
large or appropriate?

Case 1: A too-large initialization leads to exploding gradients

Consider the case where every weight is initialized slightly larger than the
identity matrix.
1.5 0
W [1] = W [2] = ⋯ = W [L−1] = [ ]
0 1.5

This simplifies to y^ = W [L] 1.5L−1 x, and the values of a[l] increase exponentially

with l. When these activations are used in backward propagation, this leads to
the exploding gradient problem. That is, the gradients of the cost with the
respect to the parameters are too big. This leads the cost to oscillate around
its minimum value.

Case 2: A too-small initialization leads to vanishing gradients

Similarly, consider the case where every weight is initialized slightly smaller
than the identity matrix.
0.5 0
W [1] = W [2] = ⋯ = W [L−1] = [ ]
0 0.5

This simplifies to y^ = W [L] 0.5L−1 x, and the values of the activation a[l] decrease

exponentially with l. When these activations are used in backward

↑ Back to top
propagation, this leads to the vanishing gradient problem. The gradients of the
cost with respect to the parameters are too small leading to convergence of
https://www.deeplearning.ai/ai-notes/initialization/index.html 7/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
cost with respect to the parameters are too small, leading to convergence of
the cost before it has reached the minimum value.
All in all, initializing weights with inappropriate values will lead to divergence
or a slow-down in the training of your neural network. Although we illustrated
the exploding/vanishing gradient problem with simple symmetrical weight
matrices, the observation generalizes to any initialization values that are too
small or too large.

III How to find appropriate initialization values

To prevent the gradients of the network’s activations from vanishing or
exploding, we will stick to the following rules of thumb:

1. The mean of the activations should be zero.

2. The variance of the activations should stay the same across every layer.

Under these two assumptions, the backpropagated gradient signal should not
be multiplied by values too small or too large in any layer. It should travel to
the input layer without exploding or vanishing.
More concretely, consider a layer l. Its forward propagation is:
a[l−1] = g [l−1] (z [l−1] )
z [l] = W [l] a[l−1] + b[l]

[l] [l] [l]

a = g (z )

We would like the following to hold2:

E[a[l−1] ] = E[a[l] ]
V ar(a[l−1] ) = V ar(a[l] )

Ensuring zero-mean and maintaining the value of the variance of the input of
every layer guarantees no exploding/vanishing signal, as we’ll explain in a
moment. This method applies both to the forward propagation (for
activations) and backward propagation (for gradients of the cost with ↑respect
Back to top
to activations). The recommended initialization is Xavier initialization (or one
https://www.deeplearning.ai/ai-notes/initialization/index.html 8/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

of its derived methods), for every layer l:

1
W [l] ∼ N (μ = 0, σ 2 = )
n[l−1]

b[l] = 0

In other words, all the weights of layer l are picked randomly from a normal
distribution with mean μ = 0 and variance σ2 = n 1 where n[l−1] is the number [l−1]

of neuron in layer l − 1. Biases are initialized with zeros.

The visualization below illustrates the influence of the Xavier initialization on
each layer’s activations for a five-layer fully-connected neural network.

1. Load your dataset

Load 10,000 handwritten digits images (MNIST).
Load MNIST (0%)

2. Select an initialization method

Among the below distributions, select the one to use to initialize your parameters3.
Zero Uniform Xavier Standard Normal

3. Train the network and observe

The grid below refers to the input images, Blue squares represent correctly classified images.
Red squares represent misclassified images.

Input batch of 100 images

Batch: 0 Epoch: 0
↑ Back to top
A[1] A[2] A[3] A[4]
https://www.deeplearning.ai/ai-notes/initialization/index.html 9/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

-1 0 1 -1 0 1 -1 0 1 -1 0 1

Output predictions of 100 images

Misclassified: 0/100 Cost: 0.00

You can find the theory behind this visualization in Glorot et al. (2010). The
next section presents the mathematical justification for Xavier initialization
and explains more precisely why it is an effective initialization.

↑ Back to top

IV Justification for Xavier initialization

https://www.deeplearning.ai/ai-notes/initialization/index.html 10/15
IV
04.05.2024, 10:59
Justification for Xavier initialization Initializing neural networks - deeplearning.ai

In this section, we will show that Xavier Initialization4 keeps the variance the
same across every layer. We will assume that our layer’s activations are
normally distributed around zero. Sometimes it helps to understand the
mathematical justification to grasp the concept, but you can understand the
fundamental idea without the math.
Let’s work on the layer l described in part (III) and assume the activation
function is tanh. The forward propagation is:
z [l] = W [l] a[l−1] + b[l]
a[l] = tanh(z [l] )

The goal is to derive a relationship between V ar(a[l−1] ) and V ar(a[l] ). We will

then understand how we should initialize our weights such that:
V ar(a[l−1] ) = V ar(a[l] ).

Assume we initialized our network with appropriate values and the input is
normalized. Early on in the training, we are in the linear regime of tanh. Values
are small enough and thus tanh(z [l] ) ≈ z [l] ,5 meaning that:
V ar(a[l] ) = V ar(z [l] )

Moreover, z [l] = W [l] a[l−1] + b[l] = vector(z1[l] , z2[l] , … , zn[l] ) where

[l]

k . For simplicity, let’s assume that b = 0 (it will end up

[l−1]
zk[l] = ∑nj=1 wkj

[l] [l−1]
aj + b[l]

[l]

being true given the choice of initialization we will choose). Thus, looking
element-wise at the previous equation V ar(a[l−1] ) = V ar(a[l] ) now gives:
n[l−1]
V ar(a[l]
k )
=V ar(zk[l] ) = V ar( ∑ wkj
[l] [l−1]
aj )

j=1

A common math trick is to extract the summation outside the variance. To do

this, we must make the following three assumptions6:

1. Weights are independent and identically distributed

2. Inputs are independent and identically distributed
3. Weights and inputs are mutually independent ↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 11/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

Thus, now we have:

n[l−1] n[l−1]
V ar(a[l]
k )
=V ar(zk[l] )
= V ar ( ∑
[l] [l−1]
wkj aj )
= ∑ V ar(wkj
[l] [l−1]
aj )

j=1 j=1

Another common math trick is to convert the variance of a product into a

product of variances. Here is the formula for it:
V ar(XY ) = E[X]2 V ar(Y ) + V ar(X)E[Y ]2 + V ar(X)V ar(Y )

Using this formula with X = wkj[l] and Y = a[l−1]

j , we get:

[l] [l−1] [l] 2

V ar(wkj aj ) = E[wkj
] V ar(a[l−1]

j
[l]
) + V ar(wkj )E[a[l−1]
j
[l]
]2 + V ar(wkj )V ar(a[l−1]
j )

We’re almost done! The first assumption leads to E [wkj[l] ]2 = 0 and the second

assumption leads to E [a[l−1]

j ]2 = 0 because weights are initialized with zero

mean, and inputs are normalized. Thus:

n[l−1] n[l−1]
V ar(zk[l] )
= ∑V
[l]
ar(wkj )V ar(a[l−1]
j ) = ∑ V ar(W [l] )V ar(a[l−1] ) = n[l−1] V ar(W [l] )V ar(a[l−1] )

j=1 j=1

The equality above results from our first assumption stating that:
[l] [l] [l]
V ar(wkj ) = V ar(w11 ) = V ar(w12 ) = ⋯ = V ar(W [l] )

Similarly the second assumption leads to:

V ar(a[l−1]
j ) = V ar(a[l−1]

1 ) = V ar(a[l−1]
2 ) = ⋯ = V ar(a[l−1] )

With the same idea:

V ar(z [l] ) = V ar(zk[l] )

Wrapping up everything, we have:

V ar(a[l] ) = n[l−1] V ar(W [l] )V ar(a[l−1] )

Voilà! If we want the variance to stay the same across layers

(V ar(a[l] ) = V ar(a[l−1] )), we need V ar(W [l] ) = n 1 . This justifies the choice of [l−1]

variance for Xavier initialization.

Notice that in the previous steps we did not choose a specific layer l. Thus, we
have shown that this expression holds for every layer of our network. Let L be
↑ Back to top
the output layer of our network. Using this expression at every layer, we can
https://www.deeplearning.ai/ai-notes/initialization/index.html 12/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

link the output layer’s variance to the input layer’s variance:

V ar(a[L] ) = n[L−1] V ar(W [L] )V ar(a[L−1] )
= n[L−1] V ar(W [L] )n[L−2] V ar(W [L−1] )V ar(a[L−2] )

=…

L
= [∏ n[l−1] V ar(W [l] )] V ar(x)

l=1

Depending on how we initialize our weights, the relationship between the

variance of our output and input will vary dramatically. Notice the following
three cases.
⎧<
⎪ 1 ⟹ Vanishing Signal
n[l−1] V ar ( W [l] ) ⎨= 1 ⟹ V ar(a[L] ) = V ar(x)
⎪
⎩

>1 ⟹ Exploding Signal

Thus, in order to avoid the vanishing or exploding of the forward propagated

signal, we must set n[l−1] V ar(W [l] ) = 1 by initializing V ar(W [l] ) = n 1 . [l−1]

Throughout the justification, we worked on activations computed during the

forward propagation. The same result can be derived for the backpropagated
gradients. Doing so, you will see that in order to avoid the vanishing or
exploding gradient problem, we must set n[l] V ar(W [l] ) = 1 by initializing
V ar(W [l] ) = n1 . [l]

Conclusion
In practice, Machine Learning Engineers using Xavier initialization would either
initialize the weights as N (0, n 1 ) or as N (0, n 2+n ). The variance term of the
[l−1]

[l−1] [l]

latter distribution is the harmonic mean of n 1 and n1 . [l−1]

[l]

This is a theoretical justification for Xavier initialization. Xavier initialization

works with tanh activations. Myriad other initialization methods exist. If you
are using ReLU, for example, a common initialization is He initialization (He et
al., Delving Deep into Rectifiers), in which the weights are initialized by
multiplying by 2 the variance of the Xavier initialization. While the justification
for this initialization is slightly more complicated, it follows the same thought
↑ Back to top

process as the one for tanh.

https://www.deeplearning.ai/ai-notes/initialization/index.html 13/15
p
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

Learn more about how to effectively initialize parameters in

Course 2 of the Deep Learning Specialization
Enroll now

AUTHORS

1. Kian Katanforoosh - Written content and structure.

2. Daniel Kunin - Visualizations (created using D3.js and TensorFlow.js).

ACKNOWLEDGMENTS

1. The template for the article was designed by Jingru Guo and inspired by Distill.
2. The first visualization adapted code from Mike Bostock's visualization of the Goldstein-Price
function.
3. The banner visualization adapted code from deeplearn.js's implementation of a CPPN.

FOOTNOTES

↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 14/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai

1. All bias parameters are initialized to zero and weight parameters are drawn from a normal
distribution with zero mean and selected variance.
2. Under the hypothesis that all entries of the weight matrix W [l] are picked from the same
distribution, V ar(w11 ) = V ar(w12 ) = ⋯ = V ar(wn n ). Thus, V ar(W [l] ) indicates the

[l] [l−1]

variance of any entry of W [l] (they're all the same!). Similarly, we will denote V ar(x) (resp.
V ar(a[l] )) the variance of any entry of x (resp. a[l] ). It is a fair approximation to consider that
every pixel of a "real-world image" x is distributed according to the same distribution.
3. All bias parameters are initialized to zero and weight parameters are drawn from either "Zero"
distribution (wij = 0), "Uniform" distribution (wij ∼ U ( n−1 , n1 )), "Xavier" distribution

[l−1]

(wij ∼ N (0, n1 )), or "Standard Normal" distribution (wij ∼ N (0, 1)).

[l−1]

4. Concretely it means we pick every weight randomly and independently from a normal
distribution centered in μ = 0 and with variance σ2 = n 1 . [l−1]

5. We assume that W [l] is initialized with small values and b[l] is initialized with zeros. Hence,
Z [l] = W [l] A[l−1] + b[l] is small and we are in the linear regime of tanh. Remember the slope of
tanh around zero is one, thus tanh(Z [l] ) ≈ Z [l] .

6. The first assumption will end up being true given our initialization scheme (we pick weights
randomly according to a normal distribution centered at zero). The second assumption is not
always true. For instance in images, inputs are pixel values, and pixel values in the same
region are highly correlated with each other. On average, it’s more likely that a green pixel is
surrounded by green pixels than by any other pixel color, because this pixel might be
representing a grass field, or a green object. Although it’s not always true, we assume that
inputs are distributed identically (let’s say from a normal distribution centered at zero.) The
third assumption is generally true at initialization, given that our initialization scheme makes
our weights independent and identically distributed (i.i.d.).

REFERENCE

To reference this article in an academic context, please cite this work as:

Katanforoosh & Kunin, "Initializing neural networks", deeplearning.ai, 2019.

↑ Back to top

https://www.deeplearning.ai/ai-notes/initialization/index.html 15/15

ChipBlaster EV-2000 Service Manual
No ratings yet
ChipBlaster EV-2000 Service Manual
29 pages
Intro_DL_04
No ratings yet
Intro_DL_04
35 pages
Duaa - Asim - Ignatius Initialization Methods
No ratings yet
Duaa - Asim - Ignatius Initialization Methods
42 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Weights Initialization in Neural Networks
No ratings yet
Weights Initialization in Neural Networks
31 pages
DL UNIT 5 NOTES 2
No ratings yet
DL UNIT 5 NOTES 2
23 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
8 pages
Initialization
No ratings yet
Initialization
16 pages
9.b Handout-5-Weight Init
No ratings yet
9.b Handout-5-Weight Init
4 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Training Neural
No ratings yet
Training Neural
16 pages
Unit 3
No ratings yet
Unit 3
110 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
a imprimer 4
No ratings yet
a imprimer 4
4 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Weight initialization in ANNs
No ratings yet
Weight initialization in ANNs
13 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
General Observation
No ratings yet
General Observation
93 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Lecture 8.4
No ratings yet
Lecture 8.4
13 pages
1. Introduction to deep learning- Deep feed forward network
No ratings yet
1. Introduction to deep learning- Deep feed forward network
24 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
UNIT3
No ratings yet
UNIT3
17 pages
The Challenge of Vanishing/Exploding Gradients in Deep Neural Networks
No ratings yet
The Challenge of Vanishing/Exploding Gradients in Deep Neural Networks
8 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
M3_Transcript
No ratings yet
M3_Transcript
10 pages
Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
ANNs
No ratings yet
ANNs
17 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Understanding Weight Initialization For Neural Networks - PyImageSearch
No ratings yet
Understanding Weight Initialization For Neural Networks - PyImageSearch
16 pages
Designing Your Neural Networks - Towards Data Science
No ratings yet
Designing Your Neural Networks - Towards Data Science
15 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Neural-Network-Introduction
No ratings yet
Neural-Network-Introduction
5 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Ann-Back Propagation
No ratings yet
Ann-Back Propagation
21 pages
LecML -3 NN
No ratings yet
LecML -3 NN
33 pages
Ann MPDM Ii
No ratings yet
Ann MPDM Ii
42 pages
Slides 11
No ratings yet
Slides 11
48 pages
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
The Sweet Voice of Salvation': Febc & Its Mission
No ratings yet
The Sweet Voice of Salvation': Febc & Its Mission
3 pages
Untitled
No ratings yet
Untitled
12 pages
REVISED REQUIREMENTS FOR CANCELLATION OF POS
No ratings yet
REVISED REQUIREMENTS FOR CANCELLATION OF POS
2 pages
Booklist
No ratings yet
Booklist
2 pages
TV 04 - Lab 4
No ratings yet
TV 04 - Lab 4
2 pages
Analects of Confucius: By: Rosie Catis and Jenny Rose Deslate
No ratings yet
Analects of Confucius: By: Rosie Catis and Jenny Rose Deslate
24 pages
Math 7 Quarter 3
100% (2)
Math 7 Quarter 3
5 pages
Atmscuw301 PDF
No ratings yet
Atmscuw301 PDF
120 pages
CHG
No ratings yet
CHG
13 pages
Van Dijk - Discourse, Knowledge and Ideology
No ratings yet
Van Dijk - Discourse, Knowledge and Ideology
34 pages
Chapter 3 - Human Resources Management Strategy and Analysis
No ratings yet
Chapter 3 - Human Resources Management Strategy and Analysis
25 pages
E-Government Readiness Assessment Survey
No ratings yet
E-Government Readiness Assessment Survey
20 pages
Chapter Four Design and Analysis of Mixer: Material Selection
No ratings yet
Chapter Four Design and Analysis of Mixer: Material Selection
48 pages
Thesis Submission Certificate
No ratings yet
Thesis Submission Certificate
3 pages
NCP Impaired Social Interaction
No ratings yet
NCP Impaired Social Interaction
2 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
The Kidney
No ratings yet
The Kidney
11 pages
Shella Item Analysis Frequency of Errors Masterry Level Mps
No ratings yet
Shella Item Analysis Frequency of Errors Masterry Level Mps
17 pages
Quantum QT 710 Table
No ratings yet
Quantum QT 710 Table
60 pages
#8@6"C/C 3-SET: Keyplan:-Consultant
No ratings yet
#8@6"C/C 3-SET: Keyplan:-Consultant
1 page
Dissertation Sur Le Realisme Et Naturalisme
100% (2)
Dissertation Sur Le Realisme Et Naturalisme
7 pages
Sino sd26-2v Dro PDF
No ratings yet
Sino sd26-2v Dro PDF
1 page
Consumer Behaviour at Amazon
100% (1)
Consumer Behaviour at Amazon
123 pages
PS ID Creation@Stores 0 2
No ratings yet
PS ID Creation@Stores 0 2
11 pages
Groups and Problem Statements - IBM Summer Internship
No ratings yet
Groups and Problem Statements - IBM Summer Internship
33 pages
Sales and Distribution Management PDF Ebook
No ratings yet
Sales and Distribution Management PDF Ebook
3 pages
Creativity Room Transcripts
No ratings yet
Creativity Room Transcripts
16 pages
Nutriquiz
No ratings yet
Nutriquiz
1 page
4a. Protective Clothing
No ratings yet
4a. Protective Clothing
32 pages

Initializing Neural Networks - Deeplearning - Ai

Uploaded by

Initializing Neural Networks - Deeplearning - Ai

Uploaded by

04.05.2024, 10:59 Initializing neural networks - deeplearning.

Initializing neural networks

I The importance of effective initialization

I The importance of effective initialization

1. Initialize the parameters

1. Forward propagate an input

1. Choose input dataset

Node Type: Input Relu Sigmoid

2. Choose initialization method

Select whether to visualize the weights or gradients of the network above.

3. Train the network.

features during training.

Choosing proper values for initialization is necessary for efficient training. We

II The problem of exploding or vanishing gradients

At every iteration of the optimization loop (forward, cost, backward, update),

output activation is:

power of L − 1, while W [L] denotes the Lth matrix).

Case 1: A too-large initialization leads to exploding gradients

Case 2: A too-small initialization leads to vanishing gradients

exponentially with l. When these activations are used in backward

III How to find appropriate initialization values

1. The mean of the activations should be zero.

[l] [l] [l]

We would like the following to hold2:

of its derived methods), for every layer l:

of neuron in layer l − 1. Biases are initialized with zeros.

1. Load your dataset

2. Select an initialization method

3. Train the network and observe

Input batch of 100 images

Output predictions of 100 images

Misclassified: 0/100 Cost: 0.00

IV Justification for Xavier initialization

The goal is to derive a relationship between V ar(a[l−1] ) and V ar(a[l] ). We will

Moreover, z [l] = W [l] a[l−1] + b[l] = vector(z1[l] , z2[l] , … , zn[l] ) where ​ ​

k . For simplicity, let’s assume that b = 0 (it will end up

A common math trick is to extract the summation outside the variance. To do

1. Weights are independent and identically distributed

Thus, now we have:

Another common math trick is to convert the variance of a product into a

Using this formula with X = wkj[l] and Y ​ = a[l−1]

[l] [l−1] [l] 2

assumption leads to E [a[l−1]

mean, and inputs are normalized. Thus:

Similarly the second assumption leads to:

With the same idea:

Wrapping up everything, we have:

Voilà! If we want the variance to stay the same across layers

variance for Xavier initialization.

link the output layer’s variance to the input layer’s variance:

Depending on how we initialize our weights, the relationship between the

>1 ⟹ Exploding Signal

Thus, in order to avoid the vanishing or exploding of the forward propagated

Throughout the justification, we worked on activations computed during the

latter distribution is the harmonic mean of n 1 and n1 . [l−1] ​

This is a theoretical justification for Xavier initialization. Xavier initialization

process as the one for tanh.

Learn more about how to effectively initialize parameters in

1. Kian Katanforoosh - Written content and structure.

(wij ∼ N (0, n1 )), or "Standard Normal" distribution (wij ∼ N (0, 1)).

Katanforoosh & Kunin, "Initializing neural networks", deeplearning.ai, 2019.

You might also like

Moreover, z [l] = W [l] a[l−1] + b[l] = vector(z1[l] , z2[l] , … , zn[l] ) where

Using this formula with X = wkj[l] and Y = a[l−1]

latter distribution is the harmonic mean of n 1 and n1 . [l−1]