Initializing Neural Networks - Deeplearning - Ai
Initializing Neural Networks - Deeplearning - Ai
ai
AI Notes
↑ Back to top
https://www.deeplearning.ai/ai-notes/initialization/index.html 1/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
TABLE OF CONTENTS
↑ Back to top
https://www.deeplearning.ai/ai-notes/initialization/index.html 2/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
Then, given a new data point, you can use the model to predict its class.
The initialization step can be critical to the model’s ultimate performance, and
it requires the right method. To illustrate this, consider the three-layer neural
network below. You can try initializing this network with different methods
and observe the impact on the learning.
↑ Back to top
https://www.deeplearning.ai/ai-notes/initialization/index.html 3/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
This legend details the color scheme for labels, and the values of the weights/gradients.
Label/Prediction:
0 0.5 1
Weight/Gradient:
neg zero pos
↑ Back to top
X1
https://www.deeplearning.ai/ai-notes/initialization/index.html 4/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
X2
0
Epoch
4
0
2X
-2
What do you notice about the gradients and weights when the initialization
-4
method is zero?
-4 -2 0 2 4
X1
↑ Back to top
Initializing all the weights with zeros leads the neurons to learn the same
https://www.deeplearning.ai/ai-notes/initialization/index.html 5/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
In fact, any constant initialization scheme will perform very poorly. Consider a
neural network with two hidden units, and assume we initialize all the biases
to 0 and the weights with some constant α. If we forward propagate an input
(x1 , x2 ) in this network, the output of both hidden units will be relu(αx1 + αx2 ).
Thus, both hidden units will have identical influence on the cost, which will
lead to identical gradients. Thus, both neurons will evolve symmetrically
throughout training, effectively preventing different neurons from learning
different things.
What do you notice about the cost plot when you initialize weights with values
too small or too large?
Despite breaking the symmetry, initializing the weights with values (i) too
small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.
Assume all the activation functions are linear (identity function). Then the
https://www.deeplearning.ai/ai-notes/initialization/index.html 6/15
Assume all the activation functionsInitializing
04.05.2024, 10:59
are linear (identity function). Then the
neural networks - deeplearning.ai
where L = 10 and W [1] , W [2] , … , W [L−1] are all matrices of size (2, 2) because
layers [1] to [L − 1] have 2 neurons and receive 2 inputs. With this in mind, and
for illustrative purposes, if we assume W [1] = W [2] = ⋯ = W [L−1] = W the
output prediction is y^ = W [L] W L−1 x (where W L−1 takes the matrix W to the
This simplifies to y^ = W [L] 1.5L−1 x, and the values of a[l] increase exponentially
with l. When these activations are used in backward propagation, this leads to
the exploding gradient problem. That is, the gradients of the cost with the
respect to the parameters are too big. This leads the cost to oscillate around
its minimum value.
This simplifies to y^ = W [L] 0.5L−1 x, and the values of the activation a[l] decrease
Under these two assumptions, the backpropagated gradient signal should not
be multiplied by values too small or too large in any layer. It should travel to
the input layer without exploding or vanishing.
More concretely, consider a layer l. Its forward propagation is:
a[l−1] = g [l−1] (z [l−1] )
z [l] = W [l] a[l−1] + b[l]
Ensuring zero-mean and maintaining the value of the variance of the input of
every layer guarantees no exploding/vanishing signal, as we’ll explain in a
moment. This method applies both to the forward propagation (for
activations) and backward propagation (for gradients of the cost with ↑respect
Back to top
to activations). The recommended initialization is Xavier initialization (or one
https://www.deeplearning.ai/ai-notes/initialization/index.html 8/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
b[l] = 0
In other words, all the weights of layer l are picked randomly from a normal
distribution with mean μ = 0 and variance σ2 = n 1 where n[l−1] is the number [l−1]
Batch: 0 Epoch: 0
↑ Back to top
A[1] A[2] A[3] A[4]
https://www.deeplearning.ai/ai-notes/initialization/index.html 9/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
-1 0 1 -1 0 1 -1 0 1 -1 0 1
You can find the theory behind this visualization in Glorot et al. (2010). The
next section presents the mathematical justification for Xavier initialization
and explains more precisely why it is an effective initialization.
↑ Back to top
In this section, we will show that Xavier Initialization4 keeps the variance the
same across every layer. We will assume that our layer’s activations are
normally distributed around zero. Sometimes it helps to understand the
mathematical justification to grasp the concept, but you can understand the
fundamental idea without the math.
Let’s work on the layer l described in part (III) and assume the activation
function is tanh. The forward propagation is:
z [l] = W [l] a[l−1] + b[l]
a[l] = tanh(z [l] )
Assume we initialized our network with appropriate values and the input is
normalized. Early on in the training, we are in the linear regime of tanh. Values
are small enough and thus tanh(z [l] ) ≈ z [l] ,5 meaning that:
V ar(a[l] ) = V ar(z [l] )
[l]
being true given the choice of initialization we will choose). Thus, looking
element-wise at the previous equation V ar(a[l−1] ) = V ar(a[l] ) now gives:
n[l−1]
V ar(a[l]
k )
=V ar(zk[l] ) = V ar( ∑ wkj
[l] [l−1]
aj )
j=1
https://www.deeplearning.ai/ai-notes/initialization/index.html 11/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
j=1 j=1
j
[l]
) + V ar(wkj )E[a[l−1]
j
[l]
]2 + V ar(wkj )V ar(a[l−1]
j )
We’re almost done! The first assumption leads to E [wkj[l] ]2 = 0 and the second
j=1 j=1
The equality above results from our first assumption stating that:
[l] [l] [l]
V ar(wkj ) = V ar(w11 ) = V ar(w12 ) = ⋯ = V ar(W [l] )
1 ) = V ar(a[l−1]
2 ) = ⋯ = V ar(a[l−1] )
L
= [∏ n[l−1] V ar(W [l] )] V ar(x)
l=1
Conclusion
In practice, Machine Learning Engineers using Xavier initialization would either
initialize the weights as N (0, n 1 ) or as N (0, n 2+n ). The variance term of the
[l−1]
[l−1] [l]
[l]
AUTHORS
ACKNOWLEDGMENTS
1. The template for the article was designed by Jingru Guo and inspired by Distill.
2. The first visualization adapted code from Mike Bostock's visualization of the Goldstein-Price
function.
3. The banner visualization adapted code from deeplearn.js's implementation of a CPPN.
FOOTNOTES
↑ Back to top
https://www.deeplearning.ai/ai-notes/initialization/index.html 14/15
04.05.2024, 10:59 Initializing neural networks - deeplearning.ai
1. All bias parameters are initialized to zero and weight parameters are drawn from a normal
distribution with zero mean and selected variance.
2. Under the hypothesis that all entries of the weight matrix W [l] are picked from the same
distribution, V ar(w11 ) = V ar(w12 ) = ⋯ = V ar(wn n ). Thus, V ar(W [l] ) indicates the
[l] [l−1]
variance of any entry of W [l] (they're all the same!). Similarly, we will denote V ar(x) (resp.
V ar(a[l] )) the variance of any entry of x (resp. a[l] ). It is a fair approximation to consider that
every pixel of a "real-world image" x is distributed according to the same distribution.
3. All bias parameters are initialized to zero and weight parameters are drawn from either "Zero"
distribution (wij = 0), "Uniform" distribution (wij ∼ U ( n−1 , n1 )), "Xavier" distribution
[l−1]
[l−1]
[l−1]
4. Concretely it means we pick every weight randomly and independently from a normal
distribution centered in μ = 0 and with variance σ2 = n 1 . [l−1]
5. We assume that W [l] is initialized with small values and b[l] is initialized with zeros. Hence,
Z [l] = W [l] A[l−1] + b[l] is small and we are in the linear regime of tanh. Remember the slope of
tanh around zero is one, thus tanh(Z [l] ) ≈ Z [l] .
6. The first assumption will end up being true given our initialization scheme (we pick weights
randomly according to a normal distribution centered at zero). The second assumption is not
always true. For instance in images, inputs are pixel values, and pixel values in the same
region are highly correlated with each other. On average, it’s more likely that a green pixel is
surrounded by green pixels than by any other pixel color, because this pixel might be
representing a grass field, or a green object. Although it’s not always true, we assume that
inputs are distributed identically (let’s say from a normal distribution centered at zero.) The
third assumption is generally true at initialization, given that our initialization scheme makes
our weights independent and identically distributed (i.i.d.).
REFERENCE
To reference this article in an academic context, please cite this work as:
© Deeplearning.ai 2021
PRIVACY POLICY TERMS OF USE
↑ Back to top
https://www.deeplearning.ai/ai-notes/initialization/index.html 15/15