0% found this document useful (0 votes)
22 views46 pages

Initializers (Advanced) - Update

The document discusses initialization techniques for neural networks. It provides examples of how ReLU and sigmoid activations can affect gradient vanishing and explosion when used with different data normalization techniques. It also introduces the Xavier and Kaiming He initialization methods to address these issues.

Uploaded by

sx9ttnpq9s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views46 pages

Initializers (Advanced) - Update

The document discusses initialization techniques for neural networks. It provides examples of how ReLU and sigmoid activations can affect gradient vanishing and explosion when used with different data normalization techniques. It also introduces the Xavier and Kaiming He initialization methods to address these issues.

Uploaded by

sx9ttnpq9s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

AI VIETNAM

All-in-One Course

Multi-layer Perception
Initialization (Advanced)

Quang-Vinh Dinh
Ph.D. in Computer Science

Year 2023
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
𝑋 ∈ 0, 255
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std

𝑧1
Normalization

28 Fully Fully
connect connect Output

...
784 Softmax

...

...
28
activation
flatten
𝑧10

1 1

784 Nodes 256 Nodes 10 Nodes


+ ReLU + ReLU Output layer 1
𝑋 ∈ −1, 1
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std

𝑧1
Normalization

28 Fully Fully
connect connect Output

...
784 Softmax

...

...
28
activation
flatten
𝑧10

1 1

784 Nodes 256 Nodes 10 Nodes


+ ReLU + ReLU Output layer 2
AI VIETNAM
All-in-One Course
Experimental Results

ReLU + [0, 255] ReLU + [-1, 1]

3
𝑋 ∈ 0, 255
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std

𝑧1
Normalization

28 Fully Fully
connect connect Output

...
784 Softmax

...

...
28
activation
flatten
𝑧10

1 1

784 Nodes 256 Nodes 10 Nodes


+ Sigmoid + Sigmoid Output layer 4
𝑋 ∈ −1, 1
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std

𝑧1
Normalization

28 Fully Fully
connect connect Output

...
784 Softmax

...

...
28
activation
flatten
𝑧10

1 1

784 Nodes 256 Nodes 10 Nodes


+ Sigmoid + Sigmoid Output layer 5
AI VIETNAM
All-in-One Course
Experimental Results

Sigmoid + [0, 255] Sigmoid + [-1, 1]

6
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
AI VIETNAM
All-in-One Course
Gradient Vanishing
Large weight initialization

𝑋 w1 w2
z1 s z2 𝑦ො0
b2 Cross
Softmax
Entropy
b1 w3
1 1 z3 𝑦ො1
b3
Layer 1 Layer 2

7
AI VIETNAM
All-in-One Course
Gradient Vanishing
Large weight initialization

𝐿′w1 = 9 ∗ 10−7
2.4 𝐿′w2 = −0.972

6.74 9.808
z1 s z3 𝑦ො0
0.0 Cross
Softmax
Entropy
0.0 13.3 𝑦ො1
1 1 z4
0.0
with 𝜂 = 0.01
Layer 1 Layer 2
𝜂𝐿′w1 = 9 ∗ 10−9

𝜂𝐿′b1 = 4 ∗ 10−9 𝐿′b1 = 4 ∗ 10−7

8
AI VIETNAM
All-in-One Course
Gradient Vanishing
𝑋

w1 w2 w3 w4
z1 s z2 s z3 s z4 s 1
w5 w6
b1 b2 b3 b4 Layer 5
1 1 1 1
b5 b6
Layer 1 Layer 2 Layer 3 Layer 4
z5 z6

s Softmax
Sigmoid function

Loss
MLP with 5 layers 𝑦ො0 𝑦ො1
Computation

Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Vanishing
2.4

0.919 −0.812 1.471 −0.776


z1 s z2 s z3 s z4 s 1
−0.309 1.133
0.0 0.0 0.0 0.0
1 1 1 1 Layer 5
0.0 0.0
Layer 1 Layer 2 Layer 3 Layer 4
z5 z6

s Softmax
Sigmoid function

Loss −0.118 0.433


MLP with 5 layers Computation

1.0066 0
AI VIETNAM
All-in-One Course
Gradient Vanishing
2.4 𝐿′ = −0.002 𝐿′w2 = −0.011 𝐿′w3 = −0.012 𝐿′w4 = 0.133
w1

0.919 −0.812 1.471 −0.776


z1 s z2 s z3 s z4 s 1
−0.309 1.133
0.0 0.0 0.0 0.0
1 1 1 1
0.0 0.0

𝐿′b1 = 0.0009 𝐿′b2 = −0.012 𝐿′b3 = −0.039 𝐿′b4 = 0.216 z5 z6

Derivative values are too small


Softmax
w1 = w1 − 𝜂𝐿′w1
= 0.919 − 0.01 ∗ (−0.0002) −0.118 0.433
MLP with 5 layers
= 0.919002
1.0066 0
b1 = b1 − 𝜂𝐿′b =9 ∗ 10−6
AI VIETNAM
All-in-One Course
Gradient Vanishing
𝑋

w1
z1 s
w2
z2 s
w3
z3 ..... s 1
w5 w6
b1 b2 b3 Layer 8
1 1 1
b5 b6
Layer 1 Layer 2 Layer 3
z5 z6

s Softmax
Sigmoid function

Loss
MLP with 8 layers 𝑦ො0 𝑦ො1
Computation

Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Vanishing
𝑋 𝐿′w = 7 ∗ 10−7
1

−0.358
z1 s
−1.683
z2 s
−0.1407
z3 ..... s 1

0.0 0.0 0.0


1 1 1 Layer 8

𝐿′b1 = 3 ∗ 10−7 z5 z6

𝜂𝐿′w1 = 7 ∗ 10−9
Softmax
𝜂𝐿′b1 =3 ∗ 10−9
Loss
MLP with 8 layers 𝑦ො0 𝑦ො1
Derivative values Computation
are super small
Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Explosion
Large weight initialization
and large learning rate

s PReLU function
𝐿′w1 = 99.2
2.4 𝐿′w2 = −54.6

2.68 −3.27
z1 p z3 𝑦ො0
0.0 Cross
Softmax
Entropy
0.0 1.58 𝑦ො1
1 1 z4
0.0
with 𝜂 = 10
Layer 1 Layer 2
𝜂𝐿′w1 = 99

𝜂𝐿′b1 = 48.6 𝐿′b1 = 4.86

14
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
AI VIETNAM
All-in-One Course
Mean
Data
1 2
𝑃𝑋 𝑋 = 2 = 𝑃𝑋 𝑋 = 4 =
𝑋 = {𝑋1 , … , 𝑋𝑁 } 6 6

1 1
Formula 𝑃𝑋 𝑋 = 8 = 𝑃𝑋 𝑋 = 1 =
6 6
𝑁

𝐸 𝑋 = ෍ 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 1
𝑃𝑋 𝑋 = 5 =
𝑖=1 6

Given the data 1 1 1 2 1


𝐸 𝑋 =2× +8× +5× +4× +1×
𝑋 = {2, 8, 5, 4, 1, 4} 6 6 6 6 6
2 8 5 8 1
𝑁=6 = + + + + =4
6 6 6 6 6 15
AI VIETNAM
All-in-One Course
Mean
𝑁 𝑁
Data
𝐸 𝑋𝑌 = ෍ ෍ 𝑋𝑖 𝑌𝑗 𝑃(𝑋𝑖 , 𝑌𝑗 )
𝑋 = {𝑋1 , … , 𝑋𝑁 } 𝑖=1 𝑗=1

𝑁 𝑁
Formula = ෍ ෍ 𝑋𝑖 𝑌𝑗 𝑃(𝑋𝑖 )𝑃(𝑌𝑗 )
𝑁 𝑖=1 𝑗=1

𝐸 𝑋 = ෍ 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 𝑁 𝑁
𝑖=1
= ෍ 𝑋𝑖 𝑃(𝑋𝑖 ) ෍ 𝑌𝑗 𝑃(𝑌𝑗 )
𝑖=1 𝑗=1

=𝐸 𝑋 𝐸 𝑌

16
AI VIETNAM
All-in-One Course
Variance
Formula Example: 𝑋 = {5, 3 6, 7, 4}
mean 1 1 1 1 1
𝑁
𝐸 𝑋 =5× +3× +6× +7× +4×
𝐸 𝑋 = ෍ 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 5 5 5 5 5
𝑖=1
=5

variance 1
𝑣𝑎𝑟(𝑋) = [ 5 − 5 2 + 3−5 2 + 6 − 5 2+
2 5
𝑣𝑎𝑟(𝑋) = 𝐸 𝑋−𝐸 𝑋 2
7−5 + 4 − 5 2]
𝑁
2 1
= ෍ 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 ) = (0+4+1+4+1)=2
5
𝑖=1

Standard
𝜎= 𝑣𝑎𝑟(𝑋) 𝜎= 𝑣𝑎𝑟(𝑋) = 1.41
deviation

17
AI VIETNAM
All-in-One Course
Variance
𝑁
Formula 2
𝑣𝑎𝑟 𝑋 = ෍ 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 )
mean 𝑖=1
𝑁
𝑁
𝐸 𝑋 = ෍ 𝑋𝑖 𝑃𝑋 (𝑋𝑖 )
𝑖=1 = ෍ 𝑋𝑖2 − 2𝑋𝑖 𝐸 𝑋 + 𝐸 𝑋 2 𝑃𝑋 (𝑋𝑖 )
𝑖=1
𝑁 𝑁
variance
= ෍ 𝑋𝑖2 𝑃𝑋 (𝑋𝑖 ) − ෍ 2𝑋𝑖 𝐸 𝑋 𝑃𝑋 𝑋𝑖
2
𝑣𝑎𝑟(𝑋) = 𝐸 𝑋−𝐸 𝑋 𝑖=1 𝑖=1
𝑁
𝑁
2 + ෍ 𝐸 𝑋 2 𝑃𝑋 (𝑋𝑖 )
= ෍ 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 )
𝑖=1
𝑖=1
𝑁
Standard = 𝐸 𝑋 2 − 2𝐸 𝑋 ෍ 𝑋𝑖 𝑃𝑋 𝑋𝑖 +𝐸 𝑋 2
𝜎= 𝑣𝑎𝑟(𝑋)
deviation
𝑖=1
2
= 𝐸 𝑋2 − 𝐸 𝑋 18
AI VIETNAM
All-in-One Course
Variance
2 2
𝑣𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝐸 𝑋

2 2 2
𝑣𝑎𝑟 𝑋𝑌 = 𝐸 𝑋 𝑌 − 𝐸 𝑋𝑌

2
=𝐸 𝑋2 𝐸 𝑌2 − 𝐸 𝑋 𝐸 𝑌

2 2 2
= 𝑣𝑎𝑟 𝑋 + 𝐸 𝑋 𝑣𝑎𝑟 𝑌 + 𝐸 𝑌 − 𝐸 𝑋 𝐸 𝑌

2 2
= 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 + 𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑣𝑎𝑟 𝑌 𝐸 𝑌

19
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization

Uniform Distribution
𝑎+𝑏 1
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 = 𝑏−𝑎
2

1 2
𝑏−𝑎
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 =
𝑏−𝑎 12

20
AI VIETNAM
All-in-One Course
Initialization Methods
Uniform Distribution
𝑎+𝑏
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 =
2
1 𝑏−𝑎 2
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 = ∞ 𝑏
𝑏−𝑎 12
1
𝐸 𝑋 = න 𝑥𝑓 𝑥 𝑑𝑥 = න 𝑥 𝑑𝑥
−∞ 𝑎 𝑏−𝑎

𝑥2 𝑏
𝑏 2 − 𝑎2 𝑎+𝑏
= |𝑎 = =
1 2(𝑏 − 𝑎) 2(𝑏 − 𝑎) 2
𝑏−𝑎

21
AI VIETNAM
All-in-One Course
Initialization Methods
Uniform Distribution 2

2
𝑣𝑎𝑟 𝑋 = 𝐸 𝑋−𝐸 𝑋 =න 𝑥−𝐸 𝑋 𝑓 𝑥 𝑑𝑥
𝑎+𝑏 −∞
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 = 2
2 𝑏
𝑎+𝑏 1
2 =න 𝑥− 𝑑𝑥
1 𝑏−𝑎 𝑎 2 𝑏−𝑎
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 =
𝑏−𝑎 12 1 𝑏 𝑏
𝑎+𝑏 𝑏
𝑎+𝑏
2
2
= න 𝑥 𝑑𝑥 − න 2𝑥 𝑑𝑥 + න 𝑑𝑥
𝑏−𝑎 𝑎 𝑎 2 𝑎 2
2
1 𝑥 3 𝑏 𝑥 2 (𝑎 + 𝑏) 𝑏 𝑎+𝑏
= | − |𝑎 + 𝑥|𝑏𝑎
𝑏−𝑎 3 𝑎 2 2
1
2
𝑏−𝑎 1 𝑏3 − 𝑎3 (𝑏2 − 𝑎2 )(𝑎 + 𝑏) 𝑎+𝑏
= − + (𝑏 − 𝑎)
𝑏−𝑎 3 2 2
𝑎2 + 𝑎𝑏 + 𝑏2 𝑎2 + 2𝑎𝑏 + 𝑏2 𝑎2 + 2𝑎𝑏 + 𝑏2
= − +
3 2 4
4 𝑎2 + 𝑎𝑏 + 𝑏2 − 3 𝑎2 + 2𝑎𝑏 + 𝑏2 𝑏−𝑎 2
= =
12 12
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization

Gaussian Distribution

𝑋~𝑵 𝜇, 𝜎 2

1 1 𝑥−𝜇 2

𝑓 𝑥 = 𝑒 2 𝜎
𝜎 2𝜋

23
𝑒 𝑥 − 𝑒 −𝑥 2 2
Maclaurin series tanh 𝑥 = 𝑥
𝑒 + 𝑒 −𝑥
= 1 − 2𝑥 =
𝑒 + 1 𝑒 −2𝑥 + 1
−1
Tính giá trị xấp xỉ hàm f(x) cho những giá trị
𝑥 ≈0 tanh 0 = 0

(𝑛)
𝑥𝑛
𝑓 𝑥 = ෍𝑓 0 tanh′ 0 = 1 − 𝑡𝑎𝑛ℎ2 0 = 1
𝑛!
𝑛=0

𝑓 ′′ 0 2 𝑓 (3) 0 3 ′
=𝑓 0 +𝑓 0 𝑥+ ′
𝑥 + 𝑥 +⋯ tanh′′ 0 = 1 − 𝑡𝑎𝑛ℎ2 0
2! 3!
= −2𝑡𝑎𝑛ℎ 0 tanh′ 0 = 0


tanh(3) 0 = −2𝑡𝑎𝑛ℎ 0 tanh′ 0

= −2 tanh′ 0 tanh′ 0 + tanh′′ 0 𝑡𝑎𝑛ℎ 0 = −2

′′ (3)

𝑓 0 2
𝑓 0 3
tanh 𝑥 = 𝑓 0 + 𝑓 0 𝑥 + 𝑥 + 𝑥 +⋯
2! 3!
𝑥3
=𝑥− +⋯
3!
tanh 𝑥 ≈ 𝑥
Maclaurin series 1
sigmoid 𝑥 =
Tính giá trị xấp xỉ hàm f(x) cho những giá trị 1 + 𝑒 −𝑥
𝑥 ≈0 1
∞ sigmoid 0 =
(𝑛)
𝑥𝑛 2
𝑓 𝑥 = ෍𝑓 0
𝑛! ′
1
𝑛=0 sigmoid 0 = sigmoid 0 1 − sigmoid 0 =
4
𝑓 ′′ 0 2 𝑓 (3) 0 3

=𝑓 0 +𝑓 0 𝑥+ 𝑥 + 𝑥 +⋯
2! 3! sigmoid′′ 0 = sigmoid 0 1 − sigmoid 0 ′

= sigmoid′ 0 − 2 sigmoid 0 sigmoid′ 0 = 0


𝑓 ′′ 0 2 𝑓 (3) 0 3
sigmoid 𝑥 = 𝑓 0 + 𝑓 0 𝑥 + 𝑥 + 𝑥 +⋯
2! 3!
1 𝑥
= + +⋯
2 4
1 𝑥
sigmoid 𝑥 ≈ +
2 4
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +


2 𝑤𝑛
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑦 𝐸 𝑋
var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
Uniform Distribution
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
𝑋~𝑈 𝑎, 𝑏
1 activation = tanh 𝑎𝑖 = tanh 𝑧𝑖 ≈ 𝑧𝑖 var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑓 𝑥 =
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 1
𝑣𝑎𝑟 𝑋 = 1
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh

𝑥0 1
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =


3
Uniform Distribution 𝑥𝑛

𝑋~𝑈 𝑎, 𝑏
1 3 3
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2 𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
27
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh

𝑥0 1
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎 = 2 𝜎=


𝑛 𝑛
Gaussian Distribution 𝑥𝑛

𝑋~𝑁 0, 𝜎 2 1
𝑊𝑖 ~𝑁 0,
𝑛
28
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh

Uniform Distribution Gaussian Distribution

3 3 1
𝑊𝑖𝑗 ~𝑈 − , 𝑊𝑖𝑗 ~𝑵 0,
𝑛 𝑛 𝑛

29
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +


2 𝑤𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛
𝑣𝑎𝑟 𝑦 𝐸 𝑋 var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
Uniform Distribution
1 𝑧𝑖
𝑋~𝑈 𝑎, 𝑏 activation = sigmoid 𝑎𝑖 = sigmoid 𝑧𝑖 ≈ +
2 4
1
𝑓 𝑥 = 16var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 16
𝑣𝑎𝑟 𝑋 = 16
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = sigmoid

𝑥0 16
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =


3
Uniform Distribution 𝑥𝑛

𝑋~𝑈 𝑎, 𝑏
1
4 3 4 3
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2
𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
31
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = sigmoid

𝑥0 16
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎2 =


𝑛

Gaussian Distribution 𝑥𝑛

𝑋~𝑁 0, 𝜎 2 16
𝑊𝑖 ~𝑁 0,
𝑛
32
AI VIETNAM
All-in-One Course
Initialization Methods
Kaiming He Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +


2 𝑤𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛
𝑣𝑎𝑟 𝑦 𝐸 𝑋 var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
Uniform Distribution
𝑋~𝑈 𝑎, 𝑏 activation = relu 𝑎𝑖 = 𝑚𝑎𝑥 0, 𝑧𝑖
1
𝑓 𝑥 = 2var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 2
𝑣𝑎𝑟 𝑋 = 2
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
He Initialization activation = he

𝑥0 2
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =


3
Uniform Distribution 𝑥𝑛

𝑋~𝑈 𝑎, 𝑏
1
6 6
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2
𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
34
AI VIETNAM
All-in-One Course
Initialization Methods
He Initialization activation = he

𝑥0 2
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎2 =


𝑛

Gaussian Distribution 𝑥𝑛

𝑋~𝑁 0, 𝜎 2 2
𝑊𝑖 ~𝑁 0,
𝑛
35
AI VIETNAM
All-in-One Course
Summary
Recommendation

Data Preparation

[-1, 1] Data
or z-score Normalization
Optimizer
Adam
Selection
ReLU Activation Model (Network)
Batch norm Construction Loss function
Selection

Glorot uniform Parameter Metric Selection


or He normal Initialization
36
AI VIETNAM
All-in-One Course
Further Reading
Dying ReLU

https://towardsdatascience.com/the-dying-relu-problem-clearly-explained-42d0c54e0d24

Initialization

https://www.deeplearning.ai/ai-notes/initialization/index.html

37

You might also like