Initializers (Advanced) - Update
Initializers (Advanced) - Update
All-in-One Course
Multi-layer Perception
Initialization (Advanced)
Quang-Vinh Dinh
Ph.D. in Computer Science
Year 2023
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
𝑋 ∈ 0, 255
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std
𝑧1
Normalization
28 Fully Fully
connect connect Output
...
784 Softmax
...
...
28
activation
flatten
𝑧10
1 1
𝑧1
Normalization
28 Fully Fully
connect connect Output
...
784 Softmax
...
...
28
activation
flatten
𝑧10
1 1
3
𝑋 ∈ 0, 255
Normalize(𝑚𝑒𝑎𝑛, std)
Image −𝑚𝑒𝑎𝑛
Image =
std
𝑧1
Normalization
28 Fully Fully
connect connect Output
...
784 Softmax
...
...
28
activation
flatten
𝑧10
1 1
𝑧1
Normalization
28 Fully Fully
connect connect Output
...
784 Softmax
...
...
28
activation
flatten
𝑧10
1 1
6
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
AI VIETNAM
All-in-One Course
Gradient Vanishing
Large weight initialization
𝑋 w1 w2
z1 s z2 𝑦ො0
b2 Cross
Softmax
Entropy
b1 w3
1 1 z3 𝑦ො1
b3
Layer 1 Layer 2
7
AI VIETNAM
All-in-One Course
Gradient Vanishing
Large weight initialization
𝐿′w1 = 9 ∗ 10−7
2.4 𝐿′w2 = −0.972
6.74 9.808
z1 s z3 𝑦ො0
0.0 Cross
Softmax
Entropy
0.0 13.3 𝑦ො1
1 1 z4
0.0
with 𝜂 = 0.01
Layer 1 Layer 2
𝜂𝐿′w1 = 9 ∗ 10−9
8
AI VIETNAM
All-in-One Course
Gradient Vanishing
𝑋
w1 w2 w3 w4
z1 s z2 s z3 s z4 s 1
w5 w6
b1 b2 b3 b4 Layer 5
1 1 1 1
b5 b6
Layer 1 Layer 2 Layer 3 Layer 4
z5 z6
s Softmax
Sigmoid function
Loss
MLP with 5 layers 𝑦ො0 𝑦ො1
Computation
Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Vanishing
2.4
s Softmax
Sigmoid function
1.0066 0
AI VIETNAM
All-in-One Course
Gradient Vanishing
2.4 𝐿′ = −0.002 𝐿′w2 = −0.011 𝐿′w3 = −0.012 𝐿′w4 = 0.133
w1
w1
z1 s
w2
z2 s
w3
z3 ..... s 1
w5 w6
b1 b2 b3 Layer 8
1 1 1
b5 b6
Layer 1 Layer 2 Layer 3
z5 z6
s Softmax
Sigmoid function
Loss
MLP with 8 layers 𝑦ො0 𝑦ො1
Computation
Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Vanishing
𝑋 𝐿′w = 7 ∗ 10−7
1
−0.358
z1 s
−1.683
z2 s
−0.1407
z3 ..... s 1
𝐿′b1 = 3 ∗ 10−7 z5 z6
𝜂𝐿′w1 = 7 ∗ 10−9
Softmax
𝜂𝐿′b1 =3 ∗ 10−9
Loss
MLP with 8 layers 𝑦ො0 𝑦ො1
Derivative values Computation
are super small
Cross Entropy 𝑦
AI VIETNAM
All-in-One Course
Gradient Explosion
Large weight initialization
and large learning rate
s PReLU function
𝐿′w1 = 99.2
2.4 𝐿′w2 = −54.6
2.68 −3.27
z1 p z3 𝑦ො0
0.0 Cross
Softmax
Entropy
0.0 1.58 𝑦ො1
1 1 z4
0.0
with 𝜂 = 10
Layer 1 Layer 2
𝜂𝐿′w1 = 99
14
Outline
➢ Case Studies
➢ Gradient Vanishing
➢ Gradient Explosion
➢ Xavier Glorot Initialization
➢ Kaiming He Initialization
AI VIETNAM
All-in-One Course
Mean
Data
1 2
𝑃𝑋 𝑋 = 2 = 𝑃𝑋 𝑋 = 4 =
𝑋 = {𝑋1 , … , 𝑋𝑁 } 6 6
1 1
Formula 𝑃𝑋 𝑋 = 8 = 𝑃𝑋 𝑋 = 1 =
6 6
𝑁
𝐸 𝑋 = 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 1
𝑃𝑋 𝑋 = 5 =
𝑖=1 6
𝑁 𝑁
Formula = 𝑋𝑖 𝑌𝑗 𝑃(𝑋𝑖 )𝑃(𝑌𝑗 )
𝑁 𝑖=1 𝑗=1
𝐸 𝑋 = 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 𝑁 𝑁
𝑖=1
= 𝑋𝑖 𝑃(𝑋𝑖 ) 𝑌𝑗 𝑃(𝑌𝑗 )
𝑖=1 𝑗=1
=𝐸 𝑋 𝐸 𝑌
16
AI VIETNAM
All-in-One Course
Variance
Formula Example: 𝑋 = {5, 3 6, 7, 4}
mean 1 1 1 1 1
𝑁
𝐸 𝑋 =5× +3× +6× +7× +4×
𝐸 𝑋 = 𝑋𝑖 𝑃𝑋 (𝑋𝑖 ) 5 5 5 5 5
𝑖=1
=5
variance 1
𝑣𝑎𝑟(𝑋) = [ 5 − 5 2 + 3−5 2 + 6 − 5 2+
2 5
𝑣𝑎𝑟(𝑋) = 𝐸 𝑋−𝐸 𝑋 2
7−5 + 4 − 5 2]
𝑁
2 1
= 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 ) = (0+4+1+4+1)=2
5
𝑖=1
Standard
𝜎= 𝑣𝑎𝑟(𝑋) 𝜎= 𝑣𝑎𝑟(𝑋) = 1.41
deviation
17
AI VIETNAM
All-in-One Course
Variance
𝑁
Formula 2
𝑣𝑎𝑟 𝑋 = 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 )
mean 𝑖=1
𝑁
𝑁
𝐸 𝑋 = 𝑋𝑖 𝑃𝑋 (𝑋𝑖 )
𝑖=1 = 𝑋𝑖2 − 2𝑋𝑖 𝐸 𝑋 + 𝐸 𝑋 2 𝑃𝑋 (𝑋𝑖 )
𝑖=1
𝑁 𝑁
variance
= 𝑋𝑖2 𝑃𝑋 (𝑋𝑖 ) − 2𝑋𝑖 𝐸 𝑋 𝑃𝑋 𝑋𝑖
2
𝑣𝑎𝑟(𝑋) = 𝐸 𝑋−𝐸 𝑋 𝑖=1 𝑖=1
𝑁
𝑁
2 + 𝐸 𝑋 2 𝑃𝑋 (𝑋𝑖 )
= 𝑋𝑖 − 𝐸 𝑋 𝑃𝑋 (𝑋𝑖 )
𝑖=1
𝑖=1
𝑁
Standard = 𝐸 𝑋 2 − 2𝐸 𝑋 𝑋𝑖 𝑃𝑋 𝑋𝑖 +𝐸 𝑋 2
𝜎= 𝑣𝑎𝑟(𝑋)
deviation
𝑖=1
2
= 𝐸 𝑋2 − 𝐸 𝑋 18
AI VIETNAM
All-in-One Course
Variance
2 2
𝑣𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝐸 𝑋
2 2 2
𝑣𝑎𝑟 𝑋𝑌 = 𝐸 𝑋 𝑌 − 𝐸 𝑋𝑌
2
=𝐸 𝑋2 𝐸 𝑌2 − 𝐸 𝑋 𝐸 𝑌
2 2 2
= 𝑣𝑎𝑟 𝑋 + 𝐸 𝑋 𝑣𝑎𝑟 𝑌 + 𝐸 𝑌 − 𝐸 𝑋 𝐸 𝑌
2 2
= 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 + 𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑣𝑎𝑟 𝑌 𝐸 𝑌
19
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization
Uniform Distribution
𝑎+𝑏 1
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 = 𝑏−𝑎
2
1 2
𝑏−𝑎
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 =
𝑏−𝑎 12
20
AI VIETNAM
All-in-One Course
Initialization Methods
Uniform Distribution
𝑎+𝑏
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 =
2
1 𝑏−𝑎 2
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 = ∞ 𝑏
𝑏−𝑎 12
1
𝐸 𝑋 = න 𝑥𝑓 𝑥 𝑑𝑥 = න 𝑥 𝑑𝑥
−∞ 𝑎 𝑏−𝑎
𝑥2 𝑏
𝑏 2 − 𝑎2 𝑎+𝑏
= |𝑎 = =
1 2(𝑏 − 𝑎) 2(𝑏 − 𝑎) 2
𝑏−𝑎
21
AI VIETNAM
All-in-One Course
Initialization Methods
Uniform Distribution 2
∞
2
𝑣𝑎𝑟 𝑋 = 𝐸 𝑋−𝐸 𝑋 =න 𝑥−𝐸 𝑋 𝑓 𝑥 𝑑𝑥
𝑎+𝑏 −∞
𝑋~𝑈 𝑎, 𝑏 𝐸𝑋 = 2
2 𝑏
𝑎+𝑏 1
2 =න 𝑥− 𝑑𝑥
1 𝑏−𝑎 𝑎 2 𝑏−𝑎
𝑓 𝑥 = 𝑣𝑎𝑟 𝑋 =
𝑏−𝑎 12 1 𝑏 𝑏
𝑎+𝑏 𝑏
𝑎+𝑏
2
2
= න 𝑥 𝑑𝑥 − න 2𝑥 𝑑𝑥 + න 𝑑𝑥
𝑏−𝑎 𝑎 𝑎 2 𝑎 2
2
1 𝑥 3 𝑏 𝑥 2 (𝑎 + 𝑏) 𝑏 𝑎+𝑏
= | − |𝑎 + 𝑥|𝑏𝑎
𝑏−𝑎 3 𝑎 2 2
1
2
𝑏−𝑎 1 𝑏3 − 𝑎3 (𝑏2 − 𝑎2 )(𝑎 + 𝑏) 𝑎+𝑏
= − + (𝑏 − 𝑎)
𝑏−𝑎 3 2 2
𝑎2 + 𝑎𝑏 + 𝑏2 𝑎2 + 2𝑎𝑏 + 𝑏2 𝑎2 + 2𝑎𝑏 + 𝑏2
= − +
3 2 4
4 𝑎2 + 𝑎𝑏 + 𝑏2 − 3 𝑎2 + 2𝑎𝑏 + 𝑏2 𝑏−𝑎 2
= =
12 12
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization
Gaussian Distribution
𝑋~𝑵 𝜇, 𝜎 2
1 1 𝑥−𝜇 2
−
𝑓 𝑥 = 𝑒 2 𝜎
𝜎 2𝜋
23
𝑒 𝑥 − 𝑒 −𝑥 2 2
Maclaurin series tanh 𝑥 = 𝑥
𝑒 + 𝑒 −𝑥
= 1 − 2𝑥 =
𝑒 + 1 𝑒 −2𝑥 + 1
−1
Tính giá trị xấp xỉ hàm f(x) cho những giá trị
𝑥 ≈0 tanh 0 = 0
∞
(𝑛)
𝑥𝑛
𝑓 𝑥 = 𝑓 0 tanh′ 0 = 1 − 𝑡𝑎𝑛ℎ2 0 = 1
𝑛!
𝑛=0
𝑓 ′′ 0 2 𝑓 (3) 0 3 ′
=𝑓 0 +𝑓 0 𝑥+ ′
𝑥 + 𝑥 +⋯ tanh′′ 0 = 1 − 𝑡𝑎𝑛ℎ2 0
2! 3!
= −2𝑡𝑎𝑛ℎ 0 tanh′ 0 = 0
′
tanh(3) 0 = −2𝑡𝑎𝑛ℎ 0 tanh′ 0
′′ (3)
′
𝑓 0 2
𝑓 0 3
tanh 𝑥 = 𝑓 0 + 𝑓 0 𝑥 + 𝑥 + 𝑥 +⋯
2! 3!
𝑥3
=𝑥− +⋯
3!
tanh 𝑥 ≈ 𝑥
Maclaurin series 1
sigmoid 𝑥 =
Tính giá trị xấp xỉ hàm f(x) cho những giá trị 1 + 𝑒 −𝑥
𝑥 ≈0 1
∞ sigmoid 0 =
(𝑛)
𝑥𝑛 2
𝑓 𝑥 = 𝑓 0
𝑛! ′
1
𝑛=0 sigmoid 0 = sigmoid 0 1 − sigmoid 0 =
4
𝑓 ′′ 0 2 𝑓 (3) 0 3
′
=𝑓 0 +𝑓 0 𝑥+ 𝑥 + 𝑥 +⋯
2! 3! sigmoid′′ 0 = sigmoid 0 1 − sigmoid 0 ′
′
𝑓 ′′ 0 2 𝑓 (3) 0 3
sigmoid 𝑥 = 𝑓 0 + 𝑓 0 𝑥 + 𝑥 + 𝑥 +⋯
2! 3!
1 𝑥
= + +⋯
2 4
1 𝑥
sigmoid 𝑥 ≈ +
2 4
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
…
2 𝑤𝑛
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑦 𝐸 𝑋
var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
Uniform Distribution
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
𝑋~𝑈 𝑎, 𝑏
1 activation = tanh 𝑎𝑖 = tanh 𝑧𝑖 ≈ 𝑧𝑖 var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑓 𝑥 =
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 1
𝑣𝑎𝑟 𝑋 = 1
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh
𝑥0 1
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =
…
3
Uniform Distribution 𝑥𝑛
𝑋~𝑈 𝑎, 𝑏
1 3 3
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2 𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
27
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh
𝑥0 1
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎 = 2 𝜎=
…
𝑛 𝑛
Gaussian Distribution 𝑥𝑛
𝑋~𝑁 0, 𝜎 2 1
𝑊𝑖 ~𝑁 0,
𝑛
28
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = tanh
3 3 1
𝑊𝑖𝑗 ~𝑈 − , 𝑊𝑖𝑗 ~𝑵 0,
𝑛 𝑛 𝑛
29
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
…
2 𝑤𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛
𝑣𝑎𝑟 𝑦 𝐸 𝑋 var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
Uniform Distribution
1 𝑧𝑖
𝑋~𝑈 𝑎, 𝑏 activation = sigmoid 𝑎𝑖 = sigmoid 𝑧𝑖 ≈ +
2 4
1
𝑓 𝑥 = 16var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 16
𝑣𝑎𝑟 𝑋 = 16
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = sigmoid
𝑥0 16
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =
…
3
Uniform Distribution 𝑥𝑛
𝑋~𝑈 𝑎, 𝑏
1
4 3 4 3
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2
𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
31
AI VIETNAM
All-in-One Course
Initialization Methods
Xavier Initialization activation = sigmoid
𝑥0 16
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎2 =
…
𝑛
Gaussian Distribution 𝑥𝑛
𝑋~𝑁 0, 𝜎 2 16
𝑊𝑖 ~𝑁 0,
𝑛
32
AI VIETNAM
All-in-One Course
Initialization Methods
Kaiming He Initialization 𝑥0 𝑎𝑖 = activation(𝑧𝑖 ) 𝐸 𝑋 =0
𝑤0 𝐸 𝑊 =0
𝑤1 𝑏=0
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 𝑥1 𝑧𝑖 𝑎𝑖
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
…
2 𝑤𝑛 𝑧𝑖 = (𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
𝑣𝑎𝑟 𝑋 𝐸 𝑌 +
2 𝑥𝑛
𝑣𝑎𝑟 𝑦 𝐸 𝑋 var(𝑧𝑖 ) = var(𝑥1 𝑤1 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏)
= 𝑛var(𝑥𝑖 𝑤𝑖 ) = 𝑛var(𝑥𝑖 )var(𝑤𝑖 )
Uniform Distribution
𝑋~𝑈 𝑎, 𝑏 activation = relu 𝑎𝑖 = 𝑚𝑎𝑥 0, 𝑧𝑖
1
𝑓 𝑥 = 2var(𝑎𝑖 ) = var(𝑧𝑖 )
𝑏−𝑎 iid
𝑏−𝑎 2 var(X) ≈ var(𝐚) var(𝑥𝑖 ) ≈ var(𝑎𝑖 ) nvar(𝑤𝑖 ) ≈ 2
𝑣𝑎𝑟 𝑋 = 2
12 var(𝑤𝑖 ) ≈
𝑛
AI VIETNAM
All-in-One Course
Initialization Methods
He Initialization activation = he
𝑥0 2
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑈 −𝑟, 𝑟
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 𝑟2
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝑣𝑎𝑟 𝑤𝑖 =
…
3
Uniform Distribution 𝑥𝑛
𝑋~𝑈 𝑎, 𝑏
1
6 6
𝑓 𝑥 =
𝑏−𝑎
𝑊𝑖 ~𝑈 − ,
𝑏−𝑎 2
𝑛 𝑛
𝑣𝑎𝑟 𝑋 =
12
34
AI VIETNAM
All-in-One Course
Initialization Methods
He Initialization activation = he
𝑥0 2
𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸 𝑌 var(𝑤𝑖 ) ≈
𝑛
𝑣𝑎𝑟 𝑋𝑌 = 𝑣𝑎𝑟 𝑋 𝑣𝑎𝑟 𝑌 +
2 𝑤𝑖 ~𝑁 0, 𝜎 2
𝑣𝑎𝑟 𝑋 𝐸 𝑌 + 𝑥1 𝑧𝑖 𝑎𝑖
2 1
𝑣𝑎𝑟 𝑦 𝐸 𝑋 𝜎2 =
…
𝑛
Gaussian Distribution 𝑥𝑛
𝑋~𝑁 0, 𝜎 2 2
𝑊𝑖 ~𝑁 0,
𝑛
35
AI VIETNAM
All-in-One Course
Summary
Recommendation
Data Preparation
[-1, 1] Data
or z-score Normalization
Optimizer
Adam
Selection
ReLU Activation Model (Network)
Batch norm Construction Loss function
Selection
https://towardsdatascience.com/the-dying-relu-problem-clearly-explained-42d0c54e0d24
Initialization
https://www.deeplearning.ai/ai-notes/initialization/index.html
37