Chapter 2 - 2 Shallow Neural Network 2 - 2
Chapter 2 - 2 Shallow Neural Network 2 - 2
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical
biophysics, 5(4):115–133, 1943.
• If the learning rate is TOO LARGE, gradient descent will overshoot the
minima and diverge.
• If the learning rate is too small, gradient descent will require too many epochs
to converge and can become trapped in local minima more easily.
• If features are scaled on the same scale, gradient descent converges faster and prevents
weights from becoming too small (weight decay).
𝑥𝑗 − 𝜇𝑗
𝑥𝑗,𝑠𝑡𝑑 =
𝜎𝑗
where μj is the sample mean of the feature xj and σj the standard deviation.
• After standardization, the features will have unit variance and centered around mean
zero.
𝒥 𝑤, 𝑏𝑚
1
= ℒ(𝑦ො 𝑖 , 𝑦 (𝑖) )
𝑚
𝑖=1𝑚
1 𝑖 𝑖 𝑖 𝑖
=− 𝑦 log 𝑦ො + 1−𝑦 log 1 − 𝑦ො
𝑚
𝑖=1
• Goal:
• Find vectors w and b that minimize the cost function (total loss)
𝐽 𝑤, 𝑏
𝑏
𝑤
• A graph that depicts all the computations required for a function in a forward path
• For example: J(x, y, z) = 4(x + yz)
y v = x+u J = 4v
u = y*z
z
𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑦ො = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = −(𝑦 log(𝑎) + (1 − 𝑦) log(1 − 𝑎))
x1
x2
w1 z = w1x1 + w2x2 + b a = σ(z) ℒ(a, y)
w2
b
𝛿𝐿(𝑎,𝑦) 𝑦 1−𝑦
• = − +
𝛿𝑎 𝑎 1−𝑎
𝛿𝐿(𝑎,𝑦) 𝛿𝐿(𝑎,𝑦) 𝛿𝑎 𝑦 1−𝑦
• = = − + 𝑎 1−𝑎 =𝑎−𝑦
𝛿𝑧 𝛿𝑎 𝛿𝑧 𝑎 1−𝑎
𝛿𝐿(𝑎,𝑦) 𝛿𝐿(𝑎,𝑦) 𝛿𝑎 𝛿𝑧 𝑦 1−𝑦 𝛿𝐿(𝑎,𝑦)
•
𝛿𝑤1
= 𝛿𝑎 𝛿𝑧 𝛿𝑤 = − 𝑎 + 1−𝑎 𝑎 1 − 𝑎 𝑥1 = 𝑥1 𝑎 − 𝑦 = 𝑥1 𝛿𝑧
1
Layer 1
[1]
• 2-layer NN 𝑎1
• 1 hidden layer
𝑥1 [1]
𝑎2
𝑎0 = 𝑿 Layer 2
𝑥2 𝑎3
[1]
𝑦ො =𝑎
[2]
𝑥3 [1]
𝑎4 [1]
𝑎1
[1]
𝑎2
𝑎[1] = [1]
𝑎3
[1]
𝑎4
Input layer Hidden layer Output layer
𝑥1
𝑥2 𝜎(𝑧) 𝑎 = 𝑦ො
𝑧 𝑎
𝑥3
𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑎 = 𝜎(𝑧)
Given input x:
2 2
𝑧 1
=𝑊 1
𝑥+𝑏 1 𝑧 =𝑊 𝑎1 +𝑏 2
𝑎 1 = 𝜎(𝑧 1
) 𝑎 2 = 𝜎(𝑧 2
)
for i = 1 to m:
1 (𝑖) 1
𝑧 =𝑊 𝑥 (𝑖) + 𝑏 1
𝑎 1 (𝑖) = 𝜎(𝑧 1 𝑖
)
2 (𝑖) 2
𝑧 =𝑊 𝑎 1 (𝑖) + 𝑏 2
𝑋= 𝑥 (1) 𝑥 (2) … 𝑥 (𝑚)
𝑎 2 (𝑖) = 𝜎(𝑧 2 𝑖
)
1 1
𝑍 =𝑊 𝑋+𝑏 1
A [1]
= 𝐴 1 = 𝜎(𝑍 1 )
𝑎[1](1) 𝑎[1](2) …𝑎[1](𝑚)
𝑍2 =𝑊 2
𝐴1 +𝑏 2
𝐴 2 = 𝜎(𝑍 2 )
z
sigmoid 1 𝑎(1 − 𝑎)
𝑎=
1 + 𝑒 −𝑧 tanh a
tanh 𝑒 𝑧 − 𝑒 −𝑧 1 − 𝑎2
z
𝑎= 𝑧
𝑒 + 𝑒 −𝑧
0 if 𝑧 < 0
ReLU a
ReLU max(0, 𝑧)
1 if 𝑧 ≥ 0
0.01 if 𝑧 < 0
z
Leaky ReLU max(0.01𝑧, 𝑧)
1 if 𝑧 ≥ 0 Leaky a
ReLU
z
Minhhuy Le, ICSLab, Phenikaa Uni. 29
1. Shallow neural network Why non-linear activation function?
𝑥1
𝑏 [1] 𝑑𝑧 [1] = 𝑊 2𝑇
𝑑𝑧 [2] ∗ 𝑔[1] ′(z 1 )
𝑑𝑧 [2] = 𝑎[2] − 𝑦
𝑇
𝑑𝑊 [2] = 𝑑𝑧 [2] 𝑎 1
𝑑𝑏 [2] = 𝑑𝑧 [2]
Minhhuy Le, ICSLab, Phenikaa Uni. 31
1. Shallow neural network Vectorizing Gradient Descent
• W[1] = np.random.randn((2,2))*0.01
• Small random values are suggested!
• If too large, Z[1] = W[1]X + b[1] will also be very large and a[1] =
g[1](z[1]) will be in the flat areas and gradient descent will be
very, very slooooooooow….