Neural Network Training
Neural Network Training
Neural Network
Representation
ℎ𝑤 𝑥
Neuron model: Logistic unit
𝒙𝟎 𝒘𝟎
𝒙𝟏 𝒘𝟏
𝒙= 𝒙 w= 𝒘
𝟐 𝟐
𝑥1 𝒙𝟑 𝒘𝟑
𝑥2 𝑇
𝑤 𝑥 + 𝑏 𝜎(𝑧) 𝑎 = 𝑦ො
𝑧 𝑎
𝒙𝟎 =1 𝒘𝟎 =b
𝑥3
𝒙𝟏 𝒘𝟏
𝒙 = 𝒙𝟐 w= 𝒘𝟐
𝒙𝟑 𝒘𝟑
Neural Network Representation
𝑥1 𝑥1
𝑥2 𝑤 𝑥 + 𝑏 𝜎(𝑧)
𝑇 𝑎 = 𝑦ො 𝑥2 𝑦ො
𝑧 𝑎
𝑥3 𝑥3
𝑇
𝑧 =𝑤 𝑥+𝑏 𝑥1
𝑎 = 𝜎(𝑧) 𝑥2 𝑦ො
𝑥3
Neural Network Representation
1 1 1𝑇 [1] [1] 1
𝑎1 𝑧1 = 𝑤1 𝑥+ 𝑏1 , 𝑎1 = 𝜎(𝑧1 )
𝑥1 1 1 1𝑇 [1] [1] 1
𝑎2 𝑧2 = 𝑤2 𝑥+ 𝑏2 , 𝑎2 = 𝜎(𝑧2 )
𝑥2 𝑦ො [1] [1]
𝑎3
1 1 1𝑇 1
𝑧3 = 𝑤3 𝑥+ 𝑏3 , 𝑎3 = 𝜎(𝑧3 )
𝑥3
𝑎4
1 1
𝑤11 1 1𝑇 [1] [1] 1
1 1
𝑧4 = 𝑤4 𝑥+ 𝑏4 , 𝑎4 = 𝜎(𝑧4 )
𝑤1 = 𝑤12
1
𝑤13
1 1 1
𝑤11 𝑤12 𝑤13 𝑏1
1
1 1 1
𝑤21 𝑤22 𝑤23 𝒙𝟏 1
𝑊 [1] = 1 𝑏2
1
𝑤31
1
𝑤32
1
𝑤33 𝒙 = 𝒙𝟐 𝑏 = 1
𝒙𝟑 𝑏3
1 1 1
𝑤41 𝑤42 𝑤43 𝑏4
1
[𝑙] [𝑙]
Parameters 𝑊 and 𝑏
𝑥1
𝑦ො
𝑥2
𝑧 [1] = 𝑊 1 ∗ 𝑥 + 𝑏 [1]
𝑊 [𝑙] : (𝑛 𝑙 , 𝑛 𝑙−1 )
(3,1) 3,2 2,1 (3,1)
𝑛 1 ,1 𝑛 1 ,𝑛 0 𝑛 0 ,1 𝑛 1 ,1 𝑏 [𝑙] : (𝑛 𝑙 , 1)
𝑑𝑊 [𝑙] : (𝑛 𝑙 , 𝑛 𝑙−1 )
𝑑𝑏 [𝑙] : (𝑛 𝑙 , 1)
𝑧 [2] = 𝑊 2 ∗ 𝑎 1 + 𝑏 [2]
(5,1) 5,3 3,1 (5,1)
Vectorizing across multiple examples
1 1 1
𝑧 =𝑊 𝑥+𝑏
𝑥1 1 1
𝑎 = 𝜎(𝑧 )
𝑥2 𝑦ො 2 2 1 2
𝑧 =𝑊 𝑎 +𝑏
𝑥3 2 2
𝑎 = 𝜎(𝑧 )
1 1 1
𝑍 =𝑊 𝑋+𝑏
1 1
𝐴 = 𝜎(𝑍 )
2 2 1 2
A [1]
= [1](1) [1](2) [1](𝑚)
𝑍 =𝑊 𝐴 +𝑏
𝑎 𝑎 … 𝑎 2 2
𝐴 = 𝜎(𝑍 )
Vectorized implementation
𝑥1
𝑦ො
𝑥2
𝑧 [1] = 𝑊 1
∗ 𝑥 + 𝑏 [1]
𝑧 [𝑙] , 𝑎 𝑙 : (𝑛 𝑙 , 1)
𝑛 1 ,1 𝑛 1 ,𝑛 0 𝑛 0 ,1 𝑛 1 ,1
𝑍 [𝑙] , 𝐴 𝑙 : (𝑛 𝑙 , 𝑚)
𝑍 [1]
=𝑊 1
∗ 𝑋 +𝑏 [1] 𝑙 =0 𝐴 0 = 𝑋 ∶ (𝑛 0 , 𝑚)
𝑛 1 ,𝑚 𝑛 1 ,𝑛 0 𝑛 0 ,𝑚 𝑛 1 ,1 𝑑𝑍 [𝑙] , 𝑑𝐴 𝑙 : (𝑛 𝑙 , 𝑚)
Forward propagation
…
ℎ𝑤 𝑥
𝐴[𝐿] = 𝑔 𝐿 𝑍 𝐿 = 𝑌
Why do you need non-linear
activation functions?
Why do you need non-linear
activation functions?
This is because applying a linear function to another linear function will result in a linear function over
the original input. This loses much of the representational power of the neural network as often times
the output we are trying to predict has a non-linear relationship with the inputs. Without non-linear
activation functions, the neural network will simply perform linear regression.
Credit to MIT’s Intro to Deep Learning
Derivatives of activation functions
Sigmoid activation function
1
𝑔(𝑧) = −𝑧
1+𝑒
z
Tanh activation function
𝑔(𝑧) = tanh(𝑧)
z
Derivative of ReLU and Leaky ReLU
a a
z z
ReLU Leaky ReLU
𝑔 𝑧 = 𝑀𝑎𝑥(0, 𝑧) 𝑔 𝑧 = 𝑀𝑎𝑥(0,01 𝑧, 𝑧)
0 𝑖𝑓 𝑧 < 0 0,01 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 = ቐ 1 𝑖𝑓 𝑧 > 0 𝑔′ 𝑧 = ቐ 1 𝑖𝑓 𝑧 > 0
𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑧 = 0 𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑧 = 0
For simplicity:
𝑔 𝑧 = 𝑀𝑎𝑥(0, 𝑧) 𝑔 𝑧 = 𝑀𝑎𝑥(0,01𝑧, 𝑧)
0 𝑖𝑓 𝑧 < 0 ′ 0,01 𝑖𝑓 𝑧 < 0
′
𝑔 𝑧 =ቊ 𝑔 𝑧 =ቊ
1 𝑖𝑓 𝑧 ≥ 0 1 𝑖𝑓 𝑧 ≥ 0
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Loss function
ℎ𝑤 𝑥
Logistic regression:
𝑛 𝑑
1
𝐽 𝑤 = − 𝑦 𝑖 𝑙𝑜𝑔ℎ𝑤 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝑤 𝑥 𝑖
+ 𝑤𝑗2
𝑛 2𝑛
𝑖=1 𝑗=1
Neural networks:
𝑛 𝐾 𝐿−1 𝑠𝑙 𝑠𝑙+1
1 (𝑖) 𝑖 (𝑖) 𝑖
𝑙
𝐽 𝑤 = − 𝑦𝑘 𝑙𝑜𝑔 ℎ𝑤 𝑥 + 1− 𝑦𝑘 log 1 − ℎ𝑤 𝑥 + (𝑤𝑗𝑖 )2
𝑛 𝑖 𝑘 2𝑛
𝑖=1 𝑘=1 𝑙=1 𝑖=1 𝑗=1
𝑛 𝐾 𝐿−1 𝑠𝑙 𝑠𝑙+1
1 (𝑖) 𝑖 (𝑖) 𝑖
𝑙 2
𝐽 𝑤 = − 𝑦𝑘 𝑙𝑜𝑔 ℎ𝑤 𝑥 + 1− 𝑦𝑘 log 1 − ℎ𝑤 𝑥 + (𝑤𝑗𝑖 )
𝑛 𝑖 𝑘 2𝑛
𝑖=1 𝑘=1 𝑙=1 𝑖=1 𝑗=1
min 𝐽 𝑤
𝐽 = 3(𝑎 + 𝑏𝑐)
𝑎=5
11 33
𝑏=3 6 𝑣 =𝑎+𝑢 𝐽 = 3𝑣
𝑢=𝑏𝑐
𝑐=2
Derivatives with a Computation Graph
𝑎=5
11 33
𝑏=3 6 𝑣 =𝑎+𝑢 𝐽 = 3𝑣
𝑢=𝑏𝑐
𝑐=2
𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝑣 𝑑𝐽 𝑑𝐽 𝑑𝑣
=3 = =3*1=3 = =3*1=3
𝑑𝑣 𝑑𝑎 𝑑𝑣 𝑑𝑎 𝑑𝑢 𝑑𝑣 𝑑𝑢
𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝑣 𝑑𝑢
=? =? = =3∗1∗𝑐
𝑑𝑏 𝑑𝑐 𝑑𝑏 𝑑𝑣 𝑑𝑢 𝑑𝑏
Logistic regression derivatives
𝑇
𝑧 = 𝑤 𝑥+𝑏
𝑦ො = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = −(𝑦 log(𝑎) + (1 − 𝑦) log(1 − 𝑎))
𝜕𝐿 1 1 𝑦 1−𝑦 Backward propagation
=− 𝑦× − 1−𝑦 × =− +
𝑑𝑎 𝑎 1−𝑎 𝑎 1−𝑎
𝜕𝐿 𝜕𝐿 𝑑𝑎 𝑦 1−𝑦
= × = − + × 𝑎× 1−𝑎 = 𝑎 − 𝑦 (𝐶ℎ𝑎𝑖𝑛 𝑟𝑢𝑙𝑒)
𝑑𝑧 𝑑𝑎 𝑑𝑧 𝑎 1−𝑎
𝑑𝑎
=𝑎× 1−𝑎
𝑑𝑧
𝜕𝐿 𝜕𝐿 𝑑𝑧
= × = 𝑥1 𝑑𝑧
𝑑𝑤1 𝑑𝑧 𝑑𝑤1
𝜕𝐿
= 𝑥2 𝑑𝑧 𝜔1 : = 𝜔1 - 𝛼 d𝜔1
𝑑𝑤2 𝜕𝐿 𝜕𝐿
= "𝑑𝑧" = "dw1 " 𝜔2 : = 𝜔2 - 𝛼 d𝜔2
𝑑𝑧 𝑑𝑤1
𝜕𝐿 b≔ 𝑏- 𝛼db
= 𝑑𝑧
𝑑𝑏
Forward and Backward propagation
𝑑𝑍 [𝐿] = 𝐴[𝐿] − 𝑌
𝑍 [1] = 𝑊 [1] 𝑋 + 𝑏 [1] [𝐿]
1 𝐿 𝐿𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝐴[1] = 𝑔 1 (𝑍 1 ) 𝑚
1
𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2] 𝑑𝑏 = 𝑛𝑝. sum(d𝑍 𝐿 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[𝐿]
𝑚
[𝐿−1] 𝐿𝑇 𝐿 ′ 𝐿
𝐴[2] = 𝑔 2 (𝑍 2 ) 𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 𝐿−1 )
…
…
[1] 𝐿𝑇 ′ 1
𝐴 [𝐿]
=𝑔 𝐿
𝑍 𝐿
= 𝑌 𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 1 )
2
[1]
1 1 1𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
1
𝑑𝑏 = 𝑛𝑝. sum(d𝑍 1 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[1]
𝑚
Forward propagation for layer l
𝑍 [1] = 𝑊 [1] 𝑋 + 𝑏 [1]
𝐴[1] = 𝑔 1 (𝑍 1 )
𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2]
𝐴[2] = 𝑔 2 (𝑍 2 )
…
𝐴 [𝐿]
=𝑔 𝐿
𝑍 𝐿
= 𝑌
Input 𝑎 [𝑙−1]
Output 𝑎 , cache (𝑧 )
[𝑙] [𝑙]
Backward propagation for layer l
𝑑𝑍 [𝐿] = 𝐴[𝐿] − 𝑌
[𝐿]
1 𝐿 𝐿𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
1
𝑑𝑏 = 𝑛𝑝. sum(d𝑍 𝐿 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[𝐿]
𝑚
[𝐿−1] 𝐿𝑇 𝐿 ′ 𝐿
𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 𝐿−1 )
…
[1] 𝐿𝑇 ′ 1
𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 1 )
2
[1]
1 1 1𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
Input 𝑑𝑎 [𝑙] 1
𝑑𝑏 [1] = 𝑛𝑝. sum(d𝑍 1 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
𝑚
Output 𝑑𝑎 𝑙−1 𝑙
, 𝑑𝑊 , 𝑑𝑏 [𝑙]
Forward and backward propagation
Gradient checking
Numerical estimation of gradients
http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/
Gradient check for a neural network
Take 𝑊 1
,𝑏 [1]
,…,𝑊 𝐿
,𝑏 𝐿
and reshape into a big vector 𝜃.
Take 𝑑𝑊 1
, 𝑑𝑏 [1]
, … , 𝑑𝑊 𝐿
, 𝑑𝑏 𝐿
and reshape into a big vector d𝜃.
Gradient checking (Grad check) 𝐽 𝜃 = 𝐽(𝜃1 , 𝜃2 , 𝜃3 , … )
for each 𝑖:
𝑖 𝐽 𝜃1 ,𝜃2 …,𝜃𝑖 +𝜀,… −𝐽 𝜃1 ,𝜃2 …,𝜃𝑖 −𝜀,…
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 =
2𝜀
𝜕𝐽
≈ 𝑑𝜃 𝑖 = 𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 ≈ 𝑑𝜃
𝜕𝜃𝑖
−7
Check: 𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 − 𝑑𝜃 ≈ 10 - great
2
___________________________ −5
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 + 𝑑𝜃 10 - normal
2 2
10−3 - worry
Implementation Note:
- Implement backpropagation to compute partial derivative.
- Implement numerical gradient check to compute approximate
derivative.
- Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.
Important:
- Be sure to disable your gradient checking code before training
your classifier. If you run numerical gradient computation on
every iteration of gradient descent your code will be very slow.
Random initialization and Symmetry
• Before we start training the neural network, we must select an initial value
for these parameters. We do not use the value zero as the initial value.
This is because the output of the first layer will always be the same.
• This will cause problems later on when we try to update these parameters (i.e., the
gradients will all be the same). The solution is to randomly initialize the parameters
to small values (e.g., normally distributed around zero;
𝐽 𝑤0 , 𝑤1
𝑤1
𝑤0
𝐽 𝑤0 , 𝑤1
𝑤1
𝑤0
Andrew Ng
Intuition about deep representation
𝑦ො
What are hyperparameters?
Parameters: 𝑊 1
,𝑏 1
,𝑊 2
,𝑏 2
,𝑊 3
,𝑏 3
…
Experiment Code
Reading material:
• Deep Learning. Andrew Ng.
http://cs229.stanford.edu/notes2019fall/cs229-notes-
deep_learning.pdf
• Chapter 5 - Zhang, Aston & Lipton, Zachary & Li, Mu & Smola,
Alexander. (2023). Dive into Deep Learning, Cambridge University
Press. https://d2l.ai/chapter_multilayer-perceptrons/index.html
• Chapter 6 - Goodfellow, I.; Bengio, Y. & Courville, A. (2016), Deep
Learning , MIT Press deeplearningbook.org/contents/mlp.html