0% found this document useful (0 votes)
26 views73 pages

Neural Network Training

Uploaded by

yitej21617
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views73 pages

Neural Network Training

Uploaded by

yitej21617
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Deep Learning

Neural Network
Representation

Credit to Stanford’s Machine Learning Course


Logistic Regression (Notation)

ℎ𝑤 𝑥
Neuron model: Logistic unit
𝒙𝟎 𝒘𝟎
𝒙𝟏 𝒘𝟏
𝒙= 𝒙 w= 𝒘
𝟐 𝟐
𝑥1 𝒙𝟑 𝒘𝟑

𝑥2 𝑇
𝑤 𝑥 + 𝑏 𝜎(𝑧) 𝑎 = 𝑦ො
𝑧 𝑎
𝒙𝟎 =1 𝒘𝟎 =b
𝑥3
𝒙𝟏 𝒘𝟏
𝒙 = 𝒙𝟐 w= 𝒘𝟐
𝒙𝟑 𝒘𝟑
Neural Network Representation

𝑥1 𝑥1
𝑥2 𝑤 𝑥 + 𝑏 𝜎(𝑧)
𝑇 𝑎 = 𝑦ො 𝑥2 𝑦ො
𝑧 𝑎
𝑥3 𝑥3

𝑇
𝑧 =𝑤 𝑥+𝑏 𝑥1
𝑎 = 𝜎(𝑧) 𝑥2 𝑦ො
𝑥3
Neural Network Representation
1 1 1𝑇 [1] [1] 1
𝑎1 𝑧1 = 𝑤1 𝑥+ 𝑏1 , 𝑎1 = 𝜎(𝑧1 )
𝑥1 1 1 1𝑇 [1] [1] 1
𝑎2 𝑧2 = 𝑤2 𝑥+ 𝑏2 , 𝑎2 = 𝜎(𝑧2 )
𝑥2 𝑦ො [1] [1]
𝑎3
1 1 1𝑇 1
𝑧3 = 𝑤3 𝑥+ 𝑏3 , 𝑎3 = 𝜎(𝑧3 )
𝑥3
𝑎4
1 1
𝑤11 1 1𝑇 [1] [1] 1
1 1
𝑧4 = 𝑤4 𝑥+ 𝑏4 , 𝑎4 = 𝜎(𝑧4 )
𝑤1 = 𝑤12
1
𝑤13

1 1 1
𝑤11 𝑤12 𝑤13 𝑏1
1
1 1 1
𝑤21 𝑤22 𝑤23 𝒙𝟏 1
𝑊 [1] = 1 𝑏2
1
𝑤31
1
𝑤32
1
𝑤33 𝒙 = 𝒙𝟐 𝑏 = 1
𝒙𝟑 𝑏3
1 1 1
𝑤41 𝑤42 𝑤43 𝑏4
1
[𝑙] [𝑙]
Parameters 𝑊 and 𝑏

𝑥1
𝑦ො
𝑥2

𝑧 [1] = 𝑊 1 ∗ 𝑥 + 𝑏 [1]
𝑊 [𝑙] : (𝑛 𝑙 , 𝑛 𝑙−1 )
(3,1) 3,2 2,1 (3,1)
𝑛 1 ,1 𝑛 1 ,𝑛 0 𝑛 0 ,1 𝑛 1 ,1 𝑏 [𝑙] : (𝑛 𝑙 , 1)
𝑑𝑊 [𝑙] : (𝑛 𝑙 , 𝑛 𝑙−1 )
𝑑𝑏 [𝑙] : (𝑛 𝑙 , 1)
𝑧 [2] = 𝑊 2 ∗ 𝑎 1 + 𝑏 [2]
(5,1) 5,3 3,1 (5,1)
Vectorizing across multiple examples
1 1 1
𝑧 =𝑊 𝑥+𝑏
𝑥1 1 1
𝑎 = 𝜎(𝑧 )
𝑥2 𝑦ො 2 2 1 2
𝑧 =𝑊 𝑎 +𝑏
𝑥3 2 2
𝑎 = 𝜎(𝑧 )

m – number of examples in the dataset


Vectorising across multiple examples
for i = 1 to m
𝑥1
1 (𝑖) 1 (𝑖) 1
𝑥2 𝑦ො 𝑧 =𝑊 𝑥 +𝑏
1 (𝑖) 1 𝑖
𝑥3 𝑎 = 𝜎(𝑧 )
2 (𝑖) 2 1 (𝑖) 2
𝑧 =𝑊 𝑎 +𝑏
2 (𝑖) 2 𝑖
𝑎 = 𝜎(𝑧 )
𝑋 = 𝑥 (1) 𝑥 (2) … 𝑥 (𝑚)

1 1 1
𝑍 =𝑊 𝑋+𝑏
1 1
𝐴 = 𝜎(𝑍 )
2 2 1 2
A [1]
= [1](1) [1](2) [1](𝑚)
𝑍 =𝑊 𝐴 +𝑏
𝑎 𝑎 … 𝑎 2 2
𝐴 = 𝜎(𝑍 )
Vectorized implementation

𝑥1
𝑦ො
𝑥2

𝑧 [1] = 𝑊 1
∗ 𝑥 + 𝑏 [1]
𝑧 [𝑙] , 𝑎 𝑙 : (𝑛 𝑙 , 1)
𝑛 1 ,1 𝑛 1 ,𝑛 0 𝑛 0 ,1 𝑛 1 ,1
𝑍 [𝑙] , 𝐴 𝑙 : (𝑛 𝑙 , 𝑚)

𝑍 [1]
=𝑊 1
∗ 𝑋 +𝑏 [1] 𝑙 =0 𝐴 0 = 𝑋 ∶ (𝑛 0 , 𝑚)
𝑛 1 ,𝑚 𝑛 1 ,𝑛 0 𝑛 0 ,𝑚 𝑛 1 ,1 𝑑𝑍 [𝑙] , 𝑑𝐴 𝑙 : (𝑛 𝑙 , 𝑚)
Forward propagation

[1] [1] [1]


𝑍 =𝑊 𝑋+𝑏
𝐴[1] = 𝑔 1 (𝑍 1 )
𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2]
𝐴[2] = 𝑔 2 (𝑍 2 )


ℎ𝑤 𝑥
𝐴[𝐿] = 𝑔 𝐿 𝑍 𝐿 = 𝑌෠
Why do you need non-linear
activation functions?
Why do you need non-linear
activation functions?

This is because applying a linear function to another linear function will result in a linear function over
the original input. This loses much of the representational power of the neural network as often times
the output we are trying to predict has a non-linear relationship with the inputs. Without non-linear
activation functions, the neural network will simply perform linear regression.
Credit to MIT’s Intro to Deep Learning
Derivatives of activation functions
Sigmoid activation function

1
𝑔(𝑧) = −𝑧
1+𝑒

z
Tanh activation function
𝑔(𝑧) = tanh(𝑧)

z
Derivative of ReLU and Leaky ReLU
a a

z z
ReLU Leaky ReLU
𝑔 𝑧 = 𝑀𝑎𝑥(0, 𝑧) 𝑔 𝑧 = 𝑀𝑎𝑥(0,01 𝑧, 𝑧)
0 𝑖𝑓 𝑧 < 0 0,01 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 = ቐ 1 𝑖𝑓 𝑧 > 0 𝑔′ 𝑧 = ቐ 1 𝑖𝑓 𝑧 > 0
𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑧 = 0 𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑧 = 0

For simplicity:
𝑔 𝑧 = 𝑀𝑎𝑥(0, 𝑧) 𝑔 𝑧 = 𝑀𝑎𝑥(0,01𝑧, 𝑧)
0 𝑖𝑓 𝑧 < 0 ′ 0,01 𝑖𝑓 𝑧 < 0

𝑔 𝑧 =ቊ 𝑔 𝑧 =ቊ
1 𝑖𝑓 𝑧 ≥ 0 1 𝑖𝑓 𝑧 ≥ 0
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Loss function
ℎ𝑤 𝑥
Logistic regression:
𝑛 𝑑
1 
𝐽 𝑤 = − ෍ 𝑦 𝑖 𝑙𝑜𝑔ℎ𝑤 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝑤 𝑥 𝑖
+ ෍ 𝑤𝑗2
𝑛 2𝑛
𝑖=1 𝑗=1

Neural networks:
𝑛 𝐾 𝐿−1 𝑠𝑙 𝑠𝑙+1
1 (𝑖) 𝑖 (𝑖) 𝑖
 𝑙
𝐽 𝑤 = − ෍ ෍ 𝑦𝑘 𝑙𝑜𝑔 ℎ𝑤 𝑥 + 1− 𝑦𝑘 log 1 − ℎ𝑤 𝑥 + ෍ ෍ ෍ (𝑤𝑗𝑖 )2
𝑛 𝑖 𝑘 2𝑛
𝑖=1 𝑘=1 𝑙=1 𝑖=1 𝑗=1

(𝑥 (𝑖) , 𝑦 (𝑖) ): 𝑖 = 1, … 𝑛 - training set

𝐿 − total no. of layers in network


𝑠𝑙 −no. of units (not counting bias unit) in layer
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Credit to MIT’s Intro to Deep Learning
Gradient Descent algorithm

𝑛 𝐾 𝐿−1 𝑠𝑙 𝑠𝑙+1
1 (𝑖) 𝑖 (𝑖) 𝑖
 𝑙 2
𝐽 𝑤 = − ෍ ෍ 𝑦𝑘 𝑙𝑜𝑔 ℎ𝑤 𝑥 + 1− 𝑦𝑘 log 1 − ℎ𝑤 𝑥 + ෍ ෍ ෍ (𝑤𝑗𝑖 )
𝑛 𝑖 𝑘 2𝑛
𝑖=1 𝑘=1 𝑙=1 𝑖=1 𝑗=1

min 𝐽 𝑤

Need code to compute:


- 𝐽 𝑤
𝜕
- 𝑙 𝐽 𝑤
𝜕𝑤𝑗𝑖
Neural Network
Training
Derivatives with a
Computation Graph
Computation Graph

𝐽 = 3(𝑎 + 𝑏𝑐)

𝑎=5
11 33
𝑏=3 6 𝑣 =𝑎+𝑢 𝐽 = 3𝑣
𝑢=𝑏𝑐
𝑐=2
Derivatives with a Computation Graph
𝑎=5
11 33
𝑏=3 6 𝑣 =𝑎+𝑢 𝐽 = 3𝑣
𝑢=𝑏𝑐
𝑐=2

𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝑣 𝑑𝐽 𝑑𝐽 𝑑𝑣
=3 = =3*1=3 = =3*1=3
𝑑𝑣 𝑑𝑎 𝑑𝑣 𝑑𝑎 𝑑𝑢 𝑑𝑣 𝑑𝑢

𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝐽 𝑑𝑣 𝑑𝑢
=? =? = =3∗1∗𝑐
𝑑𝑏 𝑑𝑐 𝑑𝑏 𝑑𝑣 𝑑𝑢 𝑑𝑏
Logistic regression derivatives
𝑇
𝑧 = 𝑤 𝑥+𝑏
𝑦ො = 𝑎 = 𝜎(𝑧)
ℒ 𝑎, 𝑦 = −(𝑦 log(𝑎) + (1 − 𝑦) log(1 − 𝑎))
𝜕𝐿 1 1 𝑦 1−𝑦 Backward propagation
=− 𝑦× − 1−𝑦 × =− +
𝑑𝑎 𝑎 1−𝑎 𝑎 1−𝑎

𝜕𝐿 𝜕𝐿 𝑑𝑎 𝑦 1−𝑦
= × = − + × 𝑎× 1−𝑎 = 𝑎 − 𝑦 (𝐶ℎ𝑎𝑖𝑛 𝑟𝑢𝑙𝑒)
𝑑𝑧 𝑑𝑎 𝑑𝑧 𝑎 1−𝑎

𝑑𝑎
=𝑎× 1−𝑎
𝑑𝑧

𝜕𝐿 𝜕𝐿 𝑑𝑧
= × = 𝑥1 𝑑𝑧
𝑑𝑤1 𝑑𝑧 𝑑𝑤1

𝜕𝐿
= 𝑥2 𝑑𝑧 𝜔1 : = 𝜔1 - 𝛼 d𝜔1
𝑑𝑤2 𝜕𝐿 𝜕𝐿
= "𝑑𝑧" = "dw1 " 𝜔2 : = 𝜔2 - 𝛼 d𝜔2
𝑑𝑧 𝑑𝑤1
𝜕𝐿 b≔ 𝑏- 𝛼db
= 𝑑𝑧
𝑑𝑏
Forward and Backward propagation
𝑑𝑍 [𝐿] = 𝐴[𝐿] − 𝑌
𝑍 [1] = 𝑊 [1] 𝑋 + 𝑏 [1] [𝐿]
1 𝐿 𝐿𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝐴[1] = 𝑔 1 (𝑍 1 ) 𝑚
1
𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2] 𝑑𝑏 = 𝑛𝑝. sum(d𝑍 𝐿 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[𝐿]
𝑚
[𝐿−1] 𝐿𝑇 𝐿 ′ 𝐿
𝐴[2] = 𝑔 2 (𝑍 2 ) 𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 𝐿−1 )


[1] 𝐿𝑇 ′ 1
𝐴 [𝐿]
=𝑔 𝐿
𝑍 𝐿
= 𝑌෠ 𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 1 )
2

[1]
1 1 1𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
1
𝑑𝑏 = 𝑛𝑝. sum(d𝑍 1 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[1]
𝑚
Forward propagation for layer l
𝑍 [1] = 𝑊 [1] 𝑋 + 𝑏 [1]
𝐴[1] = 𝑔 1 (𝑍 1 )
𝑍 [2] = 𝑊 [2] 𝐴[1] + 𝑏 [2]
𝐴[2] = 𝑔 2 (𝑍 2 )

𝐴 [𝐿]
=𝑔 𝐿
𝑍 𝐿
= 𝑌෠

Input 𝑎 [𝑙−1]

Output 𝑎 , cache (𝑧 )
[𝑙] [𝑙]
Backward propagation for layer l
𝑑𝑍 [𝐿] = 𝐴[𝐿] − 𝑌
[𝐿]
1 𝐿 𝐿𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
1
𝑑𝑏 = 𝑛𝑝. sum(d𝑍 𝐿 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
[𝐿]
𝑚
[𝐿−1] 𝐿𝑇 𝐿 ′ 𝐿
𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 𝐿−1 )


[1] 𝐿𝑇 ′ 1
𝑑𝑍 = 𝑑𝑊 𝑑𝑍 𝑔 (𝑍 1 )
2

[1]
1 1 1𝑇
𝑑𝑊 = 𝑑𝑍 𝐴
𝑚
Input 𝑑𝑎 [𝑙] 1
𝑑𝑏 [1] = 𝑛𝑝. sum(d𝑍 1 , 𝑎𝑥𝑖𝑠 = 1, 𝑘𝑒𝑒𝑝𝑑𝑖𝑚𝑠 = 𝑇𝑟𝑢𝑒)
𝑚

Output 𝑑𝑎 𝑙−1 𝑙
, 𝑑𝑊 , 𝑑𝑏 [𝑙]
Forward and backward propagation
Gradient checking
Numerical estimation of gradients

http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/
Gradient check for a neural network

Take 𝑊 1
,𝑏 [1]
,…,𝑊 𝐿
,𝑏 𝐿
and reshape into a big vector 𝜃.

Take 𝑑𝑊 1
, 𝑑𝑏 [1]
, … , 𝑑𝑊 𝐿
, 𝑑𝑏 𝐿
and reshape into a big vector d𝜃.
Gradient checking (Grad check) 𝐽 𝜃 = 𝐽(𝜃1 , 𝜃2 , 𝜃3 , … )

for each 𝑖:
𝑖 𝐽 𝜃1 ,𝜃2 …,𝜃𝑖 +𝜀,… −𝐽 𝜃1 ,𝜃2 …,𝜃𝑖 −𝜀,…
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 =
2𝜀
𝜕𝐽
≈ 𝑑𝜃 𝑖 = 𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 ≈ 𝑑𝜃
𝜕𝜃𝑖

−7
Check: 𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 − 𝑑𝜃 ≈ 10 - great
2
___________________________ −5
𝑑𝜃𝑎𝑝𝑝𝑟𝑜𝑥 + 𝑑𝜃 10 - normal
2 2

10−3 - worry
Implementation Note:
- Implement backpropagation to compute partial derivative.
- Implement numerical gradient check to compute approximate
derivative.
- Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.

Important:
- Be sure to disable your gradient checking code before training
your classifier. If you run numerical gradient computation on
every iteration of gradient descent your code will be very slow.
Random initialization and Symmetry
• Before we start training the neural network, we must select an initial value
for these parameters. We do not use the value zero as the initial value.
This is because the output of the first layer will always be the same.
• This will cause problems later on when we try to update these parameters (i.e., the
gradients will all be the same). The solution is to randomly initialize the parameters
to small values (e.g., normally distributed around zero;

• What if we had initialized all parameters to be the same non-zero value?


• Each element of the activation vector a[1] will be the same (because W[1] contains
all the same values). This behavior will occur at all layers of the neural network. As a
result, when we compute the gradient, all neurons in a layer will be equally
responsible for anything contributed to the final loss. We call this property
symmetry. This means each neuron (within a layer) will receive the exact same
gradient update value (i.e., all neurons will learn the same thing).
Random initialization
Putting it together. Training a neural network
Pick a network architecture (connectivity pattern between neurons)

No. of input units: Dimension of features


No. output units: Number of classes
Training a neural network
1. Randomly initialize weights
2. Implement forward propagation
3. Implement code to compute cost function J(w)
𝜕
4. Implement backprop to compute partial derivatives 𝑙 𝐽 𝑤
𝜕𝑤𝑗𝑖
𝜕
5. Use gradient checking to compare 𝑙 𝐽 𝑤 computed using
𝜕𝑤𝑗𝑖
backpropagation vs. using numerical estimate of gradient of
J(w).
6. Use gradient descent or advanced optimization method with
backpropagation to try to minimize J(w) as a function of
parameters
Gradient descent is a first-order iterative optimization algorithm for finding the local minimum of a function.

𝐽 𝑤0 , 𝑤1

𝑤1
𝑤0
𝐽 𝑤0 , 𝑤1

𝑤1
𝑤0
Andrew Ng
Intuition about deep representation

𝑦ො
What are hyperparameters?
Parameters: 𝑊 1
,𝑏 1
,𝑊 2
,𝑏 2
,𝑊 3
,𝑏 3

Hyperparameters: learning rate


# iterations
# hidden layers
# hidden units
# choice of activation function
minibatch size,
regularization parameter
Applied deep learning is a very empirical
process
Idea

Experiment Code
Reading material:
• Deep Learning. Andrew Ng.
http://cs229.stanford.edu/notes2019fall/cs229-notes-
deep_learning.pdf
• Chapter 5 - Zhang, Aston & Lipton, Zachary & Li, Mu & Smola,
Alexander. (2023). Dive into Deep Learning, Cambridge University
Press. https://d2l.ai/chapter_multilayer-perceptrons/index.html
• Chapter 6 - Goodfellow, I.; Bengio, Y. & Courville, A. (2016), Deep
Learning , MIT Press deeplearningbook.org/contents/mlp.html

You might also like