CS460 - Deep Learning - W02 & W03
CS460 - Deep Learning - W02 & W03
𝑦ො = 𝑔(𝑏0 + 𝑥𝑖 𝑤𝑖 )
Bias 1
𝑖=1
Can be expressed as: 𝑏0
𝑦ො = 𝑔 𝑏0 + 𝑋 𝑇 𝑊 𝑥1 𝑤1
Where: ∑ ∫ 𝑦ො
𝑤2
𝑥1 𝑤1 Inputs 𝑥2
𝑋 = ⋮ , and 𝑊 = ⋮ 𝑤𝑚
𝑥𝑚 𝑤𝑚 ⋮ Non-linear
Weights
Activation
𝑥𝑚 Function
Hidden Unit Types (Activation Functions)
• The choice of hidden units also affects the
training of the model.
Bias 1
• A hidden unit, in general, takes an input vector 𝑥
and calculates an affine transformation
𝑇
of the 𝑏0
given input vector 𝑧 = 𝑏0 + 𝑋 𝑊, then applies a 𝑥1 𝑤1
𝑦ො = 𝑔(𝑧)
nonlinear activation function 𝑔(𝑧). 𝑧
𝑤2
• The use of a non-linear activation function is to Inputs 𝑥2
introduce non-linearity in the network. 𝑤𝑚
[https://playground.tensorflow.org/]
Weights
• For multiple hidden units and layers:
(1) (1) 𝑥𝑚
𝑇
𝑧𝑖 = 𝑏0,𝑖 + 𝑋 𝑊𝑖
(2) 𝑇 (2)
𝑦ො𝑖 = 𝑔(𝑏0,𝑖 + 𝑍 𝑊𝑖 )
Rectified Linear Unit (ReLU) Function
• No consensus on which activation function to use, however, a good
default function is the Rectified Linear Unit (ReLU) that is given by:
𝑔(𝑧) = 𝑚𝑎𝑥{0, 𝑧}
Sigmoidal Functions
𝑔 𝑧 = max 0, 𝑧 1 𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = 𝜎 𝑧 = 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧 = 𝑧
1, 𝑧>0 1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
𝑔′ 𝑧 = ቊ 𝑔′ 𝑧 = 𝑔 𝑧 (1 − 𝑔 𝑧 ) 𝑔′ 𝑧 = 1 − 𝑔 𝑧 2
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Output Units
• The choice of an output unit type depends on the task at hand and affects
the choice of the cost function, for example:
• For regression tasks, Linear output units are used for suitable for an output layer.
• Given the hidden features from the last hidden layer ℎ = 𝑓(𝑥; 𝜃), the linear output units
produce an output vector 𝑦ො = 𝑊 𝑇 ℎ + 𝑏.
• A proper cost function for such units is the Mean-Squared Error (MSE)
• For binary classification tasks, Sigmoid output units are used.
• The produced output vector is 𝑦ො = 𝜎(𝑊 𝑇 ℎ + 𝑏), where 𝜎 is the logistic sigmoid function.
• For multioutput classification problems, we use the Softmax output units.
• The required output vector becomes 𝑦ෝ𝑖 = 𝑃(𝑦 = 𝑖| 𝑥), such that 𝑦ෝ𝑖 ∈ [0, 1] and the entire
output vector sums up to 1.
𝑒𝑥𝑝(𝑧𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) =
σ𝑗 𝑒𝑥𝑝(𝑧𝑗 )
෨ = 𝑖 | 𝑥)
𝑧𝑖 = 𝑙𝑜𝑔𝑃(𝑦
Training – II
2. The parameters of the network (weights and biases) are updated by backpropagating the
error measured using some cost function between the network’s output and the actual targets
of the dataset, such that the error is minimized.
• Cost function: 𝑛
1
𝐽 𝑊 = ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )
𝑛
𝑖=1
• For example, ℒ can be the 𝑛Cross-Entropy function for classification problems:
1
𝐽 𝑊 = 𝑦 (𝑖) log 𝑓 𝑥 (𝑖) ; 𝑊 + (1 − 𝑦 (𝑖) ) log(1 − 𝑓 𝑥 𝑖 ; 𝑊 )
𝑛
𝑖=1
• Or it can be the Mean-Squared Error (MSE)𝑛 for regression problems:
1
𝐽 𝑊 = (𝑦 𝑖 − 𝑓 𝑥 (𝑖) ; 𝑊 )2
𝑛
𝑖=1
Cost Function
• In most cases, the network defines a distribution 𝑃(𝑦 | 𝑥; 𝜃).
• So, to train the neural network we simply employ the maximum
likelihood principle which means the cost function is simply the
negative log-likelihood (i.e., cross-entropy) between the training data
and the model distribution.
• A general form of the cross-entropy loss function is given by:
𝐽(𝜃) = −𝔼𝑥,𝑦~𝑝ො𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥)
• The choice of a cost function affects the learning ability of a neural
network model especially for cost functions that saturate as they
make the gradient very small which is not ideal for gradient-based
learning models.
Cost Function Optimization
• The network is optimized using gradient-descent based algorithms
such as Stochastic Gradient Descent (SGD) or adaptive learning
methods such as (Adam, Adagrad, RMSprop, …) to find the set of
parameters that gives the minimum loss.
𝑛
1
𝑊 = argmin ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )
∗
𝑊 𝑛
𝑖=1
∗
𝑊 = argmin 𝐽(𝑊)
𝑊
Gradient Descent Algorithm
1. Randomly initialize weights ~𝑁(0, 𝜎 2 )
2. Repeat until convergence:
𝜕𝐽(𝑊)
• Compute gradient
𝜕𝑊
• Take 𝜂 step in the opposite of the gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Stochastic Gradient Descent
1. Randomly initialize weights ~𝑁(0, 𝜎 2 ) Mini-batch SGD is faster to
compute than Vanilla GD and
2. Repeat until convergence: more accurate than SGD
(better estimate to the true
• Choose a batch of 𝐵 data points. gradient)
𝜕𝐽(𝑊) 1 𝐵 𝜕𝐽𝑘 (𝑊)
• Compute gradient = σ𝑘=1
𝜕𝑊 𝐵 𝜕𝑊
• Take 𝜂 step in the opposite of the
gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Backpropagation
𝐽(𝑊)
𝑥 𝑧1 𝑦ො
W1 W5
𝑥1 𝑍ℎ1 | ℎ1 𝑦1
W3 W7
W2 W6
𝑥2 𝑍ℎ2 | ℎ2 𝑦2
W4 W8
𝑏1 𝑏2
Credits: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example
Backpropagation Example
Activation Function: Sigmoid for all units
Learning Rate (𝜂): 0.5
Target1 0.01
W1 0.15 W5 0.4
0.05 𝑥1 𝑍ℎ1 | ℎ1 𝑍𝑦1 |𝑦1
W3 0.25 W7 0.5
0.35 0.6
𝑏1 𝑏2
𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1
= 0.05 ∗ 0.25 + 0.1 ∗ 0.3 + 0.35 = 0.3925
1 1
ℎ2 = 𝜎 𝑍ℎ2 = −𝑍ℎ2 = 1 + 𝑒 −0.3925 = 0.59688
1+𝑒
𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.59327 ∗ 0.5 + 0.59688 ∗ 0.55 + 0.6 = 1.224919
1 1
𝑦2 = 𝜎 𝑍𝑦2 = −𝑍𝑦2 = 1 + 𝑒 −1.224919 = 0.772928
1+𝑒
Error/Cost Function
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2
Derivatives – I
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1
𝜕𝑦1
= 0.751365 − 0.01
= 0.741365
Derivatives – II
1
𝑦1 = 𝜎 𝑍𝑦1 =
1 + 𝑒 −𝑍𝑦1
𝜕𝑦1
= 𝜎 𝑍𝑦1 1 − 𝜎 𝑍𝑦1
𝜕𝑍𝑦1
= 𝑦1 1 − 𝑦1
= 0.751365 1 − 0.751365 = 0.186815
Derivation:
Derivatives – III
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
𝜕𝑍𝑦1
= ℎ1 = 0.59327
𝜕𝑤5
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝑦1 𝜕𝑍𝑦1
= ∗ ∗
𝜕𝑤5 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝑤5
= 0.741365 ∗ 0.186815 ∗ 0.59327 = 0.082167
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤5 ư = 𝑤5 − 𝜂 = 0.4 − 0.5 ∗ 0.082167 = 0.358916
𝜕𝑤5
Similarly:
𝑤6 = 0.45 → 𝑤6 ư = 0.40866
𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137
Backward Pass – II
• To calculate the gradient for 𝑤1 , we use the chain rule like before as
follows:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
Where:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
Derivatives – I
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝐸𝑦2 𝜕𝐸𝑦2 𝜕𝑦2 𝜕𝑍𝑦2
= ∗ ∗ = ∗ ∗
𝜕ℎ1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕ℎ1 𝜕ℎ1 𝜕𝑦2 𝜕𝑍𝑦2 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1 = 0.751365 − 0.01 = 0.741365 = 𝑦2 − 𝑇𝑎𝑟𝑔𝑒𝑡2 = 0.772928 − 0.99 = −0.217072
𝜕𝑦1 𝜕𝑦2
𝜕𝑦1 𝜕𝑦2
= 𝑦1 1 − 𝑦1 = 0.751365 1 − 0.751365 = 0.186815 = 𝑦2 1 − 𝑦2 = 0.772928 1 − 0.772928 = 0.17551
𝜕𝑍𝑦1 𝜕𝑍𝑦2
𝜕𝑍𝑦1 𝜕𝑍𝑦2
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2 = 𝑤5 = 0.4 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2 = 𝑤7 = 0.5
𝜕ℎ1 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 0.741365 ∗ 0.186815 ∗ 0.4 = 0.055399 = −0. 217072 ∗ 0.17551 ∗ 0.5 = −0.019049
𝜕ℎ1 𝜕ℎ1
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 0.055399 + −0.019049 = 0.03635
𝜕ℎ1
Derivatives – II
1
ℎ1 = 𝜎 𝑍ℎ1 =
1 + 𝑒 −𝑍ℎ1
𝜕ℎ1
= 𝜎 𝑍ℎ1 1 − 𝜎 𝑍ℎ1
𝜕𝑍ℎ1
= ℎ1 1 − ℎ1
= 0.59327 1 − 0.59327 = 0.2413
Derivatives – III
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤1 + 𝑏1
𝜕𝑍ℎ1
= 𝑥1 = 0.05
𝜕𝑤1
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
= 0.03635 ∗ 0.2413 ∗ 0.05 = 0.00043856
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤1 ư = 𝑤1 − 𝜂 = 0.15 − 0.5 ∗ 0.00043856 = 0.14978
𝜕𝑤1
Similarly:
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤1 = 0.15 → 𝑤1 ư = 0.14978
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
Forward Pass – After Updates 𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤5 = 0.40 → 𝑤5 ư = 0.35891
Using the same inputs 𝑥1 = 0.05, 𝑥2 = 0.1 and 𝑤6 = 0.45 → 𝑤6 ư = 0.40866
biases 𝑏1 = 0.35, 𝑏2 = 0.60 𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏1 𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
= 0.05 ∗ 0.14978 + 0.1 ∗ 0.19956 + 0.35 = 0.377445 = 0.59325 ∗ 0.35891 + 0.59686 ∗ 0.40866 + 0.6 = 1.056836
1 1 1 1
ℎ1 = 𝜎 𝑍ℎ1 = = = 0.59325 𝑦1 = 𝜎 𝑍𝑦1 = = = 0.742085
1 + 𝑒 −𝑍ℎ1 1 + 𝑒 −0.377445 1 + 𝑒 −𝑍𝑦1 1 + 𝑒 −1.056836
𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.05 ∗ 0.24975 + 0.1 ∗ 0.2995 + 0.35 = 0.392437 = 0.59325 ∗ 0.5113 + 0.59686 ∗ 0.56137 + 0.6 = 1.238388
1 1 1 1
ℎ2 = 𝜎 𝑍ℎ2 = = = 0.59686 𝑦2 = 𝜎 𝑍𝑦2 = = = 0.775283
1 + 𝑒 −𝑍ℎ2 1 + 𝑒 −0.392437 1 + 𝑒 −𝑍𝑦2 1 + 𝑒 −1.238388
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2 The error decreased
From First Pass: 𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2
from 0.2983711 to
1 1
= 0.01 − 0.742085 2 + 0.99 − 0.775283 2 0.29102 after one
2 2 pass.
= 0.26797 + 0.02305 = 0.29102
Using Matrices
The Backpropagation Algorithm
1.Input 𝑥:
Set the corresponding activation 𝑎0 for the input layer.
2.Feedforward:
For each 𝑙 = 1,2,3, … , 𝐿 compute 𝑧 𝑙 = 𝑤 𝑙 𝑎𝑙−1 + 𝑏𝑙 and 𝑎𝑙 = 𝜎(𝑧 𝑙 ).
3.Output error 𝛿 𝐿 :
Compute the vector 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ).
4.Backpropagate the error:
For each 𝑙 = 𝐿 − 1, 𝐿 − 2, … , 1 compute 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ).
5.Output:
𝜕𝐶 𝜕𝐶
The gradient of the cost function is given by 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 and 𝑙 = 𝛿𝑗𝑙 .
𝜕𝑤𝑗𝑘 𝜕𝑏𝑗
Forward Pass – Hidden Units
0.3775 0.59327
𝐻 = 𝜎 𝑍1 = 𝜎 =
0.3925 0.59688
Forward Pass – Output Units
1.105904 0.751365
𝑌=𝜎 𝑍2 =𝜎 =
1.224919 0.772928
Backward Pass – I
• 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 2 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 2 )
0.751365 0.01 0.741365
• ∇𝑎𝐶 = 𝑌 − 𝑇𝑎𝑟𝑔𝑒𝑡 = − =
0.772928 0.99 −0.217072
0.751365 0.248635 0.18681
• 𝜎′ 𝑧2 = 𝑌 ⊙ 1 − 𝑌 = ⊙ =
0.772928 0.227072 0.17551
2 0.741365 0.18681 0.13849
•𝛿 = ⊙ =
−0.217072 0.17551 −0.03809
Backward Pass – II
• 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 1 = ((𝑊 2 )𝑇 𝛿 2 ) ⊙ 𝜎′(𝑧1 )
′ 1 0.59327 0.40673 0.2413
• 𝜎 𝑧 =𝐻⊙ 1−𝐻 = ⊙ =
0.59688 0.40312 0.24061
𝑇
0.4 0.45 0.13849 0.2413
• 𝛿1 = ⊙
0.5 0.55 −0.03809 0.24061
1 0.4 0.5 0.13849 0.2413
•𝛿 = ⊙
0.45 0.55 −0.03809 0.24061
1 0.036351 0.2413 0.00877
•𝛿 = ⊙ =
0.041371 0.24061 0.00995
Backward Pass – III
𝜕𝐶 𝜕𝐶
• 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 = 𝑎1 𝛿 2 = 𝛿 2 𝐻 𝑇
𝜕𝑤𝑗𝑘 𝜕𝑊 2
𝜕𝐶 0.13849 0.08216 0.08266
• = 0.59327 0.59688 =
𝜕𝑊 2 −0.03809 −0.02259 −0.02273
𝜕𝐶
• = 𝑎0 𝛿 1 = 𝛿 1 𝑋 𝑇
𝜕𝑊 1
𝜕𝐶 0.00877 0.0004385 0.000877
• = 0.05 0.1 =
𝜕𝑊 1 0.00995 0.0004975 0.000995
Weight Updates
𝜕𝐶 𝜕𝐶
• 𝑊𝑙 = 𝑊𝑙 − 𝜂 𝑙 𝑊2 = 𝑊2 −𝜂 2
𝜕𝑊 𝜕𝑊
0.4 0.45 0.08216 0.08266
• 𝑊2 = − 0.5 =
0.5 0.55 −0.02259 −0.02273
0.35892 0.40867
0.51129 0.56136
1 0.15 0.2 0.0004385 0.000877
•𝑊 = − 0.5 =
0.25 0.3 0.0004975 0.000995
0.14978 0.19956
0.24975 0.29950
References
• http://neuralnetworksanddeeplearning.com/chap2.html
• https://arxiv.org/pdf/1802.01528
• https://www.deeplearningbook.org/