0% found this document useful (0 votes)
7 views44 pages

CS460 - Deep Learning - W02 & W03

The document provides an overview of Artificial Neural Networks (ANNs), detailing their architecture, training methods, and activation functions. It discusses the importance of choosing appropriate hidden and output units based on the task, as well as the optimization techniques used to minimize error during training. Key concepts such as backpropagation, cost functions, and issues related to model training are also highlighted.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views44 pages

CS460 - Deep Learning - W02 & W03

The document provides an overview of Artificial Neural Networks (ANNs), detailing their architecture, training methods, and activation functions. It discusses the importance of choosing appropriate hidden and output units based on the task, as well as the optimization techniques used to minimize error during training. Key concepts such as backpropagation, cost functions, and issues related to model training are also highlighted.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CS460 – Deep Learning

W02 & W03


Artificial Neural Networks Foundation

TA: Mahmoud ElMorshedy


Artificial Neural Networks
(ANNs)
• An artificial neural network (ANN) is a data processing
paradigm modelled after how the human brain
processes information.
• ANNs can model non-linear relationships between
data samples and targets which made them a
powerful candidate for many machine learning tasks.
Architecture
• The network is a feedforward network where the data flow through the network is in one direction
from the input layer to the output layer (forward).
• Width and depth of the network is considered one of the hyperparameters of the network.
• The basic architecture of an ANN consists of:
• An input layer (with 𝑛 nodes corresponding to 𝑛-dimensional input vector)
• One or more hidden layers (with one or more hidden nodes per layer)
• An output layer (with one or more nodes)
Training – I
• The network is trained using the popular backpropagation algorithm.
• The main idea of the algorithm is:
1. The network’s output is calculated using forward propagation.
𝑚

𝑦ො = 𝑔(𝑏0 + ෍ 𝑥𝑖 𝑤𝑖 )
Bias 1
𝑖=1
Can be expressed as: 𝑏0
𝑦ො = 𝑔 𝑏0 + 𝑋 𝑇 𝑊 𝑥1 𝑤1
Where: ∑ ∫ 𝑦ො
𝑤2
𝑥1 𝑤1 Inputs 𝑥2
𝑋 = ⋮ , and 𝑊 = ⋮ 𝑤𝑚
𝑥𝑚 𝑤𝑚 ⋮ Non-linear
Weights
Activation
𝑥𝑚 Function
Hidden Unit Types (Activation Functions)
• The choice of hidden units also affects the
training of the model.
Bias 1
• A hidden unit, in general, takes an input vector 𝑥
and calculates an affine transformation
𝑇
of the 𝑏0
given input vector 𝑧 = 𝑏0 + 𝑋 𝑊, then applies a 𝑥1 𝑤1
𝑦ො = 𝑔(𝑧)
nonlinear activation function 𝑔(𝑧). 𝑧
𝑤2
• The use of a non-linear activation function is to Inputs 𝑥2
introduce non-linearity in the network. 𝑤𝑚
[https://playground.tensorflow.org/]
Weights
• For multiple hidden units and layers:
(1) (1) 𝑥𝑚
𝑇
𝑧𝑖 = 𝑏0,𝑖 + 𝑋 𝑊𝑖
(2) 𝑇 (2)
𝑦ො𝑖 = 𝑔(𝑏0,𝑖 + 𝑍 𝑊𝑖 )
Rectified Linear Unit (ReLU) Function
• No consensus on which activation function to use, however, a good
default function is the Rectified Linear Unit (ReLU) that is given by:
𝑔(𝑧) = 𝑚𝑎𝑥{0, 𝑧}
Sigmoidal Functions

• Prior to ReLU, the sigmoidal functions such as


the Logistic Sigmoid and Hyperbolic Tangent
functions were used.

• Such functions suffer from a saturation


problem that produces low gradient values Hyperbolic Tangent
which slows the learning. Sigmoid Function
(tanh) Function
𝑔 𝑧 = 𝜎 𝑧 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧
1 𝑒 𝑧 − 𝑒 −𝑧
= = 𝑧
1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
Derivatives

ReLU Function Sigmoid Function Hyperbolic Tangent Function

𝑔 𝑧 = max 0, 𝑧 1 𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = 𝜎 𝑧 = 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧 = 𝑧
1, 𝑧>0 1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
𝑔′ 𝑧 = ቊ 𝑔′ 𝑧 = 𝑔 𝑧 (1 − 𝑔 𝑧 ) 𝑔′ 𝑧 = 1 − 𝑔 𝑧 2
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Output Units
• The choice of an output unit type depends on the task at hand and affects
the choice of the cost function, for example:
• For regression tasks, Linear output units are used for suitable for an output layer.
• Given the hidden features from the last hidden layer ℎ = 𝑓(𝑥; 𝜃), the linear output units
produce an output vector 𝑦ො = 𝑊 𝑇 ℎ + 𝑏.
• A proper cost function for such units is the Mean-Squared Error (MSE)
• For binary classification tasks, Sigmoid output units are used.
• The produced output vector is 𝑦ො = 𝜎(𝑊 𝑇 ℎ + 𝑏), where 𝜎 is the logistic sigmoid function.
• For multioutput classification problems, we use the Softmax output units.
• The required output vector becomes 𝑦ෝ𝑖 = 𝑃(𝑦 = 𝑖| 𝑥), such that 𝑦ෝ𝑖 ∈ [0, 1] and the entire
output vector sums up to 1.
𝑒𝑥𝑝(𝑧𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) =
σ𝑗 𝑒𝑥𝑝(𝑧𝑗 )
෨ = 𝑖 | 𝑥)
𝑧𝑖 = 𝑙𝑜𝑔𝑃(𝑦
Training – II
2. The parameters of the network (weights and biases) are updated by backpropagating the
error measured using some cost function between the network’s output and the actual targets
of the dataset, such that the error is minimized.
• Cost function: 𝑛
1
𝐽 𝑊 = ෍ ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )
𝑛
𝑖=1
• For example, ℒ can be the 𝑛Cross-Entropy function for classification problems:
1
𝐽 𝑊 = ෍ 𝑦 (𝑖) log 𝑓 𝑥 (𝑖) ; 𝑊 + (1 − 𝑦 (𝑖) ) log(1 − 𝑓 𝑥 𝑖 ; 𝑊 )
𝑛
𝑖=1
• Or it can be the Mean-Squared Error (MSE)𝑛 for regression problems:
1
𝐽 𝑊 = ෍(𝑦 𝑖 − 𝑓 𝑥 (𝑖) ; 𝑊 )2
𝑛
𝑖=1
Cost Function
• In most cases, the network defines a distribution 𝑃(𝑦 | 𝑥; 𝜃).
• So, to train the neural network we simply employ the maximum
likelihood principle which means the cost function is simply the
negative log-likelihood (i.e., cross-entropy) between the training data
and the model distribution.
• A general form of the cross-entropy loss function is given by:
𝐽(𝜃) = −𝔼𝑥,𝑦~𝑝ො𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥)
• The choice of a cost function affects the learning ability of a neural
network model especially for cost functions that saturate as they
make the gradient very small which is not ideal for gradient-based
learning models.
Cost Function Optimization
• The network is optimized using gradient-descent based algorithms
such as Stochastic Gradient Descent (SGD) or adaptive learning
methods such as (Adam, Adagrad, RMSprop, …) to find the set of
parameters that gives the minimum loss.
𝑛
1
𝑊 = argmin ෍ ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )

𝑊 𝑛
𝑖=1

𝑊 = argmin 𝐽(𝑊)
𝑊
Gradient Descent Algorithm
1. Randomly initialize weights ~𝑁(0, 𝜎 2 )
2. Repeat until convergence:
𝜕𝐽(𝑊)
• Compute gradient
𝜕𝑊
• Take 𝜂 step in the opposite of the gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Stochastic Gradient Descent
1. Randomly initialize weights ~𝑁(0, 𝜎 2 ) Mini-batch SGD is faster to
compute than Vanilla GD and
2. Repeat until convergence: more accurate than SGD
(better estimate to the true
• Choose a batch of 𝐵 data points. gradient)
𝜕𝐽(𝑊) 1 𝐵 𝜕𝐽𝑘 (𝑊)
• Compute gradient = σ𝑘=1
𝜕𝑊 𝐵 𝜕𝑊
• Take 𝜂 step in the opposite of the
gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Backpropagation

𝐽(𝑊)
𝑥 𝑧1 𝑦ො

• The Chain Rule:

𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕𝑦ො 𝜕𝑧1


= ∗ ∗
𝜕𝑤1 𝜕𝑦ො 𝜕𝑧1 𝜕𝑤1
• Do this for every weight in the network.
Architecture Design
• Another choice that has to be made is the architecture of the
network, that is the depth (number of layers) and the width (number
of nodes per layer).
• However, the ideal architecture can only be found through
experimentation and observing the validation error.
MLP Issues
• Training set size: should be 5 to 10 times higher than the number of
weights in the network.
• Overfitting: Mostly happens when the network size or architecture is
much larger than training samples (i.e., the network has enough
weights to memorize the whole training input [low training loss]
while lacks generalization on test set [high validation loss]).
• Black-box syndrome: lack of explainability (Explainable AI (XAI)
research area).
• Vanishing Gradient: Backpropagation gives partial derivatives which
become mush smaller as we go back further (Chain rule).
Backpropagation Example

W1 W5
𝑥1 𝑍ℎ1 | ℎ1 𝑦1
W3 W7

W2 W6
𝑥2 𝑍ℎ2 | ℎ2 𝑦2
W4 W8

𝑏1 𝑏2

Input Layer Hidden Layer Output Layer

Credits: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example
Backpropagation Example
Activation Function: Sigmoid for all units
Learning Rate (𝜂): 0.5

Target1 0.01
W1 0.15 W5 0.4
0.05 𝑥1 𝑍ℎ1 | ℎ1 𝑍𝑦1 |𝑦1
W3 0.25 W7 0.5

W2 0.2 W6 0.45 Target 2 0.99


0.1 𝑥2 𝑍ℎ2 | ℎ2 𝑍𝑦2 |𝑦2
W4 0.3 W8 0.55

0.35 0.6
𝑏1 𝑏2

Input Layer Hidden Layer Output Layer


Forward Pass – Hidden Units
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏1
= 0.05 ∗ 0.15 + 0.1 ∗ 0.2 + 0.35 = 0.3775
1 1
ℎ1 = 𝜎 𝑍ℎ1 = −𝑍ℎ1 = 1 + 𝑒 −0.3775 = 0.59327
1+𝑒

𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1
= 0.05 ∗ 0.25 + 0.1 ∗ 0.3 + 0.35 = 0.3925
1 1
ℎ2 = 𝜎 𝑍ℎ2 = −𝑍ℎ2 = 1 + 𝑒 −0.3925 = 0.59688
1+𝑒

Sigmoid Calculator: https://www.tinkershop.net/ml/sigmoid_calculator.html


Forward Pass – Output Units
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
= 0.59327 ∗ 0.4 + 0.59688 ∗ 0.45 + 0.6 = 1.105904
1 1
𝑦1 = 𝜎 𝑍𝑦1 = −𝑍𝑦1 = 1 + 𝑒 −1.105904 = 0.751365
1+𝑒

𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.59327 ∗ 0.5 + 0.59688 ∗ 0.55 + 0.6 = 1.224919
1 1
𝑦2 = 𝜎 𝑍𝑦2 = −𝑍𝑦2 = 1 + 𝑒 −1.224919 = 0.772928
1+𝑒
Error/Cost Function
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2

𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2


1 2
1 2
= 0.01 − 0.751365 + 0.99 − 0.772928
2 2
= 0.274811 + 0.02356 = 0.2983711
Backward Pass – I
• To update each weight, we use the backpropagation algorithm where
we calculate the partial derivative of the total error 𝐸𝑇𝑜𝑡𝑎𝑙 with
respect to each weight.
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
• For example, to update weight 𝑤5 , we calculate using the
𝜕𝑤5
chain rule of calculus as follows:

𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝑦1 𝜕𝑍𝑦1


= ∗ ∗
𝜕𝑤5 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝑤5
Derivation:

Derivatives – I
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2

𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1
𝜕𝑦1
= 0.751365 − 0.01
= 0.741365
Derivatives – II
1
𝑦1 = 𝜎 𝑍𝑦1 =
1 + 𝑒 −𝑍𝑦1

𝜕𝑦1
= 𝜎 𝑍𝑦1 1 − 𝜎 𝑍𝑦1
𝜕𝑍𝑦1
= 𝑦1 1 − 𝑦1
= 0.751365 1 − 0.751365 = 0.186815
Derivation:

Derivatives – III
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2

𝜕𝑍𝑦1
= ℎ1 = 0.59327
𝜕𝑤5
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝑦1 𝜕𝑍𝑦1
= ∗ ∗
𝜕𝑤5 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝑤5
= 0.741365 ∗ 0.186815 ∗ 0.59327 = 0.082167

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤5 ư = 𝑤5 − 𝜂 = 0.4 − 0.5 ∗ 0.082167 = 0.358916
𝜕𝑤5
Similarly:
𝑤6 = 0.45 → 𝑤6 ư = 0.40866
𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137
Backward Pass – II
• To calculate the gradient for 𝑤1 , we use the chain rule like before as
follows:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
Where:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
Derivatives – I
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝐸𝑦2 𝜕𝐸𝑦2 𝜕𝑦2 𝜕𝑍𝑦2
= ∗ ∗ = ∗ ∗
𝜕ℎ1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕ℎ1 𝜕ℎ1 𝜕𝑦2 𝜕𝑍𝑦2 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1 = 0.751365 − 0.01 = 0.741365 = 𝑦2 − 𝑇𝑎𝑟𝑔𝑒𝑡2 = 0.772928 − 0.99 = −0.217072
𝜕𝑦1 𝜕𝑦2

𝜕𝑦1 𝜕𝑦2
= 𝑦1 1 − 𝑦1 = 0.751365 1 − 0.751365 = 0.186815 = 𝑦2 1 − 𝑦2 = 0.772928 1 − 0.772928 = 0.17551
𝜕𝑍𝑦1 𝜕𝑍𝑦2

𝜕𝑍𝑦1 𝜕𝑍𝑦2
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2 = 𝑤5 = 0.4 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2 = 𝑤7 = 0.5
𝜕ℎ1 𝜕ℎ1

𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 0.741365 ∗ 0.186815 ∗ 0.4 = 0.055399 = −0. 217072 ∗ 0.17551 ∗ 0.5 = −0.019049
𝜕ℎ1 𝜕ℎ1

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 0.055399 + −0.019049 = 0.03635
𝜕ℎ1
Derivatives – II
1
ℎ1 = 𝜎 𝑍ℎ1 =
1 + 𝑒 −𝑍ℎ1

𝜕ℎ1
= 𝜎 𝑍ℎ1 1 − 𝜎 𝑍ℎ1
𝜕𝑍ℎ1
= ℎ1 1 − ℎ1
= 0.59327 1 − 0.59327 = 0.2413
Derivatives – III
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤1 + 𝑏1

𝜕𝑍ℎ1
= 𝑥1 = 0.05
𝜕𝑤1
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
= 0.03635 ∗ 0.2413 ∗ 0.05 = 0.00043856

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤1 ư = 𝑤1 − 𝜂 = 0.15 − 0.5 ∗ 0.00043856 = 0.14978
𝜕𝑤1
Similarly:
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤1 = 0.15 → 𝑤1 ư = 0.14978
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
Forward Pass – After Updates 𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤5 = 0.40 → 𝑤5 ư = 0.35891
Using the same inputs 𝑥1 = 0.05, 𝑥2 = 0.1 and 𝑤6 = 0.45 → 𝑤6 ư = 0.40866
biases 𝑏1 = 0.35, 𝑏2 = 0.60 𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137

𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏1 𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
= 0.05 ∗ 0.14978 + 0.1 ∗ 0.19956 + 0.35 = 0.377445 = 0.59325 ∗ 0.35891 + 0.59686 ∗ 0.40866 + 0.6 = 1.056836
1 1 1 1
ℎ1 = 𝜎 𝑍ℎ1 = = = 0.59325 𝑦1 = 𝜎 𝑍𝑦1 = = = 0.742085
1 + 𝑒 −𝑍ℎ1 1 + 𝑒 −0.377445 1 + 𝑒 −𝑍𝑦1 1 + 𝑒 −1.056836

𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.05 ∗ 0.24975 + 0.1 ∗ 0.2995 + 0.35 = 0.392437 = 0.59325 ∗ 0.5113 + 0.59686 ∗ 0.56137 + 0.6 = 1.238388
1 1 1 1
ℎ2 = 𝜎 𝑍ℎ2 = = = 0.59686 𝑦2 = 𝜎 𝑍𝑦2 = = = 0.775283
1 + 𝑒 −𝑍ℎ2 1 + 𝑒 −0.392437 1 + 𝑒 −𝑍𝑦2 1 + 𝑒 −1.238388
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2 The error decreased
From First Pass: 𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2
from 0.2983711 to
1 1
= 0.01 − 0.742085 2 + 0.99 − 0.775283 2 0.29102 after one
2 2 pass.
= 0.26797 + 0.02305 = 0.29102
Using Matrices
The Backpropagation Algorithm

1.Input 𝑥:
Set the corresponding activation 𝑎0 for the input layer.
2.Feedforward:
For each 𝑙 = 1,2,3, … , 𝐿 compute 𝑧 𝑙 = 𝑤 𝑙 𝑎𝑙−1 + 𝑏𝑙 and 𝑎𝑙 = 𝜎(𝑧 𝑙 ).
3.Output error 𝛿 𝐿 :
Compute the vector 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ).
4.Backpropagate the error:
For each 𝑙 = 𝐿 − 1, 𝐿 − 2, … , 1 compute 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ).
5.Output:
𝜕𝐶 𝜕𝐶
The gradient of the cost function is given by 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 and 𝑙 = 𝛿𝑗𝑙 .
𝜕𝑤𝑗𝑘 𝜕𝑏𝑗
Forward Pass – Hidden Units

𝑥1 0.05 𝑤1 𝑤2 0.15 0.2


𝑋= 𝑥 = 𝑊1 = 𝑤 𝑤4 = 0.25
2 0.1 3 0.3

0.15 0.2 0.05 0.3775


𝑍1 = 𝑊1 𝑋 + 𝑏1 = + 0.35 =
0.25 0.3 0.1 0.3925

0.3775 0.59327
𝐻 = 𝜎 𝑍1 = 𝜎 =
0.3925 0.59688
Forward Pass – Output Units

ℎ1 0.59327 𝑤5 𝑤6 0.4 0.45


𝐻= = 𝑊2 = 𝑤 𝑤8 = 0.5 0.55
ℎ2 0.59688 7

0.4 0.45 0.59327 1.105904


𝑍 2 = 𝑊 2 𝐻 + 𝑏2 = + 0.6 =
0.5 0.55 0.59688 1.224919

1.105904 0.751365
𝑌=𝜎 𝑍2 =𝜎 =
1.224919 0.772928
Backward Pass – I
• 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 2 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 2 )
0.751365 0.01 0.741365
• ∇𝑎𝐶 = 𝑌 − 𝑇𝑎𝑟𝑔𝑒𝑡 = − =
0.772928 0.99 −0.217072
0.751365 0.248635 0.18681
• 𝜎′ 𝑧2 = 𝑌 ⊙ 1 − 𝑌 = ⊙ =
0.772928 0.227072 0.17551
2 0.741365 0.18681 0.13849
•𝛿 = ⊙ =
−0.217072 0.17551 −0.03809
Backward Pass – II
• 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 1 = ((𝑊 2 )𝑇 𝛿 2 ) ⊙ 𝜎′(𝑧1 )
′ 1 0.59327 0.40673 0.2413
• 𝜎 𝑧 =𝐻⊙ 1−𝐻 = ⊙ =
0.59688 0.40312 0.24061
𝑇
0.4 0.45 0.13849 0.2413
• 𝛿1 = ⊙
0.5 0.55 −0.03809 0.24061
1 0.4 0.5 0.13849 0.2413
•𝛿 = ⊙
0.45 0.55 −0.03809 0.24061
1 0.036351 0.2413 0.00877
•𝛿 = ⊙ =
0.041371 0.24061 0.00995
Backward Pass – III
𝜕𝐶 𝜕𝐶
• 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 = 𝑎1 𝛿 2 = 𝛿 2 𝐻 𝑇
𝜕𝑤𝑗𝑘 𝜕𝑊 2
𝜕𝐶 0.13849 0.08216 0.08266
• = 0.59327 0.59688 =
𝜕𝑊 2 −0.03809 −0.02259 −0.02273

𝜕𝐶
• = 𝑎0 𝛿 1 = 𝛿 1 𝑋 𝑇
𝜕𝑊 1
𝜕𝐶 0.00877 0.0004385 0.000877
• = 0.05 0.1 =
𝜕𝑊 1 0.00995 0.0004975 0.000995
Weight Updates
𝜕𝐶 𝜕𝐶
• 𝑊𝑙 = 𝑊𝑙 − 𝜂 𝑙 𝑊2 = 𝑊2 −𝜂 2
𝜕𝑊 𝜕𝑊
0.4 0.45 0.08216 0.08266
• 𝑊2 = − 0.5 =
0.5 0.55 −0.02259 −0.02273
0.35892 0.40867
0.51129 0.56136
1 0.15 0.2 0.0004385 0.000877
•𝑊 = − 0.5 =
0.25 0.3 0.0004975 0.000995
0.14978 0.19956
0.24975 0.29950
References
• http://neuralnetworksanddeeplearning.com/chap2.html
• https://arxiv.org/pdf/1802.01528
• https://www.deeplearningbook.org/

You might also like