0% found this document useful (0 votes)

7 views44 pages

CS460 - Deep Learning - W02 & W03

The document provides an overview of Artificial Neural Networks (ANNs), detailing their architecture, training methods, and activation functions. It discusses the importance of choosing appropriate hidden and output units based on the task, as well as the optimization techniques used to minimize error during training. Key concepts such as backpropagation, cost functions, and issues related to model training are also highlighted.

Uploaded by

Abdelrhman Adel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views44 pages

CS460 - Deep Learning - W02 & W03

Uploaded by

Abdelrhman Adel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

CS460 – Deep Learning

W02 & W03

Artificial Neural Networks Foundation

TA: Mahmoud ElMorshedy

Artificial Neural Networks
(ANNs)
• An artificial neural network (ANN) is a data processing
paradigm modelled after how the human brain
processes information.
• ANNs can model non-linear relationships between
data samples and targets which made them a
powerful candidate for many machine learning tasks.
Architecture
• The network is a feedforward network where the data flow through the network is in one direction
from the input layer to the output layer (forward).
• Width and depth of the network is considered one of the hyperparameters of the network.
• The basic architecture of an ANN consists of:
• An input layer (with 𝑛 nodes corresponding to 𝑛-dimensional input vector)
• One or more hidden layers (with one or more hidden nodes per layer)
• An output layer (with one or more nodes)
Training – I
• The network is trained using the popular backpropagation algorithm.
• The main idea of the algorithm is:
1. The network’s output is calculated using forward propagation.
𝑚

𝑦ො = 𝑔(𝑏0 + ෍ 𝑥𝑖 𝑤𝑖 )
Bias 1
𝑖=1
Can be expressed as: 𝑏0
𝑦ො = 𝑔 𝑏0 + 𝑋 𝑇 𝑊 𝑥1 𝑤1
Where: ∑ ∫ 𝑦ො
𝑤2
𝑥1 𝑤1 Inputs 𝑥2
𝑋 = ⋮ , and 𝑊 = ⋮ 𝑤𝑚
𝑥𝑚 𝑤𝑚 ⋮ Non-linear
Weights
Activation
𝑥𝑚 Function
Hidden Unit Types (Activation Functions)
• The choice of hidden units also affects the
training of the model.
Bias 1
• A hidden unit, in general, takes an input vector 𝑥
and calculates an affine transformation
𝑇
of the 𝑏0
given input vector 𝑧 = 𝑏0 + 𝑋 𝑊, then applies a 𝑥1 𝑤1
𝑦ො = 𝑔(𝑧)
nonlinear activation function 𝑔(𝑧). 𝑧
𝑤2
• The use of a non-linear activation function is to Inputs 𝑥2
introduce non-linearity in the network. 𝑤𝑚
[https://playground.tensorflow.org/]
Weights
• For multiple hidden units and layers:
(1) (1) 𝑥𝑚
𝑇
𝑧𝑖 = 𝑏0,𝑖 + 𝑋 𝑊𝑖
(2) 𝑇 (2)
𝑦ො𝑖 = 𝑔(𝑏0,𝑖 + 𝑍 𝑊𝑖 )
Rectified Linear Unit (ReLU) Function
• No consensus on which activation function to use, however, a good
default function is the Rectified Linear Unit (ReLU) that is given by:
𝑔(𝑧) = 𝑚𝑎𝑥{0, 𝑧}
Sigmoidal Functions

• Prior to ReLU, the sigmoidal functions such as

the Logistic Sigmoid and Hyperbolic Tangent
functions were used.

• Such functions suffer from a saturation

problem that produces low gradient values Hyperbolic Tangent
which slows the learning. Sigmoid Function
(tanh) Function
𝑔 𝑧 = 𝜎 𝑧 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧
1 𝑒 𝑧 − 𝑒 −𝑧
= = 𝑧
1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
Derivatives

ReLU Function Sigmoid Function Hyperbolic Tangent Function

𝑔 𝑧 = max 0, 𝑧 1 𝑒 𝑧 − 𝑒 −𝑧
𝑔 𝑧 = 𝜎 𝑧 = 𝑔 𝑧 = 𝑡𝑎𝑛ℎ 𝑧 = 𝑧
1, 𝑧>0 1 + 𝑒 −𝑧 𝑒 + 𝑒 −𝑧
𝑔′ 𝑧 = ቊ 𝑔′ 𝑧 = 𝑔 𝑧 (1 − 𝑔 𝑧 ) 𝑔′ 𝑧 = 1 − 𝑔 𝑧 2
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Output Units
• The choice of an output unit type depends on the task at hand and affects
the choice of the cost function, for example:
• For regression tasks, Linear output units are used for suitable for an output layer.
• Given the hidden features from the last hidden layer ℎ = 𝑓(𝑥; 𝜃), the linear output units
produce an output vector 𝑦ො = 𝑊 𝑇 ℎ + 𝑏.
• A proper cost function for such units is the Mean-Squared Error (MSE)
• For binary classification tasks, Sigmoid output units are used.
• The produced output vector is 𝑦ො = 𝜎(𝑊 𝑇 ℎ + 𝑏), where 𝜎 is the logistic sigmoid function.
• For multioutput classification problems, we use the Softmax output units.
• The required output vector becomes 𝑦ෝ𝑖 = 𝑃(𝑦 = 𝑖| 𝑥), such that 𝑦ෝ𝑖 ∈ [0, 1] and the entire
output vector sums up to 1.
𝑒𝑥𝑝(𝑧𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖 ) =
σ𝑗 𝑒𝑥𝑝(𝑧𝑗 )
෨ = 𝑖 | 𝑥)
𝑧𝑖 = 𝑙𝑜𝑔𝑃(𝑦
Training – II
2. The parameters of the network (weights and biases) are updated by backpropagating the
error measured using some cost function between the network’s output and the actual targets
of the dataset, such that the error is minimized.
• Cost function: 𝑛
1
𝐽 𝑊 = ෍ ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )
𝑛
𝑖=1
• For example, ℒ can be the 𝑛Cross-Entropy function for classification problems:
1
𝐽 𝑊 = ෍ 𝑦 (𝑖) log 𝑓 𝑥 (𝑖) ; 𝑊 + (1 − 𝑦 (𝑖) ) log(1 − 𝑓 𝑥 𝑖 ; 𝑊 )
𝑛
𝑖=1
• Or it can be the Mean-Squared Error (MSE)𝑛 for regression problems:
1
𝐽 𝑊 = ෍(𝑦 𝑖 − 𝑓 𝑥 (𝑖) ; 𝑊 )2
𝑛
𝑖=1
Cost Function
• In most cases, the network defines a distribution 𝑃(𝑦 | 𝑥; 𝜃).
• So, to train the neural network we simply employ the maximum
likelihood principle which means the cost function is simply the
negative log-likelihood (i.e., cross-entropy) between the training data
and the model distribution.
• A general form of the cross-entropy loss function is given by:
𝐽(𝜃) = −𝔼𝑥,𝑦~𝑝ො𝑑𝑎𝑡𝑎 𝑙𝑜𝑔𝑝𝑚𝑜𝑑𝑒𝑙 (𝑦|𝑥)
• The choice of a cost function affects the learning ability of a neural
network model especially for cost functions that saturate as they
make the gradient very small which is not ideal for gradient-based
learning models.
Cost Function Optimization
• The network is optimized using gradient-descent based algorithms
such as Stochastic Gradient Descent (SGD) or adaptive learning
methods such as (Adam, Adagrad, RMSprop, …) to find the set of
parameters that gives the minimum loss.
𝑛
1
𝑊 = argmin ෍ ℒ(𝑓 𝑥 (𝑖) ; 𝑊 , 𝑦 (𝑖) )
∗
𝑊 𝑛
𝑖=1
∗
𝑊 = argmin 𝐽(𝑊)
𝑊
Gradient Descent Algorithm
1. Randomly initialize weights ~𝑁(0, 𝜎 2 )
2. Repeat until convergence:
𝜕𝐽(𝑊)
• Compute gradient
𝜕𝑊
• Take 𝜂 step in the opposite of the gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Stochastic Gradient Descent
1. Randomly initialize weights ~𝑁(0, 𝜎 2 ) Mini-batch SGD is faster to
compute than Vanilla GD and
2. Repeat until convergence: more accurate than SGD
(better estimate to the true
• Choose a batch of 𝐵 data points. gradient)
𝜕𝐽(𝑊) 1 𝐵 𝜕𝐽𝑘 (𝑊)
• Compute gradient = σ𝑘=1
𝜕𝑊 𝐵 𝜕𝑊
• Take 𝜂 step in the opposite of the
gradient direction.
𝜕𝐽(𝑊)
• Weight Update: 𝑊 ← 𝑊 − 𝜂
𝜕𝑊
• Where 𝜂 is the learning rate
3. Return Weights (model)
Backpropagation

𝐽(𝑊)
𝑥 𝑧1 𝑦ො

• The Chain Rule:

𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕𝑦ො 𝜕𝑧1

= ∗ ∗
𝜕𝑤1 𝜕𝑦ො 𝜕𝑧1 𝜕𝑤1
• Do this for every weight in the network.
Architecture Design
• Another choice that has to be made is the architecture of the
network, that is the depth (number of layers) and the width (number
of nodes per layer).
• However, the ideal architecture can only be found through
experimentation and observing the validation error.
MLP Issues
• Training set size: should be 5 to 10 times higher than the number of
weights in the network.
• Overfitting: Mostly happens when the network size or architecture is
much larger than training samples (i.e., the network has enough
weights to memorize the whole training input [low training loss]
while lacks generalization on test set [high validation loss]).
• Black-box syndrome: lack of explainability (Explainable AI (XAI)
research area).
• Vanishing Gradient: Backpropagation gives partial derivatives which
become mush smaller as we go back further (Chain rule).
Backpropagation Example

W1 W5
𝑥1 𝑍ℎ1 | ℎ1 𝑦1
W3 W7

W2 W6
𝑥2 𝑍ℎ2 | ℎ2 𝑦2
W4 W8

𝑏1 𝑏2

Input Layer Hidden Layer Output Layer

Credits: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example
Backpropagation Example
Activation Function: Sigmoid for all units
Learning Rate (𝜂): 0.5

Target1 0.01
W1 0.15 W5 0.4
0.05 𝑥1 𝑍ℎ1 | ℎ1 𝑍𝑦1 |𝑦1
W3 0.25 W7 0.5

W2 0.2 W6 0.45 Target 2 0.99

0.1 𝑥2 𝑍ℎ2 | ℎ2 𝑍𝑦2 |𝑦2
W4 0.3 W8 0.55

0.35 0.6
𝑏1 𝑏2

Input Layer Hidden Layer Output Layer

Forward Pass – Hidden Units
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏1
= 0.05 ∗ 0.15 + 0.1 ∗ 0.2 + 0.35 = 0.3775
1 1
ℎ1 = 𝜎 𝑍ℎ1 = −𝑍ℎ1 = 1 + 𝑒 −0.3775 = 0.59327
1+𝑒

𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1
= 0.05 ∗ 0.25 + 0.1 ∗ 0.3 + 0.35 = 0.3925
1 1
ℎ2 = 𝜎 𝑍ℎ2 = −𝑍ℎ2 = 1 + 𝑒 −0.3925 = 0.59688
1+𝑒

Sigmoid Calculator: https://www.tinkershop.net/ml/sigmoid_calculator.html

Forward Pass – Output Units
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
= 0.59327 ∗ 0.4 + 0.59688 ∗ 0.45 + 0.6 = 1.105904
1 1
𝑦1 = 𝜎 𝑍𝑦1 = −𝑍𝑦1 = 1 + 𝑒 −1.105904 = 0.751365
1+𝑒

𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.59327 ∗ 0.5 + 0.59688 ∗ 0.55 + 0.6 = 1.224919
1 1
𝑦2 = 𝜎 𝑍𝑦2 = −𝑍𝑦2 = 1 + 𝑒 −1.224919 = 0.772928
1+𝑒
Error/Cost Function
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2

𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2

1 2
1 2
= 0.01 − 0.751365 + 0.99 − 0.772928
2 2
= 0.274811 + 0.02356 = 0.2983711
Backward Pass – I
• To update each weight, we use the backpropagation algorithm where
we calculate the partial derivative of the total error 𝐸𝑇𝑜𝑡𝑎𝑙 with
respect to each weight.
𝜕𝐸𝑇𝑜𝑡𝑎𝑙
• For example, to update weight 𝑤5 , we calculate using the
𝜕𝑤5
chain rule of calculus as follows:

𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝑦1 𝜕𝑍𝑦1

= ∗ ∗
𝜕𝑤5 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝑤5
Derivation:

Derivatives – I
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2

𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1
𝜕𝑦1
= 0.751365 − 0.01
= 0.741365
Derivatives – II
1
𝑦1 = 𝜎 𝑍𝑦1 =
1 + 𝑒 −𝑍𝑦1

𝜕𝑦1
= 𝜎 𝑍𝑦1 1 − 𝜎 𝑍𝑦1
𝜕𝑍𝑦1
= 𝑦1 1 − 𝑦1
= 0.751365 1 − 0.751365 = 0.186815
Derivation:

Derivatives – III
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2

𝜕𝑍𝑦1
= ℎ1 = 0.59327
𝜕𝑤5
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝑦1 𝜕𝑍𝑦1
= ∗ ∗
𝜕𝑤5 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝑤5
= 0.741365 ∗ 0.186815 ∗ 0.59327 = 0.082167

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤5 ư = 𝑤5 − 𝜂 = 0.4 − 0.5 ∗ 0.082167 = 0.358916
𝜕𝑤5
Similarly:
𝑤6 = 0.45 → 𝑤6 ư = 0.40866
𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137
Backward Pass – II
• To calculate the gradient for 𝑤1 , we use the chain rule like before as
follows:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
Where:
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
Derivatives – I
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑦1 𝜕𝐸𝑦2
= +
𝜕ℎ1 𝜕ℎ1 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕𝐸𝑦2 𝜕𝐸𝑦2 𝜕𝑦2 𝜕𝑍𝑦2
= ∗ ∗ = ∗ ∗
𝜕ℎ1 𝜕𝑦1 𝜕𝑍𝑦1 𝜕ℎ1 𝜕ℎ1 𝜕𝑦2 𝜕𝑍𝑦2 𝜕ℎ1
𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 𝑦1 − 𝑇𝑎𝑟𝑔𝑒𝑡1 = 0.751365 − 0.01 = 0.741365 = 𝑦2 − 𝑇𝑎𝑟𝑔𝑒𝑡2 = 0.772928 − 0.99 = −0.217072
𝜕𝑦1 𝜕𝑦2

𝜕𝑦1 𝜕𝑦2
= 𝑦1 1 − 𝑦1 = 0.751365 1 − 0.751365 = 0.186815 = 𝑦2 1 − 𝑦2 = 0.772928 1 − 0.772928 = 0.17551
𝜕𝑍𝑦1 𝜕𝑍𝑦2

𝜕𝑍𝑦1 𝜕𝑍𝑦2
𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2 = 𝑤5 = 0.4 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2 = 𝑤7 = 0.5
𝜕ℎ1 𝜕ℎ1

𝜕𝐸𝑦1 𝜕𝐸𝑦2
= 0.741365 ∗ 0.186815 ∗ 0.4 = 0.055399 = −0. 217072 ∗ 0.17551 ∗ 0.5 = −0.019049
𝜕ℎ1 𝜕ℎ1

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
= 0.055399 + −0.019049 = 0.03635
𝜕ℎ1
Derivatives – II
1
ℎ1 = 𝜎 𝑍ℎ1 =
1 + 𝑒 −𝑍ℎ1

𝜕ℎ1
= 𝜎 𝑍ℎ1 1 − 𝜎 𝑍ℎ1
𝜕𝑍ℎ1
= ℎ1 1 − ℎ1
= 0.59327 1 − 0.59327 = 0.2413
Derivatives – III
𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤1 + 𝑏1

𝜕𝑍ℎ1
= 𝑥1 = 0.05
𝜕𝑤1
Combined
𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕𝐸𝑇𝑜𝑡𝑎𝑙 𝜕ℎ1 𝜕𝑍ℎ1
= ∗ ∗
𝜕𝑤1 𝜕ℎ1 𝜕𝑍ℎ1 𝜕𝑤1
= 0.03635 ∗ 0.2413 ∗ 0.05 = 0.00043856

𝜕𝐸𝑇𝑜𝑡𝑎𝑙
𝑤1 ư = 𝑤1 − 𝜂 = 0.15 − 0.5 ∗ 0.00043856 = 0.14978
𝜕𝑤1
Similarly:
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤1 = 0.15 → 𝑤1 ư = 0.14978
𝑤2 = 0.20 → 𝑤2 ư = 0.19956
Forward Pass – After Updates 𝑤3 = 0.25 → 𝑤3 ư = 0.24975
𝑤4 = 0.30 → 𝑤4 ư = 0.29950
𝑤5 = 0.40 → 𝑤5 ư = 0.35891
Using the same inputs 𝑥1 = 0.05, 𝑥2 = 0.1 and 𝑤6 = 0.45 → 𝑤6 ư = 0.40866
biases 𝑏1 = 0.35, 𝑏2 = 0.60 𝑤7 = 0.50 → 𝑤7 ư = 0.51130
𝑤8 = 0.55 → 𝑤8 ư = 0.56137

𝑍ℎ1 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑏1 𝑍𝑦1 = ℎ1 𝑤5 + ℎ2 𝑤6 + 𝑏2
= 0.05 ∗ 0.14978 + 0.1 ∗ 0.19956 + 0.35 = 0.377445 = 0.59325 ∗ 0.35891 + 0.59686 ∗ 0.40866 + 0.6 = 1.056836
1 1 1 1
ℎ1 = 𝜎 𝑍ℎ1 = = = 0.59325 𝑦1 = 𝜎 𝑍𝑦1 = = = 0.742085
1 + 𝑒 −𝑍ℎ1 1 + 𝑒 −0.377445 1 + 𝑒 −𝑍𝑦1 1 + 𝑒 −1.056836

𝑍ℎ2 = 𝑥1 𝑤3 + 𝑥2 𝑤4 + 𝑏1 𝑍𝑦2 = ℎ1 𝑤7 + ℎ2 𝑤8 + 𝑏2
= 0.05 ∗ 0.24975 + 0.1 ∗ 0.2995 + 0.35 = 0.392437 = 0.59325 ∗ 0.5113 + 0.59686 ∗ 0.56137 + 0.6 = 1.238388
1 1 1 1
ℎ2 = 𝜎 𝑍ℎ2 = = = 0.59686 𝑦2 = 𝜎 𝑍𝑦2 = = = 0.775283
1 + 𝑒 −𝑍ℎ2 1 + 𝑒 −0.392437 1 + 𝑒 −𝑍𝑦2 1 + 𝑒 −1.238388
1
𝐸𝑦 = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2
2 The error decreased
From First Pass: 𝐸𝑇𝑜𝑡𝑎𝑙 = 𝐸𝑦1 + 𝐸𝑦2
from 0.2983711 to
1 1
= 0.01 − 0.742085 2 + 0.99 − 0.775283 2 0.29102 after one
2 2 pass.
= 0.26797 + 0.02305 = 0.29102
Using Matrices
The Backpropagation Algorithm

1.Input 𝑥:
Set the corresponding activation 𝑎0 for the input layer.
2.Feedforward:
For each 𝑙 = 1,2,3, … , 𝐿 compute 𝑧 𝑙 = 𝑤 𝑙 𝑎𝑙−1 + 𝑏𝑙 and 𝑎𝑙 = 𝜎(𝑧 𝑙 ).
3.Output error 𝛿 𝐿 :
Compute the vector 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ).
4.Backpropagate the error:
For each 𝑙 = 𝐿 − 1, 𝐿 − 2, … , 1 compute 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ).
5.Output:
𝜕𝐶 𝜕𝐶
The gradient of the cost function is given by 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 and 𝑙 = 𝛿𝑗𝑙 .
𝜕𝑤𝑗𝑘 𝜕𝑏𝑗
Forward Pass – Hidden Units

𝑥1 0.05 𝑤1 𝑤2 0.15 0.2

𝑋= 𝑥 = 𝑊1 = 𝑤 𝑤4 = 0.25
2 0.1 3 0.3

0.15 0.2 0.05 0.3775

𝑍1 = 𝑊1 𝑋 + 𝑏1 = + 0.35 =
0.25 0.3 0.1 0.3925

0.3775 0.59327
𝐻 = 𝜎 𝑍1 = 𝜎 =
0.3925 0.59688
Forward Pass – Output Units

ℎ1 0.59327 𝑤5 𝑤6 0.4 0.45

𝐻= = 𝑊2 = 𝑤 𝑤8 = 0.5 0.55
ℎ2 0.59688 7

0.4 0.45 0.59327 1.105904

𝑍 2 = 𝑊 2 𝐻 + 𝑏2 = + 0.6 =
0.5 0.55 0.59688 1.224919

1.105904 0.751365
𝑌=𝜎 𝑍2 =𝜎 =
1.224919 0.772928
Backward Pass – I
• 𝛿 𝐿 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 2 = ∇𝑎𝐶 ⊙ 𝜎′(𝑧 2 )
0.751365 0.01 0.741365
• ∇𝑎𝐶 = 𝑌 − 𝑇𝑎𝑟𝑔𝑒𝑡 = − =
0.772928 0.99 −0.217072
0.751365 0.248635 0.18681
• 𝜎′ 𝑧2 = 𝑌 ⊙ 1 − 𝑌 = ⊙ =
0.772928 0.227072 0.17551
2 0.741365 0.18681 0.13849
•𝛿 = ⊙ =
−0.217072 0.17551 −0.03809
Backward Pass – II
• 𝛿 𝑙 = ((𝑊 𝑙+1 )𝑇 𝛿 𝑙+1 ) ⊙ 𝜎′(𝑧 𝑙 ) 𝛿 1 = ((𝑊 2 )𝑇 𝛿 2 ) ⊙ 𝜎′(𝑧1 )
′ 1 0.59327 0.40673 0.2413
• 𝜎 𝑧 =𝐻⊙ 1−𝐻 = ⊙ =
0.59688 0.40312 0.24061
𝑇
0.4 0.45 0.13849 0.2413
• 𝛿1 = ⊙
0.5 0.55 −0.03809 0.24061
1 0.4 0.5 0.13849 0.2413
•𝛿 = ⊙
0.45 0.55 −0.03809 0.24061
1 0.036351 0.2413 0.00877
•𝛿 = ⊙ =
0.041371 0.24061 0.00995
Backward Pass – III
𝜕𝐶 𝜕𝐶
• 𝑙 = 𝑎𝑘𝑙−1 𝛿𝑗𝑙 = 𝑎1 𝛿 2 = 𝛿 2 𝐻 𝑇
𝜕𝑤𝑗𝑘 𝜕𝑊 2
𝜕𝐶 0.13849 0.08216 0.08266
• = 0.59327 0.59688 =
𝜕𝑊 2 −0.03809 −0.02259 −0.02273

𝜕𝐶
• = 𝑎0 𝛿 1 = 𝛿 1 𝑋 𝑇
𝜕𝑊 1
𝜕𝐶 0.00877 0.0004385 0.000877
• = 0.05 0.1 =
𝜕𝑊 1 0.00995 0.0004975 0.000995
Weight Updates
𝜕𝐶 𝜕𝐶
• 𝑊𝑙 = 𝑊𝑙 − 𝜂 𝑙 𝑊2 = 𝑊2 −𝜂 2
𝜕𝑊 𝜕𝑊
0.4 0.45 0.08216 0.08266
• 𝑊2 = − 0.5 =
0.5 0.55 −0.02259 −0.02273
0.35892 0.40867
0.51129 0.56136
1 0.15 0.2 0.0004385 0.000877
•𝑊 = − 0.5 =
0.25 0.3 0.0004975 0.000995
0.14978 0.19956
0.24975 0.29950
References
• http://neuralnetworksanddeeplearning.com/chap2.html
• https://arxiv.org/pdf/1802.01528
• https://www.deeplearningbook.org/

MA181-004 Ethos Up-EASY Service Manual
100% (2)
MA181-004 Ethos Up-EASY Service Manual
156 pages
RetroMagazine 07 Eng
No ratings yet
RetroMagazine 07 Eng
55 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Function Key
100% (1)
Function Key
3 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Lecture 2 - Process Design & Analysis
No ratings yet
Lecture 2 - Process Design & Analysis
29 pages
Pipe Thickness Calculation For Internal Pressure
No ratings yet
Pipe Thickness Calculation For Internal Pressure
12 pages
MPR-214F Instruction
No ratings yet
MPR-214F Instruction
35 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
2024-05-07 - Module Réseaux de Neurones Pour La Performance Industrielle
No ratings yet
2024-05-07 - Module Réseaux de Neurones Pour La Performance Industrielle
61 pages
Ft-950 Usa Exp Eu Om Eng Eh031h206
No ratings yet
Ft-950 Usa Exp Eu Om Eng Eh031h206
132 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Ch03 Block Cipher
No ratings yet
Ch03 Block Cipher
54 pages
Matlab Odd Workbook - 2022-2023
No ratings yet
Matlab Odd Workbook - 2022-2023
60 pages
Bredel Pumps
No ratings yet
Bredel Pumps
80 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Slides 11
No ratings yet
Slides 11
48 pages
Neural Network (Perceptrons)
No ratings yet
Neural Network (Perceptrons)
31 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
Cooling System Cat C-15 & C-18
No ratings yet
Cooling System Cat C-15 & C-18
5 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
How To Build Your Own Neural Network From Scratch in
No ratings yet
How To Build Your Own Neural Network From Scratch in
6 pages
Lab 1: Getting Started
No ratings yet
Lab 1: Getting Started
51 pages
Syllabus Data Analytics With Excel Bcs358a
No ratings yet
Syllabus Data Analytics With Excel Bcs358a
5 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
User Manual 4587613
No ratings yet
User Manual 4587613
3 pages
Pinhole Cameras and Eyes
No ratings yet
Pinhole Cameras and Eyes
5 pages
Plant Image Analysis Fundamentals and Applications Edited by S Dutta Gupta and Yasuomi Ibaraki Download
No ratings yet
Plant Image Analysis Fundamentals and Applications Edited by S Dutta Gupta and Yasuomi Ibaraki Download
83 pages
Lecture 1
No ratings yet
Lecture 1
10 pages
Single Neuron Model
No ratings yet
Single Neuron Model
16 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
The Features of The Dell Dual Monitor Arm - MDA20 - Dell US
No ratings yet
The Features of The Dell Dual Monitor Arm - MDA20 - Dell US
7 pages
Unit II
No ratings yet
Unit II
12 pages
Pr2 ANN WriteUp
No ratings yet
Pr2 ANN WriteUp
11 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
NN 2
No ratings yet
NN 2
12 pages
The Smart Thermostat
No ratings yet
The Smart Thermostat
15 pages
Hci 2
No ratings yet
Hci 2
10 pages
67 Working Principle Ultraviolet Flame Sensor Honeywell
No ratings yet
67 Working Principle Ultraviolet Flame Sensor Honeywell
1 page
Id Unit 5
No ratings yet
Id Unit 5
9 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
No ratings yet
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
27 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Isolation Forest Step by Step. Overview - by Hyunsu Kim - Medium
No ratings yet
Isolation Forest Step by Step. Overview - by Hyunsu Kim - Medium
5 pages
How To Make Speakers
No ratings yet
How To Make Speakers
4 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Kubernetes Interview Questions 1 3 1685320790
No ratings yet
Kubernetes Interview Questions 1 3 1685320790
3 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
C++ Programming Course
No ratings yet
C++ Programming Course
7 pages
AMER - BRO - Stroboscopy Solution - (MKENT-2482EN-U Rev 2) - 09.2020
No ratings yet
AMER - BRO - Stroboscopy Solution - (MKENT-2482EN-U Rev 2) - 09.2020
4 pages
Nsikak Eseme Adada 0037509021 20240703013724
No ratings yet
Nsikak Eseme Adada 0037509021 20240703013724
2 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
GeM Bidding 5879144
No ratings yet
GeM Bidding 5879144
5 pages
Geometry of LPS: 2.1 Finding A Basic Feasible Solution
No ratings yet
Geometry of LPS: 2.1 Finding A Basic Feasible Solution
7 pages
Tooling For Euomac Multi Tools
No ratings yet
Tooling For Euomac Multi Tools
4 pages
Oi Nod 2425 0142
No ratings yet
Oi Nod 2425 0142
1 page