0% found this document useful (0 votes)
8 views48 pages

Backpropogation Algorithm

The document explains how neural networks operate, focusing on the backpropagation algorithm used to train them by adjusting weights to minimize output error. It details the process of calculating net values, activations, and the application of gradient descent for weight updates. Additionally, it covers concepts like local and global minima in optimization, and the chain rule for calculating gradients in more complex networks.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views48 pages

Backpropogation Algorithm

The document explains how neural networks operate, focusing on the backpropagation algorithm used to train them by adjusting weights to minimize output error. It details the process of calculating net values, activations, and the application of gradient descent for weight updates. Additionally, it covers concepts like local and global minima in optimization, and the chain rule for calculating gradients in more complex networks.

Uploaded by

Cát Lăng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

How Neural Networks and the

Backpropagation Works
We have this input data…….
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9

We wish to map it to…..


Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Let’s take our first sample
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9

Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Consider this Neural network….
Example taken from: Neural Networks, A classroom approach by Satish Kumar
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22

0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12

-0.02 0.27

Bias
Bias
Let’s Start by moving forward
Bias

0.01
Input Layer Hidden Layer
Net value is the total input
0.1
X1 coming to the neuron.
0.5

Net value of the first neuron in the hidden layer:

-0.2
𝑧1 = 𝑥1 (0.1) + 𝑥2 −0.2 + 𝑏𝑖𝑎𝑠
X2
-0.5 𝑧1 = 0.5 0.1 + −0.5 −0.2 + 0.01

𝑧1 = 0.16
Input Layer
Net value is the total input
X1
coming to the neuron.
0.5
0.3
Net value of the second neuron in the hidden layer:
𝑧2 = 𝑥1 (0.3) + 𝑥2 0.55 + 𝑏𝑖𝑎𝑠

X2
-0.5 0.55

-0.02

Bias

𝑧2 = 0.5 0.3 + −0.5 0.55 + (−0.02)


𝑧2 = −0.145
The activation of the neuron

Net (z) Activation


(z)

Activation is scaling the input value (net value) to


a value from 0-1
For Example, The Sigmoidal Function:
Activating the two neurons at the hidden layer:

Net (z) Activation


(z) The input(net) value

For simplicity, we will consider =1

1
𝛿 𝑧1 = −0.16 = 0.5399
1+𝑒
1
𝛿 𝑧2 = 0.145 = 0.4638
1+𝑒
Let’s continue with the output neurons
Now, the hidden neuron’s output becomes the input to the next neuron

Bias
0.31
Hidden Layer Output Layer

0.37

0.9
𝑦1 = 0.5399(0.37) + 0.4638 0.9 + 0.31
𝑦1 = 0.9271
-0.22
Similarly……

-0.12

0.27

Bias

𝑦2 = 0.5399(−0.22) + 0.4638 −0.12 + 0.27


𝑦2 = 0.0955
Now, activating the output neurons
1
𝛿 𝑦1 = = 0.7164
1+𝑒 −0.9271
1
𝛿 𝑦2 = = 0.5238
1+𝑒 −0.0955
Output Layer

0.7164

0.5238

Complete Guide to Neural Networks with Python: Theory and Applications


Definition of Backpropagation
• A method to train the neural network, by
adjusting the weights of the neurons, for the
purpose of reducing the output error.
Gradient Descent
Gradient Descent

• The base algorithm that is used to minimize the error with respect to the
weights of the neural network. The learning rate determines the step size of
the update used to reach the minimum.
• An Epoch is one complete pass through all the samples.

https://www.learnopencv.com/understanding-activation-
functions-in-deep-learning/ https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html
The Backpropagation
Remember our objective is to:
Minimize the error By Changing the Weight

Negative Slope: Gradient Descent


We move in the
When we Increase w,
direction opposite to
the loss is decreasing 
the derivative
-(-) = +  Weight
(opposite to the slope)
Increases (Moving
Right)

Positive Slope:
When we increase w,
the loss is increasing 
-(+) = -  Weight
Decreases (Moving Left)
Weight Update Rule:

ɳ = Learning Rate – How


fast we update the
weights. In other words,
𝑑𝐸 the step size of the update

𝑤 𝑤 − ղ
𝑑𝑤
Old Weight Negative Learnin Gradient
https://towardsdatascience.com/gradien
Slop g Rate t-descent-in-a-nutshell-eaf8c18212f0
Local Minimum and Global Minimum
f(x)
Convex and Non-Convex
Optimization

One global/local minima One or more local


minima and a global
minima

Image Credits: https://www.oreilly.com/radar/the-hard-thing-about-deep-learning/


Non-Convex Optimization
Multiple Local Minima
y=z+2 No w term 𝑑𝐸
z=w+4
???
𝑑𝑤

𝑤 Net (z) Activation


𝑎
𝐸𝑟𝑟𝑜𝑟 (𝐸)

This cannot be
done directly
𝑦 =𝑧+2 𝑦 = 𝑓(𝑧)

𝑧 =𝑤+4 𝑧 = 𝑔(𝑤)

𝑑𝑦 𝑑𝑦 𝑑𝑧
= .
𝑑𝑤 𝑑𝑧 𝑑𝑤
What Should be done is…….
𝑑𝑎
𝑑𝑧 𝑑𝐸
𝑑𝑧
𝑑𝑤 𝑑𝑎

𝑤 Net (z) Activation


𝑎
𝐸𝑟𝑟𝑜𝑟 (𝐸)
The Chain Rule
𝑑𝑎
𝑑𝑧 𝑑𝐸
𝑑𝑧
𝑑𝑤 𝑑𝑎

𝑤 Net (z) Activation


𝑎
𝐸𝑟𝑟𝑜𝑟 (

𝑑𝐸 𝑑𝐸 𝑑𝑎 𝑑𝑧
=
𝑑𝑤 𝑑𝑎 𝑑𝑧 𝑑𝑤
More Complex

𝑤1 𝑤2
𝑥 𝑧1 𝑎1 𝑧 2 𝑎2 𝐸
𝑑𝐸
𝑑𝑧 1
𝑑𝑎1 𝑑𝑧 2 𝑑𝑎2 𝑑𝑎2
𝑑𝑤1
𝑑𝑧 1 𝑑𝑎1 𝑑𝑧 2

𝑑𝐸 𝑑𝐸 𝑑𝑎2 𝑑𝑧 2 𝑑𝑎1 𝑑𝑧 1
=
𝑑𝑤1 𝑑𝑎2 𝑑𝑧 2 𝑑𝑎1 𝑑𝑧 1 𝑑𝑤1
Consider these neurons to work with…….
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22

0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12

-0.02 0.27

Bias
Bias
Adjusting the weight of the output neuron
Bias Bias
0.01 0.31
Input Layer Hidden Layer Output Layer
d1
X1 0.1 0.37
0.9
0.5
0.3 -0.22

0.9
-0.2
d2
X2 0.1
-0.5 0.55 -0.12

-0.02 0.27

Bias
Bias
How much is the error changing
with respect to the output
Expected Actual

Assuming one training sample


per iteration (batch size of 1)

1
{𝑑1 −𝛿 𝑦1 } 2 + {𝑑2 −𝛿 𝑦2 } 2
2

= -(0.9 – 0.7164) = -0.1836


How much is the output changing
with respect to the input

= 0.7164(1-0.7164) = 0.2031
How much is the input changing
with respect to the weight

= 0.5399
All Together

-0.1836 0.2031 0.5399

= -0.0201
Weight Update for the neuron

Found from the chain rule


-0.0201

Old Weight Learning Rate, how fast are you moving


Assume it to be 1.2

= 0.37 + 1.2(0.0201) = 0.3941


The new weight
Adjusting the weight for the Hidden
Layer

Input Layer Hidden Layer Output Layer

X1
0.5

X2
-0.5
In our case, p=2 𝜕𝑧1
𝜕𝑤1

𝑧1 = 𝑥1 (0.1) + 𝑥2 −0.2 + 𝑏𝑖𝑎𝑠


X1

0.5
𝜕(𝛿 𝑍1 )

In our case, p=2 𝑍1

𝛿 𝑍1 [1-𝛿 𝑍1 ]
1
𝛿 𝑧1 =
1 + 𝑒 −𝑍1
0.5399(1-0.5399)

0.2484
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031

𝑦1 = 𝛿 𝑍1 (𝑤1) + 𝛿 𝑍2 𝑤2 + 0.31
𝑦1 = 0.5399(0.37) + 0.4638 0.9 + 0.31
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37

1
{𝑑1 −𝛿 𝑦1 } 2 + {𝑑2 −𝛿 𝑦2 } 2
2
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37
−[𝑑2 − 𝛿 𝑦2 ]
-(0.1-0.5238)
0.4238
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238

𝛿 𝑦2 [1 − 𝛿 𝑦2 ]
0.5238(1-0.5238)
0.2494
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238 0.2494

𝛿 𝑍1
𝑦2 = 0.5399(−0.22) + 0.4638 −0.12 + 0.27
In our case, p=2

𝜕𝐸 𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕𝐸 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2
+
𝜕(𝛿 𝑦1 ) 𝜕 𝑦1 𝜕(𝛿 𝑍1 ) 𝜕(𝛿 𝑦2 ) 𝜕 𝑦2 𝜕(𝛿 𝑍1 )
-0.1836 0.2031 0.37 0.4238 0.2494 -0.22

-0.0370
0.2484 0.5
-0.0370

-0.0045954
Weight Update for the hidden neuron

Found from the chain rule

-0.0045954

Old Weight Learning Rate, how fast are you moving


Assume it to be 1.2

= 0.1 + 1.2(0.0045954) =
0.1055
The new weight
A Final Diagram to Wrap it up…….

https://www.jeremyjordan.me/neural-networks-training/
Weights Update for the network

https://www.jeremyjordan.me/neural-networks-training/

Blue Path: Orange Path:

Combine
Continue…..
• Similar Procedure for all the other neurons

Complete Guide to Neural Networks with Python: Theory and Applications


Take the second sample (iteration 2)
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9

Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
Take the third sample (iteration 3)
Feature 1 Feature 2
0.5 -0.5
0.3 0.4
0.7 0.9

Feature 1 Feature 2
0.9 0.1
0.9 0.9
0.1 0.1
• That was ONE EPOCH. An Epoch is one
complete pass through all the samples. After
repeating that for many epochs (ex. 25) our
neural network is expected to reach the
minimum error, and be considered as trained.
We’ll learn about optimization later!

You might also like