7 - Feedforward and Backpropagation
7 - Feedforward and Backpropagation
Backpropagation
© Nisheeth Joshi
Training a
Neural Network
• Forward Propagation
• Backpropagation
© Nisheeth Joshi
© Nisheeth Joshi
What is training
Student X1 X2 X3 Y
Physics (%) Chemistry Hours Studied Mathematics (%)
(%)
1 60 80 5 82
2 70 75 7 94
3 50 55 10 45
4 40 56 7 43
© Nisheeth Joshi
© Nisheeth Joshi
What kind of a problem is this? Why?
Regression/Classification
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
b2
© Nisheeth Joshi
Input Layer
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
b2
© Nisheeth Joshi
Edges
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
b2
© Nisheeth Joshi
Weights and Biases – Most Important
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
b2
© Nisheeth Joshi
Hidden Layer (Contains Neurons)
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
Each Neuron has two Operations
b2 1. Summation (Linear)
2. Activation (Non-Linear)
© Nisheeth Joshi
Output Layer
b1
W1 ∑ ∫ b3
X1 W7
W3
W2
W5
X2
∑ y’
W4
W8
W5 ∑ ∫
X3
b2
© Nisheeth Joshi
Forward
Propagation
• We Feed input data
• It gets transmitted to Neurons
• by multiplying weights with inputs
• Linear and Non-Linear Operations are
performed
• Non Linear – because we want
our network to fit in all kinds of
interesting relationships (between
input and output)
• Results of neuron computation is then
multiplied with weights and output is
generated.
© Nisheeth Joshi
© Nisheeth Joshi
Sample Dataset
Student X1 X2 X3 Y
Physics (%) Chemistry Hours Studied Mathematics (%)
(%)
1 60 80 5 82
2 70 75 7 94
3 50 55 10 45
4 40 56 7 43
© Nisheeth Joshi
b1
W1 ∑ ∫ b3
60 W7
W3
W2
W5
80
∑ y’
W4
W8
W5 ∑ ∫
5
b2
© Nisheeth Joshi
b1
W1 ∑ ∫ g1
60
W3
W5
80
© Nisheeth Joshi
Inside the
Neuron
• Two Operations
• Linear Summation of weighted
inputs, plus a bias
• Non-Linear Operation
© Nisheeth Joshi
© Nisheeth Joshi
b1
Linear Operation
∫
W1 ∑ z1
60 g1
W3
W5
80
© Nisheeth Joshi
-15
Linear Operation
60
0.1
0.1
∑ z1
∫ g1
0.1
80
5
z1 = w1*X1 + w2*X2 + w3*X3 + b1
= 0.1*60 + 0.1*80 + 01.*5 + (-15)
= -0.5
© Nisheeth Joshi
b1
Non-Linear Operation
W1 ∑ z1
60 g1
W3
W5
80
5
𝟏
𝒈𝟏 =
𝟏 + 𝒆−𝒛𝟏
© Nisheeth Joshi
b1 Non-Linear Operation
60
W1
W3
∑ -0.5
∫ g1
W5
𝟏
80 𝒈𝟏 =
𝟏 + 𝒆−𝒛𝟏
𝟏
5
= 𝟏+ 𝒆−(−𝟎.𝟓)
= 0.37
© Nisheeth Joshi
-15
0.1 ∑ ∫
0.37
60
0.1
0.1
80
© Nisheeth Joshi
60
W2
80
W4
g2
5
W5
∑ z2
∫
b2
© Nisheeth Joshi
60
0.15
80 0.05
0.047
5
-0.2
∑ -3
∫
-15
© Nisheeth Joshi
b1
W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ y’
W4
g2 = 0.047
W5 ∑ ∫
5
b2
© Nisheeth Joshi
Output Layer: Linear Summation
g1 = 0.37 b3
∑ y’
g2 = 0.047
© Nisheeth Joshi
Output Layer: Linear Summation
y’ = w7*g1 + w8*g2 + b3
= 12*0.37 + 9*0.047 + 20
g1 = 0.37 b3 = 20 = 24.95
w7 = 12
∑ y’
g2 = 0.047
w8 = 9
© Nisheeth Joshi
b1
W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ 24.95
W4
g2 = 0.047
W5 ∑ ∫
5
b2
© Nisheeth Joshi
Compare the Y-Predicted with Y-Actual
What to do?
Backpropagation
© Nisheeth Joshi
© Nisheeth Joshi
Backpropagation
• Problem…
© Nisheeth Joshi
The best way
• Apply Gradient Descent
• Move along the negative
direction of the slope of an
error (cost/loss) function until
we find a minimum value
• We want to understand as
how sensitive is the cost
function to changes in w7
© Nisheeth Joshi
Cost/Loss Function
2
𝐶𝑜𝑠𝑡 = (𝑌𝑝𝑟𝑒𝑑 − 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌𝑝𝑟𝑒𝑑
= ∗
𝜕𝑤7 𝜕𝑌𝑝𝑟𝑒𝑑 𝜕𝑤7
Y’ = w7*g1 + w8*g2 + b3
© Nisheeth Joshi
Cost/Loss Function
2
𝐶𝑜𝑠𝑡 = (𝑌𝑝𝑟𝑒𝑑 − 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌𝑝𝑟𝑒𝑑
= ∗ = 2(𝑌𝑝𝑟𝑒𝑑 - 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )
𝜕𝑤7 𝜕𝑌𝑝𝑟𝑒𝑑 𝜕𝑤7
Y’ = w7*g1 + w8*g2 + b3
© Nisheeth Joshi
Cost/Loss Function
2
𝐶𝑜𝑠𝑡 = (𝑌𝑝𝑟𝑒𝑑 − 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌𝑝𝑟𝑒𝑑
= ∗ = 2(𝑌𝑝𝑟𝑒𝑑 - 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )*g1
𝜕𝑤7 𝜕𝑌𝑝𝑟𝑒𝑑 𝜕𝑤7
© Nisheeth Joshi
• Similarly, the partial derivative
of a cost function to w8 can
be calculated
• So on and so forth…
© Nisheeth Joshi
Cost/Loss Function
2
𝐶𝑜𝑠𝑡 = (𝑌𝑝𝑟𝑒𝑑 − 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌𝑝𝑟𝑒𝑑
= ∗ = 2(𝑌𝑝𝑟𝑒𝑑 - 𝑌𝑎𝑐𝑡𝑢𝑎𝑙 )*g1
𝜕𝑤7 𝜕𝑌𝑝𝑟𝑒𝑑 𝜕𝑤7
© Nisheeth Joshi
Once Done…
© Nisheeth Joshi
© Nisheeth Joshi
Updating weights and biases
𝜕𝑐𝑜𝑠𝑡
• W7+ = W7 - = 12 – 0.01*(-42.2) = 12.04
𝜕𝑤7
𝜕𝑐𝑜𝑠𝑡
• W8+ = W8 - = 9 – 0.01*(-5.36) = 9.05
𝜕𝑤8
𝜕𝑐𝑜𝑠𝑡
• b3+ = b3 - = 20 – 0.01*(-114.1) = 21.1
𝜕𝑏3
© Nisheeth Joshi
Previous Weights
g1 = 0.37 b3 = 20
w7 = 12
∑ y’
g2 = 0.047
w8 = 9
© Nisheeth Joshi
Updated Weights
g1 = 0.37 b3 = 21.1
w7 = 12.04
∑ y’
g2 = 0.047
w8 = 9.05
© Nisheeth Joshi
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W1+ = W1 - W7+ = W7 -
𝜕𝑤1 𝜕𝑤7
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W2 = W2 -
+ W8 = W8 -
+
𝜕𝑤2 𝜕𝑤8
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡
W3 = W3 -
+ b3 = b3 -
+
𝜕𝑤3 𝜕𝑏3
𝜕𝑐𝑜𝑠𝑡
W4 = W4 -
+
𝜕𝑤4
𝜕𝑐𝑜𝑠𝑡
W5 = W5 -
+
𝜕𝑤5
𝜕𝑐𝑜𝑠𝑡
W6 = W6 -
+
𝜕𝑤6
𝜕𝑐𝑜𝑠𝑡
b1 = b1 -
+
𝜕𝑏1
𝜕𝑐𝑜𝑠𝑡
b2 = b2 -
+
𝜕𝑏2
© Nisheeth Joshi
b1
W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ 24.95
W4
g2 = 0.047
W5 ∑ ∫
5
b2
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡
• W1+ = W1 -
𝜕𝑤1
• Trace back…
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡
• W1+ = W1 -
𝜕𝑤1
• Trace back…
© Nisheeth Joshi
-15
60
0.1
0.1
∑ z1
∫ g1
0.1
80
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡
• W1+ = W1 -
𝜕𝑤1
• Trace back…
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1
• = ∗ ∗ ∗
𝜕𝑤1 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1 𝜕𝑤1
𝜕𝑐𝑜𝑠𝑡
• = 2(Y-pred – Y-actual)
𝜕𝑌−𝑝𝑟𝑒𝑑
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1
• = ∗ ∗ ∗
𝜕𝑤1 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1 𝜕𝑤1
𝜕𝑐𝑜𝑠𝑡
• = 2(Y-pred – Y-actual)
𝜕𝑌−𝑝𝑟𝑒𝑑
𝜕𝑌−𝑝𝑟𝑒𝑑
• = w7
𝜕𝑔1
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1
• = ∗ ∗ ∗
𝜕𝑤1 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1 𝜕𝑤1
𝜕𝑐𝑜𝑠𝑡
• = 2(Y-pred – Y-actual)
𝜕𝑌−𝑝𝑟𝑒𝑑
𝜕𝑌−𝑝𝑟𝑒𝑑
• = w7
𝜕𝑔1
𝜕𝑔1 𝟏 𝟏
• = g1 * (1-g1) = ( ) ∗ (𝟏 − )
𝜕𝑧1 𝟏+ 𝒆−𝒛𝟏 𝟏+ 𝒆−𝒛𝟏
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1
• = ∗ ∗ ∗
𝜕𝑤1 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1 𝜕𝑤1
𝜕𝑐𝑜𝑠𝑡
• = 2(Y-pred – Y-actual)
𝜕𝑌−𝑝𝑟𝑒𝑑
𝜕𝑌−𝑝𝑟𝑒𝑑
• = w7
𝜕𝑔1
𝜕𝑔1 𝟏 𝟏
• = g1 * (1-g1) = ( ) ∗ (𝟏 − )
𝜕𝑧1 𝟏+ 𝒆−𝒛𝟏 𝟏+ 𝒆−𝒛𝟏
𝜕𝑧1
• = x1
𝜕𝑤1
© Nisheeth Joshi
Let’s consider w1
𝜕𝑐𝑜𝑠𝑡 𝜕𝑐𝑜𝑠𝑡 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1
• = ∗ ∗ ∗
𝜕𝑤1 𝜕𝑌−𝑝𝑟𝑒𝑑 𝜕𝑔1 𝜕𝑧1 𝜕𝑤1
𝜕𝑐𝑜𝑠𝑡 𝟏 𝟏
• = 2(24.95-82) * 12*[( ) * (1 - )] ∗ 60
𝜕𝑤1 𝟏+ 𝒆−(−𝟎.𝟓) 𝟏+ 𝒆−(−𝟎.𝟓)
• -19303
© Nisheeth Joshi
Update w1
𝜕𝑐𝑜𝑠𝑡
• W1+ = W1 - = 0.1 – (0.01)*(-19303) = 193
𝜕𝑤1
© Nisheeth Joshi
b1
W1 ∑ ∫ g1 = 0.37 b3
60
W3
W2
W5
80
∑ 24.95
W4
g2 = 0.047
W5 ∑ ∫
5
b2
© Nisheeth Joshi
Repeat
© Nisheeth Joshi
© Nisheeth Joshi