0% found this document useful (0 votes)
14 views39 pages

14 Introduction To Training A Network

Uploaded by

Shahzaib Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

14 Introduction To Training A Network

Uploaded by

Shahzaib Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Training a Network

Perceptron
Perceptron

1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )

𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Line Learning
Filter Learning
Filter Visualization
Probability

𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Class Labels

y=1
Actual Value

𝑦 = 𝑓(𝒙𝑖 ; 𝑾)

Predicted Value
y=o
Actual Value
Training a Perceptron

𝑦 = 𝑓(𝒙𝑖 ; 𝑾)

Predicted Value
Training Data

...
Random Initialization
Feed Forward

0.8 Predicted
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Loss Function

0.8 Predicted
0 Actual
Error = ?
Loss Function

0.8 𝑦
0 y
Loss Function

0.8 𝑦
0 y

𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
Loss Function

0.8 𝑦
0 y

𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝑆𝑆𝐸 = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝑆𝑆𝐸 = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Loss Function

0.8 𝑦
0 y

𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝐽(𝑾) = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝐽(𝑾) = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Loss Optimization Function

0.01 𝑦
0 y

𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝐽(𝑾) = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝐽(𝑾) = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Idea of Gradient Descent

𝐽 𝑤
Stochastic Gradient Descent

• We can apply stochastic gradient descent to the problem of finding the coefficients for our
model as follows:

– Given each training instance:


– Calculate a prediction using the current values of the coefficients.
– Calculate new coefficient values based on the error in the prediction.

• You continue to update the model for training instances and correcting errors until the model
is accurate enough or cannot be made any more accurate. It is often a good idea to
randomize the order of the training instances shown to the model to mix up the corrections
made.
Stochastic Gradient Descent
1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2)
𝑛
1 2
𝐽 𝒘 = 𝐽(𝑤0, 𝑤1, 𝑤2) = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1

w0 = w0 – lambda * dJ/d(w0)

w1 = w1 – lambda * dJ/d(w1)

w2 = w2 – lambda * dJ/d(w2)

𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 )
𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2

https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e
Iteration 1

• Let’s start with values of 0.0 for coefficients and 0.03 for learning rate.

w0 = 0.0, w1 = 0.0, w2 = 0.0, lambda = 0.3


• We can now calculate the predicted value for 𝑦 ̂ using our starting point coefficients for the
first training instance:

𝑦 ̂ = 1 / 1 + e-(w0 + w1x1 + w2x2)


= 1 / 1 + e-(0 + 0*2.78 + 0*2.55)
= 0.5

• We can now use this prediction in our equation for gradient descent to update the weights.

𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) 𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) = 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.78
= -0.0375 = -0.104290635

𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.55
= -0.09564513761
Iteration 2

• Let’s start with values of 0.0 for coefficients and 0.03 for learning rate.

w0 = -0.0375, w1 = -0.1042, w2 = -0.0956, lambda = 0.3


• We can now calculate the predicted value for 𝑦 ̂ using our starting point coefficients for the
first training instance:

𝑦 ̂ = 1 / 1 + e-(w0 + w1x1 + w2x2)


= 1 / 1 + e-(0 + 0*2.78 + 0*2.55)
= 0.5

• We can now use this prediction in our equation for gradient descent to update the weights.

𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) 𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) = 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.78
= -0.0375 = -0.104290635

𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.55
= -0.09564513761
Iteration 100

• You can repeat this process 100 times. This is 10 complete epochs of the training data being
exposed to the model and updating the coefficients. The graph below show a plot of accuracy
of the model over 10 epochs.

• Here is a list of all of the values for the coefficients after the 100 iterations:

w0 = -0.4066054641

w1 = 0.8525733164

w2 = -1.104746259
Trained

w0 = -0.4066054641

w1 = 0.8525733164

w2 = -1.104746259

𝑦 = 1 / 1 + e-(w0 + w1x1 + w2x2)


𝑦 = 1 / 1 + e-(-0.4066054641 + 0.8525733164x1 + -1.104746259x2)
Testing a Perceptron

• Let’s plug final coefficients into our model and make a prediction for each point in our
training dataset.

𝑦 = 1 / 1 + e-(w0 + w1x1 + w2x2)


𝑦 = 1 / 1 + e-(-0.4066054641 + 0.8525733164x1 + -1.104746259x2)

Predicted if (Predicted < 0.5) Then 0 Else 1

Accuracy = Correct Predictions / Total Predictions

= (10 /10) * 100

= 100%
Trained

w0 = -0.4066054641

w1 = 0.8525733164

w2 = -1.104746259, . . .

𝑦 = 1 / 1 + e-(w0 + w1x1 + w2x2 + . . . )


𝑦 = 1 / 1 + e-(-0.4066054641 + 0.8525733164x1 + -1.104746259x2 + . . . )
Testing a Perceptron

𝑦 = 𝑓(𝑥 𝑖 ; 𝑾)

Predicted Value
Stochastic Gradient Descent
1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2)
𝑛
1 2
𝐽 𝒘 = 𝐽(𝑤0, 𝑤1, 𝑤2) = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1

w0 = w0 – lambda * dJ/d(w0)

w1 = w1 – lambda * dJ/d(w1)

w2 = w2 – lambda * dJ/d(w2)

𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 )
𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2

https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e
Composite Function

x f1 y1 f2 y2

𝑓2 𝑓1 𝑥 = 𝑓2 ° 𝑓1 𝑥

𝜕𝑓2 𝜕𝑓2 𝜕𝑓1


= * Apply Chain Rule
𝜕𝑥 𝜕𝑓1 𝜕𝑥
Idea of Back Propagation

w1
x f1 y1 J(w)

𝜕𝐽(𝑤) 𝜕𝐽(𝑤) 𝜕𝑦1


= *
𝜕𝑤1 𝜕𝑦1 𝜕𝑤1
Idea of Back Propagation

w1 w2
x f1 y1 f2 y2 J(w)

𝜕𝐽(𝑤) 𝜕𝐽(𝑤) 𝜕𝑦1 𝜕𝐽(𝑤) 𝜕𝐽(𝑤) 𝜕𝑦2


= * = *
𝜕𝑤1 𝜕𝑦1 𝜕𝑤1 𝜕𝑤2 𝜕𝑦2 𝜕𝑤2
Idea of Back Propagation

w1 w2
x f1 y1 f2 y2 J(w)

𝜕𝐽(𝑤) 𝜕𝐽(𝑤) 𝜕𝑦2 𝜕𝑦1


= * *
𝜕𝑤1 𝜕𝑦2 𝜕𝑦1 𝜕𝑤1

𝜕𝐽(𝑤)
𝜕𝑦1
CNN
Training
Learning Rate
Training
Training
Summary

You might also like