14 Introduction To Training A Network
14 Introduction To Training A Network
Perceptron
Perceptron
1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Line Learning
Filter Learning
Filter Visualization
Probability
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Class Labels
y=1
Actual Value
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Predicted Value
y=o
Actual Value
Training a Perceptron
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Predicted Value
Training Data
...
Random Initialization
Feed Forward
0.8 Predicted
𝑦 = 𝑓(𝒙𝑖 ; 𝑾)
Loss Function
0.8 Predicted
0 Actual
Error = ?
Loss Function
0.8 𝑦
0 y
Loss Function
0.8 𝑦
0 y
𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
Loss Function
0.8 𝑦
0 y
𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝑆𝑆𝐸 = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝑆𝑆𝐸 = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Loss Function
0.8 𝑦
0 y
𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝐽(𝑾) = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝐽(𝑾) = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Loss Optimization Function
0.01 𝑦
0 y
𝑛
1 2
𝑆𝑆𝐸 = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
𝑛
1 2
𝐽(𝑾) = 𝑓(𝒙𝑖 ; 𝑾) − 𝑦𝑖
2𝑛
𝑖=1
𝑛 2
1 1
𝐽(𝑾) = − 𝑦𝑖
2𝑛 1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2+ … )
𝑖=1
Idea of Gradient Descent
𝐽 𝑤
Stochastic Gradient Descent
• We can apply stochastic gradient descent to the problem of finding the coefficients for our
model as follows:
• You continue to update the model for training instances and correcting errors until the model
is accurate enough or cannot be made any more accurate. It is often a good idea to
randomize the order of the training instances shown to the model to mix up the corrections
made.
Stochastic Gradient Descent
1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2)
𝑛
1 2
𝐽 𝒘 = 𝐽(𝑤0, 𝑤1, 𝑤2) = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
w0 = w0 – lambda * dJ/d(w0)
w1 = w1 – lambda * dJ/d(w1)
w2 = w2 – lambda * dJ/d(w2)
𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 )
𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e
Iteration 1
• Let’s start with values of 0.0 for coefficients and 0.03 for learning rate.
• We can now use this prediction in our equation for gradient descent to update the weights.
𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) 𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) = 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.78
= -0.0375 = -0.104290635
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.55
= -0.09564513761
Iteration 2
• Let’s start with values of 0.0 for coefficients and 0.03 for learning rate.
• We can now use this prediction in our equation for gradient descent to update the weights.
𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) 𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) = 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.78
= -0.0375 = -0.104290635
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
= 0.0 – 0.3 * (0.5-0) * 0.5 * (1-0.5) * 2.55
= -0.09564513761
Iteration 100
• You can repeat this process 100 times. This is 10 complete epochs of the training data being
exposed to the model and updating the coefficients. The graph below show a plot of accuracy
of the model over 10 epochs.
• Here is a list of all of the values for the coefficients after the 100 iterations:
w0 = -0.4066054641
w1 = 0.8525733164
w2 = -1.104746259
Trained
w0 = -0.4066054641
w1 = 0.8525733164
w2 = -1.104746259
• Let’s plug final coefficients into our model and make a prediction for each point in our
training dataset.
= 100%
Trained
w0 = -0.4066054641
w1 = 0.8525733164
w2 = -1.104746259, . . .
𝑦 = 𝑓(𝑥 𝑖 ; 𝑾)
Predicted Value
Stochastic Gradient Descent
1
𝑦=
1 + 𝑒 −(𝑤0 + 𝑤1∗𝑥1 + 𝑤2∗𝑥2)
𝑛
1 2
𝐽 𝒘 = 𝐽(𝑤0, 𝑤1, 𝑤2) = 𝑦𝑖 − 𝑦𝑖
2𝑛
𝑖=1
w0 = w0 – lambda * dJ/d(w0)
w1 = w1 – lambda * dJ/d(w1)
w2 = w2 – lambda * dJ/d(w2)
𝑤0 = 𝑤0 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 )
𝑤1 = 𝑤1 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x1
𝑤2 = 𝑤2 − 𝑙𝑎𝑚𝑏𝑑𝑎 ∗ 𝑦𝑖 − 𝑦𝑖 ∗ 𝑦𝑖 ∗ (1 − 𝑦𝑖 ) * x2
https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e
Composite Function
x f1 y1 f2 y2
𝑓2 𝑓1 𝑥 = 𝑓2 ° 𝑓1 𝑥
w1
x f1 y1 J(w)
w1 w2
x f1 y1 f2 y2 J(w)
w1 w2
x f1 y1 f2 y2 J(w)
𝜕𝐽(𝑤)
𝜕𝑦1
CNN
Training
Learning Rate
Training
Training
Summary