Lecture 5 NN
Lecture 5 NN
′ +1 𝑖𝑓 σ 𝒏
𝒊=𝟎 wi xi ≥ 1 ′ 1 𝑖𝑓 σ 𝒏
𝒊=𝟎 wi xi > 0
𝑺𝟏 𝑺𝟏 0.10.10.70.70 00
• 𝑦 =൝ or 𝑦 = ൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 0 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≤ 0 𝑺𝟐 𝑺𝟐 0.50.51.01.00 00
Perceptron Learning Rule Algorithm 𝑺𝟑 𝑺𝟑 1.31.30.90.90 01
• Learns acceptable weight vector for linear separable dataset. 𝑺𝟒 𝑺𝟒 2.12.11.61.61 10
• Initially assign the weights randomly.
• While training example exists
• Apply the perceptron to Compute the predicted output each training sample.
• Compare the actual output (y), and predicted output (𝒚′ ).
• Modify the weights whenever misclassifies the training sample.
• Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
• Repeat until perceptron classifies all training examples precisely.
Perceptron Working Principle for Linear Dataset
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 1.2, 𝑤2 = 0.6. Learning rate 𝛼 = 0.5, Threshold = 1. Find the updated weights that classifies the given
dataset precisely. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔𝒙𝟏 𝒙𝒙𝟏𝟐 𝒚𝒙𝟐 𝒚′ 𝒚
𝑺𝒂𝒎𝒑𝒍𝒆𝒔
• Here, 𝑤0 or b is not given, consider as 0. 𝑺𝟏𝑺𝟏 0 00 0 0 0 0
• Predicted output 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗xn=σ𝒏𝒊=𝟎 wi xi 𝑺𝑺 0 01 0 1 0 0
𝟐𝟐
′ +1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟑𝑺𝟑 1 10 00 1 0
• 𝑦 =൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 𝑺𝟒𝑺𝟒 1 11 1 1 1
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(0) = 0 < 1. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 ,
• Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(1) = 0.6 < 1. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, Actual output 𝒚𝟐 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟑 ,
• Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(1) + 0.6(0) = 1.2 ≥ 𝟏. So, 𝑦3′ =1.
• Predicted output 𝑦3′ =1, but Actual output 𝒚𝟑 = 𝟎. Misclassified. So, Weights need to update.
• Update Weights using Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
Perceptron Working Principle for Linear Dataset
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚 𝒚′
+1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟏 0 0 0 0
• 𝑦′ = ൝ 𝑺𝟐 0 1 0 0
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1
𝑺𝟑 1 0 0
• Update Weights using Perceptron rule wi = wi + ∆wi
𝑺𝟒 1 1 1
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
For 𝑺𝟑 , 𝑤1 = 1.2, 𝑤2 = 0.6 and 𝑥1 = 1, 𝑥2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(0 − 1) × 𝑥1 =0.5 0 − 1 × 1 = −0.5 ; w1 = w1 + ∆w1=1.2 – 0.5 = 0.7
Weight w2∆w2=𝟎. 𝟓(𝟎 − 𝟏) × 𝒙𝟐 =𝟎. 𝟓 𝟎 − 𝟏 × 𝟎 = 𝟎; w2 = w2 + ∆w𝟐=0.6 – 0 = 0.6
Now, we have updated weights of w1 = 0.7 & w2 = 0.6.
Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(0) + 0.6(0) = 0 < 1. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(0) + 0.6(1) = 0.6 < 1. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, Actual output 𝒚𝟐 = 𝟎. Correctly classified. So, Weights need not to update.
Perceptron Working Principle for Linear Dataset
For 𝑺𝟑 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(1) + 0.6(0) = 0.7 < 1. So, 𝑦3′ =0.
• Predicted output 𝑦3′ =0, Actual output 𝒚𝟑 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟒 , Predicted output 𝑦4′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(1) + 0.6(1) = 1.3 ≥ 𝟏. So, 𝑦4′ =1.
• Predicted output 𝑦4′ =1, Actual output 𝒚𝟒 =1. Correctly classified. So, Weights need not to update.
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚 𝒚′
′ +1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟏 0 0 0 0
• 𝑦 =൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 𝑺𝟐 0 1 0 0
• All training examples are correctly classified by weights 𝑺𝟑 1 0 0 0
𝑤1 = 0.7, 𝑤2 = 0.6 𝑺𝟒 1 1 1 1
• So, these are the weights that are selected for this dataset using perceptron learning.
• Stop the algorithm.
Implement OR Gate using Perceptron (Linear Dataset)
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 0, 𝑤2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟏𝒙𝟐 𝒙𝟐 𝒚 𝒚𝒚′
𝑺𝒂𝒎𝒑𝒍𝒆𝒔
• Predicted output 𝑦3′ =1, Actual output 𝒚𝟑 = 𝟏. Correctly classified. So, Weights need not to update.
• Apply these weights to rest of the samples to verify the prefect classification.
For 𝑺𝟒 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(1) + 1(1) = 2 ≥𝟏. So, 𝑦4′ =1.
• Predicted output 𝑦4′ =1, Actual output 𝒚𝟒 = 𝟏. Correctly classified. So, Weights need not to update.
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(0) + 1(0) = 0 < 𝟏. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(0) + 1(1) = 1 ≥ 𝟏. So, 𝑦2′ =1.
• Predicted output 𝑦2′ =1, Actual output 𝒚𝟐 = 𝟏. Correctly classified. So, Weights need not to update.
All training examples are correctly classified under w1 =1 & w2 = 1. So, OR gate is implemented.
Delta rule learning
• Perceptron rule works perfectly for linearly separable data. But it fails to converge for the not
linearly separable data.
• Delta rule overcomes this limitation that learns the weights for non-linear dataset.
• It converges towards a best fit approximation target output.
• To attain this, apply gradient descent to search hypothesis space of possible weight vectors to
find weights that best fit with the training examples.
• Delta rule is the basis for backpropagation that learns the network with many interconnected
neuron units.
Delta rule learning
• Delta rule is best to Train the unthreshold perceptron to learn the weights.
• Predicted output for linear unit 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn =σ𝒏𝒊=𝟎 wi xi
• Predicted output of linear unit without threshold : 𝑦 ′ 𝑥Ԧ = 𝑤. 𝑥Ԧ
• 𝑦 ′ 𝑥Ԧ represents the linear unit at the first stage (phase) of a perceptron without threshold.
1
• Then Compute the cost 𝐸𝑟𝑟𝑜𝑟(𝑤) = σ𝑚
𝑖=1(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
2
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚
2
• m – no of training samples. 𝑺𝟏 0 0 0
Calculate the direction of steepest descent along error surface: 𝑺𝟐 0 1 1
• Direction can be found by calculating the derivative of error E w.r.t each weight 𝑤. 𝑺𝟑 1 0 1
𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑺𝟒 1 1 1
• This vector derivative 𝛻𝐸(𝑤)is gradient of E w.r.t 𝑤, 𝛻𝐸(𝑤)= ( , , …. ).
𝜕w0 𝜕w0 𝜕w0
• Since gradient descent specifies the direction of steepest increase of Error, the training rule for GD is
𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 =−𝛼 ∗ 𝛻𝐸 𝑤𝑖 = −𝛼 ∗ ( )
𝜕wi
• ‘-’ indicates moving the vector 𝑤 i.e., movement of 𝑤 decreases the Error. 𝛼 is learning rate (positive
constant).
Delta rule learning
• The training rule can be written using the components as
𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 =−𝛼 ∗ 𝛻𝐸 𝑤𝑖 = −𝛼 ∗ ( ); 𝛻𝐸(𝑤)= ( , , …. ).
𝜕wi 𝜕w0 𝜕w0 𝜕w0
𝑚
𝜕𝐸 𝜕 1
= ∗ (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2
𝜕wi 𝜕wi 2 𝑖=1
1 𝜕
= ∗ σ𝑚
𝑖=1 (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2
2 𝜕wi
1 𝜕
= ∗ σ𝑚
𝑖=1 2 ∗ (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡) (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
2 𝜕wi
𝜕
=σ𝑚
𝑖=1(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡) (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑤𝑖 . 𝑥Ԧ 𝑖)
𝜕wi
𝜕𝐸
=σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ (− 𝑥
Ԧ 𝑖)
𝜕wi
𝜕𝐸
𝜵𝑬 𝒘𝒊 = =σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ (− 𝑥
Ԧ 𝑖)
𝜕wi
Gradient Descent Algorithm
• Gradient descent and Delta rule is applied to classify the Non-linear dataset.
• Weights are updated using the delta rule
𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 = 𝛼 ∗ 𝛻𝐸 𝑤𝑖 = 𝛼 ∗ ( ); 𝛻𝐸(𝑤𝑖)= ( , , …. ).
𝜕wi 𝜕w0 𝜕w0 𝜕w0
𝜕𝐸
𝜵𝑬 𝑤𝑖 = =σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ ( 𝑥𝑖 )
𝜕wi
GD is searching through a large or infinite hypothesis space that can be applied whenever
• The hypothesis space contains continuously parameterized hypotheses (e.g., weights in linear unit)
• The error can be differentiated w.r.t hypothesis parameters.
𝒇 𝒙 = 𝐦𝐚𝐱(𝟎, 𝒙)
4. Leaky ReLU( Rectified Linear unit) Activation function
• Leaky ReLU is an enhanced version of ReLU to solve the Dying ReLU problem as it has a small positive slope in
the negative area.
𝒇(𝒙) = max(𝟎. 𝟎𝟏 ∗ 𝒙 , 𝒙)
• This function returns x if it receives any positive input.
• But for any negative value of x, it returns a really small value which is 0.01 times x.
• Thus, the gradient of the left side of the graph comes out to be a non zero value.
• Hence it does not encounter dead neurons in that region.
• Range is −∞ 𝑡𝑜 + ∞.
Architecture of Neural Network
A neural network contains different layers.
Input layer - The first layer is the, it picks up the input signals and passes
them to the next layer.
Hidden layer - The next layer does all kinds of calculations and feature
extractions.
Often, possibility of more than one hidden layer.
Output layer - Provides the final result.
Classification of Neural Networks
1. Shallow neural network: The Shallow neural network has only one
hidden layer between the input and output.
2. Deep neural network: Deep neural networks have more than one hidden
layers. For instance, Google LeNet model for image recognition counts 22
layers.
Building Neural Network Model
1. Weight initialization
2. Forward propagation
3. Loss (or cost) computation
4. Backpropagation
5. Weight update using optimization algorithm.
Weight initialization
1. Zero Initialization (Symmetry Problem)
• When Initialize all weights of a neural network as zero, neural network will become a linear model i.e., all the partial
derivatives in backpropagation will be the same for every neuron. So, Parameters are not updated. Hence, should
avoid to initialize all weights to zero
2.Weight initialized with too-small values (Vanishing Gradient Problem)
• Model will be converged earlier. It indicated as model performance improves very slowly during training and it stops
very early.
• If the initial weights of a neuron are too small relative to the inputs, the gradient of the hidden layers will diminish
exponentially in back propagation i.e., vanishing through the layers.
• Use ReLu or leaky ReLu to avoid vanishing gradient.
3. Too-large Initialization (Exploding Gradient)
• Loss value will oscillate around minima but unable to converge
• Model involves much change in its loss value on each update due to its instability, may reach infinity sometimes.
• Use ReLu or leaky ReLu to avoid Exploding gradient.
Working of Neural Network
The complete training process of a neural network involves two steps.
1. Forward Propagation - Receive input data, process the information, and generate output
• Data are fed into the input layer in the form of numbers. If image is an input, these numerical values denote
the intensity of pixels.
• The neurons in the hidden layers apply a few mathematical operations on these values to learn the features.
• To perform these mathematical operations, there are certain parameter values that are randomly initialized.
• Pass these mathematical operations at the hidden layer.
• 𝑧 [1] = 𝑤 [1] x+ 𝑏[1] ;
𝟏
• 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛
• 𝑧 [2] = 𝑤 [2] 𝐴[1] + 𝑏 [2] ;
𝟏
• 𝐴[2] = 𝜎(𝑧 [2] ) = [𝟐]
𝟏+𝒆−𝒛
= 𝐴[2]
• 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 (𝑌)
Working of Neural Network
3. Loss (or cost) computation
• Once the output is generated, the compare the predicted output with the actual (Ground truth/label) value y.
• The optimal value for error is zero (no error) i.e., both desired (actual) and predicted results are identical.
• In this context, apply Mean Squared Error (MSE) function:
1
Error (cost) J = (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 )2
2
1
• E.g., Error (cost) J = (0.03 − 0.0874322143)2 = 0.356465271
2
Image:Towardsdatascience.c
Working of Neural Network
3. Backward Propagation
• Calculate the derivatives (gradients) and Update the parameters after each iteration to reduce the error.
Calculate the derivatives (gradients)
𝟏
𝑧 [1] = 𝑤 [1] x+ 𝑏 [1] ; 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛
𝟏
𝑧 [2] = 𝑤 [2] 𝐴[1] + 𝑏 [2] ; 𝐴[2] = 𝜎(𝑧 [2] ) = [𝟐]
𝟏+𝒆−𝒛
• Based on chain rule, compute the product of the derivatives along all paths connecting the variables.
𝝏𝑱 𝝏 𝟏 𝟏
• Let’s consider weights of the third layer 𝑤 [2] and 𝑏[2] gradients: 𝝏𝐴[2]
= 𝝏𝐴[2]( 𝟐 (𝒀 − 𝐴[2] )𝟐 ) = 𝟐
* 2 (𝒀 − 𝐴[2] ) * (-1)
Image:Towardsdatascience.c
Backward Propagation
• Next, compute the 𝑤 1 and 𝑏1 gradients for Layer 2 :
• 𝑧 [1] = 𝑤 [1]x+ 𝑏 [1]
𝟏
• 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛
Image:Towardsdatascience.c
Working of Neural Network
• 5. Update the weights
• After gradients computed, update the new parameters for the model, and iterate (i.e., apply forward
propagation) all over again until the model converges.
• Alpha is the learning rate that will determines the convergence rate.
Forward Propagation - Example
• Z1 is the Sum of Products (SOP) between each input and its corresponding weight:
• Z1 = W1 * X1 + W2 * X2+ b * 1
• Z1 = W1 * X1 + W2 * X2+ b * 1
• Z1 =0.1* 0.5 + 0.3*0.2+1.83
• Z1 =1.94
𝟏
• A1 = 𝜎(Z1) = = 0.874352143.
𝟏+𝒆−𝒁𝟏
Backward Propagation- Example
• training steps = n i.e., (0, 1, 2….) ; 𝑤𝑖 - Parameter at ith step ;
• Alpha denotes learning rate ; 𝑌 𝑛 = actual output =0.03;
• 𝑌 𝑛 or A1 denotes predicted output ; n =0 ;b=1.83, 𝑤1 = 0.5. 𝑤2 = 0.2 ;
• Alpha =0.01; ; 𝑌 𝑛 =0.874352143 ;
• Inputs: 𝑥0 or Bias input=+1, 𝑥1 =0.1, 𝑥2 =0.3.
• Z1 = W1 * X1 + W2 * X2+ b * 1 = 0.5* 0.1 + 0.2*0.3+1.83*1 =1.94
𝟏 𝟏
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 = A1 = 𝜎(Z1) = = =0.87.
𝟏+𝒆−𝒁𝟏 𝟏+𝒆𝟏.𝟗𝟒
𝜕𝐽 𝜕𝐽 𝜕𝐽
• Compute , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑏
b
Image:Towardsdatascience.c
Backward Propagation- Example
𝜕𝐽 𝜕𝐽 𝜕𝐽
• Compute , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑏
𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
= * *
𝜕𝑤1 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤1
𝜕𝐽 𝜕 1 1
= ( (𝑌 − A1)2 ) = * 2 (𝑌 − A1) * (-1) i.e., -1 from differentiate by (-A1)
𝜕𝐴1 𝜕𝐴1 2 2
= 0.013803732
𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍
• = 𝜕𝐴 * * 𝜕𝑤1 = 0.844352143 * 0.013803732 * 0.1 = 0.001165
𝜕𝑤1 1 𝝏𝒁𝟏 1
Image:Towardsdatascience.c
Parameter updation - Example
𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
= * *
𝜕𝑤2 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤2
𝜕𝐽
• = 0.844352143 ;
𝜕𝐴1
𝝏𝑨𝟏
• = 0.013803732
𝝏𝒁𝟏
𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
• = * * = 0.844352143 * 0.013803732 * 0.3 = 0.00349656
𝜕𝑤2 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤2
Parameter updation - Example
𝜕𝐽 𝝏𝑱 𝜕𝐴1 𝝏𝒁𝟏
• = * *
𝜕𝑏 𝝏𝑨𝟏 𝜕𝑍1 𝝏𝒃
𝜕𝐽
= 0.844352143 ∗ 0.013803732* 1 = 0.013803732
𝜕𝑏
Parameter updation:
𝜕𝐽 𝜕𝐽
• 𝑤1 = 𝑤1 − 𝛼 =0.5 – 0.01(0.001165) = 0.498835 ; 𝑤2 = 𝑤2 − 𝛼( )= 0.2 – 0.01(0.003496) = 0.19965;
𝜕𝑤1 𝜕𝑤2
𝜕𝐽
b = b - 𝛼( ) =1.83 – 0.01(0.013803732) = 1.82988
𝜕𝑏
• Apply these updated new parameters with input and proceed the forward propagation for 2nd iteration until
convergence i.e., three consecutive iterations getting same error value.
new forward pass calculations:
s=X1* W1+ X2*W2+b
s=0.1*0. 498835+ 0.3*0.19965 +1.82988
s=
f(s)=1/(1+e-s) f(s)=1/(1+e -)
f(s)=
E=1/2(0.03- )2 E=
new error (0. )
old error (0.3528)
reduction of 0.
As long as there’s a reduction, we’re
moving in the right direction.
The error reduction is small because
we’re using a small learning rate (0.01).
The forward and backward passes should
be repeated until the error is 0 or for a
number of epochs (i.e. iterations).
Forward Propagation – Example 2
Forward Propagation – Digit Image Example 3
Forward Propagation – Digit Image Example 3
Parameter
Learning Rate
• Learning rate (step size) is hyper-parameter and range between 0.0 and 1.0
• The amount that the weights are updated during training.
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates require more training epochs given the smaller changes made to the weights for
each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
Bias
• The bias is a node that is always 'on'. i.e. its value is set to 1.
• Bias is like the intercept in a linear equation. It is an additional parameter in the Neural Network which is
used to adjust the output along with the weighted sum of the inputs to the neuron.
Output = sum (W * X) + bias
• Therefore, Bias is a constant which helps the model in a way that it can fit best for the given data.
• Weight decide how fast the activation function will trigger whereas bias
is used to delay the triggering of the activation function.
Problems in neural network -
Overfitting
• Optimize a model requires to find the best parameters that minimize the loss of the training set.
• Generalization - how the model behaves for unseen data.
• The best method is to have a balanced dataset with sufficient amount of data. Reduce overfitting by regularization.
Network size
• A neural network with too many layers and hidden units are be highly sophisticated (expensive). So, reduce the
complexity of the model by reduce its size. There is no best practice to define the number of layers. Start with a small
amount of layer and increases its size until find the model overfit.
Weight Regularization
• Prevent overfitting by add constraints to the weights. The constraint forces the network to be small. The constraint is
added to the loss function. Two kinds of regularization: L1: Lasso: Cost is proportional to the absolute value of the
weight coefficients.
• L2: Ridge: Cost is proportional to the square of the value of the weight coefficients.
2
• 𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆 σ𝑘 𝜃𝑘
Problems in neural network -
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
Dropout
• Dropout makes some weights will be randomly set to zero.
• E.g., weights= [0.1, 1.7, 0.7, -0.9].
• After apply dropout, it becomes [0.1, 0, 0, -0.9] with randomly distributed 0.
• The dropout rate parameter controls how many weights to be set to zeroes.
• Having a rate between 0.2 and 0.5 is common.
Implement OR Gate using Perceptron (Linear Dataset)
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 1.2, 𝑤2 = 0.6. Learning rate 𝛼 = 0.5, Threshold = 1. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚
• Find the updated weights that classifies the given dataset precisely. 𝑺𝟏 0 0 0
𝑺𝟐 0 1 1
• Predicted output 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn =σ𝒏𝒊=𝟎 wi xi
𝑺𝟑 1 0 1
+1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥1
• 𝑦′ = ൝ 𝑺𝟒 1 1 1
0 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi <1
For 𝑺𝟏 ,
• Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(1) = 0.6 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
• Update Weights using Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
Implement OR Gate using Perceptron (Linear Dataset)
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.6 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.6 – 0.5 = 0.1
Now, we have updated weights of w1 = 1.2 & w2 = 0.5.
• Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 ,Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.1(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.1(1) = 0.1 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.1 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.1 – 0.5 = -0.4
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
Implement OR Gate using Perceptron (Linear Dataset)
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
• Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 ,Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + (-0.4)(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + (-0.4)(1) = 0.1 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.1 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.1 – 0.5 = -0.4
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
References
2. EthemAlpaydin, Introduction to Machine Learning (Adaptive Computation and Machine Learning), The
3. Wikipedia
4. https://www.indowhiz.com/articles/en/the-simple-concept-of-expectation-maximization-em-algorithm/
5. www.towardsdatascience.com
__
56