0% found this document useful (0 votes)
13 views57 pages

Lecture 5 NN

The document provides an overview of artificial neural networks (ANNs) and their functioning, drawing parallels with the human brain's neuron structure. It explains the construction and working principles of perceptrons, including their learning rules for both linear and non-linear datasets. Additionally, it discusses the significance of weights, activation functions, and the process of updating weights during training to achieve accurate classifications.

Uploaded by

Ashwin K.L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

Lecture 5 NN

The document provides an overview of artificial neural networks (ANNs) and their functioning, drawing parallels with the human brain's neuron structure. It explains the construction and working principles of perceptrons, including their learning rules for both linear and non-linear datasets. Additionally, it discusses the significance of weights, activation functions, and the process of updating weights during training to achieve accurate classifications.

Uploaded by

Ashwin K.L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Module-1

Linear Algebra-Scalars, Vectors, Matrices and Tensors-


Operations –Special kind of Matrices and Vectors-
Probability Basics-Gradient based optimization- Linear
Classifiers, Linear Machines with Hinge Loss-
Introduction to Neural Network-Activation functions-
Tensorflow , keras operations
Artificial Neural Networks — Mapping the Human Brain
• The human brain consists of neurons or nerve cells which transmit and
process the information received from our senses like ear, nose, etc.
• Many such nerve cells are arranged together in the brain to form a
network of nerves.
• These nerves pass electrical impulses i.e., the excitation from one
neuron to the other.
• The dendrites receive the impulse from the terminal button or synapse
of an adjoining neuron. (input)
• Dendrites carry the impulse to the nucleus of the nerve cell (soma).
• Here, the electrical impulse is processed and then passed on to the
axon. (output)
• The axon is longer branch among the dendrites which carries the
impulse from the soma to the synapse.
• The synapse then, passes the impulse to dendrites of the second
neuron.
• Thus, a complex network of neurons is created in the human brain.
Artificial Neural Networks
• Neural Networks imitate the behavior of the human brain. NN allows computer programs to recognize patterns
and solve the problems in the AI, machine learning, and deep learning domains.
• Artificial neural networks (ANNs) contains a node layers such as an input layer, one or more hidden layers,
and an output layer.
Artificial Neuron Construction
• Neurons are created artificially on a computer. Connecting many such artificial neurons creates an artificial
neural network.
• Artificial neuron works like a neuron present in human brain.
• The data in the network flows through each neuron by a connection.
• Every connection has a specific weight by which the flow of data is regulated.
• If the output of any individual node is greater than a specified threshold value, that node is activated, which
pass the data to the next layer of the network.
• Otherwise, the node is not activated, and data is not passed to the next layer.
Perceptron (Artificial Neuron) Working Principle
• Perceptron is an Artificial Neuron or neural network unit that performs
certain computations to recognize features or business intelligence in the
input data.
• Perceptron model is a supervised learning algorithm of binary classifiers. It
contains a single-layer neural network with four main parameters, i.e., input
values, weights and Bias, net sum, and an activation function.
• Let’s consider x1 and x2 are the two inputs. They could be an integer, float
etc. Assume them as 1 and 0 respectively.
• When these inputs pass through the connections ( through W1 and W2 ),
they are adjusted depending on the connection weights.
• Let’s assume, W1 = 0.5, W2 = 0.6 and W3 = 0.2 , then adjusting the
weights using
z = x1 * W1 + x2 * W2 = 1 * 0.5 + 0 * 0.6 = 0.5
• Hence , 0.5 is the value adjusted by the weights of the connection.
• The connections are assumed to be the dendrites in an artificial neuron.
• To process the information, need an activation function ( soma ) .
Perceptron Working Principle for Non-Linear Dataset
𝟏
• Here, use the simple sigmoid function: f(x) =
𝟏+𝒆−𝒛
• f ( x ) = 0.6224
• The above value is the output of the neuron ( axon ).
• This value needs to be multiplied by W3 .
• 0.6224 * W3 = 0.6224 * 0.2 = 0.12448
• Now, Again apply the activation function to the above value ,
• y = f ( 0.12448 ) = 0.5310
• Hence, y ( final prediction) = 0.5310
• In this example the weights were randomly generated.
Two types of Perceptron:
Single layer and Multilayer.
• Single layer - Single layer perceptron learns only linearly separable patterns.
• Multilayer - Multilayer perceptron or feedforward NN with two or more layers (more processing power).
• The Perceptron algorithm learns the weights for the input data to draw a linear decision boundary.
• It distinguishes the two linearly separable classes +1 and -1.
Perceptron Learning Rule
• Perceptron is an Artificial Neuron or Neural network unit is used to build an ANN.
• Perceptron receives a vector of real-valued inputs, computes a linear combination of these inputs.
• Then provide the predicted output as 1 if the result is greater than some threshold value.
• Otherwise output is -1.
• Linear sum = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn =σ𝒏𝒊=𝟎 wi xi 𝑺𝒂𝒎𝒑𝒍𝒆𝒔
𝑺𝒂𝒎𝒑𝒍𝒆𝒔𝒙𝟏 𝒙𝟏 𝒙𝟐 𝒙𝟐𝒚 𝒚𝒚′

′ +1 𝑖𝑓 σ 𝒏
𝒊=𝟎 wi xi ≥ 1 ′ 1 𝑖𝑓 σ 𝒏
𝒊=𝟎 wi xi > 0
𝑺𝟏 𝑺𝟏 0.10.10.70.70 00
• 𝑦 =൝ or 𝑦 = ൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 0 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≤ 0 𝑺𝟐 𝑺𝟐 0.50.51.01.00 00
Perceptron Learning Rule Algorithm 𝑺𝟑 𝑺𝟑 1.31.30.90.90 01
• Learns acceptable weight vector for linear separable dataset. 𝑺𝟒 𝑺𝟒 2.12.11.61.61 10
• Initially assign the weights randomly.
• While training example exists
• Apply the perceptron to Compute the predicted output each training sample.
• Compare the actual output (y), and predicted output (𝒚′ ).
• Modify the weights whenever misclassifies the training sample.
• Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
• Repeat until perceptron classifies all training examples precisely.
Perceptron Working Principle for Linear Dataset
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 1.2, 𝑤2 = 0.6. Learning rate 𝛼 = 0.5, Threshold = 1. Find the updated weights that classifies the given
dataset precisely. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔𝒙𝟏 𝒙𝒙𝟏𝟐 𝒚𝒙𝟐 𝒚′ 𝒚
𝑺𝒂𝒎𝒑𝒍𝒆𝒔
• Here, 𝑤0 or b is not given, consider as 0. 𝑺𝟏𝑺𝟏 0 00 0 0 0 0
• Predicted output 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗xn=σ𝒏𝒊=𝟎 wi xi 𝑺𝑺 0 01 0 1 0 0
𝟐𝟐

′ +1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟑𝑺𝟑 1 10 00 1 0
• 𝑦 =൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 𝑺𝟒𝑺𝟒 1 11 1 1 1
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(0) = 0 < 1. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 ,
• Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(1) = 0.6 < 1. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, Actual output 𝒚𝟐 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟑 ,
• Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(1) + 0.6(0) = 1.2 ≥ 𝟏. So, 𝑦3′ =1.
• Predicted output 𝑦3′ =1, but Actual output 𝒚𝟑 = 𝟎. Misclassified. So, Weights need to update.
• Update Weights using Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
Perceptron Working Principle for Linear Dataset
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚 𝒚′

+1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟏 0 0 0 0
• 𝑦′ = ൝ 𝑺𝟐 0 1 0 0
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1
𝑺𝟑 1 0 0
• Update Weights using Perceptron rule wi = wi + ∆wi
𝑺𝟒 1 1 1
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
For 𝑺𝟑 , 𝑤1 = 1.2, 𝑤2 = 0.6 and 𝑥1 = 1, 𝑥2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(0 − 1) × 𝑥1 =0.5 0 − 1 × 1 = −0.5 ; w1 = w1 + ∆w1=1.2 – 0.5 = 0.7
Weight w2∆w2=𝟎. 𝟓(𝟎 − 𝟏) × 𝒙𝟐 =𝟎. 𝟓 𝟎 − 𝟏 × 𝟎 = 𝟎; w2 = w2 + ∆w𝟐=0.6 – 0 = 0.6
Now, we have updated weights of w1 = 0.7 & w2 = 0.6.
Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(0) + 0.6(0) = 0 < 1. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(0) + 0.6(1) = 0.6 < 1. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, Actual output 𝒚𝟐 = 𝟎. Correctly classified. So, Weights need not to update.
Perceptron Working Principle for Linear Dataset
For 𝑺𝟑 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(1) + 0.6(0) = 0.7 < 1. So, 𝑦3′ =0.
• Predicted output 𝑦3′ =0, Actual output 𝒚𝟑 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟒 , Predicted output 𝑦4′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.7(1) + 0.6(1) = 1.3 ≥ 𝟏. So, 𝑦4′ =1.
• Predicted output 𝑦4′ =1, Actual output 𝒚𝟒 =1. Correctly classified. So, Weights need not to update.
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚 𝒚′
′ +1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥ 1 𝑺𝟏 0 0 0 0
• 𝑦 =൝
−1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi < 1 𝑺𝟐 0 1 0 0
• All training examples are correctly classified by weights 𝑺𝟑 1 0 0 0
𝑤1 = 0.7, 𝑤2 = 0.6 𝑺𝟒 1 1 1 1

• So, these are the weights that are selected for this dataset using perceptron learning.
• Stop the algorithm.
Implement OR Gate using Perceptron (Linear Dataset)
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 0, 𝑤2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟏𝒙𝟐 𝒙𝟐 𝒚 𝒚𝒚′
𝑺𝒂𝒎𝒑𝒍𝒆𝒔

• Here, 𝑤0 or b is not given, consider as 0. 𝑺𝟏 𝑺𝟏 0 0 0 0 0 00


𝑺𝟐 𝑺𝟐 0 0 1 1 1 10
• Find the updated weights that classifies the given dataset precisely.
𝑺𝟑 𝑺𝟑 1 10 01 1
• Predicted output 𝑦′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn=σ𝒏𝒊=𝟎 wi xi
𝑺𝟒 𝑺𝟒 1 11 11 1
+1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥1
• 𝑦′ = ൝
0 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi <1
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(0) + 0(0) = 0 < 𝟏. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(0) + 0(1) = 0 < 𝟏. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
• Update Weights using Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
Implement OR Gate using Perceptron (Linear Dataset)
For 𝑺𝟐 , 𝑤1 =0, 𝑤2 = 0 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=0 + 0 = 0
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0 + 0.5 = 0.5
Now, we have updated weights of w1 = 0 & w2 = 0.5.
Apply to the same sample 𝑺𝟐 ,
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(0) + 0.5(1) = 0.5 < 𝟏. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟐 , 𝑤1 =0, 𝑤2 = 0.5 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=0 + 0 = 0
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.5 + 0.5 = 1
Now, we have updated weights of w1 = 0 & w2 = 1.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(0) + 1(1) = 1 ≥ 𝟏. So, 𝑦2′ =1.
• Predicted output 𝑦2′ =1, Actual output 𝒚𝟐 = 𝟏. Correctly classified. So, Weights need not to update.
Implement OR Gate using Perceptron (Linear Dataset)
Now, we have updated weights of w1 = 0 & w2 = 1.
• Apply these weights to rest of the samples to verify the prefect classification.
For 𝑺𝟏 ,Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(0) + 1(0) = 0 < 𝟏. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟑 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0(1) + 1(0) = 0 < 𝟏. So, 𝑦3′ =0.
• Predicted output 𝑦3′ =0, but Actual output 𝒚𝟑 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟑 , 𝑤1 = 0, 𝑤2 = 1 and 𝑥1 = 1, 𝑥2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 1 = 0.5 ; w1 = w1 + ∆w1=0 + 0.5 = 0.5
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟎 = 𝟎; w2 = w2 + ∆w𝟐=1 + 0 = 1
Now, we have updated weights of w1 =0.5 & w2 = 1. Apply to the same sample 𝑺𝟑 ,
For 𝑺𝟑 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 0.5(1) + 1(0) = 0.5 < 𝟏. So, 𝑦3′ =0.
• Predicted output 𝑦3′ =0.5, but Actual output 𝒚𝟑 = 𝟏. Misclassified. So, Weights need to update.
Implement OR Gate using Perceptron (Linear Dataset)
For 𝑺𝟑 , 𝑤1 = 0.5, 𝑤2 = 1 and 𝑥1 = 1, 𝑥2 = 0. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 1 = 0.5 ; w1 = w1 + ∆w1=0.5 + 0.5 = 1
𝒙 𝟏 𝒙 𝟐 𝒚 𝒚′
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟎 = 𝟎; w2 = w2 + ∆w𝟐=1 + 0 = 1
𝑺𝟏 0 0 0 0

Now, we have updated weights of w1 =1 & w2 = 1. Apply to the same sample 𝑺𝟑 , 𝑺𝟐 0 1 1 1


𝑺𝟑 1 0 1 1
For 𝑺𝟑 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(1) + 1(0) = 1 ≥𝟏. So, 𝑦3′ =1. 𝑺𝟒 1 1 1 1

• Predicted output 𝑦3′ =1, Actual output 𝒚𝟑 = 𝟏. Correctly classified. So, Weights need not to update.
• Apply these weights to rest of the samples to verify the prefect classification.
For 𝑺𝟒 , Predicted output 𝑦3′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(1) + 1(1) = 2 ≥𝟏. So, 𝑦4′ =1.
• Predicted output 𝑦4′ =1, Actual output 𝒚𝟒 = 𝟏. Correctly classified. So, Weights need not to update.
For 𝑺𝟏 , Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(0) + 1(0) = 0 < 𝟏. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1(0) + 1(1) = 1 ≥ 𝟏. So, 𝑦2′ =1.
• Predicted output 𝑦2′ =1, Actual output 𝒚𝟐 = 𝟏. Correctly classified. So, Weights need not to update.
All training examples are correctly classified under w1 =1 & w2 = 1. So, OR gate is implemented.
Delta rule learning
• Perceptron rule works perfectly for linearly separable data. But it fails to converge for the not
linearly separable data.
• Delta rule overcomes this limitation that learns the weights for non-linear dataset.
• It converges towards a best fit approximation target output.
• To attain this, apply gradient descent to search hypothesis space of possible weight vectors to
find weights that best fit with the training examples.
• Delta rule is the basis for backpropagation that learns the network with many interconnected
neuron units.
Delta rule learning
• Delta rule is best to Train the unthreshold perceptron to learn the weights.
• Predicted output for linear unit 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn =σ𝒏𝒊=𝟎 wi xi
• Predicted output of linear unit without threshold : 𝑦 ′ 𝑥Ԧ = 𝑤. 𝑥Ԧ
• 𝑦 ′ 𝑥Ԧ represents the linear unit at the first stage (phase) of a perceptron without threshold.
1
• Then Compute the cost 𝐸𝑟𝑟𝑜𝑟(𝑤) = σ𝑚
𝑖=1(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
2
𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚
2
• m – no of training samples. 𝑺𝟏 0 0 0
Calculate the direction of steepest descent along error surface: 𝑺𝟐 0 1 1
• Direction can be found by calculating the derivative of error E w.r.t each weight 𝑤. 𝑺𝟑 1 0 1
𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑺𝟒 1 1 1
• This vector derivative 𝛻𝐸(𝑤)is gradient of E w.r.t 𝑤, 𝛻𝐸(𝑤)= ( , , …. ).
𝜕w0 𝜕w0 𝜕w0
• Since gradient descent specifies the direction of steepest increase of Error, the training rule for GD is
𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 =−𝛼 ∗ 𝛻𝐸 𝑤𝑖 = −𝛼 ∗ ( )
𝜕wi
• ‘-’ indicates moving the vector 𝑤 i.e., movement of 𝑤 decreases the Error. 𝛼 is learning rate (positive
constant).
Delta rule learning
• The training rule can be written using the components as
𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 =−𝛼 ∗ 𝛻𝐸 𝑤𝑖 = −𝛼 ∗ ( ); 𝛻𝐸(𝑤)= ( , , …. ).
𝜕wi 𝜕w0 𝜕w0 𝜕w0

𝑚
𝜕𝐸 𝜕 1
= ∗ ෍(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2
𝜕wi 𝜕wi 2 𝑖=1
1 𝜕
= ∗ σ𝑚
𝑖=1 (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2
2 𝜕wi
1 𝜕
= ∗ σ𝑚
𝑖=1 2 ∗ (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡) (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)
2 𝜕wi
𝜕
=σ𝑚
𝑖=1(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡) (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑤𝑖 . 𝑥Ԧ 𝑖)
𝜕wi
𝜕𝐸
=σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ (− 𝑥
Ԧ 𝑖)
𝜕wi
𝜕𝐸
𝜵𝑬 𝒘𝒊 = =σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ (− 𝑥
Ԧ 𝑖)
𝜕wi
Gradient Descent Algorithm
• Gradient descent and Delta rule is applied to classify the Non-linear dataset.
• Weights are updated using the delta rule
𝜕𝐸 𝜕𝐸 𝜕𝐸 𝜕𝐸
𝑤𝑖=𝑤𝑖 + ∆𝑤𝑖 ; ∆w𝑖 = 𝛼 ∗ 𝛻𝐸 𝑤𝑖 = 𝛼 ∗ ( ); 𝛻𝐸(𝑤𝑖)= ( , , …. ).
𝜕wi 𝜕w0 𝜕w0 𝜕w0
𝜕𝐸
𝜵𝑬 𝑤𝑖 = =σ𝑚
𝑖=1 𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 ∗ ( 𝑥𝑖 )
𝜕wi
GD is searching through a large or infinite hypothesis space that can be applied whenever
• The hypothesis space contains continuously parameterized hypotheses (e.g., weights in linear unit)
• The error can be differentiated w.r.t hypothesis parameters.

• Practical challenges in applying GD are


• Converging to a local minimum can sometimes be quite slow i.e., need more iterations
• If there are multiple local minima in the error surface, then there is no guarantee that the procedure
will find the global minimum.
Activation function
• Activation function defines the output of given input or set of inputs or defines the output of node that is
given in inputs.
• They will deactivate or activate neurons to get the desired output.
• It also performs a nonlinear transformation on the input to get better results on a complex neural network.
• Activation function also helps to normalize the output of any input in the range between 1 to -1.
• Activation function must be efficient, and it should reduce the computation time because the neural network
sometimes trained on millions of data points.
• Activation function decides in any neural network that given input or receiving information is relevant or
irrelevant.
• Two types: 1. Linear Activation Function 2. Non-Linear Activation Functions
1. Linear or Identity Activation Function
• The linear activation function or no activation or identity function is proportional to the input.
• The function doesn't do anything to the weighted sum of the input, it simply results the value it was given.
• Linear function contains the equation similar to a straight line i.e. y or f(x) = x and its range is −∞ 𝑡𝑜 + ∞.
• Linear activation function is used at just one place i.e. output layer.
Limitations:
• It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the
input x.
• If a linear activation function is used in all layers of the neural network, entire network act like just one layer.
1. Sigmoid Activation Function
• The sigmoid activation function is a probabilistic approach towards decision making and output range is
between 0 to 1 i.e. normalizing the each output of neuron.
• Since the range is the minimum, prediction (decision) would be more accurate.
𝟏
• The equation for the sigmoid function is f(x) =
𝟏+𝒆−𝒛
• Disadvantage: Vanishing gradient problem occurs because it convert large input in between the range of 0 to 1
and therefore their derivatives become much smaller which does not give satisfactory output.
• To solve this problem, ReLU is used which do not have a small derivative problem.
2. Hyperbolic Tangent Activation Function(Tanh)
• This activation function is slightly better than the sigmoid function, like the sigmoid function.
• Zero centered— Making it easier to model inputs that have strongly negative, neutral, and strongly positive
values.
• Range is in between -1 to 1.
3. Softmax Activation Function
• Softmax is used at the last layer i.e., output to classify inputs into multiple classes. Range is 0 to 1.
• The softmax basically gives value to the input variable according to their weight and the sum of these weights
(proportional to each other) is one i.e., sum of all the probabilities will be equal to one.
• For multi-class classification problem, use softmax, and cross-entropy along with it.
• Negative inputs will be converted into nonnegative values using exponential function.
4. ReLU( Rectified Linear unit) Activation function
• Rectified linear unit or ReLU is most widely used activation function. Range is from 0 to infinity.
• It is converging quickly. So, Fast Computation.
Disadvantage:
• Dying ReLU problem: when inputs approach zero, or negative, the gradient of the function becomes zero.
• It leads to the weights and biases for some neurons are not updated during the backpropagation process. This can
create dead neurons which never get activated.
• To avoid this unfitting, use Leaky ReLU function instead of ReLU.
• Leaky ReLU range is expanded to enhance the performance.

𝒇 𝒙 = 𝐦𝐚𝐱(𝟎, 𝒙)
4. Leaky ReLU( Rectified Linear unit) Activation function
• Leaky ReLU is an enhanced version of ReLU to solve the Dying ReLU problem as it has a small positive slope in
the negative area.
𝒇(𝒙) = max(𝟎. 𝟎𝟏 ∗ 𝒙 , 𝒙)
• This function returns x if it receives any positive input.
• But for any negative value of x, it returns a really small value which is 0.01 times x.
• Thus, the gradient of the left side of the graph comes out to be a non zero value.
• Hence it does not encounter dead neurons in that region.
• Range is −∞ 𝑡𝑜 + ∞.
Architecture of Neural Network
 A neural network contains different layers.
 Input layer - The first layer is the, it picks up the input signals and passes
them to the next layer.
 Hidden layer - The next layer does all kinds of calculations and feature
extractions.
 Often, possibility of more than one hidden layer.
 Output layer - Provides the final result.
Classification of Neural Networks
1. Shallow neural network: The Shallow neural network has only one
hidden layer between the input and output.
2. Deep neural network: Deep neural networks have more than one hidden
layers. For instance, Google LeNet model for image recognition counts 22
layers.
Building Neural Network Model
1. Weight initialization
2. Forward propagation
3. Loss (or cost) computation
4. Backpropagation
5. Weight update using optimization algorithm.
Weight initialization
1. Zero Initialization (Symmetry Problem)
• When Initialize all weights of a neural network as zero, neural network will become a linear model i.e., all the partial
derivatives in backpropagation will be the same for every neuron. So, Parameters are not updated. Hence, should
avoid to initialize all weights to zero
2.Weight initialized with too-small values (Vanishing Gradient Problem)
• Model will be converged earlier. It indicated as model performance improves very slowly during training and it stops
very early.
• If the initial weights of a neuron are too small relative to the inputs, the gradient of the hidden layers will diminish
exponentially in back propagation i.e., vanishing through the layers.
• Use ReLu or leaky ReLu to avoid vanishing gradient.
3. Too-large Initialization (Exploding Gradient)
• Loss value will oscillate around minima but unable to converge
• Model involves much change in its loss value on each update due to its instability, may reach infinity sometimes.
• Use ReLu or leaky ReLu to avoid Exploding gradient.
Working of Neural Network
The complete training process of a neural network involves two steps.
1. Forward Propagation - Receive input data, process the information, and generate output
• Data are fed into the input layer in the form of numbers. If image is an input, these numerical values denote
the intensity of pixels.
• The neurons in the hidden layers apply a few mathematical operations on these values to learn the features.
• To perform these mathematical operations, there are certain parameter values that are randomly initialized.
• Pass these mathematical operations at the hidden layer.
• 𝑧 [1] = 𝑤 [1] x+ 𝑏[1] ;
𝟏
• 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛
• 𝑧 [2] = 𝑤 [2] 𝐴[1] + 𝑏 [2] ;
𝟏
• 𝐴[2] = 𝜎(𝑧 [2] ) = [𝟐]
𝟏+𝒆−𝒛
෠ = 𝐴[2]
• 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡 (𝑌)
Working of Neural Network
3. Loss (or cost) computation
• Once the output is generated, the compare the predicted output with the actual (Ground truth/label) value y.
• The optimal value for error is zero (no error) i.e., both desired (actual) and predicted results are identical.
• In this context, apply Mean Squared Error (MSE) function:
1
Error (cost) J = (𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 )2
2
1
• E.g., Error (cost) J = (0.03 − 0.0874322143)2 = 0.356465271
2

Image:Towardsdatascience.c
Working of Neural Network
3. Backward Propagation
• Calculate the derivatives (gradients) and Update the parameters after each iteration to reduce the error.
Calculate the derivatives (gradients)

𝟏
𝑧 [1] = 𝑤 [1] x+ 𝑏 [1] ; 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛
𝟏
𝑧 [2] = 𝑤 [2] 𝐴[1] + 𝑏 [2] ; 𝐴[2] = 𝜎(𝑧 [2] ) = [𝟐]
𝟏+𝒆−𝒛
• Based on chain rule, compute the product of the derivatives along all paths connecting the variables.
𝝏𝑱 𝝏 𝟏 𝟏
• Let’s consider weights of the third layer 𝑤 [2] and 𝑏[2] gradients: 𝝏𝐴[2]
= 𝝏𝐴[2]( 𝟐 (𝒀 − 𝐴[2] )𝟐 ) = 𝟐
* 2 (𝒀 − 𝐴[2] ) * (-1)

Image:Towardsdatascience.c
Backward Propagation
• Next, compute the 𝑤 1 and 𝑏1 gradients for Layer 2 :
• 𝑧 [1] = 𝑤 [1]x+ 𝑏 [1]
𝟏
• 𝐴[1] = 𝜎(𝑧 [1] ) = [𝟏]
𝟏+𝒆−𝒛

Image:Towardsdatascience.c
Working of Neural Network
• 5. Update the weights
• After gradients computed, update the new parameters for the model, and iterate (i.e., apply forward
propagation) all over again until the model converges.
• Alpha is the learning rate that will determines the convergence rate.
Forward Propagation - Example
• Z1 is the Sum of Products (SOP) between each input and its corresponding weight:
• Z1 = W1 * X1 + W2 * X2+ b * 1

• Z1 = W1 * X1 + W2 * X2+ b * 1
• Z1 =0.1* 0.5 + 0.3*0.2+1.83
• Z1 =1.94
𝟏
• A1 = 𝜎(Z1) = = 0.874352143.
𝟏+𝒆−𝒁𝟏
Backward Propagation- Example
• training steps = n i.e., (0, 1, 2….) ; 𝑤𝑖 - Parameter at ith step ;
• Alpha denotes learning rate ; 𝑌 𝑛 = actual output =0.03;
• 𝑌෠ 𝑛 or A1 denotes predicted output ; n =0 ;b=1.83, 𝑤1 = 0.5. 𝑤2 = 0.2 ;
• Alpha =0.01; ; 𝑌 𝑛 =0.874352143 ;
• Inputs: 𝑥0 or Bias input=+1, 𝑥1 =0.1, 𝑥2 =0.3.
• Z1 = W1 * X1 + W2 * X2+ b * 1 = 0.5* 0.1 + 0.2*0.3+1.83*1 =1.94
𝟏 𝟏
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 = A1 = 𝜎(Z1) = = =0.87.
𝟏+𝒆−𝒁𝟏 𝟏+𝒆𝟏.𝟗𝟒

(𝐴𝑐𝑡𝑢𝑎𝑙 𝑜𝑢𝑡𝑝𝑢𝑡 −𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡)2 (0.03 −0.87)2 0.7056


• Error J = = = = 0.3528
2 2 2

𝜕𝐽 𝜕𝐽 𝜕𝐽
• Compute , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑏
b

Image:Towardsdatascience.c
Backward Propagation- Example
𝜕𝐽 𝜕𝐽 𝜕𝐽
• Compute , ,
𝜕𝑤1 𝜕𝑤2 𝜕𝑏

𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
= * *
𝜕𝑤1 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤1

𝜕𝐽 𝜕 1 1
= ( (𝑌 − A1)2 ) = * 2 (𝑌 − A1) * (-1) i.e., -1 from differentiate by (-A1)
𝜕𝐴1 𝜕𝐴1 2 2

= - (𝑌 − A1) = - (0.03 - 0.874352143) = 0.844352143


𝟏
𝝏𝑨𝟏 𝝏( ) 𝝏(𝟏+𝒆−𝒁𝟏 )−𝟏
𝟏+𝒆−𝒁𝟏
= = = −1∗ (𝟏 + 𝒆−𝒁𝟏 )−𝟐 ∗ (𝟏 + 𝒆−𝒁𝟏 )−𝟏 = A1 ∗ (1 − A1)= 0.874 * (1- 0.844352143)
𝝏𝒁𝟏 𝝏𝒁𝟏 𝝏𝒁𝟏

= 0.013803732

𝜕𝑍1 𝜕(W1 ∗ X1 + W2 ∗ X2+ b ∗ 1)


= = X1 + 0 + 0 = 0.1
𝜕𝑤1 𝜕𝑤1

𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍
• = 𝜕𝐴 * * 𝜕𝑤1 = 0.844352143 * 0.013803732 * 0.1 = 0.001165
𝜕𝑤1 1 𝝏𝒁𝟏 1

Image:Towardsdatascience.c
Parameter updation - Example
𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
= * *
𝜕𝑤2 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤2

𝜕𝐽
• = 0.844352143 ;
𝜕𝐴1

𝝏𝑨𝟏
• = 0.013803732
𝝏𝒁𝟏

𝜕𝑍1 𝜕(W1 ∗ X1 + W2 ∗ X2+ b ∗ 1)


• = = 0+ X2+ 0 =0.3
𝜕𝑤2 𝜕𝑤2

𝜕𝐽 𝜕𝐽 𝝏𝑨𝟏 𝜕𝑍1
• = * * = 0.844352143 * 0.013803732 * 0.3 = 0.00349656
𝜕𝑤2 𝜕𝐴1 𝝏𝒁𝟏 𝜕𝑤2
Parameter updation - Example
𝜕𝐽 𝝏𝑱 𝜕𝐴1 𝝏𝒁𝟏
• = * *
𝜕𝑏 𝝏𝑨𝟏 𝜕𝑍1 𝝏𝒃

𝝏𝒁𝟏 𝜕(W1 ∗ X1 + W2 ∗ X2+ b ∗ 1)


= = 0+0+1 = 1
𝝏𝒃 𝜕𝑏

𝜕𝐽
= 0.844352143 ∗ 0.013803732* 1 = 0.013803732
𝜕𝑏

Parameter updation:

𝜕𝐽 𝜕𝐽
• 𝑤1 = 𝑤1 − 𝛼 =0.5 – 0.01(0.001165) = 0.498835 ; 𝑤2 = 𝑤2 − 𝛼( )= 0.2 – 0.01(0.003496) = 0.19965;
𝜕𝑤1 𝜕𝑤2

𝜕𝐽
b = b - 𝛼( ) =1.83 – 0.01(0.013803732) = 1.82988
𝜕𝑏

• Apply these updated new parameters with input and proceed the forward propagation for 2nd iteration until
convergence i.e., three consecutive iterations getting same error value.
new forward pass calculations:
s=X1* W1+ X2*W2+b
s=0.1*0. 498835+ 0.3*0.19965 +1.82988
s=
f(s)=1/(1+e-s) f(s)=1/(1+e -)
f(s)=
E=1/2(0.03- )2 E=
new error (0. )
old error (0.3528)
reduction of 0.
As long as there’s a reduction, we’re
moving in the right direction.
The error reduction is small because
we’re using a small learning rate (0.01).
The forward and backward passes should
be repeated until the error is 0 or for a
number of epochs (i.e. iterations).
Forward Propagation – Example 2
Forward Propagation – Digit Image Example 3
Forward Propagation – Digit Image Example 3
Parameter
Learning Rate
• Learning rate (step size) is hyper-parameter and range between 0.0 and 1.0
• The amount that the weights are updated during training.
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates require more training epochs given the smaller changes made to the weights for
each update, whereas larger learning rates result in rapid changes and require fewer training epochs.
Bias
• The bias is a node that is always 'on'. i.e. its value is set to 1.
• Bias is like the intercept in a linear equation. It is an additional parameter in the Neural Network which is
used to adjust the output along with the weighted sum of the inputs to the neuron.
Output = sum (W * X) + bias
• Therefore, Bias is a constant which helps the model in a way that it can fit best for the given data.
• Weight decide how fast the activation function will trigger whereas bias
is used to delay the triggering of the activation function.
Problems in neural network -
Overfitting
• Optimize a model requires to find the best parameters that minimize the loss of the training set.
• Generalization - how the model behaves for unseen data.
• The best method is to have a balanced dataset with sufficient amount of data. Reduce overfitting by regularization.
Network size
• A neural network with too many layers and hidden units are be highly sophisticated (expensive). So, reduce the
complexity of the model by reduce its size. There is no best practice to define the number of layers. Start with a small
amount of layer and increases its size until find the model overfit.
Weight Regularization
• Prevent overfitting by add constraints to the weights. The constraint forces the network to be small. The constraint is
added to the loss function. Two kinds of regularization: L1: Lasso: Cost is proportional to the absolute value of the
weight coefficients.
• L2: Ridge: Cost is proportional to the square of the value of the weight coefficients.
2
• 𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆 σ𝑘 𝜃𝑘
Problems in neural network -
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs

Dropout
• Dropout makes some weights will be randomly set to zero.
• E.g., weights= [0.1, 1.7, 0.7, -0.9].
• After apply dropout, it becomes [0.1, 0, 0, -0.9] with randomly distributed 0.
• The dropout rate parameter controls how many weights to be set to zeroes.
• Having a rate between 0.2 and 0.5 is common.
Implement OR Gate using Perceptron (Linear Dataset)
• Let consider the given dataset with positive and negative class training examples. Assume the initial weights
𝑤1 = 1.2, 𝑤2 = 0.6. Learning rate 𝛼 = 0.5, Threshold = 1. 𝑺𝒂𝒎𝒑𝒍𝒆𝒔 𝒙𝟏 𝒙𝟐 𝒚

• Find the updated weights that classifies the given dataset precisely. 𝑺𝟏 0 0 0
𝑺𝟐 0 1 1
• Predicted output 𝑦 ′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2+ ⋯ + wn ∗ xn =σ𝒏𝒊=𝟎 wi xi
𝑺𝟑 1 0 1
+1 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi ≥1
• 𝑦′ = ൝ 𝑺𝟒 1 1 1
0 𝑖𝑓 σ𝒏𝒊=𝟎 wi xi <1
For 𝑺𝟏 ,
• Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.6(1) = 0.6 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
• Update Weights using Perceptron rule wi = wi + ∆wi
∆wi=𝜶(𝑨𝒄𝒕𝒖𝒂𝒍 𝒐𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕) × 𝒙𝒊
Implement OR Gate using Perceptron (Linear Dataset)
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.6 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.6 – 0.5 = 0.1
Now, we have updated weights of w1 = 1.2 & w2 = 0.5.
• Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 ,Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.1(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + 0.1(1) = 0.1 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.1 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.1 – 0.5 = -0.4
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
Implement OR Gate using Perceptron (Linear Dataset)
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
• Apply these weights to all the samples to verify the prefect classification.
For 𝑺𝟏 ,Predicted output 𝑦1′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + (-0.4)(0) = 0 ≤ 𝟎. So, 𝑦1′ =0.
• Predicted output 𝑦1′ =0, Actual output 𝒚𝟏 = 𝟎. Correctly classified. So, Weights need not to update.
For 𝑺𝟐 , Predicted output 𝑦2′ = w0 ∗ x0+ w1 ∗ x1 + w2 ∗ x2= 0(1) + 1.2(0) + (-0.4)(1) = 0.1 ≤ 𝟎. So, 𝑦2′ =0.
• Predicted output 𝑦2′ =0, but Actual output 𝒚𝟐 = 𝟏. Misclassified. So, Weights need to update.
For 𝑺𝟐 , 𝑤1 = 1.2, 𝑤2 = 0.1 and 𝑥1 = 0, 𝑥2 = 1. Learning rate 𝛼 = 0.5, Threshold = 1,
Weight w1 ∆w1=0.5(1 − 0) × 𝑥1 =0.5 1 − 0 × 0 = 0 ; w1 = w1 + ∆w1=1.2 – 0 = 1.2
Weight w2∆w2=𝟎. 𝟓(𝟏 − 𝟎) × 𝒙𝟐 =𝟎. 𝟓 𝟏 − 𝟎 × 𝟏 = 𝟎. 𝟓; w2 = w2 + ∆w𝟐=0.1 – 0.5 = -0.4
Now, we have updated weights of w1 = 1.2 & w2 = -0.4.
References

1. Tom M. Mitchell, Machine Learning, McGraw Hill , 2017.

2. EthemAlpaydin, Introduction to Machine Learning (Adaptive Computation and Machine Learning), The

MIT Press, 2017.

3. Wikipedia

4. https://www.indowhiz.com/articles/en/the-simple-concept-of-expectation-maximization-em-algorithm/

5. www.towardsdatascience.com

__

56

You might also like