Module3-Similarity-based Learning-11Mar2024
Module3-Similarity-based Learning-11Mar2024
Similarity-based Learning
Introduction to Similarity or Instance-based Learning
Similarity-based classifiers use similarity measures to locate the nearest
neighbors and classify a test instance.
It is also called as Instance-based learning / Just-in time learning as it does not
build an abstract model of the training instances and performs lazy learning
when classifying a new instance.
This learning mechanism simply stores all data and uses it only when it needs to
classify an unseen instance.
The advantage this learning is that processing occurs only when a request to
classify a new instance is given.
This methodology is useful when the whole dataset is not available in the
beginning but collected in an incremental manner.
The drawback of this learning is that it requires a large memory to store the data
since a global abstract model is not constructed initially with the training data.
Classification of instances is done based on the measure of similarity in the form
of distance functions over data instances.
Machine Learning 2
Introduction to Similarity or Instance-based Learning
Machine Learning 3
Introduction to Similarity or Instance-based Learning
Differences Between Instance-based and Model-based Learning:
An instance is an entity or an example in the training dataset.
It is described by a set of features or attributes.
One attribute describes the class label or category of an instance.
Instance-based methods learn or predict the class label of a test instance only
when a new instance is given for classification and until then it delays the
processing of the training dataset.
It is also called to as lazy learning methods since it does not generalize any
model from the training dataset but just keeps the training dataset as a
knowledge base until a new instance is given.
Whereas, the model-based learning, also called as eager learning, tries to
generalize the training data to a model before receiving test instances.
Model-based learning describes all assumptions about the problem domain in
the form of a model.
Machine Learning 4
Introduction to Similarity or Instance-based Learning
These algorithms basically learn in two phases called training phase and testing
phase.
In training phase, a model is built from the training dataset and is used to
classify a test instance during the testing phase.
Some examples of models constructed are decision trees, neural networks and
Support Vector Machines (SVM), etc.
The differences between instance-based learning and model-based learning are
listed below.
Machine Learning 5
Introduction to Similarity or Instance-based Learning
Machine Learning 6
Introduction to Similarity or Instance-based Learning
Examples of Instance-based learning algorithms are:
1. k-Nearest Neighbor (k-NN)
2. Variants of Nearest Neighbor learning
3. Locally Weighted Regression
4. Learning Vector Quantization (LVQ)
5. Self-Organizing Map (SOM)
6. Radial Basis Function (RBF) networks
Instance-based methods have limitations about the range of feature
values taken.
They are sensitive to irrelevant and correlated features leading to
misclassification of instances.
Machine Learning 7
Nearest-Neighbor Learning
k-Nearest-Neighbors (k-NN) is a natural approach used in similarity-based
classification.
It is a non-parametric method used for both classification and regression problems.
It is a simple and powerful algorithm that predicts the category of the test instance
according to the ‘k’ training samples which are closer to the test instance and
classifies it to that category which has the largest probability.
A visual representation of this learning is shown in Figure 4.1.
There are two classes of objects called C1 and C2 in the given figure.
For any given test instance T, the category of this test instance is determined by
looking at the class of k = 3 nearest neighbors.
Thus, the class of this test instance T is predicted as C2.
Note:
Nonparametric methods, or distribution-free methods, are statistical methods that do not
rely on assumptions that the data are drawn from a given probability distribution.
Nonparametric methods are often applied when less is known about the data.
Machine Learning 8
Nearest-Neighbor Learning
The algorithm relies on the assumption that similar objects are close to each
other in the feature space.
k-NN performs instance-based learning which just stores the training data
instances and learn the instances case by case.
This model is also memory-based as it uses training data at the time when
predictions need to be made.
Machine Learning 9
Nearest-Neighbor Learning
It is a lazy learning algorithm since no prediction model is built earlier with
training instances and classification happens only after getting the test instance.
The algorithm classifies a new instance by determining the ‘k’ most similar
instances (i.e., k nearest neighbors) and summarizing the output of those ‘k’
instances.
If the target variable is discrete, then it is a classification problem.
So, it selects the most common class value among the ‘k’ instances by a majority
vote.
If the target variable is continuous, then it is a regression problem, and hence
the mean output variable of the ‘k’ instances is the output of the test instance.
Euclidean distance, the most popular distance measure is used in k-NN to
determine the ‘k’ instances which are similar to the test instance.
The value of ‘k’ is determined by tuning with different ‘k’ values and choosing the
‘k’ which classifies the test instance more accurately.
Machine Learning 10
Nearest-Neighbor Learning
Machine Learning 11
Nearest-Neighbor Learning
Example 4.1:
Consider the student performance training dataset of 8 data instances shown in Table 4.2
which describes the performance of individual students in a course and their CGPA
obtained in the previous semesters. The independent attributes are CGPA, Assessment
and Project. The target variable is ‘Result’ which is a discrete valued variable that takes
two values ‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a
student will pass or fail in that course.
Sl. No. CGPA Assessment Project Submitted Result
1 9.2 85 8 Pass
2 8 80 7 Pass
3 8.5 81 8 Pass
4 6 45 5 Fail
5 6.5 50 4 Fail
6 8.2 72 7 Pass
7 5.8 38 5 Fail
8 8.9 91 9 Pass
Table 4.2: Training Dataset T
Machine Learning 12
Nearest-Neighbor Learning
Solution:
Given a test instance (6.1, 40, 5) and a set of categories {Pass, Fail} also
called as classes, we need to use the training set to classify the test
instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary
instance.
Assign k = 3.
Step 1:
Calculate the Euclidean distance between the test instance (6.1, 40, and 5)
and each of the training instances as shown in Table 4.3.
Machine Learning 13
Nearest-Neighbor Learning
Machine Learning 14
Nearest-Neighbor Learning
Step 2:
Sort the distances in the ascending order and select the first 3 nearest training
data instances to the test instance.
The selected nearest neighbors are shown below.
Machine Learning 17
Weighted K-Nearest-Neighbor Algorithm
Algorithm 4.2: Weighted k-NN
Inputs: Training dataset ‘T’, Distance metric ‘d’, Weighting function w(i), Test
instance ‘t’, the number of nearest neighbors ‘k’.
Output: Predicted class or category
Prediction: For test instance t,
1. For each instance ‘i’ in Training dataset T, compute the distance between the
test instance t and every other instance ‘i’ using a distance metric (Euclidean
distance).
[Continuous attributes: Euclidean distance between two points in the plane
with coordinates (x1, y1) and (x2, y2) is given as:
dist((x1, y1), (x2, y2)) = x2 − x1 y2 − y 1 ]
[Categorical attributes (Binary): Hamming Distance: If the values of two
instances are the same, the distance d will be equal to 0. Otherwise, d = 1.]
2. Sort the distances in the ascending order and select the first ‘k’ nearest
training data instances to the test instance.
Machine Learning 18
Weighted K-Nearest-Neighbor Algorithm
Machine Learning
19
Weighted K-Nearest-Neighbor Algorithm
Example 4.2:
Consider the same training dataset given in Table 4.2. Use Weighted k-NN and
determine the class.
Solution:
Step 1:
Given a test instance (7.6, 60, 8) and a set of classes {Pass, Fail}, use the
training dataset to classify the test instance using Euclidean distance and
weighting function.
Assign k = 3.
The distance calculation is shown in Table 4.5.
Machine Learning 20
Weighted K-Nearest-Neighbor Algorithm
Sl. No. CGPA Assessment Project Submitted Result Euclidean Distance
1 9.2 85 8 Pass (9.2 − 7.6) + (85 − 60) + (8 − 8)
=25.05115
2 8 80 7 Pass (8 − 7.6) + (80 − 60) + (7 − 8)
=20.02898
Step 3:
Predict the class of the test instance by weighted voting technique from
the 3 selected nearest instances.
Compute the inverse of each distance of the 3 selected nearest instances
as shown in Table 4.7.
Machine Learning 22
Weighted K-Nearest-Neighbor Algorithm
Machine Learning 23
Weighted K-Nearest-Neighbor Algorithm
Machine Learning 24
Nearest Centroid Classifier
Another variant of k-NN classifiers used for similarity-based classification is the
Nearest Centroid Classifier.
It is a simple classifier which is also called as Mean Difference classifier.
In this method, we classify a test instance to the class whose centroid/mean is
closest to that instance.
Machine Learning 25
Nearest Centroid Classifier
Example 4.3:
Consider the sample data shown in Table 4.9 with two features X and Y. The
target classes are ‘A’ or ‘B’. Predict the class using Nearest Centroid Classifier.
Solution:
Step1:
Compute the mean/centroid of each class.
In this example, we have two classes ‘A’ and ‘B’.
Machine Learning 26
Nearest Centroid Classifier
Centroid of class ‘A’ = (3 + 5 + 4, 1 + 2 + 3) / 3 = (12, 6)/3 = (4, 2)
Centroid of class ‘B’ = (7 + 6 + 8, 6 + 7 + 5) / 3 = (21, 18)/3 = (7, 6)
Now for the given test instance (6, 5), we can predict the class as explained
below.
Step 2:
Calculate the Euclidean distance between test instance (6, 5) and each of the
centroid.
Euc_Dist[(6, 5), (4, 2)] = [(6 − 4) (5 − 2) = 13 = 3.6
Euc_Dist[(6, 5), (7, 6)] = [(6 − 7) (5 − 6) = 2 = 1.414
The test instance has smaller distance to class B.
Hence, the class of this test instance is predicted as ‘B’.
Machine Learning 27
Locally Weighted Regression (LWR)
Locally Weighted Regression(LWR) is a non-parametric supervised learning
algorithm that performs local regression by combining regression model with
nearest neighbor’s model.
LWR is also referred to as a memory-based method as it requires training data
while prediction but uses only the training data instances locally around the
point of interest.
Using nearest neighbors algorithm, we find the instances that are closest to a
test instance and fit linear function to each of those ‘k’ nearest instances in the
local regression model.
We need to approximate the linear functions of all ‘k’ neighbors that minimize
the error such that the prediction line is no more linear, but it is a curve.
Ordinary linear regression finds a linear relationship between the input x and the
output y.
For the given training dataset T, Hypothesis function h𝛽 (x), the predicted target
output is a linear function, where 0 is the intercept and 1 is the coefficient of x.
Machine Learning 28
Locally Weighted Regression (LWR)
It is given by,
h𝛽 (x) = 0 + 1x
The cost function is such that it minimizes the error difference between the
predicted value h𝛽 (x) and true value ‘y’ and it is given by,
J( ) = (h𝛽(xi) − yi)2
Where ‘m’ is the number of instances in the training dataset.
Now the cost function is modified for locally weighted linear regression including
the weights only for the nearest neighbor points.
Hence, the cost function is given by,
J( ) = wi (h𝛽(xi) − yi)2
Where wi is the weight associated with each xi.
Machine Learning 29
Locally Weighted Regression (LWR)
Machine Learning 30
Locally Weighted Regression (LWR)
Example 4.4:
Consider a simple example with four instances shown in Table 4.10 and apply locally
weighted regression.
Sl. No Salary (in Lakhs) Expenditure (in Thousands)
1 5 25
2 1 5
3 2 7
4 1 8
Table 4.10 Sample Table
Solution:
Using linear regression model assuming we have computed the parameters:
0 = 4.72, 1 = 0.62.
Given a test instance with x = 2, the predicted y′ is:
y′ = 0 + 1 x = 4.72 + 0.62 × 2 = 5.96.
Machine Learning 31
Locally Weighted Regression (LWR)
Applying the nearest neighbor model, we choose k = 3 closest instances.
The Euclidean distance calculation for the training instances is shown below.
wi =
Machine Learning 32
Locally Weighted Regression (LWR)
Hence the weights of the closest instances is computed as follows:
Weight of Instance 2 is:
w2 = = . =𝑒 .
= 0.043
Weight of Instance 3 is:
w4 = = . =𝑒 .
= 0.043
The predicted output for the 3 closer instances is given as follows:
The predicted output of Instance 2 is:
y2′ = h𝛽 (x2) = 0 + 1x2 = 4.72 + 0.62 × 1 = 5.34
The predicted output of Instance 3 is:
y3′ = h𝛽 (x3) = 0 + 1x3 = 4.72 + 0.62 × 2 = 5.96
Machine Learning 33
Locally Weighted Regression (LWR)
Machine Learning 34