We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
fel ifr eta |
Similarity-based
Learning
“Anyone who stops learning is old, whether at twenty or eighty.”
Similarity-based Learning is a supervised learning technique that predicts the lass label ofa test
instance by gauging the similarity of this test instance with training instances. Similarity-based
OF similarities between test instance and specific set of training instances local to the test
ipstance in an incremental process. In contrast to other learning mechanisms, it considers only
the nearest instance or instances to predict the class of unseen instances. This learning method-
ology improves the performance of c]
as incremental learning task. Similarity-based classification is useful in various fields such ec
image Processing; text classification, pattern recognition, bio informatics, data mining, infor.
mation retrieval natural language processing, etc. A practical application of this learning is
Predicting daily stock index price changes, This chapter provides an insight of how differs
similarity-based models predict the class of a new instance.
* Understand the fundamentals of Instance based leaming
* Know about the concepts of Nearest-Neighbor Leaming using the algorithm called
k-Nearest-Neighbots (NN)
* Lear about Weighted k-Nearest-Neighbor classifier that chooses the neighbors by
using the weighted distance
* Gain knowledge about Nearest Centroid classifier, a simple altemative to
classifiers
* Understand Locally Weighted Regression (LWR) that approximates the linear
functions of all k neighbors to minimize the error while prediction j
— Henry Ford116 + Machine Learni9g $a
4.1 INTRODUCTION TO SIMILARITY OR INSTANCE-BASED
LEARNING
Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test
instance which works in contrast with other learning mechanisms such as decision trees or neural
networks. Similarity-based learning is also called as Instance-based learning/Just-in time learning
since it does not build an abstract model of the training instances and performs lazy learning when
classifying a new instance. This learning mechanism simply.stores all data and uses it only when it
needs to classify an unseen instance. The advantage of using this learning is that processing occurs
only when a request to classify a new instance is given. This methodology is particularly useful
when the whole dataset is not available in’ the beginning but-collected in an incremental manner.
The drawback of this learning is that it requires a large memory to store the data since a global
abstract model is not constructed initially with the training data. Classification of instances is done
based on the measure of similarity in the form of distance functions over data instances. Several
distance metrics are used to estimate the similarity or dissimilarity between instances required for
clustering, nearest neighbor classification, anomaly detection, and so on. Popular distance metrics
used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski distance, Cosine
similarity, Mahalanobis distance, Pearson's correlation or correlation similarity, Mean squared
difference, Jaccard coefficient, Tanimoto coefficient, etc.
Generally, Similarity-based classification problems formulate the features of test instance and
training instances in Euclidean space to learn the similarity or dissimilarity between instances.
4.1.1 Differences Between Instance- and Model-based Learning
An instance is an entity or an example in the training dataset. It is described by a set of features or
attributes. One attribute describes the class label or category of an instance. Instance-based methods
leam or predict the class label of a test instance only when a new instance is given for classification
and until then it delays the processing of the training dataset.
Itis also referred to as lazy learning methods since it does not generalize any model from the
training dataset but just keeps the training dataset as a knowledge base until a new instance is
given. In contrast, model-based learning, generally referred to as eager learning, tries to generalize the
training data to a model before receiving test instances. Model-based machine learning describes
all assumptions about the problem domain in the form of a model, These algorithms basically learn
in two phases, called training phase and testing phase. In training phase, a model is built from the
training dataset and is used to classify’a test instance during the testing phase. Some examples of
models constructed are decision trees, neural networks and Support Vector Machines (SVM), etc.
The differences between Instance-based Learning and Model-based Learning are listed in
Table 4.1.
Table 4.1: Differences between instance-based Learning and Model-based Learning
eee Ree
en
Learners 4 Enger Learners
Processing of training instances is done only during | Processing of training instances is done during
testing phase training phase
(Continued)i
1
(
i
1
i
I
]
itreceives a test instance before it receives a test instance
Predicts the class of the test instance directly from | Predicts the class of the test instance from the model
thetraining data. built
Slow in testing phase
Learns by making many Jocal approximations
Fast in testing phase ]
Leams by creating global approximation
Instance-based learning also.comes under the category of memory-based models which normally
compare the given test instance with the trained instances that are stored in memory. Memory-
based models classify a test instance by checking the similarity with the training instances.
Some examples of Instance-based learning algorithms are:
1, KeNearest Neighbor (k-NN)
2. Variants of Nearest Neighbor learning
3. Locally Weighted Regression
4, Learning Vector Quantization (LVQ)
5. Self-Organizing Map (SOM)
6, Radial Basis Function (RBF) networks
In this chapter, we will discuss about certain instance-based learning algorithms sich as
k-Nearest Neighbor (k-NN), Variants of Nearest Neighbor learning, and Locally Weighted
Regression learning.
Self-Organizing Map (SOM) and Radial Basis Function (RBF) networks are discussed along
with the concepts of artificial neural networks discussed in Chapter 10 since they could be referred
only after the understanding of neural networks.
These instance-based methods have serious limitations about the range of feature values taken.
Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of
instances.
4,2 NEAREST-NEIGHBOR LEARNING
‘A natural approach to similarity-based' classification is
k-Nearest-Neighbors (k-NN), which is a non-parametric
method used for both classification and regression
problems. It is a simple and powerful non-parametric
algorithm that predicts the category of the test instance
according to the ‘K’ training samples which are closer to
the test instance and classifies it to that category which
has the largest probability. A visual representation
of this learning is shown in Figure 4.1. There are two
classes of objects called C, and C, in the given figure.
When given a test instance T, the category of this test
instance is determined by looking at the class of k=3 Figure 4.1: Visual Representation of
nearest neighbors. Thus, the class of this test instance k-Nearest Neighbor Learning
T is predicted as C,.118 + Machine —— OT
The algorithm relies on the assumption that similar objects are close to each other in the feay,,,
Space, k-NN performs instance-based learning which just stores the training data instances ang
learning instances case by case. The model is also ‘memory-based’ as it uses training data at time
when predictions need to be made. It is a lazy learning algorithm since no prediction OTA is buil
arlier with training instances and classification happens only after getting the test instance.
The algorithm classifies a new instance by determining the ‘k’ most similar instances (ie,
k nearest neighbors) and summarizing the output of those ‘K’ instances. If the target variable is
discrete then it is a classification problem, so it selects the most common class value among the ‘k
instances by a majority vote. However, if the target variable is continuous then it is a regression
Problem, and hence the mean output variable of the ‘K’ instances is the output of the test instance.
The most popular distance measure such as Euclidean distance is used in k-NN to determine
the ‘k instances which are similar to the test instance. The value of ‘K’ is best determined by tuning
with different ’k values and choosing the ‘k’ which classifies the test instance more accurately.
Inputs: Training dataset T, distance metric d, Test instance !, the number of nearest neighbors k
Output: Predicted class or category
Prediction: For test instance t,
1, For each instance i:in T, compute the distance between the test instance { and every
other instance i in the training dataset using a distance metric (Euclidean distance),
[Continuous attributes - Euclidean distance between two points in the plane with
soordinates (x, y,) and (x, y,) is given as dist ((x, y,), (xy y,))= V%—%) + -¥) D
{Categorical attributes (Binary) - Hamming Distance: If the value of the two instances
is same, the distance d will be equal to 0 otherwise d = 1.)
- Sort the distances in an ascending order and select the first k nearest training data
instances to the test instance.
. Predict the class of the test instance by majority voting (if target attribute is discrete
valued) or mean (if target attribute is continuous valued) of the k selected nearest
instances.
ea
Consider the student performance training dataset of 8 data instances shown in
Table 4.2 which describes the performance of individual students in a course and their CGPA
obtained in the previous semesters. The independent attributes are CGPA, Assessment and
Project. The target variable is ‘Result’ which is a discrete valued variable that takes two values
‘Pass’ or ‘Fail’, Based on the performance of a student, classify whether a student will pass or fail
in that course,
Table 4.2: Training Dataset T ‘
ee Bombe)
(Continued)