0% found this document useful (0 votes)
59 views12 pages

Text Book 2 Module 4 Chapter 3-Similarity Based Learning

Aiml

Uploaded by

Dharshan M Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
59 views12 pages

Text Book 2 Module 4 Chapter 3-Similarity Based Learning

Aiml

Uploaded by

Dharshan M Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
fel ifr eta | Similarity-based Learning “Anyone who stops learning is old, whether at twenty or eighty.” Similarity-based Learning is a supervised learning technique that predicts the lass label ofa test instance by gauging the similarity of this test instance with training instances. Similarity-based OF similarities between test instance and specific set of training instances local to the test ipstance in an incremental process. In contrast to other learning mechanisms, it considers only the nearest instance or instances to predict the class of unseen instances. This learning method- ology improves the performance of c] as incremental learning task. Similarity-based classification is useful in various fields such ec image Processing; text classification, pattern recognition, bio informatics, data mining, infor. mation retrieval natural language processing, etc. A practical application of this learning is Predicting daily stock index price changes, This chapter provides an insight of how differs similarity-based models predict the class of a new instance. * Understand the fundamentals of Instance based leaming * Know about the concepts of Nearest-Neighbor Leaming using the algorithm called k-Nearest-Neighbots (NN) * Lear about Weighted k-Nearest-Neighbor classifier that chooses the neighbors by using the weighted distance * Gain knowledge about Nearest Centroid classifier, a simple altemative to classifiers * Understand Locally Weighted Regression (LWR) that approximates the linear functions of all k neighbors to minimize the error while prediction j — Henry Ford 116 + Machine Learni9g $a 4.1 INTRODUCTION TO SIMILARITY OR INSTANCE-BASED LEARNING Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test instance which works in contrast with other learning mechanisms such as decision trees or neural networks. Similarity-based learning is also called as Instance-based learning/Just-in time learning since it does not build an abstract model of the training instances and performs lazy learning when classifying a new instance. This learning mechanism simply.stores all data and uses it only when it needs to classify an unseen instance. The advantage of using this learning is that processing occurs only when a request to classify a new instance is given. This methodology is particularly useful when the whole dataset is not available in’ the beginning but-collected in an incremental manner. The drawback of this learning is that it requires a large memory to store the data since a global abstract model is not constructed initially with the training data. Classification of instances is done based on the measure of similarity in the form of distance functions over data instances. Several distance metrics are used to estimate the similarity or dissimilarity between instances required for clustering, nearest neighbor classification, anomaly detection, and so on. Popular distance metrics used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski distance, Cosine similarity, Mahalanobis distance, Pearson's correlation or correlation similarity, Mean squared difference, Jaccard coefficient, Tanimoto coefficient, etc. Generally, Similarity-based classification problems formulate the features of test instance and training instances in Euclidean space to learn the similarity or dissimilarity between instances. 4.1.1 Differences Between Instance- and Model-based Learning An instance is an entity or an example in the training dataset. It is described by a set of features or attributes. One attribute describes the class label or category of an instance. Instance-based methods leam or predict the class label of a test instance only when a new instance is given for classification and until then it delays the processing of the training dataset. Itis also referred to as lazy learning methods since it does not generalize any model from the training dataset but just keeps the training dataset as a knowledge base until a new instance is given. In contrast, model-based learning, generally referred to as eager learning, tries to generalize the training data to a model before receiving test instances. Model-based machine learning describes all assumptions about the problem domain in the form of a model, These algorithms basically learn in two phases, called training phase and testing phase. In training phase, a model is built from the training dataset and is used to classify’a test instance during the testing phase. Some examples of models constructed are decision trees, neural networks and Support Vector Machines (SVM), etc. The differences between Instance-based Learning and Model-based Learning are listed in Table 4.1. Table 4.1: Differences between instance-based Learning and Model-based Learning eee Ree en Learners 4 Enger Learners Processing of training instances is done only during | Processing of training instances is done during testing phase training phase (Continued) i 1 ( i 1 i I ] itreceives a test instance before it receives a test instance Predicts the class of the test instance directly from | Predicts the class of the test instance from the model thetraining data. built Slow in testing phase Learns by making many Jocal approximations Fast in testing phase ] Leams by creating global approximation Instance-based learning also.comes under the category of memory-based models which normally compare the given test instance with the trained instances that are stored in memory. Memory- based models classify a test instance by checking the similarity with the training instances. Some examples of Instance-based learning algorithms are: 1, KeNearest Neighbor (k-NN) 2. Variants of Nearest Neighbor learning 3. Locally Weighted Regression 4, Learning Vector Quantization (LVQ) 5. Self-Organizing Map (SOM) 6, Radial Basis Function (RBF) networks In this chapter, we will discuss about certain instance-based learning algorithms sich as k-Nearest Neighbor (k-NN), Variants of Nearest Neighbor learning, and Locally Weighted Regression learning. Self-Organizing Map (SOM) and Radial Basis Function (RBF) networks are discussed along with the concepts of artificial neural networks discussed in Chapter 10 since they could be referred only after the understanding of neural networks. These instance-based methods have serious limitations about the range of feature values taken. Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of instances. 4,2 NEAREST-NEIGHBOR LEARNING ‘A natural approach to similarity-based' classification is k-Nearest-Neighbors (k-NN), which is a non-parametric method used for both classification and regression problems. It is a simple and powerful non-parametric algorithm that predicts the category of the test instance according to the ‘K’ training samples which are closer to the test instance and classifies it to that category which has the largest probability. A visual representation of this learning is shown in Figure 4.1. There are two classes of objects called C, and C, in the given figure. When given a test instance T, the category of this test instance is determined by looking at the class of k=3 Figure 4.1: Visual Representation of nearest neighbors. Thus, the class of this test instance k-Nearest Neighbor Learning T is predicted as C,. 118 + Machine —— OT The algorithm relies on the assumption that similar objects are close to each other in the feay,,, Space, k-NN performs instance-based learning which just stores the training data instances ang learning instances case by case. The model is also ‘memory-based’ as it uses training data at time when predictions need to be made. It is a lazy learning algorithm since no prediction OTA is buil arlier with training instances and classification happens only after getting the test instance. The algorithm classifies a new instance by determining the ‘k’ most similar instances (ie, k nearest neighbors) and summarizing the output of those ‘K’ instances. If the target variable is discrete then it is a classification problem, so it selects the most common class value among the ‘k instances by a majority vote. However, if the target variable is continuous then it is a regression Problem, and hence the mean output variable of the ‘K’ instances is the output of the test instance. The most popular distance measure such as Euclidean distance is used in k-NN to determine the ‘k instances which are similar to the test instance. The value of ‘K’ is best determined by tuning with different ’k values and choosing the ‘k’ which classifies the test instance more accurately. Inputs: Training dataset T, distance metric d, Test instance !, the number of nearest neighbors k Output: Predicted class or category Prediction: For test instance t, 1, For each instance i:in T, compute the distance between the test instance { and every other instance i in the training dataset using a distance metric (Euclidean distance), [Continuous attributes - Euclidean distance between two points in the plane with soordinates (x, y,) and (x, y,) is given as dist ((x, y,), (xy y,))= V%—%) + -¥) D {Categorical attributes (Binary) - Hamming Distance: If the value of the two instances is same, the distance d will be equal to 0 otherwise d = 1.) - Sort the distances in an ascending order and select the first k nearest training data instances to the test instance. . Predict the class of the test instance by majority voting (if target attribute is discrete valued) or mean (if target attribute is continuous valued) of the k selected nearest instances. ea Consider the student performance training dataset of 8 data instances shown in Table 4.2 which describes the performance of individual students in a course and their CGPA obtained in the previous semesters. The independent attributes are CGPA, Assessment and Project. The target variable is ‘Result’ which is a discrete valued variable that takes two values ‘Pass’ or ‘Fail’, Based on the performance of a student, classify whether a student will pass or fail in that course, Table 4.2: Training Dataset T ‘ ee Bombe) (Continued)

You might also like