Unit-4 AML (1. Basics and K-NN)
Unit-4 AML (1. Basics and K-NN)
• Labelled training data containing past information comes as an input. Based on the
training data, the machine builds a predictive model that can be used on test data to
assign a label for each record in the test data.
• Some examples of supervised learning are
• Predicting the results of a game
• Predicting whether a tumour is malignant or benign
• Predicting the price of domains like real estate, stocks, etc.
• Classifying texts such as classifying a set of emails as
spam or non-spam
• When we are trying to predict a categorical or nominal variable, the problem is known as
a classification problem. Whereas when we are trying to predict a real-valued variable,
the problem falls under the category of regression.
• Some typical classification problems include: Image classification, Prediction of disease,
Recognition of handwriting etc.
• Typical applications of regression can be seen in Demand forecasting in retails, weather
forecast etc.
• Note: Supervised machine learning is as good as the data used to train it. If the training
data is of poor quality, the prediction will also be far from being precise.
7
• In the kNN algorithm, the class label of the test data elements is decided
by the class label of the training data elements which are neighbouring,
i.e. similar in nature. But there are two challenges:
1. What is the basis of this similarity or when can we say that two data
elements are similar?
2. How many similar elements should be considered for deciding the class
label of each test data element?
• To answer the first question, though there are many measures of similarity, the
most common approach adopted by kNN to measure similarity between two data
elements is Euclidean distance. Considering a very simple data set having two
features (say f1 and f2), Euclidean distance between two data
elements d1 and d2 can be measured by
• The answer to the second question, i.e. how many similar elements should be considered.
The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to
the algorithm.
• In the kNN algorithm, the value of ‘k’ indicates the number of neighbours that need to be
considered.
• For example, if the value of k is 3, only three nearest neighbours or three training data
elements closest to the test data element are considered. Out of the three data elements, the
class which is predominant is considered as the class label to be assigned to the test data.
• In case the value of k is 1, only the closest training data element is considered. The class
label of that data element is directly assigned to the test data element
• But it is often a tricky decision to decide the value of k. The reasons are as follows:
• If the value of k is very large (in the extreme case equal to the total number of records in the training
data), the class label of the majority class of the training data set will be assigned to the test data
regardless of the class labels of the neighbours nearest to the test data.
• If the value of k is very small (in the extreme case equal to 1), the class value of a noisy data or
outlier in the training data set which is the nearest neighbour to the test data will be assigned to the
test data.
• The best k value is somewhere between these two extremes.
• Few strategies are adopted by machine learning practitioners to arrive at a value for k.
• One common practice is to set k equal to the square root of the number of training records.
• Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered)
• Steps:
• Do for all test data points
• Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.
• Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
• If k = 1
• Then assign class label of the training data point to the test data point
• Else
• Whichever class label is predominantly present in the training data points, assign that class label to the test data point
• End do