Decision Tree KNN
Decision Tree KNN
Using the above inputs, we follow the below steps to classify any data point.
1. First, we choose the number k and a distance metric. You can take any distance metric such as Euclidean, Minkowski,
or Manhattan distance for numerical attributes in the dataset. You can also specify your own distance metric if you
have datasets having categorical or mixed attributes.
2. For a new data point P, calculate its distance to all the existing data points.
3. Select the k-nearest data points, where k is a user-specified parameter.
4. Among the k-nearest neighbors, count the number of data points in each class. We do this to select the class label
with a majority of data points in the k neighbors that we select.
5. Assign the new data point to the class with the majority class label among the k-nearest neighbors.
Now that we have discussed the basic intuition and the algorithm for KNN classification, let us discuss a KNN classification
numerical example using a small dataset.
Minkowski distance
Euclidean distance can be generalised using the Minkowski norm also known as the p norm. The formula for Minkowski
distance is:
Here we can see that the formula differs from the formula of Euclidean distance as we can see that instead of squaring the
difference, we have raised the difference to the power of p and have also taken the p root of the difference. Now the biggest
advantage of using such a distance metric is that we can change the value of p to get different types of distance metrics.
p=2
p=1
Example
To solve the numerical example on the K-nearest neighbor i.e. KNN classification algorithm, we will use the following dataset.
Point Coordinates Class Label
A1 (2,10) C2
A2 (2, 6) C1
A3 (11,11) C3
A4 (6, 9) C2
A5 (6, 5) C1
A6 (1, 2) C1
A7 (5, 10) C2
A8 (4, 9) C2
A9 (10, 12) C3
A10 (7, 5) C1
A12 (4, 6) C1
A15 (3, 8) C2
For this, we will first specify the number of nearest neighbors i.e. k. Let us take k to be 3. Now, we will find the distance of P to
each data point in the dataset. For this KNN classification numerical example, we will use the euclidean distance metric. The
following table shows the euclidean distance of P to each data point in the dataset.
A2 (2, 6) 3.16
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A6 (1, 2) 6.40
A8 (4, 9) 2.23
A4 (6, 9) 2.23
A5 (6, 5) 2.23
A8 (4, 9) 2.23
A7 (5, 10) 3
A2 (2, 6) 3.16
A6 (1, 2) 6.4
Now, point A12, A4, and A5 have the class labels C1, C2, and C1 respectively. Among these points, the majority class label is C1.
Therefore, we will specify the class label of point P = (5, 7) as C1. Hence, we have successfully used KNN classification to classify
point P according to the given dataset.
Q.2 We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes
(acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive
survey, can we guess what the classification of this new tissue is?
Suppose use K = 3
2. Calculate the distance between the query-instance and all the training samples
Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to
X2 = Strength
X1 = Acid Durability (seconds) Square Distance to query instance (3, 7)
(kg/square meter)
7 7
7 4
3 4
1 4
3. Sort the distance and determine nearest neighbors based on the K-th minimum distance
X2 = Strength
X1 = Acid Durability Square Distance to query instance Rank minimum Is it included in 3-Nearest
(kg/square
(seconds) (3, 7) distance neighbors?
meter)
7 7 3 Yes
7 4 4 No
3 4 1 Yes
1 4 2 Yes
4. Gather the category (Y) of the nearest neighbors. Notice in the second row last column that the category of nearest neighbor
(Y) is not included because the rank of this data is more than 3 (=K).
X2 =
X1 = Acid Rank
Strength Square Distance to query Is it included in 3- Y = Category of
Durability minimum
(kg/square instance (3, 7) Nearest neighbors? nearest Neighbor
(seconds) distance
meter)
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance
We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 =
Decision tree:
Introduction
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is
and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes
• Entropy:
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.
Information Gain:
Let’s understand this with the help of an example. Consider a piece of data collected over the
course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the outcome
variable is whether Golf was played on the day. Now, our job is to build a predictive model which
takes in above 4 parameters and predicts whether Golf will be played on the day. We’ll build a
decision tree -
Yes No Total
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of
them belong to one class and other half belong to other class that is perfect randomness. Here
it’s 0.94 which means the distribution is fairly random. Now, the next step is to choose the
attribute that gives us highest possible Information Gain which we’ll choose as the root node.
Let’s start with ‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible values
in the sample data, hence x = {Weak, Strong} We’ll have to calculate:
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind is
Strong.
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’ for
‘Play Golf’. So, we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for Play
Golf and 3 where we had ‘No’ for Play Golf.
Remember, here half items belong to one class while other half belong to other. Hence we have
perfect randomness. Now we have all the pieces required to calculate the Information Gain,
Which tells us the Information Gain by considering ‘Wind’ as the feature and give us information
gain of 0.048. Now we must similarly calculate the Information Gain for all the features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we chose
Outlook attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook. Now how do we proceed from this point? We can simply
apply recursion, you might want to look at the algorithm steps described earlier. Now that we’ve
used Outlook, we’ve got three of them remaining Humidity, Temperature, and Wind. And, we had
three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast node already ended
up having leaf node ‘Yes’, so we’re left with two subtrees to compute: Sunny and Rain
Table where the value of Outlook is Sunny looks like:
As we can see the highest Information Gain is given by Humidity. Proceeding in the same way
with
will give us Wind as the one with highest information gain. The final Decision Tree looks
something like this.