0% found this document useful (0 votes)
36 views29 pages

Introduction To Classification - KNN

Introduction to classification - KNN

Uploaded by

samanthasaryar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views29 pages

Introduction To Classification - KNN

Introduction to classification - KNN

Uploaded by

samanthasaryar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

INTRODUCTION TO

CLASSIFICATION
K - NEAREST NEIGHBOUR
Prepared By
Fariha Jahan
Lecturer, Department
of Computer Science &
Engineering
Daffodil International
University(DIU)
Nearest Neighbour
• Mainly used when all attribute values are continuous
• It can be modified to deal with categorical attributes
• The idea is to estimate the classification of an unseen instance
using the classification of the instance or instances that are closest
to it, in some sense that we need to define (classifies new cases
based on a similarity measure)
Nearest Neighbour

• What should its classification be?


• Even without knowing what the six attributes represent, it seems
intuitively obvious that the unseen instance is nearer to the first
instance than to the second.
K - Nearest Neighbour (KNN)
• In practice there are likely to be many more instances in the
training set but the same principle applies.
• It
is usual to base the classification on those of the k nearest
neighbours, not just the nearest one.
• The method is then known as k-Nearest Neighbour or just k-NN
classification
KNN
• We can illustrate k-NN classification diagrammatically when the
dimension (i.e. the number of attributes) is small.
• Next
we will see an example which illustrates the case where the
dimension is just 2.
• Inreal-world data mining applications it can of course be
considerably larger.
KNN
•A training set with 20 instances, each giving
the values of two attributes and an associated
classification
• How can we estimate the classification for an
‘unseen’ instance where the first and second
attributes are 9.1 and 11.0, respectively?
KNN
• For this small number of attributes we can
represent the training set as 20 points on a
two-dimensional graph with values of the first
and second attributes measured along the
horizontal and vertical axes, respectively.
• Each point is labelled with a + or − symbol to
indicate that the classification is positive or
negative, respectively.
KNN

•Acircle has been added to enclose the five


nearest neighbours of the unseen instance,
which is shown as a small circle close to the
centre of the larger one.
KNN

• Thefive nearest neighbours are labelled with three


+ signs and two − signs
• Soa basic 5-NN classifier would classify the unseen
instance as ‘positive’ by a form of majority voting.
KNN
• We can represent two points in two dimensions (‘in two-
dimensional space’ is the usual term) as (a1, a2) and (b1, b2)
• When there are three attributes we can represent the points by
(a1, a2, a3) and (b1, b2, b3)
• When there are n attributes, we can represent the instances by the
points (a1, a2, . . . , an) and (b1, b2, . . . , bn) in ‘n-dimensional
space’
Distance Measures
• There are many possible ways of measuring the distance between
two instances with n attribute values, or equivalently between two
points in n-dimensional space.
• But here distance measurement usually imposes three requirements
(let, dist(X, Y) denotes the distance between two points X and Y)
• The distance of any point A from itself is zero, i.e. dist(A, A) = 0
• The distance from A to B is the same as the distance from B to A, i.e.
dist(A, B) = dist(B, A) (the symmetry condition)
• The third condition is called the triangle inequality (Figure 2.7). It
corresponds to the intuitive idea that ‘the shortest distance between any
two points is a straight line’. The condition says that for any points A, B
and Z:
dist(A, B) ≤ dist(A, Z) + dist(Z, B)
Distance Measures
• There are many possible distance measures
• Euclidean Distance

• Manhattan Distance or City Block Distance

• Hamming Distance …
Distance Measures: Euclidean
Distance
• Ifwe denote an instance in the training set by (a1, a2) and the
unseen instance by (b1, b2) the length of the straight line joining
the points is

• Ifthere are two points (a1, a2, a3) and (b1, b2, b3) in a three-
dimensional space the corresponding formula is

• The formula for Euclidean distance between points (a1, a2, . . . ,


an) and (b1, b2, . . . , bn) in n-dimensional space is a generalisation
of these two results. The Euclidean distance is given by the
formula
Distance Measures: Manhattan
Distance
• TheCity Block distance between the points (4, 2) and (12, 9) is (12
− 4) + (9 − 2) = 8 + 7 = 15
KNN
•A training set with 20 instances, each giving
the values of two attributes and an associated
classification
• How can we estimate the classification for an
‘unseen’ instance where the first and second
attributes are 9.1 and 11.0, respectively?
• Use Euclidean Distance
Normalisation
•A major problem when using the Euclidean distance formula (and
many other distance measures) is that the large values frequently
swamp the small ones.

• When the distance of these instances from an unseen one is


calculated, the mileage attribute will almost certainly contribute a
value of several thousands squared, i.e. several millions, to the
sum of squares total.
Normalisation
• It
is clear that in practice the only attribute that will matter when
deciding which neighbours are the nearest using the Euclidean
distance formula is the mileage.
• Wecould have chosen an alternative measure of distance travelled
such as millimetres or perhaps light years. Similarly we might have
measured age in some other unit such as milliseconds or millennia.
The units chosen should not affect the decision on which are the
nearest neighbours.
Normalisation
• Toovercome this problem we generally normalise the values of
continuous attributes.
• The idea is to make the values of each attribute run from 0 to 1.
• In
general if the lowest value of attribute A is min and the highest
value is max, we convert each value of A, say a, to (a − min)/(max
− min).
• Using this approach all continuous attributes are converted to
small numbers from 0 to 1, so the effect of the choice of unit of
measurement on the outcome is greatly reduced.
Normalisation
• Note that it is possible that an unseen instance may have a value
of A that is less than min or greater than max. If we want to keep
the adjusted numbers 38 Principles of Data Mining in the range
from 0 to 1 we can just convert any values of A that are less than
min or greater than max to 0 or 1, respectively.
Normalisation
• Another issue that occurs with measuring the distance between
two points is the weighting of the contributions of the different
attributes.
• We may believe that the mileage of a car is more important than
the number of doors it has.
• To achieve this we can adjust the formula for Euclidean distance to

where w1, w2, . . . , wn are the weights. It is customary to scale the


weight values so that the sum of all the weights is one.
Dealing with Categorical
Attributes
• One of the weaknesses of the nearest neighbour approach to
classification is that there is no entirely satisfactory way of dealing
with categorical attributes.
• One possibility is to say that the difference between any two
identical values of the attribute is zero and that the difference
between any two different values is 1. (Hamming Distance)
• Effectively
this amounts to saying (for a colour attribute) red − red
= 0, red − blue = 1, blue − green = 1, etc.
Dealing with Categorical
Attributes
• Sometimes there is an ordering (or a partial ordering) of the values
of an attribute (Ordinal Attribute), for example we might have
values good, average and bad.
• We could treat the difference between good and average or
between average and bad as 0.5 and the difference between good
and bad as 1.
• Thisstill does not seem completely right, but may be the best we
can do in practice.
Age Loan Default Distance
Exercise-1 25 40000 N
35 60000 N
45 80000 N
20 20000 N
35 120000 N
52 18000 N
23 95000 Y
40 62000 Y
60 100000 Y
48 220000 Y
33 150000 Y

48 142000 ??
Age Loan Default Distance
Exercise-1 25 40000 N 102000
35 60000 N 82000
45 80000 N 62000
20 20000 N 122000
35 120000 N 22000
52 18000 N 124000
23 95000 Y 47000
40 62000 Y 80000
60 100000 Y 42000
48 220000 Y 78000
33 150000 Y 8000

48 142000 ??
Age Loan Default Distance
Exercise-2 0.125 0.11 N
0.375 0.21 N
0.625 0.31 N
0 0.01 N
0.375 0.5 N
0.8 0 N
0.075 0.38 Y
0.5 0.22 Y
1 0.41 Y
0.7 1 Y
0.325 0.65 Y

0.7 0.61 ??
Age Loan Default Distance
Exercise-2 0.125 0.11 N 0.762
0.375 0.21 N 0.5154
0.625 0.31 N 0.3092
0 0.01 N 0.922
0.375 0.5 N 0.3431
0.8 0 N 0.6181
0.075 0.38 Y 0.666
0.5 0.22 Y 0.4383
1 0.41 Y 0.3606
0.7 1 Y 0.39
0.325 0.65 Y 0.3771

0.7 0.61 ??

You might also like