Lec 46
Lec 46
Lecture - 46
K - Nearest Neighbors (kNN)
I just want to make sure that we get the terminology right. We will
later see that the k nearest neighbor, there is one parameter that we use
for classifying, which is the number of neighbors that I am going to
look at. So, I do not want you to wonder, since we are using anyway a
parameter in this k nearest neighbor why am I calling it nonparametric.
So, the distinction here is subtle, but I want you to remember this. The
number of neighbors that we use in the k nearest neighbor algorithm
that you will see later, is actually a tuning parameter for the algorithm,
that is not a parameter that I have derived from the data.
So, what we mean by this is the following. If I give trained data for
example, for logistics regression. I have to do this work to get these
parameters, before I can do any classification for a test data point . So,
without these parameters I can never classify test data points. However,
in k nearest neighbor it just give me data and a test data point I will
classify.
So, we will see how that is done, but no work needs to be done
before I am able to classify a test data point. So, that is an other
important difference between k nearest neighbor and logistic regression
for example. It is also called as an instant based learning where the
function is approximated locally. So, we will come back to this notion
of local as I describe this algorithm.
So, there are ways to address this, but when we say, when the
amount of data is large, all that we are saying is since there is no
explicit training phase, there is no optimization with a large number of
data points, to be able to identify parameters that are useless at later in
classification. So, in other words, in other algorithms you will do all
the e ort a priori and once you have the parameters then classification
becomes, on the test data point becomes, easier. However, since kNN
is a lazy algorithm all the, all the calculations are deferred till you had
actually have to do something, at that point there might be lot more
classic, lot more calculations if the data is large.
So, what basically we are saying is, if there is a particular data point
and I want to find out which class this data point belongs to, all I need
to do is look at all the neighboring data points and then find which
class they belong to and then take a majority vote and that is what is
the class that is assigned to this data point. So, its something like if you
want to know a person, you know his neighbors, something like that is
what use using k nearest neighbors.
We really need a distance metric for this algorithm to work and this
distance metric would basically say what is the proximity between any
two data points. The distance metric could be Euclidean distance,
Mahalanabis distance, Hamming distance and so on. So, there are
several distance metrics that you could use to basically use k nearest
neighbor.
So, there might be many classes. So, multi class problems are also very very
easy to solve using kNN algorithm. So, let us anyway stick to binary
problem. Then what you are going to do is, let us say I have a new test
point which I call it Xnew and I want to find out how I classify this. So,
the very first step which is what we talk about here, is we find a
distance between this new test point and each of the labelled data
points in the data set. So for example, there could be a distance d 1
between Xν and X1, d 2 between Xν and X2, d3 and so on and dl. So,
once you calculate this distance, then what you do is you have n
distances and this is the reason why we said you need a distance metric
in last slide for a kNN to work.
So, the distance is zero then the point is Xν itself. So, any small
distance is the closest to Xnew and as you go down it is further and
further. Now the next step is very simple. If let us say you are looking
at k nearest neighbors with k = 3, then what you are going to do, is you
are going to find the first three distances in this and this distance is
from Xn, this distance is from X5 and this distance is from X3.
So, it says to find the class label that the majority of this k label data
points have and I assign it to the test data point, very simple. Now I
also said this algorithm with minor modifications can be used for
function approximation. So, for example, if you so choose to, you
could take this and then let us say if you want to predict what an output
will be for a new point, you could find the output value for these three
points and take an average. For example, very trivial, and then say that
is the output corresponding to this, so that becomes of adaptation of
this for function approximation problems and so on. Nonetheless for
classification this is the basic idea. Now if you said k = 5 then what
you do is, you go down to 5 numbers and then do the majority vote, so
that is all we do. So, let us look at this very simple idea here.
(Refer Slide Time: 15:05)
Let us say this is actually the training data itself and then I want to
look at k = 3 and then see what labels will be for the training data
itself. The blue are actually labeled, so this is supervised. So, the blue
are all belonging to class 1 and the red is all belonging to class 2, and
then let us say for example, I want to figure out this point here blue
point. Though I know the label is blue, what class would k nearest
neighbor algorithm say this point belongs to. Say if I want to take k =
3, then basically I have to find three nearest points which are these
three, so this is what is represented.
And since the majority is blue, this will be blue. So, basically if you
think about it, this point would be classified correctly and so on. Now
even in the training set for example, if you take this red point, I know
the label is red; how-ever, if I were to run k nearest neighbor with three
data points, when you find the three closest point, they all belong to
blue. So, this would be misclassified as blue even in the training data
set. So, you will notice one general principle is, there is a possibility of
data points getting misclassified, only in kind of this region where
there is a mix of both of these data points.
Now, you do not have to do anything till you get a data point. So,
you could verify how well the algorithm will do on the training set
itself. However, if you give me a new test data here, so which is what
you shown by this data point. Then if you want to do a classification
there is no label for this. Remember the other red and blue data points
already have a label from prior knowledge, this does not have a label.
So, I want to find out a label for it. So, if I were to use k = 3, then
for this data point I will find the three closest neighbors, they happen to
be these three data points. Then I will notice that two out of these are
red, so this point will get a label + 2. If on the other hand the test data
point is here and you were using K = 5 then, you look at the 5 closest
neighbor to this point and then you see that two of them are class 2 and
three are class 1, so majority voting, this will be put into class 1. So,
you will get a label of class 1 for this data point. So, this is the basic
idea of k nearest neighbor, so very very simple.
(Refer Slide Time: 19:17)
So, it is always a good idea to scale your data in some format before
doing this distance. Otherwise while this might be an important
variable from a classification viewpoint, it will never show up, because
these numbers are bigger and they will simply dominate the small
number. So, feature selection and scaling are things to keep in mind.
And the last thing is curse of dimensionality. So, I told you that while
this is a very nice algorithm to apply, because there is not much
computation that is done at the beginning itself. However, if you
notice, if I get a test data point and I have to find let us say the 5 closest
neighbor, there is no way in which I can do this, it looks like, unless I
calculate all the distances.
So, that can become a serious problem, if the number of data points
in my database is very large. Let us say I have 10000 data points, and
let us assume that I have an algorithm k nearest neighbor algorithm
with K = 5. So, really what I am looking for, is finding 5 closest data
points from this data base to this data point. However, it looks like I
have to calculate all the 10,000 distances and then sort them and then
pick the top 5. So, in other words to get this top 5 have to do so much
work. So, there must be smarter ways of doing it, but nonetheless one
has to remember the number of data points and number of features, one
has to think how to apply this algorithm carefully.
So, because if let us say there are two classes like this, then for this
data point if you take a large number of neighbors, then you might pick
many neighbors from the other class also, so that can make the
boundaries less crisp and more di use. On the fifth slide, flipside, if you
use smaller values of k then your algorithm is likely to be affected by
noise and outliers, however, your decision boundaries as a rule of
thumb are likely to become crisper. So, this is, these are some things to
keep in mind.
Thanks.