The Nearest Neighbour Algorithm
The Nearest Neighbour Algorithm
We now introduce a concrete learning algorithm for classification. This algorithm differs from ERM
because it is not minimizing the training error in a given class of predictors. For now, we restrict our
attention to binary classification tasks with numerical features, namely X = Rd and Y = {−1, 1}.
Given a training set, the classifier generated by this algorithm is based on the following simple rule:
predict every point in the training set with its own label, and predict any other point with the label
of the point in the training set which is closest to it.
More formally, given a training set S ≡ (x1 , y1 ), . . . , (xm , ym ) , the nearest neighbour algorithm
(NN) generates a classifier hNN : Rd → {−1, 1} defined by:
If there is more than one point in S with smallest distance to x, then the algorithm predicts with
the majority of the labels of these closest points. If there is an equal number of closest points with
positive and negative labels, then the algorithm predicts a default value in {−1, 1} (for instance,
the most frequent label in the training set).
Note that hNN (xt ) = yt for every training example (xt , yt ). The distance between x = (x1 , . . . , xd )
and xt = (xt,1 , . . . , xt,d ), denoted by ∥x − xt ∥, is computed using the Euclidean distance,
v
u d
uX
∥x − xt ∥ = t (xi − xt,i )2 .
i=1
The classifier generated by NN induces a partition of Rd in Voronoi cells, where each training
instance xt (here called a “center”) is contained in a cell, and the border between two cells is the
set of points in Rd that have equal distance from the two cell centers (see Figure 1).
1
As NN typically stores the entire training set, the algorithm does not scale well with the number
|S| = m of training points. Moreover, given any test point x, computing hNN (x) is costly, as it
requires computing the distance between x and every point of the training set, which in Rd takes
time Θ(dm) (shorter running times are possible when distances are approximated rather than being
computed exactly). Finally, note that NN always generates a classifier hNN such that ℓS (hNN ) = 0.
This is not surprising because, as we already said, NN stores the entire training set.
Complexity of the classifier
+ + + − − + + + − − − −
+ + + − − + + + − − − −
+ + + − − + + + − − − −
Figure 2: Plot of the hk−NN classifier for k = 1, 3, 5 on a 1-dimensional training set. As k increases,
the classifier becomes simpler and the number of mistaken points in the training set increases.
Starting from NN, we can obtain a family of algorithms denoted by k-NN for k = 1, 3, 5, . . ., where
k cannot be taken larger than
the size of the training set. These algorithms are defined as follows:
given a training set S = (x1 , y1 ), . . . , (xm , ym ) , k-NN generates a classifier hk−NN such that
hk−NN (x) is the label yt ∈ {−1, 1} appearing in the majority of the k points xt ∈ S which are
closest to x.1 Hence, in order to compute hk−NN (x), we perform the following operations:
1. Find the k training points xt1 , . . . , xtk closest to x.1 Let yt1 , . . . , ytk be their labels.
2. If the majority of the labels yt1 , . . . , ytk is +1, then hk−NN (x) = +1; if the majority is −1,
then hk−NN (x) = −1.
Note that, for each k ≥ 1 and for each xt in the training set, xt is always included in the k points
that are closest to xt .
It is important to note that, unlike 1-NN, in general we have that ℓS (hk−NN ) > 0. Moreover, in
Figure 2 we see that, as k grows, the classifiers generated by k-NN become simpler. In particular,
when k is equal to the size of the training set, hk−NN becomes a constant classifier that always
predicts the most common label in the training set.
1
Just like in the case of 1-NN, there could be training points at the same distance from x such that more than k
points are closest to x. In this case we proceed by ranking the training points based on their distance from x and
then taking the k′ closest points where k′ is the smallest integer bigger or equal to k such that the (k′ + 1)-th point in
the ranking has distance from x strictly larger that the k′ -th point. If no such k′ exists, then we take all the points
2
The figure above shows the typical trend of training error (orange curve) and test error (blue curve)
of the k-NN classifier for increasing values of the parameter k on a real dataset (Breast Cancer
Wisconsin) for binary classification with zero-one loss. Note that the minimum of the test error
is attained at a value corresponding to a hk−NN classifier with training error generally bigger than
zero. The learning algorithm suffers from high test error for small values of k (overfitting) and for
large values of k (underfitting).
In addition to binary classification, k-NN can be used to solve multiclass classification problems
(where Y contains more than two symbols) and also regression problems (where Y = R). In the first
case, we operate like in the binary case and predict using the label corresponding to the majority
of the labels of the k closest training points. In the second case, the prediction is the average of
the labels of the k closest training points.
in the training set. If k′ is strictly bigger than k, even, and there is an equal number of closest points with positive
and negative labels, then the algorithm predicts a default value in {−1, 1}.