ML DSBA Lab4
ML DSBA Lab4
1 Description
The goal of this lab is to study the k-Nearest Neighbors classification algorithm. Initially, we discuss the
basic characteristics of the k-NN classifier, and then we examine how it can be applied on the handwrit-
ten digit classification problem.
1
The training examples used be the k-NN algorithm are vectors in a multidimensional feature space,
each one associated with a class label. Let X = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, where xi = (x1 , x2 , . . . , xn ),
i = 1, . . . , m and yi the class label, be the m × n training dataset. The training phase of the algorithm
consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled instance x = (x1 , x2 , . . . , xn )
is classified by assigning the label which is most frequent among the k training samples nearest to that
query point. In order to find the nearest neighbors of the new instance, a similarity (or distance) measure
between the instance and the training examples should be defined. Typically, the choice of a similar-
ity measure depends on the type of the features in the data. In the case of real-valued features (i.e.,
xi ∈ R, i = 1, . . . , n), the Euclidean distance is the most commonly used measure:
v
um
uX
d(xi , xj ) = t (xim − xjm )2 .
f =1
In the case of discrete variables, such as for text classification, the Hamming distance1 can be used. An-
other measures for the similarity between instances include the correlation coefficient (e.g., Pearson
correlation coefficient).
Algorithm 1 provides the pseudocode of the k-NN classifier.
Note that, while computing the Euclidean distance between instance vectors, the features should be
on the same scale. Although this is part of the preprocessing task, we stress out that if the data is not
normalized, the performance of the k-NN classifier can heavily be affected. One way to normalize the
values of the features is by applying the min-max normalization, where the value v of a numeric attribute
x is transformed to v 0 in the range [0, 1] by computing the v 0 = (v − min(x))/(max(x) − min(x)), where
min(x) and max(x) the minimum and maximum values of attribute x. Another way to normalize the
data is by computing the z-score zv = (v − µx )/σx , where µx is the mean value of attribute x and σx the
standard deviation.
Although k-NN algorithm is very simple, it typically performs well in practice and is easily imple-
mentable. However, it has been observed that when the class distribution is skewed, the majority
voting rule does not perform well. That is, instances of a more frequent class tend to dominate the
prediction of the new instance, because they tend to be common among the k nearest neighbors due
to their large number. One way to overcome this problem is to weight the classification, taking into
account the distance from the test instance to each of its k nearest neighbors. The class of each of the
1 Wikipedia’s lemma for Hamming distance: http://en.wikipedia.org/wiki/Hamming_distance.
2
k nearest neighbors is multiplied by a weight proportional to the inverse of the distance from that in-
stance to the test instance. The algorithm is also sensitive to noisy features and may perform badly
in high dimensions (curse of dimensionality). In these cases, the performance of the algorithm can be
improved applying feature selection or dimensionality reduction techniques. Additionally, the running
time of the k-NN algorithm is high; for each test instance, we have to search through all training data to
find the nearest neighbors. This point can be improved using appropriate data structures that support
fast nearest neighbor search and make k-NN computationally tractable even for large data sets (these
generally seek to reduce the number of distance evaluations actually performed).
Choice of parameter k
The value of parameter k often depends on the properties of the dataset. Generally, larger values of k
reduce the effect of noise on the classification, but make boundaries between classes less distinct. On
the other hand, small values of k create many small regions for each class and may lead to overfit. In
practice, we can apply cross-validation in order to choose an appropriate value of k 2 . A rule of thumb
in machine learning is to pick k near the square root of the size of the training set.
All digit images have been size-normalized and centered in a fixed size image of 28 × 28 pixels. Each
pixel of the image is represented by a value in the range of [0, 255], where 0 corresponds to black, 255
to white and anything in between is a different shade of grey. In our case, the pixels are the features of
our dataset; therefore, each image (instance) has 784 features. That way, the training set has dimensions
60, 000 × 784 and the test set 10, 000 × 784. Regarding the class labels, each figure (digit) belongs to the
category that this digit represents (e.g., digit 2 belongs to category 2). Due to time constraints, in the
2 A technique based on cross-validation for the selection of k is described here: https://www.quora.com/
How-can-I-choose-the-best-K-in-KNN-K-nearest-neighbour-classification.
3 The MNIST database: http://yann.lecun.com/exdb/mnist/.
3
experiments that will be performed in the lab, we will use subsets of the above training and test sets.
The code that imports the MNIST dataset has been implemented in the loadMnist.py Python script.
Since the dataset is relatively large, we keep a subset of the training and test data. This is happening
due to time constraints of the lab and the fact that the k-NN algorithm is computationally expensive.
# Keep a s u b s e t o f t h e t r a i n i n g ( 6 0 , 0 0 0 i m a g e s ) and t e s t ( 1 0 , 0 0 0 ) d a t a
trainingImages = trainingImages [ : 2 0 0 0 , : ]
trainingLabels = trainingLabels [:2000]
# T e s t f o r a s u b s e t o f t h e d a t a s e t ( e . g . , 20 i m a g e s ) t o k e e p t h e r u n n i n g t i m e r e l a t i v e l y low
testImages = testImages [ : 2 0 , : ]
testLabels = testLabels [:20]
The next commands are for illustration purposes; they depict the first ten digits (images) of the test
data.
# Show t h e f i r s t t e n d i g i t s
f i g = p l t . f i g u r e ( ’ F i r s t 10 D i g i t s ’ )
f o r i in range ( 1 0 ) :
a = f i g . add subplot ( 2 , 5 , i +1)
p l t . imshow ( t e s t I m a g e s [ i , : ] . reshape ( 2 8 , 2 8 ) , cmap=cm . gray )
plt . axis ( ’ off ’ )
p l t . show ( )
The next part of the code performs the classification of the test dataset using the k-NN algorithm. The
kNN() function implements the k-Nearest Neighbors algorithm and the body of the function should
be filled in the lab. It takes as input the parameter k (i.e., number of naighbors), the training data and
their class labels, as well all the test data. In this case, we use the k = 5 nearest neighbors. As we have
already discussed, the k-NN classifier is not based on a model that has been built upon the training
data. The prediction of the class labels of new instances occurs during the classification phase based on
the training set.
# Run kNN a l g o r i t h m
k = 5
p r e d i c t e d D i g i t s = z e r o s ( t e s t I m a g e s . shape [ 0 ] )
f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
p r i n t ” Current T e s t I n s t a n c e : ” + s t r ( i +1)
p r e d i c t e d D i g i t s [ i ] = kNN( 5 , t r a i n i n g I m a g e s , t r a i n i n g L a b e l s , t e s t I m a g e s [ i , : ] )
4
Finally, we compute the accuracy of the k-NN classifier. In particular, we compute the predicted labels
of the test data with the true class labels contained in the testLabels variable.
# Calculate accuracy
successes = 0
f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
i f p r e d i c t e d D i g i t s [ i ] == t e s t L a b e l s [ i ] :
s u c c e s s e s += 1
a c c u r a c y = s u c c e s s e s / f l o a t ( t e s t I m a g e s . shape [ 0 ] )
print
p r i n t ” Accuracy : ” + s t r ( a c c u r a c y )
# Add y o u r c o d e h e r e
return l a b e l
• Change the variable k and compute the accuracy of the algorithm. What do you observe?
• Consider the size of the training set (recall that we have 60, 000 training instances) and examine
the performance of the classifier for different cases. What do you observe? Is there any trade-off
between the accuracy and the running time?
References
[1] Jiawei Han, Micheline Kamber, Jian Pei. ”Data Mining: Concepts and Techniques”. The Morgan
Kaufmann Series in Data Management Systems, 2006.
[2] Tom M. Mitchell. ”Machine learning”. Burr Ridge, IL: McGraw Hill 45, 1997.