0% found this document useful (0 votes)
25 views5 pages

ML DSBA Lab4

1) The document describes the k-nearest neighbors (k-NN) classification algorithm and its application to handwritten digit recognition using the MNIST dataset. 2) The k-NN algorithm classifies new instances based on the majority class of the k closest training examples, where closeness is typically defined using Euclidean distance. 3) The document outlines the steps of applying k-NN to classify handwritten digits from MNIST, including loading training and test data, computing distances, finding the k nearest neighbors, and predicting class labels.

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

ML DSBA Lab4

1) The document describes the k-nearest neighbors (k-NN) classification algorithm and its application to handwritten digit recognition using the MNIST dataset. 2) The k-NN algorithm classifies new instances based on the majority class of the k closest training examples, where closeness is typically defined using Euclidean distance. 3) The document outlines the steps of applying k-NN to classify handwritten digits from MNIST, including loading training and test data, computing distances, finding the k nearest neighbors, and predicting class labels.

Uploaded by

Houssam Fouki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

F OUNDATIONS OF M ACHINE L EARNING

M.S C . IN D ATA S CIENCES AND B USINESS A NALYTICS


C ENTRALE S UP ÉLEC
Lab 4: k-Nearest Neighbors Classifier

Instructor: Fragkiskos Malliaros


TA: Benjamin Maheu
November 4, 2021

1 Description
The goal of this lab is to study the k-Nearest Neighbors classification algorithm. Initially, we discuss the
basic characteristics of the k-NN classifier, and then we examine how it can be applied on the handwrit-
ten digit classification problem.

2 k-Nearest Neighbors Classification Algorithm


The k-Nearest Neighbors algorithm (k-NN) is a simple, very intuitive and commonly used method in
the classification task. Recall that, in classification, the goal is to identify the class of a new instance
(observation) based on a training dataset in which the class label of each instance is known.
In the k-NN algorithm, the predicted class label of a new
instance is based on the already known classes of the k most
similar neighbors. That way, the class of a new instance is
specified by the majority rule between the classes of the k
nearest neighbors. Variable k is the parameter of the algo-
rithm and it can take positive integer values. In the extreme
case where k = 1, the object is simply assigned to the class
of that single nearest neighbor. In the case where k > 1 is an
even number and the majority rule does not hold (i.e., equal
number of neighbors from each class), the label is selected
Figure 1: Example of k-NN classification.
randomly. The test sample (green circle) can be classified
Note that, neighbors-based classification is a type of either to the first class of blue squares or to the
instance-based learning: it does not attempt to construct a second class of red triangles. If k = 3 (solid
general internal model, but simply stores instances of the line circle) it is assigned to the second class be-
cause there are 2 triangles and only 1 square
training data. Classification is computed from a simple ma-
inside the inner circle. If k = 5 (dashed line
jority vote of the nearest neighbors of each point: a query circle) it is assigned to the first class. (Source:
point is assigned the data class which has the most repre- Wikipedia).
sentatives within the nearest neighbors of the point.

1
The training examples used be the k-NN algorithm are vectors in a multidimensional feature space,
each one associated with a class label. Let X = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, where xi = (x1 , x2 , . . . , xn ),
i = 1, . . . , m and yi the class label, be the m × n training dataset. The training phase of the algorithm
consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled instance x = (x1 , x2 , . . . , xn )
is classified by assigning the label which is most frequent among the k training samples nearest to that
query point. In order to find the nearest neighbors of the new instance, a similarity (or distance) measure
between the instance and the training examples should be defined. Typically, the choice of a similar-
ity measure depends on the type of the features in the data. In the case of real-valued features (i.e.,
xi ∈ R, i = 1, . . . , n), the Euclidean distance is the most commonly used measure:
v
um
uX
d(xi , xj ) = t (xim − xjm )2 .
f =1

In the case of discrete variables, such as for text classification, the Hamming distance1 can be used. An-
other measures for the similarity between instances include the correlation coefficient (e.g., Pearson
correlation coefficient).
Algorithm 1 provides the pseudocode of the k-NN classifier.

Algorithm 1 k-Nearest Neighbors Classification


Input: Training data X = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, where xi = (x1 , x2 , . . . , xn ), i = 1, . . . , m
New unlabeled instance x
Parameter k
Output: Class label y of x

1: Compute the distance of the test instance x to each training instance


2: Sort the distances in ascending (or descending) order
3: Use the sorted distances to select the k nearest neighbors of x
4: Assign x to a class based on the majority rule of the k nearest neighbors

Note that, while computing the Euclidean distance between instance vectors, the features should be
on the same scale. Although this is part of the preprocessing task, we stress out that if the data is not
normalized, the performance of the k-NN classifier can heavily be affected. One way to normalize the
values of the features is by applying the min-max normalization, where the value v of a numeric attribute
x is transformed to v 0 in the range [0, 1] by computing the v 0 = (v − min(x))/(max(x) − min(x)), where
min(x) and max(x) the minimum and maximum values of attribute x. Another way to normalize the
data is by computing the z-score zv = (v − µx )/σx , where µx is the mean value of attribute x and σx the
standard deviation.

Additional properties of k-NN

Although k-NN algorithm is very simple, it typically performs well in practice and is easily imple-
mentable. However, it has been observed that when the class distribution is skewed, the majority
voting rule does not perform well. That is, instances of a more frequent class tend to dominate the
prediction of the new instance, because they tend to be common among the k nearest neighbors due
to their large number. One way to overcome this problem is to weight the classification, taking into
account the distance from the test instance to each of its k nearest neighbors. The class of each of the
1 Wikipedia’s lemma for Hamming distance: http://en.wikipedia.org/wiki/Hamming_distance.

2
k nearest neighbors is multiplied by a weight proportional to the inverse of the distance from that in-
stance to the test instance. The algorithm is also sensitive to noisy features and may perform badly
in high dimensions (curse of dimensionality). In these cases, the performance of the algorithm can be
improved applying feature selection or dimensionality reduction techniques. Additionally, the running
time of the k-NN algorithm is high; for each test instance, we have to search through all training data to
find the nearest neighbors. This point can be improved using appropriate data structures that support
fast nearest neighbor search and make k-NN computationally tractable even for large data sets (these
generally seek to reduce the number of distance evaluations actually performed).

Choice of parameter k

The value of parameter k often depends on the properties of the dataset. Generally, larger values of k
reduce the effect of noise on the classification, but make boundaries between classes less distinct. On
the other hand, small values of k create many small regions for each class and may lead to overfit. In
practice, we can apply cross-validation in order to choose an appropriate value of k 2 . A rule of thumb
in machine learning is to pick k near the square root of the size of the training set.

3 Handwritten Digit Recognition with k-NN


In this lab, we will implement and apply the k-NN classifier to recognize handwritten digits from the
MNIST database3 .

3.1 Description of the Dataset


The MNIST dataset consists of handwritten digit images (0 − 9) and it is divided in 60, 000 examples for
the training set and 10, 000 examples for testing. Figure 2 depicts the first 10 images of the dataset.

Figure 2: Example of 10 digits of the training set.

All digit images have been size-normalized and centered in a fixed size image of 28 × 28 pixels. Each
pixel of the image is represented by a value in the range of [0, 255], where 0 corresponds to black, 255
to white and anything in between is a different shade of grey. In our case, the pixels are the features of
our dataset; therefore, each image (instance) has 784 features. That way, the training set has dimensions
60, 000 × 784 and the test set 10, 000 × 784. Regarding the class labels, each figure (digit) belongs to the
category that this digit represents (e.g., digit 2 belongs to category 2). Due to time constraints, in the
2 A technique based on cross-validation for the selection of k is described here: https://www.quora.com/
How-can-I-choose-the-best-K-in-KNN-K-nearest-neighbour-classification.
3 The MNIST database: http://yann.lecun.com/exdb/mnist/.

3
experiments that will be performed in the lab, we will use subsets of the above training and test sets.
The code that imports the MNIST dataset has been implemented in the loadMnist.py Python script.

3.2 Pipeline of the Task


Here we describe the basic steps of the pipeline for the classification task, as given in the kNN/main.py
Python script.
Initially, the data is loaded; variables trainingImage and trainingLabels contain the training
instances and their class labels respectively. In a similar way, the test data and their class are loaded.
Note that the loadMnist() function has already been implemented in the loadMnist.py script. The
data is stored in the Data directory. Recall that each instance is a digit with 28 × 28 = 784 features
(pixels).
# Load t r a i n i n g and t e s t d a t a
t r a i n i n g I m a g e s , t r a i n i n g L a b e l s = loadMnist ( ’ t r a i n i n g ’ )
t e s t I m a g e s , t e s t L a b e l s = loadMnist ( ’ t e s t i n g ’ )

Since the dataset is relatively large, we keep a subset of the training and test data. This is happening
due to time constraints of the lab and the fact that the k-NN algorithm is computationally expensive.
# Keep a s u b s e t o f t h e t r a i n i n g ( 6 0 , 0 0 0 i m a g e s ) and t e s t ( 1 0 , 0 0 0 ) d a t a
trainingImages = trainingImages [ : 2 0 0 0 , : ]
trainingLabels = trainingLabels [:2000]

# T e s t f o r a s u b s e t o f t h e d a t a s e t ( e . g . , 20 i m a g e s ) t o k e e p t h e r u n n i n g t i m e r e l a t i v e l y low
testImages = testImages [ : 2 0 , : ]
testLabels = testLabels [:20]

The next commands are for illustration purposes; they depict the first ten digits (images) of the test
data.
# Show t h e f i r s t t e n d i g i t s
f i g = p l t . f i g u r e ( ’ F i r s t 10 D i g i t s ’ )
f o r i in range ( 1 0 ) :
a = f i g . add subplot ( 2 , 5 , i +1)
p l t . imshow ( t e s t I m a g e s [ i , : ] . reshape ( 2 8 , 2 8 ) , cmap=cm . gray )
plt . axis ( ’ off ’ )

p l t . show ( )

The next part of the code performs the classification of the test dataset using the k-NN algorithm. The
kNN() function implements the k-Nearest Neighbors algorithm and the body of the function should
be filled in the lab. It takes as input the parameter k (i.e., number of naighbors), the training data and
their class labels, as well all the test data. In this case, we use the k = 5 nearest neighbors. As we have
already discussed, the k-NN classifier is not based on a model that has been built upon the training
data. The prediction of the class labels of new instances occurs during the classification phase based on
the training set.
# Run kNN a l g o r i t h m
k = 5
p r e d i c t e d D i g i t s = z e r o s ( t e s t I m a g e s . shape [ 0 ] )

f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
p r i n t ” Current T e s t I n s t a n c e : ” + s t r ( i +1)
p r e d i c t e d D i g i t s [ i ] = kNN( 5 , t r a i n i n g I m a g e s , t r a i n i n g L a b e l s , t e s t I m a g e s [ i , : ] )

4
Finally, we compute the accuracy of the k-NN classifier. In particular, we compute the predicted labels
of the test data with the true class labels contained in the testLabels variable.
# Calculate accuracy
successes = 0

f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
i f p r e d i c t e d D i g i t s [ i ] == t e s t L a b e l s [ i ] :
s u c c e s s e s += 1

a c c u r a c y = s u c c e s s e s / f l o a t ( t e s t I m a g e s . shape [ 0 ] )
print
p r i n t ” Accuracy : ” + s t r ( a c c u r a c y )

3.3 Tasks to be Performed


• Fill the file kNN/KNN.py that implements the k-NN algorithm. As distance function, you can use
the Euclidean distance.
def kNN( k , X , l a b e l s , y ) :
# k : number o f n e a r e s t n e i g h b o r s
# X: t r a i n i n g data
# l a b e l s : c l a s s l a b e l s of training data
# y : predicted l a b e l s of t e s t data

# Add y o u r c o d e h e r e

return l a b e l

• Change the variable k and compute the accuracy of the algorithm. What do you observe?

• Consider the size of the training set (recall that we have 60, 000 training instances) and examine
the performance of the classifier for different cases. What do you observe? Is there any trade-off
between the accuracy and the running time?

References
[1] Jiawei Han, Micheline Kamber, Jian Pei. ”Data Mining: Concepts and Techniques”. The Morgan
Kaufmann Series in Data Management Systems, 2006.

[2] Tom M. Mitchell. ”Machine learning”. Burr Ridge, IL: McGraw Hill 45, 1997.

You might also like