0% found this document useful (0 votes)

25 views5 pages

ML DSBA Lab4

1) The document describes the k-nearest neighbors (k-NN) classification algorithm and its application to handwritten digit recognition using the MNIST dataset. 2) The k-NN algorithm classifies new instances based on the majority class of the k closest training examples, where closeness is typically defined using Euclidean distance. 3) The document outlines the steps of applying k-NN to classify handwritten digits from MNIST, including loading training and test data, computing distances, finding the k nearest neighbors, and predicting class labels.

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views5 pages

ML DSBA Lab4

Uploaded by

Houssam Fouki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

F OUNDATIONS OF M ACHINE L EARNING

M.S C . IN D ATA S CIENCES AND B USINESS A NALYTICS

C ENTRALE S UP ÉLEC
Lab 4: k-Nearest Neighbors Classifier

Instructor: Fragkiskos Malliaros

TA: Benjamin Maheu
November 4, 2021

1 Description
The goal of this lab is to study the k-Nearest Neighbors classification algorithm. Initially, we discuss the
basic characteristics of the k-NN classifier, and then we examine how it can be applied on the handwrit-
ten digit classification problem.

2 k-Nearest Neighbors Classification Algorithm

The k-Nearest Neighbors algorithm (k-NN) is a simple, very intuitive and commonly used method in
the classification task. Recall that, in classification, the goal is to identify the class of a new instance
(observation) based on a training dataset in which the class label of each instance is known.
In the k-NN algorithm, the predicted class label of a new
instance is based on the already known classes of the k most
similar neighbors. That way, the class of a new instance is
specified by the majority rule between the classes of the k
nearest neighbors. Variable k is the parameter of the algo-
rithm and it can take positive integer values. In the extreme
case where k = 1, the object is simply assigned to the class
of that single nearest neighbor. In the case where k > 1 is an
even number and the majority rule does not hold (i.e., equal
number of neighbors from each class), the label is selected
Figure 1: Example of k-NN classification.
randomly. The test sample (green circle) can be classified
Note that, neighbors-based classification is a type of either to the first class of blue squares or to the
instance-based learning: it does not attempt to construct a second class of red triangles. If k = 3 (solid
general internal model, but simply stores instances of the line circle) it is assigned to the second class be-
cause there are 2 triangles and only 1 square
training data. Classification is computed from a simple ma-
inside the inner circle. If k = 5 (dashed line
jority vote of the nearest neighbors of each point: a query circle) it is assigned to the first class. (Source:
point is assigned the data class which has the most repre- Wikipedia).
sentatives within the nearest neighbors of the point.

1
The training examples used be the k-NN algorithm are vectors in a multidimensional feature space,
each one associated with a class label. Let X = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, where xi = (x1 , x2 , . . . , xn ),
i = 1, . . . , m and yi the class label, be the m × n training dataset. The training phase of the algorithm
consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a user-defined constant, and an unlabeled instance x = (x1 , x2 , . . . , xn )
is classified by assigning the label which is most frequent among the k training samples nearest to that
query point. In order to find the nearest neighbors of the new instance, a similarity (or distance) measure
between the instance and the training examples should be defined. Typically, the choice of a similar-
ity measure depends on the type of the features in the data. In the case of real-valued features (i.e.,
xi ∈ R, i = 1, . . . , n), the Euclidean distance is the most commonly used measure:
v
um
uX
d(xi , xj ) = t (xim − xjm )2 .
f =1

In the case of discrete variables, such as for text classification, the Hamming distance1 can be used. An-
other measures for the similarity between instances include the correlation coefficient (e.g., Pearson
correlation coefficient).
Algorithm 1 provides the pseudocode of the k-NN classifier.

Algorithm 1 k-Nearest Neighbors Classification

Input: Training data X = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )}, where xi = (x1 , x2 , . . . , xn ), i = 1, . . . , m
New unlabeled instance x
Parameter k
Output: Class label y of x

1: Compute the distance of the test instance x to each training instance

2: Sort the distances in ascending (or descending) order
3: Use the sorted distances to select the k nearest neighbors of x
4: Assign x to a class based on the majority rule of the k nearest neighbors

Note that, while computing the Euclidean distance between instance vectors, the features should be
on the same scale. Although this is part of the preprocessing task, we stress out that if the data is not
normalized, the performance of the k-NN classifier can heavily be affected. One way to normalize the
values of the features is by applying the min-max normalization, where the value v of a numeric attribute
x is transformed to v 0 in the range [0, 1] by computing the v 0 = (v − min(x))/(max(x) − min(x)), where
min(x) and max(x) the minimum and maximum values of attribute x. Another way to normalize the
data is by computing the z-score zv = (v − µx )/σx , where µx is the mean value of attribute x and σx the
standard deviation.

Additional properties of k-NN

Although k-NN algorithm is very simple, it typically performs well in practice and is easily imple-
mentable. However, it has been observed that when the class distribution is skewed, the majority
voting rule does not perform well. That is, instances of a more frequent class tend to dominate the
prediction of the new instance, because they tend to be common among the k nearest neighbors due
to their large number. One way to overcome this problem is to weight the classification, taking into
account the distance from the test instance to each of its k nearest neighbors. The class of each of the
1 Wikipedia’s lemma for Hamming distance: http://en.wikipedia.org/wiki/Hamming_distance.

2
k nearest neighbors is multiplied by a weight proportional to the inverse of the distance from that in-
stance to the test instance. The algorithm is also sensitive to noisy features and may perform badly
in high dimensions (curse of dimensionality). In these cases, the performance of the algorithm can be
improved applying feature selection or dimensionality reduction techniques. Additionally, the running
time of the k-NN algorithm is high; for each test instance, we have to search through all training data to
find the nearest neighbors. This point can be improved using appropriate data structures that support
fast nearest neighbor search and make k-NN computationally tractable even for large data sets (these
generally seek to reduce the number of distance evaluations actually performed).

Choice of parameter k

The value of parameter k often depends on the properties of the dataset. Generally, larger values of k
reduce the effect of noise on the classification, but make boundaries between classes less distinct. On
the other hand, small values of k create many small regions for each class and may lead to overfit. In
practice, we can apply cross-validation in order to choose an appropriate value of k 2 . A rule of thumb
in machine learning is to pick k near the square root of the size of the training set.

3 Handwritten Digit Recognition with k-NN

In this lab, we will implement and apply the k-NN classifier to recognize handwritten digits from the
MNIST database3 .

3.1 Description of the Dataset

The MNIST dataset consists of handwritten digit images (0 − 9) and it is divided in 60, 000 examples for
the training set and 10, 000 examples for testing. Figure 2 depicts the first 10 images of the dataset.

Figure 2: Example of 10 digits of the training set.

All digit images have been size-normalized and centered in a fixed size image of 28 × 28 pixels. Each
pixel of the image is represented by a value in the range of [0, 255], where 0 corresponds to black, 255
to white and anything in between is a different shade of grey. In our case, the pixels are the features of
our dataset; therefore, each image (instance) has 784 features. That way, the training set has dimensions
60, 000 × 784 and the test set 10, 000 × 784. Regarding the class labels, each figure (digit) belongs to the
category that this digit represents (e.g., digit 2 belongs to category 2). Due to time constraints, in the
2 A technique based on cross-validation for the selection of k is described here: https://www.quora.com/
How-can-I-choose-the-best-K-in-KNN-K-nearest-neighbour-classification.
3 The MNIST database: http://yann.lecun.com/exdb/mnist/.

3
experiments that will be performed in the lab, we will use subsets of the above training and test sets.
The code that imports the MNIST dataset has been implemented in the loadMnist.py Python script.

3.2 Pipeline of the Task

Here we describe the basic steps of the pipeline for the classification task, as given in the kNN/main.py
Python script.
Initially, the data is loaded; variables trainingImage and trainingLabels contain the training
instances and their class labels respectively. In a similar way, the test data and their class are loaded.
Note that the loadMnist() function has already been implemented in the loadMnist.py script. The
data is stored in the Data directory. Recall that each instance is a digit with 28 × 28 = 784 features
(pixels).
# Load t r a i n i n g and t e s t d a t a
t r a i n i n g I m a g e s , t r a i n i n g L a b e l s = loadMnist ( ’ t r a i n i n g ’ )
t e s t I m a g e s , t e s t L a b e l s = loadMnist ( ’ t e s t i n g ’ )

Since the dataset is relatively large, we keep a subset of the training and test data. This is happening
due to time constraints of the lab and the fact that the k-NN algorithm is computationally expensive.
# Keep a s u b s e t o f t h e t r a i n i n g ( 6 0 , 0 0 0 i m a g e s ) and t e s t ( 1 0 , 0 0 0 ) d a t a
trainingImages = trainingImages [ : 2 0 0 0 , : ]
trainingLabels = trainingLabels [:2000]

# T e s t f o r a s u b s e t o f t h e d a t a s e t ( e . g . , 20 i m a g e s ) t o k e e p t h e r u n n i n g t i m e r e l a t i v e l y low
testImages = testImages [ : 2 0 , : ]
testLabels = testLabels [:20]

The next commands are for illustration purposes; they depict the first ten digits (images) of the test
data.
# Show t h e f i r s t t e n d i g i t s
f i g = p l t . f i g u r e ( ’ F i r s t 10 D i g i t s ’ )
f o r i in range ( 1 0 ) :
a = f i g . add subplot ( 2 , 5 , i +1)
p l t . imshow ( t e s t I m a g e s [ i , : ] . reshape ( 2 8 , 2 8 ) , cmap=cm . gray )
plt . axis ( ’ off ’ )

p l t . show ( )

The next part of the code performs the classification of the test dataset using the k-NN algorithm. The
kNN() function implements the k-Nearest Neighbors algorithm and the body of the function should
be filled in the lab. It takes as input the parameter k (i.e., number of naighbors), the training data and
their class labels, as well all the test data. In this case, we use the k = 5 nearest neighbors. As we have
already discussed, the k-NN classifier is not based on a model that has been built upon the training
data. The prediction of the class labels of new instances occurs during the classification phase based on
the training set.
# Run kNN a l g o r i t h m
k = 5
p r e d i c t e d D i g i t s = z e r o s ( t e s t I m a g e s . shape [ 0 ] )

f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
p r i n t ” Current T e s t I n s t a n c e : ” + s t r ( i +1)
p r e d i c t e d D i g i t s [ i ] = kNN( 5 , t r a i n i n g I m a g e s , t r a i n i n g L a b e l s , t e s t I m a g e s [ i , : ] )

4
Finally, we compute the accuracy of the k-NN classifier. In particular, we compute the predicted labels
of the test data with the true class labels contained in the testLabels variable.
# Calculate accuracy
successes = 0

f o r i in range ( t e s t I m a g e s . shape [ 0 ] ) :
i f p r e d i c t e d D i g i t s [ i ] == t e s t L a b e l s [ i ] :
s u c c e s s e s += 1

a c c u r a c y = s u c c e s s e s / f l o a t ( t e s t I m a g e s . shape [ 0 ] )
print
p r i n t ” Accuracy : ” + s t r ( a c c u r a c y )

3.3 Tasks to be Performed

• Fill the file kNN/KNN.py that implements the k-NN algorithm. As distance function, you can use
the Euclidean distance.
def kNN( k , X , l a b e l s , y ) :
# k : number o f n e a r e s t n e i g h b o r s
# X: t r a i n i n g data
# l a b e l s : c l a s s l a b e l s of training data
# y : predicted l a b e l s of t e s t data

# Add y o u r c o d e h e r e

return l a b e l

• Change the variable k and compute the accuracy of the algorithm. What do you observe?

• Consider the size of the training set (recall that we have 60, 000 training instances) and examine
the performance of the classifier for different cases. What do you observe? Is there any trade-off
between the accuracy and the running time?

References
[1] Jiawei Han, Micheline Kamber, Jian Pei. ”Data Mining: Concepts and Techniques”. The Morgan
Kaufmann Series in Data Management Systems, 2006.

[2] Tom M. Mitchell. ”Machine learning”. Burr Ridge, IL: McGraw Hill 45, 1997.

ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
No ratings yet
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
93 pages
Lec 02 - KNN
No ratings yet
Lec 02 - KNN
36 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
22 pages
12 ML KNN
No ratings yet
12 ML KNN
28 pages
Nearest Neighbour Classifier (-NN Classifier)
No ratings yet
Nearest Neighbour Classifier (-NN Classifier)
17 pages
ML 5
No ratings yet
ML 5
35 pages
K - Nearest Neighbours
No ratings yet
K - Nearest Neighbours
6 pages
Aiml M3 C2
No ratings yet
Aiml M3 C2
56 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
Classification (K-Nearest Neighbor)
No ratings yet
Classification (K-Nearest Neighbor)
22 pages
Week 5 - Instance-Based Learning & PCA
No ratings yet
Week 5 - Instance-Based Learning & PCA
69 pages
KNN HMM
No ratings yet
KNN HMM
51 pages
Supervised Learning KNN
No ratings yet
Supervised Learning KNN
23 pages
Lecture 14 and 15
No ratings yet
Lecture 14 and 15
42 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
PowerPoint Presentation - KNN Presentation
No ratings yet
PowerPoint Presentation - KNN Presentation
16 pages
ML Lecture 13 KNN
No ratings yet
ML Lecture 13 KNN
14 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Ch2 - Lec2 - K Nearest Neighbour (KNN)
No ratings yet
Ch2 - Lec2 - K Nearest Neighbour (KNN)
18 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
KNN CIML
No ratings yet
KNN CIML
12 pages
K-Nearest Neighbors Algorithm - Wikipedia
No ratings yet
K-Nearest Neighbors Algorithm - Wikipedia
10 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
Supervised Example KNN
No ratings yet
Supervised Example KNN
22 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
K-Nearest Neighbors Algorithm
No ratings yet
K-Nearest Neighbors Algorithm
11 pages
Practicl Work - 02
No ratings yet
Practicl Work - 02
2 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
Nearest-Neighbor Classifier: MTL 782 Iit Delhi
No ratings yet
Nearest-Neighbor Classifier: MTL 782 Iit Delhi
16 pages
K-Nearest Neighbour Classifiers
No ratings yet
K-Nearest Neighbour Classifiers
18 pages
Algorithms - K Nearest Neighbors
No ratings yet
Algorithms - K Nearest Neighbors
23 pages
Week 07
No ratings yet
Week 07
24 pages
Instance Based Learning
No ratings yet
Instance Based Learning
16 pages
ML Assignment No. 3: 3.1 Title
No ratings yet
ML Assignment No. 3: 3.1 Title
6 pages
Miss Erum Mahood Topic: KNN Algorthim: Presentator BY: Zobia Malaika Maryam Minahil
No ratings yet
Miss Erum Mahood Topic: KNN Algorthim: Presentator BY: Zobia Malaika Maryam Minahil
10 pages
20 KNN Presentation
No ratings yet
20 KNN Presentation
16 pages
Part A 3. KNN Classification
No ratings yet
Part A 3. KNN Classification
35 pages
14 K - Nearest Neighbours
No ratings yet
14 K - Nearest Neighbours
8 pages
COS4852 2023 Unit 2 - KNN
No ratings yet
COS4852 2023 Unit 2 - KNN
10 pages
Lecture8 KNN1
No ratings yet
Lecture8 KNN1
16 pages
Bài nhóm tìm hiểu về KNN
No ratings yet
Bài nhóm tìm hiểu về KNN
5 pages
K-Nearest Neighbor Classification-Algorithm and Characteristics
No ratings yet
K-Nearest Neighbor Classification-Algorithm and Characteristics
6 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
Chap7 KNN
No ratings yet
Chap7 KNN
15 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
ML Assignment No. 3: 3.1 Title
No ratings yet
ML Assignment No. 3: 3.1 Title
6 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
10 pages
Wikipedia K Nearest Neighbor Algorithm
No ratings yet
Wikipedia K Nearest Neighbor Algorithm
4 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Road Traffic Algorithm
No ratings yet
Road Traffic Algorithm
5 pages
Image Classification Supervised
No ratings yet
Image Classification Supervised
12 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Data Science Honor Syllabus Sem-I
No ratings yet
Data Science Honor Syllabus Sem-I
5 pages
KNN Algorithm
No ratings yet
KNN Algorithm
3 pages
Correcting Working Postures in Industry: A Practical Method For Analysis
No ratings yet
Correcting Working Postures in Industry: A Practical Method For Analysis
3 pages
5.Topic-Sensitive PageRank S5
No ratings yet
5.Topic-Sensitive PageRank S5
11 pages
Fire Extinguisher Prediction Using Machine Learning Report
No ratings yet
Fire Extinguisher Prediction Using Machine Learning Report
48 pages
Unit IV Aiml
No ratings yet
Unit IV Aiml
32 pages
Linear Regression: Student: Mohammed Abu Musameh Supervisor: Eng. Akram Abu Garad
No ratings yet
Linear Regression: Student: Mohammed Abu Musameh Supervisor: Eng. Akram Abu Garad
35 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
1 s2.0 S0306457321001369 Main
No ratings yet
1 s2.0 S0306457321001369 Main
18 pages
Customer Segmentation Using Data Science
No ratings yet
Customer Segmentation Using Data Science
7 pages
Blood Bank Management System
No ratings yet
Blood Bank Management System
20 pages
Discrimination of Customers Decision-Making in A L
No ratings yet
Discrimination of Customers Decision-Making in A L
13 pages
Advance AI and ML LAB
No ratings yet
Advance AI and ML LAB
16 pages
1 s2.0 S1569843224000888 Main
No ratings yet
1 s2.0 S1569843224000888 Main
17 pages
Large Language Models For Scientific Synthesis Inference and Orfxzot0i9
No ratings yet
Large Language Models For Scientific Synthesis Inference and Orfxzot0i9
27 pages
Kisi Tafa
No ratings yet
Kisi Tafa
60 pages
Image Steganalysis Using Deep Learning 2023
No ratings yet
Image Steganalysis Using Deep Learning 2023
33 pages
ALPR
No ratings yet
ALPR
15 pages
BRM Unit 3 PDF
No ratings yet
BRM Unit 3 PDF
24 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
19 pages
Hassan - 2023 - IOP - Conf. - Ser. - Earth - Environ. - Sci. - 1167 - 012026
No ratings yet
Hassan - 2023 - IOP - Conf. - Ser. - Earth - Environ. - Sci. - 1167 - 012026
12 pages
IEEE Report Face Emotions Recognition
No ratings yet
IEEE Report Face Emotions Recognition
4 pages
Kinds and Classification of Research
No ratings yet
Kinds and Classification of Research
16 pages
Accuracy of Traffic Detection Devices On Two-And Four-Lane Arterials
No ratings yet
Accuracy of Traffic Detection Devices On Two-And Four-Lane Arterials
21 pages
Image Classification With RandomForests in R
No ratings yet
Image Classification With RandomForests in R
49 pages
Main EL CM2end 2023
No ratings yet
Main EL CM2end 2023
33 pages
Web Dev Vs AI - ML - Time To Internship
No ratings yet
Web Dev Vs AI - ML - Time To Internship
3 pages
Human Posture Recognition Using A Hybrid of Fuzzy Logic and Machine Learning Approaches
No ratings yet
Human Posture Recognition Using A Hybrid of Fuzzy Logic and Machine Learning Approaches
12 pages
Lecturenotes 3
No ratings yet
Lecturenotes 3
11 pages
Classification For Non Infected and Infected Ganod
No ratings yet
Classification For Non Infected and Infected Ganod
7 pages
Cheat Sheet Tutorial
No ratings yet
Cheat Sheet Tutorial
2 pages
Symbol Recognition: Current Advances and Perspectives
No ratings yet
Symbol Recognition: Current Advances and Perspectives
25 pages
1 s2.0 S131915781730544X Main
No ratings yet
1 s2.0 S131915781730544X Main
7 pages
1 MinHash-1
No ratings yet
1 MinHash-1
4 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

ML DSBA Lab4

Uploaded by

ML DSBA Lab4

Uploaded by

F OUNDATIONS OF M ACHINE L EARNING

M.S C . IN D ATA S CIENCES AND B USINESS A NALYTICS

Instructor: Fragkiskos Malliaros

2 k-Nearest Neighbors Classification Algorithm

Algorithm 1 k-Nearest Neighbors Classification

1: Compute the distance of the test instance x to each training instance

Additional properties of k-NN

3 Handwritten Digit Recognition with k-NN

3.1 Description of the Dataset

Figure 2: Example of 10 digits of the training set.

3.2 Pipeline of the Task

3.3 Tasks to be Performed

You might also like