Lab 1 1.2
Lab 1 1.2
01
Name of the Experiment: Implementation of Nearest Neighbor classification algorithms with
and without distorted pattern.
Dataset: MINST dataset
The MNIST (Modified National Institute of Standards and Technology) dataset is a widely
used dataset for handwritten digit recognition and image classification tasks. It consists of
70,000 grayscale images of handwritten digits (0-9) and is commonly used as a benchmark for
testing and comparing the performance of various machine learning algorithms.
Some Key Characteristics of the MNIST Dataset:
➢ Image size: The images in the MNIST dataset are 28×28 pixels in size, making them
relatively small and easy to work with.
➢ Image format: The images are stored in a 28×28 array of pixel values, with each
pixel having a value between 0 and 255, representing the intensity of the pixel.
➢ Labels: The MNIST dataset includes labels for each image, indicating the digit that
that the image represents.
➢ Training and Test Sets: 60,000 training images and 10,000 testing images.
➢ Balance: The MNIST dataset is well-balanced, with roughly equal numbers of
images for each digit.
➢ Features: Number of total features is 784.
➢ Classes: Total 10 classes.
Ratio of training and Test Dataset:
The MNIST dataset is typically split into two sets: a training set and a test set. The ratio
of the training set to the test set is usually 60,000 images, respectively. So, the ratio of the
training set to the test set is 60,000/10,000=6:1.
Implementation:
At first all the dependencies are loaded.
Code:
import numpy as np
import pandas as pd
import statistics
from statistics import mode
import tensorflow as tf
Then the dataset is loaded and train data as x_train, train class label as y_train, test data
as x_test, test class label as y_test are extracted from the dataset.
(x_train,y_train),(x_test,y_test) = tf.keras.datasets.mnist.load_data(path='mnist.npz')
The array of train data and label data are reshaped for flexibility i.e. converted to 2-D array.
x_train = x_train.reshape(x_train.shape[0],784)
y_train = y_train.reshape(y_train.shape[0],1)
x_test = x_test.reshape(x_test.shape[0],784)
y_test = y_test.reshape(y_test.shape[0],1)
MN_image = np.vstack((x_train,x_test))
MN_label = np.vstack((y_train,y_test))
After that I have defined a function that implements KNN algorithm. It performs the
following operation:
➢ Take a test image and k-value.
➢ Measures the Manhattan distance between the two images where feature is the pixel
value. The Manhattan Distance between two points (X1, Y1) and (X2, Y2) is given by-
sort_distance.append(((np.absolute(MN_image[i,]-
MN_image[j,]).sum()),i,j,MN_label[j][0]))
sort_distance.sort(key=lambda tup:tup[0])
KNN has been applied for all test images and then the accuracy of the model is
calculated. It should be noted that when more than majority label exits, this situation has
been solved by increasing k-values. The value of K can greatly impact the performance of a
KNN model, and choosing the right value can be crucial. The value of K depends on various
attributes, including:
➢ Number of classes
➢ Data distribution
➢ Outliers
➢ Noise
c=0
for i in range(60000,70000): #test data
for k in range(0,2):
try:
t=KNN(k,i)
if t == MN_label[i][0]:
c=c+1
break
except:
a=0
print("Accuracy:",(c*100)/10000)
Analysis Of Accuracy:
Table 1.1: Accuracy of KNN on MNIST dataset
1 76.39
7 81.76
Limitations of KNN:
➢ The dataset is randomly distributed, so this classified doesn’t perform well
➢ The dataset has many outliers and KNN is very sensitive to outliers.
➢ If the label of some data are missed, performance degrades.
➢ Dataset having more classes, performance degrades more.
➢ Dataset that has no label, KNN performs bad as classifier.
Conclusion:
KNN performs well to classify data between two classes. As the number of classes
increases, it become hard to classify the data accurately. This is because the shortest distance
for data to be classified may be equal to more than one class. Though it has some drawbacks,
KNN is used as a simple, lightweight classifier.