MachineLearning-Spring24 - KNN Implementation For Classification
MachineLearning-Spring24 - KNN Implementation For Classification
It is a simple and intuitive supervised machine learning algorithm used for classification and regression
tasks. In classification, KNN predicts the class label of a new data point based on the majority class of its
'k' nearest neighbors in the feature space. The 'k' neighbors are determined by measuring distances,
typically using Euclidean distance, between the new data point and all other data points in the training
set. The algorithm does not make any assumptions about the underlying data distribution, making it
versatile and suitable for various types of datasets. However, its performance may degrade with high-
dimensional or noisy data, and it can be computationally expensive, especially with large datasets, as it
requires storing and computing distances for all training samples during prediction.
Minkowski Distance:
The Minkowski distance is a generalization of other distance measures such as Euclidean distance and
Manhattan distance. It calculates the distance between two points in a multi-dimensional space.
The Minkowski distance between two points ( P ) and ( Q ) in ( n )-dimensional space is given by:
n p
p
D(P , Q) = (∑ |x i − y i | )
i=1
Where:
( x_i ) and ( y_i ) are the ( i )-th dimensions of points ( P ) and ( Q ) respectively.
( p ) is a parameter that defines the order of the Minkowski distance. When ( p = 1 ), it is equivalent
to Manhattan distance, and when ( p = 2 ), it is equivalent to Euclidean distance.
Euclidean Distance:
The Euclidean distance is a measure of the straight-line distance between two points in Euclidean space.
The Euclidean distance between two points ( P ) and ( Q ) in ( n )-dimensional space is given by:
n
2
D(P , Q) = ∑(x i − y i )
⎷
i=1
Where:
( x_i ) and ( y_i ) are the ( i )-th dimensions of points ( P ) and ( Q ) respectively.
These distances are commonly used in various machine learning algorithms and data analysis tasks to
quantify the similarity or dissimilarity between data points in a dataset.
In [1]:
import pandas as pd
print(df)
brightness saturation Class
0 40 20 Red
1 50 50 Blue
2 60 90 Blue
3 10 25 Red
What is scikit-learn?
Scikit-learn, often abbreviated as sklearn, is one of the most popular and widely used machine learning
libraries in Python. It provides simple and efficient tools for data mining and data analysis, built on top of
NumPy, SciPy, and matplotlib.
In [2]:
# import kNN from sklearn
from sklearn.neighbors import KNeighborsClassifier
Each row of the feature matrix corresponds to a single sample or data point (Feature vector), and each
column corresponds to a feature or attribute of that sample. Therefore, the dimensions of the feature
matrix X are typically m × n, where m is the number of samples and n is the number of features.
For example, in a dataset with m samples and n features, the feature matrix X would look like this:
x 11 x 12 … x 1n
⎡ ⎤
x 21 x 22 … x 2n
X =
⋮ ⋮ ⋱ ⋮
⎣ ⎦
x m1 x m2 … x mn
Where x ij represents the value of the j-th feature of the i-th sample.
So, in the context of scikit-learn or any machine learning library, when you refer to the feature matrix X,
you are essentially talking about the data matrix containing the features of your dataset.
In [3]:
X # feature matrix
0 40 20
1 50 50
2 60 90
3 10 25
For classification problems, y contains discrete class labels or categories assigned to each sample i.e.
(Red, Blue). In regression problems, y contains continuous values representing the target variable to be
predicted for each sample i.e. 2500,35000,400000,200000.
In [4]:
y
Out[4]: 0 Red
1 Blue
2 Blue
3 Red
Name: Class, dtype: object
In [5]:
# Creating and training the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X, y)
print ("The predicted label for sample: \n\n",X_test," \n\nis ", y_pred)
brightness saturation
0 20 35
is ['Red']
/home/abdullah/anaconda3/lib/python3.9/site-packages/sklearn/neighbors/_classification.p
y:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the de
fault behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, th
is behavior will change: the default value of `keepdims` will become False, the `axis` o
ver which the statistic is taken will be eliminated, and the value None will no longer b
e accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
In [6]:
# Finding the distances and indices of the k nearest neighbors
distances, indices = knn.kneighbors(X_test)