0% found this document useful (0 votes)
5 views

MachineLearning-Spring24 - KNN Implementation For Classification

Uploaded by

balouch86013
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

MachineLearning-Spring24 - KNN Implementation For Classification

Uploaded by

balouch86013
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

K-Nearest Neighbors (KNN)

It is a simple and intuitive supervised machine learning algorithm used for classification and regression
tasks. In classification, KNN predicts the class label of a new data point based on the majority class of its
'k' nearest neighbors in the feature space. The 'k' neighbors are determined by measuring distances,
typically using Euclidean distance, between the new data point and all other data points in the training
set. The algorithm does not make any assumptions about the underlying data distribution, making it
versatile and suitable for various types of datasets. However, its performance may degrade with high-
dimensional or noisy data, and it can be computationally expensive, especially with large datasets, as it
requires storing and computing distances for all training samples during prediction.

Minkowski Distance:
The Minkowski distance is a generalization of other distance measures such as Euclidean distance and
Manhattan distance. It calculates the distance between two points in a multi-dimensional space.

The Minkowski distance between two points ( P ) and ( Q ) in ( n )-dimensional space is given by:

n p

p
D(P , Q) = (∑ |x i − y i | )

i=1

Where:

( x_i ) and ( y_i ) are the ( i )-th dimensions of points ( P ) and ( Q ) respectively.
( p ) is a parameter that defines the order of the Minkowski distance. When ( p = 1 ), it is equivalent
to Manhattan distance, and when ( p = 2 ), it is equivalent to Euclidean distance.

Euclidean Distance:
The Euclidean distance is a measure of the straight-line distance between two points in Euclidean space.

The Euclidean distance between two points ( P ) and ( Q ) in ( n )-dimensional space is given by:


n

2
D(P , Q) = ∑(x i − y i )

i=1

Where:

( x_i ) and ( y_i ) are the ( i )-th dimensions of points ( P ) and ( Q ) respectively.

These distances are commonly used in various machine learning algorithms and data analysis tasks to
quantify the similarity or dissimilarity between data points in a dataset.

In [1]:
import pandas as pd

# Creating the data for traing


# it is same as we saw exaple in previous class
data = {
'brightness': [40, 50, 60, 10],
'saturation': [20, 50, 90, 25],
'Class': ['Red', 'Blue', 'Blue', 'Red']
}

# Creating the DataFrame


df = pd.DataFrame(data)

print(df)
brightness saturation Class
0 40 20 Red
1 50 50 Blue
2 60 90 Blue
3 10 25 Red

What is scikit-learn?
Scikit-learn, often abbreviated as sklearn, is one of the most popular and widely used machine learning
libraries in Python. It provides simple and efficient tools for data mining and data analysis, built on top of
NumPy, SciPy, and matplotlib.

In [2]:
# import kNN from sklearn
from sklearn.neighbors import KNeighborsClassifier

# Creating the DataFrame for testing


X_test = pd.DataFrame({'brightness': [20], 'saturation': [35]})

# Splitting the data into features and target variable


X = df[['brightness', 'saturation']]
y = df['Class']

Feature Matrix (X)


The term "X" typically refers to the feature matrix, also known as the design matrix or predictor matrix.
This matrix contains the features or attributes of the dataset used for training the model.

Each row of the feature matrix corresponds to a single sample or data point (Feature vector), and each
column corresponds to a feature or attribute of that sample. Therefore, the dimensions of the feature
matrix X are typically m × n, where m is the number of samples and n is the number of features.

For example, in a dataset with m samples and n features, the feature matrix X would look like this:

x 11 x 12 … x 1n
⎡ ⎤
x 21 x 22 … x 2n

X =

⋮ ⋮ ⋱ ⋮
⎣ ⎦
x m1 x m2 … x mn

Where x ij represents the value of the j-th feature of the i-th sample.

So, in the context of scikit-learn or any machine learning library, when you refer to the feature matrix X,
you are essentially talking about the data matrix containing the features of your dataset.

In [3]:
X # feature matrix

Out[3]: brightness saturation

0 40 20

1 50 50

2 60 90

3 10 25

Target Variable (y)


The term "y" typically refers to the target variable or the response variable. It represents the labels or
outcomes associated with the samples in the dataset.
The target variable y is usually a one-dimensional array or a column vector containing the labels or
outcomes corresponding to each sample in the feature matrix X.

For classification problems, y contains discrete class labels or categories assigned to each sample i.e.
(Red, Blue). In regression problems, y contains continuous values representing the target variable to be
predicted for each sample i.e. 2500,35000,400000,200000.

In [4]:
y

Out[4]: 0 Red
1 Blue
2 Blue
3 Red
Name: Class, dtype: object

In [5]:
# Creating and training the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X, y)

# Predicting the labels for the test set


y_pred = knn.predict(X_test)

print ("The predicted label for sample: \n\n",X_test," \n\nis ", y_pred)

The predicted label for sample:

brightness saturation
0 20 35

is ['Red']
/home/abdullah/anaconda3/lib/python3.9/site-packages/sklearn/neighbors/_classification.p
y:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the de
fault behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, th
is behavior will change: the default value of `keepdims` will become False, the `axis` o
ver which the statistic is taken will be eliminated, and the value None will no longer b
e accepted. Set `keepdims` to True or False to avoid this warning.
mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

In [6]:
# Finding the distances and indices of the k nearest neighbors
distances, indices = knn.kneighbors(X_test)

You might also like