0% found this document useful (0 votes)
5 views

Vertopal.com Lab4 KNN

The document outlines a step-by-step guide for implementing a K-Nearest Neighbors (KNN) algorithm using the IRIS dataset. It covers data loading, exploratory data analysis, distance calculation, neighbor finding, voting on labels, and model evaluation. The document includes code snippets and explanations for each part of the process.

Uploaded by

ammarkusow2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Vertopal.com Lab4 KNN

The document outlines a step-by-step guide for implementing a K-Nearest Neighbors (KNN) algorithm using the IRIS dataset. It covers data loading, exploratory data analysis, distance calculation, neighbor finding, voting on labels, and model evaluation. The document includes code snippets and explanations for each part of the process.

Uploaded by

ammarkusow2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 9

Imports

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Load IRIS dataset

iris = datasets.load_iris()

print(iris)

As you can see the dataset is in the form of a dictionay. What are the keys of the
dictionary?

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names',


'filename', 'data_module'])

What is the value of the key data? Assign the value to a variable X

What is the shape of X?

What is the value of the key target? Assign the value to a variable y

What is the shape of y?

What is the value of the key target_names? Assign the value to a variable
target_names

What is the value of the key feature_names? Assign the value to a variable
feature_names

#Solution
X = iris['data']
y = iris['target']
feature_names = iris['feature_names']
target_names = iris['target_names']

#note: you can also get access to the elements by dot (.) access operator,
e.g.,
# X = iris.data

print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
print(feature_names)
print(target_names)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(150, 4)
(150,)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
['setosa' 'versicolor' 'virginica']

Figure below illustrates the features and target labels for iris
dataset.

Print the 5th datapoint in your dataset X

Print the features and target label of flower 1 to 5.

Iterate over all datapoints in X and calculate the area of Sepal and Petal for each
flower in the dataset.

Exploratory Data Analysis

Box plot of all features

plt.figure()
plt.boxplot(X)
plt.ylabel("[cm]")
plt.xlabel(feature_names)
plt.show()

[]

Scatter plot for each pair of features

Plot the scater plot for the pair of first and second features

(X[:,0], X[:,1])

Dont't forget to label your axes.

hint: use c=y inside the scatter plot to color the points based on the
target labels.

#your code here

Write a function called plot_pairwise that takes the pair of feaure and their
labels and plot the scatter plot.

def plot_pair(X1, X2, x1_label , x2_label, y):


...

Use plot_pari functions and plot the scatter plot for all pairs of features.

X[:,0], X[:,1], 'Sepal Length', 'Sepal Width'


X[:,0], X[:,2], 'Sepal Length', 'Petal Length'
X[:,0], X[:,3], 'Sepal Length', 'Petal Width'
X[:,1], X[:,2], 'Sepal Width', 'Petal Length'
X[:,1], X[:,3], 'Sepal Width', 'Petal Width'
X[:,2], X[:,3], 'Petal Length', 'Petal Width'

#your code here

(Optional) The plots shown above do not have legend. To add legend to
the plot, you can use the following code snippet.

def plot_pair_with_legned(x1, x2, x1_label , x2_label, y):


plt.figure()
for i, target_name in enumerate(iris.target_names):
plt.scatter(x1[y == i], x2[y == i], label=target_name)

plt.xlabel(x1_label)
plt.ylabel(x2_label)
plt.legend()
plt.show()

plot_pair_with_legned(X[:,0], X[:,1], feature_names[0], feature_names[1], y)

[]

Histogram of each feature

Plot the histogram of each feature.

#your code here

K Nearest Neighbors (KNN)

Euclidean Distance (2D)

In geometry, the Euclidean distance is the straight-line distance


between two points.

Given two points $ P(x_1, y_1) $ and $ Q(x_2, y_2)$ in a 2D plane, the
Euclidean distance between them is calculated as follows:

$ d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $

Example (2D)

Let's say we have two points:

- $ P(2, 2) $
- $ P_2(5, 5) $

$ d(P_1, P_2) = \sqrt{(2 - 5)^2 + (2 - 5)^2}= \sqrt{18} \approx 4.2 $

We can calculate the distance between these two points.

P = np.array([2, 2])
Q = np.array([5, 5])
distance = np.sqrt(np.sum((P - Q)**2))
distance

np.float64(4.242640687119285)

Example (3 Dimensions)

Consider two points in 3D space:

- $ P_1(1, 2, 3) $
- $ P_2(4, 0, 8) $

We can calculate the Euclidean distance as follows:

$ d(P_1, P_2) = \sqrt{(4 - 1)^2 + (0 - 2)^2 + (8 - 3)^2} $


$ d(P_1, P_2) = \sqrt{3^2 + (-2)^2 + 5^2} = \sqrt{9 + 4 + 25} =
\sqrt{38} \approx 6.16 $

# Define two points in 3D space


P1 = np.array([1, 2, 3])
P2 = np.array([4, 0, 8])

# Calculate the Euclidean distance


distance = np.sqrt(np.sum((P2 - P1)**2))

print(f'The Euclidean distance between P1 and P2 is: {distance:.2f}')

The Euclidean distance between P1 and P2 is: 6.16

Write a function that get two np arrays P and Q and return the Euclidean distance
between them.

def straight_line_distance(P, Q):


...

KNN Algorithm

KNN from scratch

0 - Look at the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,


random_state=42)

Explain each term in the cell above. X_train, X_test, y_train, y_test?

????

1 - Calculate distances

Take one sample from test set and find the distance between this sample and all
samples in the training set. In addition to the distance, you need to store the
index of the sample in the training set.

So for exaple if the distance between the test sample and the 5th sample in the
training set is 3.5, you need to store (5, 3.5).

test_instance = X_test[0]

distances = [] # append the (index, distance) tuples to this list


# your code here

Write a function called calculate_distances that takes the test sample and the
training set and return the distances and the indices of the training samples.

def calculate_distances(test_instance, X_train):


#return distances
...

What you pass as input to the function calculate_distances? What you get as output
when you call this function?

????
What is shape of input arrays to the function calculate_distances? What is the
shape of output?

???

2 - Find neighbors

Step 1: Sort the (index, distance) tuples based on distance value in


anascending order.

distances = calculate_distances(test_instance, X_train)


distances.sort(key=lambda x: x[1])
distances

[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639)),
(47, np.float64(0.5291502622129183)),
(17, np.float64(0.5830951894845297)),
(36, np.float64(0.6164414002968978)),
(65, np.float64(0.6244997998398398)),
(41, np.float64(0.6480740698407859)),
(48, np.float64(0.6999999999999995)),
(70, np.float64(0.7071067811865478)),
(63, np.float64(0.728010988928052)),
(23, np.float64(0.741619848709566)),
(14, np.float64(0.754983443527075)),
(68, np.float64(0.774596669241483)),
(73, np.float64(0.7874007874011811)),
(0, np.float64(0.8124038404635955)),
(50, np.float64(0.8124038404635965)),
(9, np.float64(0.8602325267042631)),
(60, np.float64(0.9273618495495711)),
(18, np.float64(0.9433981132056598)),
(67, np.float64(0.9643650760992956)),
(20, np.float64(0.9746794344808962)),
(5, np.float64(0.9746794344808963)),
(37, np.float64(1.0049875621120894)),
(42, np.float64(1.0440306508910553)),
(2, np.float64(1.0535653752852738)),
(64, np.float64(1.0954451150103324)),
(62, np.float64(1.1045361017187258)),
(8, np.float64(1.1575836902790226)),
(44, np.float64(1.224744871391589)),
(43, np.float64(1.296148139681572)),
(11, np.float64(1.2999999999999998)),
(71, np.float64(1.3490737563232036)),
(38, np.float64(1.3490737563232043)),
(31, np.float64(1.407124727947029)),
(40, np.float64(1.4247806848775015)),
(1, np.float64(1.438749456993816)),
(52, np.float64(1.5556349186104048)),
(56, np.float64(1.6186414056238647)),
(29, np.float64(1.6278820596099706)),
(58, np.float64(1.6431676725154982)),
(16, np.float64(1.7349351572897476)),
(74, np.float64(1.8138357147217057)),
(55, np.float64(1.8165902124584952)),
(24, np.float64(1.8493242008906932)),
(4, np.float64(1.8601075237738276)),
(54, np.float64(1.8973665961010275)),
(32, np.float64(1.9157244060668017)),
(15, np.float64(1.997498435543818)),
(61, np.float64(2.0346989949375804)),
(51, np.float64(2.090454496036687)),
(19, np.float64(2.4020824298928627)),
(69, np.float64(3.2939338184001206)),
(3, np.float64(3.3674916480965473)),
(13, np.float64(3.4161381705077445)),
(39, np.float64(3.551056180912941)),
(49, np.float64(3.5623026261113755)),
(53, np.float64(3.5623026261113755)),
(10, np.float64(3.5735136770411273)),
(12, np.float64(3.5791060336346563)),
(26, np.float64(3.6318039594669758)),
(6, np.float64(3.6537651812890224)),
(59, np.float64(3.6565010597564442)),
(25, np.float64(3.685105154537656)),
(57, np.float64(3.765634076752546)),
(30, np.float64(3.782856063875548)),
(7, np.float64(3.823610858861032)),
(33, np.float64(3.8314488121336034)),
(72, np.float64(3.844476557348217)),
(21, np.float64(3.845776904605882)),
(46, np.float64(3.8961519477556315)),
(27, np.float64(3.9357337308308855)),
(22, np.float64(4.177319714841085))]

Step 2: Select the first k elements of the sorted list. And, store the
index of these k elements in a list.

k = 5
distances[:k]

[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639))]

Extract the index of the k nearest neighbors from (index, distance) tuples.

neighbor_index =[]
# your code here

Step 3: Find the labels of these top k samples from y_train array.

neighbor_label = []
#your code here

Now write a function find_neighbors to do all the steps above from 1 to 3.

def find_neighbors(test_instance, X_train, y_train, k):


"""
Inputs
test_instance: One data point form test set
X_train: train dataset
y_train: train labels
k: number of neighbours

Output
neighbor_label: list of k neighbours labels
"""
#your code here

What you pass as input to the function find_neighbors? What you get as output when
you call this function?

???

What is shape of input arrays to the function find_neighbors? What is the shape of
output?

???

Explain what operations are done inside the function find_neighbors to calculate
the label of k nearest neighbors?

???

3 - Vote on labels

You have this function to vote on labels of the k nearest neighbors.

def vote_on_labels(neighbor_label):
prediction_dict = {}
for label in neighbor_label:
if label in prediction_dict:
prediction_dict[label] += 1
else:
prediction_dict[label] = 1
prediction = max(prediction_dict, key=prediction_dict.get)
return prediction

y_pred = vote_on_labels(neighbor_label)
y_pred

np.int64(1)

What you pass as input to the function vote_on_label? What you get as output when
you call this function?

????

What is shape of input arrays to the function vote_on_label? What is the shape of
output?

???

4 - put it all together

Now iterate over all datapoints of X_test and calculate their label.

y_pred = []
#your code here

Turn code into a function KNN that takes the training set, the target labels of the
training set, the test set, and the value of k and return the predicted labels of
the test set.

def KNN(X_train, y_train, X_test, k):


...

5 - Evaluate the model

Finally, calculate the accuracy of the KNN algorithm.

y_test == y_pred

array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, False, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
False, True, True])

accuracy = sum(y_test == y_pred) / len(y_test) #takes True as 1 and False as 0


print(f"accuracy: {accuracy * 100} %")

accuracy: 94.66666666666667 %

Turn your code into a function evaluate that takes the predicted labels and the
true labels and return the accuracy of the model.

def evaluate(y_test, y_pred):


# your code here
...

KNN in Scikit-Learn

knn_model = KNeighborsClassifier(n_neighbors=4) # You can change the value of


'k' as needed.
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 93.33%

(Optional) 6 - Hyperparameter tuning

So far we have used k=3. Now, we are going to find the best value of k
for the KNN algorithm.

K = [1, 2, 3, 4, 5, 6, 7, 8]
my_accs = []
# your code here

plot the accuracy of the model for different values of k with


scikit-learn and compare the results with the results from the scratch
implementation.

K = [1, 2, 3, 4, 5, 6, 7, 8]
sklearn_accs = []
#your code here

Can you justify the difference between the results of the two
implementations?

You might also like