Vertopal.com Lab4 KNN
Vertopal.com Lab4 KNN
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
print(iris)
As you can see the dataset is in the form of a dictionay. What are the keys of the
dictionary?
What is the value of the key data? Assign the value to a variable X
What is the value of the key target? Assign the value to a variable y
What is the value of the key target_names? Assign the value to a variable
target_names
What is the value of the key feature_names? Assign the value to a variable
feature_names
#Solution
X = iris['data']
y = iris['target']
feature_names = iris['feature_names']
target_names = iris['target_names']
#note: you can also get access to the elements by dot (.) access operator,
e.g.,
# X = iris.data
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
print(feature_names)
print(target_names)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(150, 4)
(150,)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
['setosa' 'versicolor' 'virginica']
Figure below illustrates the features and target labels for iris
dataset.
Iterate over all datapoints in X and calculate the area of Sepal and Petal for each
flower in the dataset.
plt.figure()
plt.boxplot(X)
plt.ylabel("[cm]")
plt.xlabel(feature_names)
plt.show()
[]
Plot the scater plot for the pair of first and second features
(X[:,0], X[:,1])
hint: use c=y inside the scatter plot to color the points based on the
target labels.
Write a function called plot_pairwise that takes the pair of feaure and their
labels and plot the scatter plot.
Use plot_pari functions and plot the scatter plot for all pairs of features.
(Optional) The plots shown above do not have legend. To add legend to
the plot, you can use the following code snippet.
plt.xlabel(x1_label)
plt.ylabel(x2_label)
plt.legend()
plt.show()
[]
Given two points $ P(x_1, y_1) $ and $ Q(x_2, y_2)$ in a 2D plane, the
Euclidean distance between them is calculated as follows:
Example (2D)
- $ P(2, 2) $
- $ P_2(5, 5) $
P = np.array([2, 2])
Q = np.array([5, 5])
distance = np.sqrt(np.sum((P - Q)**2))
distance
np.float64(4.242640687119285)
Example (3 Dimensions)
- $ P_1(1, 2, 3) $
- $ P_2(4, 0, 8) $
Write a function that get two np arrays P and Q and return the Euclidean distance
between them.
KNN Algorithm
Explain each term in the cell above. X_train, X_test, y_train, y_test?
????
1 - Calculate distances
Take one sample from test set and find the distance between this sample and all
samples in the training set. In addition to the distance, you need to store the
index of the sample in the training set.
So for exaple if the distance between the test sample and the 5th sample in the
training set is 3.5, you need to store (5, 3.5).
test_instance = X_test[0]
Write a function called calculate_distances that takes the test sample and the
training set and return the distances and the indices of the training samples.
What you pass as input to the function calculate_distances? What you get as output
when you call this function?
????
What is shape of input arrays to the function calculate_distances? What is the
shape of output?
???
2 - Find neighbors
[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639)),
(47, np.float64(0.5291502622129183)),
(17, np.float64(0.5830951894845297)),
(36, np.float64(0.6164414002968978)),
(65, np.float64(0.6244997998398398)),
(41, np.float64(0.6480740698407859)),
(48, np.float64(0.6999999999999995)),
(70, np.float64(0.7071067811865478)),
(63, np.float64(0.728010988928052)),
(23, np.float64(0.741619848709566)),
(14, np.float64(0.754983443527075)),
(68, np.float64(0.774596669241483)),
(73, np.float64(0.7874007874011811)),
(0, np.float64(0.8124038404635955)),
(50, np.float64(0.8124038404635965)),
(9, np.float64(0.8602325267042631)),
(60, np.float64(0.9273618495495711)),
(18, np.float64(0.9433981132056598)),
(67, np.float64(0.9643650760992956)),
(20, np.float64(0.9746794344808962)),
(5, np.float64(0.9746794344808963)),
(37, np.float64(1.0049875621120894)),
(42, np.float64(1.0440306508910553)),
(2, np.float64(1.0535653752852738)),
(64, np.float64(1.0954451150103324)),
(62, np.float64(1.1045361017187258)),
(8, np.float64(1.1575836902790226)),
(44, np.float64(1.224744871391589)),
(43, np.float64(1.296148139681572)),
(11, np.float64(1.2999999999999998)),
(71, np.float64(1.3490737563232036)),
(38, np.float64(1.3490737563232043)),
(31, np.float64(1.407124727947029)),
(40, np.float64(1.4247806848775015)),
(1, np.float64(1.438749456993816)),
(52, np.float64(1.5556349186104048)),
(56, np.float64(1.6186414056238647)),
(29, np.float64(1.6278820596099706)),
(58, np.float64(1.6431676725154982)),
(16, np.float64(1.7349351572897476)),
(74, np.float64(1.8138357147217057)),
(55, np.float64(1.8165902124584952)),
(24, np.float64(1.8493242008906932)),
(4, np.float64(1.8601075237738276)),
(54, np.float64(1.8973665961010275)),
(32, np.float64(1.9157244060668017)),
(15, np.float64(1.997498435543818)),
(61, np.float64(2.0346989949375804)),
(51, np.float64(2.090454496036687)),
(19, np.float64(2.4020824298928627)),
(69, np.float64(3.2939338184001206)),
(3, np.float64(3.3674916480965473)),
(13, np.float64(3.4161381705077445)),
(39, np.float64(3.551056180912941)),
(49, np.float64(3.5623026261113755)),
(53, np.float64(3.5623026261113755)),
(10, np.float64(3.5735136770411273)),
(12, np.float64(3.5791060336346563)),
(26, np.float64(3.6318039594669758)),
(6, np.float64(3.6537651812890224)),
(59, np.float64(3.6565010597564442)),
(25, np.float64(3.685105154537656)),
(57, np.float64(3.765634076752546)),
(30, np.float64(3.782856063875548)),
(7, np.float64(3.823610858861032)),
(33, np.float64(3.8314488121336034)),
(72, np.float64(3.844476557348217)),
(21, np.float64(3.845776904605882)),
(46, np.float64(3.8961519477556315)),
(27, np.float64(3.9357337308308855)),
(22, np.float64(4.177319714841085))]
Step 2: Select the first k elements of the sorted list. And, store the
index of these k elements in a list.
k = 5
distances[:k]
[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639))]
Extract the index of the k nearest neighbors from (index, distance) tuples.
neighbor_index =[]
# your code here
Step 3: Find the labels of these top k samples from y_train array.
neighbor_label = []
#your code here
Output
neighbor_label: list of k neighbours labels
"""
#your code here
What you pass as input to the function find_neighbors? What you get as output when
you call this function?
???
What is shape of input arrays to the function find_neighbors? What is the shape of
output?
???
Explain what operations are done inside the function find_neighbors to calculate
the label of k nearest neighbors?
???
3 - Vote on labels
def vote_on_labels(neighbor_label):
prediction_dict = {}
for label in neighbor_label:
if label in prediction_dict:
prediction_dict[label] += 1
else:
prediction_dict[label] = 1
prediction = max(prediction_dict, key=prediction_dict.get)
return prediction
y_pred = vote_on_labels(neighbor_label)
y_pred
np.int64(1)
What you pass as input to the function vote_on_label? What you get as output when
you call this function?
????
What is shape of input arrays to the function vote_on_label? What is the shape of
output?
???
Now iterate over all datapoints of X_test and calculate their label.
y_pred = []
#your code here
Turn code into a function KNN that takes the training set, the target labels of the
training set, the test set, and the value of k and return the predicted labels of
the test set.
y_test == y_pred
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, False, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
False, True, True])
accuracy: 94.66666666666667 %
Turn your code into a function evaluate that takes the predicted labels and the
true labels and return the accuracy of the model.
KNN in Scikit-Learn
Accuracy: 93.33%
So far we have used k=3. Now, we are going to find the best value of k
for the KNN algorithm.
K = [1, 2, 3, 4, 5, 6, 7, 8]
my_accs = []
# your code here
K = [1, 2, 3, 4, 5, 6, 7, 8]
sklearn_accs = []
#your code here
Can you justify the difference between the results of the two
implementations?