0% found this document useful (0 votes)
13 views6 pages

K-Nearest Neighbor Algorithm: Dataset Preparation

Knn algorithm

Uploaded by

zarifahmed180
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

K-Nearest Neighbor Algorithm: Dataset Preparation

Knn algorithm

Uploaded by

zarifahmed180
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

K-Nearest Neighbor Algorithm

Code for loading dataset into 2D python list: here

Dataset preparation:
Randomly Split the dataset into Training (70%), Validation (15%) and Test (15%) set
Train_set=[], Val_set=[], Test_set=[]
//Following code shuffles your dataset list
1. for each sample S in the dataset:
2. generate a random number R in the range of [0,1]
3. if R>=0 and R<=0.7
4. append S in Train_set
5. elif R>0.7 and R<=0.85
6. append S in Val_set
7. else:
8. append S in Test_set

KNN Classification:
Use Iris data iris,
K=5
1. for each sample V in the VALIDATION set:
2. for each sample T in the TRAINING set:
3. Find Euclidean distance between Vx (features->N-1) and Tx
(features->N-1)
4. Store T and the distance in list L
5. Sort L in ascending order of distance
6. Take the first K samples
7. Take the majority class from the K samples (this is the detected class
for sample V)
8. Now, check if this class is correct or not
9. Calculate validation_accuracy = (correct VALIDATION samples)/(total
VALIDATION samples) * 100
Calculate validation accuracy in a similar way for K = 1, 3, 5, 10, 15
Make a table with 2 columns: K and Validation Accuracy (report template)
Now, take the K with highest Validation Accuracy
Use this best K to determine Test Accuracy (Simply replace the VALIDATION set with
TEST set)

KNN Regression:
Use diabetes data diabetes
K = 5, Error = 0
1.for each sample V in the VALIDATION set:
2. for each sample T in the TRAINING set:
3. Find Euclidean distance between Vx and Tx
4. Store Tx and the distance in list L
5. Sort L in ascending order
6. Take the first K samples
7. Take the average output of the K samples (this is the determined
output for sample V)
8. Error = Error + (V true output - V determined output)^2
9.Calculate Mean_Squared_Error = Error/(total number of samples in
VALIDATION set)

Calculate Mean_Squared_Error in a similar way for K = 1, 3, 5, 10, 15


Make a table with 2 columns: K and Mean_Squared_Error (report template)
Now, take the K with minimum Mean_Squared_Error
Use this best K to determine Mean_Squared_Error for the Test set (Simply replace the
VALIDATION set with TEST set)
Instruction
● Submit the .ipynb file and a report (report template) .pdf file.
● DO NOT USE LIBRARIES SUCH AS: "Sklearn", "Scikit learning" or "pandas" for this
assignment
● Copying will result in -100% penalty

Marks Distribution
(1) Dataset loading: 1.5
(2) Train, Validation, Test split: 2.5
(3) KNN classification algorithm + K tuning (table) + test accuracy : 5 + 1.5 + 1.5
(4) KNN regression algorithm + K tuning (table) + test mean squared error : 5 + 1.5 + 1.5

Dataset description:

Diabetes
[source: Diabetes dataset, sklearn.datasets.load_diabetes — scikit-learn 1.1.1 documentation]
Number of Instances: 442
Number of Attributes: First 10 columns are numeric predictive values
Target: Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information:
● age in years
● sex
● bmi body mass index
● bp average blood pressure
● s1 tc, total serum cholesterol
● s2 ldl, low-density lipoproteins
● s3 hdl, high-density lipoproteins
● s4 tch, total cholesterol / HDL
● s5 ltg, possibly log of serum triglycerides level
● s6 glu, blood sugar level
Iris:
Source [7.1. Toy datasets — scikit-learn 1.1.1 documentation ]
Number of Instances 150 (50 in each of three classes)
Number of Attributes 4 numeric, predictive attributes and the class
Attribute Information
● sepal length in cm
● sepal width in cm
● petal length in cm
● petal width in cm
● class:
○ Iris-Setosa
○ Iris-Versicolour
○ Iris-Virginica
Resources
7.1. Toy datasets — scikit-learn 1.0.2 documentation

● Dataset (samples, features/attributes, label/classes)


○ iris, diabetes
● Model high level concept from the perspective of supervised learning
● supervised learning, Classification, Regression
● dataset -> train, val, test

● KNN high level overview


● KNN pseudocode
● Instructions

● Classification: majority
● Regression: squared error
https://www.quora.com/What-are-industry-applications-of-the-K-nearest-neighbor-algorithm
https://stackoverflow.com/questions/53704811/is-k-nearest-neighbors-algorithm-used-a-lot-in-
real-life

You might also like