K-Nearest Neighbor Algorithm: Dataset Preparation
K-Nearest Neighbor Algorithm: Dataset Preparation
Dataset preparation:
Randomly Split the dataset into Training (70%), Validation (15%) and Test (15%) set
Train_set=[], Val_set=[], Test_set=[]
//Following code shuffles your dataset list
1. for each sample S in the dataset:
2. generate a random number R in the range of [0,1]
3. if R>=0 and R<=0.7
4. append S in Train_set
5. elif R>0.7 and R<=0.85
6. append S in Val_set
7. else:
8. append S in Test_set
KNN Classification:
Use Iris data iris,
K=5
1. for each sample V in the VALIDATION set:
2. for each sample T in the TRAINING set:
3. Find Euclidean distance between Vx (features->N-1) and Tx
(features->N-1)
4. Store T and the distance in list L
5. Sort L in ascending order of distance
6. Take the first K samples
7. Take the majority class from the K samples (this is the detected class
for sample V)
8. Now, check if this class is correct or not
9. Calculate validation_accuracy = (correct VALIDATION samples)/(total
VALIDATION samples) * 100
Calculate validation accuracy in a similar way for K = 1, 3, 5, 10, 15
Make a table with 2 columns: K and Validation Accuracy (report template)
Now, take the K with highest Validation Accuracy
Use this best K to determine Test Accuracy (Simply replace the VALIDATION set with
TEST set)
KNN Regression:
Use diabetes data diabetes
K = 5, Error = 0
1.for each sample V in the VALIDATION set:
2. for each sample T in the TRAINING set:
3. Find Euclidean distance between Vx and Tx
4. Store Tx and the distance in list L
5. Sort L in ascending order
6. Take the first K samples
7. Take the average output of the K samples (this is the determined
output for sample V)
8. Error = Error + (V true output - V determined output)^2
9.Calculate Mean_Squared_Error = Error/(total number of samples in
VALIDATION set)
Marks Distribution
(1) Dataset loading: 1.5
(2) Train, Validation, Test split: 2.5
(3) KNN classification algorithm + K tuning (table) + test accuracy : 5 + 1.5 + 1.5
(4) KNN regression algorithm + K tuning (table) + test mean squared error : 5 + 1.5 + 1.5
Dataset description:
Diabetes
[source: Diabetes dataset, sklearn.datasets.load_diabetes — scikit-learn 1.1.1 documentation]
Number of Instances: 442
Number of Attributes: First 10 columns are numeric predictive values
Target: Column 11 is a quantitative measure of disease progression one year after baseline
Attribute Information:
● age in years
● sex
● bmi body mass index
● bp average blood pressure
● s1 tc, total serum cholesterol
● s2 ldl, low-density lipoproteins
● s3 hdl, high-density lipoproteins
● s4 tch, total cholesterol / HDL
● s5 ltg, possibly log of serum triglycerides level
● s6 glu, blood sugar level
Iris:
Source [7.1. Toy datasets — scikit-learn 1.1.1 documentation ]
Number of Instances 150 (50 in each of three classes)
Number of Attributes 4 numeric, predictive attributes and the class
Attribute Information
● sepal length in cm
● sepal width in cm
● petal length in cm
● petal width in cm
● class:
○ Iris-Setosa
○ Iris-Versicolour
○ Iris-Virginica
Resources
7.1. Toy datasets — scikit-learn 1.0.2 documentation
● Classification: majority
● Regression: squared error
https://www.quora.com/What-are-industry-applications-of-the-K-nearest-neighbor-algorithm
https://stackoverflow.com/questions/53704811/is-k-nearest-neighbors-algorithm-used-a-lot-in-
real-life