0% found this document useful (0 votes)

112 views10 pages

Knn-Experiments - Jupyter Notebook

The document describes experiments conducted to test how K-nearest neighbors (KNN) regression is affected by various hyperparameters and settings. It tests the impact of the number of neighbors k, distance metrics, scaling of predictors, and presence of useless predictors. The best k value was found to be 13 when using Euclidean distance, and also 13 when using Manhattan distance. Plotting the training and test error against k showed error decreasing until around k=15 before plateauing.

Uploaded by

api-526286119

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views10 pages

Knn-Experiments - Jupyter Notebook

Uploaded by

api-526286119

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

KNN regression experiments

In class we learned about how KNN regression works, and tips for using KNN. For example, we
learned that data should be scaled when using KNN, and that extra, useless predictors should not
be used with KNN. Are these tips really correct?

In this notebook we run a bunch of tests to see how KNN is affect by the choice of k, distance
function, scaling of the predictors, presence of useless predictors, and other things.

One experiment we do not run, and which would be interesting, is to see how KNN performance
changes as a function of the size of the training set.

INSTRUCTIONS
Enter code wherever you see # YOUR CODE HERE in code cells, or YOU TEXT HERE in markup
cells.

Out[167]: Click here to display/hide the code.

Read the data and take a first look at it

The diamonds dataset is good for testing KNN because it has many numeric features. See
https://www.kaggle.com/shivam2503/diamonds (https://www.kaggle.com/shivam2503/diamonds)
for information on the dataset.

RangeIndex: 53940 entries, 0 to 53939

Data columns (total 11 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 53940 non-null int64

1 carat 53940 non-null float64

2 cut 53940 non-null object

3 color 53940 non-null object

4 clarity 53940 non-null object

5 depth 53940 non-null float64

6 table 53940 non-null float64

7 price 53940 non-null int64

8 x 53940 non-null float64

9 y 53940 non-null float64

10 z 53940 non-null float64

dtypes: float64(6), int64(2), object(3)

memory usage: 4.5+ MB

localhost:8889/notebooks/knn-experiments.ipynb# 1/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Note that numeric features have different ranges. For example, the median value of carat is 0.7,
while the median value of depth is about 62. Price has a much greater median value, but we will be
using it as the target variable.

Out[170]:
Unnamed: 0 carat depth table price x

count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940

mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5

std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1

min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0

25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4

50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5

75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6

max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58

Prepare data for machine learning

We will use KNN regression to predict the price of a diamond from its physical features.

We use a subset of the data set for our training and test data. Note that we keep an unscaled
version of the data for one of the experiments we will run.

(4900, 6)

[[-1.04847699 -0.73702623 -1.10709561 -1.23038202 -1.23117462 -1.27781454]

[ 0.55549967 -0.45992212 -0.66210689 0.7593329 0.7736844 0.6973562 ]

[-0.79521751 0.30211416 -0.66210689 -0.81102507 -0.84458746 -0.78762618]]

Baseline performance

For regression problems, our baseline is the "blind" prediction that is just the average value of the
target variable. The blind prediction must be calculated using the training data. Calculate and print
the test set root mean squared error (test RMSE) using this blind prediction.
I have provided a
function you can use for RMSE.

test RMSE, baseline: 3948.9

Performance with default hyperparameters

Using the training set, train a KNN regression model using the ScikitLearn KNeighborsRegressor,
localhost:8889/notebooks/knn-experiments.ipynb# 2/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

and report on the test RMSE. The test RMSE is the RMSE computed using the test data set.

When using the KNN algorithm, use algorithm='brute' to get the basic KNN algorithm.

test RMSE, default hyperparameters: 1507.8

Impact of K

In class we discussed the relationship of the hyperparameter k to overfitting.

I provided code to test KNN on k=1, k=3, k=5, ..., k=29. For each value of k, compute the training
RMSE and test RMSE. The training RMSE is the RMSE computed using the training data. Use the
'brute' algorithm, and Euclidean distance, which is the default. You need to add the
get_train_test_rmse() function.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Test RMSE when k = 5: 1507.8

Using the training and test RMSE values you got for each value of k, find the k associated with the
lowest test RMSE value. Print this k value and the associated lowest test RMSE value. In other
words, if you found that k=11 gave the lowest test RMSE, then print the value 11 and the test
RMSE value obtained when k=11.

best k = 13, best test RMSE: 1439.7

Plot the test and training RMSE as a function of k, for all the k values you tried.

localhost:8889/notebooks/knn-experiments.ipynb# 3/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Comments

In the markup cell below, write about what you learned from your plot. I would expect two or three
sentences, but what's most important is that you write something thoughtful.

From just the plot above, I can clearly see that the test RMSE is much lower than the baseline
RMSE, which is a good sign; however, because the difference in the Y value is so big, it's hard to
see the details for the plotted test RMSE values. The general consensus is that the RMSE
dramatically drops for the test data between 1 and 3, and gruadually gets smaller until around 15 k
nearest neighbors before it plateaus.

Impact of distance metric

Repeat what you did to test the impact of k, but this time use Manhattan distance as your distance
metric. Look at the options for KNeighborsRegressor() to see how to use Manhattan distance.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Print the value of k that gives the best test RMSE, and the test RMSE associated with that k, just
as you did in the previous section.

best k = 13, best test RMSE: 1438.5

Plot the training and test RMSE as a function of k, just as you did in the previous section. Be sure
localhost:8889/notebooks/knn-experiments.ipynb# 4/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

to note that Manhattan distance was used in your plot title.

Comments

Consider what you learned from your experiment, and write a little about it. Think about how the
results changed as a result of changing the distance function.

Based on the graphs, there are very little differences between using Euclidean and Manhattan
distances for the test data that I used. I do notice that the sharp decline in test RMSE in the
beginning is broken into two sections that last until about 8 k nearest neighbors for Manhattan
distance, whereas with Euclidean distance, the RMSE value gradually declines starting at about 3.
The difference is so little to me that I almost thought I didn't change the power parameter for the
Minkowski metric until I checked the best test RMSE values.

Impact of noise predictors

In class we heard that the KNN performance goes down if useless "noisy predictors" are present.
These are predictor that don't help in making predictions. In this section, run KNN regression by
adding one noise predictor to the data, then 2 noise predictors, then three, and then four. For each,
compute the training and test RMSE. In every case, use k=10 as the k value and use Euclidean
distance as the distance function.

The add_noise_predictor() method makes it easy to add a predictor variable of random values to
X_train or X_test.

localhost:8889/notebooks/knn-experiments.ipynb# 5/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Hint: In each iteration of your loop, add a noisy predictor to both X_train and X_test. You don't
need to worry about rescaling the data, as the new noisy predictor is already scaled. Don't modify
X_train and X_test however, as you will be using them again.

0 1 2 3 4 done

Plot the percent increase in test RMSE as a function of the number of noise predictors. The x axis
will range from 0 to 4. The y axis will show a percent increase in test RMSE.

To compute percent increase in RMSE for n noise predictors, compute 100 * (rmse -
base_rmse)/base_rmse, where base_rmse is the test RMSE with no noise predictors, and rmse is
the test RMSE when n noise predictors have been added.

Comments

Look at the results you obtained and add some thoughtful commentary.

To be completely honest, I'm not sure I did this part correctly. It seems like having one noise
predictor is slightly owrse than having 2 or 3 noise predictors, but if there are 4 noise predictors,
then the RMSE almost triples.

Impact of scaling

In class we learned that we should scaled the training data before using KNN. How important is
scaling with KNN? Repeat the experiments you ran before (like in the impact of distance metric
section), but this time use unscaled data.

localhost:8889/notebooks/knn-experiments.ipynb# 6/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Run KNN as before but use the unscaled version of the data. You will vary k as before. Use
algorithm='brute' and Euclidean distance.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Print the best k and the test RMSE associated with the best k.

best k = 9, best test RMSE: 1469.2

Plot training and test RMSE as a function of k. Your plot title should note the use of unscaled data.

Comments

Reflect on what happened and provide some short commentary, as in previous sections.

It's difficult to see the minute changes in the test RMSE from the graph above because the training
RMSE is much higher, thus making the graph more zoomed out.

Impact of algorithm

We didn't discuss in class that there are variants of the KNN algorithm. The main purpose of the
variants is to be faster and to reduce that amount of training data that needs to be stored.

Run experiments where you test each of the three KNN algorithms supported by Scikit-Learn:
ball_tree, kd_tree, and brute. In each case, use k=10 and use Euclidean distance.
localhost:8889/notebooks/knn-experiments.ipynb# 7/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

brute ball_tree kd_tree done

Print the name of the best algorith, and the test RMSE achieved with the best algorithm.

The best algorithm: 'kd_tree' | RSME: 1454.407

Plot the test RMSE for each of the three algorithms as a bar plot.

Comments

As usual, reflect on the results and add comments.

It appears the best was the brute algorithm but I'm a little confused on my plotting, I'm not sure if
I'm intereperting this data correctly.

Impact of weighting

It was briefly mentioned in lecture that there is a variant of KNN in which training points are given
more weight when they are closer to the point for which a prediction is to be made. The 'weight'
parameter of KNeighborsRegressor() has two possible values: 'uniform' and 'distance'. Uniform is
the basic algorithm.

localhost:8889/notebooks/knn-experiments.ipynb# 8/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Run an experiment similar to the previous one. Compute the test RMSE for uniform and distance
weighting. Using k = 10, the brute algorithm, and Euclidean distance.

uniform distance done

Print the weighting the gave the lowest test RMSE, and the test RMSE it achieved.

The best weight: 'distance' | RMSE: 1407.531

Create a bar plot showing the test RMSE for the uniform and distance weighting options.

Comments

As usual, reflect and comment.

Results show as the best weight with an RMSE: 55817.065 changing weight makes in smaller
RMSE. I did feel that the results were very simliar and plotting the data helped me visualize it
better.

Conclusions

Please provide at least a few sentences of commentary on the main things you've learned from the
experiments you've run.
localhost:8889/notebooks/knn-experiments.ipynb# 9/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

This lab was definitely trial and error to get values to shown correctly. In a few instances I was
expecting different results but tried to learn from what the data shown. For a few questions, I was
not sure if I was obtaining the correct data and I had to go back to modules to see if I was doing
things correctly.

localhost:8889/notebooks/knn-experiments.ipynb# 10/10

KNN Experiments Housing Student x22 - Jupyter Notebook-1
No ratings yet
KNN Experiments Housing Student x22 - Jupyter Notebook-1
15 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
Machine Learning Lab Manual 7
100% (1)
Machine Learning Lab Manual 7
8 pages
Worksheet - 2.3 20BCS7611
No ratings yet
Worksheet - 2.3 20BCS7611
6 pages
Experiment 2.2 KNN Classifier
No ratings yet
Experiment 2.2 KNN Classifier
7 pages
Rahul Raj - Ipynb - Colab
No ratings yet
Rahul Raj - Ipynb - Colab
50 pages
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
No ratings yet
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
13 pages
B-56 Sanket Jambhulkar MLA-7
No ratings yet
B-56 Sanket Jambhulkar MLA-7
9 pages
Worksheet - 2.3 20BCS7490
No ratings yet
Worksheet - 2.3 20BCS7490
6 pages
DL Exp-1.4 19BCS1431
No ratings yet
DL Exp-1.4 19BCS1431
5 pages
Experiment 4
No ratings yet
Experiment 4
8 pages
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
No ratings yet
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
7 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
Week 07
No ratings yet
Week 07
24 pages
PML Lab Exp 11
No ratings yet
PML Lab Exp 11
3 pages
KNN Regression
No ratings yet
KNN Regression
3 pages
AIML
No ratings yet
AIML
13 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
K Nearestneighborknnalgorithm 241117075907 d767c46d
No ratings yet
K Nearestneighborknnalgorithm 241117075907 d767c46d
13 pages
CSE445 NSU Week - 5
No ratings yet
CSE445 NSU Week - 5
26 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
MachineLearning-Spring24 - KNN Implementation For Classification
No ratings yet
MachineLearning-Spring24 - KNN Implementation For Classification
3 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
K-Nearest Neighbors Algorithm
No ratings yet
K-Nearest Neighbors Algorithm
7 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
K-Nearest Neighbour Classifier: Prerequisite
No ratings yet
K-Nearest Neighbour Classifier: Prerequisite
6 pages
Notes: KNN: K-Nearest Neighbors
No ratings yet
Notes: KNN: K-Nearest Neighbors
4 pages
ML Notes
100% (2)
ML Notes
125 pages
Lecture#2. K Nearest Neighbors
No ratings yet
Lecture#2. K Nearest Neighbors
10 pages
04 KNN Implementation
No ratings yet
04 KNN Implementation
7 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
KNN 2
No ratings yet
KNN 2
53 pages
K-Nearest Neighbor Algorithm: Dataset Preparation
No ratings yet
K-Nearest Neighbor Algorithm: Dataset Preparation
6 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
Diabetes Prediction System With KNN Algorithm
No ratings yet
Diabetes Prediction System With KNN Algorithm
29 pages
Lecture 4 KNN
No ratings yet
Lecture 4 KNN
17 pages
Updated K-Nearest Neighbors in Machine Learning
No ratings yet
Updated K-Nearest Neighbors in Machine Learning
11 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Activity 01: Python Set/s of Source Code Use in The Activity (Paste Below)
No ratings yet
Activity 01: Python Set/s of Source Code Use in The Activity (Paste Below)
2 pages
CSL0777 L22
No ratings yet
CSL0777 L22
35 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
05 K-Nearest Neighbors
No ratings yet
05 K-Nearest Neighbors
15 pages
4 KNN Classifier
No ratings yet
4 KNN Classifier
6 pages
Part A 3. KNN Classification
No ratings yet
Part A 3. KNN Classification
35 pages
Lab 1
No ratings yet
Lab 1
3 pages
Ue21cs352a 20230830121058
No ratings yet
Ue21cs352a 20230830121058
18 pages
Lecture 02 - KNN and ML Basics
No ratings yet
Lecture 02 - KNN and ML Basics
33 pages
Untitled 9
No ratings yet
Untitled 9
17 pages
K-Nearest Neighbor On Python Ken Ocuma
100% (2)
K-Nearest Neighbor On Python Ken Ocuma
9 pages
Introduction To KNN and R
No ratings yet
Introduction To KNN and R
12 pages
'Machine Learning (Nagarjun)
No ratings yet
'Machine Learning (Nagarjun)
10 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
22 pages
Here's An Visualization of The K-Nearest Neighbors Algorithm
No ratings yet
Here's An Visualization of The K-Nearest Neighbors Algorithm
5 pages
Why Do We Need A K-NN Algorithm?
No ratings yet
Why Do We Need A K-NN Algorithm?
11 pages
Csen2141 Data Analytics Descriptive, Predictive, Prescriptive
No ratings yet
Csen2141 Data Analytics Descriptive, Predictive, Prescriptive
3 pages
Osta L4
No ratings yet
Osta L4
30 pages
CH 4
No ratings yet
CH 4
113 pages
Week-9 Discrete Probability Distributions
No ratings yet
Week-9 Discrete Probability Distributions
97 pages
Regression Analysis - STAT510
No ratings yet
Regression Analysis - STAT510
39 pages
FIN213 - Semester Test 2 Solutions Memo 20240503
No ratings yet
FIN213 - Semester Test 2 Solutions Memo 20240503
13 pages
Structural Equation Model-SEM
No ratings yet
Structural Equation Model-SEM
113 pages
Probability Distribution On Discrete Random Variables DLP
No ratings yet
Probability Distribution On Discrete Random Variables DLP
3 pages
Aviation Forecasting Techniques - Sem 3 - MBA Aviation
100% (1)
Aviation Forecasting Techniques - Sem 3 - MBA Aviation
109 pages
Panel Data Models
No ratings yet
Panel Data Models
25 pages
Answers Key-Solutions of Semi-Final Exam in Statistics and Probability 2019-2020 For of Grade 11
No ratings yet
Answers Key-Solutions of Semi-Final Exam in Statistics and Probability 2019-2020 For of Grade 11
2 pages
AEB 418 - Assignment 4 - 2024
No ratings yet
AEB 418 - Assignment 4 - 2024
7 pages
Probability
No ratings yet
Probability
12 pages
Stochiastic Time Series
No ratings yet
Stochiastic Time Series
49 pages
Report: Mean (Expected Value) of A Discrete Random Variable 100%
No ratings yet
Report: Mean (Expected Value) of A Discrete Random Variable 100%
2 pages
Digital Communication Systems by Simon Haykin-133
No ratings yet
Digital Communication Systems by Simon Haykin-133
6 pages
Quality Midsem
No ratings yet
Quality Midsem
179 pages
CH 10
No ratings yet
CH 10
9 pages
Lecturenotes 3
No ratings yet
Lecturenotes 3
11 pages
Lecture 0
No ratings yet
Lecture 0
10 pages
THT 3
100% (1)
THT 3
4 pages
1 - Binary Dependent Variable Models
No ratings yet
1 - Binary Dependent Variable Models
63 pages
5.Sm025 Wow - Question Set 5
No ratings yet
5.Sm025 Wow - Question Set 5
12 pages
This Study Resource Was: P (1 X) e DX ( e e
No ratings yet
This Study Resource Was: P (1 X) e DX ( e e
4 pages
Bivariate Data
No ratings yet
Bivariate Data
8 pages
Ts Maths 2a 2024
No ratings yet
Ts Maths 2a 2024
3 pages
Geostatistics Kriging
100% (2)
Geostatistics Kriging
62 pages
Chapter 5 - Random Sapling
No ratings yet
Chapter 5 - Random Sapling
25 pages
Diffusions and Stochastic Differential Equations
No ratings yet
Diffusions and Stochastic Differential Equations
8 pages
Industrial Design of Experiments A CA...
100% (1)
Industrial Design of Experiments A CA...
391 pages

Knn-Experiments - Jupyter Notebook

Uploaded by

Knn-Experiments - Jupyter Notebook

Uploaded by

2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

KNN regression experiments

Out[167]: Click here to display/hide the code.

Read the data and take a first look at it

RangeIndex: 53940 entries, 0 to 53939

Data columns (total 11 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 53940 non-null int64

1 carat 53940 non-null float64

2 cut 53940 non-null object

3 color 53940 non-null object

4 clarity 53940 non-null object

5 depth 53940 non-null float64

6 table 53940 non-null float64

7 price 53940 non-null int64

8 x 53940 non-null float64

9 y 53940 non-null float64

10 z 53940 non-null float64

dtypes: float64(6), int64(2), object(3)

memory usage: 4.5+ MB

count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940

mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5

std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1

min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0

25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4

50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5

75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6

max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58

Prepare data for machine learning

[[-1.04847699 -0.73702623 -1.10709561 -1.23038202 -1.23117462 -1.27781454]

[ 0.55549967 -0.45992212 -0.66210689 0.7593329 0.7736844 0.6973562 ]

[-0.79521751 0.30211416 -0.66210689 -0.81102507 -0.84458746 -0.78762618]]

test RMSE, baseline: 3948.9

Performance with default hyperparameters

test RMSE, default hyperparameters: 1507.8

In class we discussed the relationship of the hyperparameter k to overfitting.

Test RMSE when k = 5: 1507.8

best k = 13, best test RMSE: 1439.7

Impact of distance metric

best k = 13, best test RMSE: 1438.5

to note that Manhattan distance was used in your plot title.

Impact of noise predictors

best k = 9, best test RMSE: 1469.2

brute ball_tree kd_tree done

The best algorithm: 'kd_tree' | RSME: 1454.407

As usual, reflect on the results and add comments.

uniform distance done

The best weight: 'distance' | RMSE: 1407.531

As usual, reflect and comment.

You might also like