0% found this document useful (0 votes)
112 views10 pages

Knn-Experiments - Jupyter Notebook

The document describes experiments conducted to test how K-nearest neighbors (KNN) regression is affected by various hyperparameters and settings. It tests the impact of the number of neighbors k, distance metrics, scaling of predictors, and presence of useless predictors. The best k value was found to be 13 when using Euclidean distance, and also 13 when using Manhattan distance. Plotting the training and test error against k showed error decreasing until around k=15 before plateauing.

Uploaded by

api-526286119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views10 pages

Knn-Experiments - Jupyter Notebook

The document describes experiments conducted to test how K-nearest neighbors (KNN) regression is affected by various hyperparameters and settings. It tests the impact of the number of neighbors k, distance metrics, scaling of predictors, and presence of useless predictors. The best k value was found to be 13 when using Euclidean distance, and also 13 when using Manhattan distance. Plotting the training and test error against k showed error decreasing until around k=15 before plateauing.

Uploaded by

api-526286119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

KNN regression experiments


In class we learned about how KNN regression works, and tips for using KNN. For example, we
learned that data should be scaled when using KNN, and that extra, useless predictors should not
be used with KNN. Are these tips really correct?

In this notebook we run a bunch of tests to see how KNN is affect by the choice of k, distance
function, scaling of the predictors, presence of useless predictors, and other things.

One experiment we do not run, and which would be interesting, is to see how KNN performance
changes as a function of the size of the training set.

INSTRUCTIONS
Enter code wherever you see # YOUR CODE HERE in code cells, or YOU TEXT HERE in markup
cells.

Out[167]: Click here to display/hide the code.

Read the data and take a first look at it

The diamonds dataset is good for testing KNN because it has many numeric features. See
https://www.kaggle.com/shivam2503/diamonds (https://www.kaggle.com/shivam2503/diamonds)
for information on the dataset.

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 53940 entries, 0 to 53939

Data columns (total 11 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 53940 non-null int64

1 carat 53940 non-null float64

2 cut 53940 non-null object

3 color 53940 non-null object

4 clarity 53940 non-null object

5 depth 53940 non-null float64

6 table 53940 non-null float64

7 price 53940 non-null int64

8 x 53940 non-null float64

9 y 53940 non-null float64

10 z 53940 non-null float64

dtypes: float64(6), int64(2), object(3)

memory usage: 4.5+ MB

localhost:8889/notebooks/knn-experiments.ipynb# 1/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Note that numeric features have different ranges. For example, the median value of carat is 0.7,
while the median value of depth is about 62. Price has a much greater median value, but we will be
using it as the target variable.

Out[170]:
Unnamed: 0 carat depth table price x

count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940

mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5

std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1

min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0

25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4

50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5

75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6

max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58

Prepare data for machine learning

We will use KNN regression to predict the price of a diamond from its physical features.

We use a subset of the data set for our training and test data. Note that we keep an unscaled
version of the data for one of the experiments we will run.

(4900, 6)

[[-1.04847699 -0.73702623 -1.10709561 -1.23038202 -1.23117462 -1.27781454]

[ 0.55549967 -0.45992212 -0.66210689 0.7593329 0.7736844 0.6973562 ]

[-0.79521751 0.30211416 -0.66210689 -0.81102507 -0.84458746 -0.78762618]]

Baseline performance

For regression problems, our baseline is the "blind" prediction that is just the average value of the
target variable. The blind prediction must be calculated using the training data. Calculate and print
the test set root mean squared error (test RMSE) using this blind prediction.
I have provided a
function you can use for RMSE.

test RMSE, baseline: 3948.9

Performance with default hyperparameters

Using the training set, train a KNN regression model using the ScikitLearn KNeighborsRegressor,
localhost:8889/notebooks/knn-experiments.ipynb# 2/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

and report on the test RMSE. The test RMSE is the RMSE computed using the test data set.

When using the KNN algorithm, use algorithm='brute' to get the basic KNN algorithm.

test RMSE, default hyperparameters: 1507.8

Impact of K

In class we discussed the relationship of the hyperparameter k to overfitting.

I provided code to test KNN on k=1, k=3, k=5, ..., k=29. For each value of k, compute the training
RMSE and test RMSE. The training RMSE is the RMSE computed using the training data. Use the
'brute' algorithm, and Euclidean distance, which is the default. You need to add the
get_train_test_rmse() function.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Test RMSE when k = 5: 1507.8

Using the training and test RMSE values you got for each value of k, find the k associated with the
lowest test RMSE value. Print this k value and the associated lowest test RMSE value. In other
words, if you found that k=11 gave the lowest test RMSE, then print the value 11 and the test
RMSE value obtained when k=11.

best k = 13, best test RMSE: 1439.7

Plot the test and training RMSE as a function of k, for all the k values you tried.

localhost:8889/notebooks/knn-experiments.ipynb# 3/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Comments

In the markup cell below, write about what you learned from your plot. I would expect two or three
sentences, but what's most important is that you write something thoughtful.

From just the plot above, I can clearly see that the test RMSE is much lower than the baseline
RMSE, which is a good sign; however, because the difference in the Y value is so big, it's hard to
see the details for the plotted test RMSE values. The general consensus is that the RMSE
dramatically drops for the test data between 1 and 3, and gruadually gets smaller until around 15 k
nearest neighbors before it plateaus.

Impact of distance metric

Repeat what you did to test the impact of k, but this time use Manhattan distance as your distance
metric. Look at the options for KNeighborsRegressor() to see how to use Manhattan distance.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Print the value of k that gives the best test RMSE, and the test RMSE associated with that k, just
as you did in the previous section.

best k = 13, best test RMSE: 1438.5

Plot the training and test RMSE as a function of k, just as you did in the previous section. Be sure
localhost:8889/notebooks/knn-experiments.ipynb# 4/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

to note that Manhattan distance was used in your plot title.

Comments

Consider what you learned from your experiment, and write a little about it. Think about how the
results changed as a result of changing the distance function.

Based on the graphs, there are very little differences between using Euclidean and Manhattan
distances for the test data that I used. I do notice that the sharp decline in test RMSE in the
beginning is broken into two sections that last until about 8 k nearest neighbors for Manhattan
distance, whereas with Euclidean distance, the RMSE value gradually declines starting at about 3.
The difference is so little to me that I almost thought I didn't change the power parameter for the
Minkowski metric until I checked the best test RMSE values.

Impact of noise predictors

In class we heard that the KNN performance goes down if useless "noisy predictors" are present.
These are predictor that don't help in making predictions. In this section, run KNN regression by
adding one noise predictor to the data, then 2 noise predictors, then three, and then four. For each,
compute the training and test RMSE. In every case, use k=10 as the k value and use Euclidean
distance as the distance function.

The add_noise_predictor() method makes it easy to add a predictor variable of random values to
X_train or X_test.

localhost:8889/notebooks/knn-experiments.ipynb# 5/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Hint: In each iteration of your loop, add a noisy predictor to both X_train and X_test. You don't
need to worry about rescaling the data, as the new noisy predictor is already scaled. Don't modify
X_train and X_test however, as you will be using them again.

0 1 2 3 4 done

Plot the percent increase in test RMSE as a function of the number of noise predictors. The x axis
will range from 0 to 4. The y axis will show a percent increase in test RMSE.

To compute percent increase in RMSE for n noise predictors, compute 100 * (rmse -
base_rmse)/base_rmse, where base_rmse is the test RMSE with no noise predictors, and rmse is
the test RMSE when n noise predictors have been added.

Comments

Look at the results you obtained and add some thoughtful commentary.

To be completely honest, I'm not sure I did this part correctly. It seems like having one noise
predictor is slightly owrse than having 2 or 3 noise predictors, but if there are 4 noise predictors,
then the RMSE almost triples.

Impact of scaling

In class we learned that we should scaled the training data before using KNN. How important is
scaling with KNN? Repeat the experiments you ran before (like in the impact of distance metric
section), but this time use unscaled data.

localhost:8889/notebooks/knn-experiments.ipynb# 6/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Run KNN as before but use the unscaled version of the data. You will vary k as before. Use
algorithm='brute' and Euclidean distance.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done

Print the best k and the test RMSE associated with the best k.

best k = 9, best test RMSE: 1469.2

Plot training and test RMSE as a function of k. Your plot title should note the use of unscaled data.

Comments

Reflect on what happened and provide some short commentary, as in previous sections.

It's difficult to see the minute changes in the test RMSE from the graph above because the training
RMSE is much higher, thus making the graph more zoomed out.

Impact of algorithm

We didn't discuss in class that there are variants of the KNN algorithm. The main purpose of the
variants is to be faster and to reduce that amount of training data that needs to be stored.

Run experiments where you test each of the three KNN algorithms supported by Scikit-Learn:
ball_tree, kd_tree, and brute. In each case, use k=10 and use Euclidean distance.
localhost:8889/notebooks/knn-experiments.ipynb# 7/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

brute ball_tree kd_tree done

Print the name of the best algorith, and the test RMSE achieved with the best algorithm.

The best algorithm: 'kd_tree' | RSME: 1454.407

Plot the test RMSE for each of the three algorithms as a bar plot.

Comments

As usual, reflect on the results and add comments.

It appears the best was the brute algorithm but I'm a little confused on my plotting, I'm not sure if
I'm intereperting this data correctly.

Impact of weighting

It was briefly mentioned in lecture that there is a variant of KNN in which training points are given
more weight when they are closer to the point for which a prediction is to be made. The 'weight'
parameter of KNeighborsRegressor() has two possible values: 'uniform' and 'distance'. Uniform is
the basic algorithm.

localhost:8889/notebooks/knn-experiments.ipynb# 8/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

Run an experiment similar to the previous one. Compute the test RMSE for uniform and distance
weighting. Using k = 10, the brute algorithm, and Euclidean distance.

uniform distance done

Print the weighting the gave the lowest test RMSE, and the test RMSE it achieved.

The best weight: 'distance' | RMSE: 1407.531

Create a bar plot showing the test RMSE for the uniform and distance weighting options.

Comments

As usual, reflect and comment.

Results show as the best weight with an RMSE: 55817.065 changing weight makes in smaller
RMSE. I did feel that the results were very simliar and plotting the data helped me visualize it
better.

Conclusions

Please provide at least a few sentences of commentary on the main things you've learned from the
experiments you've run.
localhost:8889/notebooks/knn-experiments.ipynb# 9/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook

This lab was definitely trial and error to get values to shown correctly. In a few instances I was
expecting different results but tried to learn from what the data shown. For a few questions, I was
not sure if I was obtaining the correct data and I had to go back to modules to see if I was doing
things correctly.

localhost:8889/notebooks/knn-experiments.ipynb# 10/10

You might also like