Knn-Experiments - Jupyter Notebook
Knn-Experiments - Jupyter Notebook
In this notebook we run a bunch of tests to see how KNN is affect by the choice of k, distance
function, scaling of the predictors, presence of useless predictors, and other things.
One experiment we do not run, and which would be interesting, is to see how KNN performance
changes as a function of the size of the training set.
INSTRUCTIONS
Enter code wherever you see # YOUR CODE HERE in code cells, or YOU TEXT HERE in markup
cells.
The diamonds dataset is good for testing KNN because it has many numeric features. See
https://www.kaggle.com/shivam2503/diamonds (https://www.kaggle.com/shivam2503/diamonds)
for information on the dataset.
<class 'pandas.core.frame.DataFrame'>
localhost:8889/notebooks/knn-experiments.ipynb# 1/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Note that numeric features have different ranges. For example, the median value of carat is 0.7,
while the median value of depth is about 62. Price has a much greater median value, but we will be
using it as the target variable.
Out[170]:
Unnamed: 0 carat depth table price x
We will use KNN regression to predict the price of a diamond from its physical features.
We use a subset of the data set for our training and test data. Note that we keep an unscaled
version of the data for one of the experiments we will run.
(4900, 6)
Baseline performance
For regression problems, our baseline is the "blind" prediction that is just the average value of the
target variable. The blind prediction must be calculated using the training data. Calculate and print
the test set root mean squared error (test RMSE) using this blind prediction.
I have provided a
function you can use for RMSE.
Using the training set, train a KNN regression model using the ScikitLearn KNeighborsRegressor,
localhost:8889/notebooks/knn-experiments.ipynb# 2/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
and report on the test RMSE. The test RMSE is the RMSE computed using the test data set.
When using the KNN algorithm, use algorithm='brute' to get the basic KNN algorithm.
Impact of K
I provided code to test KNN on k=1, k=3, k=5, ..., k=29. For each value of k, compute the training
RMSE and test RMSE. The training RMSE is the RMSE computed using the training data. Use the
'brute' algorithm, and Euclidean distance, which is the default. You need to add the
get_train_test_rmse() function.
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done
Using the training and test RMSE values you got for each value of k, find the k associated with the
lowest test RMSE value. Print this k value and the associated lowest test RMSE value. In other
words, if you found that k=11 gave the lowest test RMSE, then print the value 11 and the test
RMSE value obtained when k=11.
Plot the test and training RMSE as a function of k, for all the k values you tried.
localhost:8889/notebooks/knn-experiments.ipynb# 3/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Comments
In the markup cell below, write about what you learned from your plot. I would expect two or three
sentences, but what's most important is that you write something thoughtful.
From just the plot above, I can clearly see that the test RMSE is much lower than the baseline
RMSE, which is a good sign; however, because the difference in the Y value is so big, it's hard to
see the details for the plotted test RMSE values. The general consensus is that the RMSE
dramatically drops for the test data between 1 and 3, and gruadually gets smaller until around 15 k
nearest neighbors before it plateaus.
Repeat what you did to test the impact of k, but this time use Manhattan distance as your distance
metric. Look at the options for KNeighborsRegressor() to see how to use Manhattan distance.
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done
Print the value of k that gives the best test RMSE, and the test RMSE associated with that k, just
as you did in the previous section.
Plot the training and test RMSE as a function of k, just as you did in the previous section. Be sure
localhost:8889/notebooks/knn-experiments.ipynb# 4/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Comments
Consider what you learned from your experiment, and write a little about it. Think about how the
results changed as a result of changing the distance function.
Based on the graphs, there are very little differences between using Euclidean and Manhattan
distances for the test data that I used. I do notice that the sharp decline in test RMSE in the
beginning is broken into two sections that last until about 8 k nearest neighbors for Manhattan
distance, whereas with Euclidean distance, the RMSE value gradually declines starting at about 3.
The difference is so little to me that I almost thought I didn't change the power parameter for the
Minkowski metric until I checked the best test RMSE values.
In class we heard that the KNN performance goes down if useless "noisy predictors" are present.
These are predictor that don't help in making predictions. In this section, run KNN regression by
adding one noise predictor to the data, then 2 noise predictors, then three, and then four. For each,
compute the training and test RMSE. In every case, use k=10 as the k value and use Euclidean
distance as the distance function.
The add_noise_predictor() method makes it easy to add a predictor variable of random values to
X_train or X_test.
localhost:8889/notebooks/knn-experiments.ipynb# 5/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Hint: In each iteration of your loop, add a noisy predictor to both X_train and X_test. You don't
need to worry about rescaling the data, as the new noisy predictor is already scaled. Don't modify
X_train and X_test however, as you will be using them again.
0 1 2 3 4 done
Plot the percent increase in test RMSE as a function of the number of noise predictors. The x axis
will range from 0 to 4. The y axis will show a percent increase in test RMSE.
To compute percent increase in RMSE for n noise predictors, compute 100 * (rmse -
base_rmse)/base_rmse, where base_rmse is the test RMSE with no noise predictors, and rmse is
the test RMSE when n noise predictors have been added.
Comments
Look at the results you obtained and add some thoughtful commentary.
To be completely honest, I'm not sure I did this part correctly. It seems like having one noise
predictor is slightly owrse than having 2 or 3 noise predictors, but if there are 4 noise predictors,
then the RMSE almost triples.
Impact of scaling
In class we learned that we should scaled the training data before using KNN. How important is
scaling with KNN? Repeat the experiments you ran before (like in the impact of distance metric
section), but this time use unscaled data.
localhost:8889/notebooks/knn-experiments.ipynb# 6/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Run KNN as before but use the unscaled version of the data. You will vary k as before. Use
algorithm='brute' and Euclidean distance.
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 done
Print the best k and the test RMSE associated with the best k.
Plot training and test RMSE as a function of k. Your plot title should note the use of unscaled data.
Comments
Reflect on what happened and provide some short commentary, as in previous sections.
It's difficult to see the minute changes in the test RMSE from the graph above because the training
RMSE is much higher, thus making the graph more zoomed out.
Impact of algorithm
We didn't discuss in class that there are variants of the KNN algorithm. The main purpose of the
variants is to be faster and to reduce that amount of training data that needs to be stored.
Run experiments where you test each of the three KNN algorithms supported by Scikit-Learn:
ball_tree, kd_tree, and brute. In each case, use k=10 and use Euclidean distance.
localhost:8889/notebooks/knn-experiments.ipynb# 7/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Print the name of the best algorith, and the test RMSE achieved with the best algorithm.
Plot the test RMSE for each of the three algorithms as a bar plot.
Comments
It appears the best was the brute algorithm but I'm a little confused on my plotting, I'm not sure if
I'm intereperting this data correctly.
Impact of weighting
It was briefly mentioned in lecture that there is a variant of KNN in which training points are given
more weight when they are closer to the point for which a prediction is to be made. The 'weight'
parameter of KNeighborsRegressor() has two possible values: 'uniform' and 'distance'. Uniform is
the basic algorithm.
localhost:8889/notebooks/knn-experiments.ipynb# 8/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
Run an experiment similar to the previous one. Compute the test RMSE for uniform and distance
weighting. Using k = 10, the brute algorithm, and Euclidean distance.
Print the weighting the gave the lowest test RMSE, and the test RMSE it achieved.
Create a bar plot showing the test RMSE for the uniform and distance weighting options.
Comments
Results show as the best weight with an RMSE: 55817.065 changing weight makes in smaller
RMSE. I did feel that the results were very simliar and plotting the data helped me visualize it
better.
Conclusions
Please provide at least a few sentences of commentary on the main things you've learned from the
experiments you've run.
localhost:8889/notebooks/knn-experiments.ipynb# 9/10
2/14/22, 9:07 PM knn-experiments - Jupyter Notebook
This lab was definitely trial and error to get values to shown correctly. In a few instances I was
expecting different results but tried to learn from what the data shown. For a few questions, I was
not sure if I was obtaining the correct data and I had to go back to modules to see if I was doing
things correctly.
localhost:8889/notebooks/knn-experiments.ipynb# 10/10