The KNN
The KNN
Algorithm in Python
Table of Contents
In the graph that you’ve seen before and the following graphs in
this section, the target variable is the shape of the data point, and
the independent variables are height and width. You can see the
idea behind supervised learning in the following graph:
In this graph, the data points each have a height, a width, and a
shape. There are crosses, stars, and triangles. On the right is a
decision rule that a machine learning model might have learned.
In this case, observations marked with a cross are tall but not
wide. Stars are both tall and wide. Triangles are short but can be
wide or narrow. Essentially, the model has learned a decision rule
to decide whether an observation is more likely to be a cross, a
star, or a triangle based only on its height and width.
As you can see in the example, you can never be certain that
grouped data points fundamentally belong together, but as long
as the grouping makes sense, it can be very valuable in practice.
You can see the idea behind unsupervised learning in the
following graph:
In this graph, the observations don’t have different shapes
anymore. They’re all circles. Yet they can still be grouped into
three groups based on the distance between points. In this
particular example, there are three clusters of points that can be
separated based on the empty space between them.
Nonlinear models are models that use any approach other than
a line to separate their cases. A well-known example is
the decision tree, which is basically a long list of if … else
statements. In the nonlinear graph, if … else statements would
allow you to draw squares or any other form that you wanted to
draw. The following graph depicts a nonlinear model applied to
the example data:
This graph shows how a decision can be nonlinear. The decision
rule is made up of three squares. The box in which a new data
point falls will define its predicted shape. Note that it’s not
possible to fit this at once using a line: Two lines are needed. This
model could be re-created with if … else statements as follows:
You’ll also need much more data to fit a more complex model,
and data is not always available. Last but not least, more complex
models are more difficult for us humans to interpret, and
sometimes this interpretation can be very valuable.
This is where the force of the kNN model lies. It allows its users to
understand and interpret what’s happening inside the model, and
it’s very fast to develop. This makes kNN a great model for many
machine learning use cases that don’t require highly complex
techniques.
Drawbacks of kNN
It’s only fair to also be honest about the drawbacks of the kNN
algorithm. As touched upon before, the real drawback of kNN is its
capacity to adapt to highly complex relationships between
independent and dependent variables. kNN is less likely to
perform well on advanced tasks like computer vision and natural
language processing.
Abalones are small sea snails that look a bit like mussels. If you
want to learn more about them, you can check the abalone
Wikipedia page for more information.
The Abalone Problem Statement
The age of an abalone can be found by cutting its shell and
counting the number of rings on the shell. In the Abalone Dataset,
you can find the age measurements of a large number of
abalones along with a lot of other physical measurements.
The goal of the project is to develop a model that can predict the
age of an abalone based purely on the other physical
measurements. This would allow researchers to estimate the
abalone’s age without having to cut its shell and count the rings.
>>>
To make sure that you’ve imported the data correctly, you can do
a quick check as follows:
>>>
>>> abalone.head()
0 1 2 3 4 5 6 7 8
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
This should show you the first five lines of the Abalone Dataset,
imported in Python as a pandas DataFrame. You can see that the
column names are still missing. You can find those names in
the abalone.names file on the UCI machine learning repository. You
can add them to your DataFrame as follows:
>>>
>>> abalone.columns = [
... "Sex",
... "Length",
... "Diameter",
... "Height",
... "Whole weight",
... "Shucked weight",
... "Viscera weight",
... "Shell weight",
... "Rings",
... ]
The imported data should now be more understandable. But
there’s one other thing that you should do: You should remove
the Sex column. The goal of the current exercise is to use physical
measurements to predict the age of the abalone. Since sex is not
a purely physical measure, you should remove it from the
dataset. You can delete the Sex column using .drop:
>>>
>>>
>>>
When a new data point arrives, the kNN algorithm, as the name
indicates, will start by finding the nearest neighbors of this new
data point. Then it takes the values of those neighbors and uses
them as a prediction for the new data point.
The kNN algorithm is based on the notion that you can predict the
features of a data point based on the features of its neighbors. In
some cases, this method of prediction may be successful, while in
other cases it may not. Next, you’ll look at the mathematical
description of “nearest” for data points and the methods to
combine multiple neighbors into one prediction.
Now, to apply this to your data, you must understand that your
data points are actually vectors. You can then compute the
distance between them by computing the norm of the difference
vector.
>>>
In case you want to get more details on the math, you can have a
look at the Pythagorean theorem to understand how the
Euclidean distance formula is derived.
>>>
Now you can apply a kNN with k = 3 on a new abalone that has
the following physical measurements:
Variable Value
Length 0.569552
Diameter 0.446407
Height 0.154437
You can create the NumPy array for this data point as follows:
>>>
>>>
>>>
>>> k = 3
>>> nearest_neighbor_ids = distances.argsort()[:k]
>>> nearest_neighbor_ids
array([4045, 1902, 1644], dtype=int32)
This tells you which three neighbors are closest to
your new_data_point. In the next paragraph, you’ll see how to
convert those neighbors in an estimation.
As a first step, you need to find the ground truths for those three
neighbors:
>>>
>>>
You can compute the mode using the SciPy mode() function. As the
abalone example is not a case of classification, the following code
shows how you can compute the mode for a toy example:
>>>
Splitting Data Into Training and Test Sets for Model Evaluation
In this section, you’ll evaluate the quality of your abalone kNN
model. In the previous sections, you had a technical focus, but
you’re now going to have a more pragmatic and results-oriented
point of view.
1. Training data is used to fit the model. For kNN, this means
that the training data will be used as neighbors.
2. Test data is used to evaluate the model. It means that
you’ll make predictions for the number of rings of each of
the abalones in the test data and compare those results to
the known true number of rings.
You can split the data into training and test sets in Python
using scikit-learn’s built-in train_test_split():
>>>
>>>
>>>
>>>
>>>
Until now, you’ve only used the scikit-learn kNN algorithm out of
the box. You haven’t yet done any tuning of hyperparameters and
a random choice for k. You can observe a relatively large
difference between the RMSE on the training data and the RMSE
on the test data. This means that the model suffers
from overfitting on the training data: It does not generalize well.
This is nothing to worry about at this point. In the next part, you’ll
see how to optimize the prediction error or test error using
various tuning methods.
>>>
>>>
When you use few neighbors, you have a prediction that will be
much more variable than when you use more neighbors:
>>>
In the end, it will retain the best performing value of k, which you
can access with .best_params_:
>>>
>>> gridsearch.best_params_
{'n_neighbors': 25, 'weights': 'distance'}
In this code, you print the parameters that have the lowest error
score. With .best_params_, you can see that choosing 25 as value
for k will yield the best predictive performance. Now that you
know what the best value of k is, you can see how it affects your
train and test performances:
>>>
>>>
>>> parameters = {
... "n_neighbors": range(1, 50),
... "weights": ["uniform", "distance"],
... }
>>> gridsearch = GridSearchCV(KNeighborsRegressor(), parameters)
>>> gridsearch.fit(X_train, y_train)
GridSearchCV(estimator=KNeighborsRegressor(),
param_grid={'n_neighbors': range(1, 50),
'weights': ['uniform', 'distance']})
>>> gridsearch.best_params_
{'n_neighbors': 25, 'weights': 'distance'}
>>> test_preds_grid = gridsearch.predict(X_test)
>>> test_mse = mean_squared_error(y_test, test_preds_grid)
>>> test_rmse = sqrt(test_mse)
>>> test_rmse
2.163426558494748
Here, you test whether it makes sense to use a different weighing
using your GridSearchCV. Applying a weighted average rather than
a regular average has reduced the prediction error
from 2.17 to 2.1634. Although this isn’t a huge improvement, it’s
still better, which makes it worth it.
As a third step for kNN tuning, you can use bagging. Bagging is
an ensemble method, or a method that takes a relatively
straightforward machine learning model and fits a large number
of those models with slight variations in each fit. Bagging often
uses decision trees, but kNN works perfectly as well.
Ensemble methods are often more performant than single
models. One model can be wrong from time to time, but the
average of a hundred models should be wrong less often. The
errors of different individual models are likely to average each
other out, and the resulting prediction will be less variable.
>>>
>>>
>>>
Arbitrary k 2.37
In this table, you see the four models from simplest to most
complex. The order of complexity corresponds with the order of
the error metrics. The model with a random k performed the
worst, and the model with the bagging and GridSearchCV performed
the best.
Conclusion
Now that you know all about the kNN algorithm, you’re ready to
start building performant predictive models in Python. It takes a
few steps to move from a basic kNN model to a fully tuned model,
but the performance increase is totally worth it!