KNN Cookbook
KNN Cookbook
K-Nearest Neighbors
15.0 Introduction
The K-Nearest Neighbors classifier (KNN) is one of the simplest yet most commonly
used classifiers in supervised machine learning. KNN is often considered a lazy
learner; it doesn’t technically train a model to make predictions. Instead an observa‐
tion is predicted to be the class of that of the largest proportion of the k nearest obser‐
vations. For example, if an observation with an unknown class is surrounded by an
observation of class 1, then the observation is classified as class 1. In this chapter we
will explore how to use scikit-learn to create and use a KNN classifier.
Solution
Use scikit-learn’s NearestNeighbors:
# Load libraries
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
# Load data
iris = datasets.load_iris()
features = iris.data
# Create standardizer
251
standardizer = StandardScaler()
# Standardize features
features_standardized = standardizer.fit_transform(features)
# Create an observation
new_observation = [ 1, 1, 1, 1]
Discussion
In our solution we used the dataset of Iris flowers. We created an observation,
new_observation, with some values and then found the two observations that are
closest to our observation. indices contains the locations of the observations in our
dataset that are closest, so X[indices] displays the values of those observations.
Intuitively, distance can be thought of as a measure of similarity, so the two closest
observations are the two flowers most similar to the flower we created.
How do we measure distance? scikit-learn offers a wide variety of distance metrics, d,
including Euclidean:
deuclidean = ∑ni = 1 xi − yi 2
n
dmanhattan = ∑ xi − y i
i=1
n 1/ p
dminkowski = ∑
i=1
xi − yi p
The distance variable we created contains the actual distance measurement to each
of the two nearest neighbors:
# View distances
distances
array([[ 0.48168828, 0.73440155]])
Solution
If the dataset is not very large, use KNeighborsClassifier:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Create standardizer
standardizer = StandardScaler()
# Standardize features
X_std = standardizer.fit_transform(X)
Discussion
In KNN, given an observation, xu, with an unknown target class, the algorithm first
identifies the k closest observations (sometimes called xu’s neighborhood) based on
some distance metric (e.g., Euclidean distance), then these k observations “vote”
1
ki∑
I yi = j
∈ν
where ν is the k observation in xu’s neighborhood, yi is the class of the ith observation,
and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see
these probabilities using predict_proba:
# View probability each observation is one of three classes
knn.predict_proba(new_observations)
array([[ 0. , 0.6, 0.4],
[ 0. , 0. , 1. ]])
The class with the highest probability becomes the predicted class. For example, in
the preceding output, the first observation should be class 1 (Pr = 0.6) while the sec‐
ond observation should be class 2 (Pr = 1), and this is just what we see:
knn.predict(new_observations)
array([1, 2])
Solution
Use model selection techniques like GridSearchCV:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create standardizer
standardizer = StandardScaler()
# Standardize features
features_standardized = standardizer.fit_transform(features)
# Create a pipeline
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])
Discussion
The size of k has real implications in KNN classifiers. In machine learning we are try‐
ing to find a balance between bias and variance, and in few places is that as explicit as
the value of k. If k = n where n is the number of observations, then we have high bias
but low variance. If k = 1, we will have low bias but high variance. The best model will
come from finding the value of k that balances this bias-variance trade-off. In our sol‐
ution, we used GridSearchCV to conduct five-fold cross-validation on KNN classifiers
Solution
Use RadiusNeighborsClassifier:
# Load libraries
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create standardizer
standardizer = StandardScaler()
# Standardize features
features_standardized = standardizer.fit_transform(features)
Discussion
In KNN classification, an observation’s class is predicted from the classes of its k
neighbors. A less common technique is classification in a radius-based (RNN) classi‐