05 K-Nearest Neighbors
05 K-Nearest Neighbors
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-
neighbors-in-python-from-scratch/
K-Nearest Neighbors
• The model for kNN is the entire training dataset.
• When a prediction is required for an unseen data instance, the kNN
algorithm will search through the training dataset for the k-most
similar instances.
• The prediction attribute of the most similar instances is summarized
and returned as the prediction for the unseen instance.
K-Nearest Neighbors
• The similarity measure is dependent on the type of data.
• For real-valued data, the Euclidean distance can be used.
• Other types of data such as categorical or binary data, Hamming
distance can be used.
K-Nearest Neighbors
• In the case of regression problems, the average of the predicted
attribute may be returned.
• In the case of classification, the most prevalent class may be
returned.
K-Nearest Neighbors
• The kNN algorithm is belongs to the family of
• instance-based,
• competitive learning and
• lazy learning algorithms.
competitive learning algorithm
• It is a competitive learning algorithm, because it internally uses
competition between model elements (data instances) in order to
make a predictive decision.
• The objective similarity measure between data instances causes each
data instance to compete to “win” or be most similar to a given
unseen data instance and contribute to a prediction.
Instance-based algorithms
• Instance-based algorithms are those algorithms that model the
problem using data instances (or rows) in order to make predictive
decisions.
• The kNN algorithm is an extreme form of instance-based methods
because all training observations are retained as part of the model.
Lazy learning
• Lazy learning refers to the fact that the algorithm does not build a
model until the time that a prediction is required.
• It is lazy because it only does work at the last second.
• This has the benefit of only including data relevant to the unseen
data, called a localized model.
• A disadvantage is that it can be computationally expensive to repeat
the same or similar searches over larger training datasets.
K-nearest neighbors
• Finally, kNN is powerful because it does not assume anything about
the data, other than a distance measure can be calculated
consistently between any two instances.
• As such, it is called non-parametric or non-linear as it does not
assume a functional form.
Extensions (Asynchronous) By Pairs Sept 23
• Tune KNN. Try larger and larger k values to see if you can improve
the performance of the algorithm on the Iris dataset.
• Regression. Adapt the example and apply it to a regression
predictive modeling problem (e.g. predict a numerical value)
• More Distance Measures. Implement other distance measures that
you can use to find similar historical data, such as Hamming
distance, Manhattan distance and Minkowski distance.
• Data Preparation. Distance measures are strongly affected by the
scale of the input data. Experiment with standardization and other
data preparation methods in order to improve results.
• More Problems. As always, experiment with the technique on more
and different classification and regression problems.
By Pairs Sept 23 Expected Output
• Tune KNN. Try larger and larger k values to see if you can
improve the performance of the algorithm on the Iris dataset.
• Use at least 5 different values for k.
• Using a table, present the accuracy and briefly explain.