0% found this document useful (0 votes)
32 views15 pages

05 K-Nearest Neighbors

K-Nearest Neighbors is a simple machine learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It searches the training dataset for the k most similar instances to make predictions, using distance measures like Euclidean distance. For classification, it returns the most common class of neighbors, and for regression it averages their values. KNN is non-parametric, instance-based, and performs lazy learning by building no model until prediction time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views15 pages

05 K-Nearest Neighbors

K-Nearest Neighbors is a simple machine learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It searches the training dataset for the k most similar instances to make predictions, using distance measures like Euclidean distance. For classification, it returns the most common class of neighbors, and for regression it averages their values. KNN is non-parametric, instance-based, and performs lazy learning by building no model until prediction time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

K-Nearest Neighbors

https://machinelearningmastery.com/tutorial-to-implement-k-nearest-
neighbors-in-python-from-scratch/
K-Nearest Neighbors
• The model for kNN is the entire training dataset.
• When a prediction is required for an unseen data instance, the kNN
algorithm will search through the training dataset for the k-most
similar instances.
• The prediction attribute of the most similar instances is summarized
and returned as the prediction for the unseen instance.
K-Nearest Neighbors
• The similarity measure is dependent on the type of data.
• For real-valued data, the Euclidean distance can be used.
• Other types of data such as categorical or binary data, Hamming
distance can be used.
K-Nearest Neighbors
• In the case of regression problems, the average of the predicted
attribute may be returned.
• In the case of classification, the most prevalent class may be
returned.
K-Nearest Neighbors
• The kNN algorithm is belongs to the family of

• instance-based,
• competitive learning and
• lazy learning algorithms.
competitive learning algorithm
• It is a competitive learning algorithm, because it internally uses
competition between model elements (data instances) in order to
make a predictive decision.
• The objective similarity measure between data instances causes each
data instance to compete to “win” or be most similar to a given
unseen data instance and contribute to a prediction.
Instance-based algorithms
• Instance-based algorithms are those algorithms that model the
problem using data instances (or rows) in order to make predictive
decisions.
• The kNN algorithm is an extreme form of instance-based methods
because all training observations are retained as part of the model.
Lazy learning
• Lazy learning refers to the fact that the algorithm does not build a
model until the time that a prediction is required.
• It is lazy because it only does work at the last second.
• This has the benefit of only including data relevant to the unseen
data, called a localized model.
• A disadvantage is that it can be computationally expensive to repeat
the same or similar searches over larger training datasets.
K-nearest neighbors
• Finally, kNN is powerful because it does not assume anything about
the data, other than a distance measure can be calculated
consistently between any two instances.
• As such, it is called non-parametric or non-linear as it does not
assume a functional form.
Extensions (Asynchronous) By Pairs Sept 23
• Tune KNN. Try larger and larger k values to see if you can improve
the performance of the algorithm on the Iris dataset.
• Regression. Adapt the example and apply it to a regression
predictive modeling problem (e.g. predict a numerical value)
• More Distance Measures. Implement other distance measures that
you can use to find similar historical data, such as Hamming
distance, Manhattan distance and Minkowski distance.
• Data Preparation. Distance measures are strongly affected by the
scale of the input data. Experiment with standardization and other
data preparation methods in order to improve results.
• More Problems. As always, experiment with the technique on more
and different classification and regression problems.
By Pairs Sept 23 Expected Output
• Tune KNN. Try larger and larger k values to see if you can
improve the performance of the algorithm on the Iris dataset.
• Use at least 5 different values for k.
• Using a table, present the accuracy and briefly explain.

• Regression. Adapt the example and apply it to a regression


predictive modeling problem (e.g. predict a numerical value)
• Explain what the predictive modeling problem is about.
• Explain the result or what changes was made (if any)
By Pairs Sept 23 Expected Output
• More Distance Measures. Implement other distance measures
that you can use to find similar historical data, such as
Hamming distance, Manhattan distance and Minkowski
distance.
• Identify what distance measure was used, explain the result.
By Pairs Sept 23 Expected Output
• Data Preparation. Distance measures are strongly affected by
the scale of the input data. Experiment with standardization and
other data preparation methods in order to improve results.
• Define a problem
• Identify what data preparation method can be used
• Explain how the data can be evaluated.

• Reference on data preparation


• https://machinelearningmastery.com/data-preparation-techniques-
for-machine-learning/
For Example:
• Problem: Email Spam or not Spam
• Data Preparation:
• Gather data, Clean data, Identify Features of Spam emails…
• Evaluation:
• Measure Accuracy of Classification
September 23 Session
• Review Linear and Logistic Regression (Asynchronous)
• Summative Exam #1 Activate at 2pm.
• Download and Install Respondus Lockdown Browser.

You might also like