0% found this document useful (0 votes)
2 views32 pages

K Nearest Neighbor

The document explains the K-Nearest Neighbors (kNN) algorithm, which classifies unlabeled examples based on the most similar labeled examples from a training dataset. It discusses the importance of choosing the appropriate number of neighbors (k) to balance overfitting and underfitting, as well as techniques for calculating distance, such as Euclidean distance. Additionally, it highlights the lazy learning nature of kNN and provides resources for further learning.

Uploaded by

kalpana khandale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views32 pages

K Nearest Neighbor

The document explains the K-Nearest Neighbors (kNN) algorithm, which classifies unlabeled examples based on the most similar labeled examples from a training dataset. It discusses the importance of choosing the appropriate number of neighbors (k) to balance overfitting and underfitting, as well as techniques for calculating distance, such as Euclidean distance. Additionally, it highlights the lazy learning nature of kNN and provides resources for further learning.

Uploaded by

kalpana khandale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

K- Nearest Neighbors

Nearest Neighbor Classification

• In a single sentence, nearest neighbor classifiers are


defined by their characteristic of classifying
unlabeled examples by assigning them the class of
the most similar labeled examples. Despite the
simplicity of this idea, nearest neighbor methods are
extremely powerful. They have been used successfully
for:
– Computer vision applications, including optical
character recognition and facial recognition in
both still images and video
– Predicting whether a person enjoys a movie which
he/she has been recommended (as in the Netfl ix
challenge)
– Identifying patterns in genetic data, for use in
detecting specific protein or diseases
The kNN Algorithm

• The kNN algorithm begins with a training


dataset made up of examples that are
classified into several categories, as labeled
by a nominal variable.
• Assume that we have a test dataset
containing unlabeled examples that
otherwise have the same features as the
training data.
• For each record in the test dataset, kNN
identifies k records in the training data that
are the "nearest" in similarity, where k is an
integer specified in advance.
• The unlabeled test instance is assigned the
Example:

Reference: Machine Learning with R, Brett Lantz, Packt Publishing


Example:
Example:
Classify me now!
Example:
Calculating Distance

• Locating the tomato's nearest neighbors


requires a dist ance function, or a f ormula
t hat measures the similarity between two
instances.
• There are many diff erent ways to
calculate distance.
• Traditionally, the kNN algorithm uses
Euclidean distance, which is the distance
one would measure if you could use a ruler
t o connect t wo point s, illustrated in the
previous figure by the dotted lines
connecting the tomato to its neighbors.
Calculating Distance

Reference: Super Data Science


Distance

• Euclidean distance is specified by the following formula,


where p and q are th examples to be compared, each
having n features. The term p1 refers to the value of the
fi rst feature of example p, while q1 refers to the value
of the fi rst feature of example q:

• The distance formula involves comparing the values of


each feature. For example, to calculate the distance
between the tomato (sweetness = 6, crunchiness = 4),
and the green bean (sweetness = 3, crunchiness = 7), we
can use the formula as follows:
Distance
Distance

Manhattan Distance

Euclidean Distance
Closest Neighbors
Choosing appropriate k

• Deciding how many neighbors to use for


kNN determines how well the mode will
generalize to future data.
• The balance between overfitting and
underfitting t he t raining dat a is a
problem known as t he bias- variance
• tradeoff.
Choosing a large k reduces t he impact or
variance caused by noisy data, but can bias
the learner such that it runs the risk of
ignoring small, but important patterns.
Choosing appropriate k
Choosing appropriate k

• In practice, choosing k depends on the


diffi culty of the concept to be learned
and the number of records in the
training data.
• Typically, k is set somewhere between 3
and 10. One common practice is to set k
equal to the square root of the number
of training examples.
• In the classifier, we might set k = 4,
because there were 15 example
ingredients in the training data and
the square root of 15 is 3.87.
Min-Max normalization

• The traditional method of rescaling features for


kNN is min- max normalization.
• This process transforms a feature such that all of its
values fall in a range between 0 and 1. The formula
for normalizing a feature is as follows. Essentially, the
formula subtracts the minimum of feature X from
each value and divides by the range of X:

• Normalized feature values can be interpreted as


indicating how far, from 0 percent to 100 percent, the
original value fell along the range between the
original minimum and maximum.
The Lazy Learning

• Using the strict definition of learning, a lazy


learner is not really learning anything.
• Instead, it merely stores the training data in
it. This allows the training phase to occur
very rapidly, with a pot ent ial downside
being t hat t he process of making
• predictions tends to be relatively slow.
Due to the heavy reliance on the training
instances, lazy learning is also known as
instance-based learning or rote learning.
Few Lazy Learning Algorithms

• K Nearest
Neighbors
• Local Regression
• Lazy Naive Bayes
The kNN Algorithm
Python Packages needed

• pandas
– Data Analytics
• numpy
– Numerical Computing
• matplotlib.pyp
lot
– Plotting graphs
• sklearn
– Classification
and
Regression
Sample Application
KNN – Classification : Dataset

• The best small project to start with on a new


tool is the classification of iris flowers (e.g. the
iris dataset).
• This is a good project because it is so well
understood.
– Attributes are numeric so you have to figure out how to
load and handle data.
– It is a classification problem, allowing you to practice
with perhaps an easier type of supervised learning
algorithm.
– It is a multi-class classification problem (multi-nominal)
that may require some specialized handling.
– It only has 4 attributes and 150 rows, meaning it is small
and easily fi ts into memory (and a screen or A4 page).
– All of the numeric attributes are in the same units and
the same scale, not requiring any special scaling or
Dataset
KNN – Classification : Dataset
Pre-processing
Characterize
Finding error with changed k
Error rate
Useful resources

• www.mitu.co.in
• www.pythonprogramminglanguage
• .com
• www.scikit-learn.org
• www.towardsdatascience.com
• www.medium.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.stephacking.
com
www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public
License

/ @mitu_grou
/company/mitu- MITUSkillologies
mITuSkillologie p skillologie
s s

Web Resources
https://mitu.co.in
http://tusharkute.c
om

[email protected]
[email protected]

You might also like