0% found this document useful (0 votes)
14 views24 pages

K Nearest Neighbour Classifier

The document discusses the k-Nearest Neighbors (k-NN) algorithm, highlighting the differences between eager and lazy learners, with k-NN categorized as a lazy learner that stores training data for classification. It explains the importance of selecting an appropriate value for k, the use of distance measures like Euclidean distance, and the algorithm's steps for classification. Additionally, it outlines the pros and cons of k-NN, its applications, and compares it with other classifiers.

Uploaded by

laxman.22bce8268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

K Nearest Neighbour Classifier

The document discusses the k-Nearest Neighbors (k-NN) algorithm, highlighting the differences between eager and lazy learners, with k-NN categorized as a lazy learner that stores training data for classification. It explains the importance of selecting an appropriate value for k, the use of distance measures like Euclidean distance, and the algorithm's steps for classification. Additionally, it outlines the pros and cons of k-NN, its applications, and compares it with other classifiers.

Uploaded by

laxman.22bce8268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Learning from nearest

neighbors
Eager Learners vs Lazy Learners
 Eager learners, when given a set of training tuples, will
construct a generalization model before receiving new
(e.g., test) tuples to classify.
 Lazy learners simply store data (or do only a little minor

processing) and wait until it is given a test tuple.


 Lazy learners store the training tuples or “instances,” they
are also referred to as instance-based learners, even though
all learning is essentially based on instances.
 Lazy learner: less time in training but more in predicting.

-k- Nearest Neighbor Classifier


-case based Classifier
What is k- NN??

 K nearest neighbors is a simple


algorithm that stores all available cases
and classifies new cases based on a
similarity measure (e.g., distance
functions).

 KNN has been used in statistical


estimation and pattern recognition
already at the beginning of the 1970’s
as a non-parametric technique.
K????
When K is small, we are restraining the region of
a given prediction and forcing our classifier to be
“more blind” to the overall distribution.

A small value for K provides the most flexible


fit, which will have low bias but high variance.

Larger values of K will have smoother decision


boundaries which mean lower variance but
increased bias.
Remarks!!
 Similarity Function-Based.

 Choose an odd value of k for 2 class


problems.

 K must not be multiple of a number of


classes.
Closeness

 The Euclidean distance between two


points or tuples, say,
X1 = (x11,x12,...,x1n) and X2
=(x21,x22,...,x2n), is

 Min-max normalization can be used to


transform a value v of a numeric attribute A
to v0 in the range [0,1] by computing
How to determine a good value for
k?

 Starting with k = 1, we use a test set to

estimate the error rate of the classifier.


 The k value that gives the minimum error
rate may be selected.
KNN Algorithm and
Example
Distance Measures

Which distance measure to use?


We use standard Euclidean Distance as it treats
each feature as equally important.
The KNN Algorithm
1. Load the data
2. Initialize K to your chosen number of neighbors
3. For each example in the data
3.1 Calculate the distance between the query example and
the current example from the data.
3.2 Add the distance and the index of the example to an
ordered collection
4. Sort the ordered collection of distances and indices from
smallest to largest (in ascending order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels
KNN Classifier Algorithm
Example
 We have data from the questionnaires survey and
objective testing with two attributes (acid durability and
strength) to classify whether a special paper tissue is
good or not. Here are four training samples :
X1 = Acid Durability X2 = Strength Y = Classification
(seconds) (kg/square meter)
7 7 Bad

7 4 Bad

3 4 Good

1 4 Good

Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification of
this new tissue.
 Step 1 : Initialize and Define k.

Lets say, k = 3
(Always choose k as an odd number if the number
of attributes is even to avoid a tie in the class prediction)
 Step 2 : Compute the distance between input sample and

Training sample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability X2 = Strength Squared Euclidean distance
(seconds) (kg/square meter)
7 7 (7-3)2 + (7-7)2 = 16

7 4 (7-3)2 + (4-7)2 = 25

3 4 (3-3)2 + (4-7)2 = 09

1 4 (1-3)2 + (4-7)2 = 13
 Step 3 : Sort the distance and determine the
nearest neighbours based of the Kth minimum
distance :

X1 = Acid X2 = Strength Squared Rank Is it included


(kg/square Euclidean minimum in 3-
Durability meter) distance distance Nearest
Neighbour?
(seconds)
7 7 16 3 Yes

7 4 25 4 No

3 4 09 1 Yes

1 4 13 2 Yes
 Step 4 : Take 3-Nearest Neighbours:
 Gather the category Y of the nearest
neighbours.

X1 = Acid X2 = Squared Rank Is it Y=


Strength Euclidean minimum included in Category of
Durability (kg/square distance distance 3-Nearest the
meter) Neighbour? nearest
(seconds) neighbour
7 7 16 3 Yes Bad

7 4 25 4 No -

3 4 09 1 Yes Good

1 4 13 2 Yes Good
 Step 5 : Apply simple majority

 Use simple majority of the category of the nearest


neighbours as the prediction value of the query
instance.

 We have 2 “good” and 1 “bad”. Thus we conclude


that the new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7 is included
in the “good” category.
Pros:
• No assumptions about data — useful, for example, for
nonlinear data
• Simple algorithm — to explain and
understand/interpret
• There’s no need to build a model, tune several
parameters
• Versatile — useful for classification or regression
Cons:
• Computationally expensive — because the algorithm
stores all of the training data
• High memory requirement
• Stores all (or almost all) of the training data
• The prediction stage might be slow (with big N)
Applications of KNN
Classifier
 Used in classification
 Used to get missing values
 Used in pattern recognition
 Used in gene expression
 Used in protein-protein prediction
 Used to get 3D structure of protein
 Used to measure document similarity
Comparison of various
classifiers
Algorithm Features Limitations

C4.5 - Models built can be easily - Small variation in data can lead
Algorithm interpreted to different decision trees
- Easy to implement - Does not work very well on
- Can use both discrete small training dataset
and - Over-fitting
continuous values
- Deals with noise
ID3 - It produces more accuracy - Requires large searching time
Algorithm than C4.5 - Sometimes it may generate
- Detection rate is increased very long rules which are
and space consumption is difficult to prune
reduced - Requires large amount of
memory to store tree

K-Nearest - Classes need not be linearly - Time to find the nearest


separable neighbours in a large training
Neighbour - Zero cost of the learning dataset can be excessive
Algorithm process - It is sensitive to noisy or
- Sometimes it is robust with irrelevant attributes
regard to noisy training data - Performance of the algorithm
- Well suited for multimodal depends on the number of
Naïve Bayes - Simple to implement - The precision of the
Algorithm - Great computational efficiency algorithm decreases if the
and classification rate amount of data is less
- It predicts accurate results for - For obtaining good results,
most of the classification and it requires a very large
prediction problems number of records

Support vector - High accuracy - Speed and size


machine - Work well even if the data is requirement both in training
Algorithm not linearly separable in the and testing is more
base feature space - High complexity and
extensive memory
requirements for
classification in many
cases

Artificial Neural - It is easy to use with few - Requires high processing


Networks parameters to adjust time if neural network is
Algorithm - A neural network learns large
and - Difficult to know how many
reprogramming is not neurons and layers are
needed. necessary
- Easy to implement - Learning can be slow
- Applicable to a wide range of
problems in real life.
Conclusion
 KNN is what we call lazy learning (vs. eager
learning)
 Conceptually simple, easy to understand
and explain
 Very flexible decision boundaries
 Not much learning at all!
 It can be hard to find a good distance
measure
 Irrelevant features and noise can be
very detrimental
 Typically can not handle more than a few
dozen attributes
 Computational cost: requires a lot computation

You might also like