0% found this document useful (0 votes)

73 views31 pages

Lecture 4

1. K-Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies new data points based on their distance to labeled examples in the training set. 2. The KNN algorithm works by finding the k closest training examples in the feature space and assigning the new data point to the most common class among its k nearest neighbors. 3. The accuracy of KNN can be evaluated using methods like predictive accuracy, n-fold cross validation, and measuring the percentage of correctly classified examples in a test set.

Uploaded by

Ahmed Mosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views31 pages

Lecture 4

Uploaded by

Ahmed Mosa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

K- Nearest Neighbors(KNN)

And Predictive Accuracy

Dr. Ammar Mohammed

Associate Professor of Computer Science
Fall 2021
Nearest Neighbors

Set of Stored Cases

• Store the training samples
Atr1 ……... AtrN Class • Use training samples to
A predict the class label of
unseen samples
B
B Unseen Case
C
Atr1 ……... AtrN
A
C
B
Is called : Instance based learning
Instance Based Learning

- Approximating real valued or discrete-valued target functions

- Learning in this algorithm consists of storing the presented
training data ( no Model is generated)
- When a new query instance (unseen data) is encountered, a set
of similar related instances is retrieved from memory and used
to classify the new query instance

- Disadvantage of instance-based methods is that the costs of

classifying new instances can be high
- Nearly all computation takes place at classification time
rather than learning time
K-Nearest Neighbors KNN

Most basic instance-based method

Supervised learning
! Basic idea of KNN:

Used to classify objects based on closest training examples in

the training data

compute
distance
test
sample

training choose k of the

samples “nearest” samples
Nearest Neighbors
Unknown record
Requires three inputs:
1.The set of stored samples(Training data)
2.Distance metric to compute distance
between samples
3.The value of k, the number of nearest
neighbors to retrieve

Nearest Neighbor classifiers are lazy learners

No pre-constructed models for classification
Nearest Neighbors Example

Food Chat Fast Price Bar BigTip

(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes

Similarity metric: Number of matching attributes (k=2)

New examples:
Example 1 (great, no, no, normal, no) Yes

à most similar: number 2 (1 mismatch, 4 match) à yes

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes

Example 2 (mediocre, yes, no, normal, no)

Yes/No
à Most similar: number 3 (1 mismatch, 4 match) à no

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes

Nearest Neighbors

! Compute distance between two points:

x=(x1,x2,..xn) and y=(y1,y2,..yn)
– Euclidean distance

d(x, y) = å (x - y )
i
i i
2

! Options
for determining the class from nearest
neighbor list
– Take majority vote of class labels among the k-
nearest neighbors
Example

In 5-nearest Neighbors what the class of the

new instance

x=(9.1, 11.0) ?

d(x, y) = ( 9.1 - 0.8 )2 + ( 11.0 - 6.3 )2 = 4.7

= 9.6
d(x, y) = ( 9.1 - 1.4 )2 + ( 11.0 - 8.1 )2
:
:

d(x, y) = ( 9.1 - 19.6 ) + ( 11.0 - 11.1 )

2 2 = 10.5
Select the 5 instances having minimum
distance You will find
3 instances classified to +
2 instances classified to –
We conclude that X=(9.1,11.0) classified as +
Features Normalization

! Scalingissues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
u height of a person may vary from 1.5 m to 1.8 m
u weight of a person may vary from 55 kg to 120 kg

u income of a person may vary from $10K to $1M

Features Normalization

Distance is Dominated by the attribute Loan, but attribute Age has no impact
How to solve this Problem ?

10
KNN Standardized Distance

X− Min
X s=
Max− Min
11
How to Determine the good value of K

¢ k = 1:
Belongs to square class

¢ k = 3:
? Belongs to triangle class

¢ k = 7:
Belongs to square class

Choosing the value of k:

If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
Choose an odd value for k, to eliminate ties
12
How to Determine the good value of K

Determined Experimentally
- Start with k=1 and use a test set to validate the error rate of
the classifier
- Repeat with k=k+2
- Choose the value of k for which the error rate is minimum

Note: k should be odd number to avoid ties

13
Predictive Accuracy
Classification step 1: Splitting data

THE PAST
Results Known

+
+ Training set
-
-
Data +

Testing set
Classification Step 2: Train and Evaluate

Results Known

+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+

Y
-
N
+
Testing set
-
Methods for Evaluation

n Predictive accuracy: the most obvious method for

estimating the performance of the classifier

Number of correct Classification

Accuracy =
n Efficiency Total Number of test cases
q time to construct the model
q time to use the model
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability:
q understandable and insight provided by the model
Predictive Accuracy:

P=C/N

P: accuracy
N: number of instances
C: correctly classified

- Available data is split int two parts called Training set and Test set
- In case Dataset is only single file, we need to divide it into a training and
test set before using method1
Splitting the data

n Holdout set: The available data set D is divided into two

disjoint subsets,
q the training set Dtrain (for learning a model)
q the test set Dtest (for testing the model)
n Important: training set should not be used in testing and
the test set should not be used in learning.
q Unseen test set provides a unbiased estimate of accuracy.
n The test set is also called the holdout set. (the examples
in the original data set D are all labeled with classes.)
n This method is mainly used when the data set D is large.
Splitting Data using n-fold cross Validation

- Available data is split int two parts called Training set and Test set
- In case the Dataset is only single file, we need to divide it into a training
and test set before using method1
n- fold Cross Validation

n n-fold cross-validation: The available data is partitioned into n

equal-size disjoint subsets.
n Use each subset as the test set and combine the rest n-1 subsets as
the training set to learn a classifier.
n The procedure is run n times, which give n accuracies.
n The final estimated accuracy of learning is the average of the n
accuracies.
n 10-fold and 5-fold cross-validations are commonly used.
n This method is used when the available data is not large.
Accuracy Paradox
Accuracy is not suitable in some applications.
n With class imbalance, accuracy alone cannot be trusted to
select well training model.
n In text mining, we may only be interested in the
documents of a particular topic, which are only a small
portion of a big document collection.
n In classification involving skewed or highly imbalanced
data, e.g., network intrusion and financial fraud detections,
we are interested only in the minority class.
q High accuracy does not mean any intrusion is detected.
q E.g., 1% intrusion. Achieve 99% accuracy by doing nothing.
n The class of interest is commonly called the positive
class, and the rest negative classes.
Confusion Matrix

Predicted labels Actual labels

Predicted Label Actual label
a

b Classifier

c d
Confusion Matrix Performance
A confusion matrix is a way of describing the breakdown of the errors
in predictions. It shows the number of correct and incorrect
predictions made by the classification model compared to the actual
outcomes (target value) in the data. The matrix is NxN, where N is
the number of target values (classes). Performance of such models is
commonly evaluated using the data in the matrix. The following table
displays a 2x2 confusion matrix for two classes (Positive and
Negative)
Confusion Matrix Performance

- Accuracy : The proportion of correct classifications from the overall number of

cases.
- Positive Predictive Value or Precision : the proportion of positive cases that were
correctly identified.
- Negative Predictive Value : the proportion of negative cases that were correctly
identified.
Sensitivity or Recall : the proportion of actual positive cases which are correctly
identified.
Specificity : the proportion of actual negative cases which are correctly identified.
Confusion Matrix

Predicted labels Actual labels

Predicted Label Actual label

TP P=positive

FP Classifier

N=Negative

FN TN

- TP ( True Positive): The number of correct classifications of the positive examples.

- FN (False Negative): The number of incorrect classifications of the positive examples.

- FP(False Positive):The number of incorrect classifications of the negative examples.

- TN(True Negative): The number of correct classifications of the negative examples.

Precision and recall Measures

n Used in information retrieval and text classification.

n We use a confusion matrix to introduce them.
Precision and Recall Measures

TP TP
p= . r= .
TP+FP TP+FN
n Precision p : is the number of True Positives divided by the
number of True positives and False Positives. Or it is the
number of correctly classified positive examples divided by the
total number of examples that are classified as positive.
n Recall r : is the number of True positives divided by the
number of True positives and the number of False Negatives.
Or it is the number of correctly classified positive examples
divided by the total number of actual positive examples in the
test set.
28
Example

n This confusion matrix gives

q precision p = 100% and
q recall r = 1%
because we only classified one positive example correctly and no
negative examples wrongly.
n Note: precision and recall only measure classification on
the positive class.

29
F-Score ( F1-Score)

n It is hard to compare two classifiers using two measures. F1 score

combines precision and recall into one measure

2 pr
F1 =
p+r
F1-Score is the harmonic mean of precision and recall

2
F1 =
1 1
+
p r
n The harmonic mean of two numbers tends to be closer to the
smaller of the two.
n For F1-value to be large, both p and r much be large.

30
Question
Suppose we train a model to predict whether a patient is positive
Covid_19 or negative Covid_19. After training the model, we apply it on
test set of 1000 classified patients and the model produces the
following confusion matrix.
Actual Class
Covid_Positive Covid_Negative
Predicted Covid_Positive 450 70
Class Covid_Negative 90 390

• Calculate the Predictive Accuracy and the Precision of the classifier.

• Calculate the Recall (sensitivity) and Specificity of the classifier.

K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
Classification
No ratings yet
Classification
58 pages
2-KNN
No ratings yet
2-KNN
67 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
Univt - IV
No ratings yet
Univt - IV
72 pages
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
No ratings yet
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
23 pages
Module 8 PDF
No ratings yet
Module 8 PDF
51 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
06 KNN
No ratings yet
06 KNN
41 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Lecture 3 Basics of Clssification
No ratings yet
Lecture 3 Basics of Clssification
53 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
7.classification Before
No ratings yet
7.classification Before
27 pages
DM - MP
No ratings yet
DM - MP
15 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Share UNIT-IV-1
No ratings yet
Share UNIT-IV-1
138 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
FPA Unit 2
No ratings yet
FPA Unit 2
20 pages
Supervised Learning - SVM - DT
No ratings yet
Supervised Learning - SVM - DT
43 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
cs4302 Lecture2
No ratings yet
cs4302 Lecture2
40 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
ML Lecture#2
No ratings yet
ML Lecture#2
70 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
Data Sciene - Unit 5 Material
No ratings yet
Data Sciene - Unit 5 Material
15 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
Unit 3
No ratings yet
Unit 3
100 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
No ratings yet
ML 7th Sem Aiml Ite Notes Complete Long (1) - 63-155
93 pages
ML 5
No ratings yet
ML 5
35 pages
Unit 2 Classification
No ratings yet
Unit 2 Classification
59 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
ML 5
No ratings yet
ML 5
76 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Sensitivity Unit 4
No ratings yet
Sensitivity Unit 4
4 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
ML - 3 - Sovan - KNN - 1
No ratings yet
ML - 3 - Sovan - KNN - 1
95 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Module 6
No ratings yet
Module 6
24 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Lecture 02 - KNN and ML Basics
No ratings yet
Lecture 02 - KNN and ML Basics
33 pages
Class11-PatternClassification KNN
No ratings yet
Class11-PatternClassification KNN
28 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
Assingment On Database
No ratings yet
Assingment On Database
16 pages
ML Unit 3 Part 3
No ratings yet
ML Unit 3 Part 3
33 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
CSD 25-05
No ratings yet
CSD 25-05
11 pages
Final Report - Production Project PDF
No ratings yet
Final Report - Production Project PDF
46 pages
YOLO Based Object Detection Models: A Review and Its Applications
No ratings yet
YOLO Based Object Detection Models: A Review and Its Applications
40 pages
AI - ML in Healthcare - Notes
No ratings yet
AI - ML in Healthcare - Notes
34 pages
Deep Learning Based Sensitive Data Detection
No ratings yet
Deep Learning Based Sensitive Data Detection
6 pages
Weapon Detection Using Python
No ratings yet
Weapon Detection Using Python
8 pages
Cat Paraphrase
No ratings yet
Cat Paraphrase
5 pages
Research Papers of Medicinal Plants
No ratings yet
Research Papers of Medicinal Plants
5 pages
Project Report: ON Heart Disease Prediction Using Machine Learning
No ratings yet
Project Report: ON Heart Disease Prediction Using Machine Learning
35 pages
Ijma 0101004
No ratings yet
Ijma 0101004
7 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
21CSC305P ML - Lab Programs 1 - 9
No ratings yet
21CSC305P ML - Lab Programs 1 - 9
36 pages
Identifying The Source of Water On Plant Using The Leaf Wetness Sensor and Via Deep Learning-Based Ensemble Method
No ratings yet
Identifying The Source of Water On Plant Using The Leaf Wetness Sensor and Via Deep Learning-Based Ensemble Method
9 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
A Hybrid Model For Kidney Stone Detection Using Deep Learning
No ratings yet
A Hybrid Model For Kidney Stone Detection Using Deep Learning
22 pages
Enhanced Classification of Cricket Batting Shots Using Advanced Machine Learning and Computer Vision Techniques
No ratings yet
Enhanced Classification of Cricket Batting Shots Using Advanced Machine Learning and Computer Vision Techniques
9 pages
AnClub Placements Prepbook 2024
No ratings yet
AnClub Placements Prepbook 2024
85 pages
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
No ratings yet
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
1 page
Relevance Feedback: Improving Results
No ratings yet
Relevance Feedback: Improving Results
41 pages
Master Thesis: Using Machine Learning Methods For Evaluating The Performance Metrics
No ratings yet
Master Thesis: Using Machine Learning Methods For Evaluating The Performance Metrics
24 pages
Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan
No ratings yet
Phishing Detection Using Neural Network: Ningxia Zhang, Yongqing Yuan
5 pages
Advanced Integration of Artificial Intel
No ratings yet
Advanced Integration of Artificial Intel
12 pages
Ai Mock Test 2
No ratings yet
Ai Mock Test 2
6 pages
Internshipreport 15
No ratings yet
Internshipreport 15
34 pages
Flakify
No ratings yet
Flakify
16 pages
Mtech AI Syllabus
No ratings yet
Mtech AI Syllabus
159 pages
Lecture 15
No ratings yet
Lecture 15
37 pages
Framework For Integration of Domain Knowlede Into Logistic Regression
No ratings yet
Framework For Integration of Domain Knowlede Into Logistic Regression
8 pages
On The Performance of Intrusion Detection Systems With Hidden Multilayer Neural Network Using DSD Training
No ratings yet
On The Performance of Intrusion Detection Systems With Hidden Multilayer Neural Network Using DSD Training
21 pages
For 28
No ratings yet
For 28
13 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

K- Nearest Neighbors(KNN)

And Predictive Accuracy

Dr. Ammar Mohammed

Set of Stored Cases

- Approximating real valued or discrete-valued target functions

- Disadvantage of instance-based methods is that the costs of

Most basic instance-based method

Used to classify objects based on closest training examples in

training choose k of the

Nearest Neighbor classifiers are lazy learners

Food Chat Fast Price Bar BigTip

Similarity metric: Number of matching attributes (k=2)

à most similar: number 2 (1 mismatch, 4 match) à yes

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes

Example 2 (mediocre, yes, no, normal, no)

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes

! Compute distance between two points:

In 5-nearest Neighbors what the class of the

d(x, y) = ( 9.1 - 0.8 )2 + ( 11.0 - 6.3 )2 = 4.7

d(x, y) = ( 9.1 - 19.6 ) + ( 11.0 - 11.1 )

u income of a person may vary from $10K to $1M

Choosing the value of k:

Note: k should be odd number to avoid ties

n Predictive accuracy: the most obvious method for

Number of correct Classification

n Holdout set: The available data set D is divided into two

n n-fold cross-validation: The available data is partitioned into n

Predicted labels Actual labels

- Accuracy : The proportion of correct classifications from the overall number of

Predicted labels Actual labels

- TP ( True Positive): The number of correct classifications of the positive examples.

- FN (False Negative): The number of incorrect classifications of the positive examples.

- FP(False Positive):The number of incorrect classifications of the negative examples.

- TN(True Negative): The number of correct classifications of the negative examples.

n Used in information retrieval and text classification.

n This confusion matrix gives

n It is hard to compare two classifiers using two measures. F1 score

• Calculate the Predictive Accuracy and the Precision of the classifier.

You might also like