K- Nearest Neighbors(KNN)
And Predictive Accuracy
Dr. Ammar Mohammed
Associate Professor of Computer Science
Fall 2021
Nearest Neighbors
Set of Stored Cases
• Store the training samples
Atr1 ……... AtrN Class • Use training samples to
A predict the class label of
unseen samples
B
B Unseen Case
C
Atr1 ……... AtrN
A
C
B
Is called : Instance based learning
Instance Based Learning
- Approximating real valued or discrete-valued target functions
- Learning in this algorithm consists of storing the presented
training data ( no Model is generated)
- When a new query instance (unseen data) is encountered, a set
of similar related instances is retrieved from memory and used
to classify the new query instance
- Disadvantage of instance-based methods is that the costs of
classifying new instances can be high
- Nearly all computation takes place at classification time
rather than learning time
K-Nearest Neighbors KNN
Most basic instance-based method
Supervised learning
! Basic idea of KNN:
Used to classify objects based on closest training examples in
the training data
compute
distance
test
sample
training choose k of the
samples “nearest” samples
Nearest Neighbors
Unknown record
Requires three inputs:
1.The set of stored samples(Training data)
2.Distance metric to compute distance
between samples
3.The value of k, the number of nearest
neighbors to retrieve
Nearest Neighbor classifiers are lazy learners
No pre-constructed models for classification
Nearest Neighbors Example
Food Chat Fast Price Bar BigTip
(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes
Similarity metric: Number of matching attributes (k=2)
New examples:
Example 1 (great, no, no, normal, no) Yes
à most similar: number 2 (1 mismatch, 4 match) à yes
àSecond most similar example: number 1 (2 mismatch, 3 match) à yes
Example 2 (mediocre, yes, no, normal, no)
Yes/No
à Most similar: number 3 (1 mismatch, 4 match) à no
àSecond most similar example: number 1 (2 mismatch, 3 match) à yes
Nearest Neighbors
! Compute distance between two points:
x=(x1,x2,..xn) and y=(y1,y2,..yn)
– Euclidean distance
d(x, y) = å (x - y )
i
i i
2
! Options
for determining the class from nearest
neighbor list
– Take majority vote of class labels among the k-
nearest neighbors
Example
In 5-nearest Neighbors what the class of the
new instance
x=(9.1, 11.0) ?
d(x, y) = ( 9.1 - 0.8 )2 + ( 11.0 - 6.3 )2 = 4.7
= 9.6
d(x, y) = ( 9.1 - 1.4 )2 + ( 11.0 - 8.1 )2
:
:
d(x, y) = ( 9.1 - 19.6 ) + ( 11.0 - 11.1 )
2 2 = 10.5
Select the 5 instances having minimum
distance You will find
3 instances classified to +
2 instances classified to –
We conclude that X=(9.1,11.0) classified as +
Features Normalization
! Scalingissues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
u height of a person may vary from 1.5 m to 1.8 m
u weight of a person may vary from 55 kg to 120 kg
u income of a person may vary from $10K to $1M
Features Normalization
Distance is Dominated by the attribute Loan, but attribute Age has no impact
How to solve this Problem ?
10
KNN Standardized Distance
X− Min
X s=
Max− Min
11
How to Determine the good value of K
¢ k = 1:
Belongs to square class
¢ k = 3:
? Belongs to triangle class
¢ k = 7:
Belongs to square class
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
Choose an odd value for k, to eliminate ties
12
How to Determine the good value of K
Determined Experimentally
- Start with k=1 and use a test set to validate the error rate of
the classifier
- Repeat with k=k+2
- Choose the value of k for which the error rate is minimum
Note: k should be odd number to avoid ties
13
Predictive Accuracy
Classification step 1: Splitting data
THE PAST
Results Known
+
+ Training set
-
-
Data +
Testing set
Classification Step 2: Train and Evaluate
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y
-
N
+
Testing set
-
Methods for Evaluation
n Predictive accuracy: the most obvious method for
estimating the performance of the classifier
Number of correct Classification
Accuracy =
n Efficiency Total Number of test cases
q time to construct the model
q time to use the model
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability:
q understandable and insight provided by the model
Predictive Accuracy:
P=C/N
P: accuracy
N: number of instances
C: correctly classified
- Available data is split int two parts called Training set and Test set
- In case Dataset is only single file, we need to divide it into a training and
test set before using method1
Splitting the data
n Holdout set: The available data set D is divided into two
disjoint subsets,
q the training set Dtrain (for learning a model)
q the test set Dtest (for testing the model)
n Important: training set should not be used in testing and
the test set should not be used in learning.
q Unseen test set provides a unbiased estimate of accuracy.
n The test set is also called the holdout set. (the examples
in the original data set D are all labeled with classes.)
n This method is mainly used when the data set D is large.
Splitting Data using n-fold cross Validation
- Available data is split int two parts called Training set and Test set
- In case the Dataset is only single file, we need to divide it into a training
and test set before using method1
n- fold Cross Validation
n n-fold cross-validation: The available data is partitioned into n
equal-size disjoint subsets.
n Use each subset as the test set and combine the rest n-1 subsets as
the training set to learn a classifier.
n The procedure is run n times, which give n accuracies.
n The final estimated accuracy of learning is the average of the n
accuracies.
n 10-fold and 5-fold cross-validations are commonly used.
n This method is used when the available data is not large.
Accuracy Paradox
Accuracy is not suitable in some applications.
n With class imbalance, accuracy alone cannot be trusted to
select well training model.
n In text mining, we may only be interested in the
documents of a particular topic, which are only a small
portion of a big document collection.
n In classification involving skewed or highly imbalanced
data, e.g., network intrusion and financial fraud detections,
we are interested only in the minority class.
q High accuracy does not mean any intrusion is detected.
q E.g., 1% intrusion. Achieve 99% accuracy by doing nothing.
n The class of interest is commonly called the positive
class, and the rest negative classes.
Confusion Matrix
Predicted labels Actual labels
Predicted Label Actual label
a
b Classifier
c d
Confusion Matrix Performance
A confusion matrix is a way of describing the breakdown of the errors
in predictions. It shows the number of correct and incorrect
predictions made by the classification model compared to the actual
outcomes (target value) in the data. The matrix is NxN, where N is
the number of target values (classes). Performance of such models is
commonly evaluated using the data in the matrix. The following table
displays a 2x2 confusion matrix for two classes (Positive and
Negative)
Confusion Matrix Performance
- Accuracy : The proportion of correct classifications from the overall number of
cases.
- Positive Predictive Value or Precision : the proportion of positive cases that were
correctly identified.
- Negative Predictive Value : the proportion of negative cases that were correctly
identified.
Sensitivity or Recall : the proportion of actual positive cases which are correctly
identified.
Specificity : the proportion of actual negative cases which are correctly identified.
Confusion Matrix
Predicted labels Actual labels
Predicted Label Actual label
TP P=positive
FP Classifier
N=Negative
FN TN
- TP ( True Positive): The number of correct classifications of the positive examples.
- FN (False Negative): The number of incorrect classifications of the positive examples.
- FP(False Positive):The number of incorrect classifications of the negative examples.
- TN(True Negative): The number of correct classifications of the negative examples.
Precision and recall Measures
n Used in information retrieval and text classification.
n We use a confusion matrix to introduce them.
Precision and Recall Measures
TP TP
p= . r= .
TP+FP TP+FN
n Precision p : is the number of True Positives divided by the
number of True positives and False Positives. Or it is the
number of correctly classified positive examples divided by the
total number of examples that are classified as positive.
n Recall r : is the number of True positives divided by the
number of True positives and the number of False Negatives.
Or it is the number of correctly classified positive examples
divided by the total number of actual positive examples in the
test set.
28
Example
n This confusion matrix gives
q precision p = 100% and
q recall r = 1%
because we only classified one positive example correctly and no
negative examples wrongly.
n Note: precision and recall only measure classification on
the positive class.
29
F-Score ( F1-Score)
n It is hard to compare two classifiers using two measures. F1 score
combines precision and recall into one measure
2 pr
F1 =
p+r
F1-Score is the harmonic mean of precision and recall
2
F1 =
1 1
+
p r
n The harmonic mean of two numbers tends to be closer to the
smaller of the two.
n For F1-value to be large, both p and r much be large.
30
Question
Suppose we train a model to predict whether a patient is positive
Covid_19 or negative Covid_19. After training the model, we apply it on
test set of 1000 classified patients and the model produces the
following confusion matrix.
Actual Class
Covid_Positive Covid_Negative
Predicted Covid_Positive 450 70
Class Covid_Negative 90 390
• Calculate the Predictive Accuracy and the Precision of the classifier.
• Calculate the Recall (sensitivity) and Specificity of the classifier.
31