Lecture 4
Lecture 4
compute
distance
test
sample
d(x, y) = å (x - y )
i
i i
2
! Options
for determining the class from nearest
neighbor list
– Take majority vote of class labels among the k-
nearest neighbors
Example
x=(9.1, 11.0) ?
= 9.6
d(x, y) = ( 9.1 - 1.4 )2 + ( 11.0 - 8.1 )2
:
:
! Scalingissues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
u height of a person may vary from 1.5 m to 1.8 m
u weight of a person may vary from 55 kg to 120 kg
Distance is Dominated by the attribute Loan, but attribute Age has no impact
How to solve this Problem ?
10
KNN Standardized Distance
X− Min
X s=
Max− Min
11
How to Determine the good value of K
¢ k = 1:
Belongs to square class
¢ k = 3:
? Belongs to triangle class
¢ k = 7:
Belongs to square class
Determined Experimentally
- Start with k=1 and use a test set to validate the error rate of
the classifier
- Repeat with k=k+2
- Choose the value of k for which the error rate is minimum
13
Predictive Accuracy
Classification step 1: Splitting data
THE PAST
Results Known
+
+ Training set
-
-
Data +
Testing set
Classification Step 2: Train and Evaluate
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y
-
N
+
Testing set
-
Methods for Evaluation
P=C/N
P: accuracy
N: number of instances
C: correctly classified
- Available data is split int two parts called Training set and Test set
- In case Dataset is only single file, we need to divide it into a training and
test set before using method1
Splitting the data
- Available data is split int two parts called Training set and Test set
- In case the Dataset is only single file, we need to divide it into a training
and test set before using method1
n- fold Cross Validation
b Classifier
c d
Confusion Matrix Performance
A confusion matrix is a way of describing the breakdown of the errors
in predictions. It shows the number of correct and incorrect
predictions made by the classification model compared to the actual
outcomes (target value) in the data. The matrix is NxN, where N is
the number of target values (classes). Performance of such models is
commonly evaluated using the data in the matrix. The following table
displays a 2x2 confusion matrix for two classes (Positive and
Negative)
Confusion Matrix Performance
TP P=positive
FP Classifier
N=Negative
FN TN
TP TP
p= . r= .
TP+FP TP+FN
n Precision p : is the number of True Positives divided by the
number of True positives and False Positives. Or it is the
number of correctly classified positive examples divided by the
total number of examples that are classified as positive.
n Recall r : is the number of True positives divided by the
number of True positives and the number of False Negatives.
Or it is the number of correctly classified positive examples
divided by the total number of actual positive examples in the
test set.
28
Example
29
F-Score ( F1-Score)
2 pr
F1 =
p+r
F1-Score is the harmonic mean of precision and recall
2
F1 =
1 1
+
p r
n The harmonic mean of two numbers tends to be closer to the
smaller of the two.
n For F1-value to be large, both p and r much be large.
30
Question
Suppose we train a model to predict whether a patient is positive
Covid_19 or negative Covid_19. After training the model, we apply it on
test set of 1000 classified patients and the model produces the
following confusion matrix.
Actual Class
Covid_Positive Covid_Negative
Predicted Covid_Positive 450 70
Class Covid_Negative 90 390
31