0% found this document useful (0 votes)
58 views

Lecture 4

1. K-Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies new data points based on their distance to labeled examples in the training set. 2. The KNN algorithm works by finding the k closest training examples in the feature space and assigning the new data point to the most common class among its k nearest neighbors. 3. The accuracy of KNN can be evaluated using methods like predictive accuracy, n-fold cross validation, and measuring the percentage of correctly classified examples in a test set.

Uploaded by

Ahmed Mosa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Lecture 4

1. K-Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies new data points based on their distance to labeled examples in the training set. 2. The KNN algorithm works by finding the k closest training examples in the feature space and assigning the new data point to the most common class among its k nearest neighbors. 3. The accuracy of KNN can be evaluated using methods like predictive accuracy, n-fold cross validation, and measuring the percentage of correctly classified examples in a test set.

Uploaded by

Ahmed Mosa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

K- Nearest Neighbors(KNN)

And Predictive Accuracy

Dr. Ammar Mohammed


Associate Professor of Computer Science
Fall 2021
Nearest Neighbors

Set of Stored Cases


• Store the training samples
Atr1 ……... AtrN Class • Use training samples to
A predict the class label of
unseen samples
B
B Unseen Case
C
Atr1 ……... AtrN
A
C
B
Is called : Instance based learning
Instance Based Learning

- Approximating real valued or discrete-valued target functions


- Learning in this algorithm consists of storing the presented
training data ( no Model is generated)
- When a new query instance (unseen data) is encountered, a set
of similar related instances is retrieved from memory and used
to classify the new query instance

- Disadvantage of instance-based methods is that the costs of


classifying new instances can be high
- Nearly all computation takes place at classification time
rather than learning time
K-Nearest Neighbors KNN

Most basic instance-based method


Supervised learning
! Basic idea of KNN:

Used to classify objects based on closest training examples in


the training data

compute
distance
test
sample

training choose k of the


samples “nearest” samples
Nearest Neighbors
Unknown record
Requires three inputs:
1.The set of stored samples(Training data)
2.Distance metric to compute distance
between samples
3.The value of k, the number of nearest
neighbors to retrieve

Nearest Neighbor classifiers are lazy learners


No pre-constructed models for classification
Nearest Neighbors Example

Food Chat Fast Price Bar BigTip


(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes

Similarity metric: Number of matching attributes (k=2)


New examples:
Example 1 (great, no, no, normal, no) Yes

à most similar: number 2 (1 mismatch, 4 match) à yes

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes

Example 2 (mediocre, yes, no, normal, no)


Yes/No
à Most similar: number 3 (1 mismatch, 4 match) à no

àSecond most similar example: number 1 (2 mismatch, 3 match) à yes


Nearest Neighbors

! Compute distance between two points:


x=(x1,x2,..xn) and y=(y1,y2,..yn)
– Euclidean distance

d(x, y) = å (x - y )
i
i i
2

! Options
for determining the class from nearest
neighbor list
– Take majority vote of class labels among the k-
nearest neighbors
Example

In 5-nearest Neighbors what the class of the


new instance

x=(9.1, 11.0) ?

d(x, y) = ( 9.1 - 0.8 )2 + ( 11.0 - 6.3 )2 = 4.7

= 9.6
d(x, y) = ( 9.1 - 1.4 )2 + ( 11.0 - 8.1 )2
:
:

d(x, y) = ( 9.1 - 19.6 ) + ( 11.0 - 11.1 )


2 2 = 10.5
Select the 5 instances having minimum
distance You will find
3 instances classified to +
2 instances classified to –
We conclude that X=(9.1,11.0) classified as +
Features Normalization

! Scalingissues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
u height of a person may vary from 1.5 m to 1.8 m
u weight of a person may vary from 55 kg to 120 kg

u income of a person may vary from $10K to $1M


Features Normalization

Distance is Dominated by the attribute Loan, but attribute Age has no impact
How to solve this Problem ?

10
KNN Standardized Distance

X− Min
X s=
Max− Min
11
How to Determine the good value of K

¢ k = 1:
— Belongs to square class

¢ k = 3:
? — Belongs to triangle class

¢ k = 7:
— Belongs to square class

Choosing the value of k:


— If k is too small, sensitive to noise points
— If k is too large, neighborhood may include points from other
classes
— Choose an odd value for k, to eliminate ties
12
How to Determine the good value of K

Determined Experimentally
- Start with k=1 and use a test set to validate the error rate of
the classifier
- Repeat with k=k+2
- Choose the value of k for which the error rate is minimum

Note: k should be odd number to avoid ties

13
Predictive Accuracy
Classification step 1: Splitting data

THE PAST
Results Known

+
+ Training set
-
-
Data +

Testing set
Classification Step 2: Train and Evaluate

Results Known

+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+

Y
-
N
+
Testing set
-
Methods for Evaluation

n Predictive accuracy: the most obvious method for


estimating the performance of the classifier

Number of correct Classification


Accuracy =
n Efficiency Total Number of test cases
q time to construct the model
q time to use the model
n Robustness: handling noise and missing values
n Scalability: efficiency in disk-resident databases
n Interpretability:
q understandable and insight provided by the model
Predictive Accuracy:

P=C/N

P: accuracy
N: number of instances
C: correctly classified

- Available data is split int two parts called Training set and Test set
- In case Dataset is only single file, we need to divide it into a training and
test set before using method1
Splitting the data

n Holdout set: The available data set D is divided into two


disjoint subsets,
q the training set Dtrain (for learning a model)
q the test set Dtest (for testing the model)
n Important: training set should not be used in testing and
the test set should not be used in learning.
q Unseen test set provides a unbiased estimate of accuracy.
n The test set is also called the holdout set. (the examples
in the original data set D are all labeled with classes.)
n This method is mainly used when the data set D is large.
Splitting Data using n-fold cross Validation

- Available data is split int two parts called Training set and Test set
- In case the Dataset is only single file, we need to divide it into a training
and test set before using method1
n- fold Cross Validation

n n-fold cross-validation: The available data is partitioned into n


equal-size disjoint subsets.
n Use each subset as the test set and combine the rest n-1 subsets as
the training set to learn a classifier.
n The procedure is run n times, which give n accuracies.
n The final estimated accuracy of learning is the average of the n
accuracies.
n 10-fold and 5-fold cross-validations are commonly used.
n This method is used when the available data is not large.
Accuracy Paradox
Accuracy is not suitable in some applications.
n With class imbalance, accuracy alone cannot be trusted to
select well training model.
n In text mining, we may only be interested in the
documents of a particular topic, which are only a small
portion of a big document collection.
n In classification involving skewed or highly imbalanced
data, e.g., network intrusion and financial fraud detections,
we are interested only in the minority class.
q High accuracy does not mean any intrusion is detected.
q E.g., 1% intrusion. Achieve 99% accuracy by doing nothing.
n The class of interest is commonly called the positive
class, and the rest negative classes.
Confusion Matrix

Predicted labels Actual labels


Predicted Label Actual label
a

b Classifier

c d
Confusion Matrix Performance
A confusion matrix is a way of describing the breakdown of the errors
in predictions. It shows the number of correct and incorrect
predictions made by the classification model compared to the actual
outcomes (target value) in the data. The matrix is NxN, where N is
the number of target values (classes). Performance of such models is
commonly evaluated using the data in the matrix. The following table
displays a 2x2 confusion matrix for two classes (Positive and
Negative)
Confusion Matrix Performance

- Accuracy : The proportion of correct classifications from the overall number of


cases.
- Positive Predictive Value or Precision : the proportion of positive cases that were
correctly identified.
- Negative Predictive Value : the proportion of negative cases that were correctly
identified.
Sensitivity or Recall : the proportion of actual positive cases which are correctly
identified.
Specificity : the proportion of actual negative cases which are correctly identified.
Confusion Matrix

Predicted labels Actual labels


Predicted Label Actual label

TP P=positive

FP Classifier

N=Negative

FN TN

- TP ( True Positive): The number of correct classifications of the positive examples.

- FN (False Negative): The number of incorrect classifications of the positive examples.

- FP(False Positive):The number of incorrect classifications of the negative examples.

- TN(True Negative): The number of correct classifications of the negative examples.


Precision and recall Measures

n Used in information retrieval and text classification.


n We use a confusion matrix to introduce them.
Precision and Recall Measures

TP TP
p= . r= .
TP+FP TP+FN
n Precision p : is the number of True Positives divided by the
number of True positives and False Positives. Or it is the
number of correctly classified positive examples divided by the
total number of examples that are classified as positive.
n Recall r : is the number of True positives divided by the
number of True positives and the number of False Negatives.
Or it is the number of correctly classified positive examples
divided by the total number of actual positive examples in the
test set.
28
Example

n This confusion matrix gives


q precision p = 100% and
q recall r = 1%
because we only classified one positive example correctly and no
negative examples wrongly.
n Note: precision and recall only measure classification on
the positive class.

29
F-Score ( F1-Score)

n It is hard to compare two classifiers using two measures. F1 score


combines precision and recall into one measure

2 pr
F1 =
p+r
F1-Score is the harmonic mean of precision and recall

2
F1 =
1 1
+
p r
n The harmonic mean of two numbers tends to be closer to the
smaller of the two.
n For F1-value to be large, both p and r much be large.

30
Question
Suppose we train a model to predict whether a patient is positive
Covid_19 or negative Covid_19. After training the model, we apply it on
test set of 1000 classified patients and the model produces the
following confusion matrix.
Actual Class
Covid_Positive Covid_Negative
Predicted Covid_Positive 450 70
Class Covid_Negative 90 390

• Calculate the Predictive Accuracy and the Precision of the classifier.


• Calculate the Recall (sensitivity) and Specificity of the classifier.

31

You might also like