0% found this document useful (0 votes)

4 views

Module 8 - PDF

The document discusses the k-Nearest Neighbors (kNN) algorithm, focusing on its classification and regression applications, effects of outliers, and the importance of choosing an appropriate k value. It also covers performance metrics for classification models, including confusion matrices, accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC). Additionally, it highlights challenges in kNN computation and optimization techniques such as k-d trees.

Uploaded by

satyam.kumar10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Module 8 - PDF

Uploaded by

satyam.kumar10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Learning Objectives

• Continuing discussion on KNN

• Performance metrics for Classification

• Significance of different metrics

1
kNN: Classification

Effect of Outliers:

● Consider k=1.
● Sensitive to outliers: Decision boundary
changes drastically with outliers.
● Solution?
○ Increase k

3
kNN: Classification

Effect of k:

● Low k: overfitting, highly

unstable decision boundary
● Good k: Smooth boundary, no
overfitting/underfitting
● Higher k: Everything classified
as most probable class
● How to find a good k?
k=1 k=15

4
kNN: Classification

Effect of k:

● Low k: overfitting, highly

unstable decision boundary
● Good k: Smooth boundary, no
overfitting/underfitting
● Higher k: Everything classified
as most probable class
● How to find a good k?
k=1 k=15
Cross validation is our friend!

5
kNN: Classification

What if we have same votes from

both classes?

Potential solutions for tie-

breaking:
● Take k odd
● Randomly select
● Use the class with larger prior k=1 k=15

6
kNN: Classification

A probabilistic variant: Probabilistic kNN

E.g. k=4, c=3

P=[3/4, 0, 1/4]

y=1 y=2 y=3

8
kNN: Classification

A probabilistic variant: Probabilistic kNN

E.g. k=4, c=3 Variant with pseudo counts:

P=[(3+1)/(4+3), (0+1)/(4+3),(1+1)/(4+3)]
=[4/7, 1/7, 2/7]
y=1 y=2 y=3

9
kNN: Regression

A simple regression algorithm:

● Train examples: where is a continuous real valued target

● Given test input
● Find distances to the n training examples using a distance metric
● Select k closest training examples and their target values
● The output is the mean of the target values of the k neighbours

Can be used for interpolation.

12
kNN: Challenges

Computationally expensive:
● Need to store all training example
● Required to compute distances to all training examples:

There are ways to optimize kNN computation:

● Reduce dimensionality using dimensionality reduction techniques
● Reduce number of comparisons:
○ kD tree implementation
○ Locality sensitive hashing

14
kNN: Computational Complexity
Brute force method
● Training time complexity: O(1)
● Training space complexity: O(1)
● Prediction time complexity: O(k * n * d)
● Prediction space complexity: O(1)

15
kNN: Computational Complexity

k-d tree method

● Training time complexity: O(d * n * log(n))
● Training space complexity: O(d * n)
● Prediction time complexity: O(k * log(n))
● Prediction space complexity: O(1)

16
kNN: k-d Tree

● K Dimensional tree (or k-d tree) is a tree data structure that is used to represent
points in a k-dimensional space.
● Used for various applications like nearest point (in k-dimensional space), efficient
storage of spatial data, range search etc.

18
kNN: k-d Tree

Example:

19
kNN: k-d Tree

Example:

20
kNN: Computational Complexity

The more “traditional” application of the kNN is the classification of data. It

often has quite a lot of points, e. g. MNIST has 60k training images and 10k test
images. Classification is done offline, which means we first do the training
phase, then just use the results during prediction. Therefore, if we want to
construct the data structure, we only need to do so once. For 10k test images,
let’s compare the brute force (which calculates all distances every time) and
k-d tree for 3 neighbors.

21
kNN: Computational Complexity

● Brute force (O(k * n)): 3 * 10,000 = 30,000

● k-d tree (O(k * log(n))): 3 * log(10,000) ~ 3 * 13 = 39

22
Classification Metrics

How to measure the performance of a classification model?

24
Classification Metrics

Most widely used metrics and tools to access classification models:

● Confusion matrix
● Accuracy
● Precision/Recall/F1-score
● Area under the ROC curve

25
Classification Metrics

Confusion Matrix

A table to summarize how successful the classification model is at

predicting examples belonging to various classes.

26
Classification Metrics

Confusion Matrix

E.g., For binary classification, a model predicts two classes: “spam” and
“not_spam” from a given email.
prediction
spam not_spam

spam True Positive False Negative

actual

(TP) (FN)

not_spam False Positive True Negative

(FP) (TN)

27
Classification Metrics

Confusion Matrix

Exercise 1: Consider a Cricket Tournament. Find

the mapping.
True Positive False Negative
(TP) (FN)
1. You had predicted that India would win and it won.
2. You had predicted that England would not win and it
False Positive True Negative
lost. (FP) (TN)
3. You had predicted that England would win, but it lost.
4. You had predicted that India would not win, but it won.

28
Classification Metrics

Confusion Matrix

Exercise 2:

prediction
1 0

1 TP=? FN=?
actual

0 FP=? TN=?

29
Classification Metrics

Confusion Matrix

Exercise 2:

prediction
1 0

1 TP=6 FN=2
actual

0 FP=1 TN=3

30
Classification Metrics

Confusion Matrix
Multiclass Classification: E.g. emotion classification
prediction
Happy Sad Angry Surprise Disgust Neutral

Happy

Sad

actual
=? Angry

Surprise

Disgust

Neutral

31
Classification Metrics

Accuracy
Accuracy is given by the number of correctly classified examples divided by the
total number of classified examples.
prediction
spam not_spam
TP+TN
Acc = spam True Positive False Negative
TP + TN + FP + FN
actual
(TP) (FN)

not_spam False Positive True Negative

(FP) (TN)

33
Classification Metrics

Accuracy
Accuracy is given by the number of correctly classified examples divided by the
total number of classified examples.

prediction
1 0
actual

1 TP=6 FN=2

0 FP=1 TN=3

Accuracy = ?
34
Classification Metrics

Precision
Precision is the ratio of correct positive predictions to the overall number of
positive predictions
prediction
spam not_spam
TP
Precisions = spam True Positive False Negative
TP + FP
actual
(TP) (FN)

FP is costly! not_spam False Positive True Negative

(FP) (TN)

35
Classification Metrics

Recall
Recall is the ratio of correct positive predictions to the overall number of positive
examples.
prediction
spam not_spam

TP spam True Positive False Negative

Recall =
actual
(TP) (FN)
TP + FN
not_spam False Positive True Negative
(FP) (TN)
FN is costly!

36
Classification Metrics

F1-Score

● The formula for the standard F1-score

is the harmonic mean of the precision
and recall.
● Best of both worlds
● A perfect model has an F-score of 1.
● FP & FN both are costly!

37
Classification Metrics

Visualizing Precision/Recall

38
Classification Metrics

Precision/Recall/F1-score

prediction
1 0

1 P=6 FN=2
Precision = ?

actual
0 FP=1 TN=3
Recall=?

F1-score=?

39
Classification Metrics

Examples: It all depends on the problem!

Diagnosis of cancer.

prediction
It is important in medical cases
cancer no_cancer where it doesn’t matter
whether we raise a false alarm
cancer Perfect X
actual

but the actual positive cases

should not go undetected!
no _cancer OK Perfect
What metric would you pick?

41
Classification Metrics

Examples: It all depends on the problem!

Diagnosis of cancer.

prediction
It is important in medical cases
cancer no_cancer where it doesn’t matter
whether we raise a false alarm
cancer Perfect X
actual

but the actual positive cases

should not go undetected!
no _cancer OK Perfect
TP
Recall =
TP + FN

42
Classification Metrics

Examples: It all depends on the problem!

Detecting if an email spam or no spam.

prediction
It is important in emails where
spam no_spam it is more important that we
don’t miss any important email
spam Perfect OK
actual

as spam than receiving an

occasional spam as no spam.
no_spam X Perfect
What metric would you pick?

43
Classification Metrics

Examples: It all depends on the problem!

Detecting if an email spam or no spam.

prediction
It is important in emails where
spam no_spam it is more important that we
don’t miss any important email
spam Perfect OK
actual

as spam than receiving an

occasional spam as no spam.
no_spam X Perfect
TP
Precision =
TP + FP

44
Classification Metrics

Multiclass Classification

prediction
Happy Sad Angry Surprise Disgust Neutral

Happy
● Can you define recall (Happy)?

Sad
● Can you define precision (Happy)?
actual

Angry

Surprise

Disgust

Neutral

45
Classification Metrics

Multiclass Classification

prediction
Happy Sad Angry Surprise Disgust Neutral

recall(Happy)
Happy

Sad
actual

Angry

Surprise

Disgust

Neutral

46
Classification Metrics

Multiclass Classification

prediction
Happy Sad Angry Surprise Disgust Neutral

precision(Happy)
Happy

Sad
actual

Angry

Surprise

Disgust

Neutral

47
Classification Metrics

Multiclass Classification

prediction
Happy Sad Angry Surprise Disgust Neutral

Can you define accuracy?

Happy

Sad
actual

Angry

Surprise

Disgust

Neutral

48
Classification Metrics

Multiclass Classification

prediction
Happy Sad Angry Surprise Disgust Neutral

Happy

Sad
actual

Angry

Surprise

Disgust

Neutral

49
Classification Metrics

Area under the ROC Curve (AUC)

● The ROC curve (ROC stands for “receiver operating characteristic,” the term
comes from radar engineering. The method was originally developed for
operators of military radar receivers starting in 1941, which led to its name.) is
a commonly used method to assess the performance of binary classification
models.

● ROC curves use a combination of:

(1) true positive rate (the proportion of positive examples predicted correctly,
defined exactly as recall) and
(2) false positive rate (the proportion of negative examples predicted
incorrectly)
to build up a summary picture of the classification performance.
51
Classification Metrics

Area under the ROC Curve (AUC)

● ROC curves use a combination of:
(1) true positive rate (the proportion of positive examples predicted correctly,
defined exactly as recall) and
(2) false positive rate (the proportion of negative examples predicted
incorrectly)
to build up a summary picture of the classification performance.

TP
FP
TPR =
FPR =
TP + FN
FP + TN

52
Classification Metrics

Area under the ROC Curve (AUC)

prediction
TP spam not_spam
TPR =
TP + FN spam True Positive False Negative

actual
(TP) (FN)

FP not_spam False Positive True Negative

FPR = (FP) (TN)
FP + TN
Specificity = 1 - FPR = TN / (TN + FP)
Sensitivity/Recall = TPR
53
Classification Metrics

Area under the ROC Curve (AUC)

TP
TPR =
TP + FN

FP
FPR =
FP + TN

54
Classification Metrics

Area under the ROC Curve (AUC)

● We used a threshold for

classification in many classification
models
● Typically for models that give
probabilistic output score.

55
Classification Metrics

Area under the ROC Curve (AUC)

● To compare different classifiers, it can

be useful to summarize the
performance of each classifier into a
single measure.
● One common approach is to
calculate the area under the ROC
curve, which is abbreviated to AUC.

56
Classification Metrics

Area under the ROC Curve (AUC)

● AUC ranges in value from 0 to 1

● A model whose predictions are 100%
wrong has an AUC of 0.0
● One whose predictions are 100%
correct has an AUC of 1.0
● AUC is classification-threshold-
invariant and suitable for comparison

58
Classification Metrics

Area under the ROC Curve (AUC)

prediction
spam not_spam
actual

spam 10 0

not_spam 10 0

All predictions say “spam”.

(1) TPR=?
(2) FPR=?
(3) Where is the point in ROC curve? 59
Classification Metrics

Area under the ROC Curve (AUC)

prediction
spam not_spam
actual

spam 0 10

not_spam 0 10

All predictions say “not_spam”.

(1) TPR=?
(2) FPR=?
(3) Where is the point in ROC curve? 60
Classification Metrics

Area under the ROC Curve (AUC)

prediction
spam not_spam
actual

spam 10 0

not_spam 0 10

All predictions are perfect.

(1) TPR=?
(2) FPR=?
(3) Where is the point in ROC curve? 61
Classification Metrics

Area under the ROC Curve (AUC)

prediction
spam not_spam
actual

spam 5 5

not_spam 5 5

Some random predictions.

(1) TPR=?
(2) FPR=?
(3) Where is the point in ROC curve? 62

Module 4
100% (2)
Module 4
2 pages
How To Unlock Vodafone M028T To Use All Network For Free - Free GSM Kmer
No ratings yet
How To Unlock Vodafone M028T To Use All Network For Free - Free GSM Kmer
8 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
Module 2
No ratings yet
Module 2
151 pages
ML 4 (1)
No ratings yet
ML 4 (1)
33 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
sensitivity unit 4
No ratings yet
sensitivity unit 4
4 pages
BSC ML CH1.pptx
No ratings yet
BSC ML CH1.pptx
63 pages
CH-5_ML
No ratings yet
CH-5_ML
36 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Share UNIT-IV-1
No ratings yet
Share UNIT-IV-1
138 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Evaluation Measures for Machine Learning Models
No ratings yet
Evaluation Measures for Machine Learning Models
6 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Lecture 2 Final
No ratings yet
Lecture 2 Final
90 pages
Machine Learning Chapter3
No ratings yet
Machine Learning Chapter3
27 pages
ML U4
No ratings yet
ML U4
48 pages
ML Metrics
No ratings yet
ML Metrics
9 pages
Confusion Matrix and Classification Evaluation Metrics
No ratings yet
Confusion Matrix and Classification Evaluation Metrics
16 pages
Classification
No ratings yet
Classification
58 pages
ADS-EXP4
No ratings yet
ADS-EXP4
3 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Classification Algorithm in Machine Learning
No ratings yet
Classification Algorithm in Machine Learning
13 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Imp Notes For Aamd
No ratings yet
Imp Notes For Aamd
6 pages
Confusion Matrix
No ratings yet
Confusion Matrix
18 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
chapter 5 Model Evaluation
No ratings yet
chapter 5 Model Evaluation
21 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Classification FoundationalMathofAI S24
No ratings yet
Classification FoundationalMathofAI S24
6 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
Lectures3 5
No ratings yet
Lectures3 5
57 pages
Machine_Learning_II
No ratings yet
Machine_Learning_II
61 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
08 Classification
No ratings yet
08 Classification
26 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
7.classification Before
No ratings yet
7.classification Before
27 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Lecture03. Classification (Chapter 3)
No ratings yet
Lecture03. Classification (Chapter 3)
46 pages
UNIT-3
No ratings yet
UNIT-3
13 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Confusion Matrix
No ratings yet
Confusion Matrix
11 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
lec5_Classification
No ratings yet
lec5_Classification
27 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
Classification Metrics
No ratings yet
Classification Metrics
24 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
840D FBSY 1102 en
No ratings yet
840D FBSY 1102 en
176 pages
Yash Website List
No ratings yet
Yash Website List
30 pages
SIM7080 Series PING Application Note V1.01 PDF
No ratings yet
SIM7080 Series PING Application Note V1.01 PDF
6 pages
Chapter 10
No ratings yet
Chapter 10
8 pages
IS306 Lecture 04
No ratings yet
IS306 Lecture 04
17 pages
5954 Exp-1.3 (Java)
No ratings yet
5954 Exp-1.3 (Java)
3 pages
Course Project
100% (1)
Course Project
2 pages
Neeraj Kumar CV
No ratings yet
Neeraj Kumar CV
6 pages
CSE422 Assignment 1 Uninformed and Local Search
No ratings yet
CSE422 Assignment 1 Uninformed and Local Search
2 pages
Chapter 1.3
No ratings yet
Chapter 1.3
35 pages
DC SS009 7 PDF
No ratings yet
DC SS009 7 PDF
11 pages
Youtube_TMB_Github Actions in Test Automation
No ratings yet
Youtube_TMB_Github Actions in Test Automation
52 pages
Digitol Bench Scale 8213-0025: Technical Manual and Parts Catalog
No ratings yet
Digitol Bench Scale 8213-0025: Technical Manual and Parts Catalog
27 pages
DNS Bhuli Timetable Portion (Online Test)
No ratings yet
DNS Bhuli Timetable Portion (Online Test)
20 pages
Homework Hackforums
100% (1)
Homework Hackforums
5 pages
User Acceptance Testing Doc-Railway Ticket Management
No ratings yet
User Acceptance Testing Doc-Railway Ticket Management
1 page
Number Series Questions For CMAT Up
No ratings yet
Number Series Questions For CMAT Up
13 pages
AP Eapcet Hall Ticket Asif
No ratings yet
AP Eapcet Hall Ticket Asif
3 pages
2022 JAVA LAB MANUAL
No ratings yet
2022 JAVA LAB MANUAL
29 pages
DIP - Chapter3 - Part 1
No ratings yet
DIP - Chapter3 - Part 1
22 pages
Robotics Project Report Group 4
No ratings yet
Robotics Project Report Group 4
14 pages
Eaton Pdi Powerwave 2 Busway Brochure Br155038en
No ratings yet
Eaton Pdi Powerwave 2 Busway Brochure Br155038en
4 pages
W8 Finall Assignment
No ratings yet
W8 Finall Assignment
6 pages
Gi I Part 2-Test 4
No ratings yet
Gi I Part 2-Test 4
3 pages
Profile Manish Trainer
No ratings yet
Profile Manish Trainer
17 pages
How To Create Successfull Dating Profile
No ratings yet
How To Create Successfull Dating Profile
6 pages
Formwork Manual Complete
No ratings yet
Formwork Manual Complete
100 pages
SF Dump
No ratings yet
SF Dump
15 pages

Module 8 - PDF

Uploaded by

Module 8 - PDF

Uploaded by

Learning Objectives

• Continuing discussion on KNN

• Performance metrics for Classification

• Significance of different metrics

● Low k: overfitting, highly

● Low k: overfitting, highly

What if we have same votes from

Potential solutions for tie-

A probabilistic variant: Probabilistic kNN

E.g. k=4, c=3

y=1 y=2 y=3

A probabilistic variant: Probabilistic kNN

E.g. k=4, c=3 Variant with pseudo counts:

A simple regression algorithm:

● Train examples: where is a continuous real valued target

Can be used for interpolation.

There are ways to optimize kNN computation:

k-d tree method

The more “traditional” application of the kNN is the classification of data. It

● Brute force (O(k * n)): 3 * 10,000 = 30,000

How to measure the performance of a classification model?

Most widely used metrics and tools to access classification models:

A table to summarize how successful the classification model is at

spam True Positive False Negative

not_spam False Positive True Negative

Exercise 1: Consider a Cricket Tournament. Find

not_spam False Positive True Negative

FP is costly! not_spam False Positive True Negative

TP spam True Positive False Negative

● The formula for the standard F1-score

Examples: It all depends on the problem!

but the actual positive cases

Examples: It all depends on the problem!

but the actual positive cases

Examples: It all depends on the problem!

as spam than receiving an

Examples: It all depends on the problem!

as spam than receiving an

Can you define accuracy?

Area under the ROC Curve (AUC)

● ROC curves use a combination of:

Area under the ROC Curve (AUC)

Area under the ROC Curve (AUC)

FP not_spam False Positive True Negative

Area under the ROC Curve (AUC)

Area under the ROC Curve (AUC)

● We used a threshold for

Area under the ROC Curve (AUC)

● To compare different classifiers, it can

Area under the ROC Curve (AUC)

● AUC ranges in value from 0 to 1

Area under the ROC Curve (AUC)

All predictions say “spam”.

Area under the ROC Curve (AUC)

All predictions say “not_spam”.

Area under the ROC Curve (AUC)

All predictions are perfect.

Area under the ROC Curve (AUC)

Some random predictions.

You might also like