0% found this document useful (0 votes)
25 views38 pages

Notes 03

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

Notes 03

Uploaded by

HAMXALA KHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine Learning

EE514 – CS535

Analysis and Evaluation of Classifier’s


Performance and Multi-class Classification

Zubair Khalid

School of Science and Engineering


Lahore University of Management Sciences

https://www.zubairkhalid.org/ee514_2023.html
Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
Classification Accuracy, Misclassification Rate (0/1 Loss):

- For each test-point, the loss is either 0 or 1; whether the prediction is correct or
incorrect.
- Averaged over n data-points, this loss is a ‘Misclassification Rate’.

Interpretation:
- Misclassification Rate: Estimate of the probability that a point is incorrectly classified.
- Accuracy = 1- Misclassification rate

Issue:
- Not meaningful when the classes are imbalanced or skewed.
Evaluation of Classification Performance
Classification Accuracy (0/1 Loss):
Example:
- Predict if a bowler will not bowl a no-ball?
- Assuming 15 no-balls in an inning, a model that says ‘Yes’ all the time will have
95% accuracy.
- Using accuracy as performance metric, we can say that a model is very accurate,
but it is not useful or valuable in fact.

Why?
- Total points: 315 (assuming other balls are legal ☺)
- No-ball label: Class 0 (4.76% are from this class) Imbalanced
- Not a no-ball label: Class 1 (95.24% are from this class) Classes
Evaluation of Classification Performance
TP, TN, FP and FN:
- Consider a binary classification problem.
Evaluation of Classification Performance
TP, TN, FP and FN:
Evaluation of Classification Performance
TP, TN, FP and FN:
Example:
- Predict if a bowler will not bowl a no-ball?
- 15 no-balls in an inning (Total balls: 315)
- Bowl no-ball (Class 0), Bowl regular ball (Class 1)
- Model(*) predicted 10 no-balls (8 correct predictions, 2 incorrect)

* Assume you have a model that has been observing the bowlers for the last 15 years
and used these observations for learning.
Evaluation of Classification Performance
Confusion Matrix (Contingency Table):
- (TP; TN; FP; FN); usefully summarized in a table, referred to as confusion matrix:
- the rows correspond to predicted class (𝑦)

- and the columns to true class (𝑦)

Actual Labels
1 (Positive) 0 (Negative) Total
1 (Positive) Predicted Total
Predicted TP FP Positives
Labels 0 (Negative) Predicted Total
FN TN Negatives
Total P= TP+FN N= P+TN
Actual Actual
Total Total
Positives Negatives
Evaluation of Classification Performance
Confusion Matrix:
Actual Labels
Example:
- Disease Detection : 1 (Positive) 0 (Negative) Total
Given pathology reports and
1 (Positive)
scans, predict heart disease Predicted TP = 100 FP = 10 110
- Yes: 1, No: 0 Labels 0 (Negative)
FN = 5 TN = 50 55

Interpretation: Total P = 105 N = 60


Out of 165 cases

- Predicted: “Yes" 110 times, and “No" 55 times

- In reality: “Yes" 105 times, and “No" 60 times


Evaluation of Classification Performance
Confusion Matrix:
Actual Labels
Example:
- Predict if a bowler will not 1 (Positive) 0 (Negative) Total
bowl a no-ball?
1 (Positive) 305
Predicted TP = 298 FP = 7
Labels 0 (Negative)
FN = 2 TN = 8 10
Interpretation:
Total P = 300 N = 15
Out of 315 balls, we had 15 no-balls.

- Model predicted 305 regular balls and 10 no-balls (8 correct predictions, 2


incorrect).
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:

- Accuracy: Overall, how frequently is the classifier correct?

- Misclassification or Error Rate: Overall, how frequently is it wrong?

- Sensitivity or Recall or True Positive Rate (TPR): How often does it predict Positive
when it is actually Positive?
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:

- False Positive Rate: Actual Negative, how often does it


predict Positive?

- Specificity or True Negative Rate (TNR): When it's actually Negative, how often does it
predict Negative?

- Precision: When it predicts Positive, how often is it Positive?


Evaluation of Classification Performance
Confusion Matrix Metrics:

Negative Predicted Value


Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix (Example: Disease Prediction):

- Accuracy: Disease/Healthy prediction accuracy

= (100+50)/165 = 0.91

- Misclassification or Error Rate: Disease/Healthy prediction accuracy

= (10+5)/165 = 0.09

- Sensitivity or Recall or True Positive Rate (TPR): When it's positive, how often does
the model detected disease?

= 100/105 = 0.95
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix (Example: Disease Prediction):

- False Positive Rate: Actually heathy, how often does it predict yes?

= 10/60 = 0.17

- Specificity or True Negative Rate (TNR): When it's actually health, how often does it predict
healthy?
= 50/60 = 0.83

- Precision: When it predicts disease, how often is it correct?

= 100/110 = 0.91
Evaluation of Classification Performance
Confusion Matrix:
Metrics using Confusion Matrix:
When to use which?

- Disease Detection: We do not want FN

- Fraud Detection: We do not want FP


Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
Confusion Matrix:
Precision and Sensitivity (Recall) Trade-off:
Sensitivity or Recall Precision
- Disease Detection:

- Recall or Sensitivity (Se); how good we are at detecting diseased people.


- Precision: How many have been correctly diagnosed as unhealthy.

- If we have diagnosed everyone unhealthy, Se=1 (diagnose


all unhealthy people correctly) but Precision may be low
(because TN=0 that increases the value of FP).

- We want high Precision and high Se (=1, Ideally).


- We should combine precision and sensitivity to evaluate the performance of classifier.
- F1-Score
Evaluation of Classification Performance
Confusion Matrix:
Sensitivity and Specificity Trade-off:
Sensitivity or Recall Specificity
- Disease Detection:

- Sp and Se; how good we are at detecting healthy and diseased people, respectively.

- If we have diagnosed everyone healthy, Sp=1 (diagnose all healthy people correctly) but
Se=0 (diagnose all unhealthy people incorrectly)

- Ideally: we want Sp= Se= 1 (perfect sensitivity and specificity) but unrealistic.
Evaluation of Classification Performance
Confusion Matrix:
Sensitivity and Specificity Trade-off:
How optimal a pair of sensitivity, specificity values is?
- Is Sp= 0.8, Se= 0.7 better than Sp= 0.7, Se= 0.8?

Threshold

- The answer depends on the application. Se = 1 Sp= 1

- In disease diagnosis;

- happy to reduce Sp in order to increase Se.

- In other applications, we may have different requirements.

- Trade-off is better explained by ROC curve and AUC.


Evaluation of Classification Performance
Confusion Matrix:
ROC (Receiver Operating Characteristic) Curve:
- Plot of TPR (Sensitivity) against FPR (1 – Specificity)
for different values of threshold.

- Also referred to as Sensitivity-(1-Specificity) plot.


Threshold

- Threshold of 0.0, every case is diagnosed as positive.


- Se= TPR = 1
- FPR = 1
- Sp= 0

- Threshold of 1.0, every case is diagnosed as negative.


- Se= TPR = 0
- FPR = 0
- Sp= 1
Evaluation of Classification Performance
Confusion Matrix:
ROC Curve
ROC Curve and AUC:

- TPR (Sensitivity): how many correct positive results


occur among all positive samples.

- FPR (1 – Specificity): how many incorrect positive


results occur among all negative samples.

- The best possible prediction method


- Se = Sp = 1 (Upper left corner of ROC space)

- Random guess; a point along a diagonal line (the


so-called line of no-discrimination), No Power!

- Area Under the ROC Curve, abbreviated as (AUC)


quantifies the power of the classifier.
Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Evaluation of Classification Performance
F1-Score:
- We observed trade-off between recall and precision.

- Higher levels of recall may be obtained at the price of lower values of precision.

- We need to define a single measure that combines recall and precision or other
metrics to evaluate the performance of a classifier.

- Some combined measures:


- F1 Score
- Matthew’s Correlation Coefficient
- 11-point average precision
- The Breakeven point
Evaluation of Classification Performance
F1 Score:

- One measure that assesses recall and precision trade-off is weighted harmonic
mean (HM) of recall and precision, that is,
Evaluation of Classification Performance
F1 Score:
Why harmonic mean?
- We could also use arithmetic mean (AM) or geometric mean (GM).

- HM is preferred as it penalizes model the


most; a conservative average, that is, for two
real positive numbers, we have

- Improvement in HM implies improvement in


AM or GM.

Different means, minimum and maximum against


precision. Recall=70% is fixed.
Evaluation of Classification Performance
Matthew’s Correlation Coefficient (MCC):
- Precision, Recall and F1-score are asymmetric. Get a different result if the classes are switched.

- Matthew’s correlation coefficient determines the correlation between true class and predicted
class. The higher the correlation between true and predicted values, the better the prediction.

- Defined as

- MCC=1 when FP = FN = 0 (Perfect classification)


- MCC=-1 when TP = TN = 0 (Perfect misclassification)
- MCC=0; Performance of classifier is not better than a random classifier (flip coin)
- MCC is symmetric by design
Evaluation of Classification Performance
11-point Average Precision:
- Adjust threshold of the classifier such that the recall takes the following 11 values 0.0, 0.1.,
…, 0.9, 1.0.

- For each value of the recall, determine the precision and find the average value of precision,
referred to as average precision (AP).

- This is just uniformly-spaced sampling of Precision-Recall curve and taking average value.

The Breakeven Point:


- Compute precision as a function of recall for different values of thresholds.

- When Precision = Recall, we have a breakeven.


Outline

- Classification Accuracy (0/1 Loss)


- TP, TN, FP and FN
- Confusion Matrix
- Sensitivity, Specificity, Precision Trade-offs, ROC, AUC
- F1-Score and Matthew’s Correlation Coefficient
- Multi-class Classification, Evaluation, Micro, Macro Averaging
Multi-Class Classification
Formulation:

Examples:
- Emotion Detection.

- Vehicle Type, Make, model, color of the vehicle from the images streamed by safe city camera.

- Speaker Identification from Speech Signal.

- State (rest, ramp-up, normal, ramp-down) of the process machine in the plant.

- Sentiment Analysis (Categories: Positive, Negative, Neutral), Text Analysis.

- Take an image of the sky and determine the pollution level (healthy, moderate, hazard).

- Record Home WiFi signals and identify the type of appliance being operated.
Multi-Class Classification
Implementation (Possible options using binary classifiers):
Option 1: Build a one-vs-all (OvA) one-vs-rest (OvR) classifier:

Option 2: Build an all-vs-all classifier:

There can be other options…


Evaluation of Classification Performance
Multiclass Classification:

- How do we define the measures for the evaluation of the performance of multi-class classifier?

- Macro-averaging: We compute performance for each class and then average.

- Micro-averaging: Compute confusion matrix after collecting decisions for all classes and then
evaluate.
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix
- Predict if a bowler will bowl a no-ball, wide bowl, regular bowl?
- 15 no-balls, 20 wide-balls in an inning (Total balls: 335)
- Model Predictions:
Actual
No-ball Wide-ball Regular ball Precision
No-ball
8 5 20
Classifier
Output
Wide-ball 2 10 10
Regular ball 5 5 270
Recall
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix – Recall and Precision:

Recall
- For i-th class, recall represents the fraction of data-points classified correctly, that is,

Precision
- For i-th class, precision represents the fraction of data-points
predicted to be in class i are actually in the i-th class, that is,

Accuracy
- Fraction of data points classified correctly, that is,
Evaluation of Classification Performance
Multiclass Classification:
Confusion Matrix – Macro-Averaging:
- We compute performance for
each class and then average.

Confusion Matrix – Each Class:


Actual Actual Actual
Not a Not
No-ball No-ball Wide Wide Regular Not Regular

Classifier
No-ball 8 25 Wide 10 12 Regular 270 10
Output Not a no- Not
ball 7 295 Not Wide
10 303 Regular 30 25
Recall

Macro-average Recall:
Evaluation of Classification Performance
Multiclass Classification:
True False
Confusion Matrix – Micro-Averaging: Micro-average
- Compute confusion matrix after collecting
True
288 47 Recall:
decisions for all classes and then evaluate.
False
47 623

Confusion Matrix – Each Class:


Actual Actual Actual
Not a Not
No-ball No-ball Wide Wide Regular Not Regular

Classifier
No-ball 8 25 Wide 10 12 Regular 270 10
Output Not a no- Not
ball 7 295 Not Wide
10 303 Regular 30 25
Evaluation of Classification Performance
Multiclass Classification:
Micro-Averaging vs Macro Averaging:
- Note Micro-average recall= Micro-average precision = F1 Score = Accuracy (computed from
confusion matrix)
- Micro-average is termed as a global metric.
- Consequently, it is not a good measure when classes are not balanced.

- Macro-average is relatively a better as we can see a zoomed-in picture before averaging.

- Note Macro-averaging does not take class imbalance into account.


- Weighted-averaging; Similar to Macro averaging but takes a weighted mean instead where
weight for each class is the total number of data-points of that class.

Weighted-average Recall:
Evaluation of Classification Performance
References:

- KM 5.7.2

You might also like