Evaluation Metrics
Evaluation Metrics
Wei Li
Training
Data Test
Data
Model
Standard Approach: Measuring
Misclassification Rate on a
Hold-out Test Set
Misclassification Rate on a Hold-out Test Set
(a) A 50:20:30split
(b) A 40:20:40split
Figure: Hold-out sampling can divide the full data into training,
validation, and test sets.
0.5
Performance on Training Set
Performance on Validation Set
0.4
Misclassification Rate
0.3
0.2
0.1
spam--positive ham--negative
For binary prediction problems there are 4 possible
outcomes:
1 True Positive (TP) :Target = positive AND Pred = positive
2 True Negative (TN) :Target = negative AND Pred =negative
3
False Positive (FP) : Target = negative AND Pred = positive
4
False Negative (FN) : :Target = positive AND Pred = negative
Table: A sample test set with model predictions.
ID Target Pred. Outcome ID Target Pred. Outcome
1 spam ham FN 11 ham ham TN
2 spam ham FN 12 spam ham FN
3 ham ham TN 13 ham ham TN
4 spam spam TP 14 ham ham TN
5 ham ham TN 15 ham ham TN
6 spam spam TP 16 ham ham TN
7 ham ham TN 17 ham spam FP
8 spam spam TP 18 spam spam TP
9 spam spam TP 19 ham ham TN
10 spam spam TP 20 ham spam FP
(6 + 9)
classification accuracy = = 0.75
(6 + 9 + 2 + 3)
Performance Measures for
Categorical Targets
6
precision = = 0.75
(6 + 2)
6
recall = = 0.667
(6 + 3)
Recall and Precision are
contradictory each other
TP
precision =
(TP + FP)
TP
recall =
Extreme cases (TP + FN)
FP=0 FN=0
Precision=100% Precision=60%
Recall = 10% Recall = 100%
(precision ×recall)
F1 -measure = 2 ×
(precision + recall)
(precision ×recall)
F1 -measure = 2 ×
(precision + recall)
Precision=100% Precision=60%
Recall = 10% Recall = 100%
F1-measure=18% F1-measure=75%
6
precision = = 0.75
(6 + 2)
6
recall = = 0.667
(6 + 3)
F1-measure=0.71
Making the most of the data