Comprehensive Guide On Confusion Matrix 1657202063
Comprehensive Guide On Confusion Matrix 1657202063
0 (Predicted) 1 (Predicted)
0 (Actual) 2 3
1 (Actual) 1 4
Here, the algorithm has made a total of 10 predictions, and this confusion matrix describes whether these
predictions are correct or not. To summarise this table in words:
there were 2 cases in which the algorithm predicted a 0 and the actual label was 0 (correct).
there were 3 cases in which the algorithm predicted a 1 and the actual label was 0 (incorrect).
there was 1 case in which the algorithm predicted a 0 and the actual label was 1 (incorrect).
there were 4 cases in which the algorithm predicted a 1 and the actual label label was 1 (correct).
0 (Predicted) 1 (Predicted)
Here, you can interpret 0 as being negative, while 1 as being positive. Using these values, we can compute the
following key metrics:
1. Accuracy
2. Misclassification rate
3. Precision
4. Recall (or Sensitivity)
5. F-Score
6. Specificity
Classification metrics
Accuracy
Accuracy is simply the proportion of correct classifications:
TN + TP
Accuracy =
TN + FP + FN + TP
The denominator here represents the total predictions made. For our example, the accuracy is as follows:
2 + 4
Accuracy = = 0.6
2 + 3 + 1 + 4
Of course, we want the accuracy to be as high as possible. However, as we shall we see later, a classifier with
a high accuracy does not necessary make for a good classifier.
Misclassification rate
Misclassification rate is the proportion of incorrect classification:
SKYTOWNER
FP + FN
Misclassif ication Rate =
TN + FP + FN + TP
Note that this is basically one minus the accuracy. For our example, the misclassification rate is as follows:
3 + 1
Misclassif ication Rate = = 0.4
2 + 3 + 1 + 4
Precision
Precision is the proportion of correct predictions given the prediction was positive:
TP
Precision =
FP + TP
0 (Predicted) 1 (Predicted)
4 4
Precision = =
3 + 4 7
The metric of precision is important in cases when you want to minimise false positives and maximise true
positives. For instance, consider a binary classifier that predicts whether an e-mail is legitimate (0) or spam
(1). From the users' perspective, they do not want any legitimate e-mail to be identified as spam since e-mails
identified as spam usually do not end up in the regular inbox. Therefore, in this case, we want to avoid
predicting spam for legitimate e-mails (false positive), and obviously aim to predict spam for actual spam e-
mails (true positive).
As a numerical example, consider two binary classifiers - the 1st classifier has a precision of 0.2, while the 2nd
classifier has a precision of 0.9. In words, for the 1st classifier, out of all e-mails predicted to be spam, only
20% of these e-mails were actually spam. This means that 80% of all e-mails identified as spam were actually
legitimate. On the other hand, for the 2nd classifier, out of all e-mails predicted to be spam, 90% of them were
actually spam. This means that only 10% of all e-mails identified as spam were legitimate. Therefore, we would
opt to use the 2nd classifier in this case.
TP
Recall =
FN + TP
SKYTOWNER
0 (Predicted) 1 (Predicted)
4 4
Recall = =
1 + 4 5
The metric of recall is important in cases when you want to minimise false negatives and maximise true
positives. For instance, consider a binary classifier that predicts whether a transaction is legitimate (0) or
fraudulent (1). From the bank's perspective, they want to be able to correctly identify all fraudulent transactions
since missing fraudulent transactions would be extremely costly for the bank. This would mean that we want to
minimise false negatives (i.e. missing actual fraudulent transactions), and maximise true positives (i.e.
correctly identifying fraudulent transactions).
As a numerical example, let's compare 2 binary classifiers where the 1st classifier has a recall of 0.2, while the
2nd has a recall of 0.9. In words, the 1st classifier is only able to correctly detect 20% of all actual fraudulent
cases, whereas the 2nd classifier is capable of correctly detecting 90% of all actual fraudulent cases. In this
case, the bank would therefore opt for the 2nd classifier.
F-score
F-score is a metric that combines both the precision and recall into a single value:
2 ⋅ Precision ⋅ Recall
F =
Precision + Recall
A low precision or recall value will reduce the F-measure. For instance:
0 (Predicted) 1 (Predicted)
10000 + 1000
Accuracy = ≈ 0.92
1000 + 10 + 900 + 1000
1000
Precision = ≈ 0.99
10 + 1000
1000
Recall = ≈ 0.53
900 + 1000
2 ⋅ 0.99 ⋅ 0.53
F = ≈ 0.69
0.99 + 0.53
As we can see, the accuracy here is very high ( 0.92). However, the F-score is relatively much lower ( 0.69)
since the number of false negatives is quite high, thereby reducing the recall. This demonstrates how only
looking at the accuracy for the classification performance would mislead you into thinking the model is
excellent, when in fact, there may be some abnormality in the prediction errors made.
SKYTOWNER
WA R N I N G
It is bad practice to only quote the accuracy for the classification performance of an algorithm. As
demonstrated here, quoting the recall, precision and F-score is also extremely important based on the
context. For instance, for a binary classifier detecting fraudulent transactions, the recall (or F-score) is
more relevant than the accuracy.
Specificity
Specificity represents the proportion of correct predictions for negative labels:
TN
Specif icity =
TN + FP
0 (Predicted) 1 (Predicted)
array([[2, 3],
[1, 5]])
Here, many people get confused as to what the numbers represent. This confusion matrix corresponds to the
following:
0 (Predicted) 1 (Predicted)
0 (Actual) 2 3
1 (Actual) 1 5
SKYTOWNER
Computing classification metrics
To compute the classification metrics, use sklearn 's classification_report(~) method like so:
Here, the 0th support represents the number of actual negative (0) labels.
SKYTOWNER