F1 Score
F1 Score
Note: Skip this section if you are already familiar with the concepts of
precision, recall, and F1 score.
Precision
If you compare the formula for precision and recall, you will notice
both looks similar. The only difference is the second term of the
denominator, where it is False Positive for precision but False
Negative for recall.
F1 Score
The columns (in orange) with the per-class scores (i.e. score for each
class) and average scores are the focus of our discussion.
We can see from the above that the dataset is imbalanced (only one
out of ten test set instances is ‘Boat’). Thus the proportion of
correct matches (aka accuracy) would be ineffective in assessing
model performance.
Calculated TP, FP, and FN values from confusion matrix | Image by author
The above table sets us up nicely to compute the per-class values
of precision, recall, and F1 score for each of the three classes.
With weighted averaging, the output average would have accounted for
the contribution of each class as weighted by the number of examples
of that given class.
We first sum the respective TP, FP, and FN values across all classes
and then plug them into the F1 equation to get our micro F1 score.