Performance
Performance
1
Binary Classification
Predicted Output
1 0
True Output (Target)
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
2
Precision
Predicted Output
1 0
True Output (Target)
Precision = TP/(TP+FP)
The percentage of predicted true positives
that are target true positives
3
Recall
Predicted Output
1 0
True Output (Target)
Recall = TP/(TP+FN)
The percentage of target true positives
that were predicted as true positives
4
Other measures - Precision vs.
Recall
• Considering precision and recall let us choose a ML approach which
maximizes what we are most interested in (precision or recall) and not
just accuracy.
• Tradeoff - Can also adjust ML parameters to accomplish the goal of the
application
• Break even point: precision = recall
• F1 or F-score = 2(precision recall)/(precision recall) - Harmonic
mean of precision and recall
5
ROC Curves and Area under curve
• Receiver Operating Characteristic
• Developed in WWII to statistically model false positive and false
negative detections of radar operators
• Standard measure in medicine and biology
• True positive rate (sensitivity) vs false positive rate (1- specificity)
• True positive rate (Probability of predicting true when it is true)
P(Pred:T|T) = Sensitivity = Recall = TP/P = TP/(TP+FN)
• False positive rate (Probability of predicting true when it is false)
P(Pred:T|F) = FP/N = FP/(TN+FP) = 1 – specificity (true negative rate) = 1
– TN/N = 1 - TN/(TN+FP)
• Want to maximize TPR and minimize FPR
• How would you do each independently?
6
ROC Curves and ROC Area
• Want to find the right balance
• But the right balance/threshold can differ for each task considered
• How do we know which algorithms are robust and accurate across
many different thresholds? – ROC curve
• Each point on the ROC curve represents a different tradeoff (cost
ratio) between true positive rate and false positive rate
• The standard measures just show accuracy for one setting of the
cost/ratio threshold, whereas the ROC curve shows accuracy for all
settings and thus allows us to compare how robust to different
thresholds one algorithm is compared to another
7
8
Assume thresholds:
Threshold = 1 (0,0), then all
outputs are 0 .3
TPR = P(T|T) = 0
FPR = P (T|F) = 0
.5
Threshold = 0, (1,1)
TPR = 1, FPR = 1
Threshold = .8 (.2,.2)
TPR = .38 FPR = .02 .8
- Better Precision
Threshold = .5 (.5,.5)
TPR = .82 FPR = .18
- Better Accuracy
Threshold = .3 (.7,.7)
TPR = .95 FPR = .43
Accuracy is maximized at point closest to the top left corner.
- Better Recall
Note that Sensitivity = Recall and the lower the
false positive rate, the higher the precision.
9
ROC Properties
• Area Properties
• 1.0 - Perfect prediction
• .9 - Excellent
• .7 - Moderate
• .5 - Random
• ROC area represents performance over all possible thresholds
• If two ROC curves do not intersect then one method dominates over
the other
• If they do intersect then one method is better for some thresholds,
and is worse for others
• Blue alg better for precision, yellow alg for recall, red neither
• Can choose method and balance based on goals
10
Performance Measurement
Summary
• Some of these measures (ROC, F-score) gaining popularity
• Can allow you to look at a range of thresholds
• However, they do not extend to multi-class situations which are very
common
• However, medicine, finance, etc. have lots of two class problems
• Could always cast problem as a set of two class problems but that can be
inconvenient
• Accuracy handles multi-class outputs and is still the most common
measure but often combined with other measures such as ROC, etc.
11