PRcurves ISL
PRcurves ISL
December 3, 2015
Talk outline
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
What’s next?
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
1 1
0.9 0.9
0.8 0.8
0.7 0.7
True positive rate
0.6 0.6
Precision
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate Recall
(left) ROC curve with non-dominated points (red circles) and convex hull (red dotted
line). (right) Corresponding Precision-Recall curve with non-dominated points (red
circles).
Universal baselines: the major diagonal of an ROC plot depicts the line of
random performance which can be achieved without training; it is universal
in the sense that it doesn’t depend on the class distribution.
Linear interpolation: any point on a straight line between two points
representing the performance of two classifiers (or thresholds) A and B can
be achieved by making a suitably biased random choice between A and B .
The slope of the connecting line determines the trade-off between the
classes under which any linear combination of A and B would yield
equivalent performance. In particular, test set accuracy assuming uniform
misclassification costs is represented by accuracy isometrics with slope
(1 − π)/π, where π is the proportion of positives .
What’s next?
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
1 1
0.9 0.9
0.8 0.8
0.7 0.7
True positive rate
0.6 0.6
Precision
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate Recall
(left) ROC curve with non-dominated points (red circles) and convex hull (red dotted
line). (right) Corresponding Precision-Recall curve with non-dominated points (red
circles).
Uninterpretable area: although many authors report the area under the PR
curve (AUPR) it doesn’t have a meaningful interpretation beyond the
geometric one of expected precision when uniformly varying the recall (and
even then the use of the arithmetic average cannot be justified).
Furthermore, PR plots have unachievable regions at the lower right-hand
side, the size of which depends on the class distribution .
No calibration: although some results exist regarding the relationship between
calibrated scores and F 1 score these are unrelated to the PR curve. To the
best of our knowledge there is no published procedure to output scores that
are calibrated for F β – that is, which give the value of β for which the
instance is on the F β decision boundary.
The F β score
The F 1 score is defined as the harmonic mean of precision and recall:
2 2prec · rec TP
F1 , = = (1)
1/prec + 1/rec prec + rec TP + (FP + FN)/2
Predicted ⊕ Predicted ª
Actual ⊕ TP FN Pos
Actual ª FP TP Neg − (TN − TP)
TP + FP Pos 2TP + FP + FN
1 (1 + β2 )TP
Fβ , = (2)
1
/prec + 1+β2 /rec
β2 (1 + β2 )TP + FP + β2 FN
1+β2
Related work
There is a range of recent results regarding the F -score:
(i) non-decomposability of the F β score, meaning it is not an average over
instances ;
(ii) estimators exist that are consistent: i.e., they are unbiased in the limit ;
(iii) given a model, operating points that are optimal for F β can be achieved by
thresholding the model’s scores ;
(iv ) a classifier yielding perfectly calibrated posterior probabilities has the
property that the optimal threshold for F 1 is half the optimal F 1 at that point. and
later by
The latter results tell us that optimal thresholds for F β are lower than optimal
thresholds for accuracy (or equal only in the case of the perfect model).
They don’t, however, tell us how to find such thresholds other than by tuning.
We demonstrate how to identify all F β -optimal thresholds for any β in a single
calibration procedure.
What’s next?
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
Baseline
A random classifier that predicts positive with probability p has F β score
(1 + β2 )pπ/(p + β2 π). Hence the baseline to beat is the always-positive
classifier rather than any random classifier. Any model with prec < π or rec < π
loses against this baseline, hence it makes sense to consider only precision and
recall values in the interval [π, 1]. Any real-valued variable x ∈ [mi n, max] on a
1/x−1/mi n max·(x−mi n)
harmonic scale can be linearised by the mapping 1/max−1/mi n = (max−mi n)·x .
prec − π π FP rec − π π FN
precG = = 1− recG = = 1−
(1 − π)prec 1 − π TP (1 − π)rec 1 − π TP
(3)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Precision Gain
0.6 0.6
Precision
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall Gain
(left) Conventional PR curve with hyperbolic F 1 isometrics (dotted lines) and the
baseline performance by the always-positive classifier (solid hyperbole). (right)
Precision-Recall-Gain curve with minor diagonal as baseline, parallel F 1 isometrics and
a convex Pareto front.
Theorem
Theorem
F β −π π FP+β2 FN
precG + β2 recG = (1 + β2 )FGβ , with FGβ = (1−π)F β = 1 − 1−π (1+β2 )TP
.
FGβ measures the gain in performance (on a linear scale) relative to a classifier
with both precision and recall – and hence F β – equal to π.
Area
R1
Define AUPRG = 0 precG d recG and ∆ = recG/π − precG/(1 − π). Hence,
−y 0 /(1 − π) ≤ ∆ ≤ 1/π, where y 0 denotes the precision gain at the operating
point where recall gain is zero.
Theorem
Let the operating points of a model with area under the Precision-Recall-Gain
curve AUPRG be chosen such that ∆ is uniformly distributed within
[−y 0 /(1 − π), 1/π]. Then the expected FG1 score is equal to
In the special case where y 0 = 1 the expected FG1 score is AUPRG/2 + 1/4.
The expected reciprocal F 1 score can be calculated from the relationship
E [1/F 1 ] = (1 − (1 − π)E [FG1 ])/π which follows from the definition of FGβ .
Calibration
Theorem
Let two classifiers be such that prec1 > prec2 and rec1 < rec2 , then these two
classifiers have the same F β score if and only if
1/prec1 − 1/prec2
β2 = − = −s PRG (5)
1/rec1 − 1/rec2
where s PRG is the slope of the connecting segment in the PRG plot.
1
cF =
1 − s PRG
1
Notice that this cannot be obtained from the accuracy-calibrated score .
1+ 1−π
π s
1
ROC
0 1 0.99
1 0.0280.014
0.048 0.9 0.76
0.9 0.053
0.8
0.8
0.49
0.7
0.7
0.075
True positive rate
Precision Gain
0.18
0.6 0.6
0.067
0.5 0.5
0.3
0.29 0.3 0.016
0.2 0.2
0.75
0.1 0.1 0
1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate Recall Gain
(left) ROC curve with scores empirically calibrated for accuracy. The green dots
correspond to a regular grid in Precision-Recall-Gain space. (right)
Precision-Recall-Gain curve with scores calibrated for F β . The green dots correspond to
a regular grid in ROC space, clearly indicating that ROC analysis over-emphasises the
high-recall region.
What’s next?
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
1 Count 1.00
●
0 ●
●
●
●
●●●
●●●●
●●
●
1 ●●
0.75
Rank of AUPRG
●
10 ●●
2−3 ●
●
Dataset
AUPRG
●
4−7
0.50 ●
● ada−agnostic
8−15
white−clover
20 16−31
32−63 0.25
64−127
●
30 128−...
0.00
30 20 10 1 0.00 0.25 0.50 0.75 1.00
Rank of AUPR AUPR
(left) Comparison of AUPRG-ranks vs AUPR-ranks. Each cell shows how many models
across 886 OpenML tasks have these ranks among the 30 models in the same task.
(right) Comparison of AUPRG vs AUPR in OpenML tasks with IDs 3872 (white-clover)
and 3896 (ada-agnostic), with 30 models in each task. Some models perform worse
than random (AUPRG < 0) and are not plotted. The models represented by the two
encircled triangles are shown in detail in the next figure.
Precision Gain
Precision
0.50 0.50 0.50
(left) ROC curves for AdaBoost (solid line) and Logistic Regression (dashed line) on the
white-clover dataset (OpenML run IDs 145651 and 267741, respectively). (middle)
Corresponding PR curves. The solid curve is on average lower with AUPR = 0.724
whereas the dashed curve has AUPR = 0.773. (right) Corresponding PRG curves,
where the situation has reversed: the solid curve has AUPRG = 0.714 while the dashed
curve has a lower AUPRG of 0.687.
What’s next?
Precision-Recall-Gain Curves
Baseline
Linearity and optimality
Area
Calibration
Practical examples
Concluding remarks
Methodological recommendations
We recommend practitioners use the F -Gain score instead of the F -score to
make sure baselines are taken into account properly and averaging is done on
the appropriate scale. If required the FGβ score can be converted back to an F β
score at the end.
Closing comments
Acknowledgments
This work was supported by the REFRAME project granted by the European
Coordinated Research on Long-Term Challenges in Information and
Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by
the Engineering and Physical Sciences Research Council in the UK under grant
EP/K018728/1. Discussions with Hendrik Blockeel helped to clarify the intuitions
underlying this work.
References I
Kendrick Boyd, Vitor Santos Costa, Jesse Davis, and C David Page.
Unachievable region in precision-recall space and its effect on empirical
evaluation.
In International Conference on Machine Learning, page 349, 2012.
T. Fawcett.
An introduction to ROC analysis.
Pattern Recognition Letters, 27(8):861–874, 2006.
T. Fawcett and A. Niculescu-Mizil.
PAV and the ROC convex hull.
Machine Learning, 68(1):97–106, July 2007.
References II
P. A. Flach.
The geometry of ROC space: understanding machine learning metrics
through ROC isometrics.
In Machine Learning, Proceedings of the Twentieth International Conference
(ICML 2003), pages 194–201, 2003.
Peter A. Flach.
ROC analysis.
In Claude Sammut and GeoffreyI. Webb, editors, Encyclopedia of Machine
Learning, pages 869–875. Springer US, 2010.
D. J. Hand and R. J. Till.
A simple generalisation of the area under the ROC curve for multiple class
classification problems.
Machine Learning, 45(2):171–186, 2001.
References III
José Hernández-Orallo, Peter Flach, and Cesar Ferri.
A unified view of performance metrics: Translating threshold choice into
expected classification loss.
Journal of Machine Learning Research, 13:2813–2869, 2012.
Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and
Inderjit S Dhillon.
Consistent binary classification with generalized performance metrics.
In Advances in Neural Information Processing Systems, pages 2744–2752,
2014.
Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy.
Optimal thresholding of classifiers to maximize F1 measure.
In Machine Learning and Knowledge Discovery in Databases, volume 8725
of Lecture Notes in Computer Science, pages 225–239. Springer Berlin
Heidelberg, 2014.
References IV
References V
References VI