A Review of The F-Measure: Its History, Properties, Criticism, and Alternatives
A Review of The F-Measure: Its History, Properties, Criticism, and Alternatives
and Alternatives
PETER CHRISTEN, The Australian National University, Australia
DAVID J. HAND, Imperial College London, UK
NISHADI KIRIELLE, The Australian National University, Australia
Methods to classify objects into two or more classes are at the core of various disciplines. When a set of objects
with their true classes is available, a supervised classifier can be trained and employed to decide if, for example,
a new patient has cancer or not. The choice of performance measure is critical in deciding which supervised
method to use in any particular classification problem. Different measures can lead to very different choices, so
the measure should match the objectives. Many performance measures have been developed, and one of them
is the F-measure, the harmonic mean of precision and recall. Originally proposed in information retrieval, the
F-measure has gained increasing interest in the context of classification. However, the rationale underlying
this measure appears weak, and unlike other measures, it does not have a representational meaning. The use
of the harmonic mean also has little theoretical justification. The F-measure also stresses one class, which
seems inappropriate for general classification problems. We provide a history of the F-measure and its use
in computational disciplines, describe its properties, and discuss criticism about the F-Measure. We conclude
with alternatives to the F-measure, and recommendations of how to use it effectively.
73
CCS Concepts: • Computing methodologies → Supervised learning by classification; • General and
reference → Measurement; Evaluation; • Information systems → Retrieval effectiveness;
Additional Key Words and Phrases: Supervised classification, performance assessment, F1-score, F1-measure,
F*-measure, representational measure, pragmatic measure
1 INTRODUCTION
The central challenge of various computational disciplines is the construction of algorithms that
automatically improve with experience [32]. An important class of such algorithms is concerned
with supervised classification, in which an algorithm is presented with a training set of objects,
each of which has a known descriptive vector of measurements, and each of which has a known
class membership. The aim is to use this training set to create a classification method (also known
as a classifier, classification model, classification algorithm, or classification rule) that can classify
Authors’ addresses: P. Christen and N. Kirielle, School of Computing, The Australian National University, North Rd, Can-
berra ACT 2600, Australia; emails: {peter.christen, nishadi.kirielle}@anu.edu.au; D. J. Hand, Department of Mathematics,
Imperial College, London SW7 2AZ, United Kingdom; email: [email protected].
This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2023 Copyright held by the owner/author(s).
0360-0300/2023/10-ART73 $15.00
https://doi.org/10.1145/3606367
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:2 P. Christen et al.
future objects based solely on their descriptive feature vectors. In what follows, for convenience,
we will assume just two classes of objects (binary classification), labelled 0 and 1 (commonly named
the negative and positive class, respectively).
A general high-level strategy for constructing classification methods is to begin by assigning
scores to the objects to be classified. In particular, with two class problems, we might score the
objects by their estimated probability of belonging to class 1, say. Given such classification scores,
the appropriate way to proceed will depend on the problem. If the set of objects to be classified
is known and finite (for example, if we wished to classify a given set of patients as having cancer
or not), then all objects can be ranked and the top x or the top y% are assigned to class 1 (having
cancer). However, if the set of objects is not fully known, or is of arbitrary and unknown size (for
example classifying future patients to have cancer or not), then it is not possible to rank them all
and choose the ones with the highest scores.
In such a case, a classification threshold must be chosen, so that those with scores larger than the
threshold are assigned to class 1. This threshold could be chosen so as to classify a given percentage
to class 1 (based on a known or estimated distribution of scores). Alternatively, it could be chosen
in absolute terms: all objects with estimated class 1 probability greater than a threshold of 0.9 could
be assigned to class 1, for example. Note an important distinction between these two cases: if the
top-scoring y% are to be assigned to class 1, then whether or not a particular object is assigned to
class 1 depends on the scores of other objects. In contrast, if an absolute threshold is used, then
objects can be assigned independently of the scores of other objects.
Central to the challenge of creating classification methods is the ability to evaluate them. This
is needed so that one can decide if a method is good enough for some purpose, and to choose
between competing methods (which possibly have been created using different algorithms, or the
same algorithm but different parameter settings). “Choosing between” methods includes notions
of algorithm selection, parameter estimation, choice of descriptive features, transformations of
features, and so on, as these implicitly select between alternative methods.
Criteria for evaluating classification methods and algorithms are of two kinds: problem-based
and “accuracy”-based. Problem-based measures are specific to particular applications, and include
aspects such as speed of classification (such as for data streams), speed of adaptation and updat-
ing (for example, in spam and fraud detection), ability to handle incomplete data, interpretability
(mandatory in certain legal frameworks), how easy they are for a non-expert to use, how effective
they are in the hands of an expert, and so on.
“Accuracy”-based measures, on the other hand, are concerned with the accuracy with which
objects are assigned to the correct class. In two-class problems, there are two ways in which such
assignments may be in error: class 0 objects may be incorrectly assigned to class 1 (known as
false positives or Type I errors), and class 1 objects may be incorrectly assigned to class 0 (false
negatives or Type II errors). To produce a one-dimensional performance measure, which can be
used to compare classification methods, the extent of these two types of error must be aggregated
into a single value. This can be done in various ways—hence the quotation marks around “accu-
racy” above when describing this class of criteria: accuracy may mean different things. Various
reviews which look at different aspects of classification performance measures have been written,
including [11, 15, 19, 23, 27, 50, 57], and others.
In this article, we review and examine one particular measure in detail: the F-measure, which is
generally calculated as the harmonic mean of precision and recall. The F-measure was originally
proposed in the domain of information retrieval [59] to evaluate the quality of ranked documents
retrieved by a search engine. In this discipline, only one class is of real interest (the relevant docu-
ments; without loss of generality, call it the positive class), and we are concerned with the propor-
tion of that class misclassified and the proportion classified into this class which are, in fact, from
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:3
the other class. These two aspects are captured by recall and precision, respectively. In particular,
we are not concerned with the number or proportion correctly classified from the class of no in-
terest. The F-measure is then a way of combining recall and precision, to yield a single number
through which classification methods can be assessed and compared.
The appropriateness of the F-measure is illustrated by considering very imbalanced situa-
tions [33] in which the uninteresting class is very large (such as all individuals who do not have
cancer). In such cases, measures such as error rate will be heavily influenced by a large number of
irrelevant correct classifications to the uninteresting class, and therefore can be poor indicators of
the usefulness of the classification method used [7, 46, 51, 52]. However, for more general classifi-
cation problems, misclassifications (and correct classifications) of both classes are of concern. This
means that the F-measure is an inappropriate measure of performance for these situations, as we
discuss further in Section 4.1.
Apart from the pragmatic benefit of reducing the two measures, recall and precision, to a single
number, there does not seem to be a proper justification for why the F-measure is appropriate for
evaluating general supervised classification methods (where both classes are of interest). In many
publications, no reasons are provided why the F-measure is used for an evaluation of classification
methods, or alternatively rationales such as “because it is a commonly used measure in our re-
search discipline”, or “because it has been used by previous relevant work” are given. Others have
discussed problematic issues with the F-measure earlier [42, 51, 60].
Contributions and outline: In the following section, we trace the use of the F-measure back to
its original development in information retrieval, and its increasing use over time in diverse com-
putational disciplines. In Section 3, we then describe the properties of the F-measure, including its
generalisation, the F β measure. From this discussion, we see that the F-measure has characteristics
that can be seen as conceptual weaknesses. We discuss these and the resulting criticism of the F-
measure in Section 4. In Section 5, we then describe alternatives to the F-measure, and we conclude
our work in Section 6 with a discussion and recommendations on how to use the F-measure in an
appropriate way.
We have previously explored the shortcomings of the F-measure in the context of record link-
age [29], the task of identifying records that refer to the same entities across databases [5, 13]. More
recently, we proposed a variant of the F-measure that overcomes some of its weaknesses [30]. Here,
we go further and provide a broader discussion of the use of the F-measure and its application in
general classification problems. Our work is targeted at a broad audience, where we aim at pro-
viding the reader with a deeper understanding of the F-measure. Our objective is also to highlight
that using a certain performance measure simply because it is being commonly used in a research
community might not lead to an appropriate evaluation of classification methods.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:4 P. Christen et al.
truth data in classification). It is then possible to identify from the set of retrieved documents those
that are relevant. The measures of precision and recall are then defined as Reference [44]:
— Precision, P, is the fraction of retrieved documents that are relevant.
— Recall, R, is the fraction of relevant documents that are retrieved.
For classification tasks, as we will discuss in Section 3, precision and recall can be defined simi-
larly. A good retrieval system ideally achieves high precision as well as high recall on a wide range
of queries [44].
Van Rijsbergen [59], in chapter 6 (which became chapter 7 in the second edition of his book),
extensively discusses the reasoning why precision and recall are useful measures for information
retrieval, and the desire to combine them into a single composite number. He also describes various
previous attempts, some going back to the early 1960s, to develop single number performance mea-
sures for information retrieval. He then develops a measure of retrieval effectiveness, E, defined
as
1
E =1− (1−α )
, (1)
P + R
α
where α = 1/(β 2 + 1), and β is a factor by how much more importance a user gives to recall
over precision. Setting α = 0 (β = ∞) corresponds to giving no importance to precision, setting
α = 1 (β = 0) corresponds to giving no importance to recall, and setting α = 1/2 (β = 1)
corresponds to giving the same importance to precision and recall.
As we discuss in more detail in Section 3, the F-measure, or more precisely its weighted gener-
alisation, F β , corresponds to F β = 1 − E (with α substituted by β), defined as
(β 2 + 1)P · R
Fβ = . (2)
β 2P + R
For β = 1 this simplifies to F 1 = 2P ·R/(P +R), the commonly used default version of the F-measure
that is also known as the F1 -measure or the F1 -score [44].
Van Rijsbergen [59] also motivates the use of the harmonic mean to calculate E, and therefore F ,
through the principle of decreasing marginal relevance, where at some point a user will be unwilling
to sacrifice a unit of precision for an added unit of recall [44, page 173]. We return to this argument
for the use of the harmonic mean in Section 3.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:5
or document collection) refer to both the MUC-4 work by Chinchor [12] and the book by Van
Rijsbergen [59]. They provide the general formulation from Equation (2), however, besides de-
scribing that the F-measure combines precision and recall, they do not justify further why it is a
suitable measure for coreference resolution.
In the context of classifying text, Lewis [41] (1995) discusses how to optimise systems when data
are time-varying, such as news feeds or electronic mail, with the aim of measuring performance
without the need to continue collecting true class labels. He suggests using the expected value
of the F-measure, assuming the true class labels are binary random variables with probability equal
to the estimated class 1 probability. But he then gives an example showing that no fixed probability
threshold can be used to optimise the F-measure when the score distribution is unknown.
In their highly influential work on sampling for imbalanced classification problems, Chawla
et al. [10] in 2002 interestingly do refer to the book by Van Rijsbergen [59] and describe precision
and recall as measures used in information retrieval. However, they do not mention the F-measure,
and proceed to employ ROC curves [18] for their experimental evaluation. While the F-measure
is calculated for a single specific classification threshold, ROC curves are based on varying such a
threshold. Neither of the books by Macmillan and Creelman [43] and Krzanowski and Hand [40],
which both cover the topic of ROC curves in detail, mention the F-measure.
The 2009 survey by He and Garcia [33] on learning from imbalanced data also discusses the
F-measure, however, without any references. The authors do raise the issue of the F-measure be-
ing sensitive to the class imbalance of a given problem. This is because precision considers both
the positive and negative classes, while recall is only based on the positive class. Branco et al. [7],
in their 2016 survey on predictive modelling of imbalanced data, refer to the book by Van Rijs-
bergen [59] and provide Equation (2) as the definition of the F-measure. They also state that this
measure is more informative than accuracy about the effectiveness of a classifier on predicting cor-
rectly the cases that matter to the user, and that the majority of the articles that use F β for performance
evaluation under imbalanced domains adopt β = 1, which corresponds to giving the same importance
to precision and recall.
Joachims [37] (2005) discusses various performance measures in the context of learning support
vector machine classifiers optimised for a certain measure. He mentions the common use of the
F-measure for binary classification problems in natural language processing applications such as
text classification, and describes how the F-measure is preferable to error rate in imbalanced clas-
sification problems. However, he does not refer to any earlier work on the F-measure, nor does he
discuss its origin in information retrieval. Similarly, Onan et al. [49] (2016) use the F-measure for
their evaluation of keyword extraction methods for text classification, without any discussion why
this measure is suitable for this task. In their 2021 survey on deep learning based text classification
methods, Minaee et al. [48] briefly mention that the F-measure is mainly used for imbalanced clas-
sification problems. However, no references to its origin nor any discussion on why it is a suitable
measure in such situations are provided.
In the context of text and Web data mining, Kosala and Blockeel [39] in their 2000 survey men-
tion precision and recall but not the F-measure, while Han and Kamber [21], in the first edition
(2000) of their influential data mining text book, discuss precision, recall, and the F-measure in the
context of text data mining. They justify the use of the harmonic mean in the F-measure as the
harmonic mean discourages a system that sacrifices one measure for another too drastically (Section
9.4). Hand et al. [31] in their 2001 data mining book also discuss precision and recall in the context
of text retrieval, but they do not mention the F-measure.
In computer vision, the F-measure is commonly used to evaluate image recognition and clas-
sification tasks. Achanta et al. [1] (2009), Epshtein et al. [17] (2010), and Brutzer et al. [8] (2011),
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:6 P. Christen et al.
Fig. 1. Google Scholar number of matches (publications) over time for different performance measures (left-
side plot), and percentage changes over time for the different variations of F-measure names over all measures
from the left-side plot (right-side plot).
are three examples of influential publications that employ the F-measure for such tasks without
providing any justifications or references to its origin.
In the context of evaluating diagnostic instruments in medicine, where problems with imbal-
anced class distributions are common, sensitivity and specificity are widely used measures [27].
Only more recently has the F-measure been employed in the health sciences, as data science
based approaches are increasingly being used in this discipline. As an example, Janowczyk and
Madabhushi [36] (2016) review deep learning methods for classifying digital pathology images,
where they use the F-measure for evaluation without any discussion why this is a suitable
measure for this task.
As this discussion has shown, while the F-measure has been used in a variety of computational
disciplines in the context of evaluating classification tasks, there seems to be a common lack of
justification why it is a suitable performance measure for a given problem domain. In the following
we present a bibliographic analysis to better understand how the F-measure has been used over
time in comparison to other performance measures.
1 See:
http://scholar.google.com/, where our code and data of this analysis are available at: https://github.com/nishadi/f-
measure.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:7
time. However, since the late 1990s, the F-measure has seen a stronger increase compared to some
of the other measures. Since 1980, the annual numbers of publications referring to the F-measure
have increased by two magnitudes, while the numbers referring to precision-recall have increased
over three magnitudes. The right-side plot in Figure 1 shows the percentage of matches for the
different F-measure names over the total number of matches for all measures we queried (as listed
in the left-side plot). A clear increase can be seen since the year 2000 and especially in the past ten
years, after an initial drop between the years 1980 and 2000.
To summarise our bibliographic analysis, since its development in information retrieval over
forty years ago, the F-measure has been used in a variety of computational disciplines, with a no-
ticeable increase in the past twenty or so years. This coincides with the increased popularity of
data based disciplines, including data mining, machine learning (especially deep learning), com-
puter vision, and natural language processing, since the year 2000.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:8 P. Christen et al.
The F-measure then combines these two using their harmonic mean, to yield a univariate (single
number) performance measure:
2 2P · R 2T P
F= = = . (3)
1
P + 1
R
P +R 2T P + F P + F N
In numerical taxonomy, this measure is called the Dice coefficient or the Sørensen-Dice coeffi-
cient [2, 56, 58] and it also goes under other names. In particular, the F-measure as defined above
is commonly known as the F1 -measure (or balanced F-measure), being a particular case of a more
general weighted version [54], defined as
1
Fα = . (4)
α
P + 1−α
R
This can be rewritten as Equation (2), with β 2 = (1 − α )/α, where with α = 1/2 (β = 1), we get
the F1 -measure, short for Fβ =1 [44]. As we discussed in Section 2.1, the Fβ -measure was derived
from the effectiveness measure, E (Equation (1)), as F β = 1 − E, developed by Van Rijsbergen [59]
who writes that it measures the effectiveness of retrieval with respect to a user who attaches β times
as much importance to recall as precision (page 123).2 However, as we show in Section 4.3, the F-
measure can be reformulated as a weighted arithmetic mean. In this reformulation, the weights
assigned to precision and recall in the Fβ -measure do not only depend upon β but also on the
actual classification outcomes.
Precision and recall reflect different aspects of a classification method’s performance, so com-
bining them is natural. Moreover, both are proportions, and both have a representational meaning,
a topic we return to in Section 4.5. Precision can be seen as an empirical estimate of the conditional
probability of a correct classification given predicted class 1 (Prob (T rue = 1|Pred = 1)), and recall
as an empirical estimate of the conditional probability of a correct classification given true class 1
(Prob (Pred = 1|T rue = 1)). An average of these, however, has no interpretation as a probability,
and unlike many other performance measures also has no representational meaning [24].
The mean of precision and recall does not correspond to any objective feature of classification
performance; it is not, for example, an empirical estimate of any probability associated with a
classifier method. Formally, and as we discuss further in Section 4.5, the F-measure, as a harmonic
mean, is a pragmatic measurement [22, 24, 34, 47]: it is a useful numerical summary but does
not represent any objective feature of the classifier method being evaluated. This is in contrast
with representational measures which correspond to real objective features: precision and recall
separately are examples, since they correspond to empirical estimates of probabilities of certain
kinds of misclassification.
There is also no straightforward justification for using the harmonic mean to combine precision
and recall. A formal argument is sometimes made that for averaging rates the harmonic mean is
more natural than, say, the arithmetic mean, but this is misleading. One might argue that the
harmonic mean of precision and recall is equivalent to (the reciprocal of) the arithmetic mean of
the number of true class 1 cases per class 1 case correctly classified, and the number of predicted
class 1 cases per class 1 case correctly classified. But this simply drives home the fact that precision
and recall are non-commensurate.
A different argument in favour of the F-measure has been made by Van Rijsbergen [59] us-
ing conjoint measurement. The essence of his argument is first to show that there exist non-
intersecting isoeffectiveness curves in the (P, R)-space (sometimes called indifference curves:
2 In the second edition of Van Rijsbergen’s book, this quote is given on page 133.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:9
curves showing combinations of P and R which are regarded as equally effective), then to de-
termine the shape of these curves, and hence to decide how to combine P and R to identify which
curve any particular (P, R) pair lies on. In particular, he arrives at the conclusion that the harmonic
mean (weighted if necessary) determines the shapes of the curves. To explore reasonable shapes
for these curves, and noting that P and R are proportions, Van Rijsbergen [59] (pages 122 and 123)
makes the assumption of decreasing marginal effectiveness: the user of the system is willing to
sacrifice one unit of precision for an increase of one unit of recall, but will not sacrifice another unit of
precision for a further unit increase in recall. For P and R values near zero, this leads to isoeffective-
ness curves which are convex towards the origin. Curves based on the harmonic mean of P and R
have this shape.
As we noted above, one way to look at the harmonic mean is that it is the arithmetic mean on the
reciprocal of the original scale. That is, it is the reciprocal of the arithmetic mean of 1/P and 1/R,
as can be seen in Equation (3). But the reciprocal transformation is not the only transformation of
the scale which will produce isoeffectiveness curves of this shape. For example, transforming to
loд(P ) and loд(R) will also yield convex isoeffectiveness curves (and results in the geometric mean
of P and R, which is known as the Fowlkes-Mallows index [20] of classifier performance). In short,
the choice of reciprocal transformation, and hence the harmonic mean, seems arbitrary.
As typically used in numerical taxonomy [56], the F-measure has more justification. Here, it is
used as a measure of similarity between two objects, counting the agreement between multiple
binary characteristics of the objects. Thus, referring to Figure 2 above, T P is the number of char-
acteristics that are present for both objects, F N is the number of characteristics that are present
for object A but absent for B, and so on. Since the number of potential descriptive characteristics
of objects can be made as large as you like, the number of characteristics not possessed by either
object, that is count T N , should not be included in the measure. But this interpretation seems to be
irrelevant in the situation when classification methods are being evaluated, as we discuss below.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:10 P. Christen et al.
Fig. 3. Three confusion matrices with different numbers of true positives (T P), false positives (F P), false
negatives (F N ), and true negatives (T N ), all yielding the same F-measure result. All three matrices cover
1,000 objects, with 300 in class 1 and 700 in class 0.
(class 0) correctly classified, while in matrix (b) over 98% of them (688 out of 700) are correctly
classified. Note that, however, the classifier that resulted in matrix (a) was able to classify more
positive objects (class 1) correctly compared to the classifier that resulted in matrix (b).
It is, therefore, important to understand that the F-measure is only a suitable measure for clas-
sification problems when the negative (generally the majority) class is not of interest at all for a
given problem or application [51].
4.2 The Same F-Measure can be Obtained for Different Pairs of Precision and Recall
A common aspect of all performance measures that combine the numbers in a confusion matrix
into a single value is that different counts of true positives (T P), false positives (F P), false negatives
(F N ), and true negatives (T N ), can result in the same value for a certain measure. This is the
rationale behind isoeffectiveness curves [59]. For example, for a given n = T P + F P + F N + T N ,
any pair of T P and T N that sum to the same value will lead to the same accuracy result.
Specifically for the F-measure, from the right-hand side of Equation (3), we can see that for
any pair of triplets (T Pa , F Pa , F Na ) and (T Pb , F Pb , F Nb ) arising from classifiers applied to the
same dataset (so that T P + F N = n 1 ) any pair (T P, F P ) satisfying F P = k · T P − n 1 , will yield
the same F-measure, even though precision and recall may differ between the classifiers. Here
k is a constant, where all three matrices in Figure 3 have k = 1.6. This means that potentially
classification methods that achieve very different results when evaluated using precision and recall
will provide the same F-measure result.
An example can be seen in Figure 3(b) and (c), where the confusion matrix (b) results in
P = 0.942 and R = 0.650, matrix (c) results in P = 0.654 and R = 0.933, while for both these
matrices F = 0.769.
The F-measure should, therefore, never be reported in isolation without also reporting preci-
sion and recall results. In situations where a single measure is evaluated, such as for hyperparam-
eter tuning for automated machine learning [38], using the F-measure can be dangerous because
the performances of classification methods are being compared in a way that potentially can pro-
vide very different outcomes (of course, this is true whenever measures are summarised into a
single number).
4.3 The Weights Assigned to Precision and Recall Depend Not Only on Alpha (or Beta)
As we discussed in Section 3, a generalised version of the F-measure allows assigning weights to
precision and recall using the parameter α, see Equation (4), or equivalently β, see Equation (2).
In an effort to understand the use of the F-measure in the context of record linkage (also known
as entity resolution) [5, 13], the process of identifying records that refer to the same entities within
or across databases, Hand and Christen [29] showed that the harmonic mean representation of
the F-measure can be reformulated as a weighted arithmetic mean of precision and recall as F =
pR + (1−p)P, where p = (T P +F N )/(2T P +F P +F N ) = P/(R +P ). In this weighted arithmetic mean
reformulation, however, the value of the weight p given to recall depends upon the outcome of the
evaluated classification method. When several classification methods are compared, the weight
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:11
p assigned to recall can be different if the numbers of false positives and false negatives obtained by
these methods differ. From the example confusion matrices in Figure 3, we can calculate p = 0.462
for matrix (a), p = 0.592 for matrix (b), and p = 0.412 for matrix (c).
As a result, in this weighted arithmetic mean reformulation of the F-measure, the weights as-
signed to precision and recall do not only depend upon the values of α or β, but also upon the
actual classification outcomes. We describe this property of the F-measure, including an extension
of the work by Hand and Christen [29] for the generalised Fβ -measure from Equation (2), in more
detail in Appendix A.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:12 P. Christen et al.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:13
based on the highest value of F will give the same as when selecting using F ∗ . It should be noted,
however, that the F ∗ -measure is no longer an average of precision and recall, because it holds that
F ∗ ≤ min(P, R).
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:14 P. Christen et al.
combinations of the variables which were almost orthogonal: the “best” classification methods
obtained using these two measures could hardly have been more different. Things are complicated
yet further by the need to choose a threshold to yield a confusion matrix or misclassification table.
This has led to measures such as the AUC-PR [14], the area under the ROC curve [18, 40] (with
its known conceptual weakness [28]) and the H-measure [26] (which we discuss in Appendix C).
All these measures average over a distribution of possible thresholds. And yet further issues are
described by Hand [25].
The implication of all of this is that just as much thought should be given to how classifier per-
formance is to be measured as to the choice of what classification method(s) is/are to be employed
for a given classification problem. Software tools which provide a wide variety of classification
methods and their easy use are now readily available, but far less emphasis has been placed on the
choice of measure of classification performance. And yet a classification method which appears to
be good under one measure may be poor under another. The critical issue is to match the measure
to the objectives of a given classification problem.
The F-measure, as widely used, is based on an ad hoc notion of combining two aspects of clas-
sifier performance, precision and recall, using the harmonic mean. This results in a pragmatic
measure that has a poor theoretical base: it seems not to correspond to any fundamental aspect of
classifier performance. With the aim at helping researchers improve the evaluation of classifica-
tion methods, we conclude our work with a set of recommendations of how to use the F-measure
in an appropriate way:
— The first aspect to consider is if the F-measure is really a suitable performance measure for a
given classification problem. Specifically, is there clear evidence that incorrect classifications
of one of the classes are irrelevant to the problem? Only if this question can be answered
affirmatively should the F-measure be considered. Otherwise, a performance measure that
considers both classes should be employed.
— As we discussed in Section 4.2, different pairs of precision and recall values can yield the
same F-measure result. It is, therefore, important to not only report the F-measure but also
precision and recall when evaluating classification methods [42]. Only when assessing and
comparing the values of all three measures can a valid picture of the comparative perfor-
mance of classification methods be made.
— If a researcher prefers an interpretable measure, then the F ∗ -measure [30] discussed in
Section 5.1, a monotonic transformation of the F-measure, can be used.
— If a researcher knows what importance they want to assign to precision and recall, the gen-
eral weighted F β or F α versions from Equations (2) or (4), respectively, can be used. For the
weighted arithmetic mean reformulation of the F-measure we discussed in Section 4.3 (and
also Appendix A), we recommend to specify the weight p assigned to recall in Equations (7)
and (11), and correspondingly set the individual classification thresholds, t, for each classifi-
cation method being compared such that the required same number T P + F P of objects are
classified to be in class 1. Alternatively, as we discuss in Appendix A.2, a researcher can set
the weight for recall as w and for precision as (1 − w ), and instead of the F-measure explic-
itly calculate the weighted arithmetic mean of precision and recall, wR + (1 − w )P, for all
classification methods being compared.
— If it is not possible to specify a specific classification threshold, then we recommend using
a measure such as the H-measure [26, 28], which averages the performance of classification
methods over a distribution of threshold values, t, as we discuss in Appendix C. Alternatively,
precision-recall curves should be provided which illustrate the performance of a classifica-
tion method over a range of values for the threshold t.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:15
The critical issue for any classification problem is to decide what aspects of classification
performance matter, and to then select an evaluation measure which reflects those aspects.
APPENDICES
A AN ARITHMETIC MEAN FORMULATION OF THE F-MEASURE
As we discussed in Section 3, given that the justification for the harmonic mean interpretation
of the F-measure seems weak, perhaps a more intuitive strategy would be to use the arithmetic
mean. Fortunately, it is possible to interpret the F-measure in this way, however, doing so reveals
a property of the measure which might be regarded as a weakness.
This appendix builds on our earlier work [29], where we explored the weaknesses of the F-
measure in the context of record linkage (also known as entity resolution) [5, 13]. Here, we extend
this work to general classification problems and also cover the more general weighted version of
the F β measure.
Fig. 4. Weights, p, assigned to recall for balanced (ci = 1, top row) and imbalanced (ci = 4, bottom row)
classes for three different classification outcomes assuming 100 data objects (with 50 in class 1 and 50 in
class 0 for the balanced classes in the top row, and 20 in class 1 and 80 in class 0 for the imbalanced classes
in the bottom row), for β = 1, β = 1/2, and β = 2.
Generally, β is seen as a way to assign weights (importance) to recall and precision [54, 59],
where values β < 1 are seen to give higher weight to precision and β > 1 give higher weight to
recall [44]. The reformulation in Equations (8) and (9), however, shows that the weights assigned
to precision and recall not only depend on β but also upon the outcome of a classification method
with regard to the values of F P, F N , and T P, when given the arithmetic mean interpretation.
From Equation (7) we can see that if F N = F P, then p = 1/2 because the equation can be
rewritten as:
TP + FN
p= . (10)
(T P + F P ) + (T P + F N )
From this one can immediately see that p < 1/2 if F P > F N and p > 1/2 if F P < F N . As a
result, for the special (and most commonly used) case β = 1, the F1 -measure or F1 -score, with the
arithmetic mean formulation of the F-measure, the weights given to recall and precision directly
depend upon the ratio of the number of false positives (F P) to the number of false negatives (F N ).
If there are more false positives than false negatives, then more weight is given to precision than
recall (p < 1/2); conversely if there are more false negatives than false positives, then more weight
is given to recall than precision (p > 1/2). Note that this holds independent of any class imbalance
and is only affected by the values of F P and F N in the confusion matrix.
In Figure 4, we illustrate this issue with several example confusion matrices for both a balanced
and imbalanced class distribution ci, for β = 1 (p), β = 2 (p β =2 ), and β = 1/2 (p β =1/2 ) for three
situations with different numbers of false positives and false negatives. As can clearly be seen,
depending upon the outcomes of a classification method, different weights are assigned to recall,
and therefore to precision, when the arithmetic mean formulation of the F-measure is being used.
The values of p obtained on the datasets and classifiers we used in our experimental evaluation in
Appendix B are shown in Figure 6 for varying classification thresholds.
Below we explore the consequences of this problem of different weights being assigned to preci-
sion and recall when the arithmetic mean formulation of the F-measure is used, and how to resolve
it. First, we examine some other properties of the F-measure.
It is straightforward to see that the harmonic mean of two positive values lies closer to the
smaller than the larger (if the two positive values are different the harmonic mean is less than the
arithmetic mean). This means that the smaller of precision and recall dominate the F-measure. An
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:17
alternative, even more extreme, measure of this kind would be min(P, R). The value P can vary
between 0 (if T P = 0 and F P 0) and 1 (if F P = 0 while T P 0). However, when R = 1, then
P lies between the proportion of objects belonging to class 1 and 1. The value of R can also vary
between 0 and 1, taking the value 0 if T P = 0 and the value 1 if F N = 0 (assuming T P + F N 0 of
course).
By the property of the harmonic mean noted above, this means that if R = 1 then the value for
F is close to the one of precision, P. This means that the F-measure is asymmetric with regard to
the extreme values of P and R, as we have also discussed in Section 4.4.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:18 P. Christen et al.
Since T P + F N is fixed (at the number of class 1 objects in the test set) we can ensure equality
of the p values by choosing individual thresholds for all classification methods so that all methods
being compared classify the same number of objects,T P +F P, to class 1. That is, if we arrange things
so that each classification method assigns the same number of objects to class 1, R (and hence also P)
is being given the same weight p by each classification method, so that the F-measure is a legitimate
measure of performance. Note, however, that this is more restricted than the situation described in
Appendix A.2. There we assumed that different classification methods generally assign different
numbers of test objects to class 1, with the weight w being chosen independently of this number.
Assigning the same number of test objects to class 1 ensures that all classification methods
are using the same p value. As discussed previously, the choice of an appropriate p should be
made on the basis of the problem and aims, but this can be difficult. If, instead, we wish to adopt
the convention of weighting recall and precision equally (p = 1/2) then the relationship p/(1 −
p) = (T P + F N )/(T P + F P ) shows that we must choose the classification thresholds so that each
classification method being compared assigns T P + F N = T P + F P objects to class 1: that is, all
methods must assign to class 1 the same number of objects as there are in class 1.
A.3.2 Calibrated Scores. Classification performance measures based on the values of T P, F N ,
F P, and T N are invariant to monotonic increasing transformations of the score continuum, since
these counts are obtained by comparing score values, with a particular score value chosen as the
threshold t. In particular, this means that we can transform the score so that it is calibrated. That
is, we can transform the score so that it is the probability of a data object to belong to class 1.
Technically, of course, we should speak of estimated probabilities and also deal with potential
overfitting issues arising from using limited data to calibrate [32], as well as estimate parameters
and evaluate performance. Here, however, we are concerned with conceptual issues of classifier
performance rather than practicalities, so we shall ignore these complications. Calibration means
that classification has the attractive interpretation that an object is assigned to class 1 if its (esti-
mated) probability of belonging to class 1 is greater than the threshold t. This leads to two possible
strategies:
(1) We might choose a probability threshold (the calibrated score being interpretable as a proba-
bility) t, which is the same for all classification methods being compared. This is conceptually
attractive: we would hardly want to say that we will classify an object to class 1 if its esti-
mated class 1 probability is greater than 0.9 when using a neural network, but greater than
0.7 when using a random forest.
(2) We might choose a weight w that is the same for all classification methods being compared.
This is also conceptually attractive. We would hardly want to say that we will weight recall as
0.7 (and precision as 0.3) when using a neural network, but weight recall as 0.4 (and precision
as 0.6) when using a random forest. As noted above, the weights represent an aspect of the
problem, not the classification method.
These two strategies can be simultaneously applied if the weights applied to precision and recall
are chosen independently of the threshold, as described in Appendix A.2. That is, a probability
threshold t, the same for all classification methods being compared, is chosen, and this yields R
and P values for each of the methods. These are then weighted using w, again the same for all
methods, to yield the overall measure wR + (1 − w )P.
However, from the weighted arithmetic mean perspective, using the F-measure implies a funda-
mental contradiction between the two strategies (1) and (2). This is easily seen by supposing that
we used strategy (1) and chose a probability threshold t so that all objects with estimated class 1
probability greater than t are assigned to class 1. Now, if the classification methods do not have
identical performance, then, in general, they will produce different values of T P, F N , F P, and T N
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:19
Table 1. Details of the three Datasets Used in the Empirical Evaluation Discussed in Appendix B
for any given threshold t. This means that the weights p and 1 − p implied by the F-measure will
be different for different methods.
The converse also applies. If we chose the thresholds so that p was the same for all classification
methods (corresponding to strategy (2) of using F-measure values where all methods give the
same weight to recall), it is likely that this would correspond to different estimated probabilities
of belonging to class 1; that is, different thresholds t.
A practical comment is worth making. Except for certain pathological situations, as the thresh-
old t increases, so T P + F P will decrease (while T P + F N remains constant). This means that
if all test object scores are different it will usually be possible to choose a threshold t such that
T P + F N = T P + F P. In other situations, it might not be possible (for example, if groups of test
objects have the same score), but then a simple averaging can be adopted. To do this, find the
threshold t 1 for which T P + F P is closest to T P + F N while T P + F P > T P + F N , and also the thresh-
old t 2 for which T P + F P is closest to T P + F N while T P + F P < T P + F N . Calculate the weights p1
and p2 for these two thresholds, and also the corresponding precision and recall values, P1 , P2 , R 1 ,
and R 2 , respectively. Then, calculate a weighted average of the recall values for these two thresh-
olds, and similarly the precision values as R = wR 1 + (1−w )R 2 and P = wP1 + (1−w )P2 , respectively.
The weight, w, is chosen such that wp1 = w (1 − p2 ) = 1/2. That is w = (1/2 − p2 )/(p1 − p2 ).
We have shown that the F-measure can be reinterpreted as a weighted arithmetic mean, which
does have a straightforward intuitive interpretation. In particular, it allows a researcher to specify
the relative importance they accord to precision and recall. When used in this way, with a user
specified choice of the weight p assigned to recall, the F-measure makes sense. Alternatively, it
might be better to abandon the F-measure altogether, and go directly for a weighted combination of
precision and recall, with weights reflecting the researcher’s attitudes to their relative importance,
with equal weights as an important special case.
B REAL-WORLD EXAMPLES
To illustrate the properties of the F-measure we discussed in Section 4 and Appendix A, in this
appendix we show actual classification results obtained when applying different classification
methods on different datasets, where we change the classification threshold from t = 0 to t = 1.
Specifically, we selected three datasets from the UCI Machine Learning Repository,3 as detailed
in Table 1. We applied four classification methods (decision tree, logistic regression, random for-
est, and support vector machine) as implemented in the Python sklearn machine learning library.4
We randomly split the datasets into training and testing sets, in proportions of 80% / 20%, and
estimated the parameters for the four classification methods by minimising misclassification rate
using a grid search over a variety of settings for the relevant parameters. Note that our experi-
ments are aimed at illustrating issues with the F-measure, we are not focused on obtaining the
highest classifier performance.
3 See: https://archive.ics.uci.edu/ml/index.php.
4 See: https://scikit-learn.org.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:20 P. Christen et al.
Fig. 5. Four different classification methods applied on three datasets from the UCI Machine Learning repos-
itory, showing F-measure results for a classification threshold varied from 0 ≤ t ≤ 1, as described in
Appendix B.
Fig. 6. Weights, p, assigned to recall for a classification threshold varied from 0 ≤ t ≤ 1 for the four classifi-
cation methods and three datasets described in Appendix B.
In Figure 5, we illustrate the asymmetric behaviour of the F-measure, calculated using Equa-
tion (3), with varying classification thresholds, t. As we discussed in Section 4.4, for all three
datasets the discussed asymmetric behaviour is visible, even for the Wisconsin Breast Cancer data
set which seems to be easy to classify. For all classification methods used (and on all datasets),
with t = 1 the values for precision (P), recall (R), and, therefore, the F-measure, are 0, while the
corresponding values for t = 0 are shown in Table 1.
As can also be seen in Figure 5, for the two more difficult to classify data sets (German Credit
and PIMA Diabetes) the best F-measure results are not obtained with the t = 0.5 classification
threshold that is often used as the default threshold, but rather with much lower thresholds [42] of
around t = 0.25 in our examples. With such thresholds, recall is much higher than precision (in the
range of 0.683 to 0.867 for German Credit and 0.736 to 0.906 PIMA Diabetes), which might not be a
desirable classification outcome. While the precise shapes of F-measure curves depend both upon
the classification method employed as well as the distribution of the scores of the classified objects,
as these examples illustrate, the asymmetric behaviour of the F-measure needs to be considered
when it is used to evaluate classification methods.
Figure 6 shows the values of the weight p, as calculated using Equation (7), for the four classifica-
tion methods and varying thresholds shown in Figure 5. As can be seen, the different classification
methods lead to different values of p as the threshold t is changing. This holds even for different
classification methods applied on the same dataset. With low thresholds the value of p is below
0.5, which means less weight is assigned to recall compared to precision. For higher thresholds the
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:21
Fig. 7. Precision results for a varying classification threshold for the three datasets and four classification
methods.
Fig. 8. Precision-recall curves for a varying classification threshold for the three datasets and four classifica-
tion methods.
value of p increases to above 0.5, resulting in higher weights being assigned to recall compared to
precision. This general pattern means that low classification thresholds give more weight to pre-
cision (if F P > F N ) while high classification thresholds give more weight to recall (if F P < F N ),
as we discussed in Appendix A.1.
In Figure 7, we show precision results for varying classification thresholds, as we discussed in
Section 4.4. As can clearly be seen, precision does not necessarily increase or decrease monoton-
ically as the classification threshold changes. These plots illustrate the actual distribution of clas-
sification scores of objects in the positive and negative classes as obtained with the four different
classifiers.
Finally, Figure 8 shows precision-recall curves for varying classification thresholds, as we dis-
cussed in Section 5.2. As clearly visible, the resulting areas under the precision-recall curves
(AUC-PR) are much larger for the Wisconsin Breast Cancer dataset which seems to be easier to
classify accurately compared to the other two datasets. What can also be seen is that for different
ranges of the classification threshold one classification method shows higher performance than
the other methods, given their precision-recall curves are above the ones of the other methods.
C THE H-MEASURE
Another widely used measure of classifier performance is the Area Under the Receiver Operating
Characteristic Curve (AUC). Instead of being based on a single choice of classification threshold,
the AUC is equivalent to an average misclassification loss, where the average is taken over a dis-
tribution of classification thresholds. Since any choice of classification threshold is equivalent to a
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:22 P. Christen et al.
choice of relative severities of the two kinds of misclassification (class 0 objects to class 1, or class 1
objects to class 0), the AUC is equivalent to averaging the misclassification loss over a distribution
of the ratio of severities of the two kinds of misclassification.
However, it has been shown that for the AUC, this distribution depends on the score distribution
produced by the classifier [26, 28]. This means it is fundamentally incoherent: it is equivalent to
using a stretchy ruler when measuring length, with the measuring instrument differing between
the classifiers being assessed. Put another way, the AUC implies that researchers using different
classifiers have different distributions of severity-ratios for the two kinds of misclassification. This
is nonsensical: misclassifying a class 0 object as a class 1 object cannot be twice as serious as the
reverse when using a random forest but three times as serious when using deep learning. The
relative seriousness depends on the problem and the aims, not the classifier.
The H-measure overcomes this fundamental incoherence by specifying a fixed distribution for
the severity-ratio. In its most common form, a beta distribution is used [28].
Another way of describing the problem with the AUC is that it is equivalent to using different
probability scoring functions when evaluating the accuracy of the estimates of the probability of
belonging to class 1 produced by different classifiers. Buja et al. [9] show that the H-measure, using
the beta distribution, overcomes this.
REFERENCES
[1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient re-
gion detection. In Conference on Computer Vision and Pattern Recognition. IEEE, Miami, 1597–1604.
[2] Brian Austin and Rita R. Colwell. 1977. Evaluation of some coefficients for use in numerical taxonomy of microorgan-
isms. International Journal of Systematic and Evolutionary Microbiology 27, 3 (1977), 204–210.
[3] Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke. 2015. Fraud Analytics using Descriptive, Predictive, and
Social Network Techniques: A Guide to Data Science for Fraud Detection. John Wiley and Sons, Hoboken, New Jersey.
[4] Thomas Benton. 2001. Theoretical and empirical models. Ph. D. Dissertation. Department of Mathematics, Imperial
College, London.
[5] Olivier Binette and Rebecca C. Steorts. 2022. (Almost) all of entity resolution. Science Advances 8, 12 (2022), eabi8021.
[6] Kendrick Boyd, Kevin H. Eng, and C. David Page. 2013. Area under the precision-recall curve: Point estimates and con-
fidence intervals. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer,
451–466.
[7] Paula Branco, Luís Torgo, and Rita P. Ribeiro. 2016. A survey of predictive modeling on imbalanced domains. Com-
puting Surveys 49, 2 (2016), 1–50.
[8] Sebastian Brutzer, Benjamin Höferlin, and Gunther Heidemann. 2011. Evaluation of background subtraction tech-
niques for video surveillance. In Conference on Computer Vision and Pattern Recognition. IEEE, 1937–1944.
[9] Andreas Buja, Werner Stuetzle, and Yi Shen. 2005. Loss Functions for Binary Class Probability Estimation and Classifi-
cation: Structure and Applications. Technical Report. The Wharton School, University of Pennsylvania.
[10] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.
[11] Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1
score and accuracy in binary classification evaluation. BMC Genomics 21, 1 (2020), 6.
[12] Nancy Chinchor. 1992. MUC-4 evaluation metrics. In Fourth Message Understanding Conference. ACL, 22–29.
[13] Peter Christen, Thilina Ranbaduge, and Rainer Schnell. 2020. Linking Sensitive Data. Springer.
[14] Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In International
Conference on Machine Learning. ACM, 233–240.
[15] Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research
7, Jan (2006), 1–30.
[16] Robert P. W. Duin. 1996. A note on comparing classifiers. Pattern Recognition Letters 17, 5 (1996), 529–536.
[17] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform.
In Conference on Computer Vision and Pattern Recognition. IEEE, 2963–2970.
[18] Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 8 (2006), 861–874.
[19] César Ferri, José Hernández-Orallo, and R. Modroiu. 2009. An experimental comparison of performance measures for
classification. Pattern Recognition Letters 30, 1 (2009), 27–38.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
A review of the F-measure: Its History, Properties, Criticism, and Alternatives 73:23
[20] Edward B. Fowlkes and Colin L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the
American Statistical Association 78, 383 (1983), 553–569.
[21] Jiawei Han and Micheline Kamber. 2000. Data Mining: Concepts and Techniques (1st ed.). Morgan Kaufmann, San
Francisco.
[22] David J. Hand. 1996. Statistics and the theory of measurement. Journal of the Royal Statistical Society: Series A 159,
3 (1996), 445–473.
[23] David J. Hand. 1997. Construction and Assessment of Classification Rules. John Wiley and Sons, Chichester.
[24] David J. Hand. 2004. Measurement Theory and Practice: The World Through Quantification. Edward Arnold, London.
[25] David J. Hand. 2006. Classifier technology and the illusion of progress. Statistical Science 21, 1 (2006), 1–14.
[26] David J. Hand. 2009. Measuring classifier performance: A coherent alternative to the area under the ROC curve. Ma-
chine Learning 77, 1 (2009), 103–123.
[27] David J. Hand. 2012. Assessing the performance of classification methods. International Statistical Review 80, 3 (2012),
400–414.
[28] David J Hand and Christoforos Anagnostopoulos. 2022. Notes on the H-measure of classifier performance. Advances
in Data Analysis and Classification 17, 1 (2022), 109–124.
[29] David J. Hand and Peter Christen. 2018. A note on using the F-measure for evaluating record linkage algorithms.
Statistics and Computing 28, 3 (2018), 539–547.
[30] David J. Hand, Peter Christen, and Nishadi Kirielle. 2021. F*: An interpretable transformation of the F-measure. Ma-
chine Learning 110, 3 (2021), 451–456.
[31] David J. Hand, Heikki Mannila, and Padhraic Smyth. 2001. Principles of Data Mining. MIT Press.
[32] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Infer-
ence, and Prediction (2nd ed.). Springer, New York.
[33] Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. Transactions on Knowledge and Data Engi-
neering 21, 9 (2009), 1263–1284.
[34] David Torres Irribarra. 2021. A Pragmatic Perspective of Measurement. Springer.
[35] Paul Jaccard. 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Société Vaudoise des Sciences Naturelles
44, 163 (1908), 223–270.
[36] Andrew Janowczyk and Anant Madabhushi. 2016. Deep learning for digital pathology image analysis: A comprehen-
sive tutorial with selected use cases. Journal of Pathology Informatics 7, 1 (2016), 29.
[37] Thorsten Joachims. 2005. A support vector method for multivariate performance measures. In International Conference
on Machine Learning. ACM, 377–384.
[38] Shubhra Kanti Karmaker, Md Mahadi Hassan, Micah J. Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni.
2021. AutoML to date and beyond: Challenges and opportunities. Computing Surveys 54, 8 (2021), 1–36.
[39] Raymond Kosala and Hendrik Blockeel. 2000. Web mining research: A survey. ACM SIGKDD Explorations Newsletter
2, 1 (2000), 1–15.
[40] Wojtek J. Krzanowski and David J. Hand. 2009. ROC Curves for Continuous Data. CRC Press, New York.
[41] David D. Lewis. 1995. Evaluating and optimizing autonomous text classification systems. In Conference on Research
and Development in Information Retrieval. ACM, 246–254.
[42] Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014. Optimal thresholding of classifiers to max-
imize F1 measure. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in
Databases. Springer, 225–239.
[43] Neil A. Macmillan and C. Douglas Creelman. 2005. Detection Theory: A User’s Guide (2nd ed.). Lawrence Erlbaum
Associates, New York.
[44] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cam-
bridge University Press, Cambridge.
[45] Joseph F. McCarthy and Wendy G. Lehnert. 1995. Using decision trees for coreference resolution. In International Joint
Conference on Artificial Intelligence. AAAI, 1050–1055.
[46] Giovanna Menardi and Nicola Torelli. 2014. Training and assessing classification rules with imbalanced data. Data
Mining and Knowledge Discovery 28, 1 (2014), 92–122.
[47] Ave Mets. 2019. A philosophical critique of the distinction of representational and pragmatic measurements on the
example of the periodic system of chemical elements. Foundations of Science 24, 1 (2019), 73–93.
[48] Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep
learning–based text classification: A comprehensive review. Computing Surveys 54, 3 (2021), 1–40.
[49] Aytuğ Onan, Serdar Korukoğlu, and Hasan Bulut. 2016. Ensemble of keyword extraction methods and classifiers in
text classification. Expert Systems with Applications 57 (2016), 232–247.
[50] David M. W. Powers. 2011. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and
correlation. International Journal of Machine Learning Technology 2, 1 (2011), 37–63.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.
73:24 P. Christen et al.
[51] David M. W. Powers. 2019. What the F-measure doesn’t measure: Features, flaws, fallacies and fixes. https://arxiv.org/
abs/1503.06410v2
[52] Troy Raeder, George Forman, and Nitesh V. Chawla. 2012. Learning from imbalanced data: Evaluation matters. In
Data Mining: Foundations and Intelligent Paradigms. Springer, Berlin, 315–331.
[53] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized
intersection over union: A metric and a loss for bounding box regression. In Conference on Computer Vision and Pattern
Recognition. IEEE, 658–666.
[54] Gerard Salton and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, Singapore.
[55] Yutaka Sasaki. 2007. The truth of the F-measure. University of Manchester, MIB-School of Computer Science.
[56] Robert R. Sokal and Peter H. A. Sneath. 1963. Numerical Taxonomy. W.H. Freeman and Co., San Francisco.
[57] Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks.
Information Processing and Management 45, 4 (2009), 427–437.
[58] Thorvald Julius Sørensen. 1948. A Method of Establishing Groups of Equal Amplitude in Plant Sociology based on Sim-
ilarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. I Kommission Hos E.
Munksgaard, Copenhagen.
[59] Cornelius J. Van Rijsbergen. 1979. Information Retrieval. Butterworth and Co., London.
[60] Adam Yedidia. 2016. Against the F-score. Retrieved from https://adamyedidia.files.wordpress.com/2014/11/f_score.pdf.
Blogpost, accessed 12 April 2023.
ACM Computing Surveys, Vol. 56, No. 3, Article 73. Publication date: October 2023.