Precision and Recall
Precision and Recall
Suppose a computer program for recognizing dogs in photographs identifies eight dogs
in a picture containing 12 dogs and some cats. Of the eight dogs identified, five actually
are dogs (true positives), while the rest are cats (false positives). The program's
precision is 5/8 while its recall is 5/12. When a search engine returns 30 pages only 20
of which were relevant while failing to return 40 additional relevant pages, its precision
is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is "how useful
the search results are", and recall is "how complete the results are".
In statistics, if the null hypothesis is that all items are irrelevant (where the hypothesis
is accepted or rejected based on the number selected compared with the sample size),
absence of type I and type II errors corresponds respectively to maximum precision (no
false positive) and maximum recall (no false negative). The above pattern recognition
example contained 8 − 5 = 3 type I errors and 12 − 5 = 7 type II errors. Precision can be
seen as a measure of exactness or quality, whereas recall is a measure of completeness
or quantity.
In simple terms, high precision means that an algorithm returned substantially more
relevant results than irrelevant ones, while high recall means that an algorithm
returned most of the relevant results.
Contents
1 Introduction
2 Definition (information retrieval context)
Precision and recall
2.1 Precision
2.2 Recall
3 Definition (classification context)
4 Probabilistic interpretation
5 F-measure
6 Limitations as goals
7 See also
8 References
9 External links
Introduction
In an information retrieval scenario, the instances are documents and the task is to return a set of relevant documents given a search term; or
equivalently, to assign each document to one of two categories, "relevant" and "not relevant". In this case, the "relevant" documents are simply those that
belong to the "relevant" category. Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing
relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents
retrieved by that search.
In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive
class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are
items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of
elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging
to the positive class but should have been).
In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all
relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says
nothing about how many irrelevant documents were also retrieved).
Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. Brain surgery
provides an illustrative example of the tradeoff. Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon
needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain
cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she
has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the
brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the
chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision
decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).
Usually, precision and recall scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other
measure (e.g. precision at a recall level of 0.75) or both are combined into a single measure. Examples of measures that are a combination of precision
and recall are the F-measure (the weighted harmonic mean of precision and recall), or the Matthews correlation coefficient, which is a geometric mean
of the chance-corrected variants: the regression coefficients Informedness (DeltaP') and Markedness (DeltaP).[1][2] Accuracy is a weighted arithmetic
mean of Precision and Inverse Precision (weighted by Bias) as well as a weighted arithmetic mean of Recall and Inverse Recall (weighted by
Prevalence).[1] Inverse Precision and Inverse Recall are simply the Precision and Recall of the inverse problem where positive and negative labels are
exchanged (for both real classes and prediction labels). Recall and Inverse Recall, or equivalently true positive rate and false positive rate, are frequently
plotted against each other as ROC curves and provide a principled mechanism to explore operating point tradeoffs. Outside of Information Retrieval, the
application of Recall, Precision and F-measure are argued to be flawed as they ignore the true negative cell of the contingency table, and they are easily
manipulated by biasing the predictions.[1] The first problem is 'solved' by using Accuracy and the second problem is 'solved' by discounting the chance
component and renormalizing to Cohen's kappa, but this no longer affords the opportunity to explore tradeoffs graphically. However, Informedness and
Markedness are Kappa-like renormalizations of Recall and Precision,[3] and their geometric mean Matthews correlation coefficient thus acts like a
debiased F-measure.
Precision
In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query:
For example, for a text search on a set of documents, precision is the number of correct results divided by the number of all returned results.
Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank, considering only the topmost results returned
by the system. This measure is called precision at n or P@n.
Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the
F1 Score (or f-measure) to provide a single measurement for a system.
Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other
branches of science and technology.
Recall
In information retrieval, recall is the fraction of the relevant documents that are successfully retrieved.
For example, for a text search on a set of documents, recall is the number of correct results divided by the number of results that should have been
returned.
In binary classification, recall is called sensitivity. It can be viewed as the probability that a relevant document is retrieved by the query.
Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2
contingency table or confusion matrix, as follows:
True condition
Prevalence Accuracy (ACC) =
Total population Condition positive Condition negative
= ΣΣCondition positive
Total population
Σ True positive + Σ True negative
Σ Total population
Positive predictive value False discovery rate (FDR),
Predicted condition False positive, (PPV), Precision = probability of false alarm =
True positive
positive Type I error Σ True positive Σ False positive
Predicted Σ Predicted condition positive Σ Predicted condition positive
condition
False omission rate (FOR) = Negative predictive value (NPV) =
Predicted condition False negative,
True negative Σ False negative Σ True negative
negative Type II error Σ Predicted condition negative Σ Predicted condition negative
True positive rate
(TPR), Recall,
False positive rate (FPR), Positive likelihood ratio (LR+)
Click thumbnail for interactive chart: Sensitivity,
Σ False positive
probability of detection Fall-out= Σ Condition negative = TPR
FPR Diagnostic F1 score =
Σ True positive odds ratio
= Σ Condition positive 2
LR+
(DOR) = 1 1
False negative rate True negative rate (TNR), LR− Recall + Precision
Negative likelihood ratio (LR−)
(FNR), Miss rate Specificity (SPC)
= Σ ΣCondition
False negative Σ True negative
= Σ Condition = FNR
TNR
positive negative
Recall in this context is also referred to as the true positive rate or sensitivity, and precision is also referred to as positive predictive value (PPV); other
related measures used in classification include true negative rate and accuracy.[6] True negative rate is also called specificity.
Probabilistic interpretation
It is possible to interpret precision and recall not as ratios but as probabilities:
Two other commonly used measures are the measure, false discovery rate (FDR)
which weights recall higher than precision, and the
measure, which puts more emphasis on precision than recall.
false omission rate (FOR)
The F-measure was derived by van Rijsbergen (1979) so that
"measures the effectiveness of retrieval with respect to a
user who attaches times as much importance to recall as accuracy (ACC)
precision". It is based on van Rijsbergen's effectiveness
For web document retrieval, if the user's objectives are not Markedness (MK)
clear, the precision and recall can't be optimized. As
summarized by Lopresti,[8]
Sources: Fawcett (2006), Powers (2011), and Ting (2011) [4] [1] [5]
See also
Uncertainty coefficient, also called proficiency
Sensitivity and specificity
References
1. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" (http://www.flind
ers.edu.au/science_engineering/fms/School-CSEM/publications/tech_reps-research_artfcts/TRRA_2007.pdf) (PDF). Journal of Machine Learning
Technologies. 2 (1): 37–63.
2. Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics. 17 (2–3): 97–119.
doi:10.1016/s0911-6044(03)00059-9 (https://doi.org/10.1016%2Fs0911-6044%2803%2900059-9).
3. Powers, David M. W. (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics
(EACL2012) Joint ROBUS-UNSUP Workshop.
4. Fawcett, Tom (2006). "An Introduction to ROC Analysis" (http://people.inf.elte.hu/kiss/11dwhdm/roc.pdf) (PDF). Pattern Recognition Letters. 27 (8):
861–874. doi:10.1016/j.patrec.2005.10.010 (https://doi.org/10.1016%2Fj.patrec.2005.10.010).
5. Ting, Kai Ming (2011). Encyclopedia of machine learning (http://link.springer.com/referencework/10.1007%2F978-0-387-30164-8). Springer.
ISBN 978-0-387-30164-8.
6. Olson, David L.; and Delen, Dursun (2008); Advanced Data Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-
76916-1
7. Zygmunt Zając. What you wanted to know about AUC. http://fastml.com/what-you-wanted-to-know-about-auc/
8. Lopresti, Daniel (2001); WDA 2001 panel (http://www.csc.liv.ac.uk/~wda2001/Panel_Presentations/Lopresti/Lopresti_files/v3_document.htm)
Baeza-Yates, Ricardo; Ribeiro-Neto, Berthier (1999). Modern Information Retrieval. New York, NY: ACM Press, Addison-Wesley, Seiten 75 ff.
ISBN 0-201-39829-X
Hjørland, Birger (2010); The foundation of the concept of relevance, Journal of the American Society for Information Science and Technology, 61(2),
217-237
Makhoul, John; Kubala, Francis; Schwartz, Richard; and Weischedel, Ralph (1999); Performance measures for information extraction (http://citesee
rx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.4637), in Proceedings of DARPA Broadcast News Workshop, Herndon, VA, February 1999
Perry, James W.; Kent, Allen; Berry, Madeline M. (1955). "Machine literature searching X. Machine language; factors underlying its design and
development". American Documentation. 6 (4): 242. doi:10.1002/asi.5090060411 (https://doi.org/10.1002%2Fasi.5090060411).
van Rijsbergen, Cornelis Joost "Keith" (1979); Information Retrieval, London, GB; Boston, MA: Butterworth, 2nd Edition, ISBN 0-408-70929-4
External links
Information Retrieval – C. J. van Rijsbergen 1979 (http://www.dcs.gla.ac.uk/Keith/Preface.html)
Computing Precision and Recall for a Multi-class Classification Problem (http://www.text-analytics101.com/2014/10/computing-precision-and-recall-f
or.html)
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of
Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.