Main
Main
value
Gaël Varoquaux, Olivier Colliot
Abstract
This chapter describes model validation, a crucial part of machine learn-
ing whether it is to select the best model or to assess performance of a
given model. We start by detailing the main performance metrics for dif-
ferent tasks (classification, regression), and how they may be interpreted,
including in the face of class imbalance, varying prevalence, or asymmet-
ric cost-benefit trade-offs. We then explain how to estimate these metrics
in a unbiased manner using training, validation, and test sets. We de-
scribe cross-validation procedures –to use a larger part of the data for
both training and testing– and the dangers of data leakage –optimism
bias due to training data contaminating the test set. Finally, we discuss
how to obtain confidence intervals of performance metrics, distinguishing
two situations: internal validation or evaluation of learning algorithms,
and external validation or evaluation of resulting prediction models.
To appear in
O. Colliot (Ed.), Machine Learning for Brain Disorders, Springer
2 Varoquaux & Colliot
1. Introduction
A machine learning (ML) model is validated by evaluating its prediction
performance. Ideally, this evaluation should be representative of how
the model would perform when deployed in a real life setting. This is
an ambitious goal, that goes beyond the settings of academic research.
Indeed, a perfect validation would probe robustness to any possible vari-
ation of the input data which may include different acquisition devices
and protocols, different practices that vary from one country to another,
from one hospital to another and even from one physician to another. A
less ambitious goal for validation is to provide an unbiased estimate of
the model performance on new –never before seen– data similar to that
used for training (but not the same data!). By similar, we mean data
that has similar clinical or socio-demographic characteristics and which
has been acquired using similar devices and protocols. To go beyond
such internal validity, external validation would evaluate generalization
to data from different sources (for example another dataset, data from
another hospital).
This chapter addresses the following questions. How to quantify the
performance of the model? This will lead us to present, in Section 2,
different performance metrics that are adequate for different ML tasks
(classification, regression . . . ). How to estimate these performance met-
rics? This will lead to the presentation of different validation strategies
(Section 3). We will also explain how to derive confidence intervals for the
estimated performance metrics, drawing the distinction between evaluat-
ing a learning algorithm or a resulting prediction model. We will present
various caveats that pertain to the use of performance metrics on medical
data as well as to data leakage, which can be particularly insidious.
2. Performance metrics
Metrics allow to quantify the performance of an ML model. In this
section, we describe metrics for classification and regression tasks. Other
tasks (segmentation, generation, detection. . . ) can use some of these but
will often require other metrics which are specific to these tasks. The
reader may refer to Chapter 13 for metrics dedicated to segmentation
and to Section 6 of Chapter 23 for metrics dedicated to segmentation,
classification and detection.
True label
Positive Negative
D+ D−
Predicted label
Positive
T+
TP FP
Negative
T−
FN TN
Figure 1: Confusion matrix The confusion matrix represents the re-
sults of a classification task. In the case of binary classification (two
classes), it divides the test samples into four categories, depending on
their true ( eg disease status, D) and predicted (test output, T ) labels:
true positives (TP), true negatives (TN), false positives (FP), false neg-
atives (FN).
• True Positives (TP): samples for which the true and predicted
labels are both 1. Example: the patient has cancer (1) and the
model classifies this sample as cancer (1)
• True Negatives (TN): samples for which the true and predicted
labels are both 0. Example: the patient does not have cancer (0)
and the model classifies this sample as non-cancer (0)
• False Positives (FP): samples for which the true label is 0 and
the predicted label is 1. Example: the patient does not have cancer
(0) and the model classifies this sample as cancer (1)
• False Negatives (FN): samples for which the true label is 1 and
the predicted label is 0. Example: the patient has cancer (1) and
the model classifies this sample as non-cancer (0)
Are false positives and false negatives equally problematic? This de-
pends on the application. For instance, consider the case of detecting
brain tumors. For a screening application, detected positive cases would
then be subsequently reviewed by a human expert, one can thus consider
that false negatives (missed brain tumor) lead to more dramatic conse-
quences than false positives. One the opposite, if a detected tumor leads
the patient to be sent to brain surgery without complementary exam,
false positives are problematic and brain surgery is not a benign opera-
tion. For automatic volumetry from magnetic resonance images (MRI),
one could argue that false positives and false negatives are equally prob-
lematic.
Multiple performance metrics can be derived from the confusion ma-
trix, all easily computed using sklearn.metrics from scikit-learn [1].
They are summarized in Box 1. One can distinguish between basic met-
rics which only focus on false positives or false negatives and summary
metrics which aim at providing an overview of the performance with a
single metric.
The performance of a classifier is characterized by pairs of basic met-
rics: either sensitivity and specificity, or PPV and NPV, which charac-
terize respectively the probability of the test given the diseased status or
vice versa (see Box 1). Note that each basic metric characterizes only the
behavior of the classifier on the positive class (D+) or the negative class
(D−); thus measuring both sensitivity and specificity, or PPV and NPV
is important. Indeed a classifier always reporting a positive prediction
would have a perfect sensitivity, but a disastrous specificity.
Basic metrics
T denotes test: classifier output; D denotes diseased status.
• Sensitivity (also called recall): fraction of positive samples
actually retrieved.
TP
Sensitivity = TP+FN Estimates P (T+ | D+)
Summary metrics
• Accuracy: fraction of the samples correctly classified.
TP+TN
Accuracy = TP+FP+TN+FN
• Markedness = TP
TP+FP
− FP
FP+TN
= PPV + NPV − 1
but a low NPV. A solution can be to exchange the two classes. The F1
score becomes informative again. Those shortcomings are fundamental,
as the F1 score is completely blind to the number of true negative, TN.
This is probably one of the reasons why it is a popular metric for segmen-
tation (usually called Dice rather than F1 ) as in this task TN is almost
meaningless (TN can be made arbitrarily large by just changing the field
of view of the image). In addition this metric has no simple link to the
probabilities of interest, even more so after switching classes.
Another option is to use Matthews Correlation Coefficient (MCC).
The MCC makes full use of the confusion matrix and can remain infor-
mative even when prevalence is very low or very high. However, its in-
terpretation may be less intuitive than that of the other metrics. Finally,
markedness [2] is a seldom known summary metric that deals well with
low-prevalence situations as it is built from the PPV and NPV (Box 1).
Its drawback is that it is as much related to the population under study
as to the classifier.
As we have seen, it is important to distinguish metrics which are
intrinsic characteristics of the classifier (sensitivity, specificity, balanced
accuracy) from those which are dependent on the target population and
in particular of its prevalence (PPV, NPV, MCC, markedness). The
former are independent of the situation in which the model is going to
be used. The latter informs on the probability of the condition (the
output label) given the output of the classifier; but they depend on the
operational situation, and in particular on the prevalence. The prevalence
can be variable (for instance the prevalence of an infectious disease will
be variable across time, the prevalence of a neurodegenerative disease will
depend on the age of the target population) and a given classifier may
be intended to be applied in various situations. This is why the intrinsic
characteristics (sensitivity and specificity) need to be judged according to
the different intended uses of the classifier (e.g. a specificity of 90% may
be considered excellent for some applications while it would be considered
unacceptable if the intended use is in a low prevalence situation).
LR+ = O(D+|T+)
O(D+)
sensitivity
= 1−specificity . This quantity depends only on sensitivity
and specificity, properties of the classifier only, and not of the prevalence
on the study population. Yet, given a target population, post-test odds
can easily be obtained by multiplying LR+ by pre-test odds, itself given
prevalence
by prevalence: O(D+) = 1−prevalence . The larger the LR+, the more use-
ful the classifier and a classifier with LR+ = 1 or less brings no additional
information on the likelihood of the disease. An equivalent to LR+ char-
acterizes the negative class: controling on “T-” instead of “T+” gives the
negative likelihood ratio: LR- = 1−sensitivity
specificity
; and low values of LR- (below
1) denote more useful predictions. These metrics, LR+ and LR- are very
useful in a situation common in biomedical settings where the only data
available to learn and evaluate a classifier is a study population with
nearly balanced classes, such as a case-control study, while the target
application –the general population– is one with a different prevalence
(e.g. a very low prevalence) or when the intended use considers variable
prevalences.
1.0
True positive rate = sensitivity or recall
0.8
Figure 3: ROC curve for
different classifiers. AUC 0.6
denotes the Area Under the
Curve, typically used to ex- 0.4
tract a number summariz- Excellent, AUC=0.95
ing the ROC curve. Good, AUC=0.88
0.2 Poor, AUC=0.75
Chance, AUC=0.50
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
1.0
Figure 4: Precision-
Recall curve for different 0.8
classifiers. AUC denotes
Precision = PPV
the Area Under the Curve, 0.6
often called average preci-
sion here. Note that the 0.4
Excellent, AUC=0.96
chance level depends on the Good, AUC=0.92
class imbalance (or preva- 0.2 Poor, AUC=0.78
lence), here 0.57. Chance, AUC=0.57
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall = TPR or sensitivity
controls average error rates while Brier score controls individual proba-
bilities, which is much more stringent and more useful to the practitioner
[5]. Accurate probabilities of individual predictions can be used for opti-
mal decision making, eg opting for brain surgery only for individuals for
which a diagnostic model predicts cancer with high confidence.
A given value of ECE is easy to interpret, as it qualifies probabili-
ties mostly independently of prediction performance. On the other hand,
the Brier score accounting both for the quality of probabilities and cor-
responding binary decisions as a low Brier score captures the ability to
give good probabilistic prediction of the output. For any classification
problem, there exists many classifiers with 0 expected calibration errors,
including some with very poor predictions. On the other hand, even the
best possible prediction has a non-zero Brier score, unless the output is
a deterministic function of the data. The Brier skill score, a variant of
the Brier score is often used to assess how far a predictor is from the
best possible prediction, more independent of the intrinsic uncertainty
in the data. The Brier skill score is a rescaled version of the Brier score
taking as a reference a reasonable baseline: 1 is a perfect prediction,
while negative values mean predictions worse than guessing from class
prevalence.
• Always look at all the individual metrics: false positives and false
negatives are seldom equivalent. Understand the medical problem
to known the right tradeoff [4]
True
predicted of a given class, knowing the ac-
C2 0 107 36
tual class. A perfect prediction would give
non-zero entries only on the diagonal. C3 0 0 92
true instances for each class, while micro averaging computes the metric
by adding the number of TP (resp. TN, FP, FN) across all classes.
Inspecting the confusion matrix extended to multi-class settings gives
a interesting tool to understand errors: it displays how many times a
given true class is predicted as another (Figure 5). A perfect prediction
has non-zero entries only on the diagonal. The confusion matrix may be
interesting to reveal which classes are commonly confused, as its name
suggests. In our example, instances that are actually of class C2 are often
predicted as of class C3.
Note that if the error was uniformly equal to the same value (10, for
instance), both measures would give the same result.
model has a larger prediction error, but also that it tends to undershoot:
predict a value that underestimates the observed value. This aspect of
the prediction error is not well captured by the summary metrics because
there are comparatively much less observations with large y.
3. Evaluation strategies
The previous section detailed metrics for assessing the performance of a
ML model. We now focus on how to estimate the expected prediction
performance of the model with these metrics. Importantly, we draw
the difference between evaluating a learning procedure, or learner, and
a learned model. While these two questions are often conflated in the
literature, the first one must account for uncontrolled fluctuations in the
learning procedure, while the second one controls a given model on a
target external population. The first question is typically of interest to
the methods researcher, to conclude on learning procedures, while the
second is central to the medical research, to conclude on the clinical
application of a model.
50
Figure 6: Visualizing prediction
45
Predicted y
3.1.1. Cross-validation
The split between train and test set is arbitrary. With the same machine-
learning algorithm, two different data splits will lead to two different
observed performances, both of which are noisy estimates of the ex-
pected generalization performance of prediction models built with this
learning procedure. A common strategy to obtain better estimates con-
sists in performing multiple splits of the whole dataset into training
and testing set: a so called cross-validation loop. For each split, a
model is trained using the training set and the performances are com-
puted using the testing set. The performances over all the testing sets
are then aggregated. Figure 7 displays different cross-validation meth-
ods. k-fold cross-validation consists in splitting the data into k sets
(called folds) of approximately equal size. It ensures that each sam-
ple in the data set is used exactly once for testing. For classification,
sklearn.model selection.StratifiedKFold performs stratified k-fold
cross-validation.
In each split, ideally, one would want to have a large training set,
because it usually allows training better performing models, and a large
testing set, because it allows a more accurate estimation of the perfor-
mance. But the dataset size is not infinite. Splitting out 10 to 20%
for the test set is a good trade off [9]; which amounts to k=5 or 10 in
a k-fold. With small datasets, to maximize the amount of train data,
it may be tempting to leave out only one observation, in a so-called
leave-one-out cross-validation. However, such depletion of the test set
gives overall worse estimates of the generalization performance. Increas-
ing the number of splits is however useful, thus another strategy con-
sists in performing a large number of random splits of the data, break-
ing from the regularity of the k-fold. If the number of splits is suf-
ficiently large, all samples will be approximately used the same num-
ber of times for training and testing. This strategy can be done using
sklearn.model selection.StratifiedSuffleSplit(n splits) and is
called “Repeated hold-out” or “Monte-Carlo cross-validation”. Beyond
giving a good estimate of the generalization performance, an important
benefit of this strategy is that it enables to study the variability of the
performances. However, running many splits may be computationally
expensive with models that are slow to train.
for a given patient. In such as case, one should never put data from
the same patient in both the training and validation sets. For instance,
one should not, for a given patient, put the visit at month 0 in the
training set and the visit at month 6 in the validation set. Similarly,
one should not use the magnetic resonance imaging (MRI) data of a
given patient for training and the positron emission tomography (PET)
image for validation. A similar situation arises when dealing with 3D
medical image. It is absolutely mandatory to avoid putting some of the
2D slices of a given patient in the training set and the rest of the slices
in the validation set. More generally, in medical applications, the split
between training and test set should always be done at the patient level.
Unfortunately, data leakage is still prevalent in many machine learning
studies on brain disorders. For instance, a literature review identified
that up to 40% of studies on convolutional neural networks for automatic
classification of Alzheimer’s disease from T1-weighted MRI potentially
suffered from data leakage [11].
• For a given patient, put some of its visits in the training set
and some in the validation set
Model Accuracy
Logistic regression 0.72
Support vector machine 0.75
Convolutional neural network 0.95
tion: the standard error is the standard deviation divided by the number
of runs. The number of runs can be made arbitrarily large given enough
compute power, thus making the standard error arbitrarily small. But
in no way does the uncertainty due to the limited test data vanish. This
uncertainty can be quantified for a fixed test set –see 3.2, but in repeated
splits or cross-validation it is difficult to derive confidence intervals be-
cause the runs are not independent [15, 14]. In particular, it is invalid
to use a standard hypothesis test –such as a T-test– across the different
folds of a cross-validation. There are some valid options to perform hy-
pothesis testing in a cross-validation setting [16, 14], but they must be
implemented with care.
Another reason not to rely on null-hypothesis testing is that their
statistical significance only asserts that the expected performance –or
being acquired using different protocols and at different sites (Figure 9).
Most often, these datasets come from other research studies (different
from the one used for training). However, research studies do not usu-
ally reflect well clinical routine data. Indeed, in research studies, the
acquisition protocols are often standardized and rigorous data quality
control is applied. Moreover, participants may not be representative of
the target population. This can be due to inclusion/exclusion criteria
(for instance excluding patients with vascular abnormalities in a study
on Alzheimer’s disease) or due to uncontrolled biases. For instance, par-
ticipants to research studies tend to have a higher socioeconomic status
than the general population. Therefore, it is highly valuable to also per-
form validation on clinical routine data, whenever possible, as it is more
likely to reflect “real-life” situations. One should nevertheless be aware
that a given clinical routine dataset may come with specificities that may
not generalize to all settings. For instance data collected within a spe-
cialized center of a university hospital may substantially differ from that
seen by a general practitioner.
from a binomial law. Table 2 gives such confidence intervals for dif-
ferent set of the test set and different values of the ground-truth accu-
racy. These can be easily adapted to other counts of errors as follows:
Accuracy N is the size of the test set
Sensitivity N is the number of negative samples in the test set
Specificity N is the number of positive samples in the test set
PPV N is the number of positively classified test samples
NPV N is the number of negatively classified test samples
We believe it is very important to have in mind the typical orders
of magnitude reported in Table 2. It is not uncommon to find medical
classification studies where the test set size is about a hundred or less.
In such a situation, the uncertainty on the estimation of the performance
is very high.
These parametric confidence intervals are easy to compute and re-
fer to. But actual confidence intervals may be wider if the samples are
not i.i.d. In addition, some interesting metrics, such as AUC ROC, do
not come with such parametric confidence interval. A general and good
option, applicable to all situations, is to approximate the sampling dis-
tribution of the metric of interest by bootstrapping the test set [8].
Finally, note that all these confidence intervals assume that the avail-
able labels are the ground truth. In practice, medical truth is difficult to
establish, and label error may bias the estimation of error rates.
When comparing two classifiers, a McNemar’s test is useful to test
whether the observed difference in errors can be explained solely by sam-
pling noise [21, 22]. The test is based on the number of samples misclas-
sified by one classifier and not the other, n01 and vice versa n10 . The test
statistics is then written (|n01 − n10 | − 1)2 /(n01 + n10 ); it is distributed
under the null as a χ2 with 1 degree of freedom. To compare classifiers’
scanning the tradeoff between specificity and sensitivity without choosing
a specific threshold on their score one option is to compare areas under
the curve of the ROC, using the DeLong test [23] or a permutation scheme
to define the null [24].
4. Conclusion
Evaluating machine learning models is crucial. Can we claim that a new
model outperforms an existing one? Is a given model trustworthy enough
to be “deployed”, making decisions in actual clinical settings? A good
answer to these questions requires model-evaluation experiments adapted
to the application settings. There is no one-size-fits-all solution. Mul-
tiple performance metrics are often important, chosen to reflect target
population and cost-benefit trade-offs of decisions, as discussed in sec-
tion 2. The prediction model must always be evaluated on unseen “test”
data but different evaluation goals lead to procedures to choose this test
data. Evaluating a “learner” –a model-construction algorithm– leads to
cross-validation, while evaluating the fitness of a given prediction rule
–as output by model fitting– calls for left-out data representative of the
target population. In all settings, accounting for uncertainty or variance
of the performance estimate is important, for instance to avoid investing
in models that bring no reliable improvements.
Acknowledgments
This work was supported the French government under management
of Agence Nationale de la Recherche as part of the “Investissements
d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Insti-
tute), ANR-10-IAIHU-06 (Agence Nationale de la Recherche-10-IA Insti-
tut Hospitalo-Universitaire-6), ANR-20-CHIA-0026 (LearnI). We thank
Sebastian Raschka for detailed feedback.
References
[1] Pedregosa F, et al (2011) Scikit-learn: Ma- [5] Perez-Lebel A, Morvan ML, Varoquaux G
chine Learning in Python. Journal of Ma- (2023) Beyond calibration: estimating the
chine Learning Research 12(85):2825–2830 grouping loss of modern neural networks.
ICLR
[2] Powers D (2011) Evaluation: From preci-
sion, recall and f-measure to roc, informed- [6] Poldrack RA, Huckins G, Varoquaux G
ness, markedness & correlation. Journal of (2020) Establishment of best practices for
Machine Learning Technologies 2(1):37–63 evidence for prediction: a review. JAMA
psychiatry 77(5):534–540
[3] Naeini MP, Cooper G, Hauskrecht M (2015)
Obtaining well calibrated probabilities us- [7] Barocas S, Hardt M, Narayanan A (2019)
ing bayesian binning. In: Twenty-Ninth Fairness and Machine Learning. fairml-
AAAI Conference on Artificial Intelligence book.org, http://www.fairmlbook.org
[4] Vickers AJ, Van Calster B, Steyerberg EW [8] Raschka S (2018) Model evaluation,
(2016) Net benefit approaches to the evalu- model selection, and algorithm selec-
ation of prediction models, molecular mark- tion in machine learning. arXiv preprint
ers, and diagnostic tests. bmj 352 arXiv:181112808
[9] Varoquaux G, Raamana PR, Engemann [17] Perezgonzalez JD (2015) Fisher, Neyman-
DA, Hoyos-Idrobo A, Schwartz Y, Thirion Pearson or NHST? A tutorial for teaching
B (2017) Assessing and tuning brain de- data testing. Frontiers in psychology p 223
coders: cross-validation, caveats, and guide-
lines. NeuroImage 145:166–179 [18] Moons KG, Altman DG, Reitsma JB, Ioan-
nidis JP, Macaskill P, Steyerberg EW, Vick-
[10] Varoquaux G (2018) Cross-validation fail- ers AJ, Ransohoff DF, Collins GS (2015)
ure: Small sample sizes lead to large error Transparent reporting of a multivariable
bars. Neuroimage 180:68–77 prediction model for individual progno-
sis or diagnosis (tripod): explanation and
[11] Wen J, Thibeau-Sutre E, Diaz-Melo M, elaboration. Annals of internal medicine
Samper-González J, Routier A, Bottani S, 162(1):W1–W73
Dormont D, Durrleman S, Burgos N, Colliot
O, et al (2020) Convolutional neural net- [19] Dockès J, Varoquaux G, Poline JB (2021)
works for classification of alzheimer’s dis- Preventing dataset shift from breaking
ease: Overview and reproducible evalua- machine-learning biomarkers. GigaScience
tion. Medical image analysis 63:101694 10(9):giab055
[12] Bouthillier X, Laurent C, Vincent P (2019) [20] Shapiro DE (1999) The interpretation of di-
Unreproducible research is reproducible. agnostic tests. Statistical methods in medi-
In: International Conference on Machine cal research 8(2):113–134
Learning, PMLR, pp 725–734
[21] Leisenring W, Pepe MS, Longton G (1997)
[13] Bouthillier X, Delaunay P, Bronzi M, Trofi- A marginal regression modelling frame-
mov A, Nichyporuk B, Szeto J, Moham- work for evaluating medical diagnostic tests.
madi Sepahvand N, Raff E, Madan K, Voleti Statistics in medicine 16(11):1263–1281
V, et al (2021) Accounting for variance in
machine learning benchmarks. Proceedings [22] Dietterich TG (1998) Approximate statisti-
of Machine Learning and Systems 3:747–769 cal tests for comparing supervised classifi-
cation learning algorithms. Neural compu-
[14] Bates S, Hastie T, Tibshirani R (2021) tation 10(7):1895–1923
Cross-validation: what does it estimate and
how well does it do it? arXiv preprint [23] DeLong ER, DeLong DM, Clarke-Pearson
arXiv:210400673 DL (1988) Comparing the areas under
two or more correlated receiver operating
[15] Bengio Y, Grandvalet Y (2004) No unbiased characteristic curves: a nonparametric ap-
estimator of the variance of k-fold cross- proach. Biometrics pp 837–845
validation. Journal of machine learning re-
search 5(Sep):1089–1105 [24] Bandos AI, Rockette HE, Gur D (2005) A
permutation test sensitive to differences in
[16] Nadeau C, Bengio Y (2003) Inference for areas for comparing roc curves from a paired
the generalization error. Machine learning design. Statistics in medicine 24(18):2873–
52(3):239–281 2893
A. Appendix
A.1.1. Odds
Odds are a measure of likelihood of an outcome: the ratio of the number
of events that produce that outcome to the number that do not. The
n+− n−−
n++ n+−
The odds are written: O(a|b = +) = n−+
and O(a|b = −) = n−−
hence
the odds ratio reads
n++ n−−
OR(a, b) = . (4)
n−+ n+−
Note that this expression is unchanged swapping the role of a and b; the
odds ratio is symmetric, OR(a, b) = OR(b, a)
This property is one reason why odds and odds ratio are so central
to biostatistics and epidemiology: sampling or recruitment bias are an
important concern in these fields. For instance, a case-control study has
a very different prevalence as the target population, where the frequency
of the disease is typically very low.
Confusion with risk ratio The odds ratio is often wrongly inter-
preted as a risk ratio –or relative risk–, which is more easily understood.
The risk ratio is the ratio of the probability of an outcome in a group
where the property holds to the probability of this outcome in a group
where this property does not hold. The risk ratio thus differs from the
odds ratio in that it is expressed for probabilities and not odds. Even
though the values for odds ratio and risk ratio are often close because, in
most diseases being diseased in much less likely than not, they are fun-
damentally different because the odds ratio does not depend on sampling
whereas the risk ratio does.
P (T + |D+)
LR+ = (5)
P (T + |D−)
2
Indeed, thankfully, many diseases have a prevalence much lower than 50%, e.g. 1%
which is already considered a frequent disease. Therefore, in order to have a sufficient
number of diseased individuals in the sample without dramatically increasing the
cost of the study, diseased participants will be oversampled. One extreme example,
but very common in medical research, is a case-control study where the number of
diseased and healthy individuals is equal.
Using the expressions in Box 1 and the fact that P (T+ |D+) = 1 − P (T−
|D+), the LR+ can be written as:
Sensitivity
LR+ = (6)
1 − Specificity
O(D + |T+)
LR+ = (9)
O(D+)
P (D+|T+) P (D+|T+) P (D+)
Indeed, O(D + |T+) = 1−P (D+|T+)
= P (D−|T+)
and O(D+) = 1−P (D+)
=
P (D+)
P (D−)
.
O(D+) is called the pre-test odds (the odds of having the disease in
the absence of test information). O(D + |T +) is called the post-test odds
(the odds of having the disease once the test result is known).
Equation 9 shows how the LR+ relates pre- and post-test odds, an
important aspect of its practical interpretation.
The factors f and (1 − f ) cancel out, and thus the expression of LR+ is
unchanged for a change of the pre-test frequency of the label (prevalence
of the test population). This is alike odds ratios, though the likelihood
ratio is not an odds ratio (and does not share all properties; for instance
it is not symmetric).