Understanding Sources of Bias in Diagnostic Accuracy Studies
Understanding Sources of Bias in Diagnostic Accuracy Studies
Understanding Sources of Bias in Diagnostic Accuracy Studies
Context.—Accuracy is an important feature of any companion article by Schmidt et al in this issue, we use
diagnostic test. There has been an increasing awareness these principles to evaluate diagnostic studies associated
of deficiencies in study design that can create bias in with a specific diagnostic test for risk of bias and reporting
estimates of test accuracy. Many pathologists are unaware quality.
of these sources of bias. Conclusions.—There are several sources of bias that are
Objective.—To explain the causes and increase aware-
ness of several common types of bias that result from unique to diagnostic accuracy studies. Because patholo-
deficiencies in the design of diagnostic accuracy studies. gists are both consumers and producers of such studies, it
Data Sources.—We cite examples from the literature is important that they be aware of the risk of bias.
and provide calculations to illustrate the impact of study (Arch Pathol Lab Med. 2013;137:558–565; doi: 10.5858/
design features on estimates of diagnostic accuracy. In a arpa.2012-0198-RA)
Figure 1. Framework for critical appraisal of diagnostic accuracy studies. The figure provides a framework for critical appraisal of the value of a study.
A study can only have value relative to a specific clinical problem. Value is determined by internal validity and external validity, which in turn depend
on the ability to assess these factors. Assessment depends upon quality of reporting. Internal validity depends on bias and precision, which are
determined by quality of design. Abbreviation: PICO, population, index test, comparator, outcome.
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 559
Figure 2. Study applicability using the population, index test,
comparator, outcome (PICO) framework. Assessment of applicability
(external validity) depends on a point-by-point comparison of the
components of the clinical problem: the population, index test,
reference test, and outcomes.
The other determinant of study value is applicability. This Figure 3. Summary ROC curve showing heterogeneity in FNAC study
is assessed by comparing the conditions of the study under accuracy. The figure shows a summary ROC curve for the diagnostic
accuracy of FNAC for diagnosis of salivary gland lesions. Each circle
evaluation (population, index test, reference test, outcomes) represents a stud. The size of the circle is proportional to the weight
with those of the clinical question (Figure 2). Changes in given to the study in meta-analysis and each circle is centered at the
any of these study parameters can cause changes in test point corresponding to the sensitivity and specificity of the study. The
accuracy. Such changes reflect true variability in test figure shows considerable variability in accuracy across studies.
conditions and are not due to bias. For example, accuracy Abbreviations: FNAC, fine-needle aspiration cytology; ROC, receiver
of fine-needle aspiration cytology (FNAC) might depend on operating characteristic.
the experience of the pathologist. The accuracy obtained in a
study with an experienced cytologist would be higher than with 10 years of experience who specializes in head and
the accuracy obtained with a relatively inexperienced neck tumors?). To make this determination, one would have
cytologist. If the difference were large, results from 2 studies to assess the reliability (risk of bias) and applicability of each
conducted by pathologists with different levels of experience study. This example provides a good example of the role of
would necessitate a comparison of the different levels of different types of variation in study appraisal.
experience. Because differences in methodology can lead to In discussing issues of bias, we will refer to the QUADAS-
different outcomes, it is important for studies to fully report 2 framework.12 QUADAS-2 is a survey that is used to assess
all of the methodology associated with both the index test the risk of bias in diagnostic studies and is organized into 4
and reference test so that sources of variation can be domains: patient selection, index test, reference test, and
appreciated and applicability can be evaluated. patient flow. QUADAS-2 is closely aligned with the PICO
Since differences in methodology, patients, or other format. QUADAS assesses risk of bias and applicability in
factors can lead to differences in accuracy measurements, each domain. Methodologic deficiencies can also give rise to
studies conducted at different sites might show different subtle issues of applicability. We will discuss applicability
levels of accuracy due to differences in the conditions at that arises from design deficiencies but will not discuss
each site. Such differences cause difficulties in comparing applicability due to real differences in study conditions.
studies, but this is again distinct from issues of bias. QUADAS-2 is designed to assess the value of a study with
Diagnostic studies often show considerable variation in respect to a clinical question, but is not designed to assess
outcomes. As discussed above, there are 3 possible reasons reporting. The STARD guidelines are designed to insure
for variation: differences in study parameters, imprecision, high-quality reporting and can also be used to assess
and bias. As an example, Figure 3 depicts the results from a reporting quality. The assessment of reliability and applica-
recent meta-analysis on the diagnostic accuracy of FNAC bility requires good reporting (Figure 1). QUADAS and
for parotid gland lesions. The results show considerable STARD therefore serve 2 related but distinct functions.
variability in accuracy and are quite heterogeneous. The
heterogeneity implies that the studies differ owing to PATIENT SELECTION
differences in study design (bias) or to real differences in Differences in patient populations can affect accuracy, and
the study parameters (PICO). Clearly, only a subset of these the comparison of studies conducted in different popula-
studies would be likely to provide valuable information with tions raises questions of applicability.
respect to a particular clinical question (eg, What is the
accuracy of fine-needle aspiration [FNA] in a 1-cm lesion Population Selection and Applicability
presenting in a US hospital, which appears benign on Study participants are obtained from a process of selection
magnetic resonance imaging, was sampled with a 22-gauge that starts from a target population and ends with the study
needle with 4 passes, and was evaluated by a pathologist participants (Figure 4). The study target population is
560 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
Figure 4. Population concepts. The figure shows the relationship of several different populations that are used in population descriptions. The
populations are related by a hierarchy of selection and application. Moving downwards, each population is selected from the population above. For
example, the eligible population is obtained from the source population by the application of inclusion/exclusion criteria. Moving upwards, the results
obtained with study participants are successively applied to each level, eventually reaching the target population. The results of the study can only be
applied directly to the study participants, and applicability of the results obtained from the study participants must be successively inferred for each
level to apply the results to a target population.
conceptual and is obtained from a clinical problem (PICO). test performance that reflect actual differences in disease
In a given study, the target population describes the patients severity. Thus, it is important to fully describe the severity of
for whom the results of the study are intended to apply. The disease in the patient population along with other factors
extent to which the results of the study apply to the study that could be associated with disease severity. Although
target population depends on how well the actual study differences in disease severity are often referred to as
participants match the target population defined in the ‘‘spectrum bias,’’ we view these differences as issues of
study question. The internal validity of a study depends on applicability because they reflect real differences in popu-
the applicability of the study participants to the study target lations.
population (ie, defined by the clinical question in the current
study). Assessment of applicability requires an evaluation of
each step of the selection process. For example, the study
participants must be representative of the study entrants in
order for the study participants. External validity depends on
the applicability of the study participants to other target
populations (ie, to clinical questions other than those posed
by the present study) as shown in Figure 2.
Spectrum Bias
It is generally easier to detect advanced disease than early-
stage disease, for which the signs are often subtle and
difficult to distinguish from normal (Figure 5). The key
parameter in patient spectrum is the difference in the
measured test parameter between the disease and non-
disease cases. We would expect diagnostic accuracy to be
greater in a study conducted in a population with advanced
disease than a population with less severe disease and, for
this reason, studies may not be comparable if they are Figure 5. Illustration of disease spectrum. The figure illustrates 2 studies
that differ with respect to disease spectrum. In the upper panel, the
conducted on populations with significant differences in patients with disease are widely separated from those without disease
disease severity. Disease severity can be influenced by many and one would expect high diagnostic accuracy in this situation. In the
factors such as the setting, referral patterns, and prior lower panel, the severity of disease shows overlap and diagnostic
testing. All of these factors could give rise to differences in performance would be lower than in the upper panel.
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 561
ences in test conditions are not a source of bias but do give
rise to issues of comparability. For example, is the accuracy
of FNAC performed with a 22-gauge needle and ultrasound
guidance equivalent to the accuracy obtained with a 26-
gauge needle without guidance? Because differences in test
conditions have the potential to affect results, it is important
for studies to fully specify the methods.
We recently conducted a meta-analysis on the accuracy of
FNAC for diagnosis of salivary gland lesions and found
considerable heterogeneity in the results (Figure 3).13 The
question arose as to whether the variation in accuracy could
be explained by differences in methodology. Unfortunately,
the methods were insufficiently reported so that the effects
of differences in methods could not be explored. In a
subsequent study, we looked at the way in which FNAC
studies described methods and found significant variation in
reporting.14 These examples illustrate the possible impact of
test conditions on test results and why it is vital for studies
to provide detailed descriptions of methods.
REFERENCE TEST
Classification Bias
No test is perfect. Errors in the reference test cause
classification bias. There are 2 types of classification bias:
differential misclassification and nondifferential misclassifi-
cation. In differential misclassification, the error rate is
Figure 6. Effect of referral pattern on patient spectrum. The figure associated with the index result. In the case of FNAC,
illustrates how prior testing can change patient spectrum and positive FNA results may have a higher misclassification rate
complicate diagnosis. The initial distribution of diseased and non- than negative FNA results, owing to error rates in histologic
diseased patients is shown in (A). The initial tests remove the easy-to- diagnosis. In nondifferential misclassification, the error rate
diagnose cases from the distribution, which creates the distributions in
(B). Additional tests create the distributions in (C). The populations in
is independent of the index test result, but this can
(A) and (C) are not comparable because the patient spectrum is much underestimate sensitivity and specificity. An example is
narrower in (C) and test accuracy would be expected to be lower in (C) provided in Table 1. The magnitude of the bias depends on
than in (A). the disease prevalence, the accuracy of the index test, and
the degree of misclassification. As shown in the example,
Referral patterns are an important determinant of patient misclassification can have significant effects. The misclassi-
spectrum (Figure 6). Each stage in the referral process can fication rate can vary from site to site depending on the
produce diagnoses that remove cases from the initial methodology associated with the reference test (eg, skill of
distribution. Thus, the patient spectrum is altered by each the pathologist, use of ancillary tests). Misclassification can
referral. In general, one would expect the spectrum to
narrow with each referral as ‘‘easy’’ cases are removed from Table 1. Effect of Classification Bias on Observed
the tails of the distribution. Test performance increases Sensitivity and Specificitya
when the distribution is wide and, for that reason, diagnosis
is more challenging at later stages than in the initial stages Imperfect
Perfect Reference Test
of the process, and a given test would be expected to be less
Reference Test (10% Misclassification)
accurate in later stages. Thus, prior testing and referral
patterns can be an important factor when comparing test Reference Test Reference Test
performance. Index Test Positive Negative Positive Negative
Positive 900 100 820 180
INDEX TEST Negative 100 900 180 820
Diagnostic tests are often complex, multistep processes Total 1000 1000 1000 1000
that can be performed in many different ways. For example, Sensitivity 0.90 0.82
even for a simple procedure such as FNAC, there are many Specificity 0.90 0.82
parameters involving the sample acquisition (needle size, a
The left-hand column shows hypothetical results for an index test with
number of passes, use of guidance techniques, experience of 90% sensitivity and 90% specificity when evaluated by a perfect
reference test. The right-hand column shows the same index test
aspirator, use of rapid on-site evaluation, etc), sample evaluated with an imperfect reference test. The imperfect reference
processing (type of stain, use of ancillary techniques), and test has a nondifferential misclassification rate of 10%. The number of
interpretation (number of pathologists who read the slide, true positives in the imperfect test is calculated as follows: True
experience level of the pathologist, availability of clinical Positive ¼ 900 (1 0.1) þ 100 (0.1). The calculation shows that
information, etc). Each of these factors has the potential to misclassifications cause observations to ‘‘move’’ across columns. Ten
percent of the true positives are misclassified as false positives and
affect test accuracy, and one can think of each variation as a 10% of the false positives are misclassified as true positives.
different test with different performance characteristics. As Sensitivity decreases if the actual number of true positives is higher
indicated above, differences in accuracy that reflect differ- than the number of false positives.
562 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
be estimated by interrater reliability studies, but such studies experience, reporting of blinding in FNAC studies is quite
are rarely referenced in FNAC diagnostic accuracy studies. poor.
In some cases, the result of the index test is explicitly
Diagnostic Review Bias and Incorporation Bias used as a criterion for the reference test. Incorporation bias
These types of bias occur when the interpretation of the is best exemplified by clinical laboratory testing, specifically
reference test is not independent of the index test, which in the evaluation of b-D-glucan for diagnosis of invasive
weakens the results of retrospective studies. fungal infections. Invasive fungal infections are tradition-
Diagnostic review bias occurs when the pathologist ally diagnosed by culture, imaging, and biopsy. b-D-Glucan
is a blood-based test that offers an opportunity for an
interpreting the final histopathology is aware of the FNA
earlier diagnosis. By the European Organisation for
result. This can affect results in that a pathologist might
Research and Treatment of Cancer criteria, the gold
search more carefully for evidence of cancer if the FNA standard for invasive fungal infections includes a positive
result is positive, and a strong FNA result might influence b-D-glucan test result.15 In this case, the index test
the interpretation of a borderline histologic result. Clini- comprises part of the gold standard. In FNAC studies,
cally, while it is important to use all information when incorporation bias occurs when a positive FNAC result is
making a diagnosis, the bias that results weakens studies of accepted as the gold standard as sometimes occurs in
diagnostic accuracy. A rigorous study would require either FNAC accuracy studies of the lung and mediastinum.
reporting that the results are blinded, or that the cases While this criterion may be reasonable in clinical practice,
were reviewed again to obtain a blinded diagnosis. In our it is a source of bias in diagnostic studies.
Figure 7. Flow diagram for partial verification. The figure shows the effect of partial verification on the observed results of a diagnostic accuracy study.
The assumptions are number of cases presenting for testing, N ¼ 1000; disease prevalence, ¼ 0.20; actual sensitivity (Sn) ¼ 0.80; actual specificity
(Sp) ¼ 0.90; positive verification rate, a ¼ 0.8; negative verification rate, b ¼ 0.20. The flow diagram shows the number of cases in each category that
will be observed. The bias in the observed accuracy statistics is shown in Table 3. Abbreviation: FNA, fine-needle aspiration.
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 563
Table 2. Effect of Partial Verification experience, these numbers and the associated bias are
on Accuracy Statisticsa typical for FNA studies.
It is important to note that partial verification only creates
Actual Observed
bias when the verification rate depends on the index test
Reference Test Reference Test result. Partial verification would not occur if the positive and
Index Test Disease No Disease Disease No Disease negative cases were randomly sampled at the same rate.
Positive 160 80 128 64
Thus, if verification is limited by cost considerations, one can
Negative 40 720 8 144 prevent bias by changing the sampling plan to make the
Total 200 800 136 208 verification rate independent of the outcome of the index
Sensitivity 0.80 0.94 test.
Specificity 0.90 0.69 Withdrawals can have a similar impact if the withdrawal
a
The table summarizes the results from the example in Figure 7. The rate depends on the result of the index test. Withdrawals are
observed results were obtained with partial verification and differ common and occur for a variety of reasons. For example,
from the actual results (ie, the results that would have been obtained patients initially screened at a community clinic may go to a
without partial verification). The example shows the bias that arises in
observed results when partial verification is present. tertiary care hospital for follow-up. Withdrawals can have
the same effect as partial verification due to design;
however, the magnitude of partial verification is generally
PATIENT FLOW AND OUTCOMES less when it is due to withdrawals.
Partial Verification
Differential Verification
Ideally, all those who are tested with the index test should
receive verification by the reference test (gold standard). Obviously, partial verification bias can be eliminated by
Failure to do so can cause bias in accuracy estimates and is verifying all cases; however, this would not be practical or
known as partial verification bias. Partial verification can ethical for invasive procedures. An alternative is to verify
arise from different causes. A study may be designed so that the remaining cases with a different reference test (a ‘‘brass
positive cases are sampled more intensively than negative standard’’) such as clinical follow-up. The problem with this
cases. Or a study may be designed so that all patients are solution is that the accuracy of the 2 reference standards
referred for verification but, for various reasons, some may differ and the accuracy of the cases referred to the
patients do not present for verification. The first case inferior test will suffer from classification bias. The overall
represents a problem in design and the second represents a accuracy estimates will be obtained from a combination of
problem in study implementation. We discuss both types the biased and unbiased results. The resulting bias is called
below. differential verification bias. To illustrate the effects of
Partial verification bias is common in FNAC accuracy differential verification bias, we continue the FNA example
studies for which the usual gold standard (histopathology) is from above but apply a different reference standard (eg,
invasive or expensive. Furthermore, most of these studies clinical follow-up) to the cases that were previously
are retrospective, for which cases are identified from surgery unobserved. We assume that the alternative brass standard
or histopathology records. Such studies fail to record the has a 10% nondifferential misclassification rate. Differen-
results of those patients who received the index test but who tial verification can have a substantial effect as shown in
did not receive surgery and histopathologic verification. Table 3.
The example in Figure 7 demonstrates the effect of These examples illustrate why documentation of flows is
partial verification bias. In 1000 cases with a disease so critical in diagnostic accuracy studies. In our experience,
prevalence of 20%, there is an assumed sensitivity of 80% withdrawals are often poorly documented in FNA diagnos-
and specificity of 90%. Positive cases (ie, those with a tic accuracy studies. The impact of partial verification bias
positive FNA result) are verified at a higher rate (80%) than can be estimated if the flows are well documented,16 but it is
negative cases (20%). The results from the example are better to prevent partial verification by good study design
presented in Table 2, where the observed accuracy statistics and management.
are compared to the actual accuracy (ie, the accuracy that
would be obtained with full verification). The table shows Inconclusive Results
that sensitivity is falsely elevated from 80% to 94% and Inconclusive results affect the applicability of one study to a
specificity is falsely decreased from 90% to 69%. In our population. Studies often aggregate test results or exclude
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 565