Understanding Sources of Bias in Diagnostic Accuracy Studies

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Review Articles

Understanding Sources of Bias


in Diagnostic Accuracy Studies
Robert L. Schmidt, MD, PhD, MBA; Rachel E. Factor, MD, MHS

 Context.—Accuracy is an important feature of any companion article by Schmidt et al in this issue, we use
diagnostic test. There has been an increasing awareness these principles to evaluate diagnostic studies associated
of deficiencies in study design that can create bias in with a specific diagnostic test for risk of bias and reporting
estimates of test accuracy. Many pathologists are unaware quality.
of these sources of bias. Conclusions.—There are several sources of bias that are
Objective.—To explain the causes and increase aware-
ness of several common types of bias that result from unique to diagnostic accuracy studies. Because patholo-
deficiencies in the design of diagnostic accuracy studies. gists are both consumers and producers of such studies, it
Data Sources.—We cite examples from the literature is important that they be aware of the risk of bias.
and provide calculations to illustrate the impact of study (Arch Pathol Lab Med. 2013;137:558–565; doi: 10.5858/
design features on estimates of diagnostic accuracy. In a arpa.2012-0198-RA)

A ccuracy is an important feature of any diagnostic test.


Accuracy estimates play an important role in evidence-
based medicine. They guide clinical decisions and are used
use in the systematic review of diagnostic accuracy studies.
The STARD initiative alone has been adopted by more than
200 journals, spanning basic research to medicine. QUA-
to develop diagnostic algorithms and clinical guidelines. DAS has been widely adopted11 and has been cited more
Poor estimates of accuracy can contribute to mistreatment, than 500 times. It is recommended for use in systematic
increased costs, or patient injury. Thus, it is important for reviews of diagnostic accuracy by the Agency for Healthcare
accuracy estimates to be reliable. Research and Quality, Cochrane Collaboration, and the
There has been increasing awareness of deficiencies in National Institute for Health and Clinical Evidence12 in the
study design and reporting in diagnostic test accuracy United Kingdom.
studies1–5 and it is now recognized that diagnostic accuracy
studies are subject to unique sources of bias. Pathologists
are often involved in diagnostic accuracy studies and, as
specialists in test methodology, play a key role in the See also p 566.
generation of data on diagnostic accuracy. It is important for
pathologists to understand the limitations of diagnostic
studies and the methodologic issues that can lead to bias in
accuracy estimates. The problems associated with diagnostic tests are well
Over the years, there have been several efforts to make recognized; however, the concepts involved are often subtle
researchers aware of the methodologic issues associated and unfamiliar to many pathologists. Because they play a
with diagnostic tests. In 1999, the Cochrane Diagnostic and key role in the production and interpretation of information
Screening Test Methods Working Group first convened to on diagnostic test accuracy, it is important for pathologists
reduce deficiencies in diagnostic test reporting. Since then, to understand the types of bias that arise in diagnostic
the STARD (Standards of Reporting Diagnostic Accuracy) accuracy studies and their impact on accuracy estimates.
checklist,6,7 QUADAS (Quality Assessment of Diagnostic Accuracy estimates are increasingly obtained from meta-
Accuracy Studies) instrument,8,9 and the QAREL (Quality analysis and, as noted above, an assessment of the risk of
Appraisal of Reliability Studies)10 instrument have been bias is now a standard part of any review of diagnostic
introduced as evidence-based quality assessment tools to accuracy. Owing to the increasing emphasis on evidence-
based medicine, pathologists will be required to produce or
interpret findings on the risk of bias in diagnostic studies.
Accepted for publication May 18, 2012. Our objective is to provide an explanation of the common
From the Department of Pathology, University of Utah School of sources of bias in diagnostic studies. This information
Medicine and ARUP Laboratories, Salt Lake City, Utah. should help pathologists to identify risks of bias in
The authors have no relevant financial interest in the products or diagnostic studies, to predict the impact of bias on study
companies described in this article.
Reprints: Robert L. Schmidt, MD, PhD, MBA, Department of outcomes and, as producers of diagnostic studies, to avoid
Pathology, University of Utah School of Medicine, 15 N Medical Dr some of the methodologic issues that commonly cause bias
E, Salt Lake City, UT 84112 (e-mail: [email protected]). in diagnostic studies.
558 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
FRAMEWORK FOR APPRAISAL upon repetition. Unlike bias, precision is a function of
To be useful, a study must address a clinical question. random error and ‘‘balances out’’ upon repetition. Both bias
Such questions are formulated in the familiar PICO format. and imprecision can render measurements unreliable. When
For a diagnostic study, the PICO parameters are population, measurements are unreliable, they may fail to represent the
index test (the test under examination), comparator or true value of the phenomenon being measured. A study is
reference test (the gold standard), and outcomes. The value said to lack internal validity when it fails to measure what it
purports to measure. Bias and random error (imprecision)
of a study is a function of its capacity to resolve a clinical
are both sources of variation that can cause a measurement
question. A clinical question can arise in the context of
to differ from the true value. Both of these sources of
clinical work (Can this study help me to diagnose this
variation negatively affect internal validity.
patient’s condition?), or in meta-analysis (Can this study Bias and precision are a function of study design. Random
help to resolve the question of the meta-analytic study?). To error (imprecision) is determined by sample size and sound
answer a clinical question correctly, a study must provide experimental design. Bias can occur at several different
information that is both reliable (internal validity) and levels. In a diagnostic accuracy study it can arise from
applicable (external validity). Internal validity is a function of individual test measurements (analytic bias) or methodo-
bias and precision. A framework for appraisal is presented in logic issues related to study design. Evaluation of internal
Figure 1. validity involves an assessment of the risk of bias and the
Bias is defined as a systematic difference in an observed level of variability caused by imprecision. This, in turn,
measurement from the true value. For example, miscalibra- requires an assessment of study design features that could
tion causes systematic measurement errors that lead to lead to bias or imprecision. Our discussion will focus on
analytic bias. In the context of a diagnostic accuracy study, bias; however, it is important to recognize that internal
bias occurs when the overall estimates of sensitivity or validity is a function of both bias and imprecision. Risk of
specificity systematically deviate from the real value. If bias bias can only be evaluated if sufficient information about the
exists, a study would consistently overestimate or underes- study design is provided to allow for an assessment. Quality
timate the true accuracy parameters if the study were of reporting has a significant impact on internal validity
repeated. Thus, bias is error that does not ‘‘balance out’’ which, in turn, is a determinant of study value.

Figure 1. Framework for critical appraisal of diagnostic accuracy studies. The figure provides a framework for critical appraisal of the value of a study.
A study can only have value relative to a specific clinical problem. Value is determined by internal validity and external validity, which in turn depend
on the ability to assess these factors. Assessment depends upon quality of reporting. Internal validity depends on bias and precision, which are
determined by quality of design. Abbreviation: PICO, population, index test, comparator, outcome.
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 559
Figure 2. Study applicability using the population, index test,
comparator, outcome (PICO) framework. Assessment of applicability
(external validity) depends on a point-by-point comparison of the
components of the clinical problem: the population, index test,
reference test, and outcomes.

The other determinant of study value is applicability. This Figure 3. Summary ROC curve showing heterogeneity in FNAC study
is assessed by comparing the conditions of the study under accuracy. The figure shows a summary ROC curve for the diagnostic
accuracy of FNAC for diagnosis of salivary gland lesions. Each circle
evaluation (population, index test, reference test, outcomes) represents a stud. The size of the circle is proportional to the weight
with those of the clinical question (Figure 2). Changes in given to the study in meta-analysis and each circle is centered at the
any of these study parameters can cause changes in test point corresponding to the sensitivity and specificity of the study. The
accuracy. Such changes reflect true variability in test figure shows considerable variability in accuracy across studies.
conditions and are not due to bias. For example, accuracy Abbreviations: FNAC, fine-needle aspiration cytology; ROC, receiver
of fine-needle aspiration cytology (FNAC) might depend on operating characteristic.
the experience of the pathologist. The accuracy obtained in a
study with an experienced cytologist would be higher than with 10 years of experience who specializes in head and
the accuracy obtained with a relatively inexperienced neck tumors?). To make this determination, one would have
cytologist. If the difference were large, results from 2 studies to assess the reliability (risk of bias) and applicability of each
conducted by pathologists with different levels of experience study. This example provides a good example of the role of
would necessitate a comparison of the different levels of different types of variation in study appraisal.
experience. Because differences in methodology can lead to In discussing issues of bias, we will refer to the QUADAS-
different outcomes, it is important for studies to fully report 2 framework.12 QUADAS-2 is a survey that is used to assess
all of the methodology associated with both the index test the risk of bias in diagnostic studies and is organized into 4
and reference test so that sources of variation can be domains: patient selection, index test, reference test, and
appreciated and applicability can be evaluated. patient flow. QUADAS-2 is closely aligned with the PICO
Since differences in methodology, patients, or other format. QUADAS assesses risk of bias and applicability in
factors can lead to differences in accuracy measurements, each domain. Methodologic deficiencies can also give rise to
studies conducted at different sites might show different subtle issues of applicability. We will discuss applicability
levels of accuracy due to differences in the conditions at that arises from design deficiencies but will not discuss
each site. Such differences cause difficulties in comparing applicability due to real differences in study conditions.
studies, but this is again distinct from issues of bias. QUADAS-2 is designed to assess the value of a study with
Diagnostic studies often show considerable variation in respect to a clinical question, but is not designed to assess
outcomes. As discussed above, there are 3 possible reasons reporting. The STARD guidelines are designed to insure
for variation: differences in study parameters, imprecision, high-quality reporting and can also be used to assess
and bias. As an example, Figure 3 depicts the results from a reporting quality. The assessment of reliability and applica-
recent meta-analysis on the diagnostic accuracy of FNAC bility requires good reporting (Figure 1). QUADAS and
for parotid gland lesions. The results show considerable STARD therefore serve 2 related but distinct functions.
variability in accuracy and are quite heterogeneous. The
heterogeneity implies that the studies differ owing to PATIENT SELECTION
differences in study design (bias) or to real differences in Differences in patient populations can affect accuracy, and
the study parameters (PICO). Clearly, only a subset of these the comparison of studies conducted in different popula-
studies would be likely to provide valuable information with tions raises questions of applicability.
respect to a particular clinical question (eg, What is the
accuracy of fine-needle aspiration [FNA] in a 1-cm lesion Population Selection and Applicability
presenting in a US hospital, which appears benign on Study participants are obtained from a process of selection
magnetic resonance imaging, was sampled with a 22-gauge that starts from a target population and ends with the study
needle with 4 passes, and was evaluated by a pathologist participants (Figure 4). The study target population is
560 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
Figure 4. Population concepts. The figure shows the relationship of several different populations that are used in population descriptions. The
populations are related by a hierarchy of selection and application. Moving downwards, each population is selected from the population above. For
example, the eligible population is obtained from the source population by the application of inclusion/exclusion criteria. Moving upwards, the results
obtained with study participants are successively applied to each level, eventually reaching the target population. The results of the study can only be
applied directly to the study participants, and applicability of the results obtained from the study participants must be successively inferred for each
level to apply the results to a target population.

conceptual and is obtained from a clinical problem (PICO). test performance that reflect actual differences in disease
In a given study, the target population describes the patients severity. Thus, it is important to fully describe the severity of
for whom the results of the study are intended to apply. The disease in the patient population along with other factors
extent to which the results of the study apply to the study that could be associated with disease severity. Although
target population depends on how well the actual study differences in disease severity are often referred to as
participants match the target population defined in the ‘‘spectrum bias,’’ we view these differences as issues of
study question. The internal validity of a study depends on applicability because they reflect real differences in popu-
the applicability of the study participants to the study target lations.
population (ie, defined by the clinical question in the current
study). Assessment of applicability requires an evaluation of
each step of the selection process. For example, the study
participants must be representative of the study entrants in
order for the study participants. External validity depends on
the applicability of the study participants to other target
populations (ie, to clinical questions other than those posed
by the present study) as shown in Figure 2.
Spectrum Bias
It is generally easier to detect advanced disease than early-
stage disease, for which the signs are often subtle and
difficult to distinguish from normal (Figure 5). The key
parameter in patient spectrum is the difference in the
measured test parameter between the disease and non-
disease cases. We would expect diagnostic accuracy to be
greater in a study conducted in a population with advanced
disease than a population with less severe disease and, for
this reason, studies may not be comparable if they are Figure 5. Illustration of disease spectrum. The figure illustrates 2 studies
that differ with respect to disease spectrum. In the upper panel, the
conducted on populations with significant differences in patients with disease are widely separated from those without disease
disease severity. Disease severity can be influenced by many and one would expect high diagnostic accuracy in this situation. In the
factors such as the setting, referral patterns, and prior lower panel, the severity of disease shows overlap and diagnostic
testing. All of these factors could give rise to differences in performance would be lower than in the upper panel.

Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 561
ences in test conditions are not a source of bias but do give
rise to issues of comparability. For example, is the accuracy
of FNAC performed with a 22-gauge needle and ultrasound
guidance equivalent to the accuracy obtained with a 26-
gauge needle without guidance? Because differences in test
conditions have the potential to affect results, it is important
for studies to fully specify the methods.
We recently conducted a meta-analysis on the accuracy of
FNAC for diagnosis of salivary gland lesions and found
considerable heterogeneity in the results (Figure 3).13 The
question arose as to whether the variation in accuracy could
be explained by differences in methodology. Unfortunately,
the methods were insufficiently reported so that the effects
of differences in methods could not be explored. In a
subsequent study, we looked at the way in which FNAC
studies described methods and found significant variation in
reporting.14 These examples illustrate the possible impact of
test conditions on test results and why it is vital for studies
to provide detailed descriptions of methods.

REFERENCE TEST
Classification Bias
No test is perfect. Errors in the reference test cause
classification bias. There are 2 types of classification bias:
differential misclassification and nondifferential misclassifi-
cation. In differential misclassification, the error rate is
Figure 6. Effect of referral pattern on patient spectrum. The figure associated with the index result. In the case of FNAC,
illustrates how prior testing can change patient spectrum and positive FNA results may have a higher misclassification rate
complicate diagnosis. The initial distribution of diseased and non- than negative FNA results, owing to error rates in histologic
diseased patients is shown in (A). The initial tests remove the easy-to- diagnosis. In nondifferential misclassification, the error rate
diagnose cases from the distribution, which creates the distributions in
(B). Additional tests create the distributions in (C). The populations in
is independent of the index test result, but this can
(A) and (C) are not comparable because the patient spectrum is much underestimate sensitivity and specificity. An example is
narrower in (C) and test accuracy would be expected to be lower in (C) provided in Table 1. The magnitude of the bias depends on
than in (A). the disease prevalence, the accuracy of the index test, and
the degree of misclassification. As shown in the example,
Referral patterns are an important determinant of patient misclassification can have significant effects. The misclassi-
spectrum (Figure 6). Each stage in the referral process can fication rate can vary from site to site depending on the
produce diagnoses that remove cases from the initial methodology associated with the reference test (eg, skill of
distribution. Thus, the patient spectrum is altered by each the pathologist, use of ancillary tests). Misclassification can
referral. In general, one would expect the spectrum to
narrow with each referral as ‘‘easy’’ cases are removed from Table 1. Effect of Classification Bias on Observed
the tails of the distribution. Test performance increases Sensitivity and Specificitya
when the distribution is wide and, for that reason, diagnosis
is more challenging at later stages than in the initial stages Imperfect
Perfect Reference Test
of the process, and a given test would be expected to be less
Reference Test (10% Misclassification)
accurate in later stages. Thus, prior testing and referral
patterns can be an important factor when comparing test Reference Test Reference Test
performance. Index Test Positive Negative Positive Negative
Positive 900 100 820 180
INDEX TEST Negative 100 900 180 820
Diagnostic tests are often complex, multistep processes Total 1000 1000 1000 1000
that can be performed in many different ways. For example, Sensitivity 0.90 0.82
even for a simple procedure such as FNAC, there are many Specificity 0.90 0.82
parameters involving the sample acquisition (needle size, a
The left-hand column shows hypothetical results for an index test with
number of passes, use of guidance techniques, experience of 90% sensitivity and 90% specificity when evaluated by a perfect
reference test. The right-hand column shows the same index test
aspirator, use of rapid on-site evaluation, etc), sample evaluated with an imperfect reference test. The imperfect reference
processing (type of stain, use of ancillary techniques), and test has a nondifferential misclassification rate of 10%. The number of
interpretation (number of pathologists who read the slide, true positives in the imperfect test is calculated as follows: True
experience level of the pathologist, availability of clinical Positive ¼ 900 (1  0.1) þ 100 (0.1). The calculation shows that
information, etc). Each of these factors has the potential to misclassifications cause observations to ‘‘move’’ across columns. Ten
percent of the true positives are misclassified as false positives and
affect test accuracy, and one can think of each variation as a 10% of the false positives are misclassified as true positives.
different test with different performance characteristics. As Sensitivity decreases if the actual number of true positives is higher
indicated above, differences in accuracy that reflect differ- than the number of false positives.

562 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
be estimated by interrater reliability studies, but such studies experience, reporting of blinding in FNAC studies is quite
are rarely referenced in FNAC diagnostic accuracy studies. poor.
In some cases, the result of the index test is explicitly
Diagnostic Review Bias and Incorporation Bias used as a criterion for the reference test. Incorporation bias
These types of bias occur when the interpretation of the is best exemplified by clinical laboratory testing, specifically
reference test is not independent of the index test, which in the evaluation of b-D-glucan for diagnosis of invasive
weakens the results of retrospective studies. fungal infections. Invasive fungal infections are tradition-
Diagnostic review bias occurs when the pathologist ally diagnosed by culture, imaging, and biopsy. b-D-Glucan
is a blood-based test that offers an opportunity for an
interpreting the final histopathology is aware of the FNA
earlier diagnosis. By the European Organisation for
result. This can affect results in that a pathologist might
Research and Treatment of Cancer criteria, the gold
search more carefully for evidence of cancer if the FNA standard for invasive fungal infections includes a positive
result is positive, and a strong FNA result might influence b-D-glucan test result.15 In this case, the index test
the interpretation of a borderline histologic result. Clini- comprises part of the gold standard. In FNAC studies,
cally, while it is important to use all information when incorporation bias occurs when a positive FNAC result is
making a diagnosis, the bias that results weakens studies of accepted as the gold standard as sometimes occurs in
diagnostic accuracy. A rigorous study would require either FNAC accuracy studies of the lung and mediastinum.
reporting that the results are blinded, or that the cases While this criterion may be reasonable in clinical practice,
were reviewed again to obtain a blinded diagnosis. In our it is a source of bias in diagnostic studies.

Figure 7. Flow diagram for partial verification. The figure shows the effect of partial verification on the observed results of a diagnostic accuracy study.
The assumptions are number of cases presenting for testing, N ¼ 1000; disease prevalence, ¼ 0.20; actual sensitivity (Sn) ¼ 0.80; actual specificity
(Sp) ¼ 0.90; positive verification rate, a ¼ 0.8; negative verification rate, b ¼ 0.20. The flow diagram shows the number of cases in each category that
will be observed. The bias in the observed accuracy statistics is shown in Table 3. Abbreviation: FNA, fine-needle aspiration.
Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 563
Table 2. Effect of Partial Verification experience, these numbers and the associated bias are
on Accuracy Statisticsa typical for FNA studies.
It is important to note that partial verification only creates
Actual Observed
bias when the verification rate depends on the index test
Reference Test Reference Test result. Partial verification would not occur if the positive and
Index Test Disease No Disease Disease No Disease negative cases were randomly sampled at the same rate.
Positive 160 80 128 64
Thus, if verification is limited by cost considerations, one can
Negative 40 720 8 144 prevent bias by changing the sampling plan to make the
Total 200 800 136 208 verification rate independent of the outcome of the index
Sensitivity 0.80 0.94 test.
Specificity 0.90 0.69 Withdrawals can have a similar impact if the withdrawal
a
The table summarizes the results from the example in Figure 7. The rate depends on the result of the index test. Withdrawals are
observed results were obtained with partial verification and differ common and occur for a variety of reasons. For example,
from the actual results (ie, the results that would have been obtained patients initially screened at a community clinic may go to a
without partial verification). The example shows the bias that arises in
observed results when partial verification is present. tertiary care hospital for follow-up. Withdrawals can have
the same effect as partial verification due to design;
however, the magnitude of partial verification is generally
PATIENT FLOW AND OUTCOMES less when it is due to withdrawals.
Partial Verification
Differential Verification
Ideally, all those who are tested with the index test should
receive verification by the reference test (gold standard). Obviously, partial verification bias can be eliminated by
Failure to do so can cause bias in accuracy estimates and is verifying all cases; however, this would not be practical or
known as partial verification bias. Partial verification can ethical for invasive procedures. An alternative is to verify
arise from different causes. A study may be designed so that the remaining cases with a different reference test (a ‘‘brass
positive cases are sampled more intensively than negative standard’’) such as clinical follow-up. The problem with this
cases. Or a study may be designed so that all patients are solution is that the accuracy of the 2 reference standards
referred for verification but, for various reasons, some may differ and the accuracy of the cases referred to the
patients do not present for verification. The first case inferior test will suffer from classification bias. The overall
represents a problem in design and the second represents a accuracy estimates will be obtained from a combination of
problem in study implementation. We discuss both types the biased and unbiased results. The resulting bias is called
below. differential verification bias. To illustrate the effects of
Partial verification bias is common in FNAC accuracy differential verification bias, we continue the FNA example
studies for which the usual gold standard (histopathology) is from above but apply a different reference standard (eg,
invasive or expensive. Furthermore, most of these studies clinical follow-up) to the cases that were previously
are retrospective, for which cases are identified from surgery unobserved. We assume that the alternative brass standard
or histopathology records. Such studies fail to record the has a 10% nondifferential misclassification rate. Differen-
results of those patients who received the index test but who tial verification can have a substantial effect as shown in
did not receive surgery and histopathologic verification. Table 3.
The example in Figure 7 demonstrates the effect of These examples illustrate why documentation of flows is
partial verification bias. In 1000 cases with a disease so critical in diagnostic accuracy studies. In our experience,
prevalence of 20%, there is an assumed sensitivity of 80% withdrawals are often poorly documented in FNA diagnos-
and specificity of 90%. Positive cases (ie, those with a tic accuracy studies. The impact of partial verification bias
positive FNA result) are verified at a higher rate (80%) than can be estimated if the flows are well documented,16 but it is
negative cases (20%). The results from the example are better to prevent partial verification by good study design
presented in Table 2, where the observed accuracy statistics and management.
are compared to the actual accuracy (ie, the accuracy that
would be obtained with full verification). The table shows Inconclusive Results
that sensitivity is falsely elevated from 80% to 94% and Inconclusive results affect the applicability of one study to a
specificity is falsely decreased from 90% to 69%. In our population. Studies often aggregate test results or exclude

Table 3. Effect of Differential Verificationa


Observed Cases (Gold Standard) Unobserved Cases Observed Cases (Brass Standard) Combined (Gold and Brass)
Reference Test Reference Test Reference Test Reference Test
Index Test Positive Negative Positive Negative Positive Negative Positive Negative
Positive 128 8 32 16 30.4 17.6 158 26
Negative 8 144 32 576 86.4 521.6 94 666
Total 136 152 64 592 116.8 539.2 252 692
Sensitivity 0.94 0.50 0.26 0.63
Specificity 0.95 0.97 0.97 0.96
a
The table is a continuation of the example in Figure 7 and Table 2. In the previous example, a significant number of cases were not verified
(unobserved cases). The unobserved cases were mainly composed of fine-needle aspiration–negative cases. In this example, the unobserved cases
are verified by an alternative reference standard (clinical follow-up). We assume that the alternative standard has a nondifferential misclassification
rate of 10%.
564 Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor
results in particular categories. In FNA studies, common References
diagnostic categories include inadequate, negative for malig- 1. Smidt N, Rutjes AWS, van der Windt DAWM, et al. The quality of
diagnostic accuracy studies since the STARD statement: has it improved?
nancy, atypical, suspicious, and positive for malignancy. As a Neurology. 2006;67(5):792–797.
first step, it is important that an article provide definitions for 2. Smidt N, Rutjes AWS, van der Windt DAWM, et al. Quality of reporting of
each of the indeterminate categories. Second, to maintain diagnostic accuracy studies. Radiology. 2005;235(2):347–353.
applicability, it is important that researchers report all results 3. Whiting P, Rutjes AWS, Dinnes J, Reitsma JB, Bossuyt PMM, Kleijnen J. A
systematic review finds that diagnostic reviews fail to incorporate quality despite
before aggregating results into categories. We often see available tools. J Clin Epidemiol. 2005;58(1):1–12.
articles in which results are grouped in different ways. For 4. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-
example, one article may count inadequate results and related bias in studies of diagnostic tests [erratum in JAMA. 2000; 283(15):1963].
JAMA. 1999;282(11):1061–1066.
another article may exclude inadequate results from accuracy 5. Whiting P, Rutjes AWS, Reitsma JB, Glas AS, Bossuyt PMM, Kleijnen J.
calculations. The different assumptions may be valid in the Sources of variation and bias in studies of diagnostic accuracy: a systematic
context of individual articles, but may not be applicable to review. Ann Intern Med. 2004;140(3):189–202.
other study populations. 6. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate
reporting of studies of diagnostic accuracy: the STARD initiative—standards for
It should be noted that the magnitude of the indetermi- reporting of diagnostic accuracy. Clin Chem. 2003;49(1):1–6.
nate rate can also affect applicability. The indeterminate rate 7. Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD statement for
can show significant variation between study sites. Differ- reporting studies of diagnostic accuracy: explanation and elaboration. Clin
Chem. 2003;49(1):7–18.
ences in the indeterminate rate can reflect differences in 8. Whiting P, Rutjes AWS, Reitsma JB, Bossuyt PMM, Kleijnen J. The
criteria, differences in the sample population, or differences development of QUADAS: a tool for the quality assessment of studies of
in methodology. Paradoxically, a study in which a cytopa- diagnostic accuracy included in systematic reviews. BMC Med Res Methodol.
thologist defines 15% of cases as ‘‘indeterminate’’ may have 2003;3:25.
9. Whiting PF, Weswood ME, Rutjes AWS, Reitsma JB, Bossuyt PNM, Kleijnen
better accuracy than a study with an indeterminate rate of J. Evaluation of QUADAS, a tool for the quality assessment of diagnostic accuracy
1% because the study with the high rate is only making a studies. BMC Med Res Methodol. 2006;6:9.
diagnosis on the easy cases. 10. Lucas NP, Macaskill P, Irwig L, Bogduk N. The development of a quality
appraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol. 2010;
63(8):854–861.
SUMMARY 11. Willis BH, Quigley M. Uptake of newer methodological developments and
We have explained the basis of several common types of bias the deployment of meta-analysis in diagnostic test research: a systematic review.
BMC Med Res Methodol. 2011;11:27.
that are unique to diagnostic studies. Our objective has been to 12. Whiting PF, Rutjes AWS, Westwood ME, et al. QUADAS-2: a revised tool
provide a framework to assist consumers of diagnostic for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;
accuracy studies in critically appraising results and to assist 155(8):529–536.
13. Schmidt RL, Hall BJ, Wilson AR, Layfield LJ. A systematic review and meta-
producers of diagnostic accuracy studies in avoiding many analysis of the diagnostic accuracy of fine-needle aspiration cytology for parotid
common sources of bias. It is important to recognize that no gland lesions. Am J Clin Pathol. 2011;136(1):45–59.
study is perfect and that bias and applicability are a matter of 14. Schmidt RL, Factor RE, Affolter KE, et al. Methods specification for
degree. Assessment of a study depends on quality of reporting. diagnostic test accuracy studies in fine-needle aspiration cytology: a survey of
reporting practice. Am J Clin Pathol. 2012;137(1):132–141.
One cannot assess risk of bias or applicability of a study unless 15. De Pauw B, Walsh TJ, Donnelly JP, et al. Revised definitions of invasive
the details of the population, methods, and outcomes are fully fungal disease from the European Organization for Research and Treatment of
reported. Thus, high-quality reporting is vital. These issues are Cancer/Invasive Fungal Infections Cooperative Group and the National Institute
of Allergy and Infectious Diseases Mycoses Study Group (EORTC/MSG)
likely to become more important in the future as evidence- Consensus Group. Clin Infect Dis. 2008;46(12):1813–1821.
based medicine increasingly relies upon systematic reviews 16. Zhou X-H, Obuchowski N, McLish D. Statistical Methods in Diagnostic
and meta-analysis to study test performance. Medicine. 2nd ed. Hoboken, New Jersey: John Wiley and Sons; 2011.

Arch Pathol Lab Med—Vol 137, April 2013 Bias in Diagnostic Accuracy Studies—Schmidt & Factor 565

You might also like