How does multiple testing correction work?
How does multiple testing correction work?
shotgun proteomics experiment designed a given experiment. ability—the probability that a score at least
to identify proteins involved in a particular As a motivating example, suppose that you as large as the observed score would occur in
biological process. The experiment success- are studying CTCF, a highly conserved zinc- data drawn according to the null hypothesis—
fully identifies most of the proteins that you finger DNA-binding protein that exhibits is called the P-value.
already know to be involved in the process diverse regulatory functions and that may Likewise, the P-value of a candidate CTCF
and implicates a few more. Each of these play a major role in the global organization binding site with a score of 17.0 is equal to
novel candidates will need to be verified with of the chromatin architecture of the human the percentage of scores in the null distribu-
a follow-up assay. How do you decide how genome1. To better understand this protein, tion that are ≥17.0. Among the 68 million
many candidates to pursue? you want to identify candidate CTCF bind- null scores shown in Figure 1c, 35 are ≥17.0,
The answer lies in the tradeoff between ing sites in human chromosome 21. Using leading to a P-value of 5.5 × 10–7 (35/68 mil-
the cost associated with a false positive ver- a previously published model of the CTCF lion). The P-value associated with score x
sus the benefit of identifying a novel partici- binding motif (Fig. 1a)2, each 20 nucleotide corresponds to the area under the null dis-
pant in the biological process that you are (nt) sub-sequence of chromosome 21 can be tribution to the right of x (Fig. 1d).
studying. False positives tend to be particu- scored for its similarity to the CTCF motif. Shuffling the human genome and rescan-
larly problematic in genomic or proteomic Considering both DNA strands, there are 68 ning with the CTCF motif is an example of
studies where many candidates must be sta- million such subsequences. Figure 1b lists the an ‘empirical null model’. Such an approach
tistically tested. top 20 scores from such a search. can be inefficient because a large number
Such studies may include identifying genes of scores must be computed. In some cases,
that are differentially expressed on the basis Interpreting scores with the null however, it is possible to analytically calculate
of microarray or RNA-Seq experiments, scan- hypothesis and the P-value the form of the null distribution and calcu-
ning a genome for occurrences of candidate How biologically meaningful are these late corresponding P-values (that is, by defin-
transcription factor binding sites, search- scores? One way to answer this question is to ing the null distribution with mathematical
ing a protein database for homologs of a assess the probability that a particular score formulae rather than by estimating it from
query protein or evaluating the results of a would occur by chance. This probability can measured data).
genome-wide association study. In a nutshell, be estimated by defining a ‘null hypothesis’ In the case of scanning for CTCF motif
the property that makes these experiments so that represents, essentially, the scenario that occurrences, an analytic null distribution
attractive—their massive scale—also creates we are not interested in (that is, the random (gray line in Fig. 1d) can be calculated using
many opportunities for spurious discoveries, occurrence of 20 nucleotides that match the a dynamic programming algorithm, assum-
which must be guarded against. CTCF binding site). ing that the sequence being scanned is gener-
In assessing the cost-benefit tradeoff, it is The first step in defining the null hypothesis ated randomly with a specified frequency of
helpful to associate with each discovery a sta- might be to shuffle the bases of chromosome each of the four nucleotides3. This distribu-
tistical confidence measure. These measures 21. After this shuffling procedure, high- tion allows us to compute, for example, that
may be stated in terms of P-values, false scoring occurrences of the CTCF motif will the P-value associated with the top score in
discovery rates or q-values. The goal of this only appear because of random chance. Then, Figure 1b is 2.3 × 10–10 (compared to 1.5 ×
article is to provide an intuitive understand- the shuffled chromosome can be rescanned 10–8 under the empirical null model). This
ing of these confidence measures, a sense for with the same CTCF matrix. Performing this P-value is more accurate and much cheaper
procedure results in the distribution of scores to compute than the P-value estimated from
William S. Noble is at the Department of shown in Figure 1c. the empirical null model.
Genome Sciences, Department of Computer Although it is not visible in Figure 1c, out of In practice, determining whether an
Science and Engineering, University of the 68 million 20-nt sequences in the shuffled observed score is statistically significant
Washington, Seattle, Washington, USA. chromosome, only one had a score ≥26.30. requires comparing the corresponding sta-
e-mail: [email protected] In statistics, we say that the probability of tistical confidence measure (the P-value) to
CC A G GC G
a b Position Str Sequence Score
19390631 + TTGACCAGCAGGGGGCGCCG 26.30
32420105 + CTGGCCAGCAGAGGGCAGCA 26.30
27910537 − CGGTGCCCCCTGCTGGTCAG 26.18
21968106 + GTGACCACCAGGGGGCAGCA 25.81
31409358 + CGGGCCTCCAGGGGGCGCTC 25.56
19129218 − TGGCGCCACCTGCTGGTCAC 25.44
21854623 + CTGGCCAGCAGAGGGCAGGG 24.95
12364895 + CCCGCCAGCAGAGGGAGCCG 24.71
13406383 + CTAGCCACCAGGTGGCGGTG 24.71
18613020 + CCCGCCAGCAGAGGGAGCCG 24.71
CA
31980801 + ACGCCCAGCAGGGGGCGCCG 24.71
32909754 − TGGCTCCCCCTGGCGGCCGG 24.71
GC
25683654 + TCGGCCACTAGGGGGCACTA 24.58
31116990 − GGCCGCCACCTTGTGGCCAG 24.58
GA
29615421 − CTCTGCCCTCTGGTGGCTGC 24.46
G
6024389 + GTTGCCACCAGAGGGCACTA 24.46
C
T AAG
26610753 − CACTGCCCTCTGCTGGCCCA 24.34
T T G A
26912791
20446267
−
+
GGGCGCCACCTGGCGGTCAC 24.34
CTGCCCACCAGGGGGCAGCG 24.22
CG A G TT A TTT CG
G TGC
A
CTA
T
G
A
A
G G T
CA C AA
T
C 21872506 − TGGCGCCACCTGGCGGCAGC 24.22
1
10
11
12
13
14
15
16
17
18
19
20
c 0.03
d 6e-06
e 6e-06
Empirical null Analytic null Observed scores
Empirical null Empirical null
0.025 5e-06 5e-06
© 2009 Nature America, Inc. All rights reserved.
Frequency
Frequency
0 0 0
−100 −80 −60 −40 −20 0 20 14 16 18 20 22 24 26 14 16 18 20 22 24 26
Score Score Score
Figure 1 Associating confidence measures with CTCF binding motifs scanned along human chromosome 21. (a) The binding preference of CTCF2
represented as a sequence logo9, in which the height of each letter is proportional to the information content at that position. (b) The 20 top-scoring
occurrences of the CTCF binding site in human chromosome 21. Coordinates of the starting position of each occurrence are given with respect to
human genome assembly NCBI 36.1. (c) A histogram of scores produced by scanning a shuffled version of human chromosome 21 with the CTCF motif.
(d) This panel zooms in on the right tail of the distribution shown in c. The blue histogram is the empirical null distribution of scores observed from
scanning a shuffled chromosome. The gray line is the analytic distribution. The P-value associated with an observed score of 17.0 is equal to the area
under the curve to the right of 17.0 (shaded pink). (e) The false discovery rate is estimated from the empirical null distribution for a score threshold
of 17.0. There are 35 null scores >17.0 and 519 observed scores >17.0, leading to an estimate of 6.7%. This procedure assumes that the number of
observed scores equals the number of null scores.
a confidence threshold α. For historical rea- of a score of 17.0, even though it is associ- Because the smallest observed P-value in
sons, many studies use thresholds of α = 0.01 ated with a seemingly small P-value of 5.5 × Figure 1b is 2.3 × 10–10, no scores are deemed
or α = 0.05, though there is nothing magical 10–7 (the chance of obtaining such a P-value significant after correction.
about these values. The choice of the signifi- from null data is less than one in a million), The Bonferroni adjustment, when applied
cance threshold depends on the costs associ- scores of 17.0 or larger were in fact observed using a threshold of α to a collection of n scores,
ated with false positives and false negatives, in a scan of the shuffled genome, owing to the controls the ‘family-wise error rate’. That is,
and these costs may differ from one experi- large number of tests performed. We therefore the adjustment ensures that for a given score
ment to the next. need a ‘multiple testing correction’ procedure threshold, one or more larger scores would be
to adjust our statistical confidence measures expected to be observed in the null distribution
Why P-values are problematic in a based on the number of tests performed. with a probability of α. Practically speaking,
high-throughput experiment this means that, given a set of CTCF sites with
Unfortunately, in the context of an experi- Correcting for multiple hypothesis tests a Bonferroni adjusted significance threshold
ment that produces many scores, such as Perhaps the simplest and most widely used of α = 0.01, we can be 99% sure that none of
scanning a chromosome for CTCF binding method of multiple testing correction is the the scores would be observed by chance when
sites, reporting a P-value is inappropriate. Bonferroni adjustment. If a significance drawn according to the null hypothesis.
This is because the P-value is only statisti- threshold of α is used, but n separate tests In many multiple testing settings, minimizing
cally valid when a single score is computed. are performed, then the Bonferroni adjust- the family-wise error rate is too strict. Rather
For instance, if a single 20-nt sequence had ment deems a score significant only if the than saying that we want to be 99% sure that
been tested as a match to the CTCF binding corresponding P-value is ≤α/n. In the CTCF none of the observed scores is drawn according
site, rather than scanning all of chromosome example, we considered 68 million distinct to the null, it is frequently sufficient to identify
21, the P-value could be used directly as a 20-mers as candidate CTCF sites, so achiev- a set of scores for which a specified percentage
statistical confidence measure. ing statistical significance at α = 0.01 accord- of scores are drawn according to the null. This
In contrast, in the example above, 68 mil- ing to the Bonferroni criterion would require is the basis of multiple testing correction using
lion 20-nt sequences were tested. In the case a P-value <0.01/(68 × 106) = 1.5 × 10–10. false discovery rate (FDR) estimation.
The simplest form of FDR estimation is threshold remains unchanged relative to the focus on a single example, then the Bonferroni
illustrated in Figure 1e, again using an empir- simpler method. adjustment is more appropriate.
ical null distribution for the CTCF scan. For Complementary to the FDR, Storey6 pro- It is worth noting that the statistics lit-
a specified score threshold t = 17.0, we count posed defining the q-value as an analog of erature describes a related probability score,
the number sobs of observed scores ≥t and the P-value that incorporates FDR-based known as the ‘local FDR’7. Unlike the FDR,
the number snull of null scores ≥t. Assuming multiple testing correction. The q-value is which is calculated with respect to a collec-
that the total number of observed scores and motivated, in part, by a somewhat unfor- tion of scores, the local FDR is calculated with
null scores are equal, then the estimated FDR tunate mathematical property of the FDR: respect to a single score. The local FDR is the
is simply snull/sobs. In the case of our CTCF when considering a ranked list of scores, it is probability that a particular test gives rise to
scan, the FDR associated with a score of 17.0 possible for the FDR associated with the first a false positive. In many situations, especially
is 35/519 = 6.7%. m scores to be higher than the FDR associ- if we are interested in following up on a single
Note that, in Figure 1e, FDR estimates were ated with the first m + 1 scores. For example, gene or protein, this score may be precisely
computed directly from the score. It is also the FDR associated with the first 84 candidate what is desired. However, in general, the local
possible to compute FDRs from P-values CTCF sites in our ranked list is 0.0119, but FDR is quite difficult to estimate accurately.
using the Benjamini-Hochberg procedure, the FDR associated with the first 85 sites is Furthermore, all methods for calculating
which relies on the P-values being uniformly 0.0111. Unfortunately, this property (called P-values or for performing multiple testing
distributed under the null hypothesis4. For nonmonotonicity, meaning that the FDR correction assume a valid statistical model—
example, if the P-values are uniformly distrib- does not consistently get bigger) can make the either analytic or empirical—that captures
uted, then the P-value 5% of the way down resulting FDR estimates difficult to interpret. dependencies in the data. For example, scan-
© 2009 Nature America, Inc. All rights reserved.
the sorted list should be ~0.05. Accordingly, Consequently, Storey proposed defining the ning a chromosome with the CTCF motif
the procedure consists of sorting the P-values q-value as the minimum FDR attained at or leads to dependencies among overlapping
in ascending order, and then dividing each above a given score. If we use a score thresh- 20-nt sequences. Also, the simple null model
observed P-value by its percentile rank to get old of T, then the q-value associated with T produced by shuffling assumes that nucle-
an estimated FDR. In this way, small P-values is the expected proportion of false positives otides are independent. If these assumptions
that appear far down the sorted list will result among all of the scores above the threshold. are not met, we risk introducing inaccuracies
in small FDR estimates, and vice versa. This definition yields a well-behaved measure in our statistical confidence measures.
In general, when an analytical null model that is a function of the underlying score. We In summary, in any experimental setting in
is available, you should use it to compute saw, above, that the Bonferroni adjustment which multiple tests are performed, P-values
P-values and then use the Benjamini- yielded no significant matches at α = 0.05. must be adjusted appropriately. The Bonferroni
Hochberg procedure because the result- If we use FDR analysis instead, then we are adjustment controls the probability of mak-
ing estimated FDRs will be more accurate. able to identify a collection of 519 sites at a ing one false positive call. In contrast, false
However, if you only have an empirical null q-value threshold of 0.05. discovery rate estimation, as summarized in
model, then there is no need to estimate In general, for a fixed significance thresh- a q-value, controls the error rate among a set
P-values in an intermediate step; instead you old and fixed null hypothesis, performing of tests. In general, multiple testing correction
may directly compare your score distribution multiple testing correction by means of FDR can be much more complex than is implied by
to the empirical null, as in Figure 1e. estimation will always yield at least as many the simple methods described here. In particu-
These simple FDR estimation methods are significant scores as using the Bonferroni lar, it is often possible to design strategies that
sufficient for many studies, and the result- adjustment. In most cases, FDR analysis will minimize the number of tests performed for a
ing estimates are provably conservative with yield many more significant scores, as in our particular hypothesis or set of hypotheses. For
respect to a specified null hypothesis; that CTCF analysis. The question naturally arises, more in-depth treatment of multiple testing
is, if the simple method estimates that the then, whether a Bonferroni adjustment is issues, see reference 8.
FDR associated with a collection of scores ever appropriate.
Acknowledgments
is 5%, then on average the true FDR is ≤5%.
National Institutes of Health award P41 RR0011823.
However, a variety of more sophisticated Costs and benefits help determine the
methods have been developed for achiev- best correction method 1. Phillips, J.E. & Corces, V.G. Cell 137, 1194–1211
ing more accurate FDR estimates (reviewed Like choosing a significance threshold, choos- (2009).
2. Kim, T.H. et al. Cell 128, 1231–1245 (2007).
in ref. 5). Most of these methods focus on ing which multiple testing correction method 3. Staden, R. Methods Mol. Biol. 25, 93–102 (1994).
estimating a parameter π0, which represents to use depends upon the costs associated with 4. Benjamini, Y. & Hochberg, Y. J. R. Stat. Soc., B 57,
the percentage of the observed scores that false positives and false negatives. In particu- 289–300 (1995).
5. Kerr, K.F. Bioinformatics 25, 2035–2041 (2009).
are drawn according to the null distribu- lar, FDR analysis is appropriate if follow-up 6. Storey, J.D. J. R. Stat. Soc. Ser. A Stat. Soc. 64, 479–
tion. Depending on the data, applying such analyses will depend upon groups of scores. 498 (2002).
methods may make a big difference or almost For example, if you plan to perform a collec- 7. Efron, B., Tibshirani, R., Storey, J. & Tusher, V. J. Am.
Stat. Assoc. 96, 1151–1161 (2001).
no difference at all. For the CTCF scan, one tion of follow-up experiments and are willing 8. Dudoit, S. & van der Laan, M.J. Multiple Testing
such method6 assigns slightly lower esti- to tolerate having a fixed percentage of those Procedures with Applications To Genomics (Springer,
New York, 2008).
mated FDRs to each observed score, but experiments fail, then FDR analysis may be 9. Schneider, T.D. & Stephens, R.M. Nucleic Acids Res.
the number of sites identified at a 5% FDR appropriate. Alternatively, if follow-up will 18, 6097–6100 (1990).