0% found this document useful (0 votes)
19 views9 pages

Efron Mixture

The document discusses the challenges and opportunities presented by large-scale simultaneous hypothesis testing in fields like genomics and image processing, emphasizing the importance of choosing an appropriate null hypothesis. It introduces an empirical Bayes analysis plan that utilizes a local false discovery rate to address inference issues, demonstrating its significance through examples in genomics. The study highlights the differences between empirical and theoretical null distributions and their impact on inference outcomes in large-scale testing scenarios.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

Efron Mixture

The document discusses the challenges and opportunities presented by large-scale simultaneous hypothesis testing in fields like genomics and image processing, emphasizing the importance of choosing an appropriate null hypothesis. It introduces an empirical Bayes analysis plan that utilizes a local false discovery rate to address inference issues, demonstrating its significance through examples in genomics. The study highlights the differences between empirical and theoretical null distributions and their impact on inference outcomes in large-scale testing scenarios.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Large-Scale Simultaneous Hypothesis Testing:

The Choice of a Null Hypothesis


Bradley E FRON

Current scientiŽ c techniques in genomics and image processing routinely produce hypothesis testing problems with hundreds or thousands
of cases to consider simultaneously. This poses new difŽ culties for the statistician, but also opens new opportunities. In particular, it allows
empirical estimation of an appropriate null hypothesis. The empirical null may be considerably more dispersed than the usual theoretical
null distribution that would be used for any one case considered separately. An empirical Bayes analysis plan for this situation is developed,
using a local version of the false discovery rate to examine the inference issues. Two genomics problems are used as examples to show the
importance of correctly choosing the null hypothesis.
KEY WORDS: Empirical Bayes; Empirical null hypothesis; Local false discovery rate; Microarray analysis; Unobserved covariates.

1. INTRODUCTION with x j D 1 or 0 indicating whether or not the patient used P Ij ,


P
Until recently, “simultaneous inference” meant considering 1 · 61 xj · 6; and a vector of responses,
two or Ž ve or perhaps even 10 hypothesis tests at the same v D .v1 ; v2 ; : : : ; v74 /; (6)
time, as in Miller’s classic text (Miller 1981). Rapid progress
in technology, particularly in genomics and imaging, has vastly vk D 1 or 0 indicating whether or not a mutation occurred at
upped the ante for simultaneous inference problems. Now 500 site k. Remark A of Section 7 describes the study in more detail.
or 5,000 or even 50,000 tests may need to be evaluated simulta- For each of the 74 genomic sites, a separate logistic regres-
neously, raising new problems for the statistician, but also open- sion analysis was run using all 1,391 cases, with that site’s
ing new analytic opportunities. This article explores choosing mutation indicators as responses and the PI indicators as pre-
an appropriate null hypothesis in large-scale testing situations, dictors. Together these yielded 444 D 6 £ 74 z-values, one
and how this choice affects well-known inference methods, for testing each null hypothesis that drug j does not cause
such as the false discovery rate (FDR). mutations at site k, j D 1; 2; : : : ; 6 and k D 1; 2; : : : ; 74.
Simultaneous hypothesis testing begins with a collection of The z-values were based on the usual approximation
null hypotheses, zi D y i =sei ; i D 1; 2; : : : ; 444; (7)
H 1 ; H 2 ; : : : ; HN I (1) [using a single subscript i in place of .j; k/] where yi is the
corresponding test statistics, possibly not independent, maximum likelihood estimate (MLE) of the logistic regres-
sion coefŽ cient and sei is its approximate large-sample stan-
Y1 ; Y 2 ; : : : ; YN I (2) dard error.
Figure 1 shows a histogram of the 444 z-values, with neg-
and their p values, P1 ; P2 ; : : : ; PN , with Pi measuring how
ative zi ’s indicating greater mutational effects. The smooth
strongly yi , the observed value of Yi , contradicts Hi ; for in-
curve, f .z/, is a natural spline with 7 df, Ž t to the histogram
stance, Pi D PrHi fjYi j > jyi jg. “Large-scale” means that N is a
counts by Poisson regression. It emphasizes the central peak
big number, say at least N > 100.
near z D 0, presumably the large majority of uninteresting
It is convenient, although not necessary, to work with
drug–site combinations that have negligible mutation effects.
z-values instead of the Yi ’s or Pi ’s,
Near its center, the peak is well described by a normal den-
zi D 8 ¡1 .Pi /; i D 1; 2; : : : ; N; (3) sity with mean ¡:35 and standard deviation 1.20, which will be
called the empirical null hypothesis,
with 8 indicating the standard normal cumulative distribution
function (cdf ), for example, 8 ¡1 .:95/ D 1:645. If Hi is exactly zi jHi » N.¡:35; 1:202 /: (8)
true, then zi will have a standard normal distribution, Section 3 describes the estimation methodology for (8), with
zi jHi » N.0; 1/: (4) a brief discussion of the normality assumption in Remark D of
Section 7.
I call (4) the theoretical null hypothesis. The difference between the theoretical null N.0; 1/ and em-
Our motivating example concerns a study of 1,391 patients pirical null N.¡:35; 1:202/ may not seem worrisome here,
with human immunodeŽ ciency virus (HIV) infection, investi- but it will be shown that it substantially affects any simulta-
gating which of 6 protease inhibitor (PI) drugs cause mutations neous inference procedure. More dramatic example is given
at which of 74 sites on the viral genome. Each patient provided in Section 6, for a microarray analysis in which going from
a vector of predictors, the theoretical to empirical null totally negates any Ž ndings
of signiŽ cance. Situations going in the reverse direction can
x D .x1 ; x2 ; : : : ; x 6 /; (5)
also occur.

Bradley Efron is Professor, Department of Statistics, Stanford Univer- © 2004 American Statistical Association
sity, Stanford, CA 94305 (E-mail: [email protected]). The author thanks Journal of the American Statistical Association
Robert Shafer, David Katzenstein, and Rami Kantor for bringing the drug muta- March 2004, Vol. 99, No. 465, Theory and Methods
tion data to his attention, and Robert Tibshirani for several helpful discussions. DOI 10.1198/016214504000000089

96
Efron: Choising a Null Hypothesis in Large-Scale Tests 97

p0 and p1 D 1 ¡ p0 for the classes. Assume that zi has density


either f0 .z/ or f1 .z/, depending on its class,

p0 D PrfUninterestingg; f0 .z/ density if Uninteresting (Null);


(9)
p1 D PrfInterestingg; f1 .z/ density if Interesting (Nonnull):

The smooth curve in Figure 1 estimates the mixture den-


sity, f .z/,

f .z/ D p0 f0 .z/ C p1 f1 .z/: (10)

According to Bayes theorem, the a posteriori probability of be-


Figure 1. Histogram of 444 z-Values From the Drug Mutation Analy-
sis. The smooth curve f(z) is a natural spline Ž t to histogram counts. The ing in the Uninteresting class given z is
central peak near z D 0 is approximately N(¡.35, 1.20 2 ), the “empiri-
cal null hypothesis.” Simultaneous hypothesis tests for the 444 cases PrfUninterestingjzg D p0 f0 .z/=f .z/: (11)
depend critically on the choice between the empirical or theoretical
N(0, 1) null. Here I deŽ ne the fdr as

fdr.z/ ´ f0 .z/=f .z/; (12)


In classic situations involving only a single hypothesis test,
one must, out of necessity, use the theoretical null hypothesis, ignoring the factor p0 in (11), so fdr.z/ is an upper bound
z » N.0; 1/. The main point of this article is that large-scale on PrfUninterestingjzg. In fact, p0 can be roughly estimated
testing situations permit empirical estimation of the null dis- (see Remark B in Sec. 7), but I am assuming that p0 is near 1,
tribution. Sections 3–5 explore reasons why the empirical and say p0 ¸ :90, so fdr.z/ is not a  agrant overestimator.
theoretical null might differ, and which might be preferable in The fdr provides a useful methodology for identifying Inter-
different situations. esting cases in a situation like that of Figure 1: (1) estimate
There are scientiŽ c as well as statistical differences between f .z/ from the observed ensemble of z-values, for example, by
small-scale and large-scale hypothesis testing situations. A sin- the natural spline Ž t to the histogram counts; (2) assign a null
gle hypothesis test is most often run with the expectation and density f0 .z/; (3) calculate fdr.z/ D f0 .z/=f .z/; and (4) report
hope of rejecting the null, “with 80% power” in a typical clin- as Interesting those cases with fdr.zi / less than some threshold
ical trial. Nobody wants to reject 80% of N D 5;000 null hy- value, perhaps fdr.zi / · :10. Remark B discusses the close con-
potheses. The usual point of large-scale testing is to identify nection between this algorithm and Benjamini and Hochberg’s
a small percentage of interesting cases that deserve further (1995) method.
investigation. Although we are not exactly looking for a nee-
This article concerns the choice of f0 .z/, the null hypothesis
dle in a haystack, we do not want the whole haystack either.
density. In the drug mutation example, it is crucial to determine
An important assumption of what follows is that the proportion
whether f0 is taken to be the theoretical null, N.0; 1/, or the
of interesting cases is small, perhaps 1% or 5% of N , but not
empirical null, N.¡:35; 1:202 /. This is illustrated in Figure 2,
more than 10%. This is made explicit in Section 2, in the de-
a close-up view of Figure 1 focusing on the bin containing
scription of the local false discovery rate as an analytic tool for
z D ¡3. The expected number of the 444 zi values falling
large-scale testing. There are situations in which the 10% limit
into this bin is 6.37 for f .z/, and either .62 or 3.90 as f0 .z/
is irrelevant (e.g., in constructing prediction models), but these
lie outside our purpose here.
The terminology “Interesting/Uninteresting” used in this ar-
ticle in preference to “SigniŽ cant/NonsigniŽ cant” is discussed
near the end of Section 5. We conclude in Sections 7 and 8
with remarks, including most of the technical details, and a
summary.

2. THE LOCAL FALSE DISCOVERY RATE


It is convenient to discuss large-scale testing problems in
terms of the local false discovery rate (fdr), an empirical Bayes
version of Benjamini and Hochberg’s (1995) methodology fo-
cusing on densities rather than tail areas (see Efron et al. 2001;
Efron and Tibshirani 2002; Storey 2002, 2003).
Figure 2. Close-Up View of the Bin Containing z D ¡3 in Figure 1.
We begin with a simple Bayes model. Suppose that each of
Expected numbers in the bin are 6.37 for f(z), .62 for f0 D N(0, 1),
the N z-values falls into one of two classes, “Uninteresting” and 3.90 for f0 D N(.35, 1.20 2 ), the empirical null. Corresponding es-
or “Interesting,” corresponding to whether or not zi is gener- timates of fdr(¡3) are .097 for N(0, 1) versus .612 for N(¡.35, 1.20 2 ).
ated according to the null hypothesis, with prior probabilities Should we report the cases in this bin as Interesting?
98 Journal of the American Statistical Association, March 2004

How accurate are the estimates .¡:35; 1:20/? The usual


standard error approximations for a Poisson regression Ž t are
not appropriate here, because the zi ’s are not independent
of each other. A nonparametric bootstrap analysis was per-
formed instead, with the 1,391 80-dimensional vectors .x; v/
[(5) and (6)], as the resampling units. This yielded .09 and .08
for the bootstrap standard errors of ±0 and ¾0 , that is,
.±0 ; ¾0 / D .¡:35; 1:20/ § .:09; :08/: (15)
It seems quite unlikely that estimation error alone accounts for
the difference between the empirical null and the theoretical
values .±0 ; ¾0 / D .0; 1/. (Note that this type of bootstrap analy-
Figure 3. Comparison of Estimates of log fdr(z) for the Drug Muta- sis, which requires independent sampling units, is not applica-
tion Data. The empirical null estimate (——) declines more slowly than ble to the microarray example of Sec. 6, where correlations
the theoretical null estimate (¢ ¢ ¢ ¢ ¢ ¢). Dashes indicate the 444 z-values.
A total of 17 cases on left have fdr(z) < 1/10 for theoretical but >1/10 for
among the genes are present.)
empirical. The next two sections concern other possible causes for em-
pirical/theoretical differences, diagnostics for these causes, and
their interpretations. This list is not exhaustive, and in fact the
is N.0; 1/ or N.¡:35; 1:202 /. Thus fdr.z/ D f0 .z/=f .z/ at microarray example of Section 6 demonstrates another form
z D ¡3 is estimated to be either of pathology.
8
< :097 using the theoretical null N.0; 1/ 4. PERMUTATION TESTS AND
fdr.¡3/ D or UNOBSERVED COVARIATES
:
:612 using the empirical null N.¡:35; 1:202 /.
The theoretical N.0; 1/ null hypothesis (4) is usually based
(13) on asymptotic approximationslike those for the logistic regres-
In this bin, changing from the theoretical null to the empirical sion coefŽ cients in the drug mutation study. Permutation meth-
null changes the inferences from Interesting to deŽ nitely Unin- ods can be used to avoid these approximations, perhaps in the
teresting. hope that an improved theoretical null will more closely match
Figure 3 compares the two estimates of log fdr.z/ over the empirical.
most of the z scale. As shown, 18 of the 444 z-values have This was not the case for the drug mutation data, for which
fdr.z/ < :10 for f D N.0; 1/ but > :10 for f0 D N.¡:35; 1:202/, permutation testing was implemented by randomly pairing
with 17 of these at the left end of the scale. All told, the empir- the 1,391 predictor vectors x, (5), with the 1,391 response
ical null yields only two-thirds as many cases with fdr < :10 as vectors v, (6), and recalculating the 444 z-values. This whole
the theoretical null (35 versus 53). process was repeated independently 20 times, yielding a total
of 20 £ 444 permutation z’s. Their distribution was well ap-
3. ESTIMATING THE EMPIRICAL proximated by a N.0; :9652 / density (the “permutation null”),
NULL DISTRIBUTION except for a prominent spike near z D :3. In this case, the
permutation-improved theoretical null differs more, rather than
The empirical null distribution for the drug mutation data is
less, from the empirical null N.¡:35; 1:202 /.
estimated in two steps: (1) Fit the curve f .z/ shown in Figure 1
Permutation methods are popular in the microarray litera-
to the histogram counts by Poisson regression, and (2) Obtain
ture as a way of avoiding assumptions and approximations(see
the center and half-width of the central peak, say ±0 and ¾0 ,
Efron, Tibshirani, Storey, and Tusher 2001; Dudoit, Shaffer,
from f .z/,
and Boldrick 2003), but they do not automatically resolve the
µ ¶¡ 1 question of an appropriate null hypothesis. This can be seen in
d2 2
±0 D arg maxff .z/g and ¾0 D ¡ 2 log f .z/ ; (14) the following hypothetical example, which is a stylized version
dz ±0
of the two-sample microarray testing problem discussed in Sec-
yielding .±0 ; ¾0 / D .¡:35; 1:20/. Details are given in Remark D tion 6. The data, xij , come from N simultaneous two-sample
(Sec. 7), which brie y discusses the possibility of a nonnormal experiments, each comparing 2n subjects,
empirical null distribution. »
Controls; j D 1; 2; : : : ; n
More direct estimation methods for f0 seem possible; for ex- xij .i D 1; : : : ; N /:
ample, estimating ±0 by the median of the z-values. Suppose, Treatments; j D n C 1; n C 2; : : : ; 2n
however, that 10% of the z-values came from the nonnull dis- (16)
tribution and that all of these were located at the far left end of
The ith test statistic, Yi , is the usual two-sample t statistic, com-
Figure 1. Then the median of all the z’s would be the 4=9 quar-
paring Treatments versus Controls for the ith experiment.
tile of the actual null distribution, not its median, yielding a
Suppose that, unknown to the statistician, the data were ac-
badly biased estimate of ±0 . Similar comments apply to esti-
tually generated from
mating ¾0 (see Remark D). Method (14) does not require pre- »
liminary estimates of the proportion p0 in the null population Ij uij » N.0; 1/
xij D uij C ¯i (17)
of (9), a considerable practical advantage. 2 ¯i » N.0; ¾¯2 /;
Efron: Choising a Null Hypothesis in Large-Scale Tests 99

with the uij and ¯i mutually independent and with ¹i having some prior distribution g.¹/,
»
¡1; j D 1; 2; : : : ; n ¹i » g.¹/ for i D 1; 2; : : : ; N: (21)
Ij D (18)
C1; j D n C 1; : : : ; 2n Structure (20) is often a good approximation (see Efron 1988,
Then it is easy to show that the statistics Yi follow a dilated sec. 4), and in fact proved reasonably accurate in the bootstrap
t distribution with 2n ¡ 2 df, experiment yielding (15). Together, (20) and (21) say that the
³ ´1 mixture density f .z/, (10), is a convolution of g.¹/ with the
n 2
standard normal density ’.z/,
Yi » 1 C ¾¯2 ¢ t2n¡2 ; (19)
2 Z 1
whereas the permutation distribution, permuting Treatments f .z/ D ’.z ¡ ¹/g.¹/ d¹ (22)
¡1
and Controls within each experiment, has nearly a standard
t2n¡2 null distribution. So, for example, if ¾¯2 D 2=n, then [with the understanding that g.¹/ may include discrete proba-
p bility atoms].
the empirical density of the Yi ’s will be 2 times as wide as As a Ž rst application of the structural model, suppose that
the permutation null.
we insist that g.¹/ put probability p0 on ¹ D 0,
The quantity ¯i in (17) and (18) produces the only consistent
differences between Treatments and Controls in experiment i. Prg f¹ D 0g D p0 ; (23)
If ¯i is a dependable feature of the ith experiment, and would for some Ž xed value of p0 between 0 and 1. This amounts to the
appear again with the same value in a replication of the study, original Bayes model (9) with p0 D PrfUninterestingg, f0 .z/
then the permutation null t2n¡2 is a reasonable basis for infer- the theoretical null hypothesis N.0; 1/, and
ence. With n large and ¾¯2 D 2=n, this results in fdr.yi / < :10 Z .
for the most extreme 2% of the observed t statistics, favoring f1 .z/ D ’.z ¡ ¹/g.¹/ d¹ .1 ¡ p0 /: (24)
those with the largest values of j¯i j. ¹6D0
Suppose, however, that ¯i is not inherent to experiment i, In the context of this article, p0 should be .90 or greater.
but rather is a purely random effect that would have a different For any f .z/ of the convolution form (22), let .±g ; ¾g / be the
value and perhaps a different sign if the study were repeated; center and width parameters .±0 ; ¾0 / deŽ ned by (14). Figure 4
that is, ¯i is part of the noise and not part of the signal. In answers the following question: For a given choice of p0 in
this case, the appropriate choice is the empirical null (19). constraint (23), what are the maximum possible values of j±g j
The equivalent of Figure 1 would be all central peak, with no and of ¾g ,
interesting outliers, and no cases with small values of fdr.yi /.
This is appropriate, because now there is no real Treatment ±max D maxfj±g jjp0 g and ¾max D maxf¾g jp0 g? (25)
effect. Three curves appear for ¾max , for the general case just de-
In this latter context ¯i acts as an unobserved covari- scribed, for the case where the nonzero component of g.¹/ is
ate, a quantity that the statistician would use to correct the required to be symmetric around 0, and for the case where it
Treatment–Control comparison if it were observable. Unob- is also required to be normal. Here only the general case will
served covariates are ubiquitous in observational studies. There be discussed. Remark F (Sec. 7) discusses the solution of (25),
are several obvious ones in the drug mutation study, including which turns out to have a simple “single-point” form.
personal patient characteristics, such as age and gender, previ- The notable feature of Figure 4 is that for p0 ¸ :90, my
ous use of AZT and other non-PI drugs, years since infection, preferred realm for large-scale hypothesis testing, .±max ; ¾max /
geographic location, and so on. must be quite near the theoretical null values .0; 1/,
The effect of important unobserved covariates is to dilate the
null hypothesis density f0 .z/, as happens in (19). Unobserved ±max · :07 and ¾max · 1:04: (26)
covariates will also dilate the Interesting density f1 .z/ in (9),
and the mixture density f .z/, (10). However, an empirical Ž t-
ting method for estimating f .z/, such as the spline Ž t in Fig-
ure 1, automatically includes any dilation effects. In estimating
fdr.z/ D f0 .z/=f .z/, it is important to also allow for dilation of
the numerator f0 . This is a strong argument for preferring the
empirical null hypothesis in observational studies.
5. A STRUCTURAL MODEL FOR THE z -VALUES
The Bayesian speciŽ cations (9) underlying the fdr results
have the advantage of not requiring a structural model for
the z-values; in particular, it is not necessary to motivate,
or even describe, the nonnull density f1 .z/. There is, how-
ever, a simple structural model that can help elucidate the
Interesting–Uninteresting distinction in (9).
The structural model assumes that zi , the ith z-value, is nor- Figure 4. Maximum Possible Values of the Center and Width Para-
meters (±0 , ¾0 ), (14), When the Structural Model (20)–(22) is Constrained
mally distributed around a “true value” ¹i , its expectation,
to Put Probability p0 on ¹ D 0. For 1 ¡p0 · .10, the maxima are not much
zi » N.¹i ; 1/ for i D 1; 2; : : : ; N ; (20) greater than the theoretical null values (0, 1), as shown in Table 1.
100 Journal of the American Statistical Association, March 2004

Table 1. Value of ¾max and ±max as a Function of 1 ¡ p0 (23)

1 ¡ p0 : .05 .10 .20 .30 (Drug mutation)


¾max : 1.02 1.04 1.11 1.22 (1.20)
±max : .03 .07 .15 .27 (¡.35)

Table 1 shows .±max ; ¾max / for various choices of p0 . It shows


that the “Interesting” probability 1 ¡ p0 would have to be
nearly .30, very large by the standards of large-scale test-
ing, to obtain the observed drug mutation values .±0 ; ¾0 / D
.¡:35; 1:20/. The inference is that Uninteresting effects, such
as the unobserved covariates of Section 4, are dilating the
null hypothesis.
The main point here is that the measures (14) of center Figure 5. Best-Fit Discrete Mixing Function g(¹), (21), for Drug Mu-
and width are quite robust to the arrangement of Interesting tation Data. The bars are located at support points ¹ j , the heights are
proportional to weights ¼ j , and the tall bar
P at ¹ j D 0 has weight ¼ j D .61.
values ¹i , as long as the Interesting percentage does not ex-
Solid curve is a best-Ž t estimate f(z) D ¼j ’(z ¡ ¹ j ); it closely matches
ceed 10%. If .±0 ; ¾0 / for the central peak is much different the natural spline Ž t from Figure 1 (- - - - ).
than .0; 1/, as it is in Figure 1, then using the theoretical null
is bound to result in identifying an uncomfortably large per-
centage of supposedly Interesting cases. model (9), this yields p0 D :608 and f0 .z/ » N.0; 1/, the
We can pursue this last point for the drug mutation data by theoretical null. About 174 of the 444 cases will be identi-
removing constraint (23). Figure 5 shows an unconstrained Ž ed as Interesting, too many for a typical screening exercise.
Shifting the 28.6% to the Uninteresting classiŽ cation increases
estimate of g.¹/. For computational simplicity, g.¹/ was
p0 to :608 C :286 D :894, a more manageable value, and
assumed to be discrete, with at most J D 8 support points
changes f0 .z/ to the version of (27) supported on the four Un-
¹1 ; ¹2 ; : : : ; ¹J , so that (22) becomes
interesting ¹j ’s,
J
X
X7 ¿X 7
f .z/ D ¼j ’.z ¡ ¹j /; (27)
f0 .z/ D ¼j ’.z ¡ ¹j / ¼j : (28)
j D1
j D4 j D4

Pj being the probability g puts on ¹j , with ¼j ¸ 0 and


¼
This is approximately N.¡:34; 1:192/, almost the same as
¼j D 1. A nonlinear minimization program was employed to
the empirical null (8).
Ž nd the best-Ž tting curve of form (27) to the histogram counts
In other words, the deŽ nition of “Interesting” determines
in Figure 1, using Poisson deviance as the Ž tting criterion. The
the relevant choice of the null hypothesis f0 . If the goal is
vertical bars in Figure 5 are located at the resulting eight val-
to keep the proportion of Interesting cases manageably small,
ues ¹j , with the bar’s height proportional to ¼ j . For exam- then f0 .z/ must grow wider than N.0; 1/.
ple, the little bar at far left represents an atom of probability Use of the term “Interesting” rather than “SigniŽ cant” re-
¼1 D :015 at ¹1 D ¡10:9. The resulting f .z/ estimate, (26),  ects a difference in intent between large-scale and classical
closely resembles the natural spline Ž t of Figure 1. Table 2 testing. In the hypothetical context of Figure 5 and Table 2, all
shows all eight .¼j ; ¹j / pairs. of the 39.2% of the cases with nonzero ¹i ’s would eventually
Suppose for a moment that the estimated g.¹/ is exactly cor- be declared as “signiŽ cantly different from 0” if the sample size
rect, so 1.5% of the 444 cases have their ¹i ’s equal to ¡10:9, of patients was vastly increased. Section 4 suggests that minor
1.3%, to ¡7:0, and so on, and that an oracle has told us the deviations from N.0; 1/ might arise from scientiŽ cally uninter-
eight .¼j ; ¹j / values. Given an observed zi , we can now calcu- esting causes, such as unobserved covariates. However, even if
late PrfUninterestingjzg, (11), exactly, once the scientist speci- a modestly nonzero ¹i is genuine in some sense, it may still be
Ž es the deŽ nition of Uninteresting versus Interesting. It seems Uninteresting when viewed in comparison with an ensemble of
obvious that the 60.8% at ¹j D 0 are Uninteresting, and that more dramatic possibilities. NonsigniŽ cant implies Uninterest-
the 10.6% at ¹j D ¡10:9, ¡7:0, ¡4:9, and 6.1 deserve Inter- ing, but not conversely.
esting status. However, the status of the 28.6% at ¹j D ¡1:8,
¡1:1, and 2.4 is less clear. 6. A MICROARRAY EXAMPLE
If the 28.6% are deemed Interesting, then this leaves only Microarrays have become a prime source of large-scale si-
the 60.8% at ¹j D 0 as Uninteresting. In terms of the Bayes multaneous testing problems. Figure 6 relates to a well-known

Table 2. Weights ¼ j and Locations ¹ j for the Eight-Point Best-Fit Estimate g(¹) of Figure 8

–Interesting– ? ? Uninteresting ? Interesting


100¢¼j 1.5% 1.3% 5.6% 12.3% 13.6% 60.8% 2.7% 2.2%
¹j ¡10.9 ¡7.0 ¡4.9 ¡1.8 ¡1.1 0 2.4 6.1
NOTE: Which locations deemed Interesting versus Uninteresting determines the choice between the theoretical or empirical null hypothesis. (Numerical
results accurate to one decimal place.)
Efron: Choising a Null Hypothesis in Large-Scale Tests 101

were mutually correlated, and likewise the last four. Correla-


tions reduce the effective sample size for a two-sample t sta-
tistic, just the type of effect that would induce overdispersion
in (29).
This does not say that there are no BRCA1–BRCA2 differ-
ences, only that it is dangerous to compare the t statistics with
a standard t13 null distribution, even if simultaneous inference
is accounted for.

7. REMARKS
Remark A (Drug mutation study). The data base for the
drug mutation study (Wu et al. 2002), included 2,497 patients
having HIV subtype B, of whom 1,391 had received at least
Figure 6. Histogram of N D 3,226 z-Values From the Breast Can- one of six popular protease inhibitor (PI) drugs: amprenavir,
cer Study. The theoretical N(0, 1) null is much narrower than the central indinavir, lopinavir, nelŽ navir, ritonavir, or saquinavir. Among
peak, which has (±0 , ¾0 ) D (¡.02, 1.58). In this case the central peak
the 1,391, the mean number of PI drugs taken was 2.05 per pa-
seems to include the entire histogram.
tient. Amino acid sequences were obtained at all 99 positionson
the HIV protease gene, and mutations from wild-type recorded;
microarray experiment concerning differences between two 25 positions showed 3 or fewer mutations among the 1,391 pa-
types of genetic mutations causing increased breast cancer risk, tients, deemed too few for analysis, leaving 74 positions for the
BRCA1 and BRCA2 (see Hedenfalk, Duggen, and Chen 2001; investigationhere. Each of the 74 individuallogistic regressions
Efron and Tibshirani 2002; Efron 2003). included an intercept term as well as the six PI main effects, but
no other covariates.
The experiment included 15 breast cancer patients, 7 with
BRCA1 and 8 with BRCA2. Each women’s tumor was ana- Remark B (The local false discovery rate). The local fdr,
lyzed on a separate microarray, each microarray reporting on (11) or (12), is closely related to Benjamini and Hochberg’s
the same set of N D 3,226 genes. For each gene, the two-sample (1995) “tail-area” FDR, as discussed by Efron et al. (2001),
t statistic yi comparing the seven BRCA1 responses with the Storey (2002), and Efron and Tibshirani (2002). Substituting
eight BRCA2’s was computed. The yi ’s were then converted cdf’s F0 and F for the densities f0 and f , Bayes’s theorem
to z-values, gives a tail-area version of (11),
PrfUninterestingjz · z0 g D p0 F0 .z0 /=F .z0 /
zi D 8¡1 F13 .yi /; (29)
´ FDR.z0 /: (30)
where F13 is the cdf of a standard t distribution with 13 df.
Figure 6 displays the histogram of the 3,226 z-values. Here FDR.z0 / turns out to be the conditional expectation
The central peak is wider here than in Figure 1, with center- of fdr.z/ ´ p0 f0 .z/=f .z/ given z · z0 ,
Z z0 ¿Z z
width estimates .±0 ; ¾0 / D .¡:02; 1:58/. More importantly, 0
FDR.z0 / D fdr.z/f .z/ dz f .z/ dz: (31)
the histogram seems to be all central peak, with no interest- ¡1 ¡1
ing outliers such as those seen at the left of Figure 1. This
Benjamini and Hochberg worked in a frequentist framework,
was re ected in the local fdr calculations; using the theoreti- but their FDR control rule can be stated in empirical Bayes
cal N.0; 1/ null yielded 35 genes with fdr.zi / < :1, those with terms. Given F0 , which they usually took to be what has been
jzi j > 3:35; using the empirical N.¡:02; 1:582/ null, no genes called here the theoretical null, they estimate FDR.z0 / by
at all had fdr < :1—or, for that matter, fdr < :9, the histogram
FDR.z b.z0 /;
d 0 / D p0 F0 .z/=F (32)
in fact being a little short-tailed compared with N.¡:02; 1:582/.
There is ample reason to distrust the theoretical null in where Fb is the empirical cdf of the zi ’s. For a desired control
this case. The microarray experiment, for all its impressive level ®, say ® D :05, deŽ ne
technology, is still an observational study, with a wide range of
d
z0 D arg maxfFDR.z/ · ®gI (33)
unobserved covariates possibly distorting the BRCA1–BRCA2 z
comparison. then rejecting all cases with zi · z0 gives an expected (frequen-
Another reason for doubt can be found in the data itself. tist) rate of false discoveries no greater than ®.
The fdr methodology does not require independence of the yi ’s With z0 as in (33), relation (31) (applied to the estimated
or zi ’s across genes. However, it does require that the 15 mea- versions of FDR, fdr, and f ) says that the weighted average
surements for each gene be independentacross the microarrays. of fdr.zi / for the cases rejected by the FDR level-® rule is
Otherwise, the two-sample t statistic yi will not have an t13 null itself ®. As an example, take ® D :05 and f0 equal the theoreti-
distribution, not even approximately. cal N.0; 1/ null. Applying the FDR control rule to the negative
Unfortunately the experimental methodology used in the side of Figure 1’s drug mutation data rejects the null hypothesis
breast cancer study seems to have induced substantial correla- for the 56 cases having zi · ¡2:61; the corresponding 56 val-
tions among the various microarrays. In particular, as discussed ues of fdr.zi / have weighted average ® D :05. They vary from
in Remark G, the Ž rst four microarrays in the BRCA2 groups nearly 0 at the far left to .19 at the boundary value z D ¡2:61,
102 Journal of the American Statistical Association, March 2004

justifying the term “local”; zi ’s near the boundary are more in (14). This procedure gave the small bootstrap standard error
likely to be false discoveries than the overall .05 rate suggests. estimate in (15).
Our concern with a correct choice of null hypothesis ap- None of this methodology is crucial, although it is impor-
plies to FDR just as well as to fdr. In the microarray study, tant that the estimates ±0 and ¾0 relate directly to f0 .z/ and are
FDR with F0 D N.0; 1/ gives 24 signiŽ cant genes at ® D :05, not much affected by the nonnull distribution f1 .z/ in (9). As
whereas F0 D N.¡:02; 1:582/ gives none. In fact, any simul- an example of what can go wrong, suppose that one tries to
taneous testing procedure, the popular Westfall–Young method estimate ¾0 by a “robust” scale measure, such as (84th quan-
(Westfall and Young, 1993), for example, will depend on a cor- tile minus 16th quantile)=2. This gives ¾0 D 1:47 for the drug
rect assessment of p values for the individual cases, that is, on mutation data, re ecting long tails due to the Interesting cases
the choice of F0 . in Figure 1. Similar difŽ culties arise using the central slope of
Remark C [Estimating f .z/]. The Poisson regression me- a qq plot. Basically, a density estimate of the central peak is
thod used in Figure 1 to estimate the mixture density f .z/, (10), required, and then some assessment of its center and width.
originates in an idea of Lindsey described by Efron and Tibshi- More ambitiously, one might try extending the estimation
rani (1996, sec. 2). The range of the sample z1 ; z2 ; : : : ; zN is of f0 .z/ to third moments, permitting a skew null distribution.
partitioned into K equal intervals, with interval k having mid- Expression (35) could be generalized to
point xk and containingcount sk of the N z-values; the expecta-
P 0 C c1 z C c2 z2 =2 C c3 z 3 =6;
¡ log f .z/Dc (36)
tion ¸k of sk is nearly proportional to fk ´ f .xk /, and if the zi ’s
are independent,then the counts approximate independentPois- now requiring three derivates to estimate the coefŽ cients rather
son variates, than the two of (14). This is an unexplored path, and in particu-
ind lar Table 1 has not been extended to include skewness bounds.
sk » Poi.¸k / and ¸k D cfk ; k D 1; 2; : : : ; K; (34)
Familiarity was the only reason for using z-values instead of
where c is a constant depending on N and the interval length. t -values in Figures 1 and 6.
Lindsey’s method is to estimate the ¸k ’s with a Pois-
Remark E (Estimating p0 ). One can obtain reasonable upper
son regression, which because of (34) amounts to estimating
bounds for p0 in (9) from estimates of
a scaled version of the fk ’s; in other words, estimating f .z/.
K equals 60 in Figure 1, with the regression model being a nat- ¼.c/ ´ Prf fzi 2 ±0 § c¾0 g: (37)
ural spline with 7 df, roughly equivalent to a sixth degree poly-
nomial Ž t in z. Supposing that f0 .z/ D N.±0 ; ¾02 /, deŽ ne
Poisson regression based on (34) is almost fully efŽ cient for Z ±0 Cc¾0
estimating f .z/ if the zi ’s are independent. Here one does not G0 .c/ D 28.c/ ¡ 1 and G1 .c/ D f1 .z/ dz; (38)
expect independence,but we still have the expectationof sk pro- ±0 ¡c¾0
portional to fk . The Poisson regression method will still tend
the probabilities that zi 2 ±0 § c¾0 under f0 and f1 . Then
to unbiasedly estimate f .z/, assuming the regression model is
sufŽ ciently  exible, though we may lose estimating efŽ ciency. ¼.c/ ¡ G1 .c/ ¼.c/
I also used the bootstrap analysis that gave the standard er- p0 D · ; (39)
G0 .c/ ¡ G1 .c/ G0 .c/
rors in (15) to check (34). This turned out to be surprisingly
accurate for the drug mutation data. If it had not, then I might the inequality following from the assumption that G1 .c/ ·
have used the bootstrap estimate of covariance for the sk ’s to G0 .c/; that is, the f1 density is more dispersed than f0 .
motivate a more efŽ cient estimation procedure, though this is This leads to the estimated upper bound for p0 ,
unlikely to be important for large values of N . In any case boot- b
¼ .c/
strap analyses as in (15) will provide legitimate standard errors b0 D
p ; ¼ .c/ D #fzi 2 ±0 § c¾0 g=N:
with b (40)
G0 .c/
for the Poisson regression whether or not (34) is valid.
Remark D (Estimating the empirical null distribution). In particular, if it is assumed that G1 .c/ D 0—in other words,
The main tactic of this article is to estimate the null distribution that the Interesting zi ’s always fall outside ±0 § c¾0 —then
f0 .x/ in (9) from the central peak in the z-values’ histogram. b0 D b
p ¼ .c/=G0 .c/ is unbiased. (This is the same estimate sug-
Assuming normality for f0 gives gested in remark F of Efron et al. 2001 and Storey 2002.)
³ ´ Choosing .±0 ; ¾0 / D .¡:35; 1:20/ and c D 1:5 gave p b0 D :88
1 z ¡ ±0 2 for the drug mutation data, with bootstrap standard error .024.
log f .z/ D
P ¡ C constant (35)
2 ¾0
Remark F [Single-point solutions for .±max ;¾ max /]. The dis-
for z near 0, so that ± 0 and ¾0 can be estimated by differentiating tributions g.¹/ providing .±max ; ¾max / in (25), as graphed in
log f .z/ as in (14). The constant depends on N and p0 , but the Figure 4, have their nonzero components supported at a single
constant has no effect on the derivatives in (14). point ¹1 . For example, g.¹/ for the entry giving ¾max D 1:04
Directly differentiating the spline estimate of log f .z/ can in Table 1 puts probability .90 at ¹ D 0 and .10 at ¹1 D 1:47.
give an overly variable estimate of ¾0 . One more smoothing Single-point optimality was proved for three of the four cases in
step was used here, Ž tting a quadratic curve a0 C a1 xk C a2 x k2 Figure 4, and veriŽ ed by numerical maximization for the “Gen-
by ordinary least squares to the estimated values log fk , for xk eral” case. Here is the proof for the ¾max “Symmetric” case;
1
within 1.5 units of the maximum ±0 , yielding ¾0 D [¡2a2 ]¡ 2 as the other two proofs are similar.
Efron: Choising a Null Hypothesis in Large-Scale Tests 103

Table 3. Correlation Matrix for the BRCA2 Data With Row-Wise Means
Consider symmetric distributions putting probability p0
Subtracted off (46), Indicating Positive Correlations Within the
on ¹ D 0 and probabilities pj on symmetric pairs .¡¹j ; ¹j /, Two Blocks of Four
j D 1; 2; : : : ; J , so (22) becomes
1 2 3 4 5 6 7 8
J
X
1 1.00 .02 .02 .23 ¡.36 ¡.35 ¡.39 ¡.34
f .z/ D p0 ’.z/ C pj [’.z ¡ ¹j / C ’.z C ¹j /]=2: (41) 2 .02 1.00 .10 ¡.08 ¡.30 ¡.30 ¡.23 ¡.33
j D1 3 .02 .10 1.00 ¡.17 ¡.21 ¡.26 ¡.31 ¡.27
P 4 .23 ¡.08 ¡.17 1.00 ¡.30 ¡.23 ¡.27 ¡.32
DeŽ ning c0 D p0 =.1 ¡ p0 /, rj D pj =p0 , and rC D J1 rj D 5 ¡.36 ¡.30 ¡.21 ¡.30 1.00 ¡.02 .11 .22
1=c0 , ¾0 in (14) can be expressed as 6 ¡.35 ¡.30 ¡.26 ¡.23 ¡.02 1.00 .15 .13
7 ¡.39 ¡.23 ¡.31 ¡.27 .11 .15 1.00 .07
PJ 2 ¡¹2j =2 8 ¡.34 ¡.33 ¡.27 ¡.32 .22 .13 .07 1.00
¡ 12 1 rj ¹j e
¾0 D .1 ¡ Q/ ; where Q D P 2
: (42)
c0 rC C J1 rj e¡¹j =2
Here ± 0 D 0, which is true by symmetry assuming that p0 ¸ 1=2. off-diagonal blocks too negative and the on-diagonal blocks too
Then ¾max in (25) can be found by maximizing Q. positive.
It will be shown that with p0 (and c0 ) and ¹1 ; ¹2 ; : : : ; ¹J Remark H (Scaling properties). The associate editor pointed
held Ž xed in (41), Q is maximized by a choice of p1 ; p2 ; : : : ; pJ out that the combination of empirical null hypotheseswith false
having J ¡ 1 zero values; this is a stronger version of the single- discovery rate methodology “scales up” nicely, in terms of both
point result. Because Q is homogeneous in r D .r1 ; r2 ; : : : ; rJ / the number of tests and the amount of information per test.
in (42), the unconstrained maximization of Q.r/, subject only Consider the structural model (20), (21) with g.¹/ a mixture
to rj ¸ 0 for j D 1; 2; : : : ; J , can be considered. of 99% ¹ » N.0; :01/ and 1% of ¹ D 5. For N the number of
Differentiation gives
tests large enough, methods like Bonferroni bounds that control
1 £ 2 ¡¹2j =2 ¡ ¡¹2 =2 ¢¤ the family-wise error rate will eventually accept all N null hy-
@ Q=@rj D ¹j e ¡ Q ¢ c0 C e j ; (43)
den potheses; fdr methods, using either the empirical or theoretical
with “den” the denominator of Q. At a maximizing point r, null, will correctly identify most of the Interesting 1%.
we must have Suppose now that the amount of informationp per test in-
@Q.r/ creases by a factor of n, so that each ¹i ! n ¹i in (21). Using
· 0 with equality if rj > 0: (44) the theoretical N.0; 1/ null makes fdr reject all N cases for n
@rj
sufŽ ciently large, whereas the empirical null continues to iden-
¹2j =2 tify only the Interesting 1%. In this context, the fdr=empirical
DeŽ ning Rj D ¹2j =.1 C c0 e /, (43) and (44) yield
combination avoids the standard criticism of hypotheses test-
Q.r/ ¸ R j with equality if rj > 0: (45) ing, that rejection becomes certain for large sample sizes.
Because Q.r/ is the maximum, this says that rj , and pj can
8. SUMMARY
be nonzero only if j maximizes Rj . In case of ties, one of the
maximizing j ’s can be arbitrarily chosen. Large-scale simultaneous hypothesis testing, where the num-
All of this shows that only J D 1 need be considered in (41). ber of cases exceeds, say 100, permits the empirical estimation
The global maximized value of r0 in (41) is ¾max D .1 ¡ of a null hypothesis distribution. The empirical null may be
1
Rmax /¡ 2 , where wider (more dispersed) than the theoretical null distribution that
© ¯¡ 2 ¢ª would ordinarily be used for a single hypothesistest. The choice
Rmax D max ¹21 1 C c0 e¹1 =2 : (46) between empirical and theoretical nulls can greatly in uence
¹1
which cases are identiŽ ed as “SigniŽ cant” or “Interesting,” as
The maximizing argument ¹1 ranges from 1.43 for p0 D :95
opposed to “Null” or “Uninteresting,” this being true no matter
to 1.51 for p0 D :70. The corresponding result for ±max is sim-
which simultaneous hypothesis testing method is used.
pler, ¹1 D ± max C 1.
This article presents an analysis plan for large-scale testing
Remark G (Microarray correlation in the breast cancer situations:
study). It is easy to spot an unwanted correlation structure
² A density Ž tting technique is used to estimate the null hy-
among the eight BRCA2 microarrays. Let X be the 3,226 £ 8
pothesis distribution f0 , (Fig. 1 and Sec. 3).
matrix of BRCA2 data, with the columns of X standardized
² The local false discovery rate (fdr), an empirical Bayes
to have mean 0 and variance 1. A “de-gened” matrix e X was
formed by subtracting row-wise averages from each element version of standard FDR theory, provides inferences for
of X, the N cases (Fig. 3 and Sec. 2).
8
X There are many possible reasons for overdispersion of the
xij D xij ¡
e xik =8: (47) empirical null distribution that would lead to the empirical null
kD1 being preferred for simultaneous testing including:
Table 3 shows the 8 £ 8 correlation matrix of e X. With gen- ² Unobserved covariates in an observational study, (Sec. 4)
uine gene effects subtracted out, the correlations should vary ² Hidden correlations (Sec. 6)
around ¡1=7 D ¡:14 if the columns of X are independent. In- ² A large proportion of genuine but uninterestingly small ef-
stead, the columns are correlated in blocks of four, with the fects (Fig. 5).
104 Journal of the American Statistical Association, March 2004

Large-scale testing differs in scientiŽ c intent from an individ- (2003), “Robbins, Empirical Bayes, and Microarrays,” The Annals of
ual hypothesis test. The latter is most often designed to reject Statistics, 31, 366–378.
the null hypothesis with high probability. Large-scale testing Efron, B., and Tibshirani, R. (1996), “Using Specially Designed Exponential
Families for Density Estimation,” The Annals of Statistics, 24, 2431–2461.
is usually more of a screening operation, intended to identify (2002), “Empirical Bayes Methods and False Discovery Rates for Mi-
a small percentage of Interesting cases, assumed to be on the croarrays,” Genetic Epidemiology, 23, 70–86.
order of 10% or less in this article. The empirical null hypothe- Efron, B., Tibshirani, R., Storey, J., and Tusher, V. (2001), “Empirical Bayes
sis methodologyis designed to be accurate under this constraint Analysis of a Microarray Experiment,” Journal of the American Statistical
Association, 96, 1151–1160.
(Fig. 4). More traditional estimation methods, involving per-
Hedenfalk, I., Duggen, D., Chen, Y. et al. (2001), “Gene Expression ProŽ les in
mutations or quantiles, give incorrect f0 estimates (Sec. 4 and Hereditary Breast Cancer,” New England Journal of Medicine, 344, 539–548.
Remark D). Miller, R. (1981), Simultaneous Statistical Inference (2nd ed.), New York:
[Received June 2003. Revised August 2003.] Springer-Verlag.
Storey, J. (2002), “A Direct Approach to False Discovery Rates,” Journal of the
REFERENCES Royal Statistical Society, Ser. B, 64, 479–498.
(2003), “The Positive False Discovery Rate: A Bayesian Interpretation
Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate:
and the q-Value,” The Annals of Statistics, 31, to appear.
A Practical and Powerful Approach to Multiple Testing,” Journal of the Royal
Statistical Society, Ser. B, 57, 289–300. Westfall, P., and Young, S. (1993), Resampling-Based Multiple Testing: Exam-
Dudoit, S., Shaffer J., and Boldrick J. (2003), “Multiple Hypothesis Testing in ples and Methods for p-Value Adjustments, New York: Wiley.
Microarray Experiments,” Statistical Science, 18, 71–103. Wu, T., Schiffer, C., Shafer, R. et al. (2003), “Mutation Patterns and Structural
Efron, B. (1988), “Three Examples of Computer-Intensive Statistical Infer- Correlates in Human ImmunodeŽ ciency Virus Type 1 Protease Following
ence,” Sankhyā, 50, 338–362. Different Protease Inhibitor Treatments,” Journal of Virology, 77, 4836–4847.

You might also like