Sample Size and Optimal Design For Logistic Regression With Binary Interaction - Eugene Demidenko
Sample Size and Optimal Design For Logistic Regression With Binary Interaction - Eugene Demidenko
Sample Size and Optimal Design For Logistic Regression With Binary Interaction - Eugene Demidenko
Eugene Demidenko∗, †
Dartmouth Medical School, Hanover, NH 03755, U.S.A.
SUMMARY
There is no consensus on what test to use as the basis for sample size determination and power analysis.
Some authors advocate the Wald test and some the likelihood-ratio test. We argue that the Wald test
should be used because the Z -score is commonly applied for regression coefficient significance testing
and therefore the same statistic should be used in the power function. We correct a widespread mistake on
sample size determination when the variance of the maximum likelihood estimate (MLE) is estimated at null
value. In our previous paper, we developed a correct sample size formula for logistic regression with single
exposure (Statist. Med. 2007; 26(18):3385–3397). In the present paper, closed-form formulas are derived
for interaction studies with binary exposure and covariate in logistic regression. The formula for the optimal
control–case ratio is derived such that it maximizes the power function given other parameters. Our sample
size and power calculations with interaction can be carried out online at www.dartmouth.edu/∼ eugened.
Copyright q 2007 John Wiley & Sons, Ltd.
KEY WORDS: binary data; gene–gene interaction; gene–environment interaction; information matrix;
likelihood-ratio test; optimal allocation problem; Wald test
1. INTRODUCTION
Recent advances in genomics and declining cost of microarray analysis have increased the popular-
ity of gene–environment and gene–gene interaction epidemiological studies. ‘How many controls
and cases must there be to achieve the desired power and what should be their proportion’ are
essential questions at the very early stage of study design.
A vast literature on sample size determination for logistic regression with interaction can be
roughly divided into four groups with respect to the statistical test used: test on proportions,
assuming that all variables are binary, and general statistical tests, such as likelihood-ratio, score,
and Wald tests.
∗ Correspondence to: Eugene Demidenko, Dartmouth Medical School, Hanover, NH 03755, U.S.A.
†
E-mail: [email protected]
When all variables are binary, the logistic regression with interaction can be expressed as a
2 × 2 × 2 contingency table. Then the test on interaction reduces to the Z -test on proportions—
sometimes this test is called the Woolf test [1, 2]. Smith and Day [3] were, perhaps, the first
authors to apply this idea to sample size determination with interaction. A limitation of their
method is that they assumed an equal number of cases and controls. Hwang et al. [4] eliminated
that assumption, but assumed that the exposure and covariate are independent. The same approach
under the assumption of an equal number of cases and controls was later used by Yang et al. [5].
Whittemore [6] was the first to use the Wald test for sample size determination with logistic
regression. To avoid the exact computation of the Fisher information matrix, she suggested an
approximation when the response rate is small. Hsieh et al. [7] extended that approach and
Shieh [8] compared this approximation with the likelihood-ratio test by Monte Carlo simulations.
Foppa and Spiegelman [9] used the Wald test to determine the sample size in logistic regression
with interaction when the gene (exposure) variable is categorical (the executable program can be
downloaded at http://www.hsph.harvard.edu/faculty/spiegelman/ge trend v2.html).
Lubin and Gail [10] used the score test in the framework of logistic regression. Self and
Mauritsen [11] and Shieh [12] applied the likelihood-ratio test for power analysis and sample
size determination in a generalized linear model. Although general, these approaches need to
be adapted to interaction studies because they require the specification of the joint distribution
with the interaction variable being the product of the exposure and covariate. Gauderman [13, 14]
developed an algorithm to derive the sample size for logistic regression with interaction using the
likelihood-ratio test (one can freely download Windows-based software, QUANTO, based on this
approach at http://hydra.usc.edu/gxe).
We, however, advocate the Wald test as the basis for power analysis and sample size deter-
mination. Our motivation is as follows. By definition, power is the probability of rejecting the
null hypothesis evaluated at the alternative. Since the method for testing the significance of the
regression coefficient, , uses the Z -statistic,
ML
Z= (1)
ML )
SE(
the power is Pr(|Z |>Z 1−/2 ), where Z 1−/2 is the (1 − /2)th quantile of the standard normal
distribution. But this is the Wald test! Thus, the same test should be used for sample size determi-
nation. Although the Wald and likelihood-ratio tests are equivalent in the neighborhood of the null,
these tests are different globally in large samples [15]. Consequently, the power and the respective
sample size derived by the two tests will differ.
In the literature mentioned above, as well as in many other articles that use the Wald test, the
sample size was calculated using the following formula for the total number of observations:
√ √
(Z 1−/2 V0 + Z P V )2
n= (2)
2
where V0 is the variance of the maximum likelihood estimate (MLE) evaluated at the null hypoth-
esis, H0 : = 0, and V is the variance evaluated at the MLE,
ML . We shall refer to this formula
as to the null-variance formula. We guess that formula (2) emerged in connection with testing
of the proportion, H0 : p = p0 , where V0 = p0 (1 − p0 )/n is the variance of the proportion in n
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
38 E. DEMIDENKO
Bernoulli trials under the null. But in the existing software, the variance and the test statistic, Z ,
are never evaluated at the null but at the MLE, as in (1). As another argument, we draw a parallel
coefficient significance testing, H0 : i = 0, in linear model using the t-test by computing
to the
i / s 2 (X X)−1 . Following (2), to estimate s 2 , we should compute the residual sum of squares
ii
at i = 0 but we never do this. Remarkably, nobody derives the null-variance formula although it
travels from one paper to another.
Instead, we suggest
(Z 1−/2 + Z P )2
n= V (3)
2
for our sample size calculation. Formulas (2) and (3) usually produce close results, especially
when the alternative odds ratio (OR) is close to 1. However, they may produce differences up to
30 per cent otherwise, as was shown in the previous paper [15].
The sample size formula (3) is general and can be applied in many settings and models. For
example, formula (3) is widely used when comparing means of two groups with normal distribution
[16], but for some reasons the null-variance formula (2) is used for logistic regression.
In our previous paper, we had developed a correct sample size formula for logistic regression
with single exposure [15]. The goal of the present paper is to derive the Wald-based, closed-form
power and sample size formulas for logistic regression with binary exposure and covariate and
their interaction with no limitation on the design specification, such as an equal number of cases
and controls or independence of the environment and gene factors. Obviously, alternatively, one
could employ a categorical or continuous covariate design if the information on their distribution
is available; however, the closed-form solution would not be available.
There is principally no difference between cohort and case–control studies in terms of parameter
estimation and testing, as was shown by Prentice and Pyke [17]. Thus, we do not distinguish two
designs. Since the intercept term in a case–control study determines the ratio of cases and normal
subjects in the normal group (zero exposure and zero covariate), we seek an optimal study with
minimum total sample size that yields the predefined power.
In this section, we derive the Wald power and sample size for the interaction coefficient, , in
logistic regression defined by
e0 +x+z+(x z)
Pr(y = 1|x, z) = (4)
1 + e0 +x+z+(x z)
Variables x and z, and therefore the interaction term, x z, are binary. To be specific, we may
assume that y codes the disease status, x is the exposure indicator, and z is a genotype at a
disease-susceptibility locus with a single allele. Note that variables x and z may be interpreted in
various ways that obviously will not affect the required sample size. For example, x may represent
other genetic binary information, say a genotype at a secondary disease-susceptibility locus with a
different allele. Then x z reflects the gene–gene interaction. To shorten the interpretation, we simply
say that z = 0 corresponds to a good gene and z = 1 to a bad gene. For example, if exposure is a
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
SAMPLE SIZE AND OPTIMAL DESIGN FOR LOGISTIC REGRESSION 39
smoking status and y is cancer occurrence, the interaction x z reflects an elevated risk of cancer
for a smoker with a bad gene.
The null hypothesis is H0 : = 0 with the alternative H A : = 0. The same model was used
by Gauderman [13, 14], although he applied the likelihood-ratio test and we apply the Wald test.
The power function of the Wald test, as the probability of rejecting the null when the alternative
is true, can be well approximated as
√
n
Power = −Z 1−/2 + √ (5)
V
where Z 1−/2 is the (1 − /2)th quantile of the standard normal distribution function, ;√ is the
test size (typically, = 0.05); n is the sample size; and V is the asymptotic variance of n ML .
One can use (5) to find the minimum detectable difference, , or the sample size given power P
(typically, P = 80 per cent or P = 90 per cent). For example, the sample size required to detect
the interaction log OR, with power P and significance level is given by (3).
The major step in obtaining the power is computation of the variance V as a function of the
regression coefficients. To shorten the notation, we use capital letters to denote the exponents of
the regression coefficients. For example, A = e0 , B = e is the OR of the individual effect of the
exposure, G = e is the OR of the gene, and K = e is the interaction OR. To define probabilities
for x and z, we use the notations px = Pr(x = 1) and pz = Pr(z = 1). The relationship between x
and z can be expressed as a conditional probability specified by logistic regression, namely,
ec+z
P(x = 1|z) =
1 + ec+z
Parameter is defined via the OR D = e , and a positive parameter C = ec is found from the
quadratic equation
pz 1 − pz
1 − px = +
1 + CD 1+C
1 1 1 1
V= + + + (6)
L R F J
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
40 E. DEMIDENKO
A(1 − pz ) ABC DG K pz
L= , R=
(1 + A)2 (1 + C) (1 + ABG K )2 (1 + C D)
ABC(1 − pz ) AGpz
F= , J= (7)
(1 + AB)2 (1 + C) (1 + AG)2 (1 + C D)
Given V, the power and sample size are computed by formulas (5) and (3). Since we give the
complete inverse information matrix, the sample size may be determined not only for the interaction
term but also for any other coefficient in regression (4).
To compute the required sample size, we need to specify four groups of parameters (nine
parameters in total):
1. Three parameters in formula (3): the significance level ; the power P; and the alternative
log OR, = ln K .
2. Three parameters that specify the joint distribution of the exposure and the gene: the propor-
tion of subjects in the general population with exposure, px = Pr(x = 1); the proportion of
subjects in the general population with the bad gene, pz = Pr(z = 1); and the OR between
exposure and gene D.
3. Two ORs as individual effects of the exposure, B = e , and the gene, G = e , with ORs
computed based on coefficients from model (4).
4. The proportion of diseased subjects in the general population with no exposure and a good
gene (baseline prevalence), p y = Pr(y = 1|x = 0, z = 0) = A/(1+ A). In a case–control study,
this corresponds to the proportion of cases among subjects with zero exposure and covariate.
Note that the above specifications of px and pz assume marginal prevalence probabilities.
Alternatively, we could specify conditional probabilities, such as probability of exposure in controls,
as in [9]. For example, A may be determined using the ratio of total number of controls to subjects
from the equation
where four probabilities, { pi j , i = 0, 1; j = 0, 1}, are defined in the Appendix. Vice versa, if A
is determined, as in optimal design (Section 3), the proportion of controls to n is equal to the
left-hand side of equation (8). Note that A is the expected ratio of cases to controls in the normal
group (no exposure, good gene), while (n − n 0 )/n 0 is the ratio of cases to controls in the entire
sample. For example, an equal number of controls and cases in the entire sample does not imply
an equal design in the normal group (A = 1). Formula (8) should be used for the back and forth
calculation.
Another comment is regarding setting up the value for the alternative OR of interaction, K .
Following the line of common reasoning, we compare the effect of two groups, x = 0, z = 0 and
x = 1, z = 1, which on the logit scale is = + + with the interaction effect = − ( + ).
Thus, the pure interaction or synergistic effect on the OR scale is K /(BG). Consequently, if we
want to detect a synergistic OR, K S , we set K = K S BG.
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
SAMPLE SIZE AND OPTIMAL DESIGN FOR LOGISTIC REGRESSION 41
Figure 1. The required sample size for a gene–gene interaction study as a function of the alternative OR
computed using the Wald test with the nominal power 80 per cent. There is good agreement between
theoretical power and empirical power from simulations.
As mentioned in the Introduction, the standard maximum likelihood theory for unmatched case–
control studies holds and therefore the variance of the MLE and the sample size formulas derived
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
42 E. DEMIDENKO
above remain valid. The only difference is that p y = A/(1 + A) is not interpretated as the disease
prevalence rate but simply as the proportion of cases in the sample. Since the proportion of cases
and controls in a case–control study can be chosen as part of the study design, we may find
an optimal p y that minimizes the sample size, n, given the power probability, P. This problem
is known as the ‘optimal allocation problem’ and several authors have studied it. For example,
Brittain and Schlesselman [19] suggest an optimal p y for the test on proportion that minimizes
the variance or maximizes the power. For optimal design/proportion of cases, we seek the p y that
minimizes the required sample size to achieve a given power P. Thus, following formula (3), the
problem of optimal design of case–control interaction studies reduces to minimization of V as a
function of A. As shown in the Appendix, the optimal ratio of cases to controls is
(1 + BC) + (1 + BC D K )
Aopt = (9)
B(C + B) + BG 2 K (B K + C D)
where
= (1 + C)DG K pz , = (1 + C D)(1 − pz )
It is elementary to check that when individual effects of exposure and gene are zero (B = G = 1)
and exposure and gene are independent (D = 1), we have
1 + (K − 1)( px + pz − px pz )
Aopt = (10)
K [K − (K − 1)( px + pz − px pz )]
If K = 1 we have Aopt = 1. This means that the 50/50 design is optimal only when the alternative
OR is 1. It should be noted that Aopt gives the optimal ratio of cases to controls in the group with
zero exposure and good gene (x = z = 0). To obtain the optimal number of controls, n 0 in the total
sample formula (8) should be used.
In Figure 2, we show the optimal number of cases to controls in the normal group (x = z = 0),
computed by formula (10), and in the total sample, using formula (8), with several probabilities of
x assuming that probability of z is 0.5. As in the figure, for K >1 there should be less cases and
more controls in the normal group, but when K <1 the reverse is true. When px is close to 1 this
proportion is almost 1, but it dramatically changes with alternative OR for small probability, px .
The relationship between the optimal proportion of cases and controls in the total sample depends
on the prevalence of the exposure. If px <0.5 it resembles the previous case, but for large px it
reverses. As evident from this graph, the ratio of cases to controls in the normal and entire group
may be quite different, especially under extreme probabilities of exposure. This fact should be
remembered when specifying the value of A for computing the power and n.
This approach can be extended to the design of cost-effective epidemiological studies that
minimizes the cost function p0 n 0 + p1 n 1 , where p0 and p1 are the costs of control and case.
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
SAMPLE SIZE AND OPTIMAL DESIGN FOR LOGISTIC REGRESSION 43
Figure 2. Optimal proportion of cases to controls in two groups for logistic regression with interaction
for different values of px in the range from 0.1 to 0.9, and pz = 12 .
the optimal case–control ratio. Since B = G = D = 1, we can apply formula (10), which yields
Aopt = 0.343 with the minimum variance V = 121.5. As follows from formula (3), the reduction
in the total sample size is 121.5/169.9 = 0.72, about 30 per cent. Thus, the total sample size
n = 252 × 0.72 = 180 with the optimal ratio of cases to controls in the normal group gives the
same power as n = 252, assuming that this ratio is 1. Using formula (8), we compute the proportion
of controls in the entire sample (n 0 /n): if A = 1 we have n 0 /n = 0.46 and under the optimal design,
when A = 0.343 we have n 0 /n = 0.69. Thus, under the optimal design, there should be 124 controls
and 56 cases in the entire sample of 180 participants.
Power function is the probability of rejecting the null hypothesis at the alternative. Therefore, power
should be computed using the same statistic used to test the null. Since the Z -test is commonly
used for coefficient significance testing, the same test should be used for power analysis and
sample size determination. However, when the likelihood-ratio test is planned for use, the same
test statistic should be used as the basis for power computation.
The Wald-based sample size and power calculations for logistic regression with binary interaction
can be carried out online at www.dartmouth.edu/∼ eugened.
It is a mystery why the null variance formula is used in all papers on Wald-based sample
size determination. One explanation is that it was initially borrowed from the test on proportion,
H0 : p = p0 . But in coefficient significance testing, such as in model (4), H0 : = 0, the variance
is evaluated at the MLE, not at = 0. Although the two formulas are close for small log ORs, they
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
44 E. DEMIDENKO
may lead to a considerable difference otherwise. Sometimes the difference in the tests is down
played, suggesting that they are equivalent in large sample. Although this statement is true at the
null, it is not true at the alternative. Consequently, different tests yield different sample sizes even
asymptotically.
Since specification of the parameter values is never rigorous, it is a good practice to compute a
sample size under different scenarios. Closed-form formulas presented in this work become very
helpful to carry out the respective sensitivity analysis.
The power computation and sample size formula are applicable to case–control studies as well.
For a case–control study, we find an optimal proportion that maximizes power and, respectively,
minimizes the total sample size. We need to be careful when specifying the proportion of cases in
the total sample or normal group. When the alternative OR is close to 1, the 50/50 design is close
to optimal. But when OR becomes extreme, as may happen in gene–gene or gene–environment
studies, we may substantially reduce the total number of subjects and yet have the same power.
In our example, the optimal design reduces the number of subjects by 30 per cent, with the same
power.
APPENDIX A
where subindex (x, z) means that the expectation is taken over the joint distribution of x and z.
The right-hand side of this expression follows from the identities x 2 = x and z 2 = z. Since x and
z take value 0 or 1, we express the information matrix as
e0 e 0 +
M1 Pr(x = 0, z = 0) + M2 Pr(x = 1, z = 0)
[1 + e0 ]2 [1 + e0 + ]2
e 0 + e0 ++z+
+ M3 Pr(x = 0, z = 1) + M4 Pr(x = 1, z = 1)
[1 + e0 +z ]2 [1 + e0 +++ ]2
where
⎡ ⎤ ⎡ ⎤
1 0 0 0 1 1 0 0
⎢0 0⎥ ⎢1 0⎥
⎢ 0 0 ⎥ ⎢ 1 0 ⎥
M1 = ⎢ ⎥, M2 = ⎢ ⎥
⎣0 0 0 0⎦ ⎣0 0 0 0⎦
0 0 0 0 0 0 0 0
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
SAMPLE SIZE AND OPTIMAL DESIGN FOR LOGISTIC REGRESSION 45
⎡ ⎤ ⎡ ⎤
1 0 1 0 1 1 1 1
⎢ ⎥ ⎢ ⎥
⎢0 0 0 0⎥ ⎢1 1 1 1⎥
M3 = ⎢
⎢
⎥,
⎥ M4 = ⎢
⎢
⎥
⎥
⎣1 0 1 0⎦ ⎣1 1 1 1⎦
0 0 0 0 1 1 1 1
which is the square root of n times the covariance matrix of the MLE of (0 , , , ). √ The last
element of this matrix is the asymptotic variance of the interaction coefficient estimate, nML .
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim
46 E. DEMIDENKO
ACKNOWLEDGEMENTS
I am thankful to Sergey Demidenko, who helped me with programming and webpage design for sample
size and power calculations. The author is also grateful for the reviewer’s comments that helped to improve
the paper.
REFERENCES
1. Rosner B. Fundamentals of Biostatistics (6th edn). Duxbury: Pacific Grove, 2005.
2. Gardiner J, Pathak D, Indurkhya A. Power calculations for detecting interaction in stratified 2 × 2 tables. Statistics
and Probability Letters 1999; 41(3):267–275.
3. Smith PG, Day NE. The design of case–control studies: the influence of confounding and interaction effects.
International Journal of Epidemiology 1984; 13(3):356–365.
4. Hwang S-J, Beaty TH, Liang K-L, Coresh J, Khoury MJ. Minimum sample size estimation to detect gene–
environment interaction in case–control designs. American Journal of Epidemiology 1994; 140(11):1029–1037.
5. Yang Q, Khoury MJ, Friedman JM, Flanders WD. On the use of population attributable fraction to determine
sample size for case–control studies of gene–environment interaction. Epidemiology 2003; 14(2):161–167.
6. Whittemore AS. Sample size for logistic regression with small response probability. Journal of the American
Statistical Association 1981; 76(373):27–32.
7. Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression.
Statistics in Medicine 1998; 17(14):1623–1634.
8. Shieh G. On power and sample size calculations for likelihood-ratio tests in generalized linear models. Biometrics
2000; 56(4):1192–1196.
9. Foppa I, Spiegelman D. Power, sample size calculations for case–control studies of gene–environment interactions
with polytomous exposure variable. American Journal of Epidemiology 1997; 146(7):596–604.
10. Lubin J, Gail M. On power and sample size calculations for studying features of the relative odds of disease.
American Journal of Epidemiology 1990; 131(3):552–566.
11. Self SG, Mauritsen RH. Power/sample size calculations for generalized linear models. Biometrics 1988; 44(1):
79–86.
12. Shieh G. A comparison of two approaches for power and sample size calculations in logistic regression models.
Communications in Statistics—Statistical Simulations 2000; 29(3):763–791.
13. Gauderman WJ. Sample size requirements for association studies of gene–gene interaction. American Journal of
Epidemiology 2002; 155(5):478–484.
14. Gauderman WJ. Sample size requirements for matched case–control studies of gene–environment interaction.
Statistics in Medicine 2002; 21(1):35–50.
15. Demidenko E. Sample size determination for logistic regression revisited. Statistics in Medicine 2007; 26(18):
3385–3397.
16. Machin D, Campbell MJ. Design of Studies for Medical Research. Wiley: New York, 2005.
17. Prentice RL, Pyke R. Logistic disease incidence models and case–control studies. Biometrika 1979; 66(3):403–411.
18. Gilliland F, McConnell R, Peters J, Gong J. A theoretical basis for investigating ambient air pollution and
children’s respiratory health. Environmental Health Prospective 1999; 107(3):403–407.
19. Brittain E, Schlesselman JJ. Optimal allocation for the comparison of proportions. Biometrics 1982; 38(4):
1003–1009.
Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:36–46
DOI: 10.1002/sim