Am. J. Epidemiol. 1997 Schaubel 450 8
Am. J. Epidemiol. 1997 Schaubel 450 8
Am. J. Epidemiol. 1997 Schaubel 450 8
Copyright 1997 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Douglas Schaubel, 1 ' 2 James Hanley,1 Jean-Paul Collet, 13 Jean-Francois Boivin, 13 Colin Sharpe,1-3
Howard I. Morrison,2 and Yang Mao 2
450
Preexisting computerized databases are potentially valuable sources of epidemiologic data. Since such
databases are infrequently created specifically for etiologic research, data may be available for the exposure
of interest and, through record linkage, for the endpoint of interest, but lacking for potential confounders.
Because of the size of these databases, two-stage sampling is an efficient alternative to surveying the entire
study population for confounder data. At stage 1, information on exposure and disease status is obtained for
the entire study population. Confounder data are collected for probability-selected subsamples at stage 2.
Logistic regression is performed on the stage 2 sample, with the parameter estimates and variances appropriately corrected to account for the stage 1 data. In this paper, the authors present methods for determining
the required stage 2 sample size in the case of categorical exposure and confounding variables. Sample size
tables, power curves, and a computer program have been produced to accommodate a binary exposure and
a single binary confounder. With the increasing availability of preexisting yet incomplete databases, the
potential for use of two-stage sampling will greatly increase in the future. This investigation provides a basis
for estimating the number of participants to sample for the collection of confounder data at the second stage.
Am J Epidemiol 1997;146:450-8.
D=0
D=l
No.o
N,,o
No.,
N,,,
J-l
Nftj-i
N..M
Total
No
N,
Am J Epidemiol
tion," is usually more efficient than random, diseasebased, or exposure-based stage 2 sampling. Its efficiency derives from sampling fractions that are
inversely proportional to cell size, exploiting the fact
that observations from small groups contribute, on
average, more information than those from large
groups. The reason for the increased efficiency can be
understood heuristically from Woolf s formula for the
variance of the logarithm of the odds ratio (OR) (5).
ANALYSIS OF TWO-STAGE DATA
Preliminary issues
(1)
(2)
451
452
Schaubel etal.
D=0
D=l
C
C
E
no.o.o
"o.o.i
"o.i.o
"o.i.i
...
K-l
Total
...
"o.O.K-l
n o . o = So.oNo.o
i.o,o
...
no, i,K-i
no.i^o.iNo.i
i.i.o
...
i,o,<
...
i.0,K-l
l.l.l
...
l,l,K-l
nn=Si,,Nlfl
0
n
K-l
Total
n
io=si.oN|,o
...
...
...
...
...
...
J-1
no.j-i.o
...
Ko, M, K-l
o,j-i =s o,j-iNo,j-i
J-1
i,Ji.i
...
no
Total
"OJ-LI
Total
i.J-i.o
l.M,K-l
u-i = su-iNi,n
n,
FIGURE 2. Cross-tabulation of exposure level (E), disease status (D), and confounder level (C) (2 x J x K table) for t h e stage 2 sample.
Confounder (C) information is obtained for n2 = 2,-2/J,; subjects. The n^'s are chosen at random from the A/(/ members of the stage 1 sample
(see figure 1) with sampling fractions s,y = n,-,/W;/.
1
1
- { +
1
i
1
i
(4)
(5)
ference in the precision of the estimator of /3 attributable to having adjusted for the confounder. The
difference, V2(j3)crade V(l), can be considered the
gain in precision obtained by incorporating the stage 1
information into the stage 2 estimates. The crude variances are given by the sum of the reciprocals of the
entries of the 2 X 2 table relating exposure and disease, as per Woolf s method (5). That is,
10
11
1 1 1
+
nu +
n 10 +
W) cnide =
nn0l
n
n
(6)
/logistic
-1
i-
1
1
vt = ^ +
n
llk
1
1
h 7 + -^, (7)
\0k
0\k
00k
454
Schaubel et al.
a. Exposure/Confounder Association
C=0
C= l
Total
Poo=O-66
p M =0.24
1 - PE=0.9
P,o=O.O4
p,,=0.06
PE=0.1
Total
l-Pc=0.7
p c =0.3
D=0
D=l
No (Poi +Poo)
1800 (1800)
808.1 (863.3)
N , (4>n +<t>w)l4>
No (Pn+Pio)
200 (200)
191.9(136.7)
D=l
D=0
E
c=o
C= l
C=0
nooPoi/(Poi+Poo)
40.1 (41.3)
71 .6(73 7)
noiPi</(Pii+Pio)
61.0(62.8)
89.0 (91.7)
27 .9(25 4)
C= l
)
78.4 (80.7)
)
122.1 (111.3)
00
FIGURE 3. Example of power calculation for a two-stage study. A two-stage case-control study is planned to evaluate the effect on disease
incidence (D) of an exposure () recorded on a binary scale after adjustment for a single binary confounder (C). The following quantities are
known or estimated: cases {N-, = 1,000), controls (A/o = 2,000), exposure prevalence (pE = 10%), confounder prevalence ( p c = 30%),
exposure odds ratio {ep = 1.5) (crude exposure odds ratio = 2.1), confounder odds ratio (e 7 = 3.0), and stage 2 sample size (fixed in advance)
{n2 = 600). The {E,C) distribution in the source population is described in section a, with 0 = (0.66 x 0.06)/(0.24 x 0.04) = 4.0. Expected
cell entries for the cross-tabulation of D and E at stage 1 under HA {Ho in parentheses) are given in section b. Expected cell entries for the
cross-tabulation of D, E, and C at stage 2 under HA {Ho) are given in section c. The projected variances under Ho and HA are Vo{0) = 0.02
and VA(/3) = 0.017821, respectively. Power is estimated at 1 - * ( - z j = 83%, where ze = (log(1.5) - 1.96 x V0.017821)/VO02 = 0.961.
Power = 92% when the entire study population is sampled at stage 2. Note that <f>lk = plk e * + 1 * , <f> = </>,, + <f>10 + <01 + #0,,.
TABLE 1.
455
e
10%
30%
50%
10%
50%
10%
30%
50%
30%
40%
20%
30%
40%
1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0
48
136
302
92
245
548
91
220
434
37
99
207
72
195
446
72
184
390
33
84
166
66
179
406
68
179
396
30
85
192
57
156
356
56
139
280
27
72
150
52
143
329
52
134
285
25
64
125
50
136
308
51
135
301
1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0
192
311
528
308
460
762
260
352
507
148
236
387
248
390
680
211
308
484
135
206
320
232
373
666
201
310
522
119
200
352
194
303
516
162
228
337
107
173
285
180
288
504
154
226
358
102
156
242
175
282
500
152
235
393
1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0
349
490
756
504
639
932
384
443
463
273
381
575
408
560
879
315
397
551
250
339
487
384
548
897
302
408
613
219
323
516
318
428
642
241
291
378
198
282
425
298
414
651
229
293
408
188
257
367
291
414
668
228
308
460
* Stage 2 sample size required to detect an exposure odds ratio (OR) of eP = 1.5 with 90% power and type I
error of a = 0.05 (two-sided).
t A case-control study {N. cases and Wo controls at stage 1) designed to evaluate the effect of exposure
recorded on a binary scale, witn adjustment for a single binary confounder with the following quantities anticipated:
exposure prevalence = pp confounder prevalence = p^, exposure OR = eP, confounder OR = ei, and (,C) crossproduct ratio = 8.
tables and power curves, generated by a program written in SAS (SAS Institute, Cary, North Carolina),
which provides the required n2 for either 80 percent or
90 percent power. The second is an executable program (source code written in C) which can calculate
either power for a given n2 or the minimum n2 needed
to achieve a prespecified level of power. The relevant
tables and software for this procedure are available
from the first author upon request.
DISCUSSION
30%
20%
458
Schaubel et al.
APPENDIX
Expected Cell Entries
Stage 1. Assume that the numbers of diseased and nondiseased subjects at stage 1 are known, and that the data layout
for the study population follows that of table 1. Under a multiplicative model, as in equation 1, with no interaction, the
expected entries of the 2 X J table at stage 1 are:
j=0 (fc=0
;=0 *=0
Stage 2. With the n,y known, the expected cell entries for the 2 X / X K table at stage 2 are given by:
Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016
pjk
n
2^
k=Q
fc=0
Am J Epidemiol