Biometrics
Biometrics
June 2000
READER REACTION
Rick L. Williams
Research Triangle Institute, P.O. Box 12194, Research Triangle Park, North Carolina 27709-2194, U.S.A.
email: [email protected]
SUMMARY.There is a simple robust variance estimator for cluster-correlated data. While this estimator
is well known, it is poorly documented, and its wide range of applicability is often not understood. The
estimator is widely used in sample survey research, but the results in the sample survey literature are not
easily applied because of complications due to unequal probability sampling. This brief note presents a
general proof that the estimator is unbiased for cluster-correlated data regardless of the setting. The result
is not new, but a simple and general reference is not readily available. The use of the method will benefit
from a general explanation of its wide applicability.
KEY WORDS: Between-cluster variance estimator.
There are many situations where data are observed in clusters statistic beyond what would be expected under independence.
such that observations within a cluster are correlated while Analyses that assume independence of the observations will
observations between clusters are uncorrelated, so-called generally underestimate the true variance and lead to test
cluster-correlated data. For example, the typical teratology statistics with inflated Type I errors.
screening experiment involves administration of a compound The following presents an unbiased variance estimator for
to pregnant dams of a rodent species, followed by evaluation a linear statistic from cluster-correlated data. The approach
of the fetuses in a litter for various types of malformations. In uses the well-known, but not well-documented, robust
this situation, the fetuses within a particular litter are corre- between-cluster variance estimator for cluster-correlated data.
lated while any two fetuses from different litters are indepen- This approach is used extensively in sample survey research
dent. Similarly, dental studies often collect data on each tooth where clustered data are commonly encountered. See, e.g.,
surface for each of several teeth from a set of patients. Again, Hansen, Hurwitz, and Madow (1953, Section 6.7) or Sarndal,
observations from the same patient are correlated while any Swensson, and Wretman (1992, Section 4.5). These two ref-
two observations from different patients are independent. An- erences from the sample survey literature justify the variance
other example is repeated measurements or recurrent events estimator under the assumptions that the primary clusters
observed on the same person. As before, observations at dif- are sampled with replacement, while any sampling plan that
ferent time points from the same person are correlated while allows unbiased estimation of the primary cluster totals can
any two observations from different patients are independent. be used within a cluster. In the sample survey situation, with-
As a final illustration, sample surveys often use multistage replacement sampling of the primary clusters implies that ob-
sample designs. For example, a sample of hospital patients servations between primary clusters are uncorrelated. In the
might start out with a sample of geographic areas (such as general situation, the critical assumption is that the observa-
counties), followed by a sample of hospitals within the se- tions between clusters are uncorrelated.
lected geographics areas, ending with a sample of hospital The following notation describes the general cluster-cor-
discharges abstracted from the selected hospitals. Here we related data situation. Let z j k be the lcth observation (lc =
have a three-stage design consisting of geographic areas, hos- 1 , 2 , . . . , n j ) from the j t h cluster ( j = 1 , 2 , .. . , m ) . Assume,
pitals, and hospital discharges. If the geographic areas were without loss of generality, that E[zjk] = 0. Further assume
selected with replacement, then selected discharges from two that cov(zjk, z j k r ) = U j k k l and that cov(zjk, z j , p ) = 0 when
geographic areas would be uncorrelated while two discharges j # j ’ . These assumptions are very general and allow the vari-
from the same geographic area would be correlated. ance to be heteroscedastic, both between and within clusters,
A major statistical problem with cluster-correlated data and allow for an arbitrary dependence structure among obser-
arises from intracluster correlation, or the potential for clus- vations within a cluster. For example, there could be three or
termates to respond similarly. This phenomenon is often re- more levels of nesting, as in the dental example above (tooth
ferred to as overdispersion or extra variation in an estimated surfaces nested within teeth nested within patients) or an au-
645
646 Biornetrics, June 2000
toregressive process for repeated measurements over time on sample surveys, by Bieler and Williams (1995) for logistic
the same person. regression in teratology studies, and by Williams (1995) for
First, consider the simple linear statistic z = C, z J k and
Kaplan-Meier survival functions. The Taylor series lineariza-
note that tion approach with the between-cluster variance estimator is
closely related to the generalized estimating equation (GEE)
var[z] = c v a r [ x z J k ] = ~ ~ ~ q l k k ’ . approach of Liang and Zeger (1986) and, in some situations,
3 k 3 k k’ the two approaches are the same when assuming working in-
dependence. The Taylor series linearization approach is much
Letting z3 = c k z J k and Z = c, z3/m, the between-cluster older, with its roots in sample survey research reaching back
variance estimator is then given by to the early 1950s. The G E E approach attempts to improve
r 1 estimation by including assumptions about the within-cluster
correlation structure in the estimating equations.
RBSUME
We want to show that E[S2] = ~j ck c k ’ u j k k ’ = var[z].First,
I1 existe un estimateur simple et robuste de la variance pour
note that des donnCes corrklkes par groupe. Alors que cet estimateur
est bien connu, la documentation le concernant est limitCe et
k k‘ k k‘
son large champs d’application est souvent ma1 compris. I1 est
largement utilisC dans la recherche d’enquete par Cchantillon.
Also, mais dans la 1ittCrature sur les enqudtes par Cchantillon les
rksultats ne sont pas facilement appliques B cause des com-
plications dues aux inkgales probabilitks d’6chantillonnage.
J J J
Cette courte note pr6sente la preuve gkn6rale que l’estimateur
est non biaisC pour des donnCes corrClCes par groupe quelle
que soit la composition. Bien que le resultat ne soit pas nou-
veau, aucune rCfCrence simple et gknkrale n’est facilement
disponible. L’utilisation de la mkthode pourra bknbficier d’une
because observations from different clusters are uncorrelated. explication gCnCrale de son large domaine d’application.
Thus,
r 1 REFERENCES
Bieler, G. S. and Williams, R. L. (1995). Cluster sampling
techniques in quanta1 response teratology and develop-
= c c c q l l ; k t =var[z]. mental toxicity studies. Biornetrics 51, 764-776.
Binder, D. (1983). On the variance of asymptotically normal
j k k’
estimators from complex surveys. International Statisti-
Hence, we have the desired result that the between-cluster cal Review 51, 279-292.
variance estimator, S 2 , is an unbiased estimator of the vari- Fuller, W. A. (1975). Regression analysis for sample surveys.
ance of a linear statistic. Notice that we only need to know to Sankhya C 37, 117-132.
which cluster each observation belongs without regard to the Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953).
dependence structure of observations within a cluster. Sample Survey Methods and Theory, Volume I, Methods
The above is not a new result, but it is poorly documented. and Applications. New York: Wiley.
It has been available in the sample survey literature since at Liang, K. and Zeger, S. (1986). Longitudinal data analysis
least 1953 (Hansen et al., 1953, Section 6.7). However, we are
using generalized linear models. Biometrzka 73, 13-22.
not aware of a general proof that the between-cluster variance
Rao, J. and Colin, D. (1991). Fitting dose-response models
estimator is unbiased for cluster-correlated data. The proofs
and hypothesis testing in teratological studies. In Statis-
in the sample survey literature are not easily applied because
tics in Toxicotogy, D. Krewski and C. Franklin (eds).
of the complications due t o unequal probability sampling. The
New York: Gordon and Breach.
wide applicability of the results is often not well recognized
because of the lack of a clear reference. Sarndal, C. E., Swensson, B., and Wretman, J. (1992). Model
On a final note, the between-cluster variance estimator can Assisted Survey Sampling. New York: Springer-Verlag.
be combined with a Taylor series linearization approach Williams, R. L. (1995). Product-limit survival functions with
(Woodruff, 1971; Binder, 1983) to yield, as the number of correlated survival times. Lifetime Data Analysis 1, 171-
clusters grows large, consistent variance estimates of nonlin- 186.
ear statistics. This approach replaces the original data with a Woodruff, R. (1971). A simple method for approximating the
linear approximation which can then be used as shown above. variance of a complicated estimate. Journal of the Amer-
For example, Taylor series linearization with the between- ican Statistical Association 66, 411-414.
cluster variance estimator was used by Rao and Colin (1991)
for the proportion of malformed fetuses for teratology studies, Received June 1999. Revised October 1999.
by Fuller (1975) for linear regression coefficients in complex Accepted November 1999.