2 Complex Sampling Concepts: PSU PSU PSU Usus CS SRS
2 Complex Sampling Concepts: PSU PSU PSU Usus CS SRS
2 Complex Sampling Concepts: PSU PSU PSU Usus CS SRS
2.1 Introduction
A common theme in alcohol abuse research is that data are usually obtained from a multi-stage or
complex sample design. An example of a typical complex sampling scheme is:
o Stratify the geographical area under study according to census geography and census socio-
economic variables.
o Form meaningful clusters of population elements, called primary sampling units (PSUs), for
example schools, in each stratum.
o Draw a predetermined number of PSUs from each stratum, using probability sampling
proportional to size.
o Do one or more stages of subsampling within each PSU.
o Draw a simple random sample of ultimate sampling units (USUs) at the last stage.
The main advantages of a complex sample (CS) in comparison with a simple random sample (SRS)
are:
o CS does not require a complete sampling frame of the population elements.
o CS is more economical and practical.
o CS guarantees a representative sample of the population.
o CS makes a step-by-step design of the sample possible.
The main disadvantage of CS is that it is generally less efficient than SRS, i.e., it yields estimates of
lower precision for a fixed sample size.
In the application of CS, the design effect (deff) and sampling weights play an important role. The
design effect is defined as
The design effect (deff) provides a rough and ready method of estimating the variance of survey
statistics and of adjusting the output of standard statistical software packages for the complex
sampling design. This aspect of deff derives from its assumed portability. See Kish (1965) for a
discussion of design effects.
The deff is used not only to produce estimates of variance, but also to adjust the output of standard
analyses. For example, the practitioner may utilize standard statistical software packages to
conduct a regression analysis of a hypothesized linear relationship between survey variables, or to
formulate a multi-way table and conduct a χ 2 test of independence between survey variables. The
output of standard statistical software packages gives wrong answers for such problems (because
In this chapter we start with some known results for the sum of random variables and for multiple
linear regression to illustrate adjustments that must be made to accommodate complex sampling
properly. In Sections 2.2 and 2.3 we provide a brief summary of important concepts in complex
sampling and in Sections 2.4 to 2.7 discuss how these concepts are currently applied to fit
regression models to survey data. We will show that standard software packages for regression
analysis allow for a weight variable, but do not yield the correct standard error estimates and
measures of fit.
Consider a finite population of identifiable units U = {u1 , u2 ,..., u N } where the size N of the
population is assumed known. The inclusion of a given element ui in a sample s is a random event
indicated by the random variable I i (sample membership indicator of element i), i = 1, 2,..., N
defined as
⎧1 if i ∈ s
Ii = ⎨
⎩ 0 else.
π i = P(ui ∈ s) = P( I i = 1).
The probability that both ui and u j will be included in the sample, denoted by π ij , is
π ij = P(ui ∈ s and u j ∈ s ) = P( I i ⋅ I j = 1)
By definition (see e.g. Traat, Meister, & Söstra, 2001). Therefore, E ( I i ) = π i , Var ( I i ) = π i (1 − π i )
and Cov( I i , I j ) = π ij − π iπ j .
The selected sample s = {u(1) , u(2) ,..., u( n ) } , is an unordered set of population units, where n denotes
the sample size. A sampling weight wi for the i-th USU is usually calculated as 1/ π i , where π i
Let y N and z N be N ×1 vectors of finite population values with typical elements y j and z j , j =
1, 2, …, N respectively. Denote the values drawn from a multi-stage sample of size n by y s and
z s , where
N N N N N
Consider the population totals t1 = ∑ yi , t2 = ∑ yi2 , t3 = ∑ zi , t4 = ∑ zi2 , and t5 = ∑ zi yi . Each
i =1 i =1 i =1 i =1 i =1
population total t j can be estimated by the corresponding π -estimator t jπ (Horvitz & Thompson,
1952). For example,
n N
t1π =
∑ y(i ) / π i , t 5π = ∑ zi yi / π i .
i =1 i =1
Each estimated total can be written as a linear function of the sample membership indicators
I i , i = 1, 2,..., N . For example,
∑i i i
t1π = I y / π .
i =1
( )
N N
E t1π = ∑ E ( I i ) yi / π i = ∑ yi = t1.
i =1 i =1
1 ⎛ π ⎞
( )
N N
Var t1π = ∑∑ ⎜ ij − 1⎟ yi y j .
⎜
i =1 j =1 π ij ⎝ π iπ j
⎟
⎠
Hence,
n n
1 ⎛ π ij ⎞
m (t1π ) =
Var ∑∑ ⎜⎜
i =1 j =1 π ij ⎝ π iπ j
− 1⎟ y(i ) y( j ) .
⎟
⎠
This simple example shows that the expected value and variance of the sum are rather different
under complex sampling than for the corresponding case of simple random sampling.
Standard methods are available for the point estimation of sample functions of the population
totals, such as means, ratios, and differences of ratios. These methods are based (see e.g. Särndal,
et. al., 1992) on the following result. Given that a population parameter θ can be expressed as a
function of several population totals, i.e. θ = f (t1 , t2 ,..., tq ) , then an estimator θ of θ is obtained
from θ = f (t1π ,..., t jπ ,..., t qπ ) where t jπ is the corresponding π -estimator of t . Additionally,
j
consistent estimators of the sample variances of the estimators are available and have been
implemented in various programs for the analysis of survey data (cf. Section 2.2). One method of
estimating the variance of θ if θ is a nonlinear function of the totals is by a first-order Taylor
approximation of f (t1π ,..., t jπ ,..., t qπ ) (see e.g. Wolter, 1985).
To estimate the variance of a survey estimator in the case of a single-stage sampling design, there
are typically two alternatives: (1) the variance estimator based upon pps sampling with
replacement, and (2) the Yates-Grundy estimator of variance (Yates & Grundy, 1953; Biyani,
1980) for pps without replacement sampling. Many survey practitioners will find the first estimator
to be a satisfactory approximation to the variance given the actual survey design. For instances in
which it is important to reflect the without replacement sampling design (e.g., important sampling
fraction) and where it is feasible to calculate the joint inclusion probabilities (e.g., Durbin’s two-
per-stratum design; see, Durbin, 1967, and Shapiro & Bateman, 1978), you would have the
opportunity to specify the Yates-Grundy estimator. Applied to a multi-stage sampling design, these
estimators usually provide a very good approximation to the total variance.
It was previously mentioned that the sampling weight wi is usually calculated as wi = 1/ π i , the so-
called base weight.
1 1
wi = .
li
πi R
It is evident that non-response and post stratification adjusted weights will have an impact on the
estimation of population totals and functions of population totals.
Suppose YN is an N × p matrix defined as YN' = (y1 , y 2 ,..., y N ) , where the elements of y i are
values of p variables of interest.
Let yij denote a typical element of y i , where yij is the number of occasions alcohol was consumed
by student i (i = 1, 2,..., N ) in the prior 30 days, and where j = 1 denotes grade 8, j = 2 denotes
grade 9, …, j = 5 denotes grade 12.
2.4.2 Example 2
Let the subscript i denote the student i (i = 1, 2,..., N ) . Suppose yi1 equals the number of times this
student was under the influence of alcohol in the prior year; yi 2 is a language score, and yi 3 is a
math score.
Example 1 above describes a longitudinal study, often referred to in the literature as a repeated
measurements study, since measurements are made on the same individual on successive
occasions. Note that, in general, measurement occasions are not necessarily equally spaced over
time.
Example 2 describes a typical cross-sectional study. Note, however, that this study may have been
carried out in 1998, and subsequently repeated in 2000 and 2002. It is evident that the finite
populations U1998 , U 2000 , and U 2002 will overlap if, for example, 8th to 12th graders in the state of
Texas are defined as the population elements. Hence, the samples s1998 , s2000 , and s2002 may also
have overlapping units. A cross-sectional study, repeated over time, is often referred to as a panel
study, but the statistical treatment usually treats the data as multiple-group data. In this examples
the year of study defines the group. A typical multiple group application is to test for differences in
the means of latent variables under the assumption of factor invariance.
E (y N | X N ) = X N β; Cov(y N | X N ) = σ 2 I N . (2.2)
The ordinary least squares (OLS) estimator β = ( X's X s ) −1 X's y s of β is not, in general, a design-
consistent estimator of β .
β = T−1t (2.3)
N N
where T = ∑ xα xα' , and t = ∑ xα yα .
α =1 α =1
N N
tij = ∑ xiα x jα and ti 0 = ∑ xiα yα .
α =1 α =1
Each total (cf. Section 2.2) can be estimated by their unbiased π -estimators. For example,
∑ (iα ) ( jα ) α
t ij ,π = x x / π ,
α =1
∑ (iα ) (α ) α
t i 0,π = x y / π , i, j = 1, 2,..., r .
α =1
In matrix notation,
'
x x n
l=
T ∑ (α ) (α ) = X's Ws Xs ,
α =1 πα
(2.4)
For most sample designs used in practice, the sampling variance of β W cannot be estimated using
standard computer packages, and a variance estimating technique has to be used.
The basic methods available (see e.g. Rao, 1975) are:
q
f (t1,π , t 2,π ,..., t q ,π ) f (t1 , t2 ,..., tq ) + ∑ a j (t j ,π − t j ) ,
j =1
where
∂f (t1,π , t 2,π ,..., t q ,π )
aj = |t1,π =t ,...,t q ,π =t
∂t j ,π 1 q
β W β + T−1 (t − Tβ
l ).
(2.7)
Cov(β W ) T−1VT−1.
(b) Balanced repeated replication (McCarthy, 1969): Statistics based on half-samples, which are
selected so as to ensure an orthogonal balanced set, are computed and the empirical covariances
of theses statistics are used as the appropriate estimator.
In longitudinal studies an increase in precision is obtained if allowance is made for the fact that
units sampled over time are correlated. Replication methods provide a simple means for
incorporating this correlation.
(c) Jackknife (Miller, 1974): The sample is first split into subsamples, each of which reflects the
original complex design. Statistics based on the sample data without one of the subsamples are
computed and the empirical covariances of these statistics serve as covariance estimators. A
more detailed account is given in Wolter (1985).
(d) Bootstrap (Efron, 1981, 1982; Kovar, Rao & Wu, 1988): The sample data is used to construct
an artificial population U * which is assumed to mimic the real, but unknown, population U.
The original design is used to draw a series of K samples (with replacement) from U * . For each
*
“bootstrap” sample, i, an estimate θ i of the population parameter θ is computed and
* * *
subsequently θ and var( θ ) are estimated from θ 1 ,θ 2 ,...,θ K .
While the standard statistical package computer programs do not in general deal with the complex
sample design situation, several special purpose programs for covariance estimation have been
developed for use with complex sample designs. Lepkowski and Bowles (1996) give a list of eight
software packages that are available for use by the general survey analyst. The eight catalogued in
their paper are CENVAR, CLUSTERS, EpiInfo, PC CARP, STATA, SUDAAN, VPLX and WesVar. The
Theoretical comparisons of the different methods of covariance estimation by Krewski & Rao
(1981) and empirical comparisons by Kish & Frankel (1974) and by Richards & Freeman (1980)
indicate their performance is very similar in many cases.
E (y N | X N ) = X N B Cov(y N | X N ) = σ 2 V, (2.8)
If V is diagonal and the inclusion probabilities are proportional to the variances, then β W (cf. (2.6))
coincides with the weighted least squares estimator
*
β = ( X's Vs−1X s ) −1 X's Vs−1y s
(2.10)
Standard programs compute the OLS estimator, β , and can often also compute the generalized OLS
*
estimator, β , together with unbiased estimators of their model-variances σ 2 ( X' X ) −1 under (2.2)
s s
and σ 2 ( X's Vs−1X s ) −1 under (2.8) respectively. The design-weighted estimator, β W , can also be
obtained by the weighted regression options of standard statistical packages (e.g. LISREL or SPSS)
by using the weights 1/ π i . Alternatively, β W can be obtained by unweighted regression on the
transformed variables yi / π i and xi / π i . Nathan (1988), however, has pointed out that the
The programs that use weighted regression, with weights 1/ π i , report the estimator of the
variance-covariance matrix as σl ( X's Ws X s ) . The model-variance of
2 −1
β W , under the
homoscedastic model (2.2), is
σl ( X's Ws X s ) ( X W V W X )( X W X )
2 −1 ' ' −1
s s s s s s s s
which simplifies to σl ( X's Ws X s ) under the homoscedastic model (V = I) only for self-weighting
2 −1
designs and under the heteroscedastic model (2.8) only if V is diagonal and the inclusion
probabilities are proportional to the variances.
2.7.1 Introduction
In this section formulae for the estimation of the covariance matrix of a vector of totals are given
for single-stage, two-stage, and three-stage sampling designs.
For a multi-stage sampling design we assume the following general sampling methods at each
stage:
o First stage: random sampling with replacement (WR), random sampling without
replacement and equal probability of selection (WOR), and random sampling without
replacement and unequal probabilities (UWOR).
o Second stage: if the first stage is not WR, then WR, WOR, or UWOR.
o Third stage: if second stage is not WR, then WR, WOR, or systematic.
From the above it follows that all specifications other than weights are ignored for subsequent
stages if a multi-stage sample contains a WR, or an approximation to WR, stage.
Overall weights for each ultimate sampling unit can be obtained as a product of weights for
corresponding units computed in each sampling stage.
To simplify the expressions for the estimated covariance matrix of a vector of totals, let
where the index h denotes a stratum within a given sampling stage, i denotes the i–th sampled unit
within stratum h in the same sampling stage and j denotes all final stage units contained within hi.
Let
mhij
z hi = ∑ z hij
j=1
(2.12)
nh
1
zh =
nh
∑z
i =1
hi (2.13)
and
1 nh
( )( z )
'
S h2 (y ) = ∑ z hi − z h
nh − 1 i =1
hj − zh
(2.14)
The covariance of the total for vector y in a single-stage sample is estimated by:
( ) ( ) ( )
H
l y = V
V T T ∑ h T
l 1 y = U y
h=1 (2.15)
( )
where U h y T is an estimated contribution from stratum h = 1, …, H and depends on the sampling
method used:
( )
o For WR, U h y T = nh S h2 (y ) ,
( )
o For simple random sampling, U h y T = (1 − f h ) nh S 2h (y ) ,
In the variance estimator, π hi and π hj are the inclusion probability for units i and j in stratum h,
and π hij is the joint inclusion probability for the same units (Yates & Grundy, 1953; Sen, 1953). In
some situations it may yield a negative estimate and is treated as undefined.
Currently, for each stratum h containing a single element, the covariance contribution U h y T ( ) is
set to zero. An alternative procedure is to collapse strata. Presently, we leave it to the discretion of
the user to collapse strata prior to any further statistical analysis.
Two-stage sample
When two-stage sampling is used and sampling WOR is applied in the first stage, the following
estimate of the covariance of the total for vector y may be used:
( ) ( ) ( ) ( )
H nh K hi
l y = V
V T
l 2 y = V
T
l 1 y +
T ∑∑ π hi ∑ U hik y T .
h =1 i =1 k =1 (2.16)
o Here π hi represents the first stage inclusion probability for the primary sampling unit i from
stratum h.
o ( ) is the covariance contribution from the second stage stratum k from the primary
U hik y T
sampling unit hi. It depends on the sampling method used in the second stage (see formulae
above).
Three-stage sample
For a three-stage sample where first stage sampling is done without replacement, and simple
random sampling is applied in the second stage, the following estimate of the covariance of the
total for vector y may be used:
nhik Lhikj
( ) ( ) ( )
H nh K hi
l y = V
V T
l 2 y +
T ∑∑ π hi ∑ f hik ∑∑ U hikjl y T ,
h =1 i =1 k =1 j =1 l =1
(2.17)
where
o f hik represents the sampling rate for the secondary sampling units in the second-stage
stratum hik,
o Lhikj indicates the number of third-stage strata in the secondary sampling unit hikj, and
o ( )
U hikjl y T denotes the covariance contribution from the third-stage stratum l, which is
contained in the secondary sampling unit hikj. Again, this depends on the third-stage
sample method (see formulae above).
Assume that L is the likelihood function or any other appropriate function of the vector γ of
unknown parameters, and that an estimate γ of γ is obtained by solving the set of simultaneous
equations
In general, no closed-form solution to the set of equations (2.18) exists, and therefore parameter
estimates are obtained iteratively using the Fisher scoring algorithm, for example,
γˆ (t +1) = γˆ (t ) + I −n1 ( γˆ (t ) ) g ( γˆ (t ) )
(2.19)
where γˆ (t ) denotes the parameter values at iteration t , t = 1, 2," ; g ( ⋅) denotes the gradient vector;
and I n ( ⋅) denotes the information matrix. In other words,
∂ ln L
g(γ) = (2.20)
∂γ
and
⎡ ∂ 2 ln L ⎤
In ( γ ) = −E ⎢ ⎥
⎣ ∂γ∂γ ′ ⎦ (2.21)
Denote the contribution to the gradient vector of each first-stage element for a given sampling
stage by g hij , where h denotes stratum, and i the i –th unit within this stratum. The index j denotes
a typical final stage element contained within the PSU hi, then
H nh mhi
[g ( γ )]r = ∑∑∑ [g hij ( γ )]r
h =1 i =1 j =1
(2.22)
From (2.18), (2.20), and (2.22) it follows that γ̂ is the solution to the set of equations
H nh mhi
ˆ ( γˆ ) = ∑∑∑ g hij ( γˆ ) = 0
w
h =1 i =1 j =1
(2.23)
ˆ ( γˆ ) at γˆ = γ , it follows that
Using a first-order Taylor expansion of w
ˆ (γ)
∂w
ˆ ( γˆ ) ≈ w
0=w ˆ (γ) + ( γˆ − γ )
∂γ ' (2.24)
ˆ (γ)
∂w ˆ ( γ ) ⎞′
⎛ ∂w
Cov ( w
ˆ ( γˆ ) ) ≈ Cov ( γ ) ⎜
ˆ ⎟.
∂γ ' ⎝ ∂γ ' ⎠ (2.25)
∂ 2 ln L ∂ ⎡ ∂g( γ ) ⎤
Thus, provided that (cf. (2.23)) = is a non-singular matrix,
∂γ∂γ ' ∂γ ⎢⎣ ∂γ ' ⎥⎦
−1 −1
⎡ ∂ 2 ln L ⎤ ⎡ ∂ 2 ln L ⎤
Cov ( γˆ ) ≈ ⎢ ⎥ Cov ( ˆ
w ( ˆ
γ ) ) ⎢ ∂γ∂γ ' ⎥ ,
⎣ ∂γ∂γ ' ⎦ ⎣ ⎦
⎡ ∂ 2 ln L ⎤
where E ⎢ ⎥ = −Ι n ( γ ) .
⎣ ∂γ∂γ ' ⎦
Cov ( γˆ ) ≈ Ι n −1 ( γ ) GΙ n −1 ( γ )
(2.26)
where G = Cov ( w
ˆ ( γˆ ) ) .
Using results derived by Fuller (1975) (see also Section 2.7), it follows that, under single stage
sampling with replacement (WR) or without replacement (WOR),
H
nh (1 − f h ) nh
( )( )
'
G=∑ ∑ t hi. − t h.. t hi. − t h.. (2.27)
h =1 n h − 1 i =1
where:
nhi
o nh = ∑ mhij , with mhij the number of cases with identical response patterns within stratum
j =1
h, cluster i, and USU j. If f hij = 1 for all h, then mhij = 1 for all h, i and j.
nh
o fh = , the sampling rate for stratum h .
Nh
o t hij () ()
= g hij γ where g hij γ is the hij -th contribution to the gradient vector g( γ ) as
defined by (2.19).
o t hi. = ∑ t hij .
j =1
nh nh
1
o t h.. = ∑ t hi. t h.. = ∑ t hi.
j =1 nh i =1
Currently, we assume a zero contribution to G for strata that contain a single PSU (cluster).
Alternatively, the collapsing of strata or PSUs is left to the user's discretion (see Section 2.7.3).
Additionally, if there is no variable to define clusters, the observations within each stratum are
treated as being the primary sampling units.
Binder, D.A. & Hidiroglou, M.A. (1988). Sampling in time. In: P.R. Krishnaiah & C.R. Rao (Eds.).
Handbook of Statistics, Vol. 6. Amsterdam: North-Holland, pp. 187-211.
Biyani, S.H. (1980). On variance estimator in unequal probability sampling, Proceedings of the
Survey Research Methods American Statistical Association, 634-637.
Durbin, J. (1967). Design of Multi-Stage Surveys for the Estimation of Sampling Errors, Applied
Statistics, XVI, 152-164.
Efron, B. (1981). Nonparametric standard errors and confidence intervals, Canadian Journal of
Statistics, 9, 139-172.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans, CBMS-NSF Regional
Conf. Series in Applied Mathematics, no. 38.
Fuller, W.A. (1975). Regression Analysis for Sample Survey. Sankhya, Series C, 37, 117-132.
Horvitz, D.G. & Thompson, D.J. (1952). A generalization of sampling without replacement from a
finite universe. Journal of the American Statistical Association, 47, 663-685.
Kish, L., & Frankel, M.R. (1974). Inference from Complex Samples, Journal of Royal Statistical
Society Ser. B, 36, 1-37.
Kovar, J., Rao, J.N.K., & Wu, C.F.J. (1988). Bootstrap and other methods to measure errors in
survey estimates, Canadian Journal of Statistics, 16 (Supplement), 25-45.
Krewski, D., & Rao, J.N.K. (1981). Inference from stratified samples: properties of the
linearization, jackknife and balanced repeated replication methods, Annals of Statistics, 9, 1010-
1019.
McCarthy, P.J. (1969). Pseudo-replication: Half samples, Internat. Stat. Rev., 37, 239-264.
Rao, J.N.K. (1975). Unbiased variance estimation for multistage designs, Sankhya, C37, 133-139.
Rao, J.N.K., & Scott, A.J. (1981). The Analysis of Categorical Data from Complex Sample
Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables, Journal of
the American Statistical Association, 76, 221-230.
Richards, V., & Freeman, D.H. (1980). A comparison of replicated and pseudo-replicated
covariance matrix estimators for the analysis of contingency tables, Proceedings of the Survey
Research Methods, American Statistical Association, 209-211.
Särndal, C.E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. New York:
Springer.
Sen, A.R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal
of the Indian Society of Agricultural Statistics, 5, 55-77.
Shapiro, G.M., & Bateman, D.V. (1978). A better alternative to the collapsed stratum variance
estimate, Proceedings of the Survey Research Methods, American Statistical Association, 451-456.
Skinner, C.J., Holt, D., & Smith, T.M.F. (1989). Analysis of Complex Surveys. Chichester: Wiley.
Traat, I., Meister, K., & Söstra, K. (2001). Statistical inference in sampling theory, Theory of
stochastic processes, Vol. 7(23), no. 1-2, 301-316.
Yates, F., & Grundy, P.M. (1953). Selection without replacement from within strata with
probability proportional to size, Journal of the Royal Statistical Society, Series B, 15, 253-261.