Sample Surveys: Rohan, Vijayan
Sample Surveys: Rohan, Vijayan
Rohan, Vijayan
Contents
1 Introduction 2
2 Stratified Sampling 23
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Post-Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Cluster Sampling 30
i
4 Sampling with unequal probabilities 39
5 Non-response 55
6 Variance Estimation 61
6.1 Half-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1
1 Introduction
2
Proof. Ignoring the dependencies between the yi we have that
Y1 Y2 YN
E [yi ] = + + ··· +
N N N
= Ȳ
just from the definition of simple random sampling, and so obviously E [ȳ] = Ȳ .
Pn n
i=1 yi 1 X X X
Var (ȳ) = Var = 2 Var (yi ) + Cov (yi , yj )
n n i=1 i6=j
Obviously for much the same reason as E [yi ] = Ȳ for any i we will also have
2
Var (y1 ) = E y12 − E [y1 ]
Y12 Y2 Y2
= + 2 + · · · + N − Ȳ 2
N N N
PN 2
i=1 Yi − Ȳ
= = σ2
N
and now we need to find Cov (y1 , y2 ). Well, call it c. Then
1
nσ 2 + n(n − 1)c
Var (ȳ) =
n2
and if n = N then clearly Var (ȳ) = 0, which means that
0 = N σ 2 + N (N − 1)c = σ 2 + (N − 1)c
−σ 2
and so c = N −1 . So
1
nσ 2 + n(n − 1)c
Var (ȳ) = 2
n
1 2 2 n−1
= σ −σ
n N −1
1 n−1
= 1− σ2
n N −1
1 N −n
= σ2
n N −1
1 N −n
= S2
n N
1 1
= − S2
n N
1 1
where the factor n − N is known as the finite population correction factor.
3
Now for an alternative approach to calculating Var (ȳ) and E [ȳ] which will be
much more useful later on. Define indicator random variables {δj } where δi is
n
1 if Yi is in our selected sample, and zero if it is not. Then P (δi = 1) = N is
the inclusion probability of the jth unit and we have
n
E [δi ] =
N
n(n − 1)
P (δi = 1, δj = 1) =
N (N − 1)
n n2 n n
Var (δi ) = − 2 = 1−
N N N N
n 2 n(n − 1) n 2
Cov (δi , δj ) = E [δi δj ] − = −
N N (N − 1) N
Transferring to ȳ,
N PN
1X 1 Yi
E [ȳ] = E [δi ] Yi = n i=1 = Ȳ
n i=1 n N
N
1 X XX
Var (ȳ) = Var (δi ) Yi2 + Cov (δi , δj ) Yi Yj
n2 i=1
i6=j
N n 2
1 n n X 2 X X n(n − 1)
= 1− Y + − Yi Yj
n2 N N i=1 i N (N − 1) N
i6=j
N
X XX
1 1 1 n−1 1
= − Yi2 + − 2 Yi Yj
N n N i=1
nN (N − 1) N
i6=j
N
1 1 1 X 2 1 1 1 XX
= − Yi + − Yi Yj
N n N i=1 N (N − 1) N n
i6=j
N
1 1 1 X 2 1 XX
= − Y − Yi Yj
n N N i=1 i N (N − 1)
i6=j
N
1 1 1 X 2 1 XX
= − Y − Yi Yj
n N N − 1 i=1 i N (N − 1) i,j
N
!
1 1 1 X 2 1 X
= − Y − Yi Ȳ
n N N − 1 i=1 i N −1 i
N
1 1 1 X 2
= − Yi − Yi Ȳ
n N N − 1 i=1
N
1 1 1 X 2
= − Y − 2Yi Ȳ + Yi Ȳ
n N N − 1 i=1 i
4
N
1 1 1 X 2
Yi − 2Yi Ȳ + Ȳ 2
= −
n N N − 1 i=1
N
!
1 1 1 X 2
= − Yi − Ȳ
n N N − 1 i=1
1 1
= − S2
n N
On the other hand, we can also do simple random sampling with replacement.
In this case the yi are independent identically distributed samples from some-
thing with an obvious probability mass function. This means we can apply
extremely basic theory to say that
E [ȳ] = Ȳ
σ2
Var (ȳ) =
n
Proposition 1.1.2. The statistic
Pn 2
2 i=1 (yi − ȳ)
s =
n−1
is an unbiased estimator for
PN 2
2 i=1 Yi − Ȳ
σ =
N
under simple random sampling with replacement. If sampling is without replace-
ment, then the same statistic is unbiased for S 2 .
5
PN " # " #!
Yj2 yi2 + nyi2 + n
P P
n j=1 j6=i yi yj i6=j yi yj
= − 2E +E
n−1 N n n2
PN " #!
Yj2 yi2 +
P
n j=1 j6=i yi yj
= −E
n−1 N n
PN 2
PN 2
!
n j=1 Yj j=1 Yj
n−1
= − − E [yi ] E [yj ]
n−1 N nN n
PN !
n (n − 1) j=1 Yj2 n−1
= − E [yi ] E [yj ]
n−1 nN n
PN
j=1 Yj2 2
− E [yi ] E [yj ] = E yi2 − E [yi ] = Var (yi ) = σ 2
=
N
6
n 1
= 1− S2 = S2
n−1 n
Pn
Proof. Well, take any linear estimator t = i=1 ai yi . Then for t to be unbiased
we must have
Xn
ai = 1
i=1
We also have !2
n
X n−1
X n−1
X
a2i = a2i + 1− ai
i=1 i=1 i=1
7
and so the variance will be minimised when for j 6= n
n
!
∂ X
2
a =0
∂aj i=1 i
!2
n−1 n−1
∂ X 2 X
= a + 1− ai
∂aj i=1 i i=1
n−1
!
X
= 2aj − 2 1 − ai = 2aj − 2an
i=1
which implies that aj = an and so the constants are all the same, so they must be
1
Pn for j 6= n, as we need the last degree of freedom
n . Note that we only minimise
to satisfy the constraint i=1 ai = 1. Also, we have located a minimum as the
second derivative is 2 > 0.
Proposition 1.1.4. If we sample characteristics x, y from a simple random
sample of size n from N units then
1 1
Cov (x̄, ȳ) = − SXY
n N
Proof.
Cov (x̄, ȳ) = E x̄ − X̄ ȳ − Ȳ
= E x̄ȳ − X̄ Ȳ
! N
N
1 X X
= E δi Xi δj Yj − X̄ Ȳ
n2 i=1 j=1
N X N
1 X
= E 2 δj δi Xi Yj − X̄ Ȳ
n i=1 j=1
N
1 XX 1 X
= E 2 δj δi Xi Yj + 2 δ 2 Xi Yi − X̄ Ȳ
n n i=1 i
i6=j
N
(n − 1) X X 1 X
= Xi Yj + Xi Yi − X̄ Ȳ
nN (N − 1) nN i=1
i6=j
N X
N N N
(n − 1) X (n − 1) X 1 X
= Xi Yj − Xi Yi + Xi Yi − X̄ Ȳ
nN (N − 1) i=1 j=1
N n(N − 1) i=1 nN i=1
N N
N (n − 1) (n − 1) X 1 X
= X̄ Ȳ − Xi Yi + Xi Yi − X̄ Ȳ
n(N − 1) nN (N − 1) i=1 nN i=1
8
N
X
N (n − 1) n(N − 1) N −1 n−1
= − X̄ Ȳ + − Xi Yi
n(N − 1) n(N − 1) nN (N − 1) nN (N − 1) i=1
N
n−N N −n X
= X̄ Ȳ + Xi Yi
n(N − 1) nN (N − 1) i=1
N
N −n X N (N − n)
= Xi Yi − X̄ Ȳ
nN (N − 1) i=1 N n(N − 1)
N
!
N −n X
= Xi Yi − N X̄ Ȳ
nN (N − 1) i=1
N
N −n X
= Xi − X̄ Yi − Ȳ
nN (N − 1) i=1
N −n 1 1
= SXY = − SXY
nN n N
Let X, Y be properties which we can measure for every member of our popula-
YT
tion. Then we will often be interested in estimating R = XT
, sometimes for its
own sake, or sometimes so that we can define the estimator
ȲˆR = R̂X̄
Note that for this to be of any use we assume that the population value X̄ is
known. This is not terribly unusual in practice.
9
On the other hand, switching to the second estimator
ȳ
Cov R̂, x̄ = Cov , x̄
x̄
h ȳ i
= E [ȳ] − E E [x̄]
x̄
h ȳ i
= Ȳ − E X̄
x̄
So
Cov x̄ȳ , x̄
h ȳ i
E =R−
x̄ X̄
This makes the bias of ȲRˆ
ȳ
−Cov , x̄
x̄
If x̄ȳ is approximately constant, or has small variance, then this makes the co-
variance term small as in general for any A, B,
2
|Cov (A, B)| ≤ Var (A) Var (B)
and in turn this makes R̂ approximately unbiased for R. Unforunately comput-
ing this covariance term is not possible, so we turn to the first order estimator.
h
ˆ
i h ȳ i ȳ − Rx̄
E Ȳ − Ȳ = E X̄ − Ȳ = X̄E
x̄ x̄
ȳ−Rx̄
Now we use the first order approximation of x̄ around the term in the
denominator, we get
ȳ − Rx̄ ȳ − Rx̄
− 2
x̄ − X̄
X̄ X̄
Note that we only expand one of the x̄ terms. Continuing,
ȳ − Rx̄ ȳ − Rx̄
∼ E X̄ − x̄ − X̄
X̄ X̄ 2
ȳ − Rx̄
=E − x̄ − X̄
X̄
" #
(ȳ − Rx̄) x̄ − X̄
=E −
X̄
" #
Rx̄ x̄ − X̄ ȳ x̄ − X̄
=E −
X̄ X̄
R Cov (ȳ, x̄)
= Var (x̄) −
X̄ X̄
R 1 1 1 1 1
= − SXX − − SXY
X̄ n N X̄ n N
1 1 1
= − (RSXX − SXY )
X̄ n N
10
Obviously this make ȲˆR unbiased for Ȳ as n approaches N . Note that this
estimate depends crucially on the specific taylor series expansion we choose. If
we simply expand x̄ȳ we get something different.
So we need to find the mean squared error of R̂. It turns out that taking
the bivariate expansion of R̂ around X̄, Ȳ gives the same answer as using the
previous expansion. But there’s probably a good reason for using the previous
one, so we still do. Applying an even more basic series expansion, we have
" 2 #
ȳ − Rx̄
∼ X̄ 2 E
X̄
h i
2
= E (ȳ − Rx̄) = Var (ȳ − Rx̄)
as the expectation of ȳ − Rx̄ is zero. Now we have to work out the term on the
right hand side. If we let zi = yi − Rxi and consider outselves to be sampling
from the Zi , we apply previous formulas to get
PN 2
Zi − Z̄
1 1 i=1
Var (z̄) = − SZ2 SZ2 =
n N N −1
So as z̄ = ȳ − Rx̄ we have another expression for Var (ȳ − Rx̄) and we actually
only need to estimate Var (z̄). An estimate of this is given by
Pn 2
(zi∗ − z̄ ∗ )
1 1
\
Var (z̄) = − s2∗
Z s2∗
Z = i=1
n N n−1
where
zi∗ = yi − R̂xi
We have
"P #
n 2
(zi∗ − z̄ ∗ )
E s2∗ i=1
z =E
n−1
" n #
1 X 2
= E yi − R̂xi − ȳ − R̂x̄
n−1 i=1
" n
#
1 X 2
= E (yi − ȳ) + R̂ (xi − x̄)
n−1 i=1
11
" n n n
#
1 X 2
X
2
X 2
= E (yi − ȳ) + R̂ (yi − ȳ) (xi − x̄) + R̂ (xi − x̄)
n−1 i=1 i=1 i=1
h i
= E s2y + R̂sxy + R̂2 s2x
which shows that Var\ (z̄) is unbiased for Var (z̄). Comparing this ratio estimator
to ȳ, we find that the difference of the mean squared errors is
h i
MSE ȲˆR − MSE [ȳ] = Var (ȳ − Rx̄) − Var (ȳ)
= Var (Rx̄) − 2RCov (x̄, ȳ)
p
= R2 Var (x̄) − 2Rρ Var (x̄) Var (ȳ)
So if ρ is sufficiently large then the ratio estimator will be more efficient than
the sample mean. And obviously as |ρ| ≤ 1 the sample mean will always be
more efficient than the ratio estimator if
p p
R Var (x̄) > 2 Var (ȳ)
On the other hand, we can also compute the exact mean square error of the
ratio estimate. So assume that the Xi are always positive.
P
i∈s Yi
h i
MSE ŶR = MSE X P
Xi
" Pi∈s #
N
i=1 δ i Yi
= MSE X PN
i=1 δi Xi
"N #
X
= MSE Yi bδi
i=1
12
where b = PN X .
i=1 δi Xi
N N
!2
X X
= E Yi bδi − Yi
i=1 i=1
N
!2
X
= E Yi (bδi − 1)
i=1
N X
X N
= Yj Yi E [(bδi − 1) (bδj − 1)]
j=1 i=1
N X
X N
= Yj Yi dij (2)
j=1 i=1
where
Well,
N X
X N N X
X N
Xi Xj dij = Xi Xj E [(bδi − 1) (bδj − 1)]
i=1 j=1 i=1 j=1
XN X
N
= E Xi Xj (bδi − 1) (bδj − 1)
i=1 j=1
N
!2
X
= E Xi (bδi − 1)
i=1
N N
!2
X X
= E Xi bδi − Xi
i=1 i=1
h i
2
= E (X − X) = 0
13
Yi
where Zi = X i
and aij = Xi Xj dij . Simply from the definition of the mean
square error, we have that this is always positive, so what we ended up with
must be a non-negative symmetric quadratic form in terms of Zi , for any values
of Zi . The aij are not all positive (although they must be for i = j), but this
implies that
XX XX X
ai,j = ai,j = ai,j = 0
i j i<j j
Going further,
N N
1 XX
=− −2Zj Zi aij
2 j=1 i=1
N N N N N N
1 X X X
2
X X
2
X
=− −2Zj Zi aij + Zi aij + Zj aij
2 j=1 i=1 i=1 j=1 j=1 i=1
N X N N X N N X N
1 X X X
=− −2Zj Zi aij + Zi2 aij + Zj2 aij
2 j=1 i=1 i=1 j=1 i=1 j=1
N N
1 XX 2
Zi + Zj2 − 2Zj Zi aij
=−
2 j=1 i=1
N N
1 XX 2
=− (Zi − Zj ) aij
2 j=1 i=1
1 XX 2
=− (Zi − Zj ) aij
2
i6=j
XX 2
=− (Zi − Zj ) aij
i<j
−1
It remains to work out dij . Well, if p(s) = Nn is the probability that we pick
the sample s using our scheme, we have
dij = E δi δj b2 − δi b − δj b + 1
X X X
= p(s)b(s)2 − p(s)b(s) − p(s)b(s) + 1
s∈i,j s∈j s∈i
1 X X X N
= N b(s)2 − b(s) − b(s) +
n s∈i,j s∈j s∈i
n
14
1 2X 1 X 1 X 1 N
= N X 2 − X P −X P +
n
P
s∈i,j Xk s∈j k∈s Xk s∈i k∈s Xk n
k∈s
The notation s ∈ i means that we sum over all samples s which contain i.
Example 1.2.1. It turns out that the estimation of the mean of a subpopulation
from a sample of a larger population is actually a ratio estimator. Say we have
a random sample of 400 households from WA, 20 of which are from Nedlands,
and we are interested in estimating average income. The obvious estimator of
average household income for Nedlands is going to be an average over 20 units.
So if the total income over the 20 units is 1, 600, 000 we have
1, 600, 000
ȲˆN edlands = = 80, 000
20
But in fact the 20 units from Nedland was actually random. So define Wi to
be an indicator of whether the ith unit in the population is in Nedlands, and
obviously the total income over Nedlands is
400
X
W YT = Wi Yi
i=1
WY
W̄
and obviously as this is a ratio estimator we already know its mean squared
error.
15
1.3 Superpopulation based and design based estimation
The justification for the supersample approach is that in many cases even a
census represents just a sample from a larger population. For instance, a census
of australian residents only measures the population at a single moment. Very
quickly the population will have changed, and will soon be different. It is also
only one of the possible populations that might have arisen from the same set
of underlying social and economic influences. In this case we might argue that
we were more interested in the underlying infinite population than the finite
sample, so the superpopulation view is not that unrealistic.
Y = βX +
16
We know that if s denotes our sample, we have
X
N Ȳ = nȳ + Yi
i∈s
/
ȲˆR = R̂X̄
The difference is that our assumptions make the variance of the estimator very
different. For example, assume that ∼ N (0, τ 2 ). Then
nτ 2 τ2
Var R̂ x1 . . . xn = Pn 2 = nx̄2
( i=1 xi )
This suggests that we not use simple random sampling, and instead try to
maximize x̄.
Now we extend the previous example to take into account an intercept as well.
That is, our model is
Yi = α + βXi + i
17
where it is assumed that the i are uncorrelated, have common variance σ 2 and
mean 0 conditional on the value of Xi . So
E [ i | Xi ] = 0
E [ Yi | Xi ] = α + βXi
To derive our estimate of Y , lets restrict ourselves to linear estimators. That is,
our prospective estimator is t, where
X X
t= Yi + gi Yi (3)
i∈s i∈s
and the collection gP i determine the equation of the estimator. For this to be
N
model unbiased for i=1 Yi we must have
" # " #
XN X X
E t− Yi x1 . . . xn = E gi Yi − Yi x1 . . . xn
i=1 i∈s i∈s
/
" #
X X
=E gi Yi x1 . . . xn − (α + βXi )
i∈s i∈s
/
" #
X X X
=E α gi + β gi Xi x1 . . . xn − (α + βXi )
i∈s i∈s i∈s
/
X X X
=α gi + β gi Xi − (α + βXi )
i∈s i∈s i∈s
/
=0
So
X X X X
α gi + β gi Xi = (α + βXi ) = (N − n)α + β Xi
i∈s i∈s i∈s
/ i∈s
/
18
!2
X X X
= E gi Yi − (α + βXi ) − (Yi − α − βXi ) x1 . . . xn
i∈s i∈s
/ i∈s
/
" # !2
X X X
= E gi Yi − E gi Yi x1 . . . xn − i x1 . . . xn
i∈s i∈s i∈s
/
!2
X X X
= E gi Yi − gi (α + βxi ) − i x1 . . . xn
i∈s i∈s i∈s
/
!2
X X
= E gi i − i x1 . . . xn
i∈s i∈s
/
!2 ! ! !2
X X X X
= E gi i −2 gi i i + i x1 . . . xn
i∈s i∈s i∈s
/ i∈s
/
XX XX XX
= E gi gj i j − 2 gi i j x1 . . . xn + E i j x1 . . . xn
i∈s j∈s i∈s j ∈s/ i∈s
/ j ∈s
/
XX X XX
E gi2 2i x1 . . . xn − 2
= gi gj E [ i j | x1 . . . xn ] + gi E [ i j | x1 . . . xn ]
i∈s,j∈s,j6=i i∈s i∈s j ∈s
/
XX X
E 2i x1 . . . xn
+ E [ i j | x 1 . . . x n ] +
i∈s,j
/ ∈s,i6
/ =j i∈s
/
i∈s
!
X
= gi2 + N − n σ2
i∈s
where the last line section follows from the independence of the i . We can use
lagrange multipliers to minimize this expression subject to conditions (4) and
(5), to get
N X̄ − x̄
N
gi = −1 + P 2 (Xi − x̄)
n i∈s (Xi − x̄)
19
!
X 1 Xi − x̄
=N + X̄ − x̄ P 2 Yi
i∈s
n i∈s (Xi − x̄)
X Xi − x̄
= N ȳ + N X̄ − x̄ P 2 Yi
i∈s i∈s (Xi − x̄)
= N ȳ + N X̄ − x̄ b
where
Pn
(xi − x̄) yi
b = Pni=1
i=1 (xi − x̄) xi
is a model-unbiased estimator of β as
Pn
(xi − x̄) E [ yi | x1 . . . xn ]
E [ b| x1 . . . xn ] = i=1 Pn
i=1 (xi − x̄) xi
Pn
(x − x̄) (α + βxi )
= i=1Pn i
(xi − x̄) xi
Pn i=1
(xi − x̄) xi β
= Pi=1
n =β
i=1 (xi − x̄) xi
Now, what would happen if we let our estimator have the same form but we
now treated it as a design-based estimator instead of a model based estimator?
Well, we still define
PN
i=1 (xi − x̄) yi
β = PN
i=1 (xi − x̄) xi
α = Ȳ − β X̄
20
The estimators of α and β are the same as for the model-based approach, but
these estimators are now only asymptotically unbiased. As for the variance of
Ȳˆ , we have
Var Ȳˆ = Var ȳ − b x̄ − X̄
∼ Var ȳ − β x̄ − X̄
= Var ȳ − β x̄ + β X̄
h 2 i
= E ȳ − Ȳ − β x̄ + β X̄
h i
2
= E (ȳ − β x̄ − α)
= E z̄ 2
1 1
= − SZ2
n N
where Zi = Yi − Ȳ − β(Xi − X̄) is the model residual and we assumed that this
had expected values of zero. A sample estimator of SZ2 is
n
1 X 2
s2z = ((yi − ȳ) − b (xi − x̄))
n − 1 i=1
n
1 X 2
= (yi − bxi − ȳ + bx̄)
n − 1 i=1
n
1 X 2
= (yi − bxi − a)
n − 1 i=1
21
Applying some approximations to b, we have
a (a − A) + A
b= =
c (c − C) + C
a−A
A A +1
= c−C
C C +1
a−A
A A +1
=
C 1 − − c−C
C
A a−A c−C
∼ +1 1−
C A C
The approximation comes from the expansion
∞
1 X
= xn
1 − x n=0
This is only valid for |x| < 1, and making this assumption in our case is not
unreasonable. Continuing,
A a−A c−C c−C a−A
= +1− −
C A C C A
We have
E [t] = Y − N E b x̄ − X̄
Ignoring the last term as being insignificant, this makes the bias of t
A a−A c−C
N E b x̄ − X̄ ∼ N E +1− x̄ − X̄
C A C
A a−A c−C
= NE − x̄ − X̄
C A C
A a c
= NE − x̄ − X̄
C A C
1 Ac
= NE a− x̄ − X̄
C C
1
= NE (a − βc) x̄ − X̄
C
" n n
! #
1 X X
= NE 2
(xi − x̄) yi − β (xi − x̄) xi x̄ − X̄
Sxx i=1 i=1
" n
! #
1 X
= NE 2
(xi − x̄) (yi − ȳ − β (xi − x̄)) x̄ − X̄
Sxx i=1
22
2 Stratified Sampling
2.1 Motivation
Suppose we again take a sample of 400 households from WA, and we find that
we have a sample mean of 100, 000. This is clearly too high to be representative,
and on looking more closely we notice that
This says that our random sample is rather unrepresentative, and contains far
too many households from wealthy areas. So divide the total population up
into 3 groups or strata. Let Wi be the proportion of the population that lies in
strata i, and ni the number of units chosen from strata i. Obviously in an ideal
situation we would have ni = nWi , but in our example we have rather severe
deviations from this.
So lets make some adjustments to fix this under and over representation in our
sample. We estimate the mean per strata, then weight these and sum them to
get a total mean that takes into account the variability among the three different
parts of the population.
Ȳˆst = 0.05 ∗ 200, 000 + 0.2 ∗ 160, 000 + 0.75 ∗ 32, 000 = 68, 000
In this case we performed the stratification after the sample was taken. How-
ever, we may perform stratification before taking the actual sample and this is
actually preferred.
23
2.2 Stratified simple random sampling
Now to lay the idea out in full. The main idea is that we decide beforehand
that there are k strata in the
Pkpopulation, and that we want to sample ni units
from the ith strata, where i=1 ni = n. Similarly let Ni be the population of
Pk
the ith strata, and i=1 Ni = N . The mean of the ni units from the ith strata
is ȳi and s2i is the corresponding variance. Ȳi and Si2 are the corresponding
population values, and Yij is the jth unit from the ith strata. Wi is the fraction
of the population that lies in the ith strata, so Wi = NN . We have
i
n
X
Ȳ = Wi Ȳi
i=1
n
Ȳˆst =
X
ȳi Wi
i=1
Stratification is in fact best when the between strata variation is very large and
the internal per-strata variation tends to be very small. This is because
n
X
Var (ȳst ) = Wi2 Var (ȳi )
i=1
n
X 1 1
= Wi2 − Si2
i=1
ni Ni
and so only the internal per-strata variation contributes to the error of the
estimator. If we are dealing with proportional allocation, this becomes
n
X
2 1 1
Var (ȳst ) = Wi − Si2
i=1
n i N i
24
n
X ni 1 N
= Wi − Si2
i=1
n n i N N i
n
X 1 ni N 1
= Wi − Si2
i=1
n N i n N
n
X 1 1
= Wi − Si2
i=1
n N
n
1 1 X
= − Wi Si2
n N i=1
If we have more than n strata then it will be impossible to choose even 1 unit
from every strata, so there are problems in this situation.
So we want to minimize
n
X Ni2 2
S
i=1
N 2 ni i
Pn
subject to the constraint i=1 ni = n. Well, using lagrange multipliers and
treating the ni as real-valued instead of integer-valued, we want to minimize
n n
! n n
!
X Ni2 Si2 X X Wi2 Si2 X
f= +λ ni − n = +λ ni − n
i=1
N 2 ni i=1 i=1
ni i=1
25
Taking partial derivatives,
∂ Wj2 Sj2
=− +λ=0 (6)
∂nj n2j
n
∂ X
= ni − n = 0 (7)
∂λ i=1
So !2
n
1 X
λ= 2 Wi Si
n i=1
Wj Sj
nj = n Pn
i=1 Wi Si
26
Pn
j=1 Wj S j
where A = n . On the other hand, the optimum allocation gives a
variance of
n
X 1 1
Var (ȳst ) = Wi2 Si2 −
i=1
n∗i Ni
n
X W 2S2 i i
∼
i=1
n∗i
n
X Wi2 Si2
=
n PnWiW
i=1
Si
j Sj
j=1
n n
X W i Si
X
= Wj Sj
i=1
n j=1
P 2
n
j=1 Wj Sj
=
n Pn
2 2 j=1 Wj Sj
= A n = A n Pn
j=1 Wj Sj
n
X Wj Sj
= A2 n Pn
j=1 j=1 Wj Sj
n n 2
X X (n∗ )
= A2 n∗i = A2 i
j=1 j=1
n∗i
So from the algebra, if ni is close to n∗i we will attain a variance close to the
optimum, like we said. So even if the Si2 are only approximately known, maybe
from some previous study, it still makes sense to use these to pick approximately
optimal strata sizes.
27
variance and internal-strata variance. That is,
k X Ni
1 X 2
S2 =
Yij − Ȳ
N − 1 i=1 j=1
k X Ni
1 X 2
= Yij − Ȳi + Ȳi − Ȳ
N − 1 i=1 j=1
k X Ni
1 X 2 2
= Yij − Ȳi + Ȳi − Ȳ + 2 Ȳi − Ȳ Yij − Ȳi
N − 1 i=1 j=1
k Ni Ni
1 X X Ni − 1 2 X 2
= Yij − Ȳi + Ȳi − Ȳ
N − 1 i=1 j=1 Ni − 1 j=1
Ni
X
+2 Ȳi − Ȳ Yij − Ȳi
j=1
k Ni Ni
1 X
(Ni − 1)
X 1 2 X 2
= Yij − Ȳi + Ȳi − Ȳ + 0
N −1 i=1 j=1
N i −1 j=1
k
!
1 X 2
(Ni − 1) Si2 + Ni Ȳi − Ȳ
=
N −1 i=1
and this can be related back to the ANOVA table of one-way classification. If
N − 1 ∼ N and Ni ∼ Ni − 1 we have approximately
k k
X X 2
N S2 = Ni Si2 + Ni Ȳi − Ȳ
i=1 i=1
k k
X X 2
S2 = Wi Si2 + Wi Ȳi − Ȳ
i=1 i=1
n
X
1 1 1 1
Var (ȳ) − Var (ȳst ) = − S2 − − Wi Si2
n N n N i=1
X n n n
!
1 1 2
X 2 X 2
∼ − Wi Si + Wi Ȳi − Ȳ − Wi Si
n N i=1 i=1 i=1
n
1 1 X 2
= − Wi Ȳi − Ȳ
n N i=1
and as the right hand side is positive we find that ȳst has a lower variance than
ȳ. But this relied crucially on our assumptions. So we find that in fact it is
28
not always true that ȳst has lower variance than ȳ, the whole-sample mean.
However it is usually true. In fact the exact condition we need is
k k
X 2 1 X
Ni Ȳi − Ȳ > (N − Ni ) Si2
i=1
N i=1
Proposition 2.4.2. Even optimal allocation does not always result in an esti-
mator with a lower variance than simple random sampling.
2.5 Post-Stratification
Post stratification means that the distribution of the n sampled units among the
strata is only known after the sampling has been performed. So the allocation of
the units among the strata is random, and as expected this adds some variability
to the stratified estimator. Obviously if we ignore the stratification and use the
whole-sample mean we have
k
X
ȳ = wi y¯i
i=1
where wi is the proportion of observed units from the ith strata. On the other
hand the post stratification estimator is defined as
k
X
ȳst = Wi ȳi
i=1
where we now have that ni is also random. As ȳi is unbiased for Ȳi we have
that ȳst is still unbiased for Ȳ . Now for calculating the variance. Well, it is a
fact that
Var (Y ) = E [Var (Y |X)] + Var (E [Y |X])
So
29
This variance is actually the variance of a standard stratified estimator, with
allocations of n1 , n2 . . . nk to the various strata. So
" k #
X 1 1 2 2
=E − Wi Si
i=1
ni Ni
k
X 1 1
= E − Wi2 Si2
i=1
ni Ni
k
X 1 2 2 1 2 2
= E Wi Si − W S
i=1
ni Ni i i
Now, assume that all the ni are nonzero. This is not too radical an assumption
if k is small compared to n. With this assumption the ni have the positive
binomial distribution, and we find that
1 1 1 1
E ∼ − 2 + 2 2
ni nWi n Wi n Wi
So
k
X 1 1 1 1
= − 2 + 2 2 Wi2 Si2 − Wi2 Si2
i=1
nWi n Wi n Wi Ni
k
X Wi Wi 1 1
= − 2 + 2 Si2 − Wi2 Si2
i=1
n n n Ni
k
X Wi 2 1 2 1 2
= S + 2 Si (1 − Wi ) − Wi Si
i=1
n i n N
X k k
1 1 1 X 2
= − Wi Si2 + 2 S (1 − Wi )
n N i=1 n i=1 i
If n is large the second term becomes small, and the first term looks like the
variance of proportional allocation. So if n is large post stratification is almost as
good as stratification with proportional allocation. Obviously the more variance
there is between the strata means, the better post stratification will be. The
unbiased estimate of this variance is
k
X 1 1
− Wi2 s2i
i=1
ni Ni
3 Cluster Sampling
So far we’ve assumed that a list of all the units in the population is available.
But often this is not the case and more often there will be lots of groups of
30
units, each one called a cluster. For each cluster we assume that the list of
units contained in the cluster can be obtained without much cost. There isn’t
really that much of a new problem here yet - We just sample from the clusters,
which will now be called sampling units and then perform a census of all
the units contained in the chosen clusters. We will assume that there are N
sampling units, with associated values Y1 . . . YN .
Going back to the example, say that we pick n = 3 and for our three clusters
we have
1 2 3
Households 12 20 8
Yi 600,000 1,200,000 4,400,000
Then we have
6 × 105 + 12 × 105 + 44 × 105 62 5
z̄ˆ = = 10
40 40
If, on the other hand, we knew that there were only 200 households, then we
could use the estimate
20(6 × 105 + 12 × 105 + 44 × 105
200
31
But the problem with our second estimator is that it fails to take into account
the number of households for the sampling units we picked. In ratio estimator
terms, the first estimator is x̄ȳ and the second X̄ ȳ
. So the mean square error of
the first is
h ȳ i 1
MSE = 2 Var (ȳ − Rx̄)
x̄ X̄
and that of the second is
ȳ 1
Var = 2 Var (ȳ)
X̄ X̄
So if we assume that ȳ − RX̄ is less variable than ȳ, as we normally would, then
the first estimator is biased but has a lower mean square error than the second.
ȳ
We do in fact prefer the estimator m̄ for cluster sampling, and the mean square
error is
Ȳ
Var ȳ − m̄
M̄
and can be estimated by
\ ȳ
Var (z̄) zi = yi − mi
m̄
for exactly the same reasons as given in the ration estimation section. As
said, this estimator is not unbiased, but compensates by correcting for non-
representativeness of the sample.
Cluster 1 2 3 4 5 Total
Household Size 20 12 8 40 20 100
Total Household Income 11 7 5 18 12 53
Now, take a sample of size 2. There are ten possible samples, all of which have
equal probability of being selected, and our estimates are either
5 × ȳ
T1 =
100
ȳ
T2 =
x̄
where the Xi are the household sizes and Yi are the total household incomes.
T2 is the preferred biased estimator and T1 is unbiased, and the corresponding
values are
32
Sample chosen T1 T2
(1,2) 45,000 56,500
.. .. ..
. . .
(2,3) 30,000 60,000
.. .. ..
. . .
(4,5) 75,000 50,000
The estimator T1 averages to 53, 000, which is the true value. The estimator
T2 averages to something else, but varies much less than T1 . Now, to make T2
unbiased we just alter the weights which we assign to each sample. Let our new
selection probabilities be denoted by p1 . . . p10 and assume that the cluster sizes
are known. We want to have that
(T1 )i
pi (T2 )i =
10
where i references the ith possible sample, as this will mean that
E [T1 ] = E [T2 ]
So
(T1 )i x̄i N 1
pi = =
(T2 )i 10 100 10
where N is the number of clusters. So we end up choosing pi proportional to
x̄i , that is, the probability of choosing a sample is proportional to the number
of households in that cluster. So in our example we would pick sample i with
probability proportional to L, which is given in the table.
Sample chosen L
(1,2) 32
.. ..
. .
(2,3) 28
.. ..
. .
(4,5) 60
But applying this scheme is difficult in practice, and we would really like to
assign inclusion probabilities to individual units. A common alternative is called
the Midzuno-Sen sampling scheme. Under this scheme we only pick 1 cluster
from the N with probability proportional to L, and then pick the remaining
n − 1 from the N − 1 as a simple random sample. If we were to apply this to the
above example, we would end up with the following probabilities for the first
choice.
Size 20 12 8 40 20 100
Probability .2 .12 .08 .4 .2 1
33
Under this scheme the probability of picking clusters (y1 , y2 . . . yn ) is given by
m1 1 m 1
+ ··· + n
M N −1 M N −1
n−1 n−1
Obviously the Midzuno-Sen scheme makes the inclusion probabilities much sim-
pler, and basically after the first unit is selected the inclusion probabilities be-
come simple random sampling without replacement and don’t change much
more. On the other hand, if every unit was selected proportionally to L then
the inclusion probabilities would in a sense change after every selection.
Finally, the distance between successive units in terms of the ordering can be
very important. For example, if the characteristic of interest is periodic in terms
of the ordering and Nn is approximately or exactly the period, then we will always
tend to sample at around the same point in the period. This can result in an
estimate that is quite bad. On the other hand, if N n is an odd multiple of half
the period, then we will tend to alternate between sampling peaks and troughs,
and this will give a good estimate.
34
3.3 Two stage cluster sampling
The assumption with two stage cluster sampling is that a list of all possible
primary units is available, and the list of all secondary units can be determined
for the selected primary units, possibly at a small cost. It is not really assumed
that a list of every secondary unit exists. This makes perfect sense, for example
imagine we are trying to estimate average household income in perth. A list of
all households doesn’t exist, but after we choose specific streets or blocks, a list
of households in these regions isn’t too difficult to find.
Now the notation. We assume that there are N clusters, with sizes M1 , M2 . . . MN ,
and the total number of second stages units is M . Before sampling, we decide
that if cluster i is chosen then mi secondary units will be picked from within
that cluster. Let Yi,j denote the jth unit from the ith cluster, Yi be the total
characteristic value of the ith cluster, and let Si2 and s2i be the population and
sample variance for the ith cluster, respectively. The value of interest is
PN PMi
i=1 j=1 Yi,j
Ȳ = PN
i=1 Mi
which is the average Y value per second stage unit. We estimate this by
N ȳT
Ȳˆ = PN
Mi
Pi=1
Y i
ȳT = i∈s
n
Unfortunately Yi is unknown and so we estimate ȳT by
P
Mi ȳi
ȳT = i∈s
ˆ
n
where ȳi is obviously the average of the values selected from the ith cluster.
Putting this back into the definition for Ȳˆ , we get the obviously unbiased esti-
mator
P
ˆ N ȳˆT N i∈S Mi ȳi N X
Ȳ = PN = PN = Mi ȳi
i=1 Mi n i=1 Mi nM
i∈S
Under two stage clustering the variance of the estimator comes from two sources
corresponding to the two stages, between cluster variation and within-cluster
35
variation. First lets look at the second stage variation. So take fixed numbers
i1 , i2 . . . in to be the clusters we have picked, so that the number of units selected
are mi1 , mi2 , . . . min respectively. Having fixed these clusters we are essentially
doing stratified sampling, so
N2 X
Var Ȳˆ i1 , i2 . . . in = Mi2 Var (ȳi )
n2 M 2 i∈s
N2 X 2 1
1
= 2 2 Mi − Si2
n M i∈s mi Mi
and then taking the expectation over first stage choices gives
N
n N2 X 2 1
h i 1
E Var Ȳˆ i1 , i2 . . . in = M − Si2
N n2 M 2 i=1 i mi Mi
n
as E [δi ] = N. Now for the first stage variation, which is
!
h i N X
Var E Ȳˆ i1 , i2 . . . in = Var Mi Ȳi
nM
i∈S
!
2
N X
= 2 2 Var Mi Ȳi
n M
i∈S
P
N2
i∈S Mi Ȳi
= 2 Var
M n
N2 1
1 2
= 2 − SM Ȳ
M n N
N
N2 1
N X 2 1 1 2 1 2
= M − S + − SM
nM 2 i=1 i mi Mi i
M2 n N Ȳ
If we define
N
X
YT = Mi Ȳi
i=1
36
then this is obviously M Ȳ and so the estimator will just be ŶT = M Ȳˆ , with
variance
N
NX 2 1 1 2 2 1 1 2
M − Si + N − SM
n i=1 i mi Mi n N Ȳ
This takes some explaining, as we have replaced a sum over [1, N ] with a sum
over [1, n] without adding a compensating Nn factor. The main point is that
PN
Mi Ȳi
i=1
MY =
Pn N
Mi y¯i
my = i=1
n
N 2
2
X Mi Ȳi − M Y
SM Ȳ =
i=1
N −1
n 2
X (Mi y¯i − my)
s2M Ȳ =
i=1
n−1
ŷratio will obviously generally be biased, as from the ratio estimation section R̂
will be biased for R, although if n is large the bias will not be too significant.
37
Looking at the mean squared error of ŷratio , we have
h i
2
MSE [ŷratio ] = E (ŷratio − Y )
" 2 #
2 ȳˆT
=M E −R
m̄
" 2 #
ˆ
ȳ T − R m̄
= M 2E
m̄
" 2 #
2 M̄ ȳˆT − Rm̄
=M E
m̄ M̄
" 2 #
ˆ
ȳ T − R m̄
∼ M 2E
M̄
M̄
assuming that m̄ is approximately constant.
h 2 i
= N 2 E ȳˆT − Rm̄
= N 2 Var ȳˆT − Rm̄
as E ȳˆT − Rm̄ = 0. This is equal to N 2 Var (z̄) if we let zi = Mi ȳi − Rmi .
Continuing,
38
There’s still the question of how we choose n and the {mi }. Often the main
consideration is cost or budget constraints. For example, say that the ith cluster
costs ci per sampled unit. This means that mi ci is the total cost of the sampling
from the ith cluster, and we can also add a fixed cost and fixed per-cluster cost.
Taking these three together gives us
X
overall cost = c0 + nc + mi ci
i∈s
Typically we would fix the expected total cost and then choose the {mi } to
minimise the variance.
So far we’ve mainly considered simple random sampling where all units have
had the same chance of inclusion, but here we consider different probability
schemes. Previously we assumed we had N units with values Y1 . . . YN , and
the probability that yi was from some particular unit was N1 . Now we are
going to allow the inclusion probabilities to vary, and they will be denoted by
p1 . . . pN . Obviously with replacement selection is easier as even if the inclusion
probabilities are not constant across units we will still have yi independent of
yj . So for the moment we use with-replacement selection.
Now for estimation using unequal probability sampling. Let P1 denote the
probability that y1 takes the value it does. That is, if y1 = Y1 then P1 = p1 .
If y1 = Y2 , then P1 = p2 . If we define t1 = Py11 then obviously this will be an
PN
unbiased estimator of Y = i=1 Yi , as
N
Y1 YN X
E [t1 ] = p1 + . . . pN = Yi
p1 pN i=1
2
Var (ti ) = E t21 − E [t1 ]
N 2
X Yi
= pi − Y 2
i=1
p i
N
X Y2
i
= −Y2
i=1
p i
N 2
X Yi
= pi −Y
i=1
pi
39
Pn
We have n independent estimates t1 . . . tn with ti = Pyii , and so t̄ = n1 i=1 ti
is unbiased for Y with variance Varn(ti ) . As we are doing with replacement
sampling this can be estimated by
Pn 2
s2t (ti − t̄)
= i=1
n n(n − 1)
Remember that we are sampling with replacement, so units can appear multiple
times.PSo let Qi denote the number of times that unit i is included in the sample,
N
with i=1 Qi = n and E [Qi ] = npi .
"P # P N
N N
i=1 Qi Ti npi Ti X
E [t̄] = E = i=1 = Yi
n n i=1
Yi
where Ti = pi .
"N #
s2t
1 X 2
E = E Qi (Ti − t̄)
n n(n − 1) i=1
"N #
1 X 2
= E Qi (Ti − Y + Y − t̄)
n(n − 1) i=1
"N #
1 X 2 2
= E Qi (Ti − Y ) + 2 (Y − t̄) (Ti − Y ) + (Y − t̄)
n(n − 1) i=1
"N N
#
1 X 2
X 2
= E Qi (Ti − Y ) + 2 (Y − t̄) Qi (Ti − Y ) + n (Y − t̄)
n(n − 1) i=1 i=1
"N N N
! #
1 X 2
X X 2
= E Qi (Ti − Y ) + 2 (Y − t̄) Qi Ti − Qi Y + n (Y − t̄)
n(n − 1) i=1 i=1 i=1
"N #
1 X 2 2
= E Qi (Ti − Y ) + 2n (Y − t̄) (t̄ − Y ) + n (Y − t̄)
n(n − 1) i=1
N
!
1 X 2
h
2
i
= n pi (Ti − Y ) − 2nVar (t̄) + nE (Y − t̄)
n(n − 1) i=1
N
!
1 X 2
= n pi (Ti − Y ) − 2nVar (t̄) + nVar (t̄)
n(n − 1) i=1
N
!
1 X 2
= n pi (Ti − Y ) − nVar (t̄)
n(n − 1) i=1
1 1
= (nVar (ti ) − nVar (t̄)) = (nVar (ti ) − Var (ti ))
n(n − 1) n(n − 1)
1
= Var (ti ) = Var (t̄)
n
40
Simple random sampling with replacement is not unusual in the unequal prob-
ability setting. Finally, we can also see why picking pi ∝ Yi is really quite a
boring case. It is quite unrealistic to assume this is possible, and when it is our
estimator has variance 0 as Pyii = c. Another way of looking at this is that if
we know all the pi and one yi , we can use this information about the design to
calculate Y - So in a sense the proportionality means that knowledge about the
Yi can be replaced by knowledge about the pi .
But now that we’ve decied to allow the selection probabilities to vary we need a
good method for specifying these probabilities. Assume that we again have an
auxillary characteristic X, known for every unit in the population, and that Y is
expected to be roughly proportional to X. Then it seems very logical to choose
pi proportional to Xi , especially as it might be difficult to pick it proportional
to the so far unobserved Yi . This sort of sampling scheme is very common
in cluster sampling, so common that in fact the characteristic X is sometimes
called ‘size’.
Now to actually apply this sampling scheme. Well, consider a two stage model
where our ‘size’ variable is cluster size. We use probability proportional to size
and with replacement sampling of clusters. Obviously
Yi
ti =
Pi
is impossible as we don’t know Yi , so we instead use
ŷi
ti =
Pi
which is still unbiased for Y as long as ŷi is unbiased for Yi , where Yi is the total
of the units in the ith selected cluster. Note we didn’t need to know anything
about the selection of the second stage units in order to say that ti was unbiased
for Y , except independence between the subsampling of different clusters. The
estimator
n
1X
t̄ˆ = ti
n i=1
41
is also unbiased for Y , with variance
N
!
Var (t1 ) 1 X ŷi
= Var δi
n n i=1
Pi
" N #! " N
!#!
1 X ŷi X ŷi
= Var E δi cluster + E Var δi cluster
n i=1
Pi i=1
Pi
N
! " N
#!
1 X Yi X Var (ŷi )
= Var δi +E δi
n i=1
P i i=1
Pi2
N 2 N
1X Yi 1 X Var (ŷi )
= Pi −Y +
n i=1 Pi n i=1 Pi2
Note that in the setup we’ve given without replacement sampling really is better
than with replacement sampling, however it is not commonly used due to the
complication involved.
Now lets go to without replacement selection. Well, for the first unit we have the
selection probabilities given by the {pi }, and for the second unit we choose with
probabilities proportional to the {pi } from the remaining units. For example,
assume that N = 4, and the selection probabilities are
Say that the first selected unit is 2. Then the selection probabilities for the
second unit conditional on having picked the first are
5 15 20
p1 = , p3 = , p4 =
40 40 40
If the next unit picked is 4, the selection probabilities for the next selection are
1 3
p1 = , p3 =
4 4
So we can see how the inclusion probabilities change over successive selections.
But the calculations are really quite messy. For example, the probability of
selecting unit 1 as the second unit is
1 1 1
0.2 ∗ + 0.3 ∗ + 0.4 ∗
8 7 6
42
To get all of these numbers we have to go back and work out the probability of
selecting unit 1 given that 3 is selected first, etc. In the simple case that n = 2
we have
0.1 0.1 1
π1 = 0.1 + 0.2 + 0.3 + 0.4
0.8 0.7
0.6
X pi
= 0.1 1 +
1 − pi
i6=1
X pi
π2 = 0.2 1 +
1 − pi
i6=2
Something that I haven’t had a chance to follow up - Given that we have selected
the units i1 . . . in , define
Yi1
t1,D =
Pi 1
Yi
t2,D = 2 (1 − Pi1 ) + Yi1
Pi 2
Yin 1 − Pi1 − · · · − Pin−1
tn,D = + Yi1 + · · · + Yin−1
Pi n
Pn
Then we can use the estimator t̄d = n1 i=1 ti,D , which has a variance which
can be unbiasedly estimated by
n
1 X 2
(ti,D − t̄D )
n − 1 i=1
This turns out to have been proposed by Des Raj in 1956. Apparently it relates
to selection of the first unit with probabilities {pi } and selection of successive
units with probabilities proportional to the {pi }.
Definition 4.1.1. The Horvitz-Thompson estimator of Y can be used with
any probability sampling scheme, including both with and without replacement.
It is
X Yi N
X Yi
ŶHT = = Zi
i∈s
πi i=1
πi
where πi is the probability that the sample contains the unit i and our indicator
variables are now going to be denoted n byo Zi . Another way of looking at this is
Yi
that we sample from the population πi . This estimator is obviously unbiased
for Y .
43
Some useful properties we will need are
N
X
πi = n
i=1
X
πij = (n − 1)πi
i6=j
PN
For the first one we know that n = i=1 Zi , so
"N # N
X X
n=E Zi = πi
i=1 i=1
So
N 2
X Yi X X Yi Yj
Var ŶHT = πi (1 − πi ) + (πij − πi πj )
i=1
πi πi πj
i6=j
44
Now to come up with an estimate of this variance. Well, we know that
"N # N
X X
E ai Zi = ai πi
i=1 i=1
XX XX
E aij Zi Zj = aij πij
i6=j i6=j
Applying this to Var ŶHT gives
2
Yi
ai = (1 − πi )
πi
Yi Yj (πij − πi πj )
aij =
πi πj πij
So our estimate is
N 2
\ X Yi XX Yi Yj (πij − πi πj )
Var ŶHT = Zi (1 − πi ) + Zi Zj
i=1
πi πi πj πij
i6=j
where
aii = πi (1 − πi )
aij = πij − πi πj
Following some algebra similar to that in the ratio estimation chapter, we get
X X Yi 2
Yj
=− aij −
πi i<j
πj
2
1 X X Yi Yj
=− − aij
2 πi πj
i6=j
45
Now similar to what we did with the Horvitz Thompson estimator, we can
estimate this by
2
1 X X Yi Yj Zi Zj
− − aij
2 πi πj πij
i6=j
Assume that we have N clusters, which are our primary sampling units, and Yi
is the total of the ith cluster. We use some arbitrary sampling scheme (choice
of πi values), which is irrelevant for our purposes, to pick a sample of n clusters.
Obviously the Horvitz Thompson estimator assuming we actually know the
values of Yi will be
N
X Zi Yi
ŶHT =
i=1
πi
Unfortunately the whole point is that we don’t know Yi and must estimate it as
Ŷi , by using some sampling scheme on the ith cluster. Again, we don’t care at
all what sampling scheme is used and this is one of the advantages of the Horvitz
Thompson estimator. Using these Ŷi values instead of Yi gives us the two stage
estimator ŶHT T S , which is still unbiased for Y so long as Ŷi is unbiased for Yi .
N
X Zi Ŷi
ŶHT T S =
i=1
πi
46
Obviously the second term is the Horvitz Thompson variance and the first term
is the contribution from the second stage subsampling. So this is
N Var Ŷ
X i
Var ŶHT +
i=1
πi
Now for estimating this slightly different variance. Well, our starting point is
the Horvitz Thompson estimator of the variance,
N 2
\ X Yi XX Yi Yj (πij − πi πj )
Var ŶHT = Zi (1 − πi ) + Zi Zj
i=1
π i πi πj πij
i6=j
where
1 − πi
ai =
πi2
πij − πi πj
aij =
πi πj πij
Our hope is that Q should estimate Var (YHT ), but this doesn’t work out exactly.
Ŷi and Ŷj are both independent by assumption, so there is no problem there,
but
h i
E Ŷi2 = Var Ŷi + Yi2 6= Yi2
and so when we look at E [Q] we do not get out Var ŶHT , instead we get
E [Q] = E [E [ Q| Z1 . . . ZN ]]
XN X X
= E Zi ai Yi2 + Var Ŷi + aij Zi Zj Yi Yj
i=1 i6=j
N
X X X
= πi a i Yi2 + Var Ŷi + aij πij Yi Yj
i=1 i6=j
N 2
X Yi X X Yi Yj
6= πi (1 − πi ) + (πij − πi πj ) = Var ŶHT
i=1
πi πi πj
i6=j
47
In fact we have
XN
E [Q] = Var ŶHT + πi ai Var Ŷi
i=1
N
X 1 − πi
= Var ŶHT + Var Ŷi
i=1
πi
So in
fact we just have to alter Q a little bit to get an unbiased estimator for
Var ŶHT T S .
N
X
Var ŶHT T S = E [Q] + Var Ŷi
i=1
Of course we could always have taken Q to be the Yates and Grundy variance
estimate with Yi swapped for Ŷi . If
aij = πij − πi πj
this gives
!2
1 X X Ŷi Ŷj Z Z
i j
E [Q] = E E − − aij Z1 . . . Zn
2 πi πj πij
i6=j
!2
1 XX Ŷi Ŷj Z
Z1 . . . Zn aij i j
Z
= E − E −
2 πi πj πij
i6=j
2 2
1 X X Var Ŷi + Y i Var Ŷj + Y j Yi Yj Zi Zj
= E − + − aij
2 πi2 πj2 πi πj πij
i6=j
1 X X
Yi Yj
2
Zi Zj 1 X X Var Ŷi Var Ŷj
= E − − aij − + aij
2 πi πj πij 2 πi2 πj2
i6=j i6=j
1 XX Var Ŷi XX Var Ŷj
= Var ŶHT − aij + aij
2 πi2 πj2
i6=j i6=j
XX Var Ŷi
= Var ŶHT − aij
πi2
i6=j
48
N Var Ŷ
X i X
= Var ŶHT − (πij − πi πj )
πi2
i=1 j6=i
N Var Ŷ
X i X
= Var ŶHT − (n − 1)πi − πi πj
i=1
πi2
j6=i
N Var Ŷ
X i
= Var ŶHT − 2 ((n − 1)πi − πi (n − πi ))
i=1
πi
N Var Ŷ
X i
πi2 − πi
= Var ŶHT − 2
i=1
πi
N Var Ŷ
X i
= Var ŶHT + (1 − πi )
i=1
π i
Example 4.2.1. A sample of size 3 is taken from the collection of all cities
in WA, with inclusion probability proportional to size. For each of these three
cities a sample of households is taken in some appropriate but unspecified way.
The estimates obtained for city mean income and variance are
Cities 1 2 3
estimated mean income, ȳi 500 340 300
size 100 400 500
πi 0.03 0.12 0.15
\
Var ȳˆi 20 24 18
This implies that there are 10, 000 households in the whole country. The joint
inclusion probabilities are
π12 = 0.0032
π13 = 0.0038
π23 = 0.0166
These are given as part of the question, and can’t be calculated from the data
we have. Converting to totals, we have
1 2 3
2 2
\
Var (ŷi ) 20 ∗ 100 24 ∗ 400 18 ∗ 5002
ŷi 50,000 136,000 150,000
49
So we have a ŶHT T S value of
50, 000 136, 000 150, 000
+ + = 3, 800, 000
0.03 0.12 0.15
For variance estimation we use the Yates and Grundy formula for Q, which gives
2
50, 000 136, 000 0.03 ∗ 0.12 − 0.0032
−
0.03 0.12 0.0032
2
50, 000 150, 000 0.03 ∗ 0.15 − 0.0038
+ −
0.03 0.15 0.0038
2
136, 000 150, 000 0.12 ∗ 0.15 − 0.0166
+ −
0.12 0.15 0.0166
= 1.18926231 × 1011
We then add
20 ∗ 1002 24 ∗ 4002 18 ∗ 5002
+ +
0.03 0.12 0.15
to Q to get our variance estimate.
One method often adopted is to just pick a large sample of size n from the
original population of size N , called the first phase sample, and only obtain
the value of the auxiliary variable for this large population. We then treat this
as the whole population, and apply some suitable sampling technique to take a
subsample and estimate the total of the characteristic of interest over the first
phase sample. This gives an unbiased estimate of the characteristic total over
the whole population, provided we choose an estimator that gives an unbiased
estimate of the total over the first-phase population. If the first phase sample
is large then the additional variance from carrying the estimate from the first
phase sample to the whole population is small.
Once we have our first phase sample we can apply whatever sampling technique
we want. One choice would be to use stratification. So based on the knowledge
50
of the auxiliary variable for the first phase sample we stratify the first phase
sample into k strata, where the ith strata consists of ni units. We use some
allocation method, and end up with the allocation of mi units to the ith strata.
Now, let ȳ denote the average of the characteristic over the first phase sample
and ȳi is the average over the whole ith strata from the first phase population.
So we now treat ȳ as a population characteristic and want to estimate it. So
using stratification we have
k
X
ȳˆ = wi ȳˆi
i=1
where wi = nni is the proportion of the first phase sample that lies in the ith
strata, and is also random. Its expectation is Wi = NN . In fact this first phase
i
sample mean also estimates the original population mean, so Ȳˆ = ȳˆ. Obviously
this estimator is still unbiased for Ȳ . So if δi is the random variable denoting
the inclusion of unit i in the first phase sample then
E ȳˆ = E E ȳˆ δ1 . . . δN
= E [ȳ] = Ȳ
This follows simply because the stratified estimator is unbiased for first phase
population mean, and the first phase population mean is unbiased for the whole
population mean as it is obtained via a simple random sample. Converting to
an estimator of the whole population mean increases the variance, so if Si2 is
the variance of the ith strata units from the original population,
Var ȳˆ n1 . . . nk
!
X k
= Var wi ȳˆi n1 . . . nk
i=1
k
X
wi2 Var ȳˆi n1 . . . nk
=
i=1
k
X
wi2 E Var ȳˆi δ1 . . . δN n1 . . . nk + Var E ȳˆi δ1 . . . δN n1 . . . nk
=
i=1
k
X 1 1
wi2 E s2i n1 . . . nk + Var ( ȳi | n1 . . . nk )
= −
i=1
mi ni
k
X 1 1 1 1
= wi2 − Si2 + − Si2
i=1
m i n i n i N i
k
X
2 1 1
= wi − Si2
i=1
m i N i
Note that we assumed that mi was constant, which is clearly not the case as
it is bounded above by ni , which is random. But if we assume that ni > mi
51
with probability approximately 1, then this assumption makes sense. Finally to
derive the unconditional variance. We also need some more information about
wi .
E [wi ] = Wi
1 1 N Wi (1 − Wi )
Var (wi ) = −
n N N −1
2 2
E wi = Var (wi ) + E wi
1 1 N Wi (1 − Wi )
= − + Wi2
n N N −1
N − n Ni Nj 1 1 N
Cov (wi , wj ) = − =− − Wi Wj
N − 1 nN 2 n N N −1
Then going back to the variance,
Var ȳˆ = E Var ȳˆ n1 . . . nk + Var E ȳˆ n1 . . . nk
k k
!
X 2 1 1 2
X
= E wi − Si + Var wi Ȳi
i=1
mi Ni i=1
k X k
X 1 1 XX
E wi2 Si2 + Var (wi ) Ȳi2 +
= − Ȳi Ȳj Cov (wi , wj )
i=1
mi Ni i=1 i6=j
k
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2
n N N −1 i=1
mi Ni
k
X XX
+ Var (wi ) Ȳi2 + Ȳi Ȳj Cov (wi , wj )
i=1 i6=j
k
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2
n N N − 1 i=1 mi Ni
k
1 1N X
+ − Wi (1 − Wi )Ȳi2
n N N − 1 i=1
1 1 N XX
− − Wi Wj Ȳi Ȳj
n N N −1
i6=j
k
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni
k
X XX
− Wi2 Ȳi2 − Wi Wj Ȳi Ȳj
i=1 i6=j
k
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni
52
k X
X k
− Wi Wj Ȳi Ȳj
i=1 j=1
k
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni
k
X k
X
− Ȳi Wi Wj Ȳj
i=1 j=1
k !
1 1 N X 1 1 2 2 2
= Var (ȳst ) + − Wi (1 − Wi ) − Si + Wi Ȳi − Ȳ
n N i=1
N −1 mi Ni
k
1 1 N X 1 1 2
2
= Var (ȳst ) + − Wi (1 − Wi ) − Si + Wi Ȳi − Ȳ
n N N − 1 i=1 mi Ni
Example 4.3.1. Say that we are dealing with a population of size 10, 000, and
we want to determine average income. It is found that the location where a
person lives is relevant to determining their income but this cannot be collected
across the whole population, probably due to resource constraints. So 1000
people are selected and divided into three strata according to where they live -
Wealthy regions, medium wealthy, and poor. We find that approximately 10%
of people live in wealthy areas, 30% live in medium wealthy areas and 60%
live in poor areas. We then select 100 of these 1000 people and measure their
income. The data is
and so Ȳˆ is also 16. As for the variance estimate of this estimator,
\ 100 99 8 300 299 2 600 599 1
Var Ȳˆ = + +
1000999 25 1000 999 40 1000 999 20
1 100 2 300 2 600 2
+ (40 − 16) + (20 − 16) + (10 − 16)
999 1000 1000 1000
53
As an alternative to double sampling with stratification, we can also do double
sampling with ratio estimation. Again we pick a large first-phase sample of size
n and get the value of the auxiliary characteristic, and then we pick a subsample
of size m on which to obtain the value of Y . Our estimator is then
ȳˆ
ȳˆratio = Ȳˆ = x̄
ˆ
x̄
We need slightly different notation, so say we have N units, a first phase sample
of size n is picked via simple random sampling and then a second phase sample
of size m is picked, again with simple random sampling. Now, define Zi to
be the indicator random variables denoting inclusion in the first phase sample.
Taking the expansion we originally used in the ratio estimation section leads to
a fairly horrible mess here, so instead we expand around every variable, giving
Ȳ X̄ X̄ Ȳ Ȳ X̄
ȳˆratio ∼ + ȳˆ − Ȳ + x̄ − X̄ − 2 x̄ ˆ − X̄
X̄ X̄ X̄ X̄
Note that this expansion says that the the estimator is roughly unbiased, so its
not that good. But looking at the mean square error, we have
" #
Ȳ Ȳ 2
MSE ȳˆratio = E ȳˆ − Ȳ + x̄ − X̄ − ˆ − X̄
x̄
X̄ X̄
" #
Ȳ 2
=E ˆ
ȳ − Ȳ + x̄ − x̄ ˆ
X̄
Ȳ
ˆ
= Var ȳ + ˆ
x̄ − x̄
X̄
Ȳ Ȳ
= Var E ȳ + ˆ ˆ
x̄ − x̄ Z1 . . . ZN
ˆ
+ E Var ȳ + ˆ
x̄ − x̄ Z1 . . . ZN
X̄ X̄
Ȳ
= Var (ȳ) + E Var ȳˆ − x̄ ˆ Z1 . . . ZN
X̄
1 1 2 1 1 2
= − SY + E − s Ȳ
n N m n y− X̄ x
h
1 1 1 1 i
= − S2 + − E s2y− Ȳ x
n N m n X̄
1 1 1 1
= − S2 + − 2
Sy− Ȳ
x
n N m n X̄
2
where Sy− Ȳ
x
is a population value and s2y− Ȳ x is the same value, but over the
X̄ X̄
first phase sample. We can also apply two-stage sampling to the non-response
problem. We do this by taking a sub-sample of the non-responders, and using
more resources or trying harder than we originally did, to get values from these
non-responders. Finally, another application of two-stage sampling is to perform
probability proportional to size sampling where the size variable is unknown.
54
5 Non-response
So far we have assumed that if i ∈ S then we can determine Yi , but this is not
always true, and is a very serious problem with mail and telephone surveys in
particular. Non-response problems even occur with many censuses. The point
is that those who respond to the survey may be very different from those who
do not, and this introduces a bias into the results. As an example, assume that
we are trying to measure the effect of a new measure or law on pharmacies.
That is, we want to know the dollar value of the loss they have incurred as a
result of the new measure.
Assume that we can categorize pharmacies into two sorts, large and small, and
that large pharmacies lose on average 10, 000 and small pharmacies lose on
average 3, 000. As 20% of pharmacies are large and 80% are small, we have the
population value
Of course we don’t know this value and want to estimate it, and so we use
a mail survey. We send this survey to all pharmacies, but the response rates
turn out to differ across small and large pharmacies. Large pharmacies may
employ people to deal with this sort of query, so assume that their response
rate for our survey is 90%. On the other hand, smaller pharmacies may not
have anyone to deal with this sort of thing, so their response rate is 40%. This
means that we end up with 0.2 × 0.9 = 18% of our surveys being returned by
large pharmacies, 0.8 × 0.4 = 32% being returned by small pharmacies, and
50% are not returned. The response rate is 50%, and so we have 36% of our
results coming from large pharmacies and 64% coming from small pharmacies.
Conditional on these return rates, if we ignore the non-response problem our
estimator is basically going to be
where ȳs is the average for 40% of the sampled small pharmacies and ȳl is the
average for 90% of the large pharmacies sampled. This gives an expected value
of
0.36 × 10, 000 + 0.64 × 3, 000 = 5, 520
for a 38% error.
55
things we can do to reduce non-response are send a reminder call, and give
advance notice. Obviously as resources are limited we will sometimes have to
choose between these three and again, a pilot study might help identify which
is most effective. Apparently it is found experimentally that the reminder call
is most effective, giving advance notice is the next most effective, and including
a stamped envelope is the least effective.
Assume that for every unit in the population there are certain factors affecting
non-response. So we can assign a probability of non-response φi to every unit
i, and more importantly we can hope to estimate this quantity. Also define Ui
to be the indicator random variable for non-response. That is, Ui is 1 if unit i
responds, and 0 otherwise.
X Yi N
X Zi Yi
=
i∈s
πi i=1
πi
That is, our new inclusion random variable is Ui Zi , and we have the problem
that part of the sample selection mechanism is now determined by some exter-
nal randomness. Obviously to use our new estimator we are going to need to
estimate φi , probably by using some sort of external information. For example,
age might be an important determinant of non-response, in that younger people
tend to be busy and therefore will not respond, but older people are not, and
so they will tend to respond more often.
Example 5.1.1. Say we divide the population into three groups, young, middle
aged and old, denoted by Y , M and O. Then we perform a survey, and we find
that 30% of the younger group responds, similarly 25% of the middle-aged group
and 50% of the older group. So we estimate that Ui = 0.3 for any unit in the
young group, Ui = 0.23 for the middle aged group and Ui = 0.5 for the older
group. In the same survey say that we had n = 100, and of our sampled units
we found that 40 were in the young group, 30 in the middle aged group and 30
in the old group. So our total data is
Y M O
Sampled 40 30 30
Responded 12 7 15
56
Obviously our value of Ŷ , if we could compute it, would be
3
N X X
Ŷ = Yi = N Wi ȳi
100 i∈s i=1
Recall that if we use proportional allocation then the stratified estimator is the
same as the whole-sample average. So we will require W1 = 0.4, W2 = 0.3, W3 =
0.3, which is probably approximately accurate. But as some of the y-values are
unknown we instead use
ˆ
Ŷ = N 0.4ȳ1(r) + 0.3ȳ2(r) + 0.3ȳ3(r)
where ȳi(r) is the sample average of the units in the ith strata which actually
ˆ
responded. So Ŷ looks like a stratified estimator, specifically post-stratification
as the number of units selected from each strata is random, although we have
to assume that wi ∼ Wi .
Continuing, assume for the moment that the non-response is not deliberate.
That is, units are not responding simply because they are busy, can’t be both-
ered, etc. The alternative is that non-response is because the respondents fear
the consequences of answering the question accurately. For example, the ques-
tion ‘Are you a drug user’ is an example of such a question. These questions
are termed ‘sensitive questions’.
If we are using the horvitz thompson estimator then we will have ai = π1i . Due
to non-response not all values of Yi will be obtained. But if we know φi we can
use the alternative estimator
ˆ X Wi
Ŷ = ai Yi
i∈s
φi
57
Example 5.1.2. Say that we attempt to survey 100 units, but only 80 respond.
So we stratify the 100 units into 3 different ages, and end up with
1 2 3
Strata < 25 25 - 45 45+
Sampled 20 50 30
Responded 12 40 28
Say we have some ‘sensitive question’ which we feel respondents will not be
willing to answer because of the consequences of doing so. But assume that if
the respondent is convinced that his answer will not be identifiable, then the
respondent will be willing to answer the question fully. Here we deal with one
particular design of an experiment which involves asking a sensitive question,
apparently due to Warner. First to illustrate this design by example. Say we
have 100 sheets of paper, these sheets of paper are randomly assigned to respon-
dents with replacement, and we are interested in estimating the proportion
p of the population who are drug users. 30 of these sheets of paper instruct the
respondent to answer ‘yes’, 20 instruct the respondent to answer ‘no’, and the
remaining 50 instruct the respondent to answer the question truthfully.
Now to go back and make this more rigorous. Assume that every unit has a
chance π of being instructed to answer the question correctly, and otherwise
are instructed on how to answer the question, which happens with probability
(1 − π). Those who are instructed how to answer are instructed with probability
γ to answer yes. Now, let Yi be the indicator random variable which is 1 if the
person possesses the characteristic of interest. In our case, Yi is 1 if and only if
unit i is a drug user. Another random variable Zi is defined only over the units
which we select to survey, and is 1 if the person actually answers yes.
The event that some respondent is instructed to answer the question correctly
is independent of the event that some other respondent is instructed to answer
correctly. So we have
P (Zi = 1) = γ(1 − π) + pπ
58
and so
E [z̄] = γ(1 − π) + pπ
z̄ − γ(1 − π)
p̂ =
π
1
Var (p̂) = 2 Var (z̄)
π
\ 1 \
Var (p̂) = 2 Var (z̄)
π
We can extend Warners idea to situations where the Yi can take arbitrary values,
and doesn’t have to be simply a yes/no answer. Say that Yi has k possible values,
denoted by X1 . . . Xk . Then we ask the respondent to answer truthfully with
probability π, and otherwise we ask a proportion γ1 of the respondents
Pk to give
answer X1 , γ2 to give answer X2 , etc. Obviously we require i=1 γi = 1, which
means that the distribution of Zi is
Value Probability
X1 (1 − π)γ1
X2 (1 − π)γ2
.. ..
. .
Xk (1 − π)γk
Yi π
On the other hand, what if instead of assigning the questions with replacement
we instead assign them without replacement? This means that instead of seeing
people individually and giving each person a randomly selected card or question,
we instead see n people at once and distribute n cards among them. In this
case we will want πn to be an integer, and
P ( Zi = 1| δ1 . . . δN ) = γ(1 − π) + p0 π
where δ1 . . . δN are the inclusion random variables on the units in the sample,
and p0 is the proportion of people in the sample who are drug users. So
P (Zi = 1) = γ(1 − π) + pπ
59
sense, so there is still things we can do. Let p0 denote the proportion of people
from a specified sample of size n which have the characteristic Yi . That is, p0 is
random as the sample of people we observe is random. Then
E [z̄] = E [E [ z̄| δ1 . . . δN ]]
= E [γ(1 − π) + p0 π] = γ(1 − π) + pπ
z̄ − γ(1 − π)
p̂ =
π
is unbiased for p. When it comes to the variance,
z̄ − γ(1 − π)
Var (p̂) = Var
π
z̄ 1
= Var = 2 Var (z̄)
π π
1
= 2 (Var (E [ z̄| δ1 . . . δN ]) + E [Var ( z̄| δ1 . . . δN )])
π
1
= 2 (Var (γ (1 − π) + p0 π) + E [Var ( z̄| δ1 . . . δN )])
π
= Var (p0 ) + E [Var ( z̄| δ1 . . . δN )]
(1 − π)γn + s
z̄ =
n
where s is the number of people asked to answer truthfully who answer yes. So
continuing,
(1 − π)γn + s
= Var (p0 ) + E Var δ
1 . . . δN
n
h s i
= Var (p0 ) + E Var δ1 . . . δN
h n
πs i
= Var (p0 ) + E Var δ1 . . . δN
h nπ
s i
0 2
= Var (p ) + π E Var δ1 . . . δN
πn
So the second term is basically a sort of two stage sample, which becomes
1 n(1 − p0 )p0
0 2 1
= Var (p ) + π E −
πn n n−1
60
6 Variance Estimation
6.1 Half-samples
σ̂12 + σ̂22
2
61
2
will almost certainly be better than 21 (T1 − T2 ) . But it requires us to do the
variance estimation directly, which we are avoiding. Note that the variance we
have estimated is not the same as the variance of the estimate from a single
sample of size n - It is the variance of the estimate from half-samples.
2
Proof. It should be obvious that 12 Ȳˆ1 − Ȳˆ2 estimates the variance of Ȳˆ1 . This
variance is
k k
Wi2 Si2
X 1 1 X
Wi2 − Si2 ∼
i=1
ni Ni i=1
ni
2
(ȳˆ1 −ȳˆ2 )
So 4 estimates
k k
W 2S2
X
i i
X 1 1
∼ Wi2 − Si2
i=1
2ni i=1
2ni Ni
= Var ȳˆ3
62
Example 6.1.3. Say we are doing stratified sampling with 3 strata, and we
decide to take k = 2 from above. That is, we apply the same sampling scheme
twice, with replacement. Our sampling scheme is to pick 2 units from stratum
1, 2 from stratum 2 and 1 from stratum 3. The strata sizes are given by
W1 = 0.5, W2 = 0.4, W3 = 0.1 and obviously the within-strata sampling is
without replacement every time. Say that our observed data is
Sample 1 Sample 2
Strata 1 7(20), 12(14) 5(8), 15(12)
Strata 2 9(32), 19(21) 4(29), 6(26)
Strata 3 2(82) 8(45)
where the actual value observed for each unit is the number in brackets. Then
our two estimators from our two samples are
63
n
out replacement samples of size 2, whoose means are denoted by ȳ1 , ȳ2 . Then
ȳ1 + ȳ2
ȳ =
2
Var (ȳ1 ) + Var (ȳ2 )
Var (ȳ) =
4
1 S2
Var (ȳ1 ) 1
= = n −
2 N 2
22
2 1 S
= −
n N 2
Note that ȳ does not come from a simple random sample without replacement
of size n. Now, if N is much greater than n we will have N1 ' 0 which means
that
S2
Var (ȳ) '
n
2
and we usually estimate this by sn . Now, how else can we arrive at this value?
Well, say we take a sample of size n, and split it into two samples t1 , t2 of
size n2 , with the average of both denoted by t̄. Note that as we selected a
single sample of size n we don’t have to worry about either of our subsamples
containing a single unit twice, which might happen if we took two samples of
size n2 and then combined them - This might be inconsistent with our original
sampling scheme.
Anyway, we can use this split into t1 , t2 to construct the estimate of the variance
of t1 ,
!
2 2
1 (t1 − t̄) + (t2 − t̄) 2
= (t1 − t̄) (10)
2 1
But if we now consider our population to be the size n sample originally selected,
we have
s2
h
2
i 2 1
E (t1 − t̄) = Var (t1 ) = − s2 =
n n n
And if we average over all possible half samples we end up actually computing
this expectation, as
h i (t − t̄)2 + · · · + (t − t̄)2 s2
2 1 k
E (t1 − t̄) = =
k n
where k = nn . So in this case, taking the variance estimates from all the pos-
2
sible half samples and averaging them gives us pretty much the same variance
estimate as we normally use for ȳ defined as the average of estimators from sim-
2
ple random samples. Importantly, Sn is also an approximation to the variance
64
of ȳˆ if this comes from a simple random sample without replacementof size n.
So the estimate we constructed also approximately estimates Var ȳˆ . Again,
notice that in the second half instead of combining two samples of size n2 we
split one of size n into two parts of size n2 .
The last example was a bit unclear, so hopefully the next one makes clear the
difference between combining two independent samples, and splitting in half a
single sample.
Proposition 6.1.2. Assume that we have k strata, and we take n1 , n2 . . . nk
units from each strata to form a first phase sample, where all the ni are divisible
by 2. Let ȳˆ be the stratified estimator using this sampling scheme, and let ȳˆi
denote the estimator from the ith half sample of the selected units, where i ranges
from 1 to
j = n1 , n2 . . . nk
n1 nk
2 ... 2
Then
j
1X 2
Var ȳˆ ' ȳˆj − ȳˆ
j i=1
Proof. Well, let ȳˆ0 denote a randomly chosen half-sample from the first phase
sample.
h 2 i h h 2 ii
E ȳˆ0 − ȳˆ = E E ȳˆ0 − ȳˆ first phase sample
" k #
X
2 1 1
=E W i ni − s2i
i=1 2 n i
" k #
X W 2 s2
i i
=E
i=1
n i
k
X W 2S2
i i
= ' Var ȳˆ
i=1
ni
The expectation can of course be given as the sum over all possible outcomes,
so in this case over j half-samples from the first-phase sample.
Example 6.1.5. Say that we take two samples, and the first time we observe
units (5, 25) which have y-values (4000, 5200), and the second time we observe
units (29, 6) which have y-values (8146, 2749). The whole pooled sample is
(5, 6, 25, 29), and so the possible splits are
65
Sample 1 Sample 2
5,6 25,29
5,25 6,29
5,29 6,25
25,29 5,6
6,29 5,25
6,25 5, 29
where t1,i refers to the estimate using the first half of the ith half-sample. We
then average all these estimates to get a total estimate that happens to coincide
with the usual variance estimate. Note that we only use the first part of every
split, in line with (10). Of course, we also have
6 2
1 XX 2
= (tj,i − t̄)
12 i=1 j=1
6
1X1 2 2
= (t1,i − t̄) + (t2,i − t̄)
6 i=1 2
6
1X 2
= (ti,1 − t̄)
6 i=1
Later we will want to allow the estimate to be non-linear, in which case the
proper way to look at this variance estimate is
6 2
1 XX 2
= (tj,i − t̄)
12 i=1 j=1
as we can’t use the above trick we used for linear estimators. So really, we
take every half-sample and average the squared differences from the whole-
sample estimate. Note that when we apply this half-samples technique to biased
estimators, we actually end up estimating the mean square error.
So far we’ve looked at cases where this averaging of variances using half samples
gives us our usual estimator back again, which is rather boring. The more
interesting case is where the direct estimation of the variance is difficult or
computationally impossible. In this case, we hope that the easier procedure of
using split samples and averaging will give an estimate which is computationally
simpler than the direct estimate, and not much more inefficient. The problem is
that a huge number of splits are required, specificially nn . If possible we’d like
2
to reduce the number of splits required, and this is where balanced replication
comes in.
66
6.2 Balanced repitition
4 −1 −1 +1 +1
The extra column is needed because, as we said, a hadamard matrix must have
dimensions n × n where n is a multiple of 4.
67
is also a hadamard matrix. So hadamard matrixes of order 2n are trivial to
construct.
Now we go back to sample surveys. Take a hadamard matrix H with the first
column and first row all 1’s, then strip out the first row and use the remaining
rows to allocate units to half-samples. For example, if n = 4 then we take the
hadamard matrix
Half-Sample 1 2 3 4
+1 +1 +1 +1
1 +1 +1 −1 −1
2 +1 −1 −1 +1
3 +1 −1 +1 −1
The 4 columns denote the units. So the row marked 1 says that we take the half-
samples (1, 2) and (3, 4). The row marked 2 gives the half-samples (1, 4) and
(2, 3), and the last row gives (1, 3) and (2, 4). Note that as the first column is
always 1, the first unit is always included in the first half-sample. In the previous
2
section we showed that (t1 −t 4
2)
was our variance estimate using the half-samples
t1 , t2 . So in this case our first row says that t1 = y1 +y
2
2
and t2 = y3 +y
2
4
More generally we would take an n × n hadamard matrix with first row and
column always 1, ignore the first row, and use the remaining (n − 1) rows to
allocate units to half-samples. We then use only the half-samples suggested by
the hadamard matrix to calculate our variance estimate. So our three variance
estimates are
y1 +y2
2
2 − y3 +y
2
4
(y1 + y2 − y3 − y4 )
2
=
4 42
y1 +y4 2
− y2 +y 2
3
2 2 (y1 + y3 − y2 − y3 )
=
4 42
y1 +y3 y2 +y4 2 2
2 − 2 (y1 + y3 − y2 − y4 )
=
4 42
Summing these and dividing by 4 not 3, we get
1
3 y12 + y22 + y32 + y42 − (2y1 y2 + 2y1 y3 . . . )
43
4
1 X 2 X
= 3
3 yi − 2 yi yj
4 i=1 i<j
4 4
1 X 2 X 2 X
= 3 4 yi − yi − yi yj
4 i=1 i=1 i6=j
4
1 X 2 X
= 3 4 yi − yi yj
4 i=1 i,j
68
!2
4 4
1 X 2 X
= 3 4 yi − yi
4 i=1 i=1
4
!
1 X
= 3 4 yi2 − 4 × 4 × ȳ 2
4 i=1
4
!
1 1X 2
= y − ȳ 2
4 4 i=1 i
s2
=
4
But things become much more complicated when the estimate is non-linear, and
it’s best to just start over. The theoretical underpinnings are also, apparently,
not that strong. So this time assume we have k balanced half-samples, the
estimated value from half-sample i is θ̂i and the estimated value from the whole
sample is θ̂. Then our variance estimator will be
k 2
1 X
θ̂i − θ̂
k i=1
which has a nice symmetry with our previous estimators, although this time we
can’t take a sum over the first ‘half’ of every pair of half-samples. We also don’t
in general have that
k
1X
θ̂i = θ̂
k i=1
Ȳ
Example 6.2.2. Suppose we want to estimate the ratio R = X̄
. We have k
Ŷi
balanced half-samples, and the estimator from the ith half-sample is R̂i = X̂i
.
We will clearly have
k
1X
R 6= R̂i
k i=1
Ŷi − R̂X̂i
R̂i − R̂ '
X̂
69
then this variance becomes
k
!2
1X Ŷi − R̂X̂i
k i=1 X̂
k 2
1 X
= Ŷi − R̂X̂i
k X̂ 2 i=1
k 2
1 X
= Ŷi − R̂X̂i − Ŷ + Ŷ
k X̂ 2 i=1
k 2
1 X
= Ŷi − Ŷ − R̂X̂i + R̂X̂
k X̂ 2 i=1
k 2
1 X
= Ŷi − Ŷ − R̂ X̂i − X̂
k X̂ 2 i=1
This is just the usual estimate of the variance of a ratio estimator. So the
variance estimate from half-samples is the same as the standard estimate, to a
first approximation.
So, suppose we have k strata and denote by (yi1 , yi2 ) the units observed from
the ith strata. Let H be a hadamrd matrix of size t ≥ k. Then the first k rows
of H represent the k strata and the t columns represent t pairs of half-samples.
If hil denotes the entry in the ith row and lth column, then hil = 1 says that
yi1 is going to be the unit picked from the ith strata for the lth half sample,
whereas hil = −1 says that yi2 will be picked. For example, assume that k = 3
so that there are 3 strata and we pick
+1 +1 +1 +1
+1 +1 −1 −1
H= +1 −1 +1 −1
+1 −1 −1 +1
70
For each split, we get two estimates of the average over the whole population.
For instance, from the first split we get
where the Wi are the proportional sizes of the strata. From previously, we have
that
2 2
(t1 − t̄) + (t2 − t̄)
2
estimates the variance of the stratified estimator with twice as many units,
provided that ni is much smaller than Ni . Our four splits give us four such
estimates of the variance, which we then average to end up with a final variance
estimate. This estimate is for the variance of the 6 unit stratified estimator.
71