0% found this document useful (0 votes)
220 views

Sample Surveys: Rohan, Vijayan

The document discusses sample surveys and various sampling methods. It introduces simple random sampling, where units are randomly selected from the population without replacement. The mean of the sample (ȳ) is an unbiased estimator of the population mean (Ȳ), and its variance is derived. Stratified and cluster sampling are also introduced to help improve estimation compared to simple random sampling. The document also covers sampling with unequal probabilities, dealing with non-response, and methods for estimating variance.

Uploaded by

Carrie Samuels
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
220 views

Sample Surveys: Rohan, Vijayan

The document discusses sample surveys and various sampling methods. It introduces simple random sampling, where units are randomly selected from the population without replacement. The mean of the sample (ȳ) is an unbiased estimator of the population mean (Ȳ), and its variance is derived. Stratified and cluster sampling are also introduced to help improve estimation compared to simple random sampling. The document also covers sampling with unequal probabilities, dealing with non-response, and methods for estimating variance.

Uploaded by

Carrie Samuels
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Sample Surveys

Rohan, Vijayan

July 19, 2010

Contents

1 Introduction 2

1.1 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Ratio estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Superpopulation based and design based estimation . . . . . . . 16

2 Stratified Sampling 23

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Stratified simple random sampling . . . . . . . . . . . . . . . . . 24

2.3 Choice of allocations to strata . . . . . . . . . . . . . . . . . . . . 25

2.4 Comparison of allocation strategies . . . . . . . . . . . . . . . . . 27

2.5 Post-Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Cluster Sampling 30

3.1 Unbiased cluster sampling . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Two stage cluster sampling . . . . . . . . . . . . . . . . . . . . . 35

i
4 Sampling with unequal probabilities 39

4.1 Probability proportional to size . . . . . . . . . . . . . . . . . . . 41

4.2 The Horvitz Thompson estimator with two stage sampling . . . . 46

4.3 Two phase sampling . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Non-response 55

5.1 Dealing with non-response . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Non response for sensitive questions . . . . . . . . . . . . . . . . 58

6 Variance Estimation 61

6.1 Half-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 Balanced repitition . . . . . . . . . . . . . . . . . . . . . . . . . . 67

1
1 Introduction

A population is a collection of N distinguishable individual objects, known as


units. We assume that we are interested in the average of some characteristic y
of of these N units. That is, each unit has a fixed non-random value y associated
with it. So if we denote the N units by the numbers 1, 2 . . . N then we have an
associated collection of characteristic values Y1 , Y2 . . . YN , which are non-random
and assumed to be obtainable at some cost. We are interested in determining
PN
Yi
Ȳ = i=1
N
The most expensive method for determining Ȳ would be to determine the value
of every Yi , known as a census. On the other hand, we could pick n < N units
for which we obtain values of y, called a sample.

1.1 Simple random sampling

Simple Random Sampling is the simplest method of selecting these n units.


We simply pick a random unit, pick another random unit from the remainder,
and repeat until we have selected n units. So we are just choosing objects at
random without replacement. The resulting random selection of subsets of {Yi }
n
assigns equal probability to each size n subset Yij j=1 of {Yi }, and so every
element of the population has an equal probability of being selected. Every pair
of elements has the same chance of being picked as every other pair, similarly
every triple of elements and so on. Simple random sampling has the advantage
of being easy to analyse. It may also be unsuitable for providing data to answer
a specific question.
If {yi } is our simple random sample then yi = Yj for some 1 ≤ j ≤ N , and
obviously our estimate of Ȳ is
Pn
yi
ȳ = i=1 (1)
n
This is apparently the ‘best’ estimator of ȳ. Although we can contrive specific
cases where there exists a better estimator than (1) these estimators will be
specific to those cases and will be useless in every other case.
Proposition 1.1.1. ȳ is unbiased for Ȳ , so that
E [ȳ] = Ȳ
 
1 1
Var (ȳ) = − S2
n N
where PN 2
2 i=1 Yi − Ȳ
S =
N −1

2
Proof. Ignoring the dependencies between the yi we have that
Y1 Y2 YN
E [yi ] = + + ··· +
N N N
= Ȳ

just from the definition of simple random sampling, and so obviously E [ȳ] = Ȳ .
 
 Pn  n
i=1 yi 1 X X X
Var (ȳ) = Var = 2 Var (yi ) + Cov (yi , yj )
n n i=1 i6=j

Obviously for much the same reason as E [yi ] = Ȳ for any i we will also have
2
Var (y1 ) = E y12 − E [y1 ]
 

Y12 Y2 Y2
= + 2 + · · · + N − Ȳ 2
N N N
PN 2
i=1 Yi − Ȳ
= = σ2
N
and now we need to find Cov (y1 , y2 ). Well, call it c. Then
1
nσ 2 + n(n − 1)c

Var (ȳ) =
n2
and if n = N then clearly Var (ȳ) = 0, which means that

0 = N σ 2 + N (N − 1)c = σ 2 + (N − 1)c
−σ 2
and so c = N −1 . So

1
nσ 2 + n(n − 1)c

Var (ȳ) = 2
n 
1 2 2 n−1
= σ −σ
n N −1
 
1 n−1
= 1− σ2
n N −1
 
1 N −n
= σ2
n N −1
 
1 N −n
= S2
n N
 
1 1
= − S2
n N

1 1

where the factor n − N is known as the finite population correction factor.

3
Now for an alternative approach to calculating Var (ȳ) and E [ȳ] which will be
much more useful later on. Define indicator random variables {δj } where δi is
n
1 if Yi is in our selected sample, and zero if it is not. Then P (δi = 1) = N is
the inclusion probability of the jth unit and we have
n
E [δi ] =
N
n(n − 1)
P (δi = 1, δj = 1) =
N (N − 1)
n n2 n  n
Var (δi ) = − 2 = 1−
N N N N
 n 2 n(n − 1)  n 2
Cov (δi , δj ) = E [δi δj ] − = −
N N (N − 1) N
Transferring to ȳ,
N PN
1X 1 Yi
E [ȳ] = E [δi ] Yi = n i=1 = Ȳ
n i=1 n N
 
N
1 X XX
Var (ȳ) = Var (δi ) Yi2 + Cov (δi , δj ) Yi Yj 
n2 i=1
i6=j
 
N   n 2 
1 n  n  X 2 X X n(n − 1)
= 1− Y + − Yi Yj 
n2 N N i=1 i N (N − 1) N
i6=j
 N
X XX 
1 1 1 n−1 1
= − Yi2 + − 2 Yi Yj
N n N i=1
nN (N − 1) N
i6=j
  N  
1 1 1 X 2 1 1 1 XX
= − Yi + − Yi Yj
N n N i=1 N (N − 1) N n
i6=j
 
  N
1 1 1 X 2 1 XX
= − Y − Yi Yj 
n N N i=1 i N (N − 1)
i6=j
 
  N
1 1  1 X 2 1 XX
= − Y − Yi Yj 
n N N − 1 i=1 i N (N − 1) i,j
  N
!
1 1 1 X 2 1 X
= − Y − Yi Ȳ
n N N − 1 i=1 i N −1 i
  N
1 1 1 X 2 
= − Yi − Yi Ȳ
n N N − 1 i=1
  N
1 1 1 X 2 
= − Y − 2Yi Ȳ + Yi Ȳ
n N N − 1 i=1 i

4
  N
1 1 1 X 2
Yi − 2Yi Ȳ + Ȳ 2

= −
n N N − 1 i=1
  N
!
1 1 1 X 2
= − Yi − Ȳ
n N N − 1 i=1
 
1 1
= − S2
n N

On the other hand, we can also do simple random sampling with replacement.
In this case the yi are independent identically distributed samples from some-
thing with an obvious probability mass function. This means we can apply
extremely basic theory to say that

E [ȳ] = Ȳ
σ2
Var (ȳ) =
n
Proposition 1.1.2. The statistic
Pn 2
2 i=1 (yi − ȳ)
s =
n−1
is an unbiased estimator for
PN 2
2 i=1 Yi − Ȳ
σ =
N
under simple random sampling with replacement. If sampling is without replace-
ment, then the same statistic is unbiased for S 2 .

Proof. First, assume sampling is with replacement. Then


n
1 X h 2
i
E s2 =
 
E (yi − ȳ)
n − 1 i=1
n
1 X  2
E yi − 2E [yi ȳ] + E ȳ 2
 
=
n − 1 i=1
n
E yi2 − 2E [yi ȳ] + E ȳ 2
   
=
n−1
PN 2
" Pn # " P
n 2 #!
n j=1 Yj j=1 yj i=1 yi
= − 2E yi +E
n−1 N n n
PN 2
" # " Pn #!
yi2 + j6=i yi yj 2
P PP
n j=1 Yj i=1 yi + i6=j yi yj
= − 2E +E
n−1 N n n2

5
PN " # " #!
Yj2 yi2 + nyi2 + n
P P
n j=1 j6=i yi yj i6=j yi yj
= − 2E +E
n−1 N n n2
PN " #!
Yj2 yi2 +
P
n j=1 j6=i yi yj
= −E
n−1 N n
PN 2
PN 2
!
n j=1 Yj j=1 Yj
n−1
= − − E [yi ] E [yj ]
n−1 N nN n
PN !
n (n − 1) j=1 Yj2 n−1
= − E [yi ] E [yj ]
n−1 nN n
PN
j=1 Yj2 2
− E [yi ] E [yj ] = E yi2 − E [yi ] = Var (yi ) = σ 2
 
=
N

Now for without replacement


" n #
 2 1 X 2
E s = E (yi − ȳ)
n−1 i=1
" n #
1 X  2
= E yi − Ȳ − ȳ − Ȳ
n−1 i=1
" n #
1 X 2   2
= E yi − Ȳ − 2 ȳ − Ȳ yi − Ȳ + ȳ − Ȳ
n−1 i=1
" n #
n 2 X  
= (Var (yi ) + Var (ȳ)) − E ȳ − Ȳ yi − Ȳ
n−1 n−1 i=1
" n #
n 2 X 
= (Var (yi ) + Var (ȳ)) − E ȳ yi − Ȳ
n−1 n−1 i=1
" n # !
n 2 X  
= (Var (yi ) + Var (ȳ)) − E ȳyi − nE ȳ Ȳ
n−1 n−1 i=1
n 2n
E ȳ 2 − Ȳ 2
  
= (Var (yi ) + Var (ȳ)) −
n−1 n−1
n 2n
= (Var (yi ) + Var (ȳ)) − Var (ȳ)
n−1 n−1
n
= (Var (yi ) − Var (ȳ))
n−1
   
n 1 1
= σ2 − − S2
n−1 n N
   
n N −1 2 1 1
= S − − S2
n−1 N n N

6
 
n 1
= 1− S2 = S2
n−1 n

Proposition 1.1.3. ȳ is the best linear unbiased estimator.

Pn
Proof. Well, take any linear estimator t = i=1 ai yi . Then for t to be unbiased
we must have
Xn
ai = 1
i=1

Looking at the variance of the estimator, we have


n
X X
Var (t) = a2i Var (yi ) + ai aj Cov (yi , yj )
i=1 i6=j
n
X X σ2
= a2i σ 2 − ai aj
i=1
N −1
i6=j
 
n
X 1 X
= σ2  a2i − ai aj 
i=1
N − 1
i6=j
 
n n
X 1 X 1 X
= σ2  a2i − ai aj + a2i 
i=1
N − 1 i,j
N − 1 i=1
n
!
N X 2 1
= σ2 ai −
N − 1 i=1 N −1
n
!
N X 1
= σ2 a2i −
N −1 i=1
N
n
!
X 1
= S2 a2i −
i=1
N

We also have !2
n
X n−1
X n−1
X
a2i = a2i + 1− ai
i=1 i=1 i=1

7
and so the variance will be minimised when for j 6= n
n
!
∂ X
2
a =0
∂aj i=1 i
 !2 
n−1 n−1
∂ X 2 X
= a + 1− ai 
∂aj i=1 i i=1
n−1
!
X
= 2aj − 2 1 − ai = 2aj − 2an
i=1

which implies that aj = an and so the constants are all the same, so they must be
1
Pn for j 6= n, as we need the last degree of freedom
n . Note that we only minimise
to satisfy the constraint i=1 ai = 1. Also, we have located a minimum as the
second derivative is 2 > 0.
Proposition 1.1.4. If we sample characteristics x, y from a simple random
sample of size n from N units then
 
1 1
Cov (x̄, ȳ) = − SXY
n N

Proof.
  
Cov (x̄, ȳ) = E x̄ − X̄ ȳ − Ȳ
 
= E x̄ȳ − X̄ Ȳ
 ! N  
N
1 X X
= E δi Xi  δj Yj  − X̄ Ȳ 
n2 i=1 j=1
 
N X N
1 X
= E 2 δj δi Xi Yj − X̄ Ȳ 
n i=1 j=1
 
N
1 XX 1 X
= E 2 δj δi Xi Yj + 2 δ 2 Xi Yi  − X̄ Ȳ
n n i=1 i
i6=j
N
(n − 1) X X 1 X
= Xi Yj + Xi Yi − X̄ Ȳ
nN (N − 1) nN i=1
i6=j
N X
N N N
(n − 1) X (n − 1) X 1 X
= Xi Yj − Xi Yi + Xi Yi − X̄ Ȳ
nN (N − 1) i=1 j=1
N n(N − 1) i=1 nN i=1
N N
N (n − 1) (n − 1) X 1 X
= X̄ Ȳ − Xi Yi + Xi Yi − X̄ Ȳ
n(N − 1) nN (N − 1) i=1 nN i=1

8
   N
X
N (n − 1) n(N − 1) N −1 n−1
= − X̄ Ȳ + − Xi Yi
n(N − 1) n(N − 1) nN (N − 1) nN (N − 1) i=1
N
n−N N −n X
= X̄ Ȳ + Xi Yi
n(N − 1) nN (N − 1) i=1
N
N −n X N (N − n)
= Xi Yi − X̄ Ȳ
nN (N − 1) i=1 N n(N − 1)
N
!
N −n X
= Xi Yi − N X̄ Ȳ
nN (N − 1) i=1
N
N −n X  
= Xi − X̄ Yi − Ȳ
nN (N − 1) i=1
 
N −n 1 1
= SXY = − SXY
nN n N

1.2 Ratio estimation

Let X, Y be properties which we can measure for every member of our popula-
YT
tion. Then we will often be interested in estimating R = XT
, sometimes for its
own sake, or sometimes so that we can define the estimator

ȲˆR = R̂X̄

Note that for this to be of any use we assume that the population value X̄ is
known. This is not terribly unusual in practice.

There are two obvious candidates for estimators of R. The first is


n
ˆ=
X yi

x
i=1 i

and the second is



R̂ =

These two estimators actually measure quite different things. Assuming simple
random sampling without replacement then R̄ ˆ represents random sampling from
Yi
a characteristic Ri = Xi , so we can apply the previous results. It has expected
value R̄ where R̄ is not the same as R, and variance
 
1 1 2
− SR
n N

9
On the other hand, switching to the second estimator
   ȳ 
Cov R̂, x̄ = Cov , x̄

h ȳ i
= E [ȳ] − E E [x̄]

h ȳ i
= Ȳ − E X̄

So
Cov x̄ȳ , x̄
h ȳ i 
E =R−
x̄ X̄
This makes the bias of ȲRˆ
 ȳ 
−Cov , x̄

If x̄ȳ is approximately constant, or has small variance, then this makes the co-
variance term small as in general for any A, B,
2
|Cov (A, B)| ≤ Var (A) Var (B)
and in turn this makes R̂ approximately unbiased for R. Unforunately comput-
ing this covariance term is not possible, so we turn to the first order estimator.
 
h
ˆ
i h ȳ i ȳ − Rx̄
E Ȳ − Ȳ = E X̄ − Ȳ = X̄E
x̄ x̄
ȳ−Rx̄
Now we use the first order approximation of x̄ around the term in the
denominator, we get
ȳ − Rx̄ ȳ − Rx̄ 
− 2
x̄ − X̄
X̄ X̄
Note that we only expand one of the x̄ terms. Continuing,
  
ȳ − Rx̄ ȳ − Rx̄ 
∼ E X̄ − x̄ − X̄
X̄ X̄ 2
 
ȳ − Rx̄ 
=E − x̄ − X̄

" #
(ȳ − Rx̄) x̄ − X̄
=E −

"  #
Rx̄ x̄ − X̄ ȳ x̄ − X̄
=E −
X̄ X̄
R Cov (ȳ, x̄)
= Var (x̄) −
X̄   X̄  
R 1 1 1 1 1
= − SXX − − SXY
X̄ n N X̄ n N
 
1 1 1
= − (RSXX − SXY )
X̄ n N

10
Obviously this make ȲˆR unbiased for Ȳ as n approaches N . Note that this
estimate depends crucially on the specific taylor series expansion we choose. If
we simply expand x̄ȳ we get something different.

Now for the associated mean square error.


h i  2   2 
MSE ȲˆR = E ȲˆR − Ȳ = X̄ 2 E R̂ − R
" 2 #
2 ȳ − Rx̄
= X̄ E

So we need to find the mean squared error of R̂. It turns out that taking
the bivariate expansion of R̂ around X̄, Ȳ gives the same answer as using the
previous expansion. But there’s probably a good reason for using the previous
one, so we still do. Applying an even more basic series expansion, we have
" 2 #
ȳ − Rx̄
∼ X̄ 2 E

h i
2
= E (ȳ − Rx̄) = Var (ȳ − Rx̄)

as the expectation of ȳ − Rx̄ is zero. Now we have to work out the term on the
right hand side. If we let zi = yi − Rxi and consider outselves to be sampling
from the Zi , we apply previous formulas to get
PN 2
Zi − Z̄
 
1 1 i=1
Var (z̄) = − SZ2 SZ2 =
n N N −1

So as z̄ = ȳ − Rx̄ we have another expression for Var (ȳ − Rx̄) and we actually
only need to estimate Var (z̄). An estimate of this is given by
Pn 2
(zi∗ − z̄ ∗ )
 
1 1
\
Var (z̄) = − s2∗
Z s2∗
Z = i=1
n N n−1

where
zi∗ = yi − R̂xi
We have
"P #
n 2
(zi∗ − z̄ ∗ )
E s2∗ i=1
 
z =E
n−1
" n #
1 X 2
= E yi − R̂xi − ȳ − R̂x̄
n−1 i=1
" n 
#
1 X 2
= E (yi − ȳ) + R̂ (xi − x̄)
n−1 i=1

11
" n n n
#
1 X 2
X
2
X 2
= E (yi − ȳ) + R̂ (yi − ȳ) (xi − x̄) + R̂ (xi − x̄)
n−1 i=1 i=1 i=1
h i
= E s2y + R̂sxy + R̂2 s2x

and now apparently we make the rather silly assumption that R̂ ∼ R, so

∼ E s2y + Rsxy + R2 s2x


 

= Sy2 + RSxy + R2 Sx2 = SZ2

which shows that Var\ (z̄) is unbiased for Var (z̄). Comparing this ratio estimator
to ȳ, we find that the difference of the mean squared errors is
h i
MSE ȲˆR − MSE [ȳ] = Var (ȳ − Rx̄) − Var (ȳ)
= Var (Rx̄) − 2RCov (x̄, ȳ)
p
= R2 Var (x̄) − 2Rρ Var (x̄) Var (ȳ)

Obviously this will be negative if


s
R Var (x̄)
ρ>
2 Var (ȳ)

So if ρ is sufficiently large then the ratio estimator will be more efficient than
the sample mean. And obviously as |ρ| ≤ 1 the sample mean will always be
more efficient than the ratio estimator if
p p
R Var (x̄) > 2 Var (ȳ)

On the other hand, we can also compute the exact mean square error of the
ratio estimate. So assume that the Xi are always positive.
 P 
i∈s Yi
h i
MSE ŶR = MSE X P
Xi
" Pi∈s #
N
i=1 δ i Yi
= MSE X PN
i=1 δi Xi
"N #
X
= MSE Yi bδi
i=1

12
where b = PN X .
i=1 δi Xi


N N
!2 
X X
= E Yi bδi − Yi 
i=1 i=1

N
!2 
X
= E Yi (bδi − 1) 
i=1
N X
X N
= Yj Yi E [(bδi − 1) (bδj − 1)]
j=1 i=1
N X
X N
= Yj Yi dij (2)
j=1 i=1

where

dij = E [(bδi − 1) (bδj − 1)]

We are going to need the property that


N X
X N
Xi Xj dij = 0
i=1 j=1

Well,
N X
X N N X
X N
Xi Xj dij = Xi Xj E [(bδi − 1) (bδj − 1)]
i=1 j=1 i=1 j=1
 
XN X
N
= E Xi Xj (bδi − 1) (bδj − 1)
i=1 j=1

N
!2 
X
= E Xi (bδi − 1) 
i=1

N N
!2 
X X
= E Xi bδi − Xi 
i=1 i=1
h i
2
= E (X − X) = 0

Continuing from (2), we have


N X
X N N X
X N
= Zj Zi (Xi Xj dij ) = Zj Zi aij
j=1 i=1 j=1 i=1

13
Yi
where Zi = X i
and aij = Xi Xj dij . Simply from the definition of the mean
square error, we have that this is always positive, so what we ended up with
must be a non-negative symmetric quadratic form in terms of Zi , for any values
of Zi . The aij are not all positive (although they must be for i = j), but this
implies that
XX XX X
ai,j = ai,j = ai,j = 0
i j i<j j

Going further,
N N
1 XX
=− −2Zj Zi aij
2 j=1 i=1
 
N N N N N N
1 X X X
2
X X
2
X
=− −2Zj Zi aij + Zi aij + Zj aij 
2 j=1 i=1 i=1 j=1 j=1 i=1
 
N X N N X N N X N
1 X X X
=−  −2Zj Zi aij + Zi2 aij + Zj2 aij 
2 j=1 i=1 i=1 j=1 i=1 j=1
N N
1 XX 2
Zi + Zj2 − 2Zj Zi aij

=−
2 j=1 i=1
N N
1 XX 2
=− (Zi − Zj ) aij
2 j=1 i=1
1 XX 2
=− (Zi − Zj ) aij
2
i6=j
XX 2
=− (Zi − Zj ) aij
i<j

Substituting back in the value of Zi and ai,j ,


X X  Yi 2
Yj
=− − Xi Xj dij
i<j
Xi Xj

−1
It remains to work out dij . Well, if p(s) = Nn is the probability that we pick
the sample s using our scheme, we have

dij = E δi δj b2 − δi b − δj b + 1
 
X X X
= p(s)b(s)2 − p(s)b(s) − p(s)b(s) + 1
s∈i,j s∈j s∈i
 
 
1 X X X N
= N b(s)2 − b(s) − b(s) + 
n s∈i,j s∈j s∈i
n

14
 
 
1  2X 1 X 1 X 1 N 
= N X 2 − X P −X P +
n
P
s∈i,j Xk s∈j k∈s Xk s∈i k∈s Xk n
k∈s

The notation s ∈ i means that we sum over all samples s which contain i.

Example 1.2.1. It turns out that the estimation of the mean of a subpopulation
from a sample of a larger population is actually a ratio estimator. Say we have
a random sample of 400 households from WA, 20 of which are from Nedlands,
and we are interested in estimating average income. The obvious estimator of
average household income for Nedlands is going to be an average over 20 units.
So if the total income over the 20 units is 1, 600, 000 we have
1, 600, 000
ȲˆN edlands = = 80, 000
20
But in fact the 20 units from Nedland was actually random. So define Wi to
be an indicator of whether the ith unit in the population is in Nedlands, and
obviously the total income over Nedlands is
400
X
W YT = Wi Yi
i=1

So we are estimating Ȳˆ by the ratio

WY

and obviously as this is a ratio estimator we already know its mean squared
error.

Example 1.2.2. Say we want to estimate the mean income of WA households.


So we take a simple random sample and end up with ȳ = 100, 000. This is far
too high, and we conclude that we have an unrepresentative sample. But can
we still make use of this unrepresentative sample to get a good estimate? Well,
assume that at the last census we had a population mean income of X̄ = 50000
and the mean income of the selected sample was x̄ = 80, 000. So we assume that
the proportion increase in income is approximately uniform over the population,
so that our better estimator of Ȳ is
50, 000
ȲˆR = 100, 000 ×
80, 000
and the general form is

ȲˆR = ȳ ×

15
1.3 Superpopulation based and design based estimation

What we have previously discussed is called design based estimation. Using


this method we assume that there is some fixed finite population of N units,
and a sample denoted by s of n < N units is taken. The survey design therefore
consists of assigning a probability to every possible sample s, and assigning
these probabilities so that an ‘average’ sample is roughly similar to the whole
population. That is, in the long run if we take repeated samples we will tend
to draw the right inferences from the selected sample.

The superpopulation approach is also called the model based approach. It


assumes that the finite population of size N is itself a sample of size N from
some infinite population, and our sample of size n is a smaller sample again
from these N units. That is, we assume that the Y1 , . . . YN are random variables
with some joint distribution, and from this our sampled units y1 , . . . yn also have
some inherent randomness. The superpopulation approach allows us to make
inference either about the underlying random population, that is, about the
joint distribution of the Y1 , . . . YN , or about the particular realizations of these
random variables.

The justification for the supersample approach is that in many cases even a
census represents just a sample from a larger population. For instance, a census
of australian residents only measures the population at a single moment. Very
quickly the population will have changed, and will soon be different. It is also
only one of the possible populations that might have arisen from the same set
of underlying social and economic influences. In this case we might argue that
we were more interested in the underlying infinite population than the finite
sample, so the superpopulation view is not that unrealistic.

A model-based estimator is called model-unbiased if conditional on the accu-


racy of the model and the choice of observation points, the estimator is unbiased.
On the other hand, design-based methods yield estimators that are uncondition-
ally unbiased, regardless of the form of the underlying population or model. If
the model is accurate then there are great benefits in using a model-based ap-
proach. On the other hand, the design-based approach is less efficient, but makes
no modelling assumptions and yields unconditionally unbiased estimators.
Example 1.3.1. Consider the superpopulation model where every unit is as-
sociated with two random variables X and Y , and

Y = βX + 

where E [ | X] = 0. That is, there is a strong dependence between all three


variables. Now, we observe the realizations Y1 . . . YN and X1 . . . XN , and from
this we take the subpopulation y1 . . . yn and x1 . . . xn . We know from the model
that
E [ Y | X] = βX

16
We know that if s denotes our sample, we have
X
N Ȳ = nȳ + Yi
i∈s
/

So our approach to estimating Ȳ is to estimate the superpopulation parameter


β, use this to estimate the non-observed Yi and then use these to calculate Ȳ .
The least squares approach is to find R̂, the value of R that minimses
n
X 2
(yi − Rxi )
i=1

and it turns out that Pn


yi ȳ
R̂ = Pni=1 =
i=1 x i x̄
Then we predict the unobserved values. In the end, going through the model
based approach gets us the same ratio estimator we had before,

ȲˆR = R̂X̄

The difference is that our assumptions make the variance of the estimator very
different. For example, assume that  ∼ N (0, τ 2 ). Then
  nτ 2 τ2
Var R̂ x1 . . . xn = Pn 2 = nx̄2

( i=1 xi )

This suggests that we not use simple random sampling, and instead try to
maximize x̄.

In general choosing between the superpopulation and design-based approaches


depends on how confident you are on the model you would use for the superpop-
ulation version. If you are very confident that the model holds uniformly then
the superpopulation approach is valid. On the other hand, the design-based
approach allows some parts of the population to deviate from the ‘model’, pos-
sibly quite seriously. The talk of a ‘model’ in the design-based approach is
misleading, but all it means is that the design-based approach is able to ro-
bustly estimate the quantity Ȳ using the ratio estimator even when there is no
really good proportional relationship between the individual X and Y values.
Importantly, even if the two approaches suggest the same estimator the differ-
ences between them may be substantial. For example, the different variance in
the above example suggests a quite different sampling criteria.

Now we extend the previous example to take into account an intercept as well.
That is, our model is
Yi = α + βXi + i

17
where it is assumed that the i are uncorrelated, have common variance σ 2 and
mean 0 conditional on the value of Xi . So

E [ i | Xi ] = 0
E [ Yi | Xi ] = α + βXi

To derive our estimate of Y , lets restrict ourselves to linear estimators. That is,
our prospective estimator is t, where
X X
t= Yi + gi Yi (3)
i∈s i∈s

and the collection gP i determine the equation of the estimator. For this to be
N
model unbiased for i=1 Yi we must have
" # " #
XN X X
E t− Yi x1 . . . xn = E gi Yi − Yi x1 . . . xn


i=1 i∈s i∈s
/
" #
X X
=E gi Yi x1 . . . xn − (α + βXi )


i∈s i∈s
/
" #
X X X
=E α gi + β gi Xi x1 . . . xn − (α + βXi )


i∈s i∈s i∈s
/
X X X
=α gi + β gi Xi − (α + βXi )
i∈s i∈s i∈s
/
=0

So
X X X X
α gi + β gi Xi = (α + βXi ) = (N − n)α + β Xi
i∈s i∈s i∈s
/ i∈s
/

As these must be equal as functions of α and β of we must have


X X
gi Xi = Xi (4)
i∈s i∈s
/
X
gi = (N − n) (5)
i∈s

Now to minimize the variance of the estimator. Well,



N
!2 
X
Var (t) = E  t − Yi x1 . . . xn 
i=1
 !2 
X X
= E gi Yi − Yi x1 . . . xn 
i∈s i∈s
/

18
 !2 
X X X
= E gi Yi − (α + βXi ) − (Yi − α − βXi ) x1 . . . xn 
i∈s i∈s
/ i∈s
/
 " # !2 
X X X
= E gi Yi − E gi Yi x1 . . . xn − i x1 . . . xn 


i∈s i∈s i∈s
/
 !2 
X X X
= E gi Yi − gi (α + βxi ) − i x1 . . . xn 
i∈s i∈s i∈s
/
 !2 
X X
= E gi i − i x1 . . . xn 
i∈s i∈s
/
 !2 ! ! !2 
X X X X
= E gi i −2 gi i i + i x1 . . . xn 
i∈s i∈s i∈s
/ i∈s
/
   
XX XX XX

= E gi gj i j − 2 gi i j x1 . . . xn  + E  i j x1 . . . xn 
i∈s j∈s i∈s j ∈s/ i∈s
/ j ∈s
/
XX X  XX
E gi2 2i x1 . . . xn − 2

= gi gj E [ i j | x1 . . . xn ] + gi E [ i j | x1 . . . xn ]
i∈s,j∈s,j6=i i∈s i∈s j ∈s
/
XX X 
E 2i x1 . . . xn

+ E [ i j | x 1 . . . x n ] +
i∈s,j
/ ∈s,i6
/ =j i∈s
/

From our uncorrelated assumption we have that E [i j ] = 0 if i 6= j, so


X   X  2
E gi2 2i x1 . . . xn +

= E i x1 . . . xn
i∈s i∈s
/
X
= gi2 σ 2 + (N − n) σ 2

i∈s
!
X
= gi2 + N − n σ2
i∈s

where the last line section follows from the independence of the i . We can use
lagrange multipliers to minimize this expression subject to conditions (4) and
(5), to get

N X̄ − x̄
 
N
gi = −1 + P 2 (Xi − x̄)
n i∈s (Xi − x̄)

Substituting this back into (3) we have


!
X X 1 1

 Xi − x̄
t= Yi + N − + X̄ − x̄ P 2 Yi
i∈s i∈s
n N i∈s (Xi − x̄)

19
!
X 1  Xi − x̄
=N + X̄ − x̄ P 2 Yi
i∈s
n i∈s (Xi − x̄)
X  Xi − x̄
= N ȳ + N X̄ − x̄ P 2 Yi
i∈s i∈s (Xi − x̄)

= N ȳ + N X̄ − x̄ b
where
Pn
(xi − x̄) yi
b = Pni=1
i=1 (xi − x̄) xi

is a model-unbiased estimator of β as
Pn
(xi − x̄) E [ yi | x1 . . . xn ]
E [ b| x1 . . . xn ] = i=1 Pn
i=1 (xi − x̄) xi
Pn
(x − x̄) (α + βxi )
= i=1Pn i
(xi − x̄) xi
Pn i=1
(xi − x̄) xi β
= Pi=1
n =β
i=1 (xi − x̄) xi

The corresponding estimator of α is a = ȳ − bx̄ which is obviously also model


unbiased. A quite standard formula for the variance of t is
2
Var ( t| x1 . . . xn ) = N 2 Var ( ȳ| x1 . . . xn ) + N 2 X̄ − x̄ Var ( b| x1 . . . xn )
 
1 1 2
= N2 − σ 2 + N 2 X̄ − x̄ Var ( b| x1 . . . xn )
n N
!
2 ni=1 (xi − x̄)2 Var (yi )
  P
2 1 1 2
=N − σ + X̄ − x̄ Pn 2
n N ( i=1 (xi − x̄) xi )
 
Pn 2
(xi − x̄) σ 2 
 
2 1 1 2
2
=N  − σ + X̄ − x̄ Pi=1 2 
n N n 2
i=1 (x i − x̄)
2 2 !
X̄ − x̄ σ
 
1 1
= N2 − σ 2 + Pn 2
n N i=1 (xi − x̄)
2 !

 
1 1 X̄ x̄
= N 2 σ2 − + Pn 2
n N i=1 (xi − x̄)

Now, what would happen if we let our estimator have the same form but we
now treated it as a design-based estimator instead of a model based estimator?
Well, we still define
PN
i=1 (xi − x̄) yi
β = PN
i=1 (xi − x̄) xi
α = Ȳ − β X̄

20
The estimators of α and β are the same as for the model-based approach, but
these estimators are now only asymptotically unbiased. As for the variance of
Ȳˆ , we have
 
Var Ȳˆ = Var ȳ − b x̄ − X̄


∼ Var ȳ − β x̄ − X̄

= Var ȳ − β x̄ + β X̄
h 2 i
= E ȳ − Ȳ − β x̄ + β X̄
h i
2
= E (ȳ − β x̄ − α)
= E z̄ 2
 
 
1 1
= − SZ2
n N

where Zi = Yi − Ȳ − β(Xi − X̄) is the model residual and we assumed that this
had expected values of zero. A sample estimator of SZ2 is
n
1 X 2
s2z = ((yi − ȳ) − b (xi − x̄))
n − 1 i=1
n
1 X 2
= (yi − bxi − ȳ + bx̄)
n − 1 i=1
n
1 X 2
= (yi − bxi − a)
n − 1 i=1

which is unbiased if we assume that b ∼ β. We can make more complicated and


more accurate assumptions instead, if we want to. Define
N
X 
A= Xi − X̄ Yi
i=1
N
X
a= (xi − x̄) yi
i=1
N
X 2
C= Xi − X̄
i=1
N
X 2
c= (xi − x̄)
i=1

21
Applying some approximations to b, we have
a (a − A) + A
b= =
c (c − C) + C
a−A
A A +1
= c−C
C C +1
a−A
A A +1 
=
C 1 − − c−C
C
  
A a−A c−C
∼ +1 1−
C A C
The approximation comes from the expansion

1 X
= xn
1 − x n=0

This is only valid for |x| < 1, and making this assumption in our case is not
unreasonable. Continuing,
 
A a−A c−C c−C a−A
= +1− −
C A C C A
We have
 
E [t] = Y − N E b x̄ − X̄
Ignoring the last term as being insignificant, this makes the bias of t
   
  A a−A c−C 
N E b x̄ − X̄ ∼ N E +1− x̄ − X̄
C A C
   
A a−A c−C 
= NE − x̄ − X̄
C A C
  
A a c  
= NE − x̄ − X̄
C A C
   
1 Ac 
= NE a− x̄ − X̄
C C
 
1 
= NE (a − βc) x̄ − X̄
C
" n n
! #
1 X X 
= NE 2
(xi − x̄) yi − β (xi − x̄) xi x̄ − X̄
Sxx i=1 i=1
" n
! #
1 X 
= NE 2
(xi − x̄) (yi − ȳ − β (xi − x̄)) x̄ − X̄
Sxx i=1

We can apply the same approximation to b to find approximate expressions for


the bias and mean square error of b for β.

22
2 Stratified Sampling

2.1 Motivation

Suppose we again take a sample of 400 households from WA, and we find that
we have a sample mean of 100, 000. This is clearly too high to be representative,
and on looking more closely we notice that

- 40(10%) of the households are from mining towns.


- 160(40%) of the households are from wealthy areas such as Nedlands,
Subiaco, etc.

- 200(50%) of the households are from other areas.

The actual distribution of the population among these regions, however, is

- 5% live in mining towns.


- 20% live in wealthy areas.
- 75% live in other areas.

This says that our random sample is rather unrepresentative, and contains far
too many households from wealthy areas. So divide the total population up
into 3 groups or strata. Let Wi be the proportion of the population that lies in
strata i, and ni the number of units chosen from strata i. Obviously in an ideal
situation we would have ni = nWi , but in our example we have rather severe
deviations from this.

So lets make some adjustments to fix this under and over representation in our
sample. We estimate the mean per strata, then weight these and sum them to
get a total mean that takes into account the variability among the three different
parts of the population.

1. Mining W1 = 0.05 n1 = 40(10%) 200,000


2. Rich W2 = 0.2 n2 = 160(40%) 160,000
3. Rest W3 = 0.75 n3 = 200(50%) 32,000

Ȳˆst = 0.05 ∗ 200, 000 + 0.2 ∗ 160, 000 + 0.75 ∗ 32, 000 = 68, 000
In this case we performed the stratification after the sample was taken. How-
ever, we may perform stratification before taking the actual sample and this is
actually preferred.

23
2.2 Stratified simple random sampling

Now to lay the idea out in full. The main idea is that we decide beforehand
that there are k strata in the
Pkpopulation, and that we want to sample ni units
from the ith strata, where i=1 ni = n. Similarly let Ni be the population of
Pk
the ith strata, and i=1 Ni = N . The mean of the ni units from the ith strata
is ȳi and s2i is the corresponding variance. Ȳi and Si2 are the corresponding
population values, and Yij is the jth unit from the ith strata. Wi is the fraction
of the population that lies in the ith strata, so Wi = NN . We have
i

n
X
Ȳ = Wi Ȳi
i=1
n
Ȳˆst =
X
ȳi Wi
i=1

In the case that ni = n N N we say that units are proportionally allocated


i

to the strata. If proportional allocation is used the stratified estimator and


the whole-sample mean will coincide, otherwise they will not. Note that this
does not really make the stratified estimator that similar to the simple random
sample mean, as our rules for how we sample the units is very different in these
two cases, resulting in a very different variance. So with proportional allocation,
k k
X X Ni
ȳˆst = Wi ȳi = ȳi
i=1 i=1
N
k k ni
X ni X 1X
= ȳi = yij
i=1
n i=1
n j=1
k i n
1 XX
= yij = ȳ
n i=1 j=1

Stratification is in fact best when the between strata variation is very large and
the internal per-strata variation tends to be very small. This is because
n
X
Var (ȳst ) = Wi2 Var (ȳi )
i=1
n  
X 1 1
= Wi2 − Si2
i=1
ni Ni

and so only the internal per-strata variation contributes to the error of the
estimator. If we are dealing with proportional allocation, this becomes
n  
X
2 1 1
Var (ȳst ) = Wi − Si2
i=1
n i N i

24
n  
X ni 1 N
= Wi − Si2
i=1
n n i N N i
n  
X 1 ni N 1
= Wi − Si2
i=1
n N i n N
n  
X 1 1
= Wi − Si2
i=1
n N
  n
1 1 X
= − Wi Si2
n N i=1

The obvious estimator of the variance of the stratified estimator is


n  
X 1 1
\
Var (ȳst ) = Wi2 − s2i
i=1
n i N i

If we have more than n strata then it will be impossible to choose even 1 unit
from every strata, so there are problems in this situation.

2.3 Choice of allocations to strata

Well, proportional allocation is one obvious way of distributing the sampled


units among the strata. Another is to try and choose the ni so as to minimise
Var (ȳst ). So assume that the Si2 are known, n is fixed, and ignore for now the
possibility that ni > Ni . We know that
n  
X 1 1
Var (ȳst ) = Wi2 − Si2
i=1
ni Ni
n 
Ni2 2

X Ni 2
= S − S
i=1
N 2 ni i N2 i

So we want to minimize
n
X Ni2 2
S
i=1
N 2 ni i
Pn
subject to the constraint i=1 ni = n. Well, using lagrange multipliers and
treating the ni as real-valued instead of integer-valued, we want to minimize
n n
! n n
!
X Ni2 Si2 X X Wi2 Si2 X
f= +λ ni − n = +λ ni − n
i=1
N 2 ni i=1 i=1
ni i=1

25
Taking partial derivatives,

∂ Wj2 Sj2
=− +λ=0 (6)
∂nj n2j
n
∂ X
= ni − n = 0 (7)
∂λ i=1

Well, (6) implies that


Wj Sj
√ = nj (8)
λ
and putting this back into (7) gives
n
X Wi Si
√ =n
i=1
λ

So !2
n
1 X
λ= 2 Wi Si
n i=1

and finally substituting this back into (8) gives

Wj Sj
nj = n Pn
i=1 Wi Si

This optimal allocation can also be known as Neyman allocation. We also


find that if N is large we can vary the ni away from the optimum values and
still achieve close to the optimal variance. So assume that the Ni are large and
that the n∗i are the optimum allocations. Then
n  
X 1 1
Var (ȳst ) = Wi2 Si2 −
i=1
ni Ni
n
X W 2S2 i i

i=1
ni
Pn 2 n
( i=1 Wi Si ) X Wi2 Si2 n2
= 2 Pn 2
n i=1 ( i=1 Wi Si ) ni
n 2
X (n∗i )
= A2
i=1
ni

26
Pn
j=1 Wj S j
where A = n . On the other hand, the optimum allocation gives a
variance of
n  
X 1 1
Var (ȳst ) = Wi2 Si2 −
i=1
n∗i Ni
n
X W 2S2 i i

i=1
n∗i
n
X Wi2 Si2
=
n PnWiW
i=1
Si
j Sj
j=1
 
n n
X W i Si
X
=  Wj Sj 
i=1
n j=1
P 2
n
j=1 Wj Sj
=
n Pn
2 2 j=1 Wj Sj
= A n = A n Pn
j=1 Wj Sj
n
X Wj Sj
= A2 n Pn
j=1 j=1 Wj Sj
n n 2
X X (n∗ )
= A2 n∗i = A2 i

j=1 j=1
n∗i

So from the algebra, if ni is close to n∗i we will attain a variance close to the
optimum, like we said. So even if the Si2 are only approximately known, maybe
from some previous study, it still makes sense to use these to pick approximately
optimal strata sizes.

2.4 Comparison of allocation strategies

Proposition 2.4.1. Stratified sampling with proportional allocation gives an


estimator with lower variance than simple random sampling, subject to certain
conditions.

Proof. Looking at S 2 , we find that we can decompose it into between-strata

27
variance and internal-strata variance. That is,
 
k X Ni
1  X 2
S2 =

Yij − Ȳ 
N − 1 i=1 j=1
 
k X Ni
1  X  2
= Yij − Ȳi + Ȳi − Ȳ 
N − 1 i=1 j=1
 
k X Ni
1  X  2  2  
= Yij − Ȳi + Ȳi − Ȳ + 2 Ȳi − Ȳ Yij − Ȳi 
N − 1 i=1 j=1
 
k Ni Ni
1  X X Ni − 1 2 X 2
=  Yij − Ȳi + Ȳi − Ȳ
N − 1 i=1 j=1 Ni − 1 j=1

Ni
X 
+2 Ȳi − Ȳ Yij − Ȳi 
j=1
  
k Ni Ni
1 X
(Ni − 1)
X 1 2 X 2
=  Yij − Ȳi + Ȳi − Ȳ + 0 
N −1 i=1 j=1
N i −1 j=1
k 
!
1 X 2

(Ni − 1) Si2 + Ni Ȳi − Ȳ

=
N −1 i=1

and this can be related back to the ANOVA table of one-way classification. If
N − 1 ∼ N and Ni ∼ Ni − 1 we have approximately
k k
X X 2
N S2 = Ni Si2 + Ni Ȳi − Ȳ
i=1 i=1
k k
X X 2
S2 = Wi Si2 + Wi Ȳi − Ȳ
i=1 i=1

   n
X
1 1 1 1
Var (ȳ) − Var (ȳst ) = − S2 − − Wi Si2
n N n N i=1
  X n n n
!
1 1 2
X 2 X 2
∼ − Wi Si + Wi Ȳi − Ȳ − Wi Si
n N i=1 i=1 i=1
  n
1 1 X 2
= − Wi Ȳi − Ȳ
n N i=1

and as the right hand side is positive we find that ȳst has a lower variance than
ȳ. But this relied crucially on our assumptions. So we find that in fact it is

28
not always true that ȳst has lower variance than ȳ, the whole-sample mean.
However it is usually true. In fact the exact condition we need is
k k
X 2 1 X
Ni Ȳi − Ȳ > (N − Ni ) Si2
i=1
N i=1

Proposition 2.4.2. Even optimal allocation does not always result in an esti-
mator with a lower variance than simple random sampling.

2.5 Post-Stratification

Post stratification means that the distribution of the n sampled units among the
strata is only known after the sampling has been performed. So the allocation of
the units among the strata is random, and as expected this adds some variability
to the stratified estimator. Obviously if we ignore the stratification and use the
whole-sample mean we have
k
X
ȳ = wi y¯i
i=1

where wi is the proportion of observed units from the ith strata. On the other
hand the post stratification estimator is defined as
k
X
ȳst = Wi ȳi
i=1

where we now have that ni is also random. As ȳi is unbiased for Ȳi we have
that ȳst is still unbiased for Ȳ . Now for calculating the variance. Well, it is a
fact that
Var (Y ) = E [Var (Y |X)] + Var (E [Y |X])
So

Var (ȳst ) = E [Var (ȳ|n1 . . . nk )] + Var (E [ȳ|n1 . . . nk ])


= E [Var (ȳ|n1 . . . nk )] + 0

29
This variance is actually the variance of a standard stratified estimator, with
allocations of n1 , n2 . . . nk to the various strata. So
" k   #
X 1 1 2 2
=E − Wi Si
i=1
ni Ni
k   
X 1 1
= E − Wi2 Si2
i=1
ni Ni
k    
X 1 2 2 1 2 2
= E Wi Si − W S
i=1
ni Ni i i

Now, assume that all the ni are nonzero. This is not too radical an assumption
if k is small compared to n. With this assumption the ni have the positive
binomial distribution, and we find that
 
1 1 1 1
E ∼ − 2 + 2 2
ni nWi n Wi n Wi
So
k   
X 1 1 1 1
= − 2 + 2 2 Wi2 Si2 − Wi2 Si2
i=1
nWi n Wi n Wi Ni
k   
X Wi Wi 1 1
= − 2 + 2 Si2 − Wi2 Si2
i=1
n n n Ni
k  
X Wi 2 1 2 1 2
= S + 2 Si (1 − Wi ) − Wi Si
i=1
n i n N
 X k k
1 1 1 X 2
= − Wi Si2 + 2 S (1 − Wi )
n N i=1 n i=1 i

If n is large the second term becomes small, and the first term looks like the
variance of proportional allocation. So if n is large post stratification is almost as
good as stratification with proportional allocation. Obviously the more variance
there is between the strata means, the better post stratification will be. The
unbiased estimate of this variance is
k  
X 1 1
− Wi2 s2i
i=1
ni Ni

3 Cluster Sampling

So far we’ve assumed that a list of all the units in the population is available.
But often this is not the case and more often there will be lots of groups of

30
units, each one called a cluster. For each cluster we assume that the list of
units contained in the cluster can be obtained without much cost. There isn’t
really that much of a new problem here yet - We just sample from the clusters,
which will now be called sampling units and then perform a census of all
the units contained in the chosen clusters. We will assume that there are N
sampling units, with associated values Y1 . . . YN .

Say we want to estimate household average income in a newly developed area.


That is, we are starting with no knowledge of the population. The area we are
looking at is made up of 20 blocks, so it makes sense to consider clusters of
units, where each cluster is a single block. In this case the Y values will be total
incomes per block, and we take a sample of size n of these blocks and get the
total household incomes for each. Obviously the average ȳ estimates
PN
Yi
Ȳ = i=1
N
If the average income across all households is Z̄ then as
N
Z̄ = Ȳ
M
we can estimate this as
N ȳ
Z̄ˆ =
M
where M is the total number of households across the 20 blocks. If we don’t
know M , then it has to be estimated by
M̂ = N m̄
giving a slightly different estimator of Z̄,
Pn
N ȳ ȳ ŷi
z̄ = = = Pni=1
M̂ m̄ i=1 m i

Going back to the example, say that we pick n = 3 and for our three clusters
we have

1 2 3
Households 12 20 8
Yi 600,000 1,200,000 4,400,000

Then we have
6 × 105 + 12 × 105 + 44 × 105 62 5
z̄ˆ = = 10
40 40
If, on the other hand, we knew that there were only 200 households, then we
could use the estimate
20(6 × 105 + 12 × 105 + 44 × 105
200

31
But the problem with our second estimator is that it fails to take into account
the number of households for the sampling units we picked. In ratio estimator
terms, the first estimator is x̄ȳ and the second X̄ ȳ
. So the mean square error of
the first is
h ȳ i 1
MSE = 2 Var (ȳ − Rx̄)
x̄ X̄
and that of the second is
 ȳ  1
Var = 2 Var (ȳ)
X̄ X̄
So if we assume that ȳ − RX̄ is less variable than ȳ, as we normally would, then
the first estimator is biased but has a lower mean square error than the second.

We do in fact prefer the estimator m̄ for cluster sampling, and the mean square
error is  

Var ȳ − m̄

and can be estimated by

\ ȳ
Var (z̄) zi = yi − mi

for exactly the same reasons as given in the ration estimation section. As
said, this estimator is not unbiased, but compensates by correcting for non-
representativeness of the sample.

3.1 Unbiased cluster sampling

Say we have five clusters, and the population data is

Cluster 1 2 3 4 5 Total
Household Size 20 12 8 40 20 100
Total Household Income 11 7 5 18 12 53

Now, take a sample of size 2. There are ten possible samples, all of which have
equal probability of being selected, and our estimates are either
5 × ȳ
T1 =
100

T2 =

where the Xi are the household sizes and Yi are the total household incomes.
T2 is the preferred biased estimator and T1 is unbiased, and the corresponding
values are

32
Sample chosen T1 T2
(1,2) 45,000 56,500
.. .. ..
. . .
(2,3) 30,000 60,000
.. .. ..
. . .
(4,5) 75,000 50,000

The estimator T1 averages to 53, 000, which is the true value. The estimator
T2 averages to something else, but varies much less than T1 . Now, to make T2
unbiased we just alter the weights which we assign to each sample. Let our new
selection probabilities be denoted by p1 . . . p10 and assume that the cluster sizes
are known. We want to have that
(T1 )i
pi (T2 )i =
10
where i references the ith possible sample, as this will mean that
E [T1 ] = E [T2 ]
So
(T1 )i x̄i N 1
pi = =
(T2 )i 10 100 10
where N is the number of clusters. So we end up choosing pi proportional to
x̄i , that is, the probability of choosing a sample is proportional to the number
of households in that cluster. So in our example we would pick sample i with
probability proportional to L, which is given in the table.

Sample chosen L
(1,2) 32
.. ..
. .
(2,3) 28
.. ..
. .
(4,5) 60

But applying this scheme is difficult in practice, and we would really like to
assign inclusion probabilities to individual units. A common alternative is called
the Midzuno-Sen sampling scheme. Under this scheme we only pick 1 cluster
from the N with probability proportional to L, and then pick the remaining
n − 1 from the N − 1 as a simple random sample. If we were to apply this to the
above example, we would end up with the following probabilities for the first
choice.

Size 20 12 8 40 20 100
Probability .2 .12 .08 .4 .2 1

33
Under this scheme the probability of picking clusters (y1 , y2 . . . yn ) is given by
m1 1 m 1
  + ··· + n  
M N −1 M N −1
n−1 n−1

Obviously the Midzuno-Sen scheme makes the inclusion probabilities much sim-
pler, and basically after the first unit is selected the inclusion probabilities be-
come simple random sampling without replacement and don’t change much
more. On the other hand, if every unit was selected proportionally to L then
the inclusion probabilities would in a sense change after every selection.

3.2 Systematic sampling

Consider an accounting system with N accounts which need to be checked. N


will probably be very large, so assume N = 1000. We want to check these N by
just picking a sample of size n and checking these, and assume that n = 10. A
very simple method for doing this is to first take a random account between 1
and 100, denote this by i. Then pick every hundredth unit thereafter. That is,
pick
i, i + 100, i + 200, . . . i + 900
N
This is more properly visualised as cluster sampling, where there are n clusters
and every cluster has the form
 
N 2N N (n − 1)
i, i + , i + ...,i +
n n n

It is normal to pick at least 2 clusters so that we have an estimate of the variance,


but with systematic sampling this may not always be done.

Obviously systematic sampling is as good as simple random sampling if the or-


dering of the units has no influence on their Y -values. But if the Y -values tend
to increase or decrease with respect to the ordering then systematic sampling
is better than simple random sampling. Systematic sampling is in some ways
similar to stratified sampling, but as the observation from the first strata com-
pletely determines the observations from all the remaining strata the method-
ology doesn’t carry across.

Finally, the distance between successive units in terms of the ordering can be
very important. For example, if the characteristic of interest is periodic in terms
of the ordering and Nn is approximately or exactly the period, then we will always
tend to sample at around the same point in the period. This can result in an
estimate that is quite bad. On the other hand, if N n is an odd multiple of half
the period, then we will tend to alternate between sampling peaks and troughs,
and this will give a good estimate.

34
3.3 Two stage cluster sampling

Often the within-cluster variation is expected to be small, at least compared to


between cluster variation. So it is a plausible strategy to take only a sample
of the units within each cluster, rather than all units, and use the savings in
terms of time or money to sample more clusters. The clusters are called the
first stage or ‘primary’ sampling units (psu). The individual units contained in
the clusters are second stage or ‘secondary’ sampling units (ssu).

The assumption with two stage cluster sampling is that a list of all possible
primary units is available, and the list of all secondary units can be determined
for the selected primary units, possibly at a small cost. It is not really assumed
that a list of every secondary unit exists. This makes perfect sense, for example
imagine we are trying to estimate average household income in perth. A list of
all households doesn’t exist, but after we choose specific streets or blocks, a list
of households in these regions isn’t too difficult to find.

Now the notation. We assume that there are N clusters, with sizes M1 , M2 . . . MN ,
and the total number of second stages units is M . Before sampling, we decide
that if cluster i is chosen then mi secondary units will be picked from within
that cluster. Let Yi,j denote the jth unit from the ith cluster, Yi be the total
characteristic value of the ith cluster, and let Si2 and s2i be the population and
sample variance for the ith cluster, respectively. The value of interest is
PN PMi
i=1 j=1 Yi,j
Ȳ = PN
i=1 Mi

which is the average Y value per second stage unit. We estimate this by
N ȳT
Ȳˆ = PN
Mi
Pi=1
Y i
ȳT = i∈s
n
Unfortunately Yi is unknown and so we estimate ȳT by
P
Mi ȳi
ȳT = i∈s
ˆ
n
where ȳi is obviously the average of the values selected from the ith cluster.
Putting this back into the definition for Ȳˆ , we get the obviously unbiased esti-
mator
P
ˆ N ȳˆT N i∈S Mi ȳi N X
Ȳ = PN = PN = Mi ȳi
i=1 Mi n i=1 Mi nM
i∈S

Under two stage clustering the variance of the estimator comes from two sources
corresponding to the two stages, between cluster variation and within-cluster

35
variation. First lets look at the second stage variation. So take fixed numbers
i1 , i2 . . . in to be the clusters we have picked, so that the number of units selected
are mi1 , mi2 , . . . min respectively. Having fixed these clusters we are essentially
doing stratified sampling, so
  N2 X
Var Ȳˆ i1 , i2 . . . in = Mi2 Var (ȳi )

n2 M 2 i∈s
N2 X 2 1
 
1
= 2 2 Mi − Si2
n M i∈s mi Mi

Remembering that i1 , i2 . . . in are actually random, we have that this is


N
N2 X
 
2 1 1
δi M i − Si2
n2 M 2 i=1 mi Mi

and then taking the expectation over first stage choices gives
N
n N2 X 2 1
 
h  i 1
E Var Ȳˆ i1 , i2 . . . in = M − Si2

N n2 M 2 i=1 i mi Mi

n
as E [δi ] = N. Now for the first stage variation, which is
!
 h i N X
Var E Ȳˆ i1 , i2 . . . in = Var Mi Ȳi

nM
i∈S
!
2
N X
= 2 2 Var Mi Ȳi
n M
i∈S
P
N2

i∈S Mi Ȳi
= 2 Var
M n
N2 1
 
1 2
= 2 − SM Ȳ
M n N

where SM Ȳ denotes the population variance of the collection of Mi Ȳi values.


Using the expected identity, we have
  h  i  h i
Var Ȳˆ = E Var Ȳˆ i1 , i2 . . . in + Var E Ȳˆ i1 , i2 . . . in

N
N2 1
   
N X 2 1 1 2 1 2
= M − S + − SM
nM 2 i=1 i mi Mi i
M2 n N Ȳ

If we define
N
X
YT = Mi Ȳi
i=1

36
then this is obviously M Ȳ and so the estimator will just be ŶT = M Ȳˆ , with
variance
N    
NX 2 1 1 2 2 1 1 2
M − Si + N − SM
n i=1 i mi Mi n N Ȳ

An unbiased estimate of this variance is given by


n    
NX 2 1 1 2 2 1 1
M − si + N − s2M Ȳ
n i=1 i mi Mi n N

This takes some explaining, as we have replaced a sum over [1, N ] with a sum
over [1, n] without adding a compensating Nn factor. The main point is that
PN
Mi Ȳi
i=1
MY =
Pn N
Mi y¯i
my = i=1
n
N 2
2
X Mi Ȳi − M Y
SM Ȳ =
i=1
N −1
n 2
X (Mi y¯i − my)
s2M Ȳ =
i=1
n−1

and the expectation of s2M Ȳ is not in fact SM


2

, although the algebra to show
this gets very horrible very quickly.

On the other hand, we can also use ratio estimation. So define


Pn Mi Pn
i=1 mi yi Mi y¯i
ˆ
ȳT = = i=1
n n
ȳˆT
ŷratio = M


R=

ȳˆT
R̂ =

ŷratio will obviously generally be biased, as from the ratio estimation section R̂
will be biased for R, although if n is large the bias will not be too significant.

37
Looking at the mean squared error of ŷratio , we have
h i
2
MSE [ŷratio ] = E (ŷratio − Y )
" 2 #
2 ȳˆT
=M E −R

" 2 #
ˆ
ȳ T − R m̄
= M 2E

" 2 #
2 M̄ ȳˆT − Rm̄
=M E
m̄ M̄
" 2 #
ˆ
ȳ T − R m̄
∼ M 2E


assuming that m̄ is approximately constant.
h 2 i
= N 2 E ȳˆT − Rm̄
= N 2 Var ȳˆT − Rm̄


 
as E ȳˆT − Rm̄ = 0. This is equal to N 2 Var (z̄) if we let zi = Mi ȳi − Rmi .
Continuing,

= N 2 (E [Var ( z̄| δ1 . . . δN )] + Var (E [ z̄| δ1 . . . δN ]))


" N
# N
!!
2 1 X 2 1X 
=N E 2 δi Mi Var (ȳi ) + Var δi Mi Ȳi − Rmi
n i=1 n i=1
N  
NX 2 1 1
= M − Si2 + N 2 Var (z̄ 0 )
n i=1 i mi Mi

where zi0 = Mi Ȳi − Rmi .


N    
NX 2 1 1 2 2 1 1
= M − Si + N − s2z0
n i=1 i mi Mi n N

An estimate of this mean square error is given, somehow, by


   
2 1 1 2 NX 1 1
N − sr + − Mi2 s2i
n N n i∈s mi Mi

where s2r is given by


1 X 2
Mi ȳi − mi R̂
n − 1 i∈s

38
There’s still the question of how we choose n and the {mi }. Often the main
consideration is cost or budget constraints. For example, say that the ith cluster
costs ci per sampled unit. This means that mi ci is the total cost of the sampling
from the ith cluster, and we can also add a fixed cost and fixed per-cluster cost.
Taking these three together gives us
X
overall cost = c0 + nc + mi ci
i∈s

and taking the expected value gives


N
n X
c0 + nc + mi ci
N i=1

Typically we would fix the expected total cost and then choose the {mi } to
minimise the variance.

4 Sampling with unequal probabilities

So far we’ve mainly considered simple random sampling where all units have
had the same chance of inclusion, but here we consider different probability
schemes. Previously we assumed we had N units with values Y1 . . . YN , and
the probability that yi was from some particular unit was N1 . Now we are
going to allow the inclusion probabilities to vary, and they will be denoted by
p1 . . . pN . Obviously with replacement selection is easier as even if the inclusion
probabilities are not constant across units we will still have yi independent of
yj . So for the moment we use with-replacement selection.
Now for estimation using unequal probability sampling. Let P1 denote the
probability that y1 takes the value it does. That is, if y1 = Y1 then P1 = p1 .
If y1 = Y2 , then P1 = p2 . If we define t1 = Py11 then obviously this will be an
PN
unbiased estimator of Y = i=1 Yi , as
N
Y1 YN X
E [t1 ] = p1 + . . . pN = Yi
p1 pN i=1
2
Var (ti ) = E t21 − E [t1 ]
 

N  2
X Yi
= pi − Y 2
i=1
p i

N
X Y2
i
= −Y2
i=1
p i

N  2
X Yi
= pi −Y
i=1
pi

39
Pn
We have n independent estimates t1 . . . tn with ti = Pyii , and so t̄ = n1 i=1 ti
is unbiased for Y with variance Varn(ti ) . As we are doing with replacement
sampling this can be estimated by
Pn 2
s2t (ti − t̄)
= i=1
n n(n − 1)
Remember that we are sampling with replacement, so units can appear multiple
times.PSo let Qi denote the number of times that unit i is included in the sample,
N
with i=1 Qi = n and E [Qi ] = npi .
"P # P N
N N
i=1 Qi Ti npi Ti X
E [t̄] = E = i=1 = Yi
n n i=1

Yi
where Ti = pi .
"N #
s2t
 
1 X 2
E = E Qi (Ti − t̄)
n n(n − 1) i=1
"N #
1 X 2
= E Qi (Ti − Y + Y − t̄)
n(n − 1) i=1
"N #
1 X  2 2

= E Qi (Ti − Y ) + 2 (Y − t̄) (Ti − Y ) + (Y − t̄)
n(n − 1) i=1
"N N
#
1 X 2
X 2
= E Qi (Ti − Y ) + 2 (Y − t̄) Qi (Ti − Y ) + n (Y − t̄)
n(n − 1) i=1 i=1
"N N N
! #
1 X 2
X X 2
= E Qi (Ti − Y ) + 2 (Y − t̄) Qi Ti − Qi Y + n (Y − t̄)
n(n − 1) i=1 i=1 i=1
"N #
1 X 2 2
= E Qi (Ti − Y ) + 2n (Y − t̄) (t̄ − Y ) + n (Y − t̄)
n(n − 1) i=1
N
!
1 X 2
h
2
i
= n pi (Ti − Y ) − 2nVar (t̄) + nE (Y − t̄)
n(n − 1) i=1
N
!
1 X 2
= n pi (Ti − Y ) − 2nVar (t̄) + nVar (t̄)
n(n − 1) i=1
N
!
1 X 2
= n pi (Ti − Y ) − nVar (t̄)
n(n − 1) i=1
1 1
= (nVar (ti ) − nVar (t̄)) = (nVar (ti ) − Var (ti ))
n(n − 1) n(n − 1)
1
= Var (ti ) = Var (t̄)
n

40
Simple random sampling with replacement is not unusual in the unequal prob-
ability setting. Finally, we can also see why picking pi ∝ Yi is really quite a
boring case. It is quite unrealistic to assume this is possible, and when it is our
estimator has variance 0 as Pyii = c. Another way of looking at this is that if
we know all the pi and one yi , we can use this information about the design to
calculate Y - So in a sense the proportionality means that knowledge about the
Yi can be replaced by knowledge about the pi .

4.1 Probability proportional to size

But now that we’ve decied to allow the selection probabilities to vary we need a
good method for specifying these probabilities. Assume that we again have an
auxillary characteristic X, known for every unit in the population, and that Y is
expected to be roughly proportional to X. Then it seems very logical to choose
pi proportional to Xi , especially as it might be difficult to pick it proportional
to the so far unobserved Yi . This sort of sampling scheme is very common
in cluster sampling, so common that in fact the characteristic X is sometimes
called ‘size’.

Sampling schemes of this type are called probability proportional to size


sampling or πps, but they have their drawbacks. The corresponding variance
estimates may have positive probabilities of being negative, and under some
πps schemes this is always the case. Also, we cannot say that with-replacement
sampling is always worse than without-replacement sampling, so in that re-
spect unequal probability sampling is more complicated. There are many πps
sampling schemes, for example Sampfords method.

Now to actually apply this sampling scheme. Well, consider a two stage model
where our ‘size’ variable is cluster size. We use probability proportional to size
and with replacement sampling of clusters. Obviously
Yi
ti =
Pi
is impossible as we don’t know Yi , so we instead use
ŷi
ti =
Pi
which is still unbiased for Y as long as ŷi is unbiased for Yi , where Yi is the total
of the units in the ith selected cluster. Note we didn’t need to know anything
about the selection of the second stage units in order to say that ti was unbiased
for Y , except independence between the subsampling of different clusters. The
estimator
n
1X
t̄ˆ = ti
n i=1

41
is also unbiased for Y , with variance
N
!
Var (t1 ) 1 X ŷi
= Var δi
n n i=1
Pi
" N #! " N
!#!
1 X ŷi X ŷi
= Var E δi cluster + E Var δi cluster
n i=1
Pi i=1
Pi
N
! " N
#!
1 X Yi X Var (ŷi )
= Var δi +E δi
n i=1
P i i=1
Pi2
N  2 N
1X Yi 1 X Var (ŷi )
= Pi −Y +
n i=1 Pi n i=1 Pi2

This variance is unbiasedly estimated by


Pn  2
ŷi
s2
N
1 X Var (ŷi ) i=1 Pi Pi − nȳ N
1 X Var (ŷi )
+ = +
n n i=1 Pi2 n(n − 1) n i=1 Pi2

Note that in the setup we’ve given without replacement sampling really is better
than with replacement sampling, however it is not commonly used due to the
complication involved.

Now lets go to without replacement selection. Well, for the first unit we have the
selection probabilities given by the {pi }, and for the second unit we choose with
probabilities proportional to the {pi } from the remaining units. For example,
assume that N = 4, and the selection probabilities are

p1 = 0.1, p2 = 0.2, p3 = 0.3, p4 = 0.4

Say that the first selected unit is 2. Then the selection probabilities for the
second unit conditional on having picked the first are
5 15 20
p1 = , p3 = , p4 =
40 40 40
If the next unit picked is 4, the selection probabilities for the next selection are
1 3
p1 = , p3 =
4 4
So we can see how the inclusion probabilities change over successive selections.
But the calculations are really quite messy. For example, the probability of
selecting unit 1 as the second unit is
1 1 1
0.2 ∗ + 0.3 ∗ + 0.4 ∗
8 7 6

42
To get all of these numbers we have to go back and work out the probability of
selecting unit 1 given that 3 is selected first, etc. In the simple case that n = 2
we have
0.1 0.1 1
π1 = 0.1 + 0.2 + 0.3 + 0.4
 0.8 0.7
 0.6
X pi
= 0.1 1 + 
1 − pi
i6=1
 
X pi
π2 = 0.2 1 + 
1 − pi
i6=2

Something that I haven’t had a chance to follow up - Given that we have selected
the units i1 . . . in , define
Yi1
t1,D =
Pi 1
Yi
t2,D = 2 (1 − Pi1 ) + Yi1
Pi 2

Yin 1 − Pi1 − · · · − Pin−1
tn,D = + Yi1 + · · · + Yin−1
Pi n
Pn
Then we can use the estimator t̄d = n1 i=1 ti,D , which has a variance which
can be unbiasedly estimated by
n
1 X 2
(ti,D − t̄D )
n − 1 i=1

This turns out to have been proposed by Des Raj in 1956. Apparently it relates
to selection of the first unit with probabilities {pi } and selection of successive
units with probabilities proportional to the {pi }.
Definition 4.1.1. The Horvitz-Thompson estimator of Y can be used with
any probability sampling scheme, including both with and without replacement.
It is
X Yi N
X Yi
ŶHT = = Zi
i∈s
πi i=1
πi

where πi is the probability that the sample contains the unit i and our indicator
variables are now going to be denoted n byo Zi . Another way of looking at this is
Yi
that we sample from the population πi . This estimator is obviously unbiased
for Y .

43
Some useful properties we will need are
N
X
πi = n
i=1
X
πij = (n − 1)πi
i6=j

PN
For the first one we know that n = i=1 Zi , so
"N # N
X X
n=E Zi = πi
i=1 i=1

For the second identity,


X X
πij = P (Zi = 1, Zj = 1)
i6=j i6=j
X
= P ( Zj = 1| Zi = 1) P (Zi = 1)
i6=j
X
= P (Zi = 1) P ( Zj = 1| Zi = 1)
i6=j
 
X

= P (Zi = 1) E  Zj Zi = 1
i6=j
  
XN

= P (Zi = 1) E   Zj  − Zi Zi = 1
j=1
= P (Zi = 1) (n − 1) = πi (n − 1)

Going back to the Horvitz-Thompson estimator, the variance is straightforward


to calculate,
N  2
  X Yi X X Yi Yj
Var ŶHT = Var (Zi ) + Cov (Zi , Zj )
i=1
πi πi πj
i6=j

As Zi is a bernoulli random variable, we have

Var (Zi ) = πi − πi2


Cov (Zi , Zj ) = E [Zi Zj ] − πi πj = πji − πi πj

So
N  2
  X Yi X X Yi Yj
Var ŶHT = πi (1 − πi ) + (πij − πi πj )
i=1
πi πi πj
i6=j

44
Now to come up with an estimate of this variance. Well, we know that
"N # N
X X
E ai Zi = ai πi
i=1 i=1
 
XX XX
E aij Zi Zj  = aij πij
i6=j i6=j

 
Applying this to Var ŶHT gives
2
Yi
ai = (1 − πi )
πi
Yi Yj (πij − πi πj )
aij =
πi πj πij

So our estimate is
N  2
\  X Yi XX Yi Yj (πij − πi πj )
Var ŶHT = Zi (1 − πi ) + Zi Zj
i=1
πi πi πj πij
i6=j

This estimator will be called the horvitz thompson variance estimator.


Unfortunately this unbiased estimate of the variance is often negative, and there
is always a non-zero probability of this happening. An alternative is the Yates
and Grundy estimate of the variance. To get this we go back to the actual form
of the variance,
N  2
  X Yi X X Yi Yj
Var ŶHT = πi (1 − πi ) + (πij − πi πj )
i=1
πi πi πj
i6=j
N
N X
X Yi Yj
= aij
i=1 j=1
πi πj

where

aii = πi (1 − πi )
aij = πij − πi πj

Following some algebra similar to that in the ratio estimation chapter, we get
X X  Yi 2
Yj
=− aij −
πi i<j
πj
 2
1 X X Yi Yj
=− − aij
2 πi πj
i6=j

45
Now similar to what we did with the Horvitz Thompson estimator, we can
estimate this by
 2
1 X X Yi Yj Zi Zj
− − aij
2 πi πj πij
i6=j

This is experimentally found to be negative less frequently than the previous


estimator, and has been proven to be always positive in two important cases,
one of which is the Midzuno-Sen sampling scheme.

4.2 The Horvitz Thompson estimator with two stage sam-


pling

Assume that we have N clusters, which are our primary sampling units, and Yi
is the total of the ith cluster. We use some arbitrary sampling scheme (choice
of πi values), which is irrelevant for our purposes, to pick a sample of n clusters.
Obviously the Horvitz Thompson estimator assuming we actually know the
values of Yi will be
N
X Zi Yi
ŶHT =
i=1
πi

Unfortunately the whole point is that we don’t know Yi and must estimate it as
Ŷi , by using some sampling scheme on the ith cluster. Again, we don’t care at
all what sampling scheme is used and this is one of the advantages of the Horvitz
Thompson estimator. Using these Ŷi values instead of Yi gives us the two stage
estimator ŶHT T S , which is still unbiased for Y so long as Ŷi is unbiased for Yi .
N
X Zi Ŷi
ŶHT T S =
i=1
πi

Applying the total variance formula,


  h  i  h i
Var ŶHT T S = E Var ŶHT T S choice of clusters + Var E ŶHT T S choice of clusters

h  i  h i
= E Var ŶHT T S Z1 . . . ZN + Var E ŶHT T S Z1 . . . ZN

" !# " N #!
N
X Zi Ŷi X Zi Ŷi
= E Var Z1 . . . ZN + Var E Z1 . . . ZN
i=1
πi i=1
πi
  
N Z 2 Var Ŷ N
!
i i Z Y
i i
X X
= E  + Var
i=1
πi2 i=1
πi
  !
N π Var Ŷ N
X i i X Zi Yi
= + Var
i=1
πi2 i=1
πi

46
Obviously the second term is the Horvitz Thompson variance and the first term
is the contribution from the second stage subsampling. So this is
 
N Var Ŷ
  X i
Var ŶHT +
i=1
πi

Now for estimating this slightly different variance. Well, our starting point is
the Horvitz Thompson estimator of the variance,
N  2
\  X Yi XX Yi Yj (πij − πi πj )
Var ŶHT = Zi (1 − πi ) + Zi Zj
i=1
π i πi πj πij
i6=j

This is unobservable as the Yi are unknown, so define


N
!2
X Ŷi XX Ŷi Ŷj (πij − πi πj )
Q= Zi (1 − πi ) + Zi Zj
i=1
πi πi πj πij
i6=j
N
X XX
= Zi ai Ŷi2 + aij Zi Zj Ŷi Ŷj
i=1 i6=j

where
1 − πi
ai =
πi2
πij − πi πj
aij =
πi πj πij

Our hope is that Q should estimate Var (YHT ), but this doesn’t work out exactly.
Ŷi and Ŷj are both independent by assumption, so there is no problem there,
but
h i  
E Ŷi2 = Var Ŷi + Yi2 6= Yi2
 
and so when we look at E [Q] we do not get out Var ŶHT , instead we get

E [Q] = E [E [ Q| Z1 . . . ZN ]]
 
XN    X X
= E Zi ai Yi2 + Var Ŷi + aij Zi Zj Yi Yj 
i=1 i6=j
N
X    X X
= πi a i Yi2 + Var Ŷi + aij πij Yi Yj
i=1 i6=j
N  2
X Yi X X Yi Yj  
6= πi (1 − πi ) + (πij − πi πj ) = Var ŶHT
i=1
πi πi πj
i6=j

47
In fact we have
  XN  
E [Q] = Var ŶHT + πi ai Var Ŷi
i=1
N
  X 1 − πi  
= Var ŶHT + Var Ŷi
i=1
πi

So in
 fact we just have to alter Q a little bit to get an unbiased estimator for
Var ŶHT T S .

  N
X  
Var ŶHT T S = E [Q] + Var Ŷi
i=1

This gives the final estimator


N

\  X Zi \  
Var ŶHT T S = Q + Var Ŷi
π
i=1 i

Of course we could always have taken Q to be the Yates and Grundy variance
estimate with Yi swapped for Ŷi . If

aij = πij − πi πj

this gives
  !2 

1 X X Ŷi Ŷj Z Z
i j
E [Q] = E E  − − aij Z1 . . . Zn 
2 πi πj πij
i6=j
  !2  
1 XX Ŷi Ŷj Z
Z1 . . . Zn  aij i j 
Z
= E − E −
2 πi πj πij
i6=j
       
2 2
1 X X Var Ŷi + Y i Var Ŷj + Y j Yi Yj  Zi Zj 
= E −  + − aij
2 πi2 πj2 πi πj πij
i6=j
      
1 X X 
Yi Yj
2
Zi Zj  1 X X Var Ŷi Var Ŷj
= E − − aij −  +  aij
2 πi πj πij 2 πi2 πj2
i6=j i6=j
    
  1 XX Var Ŷi XX Var Ŷj
= Var ŶHT −  aij + aij 
2 πi2 πj2
i6=j i6=j
 
  XX Var Ŷi
= Var ŶHT − aij
πi2
i6=j

48
 
N Var Ŷ
  X i X
= Var ŶHT − (πij − πi πj )
πi2
i=1 j6=i
  
N Var Ŷ
  X i X
= Var ŶHT − (n − 1)πi − πi πj 
i=1
πi2
j6=i
 
N Var Ŷ
  X i
= Var ŶHT − 2 ((n − 1)πi − πi (n − πi ))
i=1
πi
 
N Var Ŷ
  X i
πi2 − πi

= Var ŶHT − 2
i=1
πi
 
N Var Ŷ
  X i
= Var ŶHT + (1 − πi )
i=1
π i

So even with this alternative definition of Q we still have


  N
X  
Var ŶHT T S = E [Q] + Var Ŷi
i=1

Example 4.2.1. A sample of size 3 is taken from the collection of all cities
in WA, with inclusion probability proportional to size. For each of these three
cities a sample of households is taken in some appropriate but unspecified way.
The estimates obtained for city mean income and variance are

Cities 1 2 3
estimated mean income, ȳi 500 340 300
size 100 400 500
πi 0.03 0.12 0.15
\ 
Var ȳˆi 20 24 18

This implies that there are 10, 000 households in the whole country. The joint
inclusion probabilities are
π12 = 0.0032
π13 = 0.0038
π23 = 0.0166
These are given as part of the question, and can’t be calculated from the data
we have. Converting to totals, we have

1 2 3
2 2
\
Var (ŷi ) 20 ∗ 100 24 ∗ 400 18 ∗ 5002
ŷi 50,000 136,000 150,000

49
So we have a ŶHT T S value of
50, 000 136, 000 150, 000
+ + = 3, 800, 000
0.03 0.12 0.15
For variance estimation we use the Yates and Grundy formula for Q, which gives
 2  
50, 000 136, 000 0.03 ∗ 0.12 − 0.0032

0.03 0.12 0.0032
 2  
50, 000 150, 000 0.03 ∗ 0.15 − 0.0038
+ −
0.03 0.15 0.0038
 2  
136, 000 150, 000 0.12 ∗ 0.15 − 0.0166
+ −
0.12 0.15 0.0166
= 1.18926231 × 1011

We then add
20 ∗ 1002 24 ∗ 4002 18 ∗ 5002
+ +
0.03 0.12 0.15
to Q to get our variance estimate.

4.3 Two phase sampling

We can use auxiliary variables to estimate characteristics of interest with higher


accuracy. So far we assumed that this extra information was available from
the whole population at negligible cost, but now we relax that assumption and
instead only assume that the auxiliary information can be obtained much more
cheaply than the characteristic of interest. This means that obtaining the value
of the auxiliary characteristic for the whole population may be outright too
expensive, or may be possible but would result in very few resources being
available for actually surveying the characteristic of interest.

One method often adopted is to just pick a large sample of size n from the
original population of size N , called the first phase sample, and only obtain
the value of the auxiliary variable for this large population. We then treat this
as the whole population, and apply some suitable sampling technique to take a
subsample and estimate the total of the characteristic of interest over the first
phase sample. This gives an unbiased estimate of the characteristic total over
the whole population, provided we choose an estimator that gives an unbiased
estimate of the total over the first-phase population. If the first phase sample
is large then the additional variance from carrying the estimate from the first
phase sample to the whole population is small.

Once we have our first phase sample we can apply whatever sampling technique
we want. One choice would be to use stratification. So based on the knowledge

50
of the auxiliary variable for the first phase sample we stratify the first phase
sample into k strata, where the ith strata consists of ni units. We use some
allocation method, and end up with the allocation of mi units to the ith strata.
Now, let ȳ denote the average of the characteristic over the first phase sample
and ȳi is the average over the whole ith strata from the first phase population.
So we now treat ȳ as a population characteristic and want to estimate it. So
using stratification we have
k
X
ȳˆ = wi ȳˆi
i=1

where wi = nni is the proportion of the first phase sample that lies in the ith
strata, and is also random. Its expectation is Wi = NN . In fact this first phase
i

sample mean also estimates the original population mean, so Ȳˆ = ȳˆ. Obviously
this estimator is still unbiased for Ȳ . So if δi is the random variable denoting
the inclusion of unit i in the first phase sample then
    
E ȳˆ = E E ȳˆ δ1 . . . δN
= E [ȳ] = Ȳ
This follows simply because the stratified estimator is unbiased for first phase
population mean, and the first phase population mean is unbiased for the whole
population mean as it is obtained via a simple random sample. Converting to
an estimator of the whole population mean increases the variance, so if Si2 is
the variance of the ith strata units from the original population,

Var ȳˆ n1 . . . nk
!
X k
= Var wi ȳˆi n1 . . . nk


i=1
k
X
wi2 Var ȳˆi n1 . . . nk

=
i=1
k
X
wi2 E Var ȳˆi δ1 . . . δN n1 . . . nk + Var E ȳˆi δ1 . . . δN n1 . . . nk
     
=
i=1
k     
X 1 1
wi2 E s2i n1 . . . nk + Var ( ȳi | n1 . . . nk )

= −
i=1
mi ni
k     
X 1 1 1 1
= wi2 − Si2 + − Si2
i=1
m i n i n i N i

k   
X
2 1 1
= wi − Si2
i=1
m i N i

Note that we assumed that mi was constant, which is clearly not the case as
it is bounded above by ni , which is random. But if we assume that ni > mi

51
with probability approximately 1, then this assumption makes sense. Finally to
derive the unconditional variance. We also need some more information about
wi .

E [wi ] = Wi
 
1 1 N Wi (1 − Wi )
Var (wi ) = −
n N N −1
 2  2
E wi = Var (wi ) + E wi
 
1 1 N Wi (1 − Wi )
= − + Wi2
n N N −1
 
N − n Ni Nj 1 1 N
Cov (wi , wj ) = − =− − Wi Wj
N − 1 nN 2 n N N −1
Then going back to the variance,
    
Var ȳˆ = E Var ȳˆ n1 . . . nk + Var E ȳˆ n1 . . . nk
k    k
!
X  2 1 1 2
X
= E wi − Si + Var wi Ȳi
i=1
mi Ni i=1
k    X k
X 1 1 XX
E wi2 Si2 + Var (wi ) Ȳi2 +
 
= − Ȳi Ȳj Cov (wi , wj )
i=1
mi Ni i=1 i6=j
  k  
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2
n N N −1 i=1
mi Ni
k
X XX
+ Var (wi ) Ȳi2 + Ȳi Ȳj Cov (wi , wj )
i=1 i6=j
  k  
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2
n N N − 1 i=1 mi Ni
  k
1 1N X
+ − Wi (1 − Wi )Ȳi2
n N N − 1 i=1
 
1 1 N XX
− − Wi Wj Ȳi Ȳj
n N N −1
i6=j
  k    
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni

k
X XX
− Wi2 Ȳi2 − Wi Wj Ȳi Ȳj 
i=1 i6=j
  k    
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni

52

k X
X k
− Wi Wj Ȳi Ȳj 
i=1 j=1
  k    
1 1 N X 1 1
= Var (ȳst ) + − Wi (1 − Wi ) − Si2 + Wi Ȳi2
n N N −1 i=1
mi Ni

k
X k
X
− Ȳi Wi Wj Ȳj 
i=1 j=1
  k     !
1 1 N X 1 1 2 2 2
= Var (ȳst ) + − Wi (1 − Wi ) − Si + Wi Ȳi − Ȳ
n N i=1
N −1 mi Ni
  k    
1 1 N X 1 1 2
2
= Var (ȳst ) + − Wi (1 − Wi ) − Si + Wi Ȳi − Ȳ
n N N − 1 i=1 mi Ni

As for estimating this, if ni is very large compared to mi then an approximate


estimator is
k k
\   X ni (ni − 1) s2i 1 X 2
Var Ȳˆ = + wi ȳˆi − ȳˆ
i=1
n(n − 1) mi n − 1 i=1

Example 4.3.1. Say that we are dealing with a population of size 10, 000, and
we want to determine average income. It is found that the location where a
person lives is relevant to determining their income but this cannot be collected
across the whole population, probably due to resource constraints. So 1000
people are selected and divided into three strata according to where they live -
Wealthy regions, medium wealthy, and poor. We find that approximately 10%
of people live in wealthy areas, 30% live in medium wealthy areas and 60%
live in poor areas. We then select 100 of these 1000 people and measure their
income. The data is

Wealthy Medium wealthy Poor


units 25 40 20
ȳˆi 40 20 10
ŝ2i 8 2 1

Obviously this means our overall estimate will be


ȳˆ = 0.1 × 40 + 0.3 × 20 + 0.6 × 10 = 16

and so Ȳˆ is also 16. As for the variance estimate of this estimator,
\   100 99 8 300 299 2 600 599 1
Var Ȳˆ = + +
1000999 25 1000 999 40 1000 999 20 
1 100 2 300 2 600 2
+ (40 − 16) + (20 − 16) + (10 − 16)
999 1000 1000 1000

53
As an alternative to double sampling with stratification, we can also do double
sampling with ratio estimation. Again we pick a large first-phase sample of size
n and get the value of the auxiliary characteristic, and then we pick a subsample
of size m on which to obtain the value of Y . Our estimator is then
ȳˆ
ȳˆratio = Ȳˆ = x̄
ˆ

We need slightly different notation, so say we have N units, a first phase sample
of size n is picked via simple random sampling and then a second phase sample
of size m is picked, again with simple random sampling. Now, define Zi to
be the indicator random variables denoting inclusion in the first phase sample.
Taking the expansion we originally used in the ratio estimation section leads to
a fairly horrible mess here, so instead we expand around every variable, giving

Ȳ X̄ X̄  Ȳ  Ȳ X̄ 
ȳˆratio ∼ + ȳˆ − Ȳ + x̄ − X̄ − 2 x̄ ˆ − X̄
X̄ X̄ X̄ X̄
Note that this expansion says that the the estimator is roughly unbiased, so its
not that good. But looking at the mean square error, we have
"  #
   Ȳ  Ȳ  2
MSE ȳˆratio = E ȳˆ − Ȳ + x̄ − X̄ − ˆ − X̄

X̄ X̄
"  #
 Ȳ  2
=E ˆ
ȳ − Ȳ + x̄ − x̄ ˆ

 
Ȳ 
ˆ
= Var ȳ + ˆ
x̄ − x̄

     
Ȳ  Ȳ 
= Var E ȳ + ˆ ˆ
x̄ − x̄ Z1 . . . ZN
ˆ
+ E Var ȳ + ˆ
x̄ − x̄ Z1 . . . ZN

X̄ X̄
  

= Var (ȳ) + E Var ȳˆ − x̄ ˆ Z1 . . . ZN

    
1 1 2 1 1 2
= − SY + E − s Ȳ
n N m n y− X̄ x
    h
1 1 1 1 i
= − S2 + − E s2y− Ȳ x
n N m n X̄
   
1 1 1 1
= − S2 + − 2
Sy− Ȳ
x
n N m n X̄

2
where Sy− Ȳ
x
is a population value and s2y− Ȳ x is the same value, but over the
X̄ X̄
first phase sample. We can also apply two-stage sampling to the non-response
problem. We do this by taking a sub-sample of the non-responders, and using
more resources or trying harder than we originally did, to get values from these
non-responders. Finally, another application of two-stage sampling is to perform
probability proportional to size sampling where the size variable is unknown.

54
5 Non-response

So far we have assumed that if i ∈ S then we can determine Yi , but this is not
always true, and is a very serious problem with mail and telephone surveys in
particular. Non-response problems even occur with many censuses. The point
is that those who respond to the survey may be very different from those who
do not, and this introduces a bias into the results. As an example, assume that
we are trying to measure the effect of a new measure or law on pharmacies.
That is, we want to know the dollar value of the loss they have incurred as a
result of the new measure.

Assume that we can categorize pharmacies into two sorts, large and small, and
that large pharmacies lose on average 10, 000 and small pharmacies lose on
average 3, 000. As 20% of pharmacies are large and 80% are small, we have the
population value

ȳ = 0.2 × 10, 000 + 0.8 × 3, 000 = 4400

Of course we don’t know this value and want to estimate it, and so we use
a mail survey. We send this survey to all pharmacies, but the response rates
turn out to differ across small and large pharmacies. Large pharmacies may
employ people to deal with this sort of query, so assume that their response
rate for our survey is 90%. On the other hand, smaller pharmacies may not
have anyone to deal with this sort of thing, so their response rate is 40%. This
means that we end up with 0.2 × 0.9 = 18% of our surveys being returned by
large pharmacies, 0.8 × 0.4 = 32% being returned by small pharmacies, and
50% are not returned. The response rate is 50%, and so we have 36% of our
results coming from large pharmacies and 64% coming from small pharmacies.
Conditional on these return rates, if we ignore the non-response problem our
estimator is basically going to be

0.36 × ȳs + 0.64 × ȳl

where ȳs is the average for 40% of the sampled small pharmacies and ȳl is the
average for 90% of the large pharmacies sampled. This gives an expected value
of
0.36 × 10, 000 + 0.64 × 3, 000 = 5, 520
for a 38% error.

So it important to reduce the amount of non-response, and where non-response


is unavoidable, to compensate for it. This means that at the design stage we
have to identify factors that could cause non-response and try and avoid them.
We can do this by looking at previous similar studies and the difficulties they
had, or via pilot studies. One problem common with mail surveys is that to
send the questionnaire back a stamp and envelope are needed, and this can
be a bit of a barrier. The solution is to include a stamped envelope. Other

55
things we can do to reduce non-response are send a reminder call, and give
advance notice. Obviously as resources are limited we will sometimes have to
choose between these three and again, a pilot study might help identify which
is most effective. Apparently it is found experimentally that the reminder call
is most effective, giving advance notice is the next most effective, and including
a stamped envelope is the least effective.

5.1 Dealing with non-response

Assume that for every unit in the population there are certain factors affecting
non-response. So we can assign a probability of non-response φi to every unit
i, and more importantly we can hope to estimate this quantity. Also define Ui
to be the indicator random variable for non-response. That is, Ui is 1 if unit i
responds, and 0 otherwise.

We know that the standard horvitz thompson estimator is

X Yi N
X Zi Yi
=
i∈s
πi i=1
πi

Modified for non-response it becomes


N
X Ui Zi Yi
i=1
φi πi

That is, our new inclusion random variable is Ui Zi , and we have the problem
that part of the sample selection mechanism is now determined by some exter-
nal randomness. Obviously to use our new estimator we are going to need to
estimate φi , probably by using some sort of external information. For example,
age might be an important determinant of non-response, in that younger people
tend to be busy and therefore will not respond, but older people are not, and
so they will tend to respond more often.
Example 5.1.1. Say we divide the population into three groups, young, middle
aged and old, denoted by Y , M and O. Then we perform a survey, and we find
that 30% of the younger group responds, similarly 25% of the middle-aged group
and 50% of the older group. So we estimate that Ui = 0.3 for any unit in the
young group, Ui = 0.23 for the middle aged group and Ui = 0.5 for the older
group. In the same survey say that we had n = 100, and of our sampled units
we found that 40 were in the young group, 30 in the middle aged group and 30
in the old group. So our total data is

Y M O
Sampled 40 30 30
Responded 12 7 15

56
Obviously our value of Ŷ , if we could compute it, would be
3
N X X
Ŷ = Yi = N Wi ȳi
100 i∈s i=1

Recall that if we use proportional allocation then the stratified estimator is the
same as the whole-sample average. So we will require W1 = 0.4, W2 = 0.3, W3 =
0.3, which is probably approximately accurate. But as some of the y-values are
unknown we instead use
ˆ 
Ŷ = N 0.4ȳ1(r) + 0.3ȳ2(r) + 0.3ȳ3(r)
where ȳi(r) is the sample average of the units in the ith strata which actually
ˆ
responded. So Ŷ looks like a stratified estimator, specifically post-stratification
as the number of units selected from each strata is random, although we have
to assume that wi ∼ Wi .

Continuing, assume for the moment that the non-response is not deliberate.
That is, units are not responding simply because they are busy, can’t be both-
ered, etc. The alternative is that non-response is because the respondents fear
the consequences of answering the question accurately. For example, the ques-
tion ‘Are you a drug user’ is an example of such a question. These questions
are termed ‘sensitive questions’.

Now we generalise slightly by allowing the form of the estimator


P to be somewhat
different. We take a sample and intend to estimate Y = Yi by
X
Ŷ = ai Yi
i∈s

If we are using the horvitz thompson estimator then we will have ai = π1i . Due
to non-response not all values of Yi will be obtained. But if we know φi we can
use the alternative estimator
ˆ X Wi
Ŷ = ai Yi
i∈s
φi

of the original estimator X


Ŷ = ai Yi
i∈s
where Wi is the random variable determining indicating whether unit i responds.
Obviously the φi can’t be known exactly - how can we know the exact chance
that a given person will respond to some survey? So there are two realistic
methods of getting this information. The first is to model φi as a function of
some auxillary variables - For example age, income, etc. The second is to use
post stratification, with the assumption that when some auxillary information
does not vary too much then the response probability also does not vary too
much.

57
Example 5.1.2. Say that we attempt to survey 100 units, but only 80 respond.
So we stratify the 100 units into 3 different ages, and end up with

1 2 3
Strata < 25 25 - 45 45+
Sampled 20 50 30
Responded 12 40 28

Then our estimates are


12 40 28
φ1 = φ2 = φ3 =
20 50 30
and so our actual estimate of the characteristic is
ˆ X ai Yi X ai Yi X ai Yi
Ŷ = 12 + 40 + 28
1st strata 20 2nd strata 50 3rd strata 30

5.2 Non response for sensitive questions

Say we have some ‘sensitive question’ which we feel respondents will not be
willing to answer because of the consequences of doing so. But assume that if
the respondent is convinced that his answer will not be identifiable, then the
respondent will be willing to answer the question fully. Here we deal with one
particular design of an experiment which involves asking a sensitive question,
apparently due to Warner. First to illustrate this design by example. Say we
have 100 sheets of paper, these sheets of paper are randomly assigned to respon-
dents with replacement, and we are interested in estimating the proportion
p of the population who are drug users. 30 of these sheets of paper instruct the
respondent to answer ‘yes’, 20 instruct the respondent to answer ‘no’, and the
remaining 50 instruct the respondent to answer the question truthfully.

Now to go back and make this more rigorous. Assume that every unit has a
chance π of being instructed to answer the question correctly, and otherwise
are instructed on how to answer the question, which happens with probability
(1 − π). Those who are instructed how to answer are instructed with probability
γ to answer yes. Now, let Yi be the indicator random variable which is 1 if the
person possesses the characteristic of interest. In our case, Yi is 1 if and only if
unit i is a drug user. Another random variable Zi is defined only over the units
which we select to survey, and is 1 if the person actually answers yes.

The event that some respondent is instructed to answer the question correctly
is independent of the event that some other respondent is instructed to answer
correctly. So we have

P (Zi = 1) = γ(1 − π) + pπ

58
and so

E [z̄] = γ(1 − π) + pπ
z̄ − γ(1 − π)
p̂ =
π
1
Var (p̂) = 2 Var (z̄)
π
\ 1 \
Var (p̂) = 2 Var (z̄)
π
We can extend Warners idea to situations where the Yi can take arbitrary values,
and doesn’t have to be simply a yes/no answer. Say that Yi has k possible values,
denoted by X1 . . . Xk . Then we ask the respondent to answer truthfully with
probability π, and otherwise we ask a proportion γ1 of the respondents
Pk to give
answer X1 , γ2 to give answer X2 , etc. Obviously we require i=1 γi = 1, which
means that the distribution of Zi is

Value Probability
X1 (1 − π)γ1
X2 (1 − π)γ2
.. ..
. .
Xk (1 − π)γk
Yi π

where Yi is the true value for that respondent. Also,


k
!
X
E [Zi ] = πYi + (1 − π) γi Xi
i=1

On the other hand, what if instead of assigning the questions with replacement
we instead assign them without replacement? This means that instead of seeing
people individually and giving each person a randomly selected card or question,
we instead see n people at once and distribute n cards among them. In this
case we will want πn to be an integer, and

P ( Zi = 1| δ1 . . . δN ) = γ(1 − π) + p0 π

where δ1 . . . δN are the inclusion random variables on the units in the sample,
and p0 is the proportion of people in the sample who are drug users. So

P (Zi = 1) = γ(1 − π) + pπ

If wePassume that the Yi are fixed, then we have a population Y1 . . . YN , with


N
Yi
p = i=1 N . This is not very helpful as our setup only allows us to observe the
Zi , not the Yi . But the population of values Zi still relate to the Yi in some

59
sense, so there is still things we can do. Let p0 denote the proportion of people
from a specified sample of size n which have the characteristic Yi . That is, p0 is
random as the sample of people we observe is random. Then

E [z̄] = E [E [ z̄| δ1 . . . δN ]]
= E [γ(1 − π) + p0 π] = γ(1 − π) + pπ

This says that the estimate

z̄ − γ(1 − π)
p̂ =
π
is unbiased for p. When it comes to the variance,
 
z̄ − γ(1 − π)
Var (p̂) = Var
π
 z̄  1
= Var = 2 Var (z̄)
π π
1
= 2 (Var (E [ z̄| δ1 . . . δN ]) + E [Var ( z̄| δ1 . . . δN )])
π
1
= 2 (Var (γ (1 − π) + p0 π) + E [Var ( z̄| δ1 . . . δN )])
π
= Var (p0 ) + E [Var ( z̄| δ1 . . . δN )]

Now for Var ( z̄| δ1 . . . δN ). Well, conditional on δ1 . . . δN , we have

(1 − π)γn + s
z̄ =
n
where s is the number of people asked to answer truthfully who answer yes. So
continuing,
  
(1 − π)γn + s
= Var (p0 ) + E Var δ
1 . . . δN
n
h  s i
= Var (p0 ) + E Var δ1 . . . δN

h n
 πs i
= Var (p0 ) + E Var δ1 . . . δN

h nπ
 s i
0 2
= Var (p ) + π E Var δ1 . . . δN

πn
So the second term is basically a sort of two stage sample, which becomes

1 n(1 − p0 )p0
  
0 2 1
= Var (p ) + π E −
πn n n−1

Which is not too hard to work out.

60
6 Variance Estimation

6.1 Half-samples

If we apply a complex sampling scheme, using several different sampling tech-


niques, then calculating variance estimates becomes more difficult. We can still
try and work out the algebra, and compute estimates recursively, but past a
point this probably becomes infeasible. So we turn to certain numerical approx-
imations instead.
Example 6.1.1. Say we have some really complex survey, we have the resources
to sample n units and we know that our estimates of the characteristic are
unbiased. But instead we sample n2 units and do this twice, with replacement of
each size n2 sample. We end up with two independent unbiased estimates T1 , T2
of the characteristic we’re interested in, and the average
T1 + T2
T̄ =
2
is also ubiased with half the variance. This is not the same as the pooled
estimate, as the pooled sample can contain a unit twice whereas a sample of
size n can’t. So
2 2
2 T1 − T̄ + T2 − T̄
σ̂ =
2−1
= T12 − 2T̄ T1 + T̄ 2 + T22 − 2T̄ T2 + T̄ 2
= T12 + T22 − 2T̄ (T2 + T1 ) + 2T̄ 2
1 2
= T12 + T22 − (T2 + T1 )2 + (T1 + T2 )
2
1 2
= (T1 + T2 ) − 2T1 T2
2
1 2
T1 + T22 + 2T1 T2 − 2T1 T2

=
2
1 2
T1 + T22 − 2T1 T2

=
2
1 2
= (T1 − T2 )
2

is an unbiased estimate of the variance of T1 , which means that as Var T̄ =
Var(T1 ) we have that σ̂2 is an unbiased estimator of Var T̄ . Of course using
2 2
exactly the same sampling scheme we could have somehow used the first sample
to get an estimate σ̂12 of Var (T1 ) and σ̂22 of Var (T2 ). Then the estimate

σ̂12 + σ̂22
2

61
2
will almost certainly be better than 21 (T1 − T2 ) . But it requires us to do the
variance estimation directly, which we are avoiding. Note that the variance we
have estimated is not the same as the variance of the estimate from a single
sample of size n - It is the variance of the estimate from half-samples.

More generally, if we have the resources to collect a certain amount of data


we can instead take k with-replacement samples of size k1 , and make the corre-
sponding variance estimate. When k is large this procedure works well. We can
extend this idea of ‘splitting’ the data further, by making the splits arbitrary.
Example 6.1.2. Say we take two samples with replacement. The first time we
select units 5, 8, 9, 2 and the secont time we select units 7, 13, 5, 4 for 8 units in
total. Now, we observed one possible split. But we know that we are just as
likely to observe the split we did, as we are to observe the split 5, 4, 7, 8 and
2, 5, 9, 13. So the split into these two groups can also be used to give us an
estimate of the variance. So we can look at all possible splits, get a variance
estimate from each split and then average across the splits. Of course these
variance estimates are all rather dependent, so we won’t get an improvement in
estimation quality like we do in the case with iid samples X1 . . . Xn from some
distribution, and
 Var (X1 )
Var X̄ = (9)
n
But we will still get some improvement. And as previously stated, the variance
we end up with may not be the variance of the pooled estimate.
Proposition 6.1.1. Say that we are doing stratified simple random sampling
with k strata, and allocations of n1 , n2 . . . nk . Let ȳˆ1 and ȳˆ2 be independent
estimators using this sampling scheme. Then
2
ȳˆ1 − ȳˆ2
4
approximately estimates the variance of ȳˆ3 , which comes from stratified simple
random sampling with allocations 2n1 , 2n2 , . . . 2nk .

 2
Proof. It should be obvious that 12 Ȳˆ1 − Ȳˆ2 estimates the variance of Ȳˆ1 . This
variance is
k k
Wi2 Si2
 
X 1 1 X
Wi2 − Si2 ∼
i=1
ni Ni i=1
ni
2
(ȳˆ1 −ȳˆ2 )
So 4 estimates
k k
W 2S2
 
X
i i
X 1 1
∼ Wi2 − Si2
i=1
2ni i=1
2ni Ni

= Var ȳˆ3

62
Example 6.1.3. Say we are doing stratified sampling with 3 strata, and we
decide to take k = 2 from above. That is, we apply the same sampling scheme
twice, with replacement. Our sampling scheme is to pick 2 units from stratum
1, 2 from stratum 2 and 1 from stratum 3. The strata sizes are given by
W1 = 0.5, W2 = 0.4, W3 = 0.1 and obviously the within-strata sampling is
without replacement every time. Say that our observed data is

Sample 1 Sample 2
Strata 1 7(20), 12(14) 5(8), 15(12)
Strata 2 9(32), 19(21) 4(29), 6(26)
Strata 3 2(82) 8(45)

where the actual value observed for each unit is the number in brackets. Then
our two estimators from our two samples are

Ȳˆ1 = 0.5 × 17 + 0.4 × 26.5 + 0.1 × 82 = 27.3


Ȳˆ2 = 0.5 × 10 + 0.4 × 27.5 + 0.1 × 45 = 20.5

This gives a variance estimate of


1 ˆ 2
Ȳ1 − Ȳˆ2 = 11.56
4

which is in fact approximately an estimate of the variance of Ȳˆ3 , the stratified


estimator using an allocation of 4, 4, 2. Of course, we could take some other
split of the data into two parts. Note that if both samples include the same unit,
then we disallow all splits which would result in this unit being included twice
in either subsample. It turns out that if we take all allowable splits, calculate
the variance estimate from each one and then average these estimates over all
the possible splits, we end up with the standard estimate for the variance of a
sample mean with stratified sampling. That is, with selection of 4 units from
strata 1, 4 from strata 2 and 2 from strata 3. Some of the variances we end up
getting from these splits, in this case, are 11.56, 0.09, 25, 2.46. So the individual
estimates are highly variable, but taking the average smooths this out to an
extent.
Example 6.1.4. One more attempt to show that the half-samples method
makes sense. Say that we are using simple random sampling, and we have
enough resources to sample n units, but instead we take two independent with-

63
n
out replacement samples of size 2, whoose means are denoted by ȳ1 , ȳ2 . Then
ȳ1 + ȳ2
ȳ =
2
Var (ȳ1 ) + Var (ȳ2 )
Var (ȳ) =
4 
1 S2

Var (ȳ1 ) 1
= = n −
2 N 2
  22
2 1 S
= −
n N 2
Note that ȳ does not come from a simple random sample without replacement
of size n. Now, if N is much greater than n we will have N1 ' 0 which means
that
S2
Var (ȳ) '
n
2
and we usually estimate this by sn . Now, how else can we arrive at this value?
Well, say we take a sample of size n, and split it into two samples t1 , t2 of
size n2 , with the average of both denoted by t̄. Note that as we selected a
single sample of size n we don’t have to worry about either of our subsamples
containing a single unit twice, which might happen if we took two samples of
size n2 and then combined them - This might be inconsistent with our original
sampling scheme.

Anyway, we can use this split into t1 , t2 to construct the estimate of the variance
of t1 ,
!
2 2
1 (t1 − t̄) + (t2 − t̄) 2
= (t1 − t̄) (10)
2 1

But if we now consider our population to be the size n sample originally selected,
we have
s2
 
h
2
i 2 1
E (t1 − t̄) = Var (t1 ) = − s2 =
n n n
And if we average over all possible half samples we end up actually computing
this expectation, as
h i (t − t̄)2 + · · · + (t − t̄)2 s2
2 1 k
E (t1 − t̄) = =
k n
where k = nn . So in this case, taking the variance estimates from all the pos-

2
sible half samples and averaging them gives us pretty much the same variance
estimate as we normally use for ȳ defined as the average of estimators from sim-
2
ple random samples. Importantly, Sn is also an approximation to the variance

64
of ȳˆ if this comes from a simple random sample without replacementof size n.
So the estimate we constructed also approximately estimates Var ȳˆ . Again,
notice that in the second half instead of combining two samples of size n2 we
split one of size n into two parts of size n2 .

The last example was a bit unclear, so hopefully the next one makes clear the
difference between combining two independent samples, and splitting in half a
single sample.
Proposition 6.1.2. Assume that we have k strata, and we take n1 , n2 . . . nk
units from each strata to form a first phase sample, where all the ni are divisible
by 2. Let ȳˆ be the stratified estimator using this sampling scheme, and let ȳˆi
denote the estimator from the ith half sample of the selected units, where i ranges
from 1 to  
j = n1 , n2 . . . nk
n1 nk
2 ... 2
Then
j
 1X 2
Var ȳˆ ' ȳˆj − ȳˆ
j i=1

Proof. Well, let ȳˆ0 denote a randomly chosen half-sample from the first phase
sample.
h 2 i h h 2 ii
E ȳˆ0 − ȳˆ = E E ȳˆ0 − ȳˆ first phase sample
" k   #
X
2 1 1
=E W i ni − s2i
i=1 2 n i
" k #
X W 2 s2
i i
=E
i=1
n i

k
X W 2S2
i i

= ' Var ȳˆ
i=1
ni

The expectation can of course be given as the sum over all possible outcomes,
so in this case over j half-samples from the first-phase sample.

Example 6.1.5. Say that we take two samples, and the first time we observe
units (5, 25) which have y-values (4000, 5200), and the second time we observe
units (29, 6) which have y-values (8146, 2749). The whole pooled sample is
(5, 6, 25, 29), and so the possible splits are

65
Sample 1 Sample 2
5,6 25,29
5,25 6,29
5,29 6,25
25,29 5,6
6,29 5,25
6,25 5, 29

From these we get our variance estimates

(t1,1 − t̄)2 , (t1,2 − t̄)2 . . . (t1,6 − t̄)2

where t1,i refers to the estimate using the first half of the ith half-sample. We
then average all these estimates to get a total estimate that happens to coincide
with the usual variance estimate. Note that we only use the first part of every
split, in line with (10). Of course, we also have
6 2
1 XX 2
= (tj,i − t̄)
12 i=1 j=1
6
1X1 2 2

= (t1,i − t̄) + (t2,i − t̄)
6 i=1 2
6
1X 2
= (ti,1 − t̄)
6 i=1

Later we will want to allow the estimate to be non-linear, in which case the
proper way to look at this variance estimate is
6 2
1 XX 2
= (tj,i − t̄)
12 i=1 j=1

as we can’t use the above trick we used for linear estimators. So really, we
take every half-sample and average the squared differences from the whole-
sample estimate. Note that when we apply this half-samples technique to biased
estimators, we actually end up estimating the mean square error.

So far we’ve looked at cases where this averaging of variances using half samples
gives us our usual estimator back again, which is rather boring. The more
interesting case is where the direct estimation of the variance is difficult or
computationally impossible. In this case, we hope that the easier procedure of
using split samples and averaging will give an estimate which is computationally
simpler than the direct estimate, and not much more inefficient. The problem is
that a huge number of splits are required, specificially nn . If possible we’d like

2
to reduce the number of splits required, and this is where balanced replication
comes in.

66
6.2 Balanced repitition

Balanced repitition refers to a procedure which lets us use only (n − 1) splits


as long as n is a multiple of 4. The main tool we use is called a hadamard
matrix. A hadamard matrix is a square matrix with entries which are all −1
or +1 (or more generally roots of +1), and whoose rows are all orthogonal to
each other. In a combinatorial sense this means that if we pick any two rows,
then half of their entries will be identical and the other half will be different.
From this definition it turns out that the same properties must actually hold
for the columns, and that HH T = nI, where n is the size of the square matrix
and I is the n × n identity matrix. An example of a hadamard matrix is
 
1 1 1 1
 1 −1 −1 1 
 
 1 1 −1 −1 
1 −1 1 −1
Matricies of this form are also used in factorial experiments.
Definition 6.2.1. Let δh1 denote the inclusion vector for the 1st half of the
hth split, δh2 the inclusion vector for the second half and δh = δh1 − δh2 . Then
our collection of half-samples is said to be balanced if for every h 6= j,
δh · δj = 0
where the dot denotes the inner product.
Example 6.2.1. Say we want to find the effects of using nitrogen (N), Phos-
phorous(P) and Potash(K) on crop yields. One possibilitiy is to conduct six
experiments - Two for each factor, one where the factor is applied and one
where it is not. Of course we can just have one control experiment instead of 3
for a total of 4 experiments, and this design can actually be encoded as part of
the hadamard matrix
Plot N P K
 
1 +1 +1 +1 +1
 
2  +1 −1 −1 +1 
 
3  −1 +1 −1 +1 
 

4 −1 −1 +1 +1
The extra column is needed because, as we said, a hadamard matrix must have
dimensions n × n where n is a multiple of 4.

It is not yet known if n = 4k is a sufficient condition for the existence of a


hadamard matrix of order n, but apparently it seems likely. Also, if we have a
hadamard matrix H of order n, the matrix
 
H H
H −H

67
is also a hadamard matrix. So hadamard matrixes of order 2n are trivial to
construct.
Now we go back to sample surveys. Take a hadamard matrix H with the first
column and first row all 1’s, then strip out the first row and use the remaining
rows to allocate units to half-samples. For example, if n = 4 then we take the
hadamard matrix
Half-Sample 1 2 3 4
 
+1 +1 +1 +1
 
1  +1 +1 −1 −1 
 
2  +1 −1 −1 +1 
 

3 +1 −1 +1 −1
The 4 columns denote the units. So the row marked 1 says that we take the half-
samples (1, 2) and (3, 4). The row marked 2 gives the half-samples (1, 4) and
(2, 3), and the last row gives (1, 3) and (2, 4). Note that as the first column is
always 1, the first unit is always included in the first half-sample. In the previous
2
section we showed that (t1 −t 4
2)
was our variance estimate using the half-samples
t1 , t2 . So in this case our first row says that t1 = y1 +y
2
2
and t2 = y3 +y
2
4

More generally we would take an n × n hadamard matrix with first row and
column always 1, ignore the first row, and use the remaining (n − 1) rows to
allocate units to half-samples. We then use only the half-samples suggested by
the hadamard matrix to calculate our variance estimate. So our three variance
estimates are
y1 +y2
2
2 − y3 +y
2
4
(y1 + y2 − y3 − y4 )
2
=
4 42
y1 +y4 2
− y2 +y 2
3

2 2 (y1 + y3 − y2 − y3 )
=
4 42
y1 +y3 y2 +y4 2 2

2 − 2 (y1 + y3 − y2 − y4 )
=
4 42
Summing these and dividing by 4 not 3, we get
1
3 y12 + y22 + y32 + y42 − (2y1 y2 + 2y1 y3 . . . )
 
43
 
4
1  X 2 X
= 3
3 yi − 2 yi yj 
4 i=1 i<j
 
4 4
1  X 2 X 2 X
= 3 4 yi − yi − yi yj 
4 i=1 i=1 i6=j
 
4
1  X 2 X
= 3 4 yi − yi yj 
4 i=1 i,j

68
 !2 
4 4
1  X 2 X
= 3 4 yi − yi 
4 i=1 i=1
4
!
1 X
= 3 4 yi2 − 4 × 4 × ȳ 2
4 i=1
4
!
1 1X 2
= y − ȳ 2
4 4 i=1 i
s2
=
4

But things become much more complicated when the estimate is non-linear, and
it’s best to just start over. The theoretical underpinnings are also, apparently,
not that strong. So this time assume we have k balanced half-samples, the
estimated value from half-sample i is θ̂i and the estimated value from the whole
sample is θ̂. Then our variance estimator will be
k 2
1 X
θ̂i − θ̂
k i=1

which has a nice symmetry with our previous estimators, although this time we
can’t take a sum over the first ‘half’ of every pair of half-samples. We also don’t
in general have that
k
1X
θ̂i = θ̂
k i=1


Example 6.2.2. Suppose we want to estimate the ratio R = X̄
. We have k
Ŷi
balanced half-samples, and the estimator from the ith half-sample is R̂i = X̂i
.
We will clearly have
k
1X
R 6= R̂i
k i=1

Our variance estimator is


k 2
1 X
R̂i − R̂
k i=1

If we make the approximation

Ŷi − R̂X̂i
R̂i − R̂ '

69
then this variance becomes
k
!2
1X Ŷi − R̂X̂i
k i=1 X̂
k  2
1 X
= Ŷi − R̂X̂i
k X̂ 2 i=1
k 2
1 X
= Ŷi − R̂X̂i − Ŷ + Ŷ
k X̂ 2 i=1
k 2
1 X
= Ŷi − Ŷ − R̂X̂i + R̂X̂
k X̂ 2 i=1
k 2
1 X 
= Ŷi − Ŷ − R̂ X̂i − X̂
k X̂ 2 i=1

This is just the usual estimate of the variance of a ratio estimator. So the
variance estimate from half-samples is the same as the standard estimate, to a
first approximation.

We can extend this idea of balanced replication to stratified sampling but we


will generally need a lot of half-samples. The exception to this is where we
stratify very finely, and use simple random sampling without replacement to
select 2 units per strata.

So, suppose we have k strata and denote by (yi1 , yi2 ) the units observed from
the ith strata. Let H be a hadamrd matrix of size t ≥ k. Then the first k rows
of H represent the k strata and the t columns represent t pairs of half-samples.
If hil denotes the entry in the ith row and lth column, then hil = 1 says that
yi1 is going to be the unit picked from the ith strata for the lth half sample,
whereas hil = −1 says that yi2 will be picked. For example, assume that k = 3
so that there are 3 strata and we pick
 
+1 +1 +1 +1
 +1 +1 −1 −1 
H=  +1 −1 +1 −1 

+1 −1 −1 +1

Then this matrix specifies 4 splits, the first being

(y11 , y21 , y31 ) , (y12 , y22 , y32 )

and the second

(y11 , y21 , y32 ) , (y12 , y22 , y31 )

70
For each split, we get two estimates of the average over the whole population.
For instance, from the first split we get

t1 = W1 y11 + W2 y21 + W3 y31


t2 = W1 y12 + W2 y22 + W3 y32

where the Wi are the proportional sizes of the strata. From previously, we have
that
2 2
(t1 − t̄) + (t2 − t̄)
2
estimates the variance of the stratified estimator with twice as many units,
provided that ni is much smaller than Ni . Our four splits give us four such
estimates of the variance, which we then average to end up with a final variance
estimate. This estimate is for the variance of the 6 unit stratified estimator.

71

You might also like