Cluster Sampling
Cluster Sampling
Example 8.1:
A sociologist wants to estimate the per-capita income in a certain small
city. No list of resident adults is available. How should he design the
sample survey?
Estimation of a Population Mean and Total
Let
N = the no. of clusters in the population
n = the no. of clusters selected in a simple random sample
mi = the no. of elements in cluster i ,i=1,2 , … , N
n
1
ḿ= ∑m = the average cluster size for the sample
n i=1 i
N
M =∑ mi = the no. of elements in the population
i=1
∑ yi
ý= i=1
n
∑ mi
i=1
Estimated variance of ý :
2
sr
^ ( ý )= 1− n
V (M n Ḿ 2 )
where
n
∑ ( y i− ý mi )2
s2r = i=1
n−1
Note:
1. Here Ḿ can be estimated by ḿ if M is unknown.
2. The estimated variance is biased and a good estimator of V ( ý ) only
if n is large, n ≥ 20. The bias disappears if the cluster m1 ,m2 , … ,m N are
equal.
Example 8.2:
Interviews are conducted in each of the 25 blocks sampled in Example
8.1. The data on incomes are presented in the following table (Per-
capita income). Use the data to estimate the per-capita income in the
city and place the error of estimation.
Cluster No. of Total income Cluster No. of Total income
residents, per cluster, y i residents, per cluster, y i
mi mi
1 8 96,000 14 10 49,000
2 12 121,000 15 9 53,000
3 4 42,000 16 3 50,000
4 5 65,000 17 6 32,000
5 6 52,000 18 5 22,000
6 6 40,000 19 5 45,000
7 7 75,000 20 4 37,000
8 5 65,000 21 6 51,000
9 8 45,000 22 8 30,000
10 3 50,000 23 7 39,000
11 2 85,000 24 3 47,000
12 6 43,000 25 8 41,000
13 5 54,000 Total 151 1,329,000
Chart Title
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
0 2 4 6 8 10 12 14
∑ yi
i=1
M ý=M n
∑ mi
i=1
Example 8.3:
Use the data in previous table to estimate the total income of all
residents of the city and place a bound on the error of estimation. There
are 2500 residents of the city.
Estimator of the population total, τ , which does not depend on M :
n
N
N ý t= ∑ y i
n i=1
where
n
∑ ( y i− ý t )2
s2t = i=1
n−1
Note:
1. If there is a large amount of variation among the cluster sizes & if
cluster sizes are highly correlated with cluster totals, the V^ ( N ý t ) is
generally larger than the V^ ( M ý ).
2. The estimator N ý t does not use the information provided by the
cluster sizes m1 ,m2 , … ,mn, hence may be less precise.
Example 8.4:
Use the data in previous table to estimate the total income of all
residents of the city if M is unknown. Place a bound on the error of
estimation.
Equal Cluster Sizes: Comparison to SRS
Assume all of the mi values are equal to a common value, m, true for the
entire population of clusters.
n
∑ yi
The estimator ý= i=1
n of the population mean per element is denoted in
∑ mi
i=1
where
n
1
ý t = ∑ y =m ý c.
n i=1 i
n m n
2 2
¿ ∑ ∑ ( y ij − ý i ) +m ∑ ( ý i− ý c )
i=1 j=1 i=1
^ ( ý c )= 1− n 1 MSB
V ( )
N nm
Example:
The circulation manager of a newspaper wishes to estimate the average
number of newspapers purchased per household in a given community.
Travel costs from household to household are substantial. Therefore, the
4000 households in the community are listed in 400 geographical
clusters of 10 households each, and a simple random sample of 4
clusters is selected. Interviews are conducted, with the results shown
below. Estimate the average number of newspapers per household for
the community and place a bound on the error of estimation.
Cluste Number of newspapers Total
r
1 1 2 1 3 3 2 1 4 1 1 19
2 1 3 2 2 3 1 4 1 1 2 20
3 2 1 1 1 1 3 2 1 3 1 16
4 1 1 3 2 1 5 1 2 3 1 20
Selecting the Sample Size for Estimating Population Means and Totals
2
sr
^ ( ý )= 1− n
Given V ( )
M n Ḿ 2
where
n
∑ ( y i− ý mi )2
s2r = i=1
n−1
and
2
n σr
V ( ý )= 1−(M n Ḿ 2 )
Approximate sample size required to estimate μ, with a bound B on the
error of estimation:
Nσ 2r
n= 2
ND +σ r
Example 8.6:
Suppose the data in table 8.1 represent a preliminary sample of incomes
in the city. How large a sample should be taken in a future survey in
order to estimate the average per-capita income μ with a bound of $500
on the error of estimation.
Example 8.7:
Using the data in Table 8.1 as a preliminary sample of incomes in the
city, how large a sample is necessary to estimate the total income of all
residents, τ with a bound of $1,000,000 on the error of estimation?
There are M = 2500 residents of the city.
Approximate sample size required to estimate τ , using N ý t with a bound
B on the error of estimation:
Nσ 2r
n=
ND +σ 2r
Example 8.8:
Using the data in Table 8.1 as a preliminary sample of incomes in the
city and M is unknown. How large a sample must be taken to estimate
the total income of all residents, τ with a bound of $1,000,000 on the
error of estimation?
Estimation of a Population Proportion
Suppose an experimenter wishes to estimate a population proportion.
For example:
a) The proportion of houses in a state with inadequate plumbing.
b) The proportion of corporation presidents who are college
graduates.
Let
ai the total no. of elements in cluster i that possess the characteristic of
interest.
mi the no. of elements in i th cluster
Estimator of the population proportion p:
n
∑ ai
i=1
^p= n
∑ mi
i=1
∑ ( ai− ^p mi )2
s2p= i=1
n−1
Note:
1. V^ ( ^p ) is a good estimator only when the sample size n is large, n ≥ 20.
2. If m1=m2=…=mN , then ^p is an unbiased estimator of p and V^ ( ^p ) is an
unbiased estimator of the actual variance of ^p.
Example 8.9:
In addition to being asked about their income, the residents of the
sample survey in Example 8.2 are asked whether they rent or own their
homes. The results are given in Table 8.2. Hence, estimate the
proportion of residents who live in rented housing. Place a bound on the
error of estimation.
Cluster Residents, m Renters, a a i−^p mi
1 8 4 0.16
2 12 7 1.24
3 4 1 −0.92
4 5 3
5 6 3
6 6 4
7 7 4
8 5 2
9 8 3
10 3 2
11 2 1
12 6 3
13 5 2
14 10 5
15 9 4
16 3 1
17 6 4
18 5 2
19 5 3
20 4 1
21 6 3
22 8 3
23 7 4
24 3 0
25 8 3
Example 8.11:
Let the data in Table 8.1 form the sample of stratum 1, with, as in
Example 8.2, N 1=415 and n1 =25. A smaller neighboring city is taken to be
stratum 2. For stratum 2, n2 =10 blocks are to be sampled from N 2=168 .
Estimate the average per-capita income in the two cities combined and
place a bound on the error of estimation, given the additional data shown
in the following table:
Cluster No. of Total income
residents, mi per cluster, y i
1 2 18,000
2 5 52,000
3 7 68,000
4 4 36,000
5 3 45,000
6 8 96,000
7 6 64,000
8 10 115,000
9 3 41,000
10 1 12,000
Cluster Sampling with Probabilities Proportional to Size
Sometimes, estimates can be improved by varying the probabilities with
which units are sampled from the population.
For example: we want to estimate the no. of job openings in a city by
sampling industrial firms from within that city. Small firms will employ
few workers, whereas large firms will employ many workers.
sampling with probabilities proportional to size, or pps sampling.
Example 8.12:
An auditor wishes to sample sick-leave records of a large firm in order
to estimate the average no. of days sick leave per employee over the past
quarter. The firm has eight divisions, with varying numbers of
employees per divisions. Because no. of days of sick leave used within
each division should be highly correlated with the no. of employees, the
auditor decides to sample n=3 divisions with probability proportional to
no. of employees. Show how to select the sample if the no. pf
employees in the eight divisions are 1200,450, 2100, 860, 2840, 1910,
290 and 3200.
Estimate the average no. of sick-leave days used per person for the entire
firm and place a bound on the error of estimation.