0% found this document useful (0 votes)
108 views18 pages

Cluster Sampling

1. Cluster sampling is a probability sampling method where the sampling unit is a cluster of elements rather than individual elements. 2. It is less costly than simple random sampling or stratified random sampling when a frame listing all population elements is very expensive or impossible to obtain, but a frame listing clusters is easily available. 3. Cluster sampling involves randomly selecting clusters from the population and collecting data from all elements within those clusters.

Uploaded by

Ashweena Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views18 pages

Cluster Sampling

1. Cluster sampling is a probability sampling method where the sampling unit is a cluster of elements rather than individual elements. 2. It is less costly than simple random sampling or stratified random sampling when a frame listing all population elements is very expensive or impossible to obtain, but a frame listing clusters is easily available. 3. Cluster sampling involves randomly selecting clusters from the population and collecting data from all elements within those clusters.

Uploaded by

Ashweena Sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

CLUSTER SAMPLING

Definition: A cluster sample is a probability sample in which is


sampling unit is a collection, or cluster of elements.

Cluster sampling is less costly than SRS or stratified RS if the cost of


obtaining a frame that lists all population elements is very high OR if the
cost of obtaining observations increases as the distance separating the
elements increases.
For example:
Suppose we wish to estimate the average income per household in a
large city.
1. If we use SRS, we need a frame listing all households in the city,
this frame may be very costly or impossible to obtain.
2. To use Stratified RS, the frame is still needed for each stratum in
the population.
3. Divide the city into regions such as blocks (or cluster of elements)
& select a simple random sample of blocks from the population 
using a frame that list all city blocks, then the income of every
household within each sampled block could be measured.

Suppose that a list of households in the city is available:


1. We could select a simple random sample of households, which
probably would be scattered throughout the city  the cost will be
higher.
2. Stratified random sampling could lower the cost.
3. Cluster sampling is more appropriate of reducing travel cost.
Cluster sampling is an effective design for obtaining a specified amount
of information at minimum cost under the following conditions:
1. A good frame listing population elements either is not available or
is very costly to obtain, but a frame listing clusters is easily
obtained.
2. The cost of obtaining observations increases as the distance
separating the elements increases.
Example of cluster sampling:
1. Hospitals form convenient clusters of patients with certain illness
for studies on the average length of time a patient is hospitalized or
the average number of recurrences of these illnesses.
2. An automobile forms a nice cluster of 4 tires for studies on tire
wear and safety.
3. An orange tree forms a cluster of oranges for investigating an
insect infestation.
4. A circuit board manufactured for a computer forms a cluster of
semiconductors for testing.
How to draw a cluster sample?

Example 8.1:
A sociologist wants to estimate the per-capita income in a certain small
city. No list of resident adults is available. How should he design the
sample survey?
Estimation of a Population Mean and Total
Let
N = the no. of clusters in the population
n = the no. of clusters selected in a simple random sample
mi = the no. of elements in cluster i ,i=1,2 , … , N
n
1
ḿ= ∑m = the average cluster size for the sample
n i=1 i
N
M =∑ mi = the no. of elements in the population
i=1

Ḿ =M / N = the average cluster size for the population


yi = the total of all observations in the ith cluster

Ratio estimator of the population mean μ:


n

∑ yi
ý= i=1
n

∑ mi
i=1

Estimated variance of ý :
2
sr
^ ( ý )= 1− n
V (M n Ḿ 2 )
where
n

∑ ( y i− ý mi )2
s2r = i=1
n−1

Note:
1. Here Ḿ can be estimated by ḿ if M is unknown.
2. The estimated variance is biased and a good estimator of V ( ý ) only
if n is large, n ≥ 20. The bias disappears if the cluster m1 ,m2 , … ,m N are
equal.
Example 8.2:
Interviews are conducted in each of the 25 blocks sampled in Example
8.1. The data on incomes are presented in the following table (Per-
capita income). Use the data to estimate the per-capita income in the
city and place the error of estimation.
Cluster No. of Total income Cluster No. of Total income
residents, per cluster, y i residents, per cluster, y i
mi mi
1 8 96,000 14 10 49,000
2 12 121,000 15 9 53,000
3 4 42,000 16 3 50,000
4 5 65,000 17 6 32,000
5 6 52,000 18 5 22,000
6 6 40,000 19 5 45,000
7 7 75,000 20 4 37,000
8 5 65,000 21 6 51,000
9 8 45,000 22 8 30,000
10 3 50,000 23 7 39,000
11 2 85,000 24 3 47,000
12 6 43,000 25 8 41,000
13 5 54,000 Total 151 1,329,000
Chart Title
140,000

120,000

100,000

80,000

60,000

40,000

20,000

0
0 2 4 6 8 10 12 14

Estimator of the population total, τ :


n

∑ yi
i=1
M ý=M n

∑ mi
i=1

Estimated variance of M ý:


2
sr
^ ( ý )=M 2 1− n
^ ( M ý )=M 2 V
V ( )
M n Ḿ 2

Note: The estimator M ý is useful only if the no. of elements in the


population, M is known.

Example 8.3:
Use the data in previous table to estimate the total income of all
residents of the city and place a bound on the error of estimation. There
are 2500 residents of the city.
Estimator of the population total, τ , which does not depend on M :
n
N
N ý t= ∑ y i
n i=1

Estimated variance of M ý:


2
n st
2
( 2
V ( N ý t )=N V ( ý t ) =N 1−
^ ^
)
M n

where
n

∑ ( y i− ý t )2
s2t = i=1
n−1

Note:
1. If there is a large amount of variation among the cluster sizes & if
cluster sizes are highly correlated with cluster totals, the V^ ( N ý t ) is
generally larger than the V^ ( M ý ).
2. The estimator N ý t does not use the information provided by the
cluster sizes m1 ,m2 , … ,mn, hence may be less precise.

Example 8.4:
Use the data in previous table to estimate the total income of all
residents of the city if M is unknown. Place a bound on the error of
estimation.
Equal Cluster Sizes: Comparison to SRS
Assume all of the mi values are equal to a common value, m, true for the
entire population of clusters.
n

∑ yi
The estimator ý= i=1
n of the population mean per element is denoted in
∑ mi
i=1

this equal cluster size case by ý c , and it becomes


n n m
ý c =
1 1
[∑
m n i =1
yi =
1
]∑∑ y
mn i=1 j=1 ij

where y ij denotes the jth sample observation from cluster i.


n
^ ( ý c )= 1− n 1 1
 V ( N )( )( )∑ ( y − ý )
n m2 n−1 i=1
i t
2

where
n
1
ý t = ∑ y =m ý c.
n i=1 i

Let the sample average for cluster i be denoted by ý i, where ý i= y i /m ,


n n
1 2 1 2
therefore, 2 ∑
n m ( n−1 ) i=1
( y i− ý t ) = 2 ∑
n m ( n−1 ) i=1
( m ý i−m ý c )
n
1 2
¿ ∑ ( ý i− ý c )
n ( n−1 ) i =1

Use ANOVA argument,


n m n m n m

∑ ∑ ( ý i− ý c )2=∑ ∑ ( y ij− ý i )2+ ∑ ∑ ( ý i− ý c )2


i=1 j=1 i=1 j=1 i=1 j=1

n m n
2 2
¿ ∑ ∑ ( y ij − ý i ) +m ∑ ( ý i− ý c )
i=1 j=1 i=1

 SST =SSW + SSB


n
SSB m 2
MSB= = ∑ ( ý i− ý c )
n−1 n−1 i=1
n m
SSW 1 2
MSW = = ∑ ∑ ( y ij − ý i )
n ( m−1 ) n ( m−1 ) i=1 j=1

^ ( ý c )= 1− n 1 MSB
V ( )
N nm

Example:
The circulation manager of a newspaper wishes to estimate the average
number of newspapers purchased per household in a given community.
Travel costs from household to household are substantial. Therefore, the
4000 households in the community are listed in 400 geographical
clusters of 10 households each, and a simple random sample of 4
clusters is selected. Interviews are conducted, with the results shown
below. Estimate the average number of newspapers per household for
the community and place a bound on the error of estimation.
Cluste Number of newspapers Total
r
1 1 2 1 3 3 2 1 4 1 1 19
2 1 3 2 2 3 1 4 1 1 2 20
3 2 1 1 1 1 3 2 1 3 1 16
4 1 1 3 2 1 5 1 2 3 1 20
Selecting the Sample Size for Estimating Population Means and Totals
2
sr
^ ( ý )= 1− n
Given V ( )
M n Ḿ 2

where
n

∑ ( y i− ý mi )2
s2r = i=1
n−1

and
2
n σr
V ( ý )= 1−(M n Ḿ 2 )
Approximate sample size required to estimate μ, with a bound B on the
error of estimation:
Nσ 2r
n= 2
ND +σ r

where σ 2r is estimated by s2r and D=B2 Ḿ 2 /4.

Example 8.6:
Suppose the data in table 8.1 represent a preliminary sample of incomes
in the city. How large a sample should be taken in a future survey in
order to estimate the average per-capita income μ with a bound of $500
on the error of estimation.

Approximate sample size required to estimate τ , with a bound B on the


error of estimation:
Nσ 2r
n=
ND +σ 2r

where σ 2r is estimated by s2r and D=B2 / 4 N 2 .

Example 8.7:
Using the data in Table 8.1 as a preliminary sample of incomes in the
city, how large a sample is necessary to estimate the total income of all
residents, τ with a bound of $1,000,000 on the error of estimation?
There are M = 2500 residents of the city.
Approximate sample size required to estimate τ , using N ý t with a bound
B on the error of estimation:

Nσ 2r
n=
ND +σ 2r

where σ 2r is estimated by s2r and D=B2 / 4 N 2 .

Example 8.8:
Using the data in Table 8.1 as a preliminary sample of incomes in the
city and M is unknown. How large a sample must be taken to estimate
the total income of all residents, τ with a bound of $1,000,000 on the
error of estimation?
Estimation of a Population Proportion
Suppose an experimenter wishes to estimate a population proportion.
For example:
a) The proportion of houses in a state with inadequate plumbing.
b) The proportion of corporation presidents who are college
graduates.
Let
ai the total no. of elements in cluster i that possess the characteristic of
interest.
mi  the no. of elements in i th cluster
Estimator of the population proportion p:
n

∑ ai
i=1
^p= n

∑ mi
i=1

Estimated variance of ^p:


2
^ ( ^p )= 1− n s p 2
V (N n Ḿ )
where
n

∑ ( ai− ^p mi )2
s2p= i=1
n−1

Note:
1. V^ ( ^p ) is a good estimator only when the sample size n is large, n ≥ 20.
2. If m1=m2=…=mN , then ^p is an unbiased estimator of p and V^ ( ^p ) is an
unbiased estimator of the actual variance of ^p.
Example 8.9:
In addition to being asked about their income, the residents of the
sample survey in Example 8.2 are asked whether they rent or own their
homes. The results are given in Table 8.2. Hence, estimate the
proportion of residents who live in rented housing. Place a bound on the
error of estimation.
Cluster Residents, m Renters, a a i−^p mi
1 8 4 0.16
2 12 7 1.24
3 4 1 −0.92
4 5 3
5 6 3
6 6 4
7 7 4
8 5 2
9 8 3
10 3 2
11 2 1
12 6 3
13 5 2
14 10 5
15 9 4
16 3 1
17 6 4
18 5 2
19 5 3
20 4 1
21 6 3
22 8 3
23 7 4
24 3 0
25 8 3

Selecting the sample size for estimating proportions


Nσ 2p
n=
ND +σ 2p
where σ 2p is estimated by s2p and D=B2 Ḿ 2 /4.
Example 8.10:
The data in Table 8.2 are out of date. A new study will be conducted in
the same city for the purpose of estimating the proportion p of residents
who rent their homes. How large a sample size should be taken to
estimate p with a bound of 0.04 on the error of estimation?
Cluster Sampling Combined Stratification
The population may be divided into L strata and a cluster sample can
then be selected from each stratum.

Example 8.11:
Let the data in Table 8.1 form the sample of stratum 1, with, as in
Example 8.2, N 1=415 and n1 =25. A smaller neighboring city is taken to be
stratum 2. For stratum 2, n2 =10 blocks are to be sampled from N 2=168 .
Estimate the average per-capita income in the two cities combined and
place a bound on the error of estimation, given the additional data shown
in the following table:
Cluster No. of Total income
residents, mi per cluster, y i
1 2 18,000
2 5 52,000
3 7 68,000
4 4 36,000
5 3 45,000
6 8 96,000
7 6 64,000
8 10 115,000
9 3 41,000
10 1 12,000
Cluster Sampling with Probabilities Proportional to Size
Sometimes, estimates can be improved by varying the probabilities with
which units are sampled from the population.
For example: we want to estimate the no. of job openings in a city by
sampling industrial firms from within that city. Small firms will employ
few workers, whereas large firms will employ many workers.
 sampling with probabilities proportional to size, or pps sampling.

Estimator of the population mean μ:


n
1
^μ pps = ý= ∑ ý i
n i=1

where ý i is the mean for the ith cluster.

Estimated variance of ^μ pps:


n
^ ( ^μ pps )= 1 2
V ∑ ( ý i−^μ pps )
n ( n−1 ) i=1

Estimator of the population total τ :


n
M
τ^ pps= ∑ ý i
n i=1

Estimated variance of τ^ pps:


n
^ ( τ^ pps )= M2 2
V ∑ ( ý i− μ^ pps )
n ( n−1 ) i=1

Example 8.12:
An auditor wishes to sample sick-leave records of a large firm in order
to estimate the average no. of days sick leave per employee over the past
quarter. The firm has eight divisions, with varying numbers of
employees per divisions. Because no. of days of sick leave used within
each division should be highly correlated with the no. of employees, the
auditor decides to sample n=3 divisions with probability proportional to
no. of employees. Show how to select the sample if the no. pf
employees in the eight divisions are 1200,450, 2100, 860, 2840, 1910,
290 and 3200.

Division No. of employees Cumulative range


1 1200 1-1200
2 450 1201-1650
3 2100 1651-3750
4 860 3751-4610
5 2840 4611-7450
6 1910 7451-9360
7 290 9361-9750
8 3200 9751-12950
Example 8.13:
Suppose the total no. of sick-leave days used by the three sampled
divisions during the past quarter are, respectively,
y 1=4320 y 2=4160 y 3=5790

Estimate the average no. of sick-leave days used per person for the entire
firm and place a bound on the error of estimation.

You might also like