Chapter 3 Sampling Methods
Chapter 3 Sampling Methods
- Abdi-Khalil Edriss -
SAMPLING METHODS
I. Sampling
It is important to use appropriate sampling methods in order to be able to make proper
inferences about a population, and draw conclusions that are valid on the information
of interest. The sample must be selected to represent the entire population through
proper sampling techniques or methods.
A researcher must prepare a sample design for the study. The researcher must plan
how a sample should be selected and of what size such a sample would be. Sample
~ 45 ~
design is determined before data are collected and it is the core for any research
undertaken.
POINTS TO PONDER
The relationship between a sample and a population is established through sample
statistic and population parameter. We can be reasonably sure that the sample
statistic (e.g., y , S, or S2) is fairly close to the population parameter (i.e., , , , or
2
) if the sample elements are representative of the population elements and if the
analysis is done properly. Thus, by studying and understanding the properties (or
characteristics) of the sample elements, it would be possible to generalize the
estimated characteristics to the population elements.
Sampling
Population Sample
Parameters Estimates
( ,,σ , or σ2) ( y , S, or S2)
~ 46 ~
3.2. Why take a sample?
~ 47 ~
finite universe). A source list should be comprehensive, correct, reliable
and appropriate. It is extremely important for the source list to be as
representative of the population as possible. As previously given example,
a comprehensive list such as region-district-village-household.
o Size of sample: This refers to the number of items to be selected from the
universe or population to constitute a sample. The size of sample should
neither be excessively large, nor too small. It should be optimum. In order
to decide on the size of the sample to be selected, a researcher must take
into consideration the size of population variance, the size of population,
the parameter of interest in the research study, and budgetary constraint.
Refer to Chapter 4 for detailed discussion on Sample Size Determinations.
What is representative and enough sample size?
~ 48 ~
3.4. How do we get a good sample?
POINTS TO PONDER
Make sure that all of the sampling units are listed in the sampling frame. If referred to
several dated census listings, ensure that they are from approximately the same year.
RAND()*(b-a)+a
~ 50 ~
This formula can generate as many random number as you want; However,
one may alternatively opt to use table of random number, which is given here
as an example.
ii. The next step is to determine the number of digits we will need in the
random numbers we select. In our example, there are 350 members of
the population, so we will need 3-digit numbers to give everyone equal
chance of being selected. (If there were 17480 members of the
population, we would need to select 5-digit numbers.) Thus, we want
to select 40 random numbers in the range from 001 to 350.
~ 51 ~
iii. Now turn to Table 3.1. Notices there are several rows and columns of
5-digit numbers. The table represents a series of random numbers in
the range from 00001 to 99999. To use the table for our hypothetical
sample, we have to answer these questions:
v. We can also choose to progress through the table any way we want:
down the columns, up, across to the right or to the left, or
diagonally. Again, any of these plans will work just fine so long as
we stick to it. For convenience, let‘s agree to move down the
columns; we will go to the bottom of one column, we will go to the
top the next; and so forth.
vi. Now, where do we start? You can close your eyes and point
anywhere in the table (Table 3.1). (I know it does not sound
scientific, but it works.) Say, I will pick the number in the 3rd row
of column 2. Start with that number.
vii. Let‘s suppose we decide to start with the 3rd number in column 2
(Table 3.1). You will see that the starting number is 21960. We
have selected 219 as our first random number, and we have 39
more to go. Moving down the second column, we select 254, 004,
095, and continue to the top of column 3: 234, 116, 033, 156, 302,
174, 107 328, 078, and so on.
Note that any time you come across a number that lies outside your
range, skip it and continue on your way. The same applies if the
same number comes up more than once.
viii. That is the way to go. You keep up the procedure until you have
selected 40 random numbers. Returning to your list, your sample
consists of household number 219, 254, 004, and so forth.
~ 52 ~
POINTS TO PONDER
In simple random sampling, the selection of an individual or element is independent
of the selection of another individual or element. Under this sampling design, every
item of the population has an equal chance of inclusion in the sample. It is blind
chance alone that determines whether one item or the other is selected. Random
sampling is considered as the best technique of selecting a representative sample. The
problem with this method is that it is time consuming. Always minimize sampling
error, standard error and all sort of statistical errors.
By way of numerical examples, here are the computations for the main
characteristics of a simple random sample.
NUMERICAL EXAMPLE
From a large class, a random sample of 10 grades were drawn: 60, 52, 95, 80, 54, 48,
75, 91, 40 and 85. Calculate a 95% confidence interval for the whole class.
yi
y
n
60 52 95 80 54 48 75 91 40 85
10
680
= 68
10
POINTS TO PONDER - This is a sample mean that refers to the population mean
(or average) denoted by . Or, y infers (is an estimate for) to .
~ 53 ~
( yi y) 2
S2
(n 1)
=[(60-68)2 + (52-68)2 +( 95-68)2 +( 80-68)2 + (54-68)2 + (48-68)2 + (75-
68)2 + (91-68)2 + (40-68)2 +(85-68)2]/ (10 – 1)
64 256 729 144 196 400 49 523 784 289
9
3434
=
9
= 381.56
POINTS TO PONDER
The sample variance S2 is an estimate for the population (class) variance denoted
by 2.
( yi y) 2
S
(n 1)
= S2
381.56
= 19.53
= y sampling error
s
= y t0.025, 9
n
19.53
= 68 2.262 ( )
10
= 68 13.97
~ 54 ~
where t is a student distribution, and for 95% confidence interval with 9
degree of freedom (n – 1 = 10 –1 = 9), the t-value is 2.262 (Refer to any
statistics book with t-table).
Thus, with 95% confidence, we can conclude that the mean grade of the whole
class is between 54.03 and 81.97 (lower and upper bound of Confidence
Interval).
B E
N2 D N5
N4
A
C
N1
N3
Referring to Figure 3.1, suppose we divide a region into five districts (that is, strata),
A, B, C, D and E based on some geographic characteristics. And, the corresponding
size of the non-overlapping sub-populations (or strata) are given as –
Note that N = Ni = 200 + 300 + 500 + 600 + 400 = 2000, and suppose we want to
take sample size n=100 out of the entire population, N=2000; then from each stratum
we may sample –
N1 200
n1 n 100 10 (i.e., out of N1=2000)
N 2000
~ 55 ~
n2 = 15 from B(out of N2), n3 = 25 from C(out of N3), n4 = 30 from D (out of
N4 ) and n5 = 20 from E (out of N5)
Therefore, n = n1 + n2 + n3 + n4 + n5 = 100
However, where strata differ not only in size but also in variability and is considered
reasonable to take larger samples from the more variable strata and smaller samples
from the less variable strata, we can then account for both (differences in stratum size
and differences in stratum variability) by using disproportionate sampling design by
using the formula:
n.N i . i
ni
N 1 1 N 2 2 ... N k k
Where i denote the standard deviations for the ith stratum, Ni denote the size of the
ith stratum, and ni denote the sample size of the ith stratum.
Using the previous example, assume a population is divided into five strata so that N1
= 200, N2 = 300, N3 = 500, N4 = 600 and N5 = 400. Respective standard deviations are
1 5, 2 7, 3 10, 4 15, 5 9. How does a sample of size n=100 be allocated
to the five strata, considering optimum allocation?
Applying disproportionate sampling design, the optimum sample size for each stratum
is calculated as follows –
100(200)(5) 100000
n1 4.83 5
(200)(5) (300)(7) (500)(10) (600)15 (400)9 20700
100(300)(7) 210000
n2 10.14 10
(200)(5) (300)(7) (500)(10) (600)15 (400)9 20700
100(500)(10) 500000
n3 24.15 24
(200)(5) (300)(7) (500)(10) (600)15 (400)9 20700
100(600)(15) 900000
n4 43.47 44
(200)(5) (300)(7) (500)(10) (600)15 (400)9 20700
100(400)(9) 360000
n5 17.39 17
(200)(5) (300)(7) (500)(10) (600)15 (400)9 20700
~ 56 ~
POINTS TO PONDER
(1) In stratified sampling, each stratum is homogeneous internally and heterogeneous
with other strata. (2) The more strata used, the closer you come to maximizing inter-
strata differences and minimizing intra-stratum variances.
For example, we may want to measure indicators for special groups (sub-
populations) within our study area such as urban and rural, female-headed and
male-headed households, ethnic groups, religious groups, etc. To ensure that
such special groups are adequately represented in our sample, we create a
stratum for each group and select units from each stratum separately.
POINTS TO PONDER
In practice, we make first inferences (deal with characteristics) about each stratum,
and finally about all the strata combined.
~ 57 ~
POINTS TO PONDER
The weighing of results comes in because of unequal sample sizes (small and large
groups) as strata (or groups) are formed.
Frequently the sample size in each stratum is not proportional to the actual
population. This imbalance must be corrected when the data set is
analyzed by weighting. For example, if the urban dwellers constitute 10%
of a country‘s population but our stratification procedure results in a
sample with 20% belonging to that category, the sample data will have to
be weighted to produce national results. Another remedial procedure is to
use Probability Proportional to Size Sampling (PPS) or Proportional
Stratification (PS) method, which we will discuss latter.
POINTS TO PONDER
The main advantages of stratified sampling are: (i) more reliable information is
obtained for the same sample size if the population is stratified than they are for the
population as a whole. (ii) Comparisons between strata are easy. This is so because a
separate but similar survey is done in each stratum.
~ 58 ~
Proportional Probability Sampling (PPS)
PPS takes into consideration the size of each stratum. PPS technique will
take off the imbalances of sample size in stratified sampling automatically.
It is also known as Proportional Stratification.
PPS is applied when some strata have considerably large size, that is, the
basic sampling unit (say, household) in each stratum varies in size. For
example, a country may vary considerably in population density (some
highly populated while others are dispersed).
POINTS TO PONDER
PPS method ensures that communities with larger proportion have a proportionately
greater chance of containing a selected cluster than small communities. This type of
sample is self-weighting, which will simplify the analysis and improves the
representativeness of the sample.
Step 1
o Assign a number to each community or village
o List the population size of each community or village
1
Refer to end of this chapter for additional PPS example, courtesy of National Statistics Office (NSO),
Malawi. This real application shows the combinations of SRS, SS and PPS, which was used by NSO
for National Sample Survey of Agriculture (NSSA) data collection.
~ 59 ~
o List the cumulative population of each community, that is, the sum of
the population of that community plus the populations of all the
communities above it.
Step 2
o Suppose the population comprises 10 villages and the population sizes
are as shown in Table 3.2.
Step 3
o Calculate the sampling interval, using
Now, suppose we need to select 5 villages (say, clusters) from a total population
size of 8450, this will give
Step 4
o Select a random number which is equal or less than the sampling
interval. So choose a number between 1 and 1690 by using table of
random numbers. Say, the chosen number is 1410.
Step 5
o Look at the cumulative village table and locate the first cluster or strata
by finding the village whose cumulative population exceeds this
random number. In our example, the first cluster would be located in
village 3, whose cumulative population is 1500.
Step 6
o Add the sampling interval to the random number
~ 60 ~
1690 + 1410 = 3100
Step 7
o Choose the community whose cumulative population exceeds 3100.
Thus, the second cluster will be located in village 5.
Step 8
o Identify the location of each subsequent cluster by adding the sampling
interval to the number located the previous cluster (say, 3100 + 1690,
4790 + 1690, etc.). We stop when we have located as many clusters as
we need.
2
Note that p stands for proportion, is population mean, is population variance, n
is sample size, and y is sample mean.
~ 61 ~
Furthermore, the variance of the stratified population can be estimated from the
sample stratified variances using the following formula.
2
In practice, since the population variance, j , is unknown we have replaced it by the
2
sample variances, s in the formula as shown above.
j
NUMERICAL EXAMPLE
Suppose a Rural Credit Institution took a simple random sample of 1100 farmers of its
300,000 farmers it lends agricultural credits. The amount y that a farmer owed the
Rural Credit Institution were summarized with average of Malawi Kwacha, y =
MK7500 and standard deviation of s = MK2000.
Now, since stratifying would improve accuracy, the population and sample were
sorted according to the total repayment. This gave the data shown in Table 3.4 as
follows.
300,000 n = 1100
Using data in Table 3.4, the Rural Credit Institution needs to know how much the
farmers owed it.
~ 62 ~
Solutions
First we need the population proportion pj (j=1, 2, 3; that is, for three the strata), thus
p1 = 150,000/300,000 = 0.50
p2 = 100,000/300,000 = 0.33
p3 = 50,000/300,000 = 0.17
y ss p1 y1 p2 y 2 p3 y3
= 0.50 (2500) + 0.33(6000) + 0.17(10,000)
= 4930
So this is the average amount of repayment (in Malawi Kwacha) the farmers
should pay to the Rural Credit Institution.
The 95% confidence interval (C.I.) for the population mean, , repayment is -
ss
y ss t 0.025,1100
Var y ss
= 4930 1.96 (101.48)
= 4930 198.90
From these results we can easily estimate the total amount of Malawi Kwacha (MK)
the farmers owe the institution. The total is calculated as
~ 63 ~
Hence, with 95% confidence we can conclude that the farmers owe the Rural Credit
Institution between K1, 419, 330, 000 and K1, 538, 667, 515 (K1.4 to 1.5 billion
Malawi Kwacha).
NUMERICAL EXAMPLE
Suppose the Ministry of Health and Population interviewed a sample of 50 pregnant
women from the 3 major hospitals in the country, and decided to select random
samples of size n1 = 25 from Queen Elizabeth, n2= 15 from Lilongwe Central
Hospital and n3 = 10 from Mzuzu Hospital. The simple random samples are selected
from the three strata, and interviews were conducted. The results, with measurements
of delivery and stay time in days are given below.
Solutions
Since the data use 3 strata (hospitals), we apply stratified sampling calculations –
y ss p1 y1 p2 y 2 p3 y 3
~ 64 ~
b. Simply, read the answers from the data table, and therefore, it is 1.5 delivery
and stay days.
2 Var y ss and
Hence,
2 2 2 2 2 2
ps1 1 ps 2 2 ps
3 3
Var y ss
n 1 n 2 n 3
s 0.6
(t 0.05,n) (1.75) 0.271
n 2
15
Conclusion: The total estimated cost is between MK0.64 and 1.4 million per day at
Lilongwe Central Hospital.
V. Systematic Sampling
~ 65 ~
An element of randomness is introduced into this kind of sampling by
using random numbers to pick up the unit with which to start. The
following steps will help:
o Assign a sequence number to each member of the population.
o Determine the skip interval by dividing the number of units in the
population by the sample size. I=P/S, where I is interval or skip, P
is population size, and S is sample size.
o Select a starting point in a random digit table (it must be between 1
and I).
o Include that item in a sample and select every ith item thereafter
until total sample has been selected.
For example, if we want to take 100 samples from a population of 2000 members, the
interval is 20 (i.e 2000/100). The starting point must be selected randomly from the
interval 1 to 20. Then, every 20th item will be part of the sample. If the starting point
is 5, then the sample must include elements with sequence numbers of 5, 25, 45, 65,
85, 105, 125 and so forth.
POINTS TO PONDER
In systematic sampling, the selection of the first unit determines the whole sample.
The advantage of this sampling technique is the samples will spread evenly over the
entire population. It is also an easier and less costly method of sampling and can be
conveniently used even in case of large populations.
However, if there is a hidden periodicity in the population, systematic sampling will
prove to be an inefficient method of sampling.
Note that care must be used in applying systematic sampling to periodic populations.
It can provide greater information per unit cost than simple random
sampling can provide.
POINTS TO PONDER
Systematic sampling seems about as precise as the corresponding stratified sampling,
but the difference is that in systematic sampling the sample unit occurs at the same
relative position; while in stratified sampling it is determined by randomization within
stratum.
Important Formulas
yi
y sys
n
~ 67 ~
2 ( y) 2
2
y i n
i
( N n)
s n 1 N2
Nysys
NUMERICAL EXAMPLE
A horticulturalist has 2000 experimental hybrid maize plants of a new variety under
study. He has taken a sample of 200 and wants to estimate the total yield from the
maize plant of a 1-in-10 systematic sample of plants. The data from this survey is
listed in Table 3.5. Calculate the sample mean, variance and place a bound on the
error of estimation.
Solution
~ 68 ~
2 ( y) 2
2
y i n
i
( N n)
s n 1 N2
8002
4000
2 200 (2000 200)
s 200 1 20002
0.0018
~ 69 ~
For example, a number of villages (or clusters) may be initially selected
from a sampling frame (list of villages). A village (s) is (are) randomly
chosen from the total number of villages or clusters. Within each village or
cluster, we may interview all households or we may select only a sample
either by random selection or by another method with a cluster.
POINTS TO PONDER
If clusters happen to be some geographic subdivisions, in that case cluster sampling is
better known as area sampling. In other words, cluster designs, where the primary
sampling unit represents a cluster of units based on geographic area, are
distinguished as area sampling.
~ 70 ~
subgroups and heterogeneity and homogeneity between subgroups,
between subgroups but we usually get the reverse
Randomly choose elements First, randomly choose a number of
from within each subgroup subgroups, and secondly elements
within randomly selected clusters or
subgroups
To select the sample, we first obtain a frame listing all clusters in the
population.
Second, from the list or the frame, we draw a simple random sample of
clusters, using the random sampling procedures (Refer to simple random
sampling method discussed previously).
Third, we obtain frames that list all elements in each of the sampled
clusters.
Finally, we select a simple random sample of elements or items from each
of these frames.
For example, sampling with Probability Proportional to the Cluster Size -In case
the cluster sampling units do not have the same number or approximately the same
number of elements, it is considered appropriate to use a random selection process
where the probability of each cluster being included in the sample is proportional to
~ 71 ~
the size of the cluster. For this purpose, we have to list the number of elements in each
cluster irrespective of the method of ordering the cluster. Then, we must sample
systematically the appropriate number of elements from the cumulative totals.
Data and sample households selected from each village are as follows.
Village Number of Sample households
number households Cumulative selected
A 20 20 5 (first household randomly selected)
B 40 60 25, 45
C 10 70 65
D 25 95 85
E 30 125 105, 125
F 100 225 145, 165, 185, 205, 225
G 60 285 245, 265, 285
H 25 310 305
I 30 340 325
J 15 355 345
K 10 365 365
L 5 370 None
M 60 430 385, 405, 425
N 40 470 445, 465
O 30 500 485
P 15 515 505
Q 20 535 525
R 15 550 545
S 30 580 565
T 20 600 585
Notes:
There are 600 households from which 30 households were selected
Sampling interval is calculated as Interval=600/30 = 20, which is added on
the starting point 5 to determine the household selection in each village.
Successive increment of 20 from the first household (household number 5) is
done until 30 households were selected.
~ 72 ~
The randomly selected households from each village are given in the last
column of the Table. For example, 2 households were selected in village B, 5
households in village F and none in village L.
A frame listing all elements in the population may be not possible or costly
to obtain, whereas to obtain a list of all clusters may be easy. For example,
to compile a list of all secondary students in the country would be
expensive and time consuming, but a list of secondary schools could be
readily acquired.
The cost of obtaining data may be inflated by travel costs if the sampled
elements are spread over a large geographic area. Thus, to sample clusters
of elements that are physically closer together is often economical.
~ 73 ~
3.38. What is design effect in cluster sampling? And, why is it a
unique problem when applying cluster sampling?
POINTS TO PONDER
Based upon experiences or published cluster surveys, a design effect of 2.0 is allowed
for most variables. For example, for water and sanitation variables, a design effect of
10, for health variables like goiter it is 3.0. Other indicators strongly affected by
clustering are – vaccine coverage, which is associated with distance from health
centers and by local immunization campaigns. Homogeneity is particularly severe for
indicators of incidence of infectious diseases such as measles, which is spread from
child to child.
Some advantages of cluster sampling include reduced time and travel costs, as well
as, simplified field work and ease of both field supervision and survey administration.
This is important since better supervision of interviewers will result in improved data
quality.
~ 74 ~
3.39. How do we measure some of the characteristics of a cluster
sample?
Important Formulas
Use the following notations to discuss the mean, variance and error bound of a
cluster sample. Let
yj
y
mj
thus, mean takes the form of a ratio estimator
The estimated variance of y has the form of the variance ratio estimator
given as –
N n (yj ym) 2
Var ( y) ( )( )
Nnmm2 (n 1)
2 var( y )
~ 75 ~
NUMERICAL EXAMPLE
Interviews were conducted in 150 areas (or clusters) of a city, and 25 areas (or
clusters) where sampled using cluster sampling method. For each of the 25 sampled
clusters the data on income are given in Table 3.5. Note that the clusters are numbered
on a city map, with numbers from 1 to 150. Use these data (Table 3.6) to estimate per
capita income in the city, and place bound on the error of estimation.
1 5 5400 14 8 4100
2 6 4300 15 3 5300
3 2 8500 16 7 5000
4 3 5000 17 8 3200
5 8 4500 18 6 2200
6 5 6500 19 4 4500
7 7 7500 20 5 3700
8 6 4000 21 5 5100
9 6 5200 22 6 3000
10 5 6500 23 3 3900
11 4 5000 24 9 4100
12 11 12100 25 10 4700
13 8 9600
mj =150 yj = 132900
The best estimate of the population mean is given by the mean of all the cluster
samples as -
yj 132900
y 886
mj 150
Substituting the numbers into the right-hand side of the equations yields
and estimating M via mmean (mean of the number of residents in all the sampled
clusters) using the formula
mj 150
mm 6
n 25
( N n) (y j ym) 2
Var ( y) ( )( )
Nnmm2 (n 1)
and the estimate of the mean with a bound on the error of estimation (confidence
interval of cluster sample is given as) is given by –
Therefore, the best estimate of the average per capita income is K886 Malawi
Kwacha, and the error of estimation should be less than MK32.93 with probability
close to 95%. If this bound on the error of estimation appears rather large; sampling
more clusters (consequently, increasing the sample size) could reduce sampling error,
in general.
~ 77 ~
3.40. What if a list of households is unavailable?
If lists are not available and cannot be created by carrying out a quick
census, or by consulting community leaders, then select one household as
the starting point, followed by selecting successive households to ensure
that the sample is as representative as possible.
The exact technique for the selection of the households will depend upon
field conditions. However, here are some commonly used methods in the
field.
Find a central point in the community, such as the market. Then randomly
select a direction from the central point and count the number of
households between the central point and the edge of town in that
direction. Randomly select one of these houses to be the starting point of
the survey. The remaining households in the sample should then be
selected to give as widespread coverage as possible of the community that
is consistent with practicability.
2
Such quota sampling methods find their greatest application in opinion polls; political elections and
so forth. In addition, it has the advantage of being cheap and administratively simple.
~ 79 ~
discretion; such selection of respondents by the investigator does not happen
with random sampling methods.
Note that quota sampling has the feature of stratified sampling and some
approximate information on the number of units in each stratum; however, the
number of units and choosing sample units are left to the interview
him/herself.
~ 80 ~
3.49. What are non-sampling errors?
Non-sampling errors, usually known as ‗biases‘, are often serious than sample
errors. They arise or come about in a number of ways, sometimes intentional
(conscious) and at other times unintentionally (unconsciously).
Example:
EV = price DV = total sales
EV = advertising DV = market share
EV = program DV = income
Experiments in questionnaires
o The effect of information
o The order of question
o The wording of questions
o Different question formats
E.g., open-ended vs. close-ended, recollection
This is actual application of various random sampling methods, which shows the
combinations of SRS, SS and PPS that was used by NSO for National Sample Survey
of Agriculture (NSSA) data collection in 1992.
1. Introduction
Data on the organization and structure of the smallholder agriculture sector in
Malawi
Done once every ten years
Helps the government to formulate plans to improve the productivity of the
smallholder sector
NSSA is an integrated, multiple visit questionnaire which consists six modules
i. Household Composition
ii. Garden Details
iii. Livestock Numbers
~ 82 ~
iv. Food security and Nutrition
v. Employment status of household head, migration and
household assets (SDA A) and Health (SDA B)
vi. Extension
3. Sample Design
Malawi is divided into eight ADDs which in turn divided into 30 RDPs
For sample selection purposes, the country was divided into 107 strata based
on ecological features (soil type, cropping pattern, rain fall, etc.)
The stratum boundaries never crossed RDP and EA boundaries. This insured
that all the strata contained a complete set of EAs while RDPs contained a
complete set of strata and each ADDs contained a complete set of RDPs
The sampling methodology was a two-state stratified sample design
The EAs were selected with Probability Proportional to Size (PPS) of the EA.
The measure of size being total population of the EA as found in the 1987
Population and Housing Census
A simple random procedure was employed in the selection of the sample
households within the selected EAs
NUMERICAL EXAMPLE
Sample size set at 600 EAs. This simply was found to be statistically adequate
to give reliable results at the RDP level with coefficient of variation, CV
10%.
The number of EAs to be selected per stratum was determined by the square
root of the size of the stratum where the stratum size was given by the sum of
the population of all the EAs within the stratum.
~ 83 ~
Assume 4 strata, S1, S2, S3 and S4
Allocation of EAs to Strata
Stratum Pop Pop
S2 S1 100 10
S1
S2 81 9
S3 225 15
S4 16 4
S3 ---------------------------------------------
P=422 S=38
S4
Note: WHY pop? The square of the pop is taken because to avoid over or under
representation of stratum. Otherwise, large strata can be over-represented and small
strata can be under-represented. Basically, it is one of the procedures of increasing
precision in statistical analysis. Now, assume N = national sample size of EAs.
The square root allocation was used where the number of EAs to be selected in a
stratum was arrived as follows.
Procedures
The EAs in a stratum were listed down their population were shown together with
their Cumulative populations. For example,
~ 84 ~
EA POP CUMULATIVE POP
Step 1:
o Divide 5000 by 10 which equals 500 (sample interval). Find the number
between 1 and 500, say RND = 210
Step 2:
Check along the cumulative total and see in which one 210 falls. The EA against
this cumulative total is then selected. In this case, EA2
Step 3:
Add 500 (sample interval) to 210, you get 710. Check against which cumulative
total 710 falls. EA4 is selected.
Step 4:
Continue step 3 (go to 3). This is, add 500 + 710, 1210 + 500 until all required
EAs are selected. (In this case, 10 for S3)
It is important to note that enumerators listed all households in selected EAs, and
from the households list of each EA 20 households were selected by the simple
random sampling procedure. Also note that the enumerators screened out households
with no cultivating garden and/or livestock. Total number of households for the
survey was 600 EAs times 20 households that equaled to 12, 000 households.
Some interesting statistical methods used in designing to collect NSSA data were –
Probability of selection of an EA
= Pop EA/Pop Stratum = PEA/PS
~ 85 ~
Overall Probability of selection = (PEA/PS) x (h/H)
(PSH)/(PEAh)
Estimation of a total area, Y is done as follows
Let yi = area of land a household has, then the stratum total is given by
Y = ((PEA/PS) x (h/H)) yi
Hence, we note that each EA gives an independent measure of the estimate of the
stratum it belongs to. Therefore, if n EAs are selected in a single stratum
Y 1/n [ (PEA/PS) x (h/H) yi]
And
Var (YStratum) = { (PEA/PS) x (h/H) yi]2 - [ (PEA/PS) x (h/H) yi]2/n}/n-1
~ 86 ~
sorghum, wheat and sunflower. Note that the crop was harvested by
the enumerator and the produce was weighed and recorded.
The survey recorded the area of land cultivated by the household, not
land owned.
SDA B
Health
~ 87 ~
========================================================
MENTAL GYMNASTICS
CHAPTER THREE
=======================================================
3. Why sampling?
4. Why does Design Effect occur? And, how do we take care off?
5. In order to reduce the standard error by 30%, the sample size should be increase
by a factor of what?
7. A random sample of 200 households is selected from the total of 2500 households.
The sample mean of income is found to be x 1500Malawi Kwacha (MK), and
the sample variance is s2 = 220. Estimate μ, the average de for all 2500 households
and confidence interval.
8. A large company is concerned about the time per week lost due to absenteeism.
The company employs N=400 employees, and the time log sheets of a simple
random sample of n=40 employees show the average amount of time lost is 10
hours with a sample variance s2=3.1. Estimate the total number of person-hours
lost per week, and confidence interval.
~ 88 ~
9. The average amount of groundnut production μ for smallholder farmers must be
estimated. With no prior data available to estimate the population variance, but
most groundnut production lie within 200 kgs. There are 500 smallholder
groundnut farmers. Find the sample size needed to estimate μ with standard error
of 30 kgs.
Estimate the average amount that would be received by all the smallholder
farmers in the district, and estimate a 95% confidence interval. (Hint: use cluster
sampling method0
~ 89 ~
47 69 54090 18 40 42500
4 32 34880 14 45 51590
15 40 15000 25 26 31200
36 25 23505 9 39 61850
27 30 39010 17 42 51490
5 40 46200 2 32 90020
1 20 21545 29 50 29505
10 22 31700 12 62 100000
Estimate the average amount a household in the city spends on electricity, and
calculate the confidence interval.
~ 90 ~