STA 410 Lecture Notes
STA 410 Lecture Notes
September 2024
Contents
1 Introduction 4
1.1 Introduction to Sample Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Sample Surveys Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Types of populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Sampling frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Importance of sampling frames . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.7 Challenges with sampling frames . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Uses, scope, and advantages of sample surveys . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Uses of sample surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Scope of sample surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Advantages of sample surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Key Considerations in sample surveys . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Types of probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Non-Probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Types of Non-probability sampling . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Comparison of probability and non-probability sampling . . . . . . . . . . . . . . . . . 9
1.5.1 Probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Non-probability sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Simple random sampling (SRS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.1 Key concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.2 Advantages of simple random sampling . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Types of simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Simple random sampling with replacement (SRSWR) . . . . . . . . . . . . . . 10
1.7.2 Simple random sampling without replacement (SRSWOR) . . . . . . . . . . . . 11
1.7.3 SRSWR example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7.4 SRSWOR example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8.1 SRSWR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8.2 SRSWOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9.1 SRSWR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9.2 SRSWOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 Chapter Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
1.11 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Sampling Methods 28
3.1 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Advantages of stratified sampling . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Steps in stratified random sampling . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Objective of optimal allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.4 Neyman allocation formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 Benefits of Neyman allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.6 Estimation of population mean and variance . . . . . . . . . . . . . . . . . . . 31
3.1.7 Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Introduction to systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Steps in systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 One-stage cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Two-stage cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Multi-stage cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Criteria for choosing a sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Population Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Sampling frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 Desired precision and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.6 Method of data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2
3.4.7 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.8 Example: Choosing a sampling design . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Chapter Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Project 43
4.1 Project Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Sample Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Sample Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Diagrams 47
5.1 Systematic sampling diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Diagram of Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Stratified Sampling Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Diagram of Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
Chapter 1
Introduction
• Sample survey theory and methods encompass a range of principles and techniques used to collect,
analyze, and interpret data from a subset (sample) of a larger population. The goal is to make
inferences about the population based on the sample data. This field is crucial in statistics, social
sciences, market research, public health, and many other domains.
• Sample survey theory and methods provide a scientific approach to collecting and analyzing data,
enabling researchers to draw valid conclusions about larger populations from smaller, manageable
samples.
• Statistic: A numerical characteristic of the sample used to estimate the population parameter.
• Examples:
4
– All the students in a university (finite population).
– All possible outcomes of rolling a die (finite population).
– All stars in the universe (infinite population).
• Study population: The group from which we actually collect data, which should ideally be a
representative subset of the target population.
• Examples:
5
1.2.2 Scope of sample surveys
• Descriptive surveys: Aim to describe the characteristics of a population at a given point in
time.
• Analytical surveys: Aim to explore relationships between variables and test hypotheses.
– Example: A survey to investigate the relationship between exercise habits and health out-
comes.
• Time-saving: Collecting data from a sample is quicker than from an entire population.
• Manageable data collection: Handling a smaller amount of data is more practical and man-
ageable.
• Greater accuracy: If properly designed, sample surveys can provide more accurate and reliable
results than attempting to survey an entire population due to better quality control.
• Feasibility: Some populations are too large or inaccessible to survey in their entirety, making
sampling the only feasible option.
• Sample size: Larger sample sizes generally lead to more accurate estimates but also increase
costs.
• Questionnaire design: The design of the survey questionnaire can significantly impact the
quality of the data collected.
6
1.3 Probability sampling
1.3.1 Definition
Probability sampling is a sampling technique where each member of the population has a known,
non-zero chance of being selected in the sample.
Systematic sampling
Every k-th member of the population is selected after a random start.
• Example: Selecting every 5th person on an alphabetical list of employees after starting at a
random point.
Stratified sampling
The population is divided into homogeneous subgroups (strata) and random samples are taken from
each stratum.
• Example: Dividing a population into male and female groups and then randomly selecting an
equal number of participants from each group.
Cluster sampling
The population is divided into clusters, some clusters are randomly selected, and all members of selected
clusters are included in the sample.
• Example: Dividing a city into blocks and then randomly selecting certain blocks, including all
households within those blocks.
Multistage sampling
A combination of different sampling methods used in various stages.
• Example: Using cluster sampling to select schools and then using simple random sampling to
select students within those schools.
7
1.4.1 Types of Non-probability sampling
Convenience sampling
Samples are chosen based on their ease of access.
Judgmental/Purposive sampling
Samples are selected based on the researcher’s knowledge and judgment.
Quota sampling
The population is segmented into mutually exclusive sub-groups, and then a non-random set of obser-
vations is chosen from each subgroup.
• Example: Ensuring that a survey includes a certain number of men and women based on their
proportion in the population.
Snowball sampling
Existing study subjects recruit future subjects from among their acquaintances.
• Example: A researcher studying drug abuse starts with known users who then refer other users.
8
1.6 Simple random sampling (SRS)
Simple Random Sampling (SRS) is a basic sampling technique where each element in the population
has an equal chance of being selected in the sample.
• Sampling Frame: A list or representation of all elements in the population from which the
sample is drawn.
2. The process of selecting a simple random sample is straightforward. It can be achieved using
random number generators or drawing lots, making it easier to implement compared to more
complex sampling methods.
3. Data analysis is simplified with simple random sampling. Since each observation is equally likely,
standard statistical techniques and formulas can be applied without the need for complex adjust-
ments.
4. Simple random sampling tends to have lower variance in estimates of population parameters com-
pared to non-random sampling methods. This occurs because the variability is spread uniformly
across the sample, resulting in more stable estimates.
5. Simple random sampling can be easily adapted to various types of data and research designs.
It does not require specific information about the population structure, making it versatile for
different research contexts.
6. Simple random sampling provides a foundation for more advanced statistical methods, including
stratified sampling and cluster sampling. Understanding simple random sampling is essential for
implementing and interpreting these more complex methods.
Simple random sampling is a fundamental method in statistics with several advantages, including unbi-
ased representation, ease of implementation, simplified analysis, equal variance, flexibility, and its role
as a basis for advanced techniques. These advantages make it a preferred choice in many research and
data collection scenarios.
9
1.7 Types of simple random sampling
1.7.1 Simple random sampling with replacement (SRSWR)
• In SRSWR, each element in the population is returned to the population after it is selected. This
allows the same element to be selected more than once.
• Procedure:
• Advantages:
• Disadvantages:
– May include the same element multiple times, which might not be practical in some cases.
• Procedure:
• Advantages:
• Disadvantages:
– Each selection is not independent, making the analysis slightly more complex.
10
1.7.3 SRSWR example
• Suppose we have a population of 5 students: A, B, C, D, E.
1.8 Applications
1.8.1 SRSWR
• Useful in scenarios where the population size is small, and repeated measurements are acceptable
(e.g., quality control processes).
1.8.2 SRSWOR
• Commonly used in surveys and opinion polls where duplicate selections are not desirable (e.g.,
market research).
In SRSWR, each member of the population is included in the sample with a probability of Nn , where
n is the sample size and N is the population size. The variance of the sample mean X̄ is given by:
σ2
Var(X̄) = , where σ 2 is the population variance.
n
1.9.2 SRSWOR
1
• Probability of selecting any specific element on the first draw: N
.
1
• Probability of selecting a different element on the second draw: N −1
, and so on.
• The selection probability changes with each draw, making it dependent on previous selections.
11
In SRSWOR, the variance of the sample mean is adjusted for the fact that sampling without
replacement reduces the variance of the sample mean compared to sampling with replacement. The
variance is given by:
( )
σ2 N − n
Var(X̄) =
n N −1
where σ 2 is the population variance, n is the sample size, and N is the population size.
Numerical example
Consider a population of size N = 100 with a population variance σ 2 = 25. Suppose we draw a sample
of size n = 10.
For SRSWR
σ2 25
Var(X̄) = = = 2.5
n 10
For SRSWOR
( ) ( )
σ2 N −n 25 100 − 10 90
Var(X̄) = = = 2.5 × ≈ 2.2727
n N −1 10 100 − 1 99
Solution
To determine the probability that a randomly chosen resident from the sampling frame is included in
the sample, we use the formula for the probability of inclusion in a simple random sample. Given:
The probability of inclusion for a randomly chosen resident in the sampling frame is given by:
n
Probability = (1.1)
N
Substitute the given values:
200
Probability = = 0.02 (1.2)
10, 000
So, the probability that a randomly chosen resident from the sampling frame is included in the sample
is 0.02, or 2%.
12
1.11 Chapter Exercises
1. Simple random sampling: A population consists of 1000 individuals. A simple random sample of
100 individuals is selected. What is the probability that a particular individual is included in the
sample?
2. Population mean: The weights (in kg) of a population of 5 individuals are as follows: 60, 65, 70,
75, and 80. Calculate the population mean.
3. Sample mean: From the population given in Question 2, a simple random sample of 3 individuals
is drawn. The weights of the selected individuals are 60, 70, and 75. Calculate the sample mean.
4. Standard error of the mean: A population has a mean (µ) of 50 and a standard deviation (σ) of
10. If a sample of 25 individuals is selected, what is the standard error of the mean?
5. Population Size and Sample Selection: Suppose a researcher wants to conduct a survey on the
health habits of university students in a large city. The city has 50,000 university students. The
researcher decides to select a sample of 500 students. What is the sampling fraction for this
study?
6. Sampling frame accuracy: A researcher is studying the impact of a new teaching method on high
school students. The sampling frame consists of 1,200 high school students, but it is estimated
that 10% of the sampling frame is outdated or incorrect. If the researcher randomly selects
a sample of 100 students from this frame, what is the expected number of students from the
sampling frame who may be outdated or incorrect?
7. Estimation of population parameter: A random sample of 200 households is taken from a city
with 10,000 households. If the sample mean for monthly grocery expenditure is $400 with a
sample standard deviation of $50, estimate the total monthly grocery expenditure for the entire
population of households in the city. Assume the sample is representative of the population.
13
Chapter 2
• Example: The sample mean (x̄) is an unbiased estimator of the population mean (µ).
Consistency
An estimator is consistent if, as the sample size increases, the estimator converges in probability to the
true value of the parameter.
• Example: The sample mean (x̄) becomes closer to the population mean (µ) as the sample size
increases.
Efficiency
An estimator is efficient if it has the smallest variance among all unbiased estimators of the parameter.
• Example: Among all unbiased estimators, the sample mean (x̄) has the smallest variance for
estimating the population mean (µ).
Sufficiency
An estimator is sufficient if it captures all the information about the parameter contained in the sample.
• Example: The sample mean (x̄) is a sufficient estimator for the population mean (µ) when
sampling from a normal distribution.
14
Robustness
An estimator is robust if it remains relatively unaffected by small deviations from model assumptions.
• Example: The median is a robust estimator of central tendency, especially in the presence of
outliers.
Examples of estimators
• Mean (x̄); Used to estimate the population mean (µ)
Variance
The variability of the estimator due to random sampling.
• Low Variance: Indicates that the estimator is stable and provides similar results for different
samples.
• Estimating sample size is a crucial step in the design of any statistical study. It involves deter-
mining the number of observations or replicates necessary to achieve a desired level of accuracy
and confidence in the results. The sample size depends on the research objectives, the variability
in the population, and the precision required for the estimates.
15
• Sample size estimation is a critical step in designing experiments and surveys. It ensures that
the study has enough power to detect a statistically significant effect if one exists. The required
sample size depends on several factors, including the desired confidence level, the power of the
test, the effect size, and the variability in the data. In this lecture, we will discuss the different
methods for estimating sample sizes for various types of data and study designs.
Confidence level (1 - α)
The probability that the interval estimate will contain the true parameter.
• Common confidence levels are 90%, 95%, and 99%.
16
2.4.1 Example
Suppose we want to estimate the average height of university students with a 95% confidence level and
a margin of error of 2 cm. If the population standard deviation is known to be 10 cm, the required
sample size is: ( )2
1.96 · 10
n= = 96.04
2
Therefore, we need a sample size of at least 97 students.
2.5.1 Example
Suppose we want to estimate the proportion of university students who prefer online classes with a
95% confidence level and a margin of error of 5%. If we expect the proportion to be around 50%, the
required sample size is:
1.962 · 0.5 · (1 − 0.5)
n= = 384.16
0.052
Therefore, we need a sample size of at least 385 students.
17
2.6.1 Example
Suppose we want to compare the average exam scores of two teaching methods with a 95% confidence
level and 80% power. If the pooled standard deviation is 15 points and we want to detect a difference
of 5 points, the required sample size for each group is:
(Zα/2 + Zβ )2 · (p1 · (1 − p1 ) + p2 · (1 − p2 ))
n=
(p1 − p2 )2
where:
• Zα/2 is the critical value of the normal distribution at the desired confidence level.
2.7.1 Example
Suppose we want to compare the proportion of smokers between two age groups with a 95% confidence
level and 80% power. If we expect the proportions to be 20% and 30% in the two groups, the required
sample size for each group is:
18
Statistical power
The probability that a test will correctly reject a false null hypothesis (i.e., the probability of avoiding
a Type II error).
• Importance: High power reduces the risk of concluding that there is no effect when one actually
exists.
• Typical value: Researchers often aim for a power of 0.80 or 80%, which means there’s an 80%
chance of detecting an effect if it exists.
Effect size
A measure of the strength of the relationship between two variables or the magnitude of the difference
between groups.
• Pearson’s r: Used for measuring the strength of the correlation between two variables.
• Odds ratio: Used in logistic regression to measure the association between an exposure and an
outcome.
Importance:Helps to understand the practical significance of a study’s findings, not just the statistical
significance.
• Significance level (α): The probability of rejecting the null hypothesis when it is true (Type I
error). Common values are 0.05 or 0.01.
• Variability: More variability in the data requires a larger sample size to detect the effect.
19
Table 2.1: Type I and Type II error
Accept Ho Reject Ho
If Ho is correct Correct decision (1 − α) Type I error (α)
If Ho is not correct Type II error (β) Correct decision (1 − β)
• Power (1 − β) : 0.80
2(σ 2 )(Zα/2 + Zβ )2
n=
∆2
where:
• Zα/2 and Zβ are the z-values corresponding to the significance level and power, respectively,
20
2.9 Chapter Examples
Question 1: Properties of estimators
A researcher is using a sample mean (X̄) to estimate the population mean (µ). Suppose the sample
mean from a random sample of 50 observations is 75, and the population standard deviation is 10.
Solution 1
1. The standard error (SE) of the sample mean is calculated as:
σ
SE = √
n
X̄ ± Zα/2 × SE
Solution 2
The sample size (n) required for a given margin of error (E) and confidence level can be calculated
using: ( )
Zα/2 × σ 2
n=
E
For a 90% confidence level, Zα/2 ≈ 1.645:
( )2
1.645 × 20
n= ≈ 43.1
5
21
Question 3: Estimation of sample size
A researcher wants to estimate the proportion of voters supporting a particular candidate in an election
with a margin of error of 0.03 and a confidence level of 95%. If the estimated proportion is 0.6, what
is the required sample size?
Solution 3
The sample size for estimating proportions can be calculated using:
p(1 − p) × Zα/2
2
n=
E2
where p is the estimated proportion, E is the margin of error, and Zα/2 is the Z-value for the confidence
level. For a 95% confidence level, Zα/2 ≈ 1.96:
0.6 × (1 − 0.6) × 1.962
n= ≈ 1, 067
0.032
So, the required sample size is 1,067.
Solution 4
To calculate the 95% confidence interval for the population mean, we use the formula:
( )
σ
x̄ ± z √
n
where:
• x̄ is the sample mean,
• z is the z-value for the 95% confidence level (1.96),
• σ is the sample standard deviation,
• n is the sample size.
Given: x̄ = 50, 000, σ = 8, 000, n = 100, z = 1.96
The margin of error (E) is:
( )
8000
E = 1.96 √ = 1.96 × 800 = 1568
100
Thus, the 95% confidence interval is:
50, 000 ± 1, 568
So, the 95% confidence interval for the population mean income is:
(48, 432, 51, 568)
22
Question 5: Estimation of sample size
A survey is being planned to estimate the proportion of voters who support a new policy. The desired
margin of error is 3%, and the confidence level is 95%. Estimate the minimum sample size required if
the estimated proportion is 0.5.
Solution 5
The formula to estimate the required sample size for a proportion is:
z 2 · p · (1 − p)
n=
E2
where:
n ≈ 1068
Solution 6
: To estimate the sample size, we use the following formula for the two-sample t-test:
( )2
2 Zα/2 + Zβ σ 2
n=
d2
where:
23
• σ is the standard deviation of the outcome,
Given:
• d = 0.5,
2 (1.96 + 0.84)2 × 12
n=
0.52
2 × 7.84
n= = 62.72
0.25
You should round up to the next whole number, so n ≈ 63. Each group should have 63 participants.
Solution
: For estimating sample size for comparing two proportions, we use the following formula:
( )2
Zα/2 + Zβ [p1 (1 − p1 ) + p2 (1 − p2 )]
n=
(p1 − p2 )2
where:
Given:
24
• p1 − p2 = 0.1,
3. A random sample of size 25 is drawn from a normal population with an unknown mean µ and a
known standard deviation σ = 10. The sample mean is found to be X̄ = 50. Construct a 95%
confidence interval for µ.
4. You wish to estimate the mean cholesterol level in a population with a margin of error of 5 mg/dL
and a confidence level of 95%. The population standard deviation is known to be 20 mg/dL.
What sample size is required?
∑
5. Define the consistency of an estimator and prove whether the sample mean µ̂ = n1 ni=1 Xi is a
consistent estimator of the population mean µ.
6. A researcher wants to estimate the proportion of voters in a large city who support a particular
candidate. She wants the estimate to be within 3% of the true proportion with 95% confidence.
What sample size is needed if no preliminary estimate of the proportion is available?
7. Suppose we want to estimate the mean income of a population. From previous studies, we know
that the standard deviation of income in the population is σ = $10, 000. We want our estimate
to be within $1,000 of the true mean with 95% confidence.
8. Power and effect size: A researcher wants to conduct a study to compare the means of two
independent groups. The researcher aims for a statistical power of 0.80 and expects a medium
effect size (Cohen’s d = 0.50). The significance level (α) is set at 0.05. Calculate the required
sample size for each group to detect the effect with the desired power.
25
9. A pharmaceutical company is testing the effectiveness of a new drug. They want to estimate the
proportion of patients who experience a positive effect from the drug. They desire a margin of
error of 3% at a 95% confidence level.
(a) What sample size should they use if they have no prior estimate for the proportion?
(b) How does the sample size change if they estimate the proportion to be around 0.5 based on
preliminary studies?
26
Chapter 3
Sampling Methods
• Improved representation: Ensures that each stratum is represented in the sample, providing a
more accurate representation of the entire population.
• Reduced variability: Reduces variability within each stratum, leading to more reliable estimates.
• Focused analysis: Allows for separate analysis of different strata, useful for understanding specific
subgroups.
• Cost-effective: Can be more cost-effective than SRS if data collection is more efficient within
strata.
2. Determine the strata: Divide the population into non-overlapping subgroups (strata) based on
specific characteristics relevant to the study.
3. Determine the sample size: Decide the total sample size and allocate samples to each stratum.
This can be done proportionally or using optimal allocation.
27
4. Select samples from each stratum: Use simple random sampling to select the required number of
samples from each stratum.
5. Combine the Samples: Combine the samples from all strata to form the final sample.
where:
• Ni = Population size of stratum i
2. Calculate Products Ni Si : For each stratum, calculate the product of its population size and its
standard deviation.
∑
3. Compute the Sum: Calculate the total sum of all products Lj=1 Nj Sj .
4. Calculate Sample Sizes: Use Neyman’s formula to compute the sample size for each stratum.
5. Apply Rounding: Since sample sizes must be integers, round the results to the nearest whole
number if necessary.
Example
Consider a population divided into three strata with the following characteristics: Total sample size
n = 100.
Step-by-Step Solution:
1. Compute Ni Si :
28
Stratum Population Size Ni Mean Income X̄i Standard Deviation Si
1 500 45,000 10,000
2 300 55,000 8,000
3 200 60,000 12,000
Considerations
• Accuracy of Estimates: Ensure the estimates of standard deviations are as accurate as possible.
• Practical Constraints: The theoretical optimal sample sizes may need adjustment based on prac-
tical constraints such as budget or resources.
Conclusion
Optimal allocation using Neyman’s method enhances the precision of estimates in stratified sampling
by allocating more resources to strata with higher variability. It is a powerful tool in survey design and
analysis. Optimal allocation further enhances the efficiency of sampling by considering the variability,
size, and cost associated with each stratum
29
3.1.6 Estimation of population mean and variance
Estimation of population mean
The population mean µ is estimated by:
1 ∑
L
µ̂ = Nh X̄h
N h=1
where:
1 ∑ 2 σh2
L
Var(µ̂) = N
N 2 h=1 h nh
where:
30
Estimate of population variance
( )
1 2 25 2 20 2 15
Var(µ̂) = 100 + 150 + 250
5002 10 15 20
( )
1 4
Var(µ̂) = 10000 × 2.5 + 22500 × + 62500 × 0.75
250000 3
1 101875
Var(µ̂) = (25000 + 30000 + 46875) = = 0.4075
250000 250000
2. Determine the sample size (n):Decide how many units you want to include in the sample.
3. Calculate the sampling interval (k): Compute the interval by dividing the population size by the
sample size, k = Nn .
4. Select a random starting point (r): Choose a random number between 1 and k to determine
where to start sampling.
5. Select every k-th unit: Starting from the random starting point, select every k-th unit in the
population until the desired sample size is reached.
31
Numerical Example
Consider a population of 100 students in a university department. We want to select a sample of 10
students using systematic sampling.
Step-by-step solution
• Population size (N): 100
N 100
k= = = 10
n 10
• Select every k-th unit: Starting from the 3rd student, we select every 10th student.
Selected students: 3, 13, 23, 33, 43, 53, 63, 73, 83, 93
Thus, the sample of 10 students selected using systematic sampling are the students at positions 3, 13,
23, 33, 43, 53, 63, 73, 83, and 93 in the population list.
Conclusion
Systematic sampling is a practical and efficient method for selecting a sample from a large population.
It provides a simple way to ensure that the sample is spread evenly across the population, although care
must be taken to avoid periodicity bias.
Example
If a city is divided into 50 blocks (clusters), and we randomly select 10 blocks, then all households
within these 10 blocks are surveyed.
32
Advantages
• Cost-effective and convenient, especially when the population is spread over a large area.
Disadvantages
• Increased sampling error if the clusters are heterogeneous within but homogeneous between.
Example
If a city is divided into 50 blocks, we first randomly select 10 blocks. Then, within each selected block,
we randomly select 20 households to survey.
Advantages
• More precise than one-stage cluster sampling.
Disadvantages
• More complex and time-consuming than one-stage sampling.
Example
• Stage 1: Randomly select districts.
Advantages
• Flexible and can be adapted to different types of populations and research questions.
• Reduces costs and effort compared to sampling every element in the population.
33
Disadvantages
• More complex and can introduce more sampling error at each stage.
Stages
1. Stage 1: Randomly select 5 districts from the country.
Data
Assume the following average incomes (in thousands) and number of households (per block):
• District 1: Block 1 ($50, 40), Block 2 ($55, 35), Block 3 ($53, 30)
• District 2: Block 1 ($48, 45), Block 2 ($52, 50), Block 3 ($51, 60)
• District 3: Block 1 ($47, 42), Block 2 ($49, 48), Block 3 ($50, 45)
• District 4: Block 1 ($54, 35), Block 2 ($56, 40), Block 3 ($55, 37)
• District 5: Block 1 ($50, 50), Block 2 ($52, 45), Block 3 ($51, 47)
Solution
1. Calculate the average income for each selected block:
• District 1:
– Block 1: $50 × 40 = 2000
– Block 2: $55 × 35 = 1925
– Block 3: $53 × 30 = 1590
• District 2:
– Block 1: $48 × 45 = 2160
– Block 2: $52 × 50 = 2600
– Block 3: $51 × 60 = 3060
• District 3:
– Block 1: $47 × 42 = 1974
– Block 2: $49 × 48 = 2352
– Block 3: $50 × 45 = 2250
34
• District 4:
– Block 1: $54 × 35 = 1890
– Block 2: $56 × 40 = 2240
– Block 3: $55 × 37 = 2035
• District 5:
– Block 1: $50 × 50 = 2500
– Block 2: $52 × 45 = 2340
– Block 3: $51 × 47 = 2397
2. Sum the products for each district and divide by the total number of households:
The estimated average income for the households in the country is $51.34 thousand.
• Nature of study: Different studies (e.g., descriptive, analytical, experimental) may require different
sampling designs.
• Diversity: The heterogeneity or homogeneity of the population affects the sampling design. More
diverse populations may require stratified or cluster sampling to ensure representation.
35
3.4.3 Sampling frame
• Availability of a sampling frame: A complete and accurate list of the population elements (sam-
pling frame) is essential for many sampling designs like simple random sampling and stratified
sampling.
• Quality of the sampling frame: The sampling frame should be up-to-date and free from duplica-
tions or omissions.
3.4.4 Resources
• Budget constraints: The available budget can limit the choice of sampling design. Some designs,
like simple random sampling, may be less expensive than others like stratified sampling.
• Time constraints: The time available for data collection can also influence the sampling design.
Cluster sampling can be quicker in some scenarios compared to other designs.
• Accuracy: Higher accuracy may necessitate larger sample sizes or more sophisticated designs to
minimize bias and error.
• Data collection techniques: Some designs may be more suitable for certain data collection tech-
niques. For instance, telephone surveys might be easier to conduct using systematic sampling.
• Complexity of analysis: More complex designs like multistage sampling can make data analysis
more complicated and require advanced statistical techniques.
36
Solution
1. Research Objectives: The objective is to estimate average household income and understand the
income distribution across neighborhoods.
2. Population Characteristics: The population (100,000 households) is large and diverse, with sig-
nificant income variation across neighborhoods.
3. Sampling Frame: A complete and accurate list of households is available, categorized by neigh-
borhoods.
4. Resources: The budget is limited, and the results are needed within a month.
5. Desired Precision and Accuracy: The researcher requires a high level of precision to understand
income differences across neighborhoods.
6. Method of Data Collection: The data collection method will be face-to-face interviews, which
can be resource-intensive.
7. Statistical Analysis: The analysis will involve comparing income levels across different neighbor-
hoods.
• It aligns with the objective of comparing income levels across neighborhoods (strata).
• The complete list of households (sampling frame) is available and categorized by neighborhood.
• It balances the need for precision and the constraints of limited budget and time.
Sampling procedure
1. Stratify: Divide the city into neighborhoods.
3. Sample Size: Determine the sample size for each neighborhood based on the desired precision.
37
Stratum Population Size Ni Mean Income X̄i Standard Deviation Si
1 500 45,000 10,000
2 300 55,000 8,000
3 200 60,000 12,000
Solution
To find the optimal sample size for each stratum, use Neyman’s allocation formula:
Ni S i
ni = ∑L · n, Where:
j=1 Nj Sj
Calculations:
∑
3
Nj Sj = 5, 000, 000 + 2, 400, 000 + 2, 400, 000 = 9, 800, 000
j=1
• For Stratum 1:
500 × 10, 000
n1 = × 100 ≈ 50.97 ≈ 51
9, 800, 000
• For Stratum 2:
300 × 8, 000
n2 = × 100 ≈ 24.69 ≈ 25
9, 800, 000
• For Stratum 3:
200 × 12, 000
n3 = × 100 ≈ 24.69 ≈ 25
9, 800, 000
Thus, the optimal sample sizes are approximately 51 for Stratum 1, 25 for Stratum 2, and 25 for
Stratum 3.
38
3.6 Chapter Exercises
1. A researcher wants to estimate the average height of students in a university. The university has
three faculties: Science, Arts, and Engineering, with 300, 200, and 100 students, respectively.
If the researcher decides to use stratified random sampling and wants a total sample size of
60 students, how many students should be sampled from each faculty to ensure proportional
representation?
3. A government agency is interested in estimating the average number of hours of internet usage
per week by households in a region. The region is divided into 10 clusters, each containing
50 households. The agency randomly selects 3 clusters and surveys all households within those
clusters. The average hours of internet usage for the selected clusters are 20, 25, 30 hours,
respectively. Estimate average number of hours of internet usage per week for the entire region.
4. A company wants to inspect the quality of products coming off a production line. The production
line produces 1,000 items per day. The company decides to use systematic sampling to select 50
items for inspection each day. If the first item is selected randomly from the first 20 items, and
every 20th item thereafter is selected, list the indices of the 50 items that will be inspected.
5. A researcher wants to estimate the average daily calorie intake of children in a large city. The
city is divided into 20 districts, each district into 10 schools, and each school into 5 classes. The
researcher uses a three-stage cluster sampling method. First, 4 districts are randomly selected.
Then, 3 schools are randomly selected from each chosen district. Finally, 2 classes are randomly
selected from each chosen school. If the average daily calorie intake for the selected classes is
1800, 1900, 1750, 1850, 1700, 1950, 2000, 2100, and 2200 calories, estimate the average daily
calorie intake for all children in the city.
6. In a town of 10,000 households, a researcher uses a telephone directory containing 8,000 house-
holds as a sampling frame. If the researcher selects 500 households using systematic sampling,
what is the sampling interval?
7. A researcher is conducting a survey to estimate the average expenditure on health care across a
population divided into four strata. The total sample size available for the survey is 120. Using
Neyman’s optimal allocation method, determine the sample size for each stratum. The strata
and their characteristics are given in the table below:
39
Chapter 4
Project
2. Each group should have a unique topic of study (Duplicate and copied work is prohibited)
40
iv. Show how you estimated the sample size
v. What is your dependent and independent variables?
vi. Mention the study variables that will assist you in questionnaire development and se-
lection of the data analysis methods.
(d) Questionnaire development (2 pages maximum) (6 Marks)
i. A simple questionnaire and easy to understand
ii. Be creative in the questionnaire development
(e) Data analysis (1 page maximum) (5 Marks)
i. Descriptive statistics methods to analyze the data from the questionnaire
ii. Inferential statistics methods to analyze the data from the questionnaire
iii. Since NO DATA IS COLLECTED, the team will not perform any data analysis
(f) References and Appendix (1 Page maximum)
i. References from the citations
ii. Any other information necessary in your work
(g) Please be creative in your writing and work (4 Marks)
(h) Other pertinent information
i. Submission deadline is 21st November 2024
ii. Submit a hard copy typed in word or latex
iii. Adhere closely to the instructions given but diversity is allowed
Introduction
This guideline is designed to help undergraduate students develop a questionnaire to collect data on
Generation Z in Kenya. The aim is to understand their views on education, political, and social life. A
well-structured questionnaire will enable researchers to gather relevant and insightful data.
41
4.2.1 Sample Questionnaire
Section 1: Demographics
1. Age: _____
2. Gender:
• Male
• Female
• Other
3. Location: _____
4. Educational Level:
• Primary
• Secondary
• Tertiary
Section 2: Education
1. How satisfied are you with the current education system in Kenya?
• Very Satisfied
• Satisfied
• Neutral
• Dissatisfied
• Very Dissatisfied
2. Do you believe that the education system adequately prepares you for the job market?
• Yes
• No
• Not Sure
• Very Interested
• Interested
• Neutral
• Not Interested
• Not Interested at All
42
• Completely Trust
• Trust
• Neutral
• Distrust
• Completely Distrust
• Very High
• High
• Neutral
• Low
• Very Low
• Very Important
• Important
• Neutral
• Unimportant
• Very Unimportant
• _____
Conclusion
This guideline provides a structured approach to developing a questionnaire aimed at understanding
Generation Z in Kenya. By following these steps and using the sample questionnaire, researchers can
gather valuable data on the perspectives of this demographic group.
43
Chapter 5
Diagrams
S1 S5
Systematic
S9
Sampling
S13 S17
Population
Sampled Points
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Population Units
44
5.2 Diagram of Cluster Sampling
Population
Cluster 2
Cluster 1
Sampling
Sample Size
Stratum 1
Sampling
Sample Size
Stratum 2
Sampling
Sample Size
Stratum 3
45
5.4 Diagram of Multi-stage sampling
Population
Sample of PSUs
Sample of SSUs
46