5.Sampling and Sampling Distributions
5.Sampling and Sampling Distributions
1
Sampling
The process of selecting a portion of the population to
represent the entire population.
Sampling individuals from a population into a sample is a
critically important step in any biostatistical analysis, because
we are making generalizations about the population based on
that sample.
When selecting a sample from a population, it is important that
the sample is representative of the population, i.e., the
sample should be similar to the population with respect to key
characteristics.
2
Sampling Vs. Statistics
• Statistical Inference:
Predict and forecast values of
population parameters... On basis of sample statistics
Test hypotheses about values derived from limited and
of population parameters... incomplete sample information
Make decisions...
Make
Make Onthe
On thebasis
basisof
of
generalizationsabout
generalizations about observationsofofaa
observations
thecharacteristics
the characteristicsofof sample,aapart
sample, partofofaa
aapopulation...
population... population
population
Unbiased
Sample
Unbiased,
representative sample
Male Female drawn at random from
Population
the entire population.
5
What do we mean “the representative sample”?
It is an explicit or implicit objective of most studies in health care
which ‘count’ something or other (quantitative studies), to offer
conclusions that are generalizable.
This means that the findings of a study apply to situations other than
that of the cases in the study.
To give a hypothetical example, Smith and Jones’ (1997) study of
consultation rates in primary care which was based on data from five
practices in differing geographic settings (urban, suburban, rural)
finds higher rates in the urban environment.
When they wrote it up for publication, Smith and Jones used statistics
to claim their findings could be generalised: the differences applied
not just to these five practices, but to all practices in the country.
6
What do we mean “the representative sample”?...
For such a claim to be valid (the study to possess ‘external validity’),
the authors must convince us that their sample was not biased (that it
was representative). Although other criteria must also be met (for
instance, that the design was both appropriate and carried out
correctly - the study’s ‘internal validity’ and ‘reliability’),
It is the representativeness of a sample which allows the researcher to
generalise the findings to the wider population.
If a study has an unrepresentative or biased sample, then it may
still have internal validity and reliability, but it will not be
generalizable (will not possess external validity). Consequently the
results of the study will be applicable only to the group under study.
Such studies are, by themselves, of little use, and for example in
the case of drug trials, this could be dangerous if their findings
were generalised.
7
What do we mean “the representative sample”?...
It is essential to a study’s design (assuming that study wants to
generalise and is not simply descriptive of one setting) that sampling
is taken seriously.
However, there is a second issue which must be addressed in
sampling, sample size. Generalisations from data to wider
population depend upon a kind of statistic which tests inferences
or hypotheses.
Example_1, the t-test can be used to test a hypothesis that there
is a difference between two populations, based on a sample from
each. If we select 100 males and 100 females and test their BMI. We
find a difference in our samples, and wish to argue that the difference
found is not an accident (due to chance), but reflects an actual
difference in the wider populations from which the samples were
drawn. We use a t-test to see if we can make this claim validly.
8
What do we mean “the representative sample”?...
9
Why do we need to select a sample?
In some circumstances, it is not necessary to select a sample. If the
subjects of your study are very rare, for instance a disease occurring
only once in 100 000 children, then you might decide to study every
case you can find.
More usually, however selecting a portion of the population to represent
the entire population is must due to the follow reasons:
Feasibility: Sampling may be the only feasible method of
collecting information.
Reduced cost: Sampling reduces demands on resource such
as finance, personnel, and material.
Greater accuracy: Sampling may lead to better accuracy of
collecting data
Greater speed: Data can be collected and summarized more
quickly
Sampling enables us to estimate the characteristic of a
Sample
Subjects who are selected
Sampling Frame
The list of potential subjects from which the sample is drawn
Source population
The population from whom the study subjects would be obtained
Target population
The population to whom the results would be generalized
11
Selection methods of sampling units
1. Sampling without replacement: If we selected a unit from
the population it should not be returned to that population
before the next draw. For a population of size N we can
form Ncn different samples of size n.
c
N n =
13
Sampling error
The value of the characteristic measured in a sample
differs from that of the total population.
Because a sample is a subset of a larger group.
This type of error, arising from the sampling process, is
called sampling error.
Can’t be avoided or totally eliminated.
Minimized by increasing the size of the sample.
When n = N, sampling error = 0
14
Non sampling error (bias)
Systematic error in the design or conduct of a sampling
procedure.
Results in distortion of the sample and study results.
More serious type of error
Multi-factorial causes
Selection bias,
Information bias.
Observational error
Respondent errors
Errors in editing and tabulation of data
15
16
Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling methods
17
A. Probability sampling
Every sampling unit has a known and non‐zero
probability of selection into the sample.
However, because study samples are randomly selected
and their probability of inclusion can be calculated,
Reliable estimates can be produced and
Generalization can be made about the population.
The method chosen depends on a number of factors, such as
The available sampling frame,
How spread out the population is,
How costly it is to survey members of the population
How users will analyse the data
18
Most common probability sampling methods
1. Simple random sampling
2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling
6. Sampling with probability proportional to size (PPS)
19
1. Simple random sampling
The required number of individuals are selected at
random from the sampling frame, a list or a database of
all individuals in the population
Each member (sampling unit) of the population has
an equal chance of being included in the sample.
21
2. Systematic random sampling
Sometimes called interval sampling
Selection of individuals from the sampling frame
systematically rather than randomly
Individuals are taken at regular intervals down the list
(Every Kth individual is chosen from the sampling frame)
The starting point is chosen at random
22
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where
N is the total population size).
2. Determine the sampling interval (K) by dividing the
number of units in the population by the desired sample
size. K=
3. Select a number between one and K at random. This
number is called random start and would be the first
number included in your sample.
4. Select every Kth unit after that first number
23
24
Example:
To select n=20 from N=100, sampling interval K=N/n=100 ÷ 20 = 5
You will need to select one unit out of every five units to end up
with a total of 20 units in your sample.
Select a number between 1 and 5 from a table of random
numbers or by simple random sampling
If you choose 4, the fourth unit on your frame would be the
first unit included in your sample;
The sample might consist of the following units to make up a
sample of 20: 4 (the random start), 9, 14, 19, 24..., 99 (up to N,
which is 100 in this case).
25
Systematic sampling…
Note: Systematic sampling should not be used when a cyclic
repetition is inherent in the sampling frame.
Advantage:
Easier to perform it
Require less time than SRS
Very good when the population from which sample is to be
draw homogeneously distributed
Disadvantage:
Patterns/periodicity
26
3. Stratified random sampling
It is done when the population is known to have heterogeneity
with regard to some factors and those factors are used for
stratification.
Using stratified sampling, the population is divided into
homogeneous, mutually exclusive groups called strata, and
A population can be stratified by any variable that is available for
all units prior to sampling (e.g., age, sex, province of residence,
income, etc.).
27
Stratified random sampling…
A separate sample is drawn independently from each
stratum.
Any of the sampling methods mentioned in this section
(and others that exist) can be used to sample within
each stratum.
Stratified sampling ensures an adequate sample size
for sub‐groups in the population of interest.
When a population is stratified, each stratum becomes
an independent population and you will need to
decide the sample size for each stratum.
28
Stratified random sampling…
Why do we need to create strata?
There are many reasons,
That it can make the sampling strategy more efficient.
You need a larger sample to get a more accurate estimation of a
characteristic that varies greatly from one unit to the other than
for a characteristic that does not.
For example, if every person in a population had the same
salary, then a sample of one individual would be enough to
get a precise estimate of the average salary
29
Stratified random sampling…
Equal allocation:
Allocate equal sample size to each stratum
Proportionate allocation:
j= Nj
j is sample size of the jth stratum
Nj is population size of the jth stratum
n = n1+ n2+ ...+ nk is the total sample size
N = N1+ N2+ ...+ Nk is the total population size
30
Stratified random sampling…
Example:
Village A B C D Total
HHs 100 150 120 130 500
S. size ? ? ? ? 60
Advantage
The representativeness of the sample is improved.
Disadvantage
Sampling frame for the entire population has to be prepared
separately for each stratum.
31
4. Cluster sampling
Sometimes it is too expensive to carry out SRS
Population may be large and scattered.
Complete list of the study population unavailable
Population consists of many natural groups (clusters)
Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
Cluster sampling is the most widely used to reduce the cost
(administrative convenience)
The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous
32
Steps in cluster sampling
Cluster sampling divides the population into groups or clusters.
A number of clusters are selected randomly to represent the total
population, and then all units within selected clusters are included in
the sample.
This differs from stratified sampling, where some units are selected
from each group.
No units from non-selected clusters are included in the sample, they
are represented by those from selected clusters.
Example:
In a school based study, we assume students of the same school to be
homogeneous.
We can select randomly sections and include all students of the
selected sections only
33
Cluster sampling…
Advantages:
Cost and time reduction
It creates 'pockets' of sampled units instead of spreading the
sample over the whole territory.
Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.
Disadvantages:
Creates a loss of efficiency when compared with SRS.
It is usually better to survey a large number of small clusters
instead of a small number of large clusters.
This is because neighboring units tend to be more alike, resulting in a
sample that does not represent the whole spectrum of opinions or
situations present in the overall population (Design Effect).
34
5. Multi-stage sampling
Similar to the cluster sampling, except that it involves picking a
sample from within each chosen cluster, rather than including all
units in the cluster.
The selected clusters in the primary cluster sample are themselves
sampled, rather than fully studied
This type of sampling requires at least two stages.
The primary sampling unit (PSU) is the sampling unit in the first
sampling stage.
The secondary sampling unit (SSU) is the sampling unit in the
second sampling stage, etc.
35
Multi-stage sampling…
36
Multi-stage sampling…
Advantage:
You do not need to have a list of all units in the
population.
Saves a great amount of time and effort by
not having to create a list of all the units in a
population.
Commonly used with cluster sampling
Multi‐Stage Cluster Sampling
Disadvantage
Sampling error is increased compared with a SRS
37
B. Non-probability sampling
In non‐probability sampling, every item has an unknown chance of
being selected.
There is an assumption that there is an even distribution of a
characteristic of interest within the population.
In non‐probability sampling, since elements are chosen
subjectively, there is no way to estimate the probability of any one
element being included in the sample.
It may lead to unrepresentative samples and/or results are
unconvincing
Reliability can’t be ensured
Inappropriate if the aim is to measure variables and generalize
findings obtained from a sample to the population.
38
Non probability sampling …
Despite these drawbacks, non‐probability sampling
methods can be useful when descriptive comments
about the sample itself are desired.
Secondly, they are quick, inexpensive and convenient.
39
The most common types of non‐probability sampling
1. Convenience or haphazard sampling
2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique
40
1. Convenience or haphazard sampling
Convenience sampling is sometimes referred to as haphazard or
accidental sampling.
Major reason is administration convenience and sample is
chosen with ease of access
It is not normally representative of the target population
because sample units are only selected if they can be
accessed easily and conveniently.
It can be used when time and resources are too short, but that
advantage is greatly offset by the presence of bias.
Although useful applications of the technique are limited, it can
deliver accurate results when the population is homogeneous.
41
Convenience or haphazard sampling…
For example, a scientist could use this method to
determine whether a lake is polluted or not.
Assuming that the lake water is well‐mixed, any
sample would yield similar information.
A scientist could safely draw water anywhere on the
lake without bothering about whether or not the sample
is representative
42
2. Volunteer sampling
Occurs when people volunteer to be involved in the study.
In psychological experiments or pharmaceutical trials (e.g.,
drug testing), for example, it would be difficult and unethical
to enlist random participants from the general public.
In these instances, the sample is taken from a group of
volunteers.
Sometimes, the researcher offers payment to attract
respondents.
Sampling voluntary participants as opposed to the general
population may introduce strong biases.
The majority does not volunteer, resulting in large selection
bias.
43
3. Judgment sampling
This approach is used when a sample is taken based on
certain judgments about the overall population.
The underlying assumption is that the investigator will select
units that are characteristic of the population.
The critical issue here is objectivity: how much can judgment be
relied upon to arrive at a typical sample?
Researchers often use this method in exploratory studies like
pre‐testing of questionnaires and focus groups.
They also prefer to use this method in laboratory settings
where the choice of experimental subjects (i.e., animal,
human) reflects the investigator's pre‐existing beliefs about the
population
44
4. Quota sampling
The most common sampling method in market research about
the views on products
A proper design may have been used to determine what
numbers are needed in each of the quotas
A sample of 50 men and 50 women
Selection of individuals is done until the required total in
each group (quota) is obtained.
Quota sampling is an effective sampling method when
information is urgently required and can be conducted without
sampling frames.
In many cases where the population has no suitable frame, quota
sampling may be the only appropriate sampling method.
45
Quota sampling…
The main argument against quota sampling is that it
does not meet the basic requirement of randomness.
46
5. Snowball sampling
A technique for selecting a sample where existing
study subjects recruit future subjects from among their
acquaintances.
Thus the sample group appears to grow like a rolling
snowball.
Sampling people who are difficult to contact
47
Snowball sampling …
This sampling technique is often used in hidden populations
which are difficult for researchers to access; example
populations would be drug users, CSWs, homeless or street children,
etc.
Because sample members are not selected from a sampling frame,
snowball samples are subject to numerous biases. For example,
people who have many friends are more likely to be recruited into the
sample
48
Sampling distribution
49
Sampling distribution
The mean of a representative sample provides an estimate of
the unknown population mean, but intuitively we know that if
we took multiple samples from the same population, the
estimates would vary from one another.
We could, in fact, sample over and over from the same
population and compute a mean for each of the samples. In
essence, all these sample means constitute yet another
"population," and We could graphically display the frequency
distribution of the sample means.
This is referred to as the sampling distribution of
the sample means.
50
Main types of sampling distributions
I. Distribution of the sample mean
II. Distribution of the sample proportion
III. Distribution of the difference between two means
IV. Distribution of the difference between two proportions
51
I. Sampling distribution of sample mean
The sampling distributions illustrate three fundamental properties:
1. The mean:
The mean of all possible estimates obtained from samples of identical
size is equal to the true population mean.
2. The standard deviation(SD):
The SD of the sampling distribution decreases as the sample size
increases.
The SD of a sampling distribution takes a special name, standard
error, often indicated by the letters SE.
3. The shape:
The shape of the sampling distribution is approximately normal
when the sample size is large.
This property is known as the Central Limit Theorem.
It is the most important of all the three properties
52
Sampling distribution of sample mean…
53
Sampling distribution of sample mean…
Now consider all possible samples of size n=2
There are 4*4 = 16 different but equally- Each of these samples has a sample
likely samples of size 2 that can be drawn mean. For example, the mean of the
(with replacement) from a uniform sample (20,22) is 21, and the mean of the
population of the integers from 18 to 24: sample (18,22) is 20.
Samples of size 2 from uniform (18,24) Sample means from uniform (18,24)
54
Sampling distribution of sample mean…
55
Sampling distribution of sample mean…
Summary measures of this sampling distribution:
1. Calculate the mean of the sample means by adding the individual 16
sample means & dividing the sum by 16.
2. Also calculate the SD of the sample means.
3. Finally compare with the original population results.
56
Comparing the population with its sampling distribution
57
Comparing the population with its sampling distribution…
Example
If the population mean is μ = 98.6 degrees and a sample of n = 5
temperatures yields a sample mean of x = 99.2 degrees, then the
sampling error is: = 99.2 – 98.6 = 0.6 degrees.
The sampling error may be +ve or -ve ( x may be > or < μ)
The expected sampling error decreases as the sample size increases
62
Central Limit Theorem (CLT)
The CLT states that if you have a population with mean μ and
standard deviation σ and take sufficiently large random
samples from the population with replacement, then the
distribution of the sample means will be approximately
normally distributed.
This will hold true regardless of whether the source population
is normal or skewed, provided the sample size is sufficiently
large (usually n > 30). If the population is normal, then the
theorem holds true even for samples smaller than 30.
63
Central Limit Theorem (CLT)…
In fact, this also holds true even if the population is binomial,
provided that min(np, n(1-p))> 5, where n is the sample size
and p is the probability of success in the population.
This means that we can use the normal probability model to
quantify uncertainty when making inferences about a
population mean based on the sample mean.
For the random samples we take from the population, we
can compute the mean of the sample means:
then we get
Similarly, we can compute the standard deviation of the
sample means:
64
Central Limit Theorem (CLT)…
In order for the result of the CLT to hold true,
The sample must be sufficiently large (n > 30). Again,
there are two exceptions to this. If the population is
normal, then the result holds for samples of any size
(i..e, the sampling distribution of the sample means
will be approximately normal even for samples of size
less than 30).
65
Central Limit Theorem with a Normal Population:
The figure below illustrates a normally distributed characteristic, X, in a
population in which the population mean is 75 with a standard deviation 8.
66
Central Limit Theorem with a Normal Population…
If we take simple random samples (with replacement) of size n=10 from the
population and compute the mean for each of the samples, the distribution
of sample means should be approximately normal according to the CLT.
Note that the sample size (n=10) is less than 30, but the source population is
normally distributed, so this is not a problem.
The distribution of the sample means is illustrated below. Note that the
horizontal axis is different from the previous illustration, and that the range is
narrower.
67
Central Limit Theorem with a Normal Population…
The mean of the sample means is 75 and the standard
deviation of the sample means is 2.5, with the standard
deviation of the sample means computed as follows:
68
Central Limit Theorem with a Dichotomous Outcome:
Now suppose we measure a characteristic, X, in a population and
that this characteristic is dichotomous (e.g., success of a medical
procedure: yes or no) with 30% of the population classified as a
success (i.e., p=0.30) as shown below.
69
Central Limit Theorem with a Dichotomous Outcome…
The CLT applies even to binomial populations like this provided that the minimum
of np and n(1-p) is at least 5, where "n" refers to the sample size, and "p" is the
probability of "success" on any given trial.
In this case, we will take samples of n=20 with replacement, so min(np, n(1-
p)) = min(20(0.3), 20(0.7)) = min(6, 14) = 6. Therefore, the criterion is met.
The population mean and standard deviation for a binomial distribution are given:
Mean binomial probability:
Standard deviation:
70
Central Limit Theorem with a Dichotomous Outcome…
The distribution of sample means based on samples of size n=20 is shown below.
71
Central Limit Theorem with a Dichotomous Outcome…
Now, instead of taking samples of n=20, suppose we take simple random samples
(with replacement) of size n=10.
Note that in this scenario we do not meet the sample size requirement for the
Central Limit Theorem (i.e., min(np, n(1-p)) = min(10(0.3), 10(0.7)) = min(3,
7) = 3).
The distribution of sample means based on samples of size n=10 is shown
on the right, and you can see that it is not quite normally distributed.
The sample size must be larger in order for the distribution to approach
normality.
72
Application of the Central Limit Theorem:
Example-1:
Data from the Framingham Heart Study found that subjects over age 50 had a
mean HDL of 54 and a standard deviation of 17.
Suppose a physician has 40 patients over age 50 and wants to determine the
probability that the mean HDL cholesterol for this sample of 40 men is 60 mg/dl or
more (i.e., low risk).
Probability questions about a sample mean can be addressed with the Central
Limit Theorem, as long as the sample size is sufficiently large. In this case n=40, so
the sample mean is likely to be approximately normally distributed,
so we can compute the probability of HDL>60 by using the standard normal
distribution table.
73
Application of the CLT Example-1…
The population mean is 54, but the question is what is the probability that the
sample mean will be >60?
In general, we have the standardization formula :
P(Z > 2.22) can be looked up in the standard normal distribution table, and because we want
the probability that P(Z > 2.22), we compute is as P(Z > 2.22) = 1 - 0.9868 = 0.0132.
Therefore, the probability that the mean HDL in these 40 patients will exceed 60 is 1.32%.
74
Application of the Central Limit Theorem…
Example-2:
Suppose we want to estimate the mean LDL cholesterol) in the population
of adults 65 years of age and older. We know from studies of adults under
age 65 that the standard deviation is 13, and we will assume that the
variability in LDL in adults 65 years of age and older is the same.
We will select a sample of n=100 participants > 65 years of age, and we
will use the mean of the sample as an estimate of the population mean.
We want our estimate to be precise, specifically we want it to be within 3
units of the true mean LDL value.
75
Application of the CLT Example-2…
What is the probability that our estimate (i.e., the sample mean) will be within 3 units of the
true mean? We think of this question as P(μ - 3 < sample mean < μ + 3). Because this is a
probability about a sample mean, we will use the Central Limit Theorem.
With a sample of size n=100 we clearly satisfy the sample size criterion so we can use the
Central Limit Theorem and the standard normal distribution table.
The previous questions focused on specific values of the sample mean (e.g., 50 or 60) and
we converted those to Z scores and used the standard normal distribution table to find the
probabilities.
Here the values of interest are μ - 3 and μ + 3. The solution can be set up as follows:
From the standard normal distribution table P(Z < 2.31) = 0.9896, and a P(Z < -2.31) = 0.0104. The
range between these two = P(-2.31 < Z < 2.31) = 0.9896 - 0.0104 = 0.9791.
Therefore, there is a 97.91% probability that the sample mean, based on a sample of size
n=100, will be within 3 units of the true population mean.
This is a very powerful statement, because it means that for this question looking only at 100
individuals aged 65 or older gives us a very precise estimate of the population mean.
76
Application of the Central Limit Theorem…
Example 3
Given: μ = 50, σ = 16, n = 64
Find: P( x > 53)
Solution
1. Write the given information, μ=50, σ=16, n=64
2. Sketch a normal curve
3. Convert x to a z score