Introduction To Probabilistic Sampling
Introduction To Probabilistic Sampling
• In survey samples it is specified a population, whose data values are unknown but are regarded as
fixed, not random. Although, the observed sample is random because it depends the random selection
1. Every individual in the population must have a known and a nonzero probability of belonging to
the sample (πi > 0 for individual i). And πi must be known for every individual who ends up in the
sample.
2. Every pair of individuals in the sample must have a known and a nonzero probability of belonging
to the sample (πij > 0 for the pair of individuals (i, j)). And πij must be known for every pair that
1
Sampling weights I
• If we take a simple random sample of 3500 people from Neverland (with total population 35 million)
then any person in Neverland has a chance of being sampled equal to πi = 3500/35000000 = 1/10000
for every i.
• If 100 people of our sample are unemployed, we would expect then 100 × 10000 = 1 million unemployed
in Neverland.
• An individual sampled with a sampling probability of πi represents 1/πi individuals in the population.
2
Sampling weights II
• Example: Measure the income on a sample of one individual from a population of N individuals, where
• The estimate (Tbincome ) of the total income of the population (Tincome ) would be the income for that indi-
1
Tbincome = × incomei
πi
• Not a good estimate, it is based on only one person, but it is will be unbiased: the expected value of
N N
X 1 X
E Tincome =
b × incomei · πi = incomei = incomeT
i=1
π i i=1
3
The Horvitz-Thompson estimator
• The so called Horvitz-Thompson estimator of the population total is the foundation for many complex
analysis
e i = 1 Xi
X
πi
• Given a sample of size n the Horvitz-Thompson estimator TbX for the population total TX of X is
n n
X 1 X
T̂X = Xi = X
ei
i=1
π i i=1
4
• The variance estimator is
X X X Xi X j
i j
V ar T̂X =
d −
i,j
πij π i πj
• The formula applies to any design, however complicated, where πi and πij are known for the sampled
observations.
• The formula depends on the pairwise sampling probabilities πij , not just on the sampling weights: so
5
Simple Random Sampling
• Simple random sampling (SRS) provides a natural starting point for a discussion of probability sampling
methods. It is the simplest method and it underlies many of the more complex methods.
• Formally defined: Simple random sampling is a sampling scheme with the property that any of the
possible subsets of n distinct elements, from the population of N elements, is equally likely to be the
chosen sample.
• Every element in the population has the same probability of being selected for the sample, and the joint
6
E XAMPLE:
• Suppose that a survey is to be conducted in a high school to find out about the students’ leisure habits.
A list of the school’s 1872 students is available, with the list being ordered by the students’ identification
numbers.
• Suppose that an SRS of n = 250 is required for the survey. How to draw this sample?
– By a lottery method: an urn. Although conceptually simple, this method is cumbersome to execute
and it depends on the assumption that the representative discs (one for each student) have been
– By means of a table of random numbers. It is useful if you make it by hand: in this way is a
tedious task, requiring a large selection of random numbers, most of which are nonproductive.
7
• There are two options:
– Simple random sampling with replacement: an element can be selected more than once.
– Simple random sampling without replacement: the sample must contain n distinct elements.
• Sampling without replacement gives more precise estimators than sampling with replacement
• Now, assume that we have responses from all those sampled (there are not problems of non-response)
• Next step: Summarize the individual responses to provide estimates of characteristics of interest for
the population.
For instance: average number of hours of television viewing per day and the proportion of students
8
See a visual demonstration about SRS:
library(animation)
N OTATIONS:
• Capital letters are used for population values and parameters, and lower-case letters for sample values
and estimators.
• Y1 , Y2 , . . . YN denote the values of the variable y (e.g., hours of television viewing) for the N elements
in the population.
• In general, the value of variable y for the i-th element in the population is Yi (i = 1, 2, . . . N ), and that
9
• The population mean is given by
N
X Yi
Ȳ =
i=1
N
N 2
2
X Yi − Ȳ
σ =
i=1
N
10
Suppose that we wish to estimate the mean number of hours of television viewing per day for all the
On average, the estimator must be very closed to Ȳ over repeated applications of the sampling method.
• Observe that the term estimate is used for a specific value, while estimator is used for the rule of
• In the example we obtain, an estimate of 2.20 hours of television viewing (computed by substituting the
11
NOTE:
• Properties of sample estimators are derived theoretically by considering the pattern of results that would
• Example: suppose that drawing a SRS of 250 students from the 1872 students and then calculating
the sample mean for each sample were carried out infinite times (with replacement).
• The resulting set of sample means would have a distribution, known as the sampling distribution of the
mean.
12
• If the sample size is not too small (n ≈ 20 is sufficient) the distribution of the means of each sample
approximates the normal distribution, and the mean of this distribution is the population mean, Ȳ .
• Then it is said that the mean of the individual sample estimates ȳ over an infinite number of samples is
an unbiased estimator of Ȳ .
• Although the sampling distribution of ȳ is centered on Ȳ , any one estimate will differ from Ȳ .
• To avoid confusion with the standard deviation of the element values, standard deviations of sampling
• The variance (but it is not useful in practice) of a sample mean (ȳ0 ) of a SRS of size n is given by
N
N − n σ2 N −n 1 X 2
V ar(ȳ0 ) = · = · Yi − Ȳ
N −1 n N − 1 n · N i=1
13
• See example 2.1, p. 29 (artificial example) from Lohr (2006)
Suppose we have a population with eight elements (e.g. an almost extinct species of animal...) and we
know the weight yi for each of the N = 8 units of the whole population. We want to know the whole
Animal i 1 2 3 4 5 6 7 8
yi 1 2 4 4 7 7 7 8
• We take a sample of size 4. How many samples of size 4, can be drawn without replacement from this
population?
8
• There are 4
= 70 possible samples of size 4 that can be drawn without replacement from this popula-
tion.
• We define as P (S) = 1/70 for each distinct subset of size 4 from the population.
14
# Consider a vector of ’observations’
y <- c(1,2,4,4,7,7,7,8)
# Consider all possible samples of size 4. They are placed into a matrix with 70 rows.
for(i in 1:5){
for (j in (i+1):6) {
for (k in (j+1):7){
for(m in (k+1):8) {
dim(allguys)
allguys
# Other option
library(combinat)
t(cuales)
15
# To estimate the whole population weight, we take two times the sum
table(alltotal)
print(c("Mean:",mean(alltotal)),quote=F)
print(c("Sum:",sum(y)),quote=F)
16
N −n
• The formula V ar(ȳ0 ) depends on and the sample size n.
N −1
N −n
• The term reflects the fact that the survey population is finite in size and that sampling is con-
N −1
ducted without replacement.
• With an infinite population, or if sampling were conducted with replacement, the term is not included
• The term indicates the gains of sampling without replacement over sampling with replacement.
• In many practical situations the populations are large and, even though the samples may also be large,
17
• In large populations, the difference between sampling with and without replacement is not important:
even if the sample is drawn with replacement, the chance of selecting an element more than once is
slight.
N −n
• If the sampling fraction n/N is small, is close to 1 and has a negligible effect on the standard
N −1
error.
• The correction factor is commonly neglected (i.e., treated as 1) when the sampling fraction (n/N ) is
• For large populations it is the sample size that is dominant in determining the precision of survey results.
18
• A sample of size 2000 drawn from a country with a population of 200 million yields about as precise
results as a sample of the same size drawn from a small city of 40000 (assuming the element variances
n
1 X
s2 = (yi − ȳ)2
n − 1 i=1
N
E(s2 ) = σ2,
N −1
then,
N − n s2 n s2
Vd
ar(ȳ0 ) = · = 1− ·
N n N n
19
n
• The factor 1 − is called the finite population correction (fpc) where n/N is the sampling fraction.
N
• It is easy to determine a confidence interval for the population mean, applying standard Central Limit
Theorem.
• Example: suppose that the mean hours watching television per day for the 250 sampled students is
ȳ0 = 2.192 hours, with an element variance of s2 = 1.008. Then a 95% confidence interval for Ȳ is
s
250 1.008
2.192 ± 1.96 1− = 2.192 ± 0.116
1872 250
• That is, we are 95% confident that the interval from 2.076 to 2.308 contains the population mean.
• See some functions to illustrate the Central Limit Theorem and to compute confidence intervals pro-
grammed in R.
20
# Using R to illustrate the Central Limit Theorem (from Venables)
N <- 10000
graphics.off()
for(k in 1:20) {
pu <- par("usr")[1:2]
Sys.sleep(1)
21
# Using the TeachingDemo library
library(TeachingDemos)
X11()
clt.examp()
X11()
clt.examp(5)
X11()
clt.examp(30)
X11()
clt.examp(50)
22
srs.mu <- function(y,N,cuacua=0.95) {
n <- length(y)
z <- qnorm(1-(1-cuacua)/2)
s2 <- var(y)
list(B2,B1)}
23
Sample size for estimation population means
• How large must be a sample? ⇒ Observations cost money, time and efforts.
• The number of observations needed to estimate a population mean µ with a bound on the error of
• Hence, the sample size required to estimate µ with a bound on the error of estimation ε is
N · s2
n= N 2
4
ε + s2
24
E XAMPLE
• The average amount of money µ for a hospital’s accounts receivable must be estimated. Although no
prior data is available to estimate the population variance σ 2 , it is known that most accounts lie within
a e100 range. There are N = 1000 open accounts. Find the sample size needed to estimate µ with a
• First we estimate the population variance. Since the range is often approximately equal to four or six
standard deviations (4 · s or 6 · s), depending of the normality of data (see the Chebyshev’s inequality),
then
range 100
s≈ = = 25
4 4
then s2 ≈ 625, so
N · s2 1000 · 625
n= N 2
= 1000 = 217.39 ≈ 217 or 218 observations
4
ε + s2 4
· 9 + 625
25
See a function to calculate sample sizes programmed in R:
D <- (epsilonˆ2)/4
n <- (N*s2)/((N*D)+s2)
cbind(n.round)
# Application
n.mu(N=1000,s2=625,epsilon=3)
26
• Many sample surveys are interested about a population total, e.g. when analyzing total accounts.
• The population total (the sum of all observations) in the population is denoted by the symbol τ . Hence
Nµ = τ
– Estimated variance of τ̂
n s2
Vd ar(N ȳ) = N 2 1 −
ar(τ̂ ) = Vd
N n
1
Pn
where s2 = n−1 i=1 (yi − ȳ)2
27
Systematic Sampling
• The method of systematic sampling reduces the effort required for the sample selection.
• Systematic sampling is easy to apply, involving simply taking every k-th element after a random start.
• Example: suppose that a sample of 250 students is required from a school with 2000 students. The
• A systematic sample of the required size, would then be obtained by taking a random number between
1 and 8, to determine the first student in the sample, and taking every eighth student thereafter.
• If the random number were 5, the selected students would be the fifth, thirteenth, twenty-first, and so
• When the sampling is not a simple integer, we can round the interval to an integer, with a resultant
28
• Example: If fraction is 250/1872 or 1 in 7.488, a 1 in 7 sample would produce a sample of 267 or 268,
• Other solution is to round the interval down to start with an element selected at random from the N
elements in the population, and to proceed until the desired sample size has been achieved. Therefore,
the list is treated as circular, and the last listing is followed by the first.
• Like SRS, systematic sampling gives each element in the population the same chance of being selected
• It differs, however, from SRS in that the probabilities of different sets of elements being included in the
• In systematic sampling the sample mean is a reasonable estimator of the population mean. However,
the unequal probabilities of sets of elements means that the SRS standard error formulae are not
directly applicable.
29
• In order to estimate the standard error of estimators based on systematic samples. Sometimes it is
reasonable to assume that the list is approximately randomly ordered, in which case the sample can be
• Lists arranged in alphabetical order may often be reasonably treated in this way.
• Systematic sampling performs badly when the list is ordered in cycles of values of the survey variables
and when the sampling interval coincides with a multiple of the length of the cycle.
• Systematic sampling is widely used in practice without excessive concern for the damaging effects of
library(animation)
sample.system()
30
systematic.sample <- function(n, N, initial=F){
k <- floor(N/n)
if(initial==F){
return(guy)
31
Sampling with probabilities proportional to size
• Sometimes is advantageous to select sampling units with different probabilities (other than uniform).
• The method is called sampling with probabilities proportional to size or pps sampling.
• For a sample y1 , y2 , . . . , yn from a population of size N, let πi the probability that yi appears in the
sample.
• the pps estimator of µ only produces smaller variances than an standard SRS if the weights πi are
32
• In this case, the estimator of the population mean µ is
n
1 X yi
µ̂pps =
N n i=1 πi
n 2
1 X yi
Vd
ar(µ̂pps ) = 2 − N · µ̂pps
N n(n − 1) i=1 πi
• The best practical way to choose the weights πi ’s is to chose them proportional to a known measure-
http://cran.r-project.org/web/packages/pps/index.html
33
Example (from Scheaffer et al., p. 80):
An investigator wishes to estimate the average number of defects per keyboard on keyboards of elec-
tronic components manufactured for installation in computers. The keyboards contain varying numbers of
components, and the investigator feels that the number of defects should be positively correlated with the
Thus, pps sampling is used with the probability of selecting anyone keyboard for the sample being pro-
from the N = 10 keyboards of one day’s production. The number of components on the 10 keyboards are,
After the sampling was completed, the number of defects found on the four keyboards were, respectively,
1, 3, 2 and 1. Estimate the average number of defects per keyboard, and place a bound on the error of
estimation.
34
data <- 1:10
N <- length(data)
n <- 4
yi <- c(1, 3, 2, 1)
m.pps
35
Consider a SRS based in the library survey:
http://faculty.washington.edu/tlumley/survey/
# Artificial Data
matrix(rep("sc",70),70,1,byrow=TRUE))
rep(1,30),rep(2,40)),100*runif(235))
N <- dim(mydata)[[1]]
n <- 50
library(foreign)
write.dta(mydata,"C:/QM/mydata.dta")
36
# Selection of a sample
library(survey)
srs$popsize <- N
summary(dsrs)
par(mfrow=c(2,1))
37
Consider a SRS programmed with Stata:
use C:\QM\mydata.dta
count
sample 20
count
gen pw = 235/47
38
* Set the sampling design
svydescribe
* Compute a box-plot
39