Chap 1
Chap 1
Lassad El Moubarki.
Tunis Business School
October 6, 2023
5 Sample adjustment
Adjustment according to a quantitative variable
Adjustment by weight
Other method
The questionnaire
Types of responses
One of the more difficult skills in data analysis is deciding which statistical models and tests to
use in a particular situation. The statistical model is often dependent on the type of the
variable. Statistical variables can be classified as follow.
Categorical (Qualitative)
Categorical nominal.
Example: Are you married?
Categorical ordinal.
Example: What is the mention of your baccalaureate?
Quantitative (Numerical)
Continuous: Responses are numbers that can be broken into finer and finer units. The set of
possible numbers is large.
Example: What is your height?
Discrete: The set of possible numbers is finite.
Example: How many child do you have?
Sampling vs census
Census and sampling are two different techniques used for collecting survey data about a
reference population.
In the case of census all the population members are enumerated.
On the other hand, the sampling is widely used method, in statistical models, wherein a
data set is selected randomly from the reference population.
Census implies complete enumeration of the study objects, whereas Sampling is the
enumeration of the subgroup of elements chosen for participation.
Sampling vs census
Why sample?
We need to sample when the census is unachievable or difficult to perform. Often, the census
is:
Expensive.
Time consuming.
Technically difficult to perform.
a=c()
b=c()
n=2000
for(i in 1:n)
{
a=c(a,sample(c(0,1),1))
b=c(b,mean(a))
plot(1:i,b,main=i,col=2,type="l",xlim=c(1,n),ylim=c(0,1))
}
convergence en fonction de n
0.5
0.4
les estimations de p
0.3
0.2
0.1
0.0
Probability sampling is a sampling methods, wherein, all the subjects of the population
get an opportunity to be in the sample.
Non-probability sampling is a method of sampling wherein, it is not known if a given
individual will be in the sample or not.
In practice, we use the non-probability sampling methods when we do not have the list of all
contacts or identifiers of population subjects (The frame).
The simple random sampling is a completely random method of selecting the sample subjects.
Wherein, all the subjects of the reference population are equally likely. In practice, we assign
numbers to the population subjects and then randomly choosing from those numbers through
an automated process.
Notations
Let:
X : variable of interest
N: population size
n: sample size
f : sampling rate
X n : the mean of the SRS sample
Sn : standard error of X gotten from the SRS sample
µ: the expected value of X
Remark: When the variable of interest is binary of type yes or no (0 or 1) the problem will be
an estimation of a proportion p̂. In such case Sn2 = p̂(1 − p̂).
: prediction accuracy
Advantages
To generate an SRS sample we just need the list of identifiers ( or contacts) of the
population subjects.
Easy to apply.
Disadvantages
Problem of accuracy with small sample size.
Expensive in terms of money and time.
Exercise 1
.........................................................
.........................................................
.........................................................
.........................
( in class )
.........................
.........................................................
.........................................................
.........................................................
Exercise 2
Let’s consider a poll of Tunisian opinion about a certain political party. In a sample of 5,000
people randomly selected from all over Tunisia. 39% gave a favorable opinion. Calculate the
accuracy of this result and its confidence interval of level 95%.
.........................................................
.........................................................
.........................................................
.........................
( in class )
.........................
.........................................................
.........................................................
.........................................................
SRS in R
Let: the wished accuracy for a confidence level of 1 − α. The optimal sample size to reach
such accuracy is:
For the mean
N
n∗ = N2
1+ 2
z1−α/2 Sn2
For a proportion :
N
n∗ = N2
1+ 2
z1−α/2 pn (1−pn )
Remark: In both cases we assume that the sampling is without replacement. However, when
the sampling rate is very small (< 1%) the two ways of sampling converge to the same result.
When the reference population size is very large. The optimal sample size will be:
For the mean
2
Z1−α/2 Sn2
n∗ =
2
For the proportion :
2
Z1−α/2 pn (1 − pn )
n∗ =
2
Remark: If their is no information about a previous value of pn , we consider the extreme case:
pn (1 − pn ) = 14 .
Exercise 3
.........................................................
.........................................................
.........................................................
.........................
( in class )
.........................
.........................................................
.........................................................
.........................................................
Systematic sampling I I
Systematic Sampling is when you choose every “nth” individual to be a part of the sample.
For example, we choose every 5th person to be in the sample. Systematic sampling is an
extended implementation of the same old probability technique in which each member of the
group is selected at regular periods to form a sample. There’s an equal opportunity for every
member of a population to be selected using this sampling technique.
Systematic sampling II
steps to follow:
Number from 1 to N all units of the population.
Determine the sampling interval K by dividing the number N (population size), by the
N
number n (sample size): K = .
n
Randomly select a number between 1 and K . This number d is the origin and it is the
first number included in the sample.
Select each K th unit after this first number. The sample obtained is formed by the units
of order: d, d + K , d + 2K , d + 3K , . . . , d + (n − 1)K .
Advantages
Selecting the easy sample: just get a random number (the origin) and the rest of the sample
will be determined automatically.
Disadvantage
The sample is unrepresentative if there is a certain cycle corresponding to the sampling
interval in the population list.
Example
A telephone directory contains all the people contact in your area (this list is the frame). Since
there are 25,013 names in the directory and your sample will be 1,250 individuals, you will need
to choose a name from each block of 20 individuals. If by chance we start at the 7th contact
position in the list, then, our second unit will be the 27th, our third unit is the 47th and so on.
The probability that a unit is selected depends on its size (or weight) in the population.
The unit with larger size has the greatest chance of being included in the sample.
Advantage: Increased efficiency.
Disadvantage: Requires additional information on the units to be probed.
Example
In accounting auditing the accountant should devote more effort to check invoices with high
amounts than invoices with low amounts.
Based on this selection method, an invoice with an amount of 10000 dinars has ten time
probability to be chosen than an invoice with 1000 dinars.
Consider the list of 10 invoices presented in the following table. Then, let us make an R
program generating a PPS sample of 4 invoices to check.
Invoice number Amount
1 12 930
2 1 245 190
3 379 150
4 3 843 030
5 289 460
6 1 210 260
7 7 245 010
8 45 890
9 125 120
10 8 049 290
Solution with R
soldes=c(12930,1245190,379150,3843030,289460,
1210260,7245010,45890,125120,8049290)
total=sum(soldes)
Pr=soldes/total
l=1:10
sample(x=l,size=4,replace=FALSE,prob=Pr)
The population is subdivided into strata (relatively homogeneous groups) which are
mutually exclusive.
From each strata the same proportion of individuals is drawn. The sampling rate is the
same in all strata.
Suppose that 60 % of the students of an institute are enrolled in management science and 40%
in economics; to form a sample of 120 students following these strata, one should randomly
choose 60% × 120 = 72 students in management and 40% × 120 = 48 in economics.
The only difference between proportionate and disproportionate stratified random sampling is
their sampling fractions. With disproportionate sampling, the different strata have different
sampling fractions.
Accuracy estimation
Notation :
Nh : size of strata number h.
H : number of strata.
X̄h : the mean of the strata h (h = 1, . . . , H).
nh : size of the sample gotten from the strata h.
fh = Nnhh : Sampling fraction inside the strata h. This fraction is constant for the
proportionate sampling.
Sh2 : the variance of the strata h.
Estimation :
H
b = X Nh X
X st h
N
h=1
Accuracy:
H 2 h n
b ) = X( Nh )2 (1 − f ) Sh avec S 2 = 1 X(x − xb̄ )2
V (X st h h i h
N nh nh − 1
h=1 i=1
L. El Moubarki Sampling October 6, 2023 39 / 62
Probability sampling techniques Stratified random sampling
Advantages
Representative sample (members of each subgroup are included).
Population estimators (eg, average) are much more accurate (less variance than SRS).
Disadvantage
The right criteria must be found, according to which the population will be divided into
several strata.
Additional information about the data are needed: If we stratify the data according to the
gender, then in addition to the contact list we need also the gender list of the population
which is not necessarily available.
Exercise 3 I
We have 1060 companies. We are interested in the average of the turnover performed by these
companies per month (expressed in thousand dinars). The population is defined by 5 strata
per size range depending on the number of executives. A simple random sample is carried out
in each strata h according to a budget that makes it possible to survey 300 companies in total,
we measure yh and the dispersion Sh2 of the variable ”Turnover” in the sample of companies
drawn. The information describing the sample is summarized in the following table:
Exercise 3 II
Calculate an estimate of the average Y . Calculate the precision of this estimate and its
confidence interval.
Solution
Cluster sampling
Separate the population into subgroups called clusters. A random number of clusters are
selected.
All elements of these clusters are part of the sample: it is like we are doing a census inside
each cluster.
Advantage
This method is useful to reduce transport costs when the population is very large
geographically.
Disadvantages
The sample may not be representative of the population. The loss of accuracy of the
estimators is due to the correlation between the responses of the elements of the same cluster.
Example
Accuracy
m
c ) ≈ M − m 1 X ( Nh X − X ) 2
Var (Xcl h cl
Mm m − 1 N h=1
Exercise 4 I
A statistician wants to conduct a survey on the quality of care provided in hospital cardiology
departments. To do this, he randomly draws 7 hospitals out of the 50 hospitals listed, then, in
each of the hospitals drawn, he receives the opinion of all patients in the cardiology
department.
Hospital Total Number of beds Number of
(or patients) in each dissatisfied
cardiology department (Nh ) patients by hospital
h=1 60 6
h=2 45 4
h=3 50 6
h=4 55 5
h=5 55 5
h=6 40 5
h=7 50 7
L. El Moubarki Sampling October 6, 2023 47 / 62
Probability sampling techniques Cluster sampling
Exercise 4 II
Multistage sampling
Similar to cluster sampling, except that in this case a sample is taken from each cluster.
We have at least two degrees. The first one identifies large clusters (primary units). In the
second degree, within each cluster, the units (secondary units) that will be part of the sample
are selected.
We can have more than two degrees. Suppose that we need to study health care quality
services. It would be costly in time and money to contact sample of patients in all Tunisian
health facilities. To make the survey realizable we can follow a multi-stage sampling with 4
levels.
1 Level 1: Gouvernorate
2 Level 2: City
3 Level 3: Health facility
4 Level 4: Patient
Advantage
It costs less and it is more speed than SRS.
Disadvantage
Not as accurate as Simple Random Sampling if the sample has same size.
Judgmental sampling
Sample selection based on some judgments about the entire population, ie an investigator
selects units that he or she considers to be characteristics of the population.
Advantage: Reduced cost and time
Disadvantage: Subjectivity of the investigator
Example
A city council chooses to hold its socio-economic survey in only one district of the city,
claiming that this chosen neighborhood resembles the majority of others. Considering the cost
/ benefit ratio, there is no benefit in extending their survey to a second district.
Quota method
The quota method is a sampling method that consists of ensuring the representativity of a
sample by assigning it a structure similar to that of the base population.
The main difference between this type of sampling and the stratified one is that it is the
investigator who decides which units form the sample.
Advantages
Less costly and easier to perform.
Respect the proportions of the population.
Disadvantages
This method is not done by chance. Then, it can create a biased sample.
The distribution of the variable in the population must be known to reproduce it in the
sample.
Example
In a university, 70% of the students are in the first cycle, 20% in the second cycle and 10% in
the third cycle. To compile a sample of 200 students from this university, the investigator
arbitrarily selected 140 undergraduate, 40 graduate and 20 undergraduate students.
Route sampling
For each randomly-chosen sampling points (e.g., urban units, small cities, or voting districts),
interviewers are assigned with a starting location and provided with instructions on the random
walking rules – e.g., which direction to start, on which side of the streets to walk and which
crossroads to take. Households are selected by interviewers following the instructions.
Example :
A sample of 80 supermarkets is considered. We try to estimate the average turnover Y .
On a random sample we have: y = 110.2 dinars.
It is known that the average number µ of cash register in the supermarkets is 28.
In the sample we have: x = 28.8.
b = y µ = 110.2 ∗ 28/28.8
The quotient estimate of the mean is: Y q
x
Exercise
We consider 120 farms. An attempt is made to estimate the average production of olive oil
per farm. The average number of olive trees per farm is 130. 45 firms were randomly selected.
The average production per farm for this sample is 120.5 liters and the average number of
olive trees is 123. What will be the average production after adjustment?
Adjustment by weight
Nh
This method uses the N weights of the stratified sampling method:
M
b = X Nh y
Y
N h
h=1
with:
nh
1 X
yh = yj
nh
j=1
Nh
Then the weight for each observation yj is: Nnh .
By deleting:
In order to have the same characteristics of the reference population, we can select
randomly some observations from the sample to delete.
By bootstrapping:
In order to have the same characteristics of the reference population, we can create a new
bootstrap sample from the available sample.
Exercises I
Exercise2
Imagine that a local clothing company has 2,700 employees. The Personnel Director wants
suggestions from employees on how to improve their workplace. Since it would take too long
to question all employees, the director chooses a systematic sample of 300 employees.
a) What would be the sampling interval?
b) If number 8 was your first random number, what would be the first 5 numbers in your
sample?
c) How many different samples are possible according to this technique?
Answers
Question 1 :
a) SRS sample
b) systematic
c) Quota method
d) Stratified sampling
Question 2 :
a) The sampling interval is 9 (2700 / 300).
b) The first five numbers for selected observation are: 8, 17, 26, 35, 44.
c) 9 different samples