0% found this document useful (0 votes)
8 views47 pages

Slidesc53 3

The document discusses ratio and regression estimation in survey sampling, emphasizing the use of auxiliary variables to enhance the precision of estimators for population means and totals. It provides examples, formulas, and R code for implementing these estimations, particularly in the context of educational data analysis. Additionally, it touches on cluster sampling and stratified sampling methods for estimating population parameters.

Uploaded by

Ki Yan Shih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views47 pages

Slidesc53 3

The document discusses ratio and regression estimation in survey sampling, emphasizing the use of auxiliary variables to enhance the precision of estimators for population means and totals. It provides examples, formulas, and R code for implementing these estimations, particularly in the context of educational data analysis. Additionally, it touches on cluster sampling and stratified sampling methods for estimating population parameters.

Uploaded by

Ki Yan Shih
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

STAC53

Ratio and Regression Estimation


References: Sampling Design and Analysis, S.L. Lohr (Chap 4)

1
Ratio and Regression Estimation
• Sometimes in survey sampling, information on one (or more)
covariate x that contains useful information about the variable of
interest y is available prior to sampling.
• For example, x = acreage of agricultural fields and y = bushels of grain
in yield
• These covariates are often called auxiliary variables or subsidiary
variables.
• Ratio and regression estimation use auxiliary variables that are
correlated with the variable of interest to improve the precision of
estimators of the mean and total of a population.

2
Ratio estimator under SRSWOR

3
Ratio estimation under SRSWOR

• Since we need these values for the population, ratio estimation is


commonly used when the auxiliary variable is a variable which is
easily measured on the whole population while the response variable
is harder to measure and is obtained from only an SRSWOR of the
population.

4
• Example: An investigator wanted to estimate the total sugar content of a
truckload of oranges, a SRSWOR of ten oranges was selected.
• The weights and the sugar contents of those ten oranges are given below:

• The total weight of all the oranges, obtained by first weighing the truck loaded
and then unloaded, was found to be 1500 pounds.
• Estimate the total sugar content of the all oranges in this truckload.
5
• Solution: Let’s recall the formulas and the summary statistics:

• Note that the number of oranges in the population (i.e. N) is not


given here, but we can use the appropriate formula to estimate 𝑡𝑦
0.02280

•𝐵= = 0.0541567696
0.4210
• 𝑡Ƹ𝑦𝑟 = 𝐵𝑡෠ 𝑥 = 0.0541567696 × 1500 = 81.24 pounds
6
Ratio estimation under SRSWOR
• Note:

where 𝜌 is the population correlation between 𝑥 and 𝑦.


෠ is biased but the bias
• This implies that the ratio estimator(𝐵)
becomes smaller as the sample size n increases.

7
Ratio estimation under SRSWOR
• Now we know how to estimate 𝐵, 𝑦ത𝑈 and 𝑡𝑦 using ratio estimators.

What are the variances of those estimators?


෠ 𝑉(𝑦ത෠𝑟 ) and 𝑉(𝑡Ƹ𝑦𝑟 )?
i.e. 𝑉(𝐵),

8
Ratio estimation under SRSWOR
• Result: Under simple random sampling without replacement (for
large samples size(n)),
• 𝑦ത෠𝑟 is approximately unbiased for 𝑦ത𝑈 .
• The variances of 𝐵,෠ 𝑦ത෠𝑟 and 𝑡Ƹ𝑦𝑟 are given by (approximately)

9
• Result: Estimates of these variances are given by

• This can also be expressed as

𝑛 𝑠𝑥 𝑠𝑦
• where 𝑓 = , 𝐶𝑉 𝑥 = , 𝐶𝑉 𝑦 = and 𝜌ො is the correlation
𝑁 𝑥ҧ 𝑦ത
between x and y
10
• Result: Estimates of these variances are given by

• For sufficiently large samples, approximate confidence intervals can be


constructed using the standard errors (i.e. the square roots of these
෠ ± 𝑧𝑆𝐸(𝐵)
variances) as 𝐵 ෠ , 𝑦ത෠ ± 𝑧𝑆𝐸(𝑦) ത෠ and 𝑡Ƹ𝑦𝑟 ± 𝑧𝑆𝐸(𝑡Ƹ𝑦𝑟 )
11
• When is ratio estimator 𝑦ത෠ better than the simple sample mean 𝑦?

12
• When is ration estimator 𝑦ത෠ better than the simple sample mean 𝑦?

13
Example
The data file apisrs contains a SRSWOR of 200 schools from the API
population. We will use this data set to estimate the population total
api.stu (the number of students who took the API test) using enroll (the
number of students enrolled in each school) as an auxiliary variable.
Ratio estimation of the population total of y requires the population
total of the auxiliary variable x. In the API data set (apipop), some
schools have missing values for the variable enroll.
We will remove these schools and consider the remaining schools as
our population.

14
#R code for ratio estimation
# Note this data set has missing values
# This code removes the missing values and treats the
# individuals with no missing values as the population.
apipop <- read.csv("apipop.csv", header=1)
apisrs <- read.csv("apisrs.csv", header=1)
#----------------------------------------------------
pop <- data.frame(enroll = apipop$enroll, api.stu = apipop$api.stu)
head(pop)
enroll api.stu
1 1278 1090
2 1113 840
3 546 472
4 330 272
5 233 216
6 276 247

15
nrow(pop)
[1] 6194
# Now the pop has only these two variables
pop$enroll <- as.numeric(as.character(pop$enroll))
pop$api.stu <- as.numeric(as.character(pop$api.stu))
pop <- na.omit(pop) #This removes the missing values
nrow(pop)
[1] 6157
N <- nrow(pop)
N
[1] 6157

16
xu <- pop$enroll
xbaru <- mean(xu)
xbaru
[1] 619.0469
tx <- sum(xu)
tx
[1] 3811472
plot(apisrs$enroll, apisrs$api.stu)
cor(apisrs$enroll, apisrs$api.stu)
[1] 0.96559

17
xbar <- mean(x)
xbar
[1] 584.61
ybar <- mean(y)
ybar
[1] 482.51
Bhat <- ybar/xbar
Bhat
[1] 0.8253537
e <- y - Bhat*x
sde <- sd(e)
sde
[1] 84.62279
n <- length(y)
n
[1] 200

18
# Estimate of B
sBhat <- (sqrt((1-n/N)/n))*(sde/xbar)
ConfLevel <- 0.95
CI_B <- Bhat + qnorm(c((1-ConfLevel)/2 , (1+ConfLevel)/2 )) * sBhat
Bhat
[1] 0.8253537
sBhat
[1] 0.01006782
CI_B
[1] 0.8056211 0.845086

19
# Ratio estimate of the population total
thatyr <- Bhat*tx
sthatyr <- N*(sqrt(1/n-1/N))*sde
CItotal <- thatyr + qnorm(c((1-ConfLevel)/2 , (1+ConfLevel)/2 ))*sthatyr
thatyr
[1] 3145812
sthatyr
[1] 36238.54
CItotal
[1] 3074786 3216839

# For approx CI for the population mean, divide both limits of the
# the CI for total by N.

20
Note

We can also use the R survey package to calculate these ratio


estimates. The R code below calculates the ratio estimates using R
survey package.

21
#Ratio estimation using Survey package when the data set has missing
# values. This code removes the missing values and treats the
# individuals with no missing values as the population.
apipop <- read.csv("apipop.csv", header=1)
apisrs <- read.csv("apisrs.csv", header=1)
pop <- data.frame(enroll = apipop$enroll, api.stu = apipop$api.stu)
nrow(pop)
[1] 6194
# Now the pop has only these two variables
pop$enroll <- as.numeric(as.character(pop$enroll))
pop$api.stu <- as.numeric(as.character(pop$api.stu))
pop <- na.omit(pop) #This removes the missing values
N <- nrow(pop)
N
[1] 6157
xu <- pop$enroll
xbaru <- mean(xu)
xbaru
[1] 619.0469
tx <- sum(xu)
tx
[1] 3811472

22
library (survey)
srsdesign <- svydesign(id=~1, fpc=rep(nrow(pop), nrow(apisrs)), data=apisrs)
r <- svyratio(~y,~x, srsdesign)
r
Ratio estimator: svyratio.survey.design2(~y, ~x, srsdesign)
Ratios=
x
y 0.8253537
SEs=
x
y 0.01006782
confint(r, level = 0.95)
2.5 % 97.5 %
y/x 0.8056211 0.8450862

23
Cluster Sampling Revisited
• We have discussed cluster sampling for the special case in which all
clusters are of the same size (𝑀).
• We selected a SRSWOR of (𝑛) clusters and observed all elements in
the selected clusters.
• We will now look at how to extend that to the more general case
where the cluster sizes are not necessarily equal.
• We recall and extend the notions below:

24
Cluster Sampling Revisited, p178
• We recall and extend the notations below
• 𝑁 = the number of clusters in the population
• 𝑛 = the number of clusters selected in an SRSWOR
• 𝑀𝑖 = the number of elements in cluster 𝑖, 𝑖 = 1, … , 𝑁
• 𝑀0 = σ𝑁𝑖=1 𝑀𝑖
1 𝑁 𝑀0

• 𝑀 = σ𝑖=1 𝑀𝑖 =
𝑁 𝑁
෡ 1
• ഥ
𝑀 = σ𝑖∈𝑆 𝑀𝑖
𝑛
𝑀𝑖
• 𝑡𝑖 = σ𝑗=1 𝑦𝑖𝑗 , where 𝑦𝑖𝑗 is the measurement on the 𝑗th element in
the 𝑖th cluster.

25
Cluster Sampling Revisited, p178
• With these notations, the estimator of the population mean 𝑦ത𝑢 is given by
σ𝑁
𝑖=1 𝑡𝑖
• 𝑦ത𝑈 = σ𝑁
𝑖=1 𝑀𝑖
• Where 𝑡𝑖 and 𝑀𝑖 are usually positively correlated and this is a ratio like 𝐵 in
our discussion on ratio estimation with 𝑀𝑖 talking the place of 𝑥𝑖 and 𝑡𝑖
taking the place of 𝑦𝑖 and estimator is given by
σ𝑖∈𝑆 𝑡𝑖

• 𝑦ത𝑟 = σ
𝑖∈𝑆 𝑀𝑖
• Its estimated variance is given by
2 2
𝑛 1 σ𝑖∈𝑆 𝑡𝑖 −𝑦ത෠𝑟 𝑀𝑖 𝑛 1 σ𝑖∈𝑆 𝑀𝑖2 𝑦ത 𝑖 −𝑦ത෠𝑟
• 𝑉෠ 𝑦ത෠𝑟 = 𝑉෠ 𝐵෠ = 1 − ഥ2
= 1− ഥ2
𝑁 𝑛𝑀 𝑛−1 𝑁 𝑛𝑀 𝑛−1
෡ 1
ഥ ഥ
• If 𝑀 is unknown, we will estimate it by 𝑀 = σ𝑖∈𝑆 𝑀𝑖
𝑛

26
Ratio Estimation Under stratified sampling
• When the population has been stratified, there are two ways to
estimate the ratio estimators:
• Combined Ratio Estimators
• Separate Ratio Estimators

27
Combined Ratio Estimators
• Combined ratio estimators are defined as follows:

• where

28
Separate ratio estimator
• For the separate ratio estimator, ratio estimation is applied to each
stratum, then combined.
• The separate ratio estimators are defined as follows:

29
Result (Variance of combined ratio estimator)

30
Result (cont.)

31
Result (Variance of separate ratio estimator)

32
Result (cont.) sample estimates of variance

33
Combined Ratio or Separate Ratio estimator?

34
35
apipop <- read.csv("apipop.csv", header=1)
apistrat <- read.csv("apistrat.csv", header=1)
pop <- data.frame(enroll = apipop$enroll, api.stu = apipop$api.stu, stype
= apipop$stype)
head(pop)

36
pop$enroll <- as.numeric(as.character(pop$enroll))
pop$api.stu <- as.numeric(as.character(pop$api.stu))
pop <- na.omit(pop) #This removes the missing values
nrow(pop)
[1] 6157
yu = pop$api.stu
N <- length(yu)
N
[1] 6157
xu <- pop$enroll
xbaru <- mean(xu)
xbaru
[1] 619.0469
tx <- sum(xu)
tx
[1] 3811472

37
• library(dplyr)
• Nh <- pop %>%
• group_by(stype) %>%
• summarise(Nh = length(enroll))
• Nh

• Nh <- as.numeric(Nh$Nh)
• Nh
• [1] 4397 751 1009
• Wh <- Nh/N
• Wh
• [1] 0.7141465 0.1219750 0.1638785

38
nh <- apistrat %>%
group_by(stype) %>%
summarise(nh = length(enroll))
nh <- as.numeric(nh$nh)
nh
[1] 100 50 50
xbarh <- apistrat %>%
group_by(stype) %>%
summarise(xbarh = mean(enroll))
xbarh <- as.numeric(xbarh$xbarh)
xbarh
[1] 416.78 1320.70 832.48
sxh <- apistrat %>%
group_by(stype) %>%
summarise(sxh = sd(enroll))
sxh <- as.numeric(sxh$sxh)
sxh
[1] 166.0629 671.0737 395.357

39
ybarh <- apistrat %>%
group_by(stype) %>%
summarise(ybarh = mean(api.stu))
ybarh <- as.numeric(ybarh$ybarh)
ybarh
[1] 355.02 1070.52 695.70
syh <- apistrat %>%
group_by(stype) %>%
summarise(syh = sd(api.stu))
syh <- as.numeric(syh$syh)
syh
[1] 139.8607 589.4382 353.322
40
rh <- apistrat %>%
group_by(stype) %>%
summarise(rh = cor(enroll, api.stu))
rh <- as.numeric(rh$rh)
rh
[1] 0.9778512 0.9421533 0.9506217
xbarstr <- sum(Wh*xbarh)
ybarstr <- sum(Wh*ybarh)
Bhat <- ybarstr/xbarstr
Bhat
[1] 0.836956

41
vhatybarhatrc <- sum((Wh^2)*(1-nh/Nh)*(1/nh)*(syh^2+(Bhat*sxh)^2-
2*Bhat*rh*sxh*syh))
vhatBhat <- vhatybarhatrc/(xbarstr^2)
# Note: If xbaru is known we can use it, but R survey package uses
xbarstr
seBhat <- sqrt(vhatBhat)
seBhat
[1] 0.007754385

thatyrc <- Bhat*tx


thatyrc
[1] 3190038
sethatyrc <- seBhat*tx
sethatyrc
[1] 29555.62

42
Bhath <- ybarh/xbarh
Bhath
[1] 0.8518163 0.8105702 0.8356958
txh <- pop %>%
group_by(stype) %>%
summarise(txh = sum(enroll))
txh <- as.numeric(txh$txh)
txh
[1] 1877350 1013824 920298
thatyrs <- sum(txh*Bhath)
thatyrs
[1] 3190022

43
# Now using survey package
library (survey)
pop <- pop[order(pop$stype) , ]
apistrat <- apistrat[order(apistrat$stype) , ]
# You need to sort data by stratification variable in order for the fpc
# in the command below to match with the data
strdesign <- svydesign(id=~1, strata = ~stype, fpc=rep(Nh, nh),
data=apistrat)
#The command below estimates the combined ratio estimates
ComRatio <- svyratio(~api.stu, ~enroll, strdesign)
ComRatio
Ratio estimator: svyratio.survey.design2(~api.stu, ~enroll, strdesign)
Ratios=
enroll
api.stu 0.8369569
SEs=
enroll
api.stu 0.007754385

44
confint(ComRatio, level = 0.95)
2.5 % 97.5 %
api.stu/enroll 0.8217586 0.8521553
predict (ComRatio, total= tx) #This estimates the
population total
$total
enroll
api.stu 3190038
$se
enroll
api.stu 29555.62
# For approx CI for the population mean, divide both
limits of the the CI for total by N.

45
#Separate Ratio estimators
SepRatio <- svyratio(~api.stu, ~enroll, strdesign, separate=TRUE)
SepRatio

46
• txh <- pop %>%
• group_by(stype) %>%
• summarise(txh = sum(enroll))
• txh <- as.numeric(txh$txh)
• txh
• [1] 1877350 1013824 920298
• stratum.totals <- txh
• predict (SepRatio, total= stratum.totals) #This
estimates the popn total
• $total
• enroll
• api.stu 3190022
• $se
• enroll
• api.stu 29751.16

47

You might also like