Ca09 Pitblado Handout
Ca09 Pitblado Handout
Jeff Pitblado
Associate Director, Statistical Software
StataCorp LP
2009 Canadian Stata Users Group Meeting
Outline
1 Types of data 2
3 Variance estimation 15
3.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Total estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Balanced repeated replication (BRR) . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Summary 28
Why survey data?
Using stages of clustered sampling can help cut down on the expense and time.
1 Types of data
Simple random sample (SRS ) data
Observations are "independently" sampled from a data generating process.
Sample variability is explained by the statistical model attributed to the data generating pro-
cess
Standard data
Well use this term to distinguish this data from survey data.
Correlated data
Individuals are assumed not independent.
Cause:
Cluster sampling
Treatment:
Time-series models
cluster() option
2
Survey data
Individuals are sampled from a fixed population according to a survey design.
Distinguishing characteristics:
Standard data
proportion
regress
Survey data
svy: proportion
svy: regress
3
2 Survey data characteristics
2.1 Single stage designs
Single-stage syntax
svyset psu weight , strata(varname) fpc(varname)
Strata
Sampling unit
An individual or collection of individuals from the population that can be selected for observation.
Example
Sampling weight
The reciprocal of the probability for an individual to be sampled.
Sampling units
Strata
Typically considered to be the number of individuals in the population that a sampled indi-
vidual represents.
4
Example
If there are 100 hospitals in our population, and we choose 5 of them, the sampling weight is
20 = 100/5. Thus a sampled hospital represents 20 hospitals in the population.
Sampling weights correct for over/under sampling of sections in the population. Many times
this over/under sampling is on purpose.
Strata
In stratified designs, the population is partitioned into well-defined groups, called strata.
Example
States of the union are typically used as strata in national surveys in the US.
Although there is potential for improving efficiency by reducing sampling variability, it is usu-
ally not very practical to stratify on demographic information.
Finite population correction (FPC)
An adjustment applied to the variance due to sampling without replacement.
q Note
The FPC affects the number of components in the linearized variance estimator for multi-
stage designs.
5
Example: svyset for single-stage designs
2. nmihs the National Maternal and Infant Health Survey (1988) dataset came from a strati-
fied design
3. fpc a simulated dataset with variables that identify the characteristics from a stratified and
without-replacement clustered design
6
Below is a visual representation of a hypothetical population. Suppose each blue dot represents
an individual.
Population 1000
The following shows a 20% simple-random-sample. The solid symbols identify sampled indi-
viduals.
7
Here we partition the population into small blocks, then sample 20% of the blocks. Not all
blocks contain the same number of individuals, so the sample size is a random quantity.
Here we partition the population into four big regions, then perform a 20% sample within each
region. The sample size is not exactly 20% of the population size due to unbalanced regions and
rounding.
8
Here we re-establish the smaller blocks within the four regions, then sample 20% of the blocks
within each region.
FPC is required at stage s for stage s + 1 to play a role in the linearized variance estimator
q Note
svyset will note that it is disregarding subsequent stages when an FPC is not specified for a
given stage.
q
9
2.3 Poststratification
Poststratification
A method for adjusting sampling weights, usually to account for underrepresented groups in the
population.
Syntax
q Note
Recall that I said it is usually not vey practical to stratify on demographic information such as
age group, gender, and ethnicity. However we can usually poststratify on these variables using the
frequency distribution information available from census data.
q
*** Cat and dog data from Levy and Lemeshow (1999)
. webuse poststrata
. bysort type: sum totexp
10
Here are the mean estimates with postratification:
Linearized
Mean Std. Err. [95% Conf. Interval]
totexp 40.11513 1.163498 37.77699 42.45327
. svyset _n [pw=weight]
pweight: weight
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
. svy: mean totexp
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 1 Number of obs = 50
Number of PSUs = 50 Population size = 1300
Design df = 49
Linearized
Mean Std. Err. [95% Conf. Interval]
11
2.4 Strata with a single sampling unit
How do we get stuck with strata that have only one sampling unit?
Missing data can cause entire sampling units to be dropped from the analysis, possibly leav-
ing a single sampling unit in the estimation sample.
Certainty units
Bad design
12
Example: svydes
The NHANES2 data has 31 strata, each containing 2 PSUs.
13
Some variables in this dataset have enough missing values to cause us the lonely PSU problem.
Linearized
Mean Std. Err. [95% Conf. Interval]
hdresult 49.67141 . . .
Use if e(sample) after estimation commands to restrict svydess focus on the estimation sam-
ple. The single option will further restrict output to strata with one sampling unit.
Specifying variable names with svydes will result in more information about missing values.
14
Handling lonely sampling units
3 Variance estimation
Stata has three variance estimation methods for survey data:
Linearization
Balanced repeated replication
The jackknife
q Note
Linearization
Replication methods
Motivation
Linearization can have poor performance in datasets with a small number of sam-
pling units.
Due to privacy concerns, data providers are reluctant to release strata and sampling
unit information in public-use data. Thus some datasets now come packaged with
weight variables for use with replication methods.
Concept
Think of a replicate as a copy of the point estimates.
The idea is to resample the data, computing replicates from each resample, then
using the replicates to estimate the variance.
q
15
3.1 Linearization
Linearization
A method for deriving a variance estimator using a first order Taylor approximation of the point
estimator of interest.
Syntax
svyset ... vce(linearized)
Delta method
Huber/White/robust/sandwich estimator
Strata: h = 1, . . . , L
PSU: i = 1, . . . , nh
SSU: j = 1, . . . , mhi
Individual: k = 1, . . . , mhij
X
Yb = whijk yhijk
X nh X
Vb (Yb ) = (1 fh ) (yhi y h )2 +
n h1 i
h
X X mhi X
fh (1 fhi ) (yhij y hi )2
h i
mhi 1 j
df = NPSU Nstrata
16
Example: svy: total
Lets use our (imaginary) survey data on high school seniors to estimate the number of smokers in
the population.
. webuse seniors
. svyset
pweight: sampwgt
VCE: linearized
Single unit: missing
Strata 1: state
SU 1: county
FPC 1: ncounties
Strata 2: <one>
SU 2: school
FPC 2: nschools
Strata 3: gender
SU 3: <observations>
FPC 3: nseniors
*** Estimate number of seniors who have smoked
. svy: total smoked
(running total on estimation sample)
Survey: Total estimation
Number of strata = 50 Number of obs = 10559
Number of PSUs = 100 Population size = 20992929
Design df = 50
Linearized
Total Std. Err. [95% Conf. Interval]
Linearized
Total Std. Err. [95% Conf. Interval]
17
3.1.2 Regression models
G()
b is a total estimator, use Taylor expansion to get Vb ().
b
X
G()
b = w j sj x j = 0
j
b = DVb {G()}| 0
Vb () b bD
=
ML models
G()
b is the gradient
sj is an equation-level score
G()
b is the normal equations
sj is a residual
D is the inverse of the weighted outer product of the predictorsincluding the intercept
D = (X0 WX)1
18
Example: svy: logit
Here is an example of a logistic regression, modeling the incidence of high blood pressure as a
function of some demographic variables.
Linearized
highbp Coef. Std. Err. t P>|t| [95% Conf. Interval]
height -.0316386 .0058648 -5.39 0.000 -.0435999 -.0196772
weight .0511574 .0031191 16.40 0.000 .0447959 .057519
age .0492406 .0023624 20.84 0.000 .0444224 .0540587
female -.3215716 .0884387 -3.64 0.001 -.5019435 -.1411998
_cons -2.858968 1.049395 -2.72 0.010 -4.999224 -.7187117
19
3.2 Balanced repeated replication (BRR)
Balanced repeated replication
For designs with two PSUs in each of L strata.
Syntax
svyset ... vce(brr) mse
q Note
The idea is to resample the data, compute replicates from each resample, then use the repli-
cates to estimate the variance.
Balance here means that stratum specific contributions to the variance cancel out. In other
words, no stratum contributes more to the variance than any other.
When the dataset contains replicate weight variables, you do not need to worry about Hadamard
matrices.
q
q Note
These replicate weights are used to produce a copy of the point estimates (replicate). The
replicates are then used to estimate the variance.
svy brr can employ replicate weight variables in the dataset, if you svyset them. Oth-
erwise, svy brr will automatically adjust the sampling weights to produce the replicates;
however, a Hadamard matrix must be specified.
q
20
BRR variance formulas
b point estimates
b(i) ith replicate of the point estimates
The default variance formula uses deviations of the replicates from their mean.
The MSE formula uses deviations of the replicates from the point estimates.
BRR * is clickable, taking you to a short help file informing you that you used the MSE
formula for BRR variance estimation. q
21
Example: svy brr: logit
Lets revisit the previous logistic model fit, but use BRR for variance estimation.
BRR *
highbp Coef. Std. Err. t P>|t| [95% Conf. Interval]
22
3.3 Jackknife
The jackknife
A replication method for variance estimation. Not restricted to a specific survey design.
Syntax
svyset ... vce(jackknife) mse
q Note
svy jackknife can employ replicate weight variables in the dataset, if you svyset them.
Otherwise, svy jackknife will automatically adjust the sampling weights to produce the
replicates using the delete-1 jackknife methodology.
The delete-k jackknife is only supported if you already have the corresponding replicate
weight variables for svyset. q
b(h,i) replicate of the point estimates from stratum h, PSU i
23
q Note
The default variance formula uses deviations of the replicates from their mean.
The MSE formula uses deviations of the replicates from the point estimates.
Jknife * is clickable, taking you to a short help file informing you that you used the MSE
formula for jackknife variance estimation.
Make sure to specify the correct multiplier when you svyset jackknife replicate weight
variables. q
Jknife *
highbp Coef. Std. Err. t P>|t| [95% Conf. Interval]
height -.0316386 .0058674 -5.39 0.000 -.0436052 -.0196719
weight .0511574 .0031203 16.40 0.000 .0447936 .0575213
age .0492406 .0023634 20.83 0.000 .0444204 .0540607
female -.3215716 .088471 -3.63 0.001 -.5020093 -.1411339
_cons -2.858968 1.049924 -2.72 0.011 -5.000302 -.7176329
24
Replicate weight variable
A variable in the dataset that contains sampling weight values that were adjusted for resampling
the data using BRR or the jackknife.
Syntax
q Note
As I mentioned earlier on, variability is governed by the survey design, so our variance
estimates assume the design is fixed. The subpop() option assumes this too.
If we discourage you from using if and in, why does svy allow them?
You might want to restrict your sample because of known defects in some of the vari-
ables.
Researchers can use if and in to conduct simulation sudies by simulating survey sam-
ples from a population dataset without having to use preserve and restore.
25
We can illustrate the difference between these estimators with an SRS design.
q
26
Example: svy, subpop()
Suppose we want to estimate the mean birth weight for mothers with high blood pressure. The
highbp variable (in the nmihs data) is an indicator for mothers with high blood pressure.
In the reported results, the subpopulation information is provided in the header. Notice that
although the restricted sample results reproduce the same mean, the standard errors differ.
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
birthwgt 3202.483 28.7201 3146.077 3258.89
27
5 Summary
1. Use svyset to specify the survey design for your data.
References
[1] Levy, P. and S. Lemeshow. 1999. Sampling of Populations. 3rd ed. New York: Wiley.
[2] StataCorp. 2009. Survey Data Reference Manual: Release 11. College Station, TX: StataCorp
LP.
28