0% found this document useful (0 votes)
28 views71 pages

Missing Data

Uploaded by

lx20010516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views71 pages

Missing Data

Uploaded by

lx20010516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Statistical Analysis with Missing Data

Lecture for BDSI 2022


Dr. Peisong Han

Created based on Prof. Rod Little’s lecture slides


Missing Data Problems

Missing data problems are very common in


almost every field
– Nonresponse in sample surveys
variables
– Noncompliance in clinical trials
cases
– Two-stage design, etc.

– Dropout in longitudinal studies

– …………
2
Example: Longitudinal Data with
Dropout (Hedeker and Gibbons, 1997)

3
4
Bias when ignoring subjects with
missing data: a simulation study
• True model:
X ~ N(0,1)
Logit[Pr(E=1|X)]=0.5+X
logit[Pr(D=1|E,X)]=0.25+0.5X+1.1E

• Sample size: 500


• Number of Replicates: 5000

5
Missing-Data Mechanism
D E X
• D and E : completely observed
• X : sometimes missing ?
?
• Values of X in each cell are set ?
to missing with the following
underlying probabilities:
D=0,E=0: p00=0.19
D=0,E=1: p01=0.09
D=1,E=0: p10=0.015
D=1,E=1: p11=0.055
6
Before Deletion Estimates

• Histogram of 5000
estimates before deleting
values of X
• logistic model
logit Pr(D=1|E,X)
=b0+b1E+b2X

7
Complete-Case Estimates

• Histogram of
complete- case
analysis estimates
• Delete subjects with
missing X values
• True value = 1.1,
serious negative bias

8
Patterns of Missing Data
• General pattern
variables

cases

• Some special patterns


monotone univariate file matching

9
10
Some Examples of Missingness Mechanism

11
More Examples
• MCAR:
– patients had their weight measured by flipping a
coin.

• MAR:
– patients with high blood pressure had their
weight measured.

• NMAR:
– overweight patients had their weight measured.
What Mechanism to Assume
• MCAR:
– Simplest mechanism; strongest assumption;
usually not the true mechanism in practice
• NMAR:
– Most complex mechanism; weakest assumption;
likely the true mechanism in practice
• MAR:
– A mechanism between MCAR and NMAR;
oftentimes a good approximation to the truth;
easy to work with
General Strategies
Complete cases

???
??
?? ?
??? ? ???

Imputation Weight Analyze


Complete-Case Available
w1
Complete cases w2 Complete cases Complete cases
w3
437 ??? ???
63 ??
?? Still useful
22 1 ?? ? ?? ?
741 7 234 ??? ? ??? ??? ? ???

Imputations Weights e.g. maximum likelihood, Bayes


14
Complete-Case Analysis

Complete cases

???
??
?? ? Discard
??? ? ???

• Default analysis in statistical packages


• Simple and valid if MCAR
• Generally biased estimation
15
CC Analysis
• Does not invent data

• Simple and may be good enough with


small amounts of missing data
– but defining “small” is problematic; depends
on
• fraction of incomplete cases
• recorded information in these cases
• parameter being estimated
16
Limitations of CC Analysis

• Loss of information in incomplete cases


– Increased variance of estimates
– Bias when complete cases differ
systematically from incomplete cases
• restriction to complete cases requires that the
complete cases are representative of all the cases
for the analysis in question, but this assumption
is often questionable!

17
18
Imputation and
Multiple
Imputation

19
Problem

Y1 Y2 Y3 Y4 … Yp

Complete
cases

Cases with
some missing
values

Dobs = Observed data: Y: Discrete, continuous or


semi-continuous as well as
Dmiss = Missing data: multivariate
20
Considerations behind Imputation

• Multiple users analyzing different subsets of


variables

• Different skill levels dealing with incomplete


data

• Software to perform complete data analysis is


available

• Assume missing at random


21
Imputation

Important
issues:
Imputations are
not real values
Uncertainties
associated with
imputes
Imputation :
Draws from Pr( Dmiss | Dobs )
22
Features of Imputation
Complete cases

437
63
22 1
741 7 234

Imputations

Good Bad
Rectangular File Naïve methods can be bad
Retains observed data Invents data –
Handles missing data once Understates uncertainty
Exploits incomplete cases
23
A Bivariate Example:
Continuous Case
• Imputations are random draws
from a predictive distribution
for the missing values

Y1 Y2 mean

yˆ i 2 = Eˆ ( yi 2 | yi1 )+ ri
ri ~ N (0, s221 ), s221 = resid variance, or
yˆ r +1,2
yˆ r + 2,2 ri = residual from randomly selected complete case
yˆ r +3,2
24
A Bivariate Example: Binary
Case
• For binary (0-1) data, impute 1 with
probability = predicted prob of a one
given observed covariates
Y1 Y2
pˆ i 2 = Pr( yi 2 = 1| yi1 ) (e.g. logistic regression)
1, prob pˆ i 2
pˆ r +1,2 yi 2 = 
pˆ r + 2,2 0, prob 1 − pˆ i 2
pˆ r +3,2
25
Example: Should Imputations be
conditional on all observed variables?

• Consumer Expenditure Survey (Bureau of


Labor Statistics)

• Should the imputation of Income be


conditional on the Expenditure variable?

26
BLS Simulation Example
• BLS researchers:
– created population by accumulating
complete cases over several years
– drew 200 random samples of size 500 each
(Before deletion data sets)
– created missing data on income in each data
set
– supplied 200 data sets along with 55
covariates to University of Michigan
27
BLS Example (Continued)
• UM did not know how Income values
were deleted (except that some or all of
55 covariates were used in specifying
missing data mechanism)
• UM created two sets of imputations
Using Expenditure
Not Using Expenditure

28
BLS Imputations
• Imputations were created by drawing
values from the posterior predictive
distribution of income under an explicit
model
• One included expenditure as a
conditioning variable and other did not
• Two sets of imputed data sets and actual
data sets were analyzed by UM and BLS
respectively.
29
BLS Models of Interest
• OLS model
Food-At-Home=b0+b1 Income+covariates

• Tobit Model
Food-Away-Home= g0+g1 Income+covariates

30
Estimated regression coefficients of income from
undeleted and imputed data-sets: OLS Model

31
Estimated regression coefficients of income from
undeleted and imputed data-sets: Tobit Model

32
What should imputes condition on?

• In principle, all observed variables


– Whether predictors or outcomes of final
analysis model
– May be impractical with a lot of variables
• Variable selection
– Priority to variables predictive of missing
variable (and nonresponse)
– Favor inclusion over exclusion

33
Key Problem of Single
Imputation

• Single imputations do not account for


imputation uncertainty

– bootstrapping the imputation method

– multiple imputation

34
Multiple Imputation
• Create D sets of imputations, each set a draw
from the predictive distribution of the missing
values
– e.g. D=5

Y1 Y2 Y3 Y4 Y5
1 2 3 4 5
?
? 2.1 2.7 1.9 2.5 2.3
? ? 4.5 5.1 5.8 3.9 4.2
1 1 2 1 2
24 31 32 18 25

35
Multiple Imputation Inference
• D completed data sets (e.g. D = 5)
• Analyze each completed data set
• Combine results in easy way to produce
multiple imputation inference
• Particularly useful for public use datasets
– data provider creates imputes for multiple
users, who can analyze data with complete-
data methods
36
MI Inference for a Scalar Estimand
 = estimand of interest

The MI estimate of variance is TD = WD + (1 + 1/ D) BD


1 D
WD =  Wd = Within-Imputation Variance
D d =1
1 D ˆ
BD = 
D − 1 d =1
( d −  D ) 2
= Between-Imputation Variance
37
Example of Multiple Imputation

Estimate (se 2 )
Dataset (d ) 1 b 531234

Y1 Y2 Y3 Y4 Y5 1 12.6 (3.6 2 ) 4.32 (1.95 2 )


2 12.6 (3.6 2 ) 4.15 ( 2.64 2 )
? 3 12.6 (3.6 2 ) 4.86 ( 2.09 2 )
? 4 12.6 (3.6 2 ) 3.98 ( 2.14 2 )
? ? 5 12.6 (3.6 2 ) 4.50 ( 2.47 2 )
Mean 12.6 (3.6 2 ) 4.36 ( 2.27 2 )
Var 0 0.339

38
Summary of MI Inferences

 D WD BD TD = WD + 6 5 BD gˆD = 1.2 B
D
(1.2 BD +WD )

1 12.6 3.6 2 0 3.6 0


b 531234
 4.36 2.272 0.339 2.36 0.073

39
Imputation for monotone patterns
U Y1 Y2 ... Yp
(a) regress Y1 on U , impute missing values of Y1

(b) regress Y2 on Y1 and U ,
impute missing values of Y2
(with imputes for missing Y1 from (a))
... …

(k ) regress Yp on Y1 ,...,Yp −1 and U ,


impute missing values of Yp
(with imputes for missing Y1...Yp −1 from previous steps) Imputes
E.g. SAS PROC MI
40
Chained Equations/ Sequential Regression approach
U = fully observed, Y1 ,...Yp = incomplete, ordered least to most
missing values. (Not necessarily a monotone pattern)
Iteration 1: Regress Y1 on U ,impute missing Y1
Regress Y2 on U , Y1 , impute missing Y2
...
Regress Yp on U , Y1 ,...Yp −1 , impute missing Yp
Iteration 2...I : update imputed draws:
Regress Y1 on U , Y2 ,...Yp reimpute missing Y1
Regress Y2 on U , Y1 , Y3 ,...Yp , reimpute missing Y2
...
Regress Yp on U , Y1 ,...Yp −1 , reimpute missing Yp
41
Chained Equations/ Sequential
Regression approach

• Missing values are replaced by most recent imputes


• Regression is tailored to the type of variable:
• Continuous (linear regression)
• Binary (logistic regression)
• Categorical (polytomous regression)
• Count (Poisson regression)
• Regression diagnostics to fine tune each model
• Empirical studies show that nothing much changes after 5 or
6 iterations
42
Chained Equation Approach
• Sacrifices a coherent joint distribution for the
variables, and models a sequence of conditional
distributions

• Can be a useful practical approach for


– General patterns of missing data
– Explicit modeling joint distribution is difficult
– Multivariate normality does not model
variables of different types well

43
Weighted
Complete Case
Analysis

44
Weighted CC Analysis
w1
w2 Complete cases
w3
???
??
?? ? Discard
??? ? ???

weights

• Weight respondents differentially to reduce


nonresponse bias – e.g. mean becomes
weighted mean
• Most common weight: prop to
1/P(response)
45
Adjustment Cell method for estimating
prob of response
• Group respondents and nonrespondents into
adjustment cells with similar values on variables
recorded for both:
• e.g. females aged 25-35 living in AA
100 in sample 80 respondents
20 nonrespondents

pr(response in cell) = 0.8


response weight = 1.25
• With extensive covariate information, can’t cross-classify
on all of them
• How do we choose which variables to use?
46
Response Propensity Score
X 1 X 2 ... X p Y M
• Regress M on X (probit or logistic),
0
using respondent and nonrespondent Complete cases 0
data 0
? 1
– p(M=0|X) is the response propensity score ? 1
? 1

• Weight respondents by inverse of


propensity score
– 1/p(M=0|X)

47
Propensity Score Weighting
• A widely used approach alternative to
imputation
• Avoids modeling the data distribution
• A Fundamental concept in both missing
data and causal inference
• May not be stable if some propensity
scores are close to zero

48
Likelihood
Methods

49
Likelihood methods
• Statistical model + data  Likelihood
• Two general approaches based on
likelihood
– maximum likelihood inference for large
samples
– Bayesian inference for small samples:
log(likelihood) + log(prior) = log(posterior)
• Methods use all available data
– do not require rectangular data sets
50
Parametric Likelihood
• Data Y
• Statistical model yields probability density f (Y |  )
for Y with unknown parameters 
• Likelihood function is then the density as a function of 
L( | Y ) = const  f (Y |  )
• Loglikelihood is often easier to work with:

l ( | Y ) = log L( | Y ) = const + log{ f (Y |  )}

Constants can depend on data but not on parameter 

51
Example: Normal sample
• univariate iid normal sample
Data Y = ( y1 ,..., yn )

Parameters  = (  , 2 ),  = mean,  2 = variance

 1 n 2
Normal density: f (Y |  ,  2 ) = ( 2 )
2 − n /2
exp  − 2  ( yi −  ) 
 2 i =1 
 1 n 2
Likelihood: L(  ,  | Y ) = 
2 −n
exp  − 2  ( yi −  ) 
 2 i =1 

52
Maximum Likelihood Estimate
• The maximum likelihood (ML) estimate ˆ of 
maximizes the likelihood
L(ˆ | Y )  L( | Y ) for all 

• The ML estimate is the


“value of the parameter that makes the data most
likely”

• The ML estimate is not always unique, but is for many


regular problems given enough data

53
Computing the ML estimate

• In regular problems, the ML estimate can


be found by solving the score equation
 log L( | Y )
S ( | Y )  =0


• Iterative methods – e.g. Newton-Raphson,


Scoring, EM algorithm -- required for
most problems

54
Properties of ML estimates
• Under assumed model, ML estimate is:
– Consistent
– Efficient for large samples
– not necessarily the best for small samples
• ML estimate is transformation invariant
– If ˆ is the ML estimate of 
Then  (ˆ) is the ML estimate of  ( )

55
Likelihood methods with incomplete
data
• Statistical models needed for:
– data without missing values
– missing-data mechanism
• Model for mechanism not needed if it is ignorable (to be
defined later)
• With likelihood, proceed as before:
– ML estimates, large sample standard errors
– Bayes posterior distribution

56
The Observed Data
missing data
M1 M2 M3Y1 Y2 Y3 observed missing indicators
1 0 0 ? ? 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 0 1
0 0 1 ? ? 0 0 1
0 1 1 ? ? 0 1 1
? ? ? ?

Y = ( yij ) nK = (Yobs , Ymis ) M = (mij )n´K


ì 0, y observed
ï ij
mij = í
ïî 1, yij missing
57
Model for Y and M
f (Y , M| , ) = f (Y | )  f ( M| Y , )

Complete-data model model for mechanism

Example: bivariate normal monotone data M1 M2 Y1 Y2


complete-data model: 0 0
( yi1 , yi 2 ) ~ iid N2 ( ,  ) 0 0
0 0
model for mechanism: 0 1
?
0 1
(mi2 | yi1, yi2 ) ~ind BernéëF (y0 + y1 yi1 + y2 yi2 )ùû ?

 = Normal cumulative distribution function

58
The likelihood
• Likelihood should involve model for M
f (Yobs , M |  , ) =  f (Yobs , Ymis |  ) f ( M | Yobs , Ymis , ) dYmis

 Lfull ( , | Yobs , M ) = const  f (Yobs , M |  , )

Likelihood when ignoring the missing-data mechanism M


simpler since it does not involve model for M
works (only) when the mechanism is ignorable

f (Yobs |  ) =  f (Yobs , Ymis |  )dYmis


 Lign ( | Yobs ) = const  f (Yobs |  )

59
Ignoring the md mechanism

• It is easy to show that sufficient conditions for ignoring


the missing-data mechanism are:
(A) Missing at Random (MAR):
f ( M | Yobs , Ymis , ) = f ( M | Yobs , ) for all Ymis

(B) Distinctness:
 and  have distinct parameter spaces
(Bayes: priors distributions are independent)

60
Proof: f (Yobs , M |  , ) =  f (Yobs , Ymis |  ) f ( M | Yobs , Ymis , ) dYmis

=(MAR)  f (Y obs , Ymis |  ) f ( M | Yobs , )dYmis

= f ( M | Yobs , )   f (Yobs , Ymis |  )dYmis

= f (M | Yobs , )  f (Yobs |  )

• If MAR holds but not distinctness, ML


based on ignorable likelihood is valid but
not fully efficient

• So MAR is the key condition


61
Bayes: add prior distributions
p com plete ( , | Y , M ) =  ( , )  f (Y |  )  f ( M | Y , )

Prior dn Complete-data model model for mechanism

p full ( , | Yobs , M )   ( , )  f (Yobs , M |  , )


• Full posterior dn - involves model for M
f (Yobs , M |  , ) =  f (Y obs , Ymis |  ) f ( M | Yob s , Ymis , ) d Ym is

Posterior dn ignoring the missing-data mechanism M


(simpler since it does not involve model for M)
pign ( | Yobs )   ( )  f (Yobs |  )
f (Yobs |  ) =  f (Yobs , Ymis |  )dYmis
62
Summary

• Likelihood Inference Ignoring the Missing


Data Mechanism is valid if
– Model for Y is correctly specified
– Data are MAR
– Fully efficient if distinctness condition holds

63
Some Discussion
and Take Home
Messages

64
A Few Notes
• Any missing data method involves modeling
assumptions

• Collect and use relevant covariate information


– Covariates related to missingness and main
outcomes
• Sensitivity analyses for nonignorable missing
data

65
66
67
Missing data methods -- history
1. Before the EM algorithm (pre-1970’s)
– Ad-hoc adjustments (simple imputation)
– ML for simple problems (Anderson 1957)
– ML for complex problems too hard
2. ML era (1970’s – mid 1980’s)
– Rubin formulates model for missing data
mechanism, defines MAR (1976)
– EM and extensions facilitate ML for complex
problems
– ML for more flexible models – beyond multivariate 68
normal (see e.g. Little and Rubin 1987)
Missing data methods -- history
3. Bayes and Multiple Imputation (mid 1980’s –
present)
– Rubin proposes MI, justified via Bayes (1977, 1987)
– MCMC facilitates Bayes as an alternative to ML, with
better small sample properties (see e.g. Little and
Rubin 2019)
4. Robustness concerns (1990’s – present)
– Robins et al propose doubly robust methods for
missing data based on semiparametric approach
– Robust Bayesian models, more attention to model
checks
69
A Great Textbook on Missing
Data

70
Summary

• Missing data problems are widely seen


• Many methods dealing with missing data
– Parametric, semiparametric, nonparametric
• We covered some basic concepts and
methods
• Hopefully a useful introduction

71

You might also like