Lec4 Missing
Lec4 Missing
Missing data is a very common problem in every scientific research. In a survey sample, it occurs when there
are individuals who refuse to answer some questions. In a medical research, it happens when participants
drop out of the study.
There are three common strategies that practitioners are using to handle missing data:
• Imputation. The imputation is another popular approach that practitioners used in solving missing
data problem. The idea is very simple: we impute the missing entries with a proper value that leads
to a complete dataset. Then we can treat the problem as if there is no missingness.
Here is a caveat. If the imputation is done in a deterministic way, i.e., every time a missing entry is
imputed, it always be imputed with a fixed number, the imputed data is often problematic because we
do not take into account the intrinsic variation of that missing value. This would lead to bias in the
later estimation procedure.
A better approach is to use a stochastic imputation that we impute the missing entries by drawing
from a distribution. Later we will show that if the distribution being drawn is the actual distribution
that generates the data, the stochastic imputation leads to a dataset without any bias (Section 4.1.2).
4-1
4-2 Lecture 4: Missing data and parametric models
A challenge here is that in general, we do not know the actual distribution so how do we perform the
stochastic imputation is a problem.
Consider a regression problem where we have a binary covariate X ∈ {0, 1} and a continuous response Y ∈ R.
However, in our data, some response variables are missing and only the covariates are observed. So our data
can be represented as
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , ?), · · · , (Xn+m , ?).
The symbol ? denotes a missing value. Namely, we have n observations that are fully observed while the
other m observations that we only observe the covariate, not the response. Suppose that the parameter of
interest is the marginal median of the response variable mY . How should we estimate the median?
We can introduce an additional variable R to denote the missingness such that R = 0 means that Y is not
observed whereas R = 1 means that Y is observed. Note that R itself is another random variable.
Without any assumptions on the missing data, we are not able to accurately estimate the median consistently.
There are two common assumptions people made about the missingness:
1. MCAR: missing completely at random. This means that the missingness is independent of any
variables. Under the above notations, MCAR means that
R ⊥ X, Y.
2. MAR: missing at random. Under MAR, the missingness depends only on the observed pattern. In
our case.
P (R = 0|X, Y ) = P (R = 0|X)
since Y is not observed when R = 0.
When the missingness is neither MCAR nor MAR, it is called MNAR–missing completely at random.
Under MCAR, we can completely ignore the data with missing values and just use the sample median as an
estimate of mY . However, under MAR, we cannot do such thing because the missingness may depends on
X and if the distribution of covariate is different under fully observed data (R = 1) and partially observed
data (R = 0), we will obtain a biased estimate.
While there are other ways to estimate the median under MAR, we will focus on the method of imputation.
4.1.2 Imputation
The idea of imputation is to impute a value to the missing entry so that after imputing all missing entries,
we obtain a data without any missingness. Then we can simply apply a regular estimator (in the above
example, sample median) to estimate the parameter of interest.
However, we cannot impute any number to the missing entry because this would cause bias in the estimation.
We need to impute the value in a smart way. Generally, we want to impute the value according to the
conditional density
p(y|x, R = 0),
Lecture 4: Missing data and parametric models 4-3
the conditional density of response variable Y given the covariate X and the missing pattern R = 0. Namely,
for n + i-th observation where only Xn+i is observed, we want to draw a random number
If indeed Yn+1 is from the above density function, one can show that the sample median
is an unbiased estimator of mY .
This idea works regardless of what missing assumption is. However, the problem is that the density function
p(y|x, R = 0) cannot be estimated using our data because the only case we observed Y is when R = 1.
Under this case, MAR implies a powerful result:
Namely, the conditional density of Y given X is independent of the missing indicator R. To see how equation
(4.1) is derived, note that MAR implies
Given an observation Xn+i = x, how should we sample Ybn+i from pb(y|x, R = 1)? It is very simple. We first
sample the index I such that
1
P (I = i|data) = I(Xi = x).
nx
Namely, I is chosen from those fully observed data with the covariate Xi = x with equal probability. Given
I we then sample Yn+i from the density function
1 YI − y
q(y) = K .
h h
Although this may look scary, if the kernel function is Gaussian, q(y) is the normal density with mean YI
and variance h2 . Namely, when K is a Gaussian,
Ybn+i ∼ N (YI , h2 ).
Remark.
• The use of KDE is just one example. You can use any density estimator for pb(y|x, R = 1) as long as
you are able to sample from it.
• The equation (4.1) relies on the MAR assumption along with the fact that only one variable is subject
to missing. When there are more than one variables that can be missing, we no longer have such a
simple equivalence.
• The imputed data can be used for other estimators as well, not limited to estimating the median. You
may notice that during our imputation process, we do not use any information about the estimator.
• There imputation methods that only imputes a fixed, non-random number for each missing entries.
This is often called a deterministic imputation. For certain problem, a deterministic imputation works
but in general, it may not work. So a rule thumb is to use a random imputation if possible.
After doing the imputation for all missing entries, we obtain a complete data
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 ).
The estimate of mY is just the sample median of this impute dataset. However, there will be Monte Carlo
errors in this estimator because every time we do the imputation, we will not get the same number (due to
sampling from p(y|x, R = 0)). If we just impute the data once (this is often called single imputation), we
may suffer from the Monte Carlo errors a lot. Thus, a better approach is to perform a multiple imputation.
Multiple Imputation.1 After obtaining a complete data, we do the same imputation procedure again,
which gives us another new complete data. Then we keep repeating the above process, leading to several
complete data, which can be represented as
(1) (1)
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 )
(2) (2)
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 )
···
(N ) (N )
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 ).
We then combine all these datasets to a huge dataset and compute the estimator of the parameter of interest
(in our case, median of the response variable). This estimator has a smaller Monte Carlo error.
1 For more introduction on this topic, see https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/
Lecture 4: Missing data and parametric models 4-5
When there are more than one variable that are subject to missing, the problem gets a lot more complex.
Consider the case where each individual has d variables X1 , · · · , X5 and all of them may be missing and we
may even have many of them missing at the same time. There are two categories of the missing patterns:
1. Monotone missingness. In this case, if Xt is missing, then Xs is also missing for any s > t. This
occurs a lot in medical research due to dropout of the individuals. For instance, let Xt denote the
BMI of an individual at year t. If this individual left the study at time point τ , then we only observe
X1 , · · · , Xτ from this individual. Any information beyond year τ is missing.
2. Non-monotone missingness. When the missing pattern is not monotone, it is called non-monotone
missingness. The non-monotone missing data is a lot more challenging than monotone missing data
because there are many possible missing pattern that can occur in the data. If there are d variables,
them monotone missing data has d different missing patterns but the non-monotone case may have up
to 2d different missing patterns!
Let R ∈ {0, 1}5 be a multi-index set that denotes the observed pattern and we use the notation XR = (Xi :
Ri = 1). For instance, R = 11001 means that we observe variable X1 , X2 , and X5 and X11001 = (X1 , X2 , X5 ).
Under this notation, the MAR assumption can be written as
P (R = r|X) = P (R = r|Xr ),
namely, the probability of seeing a pattern R = r only depends on the observed variable.
MAR is a very popular assumption that people often assumed in practice (although it may not be reasonable
in some cases). However, under the non-monotone case, MAR tells us little about the missingness and it is
actually not very to work with. Why is the MAR still so popular in practice?
There are two reasons for why MAR is so popular. The first reason is that in both monotone and non-
monotone case, MAR makes the likelihood inference a lot easier. The second reason is that under monotone
missing data problem, MAR provides an elegant way to identify the entire distribution function.
The MAR has a nice property called the ignorability, which holds in both monotone and non-monotone
missingness. Consider the joint density function p(x, r) of both variable of interest X and the missing
pattern R. Recall that XR = (Xi : Ri = 1) are the observed variables under pattern R. We also denote
XR̄ = (Xi : Ri = 0) as the missing variables.
We can then factorize it into
p(x, r) = P (R = r|X = x)p(x).
Suppose we use parametric models separately for both P (R = r|X = x) and p(x), leading to
(M AR)
p(x, r; φ, θ) = P (R = r|X = x; φ)p(x; θ) = P (R = r|Xr = xr ; φ)p(x; θ),
where θ is the parameter for modeling p(x) and φ is the parameter for modeling the missing probability
P (R = r|Xr = xr ) (this separability of parameter together with MAR is often called ignorability). In our
data, what we observe are (xr , r) so we should integrate over the missing variables xr̄ :
Z Z
p(xr , r; φ, θ) = p(x, r; φ, θ)dxr̄ = P (R = r|Xr = xr ; φ) p(x; θ)dxr̄ .
4-6 Lecture 4: Missing data and parametric models
= `(φ|xr , r) + `(θ|xr ),
`(φ|xr , r) = log P (R = r|Xr = xr ; φ)
Z
`(θ|xr , r) = log p(x; θ)dxr̄ .
The above factorization is very powerful–it decouple the problem of estimating θ and the problem of esti-
mating φ!
Namely, if we are only interested in the distribution of X, we do not even need to deal with φ. We just need
to maximize `(θ|xr ). So finding the MLE of θ can be done without estimating the parameter φ, leading to
a simple procedure.
EM algorithm. Estimating θ via maximizing `(θ|xr ) is often done via the EM algorithm. The EM algorithm
is an iterative algorithm that finds a stationary point. It consists of two steps, an expectation step (E-step)
and a maximization step (M). Given an initial guess of the parameter θ(0) , the EM algorithm iterates the
following two steps until convergence (t = 0, 1, 2, 3, · · · ):
1. E-steps. Compute
Z
Q(θ; θ(t) |Xr ) = E(`(θ|X); Xr , θ(t−1) ) = `(θ|, xr̄ , Xr )p(xr̄ |Xr ; θ(t) )dxr̄ .
2. M-steps. Update
θ(t+1) = argmaxθ Q(θ; θ(t) |Xr ).
Under good conditions, the EM algorithm has the ascending property, i.e.,
and will converge to a stationary point. However, the problem is that the stationary point is not guarantee
to be the global maximum (MLE). It could be a local mode or even a saddle point.
A good introduction on the EM algorithm and missing data is Section 8 of the following textbook:
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John
Wiley & Sons.
Lecture 4: Missing data and parametric models 4-7
Under the monotone missing problem, let T denotes the index of the last observed variable. Namely, the
individual dropouts after time point T . We use the notation X≤t = (X1 , · · · , Xt ). Then the MAR can be
written as
P (T = t|X) = P (T = t|X≤t ).
The above equation gives us a very powerful result–we can estimate the missing probability P (T = t|X) for
every t = 1, · · · , d!
To see this, consider the case t = 1 so MAR implies
P (T = 1|X) = P (T = 1|X1 ).
Note that P (T > 1|X) = 1 − P (T = 1|X) = P (T 6= 1|X1 ) = P (T > 1|X1 ). Thus, we can estimate
P (T = 1|X1 ) by comparing pattern T = 1 against T > 1 given the variable X1 , which is always observed.
Thus, P (T = 1|X) is estimatible. For t = 2, the MAR implies
P (T = 2|X) = P (T = 2|X1 , X2 ).
Thus,
P (T > 2|X) = 1 − P (T = 2|X) − P (T = 1|X) = 1 − P (T = 2|X1 , X2 ) − P (T = 1|X1 ) = P (T > 2|X1 , X2 ).
Again, we can compare the pattern T = 2 against T > 2 and estimate the probability P (T = 2|X). We can
keep doing this procedure, and eventually all missing probability P (T = t|X) can be estimated.
For instance, if we are interested in estimating the parameter of interest ρ = E(ω(X1 , · · · , Xd )), we can then
use the IPW estimator2 as in the causal inference problem:
n
1 X
ρb = ω(Xi,1 , · · · , Xi,p )I(Ti = d),
nPb(T = d|X) i=1
where Pb(T = d|X) is an estimate of P (T = d|X). Similar to the case of causal inference, P (T = t|X) is also
called the propensity score.
Actually, MAR under monotone missingness is equivalent to the available case missing value (ACMV)
assumption:
p(xt+1 |x≤t , T = t) = p(xt+1 |x≤t , T > t)
for every t. The right-hand side can be estimated by conditional KDE so the density function3
p−1
Y
p(x>t |x≤t , T = t) = p(xs+1 |x≤s , T = s)
s=t
can be estimated under ACMV assumption. Why is the above density being estimatible so useful? This is
because the joint density function has the following pattern mixture model formulation:
p
X p
X
p(x) = p(x, t) = p(x>t | x≤t , T = t)p(x≤t | T = t)p(T = t),
t=1 t=1
where both p(x≤t | T = t) and P (T = t) can be directly estimated using our data so what remains unknown
is the density function p(x>t | x≤t , T = t). ACMV implies an estimator of this density function, so the entire
joint density function can be estimated. The equivalence between MAR and ACMV is shown in
Molenberghs, G., Michiels, B., Kenward, M. G., & Diggle, P. J. (1998). Monotone missing data
and pattern-mixture models. Statistica Neerlandica, 52(2), 153-161.
2 See https://en.wikipedia.org/wiki/Inverse_probability_weighting for more details.
3 Also called the extrapolation density.
4-8 Lecture 4: Missing data and parametric models
In MNAR, the missing data problem becomes a lot more complicated. There are two common strategies for
handling MNAR–the selection models and the pattern mixture models approaches.
To simplify the problem, we consider monotone missing data problem. Even in this scenario, we will see
several identifiability issues so we have to be very careful about our choice of model.
Recall that X denotes the study variable and T is the dropout time. We are interesting in the full-data
density p(x, t); note that p(x, t) implies the joint PDF of the study variable p(x).
A useful reference: https://content.sph.harvard.edu/fitzmaur/lda/C6587_C018.pdf.
p(x, t) = P (T = t|x)p(x),
The MAR and MCAR conditions are often expressed in a selection model framework. Formally, the MCAR
is
P (T = t|X) = P (T = t).
Namely, the probability of any dropout time is totally independent of the study variable X. The MAR is
P (T = t|X) = P (T = t|X≤t ).
In other words, the conditional probability of the dropout time only depend on the observed variables.
As we have mentioned, the selection model allows a simple way to construct a consistent estimator of a
parameter of interest via the IPW procedure. HereR is a simple example. Suppose that the parameter of
interest is a linear statistical functional θ = θ(F ) = ω(x)dF (x), then it can be further written as
Z Z Z
p(x, T = d) dF (dx, T = d)
θ = ω(x)p(x)dx = ω(x) dx = ω(x) .
P (T = d|x) P (T = d|x)
With an estimator of the selection probability Pb(T = d|x) (and we only need to estimate the probability of
fully-observed case), a simple IPW estimator of θ is
Z n
dFb(dx, T = d) 1 X ω(Xi )I(Ti = d)
θb0 = ω(x) = . (4.2)
Pb(T = d|x) n i=1 Pb(T = d|Xi )
You can show that θb0 is a consistent estimator (and it has asymptotical normality as well due to the Slutsky
theorem). Moreover, the influence function (recall from the bootstrap lecture note) of θb0 can be easily derived
so the variance of θb0 can be estimated via a plug-in estimate.
Lecture 4: Missing data and parametric models 4-9
Although θb0 is elegant, it may not be the best estimator in the sense that after estimating the propensity
score P (T = t|x), we only rely on the completely observed data (the ones with Ti = d) to form the final
estimator. Other observations are discarded entirely. Intuitively, this leads to an inefficient estimator.
To construct an efficient estimator, consider augmenting θb0 with an additional term
n
1X
θb1 = θb0 + (I(Ti = τ ) − Pb(Ti = τ |Xi,≤τ ))gτ (Xi,≤τ )I(Ti = τ ),
n i=1
where τ < d is any time point and gτ is a function of variable x≤τ . The augmented term has an asymptotic
mean 0 so θb1 is still a consistent estimator. The insight here is that the function gτ is something we can
choose– namely, we can choose it to minimize the variance of θb1 and this may leads to a reduction in the total
variance compared to the estimator θb0 . The same idea can be applied to every time point τ = 1, · · · , d − 1,
leading to an augmented inverse probability weighting (AIPW) estimator
n d−1
1 XX
θbAIPW = θb0 + (I(Ti = τ ) − Pb(Ti = τ |Xi,≤τ ))gτ (Xi,≤τ )I(Ti = τ ).
n i=1 τ =1
With a proper choice of gτ : τ = 1, · · · , d − 1, we can construct an estimator with the least variance. This
leads to an efficient estimator. How to construct the functions gτ : τ = 1, · · · , d − 1 is a central topic of
semi-parametric inference.
Note that sometimes the AIPW (and IPW) estimators are constructed from solving an estimating equation.
This occurs when the parameter of interest θ0 = θ(F ) is defined through solving the equation
Z Z
dF (dx, T = d)
0 = E(S(X; θ0 )) = S(x; θ0 )dF (x) = S(x; θ) .
P (T = d|x)
and we can augment it with a set of mean 0 terms to improve the efficiency.
If you are interested in the construction of AIPW, I would recommend the following textbook:
Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media.
Note: although we introduce AIPW estimators in the MNAR framework, they are often used in the MAR
scenario because the identification of propensity score/selection probability P (T = t|X) is challenging in
MNAR. The MAR is a simple case where we can identify the propensity score entirely so AIPW estimators
can be constructed easily. Essentially, as long as you can identify the selection probability, you can construct
an IPW estimator and attempt to augment it to obtain AIPW estimator to improve the efficiency. So the
direction of research is often on how to identify the selection probability.
where the first term p(x>t |x≤t , t) is called the extrapolation density and the later two terms p(x≤t |t)P (T = t)
are called observed-data density. The extrapolation density is unobservable and unidentifiable–it describes
the distribution of the missing entries. The observed-data density is identifiable since at each dropout time
T = t, we do observe variables x1 , · · · , xt .
The PMMs provide a clean separation about what is identifiable and what is not identifiable. So the strategy
for identifying p(x, t) is to make the extrapolation density be identifiable.
In monotone missing problems, the extrapolation density has the following product form:
d
Y
p(x>t | x≤t , t) = p(xs | x<s , T = t).
s=t+1
Thus, it suffices to identify each term in the product form to identify the extrapolation density. Several iden-
tifying restrictions have been proposed in the literature to identify the extrapolation density. For instance,
the complete case missing value (CCMV) restriction equates that
CC
p(xs | x<s , T = t) = p(xs | x<s , T = d),
AC
p(xs | x<s , T = t) = p(xs | x<s , T ≥ s),
and the nearest case missing value (NCMV) restriction requires that
NC
p(xs | x<s , T = t) = p(xs | x<s , T = s).
In general, one can specify any subset of patterns Ats ⊂ {s, s + 1, · · · , d} and construct a corresponding
identifying restriction
A
p(xs | x<s , T = t) =ts p(xs | x<s , T ∈ Ats );
this is called the donor-baed identifying restriction in the following paper:
Chen, Y. C., & Sadinle, M. (2019). Nonparametric Pattern-Mixture Models for Inference with
Missing Data. arXiv preprint arXiv:1904.11085.
If you make any of these assumptions, the extrapolation density can be identified from the data so you can
then estimate the full-data density p(x, t).
Here is a nice review on PMMs for MNAR:
Linero, A. R., & Daniels, M. J. (2018). Bayesian approaches for missing not at random outcome
data: The role of identifying restrictions. Statistical Science, 33(2), 198-213.
In the previous section, we introduce the idea of imputation when there is only one variable missing. But
it can be applied to cases where there are multiple missing entries. Suppose that we have an imputation
procedure such that if we observe X≤T = (X1 , · · · , XT ) and the dropout time T , the procedure generates
random numbers X>T = (XT +1 , · · · , Xd ) from a distribution Q.
Lecture 4: Missing data and parametric models 4-11
Then you can always view this imputation procedure as a PMM such that the PDF corresponds to the
imputation distribution Q is the underlying model on the extrapolation density. So any imputation method
can be viewed as implicitly handling the problem with a PMM.
From this point of view, you may notice that if we always impute the same number when observing (X≤T , T ),
then this imputation procedure is problematic since the corresponding imputation distribution is not a
good estimator of the underlying extrapolation distribution unless we are interesting in some very special
parameter of interest. The commonly-used mean imputation or median imputation are thus bad ideas to
apply in practice.
In MNAR, we need to make identifying restrictions so that the full-data distribution F (x, t) (or p(x, t)) is
identifiable. However, there is one property that an identifying restriction should have: the implied joint
distribution should be compatible/consistent with what we observe. This property is called nonparametric
saturation/nonparametric identification/just identification.
The idea is simple: because we can identify F (x, t), we can pretend the implied joint distribution is the true
generating distribution and generates a new missing data from it. The generated missing data should be
similar to the original data we have.
MAR and any pattern mixture models satisfies this property (when we attempt to estimate the joint distri-
bution via a nonparametric estimator). However, some identifying restrictions, such as the MCAR, does not
satisfy this. Whenever you proposed a new MNAR restriction, you should always think about if the implied
full-data distribution satisfies this property or not.
Sensitivity analysis is a common procedure in handling the missing data problem. In short, sensitivity
analysis is to perturb the missing data assumption a bit and see how the conclusion changes. This is often
required in handling missing data because as we have shown previously, there is no way to check if a missing
data assumption is correct (unless we have additional information) so our conclusion relies heavily on our
assumption of missingness. By perturbing the assumption on missingness, we are able to examine if our
conclusion is robust to the missing data assumption.
In MAR, one common approach for sensitivity analysis is to introduce the model (called the exponential
tilting strategy)
P (T = t|X)
log = γ T X,
P (T = t|X≤t )
When the missingness is non-monotone (which occurs very often in a survey sample), the problem becomes a
lot more complicated. Even we are willing to assume MAR, the full-data distribution p(x) may not be unique.
The following paper proposed a pattern mixture model to obtain a full-data distribution that satisfies MAR:
4-12 Lecture 4: Missing data and parametric models
Robins, J. M., & Gill, R. D. (1997). Non-response models for the analysis of non?monotone
ignorable missing data. Statistics in medicine, 16(1), 39-56.
However, it only identifies one full-data distribution satisfying MAR, not all possible distributions.
The problem is even more challenging under MNAR case. In general, nonmonotone MNAR problem is still
a very open problems. There are some attempts to deal with it but we have very limited options. Here are
some recent work related to nonmonotone MNAR:
1. Sadinle, M., & Reiter, J. P. (2017). Itemwise conditionally independent nonresponse modelling
for incomplete multivariate data. Biometrika, 104(1), 207-220.
2. Tchetgen, E. J. T., Wang, L., & Sun, B. (2018). Discrete choice models for nonmonotone
nonignorable missing data: Identification and inference. Statistica Sinica, 28(4), 2069-2088
3. Malinsky, D., Shpitser, I., & Tchetgen, E. J. T. (2019). Semiparametric Inference for
Non-monotone Missing-Not-at-Random Data: the No Self-Censoring Model. arXiv preprint
arXiv:1909.01848.
4. Chen, Y. C., & Sadinle, M. (2019). Nonparametric Pattern-Mixture Models for Inference with
Missing Data. arXiv preprint arXiv:1904.11085.
In particular, the first and the third model consider the following interesting assumptions:
Xj ⊥ Rj |X−j , R−j ,
where Rj ∈ {0, 1} is the response indicator that Rj = 1 if variable Xj is observed. This assumption is known
as ICIN (Itemwise conditionally independent nonresponse) and NSC (no self-censoring) assumption. It has
a beautiful graphical representation induced by the conditional independence.