0% found this document useful (0 votes)
11 views12 pages

Lec4 Missing

- Missing data is a common problem that can occur for various reasons like survey non-response or participant drop-out. - Common approaches to handle missing data include complete-case analysis, assuming missing at random (MAR), and imputation. - Under MAR, the missingness only depends on observed variables, not unobserved ones. This allows using maximum likelihood with all data. - Imputation involves replacing missing values with substitutes drawn from the conditional distribution of the missing variable given observed covariates. If imputed properly, it yields an unbiased analysis as if there was no missingness.

Uploaded by

dayo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Lec4 Missing

- Missing data is a common problem that can occur for various reasons like survey non-response or participant drop-out. - Common approaches to handle missing data include complete-case analysis, assuming missing at random (MAR), and imputation. - Under MAR, the missingness only depends on observed variables, not unobserved ones. This allows using maximum likelihood with all data. - Imputation involves replacing missing values with substitutes drawn from the conditional distribution of the missing variable given observed covariates. If imputed properly, it yields an unbiased analysis as if there was no missingness.

Uploaded by

dayo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

STAT 542: Multivariate Analysis Spring 2021

Lecture 4: Missing data and parametric models


Instructor: Yen-Chi Chen

4.1 Missing data: introduction and simple cases

Missing data is a very common problem in every scientific research. In a survey sample, it occurs when there
are individuals who refuse to answer some questions. In a medical research, it happens when participants
drop out of the study.
There are three common strategies that practitioners are using to handle missing data:

• Complete-case analysis (ignoring observations with missing entries). The complete-case


analysis removes any observations that contain one or more missing entries. When the proportion
of missing is small (and the missingness is irrelevant to any variables, including the one that can be
missing), this is an okay procedure. But in general, this would lead to a biased estimate.
To see this, think about estimating the average income of a city from a social survey. Many rich people
would refuse to provide their incomes (it will be easy to identify them), leading to missing entries. In
this scenario, if we ignore those individuals whose income is missing, we will get a biased estimate of
the average income.
In an observation study in medical research, sometimes people would perform analysis by adjusting
the inclusion criteria: the criteria that determines which individual will be included in our analysis.
In case that they require individuals to be fully observed, this is essentially a complete-case analysis.

• Ignorable missingness (missing at random). Another common approach is to make assumptions


and choose a good model so that the missingness is ignorable. Note that in this case, we do NOT remove
observations with missing entries–we still use their observed variables to construct our model. This
is possible when we assume the missingness is missing at random (MAR; see Section 4.2) and use a
proper parametric model.
However, MAR is just an assumption. It may be violated (which is often called missing not at random-
MNAR). When the MAR is violated, it is often hard to obtain an ignorable missingness approach to
deal with missing data. Note that sometimes we are still able to construct an ignorable missingness
procedure using the selection model and inverse probability weighting estimator (see Section 4.3.1).

• Imputation. The imputation is another popular approach that practitioners used in solving missing
data problem. The idea is very simple: we impute the missing entries with a proper value that leads
to a complete dataset. Then we can treat the problem as if there is no missingness.
Here is a caveat. If the imputation is done in a deterministic way, i.e., every time a missing entry is
imputed, it always be imputed with a fixed number, the imputed data is often problematic because we
do not take into account the intrinsic variation of that missing value. This would lead to bias in the
later estimation procedure.
A better approach is to use a stochastic imputation that we impute the missing entries by drawing
from a distribution. Later we will show that if the distribution being drawn is the actual distribution
that generates the data, the stochastic imputation leads to a dataset without any bias (Section 4.1.2).

4-1
4-2 Lecture 4: Missing data and parametric models

A challenge here is that in general, we do not know the actual distribution so how do we perform the
stochastic imputation is a problem.

4.1.1 Simple cases

Consider a regression problem where we have a binary covariate X ∈ {0, 1} and a continuous response Y ∈ R.
However, in our data, some response variables are missing and only the covariates are observed. So our data
can be represented as
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , ?), · · · , (Xn+m , ?).
The symbol ? denotes a missing value. Namely, we have n observations that are fully observed while the
other m observations that we only observe the covariate, not the response. Suppose that the parameter of
interest is the marginal median of the response variable mY . How should we estimate the median?
We can introduce an additional variable R to denote the missingness such that R = 0 means that Y is not
observed whereas R = 1 means that Y is observed. Note that R itself is another random variable.
Without any assumptions on the missing data, we are not able to accurately estimate the median consistently.
There are two common assumptions people made about the missingness:

1. MCAR: missing completely at random. This means that the missingness is independent of any
variables. Under the above notations, MCAR means that

R ⊥ X, Y.

2. MAR: missing at random. Under MAR, the missingness depends only on the observed pattern. In
our case.
P (R = 0|X, Y ) = P (R = 0|X)
since Y is not observed when R = 0.

When the missingness is neither MCAR nor MAR, it is called MNAR–missing completely at random.
Under MCAR, we can completely ignore the data with missing values and just use the sample median as an
estimate of mY . However, under MAR, we cannot do such thing because the missingness may depends on
X and if the distribution of covariate is different under fully observed data (R = 1) and partially observed
data (R = 0), we will obtain a biased estimate.
While there are other ways to estimate the median under MAR, we will focus on the method of imputation.

4.1.2 Imputation

The idea of imputation is to impute a value to the missing entry so that after imputing all missing entries,
we obtain a data without any missingness. Then we can simply apply a regular estimator (in the above
example, sample median) to estimate the parameter of interest.
However, we cannot impute any number to the missing entry because this would cause bias in the estimation.
We need to impute the value in a smart way. Generally, we want to impute the value according to the
conditional density
p(y|x, R = 0),
Lecture 4: Missing data and parametric models 4-3

the conditional density of response variable Y given the covariate X and the missing pattern R = 0. Namely,
for n + i-th observation where only Xn+i is observed, we want to draw a random number

Ỹn+i ∼ p(y|Xn+i , R = 0).

If indeed Yn+1 is from the above density function, one can show that the sample median

median{Y1 , · · · , Yn , Ỹn+1 , · · · , Ỹn+m }

is an unbiased estimator of mY .
This idea works regardless of what missing assumption is. However, the problem is that the density function
p(y|x, R = 0) cannot be estimated using our data because the only case we observed Y is when R = 1.
Under this case, MAR implies a powerful result:

p(y|x, R = 0) = p(y|x, R = 1). (4.1)

Namely, the conditional density of Y given X is independent of the missing indicator R. To see how equation
(4.1) is derived, note that MAR implies

P (R = 1|X, Y ) = 1 − P (R = 0|X, Y ) = 1 − P (R = 0|X) = P (R = 1|X).

Thus, the conditional density


p(y, x, R = 0)
p(y|x, R = 0) =
P (x, R = 0)
p(x, y)P (R = 0|x, y)
=
P (x, R = 0)
P (R = 0|x)
= p(x, y)
P (x, R = 0)
1
= p(x, y)
p(x)
P (R = 1|x)
= p(x, y)
P (x, R = 1)
p(x, y)P (R = 1|x, y)
=
P (x, R = 1)
p(y, x, R = 1)
=
P (x, R = 1)
= p(y|x, R = 1).

Thus, we obtain equation (4.1).


The power of equation (4.1) is that p(y|x, R = 1) can be estimated by a KDE:
Pn  
1 Yi −y
nh i=1 K h I(Xi = x)
pb(y|x, R = 1) = 1
P n
n i=1 I(Xi = x)
n  
1 X Yi − y
= K I(Xi = x),
nx h i=1 h
Pn
where nx = i=1 I(Xi = x) is the number of Xi = x in the completely observed data and x ∈ {0, 1}.
Namely, pb(y|x, R = 1) is the KDE applied to the completely observed data with the covariate X = x.
4-4 Lecture 4: Missing data and parametric models

Given an observation Xn+i = x, how should we sample Ybn+i from pb(y|x, R = 1)? It is very simple. We first
sample the index I such that
1
P (I = i|data) = I(Xi = x).
nx
Namely, I is chosen from those fully observed data with the covariate Xi = x with equal probability. Given
I we then sample Yn+i from the density function
 
1 YI − y
q(y) = K .
h h
Although this may look scary, if the kernel function is Gaussian, q(y) is the normal density with mean YI
and variance h2 . Namely, when K is a Gaussian,
Ybn+i ∼ N (YI , h2 ).

Remark.

• The use of KDE is just one example. You can use any density estimator for pb(y|x, R = 1) as long as
you are able to sample from it.
• The equation (4.1) relies on the MAR assumption along with the fact that only one variable is subject
to missing. When there are more than one variables that can be missing, we no longer have such a
simple equivalence.
• The imputed data can be used for other estimators as well, not limited to estimating the median. You
may notice that during our imputation process, we do not use any information about the estimator.
• There imputation methods that only imputes a fixed, non-random number for each missing entries.
This is often called a deterministic imputation. For certain problem, a deterministic imputation works
but in general, it may not work. So a rule thumb is to use a random imputation if possible.

4.1.3 Multiple imputation

After doing the imputation for all missing entries, we obtain a complete data
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 ).
The estimate of mY is just the sample median of this impute dataset. However, there will be Monte Carlo
errors in this estimator because every time we do the imputation, we will not get the same number (due to
sampling from p(y|x, R = 0)). If we just impute the data once (this is often called single imputation), we
may suffer from the Monte Carlo errors a lot. Thus, a better approach is to perform a multiple imputation.
Multiple Imputation.1 After obtaining a complete data, we do the same imputation procedure again,
which gives us another new complete data. Then we keep repeating the above process, leading to several
complete data, which can be represented as
(1) (1)
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 )
(2) (2)
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 )
···
(N ) (N )
(X1 , Y1 ), · · · , (Xn , Yn ), (Xn+1 , Ybn+1 ), · · · , (Xn+m , Ybn+1 ).
We then combine all these datasets to a huge dataset and compute the estimator of the parameter of interest
(in our case, median of the response variable). This estimator has a smaller Monte Carlo error.
1 For more introduction on this topic, see https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/
Lecture 4: Missing data and parametric models 4-5

4.2 Missing data: general problems and missing at random

When there are more than one variable that are subject to missing, the problem gets a lot more complex.
Consider the case where each individual has d variables X1 , · · · , X5 and all of them may be missing and we
may even have many of them missing at the same time. There are two categories of the missing patterns:

1. Monotone missingness. In this case, if Xt is missing, then Xs is also missing for any s > t. This
occurs a lot in medical research due to dropout of the individuals. For instance, let Xt denote the
BMI of an individual at year t. If this individual left the study at time point τ , then we only observe
X1 , · · · , Xτ from this individual. Any information beyond year τ is missing.
2. Non-monotone missingness. When the missing pattern is not monotone, it is called non-monotone
missingness. The non-monotone missing data is a lot more challenging than monotone missing data
because there are many possible missing pattern that can occur in the data. If there are d variables,
them monotone missing data has d different missing patterns but the non-monotone case may have up
to 2d different missing patterns!

Let R ∈ {0, 1}5 be a multi-index set that denotes the observed pattern and we use the notation XR = (Xi :
Ri = 1). For instance, R = 11001 means that we observe variable X1 , X2 , and X5 and X11001 = (X1 , X2 , X5 ).
Under this notation, the MAR assumption can be written as

P (R = r|X) = P (R = r|Xr ),

namely, the probability of seeing a pattern R = r only depends on the observed variable.
MAR is a very popular assumption that people often assumed in practice (although it may not be reasonable
in some cases). However, under the non-monotone case, MAR tells us little about the missingness and it is
actually not very to work with. Why is the MAR still so popular in practice?
There are two reasons for why MAR is so popular. The first reason is that in both monotone and non-
monotone case, MAR makes the likelihood inference a lot easier. The second reason is that under monotone
missing data problem, MAR provides an elegant way to identify the entire distribution function.

4.2.1 Likelihood inference with MAR

The MAR has a nice property called the ignorability, which holds in both monotone and non-monotone
missingness. Consider the joint density function p(x, r) of both variable of interest X and the missing
pattern R. Recall that XR = (Xi : Ri = 1) are the observed variables under pattern R. We also denote
XR̄ = (Xi : Ri = 0) as the missing variables.
We can then factorize it into
p(x, r) = P (R = r|X = x)p(x).
Suppose we use parametric models separately for both P (R = r|X = x) and p(x), leading to
(M AR)
p(x, r; φ, θ) = P (R = r|X = x; φ)p(x; θ) = P (R = r|Xr = xr ; φ)p(x; θ),

where θ is the parameter for modeling p(x) and φ is the parameter for modeling the missing probability
P (R = r|Xr = xr ) (this separability of parameter together with MAR is often called ignorability). In our
data, what we observe are (xr , r) so we should integrate over the missing variables xr̄ :
Z Z
p(xr , r; φ, θ) = p(x, r; φ, θ)dxr̄ = P (R = r|Xr = xr ; φ) p(x; θ)dxr̄ .
4-6 Lecture 4: Missing data and parametric models

Thus, the log-likelihood function is


Z
`(θ, φ|xr , r) = log P (R = r|Xr = xr ; φ) + log p(x; θ)dxr̄

= `(φ|xr , r) + `(θ|xr ),
`(φ|xr , r) = log P (R = r|Xr = xr ; φ)
Z
`(θ|xr , r) = log p(x; θ)dxr̄ .

The above factorization is very powerful–it decouple the problem of estimating θ and the problem of esti-
mating φ!
Namely, if we are only interested in the distribution of X, we do not even need to deal with φ. We just need
to maximize `(θ|xr ). So finding the MLE of θ can be done without estimating the parameter φ, leading to
a simple procedure.
EM algorithm. Estimating θ via maximizing `(θ|xr ) is often done via the EM algorithm. The EM algorithm
is an iterative algorithm that finds a stationary point. It consists of two steps, an expectation step (E-step)
and a maximization step (M). Given an initial guess of the parameter θ(0) , the EM algorithm iterates the
following two steps until convergence (t = 0, 1, 2, 3, · · · ):

1. E-steps. Compute
Z
Q(θ; θ(t) |Xr ) = E(`(θ|X); Xr , θ(t−1) ) = `(θ|, xr̄ , Xr )p(xr̄ |Xr ; θ(t) )dxr̄ .

2. M-steps. Update
θ(t+1) = argmaxθ Q(θ; θ(t) |Xr ).

Note that in practice, we have n observations so the Q function will be


n
1X
Qn (θ; θ(t) ) = Q(θ; θ(t) |Xi,Ri )
n i=1

and the M-step will be


θ(t+1) = argmaxθ Qn (θ; θ(t) ).

Under good conditions, the EM algorithm has the ascending property, i.e.,

`(θ(t+1) |Xr ) ≥ `(θ(t) |Xr ),

and will converge to a stationary point. However, the problem is that the stationary point is not guarantee
to be the global maximum (MLE). It could be a local mode or even a saddle point.
A good introduction on the EM algorithm and missing data is Section 8 of the following textbook:

Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John
Wiley & Sons.
Lecture 4: Missing data and parametric models 4-7

4.2.2 MAR under monotone case

Under the monotone missing problem, let T denotes the index of the last observed variable. Namely, the
individual dropouts after time point T . We use the notation X≤t = (X1 , · · · , Xt ). Then the MAR can be
written as
P (T = t|X) = P (T = t|X≤t ).
The above equation gives us a very powerful result–we can estimate the missing probability P (T = t|X) for
every t = 1, · · · , d!
To see this, consider the case t = 1 so MAR implies
P (T = 1|X) = P (T = 1|X1 ).
Note that P (T > 1|X) = 1 − P (T = 1|X) = P (T 6= 1|X1 ) = P (T > 1|X1 ). Thus, we can estimate
P (T = 1|X1 ) by comparing pattern T = 1 against T > 1 given the variable X1 , which is always observed.
Thus, P (T = 1|X) is estimatible. For t = 2, the MAR implies
P (T = 2|X) = P (T = 2|X1 , X2 ).
Thus,
P (T > 2|X) = 1 − P (T = 2|X) − P (T = 1|X) = 1 − P (T = 2|X1 , X2 ) − P (T = 1|X1 ) = P (T > 2|X1 , X2 ).
Again, we can compare the pattern T = 2 against T > 2 and estimate the probability P (T = 2|X). We can
keep doing this procedure, and eventually all missing probability P (T = t|X) can be estimated.
For instance, if we are interested in estimating the parameter of interest ρ = E(ω(X1 , · · · , Xd )), we can then
use the IPW estimator2 as in the causal inference problem:
n
1 X
ρb = ω(Xi,1 , · · · , Xi,p )I(Ti = d),
nPb(T = d|X) i=1

where Pb(T = d|X) is an estimate of P (T = d|X). Similar to the case of causal inference, P (T = t|X) is also
called the propensity score.
Actually, MAR under monotone missingness is equivalent to the available case missing value (ACMV)
assumption:
p(xt+1 |x≤t , T = t) = p(xt+1 |x≤t , T > t)
for every t. The right-hand side can be estimated by conditional KDE so the density function3
p−1
Y
p(x>t |x≤t , T = t) = p(xs+1 |x≤s , T = s)
s=t

can be estimated under ACMV assumption. Why is the above density being estimatible so useful? This is
because the joint density function has the following pattern mixture model formulation:
p
X p
X
p(x) = p(x, t) = p(x>t | x≤t , T = t)p(x≤t | T = t)p(T = t),
t=1 t=1

where both p(x≤t | T = t) and P (T = t) can be directly estimated using our data so what remains unknown
is the density function p(x>t | x≤t , T = t). ACMV implies an estimator of this density function, so the entire
joint density function can be estimated. The equivalence between MAR and ACMV is shown in

Molenberghs, G., Michiels, B., Kenward, M. G., & Diggle, P. J. (1998). Monotone missing data
and pattern-mixture models. Statistica Neerlandica, 52(2), 153-161.
2 See https://en.wikipedia.org/wiki/Inverse_probability_weighting for more details.
3 Also called the extrapolation density.
4-8 Lecture 4: Missing data and parametric models

4.3 Missing data: strategies for missing not at random

In MNAR, the missing data problem becomes a lot more complicated. There are two common strategies for
handling MNAR–the selection models and the pattern mixture models approaches.
To simplify the problem, we consider monotone missing data problem. Even in this scenario, we will see
several identifiability issues so we have to be very careful about our choice of model.
Recall that X denotes the study variable and T is the dropout time. We are interesting in the full-data
density p(x, t); note that p(x, t) implies the joint PDF of the study variable p(x).
A useful reference: https://content.sph.harvard.edu/fitzmaur/lda/C6587_C018.pdf.

4.3.1 Selection models

Selection models decompose the full-data density using

p(x, t) = P (T = t|x)p(x),

where P (T = t|x) is called the missing probability or missing data mechanism.


A common strategy in selection model is to identify P (T = d|x), where d is the end time of the study. There
are two reasons for identifying P (T = d|x). First, identifying this quantity is enough for constructing a
consistent inverse probability weighting (IPW) estimator, similar to the one we saw in the causal inference.
The other reason is that we can easily estimate the PDF p(x, T = d) by using the observations without
missing entries. If P (T = d|x) is known, then we can identify p(x) using p(x) = Pp(x,T =d)
(T =d|x) .

The MAR and MCAR conditions are often expressed in a selection model framework. Formally, the MCAR
is
P (T = t|X) = P (T = t).
Namely, the probability of any dropout time is totally independent of the study variable X. The MAR is

P (T = t|X) = P (T = t|X≤t ).

In other words, the conditional probability of the dropout time only depend on the observed variables.
As we have mentioned, the selection model allows a simple way to construct a consistent estimator of a
parameter of interest via the IPW procedure. HereR is a simple example. Suppose that the parameter of
interest is a linear statistical functional θ = θ(F ) = ω(x)dF (x), then it can be further written as
Z Z Z
p(x, T = d) dF (dx, T = d)
θ = ω(x)p(x)dx = ω(x) dx = ω(x) .
P (T = d|x) P (T = d|x)

With an estimator of the selection probability Pb(T = d|x) (and we only need to estimate the probability of
fully-observed case), a simple IPW estimator of θ is
Z n
dFb(dx, T = d) 1 X ω(Xi )I(Ti = d)
θb0 = ω(x) = . (4.2)
Pb(T = d|x) n i=1 Pb(T = d|Xi )

You can show that θb0 is a consistent estimator (and it has asymptotical normality as well due to the Slutsky
theorem). Moreover, the influence function (recall from the bootstrap lecture note) of θb0 can be easily derived
so the variance of θb0 can be estimated via a plug-in estimate.
Lecture 4: Missing data and parametric models 4-9

Although θb0 is elegant, it may not be the best estimator in the sense that after estimating the propensity
score P (T = t|x), we only rely on the completely observed data (the ones with Ti = d) to form the final
estimator. Other observations are discarded entirely. Intuitively, this leads to an inefficient estimator.
To construct an efficient estimator, consider augmenting θb0 with an additional term
n
1X
θb1 = θb0 + (I(Ti = τ ) − Pb(Ti = τ |Xi,≤τ ))gτ (Xi,≤τ )I(Ti = τ ),
n i=1

where τ < d is any time point and gτ is a function of variable x≤τ . The augmented term has an asymptotic
mean 0 so θb1 is still a consistent estimator. The insight here is that the function gτ is something we can
choose– namely, we can choose it to minimize the variance of θb1 and this may leads to a reduction in the total
variance compared to the estimator θb0 . The same idea can be applied to every time point τ = 1, · · · , d − 1,
leading to an augmented inverse probability weighting (AIPW) estimator
n d−1
1 XX
θbAIPW = θb0 + (I(Ti = τ ) − Pb(Ti = τ |Xi,≤τ ))gτ (Xi,≤τ )I(Ti = τ ).
n i=1 τ =1

With a proper choice of gτ : τ = 1, · · · , d − 1, we can construct an estimator with the least variance. This
leads to an efficient estimator. How to construct the functions gτ : τ = 1, · · · , d − 1 is a central topic of
semi-parametric inference.
Note that sometimes the AIPW (and IPW) estimators are constructed from solving an estimating equation.
This occurs when the parameter of interest θ0 = θ(F ) is defined through solving the equation
Z Z
dF (dx, T = d)
0 = E(S(X; θ0 )) = S(x; θ0 )dF (x) = S(x; θ) .
P (T = d|x)

In this case, the IPW estimator will be the solution to


Z n
dFb(dx, T = d) 1 X S(Xi ; θb0 )I(Ti = d)
0= S(x; θb0 ) =
Pb(T = d|x) n i=1 Pb(T = d|Xi )

and we can augment it with a set of mean 0 terms to improve the efficiency.
If you are interested in the construction of AIPW, I would recommend the following textbook:

Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media.

Note: although we introduce AIPW estimators in the MNAR framework, they are often used in the MAR
scenario because the identification of propensity score/selection probability P (T = t|X) is challenging in
MNAR. The MAR is a simple case where we can identify the propensity score entirely so AIPW estimators
can be constructed easily. Essentially, as long as you can identify the selection probability, you can construct
an IPW estimator and attempt to augment it to obtain AIPW estimator to improve the efficiency. So the
direction of research is often on how to identify the selection probability.

4.3.2 Pattern mixture models

Pattern-mixture models (PMMs) use another factorization of the full-data density:

p(x, t) = p(x>t |x≤t , t)p(x≤t |t)P (T = t),


4-10 Lecture 4: Missing data and parametric models

where the first term p(x>t |x≤t , t) is called the extrapolation density and the later two terms p(x≤t |t)P (T = t)
are called observed-data density. The extrapolation density is unobservable and unidentifiable–it describes
the distribution of the missing entries. The observed-data density is identifiable since at each dropout time
T = t, we do observe variables x1 , · · · , xt .
The PMMs provide a clean separation about what is identifiable and what is not identifiable. So the strategy
for identifying p(x, t) is to make the extrapolation density be identifiable.
In monotone missing problems, the extrapolation density has the following product form:
d
Y
p(x>t | x≤t , t) = p(xs | x<s , T = t).
s=t+1

Thus, it suffices to identify each term in the product form to identify the extrapolation density. Several iden-
tifying restrictions have been proposed in the literature to identify the extrapolation density. For instance,
the complete case missing value (CCMV) restriction equates that

CC
p(xs | x<s , T = t) = p(xs | x<s , T = d),

the available case missing value (ACMV) restriction assumes that

AC
p(xs | x<s , T = t) = p(xs | x<s , T ≥ s),

and the nearest case missing value (NCMV) restriction requires that

NC
p(xs | x<s , T = t) = p(xs | x<s , T = s).

In general, one can specify any subset of patterns Ats ⊂ {s, s + 1, · · · , d} and construct a corresponding
identifying restriction
A
p(xs | x<s , T = t) =ts p(xs | x<s , T ∈ Ats );
this is called the donor-baed identifying restriction in the following paper:

Chen, Y. C., & Sadinle, M. (2019). Nonparametric Pattern-Mixture Models for Inference with
Missing Data. arXiv preprint arXiv:1904.11085.

If you make any of these assumptions, the extrapolation density can be identified from the data so you can
then estimate the full-data density p(x, t).
Here is a nice review on PMMs for MNAR:

Linero, A. R., & Daniels, M. J. (2018). Bayesian approaches for missing not at random outcome
data: The role of identifying restrictions. Statistical Science, 33(2), 198-213.

4.3.3 Imputation and pattern mixture models

In the previous section, we introduce the idea of imputation when there is only one variable missing. But
it can be applied to cases where there are multiple missing entries. Suppose that we have an imputation
procedure such that if we observe X≤T = (X1 , · · · , XT ) and the dropout time T , the procedure generates
random numbers X>T = (XT +1 , · · · , Xd ) from a distribution Q.
Lecture 4: Missing data and parametric models 4-11

Then you can always view this imputation procedure as a PMM such that the PDF corresponds to the
imputation distribution Q is the underlying model on the extrapolation density. So any imputation method
can be viewed as implicitly handling the problem with a PMM.
From this point of view, you may notice that if we always impute the same number when observing (X≤T , T ),
then this imputation procedure is problematic since the corresponding imputation distribution is not a
good estimator of the underlying extrapolation distribution unless we are interesting in some very special
parameter of interest. The commonly-used mean imputation or median imputation are thus bad ideas to
apply in practice.

4.3.4 Nonparametric Saturation

In MNAR, we need to make identifying restrictions so that the full-data distribution F (x, t) (or p(x, t)) is
identifiable. However, there is one property that an identifying restriction should have: the implied joint
distribution should be compatible/consistent with what we observe. This property is called nonparametric
saturation/nonparametric identification/just identification.
The idea is simple: because we can identify F (x, t), we can pretend the implied joint distribution is the true
generating distribution and generates a new missing data from it. The generated missing data should be
similar to the original data we have.
MAR and any pattern mixture models satisfies this property (when we attempt to estimate the joint distri-
bution via a nonparametric estimator). However, some identifying restrictions, such as the MCAR, does not
satisfy this. Whenever you proposed a new MNAR restriction, you should always think about if the implied
full-data distribution satisfies this property or not.

4.3.5 Sensitivity analysis

Sensitivity analysis is a common procedure in handling the missing data problem. In short, sensitivity
analysis is to perturb the missing data assumption a bit and see how the conclusion changes. This is often
required in handling missing data because as we have shown previously, there is no way to check if a missing
data assumption is correct (unless we have additional information) so our conclusion relies heavily on our
assumption of missingness. By perturbing the assumption on missingness, we are able to examine if our
conclusion is robust to the missing data assumption.
In MAR, one common approach for sensitivity analysis is to introduce the model (called the exponential
tilting strategy)
P (T = t|X)
log = γ T X,
P (T = t|X≤t )

where γ ∈ Rd is a sensitivity parameter such that if γ = 0, we have PP(T(T=t|X


=t|X)
≤t )
= 1, which is the MAR
condition. We vary γ and examine how the estimator changes as a function of γ and use this as a way to
how sensitivity the estimator depends on the MAR assumption.

4.3.6 Nonmonotone missing data problem

When the missingness is non-monotone (which occurs very often in a survey sample), the problem becomes a
lot more complicated. Even we are willing to assume MAR, the full-data distribution p(x) may not be unique.
The following paper proposed a pattern mixture model to obtain a full-data distribution that satisfies MAR:
4-12 Lecture 4: Missing data and parametric models

Robins, J. M., & Gill, R. D. (1997). Non-response models for the analysis of non?monotone
ignorable missing data. Statistics in medicine, 16(1), 39-56.

However, it only identifies one full-data distribution satisfying MAR, not all possible distributions.
The problem is even more challenging under MNAR case. In general, nonmonotone MNAR problem is still
a very open problems. There are some attempts to deal with it but we have very limited options. Here are
some recent work related to nonmonotone MNAR:

1. Sadinle, M., & Reiter, J. P. (2017). Itemwise conditionally independent nonresponse modelling
for incomplete multivariate data. Biometrika, 104(1), 207-220.
2. Tchetgen, E. J. T., Wang, L., & Sun, B. (2018). Discrete choice models for nonmonotone
nonignorable missing data: Identification and inference. Statistica Sinica, 28(4), 2069-2088
3. Malinsky, D., Shpitser, I., & Tchetgen, E. J. T. (2019). Semiparametric Inference for
Non-monotone Missing-Not-at-Random Data: the No Self-Censoring Model. arXiv preprint
arXiv:1909.01848.
4. Chen, Y. C., & Sadinle, M. (2019). Nonparametric Pattern-Mixture Models for Inference with
Missing Data. arXiv preprint arXiv:1904.11085.

In particular, the first and the third model consider the following interesting assumptions:

Xj ⊥ Rj |X−j , R−j ,

where Rj ∈ {0, 1} is the response indicator that Rj = 1 if variable Xj is observed. This assumption is known
as ICIN (Itemwise conditionally independent nonresponse) and NSC (no self-censoring) assumption. It has
a beautiful graphical representation induced by the conditional independence.

You might also like