A Comparison of Three Popular Methods For Handling Missing Data Complete Case Analysis Inverse
A Comparison of Three Popular Methods For Handling Missing Data Complete Case Analysis Inverse
Abstract
Missing data are a pervasive problem in data analysis. Three common meth-
ods for addressing the problem are (a) complete-case analysis, where only
units that are complete on the variables in an analysis are included; (b)
weighting, where the complete cases are weighted by the inverse of an esti-
mate of the probability of being complete; and (c) multiple imputation (MI),
where missing values of the variables in the analysis are imputed as draws
from their predictive distribution under an implicit or explicit statistical
model, the imputation process is repeated to create multiple filled-in data
1
Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
2
Department of Medical Statistics, London School of Hygiene & Tropical Medicine, London, UK
3
MRC Clinical Trials Unit, UCL, UK
4
Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, Australia
Corresponding Author:
Roderick J. Little, Department of Biostatistics, University of Michigan, 1420 Washington Heights,
Ann Arbor MI 48209, USA.
Email: [email protected]
1106 Sociological Methods & Research 53(3)
sets, and analysis is carried out using simple MI combining rules. This article
provides a non-technical discussion of the strengths and weakness of these
approaches, and when each of the methods might be adopted over the
others. The methods are illustrated on data from the Youth Cohort
(Time) Series (YCS) for England, Wales and Scotland, 1984–2002.
Keywords
incomplete data, imputation, missing data, weighting
Preliminaries
Missing data are a pervasive problem in statistical analysis. The topic has an
extensive literature – textbooks on the topic include Little and Rubin (2019),
van Buuren (2018), Raghunathan (2015), Carpenter and Kenward (2013),
and Schafer (1997). We consider and compare three common approaches to
the analysis of data with missing values, namely complete-case analysis
(henceforth CC), inverse probability weighting (henceforth IPW), and multiple
imputation (henceforth MI). In CC — or complete-record analysis (e.g.
Carpenter and Kenward 2013, chapter 1) to avoid confusion with the termin-
ology of cases and controls in medical studies — only units that are complete
on the variables in an analysis are included; in IPW, the complete cases are
weighted by the inverse of an estimate of the probability of being complete;
and in MI, missing values of the variables in the analysis are imputed as
draws from their predictive distribution under an implicit or explicit statistical
model; the imputation process is repeated to create multiple filled-in data sets,
and analysis is carried out using simple MI combining rules (Rubin 1987).
All methods for handling missing data make unverifiable assumptions;
perhaps the closest to an assumption-free method is that in Horowitz and
Manski (1998), which presents bounds on parameter inferences based on
best and worst-case values of the missing variables. This method is (as
they acknowledge) very conservative, and is essentially limited to missing
variables that have known finite support.
Our focus here is principally on inference for regression coefficients and
sample means. We restrict attention to models that assume the missingness
mechanism is missing at random (MAR), as discussed in the next section,
although it is important to recognize that missing not at random (MNAR) MI
methods are possible as well. CC, IPW and MI are all quite general, in that
(given sufficient information) they can be used to handle missing data in any
Little et al. 1107
statistical analysis an analyst might wish to perform on the data without missing
values.
Other valid approaches exist for handling missing data (by “valid” we mean
that when its assumptions hold, a method yields consistent estimates of target
parameters, confidence intervals with close to nominal coverage, and tests
close to the stated size). Specifically, likelihoods can be defined for nonrectan-
gular data sets with missing values, and hence methods based on these likeli-
hoods can be implemented. In particular, maximum likelihood (ML)
estimates can be computed, with standard errors based on the information
matrix or sample re-use methods like the bootstrap; or a prior distribution can
be added to the specification and inferences based on the Bayesian posterior dis-
tribution. Indeed, ML methods for missing data are quite widely used in the
social sciences – often implicitly, as incorporated into structural equation mod-
eling software like Mplus (Muthen and Muthen 2017). ML is asymptotically
equivalent to MI under the same model for the data, so it shares some of the prop-
erties of MI discussed here. However, MI is more flexible than ML in some set-
tings, because it allows variables not included in the final analysis model to be
included in the imputation model and readily extends to settings where data may
be MNAR. Augmented inverse-probability weighted estimating equations
(Robins and Rotnitsky 1995; Robins, Rotnitsky and Zhao 1995) employ esti-
mating equations that include model predictions of missing values and weighted
residual terms, which provide some protection against model misspecification.
We focus on the three methods described above because they are used
extremely widely. In particular, CC is the default method in much statistical soft-
ware, is intuitive and is simple to implement. IPW is the standard approach to
handling unit nonresponse in surveys, and is also relatively simple to carry out.
MI methods are more varied and complex, but increasingly common because
of extensive availability in computer software packages. For example, MICE
and other R packages (Van Buuren and Groothuis-Oudshoorn 2011, Su et al.
2011), IVEware (Raghunathan et al. 2001), PROC MI in SAS (2015), and
Stata (see https://www.stata.com/features/multiple-imputation/).
Additional modeling assumptions are unavoidable when analyzing data
with missing values, so the most important step in dealing with missing
data is to limit the extent of missing values, by careful design and data collec-
tion (e.g. National Research Council 2010). Because some data are likely to
be missing despite these efforts, it is important to try to collect covariates that
are predictive of the missing values, so that an adequate adjustment can be
made. In addition, the processes that lead to missing values should be
assessed during the collection of data if possible (e.g. Little 1995), because
1108 Sociological Methods & Research 53(3)
this information plays a role in the choice of missing data adjustment method,
as discussed further below.
A basic assumption in all missing-data methods is that missingness of a
particular value hides a true underlying value that is meaningful for analysis.
Deciding whether a value is meaningful is not always as simple as it seems.
For example, consider a longitudinal analysis of measures of quality of life;
for subjects who leave the study because they move to a different location, it
makes sense to consider quality of life as missing, whereas for subjects who
die during the course of the study, it is not reasonable to consider quality of
life after time of death as missing. Rather it is preferable to restrict the analysis
of quality of life to individuals while they are alive. More complex missing
data problems arise when individuals leave a study for unknown reasons,
which may include relocation or death. Another example is nonresponse to
opinion polls, where the target population consists of individuals who will
vote – nonresponse for people who do not vote is arguably not missing
data, since an imputed value is not meaningful for estimating the proportion
of votes cast for each candidate.
Despite the fact that CC, IPW and MI are common in practice, we believe
that the principles underlying the choice between these methods are not as
well understood as they might be. Therefore, this article provides a relatively
nontechnical discussion of the strengths and weakness of CC, IPW and MI,
and guidelines for when each of the methods might be favored over the
others. For those who believe that the material is well known, here are four
preliminary facts that may surprise some readers:
Methods
Complete-Case (CC) Analysis
CC for a set of variables simply discards units where any of these variables
are missing. It has the advantage of simplicity, and it is the default analysis
in most statistical software packages. It has two main drawbacks. Firstly,
the complete cases are not a random subsample of the original sample
unless the data are MCAR. This is usually an unrealistic assumption,
because cases with missing values often differ from complete cases in
terms of the variables of interest. If the complete cases are not a random sub-
sample, CC will give biased answers for simple summary measures (such as
mean, sd) and may yield biased answers for regression models, although not
in all situations, as discussed below. Secondly, CC discards information in the
incomplete cases, which has typically cost non-trivial resources to collect,
and which will often contain information for reducing bias or increasing
the efficiency of CC estimates. A key question is thus how much information
is contained in the incomplete cases – if nearly all the information is contained
in the complete cases, CC might be a reasonable approach. Unfortunately, the
answer to this question is not straightforward, because it depends on the
Little et al. 1111
data, we then apply the standard complete-data analysis to each of the M data-
sets, yielding M estimates, say (θ̂(1) , . . . , θ̂(M) ) of parameters θ; (d) combine
the parameter estimates to create an overall estimate of θ — a method for
doing this is called a MI combining rule. In particular for scalar estimands,
(m)
the MI estimate is the average θ̂MI = M m=1 θ̂ / M of the estimates from
the M datasets, and the sampling variance of the estimate is estimated as
V̂ = Ŵ + (1 + 1 / M)B̂, where Ŵ is the average of the estimated sampling
variances from the M datasets, and B̂ is the sample variance of the estimates
across the M datasets; the factor 1 + 1/M is a small-M correction. The quantity
(1 + 1 / M)B̂ is crucial, because it estimates the increase in the variance from
imputation uncertainty, which is omitted (i.e., set to zero) by single imput-
ation methods. Other combining rules provide refinements of this basic
method, and include combining rules for test statistics and p-values. See,
for example, Little and Rubin (2019, Section 10.2).
The imputation of draws from the predictive distribution creates the vari-
ability in the estimates over the MI data sets, allowing the appropriate assess-
ment of imputation uncertainty. Imputing draws is inefficient, but the fact that
θ̂MI is averaged over datasets reduces this inefficiency, roughly by a factor of
M. In fact, MI under a well-specified model is essentially fully efficient from a
statistical perspective, providing M is sufficiently large. The appropriate
choice of M depends on the fraction of missing information, which is esti-
mated for each parameter θ by (1 + 1 / M)B̂ / V̂. Larger fractions of
missing information require larger values of M to yield good estimates of
the imputation uncertainty.
Once the data are imputed, the remaining steps of MI are not much more
difficult than doing a single imputation. The additional computing from
repeating an analysis M times is not a major burden and MI combining
rules for standard errors are standard in MI programs. Modern MI programs
yield imputed data sets that lead to proper inferences, in the sense that they
appropriately incorporate uncertainty in the parameter estimates in the imput-
ation models; they also can be applied to a general missing data pattern. The
imputation models generally assume the missing data are MAR, although
MNAR mechanisms can also be incorporated (e.g. Tompsett et al. 2018,
Giusti and Little 2011). Most of the work is in generating good predictive dis-
tributions for the missing values.
There are three primary approaches to creating the predictive distributions
for multiple imputation of the missing data. (1) Joint modeling, where predict-
ive distributions are derived from an explicit parametric joint model f (Y|θ) for
the variables in the data set, indexed by parameters θ. Examples of models
include the multivariate normal model for continuous variables, loglinear
1114 Sociological Methods & Research 53(3)
models for categorical variables, and the general location model for mixture
of continuous and categorical variables (see Little and Rubin 2019, Chapters
11–14). (2) Sequential regression imputation (Raghunathan et al. 2001), also
called chained-equation imputation (White, Royston and Wood 2011, van
Buuren and Oodshoorn 2011), or full conditional specification, where a
model is specified for the conditional distribution fj (Yj |Y(j) ) of each variable
Yj with missing values, given the other variables Y(j) = (Y1 , . . . , Y j−1 ,
Y j+1 , . . . , Yp ), for j = 1,…,p. These methods are iterative, and impute the
missing values of each variable as draws from their conditional distribution,
given the observed or most recently imputed values of the other variables. (3)
Hot deck imputation, which matches each incomplete case (which we call the
recipient) to a complete case (which we call the donor) based on some close-
ness metric. The values of missing variables for the recipient are then imputed
with the corresponding values of those variables for the donor. A variety of
metrics are used, but a common and principled choice is the distance
between the predicted means from a regression of the missing variables on
the observed variables (predictive mean matching, see Little 1988). Hot
deck methods were originally defined for single imputation – for a review
of these methods, see Andridge and Little (2010); they can be extended to
MI by defining a set of close donors for each recipient, and randomly
picking a donor for each MI data set (Little 1988).
Joint modeling is well-motivated theoretically – the underlying theory is
Bayesian, which creates imputations that take into account uncertainty in
model parameters. The approach is well suited to a monotone pattern,
where the joint distribution of Y can be factored as f (Y1 , . . . , Yp ) = f1 (Y1 )
f2 (Y2 |Y1 ) . . . f (Yp |Y1 , . . . , Y p−1 ), with variables arranged from most observed
(Y1 ) to least observed (Yp ). The distributions in this product can then be
modeled using regressions appropriate for the outcome variable type – for
example, normal linear regression for continuous outcomes, logistic regres-
sion for binary outcomes, and so on. These regressions are quite flexible,
in that they can include polynomial terms and interactions as covariates.
Imputations are created sequentially, first filling in missing values of Y1 as
draws from f1 (Y1 ), then filling in missing values of Y2 as draws from
f2 (Y2 |Y1 ), conditioning on observed and previously imputed values of Y1 ,
and so on. The data analyst creating the MIs needs to provide appropriate spe-
cifications of these regressions, using subject-matter knowledge and regres-
sion diagnostic tools applied to the set of cases that are observed on the
relevant set of variables.
For non-monotone patterns, imputation algorithms are iterative and
involve an application of Markov Chain Monte Carlo methods. This means
Little et al. 1115
that methods are needed to monitor convergence of the chain, and the
methods can be computationally intensive if the data matrix is large. A chal-
lenge for the joint modeling approach is the limited availability of models for
joint distribution of Y. For example, the popular multivariate normal model
implies that the normal regression models for the imputations that are
linear and additive in the covariates, with a constant residual variance. This
limitation can be eased by strategies such as transformation of the variables,
or more generally by using a latent normal model for binary and unordered
categorical data, a flexible approach which has recently been shown to
perform well (Quartagno and Carpenter 2019).
The chained equation approach sidesteps this limitation of joint modeling
for non-monotone patterns by not requiring that the set of conditional distri-
butions {fj (Yj |Y(j) )} for each j corresponds to a coherent joint distribution for
(Y1 , . . . , Yp ). This allows much more flexibility in the choice of imputation
model for each variable, at the expense of some theoretical coherence. In
practice, simulations studies suggest that the approach does well, provide
careful attention is given to specifying the set of imputation models so they
are mutually consistent.
Finally, hot deck approaches avoid the need to formally specify imputation
models, and are potentially less vulnerable to model misspecification
(although they still rely on the MAR assumption). These methods tend to
perform well with large data sets, where potential donors that are close
matches to recipients are plentiful. They are less useful (and results may
have relatively high variance) in smaller datasets, where good matches are
less plentiful. In such setting, the joint modeling or sequential regression
approaches tend to be superior.
cell, the absolute bias and variance of IPW and MI estimates relative to CC
are tabulated. See the footnote to the table for details.
When X is weakly associated with both R and Y (Cell LLL), CC, IPW, and
MI are similar, and CC may be preferred on grounds of simplicity. For either
IPW or MI to reduce absolute bias of CC, the propensity to respond needs to
be related to both R and Y, as in the cells HHL and HHH in Table 1. When the
propensity is strongly associated with R but weakly associated with Y (Cells
HLL and HLH), IPW actually makes things worse than CC in terms of vari-
ance, because the variability of the sample weights increases the sampling
variance of the weighted mean without a compensating reduction in absolute
bias. When the propensity is strongly related to Y (cells LHL, HHL, LHH and
HHH), IPW can have lower variance than CC. MI does not have the increased
variance of IPW in Cell HLL, and otherwise is more efficient than CC and
IPW when there are auxiliary variables other than the propensity that
Table 1. Bias and Variance of MI and IPW Relative to CC for Estimating a Mean, by Strength of
Association of the Auxiliary Variables X = (X1 , . . . , Xp ) with Response (R) and Outcome (Y).
Association of X Association of outcome Y with (i) propensity to respond and (ii) Z (as
with Response R defined in the text)
(ie. strength of
propensity to Propensity: Low Propensity: Low Propensity: High Propensity:
respond) Z: Low Z: High Z: Low High Z: High
Notes:.
“---” for Bias (or Var) within a cell indicates that the estimate for the method has similar bias (or
variance) to the estimate for CC.
“ ↓ “ for Bias (or Var) within a cell indicates that the estimate for the method has less absolute
bias (or variance) than the estimate for CC.
“ ↑ “ for Bias (or Var) within a cell indicates that the estimate for the method has greater
absolute bias (or variance) than the estimate for CC.
In summary, “ ↓ “ indicates that a method is better than CC, “ ↑ “ indicates that a method is
worse than CC, and “---” indicates that a method is similar to CC.
1118 Sociological Methods & Research 53(3)
predict Y, namely cells LLH, HLH, LHH and HHH). See Belin (2009) for an
application where MI is more efficient than CC, and Collins, Schafer and
Kam (2001) for more on the utility of auxiliary variables for enhancing the
precision of MI inferences.
Note that MI is seen to be superior to IPW in Table 1, a result of the
fact that both methods can reduce bias, but MI can reduce both bias
and variance (Little 1986). However, this property relies on the assump-
tion that the imputation model is well specified, for example, nonlinear
terms and interactions among the X’s are included as predictors if they
are needed. If the imputation model is misspecified but the model for R
on X for IPW is well specified, then IPW may be superior to MI. Also,
IPW based on a single regression of R on X can be applied to a set of vari-
ables Y with the same pattern of missing values, whereas MI requires a
different imputation model for each Y - variable in the set.
In summary, absolute bias is reduced by IPW and MI when there are
auxiliary variables that are predictive of both response and Y. Sampling
variance is reduced by IPW when the response propensity is predictive
of Y, and MI is generally more efficient than IPW, particularly when aux-
iliary variables, Z, orthogonal to the propensity to respond are predictive of
Y. These comments generally apply for subgroup means, with X interpreted
as the set of auxiliary variables other than the variable used to form the
subgroups.
What if X1 , . . . , Xp also have missing values? Because MI can be
applied to a general pattern of missing values, it can still be used to
recover the information about the mean of Y in the observed auxiliary vari-
ables for units where Y is missing, although additional modeling is needed
to develop imputation models for the missing values of X1 , . . . , Xp . The
weights in IPW can only condition on the subset of X1 , . . . , Xp that are
completely observed, and therefore may be less effective in reducing abso-
lute bias or variance than MI. This is one reason why imputation is gener-
ally favored over weighting for item nonresponse, which often gives an
unstructured “swiss-cheese” appearance to the matrix R of response
indicators.
the mean of Y. With X fully observed and missing values confined to Y, and
assuming MAR, the likelihood has the form
r
n
L(θ, ϕ|data) = fY|X (yi |xi1 , . . . , xip , θ) × fX (xi1 , . . . , xip , ϕ)
i=1 i=1
where fY|X (yi |xi1 , . . . , xip , θ) is the density of the conditional distribution of Y
given X and fX (xi1 , . . . , xip , ϕ) is the density of the marginal distribution of X.
It follows that provided θ and ϕ are distinct parameters, the complete cases
carry all the information for the parameters θ of the regression of Y on X,
so CC is in fact fully efficient. CC is also unbiased under MAR, not requiring
the stronger MCAR assumption. Under MAR, MI is not necessary in this situ-
ation. IPW is sometimes advocated over CC because it is potentially more
robust than CC when the regression of Y on X is misspecified but the
model for the propensity to respond is correctly specified. From a robustness
perspective, comparing the results from CC and IPW is sensible, if only as a
specification check for the regression of Y on X.
The incomplete cases have more information when there are missing
values in the covariates X rather than missing values in the outcome Y.
Suppose, for simplicity, that values of one of the covariates, say X1 , are
missing, and X2 , . . . , Xp , Y are fully observed. The incomplete cases with
X1 missing then have considerable information for the intercept and coeffi-
cients of X2 , . . . , Xp , but very limited information for the coefficient of X1
(Little 1992). The incomplete cases are thus of limited value if the primary
interest is in the coefficient of X1 , but are of considerably more value if the
primary interest is in other coefficients; in particular, if X1 is weakly asso-
ciated with Y, the incomplete cases have about as much information as the
complete cases for these other regression coefficients. For MI of covariates,
it is important to include the outcome variable Y in the imputation model
so as not to bias the estimated regression coefficients from the fitted data
(Little 1992).
CC also has the (perhaps unexpected) property of being unbiased for
regression coefficients when the probability that a case is complete depends
on the covariates but — given these — not the outcome, under a well-
specified model (Hughes et al. 2019, Little and Rubin 2019, Example 3.3).
For the simple missingness pattern discussed above, we consider three cases:
study variable, and individuals dropping out between times k and k + 1 have
Y1 , . . . , Yk observed and Yk+1 , . . . , YK missing. The incomplete cases often
have substantial information for the regression of YK on X, and a repeated-
measures model should be used to model the longitudinal distribution of Y
given X. This is particularly so if the intermediate values of Y measured
prior to drop-out are predictive of the missing values of Y after dropout.
Specifically, use a repeated measures model (fully efficient if correctly speci-
fied) for all the observed data, with carefully chosen covariance structure and
a parameterization of the mean model chosen to answer the scientific ques-
tion; for examples, see Carpenter & Kenward (2008) chapter 3. Because
ML for the repeated-measures model is fully efficient, MI or IPW is not
needed.
Other Analyses
We have focused attention on analysis of means and regression parameters
here, because these analyses are common in social science data sets. CC,
IPW and MI can all be applied to other types of analysis, such as loglinear
models for contingency tables, time series modeling, or analyses that
involve latent variables like factor analysis or latent structure analysis. To
keep this paper a manageable length, we preclude a detailed discussion of
these other kinds of analyses, but offer some general comments for
completeness.
Illustrative Example
Introduction
We illustrate some of the points from previous sections with an analysis of
data from the Youth Cohort (Time) Series (YCS) for England, Wales and
Scotland, 1984 − 2002 (Shapira, Iannelli and Croxford 2007). The raw data
are freely available from the U.K. data archive, http://data-archive.ac.uk,
study number SN 5765. The data come from two UK representative
government-funded cohort studies set up to examine the effects of social, eco-
nomic and policy change on young people’s experiences of education and
transitions to the labor market. For our analyses we use a subset of the data
from school children attending all school types in England and Wales from
five YCS cohorts, who reached the end of Year 11 (ie age 16 + years) in
years 1990, 1993, 1995, 1997 and 1999). All our analyses use Stata 15.1.
We compare estimates from CC, IPW and MI of the distribution of paren-
tal occupation, and the regression of Year 11 educational achievement (in the
General Certificate of Secondary Education qualifications), on the covariates
cohort, boy, ethnicity and a three-level classification of parental occupation
1124 Sociological Methods & Research 53(3)
1 √ √ √ 66965 87%
2 √ ? √ 7523 10%
3 ? √ √ 760 <1%
4 √ ? ? 651 <1%
5 Other patterns 892 <1%
Table 4. Distribution of Inverse Probability Weights from Logistic Model for the
Probability That a Unit is Complete.
are three variables with missing data, we have three chained regression imput-
ation equations which the imputation algorithm cycles through:
Each of these properly imputes the missing data in the dependent variable,
and then takes these imputed values through to the next model. We complete
10 cycles before imputing each data set and a further 10 cycles between each
of our 20 imputations.
A complication in this imputation is that the ethnicity variable has a
number of relatively sparse categories, leading to quasi-complete separation.
Unless this is corrected for, this can cause coefficients in the multinomial
regression model to become large in magnitude, and corresponding SEs to
be large too, leading to poor imputations. A relatively simple fix, which we
use here, is to temporarily augment the data set with small number of obser-
vations at each point when this occurs (White, Daniel and Royston 2010).
1126 Sociological Methods & Research 53(3)
We now illustrate and compare the results of two analyses using CC, IPW
and MI.
Table 5. Distribution of Parental Occupation, Estimated from CC, IPW and MI for (i) Whole Data Set (top) and (ii) Bangladeshi
Ethnic Group (Bottom).
Table 6. Estimated Effects of Ethnicity on GCSE Score (Estimate, SE), Adjusted for
Cohort, sex and Parental Occupation; (i) Left, CC Analysis (ii) Centre, IPW; (iii) Right,
MI.
of these data, which shows that inferences are robust to the MAR
assumption.
Conclusions
We have presented a non-technical discussion of three widely used approaches
for handling missing data, namely CC, IPW and MI. In applications, we always
begin by tabulating and graphing the data, and exploring the associations using
complete case analyses. As we move to the definitive analysis, Table 7 sum-
marizes how we choose between the approaches in our work.
In particular, when data are plausibly MAR, MI and IPW can improve effi-
ciency and reduce bias over a CC analysis. The relative gain of MI over CC or
IPW depends on how much information is contained in the incomplete cases
for the scientific analysis. Further, IPW and MI can both exploit information
in auxiliary variables (which are not included in the scientific model) to (i)
increase the plausibility of the MAR assumption, and hence reduce bias
and (ii) further improve efficiency – especially with MI, when the auxiliary
variables are good predictors of the variables with missing values in the sci-
entific model. A further advantage of MI is that it can be used when these aux-
iliary variables themselves have missing values.
However, the advantages of MI are contingent on the imputation model
being well specified, in terms of assumed relationships between the
missing and observed variables. In scientific models where there are, for
example, non-linear effects, interactions, hierarchical (multilevel) structure
and time-to-event outcomes, considerable thought needs to go into the speci-
fication of the imputation model. Analysts should also check that the distribu-
tion of imputed data is plausible in the scientific context (e.g. graphically).
Carpenter and Smuk (2021) discuss a number of examples in detail, and
provide further references. One robust MI approach is Penalized Spline of
Propensity Prediction, which imputes missing variables based on a model
that includes a penalized spline of the estimated response propensity and
other predictive covariates (Zhang and Little 2009, 2011).
When MI is not indicated, and we are choosing between CC and IPW, we
reiterate the following. First, for inferences about means, use IPW when auxil-
iary variables are available that are strongly related to both response and the vari-
able with missing values. Second, for inferences about regression with all (or
most) missing values in the outcome alone, CC is valid if the regression
model is correctly specified. However, it is prudent to compare results from
IPW and CC as a specification check. If the estimated regression coefficients
from the two analyses are very different, (and a careful check of the IPW
1130 Sociological Methods & Research 53(3)
Analysis
method When to use When to avoid
model does not highlight any concerns) the specification of the regression model
needs to be checked for errors (for example, assumptions about linearity of
absence of interactions may be invalid.) Third, for inferences about regression
with missing values in the covariates, IPW is preferred if the missingness mech-
anism is MAR, CC is preferred if the missingness mechanism plausibly depends
on the covariates but not (or only weakly on) the outcome.
In our work, we typically compare the CC analysis with either MI or IPW
(or both) and seek to understand and explain in our reporting why they differ,
because such explanations typically give additional insights and hence
improve confidence in the scientific findings.
Little et al. 1131
In practice the mechanism behind the missing data will not be known, and
requires making an assumption of the most plausible mechanism. It is there-
fore important to conduct a sensitivity analysis to alternative plausible
assumptions regarding the missingness mechanism. One approach to clarify-
ing the assumptions regarding the missingness mechanism in the primary and
sensitivity analysis is to use causal diagrams (Lee et al., 2021).
Finally, if the data are suspected to be MNAR, then it is important to con-
sider an MNAR model, at least as a sensitivity analysis. A discussion of
MNAR models is beyond the scope of this manuscript, but MI provides a
practical vehicle, as described by Carpenter and Smuk (2021) and references
therein.
Acknowledgments
James Carpenter is supported by UK Medical Research Council Grant
MC_UU_00004/07. The manuscript is submitted on behalf of the STRengthening
Analytical Thinking for Observational Studies (STRATOS) initiative (http://stratos-
initiative.org), which aims to provide accessible and accurate guidance documents
for relevant topics in the design and analysis of observational studies. The authors
thank the the editor, associate editor, three referees and two reviewers on the
STRATOS publication panel for their helpful comments on the manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research,
authorship, and/or publication of this article: This work was supported by the UK
Medical Research Council, grant number MC_UU_00004/07.
ORCID iD
Roderick J. Little https://orcid.org/0000-0001-9878-6977
References
Andridge, Rebecca H. and Roderick J. Little. 2010. “A Review of Hot Deck
Imputation for Survey Nonresponse.” International Statistical Review 78(1):40‐64.
1132 Sociological Methods & Research 53(3)
Bartlett, Jonathan W., Ofer Harel, and James R. Carpenter. 2015. “Asymptotically
Unbiased Estimation of Exposure Odds Ratios in Complete Records Logistic
Regression.” American Journal of Epidemiology 182(8):730‐6.
Belin, Tom R. 2009. “Missing Data: What a Little can do, and What Researchers can
do in Response.” American Journal of Ophthalmology 148(6):820‐2.
Cao, Weihua, Anastasios A. Tsiatis, and Marie Davidian. 2009. “Improving Efficiency
and Robustness of the Doubly Robust Estimator for a Population Mean with
Incomplete Data.” Biometrika 96:723‐34.
Carpenter, James R. and Michael G. Kenward. 2008. “Missing Data in Clinical Trials
– a Practical Guide.” National Health Service Co-ordinating Centre for Research
Methodology, url = https://researchonline.lshtm.ac.uk/id/eprint/4018500/.
Collins, Linda M., Joseph L. Schafer, and Chi-Ming Kam. 2001. “A Comparison of
Inclusive and Restrictive Strategies in Modern Missing Data Procedures.”
Psychological Methods 6(4):330‐51.
Carpenter, James R. and Melanie Smuk. 2021. “Missing Data: A Statistical
Framework for Practice.” Biometrical Journal 63:915‐47. https://doi.org/10.
1002/bimj.202000196
Carpenter, James R. and Michael G. Kenward. 2013. Multiple Imputation and Its
Application. New York: Wiley.
Giusti, Caterina and Roderick J. Little. 2011. “A Sensitivity Analysis of Nonignorable
Nonresponse to Income in a Survey with a Rotating Panel Design.” Journal of
Official Statistics 27(2):211‐29.
Horowitz, Joel L. and Charles F Manski. 1998. “Censoring of Outcomes and
Regressors Due to Survey Nonresponse: Identification and Estimation Using
Weights and Imputations.” Journal of Econometrics 84:37‐58.
Hughes, Rachael A., Jon Heron, Jonathan A.C. Sterne, and Kate Tilling. 2019.
“Accounting for Missing Data in Statistical Analyses: Multiple Imputation is
not Always the Answer.” International Journal of Epidemiology 48(4):
1294‐304.
Lee, Katherine J., Kate M. Tilling, Rosie P. Cornish, Roderick J. Little, Melanie
M. Bell, Els Goetghebeur, Joseph W. Hogan, and James R. Carpenter. 2021.
“Framework for the Treatment and Reporting of Missing Data in Observational
Studies: The Treatment And Reporting of Missing Data in Observational Studies
Framework.” Journal of Clinical Epidemiology, 134: 79‐88.
Little, Roderick J. 1986. “Survey Nonresponse Adjustments.” International Statistical
Review, 54, 139‐157
Little, Roderick J. 1988. “Missing Data in Large Surveys (with Discussion).” Journal
of Business and Economic Statistics 6:287‐301.
Little, Roderick J. 1992. “Regression with Missing X’s: A Review.” Journal of the
American Statistical Association 87:1227‐37.
Little et al. 1133
SAS. 2015. The MI Procedure. SAS/STAT 14.1 User’s Guide, SAS Institute Inc.,
Cary, NC, USA.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. New York: CRC
Press.
Schafer, Joseph L. 1998. “Multiple Imputation: A Primer.” Statistical Methods in
Medical Research 8:3‐15.
Seaman, Shaun R. and Ian R. White. 2011. “Review of Inverse Probability Weighting
for Dealing with Missing Data.” Statistical Methods in Medical Research 22:278‐
95.
Shapira, Marina, Cristina Iannelli, and Linda Croxford, 2007. Youth Cohort Time
Series for England, Wales and Scotland, 1984–2002. [Data Collection]. Scottish
Centre for Social Research, University of Edinburgh, Centre for Educational
Sociology, National Centre for Social Research, [original data producer(s)].
Scottish Centre for Social Research. SN: 5765, https://doi.org/10.5255/UKDA-
SN-5765-1
Su, Yu-Sung, Andrew Gelman, Jennifer Hill, and Masanao Yajima. 2011. “Multiple
Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box.”
Journal of Statistical Software 45(2):1‐31.
Tompsett, Daniel M., Finbarr Leacy, Margaritat Moreno-Betancur, Jon Heron, and Ian
R. White. 2018. “On the use of the not-at-Random Fully Conditional Specification
(NARFCS) Procedure in Practice.” Statistics in Medicine 37(15):2338‐53.
van Buuren, Stef. 2018. Flexible Imputation of Missing Data. 2nd Edition. FL:
Boca-Raton: CRC/Chapman and Hall.
van Buuren, Stef and Karen Groothuis-Oudshoorn. 2011. “Multivariate Imputation by
Chained Equations in R.” Journal of Statistical Software 45(4):1‐67. For asso-
ciated software see http://www.multiple-imputation.com.
Von Hippel, Paul T. (2007). “Regression with Missing Ys: An Improved Strategy
for Analyzing Multiply Imputed Data.” Sociological Methodology 37(1):
83‐117.
White, Ian R., Rhian Daniel, and Patrick Royston. 2010. “Avoiding Bias Due to
Perfect Prediction in Multiple Imputation of Incomplete Categorical Variables.”
Computational Statistics and Data Analysis 54(10):2267‐75.
White, Ian R., Patrick Royston, and Angela M. Wood. 2011. “Multiple Imputation
Using Chained Equations: Issues and Guidance for Practice.” Statistics in
Medicine 30(4):377‐99.
Zhang, Guangyu and Roderick J. Little. 2009. “Extensions of the Penalized Spline of
Propensity Prediction Method of Imputation.” Biometrics 65(3):911‐8.
Zhang, Guangyu and Roderick J. Little. 2011. “A Comparative Study of
Doubly-Robust Estimators of the Mean with Missing Data.” Journal of
Statistical Computation and Simulation 81(12):2039‐58.
Little et al. 1135
Author Biographies
Roderick J. Little is Richard D. Remington Distinguished University Professor of
Biostatistics at the University of Michigan, where he also holds appointments in the
Institute for Social Research and the Department of Statistics. His research focuses
on methods for the analysis of data with missing values and model-based survey infer-
ence, and the application of statistics to diverse scientific areas, including medicine,
demography, economics, psychiatry, aging and the environment.