Multiple Imputation in Practice
Multiple Imputation in Practice
(i)
Zj = φ∗0 + φ∗1 Z1 + φ∗2 Z2 + · · · + φ∗j−1 Zj−1 + σ ∗ , Combination step—Finally, the results are combined using
results from Rubin (1987), to calculate estimates of the within
where σ ∗ is the estimate of variance from the model and is a imputation and between imputation variability. These statistics
simulated normal random variate. We refer to this as the regres- account for the variability of the imputations and assuming that
sion method. A variant of this approach imputes the observed the imputation model is correct, provide consistent estimates of
value of Zj that is closest to Ẑj in the dataset; this ensures that the parameters and their standard errors. There has been an ex-
imputed values are plausible, and may be more appropriate if tensive literature regarding the asymptotic behavior of multiple
the normality assumption is violated. We refer to this as the imputation methods (Barnard and Rubin 1999; Meng and Rubin
predictive mean matching method. The predictive mean model 1992; Robins and Wang 2000; Rubin 1996; Wang and Robins
assumes a linear regression model and a monotone structure, 1998) these issues are not further considered here.
otherwise there will be missing predictors in model (1).
The propensity score method uses a different model for im- Notes about imputation—It should be noted that one advan-
putation. Here, values are imputed from observations that are tage of multiple imputation as an analytic approach is that it
equally likely to be missing, by fitting a logistic regression model allows the analyst to incorporate additional information into the
for the missingness indicators. Allison (2000) noted that this imputation model. This auxiliary (or extraneous) information
method can yield serious bias for imputation of missing covari- may not be of interest in the regression model, but may make
ates under some settings. the MAR assumption increasingly plausible (Liu, Taylor, and
For discrete incomplete variables, discriminant analysis (or Belin 2000; Rubin 1996); such information is straightforward
logistic regression for dichotomous variables) can be used to to incorporate into the imputation model.
impute values based on the estimated probability that a miss- A useful quantity in interpreting results from multiple im-
ing value takes on a certain value P (Zjmis = k|Zobs ). Using putation is an estimate of the fraction of missing information
Bayes’s theorem, this can be calculated from estimates of the (Rubin 1987). This quantity, typically denoted by γ̂, denotes
joint distribution of Z. how the missing data influence the uncertainty of estimates of β
Finally, the MCMC method constructs a Markov chain to (Schafer 1997). It has been noted that even with a large fraction
simulate draws from the posterior distribution of f (Zmis |Zobs ). of missing information, a relatively small number of imputa-
This can be implemented using the IP algorithm (Schafer 1997), tions provides estimates of standard errors that are almost fully
where at the tth iteration the steps can be defined as: efficient (Schafer 1997). Schafer (1999) suggested that no more
than 10 imputations are usually required, though this should be
investigated more closely if the fraction of missing information
Imputation-step: Draw Zmis,(t+1) from f (Z|Zobs , φ(t) ).
is large. In any case, the appropriate number of imputations can
Parameter-step: Draw φ(t+1) from f (φ|Zobs , Zmis,(t+1) ). be informally determined by carrying out replicate sets of m
imputations and determining whether the estimates are stable
between sets.
The Markov chain Before the advent of general purpose packages that supported
multiple imputation, the process of generating imputed datasets,
Z(1) , φ(1) , Z(2) , φ(2) , . . . , Z(t+1) , φ(t+1) , . . . managing the results from each of the m datasets, and combining
the results required specialized programming or use of macros
can be shown to converge to the posterior distribution of inter- that were difficult to use. The packages reviewed in this pa-
est. This method has the advantage that it can handle arbitrary per, though still more complicated than complete case methods,
patterns of missing data. Schafer (1997) provided a complete greatly facilitate the process of using multiple imputation.
exposition of the method in the imputation setting, while Gilks,
Richardson, and Spiegelhalter (1996) described the background
as well as other applications. As a computational tool MCMC 3. SOFTWARE PACKAGES
has some downsides; it requires an assumption of multivari- SOLAS version 3.0
ate normality, it is complicated and computationally expensive. Statistical Solutions (North American office)
Convergence is difficult to determine, and remains more of an Stonehill Corporate Center, Suite 104,
art form than a science. However, MCMC is available in SAS, 999 Broadway, Saugus, MA 01906, USA.
S-Plus, and MICE, and thus is becoming more mainstream. Tel (781) 231-7680
Some practical suggestions for what variables to include in http://www.statsol.ie/solas/solas.htm, [email protected]
the imputation model were given by van Buuren, Boshuizen,
and Knook (1999). They recommended that this set of variables SOLAS is designed specifically for the analysis of datasets
includes those in the complete data model, factors known to be with missing observations. It offers a variety of multiple impu-
associated with missingness, and factors that explain a consid- tation techniques in a single, easy-to-use package with a well-
erable amount of variance for the target variables. designed graphical user interface.
246 Statistical Computing Software Reviews
SOLAS supports predictive mean model (using the closest ob- tensive examples with numerous screen-shots were useful in
served value to the predicted value) and propensity score mod- introducing the user to SOLAS.
els for missing continuous variables, and discriminant models A single user license for SOLAS 3.0 costs $1,295 ($995 for
for missing binary and categorical variables. Once the multiple academic customers), while an upgrade from previous versions
datasets are created, the package allows the calculation of de- costs $495.
scriptive statistics, t tests and ANOVA, frequency tables, and SAS 8.2 (beta) SAS Institute Inc.
linear regression. SAS Campus Drive
The system automatically provides summary measures by Cary, NC 27513-2414
combining the results from the multiple analyses. These re- (919) 677-8000
ports can be saved as rich text format (rtf) files. It has extensive http://www.sas.com, [email protected]
capabilities to read and write database formats (1-2-3, dBase,
FoxPro, Paradox), spreadsheets (Excel), and statistical packages
(Gauss, Minitab, SAS, S-Plus, SPSS, Stata, Statistica, and Sys- The SAS System is described as an integrated suite of software
tat). Imputed datasets can be exported to other statistical pack- for enterprise-wide information delivery, which includes a ma-
ages, though this is a somewhat cumbersome process, since the jor module for statistical analysis, as implemented in SAS/STAT.
combination of results from the multiple analyses then needs to In release 8.1 two new experimental procedures (PROC MI and
be done manually. PROC MIANALYZE) were made available, which implemented
The new script language facility is particularly useful in doc- multiple imputation. The interface for PROC MI changed sub-
umenting the steps of a multiple imputation run, and for con- stantially in release 8.2. SAS anticipates putting PROC MI and
ducting simulations. It can be set up to automatically record the PROC MIANALYZE into production for release 9.
settings from a menu-based multiple imputation session, and The imputation step is carried out by PROC MI, which al-
store this configuration in a file for later revision and reuse. lows use of either monotone (predictive mean matching, de-
SOLAS includes the ability to view missing data patterns, re- noted by REGRESSION, or propensity, denoted by PROPEN-
view the quantity and positioning of missing values, and classify SITY) or nonmonotone (using MCMC) missingness methods.
them into categories of monotone or nonmonotone missingness. The MCMC methods can also be used in a hybrid model where
Because it does not consolidate observations with the same pat- the dataset is divided into monotone and nonmonotone parts, and
tern of missing data, however, this feature is of limited utility in a regression method is used for the monotone component. Exten-
large datasets. sive control and graphical diagnostics of the MCMC methods are
A nice feature of SOLAS is the fine-grained and intuitive provided. SAS supports transformation and back-transformation
control of the details of the imputation model. As an example, of variables. This may make an assumption of multivariate nor-
incorporating auxiliary information (variables in the imputation mality, needed for the REGRESSION and MCMC methods,
model but not in the regression model of interest) is straightfor- more tenable (Schafer 1997).
ward. Once PROC MI has been run, use of complete data methods
is straightforward; the only addition is the specification of a
There are a number of limitations to SOLAS’ implementa-
BY statement to repeat these methods (i.e., PROC GLM, PROC
tion. It is primarily a package for multiple imputation in linear
PHREG, or PROC LOGISTIC) for each value of the variable
regression models, and has limited data manipulation ability.
Imputation . This approach is attractive, since it allows the
While there are extensive options for linear regression, it lacks
full range of regression models available within SAS to be used
the completeness of general-purpose packages (e.g., specifica-
in imputation.
tion of interactions is cumbersome). Nonlinear regression meth-
The results are combined using PROC MIANALYZE, which
ods, such as logistic or survival models, are not supported. Non-
provides a clear summary of the results. SAS provides an op-
monotone missingness is handled in an ad-hoc fashion. While tion (EDF) to use the adjusted degrees of freedom suggested
this may be acceptable in many applications, it is not always by Barnard and Rubin (1999), and it displays estimates of the
appropriate (support for MCMC methods, not available in SO- fraction of missing information for each parameter.
LAS, are particularly attractive in this setting). By default, SO- A disadvantage of the imputation methods provided by PROC
LAS does not display any estimates of the fraction of missing MI is that the analyst has little control over the imputation model
information (this can be calculated separately from the regres- itself. In addition, for the regression and MCMC methods, SAS
sion model), nor standard error estimates for the intercept. The does not impute an observed value that is closest to the predicted
default behavior uses a fixed seed for the random number gen- value (i.e., there is no support for predictive mean matching using
erator seed. While this can be set to 0 (or left blank) to use clock observed values). Instead, it uses an assumption of multivariate
time as seed (except within the script language), the default seed normality to generate a plausible value for the imputation. SAS
will always yield the same imputation results. Using the clock allows the analyst to specify a minimum and maximum value
time as a pseudo-random seed would seem a more reasonable for imputed values on a variable-by-variable basis, as well as
default. the ability to round imputed values. In addition, a SAS data step
Installation was straightforward, and the interface was clear could be used to generate observed values. In practice, however,
and intuitive. The documentation (which consisted of three man- both these approaches are somewhat cumbersome.
uals, with a total of 487 pages) was well organized (it was the No additional installation was needed for PROC MI/PROC
only documentation with an index), and easy to read. The ex- MIANALYZE, since they are bundled with SAS/STAT. The doc-
The American Statistician, August 2001, Vol. 55, No. 3 247
umentation was terse (69 pages for PROC MI, 31 pages for logistic and polytomous regression, and discriminant analysis.
PROC MIANALYZE), but well organized; a number of exam- Nonmonotone missingness is handled by using chained equa-
ples were provided. tions (MCMC) to loop through all missing values. Extensive
SAS is licensed on an annual basis, on a per-module basis. graphical summaries of the MCMC process are provided. In ad-
An annual license for base SAS and SAS/STAT is $3,900 for the dition, MICE allows users to program their own imputation func-
first year, and $1,900 for subsequent years. Academic discounts tions, which is useful for undertaking sensitivity analyses of dif-
are generally available. ferent (possibly nonignorable) missingness models. The system
allows transformation of variables, and fine-grained control over
Missing Data Library for S-Plus the choice of predictors in the imputation model. The imputation
Insightful (formerly MathSoft) step is carried out using the mice() function. For continuous
(800) 569-0123 missing variables, MICE supports imputation using the norm
http://www.insightful.com, [email protected] function (similar to SAS’ REGRESSION option), and the pmm
function (similar to SOLAS’ predictive mean matching). Com-
pleted datasets can be extracted using the complete() func-
S-Plus 6.0 features a new missing data library, which extends tion, or can be run for each imputation using the lm.mids() or
S-Plus to support model-based missing data models, by use of glm.mids() function. Finally, results can be combined using
the EM algorithm (Dempster, Laird, and Rubin 1977) and data the pool() function.
augmentation (DA) algorithms (Tanner and Wong 1987). DA Although computationally attractive, the chained equation ap-
algorithms can be used to generate multiple imputations. The proach implemented in MICE requires assumptions about the
missing data library provides support for multivariate normal existence of the multivariate posterior distribution used for sam-
data (impGauss), categorical data (impLoglin), and con- pling, however, it is not always certain that such a distribution
ditional Gaussian models (impCgm) for imputations involving exists (van Buuren et al. 1999). Like SOLAS, MICE uses a fixed
both discrete and continuous variables. seed for random number generation, which must be overriden
The package provides a concise summary of missing data during the imputation phase to avoid always generating the same
imputed values. It would be preferable to have this seed vary by
distributions and patterns (further described in the discussion
default, but allow the option to fix the seed to allow replication
of the examples), including both text-based and graphical dis-
of results.
plays. There are good diagnostics provided for the convergence
Installation was straightforward, though automated addition
of the data augmentation algorithms. The printed documenta-
of packages under R is only supported on Unix systems. In ad-
tion, while extensive (164 pages) is light on examples of impu-
dition to the mice package, under R, two additional add-on
tation, instead focusing more on data augmentation and maxi-
packages were required (MASS and nnet). The documentation
mum likelihood (EM) based approaches. It provides an excellent
was short (39 pages), terse (particularly regarding the imputa-
tutorial regarding missing data methods in general.
tion models) but clear. An example using the NHANES dataset
S-Plus has strong support for file input, including the ability
provided a summary of how to use the package. The manual
to connect directly to Excel, and to read files in a variety of included the help pages for each function in the library.
formats (including Access, dBASE, Gauss, Matlab, Paradox, S-Plus was described previously. R is free software for
SAS, SPSS, Stata, and Systat). Unix, Windows, and Macintosh that is distributed under a
The single-user license price for S-Plus 2000 Professional for GNU-style copyleft. More information can be found at the R
Windows is $2500; S-Plus 6.0 is expected to have similar pricing. project web site: www.r-project.org. The MICE library is freely
Discounted academic pricing is available including academic available, and may be downloaded from the www.multiple-
site licenses. imputation.com Web site.
gorithm (King, Honaker, Joseph, and Scheve 2001) and performs covariance given by:
the imputation step, but does not provide support for analysis of
these imputed datasets or combination of the results. IVEware 1 0.5 0
(http://www.isr.umich.edu/src/smp/ive) by Raghunathan et al. is Σ = 0.5 1 0 .
a SAS version 6.12 callable routine built using the SAS macro 0 0 1
language. It extends multiple imputation to support complex The true regression model is given by:
survey sample designs. HLM (hierarchical linear and nonlinear
modeling) version 5 supports the analysis of multiply-imputed E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 .
datasets, where multiple plausible datasets have been previously
created. LISREL 8.20 and later supports multiple imputation, We generated 10,000 observations from the multivariate normal
but the focus of this package is not on the regression models de- distribution: Y = X1 + X2 + (i.e., β0 = 0, β1 = β2 = 1).
scribed in this article. Because of the existence of more complete Following Allison’s example, we caused approximately half of
implementations, these packages are not further discussed. the values of X2 in this dataset to be missing, according to the
following mechanisms:
Table 1. Parameter Estimates (and standard errors) From Artificial Datasets (true parameter values 1.00 and 1.00)
Missing Complete MI MI MI
mechanism Parameter case propensity regression MCMC
SAS
library(missing)
emstart <- emGauss(allison)
worstFraction(emstart)
start <- matrix(rep(emstart$paramIter[2,],10),nrow=10,byrow=T)
imp <- impGauss(allison,start=start,control=list(niter=200))
fit <- miEval(lm(y ˜ x1 + x2,data=imp))
result <- miMeanSE(fit)
MICE
library(mice)
imp <- mice(allison,imputationMethod="pmm",m=10,seed=456)
fit <- lm.mids(y ˜ x1 + x2, imp)
result <- pool(fit)
Figure 3. Code to Fit Models for Artificial Data Example.
1.5
parameter estimate for MOMSTRS
1.5
1.0
1.0
2 3 5 10 15 20 25 50
2 3 5 10 15 20 25 50 number of imputations
number of imputations
Figure 5. Distribution of Standard Error Estimates for MOMSTRS
Figure 4. Distribution of Estimates of MOMSTRS Parameter Using Parameter Using Different Number of Imputations (based on 50 replica-
Different Number of Imputations (based on 50 replications). tions).