Multiple Imputation Presentation
Multiple Imputation Presentation
Adrienne D. Woods
Methods Hour Brown Bag
April 14, 2017
A COLLECTIVIST APPROACH TO BEST
PRACTICES
• As I began learning about MI last semester, I realized that there are a lot of
guidelines that are not often followed…
• …or, if they are, nobody reports what they did!
• …or, guidelines that are outdated and/or different across disciplines
“[Controlling] variables that help account for the mechanisms resulting in missing data (e.g., race/ethnicity, age,
gender, SES)…leads to a reasonable assumption of missing at random (MAR).” Hibel, Farkas, & Morgan, 2010
Acock, 2005; Graham, 2009; Hibel, Farkas, & Morgan, 2010; Schafer, 1999
THE WHAT: WHAT IS MULTIPLE
IMPUTATION?
Then, imputed another 46 datasets to get to m = 50, and checked FMI again:
• R – mice package
• Completely syntax-based, can get out of hand for uninitiated/beginners
• STATA – multiple imputation feature
• Subsequent data analyses conducted with “mi estimate:” as the precursor to code
• SPSS – multiple imputation feature
• Creates one dataset or imputes X separate datasets (useful for HLM, for example)
• But, limited in options
• e.g., can’t manipulate knn
CO-CONSTRUCTED KNOWLEDGE &
DISCUSSION:
Main Take-Aways:
• First, always know what type of missing data you are working with
• Base m on FMI – rule of thumb is FMI/m < .01
• Know your analysis model beforehand and include at least all analysis variables in imputation model
(including interaction terms)
• Above all, be explicit about your choices.
• Include software you used to impute, auxiliary variables, etc.
• If not written out in actual manuscript, add to appendices!
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some
practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206-213.
Rubin, D. B. (1987). Comment. Journal of the American Statistical Association, 82(398), 543-546.
Shin, T., Davison, M. L., & Long, J. D. (2016). Maximum Likelihood Versus Multiple Imputation for Missing Data in Small Longitudinal
Samples With Nonnormality.
Spratt, M., Carpenter, J., Sterne, J. A., Carlin, J. B., Heron, J., Henderson, J., & Tilling, K. (2010). Strategies for
multiple imputation in longitudinal studies. American journal of epidemiology, 172(4), 478-487.
Tabachnick, B. G., & Fidell, L. S. (2013). Using Multivariate Statistics (6th Ed.). Pearson.
Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal
of statistical software, 45(3).
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance
for practice. Statistics in medicine, 30(4), 377-399.
RELATIVE EFFICIENCY OF M
“The variability between sets of imputations depends on both the number of imputations used and the
fraction of missing information. However, the fraction of missing information is itself estimated using the
between- and within-imputation variances, and thus may have substantial variability when estimated from
small numbers of imputations. Monte Carlo variation among sets of small numbers of imputations can be
substantial enough to materially affect conclusions, particularly where the original data set is small. One
approach might be to estimate the Monte Carlo variation and use that to decide the appropriate number of
imputations.” (p. 486, Spratt et al., 2010)
“The early literature focused on efficiency, and the conclusion was that you could usually get by with three
to five data sets. Schafer (1999) upped that number slightly when he stated that “Unless rates of missing
information are unusually high, there tends to be little or no practical benefit to using more than five to ten
imputations.” That conclusion was based on Rubin’s formula for relative efficiency: 1/(1+F/M) where F is the
fraction of missing information and M is the number of imputations. Thus, even with 50% missing
information, five imputed data sets would produce point estimates that were 91% as efficient as those based
on an infinite number of imputations. Ten data sets would yield 95% efficiency. But what’s good enough for
efficiency isn’t necessarily good enough for standard error estimates, confidence intervals, and p-values.”
(Allison, 2012)