Flexible Imputation of Missing Data
Flexible Imputation of Missing Data
Missing Data
Second Edition
CHAPMAN & HALL/CRC
Interdisciplinary Statistics Series
Series editors: N. Keiding, B.J.T. Morgan, C.K. Wikle, P. van der Heijden
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20180613
International Standard Book Number-13: 978-1-138-58831-8 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
I Basics 1
1 Introduction 3
1.1 The problem of missing data . . . . . . . . . . . . . . . . . . 3
1.1.1 Current practice . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Changing perspective on missing data . . . . . . . . . 6
1.2 Concepts of MCAR, MAR and MNAR . . . . . . . . . . . . 8
1.3 Ad-hoc solutions . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Listwise deletion . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Pairwise deletion . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Mean imputation . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Regression imputation . . . . . . . . . . . . . . . . . . 13
1.3.5 Stochastic regression imputation . . . . . . . . . . . . 14
1.3.6 LOCF and BOCF . . . . . . . . . . . . . . . . . . . . 16
1.3.7 Indicator method . . . . . . . . . . . . . . . . . . . . . 17
1.3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Multiple imputation in a nutshell . . . . . . . . . . . . . . . 19
1.4.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.2 Reasons to use multiple imputation . . . . . . . . . . 20
1.4.3 Example of multiple imputation . . . . . . . . . . . . 21
1.5 Goal of the book . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 What the book does not cover . . . . . . . . . . . . . . . . . 23
1.6.1 Prevention . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6.2 Weighting procedures . . . . . . . . . . . . . . . . . . 24
1.6.3 Likelihood-based approaches . . . . . . . . . . . . . . 25
vii
viii Contents
2 Multiple imputation 29
2.1 Historic overview . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Imputation . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Multiple imputation . . . . . . . . . . . . . . . . . . . 30
2.1.3 The expanding literature on multiple imputation . . . 32
2.2 Concepts in incomplete data . . . . . . . . . . . . . . . . . . 33
2.2.1 Incomplete-data perspective . . . . . . . . . . . . . . . 33
2.2.2 Causes of missing data . . . . . . . . . . . . . . . . . . 33
2.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 MCAR, MAR and MNAR again . . . . . . . . . . . . 36
2.2.5 Ignorable and nonignorable♠ . . . . . . . . . . . . . . 38
2.2.6 Implications of ignorability . . . . . . . . . . . . . . . 39
2.3 Why and when multiple imputation works . . . . . . . . . . 41
2.3.1 Goal of multiple imputation . . . . . . . . . . . . . . . 41
2.3.2 Three sources of variation♠ . . . . . . . . . . . . . . . 41
2.3.3 Proper imputation . . . . . . . . . . . . . . . . . . . . 44
2.3.4 Scope of the imputation model . . . . . . . . . . . . . 46
2.3.5 Variance ratios♠ . . . . . . . . . . . . . . . . . . . . . 46
2.3.6 Degrees of freedom♠ . . . . . . . . . . . . . . . . . . . 47
2.3.7 Numerical example . . . . . . . . . . . . . . . . . . . . 48
2.4 Statistical intervals and tests . . . . . . . . . . . . . . . . . . 49
2.4.1 Scalar or multi-parameter inference? . . . . . . . . . . 49
2.4.2 Scalar inference . . . . . . . . . . . . . . . . . . . . . . 50
2.4.3 Numerical example . . . . . . . . . . . . . . . . . . . . 50
2.5 How to evaluate imputation methods . . . . . . . . . . . . . 51
2.5.1 Simulation designs and performance measures . . . . . 51
2.5.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . 52
2.5.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6 Imputation is not prediction . . . . . . . . . . . . . . . . . . 55
2.7 When not to use multiple imputation . . . . . . . . . . . . . 57
2.8 How many imputations? . . . . . . . . . . . . . . . . . . . . 58
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Algorithms♠ . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.4 Generating MAR missing data . . . . . . . . . . . . . 70
3.2.5 MAR missing data generation in multivariate data . . 72
3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Imputation under non-normal distributions . . . . . . . . . . 74
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.2 Imputation from the t-distribution . . . . . . . . . . . 75
3.4 Predictive mean matching . . . . . . . . . . . . . . . . . . . 77
3.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.2 Computational details♠ . . . . . . . . . . . . . . . . . 79
3.4.3 Number of donors . . . . . . . . . . . . . . . . . . . . 81
3.4.4 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 84
3.5 Classification and regression trees . . . . . . . . . . . . . . . 84
3.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.6 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.6.1 Generalized linear model . . . . . . . . . . . . . . . . . 87
3.6.2 Perfect prediction♠ . . . . . . . . . . . . . . . . . . . . 89
3.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 90
3.7 Other data types . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.7.1 Count data . . . . . . . . . . . . . . . . . . . . . . . . 91
3.7.2 Semi-continuous data . . . . . . . . . . . . . . . . . . 92
3.7.3 Censored, truncated and rounded data . . . . . . . . . 93
3.8 Nonignorable missing data . . . . . . . . . . . . . . . . . . . 96
3.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.8.2 Selection model . . . . . . . . . . . . . . . . . . . . . . 97
3.8.3 Pattern-mixture model . . . . . . . . . . . . . . . . . . 98
3.8.4 Converting selection and pattern-mixture models . . . 99
3.8.5 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . 100
3.8.6 Role of sensitivity analysis . . . . . . . . . . . . . . . . 101
3.8.7 Recent developments . . . . . . . . . . . . . . . . . . . 102
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
IV Extensions 337
12 Conclusion 339
12.1 Some dangers, some do’s and some don’ts . . . . . . . . . . . 339
12.1.1 Some dangers . . . . . . . . . . . . . . . . . . . . . . . 339
12.1.2 Some do’s . . . . . . . . . . . . . . . . . . . . . . . . . 340
12.1.3 Some don’ts . . . . . . . . . . . . . . . . . . . . . . . . 341
12.2 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
12.2.1 Reporting guidelines . . . . . . . . . . . . . . . . . . . 343
12.2.2 Template . . . . . . . . . . . . . . . . . . . . . . . . . 344
12.3 Other applications . . . . . . . . . . . . . . . . . . . . . . . . 345
12.3.1 Synthetic datasets for data protection . . . . . . . . . 345
12.3.2 Analysis of coarsened data . . . . . . . . . . . . . . . . 345
12.3.3 File matching of multiple datasets . . . . . . . . . . . 346
12.3.4 Planned missing data for efficient designs . . . . . . . 346
12.3.5 Adjusting for verification bias . . . . . . . . . . . . . . 347
12.4 Future developments . . . . . . . . . . . . . . . . . . . . . . 347
12.4.1 Derived variables . . . . . . . . . . . . . . . . . . . . . 347
12.4.2 Algorithms for blocks and batches . . . . . . . . . . . 347
12.4.3 Nested imputation . . . . . . . . . . . . . . . . . . . . 348
12.4.4 Better trials with dynamic treatment regimes . . . . . 348
12.4.5 Distribution-free pooling rules . . . . . . . . . . . . . . 348
12.4.6 Improved diagnostic techniques . . . . . . . . . . . . . 349
12.4.7 Building block in modular statistics . . . . . . . . . . 349
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
References 351
I’m delighted to see this new book on multiple imputation by Stef van Buuren
for several reasons. First, to me at least, having another book devoted to multi-
ple imputation marks the maturing of the topic after an admittedly somewhat
shaky initiation. Stef is certainly correct when he states in Section 2.1.2: “The
idea to create multiple versions must have seemed outrageous at that time
[late 1970s]. Drawing imputations from a distribution, instead of estimating
the ‘best’ value, was a severe breach with everything that had been done be-
fore.” I remember how this idea of multiple imputation was even ridiculed
by some more traditional statisticians, sometimes for just being “silly” and
sometimes for being hopelessly inefficient with respect to storage demands
and outrageously expensive with respect to computational requirements.
Some others of us foresaw what was happening to both (a) computational
storage (I just acquired a 64 GB flash drive the size of a small finger for
under $60, whereas only a couple of decades ago I paid over $2500 for a
120 KB hard drive larger than a shoebox weighing about 10 kilos), and (b)
computational speed and flexibility. To develop statistical methods for the
future while being bound by computational limitations of the past was clearly
inapposite. Multiple imputation’s early survival was clearly due to the insight
of a younger generation of statisticians, including many colleagues and former
students, who realized future possibilities.
A second reason for my delight at the publication of this book is more
personal and concerns the maturing of the author, Stef van Buuren. As he
mentions, we first met through Jan van Rijckevorsel at TNO. Stef was a
young and enthusiastic researcher there, who knew little about the kind of
statistics that I felt was essential for making progress on the topic of dealing
with missing data. But consider the progress over the decades starting with
his earlier work on MICE! Stef has matured into an independent researcher
making important and original contributions to the continued development of
multiple imputation.
This book represents a ‘no nonsense’ straightforward approach to the ap-
plication of multiple imputation. I particularly like Stef’s use of graphical
displays, which are badly needed in practice to supplement the more theoret-
ical discussions of the general validity of multiple imputation methods. As I
have said elsewhere, and as implied by much of what is written by Stef, “It’s
not that multiple imputation is so good; it’s really that other methods for
addressing missing data are so bad.” It’s great to have Stef’s book on mul-
xv
xvi Foreword
tiple imputation, and I look forward to seeing more editions as this rapidly
developing methodology continues to become even more effective at handling
missing data problems in practice.
Finally, I would like to say that this book reinforces the pride of an aca-
demic father who has watched one of his children grow and develop. This book
is a step in the growing list of contributions that Stef has made, and, I am
confident, will continue to make, in methodology, computational approaches
and application of multiple imputation.
Donald B. Rubin
xvii
xviii Preface to second edition
xxi
xxii Preface to first edition
with him on many occasions. His clear vision and deceptively simple ideas
have been a tremendous source of inspiration. I am also indebted to Jan
van Rijckevorsel for bringing me into contact with Donald Rubin, and for
establishing the scientific climate at TNO in which our work on missing data
techniques could prosper.
Many people have helped realize this project. I thank Nico van Meeteren
and Michael Holewijn of TNO for their trust and support. I thank Peter van
der Heijden of Utrecht University for his support. I thank Rob Calver and the
staff at Chapman & Hall/CRC for their help and advice. Many colleagues have
commented on part or all of the manuscript: Hendriek Boshuizen, Elise Dussel-
dorp, Karin Groothuis-Oudshoorn, Michael Hermanussen, Martijn Heymans,
Nicholas Horton, Shahab Jolani, Gerko Vink, Ian White and the research mas-
ter students of the Spring 2011 class. Their comments have been very valuable
for detecting and eliminating quite a few glitches. I happily take the blame
for the remaining errors and vagaries.
The major part of the manuscript was written during a six-month sab-
batical leave. I spent four months in Krukö, Sweden, a small village of just
eight houses. I thank Frank van den Nieuwenhuijzen and Ynske de Koning
for making their wonderful green house available to me. It was the perfect
tranquil environment that, apart from snowplowing, provided a minimum of
distractions. I also spent two months at the residence of Michael Hermanussen
and Beate Lohse-Hermanussen in Altenhof, Germany. I thank them for their
hospitality, creativity and wit. It was a wonderful time.
Finally, I thank my family, in particular my beloved wife Eveline, for their
warm and ongoing support, and for allowing me to devote time, often nights
and weekends, to work on this book. Eveline liked to tease me by telling
people that I was writing “a book that no one understands.” I fear that her
statement is accurate, at least for 99% of the people. My hope is that you, my
dear reader, will belong to the remaining 1%.
xxiii
List of symbols
xxvii
Part I
Basics
Chapter 1
Introduction
y <- c(1, 2, 4)
mean(y)
[1] 2.33
[1] NA
The mean is now undefined, and R informs us about this outcome by setting
the mean to NA. It is possible to add an extra argument na.rm = TRUE to the
function call. This removes any missing data before calculating the mean:
[1] 1.5
This makes it possible to calculate a result, but of course the set of observations
on which the calculations are based has changed. This may cause problems in
statistical inference and interpretation.
3
4 Flexible Imputation of Missing Data, Second Edition
It gets worse with multivariate analysis. For example, let us try to predict
daily ozone concentration (ppb) from wind speed (mph) using the built-in
airquality dataset. We fit a linear regression model by calling the lm()
function to predict daily ozone levels, as follows:
Many R users have seen this message. The code cannot continue because there
are missing values. One way to circumvent the problem is to omit any incom-
plete records by specifying the na.action = na.omit argument to lm(). The
regression weights can now be obtained as
options(na.action = na.omit)
which eliminates the error message once and for all. Users of other software
packages like SPSS, SAS and Stata enjoy the “luxury” that this deletion option
has already been set for them, so the calculations can progress silently. Next,
we wish to plot the predicted ozone levels against the observed data, so we
use predict() to calculate the predicted values, and add these to the data to
prepare for plotting.
Argg... that doesn’t work either. The error message tells us that the two
datasets have a different number of rows. The airquality data has 153 rows,
whereas there are only 116 predicted values. The problem, of course, is that
there are missing data. The lm() function dropped any incomplete rows in
the data. We find the indices of the first six cases by
head(na.action(fit))
5 10 25 26 27 32
5 10 25 26 27 32
150 150
100 100
50 50
0 0
Figure 1.1: Predicted versus measured ozone levels for the observed (left,
blue) and missing values (right, red).
naprint(na.action(fit))
colSums(is.na(airquality))
so in our regression model, all 37 deleted cases have missing ozone scores.
Removing the incomplete cases prior to analysis is known as listwise dele-
tion or complete-case analysis. In R, there are two related functions for the
subset of complete cases, na.omit() and complete.cases().
Figure 1.1 plots the predicted against the observed values. Here we adopt
the Abayomi convention for the colors (Abayomi et al., 2008): Blue refers
to the observed part of the data, red to the synthetic part of the data (also
called the imputed values or imputations), and black to the combined data
(also called the imputed data or completed data). The printed version of the
first edition of this book used gray instead of blue. The blue points on the left
are all from the complete cases, whereas the figure on the right plots the points
for the incomplete cases (in red). Since there are no measured ozone levels in
that part of the data, the possible values are indicated by 37 horizontal lines.
Listwise deletion allows the calculations to proceed, but it may introduce
additional complexities in interpretation. Let’s try to find a better predictive
model by including solar radiation (Solar.R) into the model as
6 Flexible Imputation of Missing Data, Second Edition
Observe that the number of deleted days increased is now 42 since some rows
had no value for Solar.R. Thus, changing the model altered the sample.
There are methodological and statistical issues associated with this proce-
dure. Some questions that come to mind are:
distorted. MNAR includes the possibility that the scale produces more missing
values for the heavier objects (as above), a situation that might be difficult to
recognize and handle. An example of MNAR in public opinion research occurs
if those with weaker opinions respond less often. MNAR is the most complex
case. Strategies to handle MNAR are to find more data about the causes for
the missingness, or to perform what-if analyses to see how sensitive the results
are under various scenarios.
Rubin’s distinction is important for understanding why some methods will
work, and others not. His theory lays down the conditions under which a
missing data method can provide valid statistical inferences. Most simple fixes
only work under the restrictive and often unrealistic MCAR assumption. If
MCAR is implausible, such methods can provide biased estimates.
The standard lm() function does not take means and covariances as input,
but the lavaan package (Rosseel, 2012) provides this feature:
library(lavaan)
fit <- lavaan("Ozone ~ 1 + Wind + Solar.R
Ozone ~~ Ozone",
sample.mean = mu, sample.cov = cv,
sample.nobs = sum(complete.cases(data)))
The method is simple, and appears to use all available information. Under
MCAR, it produces consistent estimates of mean, correlations and covari-
ances (Little and Rubin, 2002, p. 55). The method has also some shortcom-
ings. First, the estimates can be biased if the data are not MCAR. Further,
the covariance and/or correlation matrix may not be positive definite, which
is requirement for most multivariate procedures. Problems are generally more
severe for highly correlated variables (Little, 1992). It is not clear what sample
size should be used for calculating standard errors. Taking the average sample
size yields standard errors that are too small(Little, 1992). Also, pairwise dele-
tion requires numerical data that follow an approximate normal distribution,
whereas in practice we often have variables of mixed types.
The idea to use all available information is good, but the proper analysis of
the pairwise matrix requires sophisticated optimization techniques and special
formulas to calculate the standard errors (Van Praag et al., 1985; Marsh,
1998), which somewhat defeats its utility. Pairwise deletion works best used
if the data approximate a multivariate normal distribution, if the correlations
between the variables are low, and if the assumption of MCAR is plausible.
It is not recommended for other cases.
12 Flexible Imputation of Missing Data, Second Edition
install.packages("mice")
library("mice")
50
150
40
Ozone (ppb)
Frequency
100
30
20
50
10
0
0
Figure 1.2: Mean imputation of Ozone. Blue indicates the observed data,
red indicates the imputed values.
Figure 1.3 shows the result. The imputed values correspond to the most
likely values under the model. However, the ensemble of imputed values vary
less than the observed values. It may be that each of the individual points is
the best under the model, but it is very unlikely that the real (but unobserved)
values of Ozone would have had this distribution. Imputing predicted values
also has an effect on the correlation. The red points have a correlation of 1
since they are located on a line. If the red and blue dots are combined, then
the correlation increases from 0.35 to 0.39. Note that this upward bias grows
with the percent missing ozone levels (here 24%).
14 Flexible Imputation of Missing Data, Second Edition
25
150
20
Ozone (ppb)
Frequency
100
15
10
50
5
0
0
Figure 1.3: Regression imputation: Imputing Ozone from the regression line.
25
150
20
Ozone (ppb)
Frequency
100
15
10
50
5
0
0
140
120
100
Ozone (ppb)
80
60
40
20
0 20 40 60 80
Day number
only the regression weights, but also the correlation between variables (cf. Ex-
ercise 3). The main idea to draw from the residuals is very powerful, and forms
the basis of more advanced imputation techniques.
Figure 1.5 plots the results of the first 80 days of the Ozone series. The
stretches of red dots indicate the imputations, and are constant within the
same batch of missing ozone levels. The real, unseen values are likely to vary
within these batches, so applying LOCF here gives implausible imputations.
LOCF is convenient because it generates a complete dataset. It can be
applied with confidence in cases where we are certain what the missing values
should be, for example, for administrative variables in longitudinal data. For
outcomes, LOCF is dubious. The method has long been used in clinical tri-
als. The U.S. Food and Drug Administration (FDA) has traditionally viewed
Introduction 17
Observe that since the missing data are in Ozone we needed to reverse the
direction of the regression model.
The indicator method has been popular in public health and epidemiol-
ogy. An advantage is that the indicator method retains the full dataset. Also,
it allows for systematic differences between the observed and the unobserved
data by inclusion of the response indicator, and could be more efficient. White
and Thompson (2005) pointed out that the method can be useful to estimate
the treatment effect in randomized trials when a baseline covariate is partially
observed. If the missing data are restricted to the covariate, if the interest is
solely restricted to estimation of the treatment effect, if compliance to the
allocated treatment is perfect and if the model is linear without interactions,
then using the indicator method for that covariate yields an unbiased esti-
mate of the treatment effect. This is true even if the missingness depends on
the covariate itself. Additional work can be found in Groenwold et al. (2012);
Sullivan et al. (2018). It is not yet clear whether the coverage of the confi-
dence interval around the treatment estimate will be satisfactory for multiple
incomplete baseline covariates.
18 Flexible Imputation of Missing Data, Second Edition
The conditions under which the indicator method works may not be met
in practice. For example, the method does not allow for missing data in the
outcome, and generally fails in observational data. It has been shown that
the method can yield severely biased regression estimates, even under MCAR
and for low amounts of missing data (Vach and Blettner, 1991; Greenland and
Finkle, 1995; Knol et al., 2010). The indicator method may have its uses in
particular situations, but fails as a generic method to handle missing data.
1.3.8 Summary
Table 1.1 provides a summary of the methods discussed in this section. The
table addresses two topics: whether the method yields the correct results on
average (unbiasedness), and whether it produces the correct standard error.
Unbiasedness is evaluated with respect to three types of estimates: the mean,
the regression weight (with the incomplete variable as dependent) and the
correlation.
The table identifies the assumptions on the missing data mechanism each
method must make in order to produce unbiased estimates. The first line of
the table should be read as follows:
1. Listwise deletion produces an unbiased estimate of the mean provided
that the data are MCAR;
2. Listwise deletion produces an estimate of the standard error that is too
large.
The interpretation of the other lines is similar. The “–” sign in some cells
indicates that the method cannot produce unbiased estimates. Observe that
both deletion methods require MCAR for all types. Regression imputation and
stochastic regression imputation can yield unbiased estimates under MAR. In
order to work, the model needs to be correctly specified. LOCF and the in-
dicator method are incapable of providing consistent estimates, even under
1 If the missing values occur in the outcome only. See Section 2.5.3 for another case.
Introduction 19
-
@
@
@
@
@
R
@
- - -
@
@
@
@
@R
@
-
MCAR. Note that some special cases are not covered in Table 1.1. For ex-
ample, listwise deletion is unbiased under two special MNAR scenarios (cf.
Section 2.7).
Listwise deletion produces standard errors that are correct for the subset
of complete cases, but in general too large for the entire dataset. Calculation of
standard errors under pairwise deletion is complicated. The standard errors
after single imputation are too small since the standard calculations make
no distinction between the observed data and the imputed data. Correction
factors for some situations have been developed (Schafer and Schenker, 2000),
but a more convenient solution is multiple imputation.
by plausible data values. These plausible values are drawn from a distribution
specifically modeled for each missing entry. Figure 1.6 portrays m = 3 imputed
datasets. In practice, m is often taken larger (cf. Section 2.8). The number
m = 3 is taken here just to make the point that the technique creates multiple
versions of the imputed data. The three imputed datasets are identical for the
observed data entries, but differ in the imputed values. The magnitude of
these difference reflects our uncertainty about what value to impute.
The second step is to estimate the parameters of interest from each im-
puted dataset. This is typically done by applying the analytic method that we
would have used had the data been complete. The results will differ because
their input data differ. It is important to realize that these differences are
caused only because of the uncertainty about what value to impute.
The last step is to pool the m parameter estimates into one estimate, and
to estimate its variance. The variance combines the conventional sampling
variance (within-imputation variance) and the extra variance caused by the
missing data extra variance caused by the missing data (between-imputation
variance). Under the appropriate conditions, the pooled estimates are unbiased
and have the correct statistical properties.
There is much more to say about each of these steps, but it shows that
multiple imputation need not be a daunting task. Assuming we have set
options(na.action = na.omit), fitting the same model to the complete
cases can be done by
The solutions are nearly identical here, which is due to the fact that most miss-
ing values occur in the outcome variable. The standard errors of the multiple
imputation solution are slightly smaller than in the complete-case analysis.
Multiple imputation is often more efficient than complete-case analysis. De-
pending on the data and the model at hand, the differences can be dramatic.
Figure 1.7 shows the distribution and scattergram for the observed and
imputed data combined. The imputations are taken from the first completed
dataset. The blue and red distributions are quite similar. Problems with the
negative values as in Figure 1.4 are now gone since the imputation method
used observed data as donors to fill the missing data. Section 3.4 describes
22 Flexible Imputation of Missing Data, Second Edition
10 15 20 25 30 35
150
Ozone (ppb)
Frequency
100
50
5
0
0
Figure 1.7: Multiple imputation of Ozone. Plotted are the imputed values
from the first imputation.
the method in detail. Note that the red points respect the heteroscedastic
nature of the relation between Ozone and Solar.R. All in all, the red points
look as if they could have been measured if they had not been missing. The
reader can easily recalculate the solution and inspect these plots for the other
imputations.
Figure 1.8 plots the completed Ozone data. The imputed data of all five
imputations are plotted for the days with missing Ozone scores. In order to
avoid clutter, the lines that connect the dots are not drawn for the imputed
values. Note that the pattern of imputed values varies substantially over the
days. At the beginning of the series, the values are low and the spread is small,
in particular for the cold and windy days 25–27. The small spread for days 25–
27 indicates that the model is quite sure of these values. High imputed values
are found around the hot and sunny days 35–42, whereas the imputations
during the moderate days 52–61 are consistently in the moderate range. Note
how the available information helps determine sensible imputed values that
respect the relations between wind, temperature, sunshine and ozone.
One final point. The airquality data is a time series of 153 days. It is well
known that the standard error of the ordinary least squares (OLS) estimate is
inefficient (too large) if the residuals have positive serial correlation (Harvey,
1981). The first three autocorrelations of the Ozone are indeed large: 0.48,
0.31 and 0.29. The residual autocorrelations are however small and within
the confidence interval: 0.13, −0.02 and 0.04. The inefficiency of OLS is thus
negligible here.
Introduction 23
140
120
100
Ozone (ppb)
80
60
40
20
0 20 40 60 80
Day number
Figure 1.8: Multiple imputation of Ozone. Plotted are the observed values
(in blue) and the multiply imputed values (in red).
1.6.1 Prevention
With the exception of McKnight et al. (2007, Chapter 4), books on missing
data do not mention prevention. Yet, prevention of the missing data is the most
direct attack on problems caused by the missing data. Prevention is fully in
spirit with the quote of Orchard and Woodbury given on p. 7. There is a
lot one could do to prevent missing data. The remainder of this section lists
point-wise advice.
Minimize the use of intrusive measures, like blood samples. Visit the sub-
ject at home. Use incentives to stimulate response, and try to match up the
interviewer and respondent on age and ethnicity. Adapt the mode of the study
(telephone, face to face, web questionnaire, and so on) to the study population.
Use a multi-mode design for different groups in your study. Quickly follow-up
for people that do not respond, and where possible try to retrieve any missing
data from other sources.
In experimental studies, try to minimize the treatment burden and inten-
sity where possible. Prepare a well-thought-out flyer that explains the purpose
and usefulness of your study. Try to organize data collection through an au-
thority, e.g., the patient’s own doctor. Conduct a pilot study to detect and
smooth out any problems.
Economize on the number of variables collected. Only collect the informa-
tion that is absolutely essential to your study. Use short forms of measure-
ment instruments where possible. Eliminate vague or ambivalent questionnaire
items. Use an attractive layout of the instruments. Refrain from using blocks
of items that force the respondent to stay on a particular page for a long time.
Use computerized adaptive testing where feasible. Do not allow other studies
to piggy-back on your data collection efforts.
Do not overdo it. Many Internet questionnaires are annoying because they
force the respondent to answer. Do not force your respondent. The result will
be an apparently complete dataset with mediocre data. Respect the wish of
your respondent to skip items. The end result will be more informative.
Use double coding in the data entry, and chase up any differences between
the versions. Devise nonresponse forms in which you try to find out why people
they did not respond, or why they dropped out.
Last but not least, consult experts. Many academic centers have depart-
ments that specialize in research methodology. Sound expert advice may turn
out to be extremely valuable for keeping your missing data rate under control.
Most of this advice can be found in books on research methodology and
data quality. Good books are Shadish et al. (2001), De Leeuw et al. (2008),
Dillman et al. (2008) and Groves et al. (2009).
are weighted by design weights, which are inversely proportional to their prob-
ability of being selected in the survey. If there are missing data, the complete
cases are re-weighted according to design weights that are adjusted to counter
any selection effects produced by nonresponse. The method is widely used in
official statistics. Relevant pointers include Cochran (1977) and Särndal et al.
(1992) and Bethlehem (2002).
The method is relatively simple in that only one set of weights is needed
for all incomplete variables. On the other hand, it discards data by listwise
deletion, and it cannot handle partial response. Expressions for the variance
of regression weights or correlations tend to be complex, or do not exist. The
weights are estimated from the data, but are generally treated as fixed. The
implications for this are unclear (Little and Rubin, 2002, p. 53).
There has been interest recently in improved weighting procedures that
are “double robust” (Scharfstein et al., 1999; Bang and Robins, 2005). This
estimation method requires specification of three models: Model A is the sci-
entifically interesting model, Model B is the response model for the outcome,
and model C is the joint model for the predictors and the outcome. The dual
robustness property states that: if either Model B or Model C is wrong (but
not both), the estimates under Model A are still consistent. This seems like
a useful property, but the issue is not free of controversy (Kang and Schafer,
2007).
by Little and Rubin (2002). A less technical account that should appeal to
social scientists can be found in Enders (2010, chapters 3–5). Molenberghs
and Kenward (2007) provide a hands-on approach of likelihood-based methods
geared toward clinical studies, including extensions to data that are MNAR.
The pairwise likelihood method was introduced by Katsikatsou et al. (2012)
and has been implemented in lavaan.
1.8 Exercises
1. Reporting practice. What are the reporting practices in your field? Take
a random sample of articles that have appeared during the last 10 years
in the leading journal in your field. Select only those that present quan-
titative analyses, and address the following topics:
(a) Did the authors report that there were missing data?
(b) If not, can you infer from the text that there must have been missing
data?
(c) Did the authors discuss how they handled the missing data?
(d) Were the missing data properly addressed?
Introduction 27
(c) Does the curvature also show up in the imputed data? If so, does
the same solution work? Hint: You can assess the j th fitted model
by getfit(fit, j), where fit was created by with(imp,...).
(d) Advanced: Do you think your solution would necessitate drawing
new imputations?
Chapter 2
Multiple imputation
29
30 Flexible Imputation of Missing Data, Second Edition
involved in the technique, the larger datasets, the extra works to create the
model and the repeated analysis, software issues, and so on. These issues have
all been addressed by now, but in 1983 Dempster and Rubin wrote: “Practi-
cal implementation is still in the developmental state” (Dempster and Rubin,
1983, p. 8).
Rubin (1987a) provided the methodological and statistical footing for the
method. Though several improvements have been made since 1987, the book
was really ahead of its time and discusses the essentials of modern imputation
technology. It provides the formulas needed to combine the repeated complete-
data estimates (now called Rubin’s rules), and outlines the conditions under
which statistical inference under multiple imputation will be valid. Further-
more, pp. 166–170 provide a description of Bayesian sampling algorithms that
could be used in practice.
Tests for combinations of parameters were developed by Li et al. (1991a),
Li et al. (1991b) and Meng and Rubin (1992). Technical improvements for
the degrees of freedom were suggested by Barnard and Rubin (1999) and
Reiter (2007). Iterative algorithms for multivariate missing data with general
missing data patterns were proposed by Rubin (1987, p. 192) Schafer (1997),
Van Buuren et al. (1999), Raghunathan et al. (2001) and King et al. (2001).
In the 1990s, multiple imputation came under fire from various sides. The
most severe criticism was voiced by Fay (1992). Fay pointed out that the valid-
ity of multiple imputation can depend on the form of subsequent analysis. He
produced “counterexamples” in which multiple imputation systematically un-
derstated the true covariance, and concluded that “multiple imputation is in-
appropriate as a general purpose methodology.” Meng (1994) pointed out that
Fay’s imputation models omitted important relations that were needed in the
analysis model, an undesirable situation that he labeled uncongenial . Related
issues on the interplay between the imputation model and the complete-data
model have been discussed by Rubin (1996) and Schafer (2003).
Several authors have shown that Rubin’s estimate of the variance can be
biased (Wang and Robins, 1998; Robins and Wang, 2000; Nielsen, 2003; Kim
et al., 2006). If there is bias, the estimate is usually too large. Rubin (2003)
emphasized that variance estimation is only an intermediate goal for making
confidence intervals, and generally not a parameter of substantive interest. He
also noted that observed bias does not seem to affect the coverage of these
intervals across a wide range of cases of practical interest.
The tide turned around 2005. Reviews started to appear that criticize
insufficient reporting practice of the missing data in diverse fields (cf. Sec-
tion 1.1.2). Nowadays multiple imputation is almost universally accepted, and
in fact acts as the benchmark against which newer methods are being com-
pared. The major statistical packages have all implemented modules for mul-
tiple imputation, so effectively the technology is implemented, almost three
decades after Dempster and Rubin’s remark.
32 Flexible Imputation of Missing Data, Second Edition
500
200
early publications
Number of publications (log)
100
"multiple imputation" in abstract
50 "multiple imputation" in title
20
10
data collector. For example, the data of a unit can be missing because the
unit was excluded from the sample. Another form of intentional missing data
is the use of different versions of the same instrument for different subgroups,
an approach known as matrix sampling. See Gonzalez and Eltinge (2007)
or Graham (2012, Section 4) for an overview. Also, missing data that occur
because of the routing in a questionnaire are intentional, as well as data (e.g.,
survival times) that are censored data at some time because the event (e.g.,
death) has not yet taken place. A related term in a multilevel context is
systematically missing data. This term refers to variables that are missing
for all individuals in a cluster because the variable was not measured in that
cluster.(Resche-Rigon and White, 2018)
Though often foreseen, unintentional missing data are unplanned and not
under the control of the data collector. Examples are: the respondent skipped
an item, there was an error in the data transmission causing data to be missing,
some of the objects dropped out before the study could be completed resulting
in partially complete data, and the respondent was sampled but refused to
cooperate. A related term in a multilevel context is sporadically missing data.
This terms is used for variables with missing values for some but not all
individuals in a cluster.
Another important distinction is item nonresponse versus unit nonre-
sponse. Item nonresponse refers to the situation in which the respondent
skipped one or more items in the survey. Unit nonresponse occurs if the
respondent refused to participate, so all outcome data are missing for this
respondent. Historically, the methods for item and unit nonresponse have
been rather different, with unit nonresponse primarily addressed by weighting
methods, and item nonresponse primarily addressed by edit and imputation
techniques.
Table 2.1 cross-classifies both distinctions, and provides some typical ex-
amples in each of the four cells. The distinction between intentional/uninten-
tional missing data is the more important one. The item/unit nonresponse
distinction says how much information is missing, while the distinction be-
tween intentional and unintentional missing data says why some information
is missing. Knowing the reasons why data are incomplete is a first step toward
the solution.
Multiple imputation 35
2.2.3 Notation
The notation used in this book will be close to that of Rubin (1987a)
and Schafer (1997), but there are some exceptions. The symbol m is used to
indicate the number of multiple imputations. Compared to Rubin (1987a) the
subscript m is dropped from most of the symbols. In Rubin (1987a), Y and R
represent the data of the population, whereas in this book Y refers to data of
the sample, similar to Schafer (1997). Rubin (1987a) uses X to represent the
completely observed covariates in the population. Here we assume that the
covariates are possibly part of Y , so there is not always a symbolic distinction
between complete covariates and incomplete data. The symbol X is used to
indicate the set of predictors in various types of models.
Let Y denote the n × p matrix containing the data values on p variables
for all n units in the sample. We define the response indicator R as an n × p
0–1 matrix. The elements of Y and R are denoted by yij and rij , respectively,
where i = 1, . . . , n and j = 1, . . . , p. If yij is observed, then rij = 1, and if yij
is missing, then rij = 0.
This book is restricted to the case where R is completely known, i.e., we
know where the missing data are. This covers many applications of practical
interest, but not all. For example, some questionnaires present a list of diseases
and ask the respondent to place a “tick” at each disease that applies. If there
is a “yes” we know that the field is not missing. However, if the field is not
ticked, it could be because the person didn’t have the disease (a genuine “no”)
or because the respondent skipped the question (a missing value). There is
no way to tell the difference from the data, so these are unknown unknowns.
In order to make progress in cases like these, we need additional assumptions
about the response behavior.
The observed data are collectively denoted by Yobs . The missing data are
collectively denoted as Ymis , and contain all elements yij where rij = 0. When
taken together Y = (Yobs , Ymis ) contain the hypothetically complete data. The
part Ymis has real values, but the values themselves are masked from us, where
R indicates which values are masked. In their book, Little and Rubin (2002,
p. 8) make the following key assumption:
Missingness indicators hide the true values that are meaningful for
analysis.
While this statement may seem obvious and uncomplicated, there are practical
situations where it may not hold. In a trial where we are interested in both
survival and quality of life, we may have missing values in either outcome. If
we know that a person is alive, then an unknown quality of life outcome is
simply missing because the quality of life score is defined for that person, but
for some reason we haven’t been able to see it. But if the person has died,
quality of life becomes undefined, and that’s the reason why we don’t see it.
It wouldn’t make much sense to try to impute something that is undefined.
A more sensible option is to stratify the analysis according to whether the
concept is defined or not. The situation becomes more complex if we do not
36 Flexible Imputation of Missing Data, Second Edition
know the person’s survival status. See Rubin (2000) for an analysis. In order
to evade such complexities, we assume that Y contains values that are all
defined, and that R indicates what we actually see.
If Y = Yobs (i.e., if the sample data are completely observed) and if we
know the mechanism of how the sample was created, then it is possible to make
a valid estimate of the population quantities of interest. For a simple random
sample, we could just take the sample mean Q̂ as an unbiased estimate of the
population mean Q. We will assume throughout this book that we know how
to do the correct statistical analysis on the complete data Y . If we cannot do
this, then there is little hope that we can solve the more complex problem of
analyzing Yobs . This book addresses the problem of what to do if Y is observed
incompletely. Incompleteness can incorporate intentional missing data, but
also unintentional forms like refusals, self-selection, skipped questions, missed
visits and so on.
Note that every unit in the sample has a row in Y . If no data have been
obtained for a unit i (presumably because of unit nonresponse), the ith record
will contain only the sample number and perhaps administrative data from
the sampling frame. The remainder of the record will be missing.
A variable without any observed values is called a latent variable. Latent
variables are often used to define concepts that are difficult to measure. Latent
variables are theoretical constructs and not part of the manifest data, so they
are typically not imputed. Mislevy (1991) showed how latent variable can be
imputed, and provided several illustrative applications.
does not simplify, so here the probability to be missing also depends on un-
observed information, including Ymis itself.
As explained in Chapter 1, simple techniques usually only work under
MCAR, but this assumption is very restrictive and often unrealistic. Multiple
imputation can handle both MAR and MNAR.
Several tests have been proposed to test MCAR versus MAR. These tests
are not widely used, and their practical value is unclear. See Enders (2010,
pp. 17–21) for an evaluation of two procedures. It is not possible to test MAR
versus MNAR since the information that is needed for such a test is missing.
Numerical illustration. We simulate three archetypes of MCAR, MAR and
MNAR. The data Y = (Y1 , Y2 ) are drawn from a standard bivariate normal
distribution with a correlation between Y1 and Y2 equal to 0.5. Missing data
are created in Y2 using the missing data model
eY1 eY2
Pr(R2 = 0) = ψ0 + ψ 1 + ψ2 (2.4)
1 + eY1 1 + eY2
with different parameters settings for ψ = (ψ0 , ψ1 , ψ2 ). For MCAR we set
ψMCAR = (0.5, 0, 0), for MAR we set ψMAR = (0, 1, 0) and for MNAR we set
ψMNAR = (0, 0, 1). Thus, we obtain the following models:
where logit(p) = log(p/(1 − p)) for any 0 < p < 1 is the logit function. In prac-
tice, it is more convenient to work with the inverse logit (or logistic) function
inverse logit−1 (x) = exp(x)/(1 + exp(x)), which transforms a continuous x
to the interval h0, 1i. In R, it is straightforward to draw random values under
these models as
Figure 2.2 displays the distribution of Yobs and Ymis under the three miss-
ing data models. As expected, these are similar under MCAR, but become
progressively more distinct as we move to the MNAR model.
38 Flexible Imputation of Missing Data, Second Edition
Yobs
MCAR
Ymis
Yobs
MAR
Ymis
Yobs
MNAR
Ymis
−2 −1 0 1 2
Y2
Figure 2.2: Distribution of Yobs and Ymis under three missing data models.
so the distribution of the data Y is the same in the response and nonresponse
groups. Thus, if the missing data model is ignorable we can model the poste-
rior distribution P (Y |Yobs , R = 1) from the observed data, and use this model
to create imputations for the missing data. Vice versa, techniques that (im-
plicitly) assume equivalent distributions assume ignorability and thus MAR.
On the other hand, if the nonresponse is nonignorable, we find
E(Q̂|Y ) = Q (2.12)
where the function V (Q̂|Y ) denotes the variance caused by the sampling pro-
cess. A statistical test with a stated nominal rejection rate of 5% should reject
the null hypothesis in at most 5% of the cases when in fact the null hypothesis
is true. A procedure is said to be confidence valid if this holds.
In summary, the goal of multiple imputation is to obtain estimates of the
scientific estimand in the population. This estimate should on average be equal
to the value of the population parameter. Moreover, the associated confidence
intervals and hypothesis tests should achieve at least the stated nominal value.
Here, P (Q|Yobs ) is the posterior distribution of Q given the observed data Yobs .
This is the distribution that we would like to know. P (Q|Yobs , Ymis ) is the pos-
terior distribution of Q in the hypothetically complete data, and P (Ymis |Yobs )
is the posterior distribution of the missing data given the observed data.
The interpretation of Equation 2.14 is most conveniently done from right to
left. Suppose that we use P (Ymis |Yobs ) to draw imputations for Ymis , denoted
as Ẏmis . We can then use P (Q|Yobs , Ẏmis ) to calculate the quantity of interest Q
from the imputed data (Yobs ,Ẏmis ). We repeat these two steps with new draws
Ẏmis , and so on. Equation 2.14 says that the actual posterior distribution of Q
is equal to the average over the repeated draws of Q. This result is important
since it expresses P (Q|Yobs ), which is generally difficult, as a combination of
two simpler posteriors from which draws can be made.
It can be shown that the posterior mean of P (Q|Yobs ) is equal to
the average of the posterior means of Q over the repeatedly imputed data.
This equation suggests the following procedure for combining the results of
repeated imputations. Suppose that Q̂l is the estimate of the `th repeated
imputation, then the combined estimate is equal to
m
1 X
Q̄ = Q̂` (2.16)
m
`=1
This equation is well known in statistics, but can be difficult to grasp at first.
The first component is the average of the repeated complete-data posterior
variances of Q. This is called the within-variance. The second component is the
variance between the complete-data posterior means of Q. This is called the
between variance. Let Ū∞ and B∞ denote the estimated within and between
components for an infinitely large number of imputations m = ∞. Then T∞ =
Ū∞ + B∞ is the posterior variance of Q.
Multiple imputation 43
where the term Ū` is the variance-covariance matrix of Q̂` obtained for the
`th imputation. The standard unbiased estimate of the variance between the
m complete-data estimates is given by
m
1 X
B= (Q̂` − Q̄)(Q̂` − Q̄)0 (2.19)
m−1
`=1
T = Ū + B + B/m
1
= Ū + 1 + B (2.20)
m
for the total variance of Q̄, and hence of (Q−Q̄) if Q̄ is unbiased. The procedure
to combine the repeated-imputation results by Equations 2.16 and 2.20 is
referred to as Rubin’s rules.
In summary, the total variance T stems from three sources:
1. Ū , the variance caused by the fact that we are taking a sample rather
than observing the entire population. This is the conventional statistical
measure of variability;
2. B, the extra variance caused by the fact that there are missing values
in the sample;
3. B/m, the extra simulation variance caused by the fact that Q̄ itself is
estimated for finite m.
The addition of the latter term is critical to make multiple imputation
work at low values of m. Not including it would result in p-values that are too
low, or confidence intervals that are too short. Traditional choices for m are
m = 3, m = 5 and m = 10. The current advice is to set m higher, e.g., m = 50
(cf. Section 2.8). The larger m gets, the smaller the effect of simulation error
on the total variance.
44 Flexible Imputation of Missing Data, Second Edition
Table 2.2: Role of symbols at three analytic levels and the relations
between them. The relation =⇒ means “is an estimate of.” The
.
relation = means “is asymptotically equal to.”
Incomplete Sample Complete Sample Population
Yobs Y = (Yobs , Ymis )
Q̄ =⇒ Q̂ =⇒ Q
.
Ū =⇒ U = V (Q̂)
.
B = V (Q̄)
sample. Imputation of data should, at the very least, lead to adequate es-
timates of both Q̂ and U . Three conditions define whether an imputation
procedure is considered proper. We use the slightly simplified version given
by Brand (1999, p. 89) combined with Rubin (1987a). An imputation proce-
dure is said to be confidence proper for complete-data statistics (Q̂, U ) if at
large m all of the following conditions hold approximately:
E(Q̄|Y ) = Q̂ (2.21)
E(Ū |Y ) = U (2.22)
1
1+ E(B|Y ) ≥ V (Q̄) (2.23)
m
The hypothetically complete sample data Y is now held fixed, and the response
indicator R varies according to a specified model.
The first requirement is that Q̄ is an unbiased estimate of Q̂. This means
that, when averaged over the response indicators R sampled under the as-
sumed response model, the multiple imputation estimate Q̄ is equal to Q̂,
the estimate calculated from the hypothetically complete data in the realized
sample.
The second requirement is that Ū is an unbiased estimate of U . This means
that, when averaged over the response indicator R sampled under the assumed
response model, the estimate Ū of the sampling variance of Q̂ is equal to U ,
the sampling variance estimate calculated from the hypothetically complete
data in the realized sample.
The third requirement is that B is a confidence valid estimate of the vari-
ance due to missing data. Equation 2.23 implies that the extra inferential
uncertainty about Q̂ due to missing data is correctly reflected. On average,
the estimate B of the variance due to missing data should be equal to V (Q̄),
the variance observed in the multiple imputation estimator Q̄ over differ-
ent realizations of the response mechanism. This requirement is analogous to
Equation 2.13 for confidence valid estimates of U .
If we replace ≥ in Equation 2.23 by >, then the procedure is said to be
proper, a stricter version. In practice, being confidence proper is enough to
obtain valid inferences.
Note a procedure may be proper for the estimand pair (Q̂, U ), while being
improper for another pair (Q̂0 , U 0 ). Also, a procedure may be proper with
respect to one response mechanism P (R), but improper for an alternative
mechanism P (R0 ).
It is not always easy to check whether a certain procedure is proper. Sec-
tion 2.5 describes simulation-based tools for checking the adequacy of impu-
tations for valid statistical inference. Chapter 3 provides examples of proper
and improper procedures.
46 Flexible Imputation of Missing Data, Second Edition
r + 2/(ν + 3)
γ= (2.26)
1+r
This measure needs an estimate of the degrees of freedom ν, and will be
discussed in Section 2.3.6. The interpretations of γ and λ are similar, but γ is
adjusted for the finite number of imputations. Both statistics are related by
ν+1 2
γ= λ+ (2.27)
ν+3 ν+3
The literature often confuses γ and λ, and erroneously labels λ as the fraction
of missing information. The values of λ and γ are almost identical for large ν,
but they could notably differ for low ν.
If Q is a vector, it is sometimes useful to calculate a compromise λ over
all elements in Q̄ as
1
λ̄ = 1 + tr(BT −1 )/k (2.28)
m
where k is the dimension of Q̄, and where B and T are now k × k matrices.
The compromise expression for r is equal to
1
r̄ = 1 + tr(B Ū −1 )/k (2.29)
m
the average relative increase in variance.
The quantities λ, r and γ as well as their multivariate analogues λ̄ and r̄ are
indicators of the severity of the missing data problem. Fractions of missing
information up to 0.2 can be interpreted as “modest,” 0.3 as “moderately
large” and 0.5 as “high” (Li et al., 1991b). High values indicate a difficult
problem in which the final statistical inferences are highly dependent on the
way in which the missing data were handled. Note that estimates of λ, r and
γ may be quite variable for low m (cf. Section 2.8).
freedom cannot be the same as for the complete data because part of the
data is missing. The “old” formula (Rubin, 1987a, eq. 3.1.6) for the degrees
of freedom can be written concisely as
1
νold = (m − 1) 1 + 2
r
m−1
= (2.30)
λ2
with r and λ defined as in Section 2.3.5. The lowest possible value is νold = m−
1, which occurs if essentially all variation is attributable to the nonresponse.
The highest value νold = ∞ indicates that all variation is sampling variation,
either because there were no missing data, or because we could re-create them
perfectly.
Barnard and Rubin (1999) noted that Equation 2.30 can produce values
that are larger than the sample size in the complete data, a situation that is
“clearly inappropriate.” They developed an adapted version for small samples
that is free of the problem. Let νcom be the degrees of freedom of Q̄ in the
hypothetically complete data. In models that fit k parameters on data with
a sample size of n we may set νcom = n − k. The estimated observed data
degrees of freedom that accounts for the missing information is
νcom + 1
νobs = νcom (1 − λ) (2.31)
νcom + 3
The adjusted degrees of freedom to be used for testing in multiple imputation
can be written concisely as
νold νobs
ν= (2.32)
νold + νobs
The quantity ν is always less than or equal to νcom . If νcom = ∞, then Equa-
tion 2.32 reduces to 2.30. If λ = 0 then ν = νcom , and if λ = 1 we find ν = 0.
Distributions with zero degrees of freedom are nonsensical, so for ν < 1 we
should refrain from any testing due to lack of information.
Alternative corrections were proposed by Reiter (2007) and Lipsitz et al.
(2002). Wagstaff and Harel (2011) compared the four methods, and concluded
that the sample-sample methods by Barnard-Rubin and Reiter performed
satisfactory.
library(mice)
imp <- mice(nhanes, print = FALSE, m = 10, seed = 24415)
fit <- with(imp, lm(bmi ~ age))
est <- pool(fit)
est
Class: mipo m = 10
estimate ubar b t dfcom df riv lambda
(Intercept) 30.50 3.408 1.454 5.01 23 12.4 0.469 0.319
age -2.13 0.906 0.238 1.17 23 15.1 0.289 0.224
fmi
(Intercept) 0.408
age 0.310
models and the testing of model terms that involved multiple parameters like
regression estimates for dummy codings created from the same variable.
All methods assume that, under repeated sampling and with complete
data, the parameter estimates Q̂ are normally distributed around the popu-
lation value Q as
Q̂ ∼ N (Q, U ) (2.33)
where U is the variance-covariance matrix of (Q − Q̂) (Rubin, 1987a, p. 75).
2
For scalar Q, the quantity U reduces to σm , the variance of the estimate Q̂ over
repeated samples. Observe that U is not the variance of the measurements.
Several approaches for multi-parameter inference are available: Wald test,
likelihood ratio test and χ2 -test. These methods are more complex than single-
parameter inference, and their treatment is therefore deferred to Section 5.2.
The next section shows how confidence intervals and p-values for scalar pa-
rameters can be calculated from multiply imputed data.
Q − Q̄
√ ∼ tν (2.34)
T
where tν is the Student’s t-distribution with ν degrees of freedom, with ν
defined by Equation 2.32.
The 100(1 − α)% confidence interval of a Q̄ is calculated as
√
Q̄ ± tν,1−α/2 T (2.35)
(Q0 − Q̄)2
Ps = Pr F1,ν > (2.36)
T
3. Missing data mechanism only. The basic simulation steps are: choose
52 Flexible Imputation of Missing Data, Second Edition
(t)
(Q̂, U ), generate incomplete data Yobs , impute, estimate (Q̄, Ū )(t) and
B (t) and calculate outcomes aggregated over t.
1. Raw bias (RB) and percent bias (PB). The raw bias of the estimate Q̄
is defined as the difference between the expected value of the estimate
and truth: RB = E(Q̄) − Q. RB should be close to zero. Bias can also be
expressed as percent bias: PB = 100 × |(E(Q̄) − Q)/Q|. For acceptable
performance we use an upper limit for PB of 5%. (Demirtas et al., 2008)
2. Coverage rate (CR). The coverage rate (CR) is the proportion of con-
fidence intervals that contain the true value. The actual rate should be
equal to or exceed the nominal rate. If CR falls below the nominal rate,
the method is too optimistic, leading to false positives. A CR below 90
percent for a nominal 95 percent interval indicates poor quality. A high
CR (e.g., 0.99) may indicate that confidence interval is too wide, so the
method is inefficient and leads to inferences that are too conservative.
Inferences that are “too conservative” are generally regarded a lesser sin
than “too optimistic”.
3. Average width (AW). The average width of the confidence interval is
an indicator of statistical efficiency. The length should be as small as
possible, but not so small that the CR will fall below the nominal level.
p
4. Root mean squared error (RMSE). The RMSE = (E(Q̄) − Q)2 is a
compromise between bias and variance, and evaluates Q̄ on both accu-
racy and precision.
Multiple imputation 53
If all is well, then RB should be close to zero, and the coverage should
be near 0.95. Methods having no bias and proper coverage are called
randomization-valid (Rubin, 1987a). If two methods are both randomization-
valid, the method with the shorter confidence intervals is more efficient. While
the RMSE is widely used, we will see in Section 2.6 that it is not a suitable
metric to evaluate multiple imputation methods.
2.5.3 Example
This section demonstrates the measures defined in Section 2.5.2 can be
calculated using simulation. The process starts by specifying a model that
is of scientific interest and that fixes Q. Pseudo-observations according to
the model are generated, and part of these observations is deleted, resulting
in an incomplete dataset. The missing values are then filled using the new
imputation procedure, and Rubin’s rules are applied to calculate the estimates
Q̄ and T . The whole process is repeated a large number of times, say in 1000
runs, each starting from different random seeds.
For the sake of simplicity, suppose scientific interest focuses on determining
β in the linear model yi = α + xi β + i . Here i ∼ N (0, σ 2 ) are random errors
uncorrelated with x. Suppose that the true values are α = 0, β = 1 and
σ 2 = 1. We have 50% random missing data in x, and compare two imputation
methods: regression imputation (cf. Section 1.3.4) and stochastic regression
imputation (cf. Section 1.3.5).
It is convenient to create a series of small R functions. The create.data()
function randomly draws artificial data from the specified linear model.
Next, we remove some data in order to make the data incomplete. Here we use
a simple random missing data mechanism (MCAR) to generate approximately
50% missing values.
We then define a small test function that calls mice() and applies Rubin’s
rules to the imputed data.
54 Flexible Imputation of Missing Data, Second Edition
The means of the estimate, the lower and upper bounds of the confidence
intervals per method, can be obtained by
true <- 1
RB <- rowMeans(res[,, "estimate"]) - true
PB <- 100 * abs((rowMeans(res[,, "estimate"]) - true)/ true)
CR <- rowMeans(res[,, "2.5 %"] < true & true < res[,, "97.5 %"])
AW <- rowMeans(res[,, "97.5 %"] - res[,, "2.5 %"])
Multiple imputation 55
RB PB CR AW RMSE
norm.predict 0.343 34.3 0.364 0.555 0.409
norm.nob -0.005 0.5 0.925 0.693 0.201
where yimis represents the true (removed) data value for unit i and where ẏi is
imputed value for unit i. For multiply imputed data we calculate RMSE for
each imputed dataset, and average these.
56 Flexible Imputation of Missing Data, Second Edition
RMSE
norm.predict 0.725
norm.nob 1.025
the regression coefficients are free of bias (Little, 1992; King et al., 2001).
This holds for any type of regression analysis, and for missing data in both Y
and X. Since the missing data rate may depend on X, complete-case analysis
will in fact work in a relevant class of MNAR models. White and Carlin
(2010) confirmed the superiority of complete-case analysis by simulation. The
differences were often small, and multiple imputation gained the upper hand
as more predictive variables were included. The property is useful though in
practice.
The second special case holds only if the complete data model is logistic
regression. Suppose that the missing data are confined to either a dichotomous
Y or to X, but not to both. Assuming that the model is correctly specified, the
regression coefficients (except the intercept) from the complete-case analysis
are unbiased if the probability to be missing depends only on Y and not on
X (Vach, 1994). This property provides the statistical basis of the estimation
of the odds ratio from case-control studies in epidemiology. If missing data
occur in both Y and X the property does not hold.
At a minimum, application of listwise deletion should be a conscious de-
cision of the analyst, and should preferably be accompanied by an explicit
statement that the missing data fit in one of the three categories described
above.
Other alternatives to multiple imputation were briefly reviewed in Sec-
tion 1.6, and may work well in particular applications. However, none of these
is as general as multiple imputation.
(Rubin, 1987a, p. 114) showed that the two variances are related by
γ0
Tm = 1 + T∞ (2.38)
m
where γ0 is the (true) population fraction of missing information. This quan-
tity is equal to the expected fraction of observations missing if Y is a single
variable without covariates, and commonly less than this if there are covari-
ates that predict Y . For example, for γ0 = 0.3 (e.g., a single variable with 30%
missing) and m = 5 we find that the calculated variance Tm is 1+0.3/5 = 1.06
times (i.e., 6%) larger than the√ ideal variance T∞ . The corresponding confi-
dence interval would thus be 1.06 = 1.03 (i.e., 3%) longer than the ideal
confidence interval based on m = ∞. Increasing m to 10 or 20 would bring
the factor down 1.5% and 0.7%, respectively. The argument is that “the ad-
ditional resources that would be required to create and store more than a few
imputations would not be well spent” (Schafer, 1997, p. 107), and “in most
situations there is simply little advantage to producing and analyzing more
than a few imputed datasets” (Schafer and Olsen, 1998, p. 549).
Royston (2004) observed that the length of the confidence interval also
depends on ν, and thus on m (cf. Equation 2.30). √ He suggested to base the
criterion for m on the confidence√ coefficient tν T , and proposed that the
coefficient of variation of ln(tν T ) should be smaller than 0.05. This effectively
constrains the range of uncertainty about the confidence interval to roughly
within 10%. This rule requires m to be “at least 20 and possibly more.”
Graham et al. (2007) investigated the effect of m on the statistical power
of a test for detecting a small effect size (< 0.1). Their advice is to set m
high in applications where high statistical power is needed. For example, for
γ0 = 0.3 and m = 5 the statistical power obtained is 73.1% instead of the
theoretical value of 78.4%. We need m = 20 to increase the power to 78.2%.
In order to have an attained power within 1% of the theoretical power, then
for fractions of missing information γ = (0.1, 0.3, 0.5, 0.7, 0.9) we need to set
m = (20, 20, 40, 100, > 100), respectively.
Bodner (2008) explored the variability of three quantities under various m:
the width of the 95% confidence interval, the p-value, and γ0 . Bodner selected
m such that the width of the 95% confidence interval is within 10% of its true
value 95% of the time. For γ0 = (0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9), he recom-
mends m = (3, 6, 12, 24, 59, 114, 258), respectively, using a linear rule. Since
the true γ0 is unknown, Bodner suggested the proportion of complete cases
as a conservative estimate of γ0 . Von Hippel (2018) showed that a relation
between m and γ0 is better explained by a quadratic rule
2
1 γ
m=1+ √ 0 √ , (2.39)
2 SD( U` )E( U` )
√ √
where E( U` ) and SD( U` ) are the mean and standard deviation of the
standard errors calculated from the imputed datasets. The rule is used in
60 Flexible Imputation of Missing Data, Second Edition
a two-step procedure, where the first step estimates γ0 and its 95% confi-
dence interval. The upper limit of the confidence interval is then plugged into
Equation 2.39. Compared to Bodner, the rule suggests somewhat lower m if
γ0 < 0.5 and substantially higher m if γ0 > 0.5.
The starting point of White et al. (2011b) is that all essential quantities
in the analysis should be reproducible within some limit, including confidence
intervals, p-values and estimates of the fraction of missing information. They
take a quote from Von Hippel (2009) as a rule of thumb: the number of impu-
tations should be similar to the percentage of cases that are incomplete. This
rule applies to fractions of missing information of up to 0.5. If m ≈ 100λ, the
following properties will hold for a parameter β:
1. The Monte Carlo error of β̂ is approximately 10% of its standard error;
2. The Monte Carlo error of the test statistic β̂/se(β̂) is approximately 0.1;
3. The Monte Carlo error of the p-value is approximately 0.01 when the
true p-value is 0.05.
White et al. (2011b) suggest these criteria provide an adequate level of repro-
ducibility in practice. The idea of reproducibility is sensible, the rule is simple
to apply, so there is much to commend it. The rule has now become the
de-facto standard, especially in medical applications. One potential difficulty
might be that the percentage of complete cases is sensitive to the number of
variables in the data. If we extend the active dataset by adding more variables,
then the percentage of complete cases can only drop. An alternative would be
to use the average missing data rate as a less conservative estimate.
Theoretically it is always better to use higher m, but this involves more
computation and storage. Setting m very high (say m = 200) may be use-
ful for low-level estimands that are very uncertain, and for which we want
to approximate the full distribution, or for parameters that are notoriously
different to estimates, like variance components. On the other hand, setting
m high may not be worth the extra wait if the primary interest is on the point
estimates (and not on standard errors, p-values, and so on). In that case using
m = 5 − 20 will be enough under moderate missingness.
Imputing a dataset in practice often involves trial and error to adapt and
refine the imputation model. Such initial explorations do not require large m.
It is convenient to set m = 5 during model building, and increase m only
after being satisfied with the model for the “final” round of imputation. So
if calculation is not prohibitive, we may set m to the average percentage of
missing data. The substantive conclusions are unlikely to change as a result
of raising m beyond m = 5.
Multiple imputation 61
2.9 Exercises
1. Nomogram. Construct a graphic representation of Equation 2.27 that al-
lows the user to convert λ and γ for different values of ν. What influence
does ν have on the relation between λ and γ?
2. Models. Explain the difference between the response model and the im-
putation model.
3. Listwise deletion. In the airquality data, predict Ozone from Wind and
Temp. Now randomly delete the half of the wind data above 10 mph, and
randomly delete half of the temperature data above 80◦ F.
(a) Are the data MCAR, MAR or MNAR?
(b) Refit the model under listwise deletion. Do you notice a change in
the estimates? What happens to the standard errors?
(c) Would you conclude that listwise deletion provides valid results
here?
(d) If you add a quadratic term to the model, would that alter your
conclusion?
(c) Check White’s conditions 1 and 2 (cf. Section 2.8). For which m
do these conditions true?
(d) Does this also hold for categorical data? Use the nhanes2 to study
this.
6. Automated choice of m. Write an R function that implements the meth-
ods discussed in Section 2.8.
Chapter 3
Univariate missing data
63
64 Flexible Imputation of Missing Data, Second Edition
7 7
a b
Gas consumption (cubic feet)
5 5
4 4
3 3
2 2
deleted observation
0 2 4 6 8 10 0 2 4 6 8 10
Temperature (°C) Temperature (°C)
7 7
c d
Gas consumption (cubic feet)
6 6
5 5
4 4
3 3
2 2
0 2 4 6 8 10 0 2 4 6 8 10
Temperature (°C) Temperature (°C)
7 7
e f
Gas consumption (cubic feet)
6 6
5 5
4 4
3 3
0 2 4 6 8 10 0 2 4 6 8 10
Temperature (°C) Temperature (°C)
Figure 3.1: Five ways to impute missing gas consumption for a temperature
of 5◦ C: (a) no imputation; (b) predict; (c) predict + noise; (d) predict + noise
+ parameter uncertainty; (e) two predictors; (f) drawing from observed data.
Univariate missing data 65
pling. Imputed values are now defined as the predicted value of the sampled
line added with noise, as in Section 3.1.2.
3.1.6 Conclusion
In summary, prediction methods are not suitable to create multiple im-
putations. Both the inherent prediction error and the parameter uncertainty
should be incorporated into the imputations. Adding a relevant extra predic-
tor reduces the amount of uncertainty, and leads to more efficient estimates
later on. The text also highlights an alternative that draws imputations from
the observed data. The imputation methods discussed in this chapter are all
variations on this basic idea.
Univariate missing data 67
1. Predict. ẏ = β̂0 + Xmis β̂1 , where β̂0 and β̂1 are least squares estimates
calculated from the observed data. Section 1.3.4 named this regression
imputation. In mice this method is available as method norm.predict.
˙ where ˙ ∼ N (0, σ̇ 2 )
3. Bayesian multiple imputation. ẏ = β̇0 + Xmis β̇1 + ,
and β̇0 , β̇1 and σ̇ are random draws from their posterior distribution,
given the data. Section 3.1.3 named this “predict + noise + parameters
uncertainty.” The method is available as method norm.
˙ where ˙ ∼ N (0, σ̇ 2 ),
4. Bootstrap multiple imputation. ẏ = β̇0 + Xmis β̇1 + ,
and where β̇0 , β̇1 and σ̇ are the least squares estimates calculated from
a bootstrap sample taken from the observed data. This is an alternative
way to implement “predict + noise + parameters uncertainty.” The
method is available as method norm.boot.
3.2.2 Algorithms♠
The calculations of the first two methods are straightforward and do not
need further explanation. This section describes the algorithms used to in-
troduce sampling variability into the parameters estimates of the imputation
model.
The Bayesian sampling draws β̇0 , β̇1 and σ̇ from their respective posterior
distributions. Box and Tiao (1973, Section 2.7) explains the Bayesian theory
behind the normal linear model. We use the method that draws imputations
under the normal linear model using the standard noninformative priors for
each of the parameters. Given these priors, the required inputs are:
68 Flexible Imputation of Missing Data, Second Edition
Algorithm 3.2: Imputation under the normal linear model with bootstrap.♠
3.2.3 Performance
Which of these four imputation methods of Section 3.2 is best? In order
to find out let us conduct a small simulation experiment where we calculate
the performance statistics introduced in Section 2.5.3. We keep close to the
original data by assuming that β0 = 5.49, β1 = −0.29 and σ = 0.86 are
the population values. These values are used to generate artificial data with
known properties.
Table 3.1 summarizes the results for the situation where we have 50% com-
pletely random missing in y and m = 5. All methods are unbiased for β1 . The
confidence interval of method norm.predict is much too short, leading to
substantial undercoverage and p-values that are “too significant.” This result
confirms the problems already noted in Section 2.6. The norm.nob method
performs better, but the coverage of 0.908 is still too low. Methods norm and
norm.boot and complete-case analysis are correct. Complete-case analysis is
a correct analysis here (Little and Rubin, 2002), and in fact the most effi-
cient choice for this problem as it yields the shortest confidence interval (cf.
Section 2.7). This result does not hold more generally. In realistic situations
involving more covariates multiple imputation will rapidly catch up and pass
70 Flexible Imputation of Missing Data, Second Edition
complete-case analysis. Note that the RMSE values are uninformative for sep-
arating correct and incorrect methods, and are in fact misleading.
While method norm.predict is simple and fast, the variance estimate is
too low. Several methods have been proposed to correct the estimate (Lee
et al., 1994; Fay, 1996; Rao, 1996; Schafer and Schenker, 2000). Though such
methods require special adaptation of formulas to calculate the variance, they
may be useful when the missing data are restricted to the outcome.
It is straightforward to adapt the simulations to other, perhaps more in-
teresting situations. Investigating the effect of missing data in the explana-
tory x instead of the outcome variable requires only a small change in the
function to create the missing data. Table 3.2 displays the results. Method
norm.predict is now severely biased, whereas the other methods remain un-
biased. The confidence interval of norm.nob is still too short, but less than in
Table 3.1. Methods norm, norm.boot and listwise deletion are correct, in the
sense that these are unbiased and have appropriate coverage. Again, under
the simulation conditions, listwise deletion is the optimal analysis. Note that
norm is slightly biased, whereas method norm.boot slightly underestimates
the variance. Both tendencies are small in magnitude. The RMSE values are
uninformative, and are only shown to illustrate that point.
We could increase the number of explanatory variables and the number of
imputations m to see how much the average confidence interval width would
shrink. It is also easy to apply more interesting missing data mechanisms,
such as those discussed in Section 3.2.4. Data can be generated from skewed
distributions, the sample size n can be varied and so on. Extensive simulation
work is available (Rubin and Schenker, 1986b; Rubin, 1987a).
1.0
MARRIGHT
MARMID
0.8
MARTAIL
Missingness inY 2
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Y1
σ12 = 0.6. We assume that all values generated are positive. Missing data in
Y2 can be created in many ways. Let R2 be the response indicator for Y2 . We
study three examples, each of which affects the distribution in different ways:
set.seed(1)
n <- 10000
sigma <- matrix(c(1, 0.6, 0.6, 1), nrow = 2)
cmp <- MASS::mvrnorm(n = n, mu = c(5, 5), Sigma = sigma)
p2.marright <- 1 - plogis(-5 + cmp[, 1])
r2.marright <- rbinom(n, 1, p2.marright)
yobs <- cmp
yobs[r2.marright == 0, 2] <- NA
Figure 3.2 displays the probability of being missing under the three MAR
72 Flexible Imputation of Missing Data, Second Edition
MARRIGHT
Yobs
Ymis
MARMID
Yobs
Ymis
MARTAIL
Yobs
Ymis
3 4 5 6 7
Y2
Figure 3.3: Box plot of Y2 separated for the observed and missing parts under
three models for the missing data based on n = 10000.
V1 V2
4.91 4.91
As expected, the means in the amputed data are lower than in the complete
data. It is possible to inspect the distributions of the observed data more
closely by md.pattern(amp$amp), bwplot(amp) and xyplot(amp)xyplot.
Many options are available that allows the user to tailor the missing data
patterns to the data at hand. See Schouten et al. (2018) for details.
3.2.6 Conclusion
Tables 3.1 and 3.2 show that methods norm.predict (regression imputa-
tion) and norm.nob (stochastic regression imputation) fail in terms of under-
stating the uncertainty in the imputations. If the missing data occur in y only,
then it is possible to correct the variance formulas of method norm.predict.
However, if the missing data occur in X, norm.predict is severely biased, so
74 Flexible Imputation of Missing Data, Second Edition
then variance correction is not useful. Methods norm and norm.boot account
for the uncertainty of the imputation model provide statistically correct infer-
ences. For missing y, the efficiency of these methods is less than theoretically
possible, presumably due to simulation error.
It is always better to include parameter uncertainty, either by the Bayesian
or the bootstrap method. The effect of doing so will diminish with increasing
sample size (Exercise 2), so for estimates based on a large sample one may
opt for the simpler norm.nob method if speed of calculation is at premium.
Note that in subgroup analyses, the large-sample requirement applies to the
subgroup size, and not to the total sample size.
the data. Transformations may fail to achieve near-normality, and even if that
succeeds, bivariate relations may be affected when imputed by a method that
assumes normality. The examples of Von Hippel are somewhat extreme, but
they do highlight the point that simple fixes to achieve normality are limited
by what they can do.
There are two possible strategies to progress. The first is to use predic-
tive mean matching. Section 3.4 will describe this approach in more detail.
The other strategy is to model the non-normal data, and to directly draw
imputations from those models. Liu (1995) proposed methods for drawing
imputations under the t-distribution instead of the normal. He and Raghu-
nathan (2006) created imputations by drawing from Tukey’s gh-distribution,
which can take many shapes. Demirtas and Hedeker (2008a) investigated the
behavior of methods for drawing imputation from the Beta and Weibull distri-
butions. Likewise, Demirtas and Hedeker (2008b) took draws from Fleishman
polynomials, which allows for combinations of left and right skewness with
platykurtic and leptokurtic distributions.
The GAMLSS method (Rigby and Stasinopoulos, 2005; Stasinopoulos
et al., 2017) extends both the generalized linear model and the generalized
additive model. A unique feature of GAMLSS is its ability to specify a (pos-
sibly nonlinear) model for each of the parameters of the distribution, thus
giving rise to an extremely flexible toolbox that can be used to model almost
any distribution. The gamlss package contains over 60 built-in distributions.
Each distribution comes with a function to draw random variates, so once
the gamlss model is fitted, it can also be used to draw imputations. The first
edition of this book showed how to construct a new univariate imputation
function that mice could call. This is not needed any more. De Jong (2012)
and De Jong et al. (2016) developed a series of imputation methods based on
GAMLSS, so it is now easy to perform multiple imputation under variety of
distributions. The ImputeRobust package (Salfran and Spiess, 2017), imple-
ments various mice methods for continuous data: gamlss (normal), gamlssJSU
(Johnson’s SU), gamlssTF (t-distribution) and gamlssGA (gamma distribu-
tion). The following section demonstrates the use of the package.
0.30 60
normal 50
0.15
t, df=6.7
0.10
45
0.05
40
0.00
Figure 3.4: Measured head circumference of 755 Dutch boys aged 1–2 years
(Fredriks et al., 2000a).
outliers are genuine data, then the t-distribution should provide imputations
that are more realistic than the normal.
We create a synthetic dataset by imputing head circumference of the same
755 boys. Imputation is easily done with the following steps: append the
data with a duplicate, create missing data in hc and run mice() calling the
gamlssTF method as follows:
library(ImputeRobust)
library(gamlss)
data(db)
data <- subset(db, age > 1 & age < 2, c("age", "head"))
names(data) <- c("age", "hc")
synthetic <- rep(c(FALSE, TRUE), each = nrow(data))
data2 <- rbind(data, data)
row.names(data2) <- 1:nrow(data2)
data2[synthetic, "hc"] <- NA
imp <- mice(data2, m = 1, meth = "gamlssTF", seed = 88009,
print = FALSE)
syn <- subset(mice::complete(imp), synthetic)
Figure 3.5 is the equivalent of Figure 3.4, but now calculated from the
synthetic data. Both configurations are similar. As expected, some outliers
also occur in the imputed data, but these are a little less extreme than in
the observed data due to the smoothing by the t-distribution. The estimated
degrees of freedom varies over replications, and appears to be somewhat larger
than the value of 6.7 estimated from the observed data. For this replication,
it is larger (11.5). The distribution of the imputed data is better behaved
Univariate missing data 77
0.30 60
normal 50
0.15
t, df=11.5
0.10
45
0.05
40
0.00
Figure 3.5: Fully synthetic data of head circumference of 755 Dutch boys
aged 1–2 years using a t-distribution.
compared to the observed data. The typical rounding patterns seen in the
real measurements are not present in the imputed data. Though these are
small differences, they may be of relevance in particular analyses.
20 20
18 18
BMI
BMI
16 16
14 14
12 12
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Age Age
2 before insulation
after insulation
0 2 4 6 8 10
Temperature (°C)
particular instance, five candidate donors are found, four from the subgroup
“after insulation” and one from the subgroup “before insulation.” The last
step is to make a random draw among these five candidates. The red parts
in the figure will vary between different imputed datasets, and thus the set of
candidates will also vary over the imputed datasets.
The data point at coordinate (10.2, 2.6) is one of the candidate donors.
This point differs from the incomplete unit in both temperature and insulation
status, yet it is selected as a candidate donor. The advantage of including the
point is that closer matches in terms of the predicted values are possible.
Under the assumption that the distribution of the target in different bands is
similar, including points from different bands is likely to be beneficial.
1. Choose a threshold η, and take all i for which |ŷi − ŷj | < η as candidate
donors for imputing j. Randomly sample one donor from the candidates,
and take its yi as replacement value.
2. Take the closest candidate, i.e., the case i for which |ŷi − ŷj | is minimal as
the donor. This is known as “nearest neighbor hot deck,” “deterministic
hot deck” or “closest predictor.”
3. Find the d candidates for which |ŷi − ŷj | is minimal, and sample one
of them. Usual values for d are 3, 5 and 10. There is also an adaptive
method to specify the number of donors (Schenker and Taylor, 1996).
4. Sample one donor with a probability that depends on |ŷi − ŷj | (Siddique
and Belin, 2008).
later in the works of Koller-Meinfelder (2009, p. 43) and White et al. (2011b,
p. 383).
Algorithm 3.3 provides the steps used in predictive mean matching using
Bayesian parameter draws for β. It is possible to create the bootstrap version
of this algorithm that will also evade the need to draw β along the same lines
as Algorithm 3.2. Given that the number of candidate donors and the model
for the mean is provided by the user, the algorithm does not need an explicit
specification of the distribution.
Morris et al. (2014) suggested a variation called local residuals draws.
Rather than taking the observed value of the donor, this method borrows
the residual from the donor, and adds that to the predicted value from the
target case. Thus, imputations are not equal to observed values, and can ex-
tend beyond the range of the observed data. This may address concerns about
variability of imputations.
settings for d, and found that d = 5 and d = 10 generally provided the best
results. Kleinke (2017) found that d = 5 may be too high for sample size lower
than n = 100, and suggested setting d = 1 for better point estimates for small
samples. Gaffert et al. (2016) explored scenarios in which candidate donors
have different probabilities to be drawn, where the probability depends on
the distance between the donor and recipient cases. As all observed cases can
be donors in this scenario, there is no need to specify d. Instead a closeness
parameter needs to be specified, and this was made adaptive to the data. An
advantage of using all donors is that the variance of the imputations can be
corrected by the Parzen correction, which alleviates concerns about insuffi-
cient variability of the imputes. Their simulations showed that with a small
sample (n = 10), the adaptive method is clearly superior to methods with a
fixed donor pool. The method is available in mice as the midastouch method.
There is also a separate midastouch package in R. Related work can be found
in Tutz and Ramzan (2015).
The default in mice is d = 5, and represents a compromise. The above
results suggest that an adaptive method for setting d could improve small
sample behavior. Meanwhile, the number of donors can be changed through
the donors argument.
Table 3.3 repeats the simulation experiment done in Tables 3.1 and 3.2
for predictive mean matching for three different choices of the number d of
candidate donors. Results are given for n = 50 and n = 1000. For n = 50
we find that β1 is increasingly biased towards the null for larger d. Because
of the bias, the coverage is lower than nominal. For missing x the bias is
much smaller. Setting d to a lower value, as recommended by Kleinke (2017),
improves point estimates, but the magnitude of the effect depends on whether
the missing values occur in x or y. For the sample size n = 1000 predictive
mean matching appears well calibrated for d = 5 for missing data in y, and
has slight undercoverage for missing data in x. Note that Table 3.3 in the first
edition of this book presented incorrect information because it had erroneously
imputed the data by norm instead of pmm.
3.4.4 Pitfalls
The obvious danger of predictive mean matching is the duplication of the
same donor value many times. This problem is more likely to occur if the
sample is small, or if there are many more missing data than observed data in
a particular region of the predicted value. Such unbalanced regions are more
likely if the proportion of incomplete cases is high, or if the imputation model
contains variables that are very strongly related to the missingness. For small
samples the donor pool size can be reduced, but be aware that this may not
work if there are only a few predictors.
The traditional method does not work for a small number of predictors.
Heitjan and Little (1991) report that for just two predictors the results were
“disastrous.” The cause of the problem appears to be related to their use
Univariate missing data 83
Missing x
pmm 1 -0.002 0.8 0.916 0.223 0.063
pmm 3 0.002 0.9 0.931 0.228 0.061
pmm 5 0.008 2.8 0.938 0.237 0.062
pmm 10 0.028 9.6 0.946 0.261 0.067
Missing y, n = 50 κ
midastouch auto 0.013 4.5 0.920 0.265 0.066
midastouch 2 0.032 11.1 0.917 0.273 0.068
midastouch 3 0.018 6.2 0.927 0.261 0.064
midastouch 4 0.012 4.1 0.926 0.260 0.064
Missing x
midastouch auto -0.003 0.9 0.932 0.241 0.060
midastouch 2 0.013 4.4 0.959 0.264 0.059
midastouch 3 0.000 0.2 0.947 0.245 0.058
midastouch 4 -0.004 1.4 0.940 0.237 0.058
Missing y, n = 1000 d
pmm 1 0.001 0.2 0.929 0.056 0.014
pmm 3 0.001 0.4 0.950 0.056 0.013
pmm 5 0.002 0.6 0.951 0.055 0.013
pmm 10 0.003 1.2 0.932 0.054 0.013
Missing x
pmm 1 0.000 0.2 0.926 0.041 0.011
pmm 3 0.000 0.1 0.933 0.041 0.011
pmm 5 0.000 0.1 0.937 0.042 0.011
pmm 10 0.000 0.1 0.928 0.042 0.011
3.4.5 Conclusion
Predictive mean matching with d = 5 is the default in mice() for contin-
uous data. The method is robust against misspecification of the imputation
model, yet performs as well as theoretically superior methods. In the context
of missing covariate data, Marshall et al. (2010a) concluded that predictive
mean matching “produced the least biased estimates and better model per-
formance measures.” Another simulation study that addressed skewed data
concluded that predictive mean matching “may be the preferred approach
provided that less than 50% of the cases have missing data and the missing
data are not MNAR” (Marshall et al., 2010b). Kleinke (2017) found that the
method works well across a wide variety of scenarios, but warned the default
cannot address severe skewness or small samples.
The method works best with large samples, and provides imputations that
possess many characteristics of the complete data. Predictive mean matching
cannot be used to extrapolate beyond the range of the data, or to interpolate
within the range of the data if the data at the interior are sparse. Also, it may
not perform well with small datasets. Bearing these points in mind, predictive
mean matching is a great all-around method with exceptional properties.
5
Insul=After 7 5
0 2 4 6 8 10
Temperature (°C)
Figure 3.8: Regression tree for predicting gas consumption. The left-hand
plot displays the binary tree, whereas the right-hand plot identifies the groups
at each end leaf in the data.
binary tree. The target variable can be discrete (classification tree) or contin-
uous (regression tree).
Figure 3.8 illustrates a simple CART solution for the whiteside data. The
left-hand side contains the optimal binary tree for predicting gas consump-
tion from temperature and insulation status. The right-hand side shows the
scatterplot in which the five groups are labeled by their terminal nodes.
CART methods have properties that make them attractive for imputation:
they are robust against outliers, can deal with multicollinearity and skewed
distributions, and are flexible enough to fit interactions and nonlinear rela-
tions. Furthermore, many aspects of model fitting have been automated, so
there is “little tuning needed by the imputer” (Burgette and Reiter, 2010).
The idea of using CART methods for imputation has been suggested by a
wide variety of authors in a variety of ways. See Saar-Tsechansky and Provost
(2007) for an introductory overview. Some investigators (He, 2006; Vateekul
and Sarinnapakorn, 2009) simply fill in the mean or mode. The majority of
tree-based imputation methods use some form of single imputation based on
prediction (Bárcena and Tusell, 2000; Conversano and Cappelli, 2003; Siciliano
et al., 2006; Creel and Krotki, 2006; Ishwaran et al., 2008; Conversano and
Siciliano, 2009). Multiple imputation methods have been developed by Harrell
(2001), who combined it with optimal scaling of the input variables, by Reiter
(2005b) and by Burgette and Reiter (2010). Wallace et al. (2010) present
a multiple imputation method that averages the imputations to produce a
single tree and that does not pool the variances. Parker (2010) investigates
multiple imputation methods for various unsupervised and supervised learning
algorithms.
The missForest method (Stekhoven and Bühlmann, 2011) successfully
86 Flexible Imputation of Missing Data, Second Edition
used regression and classification trees to predict the outcomes in mixed con-
tinuous/categorical data. MissForest is popular, presumably because it pro-
duces a single complete dataset, which at the same time is the reason why
it fails as a scientific method. The missForest method does not account for
the uncertainty caused by the missing data, treats the imputed data as if
they were real (which they are not), and thus invents information. As a con-
sequence, p-values calculated after application of missForest will be more
significant than they actually are, confidence intervals will be shorter than
they actually are, and relations between variables will be stronger than they
actually are. These problems worsen as more missing values are imputed. Un-
fortunately, comparisons studies that evaluate only accuracy, such as Waljee
et al. (2013), will fail to detect these problems.
As a alternative, multiple imputations can be created using the tree in
Figure 3.8. For a given temperature and insulation status, traverse the tree
and find the appropriate terminal node. Form the donor group of all observed
cases at the terminal node, randomly draw a case from the donor group,
and take its reported gas consumption as the imputed value. The idea is
identical to predictive mean matching (cf. Section 3.4), where the “predictive
mean” is now calculated by a tree model instead of a regression model. As
before, the parameter uncertainty can be incorporated by fitting the tree on
a bootstrapped sample.
Algorithm 3.4 describes the major steps of an algorithm for creating impu-
tations using a classification or regression tree. There is considerable freedom
at step 2, where the tree model is fitted to the training data (ẏobs , Ẋobs ). It
may be useful to fit the tree such that the number of cases at each node is equal
to some pre-set number, say 5 or 10. The composition of the donor groups
will vary over different bootstrap replications, thus incorporating sampling
uncertainty about the tree.
Multiple imputation methodology using trees has been developed by Bur-
gette and Reiter (2010), Shah et al. (2014) and Doove et al. (2014). The
Univariate missing data 87
main motivation given in these papers was to improve our ability to account
for interactions and other non-linearities, but these are generic methods that
apply to both continuous and categorical outcomes and predictors. Burgette
and Reiter (2010) used the tree package, and showed that the CART results
for recovering interactions were uniformly better than standard techniques.
Shah et al. (2014) applied random forest techniques to both continuous and
categorical outcomes, which produced more efficient estimates than standard
procedures. The techniques are available as methods rfcat and rfcont in
the CALIBERrfimpute package. Doove et al. (2014) independently developed
a similar set of routines building on the rpart (Therneau et al., 2017) and
randomForest (Liaw and Wiener, 2002) packages. Methods cart and rf are
part of mice.
A recent development is the growing interest from the machine learning
community for the idea of multiple imputation. The problem of imputing miss-
ing values has now been discovered by many, but unfortunately nearly all new
algorithms produce single imputations. An exception is the paper by Sovilj
et al. (2016), who propose the extreme learning machine using conditional
Gaussian mixture models to generate multiple imputations. It is a matter
of time before researchers realize the intimate connections between multiple
imputation and ensemble learning, so that more work along these lines may
follow.
exp(Xi β)
Pr(yi = 1|Xi , β) = (3.4)
1 + exp(Xi β)
A categorical variable with K unordered categories is imputed under the multi-
nomial logit model
exp(Xi βk )
Pr(yi = k|Xi , β) = PK (3.5)
k=1 exp(Xi βk )
exp(τk − Xi β)
Pr(yi ≤ k|Xi , β, τk ) = (3.6)
1 + exp(τk − Xi β)
with the slope β is identical across categories, but the intercepts τk differ.
For identification, we set τ1 = 0. The probability of observing category k is
written as
where the model parameters β, τk and τk−1 are suppressed for clarity.
Scott Long (1997) is a very readable introduction to these methods. The
practical application of these techniques in R is treated in Aitkin et al. (2009).
The general idea is to estimate the probability model on the subset of the ob-
served data, and draw synthetic data according to the fitted probabilities to
impute the missing data. The parameters are typically estimated by iteratively
reweighted least squares. As before, the variability of the model parameters β
and τ2 , . . . , τK introduces additional uncertainty that needs to be incorporated
into the imputations.
Algorithm 3.5 provides the steps for an approximate Bayesian imputation
method using logistic regression. The method assumes that the parameter
vector β follows a multivariate normal distribution. Although this is true in
large samples, the distribution can in fact be far from normal for modest n1 ,
for large q or for predicted probabilities close to 0 or 1. The procedure is also
Univariate missing data 89
approximate in the sense that it does not draw the estimated covariance V
matrix. It is possible to define an explicit Bayesian sampling for drawing β
and V from their exact posteriors. This method is theoretically preferable,
but as it requires more elaborate modeling, it does not easily extend to other
regression situations. In mice the algorithm is implemented as the method
logreg.
It is easy to construct a bootstrap version that avoids some of the diffi-
culties in Algorithm 3.5. Prior to estimating β̂, we include a step that draws
a bootstrap sample from Yobs and Xobs . Steps 2–5 can then be replaced by
equating β̇ = β̂.
The algorithms for imputation of variables with more than two categories
follow the same structure. In mice the multinomial logit model in method
polyreg is estimated by the nnet::multinom() function in the nnet package.
The ordered logit model in method polr is estimated by the polr() function
of the MASS package. Even though the ordered model uses fewer parameters,
it is often more difficult to estimate. In cases where MASS::polr() fails to
converge, nnet::multinom() will take over its duties. See Venables and Ripley
(2002) for more details on both functions.
group) we know only the symptom and impute disease status. Under MAR,
we should impute all 100 cases with the symptom to the diseased group, and
divide the 100 cases without the symptom randomly over the diseased and
non-diseased groups. However, this is not what happens in Algorithm 3.5. The
estimate of V will be very large as a result of separation. If we naively use this
V then β̇ in step 5 effectively covers both positive and negative values equally
likely. This results in either correctly 100 imputations in Yes or incorrectly
100 imputations in No, thereby resulting in bias in the disease probability.
The problem has recently gained attention. There are at least six different
approaches to perfect prediction:
1. Eliminate the variable that causes perfect prediction.
2. Take β̂ instead of β̇.
3. Use penalized regression with Jeffreys prior in step 2 of Algorithm 3.5
(Firth, 1993; Heinze and Schemper, 2002).
4. Use the bootstrap, and then apply method 1.
5. Use data augmentation, a method that concatenates pseudo-observations
with a small weight to the data, effectively prohibiting infinite estimates
(Clogg et al., 1991; White et al., 2010).
6. Apply the explicit Bayesian sampling with a suitable weak prior. Gelman
et al. (2008) recommend using independent Cauchy distributions on all
logistic regression coefficients.
Eliminating the most predictive variable is generally undesirable in the
context of imputation, and may in fact bias the relation of interest. Option 2
does not yield proper imputations, and is therefore not recommended. Op-
tion 3 provides finite estimates, but has been criticized as not being well in-
terpretable in a regression context (Gelman et al., 2008) and computationally
inefficient (White et al., 2010). Option 4 corrects method 1, and is simple to
implement. Options 5 and 6 have been recommended by White et al. (2010)
and Gelman et al. (2008), respectively.
Methods 4, 5 and 6 all solve a major difficulty in the construction of auto-
matic imputation techniques. It is not yet clear whether one of these methods
is superior. The logreg, polr and polyreg methods in mice implement op-
tion 5.
3.6.3 Evaluation
The methods are based on the elegant generalized linear models. Simu-
lations presented in Van Buuren et al. (2006) show that these methods per-
formed quite well in the lab. When used in practice however, the methods
may be unstable, slow and exhibit poor performance. Hardt et al. (2013) in-
tentionally pushed the logistic methods to their limits, and observed that most
Univariate missing data 91
methods break down relatively quick, i.e., if the proportion of missing values
exceeds 0.4. Van der Palm et al. (2016a) found that logreg failed to pick up
a three-way association in the data, leading to biased estimates. Likewise, Vi-
dotto et al. (2015) observed that logreg did not recover the structure in the
data as well as latent class models. Wu et al. (2015) found poor results for all
three methods (i.e., binary, multinomial and proportional odds), and advise
against their application. Akande et al. (2017) reported difficulties with fitting
multinomial variables having many categories. The performance of the proce-
dures suffered when variables with probabilities nearly equal to one (or zero)
are included in the models. Methods based on the generalized linear model
were found to be inferior to method cart (cf. Section 3.5) and to latent class
models for categorical data (cf. Section 4.4). Audigier et al. (2017) found that
logistic regression presented difficulties on the datasets with a high number of
categories, resulting in undercoverage on several quantities.
Imputation of categorical data is more difficult than continuous data. As
a rule of thumb, in logistic regression we need at least 10 events per predic-
tor in order to get reasonably stable estimates of the regression coefficients
(Van Belle, 2002, p. 87). So if we impute 10 binary outcomes, we need 100
events, and if the events occur with a probability of 0.1, then we need n > 1000
cases. If we impute outcomes with more categories, the numbers rapidly in-
crease for two reasons. First, we have more possible outcomes, and we need 10
events for each category. Second, when used as predictor, each nominal vari-
able is expanded into dummy variables, so the number of predictors multiplies
by the number of categories minus 1. The defaults logreg, polyreg and polr
tend to preserve the main effects well provided that the parameters are identi-
fied and can be reasonably well estimated. In many datasets, especially those
with many categories, the ratio of the number of fitted parameters relative
to the number of events easily drops below 10, which may lead to estimation
problems. In those cases, the advice is to specify more robust methods, like
pmm, cart or rf.
two-part model was developed by Javaras and Van Dyk (2003), who extended
the standard general location model (Olkin and Tate, 1961) to impute partially
observed semi-continuous data.
Yu et al. (2007) evaluated nine different procedures. They found that pre-
dictive mean matching performs well, provided that a sufficient number of
data points in the neighborhood of the incomplete data are available. Vink
et al. (2014) found that generic predictive mean matching is at least as good
as three dedicated methods for semi-continuous data: the two-part models as
implemented in mi (Su et al., 2011) and irmi (Templ et al., 2011b), and the
blocked general location model by Javaras and Van Dyk (2003). Vroomen et al.
(2016) investigated imputation of cost data, and found that predictive mean
matching of the log-transformed outperformed plain predictive mean match-
ing, a two-step method and complete-case analysis, and hence recommend
log-transformed method for monetary data.
The problem of censored event times has been studied extensively. There
are many statistical methods that can analyze left- or right-censored data di-
rectly, collectively known as survival analysis. Kleinbaum and Klein (2005),
Hosmer et al. (2008) and Allison (2010) provide useful introductions into the
field. Survival analysis is the method of choice if censoring is restricted to the
single outcomes. The approach is, however, less suited for censored predictors
or for multiple interdependent censored outcomes. Van Wouwe et al. (2009)
discuss an empirical example of such a problem. The authors are interested in
knowing time interval between resuming contraception and cessation of lac-
tation in young mothers who gave birth in the last 6 months. As the sample
was cross-sectional, both contraception and lactation were subject to censor-
ing. Imputation could be used to impute the hypothetically uncensored event
times in both durations, and this allowed a study of the association between
the uncensored event times.
The problem of missing event times is relevant if the event time is un-
observed. The censoring status is typically also unknown if the event time is
missing. Missing event times may be due to happenstance, for example, re-
sulting from a technical failure of the instrument that measures event times.
Alternatively, the missing data could have been caused by truncation, where
all event times beyond the truncation point are set to missing. It will be clear
that the optimal way to deal with the missing events data depends on the
reasons for the missingness. Analysis of the complete cases will systematically
distort the analysis of the event times if the data are truncated.
Imputation of right-censored data has received most attention to date. In
general, the method aims to find new (longer) event times that would have
been observed had the data not been censored. Let n1 denote the number of
observed failure times, let n0 = n − n1 denote the number of censored event
times and let t1 , . . . , tn be the ordered set of failure and censored times. For
some time point t, the risk set R(t) = ti > t for i = 1, . . . , n is the set of event
and censored times that is longer than t. Taylor et al. (2002) proposed two
imputation strategies for right-censored data:
1. Risk set imputation. For a given censored value t construct the risk set
R(t), and randomly draw one case from this set. Both the failure time
and censoring status from the selected case are used to impute the data.
2. Kaplan–Meier imputation. For a given censored value t construct the
risk set R(t) and estimate the Kaplan–Meier curve from this set. A
randomly drawn failure time from the Kaplan–Meier curve is used for
imputation.
Both methods are asymptotically equivalent to the Kaplan–Meier estimator
after multiple imputation with large m. The adequacy of imputation proce-
dures will depend on the availability of possible donor observations, which
diminishes in the tails of the survival distribution. The Kaplan–Meier method
has the advantage that nearly all censored observations are replaced by im-
Univariate missing data 95
Algorithm 3.6: Imputation of right-censored data using predictive mean
matching, Kaplan–Meier estimation and the bootstrap.♠
puted failure times. In principle, both Bayesian sampling and bootstrap meth-
ods can be used to incorporate model uncertainty, but in practice only the
bootstrap has been used.
Hsu et al. (2006) extended both methods to include covariates. The authors
fitted a proportional hazards model and calculated a risk score as a linear
combination of the covariates. The key adaptation is to restrict the risk set to
those cases that have a risk score that is similar to the risk score of censored
case, an idea similar to predictive mean matching. A donor group size with
d = 10 was found to perform well, and Kaplan–Meier imputation was superior
to risk set imputation across a wide range of situations.
Algorithm 3.6 is based on the KIMB method proposed by Hsu et al. (2006).
The method assumes that censoring status is known, and aims to impute
plausible event times for censored observations. Hsu et al. (2006) actually
suggested fitting two proportional hazards models, one with survival time as
outcome and one with censoring status as outcome, but in order to keep in
line with the rest of this chapter, here we only fit the model for survival time.
The way in which predictive mean matching is done differs slightly from Hsu
et al. (2006).
The literature on imputation methods for censored and rounded data is
96 Flexible Imputation of Missing Data, Second Edition
rapidly evolving. Alternative methods for right-censored data have also been
proposed (Wei and Tanner, 1991; Geskus, 2001; Lam et al., 2005; Liu et al.,
2011). Lyles et al. (2001), Lynn (2001), Hopke et al. (2001) and Lee et al.
(2018) concentrated on left-censored data. Imputation of interval-censored
data (rounded data) has been discussed quite extensively (Heitjan and Rubin,
1990; Dorey et al., 1993; James and Tanner, 1995; Pan, 2000; Bebchuk and
Betensky, 2000; Glynn and Rosner, 2004; Hsu, 2007; Royston, 2007; Chen
and Sun, 2010; Hsu et al., 2015). Imputation of double-censored data, where
both the initial and the final times are interval censored, is treated by Pan
(2001) and Zhang et al. (2009). Delord and Génin (2016) extended Pan’s
approach to interval-censored competing risks data, thus allowing estimation
of the survival function, cumulative incidence function, Cox and Fine & Gray
regression coefficients. These methods are available in the MIICD package.
Jackson et al. (2014) used multiple imputation to study departures from the
independent censoring assumption in the Cox model.
By comparison, very few methods have been developed to deal with trunca-
tion. Methods for imputing a missing censoring indicator have been proposed
by Subramanian (2009, 2011) and Wang and Dinse (2010).
into independent parts. Two main strategies to decompose P (Y, R) are known
as the selection model (Heckman, 1976) and the pattern-mixture model (Glynn
et al., 1986). Little and Rubin (2002, ch. 15) and Little (2009) provide in-depth
discussions of these models.
Imputations are created most easily under the pattern-mixture model. Her-
zog and Rubin (1983, pp. 222–224) proposed a simple and general family of
nonignorable models that accounts for shift bias, scale bias and shape bias.
Suppose that we expect that the nonrespondent data are shifted relative to
the respondent data. Adding a simple shift parameter δ to the imputations
creates a difference in the means of a δ. In a similar vein, if we suspect that
the nonrespondents and respondents use different scales, we can multiply each
imputation by a scale parameter. Likewise, if we suspect that the shapes of
both distributions differ, we could redraw values from the candidate impu-
tations with a probability proportional to the dissimilarity between the two
distributions, a technique known as the SIR algorithm (Rubin, 1987b). We
only discuss the shift parameter δ.
In practice, it may be difficult to specify the distribution of the nonre-
spondents, e.g., to provide a sensible specification of δ. One approach is to
compare the results under different values of δ by sensitivity analysis. Though
helpful, this puts the burden on the specification of realistic scenarios, i.e., a
set of plausible δ-values. The next sections describe the selection model and
pattern mixture in more detail, as a way to evaluate the plausibility of δ.
the missing data are created mostly at the lower blood pressures. Section 9.2.1
discusses why more missing data in the lower levels are plausible. When taken
together, the columns P (Y ) and P (R = 1|Y ) specify a selection model.
Compared to Equation 3.8 this model only reverses the roles of Y and R, but
the interpretation is quite different. The pattern-mixture model emphasizes
that the combined distribution is a mix of the distributions of Y in the respon-
ders and nonresponders. The model needs a specification of the distribution
P (Y |R = 1) of the responders (which can be conveniently modeled after the
data), and of the distribution P (Y |R = 0) of the nonresponders (for which we
have no data at all). The joint distribution is the mixture of these two distri-
butions, with mixing probabilities P (R = 1) and P (R = 0) = 1 − P (R = 1),
the overall proportions of observed and missing data, respectively.
Numerical example. The columns labeled P (Y |R = 1) and P (Y |R = 0) in
Table 3.5 contain the probability per blood pressure category for the respon-
dents and nonrespondents. Since more missing data are expected to occur at
lower blood pressures, the mass of the nonresponder distribution has shifted
toward the lower end of the scale. As a result, the mean of the nonrespon-
der distribution is equal to 138.6 mmHg, while the mean of the responder
distribution equals 151.58 mmHg.
Univariate missing data 99
1.00 0.30
0.95
Observation probability
0.25
0.90
0.20
Density
0.85
0.15
0.80
0.10
0.75
0.70 0.05
0.65 0.00
100 120 140 160 180 200 100 120 140 160 180 200
Table 3.6: Difference between the means of the blood pressure distributions
of the response and nonresponse groups, and its interpretation in the light of
what we know about the data.
δ Interpretation
0 mmHg MCAR, δ too small
−5 mmHg Small effect
−10 mmHg Large effect
−15 mmHg Extreme effect
−20 mmHg Too extreme effect
model. The right-hand plot provides the distributions P (Y |R) in the observed
(blue) and missing (red) data in the corresponding pattern-mixture model.
The hypothetically complete distribution is given by the black curve. The
distribution of blood pressure in the group with missing blood pressures is
quite different, both in form and location. At the same time, observe that the
effect of missingness on the combined distribution is only slight. The reason
is that 87% of the information is actually observed.
The mean of the distribution of the observed data remains almost un-
changed (151.6 mmHg instead of 150 mmHg), but the mean of the distribu-
tion of the missing data is substantially lower at 138.6 mmHg. Thus, under
the assumed selection model we expect that the mean of the imputed data
should be 151.6 − 138.6 = 13 mmHg lower than in the observed data.
erful. In cases where no one model will be obviously more realistic than any
other, Rubin (1987a, p. 203) stressed the need for easily communicated mod-
els, like a “20% increase over the ignorable value.” Little (2009, p. 49) warned
that it is easy to be enamored of complicated models for P (Y, R) so that we
may be “lulled into a false sense of complacency about the fundamental lack
of identification,” and suggested simple methods:
The idea of adding offsets is simple, transparent, and can be readily
accomplished with existing software.
Adding a constant or multiplying by a value are in fact the most direct ways
to specify nonignorable models.
3.9 Exercises
1. MAR. Reproduce Table 3.1 and Table 3.2 for MARRIGHT, MARMID
and MARTAIL missing data mechanisms of Section 3.2.4.
(a) Are there any choices that you need to make? If so, which?
Univariate missing data 103
(b) Consider the six possibilities to combine the missing data mech-
anism and missingness in x or y. Do you expect complete-case
analysis to perform well in each case?
(c) Do the Bayesian sampling and bootstrap methods also work under
the three MAR mechanisms?
105
106 Flexible Imputation of Missing Data, Second Edition
Figure 4.1: Some missing data patterns in multivariate data. Blue is ob-
served, red is missing.
A B C
2 1 1 1 0
3 1 1 0 1
1 1 0 1 1
2 0 0 1 2
2 3 3 8
p <- md.pairs(pattern4)
p
$rr
A B C
A 6 5 3
B 5 5 2
C 3 2 5
108 Flexible Imputation of Missing Data, Second Edition
$rm
A B C
A 0 1 3
B 0 0 3
C 2 3 0
$mr
A B C
A 0 0 2
B 1 0 3
C 3 3 0
$mm
A B C
A 2 2 0
B 2 3 0
C 0 0 3
Thus, for pair (A,B) there are five completely observed pairs (in rr), no
pairs in which A is observed and B missing (in rm), one pair in which A is
missing and B is observed (in mr) and two pairs with both missing A and B.
Note that these numbers add up to the total sample size.
The proportion of usable cases (Van Buuren et al., 1999) for imputing
variable Yj from variable Yk is defined as
Pn
i (1 − rij )rik
Ijk = P n (4.1)
i 1 − rij
p$mr/(p$mr+p$mm)
A B C
A 0.000 0 1
B 0.333 0 1
C 1.000 1 0
The first row contains IAA = 0, IAB = 0 and IAC = 1. This informs us
that B is not relevant for imputing A since there are no observed cases in B
where A is missing. However, C is observed for both missing entries in A, and
Multivariate missing data 109
may thus be a relevant predictor. The Ijk statistic is an inbound statistic that
measures how well the missing entries in variable Yj are connected to the rest
of the data.
The outbound statistic Ojk measures how observed data in variable Yj
connect to missing data in the rest of the data. The statistic is defined as
Pn
rij (1 − rik )
Ojk = i Pn (4.2)
i rij
This quantity is the number of observed pairs (Yj , Yk ) with Yj observed and
Yk missing, divided by the total number of observed cases in Yj . The quantity
Ojk equals 1 if variable Yj is observed in all records where Yk is missing.
The statistic can be used to evaluate whether Yj is a potential predictor for
imputing Yk . We can calculate Ojk in the dataset pattern4 in Figure 4.1 for
all pairs (Yj , Yk ) by
p$rm/(p$rm+p$rr)
A B C
A 0.0 0.167 0.5
B 0.0 0.000 0.6
C 0.4 0.600 0.0
The coefficient is equal to the number of variable pairs (Yj , Yk ) with Yj missing
and Yk observed, divided by the total number of observed data cells. The
value of Ij depends on the proportion of missing data of the variable. Influx
of a completely observed variable is equal to 0, whereas for completely missing
variables we have Ij = 1. For two variables with the same proportion of missing
data, the variable with higher influx is better connected to the observed data,
and might thus be easier to impute.
The outflux coefficient Oj is defined in an analogous way as
Pp Pp Pn
j rij (1 − rik )
Oj = Pkp Pin (4.4)
k i 1 − rij
110 Flexible Imputation of Missing Data, Second Edition
1.0 B
A Univariate 1.0 A Monotone
0.8 0.8
0.6 0.6
Outflux
Outflux
0.4 0.4 B
0.2 0.2
0.0 C 0.0 C
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Influx Influx
0.8 0.8
0.6 B 0.6 C
Outflux
Outflux
A
0.4 C 0.4 B
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Influx Influx
Figure 4.2: Fluxplot: Outflux versus influx in the four missing data patterns
from Figure 4.1. The influx of a variable quantifies how well its missing data
connect to the observed data on other variables. The outflux of a variable
quantifies how well its observed data connect to the missing data on other
variables. In general, higher influx and outflux values are preferred.
flux(pattern4)[,1:3]
The rows correspond to the variables. The columns contain the propor-
tion of observed data, Ij and Oj . Figure 4.2 shows the influx-outflux pattern
of the four patterns in Figure 4.1 produced by fluxplot(). In general, vari-
ables that are located higher up in the display are more complete and thus
potentially more useful for imputation. It is often (but not always) true that
Ij + Oj ≤ 1, so in practice variables closer to the subdiagonal are typically
better connected than those farther away. The fluxplot can be used to spot
variables that clutter the imputation model. Variables that are located in the
lower regions (especially near the lower-left corner) and that are uninteresting
for later analysis are better removed from the data prior to imputation.
Influx and outflux are summaries of the missing data pattern intended
to aid in the construction of imputation models. Keeping everything else con-
stant, variables with high influx and outflux are preferred. Realize that outflux
indicates the potential (and not actual) contribution to impute other variables.
A variable with high Oj may turn out to be useless for imputation if it is unre-
lated to the incomplete variables. On the other hand, the usefulness of a highly
predictive variable is severely limited by a low Oj . More refined measures of
usefulness are conceivable, e.g., multiplying Oj by the average proportion of
explained variance. Also, we could specialize to one or a few key variables
to impute. Alternatively, analogous measures for Ij could be useful. The fur-
ther development of diagnostic summaries for the missing data pattern is a
promising area for further investigation.
This list is by no means exhaustive, and other complexities may appear for
particular data. The next sections discuss three general strategies for imputing
multivariate data:
1. Monotone data imputation. For monotone missing data patterns, impu-
tations are created by a sequence of univariate methods;
2. Joint modeling. For general patterns, imputations are drawn from a
multivariate model fitted to the data;
Numerical example. The first three columns of the data frame nhanes2 in
mice have a monotone missing data pattern. In terms of the above notation,
X contains the complete variable age, Y1 is the variable hyp and Y2 is the
variable bmi. Monotone data imputation can be applied to generate m = 2
complete datasets by:
4.3.2 Algorithm
Algorithm 4.1 provides the main steps of monotone data imputation. We
order the variables according to their missingness, and impute from left to
114 Flexible Imputation of Missing Data, Second Edition
The primary advantage is speed. We need to make only two passes through
the data. Since the method uses single imputation in the first step, it should be
done only if the number of missing values that destroy the monotone pattern
is small.
Observe that the imputed values for the missing hyp data in row 3 could
also depend on bmi and chl, but in the procedure both predictors are ignored.
In principle, we can improve the method by incorporating bmi and chl into
the model, and then iterate. We will explore this technique in more detail in
Section 4.5, but first we study the theoretically nice alternative.
t
Ẏmis ∼ P (Ymis |Yobs , θ̇t−1 ) (4.5)
θ̇t t
∼ P (θ|Yobs , Ẏmis ) (4.6)
where imputations from P (Ymis |Yobs , θ̇t−1 ) are drawn by the method as de-
scribed in the previous section, and where draws from the parameter distribu-
t
tion P (θ|Yobs , Ẏmis ) are generated according to the method of Schafer (1997,
p. 184).
Algorithm 4.2 lists the major steps needed to impute multivariate miss-
ing data under the normal model. Additional background can be found in Li
(1988), Rubin and Schafer (1990) and Schafer (1997). Song and Belin (2004)
generated multiple imputations under the common factor model. The per-
formance of the method was found to be similar to that of the multivariate
normal distribution, the main pitfall being the danger of setting the numbers
of factors too low. Audigier et al. (2016) proposed an imputation method based
on Bayesian principal components analysis, and suggested it as an alternative
to regularize data with more columns than rows.
Schafer (1997, pp. 211–218) reported simulations that showed that imputa-
tions generated under the multivariate normal model are robust to non-normal
data. Demirtas et al. (2008) confirmed this claim in a more extensive simu-
lation study. The authors conclude that “imputation under the assumption
of normality is a fairly reasonable tool, even when the assumption of normal-
ity is clearly violated; the fraction of missing information is high, especially
when the sample size is relatively large.” It is often beneficial to transform
the data before imputation toward normality, especially if the scientifically in-
teresting parameters are difficult to estimate, like quantiles or variances. For
example, we could apply a logarithmic transformation before imputation to
remove skewness, and apply an exponential transformation after imputation
to revert to the original scale. See Von Hippel (2013) for a cautionary note.
Some work on automatic transformation methods for joint models is
available. Van Buuren et al. (1993) developed an iterative transformation-
imputation algorithm that finds optimal transformations of the variables to-
ward multivariate normality. The algorithm is iterative because the multiply
imputed values contribute to define the transformation, and vice versa. Trans-
formations toward normality have also been incorporated in transcan() and
aregImpute() of the Hmisc package in R (Harrell, 2001).
If a joint model is specified, it is nearly always the multivariate normal
model. Alternatives like the t-distribution (Liu, 1995) are hardly being devel-
oped or applied. The recent research effort in the area has focused on models
for multilevel data. These developments are covered in a Chapter 7.
Multivariate missing data 117
Algorithm 4.2: Imputation of missing data by a joint model for multivariate
normal data.♠
4. Repeat for s = 1, . . . , S.
5. Calculate parameters φ̇s = SWP(θ̇t−1 , s) by sweeping the predic-
tors of pattern s out of θ̇t−1 .
6. Calculate ps as the number of missing data in pattern s. Calculate
os = p − ps .
7. Calculate the Cholesky decomposition Cs of the ps × ps submatrix
of φ̇s corresponding to the missing data in pattern s.
8. Draw a random vector z ∼ N (0, 1) of length ps .
9. Take β̇s as the os × ps submatrix of φ̇s of regression weights.
t
10. Calculate imputations Ẏ[s] obs
= Y[s] β̇s + Cs0 z, where Y[s]
obs
is the ob-
served data in pattern s.
11. End repeat s.
12. Draw θ̇t = (µ̇, Σ̇) from the normal-inverse-Wishart distribution
according to Schafer (1997, p. 184).
13. End repeat t.
best performance, although its advantage over simple rounding was sometimes
slight.” Further work has been done by Yucel et al. (2008), who proposed
rounding such that the marginal distribution in the imputations is similar
to that of the observed data. Alternatively, Demirtas (2009) proposed two
rounding methods based on logistic regression and an additional drawing step
that makes rounding dependent on other variables in the imputation model.
Another proposal is to model the indicators of the categorical variables (Lee
et al., 2012). A single best rounding rule for categorical data has yet to be iden-
tified. Demirtas (2010) encourages researchers to avoid rounding altogether,
and apply methods specifically designed for categorical data.
Several joint models for categorical variables have been proposed that do
not rely on rounding. Schafer (1997) proposed several techniques to impute
categorical data and mixed continuous-categorical data. Missing data in con-
tingency tables can be imputed under the log-linear model. The model pre-
serves higher-order interactions, and works best if the number of variables is
small, say, up to six. Mixed continuous-categorical data can be imputed under
the general location model originally developed by Olkin and Tate (1961). This
model combines the log-linear and multivariate normal models by fitting a re-
stricted normal model to each cell of the contingency table. Further extensions
have been suggested by Liu and Rubin (1998) and Peng et al. (2004). Belin
et al. (1999) pointed out some limitations of the general location model for a
larger dataset with 16 binary and 18 continuous variables. Their study found
substantial differences between the imputed and follow-up data, especially for
the binary data.
Alternative imputation methods based on joint models have been devel-
oped. Van Buuren and Van Rijckevorsel (1992) maximized internal consis-
tency by the k-means clustering algorithm, and outlined methods to generate
multiple imputations. This is a single imputation method which artificially
strengthens the relations in the data. The MIMCA imputation technique (Au-
digier et al., 2017) uses a similar underlying model, and derives variability in
imputations by taking bootstrap samples under a chosen number of dimen-
sions.
Van Ginkel et al. (2007) proposed two-way imputation, a technique for
imputing incomplete categorical data by conditioning on the row and column
sum scores of the multivariate data. This method has applications for imputing
missing test item responses. Chen et al. (2011) proposed a class of models that
specifies the conditional density by an odds ratio representation relative to the
center of the distribution. This allows for separate models of the odds ratio
function and the conditional density at the center.
Vermunt et al. (2008) pioneered the use of the latent class analysis for im-
puting categorical data. The latent class (or finite mixture) model describes
the joint distribution as the product of locally independent categorical vari-
ables. When the number of classes is large, the model can be used as a generic
density estimation tool that captures the relations between the variables by
a highly parameterized model. The relevant associations in the data need not
Multivariate missing data 119
to be specified a-priori, and the main modeling effort consists of setting the
number of latent classes. Unlike the saturated log-linear models advocated
by Schafer (1997), latent models can handle a large number of variables. Vi-
dotto et al. (2015) surveyed several different implementations of the latent
class model for imputation, both frequentistic (Vermunt et al., 2008; Van der
Palm et al., 2016b) and Bayesian (Si and Reiter, 2013), which differ in the
ways to select the number of classes. Vidotto (2018) proposed an extension to
longitudinal data.
Joint models for nested (multilevel) data have been intensively studied.
Section 7.4 discusses these developments in more detail.
(Gelfand and Smith, 1990; Casella and George, 1992). In conventional applica-
tions of the Gibbs sampler, the full conditional distributions are derived from
the joint probability distribution (Gilks, 1996). In the MICE algorithm, the
conditional distributions are under direct control of the user, and so the joint
distribution is only implicitly known, and may not actually exist. While the
latter is clearly undesirable from a theoretical point of view (since we do not
know the joint distribution to which the algorithm converges), in practice it
does not seem to hinder useful applications of the method (cf. Section 4.5.3).
In order to converge to a stationary distribution, a Markov chain needs to
satisfy three important properties (Roberts, 1996; Tierney, 1996):
• irreducible, the chain must be able to reach all interesting parts of the
state space;
• aperiodic, the chain should not oscillate between different states;
• recurrence, all interesting parts can be reached infinitely often, at least
from almost all starting points.
Do these properties hold for the MICE algorithm? Irreducibility is generally
not a problem since the user has large control over the interesting parts of the
state space. This flexibility is actually the main rationale for FCS instead of
a joint model.
Periodicity is a potential problem, and can arise in the situation where
imputation models are clearly inconsistent. A rather artificial example of an
oscillatory behavior occurs when Y1 is imputed by Y2 β + 1 and Y2 is imputed
by −Y1 β + 2 for some fixed, nonzero β. The sampler will oscillate between
two qualitatively different states, so the correlation between Y1 and Y2 after
imputing Y1 will differ from that after imputing Y2 . In general, we would
like the statistical inferences to be independent of the stopping point. A way
to diagnose the ping-pong problem, or order effect, is to stop the chain at
different points. The stopping point should not affect the statistical inferences.
The addition of noise to create imputations is a safeguard against periodicity,
and allows the sampler to “break out” more easily.
Non-recurrence may also be a potential difficulty, manifesting itself as ex-
plosive or non-stationary behavior. For example, if imputations are made by
deterministic functions, the Markov chain may lock up. Such cases can some-
times be diagnosed from the trace lines of the sampler. See Section 6.5.2 for
an example. As long as the parameters of imputation models are estimated
from the data, non-recurrence is mild or absent.
The required properties of the MCMC method can be translated into con-
ditions on the eigenvalues of the matrix of transition probabilities (MacKay,
2003, pp. 372–373). The development of practical tools that put these condi-
tions to work for multiple imputation is still an ongoing research problem.
122 Flexible Imputation of Missing Data, Second Edition
4.5.3 Compatibility♠
Gibbs sampling is based on the idea that knowledge of the conditional
distributions is sufficient to determine a joint distribution, if it exists. Two
conditional densities p(Y1 |Y2 ) and p(Y2 |Y1 ) are said to be compatible if a
joint distribution p(Y1 , Y2 ) exists that has p(Y1 |Y2 ) and p(Y2 |Y1 ) as its condi-
tional densities. More precisely, the two conditional densities are compatible
if and only if their density ratio p(Y1 |Y2 )/p(Y2 |Y1 ) factorizes into the product
u(Y1 )v(Y2 ) for some integrable functions u and v (Besag, 1974). So, the joint
distribution either exists and is unique, or does not exist.
If the joint density itself is of genuine scientific interest, we should care-
fully evaluate the effect that imputations might have on the estimate of the
distribution. For example, incompatible conditionals could produce a ridge
(or spike) in an otherwise smooth density, and the location of the ridge may
actually depend on the stopping point. If such is the case, then we should have
a reason to favor a particular stopping point. Alternatively, we might try to
reformulate the imputation model so that the order effect disappears.
Arnold and Press (1989) and Arnold et al. (1999) provide necessary and
sufficient conditions for the existence of a joint distribution given two con-
ditional densities. Gelman and Speed (1993) concentrate on the question
whether an arbitrary mix of conditional and marginal distribution yields a
unique joint distribution. Arnold et al. (2002) describe near-compatibility in
discrete data.
Several papers are now available on the conditions under which impu-
tations created by conditionally specified models are draws from the implicit
joint distribution. According to Hughes et al. (2014), two conditions must hold
for this to occur. First, the conditionals must be compatible, and second the
margin must be noninformative. Suppose that p(φj ) is the prior distribution
of the set of parameters that relate Yj to Y−j , and that p(φ̃j ) is prior distri-
bution of the set of parameters that describes that relations among the Y−j .
The noninformative margins condition states that if two sets of parameters
are distinct (i.e., their joint parameter space is the product of their separate
parameter spaces), and their joint distribution p(φj , φ̃j ) are independent and
factorizes as p(φj , φ̃j ) = p(φj )p(φ̃j ). Independence is a property of the prior
distributions, whereas distinctness is a property of the model. Hughes et al.
(2014) show that distinctness holds for the saturated multinomial distribu-
tion with a Dirichlet prior, so imputations from this joint distribution can be
achieved by a set of conditionally specified models. However, for the log-linear
model with only two-way factor interaction (and no higher-order terms) dis-
tinctness only holds for a maximum of three variables. The noninformative
marginal condition is sufficient, but not necessary. In most practical cases we
are unable to show that the noninformative marginal condition holds, but we
can stop the algorithms at different points and inspect the estimates for order
effect. Simulations by Hughes et al. (2014) show that such order effects exist,
but in general they are small. Liu et al. (2013) made the same division in the
Multivariate missing data 123
the skewness of the residuals, or ideally, generate the imputations from the
underlying (but usually unknown) mechanism that generated the data.
The interesting point is that the last two papers have shifted the perspec-
tive from the user’s joint model to the data producer’s data generating model.
With incompatible models, the most important condition is the validity of
each conditional model. As long as the conditional models are able to replay
the missing data according to the mechanism that generated the data, we
might not be overly concerned with issues of compatibility.
In the majority of cases, scientific interest will focus on quantities that are
more remote to the joint density, such as regression weights, factor loadings,
prevalence estimates and so on. In such cases, the joint distribution is more
like a nuisance factor that has no intrinsic value.
Apart from potential feedback problems, it appears that incompatibility
seems like a relatively minor problem in practice, especially if the missing
data rate is modest and the imputation models fit the data well. In order to
evaluate these aspects, we need to inspect convergence and assess the fit of
the imputations.
The simulate() function code collects the correlations ρ(Y1 , Y2 ) per iter-
ation in the data frame s. Now call the function with
0 50 100 150
0.85
Correlation between Y1 and Y2
0.80
0.75
0.70
0.85
0.80
0.75
0.70
Iteration
Figure 4.3: Correlation between Y1 and Y2 in the imputed data per iteration
in five independent runs of the MICE algorithm for six levels of missing data.
The true value is 0.7. The figure illustrates that convergence can be slow for
high percentages of missing data.
missing data rates. At the same time, observe that we really have to push the
MICE algorithm to its limits to see the effect. Over 99% of real data will have
lower correlations and lower missing data rates. Of course, it never hurts to do
a couple of extra iterations, but my experience is that good results can often
be obtained with a small number of iterations.
4.5.8 Performance
Each conditional density has to be specified separately, so FCS requires
some modeling effort on the part of the user. Most software provides reasonable
defaults for standard situations, so the actual effort required may be small. A
number of simulation studies provide evidence that FCS generally yields esti-
mates that are unbiased and that possess appropriate coverage (Brand, 1999;
Raghunathan et al., 2001; Brand et al., 2003; Tang et al., 2005; Van Buuren
et al., 2006; Horton and Kleinman, 2007; Yu et al., 2007).
130 Flexible Imputation of Missing Data, Second Edition
4.6.2 Comparisons
FCS cannot use computational shortcuts like the sweep operator, so the
calculations per iterations are more intensive than under JM. Also, JM has
better theoretical underpinnings.
On the other hand, FCS allows tremendous flexibility in creating multi-
variate models. One can easily specify models that are outside any known
standard multivariate density P (X, Y, R|θ). FCS can use specialized imputa-
tion methods that are difficult to formulate as a part of a multivariate density
P (X, Y, R|θ). Imputation methods that preserve unique features in the data,
e.g., bounds, skip patterns, interactions, bracketed responses and so on can be
incorporated. It is possible to maintain constraints between different variables
in order to avoid logical inconsistencies in the imputed data that would be
difficult to do as part of a multivariate density P (X, Y, R|θ).
Lee and Carlin (2010) found that JM performs as well as FCS, even in
the presence of binary and ordinal variables. These authors also observed sub-
stantial improvements for skewed variables by transforming the variable to
symmetry (for JM) or by using predictive mean matching (for FCS). Kropko
et al. (2014) found that JM and FCS performed about equally well for con-
Multivariate missing data 131
tinuous and binary variable, but FCS outperforms JM on every metric when
the variable of interest is categorical. With predictive mean matching, FCS
outperforms JM “for every metric and variable type, including the continuous
variable.” Seaman and Hughes (2018) compared FCS to a restricted general
location model. As expected, the latter model is more efficient when correctly
specified, but the gains are small unless the relations between the variables
are very strong. As FCS was found to be more robust under misspecification,
the authors advise FCS over JM.
4.6.3 Illustration
The Fourth Dutch Growth Study by Fredriks et al. (2000a) collected data
on 14500 Dutch children between 0 and 21 years. The development of sec-
ondary pubertal characteristics was measured by the so-called Tanner stages,
which divides the continuous process of maturation into discrete stages for the
ages between 8 and 21 years. Pubertal stages of boys are defined for genital
development (gen: five ordered stages G1–G5), pubic hair development (phb:
six ordered stages P1–P6) and testicular volume (tv: 1–25 ml).
We analyze the subsample of 424 boys in the age range 8–21 years using
the boys data in mice. There were 180 boys (42%) for which scores for genital
development were missing. The missingness was strongly related to age, rising
from about 20% at ages 9–11 years to 60% missing data at ages 17–20 years.
The data consist of three complete covariates: age (age), height (hgt)
and weight (wgt), and three incomplete outcomes measuring maturation. The
following code block creates m = 10 imputations by the normal model, by
predictive mean matching and by the proportional odds model.
Figure 4.4 plots the results of the first five imputations from the normal
model. It was created by the following statement:
132 Flexible Imputation of Missing Data, Second Edition
10 12 14 16 18 20
0 1 2
5
4
3
2
Genital stage
3 4 5
5
4
3
2
1
10 12 14 16 18 20 10 12 14 16 18 20
Age
Figure 4.4: Joint modeling: Imputed data for genital development (Tanner
stages G1–G5) under the multivariate normal model. The panels are labeled
by the imputation numbers 0–5, where 0 is the observed data and 1–5 are five
multiply imputed datasets.
The figure portrays how genital development depends on age for both the
observed and imputed data. The spread of the synthetic values in Figure 4.4 is
larger than the observed data range. The observed data are categorical while
the synthetic data vary continuously. Note that there are some negative values
in the imputations. If we are to do categorical data analysis on the imputed
data, we need some form of rounding to make the synthetic values comparable
with the observed values.
Imputations for the proportonal odds model in Figure 4.5 differ markedly
from those in Figure 4.4. This model yields imputations that are categorical,
and hence no rounding is needed.
The complete-data model describes the probability of achieving each Tan-
Multivariate missing data 133
10 12 14 16 18 20
0 1 2
5
4
3
2
Genital stage
3 4 5
5
4
3
2
1
10 12 14 16 18 20 10 12 14 16 18 20
Age
10 12 14 16 18 20
0.8
0.6
0.4
0.2
0.0
FCS: predictive mean matching FCS: proportional odds
1.0
0.8
0.6
0.4
0.2
0.0
10 12 14 16 18 20
Age (years)
some discrepancies for the older boys remain. The panel labeled FCS: propor-
tional odds displays the results after applying the method for ordered cate-
gorical data as discussed in Section 3.6. The imputed data essentially agree
with the complete-case analysis, perhaps apart from some minor deviations
around the probability level of 0.9.
Figure 4.6 shows clear differences between FCS and JM when data are
categorical. Although rounding may provide reasonable results in particular
datasets, it seems that it does more harm than good here. There are many ways
to round, rounding may require unrealistic assumptions and it will attenuate
correlations. Horton et al. (2003), Ake (2005) and Allison (2005) recommend
against rounding when data are categorical. See Section 4.4.3. Horton et al.
(2003) expected that bias problems of rounding would taper off if variables
Multivariate missing data 135
have more than two categories, but the analysis in this section suggests that
JM may also be biased for categorical data with more than two categories.
Even though it may sound a bit trivial, my recommendation is: Impute cate-
gorical data by methods for categorical data.
4.8 Conclusion
Multivariate missing data lead to analytic problems caused by mutual de-
pendencies between incomplete variables. The missing data pattern provides
important information for the imputation model. The influx and outflux meas-
ures are useful to sift out variables that cannot contribute to the imputations.
For general missing data patterns, both JM and FCS approaches can be used
to impute multivariate missing data. JM is the model of choice if the data
conform to the modeling assumptions because it has better theoretical prop-
erties. The FCS approach is much more flexible, easier to understand and al-
lows for imputations close to the data. Automatic tile imputation algorithms
with simultaneous partitions of rows and columns of the data form a vast and
unexplored field.
4.9 Exercises
1. MAR. Repeat Exercise 3.1 for a multivariate missing data mechanism.
2. Convergence. Figure 4.3 shows that convergence can take longer for
very high amounts of missing data. This exercise studies an even more
extreme situation.
(a) The default argument ns of the simulate() function in Sec-
tion 4.5.7 defines six scenarios with different missing data patterns.
Define a 6 × 4 matrix ns2, where patterns R2 and R3 are replaced
by pattern R4 = (1, 0, 0). How many more missing values are there
in each scenario?
(b) For the new scenarios, do you expect convergence to be slower or
faster? Explain.
(c) Change the scenario in which all data in Y1 and Y2 are missing so
that there are 20 complete cases. Then run
slow2 <- simulate(ns = ns2, maxit = 50, seed = 62771)
5.1 Workflow
Figure 5.1 outlines the three main steps in any multiple imputation analy-
sis. In step 1, we create several m complete versions of the data by replacing
the missing values by plausible data values. The task of step 2 is to esti-
mate the parameters of scientific or commercial interest from each imputed
139
140 Flexible Imputation of Missing Data, Second Edition
-
@
@
@
@
@
R
@
- - -
@
@
@
@
@R
@
-
dataset. Step 3 involves pooling the m parameter estimates into one estimate,
and obtaining an estimate of its variance. The results allow us to arrive at
valid decisions from the data, accounting for the missing data and having the
correct type I error rate.
The objects imp, fit and est have classes mids, mira and mipo, respec-
tively. See Table 5.1 for an overview. The classic workflow works because mice
contains a with() function that understands how to deal with a mids-object.
The classic mids workflow has been widely adopted, but there are more pos-
sibilities.
The magrittr package introduced the pipe operator to R. This operator
removes the need to save and reread objects, resulting in more compact and
better readable code:
The with() function handles two tasks: to fill in the missing data and to
analyze the data. Splitting these over two separate functions provided the user
easier access to the imputed data, and hence is more flexible. The following
code uses the complete() function to save the imputed data as a list of dataset
(i.e., as an object with class mild), and then executes the analysis on each
dataset by the lapply() function.
RStudio has been highly successful with the introduction of the free and open
tidyverse ecosystem for data acquisition, organization, analysis, visualization
and reproducible research. The book by Wickham and Grolemund (2017)
provides an excellent introduction to data science using tidyverse. The mild
workflow can be written in tidyverse as
142 Flexible Imputation of Missing Data, Second Edition
Manipulating the imputed data is easy if we store the imputed data in long
format.
The long format can be processed by the dplyr::do() function into a list-
column and pooled, as follows:
These workflows yield identical estimates, but allow for different extensions.
This workflow is faster and easier than the methods in Section 5.1.1, since
there is no need to replicate the analyses m times. In the words of Dempster
and Rubin (1983), this workflow is
... seductive because it can lull the user into the pleasurable state
of believing that the data are complete after all.
The ensuing statistical analysis does not know which data are observed and
which are missing, and treats all data values as real, which will underestimate
the uncertainty of the parameters. The reported standard errors and p-values
after data-averaging are generally too low. The correlations between the vari-
ables of the averaged data will be too high. For example, the correlation matrix
is the average data
cor(ave)
While the estimated regression coefficients are unbiased, we cannot trust the
standard errors, t-values and so on. An advantage of stacking over averaging
is that it is easier to analyze categorical data. Although stacking can be useful
in specific contexts, like variable selection, in general it is not recommended.
coef(fit$analyses[[2]])
Note that the estimates differ from each other because of the uncertainty
created by the missing data. Applying the standard pooling rules is done by
1 1 1 1 1
9 9 11 15 15
All the major software packages nowadays have ways to execute the m re-
peated analyses to the imputed data.
The assumption that the fraction of missing information is the same across
all variables and statistics is unlikely to hold in practice. However, Li et al.
(1991b) provide encouraging simulation results for situations where this as-
sumption is violated. Except for some extreme cases, the level of the procedure
was close to the nominal level, while the loss of power from such violations
was modest.
The work of Li et al. (1991b) is based on large samples. Reiter (2007) de-
veloped a small-sample version for the degrees of freedom using ideas similar
to Barnard and Rubin (1999). Reiter’s νf spans several lines of text, and is not
given here. A simulation study conducted by Reiter showed marked improve-
ments over the earlier formulation, especially in smaller samples. Simulation
work by Grund et al. (2016b) and Liu and Enders (2017) confirmed that for
small samples (say n < 50) νf is more conservative than ν1 , and produced
type I errors rates closer to their nomimal value. Raghunathan (2015) recently
provided an elegant alternative based on Equation 2.32 with νobs substituted
as νobs = (νcom + 1)νcom /(νcom + 3)(1 + r̄). It is not yet known how this
correction compares to ν1 and νf .
The mice package implements the multivariate Wald test as the D1() func-
tion. Let us impute the nhanes2 data, and fit the linear regression of chl on
age and bmi.
library(mice)
imp <- mice(nhanes2, m = 10, print = FALSE, seed = 71242)
m2 <- with(imp, lm(chl ~ age + bmi))
pool(m2)
Class: mipo m = 10
estimate ubar b t dfcom df riv
(Intercept) -2.98 2896.71 1266.44 4289.79 21 11.28 0.481
age40-59 45.73 296.37 152.46 464.07 21 10.43 0.566
age60-99 65.62 342.71 229.04 594.66 21 9.08 0.735
bmi 6.40 3.26 1.41 4.81 21 11.32 0.477
lambda fmi
(Intercept) 0.325 0.419
age40-59 0.361 0.456
age60-99 0.424 0.519
bmi 0.323 0.418
We want to simplify the model by testing for age. Since age is a categorical
variable with three categories, removing it involves deleting two columns at
the same time, hence the univariate Wald test does not apply. The solution is
to fit the model without age, and run the multivariate Wald statistic to test
whether the model estimates are different.
Analysis of imputed data 149
Models:
model formula
1 chl ~ age + bmi
2 chl ~ bmi
Comparisons:
test statistic df1 df2 df.com p.value riv
1 ~~ 2 4.23 2 14.4 21 0.036 0.63
Since the Wald test is significant, removing age from the model reduces its
predictive power.
D2(m2, m1)
In contrast to the previous analysis, observe that the D2 -statistic is not sig-
nificant at an α-level of 0.05. The reason is that the D2 test is less informed
by the data, and hence less powerful than the D1 test.
the average of the likelihood ratio tests across the m datasets, i.e.,
X
dˆ = m−1 −2(l(Q̂0,` ) − l(Q̂` )) (5.10)
`
Then re-estimate the full and restricted models, with their model parameters
fixed to Q̄ and Q̄0 , respectively, and average the corresponding likelihood ratio
tests as X
d¯ = m−1 −2(l(Q̄0,` ) − l(Q̄` )) (5.11)
`
d¯
D3 = (5.12)
k(1 + r3 )
where
m+1 ˆ ¯
r3 = (d − d) (5.13)
k(m − 1)
estimates the average relative increase in variance due to nonresponse. The
quantity r3 is asymptotically equivalent to r̄ from Equation 2.29. The p-value
for D3 is equal to
P3 = Pr[Fk,ν3 > D3 ] (5.14)
where ν3 = ν1 , or equal Reiter’s correction for small samples.
The likelihood ratio test does not require normality. For complete data,
the likelihood ratio test is invariant to scale changes, which is the reason that
many prefer the likelihood ratio scale over the Wald test. However, Schafer
(1997, p. 118) observed that the invariance property is lost in multiple im-
putation because the averaging operations in Equations 5.10 and 5.11 may
yield somewhat different results under nonlinear transformations of l(Q). He
advised that the best results will be obtained if the distribution of Q is ap-
proximately normal. One may transform the parameters to achieve normality,
provided that appropriate care is taken to infer that the result is still within
the allowable parameter space.
Liu and Enders (2017) found in their simulations that D3 can become
negative, a nonsensical value, in some scenarios. They suggest that a value of
r3 > 10 or a 1000% increase in sampling variance due to missing data may
act as warning signals for this anomaly.
Routine use of the likelihood ratio statistic has long been hampered by
difficulties in calculating the likelihood ratio tests for the models with fixed
parameters Q̄ and Q̄0 . With the advent of the broom package (Robinson,
2017), the calculations have become feasible for a wide class of models. The
D3() function in mice can be used to calculate the likelihood ratio test. We
apply it to the data from previous examples by
152 Flexible Imputation of Missing Data, Second Edition
D3(m2, m1)
5.3.4 D1 , D2 or D3 ?
If the estimates are approximately normal and if the software can pro-
duce the required variance-covariance matrices, we recommend using D1 with
an adjustment for small samples if n < 100. D1 is a direct extension of Ru-
bin’s rules to multi-parameter problems, theoretically convincing, mature and
widely applied. D1 is insensitive to the assumption of equal fractions of missing
information, is well calibrated, works well with small m (unless the fractions
of information are large and variable) and suffers only modest loss of power.
The relevant literature (Rubin, 1987a; Li et al., 1991b; Reiter, 2007; Grund
et al., 2016b; Liu and Enders, 2017) is quite consistent.
If only the test statistics are available for pooling, then the D2 -statistic is
a good option, provided that the number of imputations m > 20. The test
is easy to calculate and applies to different test statistics. For m < 20, the
power may be low. D2 tends to become optimistic for high fractions of missing
information (> 0.3), and this effect unfortunately increases with sample size
(Grund et al., 2016b). Thus, careless application of D2 to large datasets with
many missing values may yield high rates of false positives.
The likelihood ratio statistic D3 is theoretically sound. Calculation of D3
requires refitting the repeated analysis models with the estimates constrained
to their pooled values. This was once an issue, but probably less so in the
future. D3 is asymptotically equivalent to D1 , and may be preferred for the-
oretical reasons: it does not require normality in the complete-data model, it
is often more powerful and it may be more stable than if k is large (as Ū need
not be inverted). Grund et al. (2016b), Liu and Enders (2017) and Eekhout
et al. (2017) found that D3 produces Type 1 error rates that were comparable
to D1 . D3 tends to be somewhat conservative in smaller samples, especially
with high fractions of missing information and with high k. Also, D3 has lower
statistical power in some of the extreme scenarios. For small samples, D1 has
a slight edge over D3 , so given the current available evidence D1 is the better
option for n < 200. In larger samples (n ≥ 200) D1 and D3 appear equally
good, so the choice between them is mostly a matter of convenience.
Analysis of imputed data 153
2. Stack . Stack the imputed datasets into a single dataset, assign a fixed
weight to each record and apply the usual variable selection methods.
5.4.2 Computation
The following steps illustrate the main steps involved by implementing a
simple majority method to select variables in mice.
This code imputes the boys data m = 10 times, fits a stepwise linear model
to predict tv (testicular volume) separately to each of the imputed dataset.
The following code blocks counts how many times each variable was selected.
Analysis of imputed data 155
votes
age gen hc hgt phb reg wgt
10 9 1 6 9 10 1
The lapply() function is used three times. The first statement extracts the
model formulas fitted to the m imputed datasets. The second lapply() call
decomposes the model formulas into pieces, and the third call extracts the
names of the variables included in all m models. The table() function counts
the number of times that each variable in the 10 replications. Variables age,
gen and reg are always included, whereas hc was selected in only one of the
models. Since hgt appears in more than 50% of the models, we can use the
Wald test to determine whether it should be in the final model.
The significant difference (p = 0.029) between the models implies that phb
should be retained. We obtain similar results for the other three variables, so
the final model contains age, gen, reg and phb.
2. Bootstrap. 200 bootstrap samples are drawn from one singly imputed
completed data. Automatic backward selection is applied to each
dataset. Any differences found between the 200 fitted models are due
to sampling variation.
3. Nested bootstrap. The bootstrap method is applied on each of the multi-
ply imputed datasets. Automatic backward selection is applied to each
of the 100 × 200 datasets. Differences between the fitted model portray
both sampling and missing data uncertainty.
4. Nested imputation. The imputation method is applied on each of the
bootstrapped datasets.
1. Gordon (2014) presents a fully worked out example code that builds
upon the doParallel library, and that combines complete() and
ibind(). With some programming this example can be adapted to other
datasets.
2. The parlMICE() function is a wrapper around mice() that can divide
the imputations over multiple cores or CPUs. Schouten and Vink (2017)
show that substantial gains are already possible with three free cores,
especially for a combination of a large number of imputations m and a
large sample size n.
3. The par.mice() function in the micemd package (Audigier and Resche-
Rigon, 2018) takes the same arguments as the mice() function, plus two
extra arguments related to the parallel calculations. It also builds on the
parallel package.
158 Flexible Imputation of Missing Data, Second Edition
The last two options are quite similar. Application of these methods is
especially beneficial for simulation studies, where the same model needs to be
replicated a large number of times. Support for multi-core processing is likely
to grow, so keep an eye on the Internet.
5.6 Conclusion
The statistical analysis of the multiply imputed data involved repeated
analysis followed by parameter pooling. Rubin’s rules apply to a wide variety
of quantities, especially if these quantities are transformed toward normality.
Dedicated statistical tests and model selection technique are now available.
Although many techniques for complete data now have their analogues for
incomplete data, the present state-of-the-art does not cover all. As multiple
imputation becomes more familiar and more routine, we will see new post-
imputation methodology that will be progressively more refined.
5.7 Exercises
Allison and Cicchetti (1976) investigated the interrelationship between
sleep, ecological and constitutional variables. They assessed these variables
for 39 mammalian species. The authors concluded that slow-wave sleep is
negatively associated with a factor related to body size. This suggests that
large amounts of this sleep phase are disadvantageous in large species. Also,
paradoxical sleep was associated with a factor related to predatory danger,
suggesting that large amounts of this sleep phase are disadvantageous in prey
species.
Allison and Cicchetti (1976) performed their analyses under complete-
case analysis. In this exercise we will recompute the regression equations for
slow wave (“nondreaming”) sleep (hrs/day) and paradoxical (“dreaming”)
sleep (hrs/day), as reported by the authors. Furthermore, we will evaluate the
imputations.
2. Imputation. The mammalsleep data are part of the mice package. Impute
the data with mice() under all the default settings. Recalculate the
regression equations (1) and (2) on the multiply imputed data.
Analysis of imputed data 159
3. Traces. Inspect the trace plot of the MICE algorithm. Does the algorithm
appear to converge?
Advanced techniques
Chapter 6
Imputation in practice
Ad hoc methods were designed to get past the missing data so that
at least some analyses could be done.
John W. Graham
163
164 Flexible Imputation of Missing Data, Second Edition
2. The second choice refers to the form of the imputation model. The form
encompasses both the structural part and the assumed error distribu-
tion. In FCS the form needs to be specified for each incomplete column
in the data. The choice will be steered by the scale of the variable to
be imputed, and preferably incorporates knowledge about the relation
between the variables. Chapter 3 described many different methods for
creating univariate imputations.
3. A third choice concerns the set of variables to include as predictors in
the imputation model. The general advice is to include as many relevant
variables as possible, including their interactions (Collins et al., 2001).
This may, however, lead to unwieldy model specifications. Section 6.3 de-
scribes the facilities within the mice() function for setting the predictor
matrix.
4. The fourth choice is whether we should impute variables that are func-
tions of other (incomplete) variables. Many datasets contain derived
variables, sum scores, interaction variables, ratios and so on. It can be
useful to incorporate the transformed variables into the multiple im-
putation algorithm. Section 6.4 describes methods that we can use to
incorporate such additional knowledge about the data.
5. The fifth choice concerns the order in which variables should be imputed.
The visit sequence may affect the convergence of the algorithm and
the synchronization between derived variables. Section 6.5.1 discusses
relevant options.
6. The sixth choice concerns the setup of the starting imputations and the
number of iterations M . The convergence of the MICE algorithm can
be monitored in many ways. Section 6.5.2 outlines some techniques that
assist in this task.
7. The seventh choice is m, the number of multiply imputed datasets. Set-
ting m too low may result in large simulation error and statistical in-
efficiency, especially if the fraction of missing information is high. Sec-
tion 2.8 provided guidelines for setting m.
Please realize that these choices are always needed. Imputation software
needs to make default choices. These choices are intended to be useful across
a wide range of applications. However, the default choices are not necessarily
the best for the data at hand. There is simply no magical setting that always
works, so often some tailoring is needed. Section 6.6 highlights some diagnostic
tools that aid in determining the choices.
Imputation in practice 165
6.3.2 Predictors
A useful feature of the mice() function is the ability to specify the set
of predictors to be used for each incomplete variable. The basic specification
is made through the predictorMatrix argument, which is a square matrix
of size ncol(data) containing 0/1 data. Each row in predictorMatrix iden-
tifies which predictors are to be used for the variable in the row name. If
diagnostics=T (the default), then mice() returns a mids object containing
a predictorMatrix entry. For example, type
negligible after the best, say, 15 variables have been included. For imputation
purposes, it is expedient to select a suitable subset of data that contains no
more than 15 to 25 variables. Van Buuren et al. (1999) provide the following
strategy for selecting predictor variables from a large database:
1. Include all variables that appear in the complete-data model, i.e., the
model that will be applied to the data after imputation, including the
outcome (Little, 1992; Moons et al., 2006). Failure to do so may bias
the complete-data model, especially if the complete-data model contains
strong predictive relations. Note that this step is somewhat counter-
intuitive, as it may seem that imputation would artificially strengthen
the relations of the complete-data model, which would be clearly unde-
sirable. If done properly however, this is not the case. On the contrary,
not including the complete-data model variables will tend to bias the re-
sults toward zero. Note that interactions of scientific interest also need
to be included in the imputation model.
2. In addition, include the variables that are related to the nonresponse.
Factors that are known to have influenced the occurrence of missing data
(stratification, reasons for nonresponse) are to be included on substan-
tive grounds. Other variables of interest are those for which the distri-
butions differ between the response and nonresponse groups. These can
be found by inspecting their correlations with the response indicator of
the variable to be imputed. If the magnitude of this correlation exceeds
a certain level, then the variable should be included.
3. In addition, include variables that explain a considerable amount of
variance. Such predictors help reduce the uncertainty of the imputations.
They are basically identified by their correlation with the target variable.
4. Remove from the variables selected in steps 2 and 3 those variables
that have too many missing values within the subgroup of incomplete
cases. A simple indicator is the percentage of observed cases within this
subgroup, the percentage of usable cases (cf. Section 4.1.2).
imp$loggedEvents
yields a warning that informs us that at initialization variable chl2 was re-
moved from the imputation model because it is collinear with chl. As a result,
chl will be imputed, but chl2 is not. We may override removal by
0 1
500
400
300
chl2
2 3
500
400
300
chl
Figure 6.1: Scatterplot of chl2 against chl for m = 3. The observed data
are linearly related, but the imputed data do not respect the relationship.
combinations of data values that are impossible in the real world. Including
knowledge about derived data in the imputation model prevents imputations
from being inconsistent. Knowledge about the derived data can take many
forms, and includes data transformations, interactions, sum scores, recoded
versions, range restrictions, if-then relations and polynomials.
The easiest way to deal with the problem is to leave any derived data
outside the imputation process. For example, we may impute any missing
height and weight data, and append whr to the imputed data afterward. It is
simple to do that in mice by
The approach is known as Impute, then transform (Von Hippel, 2009). While
whr will be consistent, the obvious problem with this approach is that whr
is not used by the imputation method, and hence biases the estimates of
parameters related to whr towards zero. Note the use of the as.mids function,
which transforms the imputed data long back into a mids object.
Another possibility is to create whr before imputation, and impute whr as
just another variable, known as JAV (White et al., 2011b), or under the name
Transform, then impute (Von Hippel, 2009). This is easy to do, as follows:
The warning results from the linear dependencies among the predictors, which
were introduced by adding whr. The mice() function checks for linear depen-
dencies during the iterations, and temporarily removes predictors from the
univariate imputation models where needed. Each removal action is docu-
mented in the the loggedEvents component of the imp.jav1 object. The last
three removal events are
tail(imp.jav1$loggedEvents, 3)
which informs us that wgt was removed while imputing whr, and vice versa.
172 Flexible Imputation of Missing Data, Second Edition
The I() operator in the meth definitions instructs R to interpret the argument
as literal. So I(100 * wgt / hgt) calculates whr by dividing wgt by hgt
(in meters). The imputed values for whr are thus derived from hgt and wgt
according to the stated transformation, and hence are consistent. Since whr
is that last column in the data, it is updated after wgt and hgt are imputed.
The changes to the default predictor matrix are needed to break any feedback
Imputation in practice 173
10 20 30 40 50 60
60
50
40
30
20
10
10 20 30 40 50 60 10 20 30 40 50 60
Table 6.2: Evaluation of parameter for whr with 25% MCAR missing in hgt
and 25% MCAR missing in wgt using four imputation strategies (nsim = 200).
Method Bias % Bias Coverage CI Width RMSE
Impute, transform -0.28 26.4 0.09 0.322 0.289
JAV -0.90 84.5 0.00 0.182 0.897
Passive imputation -0.28 26.8 0.06 0.328 0.293
smcfcs -0.03 2.6 0.90 0.334 0.094
Listwise deletion 0.01 0.7 0.90 0.307 0.094
wgt, apply each of the three methods 200 times using m = 5, and evaluate
the parameter for whr.
Table 6.2 shows that all three methods have substantial negative biases.
Method JAV almost nullifies the parameter. The other two methods are bet-
ter, but still far from optimal. Actually, none of these methods can be recom-
mended for imputing a ratio.
Bartlett et al. (2015) proposed a novel rejection sampling method that
creates imputations that are congenial in the sense of Meng (1994) with the
substantive (complete-data) model. The method was applied to squared terms
and interactions, and here we investigate whether it extends to ratios. The
method has been implemented in the smcfcs package. The imputation method
requires a specification of the complete-data model, as arguments smtype and
smformula. An example of how to generate imputations, fit models, and pool
the results is:
library(smcfcs)
data <- pop
data[sample(nrow(data), size = 100), "wgt"] <- NA
data[sample(nrow(data), size = 100), "hgt"] <- NA
data$whr <- 100 * data$wgt / data$hgt
meth <- c("", "norm", "norm", "", "", "norm")
imps <- smcfcs(originaldata = data, meth = meth, smtype = "lm",
smformula = "hc ~ age + hgt + wgt + whr")
fit <- lapply(imps$impDatasets, lm,
formula = hc ~ age + hgt + wgt + whr)
summary(pool(fit))
The results of the simulations are also in Table 6.2 under the heading of the
smcfcs. The smcfcs method is far better than the three previous alternatives,
and almost as good as one could wish for. Rejection sampling for imputation is
still new and relatively unexplored, so this seems a promising area for further
work.
Imputation in practice 175
120 65
100
80 55
Weight (kg)
60 50
45
40
40
20
35
0
Figure 6.3: The relation between the interaction term wgt.hc (on the hori-
zontal axes) and its components wgt and hc (on the vertical axes).
Figure 6.3 illustrates that the scatterplots of the real and synthetic values
are similar. Furthermore, the imputations adhere to the stated recipe (wgt -
40) * (hc - 50). Interactions involving categorical variables can be done in
similar ways (Van Buuren and Groothuis-Oudshoorn, 2011), for example by
imputing the data in separate groups. One may do this in mice by splitting
176 Flexible Imputation of Missing Data, Second Edition
Table 6.3: Evaluation of parameter for wgt.hc with 25% MCAR missing in hc
and 25% MCAR missing in wgt using four imputation strategies (nsim = 200).
Method Bias % Bias Coverage CI Width RMSE
Impute, transform 0.20 22.7 0.17 0.290 0.207
JAV 0.63 71.5 0.00 0.229 0.632
Passive imputation 0.20 22.6 0.17 0.283 0.207
scmfcs -0.01 0.8 0.92 0.306 0.083
Listwise deletion -0.01 0.8 0.91 0.237 0.076
the dataset into two or more parts, run mice() on each part and then combine
the imputed datasets with rbind().
Other methods for imputing interactions are JAV, Impute, then transform
and smcfcs. Table 6.3 contains the results of simulations similar to those in
Section 6.4.1, but adapted to include the interaction effect shown in Figure 6.3
by using the complete-data model lm(hgt wgt + hc + wgt.hc). The re-
sults tell the same story as before, with smcfcs the best method, followed by
Passive imputation and Impute, then transform.
Von Hippel (2009) stated that JAV would give consistent results under
MAR, but Seaman et al. (2012) showed that consistency actually required
MCAR. It is interesting that Seaman et al. (2012) found that JAV gener-
ally performed better than passive imputation, which is not confirmed in our
simulations. It is not quite clear where the difference comes from, but the
discussion JAV versus passive pales somewhat in the light of smcfcs.
Generic methods to preserve interactions include tree-based regression and
classification (Section 3.5) as well as various joint modeling methods (Sec-
tion 4.4). The relative strengths and limitations of these approaches still need
to be sorted out.
X = α + Y β1 + Y 2 β2 + (6.1)
2
with ∼ N (0, σ ). We assume that X is complete, and that Y = (Yobs , Ymis ) is
Imputation in practice 177
partially missing. The problem is to find imputations for Y such that estimates
of α, β1 , β2 and σ 2 based on the imputed data are unbiased, while ensuring
that the quadratic relation between Y and Y 2 will hold in the imputed data.
Define the polynomial combination Z as Z = Y β1 + Y 2 β2 for some β1 and
β2 . The idea is to impute Z instead of Y and Y 2 , followed by a decomposition
of the imputed data Z into components Y and Y 2 . Imputing Z reduces the
multivariate imputation problem to a univariate problem, which is easier to
manage. Under the assumption that P (X, Z) is multivariate normal, we can
impute the missing part of Z by Algorithm 3.1. In cases where a normal
residual distribution is suspect, we replace the linear model by predictive
mean matching. The next step is to decompose Z into Y and Y 2 . Under the
model in Equation 6.1 the value Y has two roots:
p
Y− = − 2β1 2 4β2 Z + β12 + β1 (6.2)
p
1
Y+ = 2β2 4β2 Z + β12 − β1 (6.3)
where we assume that the discriminant 4β2 Z + β12 is larger than zero. For a
given Z, we can take either Y = Y− or Y = Y+ , and square it to obtain Y 2 .
Either root is consistent with Z = Y β1 + Y 2 β2 , but the choice among these
two options requires care. Suppose we choose Y− for all Z. Then all Y will
correspond to points located on the left arm of the parabolic function. The
minimum of the parabola is located at Ymin = −β1 /2β2 , so all imputations
will occur in the left-hand side of the parabola. This is probably not intended.
The choice between the roots is made by random sampling. Let V be a
binary random variable defined as 1 if Y > Ymin , and as 0 if Y ≤ Ymin . Let us
model the probability P (V = 1) by logistic regression as
with certainty by deducting the known terms from the total. This is known
as deductive imputation (De Waal et al., 2011). If two additive terms are
missing, imputing one of these terms uses the available one degree of freedom,
and hence implicitly determines the other term. Data of this type are known
as compositional data, and they occur often in household and business sur-
veys. Imputation of compositional data has only recently received attention
(Tempelman, 2007; Hron et al., 2010; De Waal et al., 2011; Vink et al., 2015).
Hron et al. (2010) proposed matching on the Aitchison distance, a measure
specifically designed for compositional data. The method is available in R as
the robCompositions package (Templ et al., 2011a).
This section suggests a somewhat different method for imputing compo-
sitional data. Let Y123 = Y1 + Y2 + Y3 be the known total score of the three
variables Y1 , Y2 and Y3 . We assume that Y3 is complete and that Y1 and Y2
are jointly missing or observed. The problem is to create multiple imputations
in Y1 and Y2 such that the sum of Y1 , Y2 and Y3 equals a given total Y123 ,
Imputation in practice 179
and such that parameters estimated from the imputed data are unbiased and
have appropriate coverage.
Since Y3 is known, we write Y12 = Y123 − Y3 for the sum score Y1 + Y2 . The
key to the solution is to find appropriate values for the ratio P1 = Y1 /Y12 ,
or equivalently for (1 − P1 ) = Y2 /Y12 . Let P (P1 |Y1obs , Y2obs , Y3 , X) denote
the posterior distribution of P1 , which is possibly dependent on the observed
information. For each incomplete record, we make a random draw Ṗ1 from
this distribution, and calculate imputations for Y1 as Ẏ1 = Ṗ1 Y12 . Likewise,
imputations for Y2 are calculated by Ẏ2 = (1 − Ṗ1 )Y12 . It is easy to show that
Ẏ1 + Ẏ2 = Y12 , and hence Ẏ1 + Ẏ2 + Y3 = Y123 , as required.
The best way in which the posterior should be specified has still to be
determined. In this section we apply standard predictive mean matching. We
study the properties of the method by a small simulation study. The first step
is to create an artificial dataset with known properties as follows:
set.seed(43112)
n <- 400
Y1 <- sample(1:10, size = n, replace = TRUE)
Y2 <- sample(1:20, size = n, replace = TRUE)
Y3 <- 10 + 2 * Y1 + 0.6 * Y2 + sample(-10:10, size = n,
replace = TRUE)
Y <- data.frame(Y1, Y2, Y3)
Y[1:100, 1:2] <- NA
md.pattern(Y, plot = FALSE)
Y3 Y1 Y2
300 1 1 1 0
100 1 0 0 2
0 100 100 200
Thus, Y is a 400 × 3 dataset with 300 complete records and with 100 records
in which both Y1 and Y2 are missing. Next, define three auxiliary variables
that are needed for imputation:
Y123 <- Y1 + Y2 + Y3
Y12 <- Y123 - Y[,3]
P1 <- Y[,1] / Y12
data <- data.frame(Y, Y123, Y12, P1)
where the naming of the variables corresponds to the total score Y123 , the sum
score Y12 and the ratio P1 .
The imputation model specifies how Y1 and Y2 depend on P1 and Y12 by
means of passive imputation. The predictor matrix specifies that only Y3 and
Y12 may be predictors of P1 in order to avoid linear dependencies.
180 Flexible Imputation of Missing Data, Second Edition
The code I(P1 * Y12) calculates Y1 as the product of P1 and Y12 , and so on.
The pooled estimates are calculated as
estimate std.error
(Intercept) 9.71 0.98
Y1 1.99 0.11
Y2 0.61 0.05
The estimates are reasonably close to their true values of 10, 2 and 0.6, re-
spectively. A small simulation study with these data using 100 simulations
and m = 10 revealed average estimates of 9.94 (coverage 0.96), 1.95 (coverage
0.95) and 0.63 (coverage 0.91). Though not perfect, the estimates are close to
the truth, while the data adhere to the summation rule.
Figure 6.4 shows where the solution might be further improved. The dis-
tribution of P1 in the observed data is strongly patterned. This pattern is only
partially reflected in the imputed Ṗ1 after predictive mean matching on both
Y12 and Y3 . It is possible to imitate the pattern perfectly by removing Y3 as
a predictor for P1 . However, this introduces bias in the parameter estimates.
Evidently, some sort of compromise between these two options might further
remove the remaining bias. This is an area for further research.
For a general missing data pattern, the procedure can be repeated for
all pairs (Yj , Yj 0 ) that have missing data. First create a consistent starting
imputation that adheres to the rule of composition, then apply the above
method to pairs (Yj , Yj 0 ) that belong to the composition. This algorithm is
a variation on the MICE algorithm with iterations occurring over pairs of
variables rather than separate variables.
Vink (2015) extended these ideas to nested compositional data, where
a given element of the composition is broken down into subelements. The
method, called predictive ratio matching, calculates the ratio of two compo-
nents, and then borrows imputations from donors that have a similar ratio.
Component pairs are visited in an ingenious way and combined into an itera-
tive algorithm.
Imputation in practice 181
5 10 15 20 25 30
Y 12 + Y 3 Y 12
0.8
0.6
P1
0.4
0.2
5 10 15 20 25 30
Y1 + Y2
25
150
20
Ozone (ppb)
100
Frequency
15
50
10
5
0
0
Compare Figure 6.5 to Figure 1.4. The negative ozone value of −18.8 has
now been replaced by a value of 1.
The previous syntax of the post argument is a bit cumbersome. The same
result can be achieved by neater code:
0 5 10 15 20
Free Restricted
G5
Genital development
G4
G3
G2
G1
0 5 10 15 20
Age (years)
Figure 6.6: Genital development of Dutch boys by age. The free solution
does not constrain the imputations, whereas the restricted solution requires
all imputations below the age of 8 years to be at the lowest category.
are more efficient than others. In practice, there are small order effects of
the MICE algorithm, where the parameter estimates depend on the sequence
of the variables. To date, there is little evidence that this matters in prac-
tice, even for clearly incompatible imputation models (Van Buuren et al.,
2006). For monotone missing data, convergence is immediate if variables are
ordered according to their missing data rate. Rather than reordering the data,
it is more convenient to change the visit sequence of the algorithm by the
visitSequence argument. In its basic form, the visitSequence argument
is a vector of names, or a vector of integers in the range 1:ncol(data) of
arbitrary length, specifying the sequence of blocks (usually variables) for one
iteration of the algorithm. Any given block may be visited more than once
within the same iteration, which can be useful to ensure proper synchroniza-
tion among blocks of variables. Consider the mids object imp.int created in
Section 6.4.2. The visit sequence is
imp.int$visitSequence
5
26
4
24
3
mean hyp sd hyp
1.6
5 10 15 20 5 10 15 20
Iteration
Figure 6.7: Mean and standard deviation of the synthetic values plotted
against iteration number for the imputed nhanes data.
6.5.2 Convergence
There is no clear-cut method for determining when the MICE algorithm
has converged. It is useful to plot one or more parameters against the iteration
number. The mean and variance of the imputations for each parallel stream
can be plotted by
Figure 6.9 is the resulting plot for the same three variables. There is little
trend and the streams mingle well.
The default plot() function for mids objects plots the mean and variance
of the imputations. While these parameters are informative for the behavior
of the MICE algorithm, they may not always be the parameter of greatest
interest. It is easy to replace the mean and variance by other parameters, and
monitor these. Schafer (1997, pp. 129–131) suggested monitoring the “worst
linear function” of the model parameters, i.e., a combination of parameters
that will experience the most problematic convergence. If convergence can be
established for this parameter, then it is likely that convergence will also be
achieved for parameters that converge faster. Alternatively, we may monitor
some statistic of scientific interest, e.g., a correlation or a proportion. See
Sections 4.5.7 (Pearson correlation) and 9.4.3 (Kendall’s τ ) for examples.
188 Flexible Imputation of Missing Data, Second Edition
35
30
25
20
mean wgt sd wgt
60
80
50
40
60
30
40
20
10
20
20 40 60 80 100
50
40
30
20
5 10 15 20 5 10 15 20
Iteration
32
96
30
94
28
26
92
24
mean wgt sd wgt
45
20 25 30 35 40 45
40
35
30
6
17.5
4
16.5
5 10 15 20 5 10 15 20
Iteration
Figure 6.9: Healthy convergence of the MICE algorithm for hgt, wgt and
bmi.
6.6 Diagnostics
Assessing model fit is also important for building trust by assessing the
plausibility of the generated imputations. Diagnostics for statistical models
are procedures to find departures from the assumptions underlying the model.
Model evaluation is a huge field in which many special tools are available, e.g.,
Q-Q plots, residual and influence statistics, formal statistical tests, informa-
tion criteria and posterior predictive checks. In principle, all these techniques
can be applied to evaluate the imputation model. Conventional model evalua-
tion concentrates on the fit between the data and the model. In imputation it
is often more informative to focus on distributional discrepancy, the difference
190 Flexible Imputation of Missing Data, Second Edition
between the observed and imputed data. The next section illustrates this with
an example.
Given : xvar
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.5
0.0
−0.5
0.5
Deviation
0.0
−0.5
0.5
0.0
−0.5
−3 −2 −1 0 1 2 3
Figure 6.10: Worm plot of the predictive mean matching imputations for
body weight. Different panels correspond to different age ranges. While the
imputation model does not fit the data in many age groups, the distributions
of the observed and imputed data often match up very well.
Differences between the densities of the observed and imputed data may sug-
gest a problem that needs to be further checked. The mice package con-
tains several graphic functions that can be used to gain insight into the
correspondence of the observed and imputed data: bwplot(), stripplot(),
densityplot() and xyplot().
The stripplot() function produces the individual points for numerical
variables per imputation as in Figure 6.11 by
age bmi
3.0
35
2.5
30
2.0
25
1.5
1.0
20
0 1 2 3 4 5 0 1 2 3 4 5
hyp chl
1.0 1.2 1.4 1.6 1.8 2.0
250
200
150
0 1 2 3 4 5 0 1 2 3 4 5
number of data points. For large datasets it is more appropriate to use the
function bwplot() that produces side-by-side box-and-whisker plots for the
observed and synthetic data.
The densityplot() function produces Figure 6.12 by
which shows kernel density estimates of the imputed and observed data. In
this case, the distributions match up well.
Interpretation is more difficult if there are discrepancies. Such discrepan-
cies may be caused by a bad imputation model, by a missing data mechanism
that is not MCAR or by a combination of both. Bondarenko and Raghu-
nathan (2016) proposed a more refined diagnostic tool that aims to compare
the distributions of observed and imputed data conditional on the missingness
probability. The idea is that under MAR the conditional distributions should
be similar if the assumed model for creating multiple imputations has a good
fit. An example is created as:
20 30 40 0.5 1.0 1.5 2.0 2.5 100 150 200 250 300
Figure 6.12: Kernel density estimates for the marginal distributions of the
observed data (blue) and the m = 5 densities per variable calculated from the
imputed data (thin red lines).
imp$m + 1)
xyplot(imp, bmi ~ ps | as.factor(.imp),
xlab = "Probability that record is incomplete",
ylab = "BMI", pch = c(1, 19))
These statements first model the probability of each record being incom-
plete as a function of all variables in each imputed dataset. The probabilities
(propensities) are then averaged over the imputed datasets to obtain stability.
Figure 6.13 plots BMI against the propensity score in each dataset. Observe
that the imputed data points are somewhat shifted to the right. In this case,
the distributions of the blue and red points are quite similar, as expected
under MAR.
Realize that the comparison is only as good as the propensity score is. If
important predictors are omitted from the response model, then we may not
be able to see the potential misfit. In addition, it could be useful to investigate
the residuals of the regression of BMI on the propensity score. See Van Buuren
and Groothuis-Oudshoorn (2011) on a technique for how to calculate and plot
the relevant quantities.
Compared to conventional diagnostic methods, imputation comes with the
advantage that we can directly compare the observed and imputed values. The
marginal distributions of the observed and imputed data may differ because
the missing data are MAR or MNAR. The diagnostics tell us in what way
they differ, and hopefully also suggest whether these differences are expected
and sensible in light of what we know about the data. Under MAR, any
distributions that are conditional on the missing data process should be the
same. If our diagnostics suggest otherwise (e.g., the blue and red points are
very different), there might be something wrong with the imputations that we
194 Flexible Imputation of Missing Data, Second Edition
0 1 2
35
30
25
20
BMI
3 4 5
35
30
25
20
0.35 0.40 0.45 0.50 0.55 0.60 0.35 0.40 0.45 0.50 0.55 0.60
Figure 6.13: BMI against missingness probability for observed and imputed
values.
created. Alternatively, it could be the case that the observed differences are
justified, and that the missing data process is MNAR. The art of imputation
is to distinguish between these two explanations.
6.7 Conclusion
Multiple imputation is not a quick automatic fix. Creating good imputa-
tions requires substantive knowledge about the data paired with some healthy
statistical judgement. Impute close to the data. Real data are richer and more
complex than the statistical models applied to them. Ideally, the imputed val-
ues should look like real data in every respect, especially if multiple models
are to be fit on the imputed data. Keep the following points in mind:
• Use MAR as a starting point using the strategy outlined in Section 6.2.
• Choose the imputation methods and set the predictors using the strate-
gies outlined in Section 6.3.
• If the data contain derived variables that are not needed for imputation,
impute the originals and calculate the derived variables afterward.
• Use passive imputation if you need the derived variables during impu-
tation. Carefully specify the predictor matrix to avoid feedback loops.
See Section 6.4.
• Monitor convergence of the MICE algorithm for aberrant patterns, es-
pecially if the rate of missing data is high or if there are dependencies
in the data. See Sections 4.5.6 and 6.5.2.
• Make liberal use of diagnostic graphs to compare the observed and the
imputed data. Convince yourself that the imputed values could have
been real data, had they not been missing. See Section 6.6.
6.8 Exercises
1. Worm plot for normal model. Repeat the imputations in Section 6.6.1
using the linear normal model for the numerical variables. Draw the
worm plot.
• Does the imputation model for wgt fit the observed data? If not,
describe in which aspects they differ.
• Does the imputation model for wgt fit the imputed data? If not,
describe in which aspects they differ.
• Are there striking differences between your worm plot and Fig-
ure 6.10? If so, describe.
• Which imputation model do you prefer? Why?
2. Defaults. Select a real dataset that is familiar to you and that contains
at least 20% missing data. Impute the data with mice() under all the
default settings.
196 Flexible Imputation of Missing Data, Second Edition
• Inspect the streams of the MICE algorithm. Does the sampler ap-
pear to converge?
• Extend the analysis with 20 extra iterations using mice.mids().
Does this affect your conclusion about convergence?
• Inspect the data with diagnostic plots for univariate data. Are the
univariate distributions of the observed and imputed data similar?
Do you have an explanation why they do (or do not) differ?
• Inspect the data with diagnostic plots for the most interesting bi-
variate relations. Are the relations similar in the observed and im-
puted data? Do you have an explanation why they do (or do not)
differ?
• Consider each of the seven default choices in turn. Do you think
the default is appropriate for your data? Explain why.
• Do you have particular suggestions for improved choices? Which?
• Implement one of your suggestions. Do the results now look more
plausible or realistic? Explain.
Chapter 7
Multilevel multiple imputation
7.1 Introduction
Multiple imputation of multilevel data is one of the hot spots in statistical
technology. Imputers and analysts now have a bewildering array of options
for imputing missing values in multilevel data. This chapter summarizes the
state of the art, and formulates advice and guidelines for practical application
of multilevel imputation.
The structure of this chapter is as follows. We start with a concise overview
of three ways to formulate the multilevel model. Section 7.3 reviews several
non-imputation approaches for dealing with missing values in multilevel data.
Sections 7.4 and 7.5 describe imputation using the joint modeling and fully
conditional specification frameworks. Sections 7.6 and 7.7 review current pro-
cedures for imputation under multilevel models with continuous and discrete
outcomes, respectively. Section 7.8 deals with missing data in the level-2 pre-
dictors, and Section 7.9 summarizes comparative work on the different ap-
proaches. Section 7.10 contains worked examples that illustrate how impu-
tations can be generated in mice, provides guidelines on the practical appli-
cation, written in the form of recipes for multilevel imputation. The chapter
closes with an overview of unresolved issues and topics for further research.
197
198 Flexible Imputation of Missing Data, Second Edition
may relate to pupils (e.g., test scores), whereas other data concern the class
level (e.g., class size). Another example arises in longitudinal studies, where
the individual’s responses over time are nested within individuals. Some of the
data vary with time (e.g., disease history), whereas other data vary between in-
dividuals (e.g., sex). The term multilevel analysis refers to the methodology to
analyze data with such multilevel structure, a methodology that can be traced
back to the definition of the intra-class correlation (ICC) by Fisher (1925).
Multilevel analysis is quite different from methods for single-level data. The
analysis of multilevel data is a vast topic, and this is not the place to cover
the model in detail. There are excellent introductions by Raudenbush and
Bryk (2002), Gelman and Hill (2007), Snijders and Bosker (2012), Fitzmau-
rice et al. (2011), and Hox et al. (2018). This chapter assumes basic familiarity
with these models.
A challenging aspect of multilevel analysis is the existence of a variety of
notational systems and concepts. This section describes three different nota-
tions. In order to illustrate these notations, we use data on school performance
of grade 8 pupils in Dutch schools. These data were collected by Brandsma
and Knuver (1989), and were used as the primary examples in Chapters 4 and
5 of Snijders and Bosker (2012). The data are available as the brandsma object
in the mice package. The data contain a mix of both pupil-level measurements
and school-level measurements.
library(mice)
data("brandsma", package = "mice")
d <- brandsma[, c("sch", "lpo", "sex", "den")]
The scientific interest is to create a model for predicting the outcome lpo
from the level-1 predictor sex (coded as 0-1) and the level-2 predictor den
(which takes values 1-4). Let the data be divided into C clusters (e.g., classes,
schools), indexed by c (c = 1, . . . , C). Each cluster holds nc units, indexed
by i = 1, . . . , nc . There are three ways to write the same model (Scott et al.,
2013).
In level notation, introduced by Bryk and Raudenbush (1992), we formu-
late a multilevel model as a system of two equations, one at level-1, and two
at level-2:
Multilevel multiple imputation 199
where lpoic is the test score of pupil i in school c, where sexic is the sex of
pupil i in school c, and where denc is the religious denomination of school
c. Note that here the subscripts distinguish the level-1 and level-2 variables.
In this notation, sexic only appears in the level-1 model 7.1, and denc only
appears in the level-2 model 7.2. The term β0c is a random intercept that
varies by cluster, while β1c is a sex effect that is assumed to be the same across
schools. The term ic is the within-cluster random residual at the pupil level
with a normal distribution ic ∼ N (0, σ2 ). The first level-2 model describes
the variation in the mean test score between schools as a function of the grand
mean γ00 , a school-level effect γ01 of denomination and a school-level random
residual with a normal distribution u0c ∼ N (0, σu2 0 ). The second level-2 model
does not have a random residual, so this specifies that β1c is a fixed effect equal
in value to γ10 . The unknowns to be estimated are the fixed parameters γ00 ,
γ01 and γ10 , and the variance components σ2 and σu2 0 .
We may write the same model as a single predictive equation by substi-
tuting the level-2 models into the level-2 model:
We do not need the double subscripts any more, so we write the model in
composite notation as
Note that these β’s are fixed effects and the β’s in the level-1 model 7.1 are
random effects. They differ by the number of subscripts.
The same model written in matrix notation is widely known as the linear
mixed effects model (Laird and Ware, 1982) and can be written as
yc = Xc β + Zc uc + c (7.6)
can be used to impute class membership. See Hill and Goldstein (1998) and
Goldstein (2011b) for models to handle missing class identification.
4. The variation of the random slopes can be large, so the method used to
deal with the missing data should account for this.
5. The error variance σ2 may differ across clusters (heteroscedasticity)
202 Flexible Imputation of Missing Data, Second Edition
the cluster size hardly aids in reducing this bias. In addition, the regression
weights for the fixed effects will be biased. Grund et al. (2018b) conclude
that single-level imputation should be avoided unless only a few cases contain
missing data (e.g., less than 5%) and the intra-class correlation is low (e.g.,
less than .10). Conducting multiple imputation with the wrong model (e.g.,
single-level methods) can be more hazardous than listwise deletion.
Another ad-hoc technique is to add a dummy variable for each cluster, so
that the model estimates a separate coefficient for each cluster. The coeffi-
cients are estimated by ordinary least squares, and the parameters are drawn
from their posteriors. If the missing values are restricted to the outcome, this
method will estimate the fixed effects quite well, but also artificially inflates
the true variation between groups, and thus biases the ICC upwards (An-
dridge, 2011; Van Buuren, 2011; Graham, 2012). If there are also missing
values in the predictors, the level-1 regression weights will be unbiased, but
the level-2 weights are biased, in particular for small clusters and low ICC. See
Lüdtke et al. (2017) for more detail, who also derive the asymptotic bias. If
the primary interest is on the fixed effects, adding a cluster dummy is an eas-
ily implementable alternative, unless the missing rate is very large and/or the
intra-class correlation is very low and the number of records in the cluster is
small (Drechsler, 2015; Lüdtke et al., 2017). Since the bias in random slopes
and variance components can be substantial, one should turn to multilevel
imputation to obtain proper estimates of those parts of the multilevel model
(Speidel et al., 2017).
Vink et al. (2015) described an application of Australian school data with
over 2.8 million records, where a dummy variable per school was combined
with predictive mean matching. Given the size and complexity of the impu-
tation problem, this application would have been computationally infeasible
with full multilevel imputation. Thus, for large databases, adding a dummy
variable per cluster is a practical and useful technique for estimating the fixed
effects.
sizes.” Of course, this statement will only go as far as the assumptions of the
model are met: the data are missing at random and the model is correctly
specified.
Mixed-effects models can be fit with maximum-likelihood methods, which
take care of missing data in the dependent variable. This principle can be
extended to address missing data in explanatory variables in (multilevel) soft-
ware for structural equation modeling like Mplus (Muthén et al., 2016) and
gllamm (Rabe-Hesketh et al., 2002). Grund et al. (2018b) remarked that such
extensions could alter the meaning and value of the parameters of interest.
ural when the complete-data analysis focuses on different within- and between-
cluster associations (Enders et al., 2016; Grund et al., 2016a). Multilevel im-
putation is not without problems. Except for jomo, most models assume a
homoscedastic error structure in the level-1 residuals, which implies no ran-
dom slope variation between Yic (Carpenter and Kenward, 2013; Enders et al.,
2016). Imputations created by jomo reflect pairwise linear relationships in the
data and ignore higher-orders interaction and non-linearities. Joint modeling
may also experience difficulties with smaller samples and the default inverse
Wishart prior (Grund et al., 2018b; Audigier et al., 2018).
Imputation models can also be formulated as latent class models (Vermunt
et al., 2008; Vidotto et al., 2015). Vidotto (2018) proposed a Bayesian multi-
level latent class model that is designed to capture heterogeneity in the data
at both levels through local independence and conditional independence as-
sumptions. This class of models is quite flexible. As the method is very recent,
there is not yet much practical experience.
˙ ic
lpo ∼ N (β0 + β1 denc + β2 sexic + u0c , σ2 ) (7.8)
˙ ic
sex ∼ N (β0 + β1 denc + β2 lpoic + u0c , σ2 ) (7.9)
where all parameters are re-estimated at every iteration. Since the first equa-
tion corresponds to the complete-data model, there are no issues with this
step. The second equation simply alternates the roles of lpo and sex, and
uses the inverted mixed model to draw imputations. The above steps illus-
trate the key idea of multilevel imputation using FCS. It is not yet clear when
and how the idea will work.
Resche-Rigon and White (2018) studied the consequences of model inver-
sion, and found that the conditional expectation of the level-1 predictor in a
multivariate multilevel model with random intercepts depends on the cluster
mean of the predictor, and on the size of the cluster. In addition, the con-
ditional variance depends on cluster size. These results hold for the random
intercept model. Of course, including random slopes as well will only compli-
cate matters. The naive FCS procedure in Equation 7.8 does not account for
the cluster means or for the effects of cluster size, and hence might not pro-
206 Flexible Imputation of Missing Data, Second Edition
vide good imputations. From their derivation, Resche-Rigon and White (2018)
therefore hypothesized that the imputation model (1) should incorporate the
cluster means, and (2) be heteroscedastic if cluster sizes vary. We now discuss
these points in turn.
the imputation model and the substantive model. Improving congeniality had
a major effect, which is in line with the larger multiple imputation literature.
Section 4.5.4 explains the confusion surrounding the term compatibility in
some detail.
A problematic aspect of including cluster means is that the contextual
variable may be an unreliable estimate in small clusters. It is known that
the regression weight of the contextual variable is then biased (Lüdtke et al.,
2008). A solution is to formulate the contextual variable as a latent variable,
and use an estimator that essentially shrinks the weight towards zero. Most
joint modeling approaches assume a multivariate mixed-effects model, where
cluster means are latent.
It is not yet clear when the manifest cluster means can be regarded as
“correct” in an FCS context. When clusters are large and of similar size,
the manifest cluster means are likely to be valid and have little differential
shrinkage. For smaller clusters or clusters of unequal size, including the cluster
means in the imputation model also seems valid because proper imputation
techniques will use draws from the posterior distribution of the group means
rather than using the manifest means themselves. All in all, it appears prefer-
able to include the cluster means into the imputation model.
7.6.2 Methods
The mice, miceadds, micemd and mitml packages contain useful func-
tions for multilevel imputation. The mice package implements two methods,
2l.lmer and 2l.pan. Method 2l.lmer (Jolani, 2018) imputes both sporad-
ically and systematically missing values. Under the appropriate model, the
method is randomization-valid for the fixed effects, but the variance compo-
nents were more difficult to estimate, especially for a small number of clusters.
Method 2l.pan uses the PAN method (Schafer and Yucel, 2002). Method
2l.continuous from miceadds is similar to 2l.lmer with some different op-
tions. The 2l.jomo method from micemd is similar to 2l.pan, but uses the
jomo package as the computational engine. Method 2l.glm.norm is similar to
2l.continuous and 2l.lmer.
Two functions for heteroscedastic errors are available. A method named
2l.2stage.norm from micemd implements the two-stage method by Resche-
Rigon and White (2018). The 2l.norm method from mice implements the
Gibbs sampler from Kasim and Raudenbush (1998). Method 2l.norm can
recover the intra-class correlation quite well, even for severe MAR cases and
high amounts of missing data in the outcome or the predictor. However, it is
fairly slow and fails to achieve nominal coverage for the fixed effects for small
classes (Van Buuren, 2011).
The 2l.pmm method in the miceadds package is a generalization of the
default pmm method to data with two levels using linear mixed model fitted by
lmer or blmer models. Method 2l.2stage.pmm generalizes pmm by a two-stage
method. The default in both methods is to obtain donors across all clusters,
which is probably fine for most applications.
Table 7.2 presents an overview of R functions for univariate imputations
according to a multilevel model for continuous outcomes. Each row represents
a function. The functions belong to different packages, and there is overlap in
functionality. All functions can be called from mice() as building blocks to
form an iterative FCS algorithm.
2l.pan
7.6.3 Example
We use the brandsma data introduced in Section 7.2. Here we will analyze
the full set of 4016 pupils. Apart from Chapter 9, Snijders and Bosker (2012)
concentrated on the analysis of a reduced set of 3758 pupils. In order to keep
things simple, this section restricts the analysis to just two variables.
The cluster variable is sch. The variable lpo is the pupil’s test score at grade
8. The cluster variable is complete, but lpo has 204 missing values.
210 Flexible Imputation of Missing Data, Second Edition
Generic
miceadds 2l.pmm pmm, homoscedastic, lmer
micemd 2l.2stage.pmm pmm, heteroscedastic, mvmeta
sch lpo
3902 1 1 0
204 1 0 1
0 204 204
How do we impute the 204 missing values? Let’s apply the following five
methods:
3. 2l.pan: Multilevel method using the linear mixed model to draw uni-
variate imputations;
4. 2l.norm: Multilevel method using the linear mixed model with hetero-
geneous error variances;
5. 2l.pmm: Predictive mean matching based on predictions from the linear
mixed model, with random draws from the regression coefficients and
the random effects, using five donors.
The following code block will impute the data according to these five methods.
Multilevel multiple imputation 211
50
40
Frequency
30
20
10
0
4 6 8 10 12 14 16
SD(language score) per school
library(miceadds)
methods <- c("sample", "pmm", "2l.pan", "2l.norm", "2l.pmm")
result <- vector("list", length(methods))
names(result) <- methods
for (meth in methods) {
d <- brandsma[, c("sch", "lpo")]
pred <- make.predictorMatrix(d)
pred["lpo", "sch"] <- -2
result[[meth]] <- mice(d, pred = pred, meth = meth,
m = 10, maxit = 1,
print = FALSE, seed = 82828)
}
The code -2 in the predictor matrix pred signals that sch is the cluster
variable. There is only one variable with missing values here, so we do not
need to iterate, and can set maxit = 1. The miceadds library is needed for
the 2l.pmm method.
The 2l.pan and 2l.norm methods are the oldest multilevel methods.
Method 2l.pan is very fast, while method 2l.norm is more flexible since
the within-cluster error variances may differ. To see which of these methods
should be preferred for these data, let us study the distribution of the stan-
dard deviation of lpo by schools. Figure 7.1 shows that the standard deviation
per school varies between 4 and 16, a fairly large spread. This suggests that
2l.norm might be preferred here.
Figure 7.2 shows the box plot of the observed data (in blue) and the
imputed data (in red) under each of the methods. Box plots are drawn for
school with zero missing values, one missing value, two or three missing values
and more than three missing values. Pupils in schools with one to three missing
212 Flexible Imputation of Missing Data, Second Edition
observed
>3
2−3
1
0
sample
>3
2−3
1
0
Number of missing values per school
pmm
>3
2−3
1
0
2l.pan
>3
2−3
1
0
2l.norm
>3
2−3
1
0
2l.pmm
>3
2−3
1
0
20 30 40 50 60
Figure 7.2: Box plots comparing the distribution of the observed data (blue),
and the imputed data (red) under five methods, split according to the number
of missing values per school.
Multilevel multiple imputation 213
20 30 40 50 60
1 2−3
0.05
Density (2l.pan)
0.04
0.03
0.02
0.01
20 30 40 50 60
20 30 40 50 60
1 2−3
Density (2l.pmm)
0.05
0.04
0.03
0.02
0.01
20 30 40 50 60
Figure 7.3: Density plots for schools with one and with two or three missing
values for 2l.pan (top) and 2l.pmm (bottom).
values have lower scores than pupils from a school with complete data. Pupils
from schools with more than three missing values score similar to pupils from
schools with complete data. It is interesting to study how well the different
imputation methods preserve this feature in the data.
Method sample does not use any school information, and hence the impu-
tations in all schools look alike. Methods pmm, 2l.pan, 2l.norm and 2l.pmm
preserve the pattern, though the differences are less outspoken than in the
observed data. Note that the distribution of the two normal methods (2l.pan
and 2l.norm) have tails that extend beyond the range of the observed data
(the maximum is 58). Hence, complete-data estimators based on the tails (e.g.,
finding the Top 10 Dutch schools) can be distorted by this use of the normal
imputation.
214 Flexible Imputation of Missing Data, Second Edition
Figure 7.3 shows the density plot of the 10 sets of imputed values (red)
compared with the density plot of the observed values (blue). The top row
corresponds to the 2l.pan method, and shows that some parts of the blue
curve are not well represented by the imputed values. The method at the
bottom row (2l.pmm) tracks the observed data distribution a little better.
Most research to date has concentrated on multilevel imputation using
the normal model. In reality, normality is always an approximation, and it
depends on the substantive question of how good this approximation should
be. Two-level predictive mean matching is a promising alternative that can
impute close to the data.
7.7.1 Methods
The generalized linear mixed model (GLMM) extends the mixed model for
continuous data with link functions. For example, we can draw imputations
for clustered binary data by positing a logit link with a binomial distribution.
As before, all parameters need to be drawn from their respective posteriors in
order to account for the sampling variation.
Jolani et al. (2015) developed a multilevel imputation method for binary
data obtaining estimates of the parameters of model by the lme4::glmer()
function in lme4 package (Bates et al., 2015), followed by a sequence of ran-
dom draws from the parameter distributions. For meta-analysis of individual
participant data, this method outperforms simpler methods that ignore the
clustering, that assume MCAR or that split the data by cluster (Jolani et al.,
2015). The method is available as method 2l.bin in mice. The miceadds
package Robitzsch et al. (2017) contains a method 2l.binary that allows
the user to choose between likelihood estimation with lme4::glmer() and
penalized ML with blme::bglmer() (Chung et al., 2013). Related methods
are available under sequential hierarchical regression imputation (SHRIMP)
framework (Yucel, 2017).
Resche-Rigon and White (2018) proposed a two-stage estimator. At step
1, a linear regression model is fitted to each observed cluster. Any sporadically
missing data are imputed, and the model per cluster ignores any systematically
missing variables. At step 2, estimates obtained from each cluster are combined
using meta-analysis. Systematically missing variables are modeled through
a linear random effect model across clusters. A method for binary data is
available as the method 2l.2stage.bin in the micemd package. The two-stage
Multilevel multiple imputation 215
Count
micemd 2l.2stage.pois Poisson, mvmeta
micemd 2l.glm.pois Poisson, glmer
countimp 2l.poisson Poisson, glmmPQL
countimp 2l.nb2 negative binomial, glmmadmb
countimp 2l.zihnb zero-infl neg bin, glmmadmb
7.7.2 Example
The toenail data were collected in a randomized parallel group trial com-
paring two treatments for a common toenail infection. A total of 294 patients
were seen at seven visits, and severity of infection was dichotomized as “not
severe” (0) and “severe” (1). The version of the data in the DPpackage is all
numeric and easy to analyze. The following statements load the data, and
expand the data to the full design with 7 × 294 = 2058 rows. There are in
total 150 missed visits.
library(tidyr)
data("toenail", package = "DPpackage")
data <- tidyr::complete(toenail, ID, visit) %>%
tidyr::fill(treatment) %>%
216 Flexible Imputation of Missing Data, Second Edition
dplyr::select(-month)
table(data$outcome, useNA = "always")
0 1 <NA>
1500 408 150
0 1 <NA>
1635 423 0
Figure 7.4 visualizes the imputations. The plot shows the partially imputed
profiles of 16 subjects in the toenail data. The general downward trend in
the probability of infection severity with time is obvious, and was also found
by Molenberghs and Verbeke (2005, p. 302). Subjects 9 (never severe) and
117 (always severe) have both complete data. They represent the extremes,
and their random effect estimates are very similar in all five imputed datasets.
They are close, but not identical — as you might have expected — because the
multiple imputations will affect the random effects also for the fully observed
subjects. Subjects 31, 41 and 309 are imputed such that their outcomes are
equivalent to subject 9, and hence have similar random effect estimates. In
contrast, subject 214 has the same observed data pattern as 31, but it is
sometimes imputed as “severe”. As a consequence, we see that there are now
two random effect estimates for this subject that are quite different, which
reflects the uncertainty due to the missing data. Subjects 48 and 99 even
have three clearly different estimates. Imputation number 3 is colored green
instead of grey, so the isolated lines in subjects 48 and 230 come from the
same imputed dataset.
The complete-data model is a generalized linear mixed model for outcome
given treatment status, time and a random intercept. This is similar to the
models used by Molenberghs and Verbeke (2005), but here we use the visit
instead of time (which is incomplete) as the timing variable. The estimates
from the combined multiple imputation analysis are then obtained as
Multilevel multiple imputation 217
1 2 3 4 5 6 7 1 2 3 4 5 6 7
9 12 13 21
1.0
0.8
0.6
0.4
0.2
0.0
31 41 45 48
1.0
0.8
Severe infection (Y/N)
0.6
0.4
0.2
0.0
51 99 117 168
1.0
0.8
0.6
0.4
0.2
0.0
188 214 230 309
1.0
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Visit
Figure 7.4: Plot of observed (blue) and imputed (red) infection (Yes/No) by
visit for 16 selected persons in the toenail data (m = 5). The lines visualize the
subject-wise infection probability predicted by the generalized linear mixed
model given visit, treatment and their interaction per imputed dataset.
library(purrr)
mice::complete(imp, "all") %>%
purrr::map(lme4::glmer,
formula = outcome ~ treatment * visit + (1 | ID),
family = binomial) %>%
pool() %>%
summary()
more levels. In addition, it also allows imputation at the lowest level (and any
other level) with an arbitrary specification of (additive) random effects. This
includes general nested models, cross-classified models, the ability to include
cluster means at any level of clustering, and the specification of random slopes
at any level of clustering. Table 7.4 lists the various methods.
found JM and FCS equally effective, and better than ad-hoc approaches or
FIML. A difference with Enders et al. (2016) was the addition of FCS methods
that included cluster means. For models with random slopes and cross-level
interactions, FCS was found almost unbiased for the main effects, but less
reliable for higher-order terms. For categorical data, the conclusion was that
both multilevel JM and FCS are suitable for creating multiple imputations.
Incomplete level-2 variables were handled equally well by JM, FCS and FIML.
Audigier et al. (2018) found that JM, as implemented in jomo, worked
well with large clusters and binary data, but had difficulty in modeling small
(number of) clusters, tending to conservative inferences. The homogeneity
assumption in the standard generalized linear mixed model was found to be
limiting. The two-stage approach was found to perform well for systematically
missing data, but was less reliable for small clusters.
The picture that emerges is that FIML is not inherently preferable for
missing predictors or outcomes. Modern versions of JM and FCS are reliable
ways of dealing with missing data in multilevel models with random intercepts.
The FCS framework seems better suited to accommodate models with random
slopes, but may have difficulty with higher-order terms.
3183 0
583 1
175 1
12 2
104 1
16 2
9 2
7 3
11 1
4 2
1 2
1 2
Bosker (2012). For reasons of clarity, the code examples are restricted to a
subset of six variables.
There is one cluster variable (sch), one administrative variable (pup), one out-
come variable at the pupil level (lpo), two explanatory variables at the pupil
level (iqv, ses) and one explanatory variable at the school level (ssi). The
cluster variable and pupil number are complete, whereas the others contain
missing values.
Figure 7.5, with the missing data patterns, reveals that there are 3183 (out
of 4016) pupils without missing values. For the remaining sample, most have
a missing value in just one variable: 583 pupils have only missing ssi, 175
pupils have only missing lpo, 104 pupils have only missing ses and 11 pupils
have only missing lpo. The remaining 50 pupils have two or three missing
values. The challenge is to perform the analyses from Snijders and Bosker
(2012) using the full set with 4016 pupils.
222 Flexible Imputation of Missing Data, Second Edition
The empty model is fitted to the imputed datasets, and the estimates are
pooled as
library(lme4)
fit <- with(imp, lmer(lpo ~ (1 | sch), REML = FALSE))
summary(pool(fit))
library(mitml)
testEstimates(as.mitml.result(fit), var.comp = TRUE)$var.comp
Estimate
Intercept~~Intercept|sch 18.021
Residual~~Residual 63.306
ICC|sch 0.222
See Example 4.1 in Snijders and Bosker (2012) for the interpretation of the
estimates from this model.
There are missing data in both lpo and iqv. Imputation can be done
with both FCS and JM. For FCS my advice is to impute lpo and iqv by
2l.pmm with added cluster means. Adding the cluster means is done here to
improve compatibility among the conditional specified imputation models (cf.
Section 7.5.1).
The imputations will also adopt that scale, so we must back-transform the
data if we want to analyze the data in the original scale. For the multilevel
model with only random intercepts and fixed slopes, rescaling the data to the
origin presents no issues since the model is invariant to linear transformations.
This is not true when there are random slopes (Hox et al., 2018, p. 48). We
return to this point in Section 7.10.6.
The JM can create multivariate imputations by the jomoImpute or
panImpute methods. We use panImpute here.
This uses a new facility in mice that allows imputation of blocks of variables
(cf. Section 4.7.2). The final estimates on the multiply imputed data under
model 7.10 can be calculated (from the 2l.pmm method) as
Estimate
Intercept~~Intercept|sch 9.505
Residual~~Residual 40.819
ICC|sch 0.189
which produces the estimates for the random intercept model with an effect
for IQ with imputed IQ and language scores. See Example 4.2 in Snijders and
Bosker (2012) for the interpretation of the parameters.
where the variable iqvc stands for the cluster means of iqv. The model decom-
poses the contribution of IQ to the regression into a within-group component
with parameter γ10 , and a between-group component with parameter γ01 . The
interest in contextual analysis lies in testing the null hypothesis that γ01 = 0.
Because of this decomposition we need to add the cluster means of lpo and
iqv to the imputation model. Remember however that we just did that in the
FCS imputation model of Section 7.10.2. Thus, we may use the same set of
imputations to perform the within- and between-group regressions.
The following code block adds the cluster means to the imputed data,
estimates the model parameters on each set, stores the results in a list, and
pools the estimated parameters from the fitted models to get the combined
results.
Estimate
Intercept~~Intercept|sch 8.430
Residual~~Residual 40.800
ICC|sch 0.171
An alternative could have been to use the imp2 object with the imputations
under the joint imputation model.
Binary level-1 predictors can be imputed in the same way using one of the
methods listed in Table 7.4. It is not yet clear which of the methods should be
preferred. No univariate methods yet exist for multi-category variables, but
2l.pmm may be a workable alternative. Categorical variables can be imputed
by jomo (Quartagno and Carpenter, 2017), jomoImpute (Grund et al., 2018c),
by latent class analysis (Vidotto, 2018), or by Blimp (Keller and Enders, 2017).
226 Flexible Imputation of Missing Data, Second Edition
The missing values occur in lpo, iqv and den. The difference with model
7.14 is that den is a measured variable, so the value is identical for all members
of the same cluster. If den is missing, it is missing for the entire cluster.
Imputing a missing level-2 predictor is done by forming an imputation model
at the cluster level.
Imputation can be done with both FCS and JM (cf. Table 6, row 3 in
Grund et al. (2018b)). For FCS, the advice is to include aggregates of all level-
1 variables into the cluster level imputation model. Methods 2lonly.norm
and 2lonly.pmm add the means of all level-1 variables as predictors, and
subsequently follow the rules for single-level imputation at level-2.
The following code block imputes missing values in the 2-level predictor
den. For reasons of simplicity, I have used 2lonly.pmm, so imputations adhere
to original four-point scale. This use of predictive mean matching rests on the
assumption that the relative frequency of the denomination categories changes
with a linear function. Alternatively, one might opt for a true categorical
method to impute den, which would introduce additional parameters into the
imputation model.
The following statements address the same imputation task as a joint model
by jomoImpute:
Multilevel multiple imputation 227
Because mice calls jomoImpute per replication, the latter method can be slow
because the entire burn-in sequence is re-run for every call. Inspection of the
trace lines revealed that autocorrelations were low and convergence was quick.
To improve speed, the number of burn-in iterations was lowered from n.burn
= 5000 (default) to n.burn = 100. The total number of iterations was set as
maxit = 1, since all variables were members of the same block.
Figure 7.6 shows the density plots of the observed and imputed data after
applying the joint mixed normal/categorical model, and after predictive mean
matching. Both methods handle categorical data, so the figures for den have
multiple modes. The imputations of lpo under JM and FCS are very similar,
with jomoImpute slightly closer to normality. The complete-data analysis on
the multiply imputed data can be fitted as
Estimate
Intercept~~Intercept|sch 8.621
Residual~~Residual 40.761
ICC|sch 0.175
228 Flexible Imputation of Missing Data, Second Edition
lpo den
0.00 0.01 0.02 0.03 0.04
1.5
jomoImpute
1.0
0.5
0.0
20 30 40 50 60 0 1 2 3 4 5
20 30 40 50 60 0 1 2 3 4 5
Figure 7.6: Density plots for language score and denomination after
jomoImpute (top) and 2l.pmm (bottom).
lpoic = γ00 + γ10 iqvic + γ20 sexic + γ30 iqvic sexic + γ40 sexic denc +
γ01 iqvc + γ02 denc + γ03 iqvc denc + u0c + ic .
lpoic = β0c + β1c iqvic + β2c sexic + β3c iqvic sexic + β4c sexic denc + ic
β0c = γ00 + γ01 iqvc + γ02 denc + γ03 iqvc denc + u0c
β1c = γ10
β2c = γ20
β3c = γ30
β4c = γ40
How should we impute the missing values in lpo, iqv, sex and den, and
obtain valid estimates for the interaction term? Grund et al. (2018b) recom-
mend FCS with passive imputation of the interaction terms. As a first step,
we initialize a number of derived variables.
The new variables lpm, iqm and sxm will hold the cluster means of lpo, iqv
and sex, respectively. Variables iqd and lpd will hold the values of iqv and
lpo in deviations from their cluster means. Variables iqd.sex, lpd.sex and
iqd.lpd are two-way interactions of level-1 variables scaled as deviations from
the cluster means. Variables iqd.den, sex.den and lpd.den are cross-level
interactions. Finally, iqm.den, sxm.den and lpm.den are interactions at level-
2. For simplicity, we ignore further level-2 interactions between iqm, sxm and
lpm.
The idea is that we impute lpo, iqv, sex and den, and update the other
variables accordingly. Level-1 variables are imputed by two-level predictive
mean matching, and include as predictor the other level-1 variables, the two-
way interactions between the other level-1 variables (in deviations from their
group means), level-2 variables, and cross-level interactions.
230 Flexible Imputation of Missing Data, Second Edition
# level-1 variables
meth <- make.method(d)
meth[c("lpo", "iqv", "sex")] <- "2l.pmm"
# level-2 variables
meth["den"] <- "2lonly.pmm"
pred["den", c("lpo", "iqv", "sex",
"iqd.sex", "lpd.sex", "iqd.lpd")] <- 1
The transpose of the relevant entries of the predictor matrix illustrates the
symmetric structure of the imputation model.
The entries corresponding to the level-1 predictors are coded with a 3, indicat-
ing that both the original values as well as the cluster means of the predictor
Multilevel multiple imputation 231
are included into the imputation model. Interactions are coded with a 1. One
could also code these with a 3, in order to improve compatibility, but this is
not done here because the imputation model becomes too heavy. Because we
cannot have the same variable appearing at both sides of the equation, any
interaction terms involving the target have been deleted from the conditional
imputation models.
The specification above defines the imputation model for the variables in
the data. All other variables (e.g., cluster means, interactions) are calculated
on-the-fly by passive imputation. The code below centers iqm and lpo relative
to their cluster means.
The 2l.groupmean method from the miceadds package returns the cluster
mean pertaining to each observation. Centering on the cluster means is widely
practiced, but significantly alters the multilevel model. In the context of im-
putation, centering on the cluster means often enhances stability and robust-
ness of models to generate imputations, especially if interactions are involved.
When the complete-data model uses cluster centering, then the imputation
model should also do so. See Section 7.5.1 for more details.
The next block of code specifies the interaction effects, by means of passive
imputation.
# derive interactions
meth["iqd.sex"] <- "~ I(iqd * sex)"
meth["lpd.sex"] <- "~ I(lpd * sex)"
meth["iqd.lpd"] <- "~ I(iqd * lpd)"
meth["iqd.den"] <- "~ I(iqd * den)"
meth["sex.den"] <- "~ I(sex * den)"
meth["lpd.den"] <- "~ I(lpd * den)"
meth["iqm.den"] <- "~ I(iqm * den)"
meth["sxm.den"] <- "~ I(sxm * den)"
meth["lpm.den"] <- "~ I(lpm * den)"
The visit sequence specified below updates the relevant derived variables
after any of the measured variables is imputed, so that interactions are always
in sync. The specification of the imputation model is now complete, so it can
be run with mice().
232 Flexible Imputation of Missing Data, Second Edition
The analysis of the imputed data according to the specified model first
transforms den into a categorical variable, and then fits and pools the mixed
model.
IQ (cf. Example 5.1 in Snijders and Bosker (2012)). Section 7.10.3 showed
how to impute the contextual model. Including random slopes extends the
complete-data model as
lpoic = γ00 + γ01 iqvc + γ10 iqvic + u0c + u1c iqvic + ic (7.22)
The addition of the term u1c to the equation for β1c allows for β1c to vary
over clusters, hence the name “random slopes”.
Missing data may occur in lpo and iqv. Enders et al. (2016) and Grund
et al. (2018b) recommend FCS for this problem. The procedure is almost
identical to that in Section 7.10.2, but now including both the cluster means
and random slopes into the imputation model.
The entry of 4 at cell (lpo, iqv) in the predictor matrix adds three variables to
the imputation model for lpo: the value of iqv, the cluster means of iqv and
the random slopes of iqv. Conversely, imputing iqv adds the three covariates:
the values of lpo, the cluster means of lpo and the random slopes of lpo.
The iqv variable had zero mean in the data, so this could be imputed right
away, but lpo needs to be centered around the grand mean in order to reduce
the large number of warnings about unstable estimates. It is known that the
random slopes model is not invariant to a shift in origin in the predictors (Hox
et al., 2018), so we may wonder what the effect of centering on the grand mean
will be on the quality of the imputations. See Kreft et al. (1995) and Enders
and Tofighi (2007) for discussions on the effects of centering in multilevel
234 Flexible Imputation of Missing Data, Second Edition
Estimate
Intercept~~Intercept|sch 8.591
Intercept~~iqv|sch -0.781
iqv~~iqv|sch 0.188
Residual~~Residual 39.791
ICC|sch 0.178
See Example 5.1 in Snijders and Bosker (2012) for the interpretation of
these model parameters. Interestingly, if we don’t restore the mean of lpo, the
estimated intercept represents the average difference between the observed and
imputed language scores. Its value here is -0.271 (not shown), so on average
pupils without a language test score a little lower than pupils with a score.
The difference is not statistically significant (p = 0.23).
model described in Example 5.3 of Snijders and Bosker (2012). This is a fairly
elaborate model that can best be understood in level notation:
lpoic = β0c + β1c iqvic + β2c sesic + β3c iqvic sesic + ic (7.26)
β0c = γ00 + γ01 iqvc + γ02 sesc + γ03 iqvc sesc + u0c (7.27)
β1c = γ10 + γ11 iqvc + γ12 sesc + u1c (7.28)
β2c = γ20 + γ21 iqvc + γ22 sesc + u2c (7.29)
β3c = γ30 (7.30)
Although this expression may look somewhat horrible, it clarifies that the
expected value of lpo depends on the following terms:
• the level-1 variables iqvic and sesic ;
• the level-1 interaction iqvic sesic ;
• the cluster means iqvc and sesc ;
• the within-variable cross-level interactions iqvic iqvc and sesic sesc ;
• the between-variable cross-level interactions iqvic sesc and sesic iqvc ;
• the level-2 interaction iqvc sesc ;
• the random intercepts;
• the random slopes for iqv and ses.
All terms need to be included into the imputation model for lpo. Univariate
imputation models for iqv and ses can be specified along the same principles
by reversing the roles of outcome and predictor. As a first step, let us pad the
data with the set of all relevant interactions from model 7.26.
Here iqv.ses represents the multiplicative interaction term for iqv and ses,
and lpm represents the cluster means of lpo, and so on. Imputation models for
lpo, iqv and ses are specified by setting the relevant entries in the transformed
predictor matrix as follows:
The model for lpo is almost equivalent to model 7.26. According to the
model, both cluster means and random effects should be included, thus val-
ues pred["lpo", c("iqv", "ses")] should be coded as a 4, and not as a
3. However, the cluster means and random effects are almost linearly depen-
dent, which causes slow convergence and unstable estimates in the imputation
model. These problems disappear when only the cluster means are included
as covariates. An alternative is to scale the predictors in deviations from the
cluster means, as was done in Section 7.10.5. This circumvents many of the
computational issues of raw-scored variables, and the parameters are easier to
interpret.
The specifications for iqv and ses correspond to the inverted models.
Inverting the random slope model produces reasonable estimates for the fixed
Multilevel multiple imputation 237
effect and the intercept variance, but estimates of the slope variance can be
unstable and biased, especially in small samples (Grund et al., 2016a). Unless
the interest is in the slope variance (for which listwise deletion appears to
be better), using FCS by inverting the random slope model is the currently
preferred method to account for differences in slopes between clusters.
Next, we need to specify the derived variables. The cluster means are
updated by the 2l.groupmean method.
summary(pool(fit))
Estimate
Intercept~~Intercept|sch 7.93524
Intercept~~ses|sch -0.00920
Intercept~~iqv|sch -0.75078
ses~~ses|sch 0.00114
ses~~iqv|sch -0.00830
iqv~~iqv|sch 0.16489
Residual~~Residual 37.78840
ICC|sch 0.17355
The estimates are quite close to Table 5.3 in Snijders and Bosker (2012).
These authors continue with simplifying the model. The same set of imputa-
tions can be used for these simpler models since the imputation model is more
general than the substantive models.
7.10.8 Recipes
The term “cookbook statistics” is sometimes used to refer to thoughtless
and rigid applications of statistical procedures. Minute execution of a sequence
of steps won’t earn you a Nobel Prize, but a good recipe will enable you to
produce a decent meal from ingredients that you may not have seen before.
The recipes given here are intended to assist you to create a decent set of
imputations for multilevel data.
Table 7.5 contains two recipes for imputing multilevel data. There are
separate recipes for level-1 and level-2 data. The recipes follow the inclusive
strategy advocated by Collins et al. (2001), and extend the predictor specifi-
cation strategy in Section 6.3.2 to multilevel data. Including all two-way (or
higher-order) interactions may quickly inflate the number of parameters in
Multilevel multiple imputation 239
Table 7.5: Recipes for imputing multilevel data for models with random inter-
cepts and random slopes. There are different procedures for level-1 and level-2
variables.
Recipe for a level-1 target
1. Define the most general analytic model to be applied to imputed data
2. Select a 2l method that imputes close to the data
3. Include all level-1 variables
4. Include the disaggregated cluster means of all level-1 variables
5. Include all level-1 interactions implied by the analytic model
6. Include all level-2 predictors
7. Include all level-2 interactions implied by the analytic model
8. Include all cross-level interactions implied by the analytic model
9. Include predictors related to the missingness and the target
10. Exclude any terms involving the target
the model, especially for categorical data, so some care is needed in selecting
the interactions that seem most important to the application at hand.
Sections 7.10.1 to 7.10.7 demonstrated applications of these recipes for a
variety of multilevel models. One very important source of information was
not yet included. For clarity, all procedures were restricted to the subset of
data that was actually used in the model. This strategy is not optimal in
general because it fails to include potentially auxiliary information that is not
modeled. For example, the brandsma data also contains the test scores from
the same pupils taken one year before the outcome was measured. This score
is highly correlated to the outcome, but it was not part of the model and
hence not used for imputation. Of course, one could extend the substantive
model (e.g., include the pre-test score as a covariate), but this affects the
interpretation and may not correspond to the question of scientific interest. A
better way is to include these variables only into the imputation model. This
will decrease the between-imputation variability and hence lead to sharper
statistical inferences. Including extra predictive variables is left as an exercise
for the reader.
The procedure in Section 7.10.7 may be a daunting task when the number
of variables grows, especially keeping track of all required interaction effects.
The whole process can be automated, but currently there is no software that
240 Flexible Imputation of Missing Data, Second Edition
will perform these steps behind the screen. This may be a matter of time. In
general, it is good to be aware of the steps taken, so specification by hand
could also be considered an advantage.
Monitoring convergence is especially important for models with many ran-
dom slopes. Warnings from the underlying multilevel routines may indicate
over-specification of the model, for example, with a too large number of pa-
rameters. The imputer should be attentive to such messages by reducing the
complexity of imputation model in the light of the analytic model. In mul-
tilevel modeling, overparameterization occurs almost always in the variance
part of the model. Reducing the number of random slopes, or simplifying the
level-2 model structure could help to reduce computational complexity.
People differ widely in how they react to events. Most scientific studies
express the effect of a treatment as an average over a group of persons. This
is informative if the effect is thought to be similar for all persons, but is less
useful if the effect is expected to differ. This chapter uses multiple imputation
to estimate the individual causal effect (ICE), or the unit-level causal effect,
for one or more units in the data. The hope is that this allows us to develop
a deeper understanding of how and why people differ in their reactions to an
intervention.
so John gains one year because of surgery. We see that the new surgery is
beneficial for John, Peter and Torey, but harmful to the others.
241
242 Flexible Imputation of Missing Data, Second Edition
Table 8.1: Number of years lived for eight patients under a new surgery Y (1)
and under standard treatment Y (0). Hypothetical data.
Patient Age Y (1) Y (0) τi
John 68 14 13 +1
Caren 76 0 6 −6
Joyce 66 1 4 −3
Robert 81 2 5 −3
Ruth 70 3 6 −3
Nick 72 1 6 −5
Peter 81 10 8 +2
Torey 72 9 8 +1
In addition, let the average causal effect, or ACE, be the mean ICE over
all units, i.e.,
n
1X
τ= Yi (1) − Yi (0) (8.2)
n i=1
There are many such unit-level causal effects, and we often wish
to summarize them for the finite sample or for subpopulations.
These authors also note that estimating the ICE is difficult because the es-
timates are sensitive to choices for the prior distribution of the dependence
structure between the two potential outcomes. Morgan and Harding (2006)
wrote
Because it is usually impossible to effectively estimate individual-
level causal effects, we typically shift attention to aggregated
causal effects.
8.3 Framework
Let us explore the use of multiple imputation of the missing potential
outcomes, with the objective of estimating τi for some target person i. We use
the potential outcomes framework using the notation of Imbens and Rubin
(2015). Let the individual causal effect for unit i be defined as τi = Yi (1) −
Yi (0). Let Wi = 0 if unit i received the control treatment, and let Wi = 1
if i received the active treatment. We assume that assignment to treatments
is unconfounded by the unobserved outcomes Ymis , so P (W |Y (0), Y (1), X) =
P (W |Yobs , X) specifies ignorable treatment assignment mechanism where each
unit has a non-zero probability for each treatment (Imbens and Rubin, 2015,
p. 39). Optionally, we may assume a joint distribution P (Y (0), Y (1), X) of
potential outcomes Y (0) and Y (1) and covariates X. This is not strictly needed
for creating valid inferences under known randomized treatment assignments,
but it is beneficial in more complex situations.
Imbens and Rubin (2015) specified a series of joint normal models to gener-
ate multiple imputations of the missing values in the potential outcomes. Here
we will use the FCS framework to create multiple imputations of the missing
potential outcomes. The idea is that we alternate two univariate imputations:
where φ̇1 and φ̇0 are draws from the parameters of the imputation model.
Let Ẏi` (Wi ) denote an independent draw from the posterior predictive dis-
tributions of Y for unit i, imputation `, and treatment Wi . The replicated
individual causal effect τ̇i` in the `th imputed dataset is equal to
Note that both Ẏi` (1) and Ẏi` (0) vary over ` in Equation 8.5, but this is only
needed if both outcomes are missing for unit i. In general, we may equate
Yi` (Wi ) = Yi (Wi ) for the observed outcomes. If unit i was allocated to the
246 Flexible Imputation of Missing Data, Second Edition
experimental treatment and if the outcome was observed, the replicated causal
effect 8.5 simplifies to
τ̇i` = Yi (1) − Ẏi` (0) (8.8)
Likewise, if unit i was measured under the control condition, we find
τ̇i` = Ẏi` (1) − Yi (0) (8.9)
library(mice)
data2 <- data[, -1]
imp <- mice(data2, method = "norm", seed = 188, print = FALSE)
Warning: Number of logged events: 7
Individual causal effects 247
100
y1 y0
100
50
50
0
0
−50
−50
−100
−100
0 1 2 3 4 5 0 1 2 3 4 5
Figure 8.1 shows the values of the observed and imputed data of the poten-
tial outcomes. The imputations look very bad, especially for y1. The spread
is much larger than in the data, resulting in illogical negative values and im-
plausible high values. The problem is that there are no persons for which y1
and y0 are jointly observed, so the relation between y1 and y0 is undefined.
We may see this clearly from the correlations ρ(Y (0), Y (1)) between the two
potential outcomes in each imputed dataset.
1 2 3 4 5
-0.994 -0.552 0.952 0.594 -0.558
The ρ’s are all over the place, signaling that the correlation ρ(Y (0), Y (1)) is
not identified from the data in y1 and y0, a typical finding for the file matching
missing data pattern.
extreme ρ = −1, we expect that the treatment would entirely reverse the order
of units, so the unit with the maximum outcome under treatment will have
the minimum outcome under control, and vice versa. It is hard to imagine
interventions for which that would be realistic.
The ρ parameter can act as a tuning knob regulating the amount of het-
erogeneity in the imputation. In my experience, ρ has to be set at fairly high
value, say in the range 0.9 – 0.99. The correlation in Table 8.1 is 0.9, which
allows for fairly large differences in τi , here from −6 years to +2 years.
The specification ρ in mice can be a little tricky, and is most easily achieved
by appending hypothetical extra cases to the data with both y1 and y0 ob-
served given the specified correlation. Following Imbens and Rubin (2015,
p. 165) we assume a bivariate normal distribution for the potential outcomes:
σ02
Yi (0) µ0 ρσ0 σ1
θ ∼ N , (8.10)
Yi (1) µ1 ρσ0 σ1 σ12
where θ = (µ0 , µ1 , σ02 , σ12 ) are informed by the available data, and where ρ is
set by the user. The corresponding sample estimates are µ̂0 = 6.6, µ̂1 = 5.0,
σ̂02 = 1.8 and σ̂12 = 61. However, we do not use these estimates right away
in Equation 8.10. Rather we equate µ0 = µ1 because we wish to avoid using
the group difference twice. The location is arbitrary, but a convenient choice
is grand mean µ = 6, which gives quicker convergence than, say, µ = 0. Also,
we equate σ02 = σ12 because the scale units of the potential outcomes must
be the same in order to calculate meaningful differences. A convenient choice
is the variance of the observed outcome data σ̂ 2 = 19.1. For very high ρ, we
found that setting σ02 = σ12 = 1 made the imputation algorithm more stable.
Finally, we need to account for the difference in means between the data and
the prior. Define Di = 1 if unit i belongs to the data, and Di = 0 otherwise.
The bivariate normal model for drawing the imputation is
Yi (0) 6 + Di α̇0 19.1 19.1ρ
θ∼N , (8.11)
Yi (1) 6 + Di α̇1 19.1ρ 19.1
where α̇0 and α̇1 are drawn as usual. The number of cases used for the prior is
arbitrary, and will give essentially the same result. We have set it here to 100,
so that the empirical correlation in the extra data will be reasonably close to
the specified value. The following code block generates the extra cases.
set.seed(84409)
rho <- 0.9
mu <- mean(unlist(data2[, c("y1", "y0")]), na.rm = TRUE)
sigma2 <- var(unlist(data2), na.rm = TRUE)
# sigma2 <- 1
cv <- rho * sigma2
s2 <- matrix(c(sigma2, cv, cv, sigma2), nrow = 2)
prior <- data.frame(MASS::mvrnorm(n = 100, mu = rep(mu, 2),
Individual causal effects 249
Sigma = s2))
names(prior) <- c("y1", "y0")
The next statements combine the observed data and the prior, and cal-
culate two variables. The binary indicator d separates the intercepts of the
observed data unit and prior units.
1 2 3 4 5
0.952 0.976 0.960 0.953 0.840
mean y1 sd y1
1 2 3 4 5
5 10
−5 0
mean y0 sd y0
2 4 6 8 10
10 15
5
0
4
0
3
−10 −5
2
1
0 20 40 60 80 100 0 20 40 60 80 100
Iteration
y1 y0
15
15
10
10
5
5
0
0
−5
−5
0 1 2 3 4 5 0 1 2 3 4 5
Figure 8.3: FCS with a ρ prior. Stripplots of the prior data (gray) and the
imputed data (red) for the potential outcomes y1 and y0.
Individual causal effects 251
0 5 10 15
0 1 2
15
10
−5
3 4 5
y1
15
10
−5
0 5 10 15 0 5 10 15
y0
Min
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Min
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Min
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
Figure 8.5: Fan plot. Observed and imputed (m = 100) outcomes under new
(1) and standard (0) surgery. John, Caren and Joyce had the new surgery.
The three rows correspond to ρ = 0.00 (top), ρ = 0.90 (middle) and ρ = 0.99
(bottom). Data from Table 8.1.
Individual causal effects 253
the mean. The effects are heterogeneous. Convergence of this condition is very
quick.
The pattern for the ρ = 0.99 condition (bottom row) is different. All
patients would benefit from the standard surgery, but the magnitude of the
benefit is smaller than those under ρ = 0. Observe that all effects, except
for Caren and Joyce, go against the direction of the regression to the mean.
The effects are almost identical, and the between-imputation variance is small.
The solution in the middle row (ρ = 0.9) is a compromise between the two.
Intuitively, this setting is perhaps the most realistic and reasonable.
8.4.3 Extensions
Imbens and Rubin (2015) observe that the inclusion of covariates does not
fundamentally change the underlying method for imputing the missing out-
comes. This generalizes Neyman’s method to covariates, and has the advantage
that covariates can improve the imputations by providing additional informa-
tion on the outcomes. A clear causal interpretation is only warranted if the
covariates are not influenced by the treatment. This includes pre-treatment
factors that led up to the decision to treat or not (e.g., age, disease history),
pre-treatment factors that are predictive of the later outcome, such as base-
line outcomes, and post-treatment factors that are not affected by treatment.
Covariates that may have changed as a result of the experimental treatment
should not be included in a causal model.
We may distinguish various types of covariates. A covariate can be part of
the assignment mechanism, related to the potential outcomes Yi (0) and Yi (1),
or related to τi . The first covariate type is often included into the imputation
model in order to account for design effects, in particular to achieve compara-
bility of the experimental units. A causal effect must be a comparison of the
ordered sets {Yi (1), i ∈ S} and {Yi (0), i ∈ S} (Rubin, 2005), which can always
be satisfied once the potential outcomes have been imputed for units i ∈ S.
Hence, we have no need to stratify on design variables to achieve comparabil-
ity. However, we still need to include design factors into the imputation model
in order to satisfy the condition of ignorable treatment assignment. Including
the second covariate type will make the imputations more precise, and so this
is generally beneficial. The third covariate type directly explains heterogeneity
of causal effects over the units. Because τi is an entirely missing (latent) vari-
able, it is difficult to impute τi directly in mice. The method in Section 8.4.2
explains heterogeneity of τi indirectly via the imputation models for Yi (0) and
Yi (1). Any covariates of type 3 should be included on the imputation models,
and their regression weights should be allowed to differ between models for
Yi (0) and Yi (1).
Suppose we wish to obtain the average causal effect for nS > 1 units i ∈ S.
Calculate the within-replication average causal effect τ` over the units in set
254 Flexible Imputation of Missing Data, Second Edition
S as
1 X
τ̇` = τ̇i` (8.12)
nS
i∈S
and then combine the results over the replications by means of Rubin’s rules.
We can use the same principles for categorical outcomes. Mortality is a widely
used outcome, and can be imputed by logistic regression. The ICE can then
takes on one of four possible values: (alive, alive), (alive, dead), (dead, alive)
and (dead, dead). An still open question is how the dependency between the
potential outcomes is best specified.
We may add a potential outcome for every additional treatment, so ex-
tension to three or more experimental conditions does not present new con-
ceptual issues. However the imputation problem becomes more difficult. With
four treatments, 75 percent of each outcome will need to be imputed, and the
number of outcomes to impute will double. There are practical limits to what
can be done, but I have done analyses with seven treatment arms. Careful
monitoring of convergence is needed, as well as a reasonably size dataset in
each experimental group.
After imputation, the individual causal effect estimates can be analyzed
for patterns that explain the heterogeneity. The simplest approach takes τ̂i
as the outcome for a regression at the unit level, and ignores the often sub-
stantial variation around τ̂i . This is primarily useful for exploratory analysis.
Alternatively, we may utilize the full multiple imputation cycle, so perform
the regression on τ̇i` within each imputed dataset, and then pool the results
by Rubin’s rules.
Lam (2013) found that predictive mean matching performed well for imput-
ing potential outcomes. Gutman and Rubin (2015) described a spline-based
imputation method for binary data with good statistical properties. Imbens
and Rubin (2015) show how the ACE and ρ are independent, discuss various
options of setting ρ and derive estimates of the ACE. Smink (2016) found
that the quality of the ICE estimate depends on the quantile of the realized
outcome, and concluded that proper modeling of the correlation between the
potential outcomes is needed.
There is a vast class of methods that relate the observed scores Yi to
covariates X by least-squares or machine learning methods. These methods
are conceptually and analytically distinct from the methods presented in this
chapter. Some methods are advertised as estimating individual causal effects,
but actually target a different estimand. The relevant literature typically de-
fines individual causal effect as something like
which is the difference between the predicted value under treatment and
predicted value under control for each individual. In order to quantify τ̃i ,
one needs to estimate the components E[Y |X = xi , Wi = 1] and E[Y |X =
xi , Wi = 0] from the data. Now in practice, the set of units i ∈ S1 for estimat-
ing the first component differs from the set of units i ∈ S0 for estimating the
second. In that case, τ̃i takes the expectation over different sets of units, so τ̃i
reflects not only the treatment effect, but also any effects that arise because
the units in S1 and S0 are different, and even mutually exclusive. This violates
the critical requirement for causal inference that “the comparison must be a
comparison of Yi (1) and Yi (0) for a common set of units” (Rubin, 2005, p.
323). If we aspire taking expectations over the same set of units, we will need
to make additional assumptions. Depending on such assumptions about the
treatment assignment mechanism and about ρ, there will be circumstances
where τi and τ̃i lead to the same estimates, but without such assumptions,
the estimands τi and τ̃i are generally different.
I realize that the methods presented in this chapter only scratch the surface
of a tremendous, yet unexplored field. The methodology is in a nascent state,
and I hope that the materials in this chapter will stimulate further research
in the area.
Part III
Case studies
Chapter 9
Measurement issues
library(mice)
## DO NOT DO THIS
imp <- mice(data) # not recommended
If you are lucky, the program may run and impute, but after a few minutes
it becomes clear that it takes a long time to finish. And after the wait is over,
the imputations turn out to be surprisingly bad. What happened?
Some exploration of the data reveals that your colleague sent you a dataset
with 351 columns, essentially all the information that was sampled in the
study. By default, the mice() function uses all other variables as predictors,
so mice() will try to calculate regression analyses with 350 explanatory vari-
ables, and repeat that for every incomplete variable. Categorical variables are
internally represented as dummy variables, so the actual number of predictors
could easily double. This makes the algorithm extremely slow, if it runs at all.
259
260 Flexible Imputation of Missing Data, Second Edition
Some further exploration reveals some variables are free text fields, and
that some of the missing values were not marked as such in the data. As a
consequence, mice() treats impossible values such as “999” or “−1” as real
data. Just one forgotten missing data mark may introduce large errors into
the imputations.
In order to evade such practical issues, it is necessary to spend some time
exploring the data first. Furthermore, it is helpful if you understand for which
scientific question the data are used. Both will help in creating sensible impu-
tations.
This section concentrates on what can be done based on the data val-
ues themselves. In practice, it is far more productive and preferable to work
together with someone who knows the data really well, and who knows the
questions of scientific interest that one could ask from the data. Sometimes
the possibilities for cooperation are limited. This may occur, for example, if
the data have come from several external sources (as in meta analysis), or if
the dataset is so diverse that no one person can cover all of its contents. It will
be clear that this situation calls for a careful assessment of the data quality,
well before attempting imputation.
were visited by a physician between January 1987 and May 1989. A full med-
ical history, information on current use of drugs, a venous blood sample, and
other health-related data were obtained. BP was routinely measured during
the visit. Apart from some individuals who were bedridden, BP was measured
while seated. An Hg manometer was used and BP was rounded to the near-
est 5 mmHg. Measurements were usually taken near the end of the interview.
The mortality status of each individual on March 1, 1994 was retrieved from
administrative sources.
Of the original cohort, a total of 218 persons died before they could be
visited, 59 persons did not want to participate (some because of health prob-
lems), 2 emigrated and 1 was erroneously not interviewed, so 956 individuals
were visited. Effects due to subsampling the visited persons from the entire
cohort were taken into account by defining the date of the home visit as the
start (Boshuizen et al., 1998). This type of selection will not be considered
further.
library(foreign)
file.sas <- file.path(dataproject, "original/master85.xport")
## xport.info <- lookup.xport(file.sas)
original.sas <- read.xport(file.sas)
names(original.sas) <- tolower(names(original.sas))
dim(original.sas)
The dataset contains 1236 rows and 351 columns. When I tracked down
the origin of the data, the former investigators informed me that the file was
composed during the early 1990’s from several parts. The basic component
consisted of a Dbase file with many free text fields. A dedicated Fortran
program was used to separate free text fields. All fields with medical and
drug-related information were hand-checked against the original forms. The
information not needed for analysis was not cleaned. All information was kept,
so the file contains several versions of the same variable.
A first scan of the data makes clear that some variables are free text fields,
person codes and so on. Since these fields cannot be sensibly imputed, they
are removed from the data. In addition, only the 956 cases that were initially
visited are selected, as follows:
The frequency distribution of the missing cases per variable can be obtained
as:
0 2 3 5 7 14 15 28 29 32 33 34 35 36 40 42
87 2 1 1 1 1 2 1 3 2 34 15 25 4 1 1
43 44 45 46 47 48 49 50 51 54 64 72 85 103 121 126
2 1 4 2 3 24 4 1 20 2 1 4 1 1 1 1
137 155 157 168 169 201 202 228 229 230 231 232 233 238 333 350
1 1 1 2 1 7 3 5 4 2 4 1 1 1 3 1
501 606 635 636 639 642 722 752 753 812 827 831 880 891 911 913
3 1 2 1 1 2 1 5 3 1 1 3 3 3 3 1
919 928 953 954 955
1 1 3 3 3
Ignoring the warning for a moment, we see that there are 87 variables that
are complete. The set includes administrative variables (e.g., person number),
design factors, date of measurement, survival indicators, selection variables
and so on. The set also included some variables for which the missing data
were inadvertently not marked, containing values such as “999” or “−1.” For
example, the frequency distribution of the complete variable “beroep1” (oc-
cupation) is
-1 0 1 2 3 4 5 6 <NA>
42 1 576 125 104 47 44 17 0
There are no missing values, but a variable with just categories “−1” and
“0” is suspect. The category “−1” likely indicates that the information was
missing (this was the case indeed). One option is to leave this “as is,” so that
mice() treats it as complete information. All cases with a missing occupation
are then seen as a homogeneous group.
Measurement issues 263
Two other variables without missing data markers are syst and diast, i.e.,
systolic and diastolic BP classified into six groups. The correlation (using the
observed pairs) between syst and rrsyst, the variable of primary interest,
is 0.97. Including syst into the imputation model for rrsyst will ruin the
imputations. The “as is” option is dangerous, and shares some of the same
perils of the indicator method (cf. Section 1.3.7). The message is that variables
that are 100% complete deserve appropriate attention.
After a first round of screening, I found that 57 of the 87 complete variables
were uninteresting or problematic in some sense. Their names were placed on
a list named outlist1 as follows:
[1] 57
9.1.4 Outflux
We should also scrutinize the variables at the other end. Variables with
high proportions of missing data generally create more problems than they
solve. Unless some of these variables are of genuine interest to the investiga-
tor, it is best to leave them out. Virtually every dataset contains some parts
that could better be removed before imputation. This includes, but is not
limited to, uninteresting variables with a high proportion of missing data,
variables without a code for the missing data, administrative variables, con-
stant variables, duplicated, recoded or standardized variables, and aggregates
and indices of other information.
Figure 9.1 is the influx-outflux pattern of Leiden 85+ Cohort data. The in-
flux of a variable quantifies how well its missing data connect to the observed
data on other variables. The outflux of a variable quantifies how well its ob-
served data connect to the missing data on other variables. See Section 4.1.3
for more details. Though the display could obviously benefit from a better
label-placing strategy, we can see three groups. All points are relatively close
to the diagonal, which indicates that influx and outflux are balanced.
The group at the left-upper corner has (almost) complete information, so
the number of missing data problems for this group is relatively small. The
intermediate group has an outflux between 0.5 and 0.8, which is small. Miss-
ing data problems are more severe, but potentially this group could contain
important variables. The third group has an outflux with 0.5 and lower, so
its predictive power is limited. Also, this group has a high influx, and is thus
highly dependent on the imputation model.
Note that there are two variables (hypert1 and aovar) in the third group
that are located above the diagonal. Closer inspection reveals that the missing
264 Flexible Imputation of Missing Data, Second Edition
1.0anammnd
aover
n
anamjaar
anamjr
w
anamdag
gebmndn
gebjaarn
gebdagn
beroep2
beroep1
special3
special2
special1
surdag
survdp
aheup
diag1
diag2
diag3
diag4
diag5
diag6
diag7
diag8
diag9
dmvg oog
opn
cgyn
sartrose
totact_1
senieurr
totact1
totmed_1
nca_vg
lv4109
vu
anamdat
dwadat
surmnd
survda
aknie prn
vsyst
cssexe
ue4
lftgr
ddiast
totmed1
dmact
o
lftanam
fdoscat4
d
d
ftoverl
reuma a uea
rsel
inter vtu
vow
o
intd
1
h rb2
reho1rm
0
i9
2
3
4
5
6
7
8
1 rnol9
jarv8
o d
postuur
burg
adl
gravi
kind
hoor
gebit
medgeb
vageb
rahand
opvang
demenvg
incontu
apres
beroeprr
betablok
rrmed
d
diuret
med49aa
med51ab
med39ka
med05b
med18b
med18a
med30b
med30a
med39b
med36b
med36a
med45b
med45a
med39h
dementi2rook
med05c
med30c
med39c
med45c
med49c
med15
med10
med09
med06
med19
med35
med59
diuretr
demenact
heupfrac
lab25a
ovneurvg
reumavg
ovpsyvg
caravg
voor1
artrovg
incontf
appen_vg
arythmvg
deccorvg
hypertvg
ulcusspv
ovihzvg
deprivg
thyrvg
vgpneu_f
heupfrvg
polsfrvg
pneu1_vg
vgroep17
vgroep15
vgroep11 5alc
visusb
a
z
intrr
vgroep21
vgroep20
vgroep19
vgroep18
vgroep16
vgroep13
vgroep12
vgroep10
vgroep09
vgroep08
vgroep07
vgroep06
vgroep03
vgroep02
vgroep01
vgroep14
vgroep04
tbc_vg
val_vg
cva
liesbr
prostaat
hyper1
ov_hv
pacem
galsteen prikmnd
prikjaar
prikdag
prikdat
ovpsyact
caraact
ovneurac
o
a hrvneur
pneu1_ac
ca_act
decomp
arythmac
hypertac
ulcusspa
depriact
thyract
jtihd
acpneu_f
reumaact
artroact
bronch
oedeem
deccorac
ovihzacteihycptyha
val_act
ov_ihd
infectac
infarct
renins
polsfrac
diabet
diaggr21
diaggr20
diaggr19
diaggr18
diaggr17
diaggr16
diaggr15
diaggr14
diaggr13
diaggr12
diaggr11
diaggr10
diaggr09
diaggr08
diaggr07
diaggr06
diaggr04
diaggr03
diaggr02
diaggr01 rmmse
tm
t
rrsyst
schildkl
rrdiast
0.8 hoevsig
transfu
inhal
vov2
voor2
m ery
mcv
mono
llymfo
neutro
em
hb
u
ccchhoc
ht
ucgluc
ca
rfos
ep
schol
toei
alfo
bilt
alb
kaoat
gggt
na.cl
ldh
uriz
0.6
vo
vovghq
3r3
Outflux
0.4
vo
vov4r4
mobhuis
mobtrap
mobbuit
bse
vov5
voor5
ravoet
0.2
uery
uglu
ucyl
uei
vuleu
o
vov6r6
fdosis hypert1
opleid
aovar
vovov7r7
prik2mnd
prik2dag
voprik2jr
vov8r8
vovovbilg
9r9
0.0 vo
vvovo11
r10
20
112
Figure 9.1: Global influx-outflux pattern of the Leiden 85+ Cohort data.
Variables with higher outflux are (potentially) the more powerful predictors.
Variables with higher influx depend strongly on the imputation model.
data mark had not been set for these two variables. Variables that might cause
problems later on in the imputations are located in the lower-right corner.
Under the assumption that this group does not contain variables of scientific
interest, I transferred 45 variables with an outflux < 0.5 to outlist2:
[1] 45
In these data, the set of selected variables is identical to the group with
more than 500 missing values, but this need not always be the case. I removed
the 45 variables, recalculated influx and outflux on the smaller dataset and
selected 32 new variables with outflux < 0.5.
Measurement issues 265
Variable outlist3 contains 32 variable names, among which are many lab-
oratory measurements. I prefer to keep these for imputation since they may
correlate well with BP and survival. Note that the outflux changed consider-
ably as I removed the 45 least observed variables. Influx remained nearly the
same.
head(ini$loggedEvents, 2)
tail(ini$loggedEvents, 2)
[1] 108
There are 108 unique variables to be removed. Thus, before doing any
imputations, I cleaned out about one third of the data that are likely to cause
problems. The downsized data are
The next step is to build the imputation model according to the strategy
outlined above. The function quickpred() is applied as follows:
There are 198 incomplete variables in data2. The character vector inlist
specifies the names of the variables that should be included as covariates in
every imputation model. Here I specified age, sex and blood pressure. Blood
pressure is the variable of central interest, so I included it in all models. This
list could be longer if there are more outcome variables. The inlist could
also include design factors.
The quickpred() function creates a binary predictor matrix of 198 rows
and 198 columns. The rows correspond to the incomplete variables and the
columns report the same variables in their role as predictor. The number of
predictors varies per row. We can display the distribution of the number of
predictors by
table(rowSums(pred))
0 7 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 1 2 1 1 2 5 2 13 8 16 9 13 7 5 6 10 6 3 6 4
30 31 32 33 34 35 36 37 38 39 40 41 42 44 45 46 49 50 57 59 60
8 3 6 9 2 4 6 2 5 2 4 2 3 4 3 3 3 1 1 1 1
61 68 79 83 85
1 1 1 1 1
268 Flexible Imputation of Missing Data, Second Edition
rowSums(pred[c("rrsyst", "rrdiast"),])
rrsyst rrdiast
41 36
names(data2)[pred["rrsyst", ] == 1]
Thanks to the smaller dataset and the more compact imputation model,
this code runs about 50 times faster than “blind imputation” as practiced in
Section 9.1. More importantly, the new solution is much better. To illustrate
the latter, take a look at Figure 9.2.
The figure is the scatterplot of rrsyst and rrdiast of the first imputed
dataset. The left-hand figure shows what can happen if the data are not prop-
erly screened. In this particular instance, a forgotten missing data mark of
“−1” was counted as a valid blood pressure value, and produced imputa-
tion that are far off. In contrast, the imputations created with the help of
quickpred() look reasonable.
The plot was created by the following code:
Measurement issues 269
140
120
Diastolic BP (mmHg)
100
80
60
40
Systolic BP (mmHg)
Figure 9.2: Scatterplot of systolic and diastolic blood pressure from the first
imputation. The left-hand-side plot was obtained after just running mice()
on the data without any data screening. The right-hand-side plot is the result
after cleaning the data and setting up the predictor matrix with quickpred().
Leiden 85+ Cohort data.
where dead is coded such that “1” means death. The nelsonaalen() function
is part of mice. Table 9.1 lists the correlations beween several key variables.
The correlation between H0 (T ) and T is almost equal to 1, so for these data
it matters little whether we take H0 (T ) or T as the predictor. The high corre-
lation may be caused by the fact that nearly everyone in this cohort has died,
so the percentage of censoring is low. The correlation between H0 (T ) and
T could be lower in other epidemiological studies, and thus it might matter
whether we take H0 (T ) or T . Observe that the correlation between log(T ) and
blood pressure is higher than for H0 (T ) or T , so it makes sense to add log(T )
as an additional predictor. This strong relation may have been a consequence
of the design, as the frail people were measured first.
3. Perform a dry run with maxit=0 and inspect the logged events pro-
duced by mice(). Remove any constant and collinear variables before
imputation.
4. Find out what will happen after the data have been imputed. Deter-
mine a set of variables that are important in subsequent analyses, and
include these as predictors in all models. Transform variables to improve
predictability and coherence in the complete-data model.
5. Run quickpred(), and determine values of mincor and minpuc such
that the average number of predictors is around 25.
6. After imputation, determine whether the generated imputations are sen-
sible by comparing them to the observed information, and to knowledge
external to the data. Revise the model where needed.
7. Document your actions and decisions, and obtain feedback from the
owner of the data.
It is most helpful to try out these techniques on data gathered within
your own institute. Some of these steps may not be relevant for other data.
Determine where you need to adapt the procedure to suit your needs.
1.0
K−M Survival Probability
0.8
BP measured
0.6
0.4
BP missing
0.2
0.0
0 1 2 3 4 5 6 7
Figure 9.3: Kaplan–Meier curves of the Leiden 85+ Cohort, stratified ac-
cording to missingness. The figure shows the survival probability since intake
for the group with observed BP measures (blue) and the group with missing
BP measures (red).
data had they been observed. By definition, this guess needs to be based on
external information beyond the data.
Table 9.2: Some variables that have different distributions in the response
(n = 835) and nonresponse groups (n = 121). Shown are rounded percentages.
Significance levels correspond to the χ2 -test.
that the sequence in which the respondents were interviewed was not random.
High-risk groups, that is, elderly in hospitals and nursing homes and those
over 95, were visited first.
Table 9.3 contains the proportion of persons for which BP was not meas-
ured, cross-classified by three-year survival and history of hypertension as
measured during anamnesis. Of all persons who die within three years and
that have no history of hypertension, more than 19% have no BP score. The
rate for other categories is about 9%. This suggests that a relatively large
group of individuals without hypertension and with high mortality risk is
missing from the sample for which BP is known.
Using only the complete cases could lead to confounding by selection. The
complete-case analysis might underestimate the mortality of the lower and
normal BP groups, thereby yielding a distorted impression of the influence of
BP on survival. This reasoning is somewhat tentative as it relies on the use of
hypertension history as a proxy for BP. If true, however, we would expect more
missing data from the lower BP measures. It is known that BP and mortality
are inversely related in this age group, that is, lower BP is associated with
higher mortality. If there are more missing data for those with low BP and
high mortality (as in Table 9.3), selection of the complete cases could blur the
effect of BP on mortality.
9.2.2 Scenarios
The previous section presented evidence that there might be more missing
data for the lower blood pressures. Imputing the data under MAR can only
account for nonresponse that is related to the observed data. However, the
missing data may also be caused by factors that have not been observed.
In order to study the influence of such factors on the final inferences, let us
conduct a sensitivity analysis.
Section 3.8 advocated the use of simple adjustments to the imputed data as
a way to perform sensitivity analysis. Table 3.6 lists possible values for an offset
δ, together with an interpretation whether the value would be (too) small or
(too) large. The next section uses the following range for δ: 0 mmHg (MCAR,
too small), −5 mmHg (small), −10 mmHg (large), −15 mmHg (extreme) and
−20 mmHg (too extreme). The last value is unrealistically low, and is primarily
included to study the stability of the analysis in the extreme.
for (i in 1:length(delta)) {
d <- delta[i]
cmd <- paste("imp[[j]][,i] <- imp[[j]][,i] +", d)
post["rrsyst"] <- cmd
imp <- mice(data2, pred = pred, post = post, maxit = 10,
seed = i * 22)
imp.all.undamped[[i]] <- imp
}
Measurement issues 275
Table 9.4: Realized difference in means of the observed and imputed SBP
(mmHg) data under various δ-adjustments. The number of multiple imputa-
tions is m = 5.
δ Difference
0 −8.2
−5 −12.3
−10 −20.7
−15 −26.1
−20 −31.5
Table 9.5: Hazard ratio estimates (with 95% confidence interval) of the classic
proportional hazards model. The estimates are relative to the reference group
(145–160 mmHg). Rows correspond to different scenarios in the δ-adjustment.
The row labeled “CCA” contains results of the complete-case analysis.
as.vector(exp(summary(pool(fit))[, 1]))
Table 9.5 provides the hazard ratio estimates under the different scenarios
for three SBP groups. A risk ratio of 1.76 means that the mortality risk (after
correction for sex and age) in the group “<125 mmHg” is 1.76 times the risk of
the reference group “145–160 mmHg.” The inverse relation relation between
Measurement issues 277
mortality and blood pressure in this age group is consistent, where even the
group with the highest blood pressures have (nonsignificant) lower risks.
Though the imputations differ dramatically under the various scenarios,
the hazard ratio estimates for different δ are close. Thus, the results are es-
sentially the same under all specified MNAR mechanisms. Also observe that
the results are close to those from the analysis of the complete cases.
9.2.5 Conclusion
Sensitivity analysis is an important tool for investigating the plausibility
of the MAR assumption. This section explored the use of an informal, simple
and direct method to create imputations under nonignorable models by simply
deducting some amount from the imputations.
Section 3.8.1 discussed shift, scale and shape parameters for nonignorable
models. We only used a shift parameter here, which suited our purposes in
the light of what we knew about the causes of the missing data. In other ap-
plications, scale or shape parameters could be more natural. The calculations
are easily adapted to such cases.
4
2 1
Self−Reported − Measured BMI
−2
−4
3 4
20 25 30 35 40 45
Measured BMI
bias, the number of persons located in area “4” is generally larger than in area
“2,” leading to underestimation.
There have been many attempts to correct measured height and weight
for bias using predictive equations. These attempts have generally not been
successful. The estimated prevalences were often still found to be too low after
correction. Moreover, there is substantial heterogeneity in the proposed pre-
dictive formulae, resulting in widely varying prevalence estimates. See Visscher
et al. (2006) for a summary of these issues. The current consensus is that it is
not possible to estimate overweight and obesity prevalence from self-reported
data. Dauphinot et al. (2008) even suggested to lower cut-off values for obesity
based on self-reported data.
The goal is to estimate obesity prevalence in the population from self-
reported data. This estimate should be unbiased in the sense that, on aver-
age, it should be equal to the estimate that would have been obtained had
data been truly measured. Moreover, the estimate must be accompanied by a
standard error or a confidence interval.
34
4 ab 1
32
Measured BMI
30
28
26 3 ab 2
26 28 30 32 34
Self−Reported BMI
Figure 9.5 plots the data of Figure 9.4 in a different way. The figure is
centered around the BMI of 30 kg/m2 . The two dashed lines divide the area
into four quadrants. Quadrant 1 contains the cases that are obese according
to both BMI values. Quadrant 3 contains the cases that are classified as non-
obese according to both. Quadrant 2 holds the subjects that are classified as
obese according to self-report, but not according to measured BMI. Quad-
rant 4 has the opposite interpretation. The area and quadrant numbers used
in Figures 9.4 and 9.5 correspond to identical subdivisions in the data.
The “true obese” in Figure 9.5 lie in quadrants 1 and 4. The obese accord-
ing to self-report are located in quadrants 1 and 2. Observe that the number
of cases in quadrant 2 is smaller than in quadrant 4, a result of the systematic
bias that is observed in humans. Using uncorrected self-report thus leads to
an underestimate of the true prevalence.
The regression line that predicts measured BMI from self-reported BMI
is added to the display. This line intersects the horizontal line that separates
quadrant 3 from quadrant 4 at a (self-reported) BMI value of 29.4 kg/m2 .
Note that using the regression line to predict obese versus non-obese is in fact
equivalent to classifying all cases with a self-report of 29.4 kg/m2 or higher as
obese. Thus, the use of the regression line as a predictive equation effectively
shifts the vertical dashed line from 30 kg/m2 to 29.4 kg/m2 . Now we can make
the same type of comparison as before. We count the number of cases in
280 Flexible Imputation of Missing Data, Second Edition
9.3.4 Data
The calibration sample is taken from Krul et al. (2010). The dataset con-
tains of nc = 1257 Dutch subjects with both measured and self-reported data.
The survey sample consists of ns = 803 subjects of a representative sample of
Dutch adults aged 18–75 years. These data were collected in November 2007
either online or using paper-and-pencil methods. The missing data pattern in
the combined data is summarized as
age sex hr wr hm wm
1257 1 1 1 1 1 1 0
803 1 1 1 1 0 0 2
0 0 0 0 803 803 1606
The row containing all ones corresponds to the 1257 observations from the
calibration sample with complete data, whereas the rows with a zero on hm
and wm correspond to 803 observations from the survey sample (where hm and
wm were not measured).
We apply predictive mean matching (cf. Section 3.4) to impute hm and wm in
the 803 records from the survey data. The number of imputations m = 10. The
complete-data estimates are calculated on each imputed dataset and combined
using Rubin’s pooling rules to obtain prevalence rates and the associated
confidence intervals as in Sections 2.3.2 and 2.4.
9.3.5 Application
The mice() function can be used to create m = 10 multiply imputed
datasets. We imputed measured height, measured weight and measured BMI
using the following code:
The code defines a bmi() function for use in passive imputation to calculate
bmi. The predictor matrix is set up so that only age, sex, hr and wr are
permitted to impute hm and wm.
282 Flexible Imputation of Missing Data, Second Edition
4
2 1
Self−Reported − Measured BMI
−2
−4
3 4
20 25 30 35 40 45
Measured BMI
Figure 9.6: Relation between measured BMI and self-reported BMI in the
calibration (blue) and survey (red) data in the first imputed dataset.
Table 9.7: Obesity prevalence estimate (%) and standard error (se) in the
survey data (n = 803), reported (observed data) and corrected (imputed).
Reported Corrected
Sex Age n % se % se
Male 18–29 69 8.7 3.4 9.4 3.9
30–39 73 11.0 3.7 15.7 5.0
40–49 66 9.1 3.6 12.5 4.8
50–59 91 20.9 4.3 25.4 5.2
60–75 101 7.9 2.7 15.6 4.2
18–75 400 11.7 1.6 16.0 2.0
9.3.6 Conclusion
Predictive equations to correct for self-reporting bias will only work if
the percentage of explained variance is very high. In the general case, they
have a systematic downward bias, which makes them unsuitable as correction
methods. The remedy is to explicitly account for the residual distribution. We
have done so by applying multiple imputation to impute measured height and
weight. In addition, multiple imputation produces the correct standard errors
of the prevalence estimates.
YA YB
300 1
292 1
6 2
Figure 9.7: Missing data pattern for walking data without a bridge study.
The data Y is a data frame with 604 rows and 2 columns: YA and YB. Figure 9.7
shows that the missing data pattern is unconnected (cf. Section 4.1.1), with
no observations linking YA to YB. There are six records that contain no data
at all.
For this problem, we monitor the behavior of a rank-order correlation,
Kendall’s τ , between YA and YB. This is not a standard facility in mice(), but
we can easily write a small function micemill() that calculates Kendall’s τ
after each iteration as follows.
This code executes 50 iterations of the MICE algorithm. After any number
of iterations, we may plot the trace lines of the MICE algorithm by
Figure 9.8 contains the trace plot of 50 iterations. The traces start near
zero, but then freely wander off over a substantial range of the correlation. In
principle, the traces could hit values close to +1 or −1, but that is an extremely
unlikely event. The MICE algorithm obviously does not know where to go,
and wanders pointlessly through parameter space. The reason that this occurs
is that the data contain no information about the relation between YA and
YB .
Despite the absence of any information about the relation between YA and
YB , we can calculate θ̂AB and θ̂BA without a problem from the imputed data.
We find θ̂AB = 0.500 (SD: 0.031), which is very close to θ̂BB (0.497), and
far from the estimate under simple equating (0.807). Likewise, we find θ̂BA =
0.253 (SD: 0.034), very close to θ̂AA (0.243) and far from the estimate under
equating (0.658). Thus, if we perform the analysis without any information
that links the items, we consistently find no difference between the estimates
for populations A and B, despite the huge variation in Kendall’s τ .
We have now two estimates of θ̂AB and θ̂BA . In particular, in Section 9.4.2
288 Flexible Imputation of Missing Data, Second Edition
0.6
0.4
Kendall's τ
0.2
0.0
−0.2
−0.4
0 10 20 30 40 50
Iteration
we calculated θ̂BA = 0.658 and θ̂AB = 0.807, whereas in the present section
the results are θ̂BA = 0.253 and θ̂AB = 0.500, respectively. Thus, both health
measures are very dissimilar due to the assumptions made. The question is
which method yields results that are closer to the truth.
290 0
300 1
294 1
6 2
Figure 9.9: Missing data pattern for walking data with a bridge study.
where X contains any relevant covariates, like age and sex, and/or interaction
terms. In other words, the way in which YA depends on YB and X is the same
in sources B and E. Likewise, the way in which YB depends on YA and X
is the same in sources A and E. The inclusion of such covariates allows for
various forms of differential item functioning (Holland and Wainer, 1993).
The two assumptions need critical evaluation. For example, if the respon-
dents in source S = E answered the items in a different language than the
respondents in sources A or B, then the assumption may not be sensible un-
less one has great faith in the translation. It is perhaps better then to search
for a bridge study that is more comparable.
Note that it is only required that the conditional distributions are identi-
cal. The imputations remain valid when the samples have different marginal
distributions. For efficiency reasons and stability, it is generally advisable to
have match samples with similar distribution, but it is not a requirement.
The design is known as the common-item nonequivalent groups design (Kolen
and Brennan, 1995) or the non-equivalent group anchor test (NEAT) design
(Dorans, 2007).
Multiple imputation on the dataset walking is straightforward.
The behavior of the trace plot is very different now (cf. Figure 9.10).
After the first few iterations, the trace lines consistently move around a value
of approximately 0.53, with a fairly small range. Thus, after five iterations,
the conditional distributions defined by sample E have percolated into the
imputations for item A (in sample B) and item B (in sample A).
The behavior of the samplers is dependent on the relative size of the bridge
study. In these data, the bridge study is about one third of the total data. If
the bridge study is small relative to the other two data sources, the sampler
may be slow to converge. As a rule of the thumb, the bridge study should be
at least 10% of the total sample size. Also, carefully monitor convergence of
the most critical linkages using association measures.
Note that we can also monitor the behavior of θ̂AB and θ̂BA . In order to
calculate θ̂AB after each iteration we add two statements to the micemill()
function:
Measurement issues 291
0.55
Kendall's τ
0.50
0.45
0.40
5 10 15 20
Iteration
The results are assembled in the variable thetaAB in the working directory.
This variable should be initialized as thetaAB <- NULL before milling.
It is possible that the relation between YA and YB depends on covariates,
like age and sex. If so, including covariates into the imputation model allows
for differential item functioning across the covariates. It is perfectly possible
to change the imputation model between iterations. For example, after the
first 20 iterations (where we impute YA from YB and vice versa) we add age
and sex as covariates, and do another 20 iterations. This goes as follows:
0.65
0.60
0.55
θAB
^
0.50
^
θBB
0.45
Without covariates With covariates
0 10 20 30 40
Iteration
Figure 9.11: Trace plot of θ̂AB (proportion of sample A that scores in cat-
egory 0 of item B) after multiple imputation (m = 10), without covariates
(iteration 1–20), and with covariates age and sex as part of the imputation
model (iterations 21–40).
Table 9.9: Disability and health estimates for populations A and B under
four assumptions. θ̂AA and θ̂BA are the item means on item A for samples A
and B, respectively. θ̂AB and θ̂BB are the proportions of cases into category 0
of item B for samples A and B, respectively. MI-multiple imputation.
9.4.6 Interpretation
Figure 9.11 plots the traces of MICE algorithm, where we calculated θAB ,
the proportion of sample A in category 0 of item B. Without covariates, the
proportion is approximately 0.587. Under equating, this proportion was found
to be equal to 0.807 (cf. Section 9.4.2). The difference between the old (0.807)
and the new (0.587) estimate is dramatic. After adding age and sex to the
imputation model, θAB drops further to about 0.534, close to θBB , the estimate
for population B (0.497).
Table 9.9 summarizes the estimates from the four analyses. Large differ-
ences are found between population A and B when we simply assume that
the four categories of both items are identical (simple equating). In this case,
population A appears much healthier by both measures. In constrast, if we
Measurement issues 293
9.4.7 Conclusion
Incomparability of data is a key problem in many fields. It is natural for
scientists to adapt, refine and tweak measurement procedures in the hope of
obtaining better data. Frequent changes, however, will hamper comparisons.
Equating categories is widely practiced to “make the data comparable.”
It is often not realized that recoding and equating data amplify differences.
The degree of exaggeration is inversely related to Kendall’s τ . For the item
mean statistic, the difference in mean walking disability after equating is about
twice the size of that under multiple imputation. Also, the estimate of 0.807
after simple equating is a gross overestimate. Overstated differences between
populations may spur inappropriate interventions, sometimes with substan-
tial financial consequences. Unless backed up by appropriate data, equating
categories is not a solution.
The section used multiple imputation as a natural and attractive alterna-
tive. The first major application of multiple imputation addressed issues of
comparability (Clogg et al., 1991). The advantage is that bureau A can inter-
pret the information of bureau B using the scale of bureau A, and vice versa.
The method provides possible contingency tables of items A and B that could
have been observed if both had been measured.
Dorans (2007) describes techniques for creating valid equating tables. Such
tables convert the score of instrument A into that of instrument B, and vice
versa. The requirements for constructing such tables are extremely high: the
measured constructs should be equal, the reliability should be equal, the con-
version of B to A should be the inverse of that from B to A (symmetry),
it should not matter whether A or B is measured and the table should be
independent of the population. Holland (2007) presents a logical sequence of
linking methods that progressively moves toward higher forms of equating.
Multiple imputation in general fails on the symmetry requirement, as it pro-
duces m scores on B for one score of A, and thus cannot be invertible. The
method as presented here can be seen as a first step toward obtaining formal
294 Flexible Imputation of Missing Data, Second Edition
9.5 Exercises
1. Contingency table. Adapt the micemill() function for the walking data
so that it prints out the contingency table of YA and YB of the first
imputation at each iteration. How many statements do you need?
2. Pool τ . Find out what the variance of Kendall’s τ is, and construct its
95% confidence intervals under multiple imputation. Use the auxiliary
function pool.scalar() for pooling.
3. Covariates. Calculate the correlation between age and the items A and
B under two imputation models: one without covariates, and one with
covariates. Which of the correlations is higher? Which solution do you
prefer? Why?
There are known knowns. These are things we know that we know.
There are known unknowns. That is to say, there are things that
we know we don’t know. But there are also unknown unknowns.
There are things we don’t know we don’t know.
Donald Rumsfeld
This chapter changes the perspective to the rows of the data matrix. An
important consequence of nonresponse and missing data is that the remaining
sample may not be representative any more. Multiple imputation of entire
blocks of variables (Section 10.1) can be useful to adjust for selective loss of
cases in panel and cohort studies. Section 10.2 takes this idea a step further
by appended and imputing new synthetic records to the data. This can also
work for cross-sectional studies.
295
296 Flexible Imputation of Missing Data, Second Edition
and another 67 children died between the ages of 28 days and 19 years, leaving
959 survivors at the age of 19 years. Intermediate outcome measures from
earlier follow-ups were available for 89% of the survivors at age 14 (n = 854),
77% at age 10 (n = 712), 84% at age 9(n = 813), 96% at age 5 (n = 927) and
97% at age 2(n = 946).
To study the effect of drop-out, Hille et al. (2005) divided the 959 survivors
into three response groups:
1. Full responders were examined at an outpatient clinic and completed
the questionnaires (n = 596);
2. Postal responders only completed the mailed questionnaires (n = 109);
3. Nonresponders did not respond to any of the mailed requests or tele-
phone calls, or could not be traced (n = 254).
Table 10.1: Count (percentage) of various factors for three response groups.
Source: Hille et al. (2005).
All Full Postal Non-
responders responders responders responders
n 959 596 109 254
Sex
Boy 497 (51.8) 269 (45.1) 60 (55.0) 168 (66.1)
Girl 462 (48.2) 327 (54.9) 49 (45.0) 86 (33.9)
Origin
Dutch 812 (84.7) 524 (87.9) 96 (88.1) 192 (75.6)
Non-Dutch 147 (15.3) 72 (12.1) 13 (11.9) 62 (24.4)
Maternal education
Low 437 (49.9) 247 (43.0) 55 (52.9) 135 (68.2)
Medium 299 (34.1) 221 (38.5) 31 (29.8) 47 (23.7)
High 140 (16.0) 106 (18.5) 18 (17.3) 16 (8.1)
Social economic level
Low 398 (42.2) 210 (35.5) 48 (44.4) 140 (58.8)
Medium 290 (30.9) 193 (32.6) 31 (28.7) 66 (27.7)
High 250 (26.7) 189 (31.9) 29 (26.9) 32 (13.4)
Handicap status at age 14 years
Normal 480 (50.8) 308 (51.7) 42 (38.5) 130 (54.2)
Impairment 247 (26.1) 166 (27.9) 36 (33.0) 45 (18.8)
Mild 153 (16.2) 101 (16.9) 16 (14.7) 36 (15.0)
Severe 65 (6.9) 21 (3.5) 15 (13.8) 29 (12.1)
For each outcome, the investigator created a list of potentially relevant pre-
dictors according to the predictor selection strategy set forth in Section 6.3.2.
In total, this resulted in a set of 85 unique variables. Only four of these were
completely observed for all 959 children. Moreover, the information provided
by the investigators was coded (in Microsoft Excel) as an 85 × 85 predictor
matrix that is used to define the imputation model.
Figure 10.1 shows a miniature version of the predictor matrix. The dark
cell indicates that the column variable is used to impute the row variable. Note
the four complete variables with rows containing only zeroes. There are three
blocks of variables. The first nine variables (Set 1: geslacht–sga) are potential
confounders that should be controlled for in all analyses. The second set of
variables (Set 2: grad.t–sch910r) are variables measured at intermediate
time points that appear in specific models. The third set of variables (Set 3:
iq–occrec) are the incomplete outcomes of primary interest collected at the
age of 19 years. The imputation model is defined such that:
1. All variables in Set 1 are used as predictors to impute Set 1, to preserve
relations between them;
2. All variables in Set 1 are used as predictors to impute Set 3, because all
variables in Set 1 appear in the complete-data models of Set 3;
298 Flexible Imputation of Missing Data, Second Edition
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Figure 10.1: The 85 × 85 predictor matrix used in the POPS study. The gray
parts signal the column variables that are used to impute the row variable.
mean sd
2.0
0.5
1.8
0.4
1.6
0.3
1.4
0.2
1.2
5 10 15 20 5 10 15 20
Iteration
Figure 10.2: Trace lines of the MICE algorithm for the variable a10u illus-
trating problematic convergence.
The number of iterations is set to 20 because the trace lines from the MICE
algorithm show strong initial trends and slow mixing.
Figure 10.2 plots the trace lines of the binary variable a10u, an indicator of
visual disabilities. The behavior of these trace lines looks suspect, especially
for a10u. The mean of a10u (left side) of the imputed values converges to
a value near 1.9, while the standard deviation (right side) drops below that
variability that is found in the data. Since the categories are coded as 1 = no
problem and 2 = problem, a value of 1.9 actually implies that 90% of the
nonresponders would have a problem. The observed prevalence of a10u in the
full responders is only 1.5%, so 90% is clearly beyond any reasonable value.
In addition, iq and coping move into remote territories. Figure 10.3 illus-
trates that the imputed values for iq appear unreasonably low, whereas for
coping they appear unreasonably high. What is happening here?
The phenomenon we see illustrates a weakness (or feature) of the MICE al-
gorithm that manifests itself when the imputation model is overparametrized
relative to the available information. The source of the problem lies in the
imputation of the variables in Set 3, the measurements at 19 years. We spec-
ified that all variables in Set 3 should impute each other, with the idea of
300 Flexible Imputation of Missing Data, Second Edition
iq coping
60
140
50
120
40
100
30
80
20
60
10
0 1 2 3 4 5 0 1 2 3 4 5
Imputation number
preserving the multivariate relations between these variables. For 254 out of
959 children (26.5%), we do not have any information at age 19. The MICE
algorithm starts out by borrowing information from the group of responders,
and then quickly finds out that it can create imputations that are highly cor-
related. However, the imputed values do not look at all like the observed data,
and are more like multivariate outliers that live in an extreme part of the data
space.
There are several ways to alleviate the problem. The easiest approach is to
remove the 254 nonresponders from the data. This is a sensible approach if the
analysis focuses on determinants of 19-year outcomes, but it is not suitable for
making inferences on the marginal outcome distribution of the entire cohort.
A second approach is to simplify the model. Many of the 19-year outcomes are
categorical, and we reduce the number of parameters drastically by applying
predictive mean matching to these outcomes. A third approach would be to
impute all outcomes as a block. This would find potential donors among the
observed data, and impute all outcomes simultaneously. This removes the
risk of artificially inflating the relations among outcomes, and is a promising
alternative. Finally, the approach we follow here is to simplify the imputation
model by removing the gray block in the lower-right part of Figure 10.1. The
relation between the outcomes would then only be maintained through their
relation with predictors measured at other time points. It is easy to change
and rerun the model as:
Selection issues 301
iq coping
60
140
50
120
40
100
30
80
20
60
10
0 1 2 3 4 5 0 1 2 3 4 5
Imputation number
10.1.5 Results
Table 10.2 provides estimates of the percentage of three health problems,
both uncorrected and corrected for selective drop-out. As expected, all es-
timates are adjusted upward. Note that the prevalence of visual problems
tripled to 4.7% after correction. While this increase is substantial, it is well
within the range of odds ratios of 2.6 and 4.4 reported by Hille et al. (2005).
The adjustment shows that prevalence estimates in the whole group can be
substantially higher than in the group of full responders. Hille et al. (2007)
provide additional and more detailed results.
302 Flexible Imputation of Missing Data, Second Edition
10.1.6 Conclusion
Many studies are plagued by selective drop-out. Multiple imputation pro-
vides an intuitive way to adjust for drop-out, thus enabling estimation of
statistics relative to the entire cohort rather than the subgroup. The method
assumes MAR. The formulation of the imputation model requires some care.
Section 10.1.3 outlines a simple strategy to specify the predictor matrix to fit
an imputation model for multiple uses. This methodology is easily adapted to
other studies.
Section 10.1.4 illustrates that multiple imputation is not without dangers.
The imputations produced by the initial model were far off, which underlines
the importance of diagnostic evaluation of the imputed data. A disadvantage
of the approach taken to alleviate the problem is that it preserves the relations
between the variables in Set 3 only insofar as they are related through their
common predictors. These relations may thus be attenuated. Some alterna-
tives were highlighted, and an especially promising one is to impute blocks
of variables (cf. Section 4.7.2). Whatever is done, it is important to diag-
nose aberrant algorithmic behavior, and decide on an appropriate strategy to
prevent it given the scientific questions at hand.
10.2.2 Nonresponse
During data collection, it quickly became evident that the response in
children older than 15 years was extremely poor, and sometimes fell even below
20%. Though substantial nonresponse was caused by lack of perceived interest
by the children, we could not rule out the possibility of selective nonresponse.
For example, overweight children may have been less inclined to participate.
The data collection method was changed in November 2008 so that all children
with a school class were measured. Once a class was selected, nonresponse of
the pupils was very generally small. In addition, children were measured by
304 Flexible Imputation of Missing Data, Second Edition
Table 10.3: Distribution of the population and the sample over five geograph-
ical regions by age. Numbers are column percentages. Source: Fifth Dutch
Growth Study (Schönbeck et al., 2013).
0–9 Years 10–13 Years 14–21 Years
Region Population Sample Population Sample Population Sample
North 12 7 12 11 12 4
East 24 28 24 11 24 55
South 23 27 24 31 25 21
West 21 26 20 26 20 15
City 20 12 19 22 19 4
special teams at two high schools, two universities and a youth festival. The
sample was supplemented with data from two studies from Amsterdam and
Zwolle.
sex and age in years. For example, for the combination (North, 0--9 years)
nimp = 400 new records are created as follows. All 400 records have the region
category North. The first 200 records are boys and the last 200 records are
girls. Age is drawn uniformly from the range 0–9 years. The outcomes of
interest, like height and weight, are set to missing. Similar blocks of records
are created for the other five categories of interest, resulting in a total of 1975
new records with complete covariates and missing outcomes.
The following R code creates a dataset of 1975 records, with four complete
covariates (id, reg, sex, age) and four missing outcomes (hgt, wgt, hgt.z,
wgt.z). The outcomes hgt.z and wgt.z are standard deviation scores (SDS),
or Z-scores, derived from hgt and wgt, respectively, standardized for age and
sex relative to the Dutch references (Fredriks et al., 2000a).
0.4
Height (SDS)
0.2
0.0
−0.2
−0.4
0 5 10 15 20
Age (years)
Figure 10.5: Height SDS by age and region of Dutch children. Source: Fifth
Dutch Growth Study (n = 10030).
North City
0.6
0.4
Height (SDS)
0.2
0.0
−0.2
−0.4
0 5 10 15 20 0 5 10 15 20
Age (years)
Figure 10.6: Mean height SDS by age for regions North and City, in the
observed data (n = 10030) (blue) and 10 augmented datasets that correct for
the nonresponse (n = 12005).
normal model method norm rather the pmm. If necessary, absolute values in
centimeters (cm) and kilograms (kg) can be calculated after imputation.
Figure 10.6 displays mean height SDS per year for regions North and City
in the original and augmented data. The 10 imputed datasets show patterns
in mean height SDS similar to those in the observed data. Because of the lower
sample size, the means for region North are more variable than City. Observe
also that the rising pattern in City is reproduced in the imputed data. No
imputations were generated for the ages 10–13 years, which explains that the
means of the imputed and observed data coincide. The imputations tend to
smooth out sharp peaks at higher ages due to the low number of data points.
171
Height (cm)
182
169
180
167
178
165
16 17 18 19 20 21 16 17 18 19 20 21
Age (years)
Figure 10.7: Final height estimates in Dutch boys and girls from the original
sample (n = 10030) and 10 augmented samples (n = 12005) that correct for
the nonresponse.
10.2.7 Discussion
The application as described here only imputes height and weight in Dutch
children. It is straightforward to extend the method to impute additional
outcomes, like waist or hip circumference.
The method can only correct for covariates whose distributions are known
in both the sample and population. It does not work if nonresponse depends
on factors for which we have no population distribution. However, if we have
possession of nonresponse forms for a representative sample, we may use any
covariates common to the responders and nonresponders to correct for the
nonresponse using a similar methodology. The correction will be more suc-
cessful if these covariates are related to the reasons for the nonresponse.
There are no accepted methods yet to calculate the number of extra records
needed. Here we used 1975 new records to augment the existing 10030 records,
about 16% of the total. This number of artificial records brought the covari-
ate distribution in the augmented sample close to the population distribution
without the need to discard any of the existing records. When the imbalance
grows, we may need a higher percentage of augmentation. The estimates will
then be based on a larger fraction of missing information, and may thus be-
come unstable. Alternatively, we could sacrifice some of the existing records
by taking a random subsample of strata that are overrepresented, but discard-
ing data is likely to lead it less efficient estimates. It would be interesting to
compare the methodology to traditional weighting approaches.
Selection issues 309
10.3 Exercises
1. 90th centile. Repeat the analysis in Section 10.2.6 for final height. Study
the effect of omitting the interaction effect from the imputation model.
Are the effects on the 90th centile the same as for the mean?
2. How many records? Section 10.2.4 describes an application in which
incomplete records are appended to create a representative sample. De-
velop a general strategy to determine the number of records needed to
append.
Chapter 11
Longitudinal data
id age Y1 Y2
1 14 28 22
2 12 34 16
3 ...
id age Y
1 14 28
1 14 22
2 12 34
2 12 16
3 ...
Note that the concepts of long and wide are general, and also apply to
cross-sectional data. For example, we have seen the long format before in
311
312 Flexible Imputation of Missing Data, Second Edition
Section 5.1.3, where it referred to stacked imputed data that was produced
by the complete() function. The basic idea is the same.
Both formats have their advantages. If the data are collected on the same
time points, the wide format has no redundancy or repetition. Elementary
statistical computations like calculating means, change scores, age-to-age cor-
relations between time points, or the t-test are easy to do in this format. The
long format is better at handling irregular and missed visits. Also, the long
format has an explicit time variable available that can be used for analysis.
Graphs and statistical analyses are easier in the long format.
Applied researchers often collect, store and analyze their data in the wide
format. Classic ANOVA and MANOVA techniques for repeated measures and
structural equation models for longitudinal data assume the wide format.
Modern multilevel techniques and statistical graphs, however, work only from
the long format. The distinction between the two formats is a first stumbling
block for those new to longitudinal analysis.
Singer and Willett (2003) advise the data storing in both formats. The
wide and the long formats can be easily converted in another by means of
gather() and spread() functions on tidyr (Wickham and Grolemund, 2017).
The wide-to-long conversion can usually be done without a problem. The long-
to-wide conversion can be difficult. If individuals are seen at different times,
direct conversion is impractical. The number of columns in the wide format
becomes overly large, and each column contains many missing values. An ad
hoc solution is to create homogeneous time groups, which then become the
new columns in the wide format. Such regrouping will lead to loss of precision
of the time variable. For some studies this need not be a problem, but for
others it will.
Multiple imputation is somewhat more convenient in the wide format.
Apart from the fact that the columns are ordered in time, there is nothing
special about the imputation problem. We may thus apply techniques for
single level data to longitudinal data. Section 11.2 discusses an imputation
technique in the wide format in a clinical trial application with the goal of
performing a statistical analysis according to the intention to treat (ITT)
principle. The longitudinal character of the data helped specify the imputation
model.
Multiple imputation of the longitudinal data in the long form can be
done by multilevel imputation techniques. See Chapter 7 for an overview.
Section 11.3 discusses multiple imputation in the long format. The applica-
tion defines a common time raster for all persons. Multiple imputations are
drawn for each raster point. The resulting imputed datasets can be converted
to, and analyzed in, the wide format if desired. This approach is a more prin-
cipled way to deal with the information loss problem discussed previously.
The procedure aligns times to a common raster, hence the name time raster
imputation (cf. Section 11.3).
Longitudinal data 313
Table 11.1: SE Fireworks Disaster Study. The UCLA PTSD Reaction Index
of 52 subjects, children and parents, randomized to EMDR or CBT.
id trt pp Y1c Y2c Y3c Y1p Y2p Y3p id trt pp Y1c Y2c Y3c Y1p Y2p Y3p
1 E Y – – – 36 35 38 32 E N 28 17 8 40 42 33
2 C N 45 – – – – – 33 E N – – – 38 22 25
3 E N – – – 13 19 13 34 E N – – – 17 – –
4 C Y – – – 33 27 20 35 E Y 50 20 – 19 1 5
5 E Y 26 6 4 27 16 11 37 C N 30 – 26 59 – 28
6 C Y 8 1 2 32 15 13 38 C Y – – – 35 24 27
7 C Y 41 26 31 – 39 39 39 E N – – – – – –
8 C N – – – 24 13 35 40 E Y 25 5 2 42 13 11
10 C Y 35 27 14 48 23 – 41 E Y 36 11 9 30 2 1
12 C Y 28 15 13 45 33 36 43 E N 17 – – – – –
13 E Y – – – 26 17 14 44 E N 27 – – 40 – –
14 C Y 33 8 9 37 7 3 45 C Y 31 12 29 34 28 29
15 E Y 43 – 7 25 27 1 46 C Y – – – 44 35 25
16 C Y 50 8 35 39 21 34 47 C Y – – – 30 18 14
17 C Y 31 21 10 32 21 19 48 E Y 25 18 – 18 17 2
18 E Y 30 17 16 47 28 34 49 C N 24 23 16 44 29 34
19 E Y 29 6 5 20 14 11 50 E Y 31 13 9 34 18 13
20 E Y 47 14 22 44 21 25 51 C Y – – – 52 13 13
21 C Y 39 12 12 39 5 19 52 C Y 30 35 28 – 44 50
23 C Y 14 12 5 29 9 4 53 C Y 19 33 21 36 21 21
24 E N 27 – – – – – 54 C N 43 – – 48 – –
25 E Y 6 10 5 25 16 16 55 E Y 64 42 35 44 31 16
28 C Y – 2 6 36 17 23 56 C Y – – – 37 6 9
29 E Y 23 23 28 23 25 13 57 C Y 31 12 – 32 26 –
30 E Y – – – 20 23 12 58 E Y – – – 49 28 25
31 C N 15 24 26 33 36 38 59 E Y 39 7 – 39 7 –
yp2 yp1 yp3 yc1 yc2 yc3 yp1 yc1 yp3 yp2 yc3 yc2
19 0
3 0
2 1
1 2
1 1
2 4
1 1
3 3
10 3
1 5
1 1
3 5
2 2
1 6
2 1
0 2 3 11 11 4 5 7 8 10 11 45
14 41
Figure 11.1: Missing data patterns for the “per-protocol” group (left) and
the “drop-out” group (right).
refused treatment from a therapist not belonging to his own culture (1). One
child showed spontaneous recovery before treatment started (1).
Comparison between the 14 drop-outs and the 38 completers regarding
presentation at time of initial assessment yielded no significant differences in
any of the demographic characteristics or number of traumatic experiences.
On the symptom scales, only the mean score of the PROPS was marginally
significantly higher for the drop-out group than for the treatment completers
(t = 2.09, df = 48, p = .04).
Though these preliminary analyses are comforting, the best way to analyze
the data is to the compare participants in the groups to which they were
randomized, regardless of whether they received or adhered to the allocated
intervention. Formal statistical testing requires random assignment to groups.
The intention to treat (ITT) principle is widely recommended as the preferred
approach to the analysis of clinical trials. DeMets et al. (2007) and White et al.
(2011a) provide balanced discussions of pros and cons of ITT.
challenging, the real dataset is more complex than this. There are six addi-
tional outcome variables (e.g., the Child Behavior Checklist, or CBCL), each
measured over time and similarly structured as in Table 11.1. In addition,
some of the outcome measures are to be analyzed on both the subscale level
and the total score level. For example, the PTSD-RI has three subscales (in-
trusiveness/numbing/avoidance, fear/anxiety, and disturbances in sleep and
concentration and two additional summary measures (full PTSD and partial
PTSD). All in all, there were 65 variables in data to be analyzed. Of these, 49
variables were incomplete. The total number of cases was 52, so in order to
avoid grossly overdetermined models, the predictors of the imputation model
should be selected very carefully.
A first strategy for predictor reduction was to preserve all deterministic re-
lations columns in the incomplete data. This was done by passive imputation.
p p p
For example, let Ya,1 , Yb,1 and Yc,1 represent the scores on three subscales of
the PTSD parent form administered at T1. Each of these is imputed individ-
ually. The total variable Y1p is then imputed by mice in a deterministic way
as the sum score.
A second strategy to reduce the number of predictors was to leave out
other outcomes, measured at other time points. To illustrate this, a subset of
p p p
the predictor matrix for imputing Ya,1 , Yb,1 and Yc,1 is:
T1 T2 T3 T1 T2 T3
2 7 10 24
60
50
40
30
20
10
0
34 37 39 43
60
50
40
30
UCLA−RI Parent Score
20
10
0
44 52 54 57
60
50
40
30
20
10
0
59
60
50
40
30
20
10
0
T1 T2 T3
Time
Figure 11.2: Plot of the multiply imputed data of the 13 subjects with one
or more missing values on PTSD-RI parent form.
T1 T2 T3
EMDR CBT
UCLA−RI Parent Score
35
30
25
20
15
T1 T2 T3
Time
Figure 11.3: Mean levels of PTSD-RI parent form for the completely ob-
served profiles (blue) and all profiles (black) in the EMDR and CBT groups.
presumably caused by the EMDR and CBT therapies. The shape between
end of treatment (T2) and follow-up (T3) differs somewhat for the group,
suggesting that EMDR has better long-term effects, but this difference was
not statistically significant. Also note that the complete-case analysis and the
analysis based on ITT are in close agreement with each other here.
We will not go into details here to answer the second research question as
stated on page 313. It is of interest to note that EMDR needed fewer sessions
to achieve its effect. The original publication (De Roos et al., 2011) contains
the details.
T1 T2 T3
EMDR CBT
30
UCLA−RI Child Score
25
20
15
T1 T2 T3
Time
Figure 11.4: Mean levels of PTSD-RI Child Form for the completely observed
profiles (blue) and all profiles (black) in the EMDR and CBT groups.
fitted values that have been smoothed. Deriving a change score as the differ-
ence between the fitted curve of the person at T1 and T2 results in values that
are closer to zero than those derived from data that have been observed.
This section describes a technique that inserts pseudo time points to the
observed data of each person. The outcome data at these supplementary time
points are multiply imputed. The idea is that the imputed data can be ana-
lyzed subsequently by techniques for change scores and repeated measures.
The imputation procedure is akin to the process needed to print a photo
in a newspaper. The photo is coded as points on a predefined raster. At
the microlevel there could be information loss, but the scenery is essentially
unaffected. Hence the name time raster imputation. My hope is that this
method will help bridge the gap between modern and classic approaches to
longitudinal data.
3. Short early. There is a large increase between ages 2y and 5y. We would
surely interpret the period 2y–5y is as a critical period for this person.
Longitudinal data 323
1.6
1.4
·························~~&:&:~·~·~·~~&:&:~~·~·~~&:&:~-s·.gta.;.;&·····
I , r·-
1 / •
/ /
1.2
/ / I
1.0 / r------/ I
(/) / I I
0 0.8 I I
(/)
It !
~ 0.6
£ll /1 !
;
0.4
I
/I
1/ 1 ---
.•.••••••••.•.
Long critical period
No critical period
Short Early
0.2 I Short Late
-··-··-··-··-··-··-··-··-··-··-·!
Two critical periods
0.0
0 5 10 15 20 25
Age
Figure 11.5: Five theoretical BMI SDS trajectories for a person age 18 years
with a BMI SDS = 1.5 SD.
4. Short late. This is essentially the same as before, but shifted forward in
time.
5. Two critical periods. Here the total increase of 1.5 SD is spread over
two periods. The first occurs at 2y–5y with an increase of 1.0 SD. The
second at 12y–15y with an increase of 0.5 SD.
In practice, mixing between these and other forms will occur.
The objective is to identify any periods during childhood that contribute
to an increase in overweight at adult age. A period is “critical” if
1. change differs between those who are and are not later overweight; and
2. change is associated with the outcome after correction for the measure
at the end of the period.
Both need to hold. In order to solve the problem of irregular age spacing,
De Kroon et al. (2010) use the broken stick model, a piecewise linear growth
curve fitted, as a means to describe individual growth curves at fixed times.
This section extends this methodology by generating imputations accord-
ing to the broken stick model. The multiply imputed values are then used
to estimate difference scores and regression models that throw light on the
question of scientific interest.
324 Flexible Imputation of Missing Data, Second Edition
library(splines)
data <- tbc
x1 x2 x3 x4 x5 x6 x7 x8 x9
[1,] 1.00 0.00 0.00 0 0 0 0 0 0
[2,] 0.27 0.73 0.00 0 0 0 0 0 0
[3,] 0.00 0.83 0.17 0 0 0 0 0 0
Matrix X has only two nonzero elements in each row. Each row sums to 1.
If an observed age coincides with a break age, the corresponding entry is equal
to 1, and all remaining elements are zero. In the data example, this occurs
in the first record, at birth. A small constant of 0.0001 was added to the last
Longitudinal data 325
break age. This was done to accommodate for a pseudo time point with an
exact age of 29 years, which will be inserted later in Section 11.3.6.
The measurements yi for person i are modeled by the linear mixed effects
model
yi = Xi (β + βi ) + i (11.2)
= Xi γi + i
library(lme4)
fit <- lmer(wgt.z ~ 0 + x1 + x2 + x3 + x4 + x5 +
x6 + x7 + x8 + x9 + (0 + x1 + x2 +
x3 + x4 + x5 + x6 + x7 + x8 + x9 | id),
data = data)
round(head(t(tsiz)), 2)
The γ̂i estimates are found in the variable tsiz. Let δ̂ik = γ̂i,j+1 − γ̂i,j
with j = 1, . . . , k − 1 denote the successive differences (or increments) of the
elements in γ̂i . These are found in the variable tinc. We may interpret δ̂i as
the expected change scores for person i.
The first criterion for a critical period is that change differs between those
who are and are not later overweight. A simple analysis for this criterion is
the Student’s t-test applied to δ̂ik for every period k. The correlations between
δ̂ik at successive k were generally higher than 0.5, so we analyzed uncondi-
tional change scores (Jones and Spiegelhalter, 2009). The second criterion for
a critical period involves fitting two regression models, both of which have
final BMI SDS at adulthood, denoted by γiadult , as their outcome. The two
models are:
1259 2447
−2
7019 7460
2
Weight SDS
−2
7646 8046
−2
Age
Figure 11.6: Broken stick trajectories for Weight SDS from six selected in-
dividuals from the Terneuzen cohort.
proportions during the first year. Child 7646 was born prematurely with a
gestational age of 32 weeks. This individual has an unusually large increase in
weight between birth and puberty. Child 8046 is aberrant with an unusually
large number of weight measurements around the age of 8 days, but was
subsequently not measured for about 1.5 years.
Figure 11.6 also displays the individual broken stick estimates for each
outcome as a line. Observe that the model follows the individual data points
very well. De Kroon et al. (2010) analyzed these estimates by the methods
described at the end of Section 11.3.2, and found that the periods 2y–6y and
10y–18y were most relevant for developing later overweight.
328 Flexible Imputation of Missing Data, Second Edition
11.3.6 Imputation
The measured outcomes are denoted by Yobs , e.g., weight SDS. For the
moment, we assume that the Yobs are coded in long format and complete,
though neither is an essential requirement. For each person i we append k
records, each of which corresponds to a break age. In R we first set up a time
warping model that connect real age to warped age, and then integrate the
new ages into the data.
sets the outcome variables to NA, appends the result to the original data and
sorts the result with respect to id and age. The real data are thus mingled
with the supplementary records with missing outcomes. The first few records
of data2 look like this:
head(data2)
Multiple imputation must take into account that the data are clustered
within persons. The setup for mice() requires some care, so we discuss each
step in detail.
These statements specify that only hgt.z, wgt.z and bmi.z need to be
imputed. For these three outcomes we request the elementary imputation
function mice.impute.2l.pan(), which is designed to impute data with two
levels. See Section 7.6 for more detail.
The setup of the predictor matrix needs some care. We first empty all
entries from the variable pred. The statement pred[Y, "id"] <- (-2) de-
fines variable id as the class variable. The statement pred[Y, "sex"] <- 1
specifies sex as a fixed effect, as usual, while pred[Y, paste("x", 1:9, sep
= "")] <- 2 sets the B-spline basis as a random effect, as in Equation 11.2.
The remaining three statement specify the Y2 is a random effects predictor of
Y1 (and vice versa), and both Y1 and Y2 are random effects predictors of Y3 .
Note that Y3 (BMI SDS) is not a predictor of Y1 or Y2 in order to prevent
the type of convergence problems explained in Section 6.5.2. Note also that
age is not included in order to evade duplication with its B-spline coding. In
summary, there are 12 random effects (9 for age and 3 for the outcomes), one
class variable, and one fixed effect.
The actual imputations are produced by
There are over 48000 records. This call takes about 30 minutes to complete,
which is much longer than the other applications discussed in this book. In
the year 2012 the same problem still took over 10 hours, so there is certainly
progress.
Figure 11.7 displays ten multiply imputed trajectories for the six persons
displayed in Figure 11.6. The general impression is that the imputed trajectory
follows the data quite well. At ages where the are many data points (e.g., in
period 0d–1y in person 1259 or in period 8d–1y in person 7460) the curves
are quite close, indicating a relatively large certainty. On the other hand, at
locations where data are sparse (e.g., the period 10y–29y in person 7019, or the
period 8d–2y in person 8046) the curves diverge, indicating a large amount
of uncertainty about the imputation. This effect is especially strong at the
edges of the age range. Incidentally, we noted that the end effects are less
pronounced for larger sample sizes.
It is also interesting to study whether imputation preserves the relation
between height, weight and BMI. Figure 11.8 is a scattergram of height SDS
and weight SDS split according to age that superposes the imputations on the
observed data in the period after the break point. In general the relation in
the observed data is preserved in the imputed data. Note that the imputations
become more variable for regions with fewer data. This is especially visible at
the panel in the upper-right corner at age 29y, where there were no data at
all. Similar plots can be made in combination with BMI SDS. In general, the
data in these plots all behave as one would expect.
1259 2447
−2
7019 7460
2
Weight SDS
−2
7646 8046
−2
Age
Figure 11.7: Ten multiply imputed trajectories of weight SDS for the same
persons as in Figure 11.6 (in red). Also shown are the data points (in blue).
similar, so the mean change estimated under both methods is similar. The
p-values in the broken stick method are generally more optimistic relative to
multiple imputation, which is due to the fact that the broken stick model
ignores the uncertainty about the estimates.
There is also an effect on the correlations. In general, the age-to-age cor-
relations of the broken stick method are higher than the raster imputations.
Table 11.3 provides the age-to-age correlation matrix of BMI SDS esti-
mated from 1745 cases from the Terneuzen Birth Cohort. Apart from the
peculiar values for the age of 8 days, the correlations decrease as the period
between time points increases. The values for the broken stick method are
higher because these do not incorporate the uncertainty of the estimates.
332 Flexible Imputation of Missing Data, Second Edition
−2 0 2
−2
2
Weight SDS
−2
−2
−2 0 2 −2 0 2
Height SDS
Figure 11.8: The relation between height SDS and weight SDS in the ob-
served (blue) and imputed (red) longitudinal trajectories. The imputed data
occur exactly at the break ages. The observed data come from the period
immediately after the break age. No data beyond 29 years were observed, so
the upper-right panel contains no observed data.
11.4 Conclusion
This chapter described techniques for imputing longitudinal data in both
the wide and long formats. Some things are easier in the wide format, e.g.,
change scores or imputing data, while other procedures are easier in the long
format, e.g., graphics and advanced statistical modeling. It is therefore useful
to have both formats available.
Longitudinal data 333
Table 11.2: Mean change per period, split according to adult overweight
(AO) (n = 124) and no adult overweight (NAO) (n = 486) for the broken
stick method and for multiple imputation of the time raster.
Broken stick
Time raster imputation
Period NAO AO p-value NAO AO p-value
0d–8d −0.88 −0.80 0.214 −0.93 −0.82 0.335
8d–4m −0.32 −0.34 0.811 −0.07 −0.11 0.745
4m–1y 0.42 0.62 0.006 0.35 0.58 0.074
1y–2y 0.22 0.28 0.242 0.24 0.26 0.884
2y–6y −0.36 −0.10 <0.001 −0.35 −0.06 0.026
6y–10y 0.05 0.34 <0.001 −0.01 0.31 0.029
10y–18y 0.09 0.52 <0.001 0.17 0.68 0.009
Table 11.3: Age-to-age correlations of BMI SDS the broken stick estimates
(lower triangle) and raster imputations (upper triangle) for the Terneuzen
Birth Cohort (n = 1745).
Age 0d 8d 4m 1y 2y 6y 10y 18y
0d – 0.64 0.20 0.21 0.18 0.17 0.16 0.11
8d 0.75 – 0.30 0.17 0.20 0.20 0.15 0.13
4m 0.28 0.44 – 0.39 0.30 0.29 0.20 0.16
1y 0.28 0.23 0.65 – 0.55 0.40 0.31 0.23
2y 0.31 0.33 0.46 0.76 – 0.56 0.36 0.23
6y 0.31 0.36 0.46 0.59 0.79 – 0.62 0.42
10y 0.26 0.26 0.35 0.47 0.55 0.89 – 0.53
18y 0.23 0.26 0.29 0.37 0.40 0.72 0.89 –
The methodology for imputing data in the wide format is not really differ-
ent from that of cross-sectional data. When possible, always try to convert the
data into the wide format before imputation. If the data have been observed
at irregular time points, as in the Terneuzen Birth Cohort, conversion of the
data into the wide format is not possible, however, and imputation can be
done in the long format by multilevel imputation.
This chapter introduced time raster imputation, a technique for converting
data with an irregular age spacing into the wide format by means of impu-
tation. Time rastering seems to work well in the sense that the generated
trajectories follow the individual trajectories. The technique is still experi-
mental and may need further refinement before it can be used routinely.
The current method inserts missing data at the full time grid, and thus
imputes data even at time points where there are real observations. One obvi-
ous improvement would be to strip such points from the grid so that they are
not imputed. For example, in the Terneuzen Birth Cohort this means that we
would always take observed birth weight when it is measured.
334 Flexible Imputation of Missing Data, Second Edition
11.5 Exercises
1. Potthoff–Roy, wide format imputation. Potthoff and Roy (1964) pub-
lished classic data on a study in 16 boys and 11 girls, who at ages 8,
10, 12, and 14 had the distance (mm) from the center of the pituitary
gland to the pteryomaxillary fissure measured. Changes in pituitary-
pteryomaxillary distances during growth is important in orthodontic
therapy. The goals of the study were to describe the distance in boys and
girls as simple functions of age, and then to compare the functions for
boys and girls. The data have been reanalyzed by many authors includ-
ing Jennrich and Schluchter (1986), Little and Rubin (1987), Pinheiro
and Bates (2000), Verbeke and Molenberghs (2000) and Molenberghs
and Kenward (2007).
• Take the version from Little and Rubin (1987) in which nine entries
have been made missing. The missing data have been created such
that children with a low value at age 8 are more likely to have a
missing value at age 10. Use mice() to impute the missing entries
under the normal model using m = 100.
• For each missing entry, summarize the distribution of the 100 im-
putations. Determine the interquartile range of each distribution.
If the imputations fit the data, how many of the original values you
expect to fall within this range? How many actually do?
• Produce a lattice graph of the nine imputed trajectories that
clearly shows the range of the imputed values.
2. Potthoff–Roy comparison. Use the multiply imputed data from the previ-
ous exercise, and apply a linear mixed effects model with an unstructured
Longitudinal data 335
• Calculate the broken stick estimates for each child using 8, 10, 12
and 14 as the break ages. Make a graph like Figure 11.6. Each
data point has exactly one parameter, so the fit could be perfect in
principle. Why doesn’t that happen? Which two children show the
largest discrepancies between the data and the model?
• Compare the age-to-age correlation matrix of the broken stick es-
timates to the original data. Why are these correlation matrices
different?
• How would you adapt the analysis such that the age-to-age cor-
relation matrix of the broken stick estimates would reproduce the
age-to-age correlation matrix of the original data. Hint: Think of a
simpler form of multilevel analysis.
• Multiply impute the data according to the method used in Sec-
tion 11.3.6, and produce a display like Figure 11.7 for children 1,
7, 20, 21, 22 and 24.
• Compare the age-to-age correlation matrix from the imputed data
to that of the original data. Are these different? How? Calculate
the correlation matrix after deleting the data from the two children
who showed the largest discrepancy in the broken stick model. Did
this help?
• How would you adapt the imputation method for the longitudinal
data so that its correlation matrix is close to that of the original?
Part IV
Extensions
Chapter 12
Conclusion
This closing chapter starts with a description of the limitations and pit-
falls of multiple imputation. Section 12.2 provides reporting guidelines for
applications. Section 12.3 gives an overview of applications that were omitted
from the book. Section 12.4 contains some speculations about possible future
developments.
339
340 Flexible Imputation of Missing Data, Second Edition
• Evaluate whether the imputed data could have been real data if they
had not been missing;
• Take m = 5 for model building, and increase afterward if needed;
Conclusion 341
• Uncritically accept imputations that are very different from the observed
data.
342 Flexible Imputation of Missing Data, Second Edition
12.2 Reporting
Section 1.1.2 noted that the attitude toward missing data is changing.
Many aspects related to missing data could potentially affect the conclusions
drawn for the statistical analysis, but not all aspects are equally important.
This leads to the question: What should be reported from an analysis with
missing data?
Guidelines to report the results of a missing data analysis have been given
by Sterne et al. (2009), Enders (2010), National Research Council (2010) and
Mackinnon (2010). These sources vary in scope and comprehensiveness, but
they also exhibit a great deal of overlap and consensus. Section 12.2.1 combines
some of the material found in the three sources.
Reviewers or editors may be unfamiliar with, or suspicious of, newer ap-
proaches to handling missing data. Substantive researchers are therefore often
wary about using advanced statistical methods in their reports. Though this
concern is understandable,
. . . resorting to flawed procedures in order to avoid criticism from
an uninformed reviewer or editor is a poor reason for avoiding
sophisticated missing data methodology (Enders, 2010, p. 340)
Until reviewers and referees become more familiar with the newer methods,
a better approach is to add well-chosen and concise explanatory notes. On
the other hand, editors and reviewers are increasingly expecting applied re-
searchers to do multiple imputation, even when the authors had good reasons
for not doing it (e.g., less than 5% incomplete cases) (Ian White, personal
communication).
The natural place to report about the missing data in a manuscript is
the paragraph on the statistical methodology. As scientific articles are often
subject to severe space constraints, part of the report may need to go into
supplementary online materials instead of the main text. Since the addition
of explanatory notes increases the number of words, there needs to be some
balance between the material that goes into the main text and the supplemen-
tary material. In applications that requires novel methods, a separate paper
may need to be written by the team’s statistician. For example, Van Buuren
et al. (1999) explained the imputation methodology used in the substantive
paper by Boshuizen et al. (1998). In general, the severity of the missing data
problem and the method used to deal with the problem needs to be part of
the main paper, whereas the precise modeling details could be relegated to
the appendix or to a separate methodological paper.
Conclusion 343
1. Amount of missing data: What is the number of missing values for each
variable of interest? What is the number of cases with complete data for
the analyses of interest? If people drop out at various time points, break
down the number of participants per occasion.
2. Reasons for missingness: What is known about the reasons for missing
data? Are the missing data intentional? Are the reasons possibly related
to the outcome measurements? Are the reasons related to other variables
in the study?
3. Consequences: Are there important differences between individuals with
complete and incomplete data? Do these groups differ in mean or spread
on the key variables? What are the consequences if complete-case anal-
ysis is used?
4. Method: What method is used to account for missing data (e.g.,
complete-case analysis, multiple imputation)? Which assumptions were
made (e.g., missing at random)? How were multivariate missing data
handled?
5. Software: What multiple imputation software is used? Which settings
differ from the default?
6. Number of imputed datasets: How many imputed datasets were created
and analyzed?
7. Imputation model: Which variables were included in the imputation
model? Was any form of automatic variable predictor used? How
were non-normally distributed and categorical variables imputed? How
were design features (e.g., hierarchical data, complex samples, sampling
weights) taken into account?
8. Derived variables: How were derived variables (transformations, recodes,
indices, interaction terms, and so on) taken into account?
9. Diagnostics: How has convergence been monitored? How do the observed
and imputed data compare? Are imputations plausible in the sense that
they could have been plausibly measured if they had not been missing?
10. Pooling: How have the repeated estimates been combined (pooled) into
the final estimates? Have any statistics been transformed for pooling?
11. Complete-case analysis: Do multiple imputation and complete-case anal-
ysis lead to similar similar conclusions? If not, what might explain the
difference?
344 Flexible Imputation of Missing Data, Second Edition
12.2.2 Template
Enders (2010, pp. 340–343) provides four useful templates for reporting
the results of a missing data analysis. These templates include explanatory
notes for uninformed editors and reviewers. It is straightforward to adapt the
template text to other settings. Below I provide a template loosely styled
after Enders that I believe captures the essentials needed to report multiple
imputation in the statistical paragraph of the main text.
This text is about 135 words. If this is too long, then the sentences that
begin with “Methodologists” and “For comparison” can be deleted. In the
paragraphs that describe the results we can add the following sentence:
Table 1 gives the missing data rates of each variable.
Conclusion 345
cases. See also Gill et al. (1997) for a slightly more extended model. Heit-
jan and Rubin (1990) provided an application where age is misreported, and
the amount of misreporting increases with age itself. Such problems with the
data can be handled by multiple imputation of true age, given reported age
and other personal factors. Heitjan (1993) discussed various other biomedical
examples and an application to data from the Stanford Heart Transplant Pro-
gram. Related work on measurement error is available from several sources
(Brownstone and Valletta, 1996; Ghosh-Dastidar and Schafer, 2003; Yucel
and Zaslavsky, 2005; Cole et al., 2006; Glickman et al., 2008). Goldstein and
Carpenter (2015) formulated joint models for three types of coarsened data.
12.5 Exercises
1. Do’s: Take the list of do’s in Section 12.1.2. For each item on the list,
answer the following questions:
(a) Why is it on the list?
(b) Which is the most relevant section in the book?
(c) Can you order the list of elements from most important to least
important?
(d) What were your reasons for picking the top three?
350 Flexible Imputation of Missing Data, Second Edition
Aarts, E., Van Buuren, S., and Frank, L. E. (2010). A novel method to obtain
the treatment effect assessed for a completely randomized design: Multiple
imputation of unobserved potential outcomes. Master thesis. University of
Utrecht, Utrecht.
Abayomi, K., Gelman, A., and Levy, M. (2008). Diagnostics for multivariate
imputations. Journal of the Royal Statistical Society C, 57(3):273–291.
Agresti, A. (1990). Categorical Data Analysis. John Wiley & Sons, New York.
Aitkin, M., Francis, B., Hinde, J., and Darnell, R. (2009). Statistical Modelling
in R. Oxford University Press, Oxford.
Akande, O., Li, F., and Reiter, J. P. (2017). An empirical comparison of mul-
tiple imputation methods for categorical data. The American Statistician,
71(2):162–170.
Ake, C. F. (2005). Rounding after multiple imputation with non-binary cat-
egorical covariates. In Proceedings of the SAS Users Group International
(SUGI), volume 30, pages 112–30.
Akl, E. A., Shawwa, K., Kahale, L. A., Agoritsas, T., Brignardello-Petersen,
R., Busse, J. W., Carrasco-Labra, A., Ebrahim, S., Johnston, B. C., Neu-
mann, I., Sola, I., Sun, X., Vandvik, P., Zhang, Y., Alonso-Coello, P., and
Guyatt, G. H. (2015). Reporting missing participant data in randomised
trials: Systematic survey of the methodological literature and a proposed
guide. BMJ Open, 5(12):e008431.
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likeli-
hood estimates in logistic regression models. Biometrika, 71(1):1–10.
Allan, F. E. and Wishart, J. (1930). A method of estimating the yield of
a missing plot in field experiment work. Journal of Agricultural Science,
20(3):399–406.
Allison, P. D. (2005). Imputation of categorical variables with PROC MI.
In Proceedings of the SAS Users Group International (SUGI), volume 30,
pages 113–30.
Allison, P. D. (2010). Survival Analysis Using SAS: A Practical Guide. SAS
Press, Cary, NC, 2nd edition.
351
352 References
Boshuizen, H. C., Izaks, G. J., Van Buuren, S., and Ligthart, G. J. (1998).
Blood pressure and mortality in elderly people aged 85 and older: Commu-
nity based study. British Medical Journal, 316(7147):1780–1784.
Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Anal-
ysis. John Wiley & Sons, New York.
Brand, J. E. and Xie, Y. (2010). Who benefits most from college? Evidence for
negative selection in heterogeneous economic returns to higher education.
American Sociological Review, 75(2):273–302.
Brand, J. P. L. (1999). Development, Implementation and Evaluation of Mul-
tiple Imputation Strategies for the Statistical Analysis of Incomplete Data
Sets. PhD thesis, Erasmus University, Rotterdam.
Brand, J. P. L., Van Buuren, S., Groothuis-Oudshoorn, C. G. M., and
Gelsema, E. S. (2003). A toolkit in SAS for the evaluation of multiple
imputation methods. Statistica Neerlandica, 57(1):36–45.
Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A., and Liu, J. (2013). A
nondegenerate penalized likelihood estimator for variance parameters in
multilevel models. Psychometrika, 78(4):685–709.
Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., and Weidman, L. (1991).
Multiple imputation of industry and occupation codes in census public-
use samples using Bayesian logistic regression. Journal of the American
Statistical Association, 86(413):68–78.
Cochran, W. G. (1977). Sampling Techniques. John Wiley & Sons, New York,
3rd edition.
Cole, S. R., Chu, H., and Greenland, S. (2006). Multiple imputation for mea-
surement error correction. International Journal of Epidemiology, 35:1074–
1081.
Cole, T. J. and Green, P. J. (1992). Smoothing reference centile curves: The
LMS method and penalized likelihood. Statistics in Medicine, 11(10):1305–
1319.
Collins, L. M., Schafer, J. L., and Kam, C. M. (2001). A comparison of
inclusive and restrictive strategies in modern missing data procedures. Psy-
chological Methods, 6(3):330–351.
Conversano, C. and Cappelli, C. (2003). Missing data incremental impu-
tation through tree based methods. In Härdle, W. and Rönz, B., editors,
COMPSTAT 2002: Proceedings in Computational Statistics, pages 455–460,
Heidelberg, Germany. Physica-Verlag.
Conversano, C. and Siciliano, R. (2009). Incremental tree-based missing
data imputation with lexicographic ordering. Journal of Classification,
26(3):361–379.
Cowles, M. K. and Carlin, B. P. (1996). Markov chain Monte Carlo conver-
gence diagnostics: A comparative review. Journal of the American Statis-
tical Association, 91(434):883–904.
Creel, D. V. and Krotki, K. (2006). Creating imputation classes using clas-
sification tree methodology. In Proceeding of the Joint Statistical Meeting
2006, ASA Section on Survey Research Methods, pages 2884–2887, Alexan-
dria, VA. American Statistical Association.
Daniels, M. J. and Hogan, J. W. (2008). Missing Data in Longitudinal Studies.
Strategies for Bayesian Modeling and Sensitivity Analysis. Chapman &
Hall/CRC, Boca Raton, FL.
Dauphinot, V., Wolff, H., Naudin, F., Guéguen, R., Sermet, C., Gaspoz, J.,
and Kossovsky, M. (2008). New obesity body mass index threshold for self-
reported data. Journal of Epidemiology and Community Health, 63(2):128–
132.
References 357
Dillman, D. A., Smyth, J. D., and Melani Christian, L. (2008). Internet, Mail,
and Mixed-Mode Surveys: The Tailored Design Method. John Wiley & Sons,
New York, 3rd edition.
Doove, L. L., Van Buuren, S., and Dusseldorp, E. (2014). Recursive parti-
tioning for missing data imputation in the presence of interaction effects.
Computational Statistics & Data Analysis, 72:92–104.
Dorans, N. J. (2007). Linking scores from multiple health outcome instru-
ments. Quality of Life Research, 16(Suppl 1):85–94.
D’Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory
and Practice. John Wiley & Sons, Chichester, UK.
References 359
Erler, N. S., Rizopoulos, D., Van Rosmalen, J., Jaddoe, V. W. V., Franco,
O. H., and Lesaffre, E. M. (2016). Dealing with missing covariates in epi-
demiologic studies: A comparison between multiple imputation and a full
Bayesian approach. Statistics in Medicine, 35(17):2955–2974.
Fay, R. E. (1992). When are inferences from multiple imputation valid? In
ASA 1992 Proceedings of the Survey Research Methods Section, pages 227–
232, Alexandria, VA.
Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Boyd,
Edinburgh, London.
Fitzmaurice, G. M., Davidian, M., Verbeke, G., and Molenberghs, G. (2009).
Longitudinal Data Analysis. Chapman & Hall/CRC, Boca Raton, FL.
Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2011). Applied Longitudinal
Analysis, Second Edition. John Wiley & Sons, New York.
Ford, B. L. (1983). An overview of hot-deck procedures. In Madow, W., Olkin,
I., and Rubin, D. B., editors, Incomplete Data in Sample Surveys, volume 2,
chapter 14, pages 185–207. Academic Press.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian
Data Analysis. Chapman & Hall/CRC, London, 2nd edition.
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multi-
level/Hierarchical Models. Cambridge University Press, Cambridge.
Gelman, A., Jakulin, A., Grazia Pittau, M., and Su, Y. S. (2008). A weakly in-
formative default prior distribution for logistic and other regression models.
Annals of Applied Statistics, 2(4):1360–1383.
Gelman, A., King, G., and Liu, C. (1998). Not asked and not answered: Mul-
tiple imputation for multiple surveys. Journal of the American Statistical
Association, 93(443):846–857.
Gelman, A. and Meng, X.-L., editors (2004). Applied Bayesian Modeling and
Causal Inference from Incomplete-Data Perspectives. John Wiley & Sons,
Chichester, UK.
Gill, R. D., Van der Laan, M. L., and Robins, J. M. (1997). Coarsening at
random: Characterizations, conjectures and counter-examples. In Lin, D. Y.
and Fleming, T. R., editors, Proceedings of the First Seattle Conference on
Biostatistics, pages 255–294, Berlin. Springer-Verlag.
Gleason, T. C. and Staelin, R. (1975). A proposal for handling missing data.
Psychometrika, 40(2):229–252.
362 References
Glickman, M. E., He, Y., Yucel, R. M., and Zaslavsky, A. M. (2008). Mis-
reporting, missing data, and multiple imputation: Improving accuracy of
cancer registry databases. Chance, 21(3):55–58.
Glynn, R. J., Laird, N. M., and Rubin, D. B. (1986). Selection modeling versus
mixture modeling with nonignorable nonresponse. In Wainer, H., editor,
Drawing Inferences from Self-Selected Samples, pages 115–142. Springer-
Verlag.
Glynn, R. J. and Rosner, B. (2004). Multiple imputation to estimate the
association between eyes in disease progression with interval-censored data.
Statistics in Medicine, 23(21):3307–3318.
Goldstein, H. (2011a). Bootstrapping in multilevel models. In Hox, J. and
Roberts, J., editors, The Handbook of Advanced Multilevel Analysis, chap-
ter 9, pages 163–171. Routledge, Milton Park, UK.
Goldstein, H. (2011b). Multilevel Statistical Models. John Wiley & Sons,
Chichester, UK, 4th edition.
Goldstein, H. and Carpenter, J. R. (2015). Multilevel multiple imputation.
In Molenberghs, G., Fitzmaurice, G. M., Kenward, M. G., Tsiatis, A. A.,
and Verbeke, G., editors, Handbook of Missing Data Methodology, pages
295–316. Chapman & Hall/CRC Press, Boca Raton, FL.
Goldstein, H., Carpenter, J. R., and Browne, W. J. (2014). Fitting multilevel
multivariate models with missing data in responses and covariates that may
include interactions and non-linear terms. Journal of the Royal Statistical
Society: Series A, 177(2):553–564.
Goldstein, H., Carpenter, J. R., Kenward, M. G., and Levin, K. A. (2009).
Multilevel models with multivariate mixed response types. Statistical Mod-
elling, 9(3):173–179.
Gomes, M., Gutacker, N., Bojke, C., and Street, A. (2016). Addressing missing
data in patient-reported outcome measures (PROMS): Implications for the
use of PROMS for comparing provider performance. Health Economics,
25(5):515–528.
Gonzalez, J. M. and Eltinge, J. L. (2007). Multiple matrix sampling: A review.
In ASA 2007 Proceedings of the Section on Survey Research Methods, pages
3069–3075, Alexandria, VA.
Goodman, L. A. (1970). The multivariate analysis of qualitative data: Inter-
actions among multiple classifications. Journal of the American Statistical
Association, 65(329):226–256.
Gorber, S. C., Tremblay, M., Moher, D., and Gorber, B. (2007). A comparison
of direct vs. self-report measures for assessing height, weight and body mass
index: A systematic review. Obesity Reviews, 8(4):307–326.
References 363
Grund, S., Lüdtke, O., and Robitzsch, A. (2016b). Pooling ANOVA results
from multiply imputed datasets. Methodology, 12(3):75–88.
Grund, S., Lüdtke, O., and Robitzsch, A. (2018a). Multiple imputation of
missing data at level 2: A comparison of fully conditional and joint modeling
in multilevel designs. Journal of Educational and Behavioral Statistics,
doi.org/10.3102/1076998617738087.
364 References
Herzog, T. N., Scheuren, F. J., and Winkler, W. E. (2007). Data Quality and
Record Linking Techniques. Springer, New York.
Heymans, M. W., Van Buuren, S., Knol, D. L., Van Mechelen, W., and De Vet,
H. C. W. (2007). Variable selection under multiple imputation using the
bootstrap in a prognostic study. BMC Medical Research Methodology, 7:33.
Dutch project on preterm and small for gestational age infants at 19 years
of age. Pediatrics, 120(3):587–595.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American
Statistical Association, 81(396):945–960.
Holland, P. W. (2007). A framework and history for score linking. In Dorans,
N. J., Pommerich, M., and Holland, P. W., editors, Linking and Aligning
Scores and Scales, chapter 2, pages 5–30. Springer, New York.
Holland, P. W. and Wainer, H., editors (1993). Differential Item Functioning.
Lawrence Erlbaum Associates, Hillsdale, NJ.
Hopke, P. K., Liu, C., and Rubin, D. B. (2001). Multiple imputation for
multivariate data with missing and below-threshold measurements: Time-
series concentrations of pollutants in the Arctic. Biometrics, 57(1):22–33.
Hopman-Rock, M., Dusseldorp, E., Chorus, A. M. J., Jacobusse, G. W.,
Rütten, A., and Van Buuren, S. (2012). Response conversion for improving
comparability of international physical activity data. Journal of Physical
Activity & Health.
Horton, N. J. and Kleinman, K. P. (2007). Much ado about nothing: A com-
parison of missing data methods and software to fit incomplete data regres-
sion models. The American Statistician, 61(1):79–90.
Horton, N. J., Lipsitz, S. R., and Parzen, M. (2003). A potential for bias when
rounding in multiple imputation. The American Statistician, 57(4):229–232.
Hosmer, D. W. and Lemeshow, S. (2000). Applied Logistic Regression. John
Wiley & Sons, New York, 2nd edition.
Hosmer, D. W., Lemeshow, S., and May, S. (2008). Applied Survival Analysis:
Regression Modeling of Time to Event Data. John Wiley & Sons, Hoboken,
NJ, 2nd edition.
Hox, J. J., Moerbeek, M., and Van de Schoot, R. (2018). Multilevel Analysis:
Techniques and Applications. Third Edition. Routledge, New York.
Hron, K., Templ, M., and Filzmoser, P. (2010). Imputation of missing values
for compositional data using classical and robust methods. Computational
Statistics & Data Analysis, 54(12):3095–3107.
Hsu, C. H. (2007). Multiple imputation for interval censored data with aux-
iliary variables. Statistics in Medicine, 26(4):769–781.
Hsu, C. H., Taylor, J. M. G., and Hu, C. (2015). Analysis of accelerated
failure time data with dependent censoring using auxiliary variables via
nonparametric multiple imputation. Statistics in Medicine, 34(19):2768–
2780.
References 367
Hsu, C. H., Taylor, J. M. G., Murray, S., and Commenges, D. (2006). Survival
analysis using auxiliary variables via non-parametric multiple imputation.
Statistics in Medicine, 25(20):3503–3517.
Hughes, R. A., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K., and
Sterne, J. A. C. (2014). Joint modelling rationale for chained equations.
BMC Medical Research Methodology, 14(1):28.
King, G., Honaker, J., Joseph, A., and Scheve, K. (2001). Analyzing incom-
plete political science data: An alternative algorithm for multiple imputa-
tion. American Political Science Review, 95(1):49–69.
King, G., Murray, C. J. L., Salomon, J. A., and Tandon, A. (2004). Enhanc-
ing the validity and cross-cultural comparability of measurement in survey
research. American Political Science Review, 98(1):191–207.
Klebanoff, M. A. and Cole, S. R. (2008). Use of multiple imputation in the
epidemiologic literature. American Journal of Epidemiology, 168(4):355–
357.
Kleinbaum, D. G. and Klein, M. B. (2005). Survival Analysis: A Self-Learning
Text. Springer-Verlag, New York, 2nd edition.
Kleinke, K. (2017). Multiple imputation under violated distributional assump-
tions: A systematic evaluation of the assumed robustness of predictive mean
matching. Journal of Educational and Behavioral Statistics, 42(4):371–404.
Lee, H., Rancourt, E., and Särndal, C. E. (1994). Experiments with vari-
ance estimation from survey data with imputed values. Journal of Official
Statistics, 10(3):231–243.
Lee, K. J. and Carlin, J. B. (2010). Multiple imputation for missing data: Fully
conditional specification versus multivariate normal imputation. American
Journal of Epidemiology, 171(5):624–632.
Lee, K. J., Galati, J. C., Simpson, J. A., and Carlin, J. B. (2012). Comparison
of methods for imputing ordinal data using multivariate normal imputation:
a case study of non-linear effects in a large cohort study. Statistics in
Medicine, 31(30):4164–4174.
Lee, M., Rahbar, M. H., Brown, M., Gensler, L., Weisman, M., Diekman, L.,
and Reveille, J. D. (2018). A multiple imputation method based on weighted
quantile regression models for longitudinal censored biomarker data with
missing values at early visits. BMC Medical Research Methodology, 18(1):8.
Lesaffre, E. M. and Albert, A. (1989). Partial separation in logistic discrimi-
nation. Journal of the Royal Statistical Society B, 51(1):109–116.
Li, F., Baccini, M., Mealli, F., Zell, E. R., Frangakis, C. E., and Rubin, D. B.
(2014). Multiple imputation by ordered monotone blocks with application
to the anthrax vaccine research program. Journal of Computational and
Graphical Statistics, 23(3):877–892.
Li, F., Yu, Y., and Rubin, D. B. (2012). Imputing missing data by fully con-
ditional models: Some cautionary examples and guideline. Duke University
Department of Statistical Science Discussion Paper, 11-24.
Li, K.-H. (1988). Imputation using Markov chains. Journal of Statistical
Computation and Simulation, 30(1):57–79.
Li, K.-H., Meng, X.-L., Raghunathan, T. E., and Rubin, D. B. (1991a). Signif-
icance levels from repeated p-values with multiply-imputed data. Statistica
Sinica, 1(1):65–92.
Li, K.-H., Raghunathan, T. E., and Rubin, D. B. (1991b). Large-sample sig-
nificance levels from multiply imputed data using moment-based statistics
and an F reference distribution. Journal of the American Statistical Asso-
ciation, 86(416):1065–1073.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomFor-
est. R News, 2(3):18–22.
Licht, C. (2010). New methods for generating significance levels from multiply-
imputed data. PhD thesis, University of Bamberg, Bamberg, Germany.
372 References
Liu, J., Gelman, A., Hill, J., Su, Y. S., and Kropko, J. (2013). On the sta-
tionary distribution of iterative imputations. Biometrika, 101(1):155–173.
Lüdtke, O., Marsh, H. W., Robitzsch, A., Trautwein, U., Asparouhov, T., and
Muthén, B. O. (2008). The multilevel latent covariate model: A new, more
reliable approach to group-level effects in contextual studies. Psychological
Methods, 13(3):203.
Lüdtke, O., Robitzsch, A., and Grund, S. (2017). Multiple imputation of
missing data in multilevel designs: A comparison of different strategies.
Psychological Methods, 22(1):141–165.
Lyles, R. H., Fan, D., and Chuachoowong, R. (2001). Correlation coefficient
estimation involving a left censored laboratory assay variable. Statistics in
Medicine, 20(19):2921–2933.
Madow, W. G., Olkin, I., and Rubin, D. B., editors (1983). Incomplete Data
in Sample Surveys, volume 2. Academic Press, New York.
Mallinckroth, C. H. (2013). Preventing and Treating Missing Data in Lon-
gitudinal Clinical Trials: A Practical Guide. Cambridge University Press,
Cambridge, UK.
Marino, M., Buxton, O. M., and Li, Y. (2017). Covariate selection for multi-
level models with missing data. Stat, 6(1):31–46.
Marker, D. A., Judkins, D. R., and Winglee, M. (2002). Large-scale imputation
for complex surveys. In Groves, R. M., Dillman, D. A., Eltinge, J. L., and
Little, R. J. A., editors, Survey Nonresponse, chapter 22, pages 329–341.
John Wiley & Sons, New York.
Marsh, H. W. (1998). Pairwise deletion for missing data in structural equation
models: Nonpositive definite matrices, parameter estimates, goodness of fit,
and adjusted sample sizes. Structural Equation Modeling, 5(1):22–36.
Marshall, A., Altman, D. G., and Holder, R. L. (2010a). Comparison of im-
putation methods for handling missing covariate data when fitting a Cox
proportional hazards model: A resampling study. BMC Medical Research
Methodology, 10:112.
Marshall, A., Altman, D. G., Royston, P., and Holder, R. L. (2010b). Com-
parison of techniques for handling missing covariate data within prognostic
modelling studies: A simulation study. BMC Medical Research Methodology,
10:7.
Marshall, A., Billingham, L. J., and Bryan, S. (2009). Can we afford to ignore
missing data in cost-effectiveness analyses? European Journal of Health
Economics, 10(1):1–3.
Matsumoto, D. and Van de Vijver, F. J. R., editors (2010). Cross-Cultural
Research Methods in Psychology. Cambridge University Press, Cambridge.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman
& Hall, New York, 2nd edition.
McCulloch, C. E. and Searle, S. R. (2001). Generalized, Linear, and Mixed
Models. John Wiley & Sons, New York.
McKnight, P. E., McKnight, K. M., Sidani, S., and Figueredo, A. J. (2007).
Missing Data: A Gentle Introduction. Guilford Press, New York.
Meng, X.-L. (1994). Multiple imputation with uncongenial sources of input
(with discusson). Statistical Science, 9(4):538–573.
Meng, X.-L. and Rubin, D. B. (1992). Performing likelihood ratio tests with
multiply-imputed data sets. Biometrika, 79(1):103–111.
References 375
Netten, A. P., Dekker, F. W., Rieffe, C., Soede, W., Briaire, J. J., and Frijns,
J. H. (2017). Missing data in the field of otorhinolaryngology and head &
neck surgery: Need for improvement. Ear and Hearing, 38(1):1–6.
Neyman, J. (1923). On the application of probability theory to agricultural
experiments. essay on principles. section 9. Annals of Agricultural Sciences,
pages 1–51.
Neyman, J. (1935). Statistical problems in agricultural experimentation
(with discussion). Journal of the Royal Statistical Society, Series B,
Suppl.(2):107–180.
Nguyen, C. D., Carlin, J. B., and Lee, K. J. (2017). Model checking in multiple
imputation: an overview and case study. Emerging Themes in Epidemiology,
14(1):8.
Nielsen, S. F. (2003). Proper and improper multiple imputation. International
Statistical Review, 71(3):593–627.
O’Kelly, M. and Ratitch, B. (2014). Clinical Trials with Missing Data: A
Guide for Practitioners. John Wiley & Sons, Chichester, UK.
Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with discrete
and continuous variables. Annals of Mathematical Statistics, 32(2):448–465.
Olsen, M. K. and Schafer, J. L. (2001). A two-part random effects model
for semicontinuous longitudinal data. Journal of the American Statistical
Association, 96(454):730–745.
Orchard, T. and Woodbury, M. A. (1972). A missing information principle:
Theory and applications. In Proceedings of the Sixth Berkeley Symposium
on Mathematical Statistics and Probability, volume 1, pages 697–715.
Palmer, M. J., Mercieca-Bebber, R., King, M., Calvert, M., Richardson, H.,
and Brundage, M. (2018). A systematic review and development of a clas-
sification framework for factors associated with missing patient-reported
outcome data. Clinical Trials, 15(1):95–106.
Pan, W. (2000). A multiple imputation approach to Cox regression with
interval-censored data. Biometrics, 56(1):199–203.
Pan, W. (2001). A multiple imputation approach to regression analysis
for doubly censored data with application to AIDS studies. Biometrics,
57(4):1245–1250.
Parker, R. (2010). Missing Data Problems in Machine Learning. VDM Verlag
Dr. Müller, Saarbrücken, Germany.
Peng, Y., Little, R. J. A., and Raghunathan, T. E. (2004). An extended general
location model for causal inferences from data subject to noncompliance and
missing values. Biometrics, 60(3):598–607.
References 377
Steel, R. J., Wang, N., and Raftery, A. E. (2010). Inference from multiple
imputation for missing data using mixtures of normals. Statistical Method-
ology, 7(10):351–365.
Steinberg, A. M., Brymer, M. J., Decker, K. B., and Pynoos, R. S. (2004).
The University of California at Los Angeles post-traumatic stress disorder
reaction index. Current Psychiatry Reports, 6(2):96–100.
Taljaard, M., Donner, A., and Klar, N. (2008). Imputation strategies for miss-
ing continuous outcomes in cluster randomized trials. Biometrical Journal,
50(3):329–345.
Tang, L., Unüntzer, J., Song, J., and Belin, T. R. (2005). A comparison of
imputation methods in a longitudinal randomized clinical trial. Statistics
in Medicine, 24(14):2111–2128.
References 385
Templ, M., Kowarik, A., and Filzmoser, P. (2011b). Iterative stepwise re-
gression imputation using standard and robust methods. Computational
Statistics & Data Analysis, 55(10):2793–2806.
Therneau, T. M., Atkinson, B., and Ripley, B. D. (2017). rpart: Recursive
Partitioning and Regression Trees. R package version 4.1-11.
Tian, G.-L., Tan, M. T., Ng, K. W., and Tang, M.-L. (2009). A unified
method for checking compatibility and uniqueness for finite discrete condi-
tional distributions. Communications in Statistics - Theory and Methods,
38(1):115–129.
Van Buuren, S. and Tennant, A., editors (2004). Response Conversion for the
Health Monitoring Program, volume 04 145. TNO Quality of Life, Leiden.
Van Buuren, S. and Van Rijckevorsel, J. L. A. (1992). Imputation of miss-
ing categorical data by maximizing internal consistency. Psychometrika,
57(4):567–580.
Van Buuren, S., Van Rijckevorsel, J. L. A., and Rubin, D. B. (1993). Multiple
imputation by splines. In Bulletin of the International Statistical Institute,
volume II (CP), pages 503–504.
Van der Palm, D. W., Van der Ark, L. A., and Vermunt, J. K. (2016a). A
comparison of incomplete-data methods for categorical data. Statistical
Methods in Medical Research, 25(2):754–774.
Van der Palm, D. W., Van der Ark, L. A., and Vermunt, J. K. (2016b). Divisive
latent class modeling as a density estimation method for categorical data.
Journal of Classification, 33:52–72.
Van Deth, J. W., editor (1998). Comparative Politics. The Problem of Equiv-
alence. Routledge, London.
Van Ginkel, J. R. and Kroonenberg, P. M. (2014). Analysis of variance of
multiply imputed data. Multivariate Behavioral Research, 49(1):78–91.
Van Ginkel, J. R., Van der Ark, L. A., and Sijtsma, K. (2007). Multiple
imputation for item scores when test data are factorially complex. British
Journal of Mathematical and Statistical Psychology, 60(2):315–337.
Van Praag, B. M. S., Dijkstra, T. K., and Van Velzen, J. (1985). Least-squares
theory based on general distributional assumptions with an application to
the incomplete observations problem. Psychometrika, 50(1):25–36.
388 References
Van Wouwe, J. P., Lanting, C. I., Van Dommelen, P., Treffers, P. E., and
Van Buuren, S. (2009). Breastfeeding duration related to practised contra-
ception in The Netherlands. Acta Paediatrica, 98(1):86–90.
Vandenbroucke, J. P., Von Elm, E., Altman, D. G., Gotzsche, P. C., Mul-
row, C. D., Pocock, S. J., Poole, C., Schlesselman, J. J., and Egger, M.
(2007). Strengthening the reporting of observational studies in epidemiol-
ogy (STROBE): Explanation and elaboration. Annals of Internal Medicine,
147(8):W163–94.
Vateekul, P. and Sarinnapakorn, K. (2009). Tree-based approach to missing
data imputation. In 2009 IEEE International Conference on Data Mining
Workshops, pages 70–75. IEEE Computer Society.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S.
Springer-Verlag, New York, 4th edition.
Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for Longitu-
dinal Data. Springer, New York.
Vergouwe, Y., Royston, P., Moons, K. G. M., and Altman, D. G. (2010).
Development and validation of a prediction model with missing predictor
data: A practical approach. Journal of Clinical Epidemiology, 63(2):205–
214.
Verloove-Vanhorick, S. P., Verwey, R. A., Brand, R., Bennebroek Gravenhorst,
J., Keirse, M. J. N. C., and Ruys, J. H. (1986). Neonatal mortality risk in
relation to gestational age and birthweight: Results of a national survey
of preterm and very-low-birthweight infants in The Netherlands. Lancet,
1(8472):55–57.
Vermunt, J. K., Van Ginkel, J. R., Van der Ark, L. A., and Sijtsma, K. (2008).
Multiple imputation of incomplete categorical data using latent class anal-
ysis. Sociological Methodology, 38(1):369–397.
Viallefont, V., Raftery, A. E., and Richardson, S. (2001). Variable selection
and Bayesian model averaging in case-control studies. Statistics in Medicine,
20(21):3215–3230.
Vidotto, D. (2018). Bayesian latent class models for the multiple imputation
of cross-sectional, multilevel and longitudinal categorical data. PhD thesis,
Tilburg University, Tilburg, The Netherlands.
Vidotto, D., Vermunt, J. K., and Kaptein, M. C. (2015). Multiple imputation
of missing categorical data using latent class models: State of art. Psycho-
logical Test and Assessment Modelling, 57(4):542–576.
Vink, G., Frank, L. E., Pannekoek, J., and Van Buuren, S. (2014). Predictive
mean matching imputation of semicontinuous variables. Statistica Neer-
landica, 68(1):61–90.
Vink, G., Lazendic, G., and Van Buuren, S. (2015). Partioned predictive mean
matching as a large data multilevel imputation technique. Psychological Test
and Assessment Modeling, 57(4):577–594.
Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis,
U., Marrero, J., Zhu, J., and Higgins, P. D. R. (2013). Comparison of
imputation methods for missing laboratory data in medicine. BMJ open,
3(8):e002847.
Wallace, M. L., Anderson, S. J., and Mazumdar, S. (2010). A stochastic
multiple imputation algorithm for missing covariate data in tree-structured
survival analysis. Statistics in Medicine, 29(29):3004–3016.
390 References
Walls, T. A. and Schafer, J. L., editors (2006). Models for Intensive Longitu-
dinal Data. Oxford University Press, Oxford.
Weisberg, H. I. (2010). Bias and causation: Models and judgment for valid
comparisons. John Wiley & Sons, Hoboken, NJ.
White, I. R. and Carlin, J. B. (2010). Bias and efficiency of multiple impu-
tation compared with complete-case analysis for missing covariate values.
Statistics in Medicine, 29(28):2920–2931.
White, I. R., Daniel, R., and Royston, P. (2010). Avoiding bias due to per-
fect prediction in multiple imputation of incomplete categorical variables.
Computational Statistics & Data Analysis, 54(10):2267–2275.
White, I. R., Horton, N. J., Carpenter, J. R., and Pocock, S. J. (2011a).
Strategy for intention to treat analysis in randomised trials with missing
outcome data. British Medical Journal, 342:d40.
White, I. R. and Royston, P. (2009). Imputing missing covariate values for
the Cox model. Statistics in Medicine, 28(15):1982–1998.
Wood, A. M., White, I. R., and Royston, P. (2008). How should variable
selection be performed with multiply imputed data? Statistics in Medicine,
27(17):3227–3246.
Wood, A. M., White, I. R., and Thompson, S. G. (2004). Are missing outcome
data adequately handled? A review of published randomized controlled tri-
als in major medical journals. Clinical Trials, 1(4):368–376.
Wu, L. (2010). Mixed Effects Models for Complex Data. Chapman & Hall
/CRC, Boca Raton, FL.
Wu, W., Jia, F., and Enders, C. K. (2015). A comparison of imputation
strategies for ordinal missing data on Likert scale variables. Multivariate
Behavioral Research, 50(5):484–503.
Yang, X., Belin, T. R., and Boscardin, W. J. (2005). Imputation and variable
selection in linear regression models with missing covariates. Biometrics,
61(2):498–506.
Yao, Y., Chen, S.-C., and Wang, S.-H. (2014). On compatibility of discrete
full conditional distributions: A graphical representation approach. Journal
of Multivariate Analysis, 124:1–9.
Yates, F. (1933). The analysis of replicated experiments when the field results
are incomplete. Empirical Journal of Experimental Agriculture, 1(2):129–
142.
Yu, L.-M., Burton, A., and Rivero-Arias, O. (2007). Evaluation of software for
multiple imputation of semi-continuous data. Statistical Methods in Medical
Research, 16(3):243–258.
Yu, M., Reiter, J. P., Zhu, L., Liu, B., Cronin, K. A., and Feuer, E. J. (2017).
Protecting confidentiality in cancer registry data with geographic identifiers.
American Journal of Epidemiology, 186(1):83–91.
Yusuf, S., Zucker, D., Passamani, E., Peduzzi, P., Takaro, T., Fisher, L.,
Kennedy, J. W., Davis, K., Killip, T., and Norris, R. (1994). Effect of
coronary artery bypass graft surgery on survival: overview of 10-year re-
sults from randomised trials by the coronary artery bypass graft surgery
trialists collaboration. The Lancet, 344(8922):563–570.
Yuval, N. (2014). Sapiens. Random House, New York.
Zhou, X. H., Zhou, C., Lui, D., and Ding, X. (2014). Applied Missing Data
Analysis in the Health Sciences. John Wiley & Sons, Chichester, UK.
Zhu, J. (2016). Assessment and Improvement of a Sequential Regression Mul-
tivariate Imputation Algorithm. PhD thesis, University of Michigan.
Zhu, J. and Raghunathan, T. E. (2015). Convergence properties of a sequen-
tial regression multiple imputation algorithm. Journal of the American
Statistical Association, 110(511):1112–1124.
Author index
393
394 Author index
Bosker, R. J. 198, 200, 203, 209, 221, 122, 124, 125, 130, 174, 204, 205,
222, 224, 233–235, 238 207, 220, 225, 315, 342, 346
Bosmans, J. E. 93 Carrasco-Labra, A. 7
Bossuyt, P. M. M. 347 Carroll, R. J. 324
Box, G. E. P. 67 Casella, G. 121
Brand, J. E. 243 Castillo, E. 122, 130
Brand, J. P. L. 45, 52, 72, 90, 111, Chaloner, K. 96
119, 120, 123, 126, 129, 146, 153, Chen, H. Y. 118, 123
154, 167, 185 Chen, L. 96
Brand, R. 295–297, 301 Chen, Q. 154
Brandsma, H. P. 198 Chen, S-C. 123
Breiman, L. 84 Cheung, M. W. L. 218
Brennan, R. L. 290 Chevalier, A. 283
Briaire, J. J. 7 Chevret, S. 102
Brian, T. 90 Chickering, D. M. 120
Brick, J. M. 31, 78 Choi, H. 277, 281
Brignardello-Petersen, R. 7 Chorus, A. M. J. 284
Brooks, S. P. 188 Chu, H. 346
Brosnihan, K. B. 243 Chuachoowong, R. 96
Brown, M. 96 Chung, Y. 214
Browne, W. J. 125, 204 Cicchetti, D. 158
Clark, D. 243
Brownstone, D. 346
Clogg, C. C. 90, 293
Brugman, E. 76, 131, 303–305, 322
Cochran, W. G. 25
Brundage, M. 7
Cohen, A. 7
Bryan, S. 146
Cole, S. R. 7, 346
Bryk, A. S. 198, 200
Cole, T. J. 74
Brymer, M. J. 313
Coleman, C. L. 7
Bühlmann, P. 85
Collins, L. M. 40, 164, 165, 167, 238
Buitendijk, S. E. 303–305 Commenges, D. 95
Burgette, L. F. 85–87 Conversano, C. 85
Burgmeijer, R. J. F. 76, 131, Cook, T. D. 24, 315
303–305, 322 Cooper, K. L. 94
Burton, A. 7, 93, 129 Couper, M. P. 24
Busse, J. W. 7 Cowles, M. K. 188
Buxton, O. M. 154 Creel, D. V. 85
Cronin, K. A. 345
Calvert, M. 7 Cumsille, P. E. 346
Cameron, N. L. 322
Campbell, D. T. 24 Daanen, H. A. M. 277, 281
Cappelli, C. 85 Daly, F. 6, 63
Carlin, B. P. 188 D’Ambrosio, A. 85
Carlin, J. B. 7, 33, 58, 118, 130, 195, Daniel, R. 89, 90
342 Daniels, M. J. 203, 295
Carpenter, J. R. 17, 32, 86, 87, 96, Darnell, R. 88
Author index 395
McCullagh, P. 87 Nelson, T. D. 7
McCulloch, C. E. 200 Netten, A. P. 7
McKnight, K. M. 24 Neumann, I. 7
McKnight, P. E. 24 Neyman, J. 244
McMullen, S. 322 Ng, K. W. 123
Mealli, F. 136 Nguyen, C. D. 195
Meek, C. 120 Nian, R. 87
Meijer, E. 200 Nicholas, O. 86, 87
Melani Christian, L. 24 Nielsen, S. F. 31
Meng, X-L. 31, 124, 125, 149–151, Noorthoorn, E. 313, 320
167, 174, 340 Norris, R. 243
Mercieca-Bebber, R. 7
Meulmeester, J. F. 76, 131, 303–305, O’Kelly, M. 32
322 Olchowski, A. E. 59, 346
Meyers, D. A. 243 Olejnik, S. 218
Miche, Y. 87 Olkin, I. 93, 118
Miettinen, O. S. 10, 17 Olsen, M. K. 59, 92
Mislevy, R. J. 36 Olshen, R. 84
Mistler, S. A. 202, 205, 206, 219, 220, Ooms, J. C. L. 133
233 Orchard, T. 7
Moerbeek, M. 198, 200, 223, 233 Ostrowski, E. 6, 63
Moher, D. 7, 277
Molenberghs, G. 17, 26, 33, 102, 200, Palmer, M. J. 7
203, 216, 321, 334, 335 Pan, W. 96
Moons, K. G. M. 17, 18, 153, 168, Pannekoek, J. 78, 93, 178
214, 347 Parker, R. 85
Morgan, S. L. 244 Parzen, M. 48, 117, 134
Moriarity, C. 346 Passamani, E. 243
Morris, T. P. 81, 84, 182 Peduzzi, P. 243
Moustaki, I. 26 Peng, Y. 118
Mukherjee, A. 86 Permutt, T. 102
Mulder, A. L. M. 301 Peugh, J. L. 7
Mulrow, C. D. 7 Phelps, E. 7
Murphy, S. A. 348 Pickles, A. 204
Murray, C. J. L. 283, 294 Piesse, A. 254
Murray, S. 95, 96 Pineau, J. 348
Musoro, J. Z. 157 Pinheiro, J. C. 334
Muthén, B. O. 204, 207 Plumpton, C. O. 182
Muthén, L. K. 204 Pocock, S. J. 7, 315
Poole, C. 7
Naaktgeboren, C. A. 347 Potthoff, R. F. 318, 319, 334
National Research Council 17, 101, Powney, M. 7
342, 344 Press, S. J. 122
Naudin, F. 278 Protopopescu, C. 102
Nelder, J. A. 87 Provost, F. 85
400 Author index
Schemper, M. 90 Soede, W. 7
Schenker, N. 19, 70, 80, 81, 90, 96, Sola, I. 7
280, 293, 346 Solenberger, P. W. 31, 92, 120, 129
Scheuren, F. J. 30, 346 Song, C-C. 123
Scheve, K. 9, 31, 58 Song, J. 116, 129
Schlesselman, J. J. 7 Sovilj, D. 87
Schluchter, M. D. 320, 334 Speed, T. P. 122
Scholtus, S. 78, 178 Speidel, M. 203
Schönbeck, Y. 303–305 Spiegelhalter, D. J. 326
Schouten, R. M. 73, 157 Spiess, M. 75, 92
Schreuder, G. M. 260 Spratt, M. 342
Schultz, B. 90, 293 Staelin, R. 55
Schulz, K. F. 7 Stallard, P. 313
Schumacher, M. 156 Stapleton, J. T. 96
Scott Long, J. 88 Stasinopoulos, D. M. 74, 75, 133
Scott, M. A. 198, 200 Steel, R. J. 44
Scott Stroup, T. 348 Steinberg, A. M. 313
Seaman, S. R. 96, 122, 124, 125, 130, Stekhoven, D. J. 85
131, 173, 174, 176, 220 Stern, H. S. 33
Searle, S. R. 200 Sterne, J. A. C. 122, 130, 342
Seidell, J. C. 278 Steyerberg, E. W. 156
Sermet, C. 278 Stijnen, T. 168
Shadish, W. R. 24, 254 Stone, C. 84
Shah, A. D. 86, 87 Street, A. 7
Shao, J. 32 Su, Y. S. 90, 93, 122, 123
Shapiro, F. 313 Subramanian, S. 96
Shawwa, K. 7 Sugar, C.A. 220
Shen, Z. 348 Sullivan, T. R. 17
Shortreed, S. M. 348 Sun, J. 96
Shrout, P. E. 198, 200 Sun, X. 7
Si, Y. 119 Swensson, B. 25
Siciliano, R. 85
Sidani, S. 24 Takaro, T. 243
Siddique, J. 80 Taljaard, M. 202
Sijtsma, K. 118, 119, 205 Talma, H. 303–305
Simpson, J. A. 7, 33, 118 Tan, M. T. 123
Singal, A. G. 86 Tandon, A. 283, 294
Singer, E. 24 Tang, L. 129
Singer, J. D. 150, 312, 321 Tang, M-L. 123
Skrondal, A. 204 Tang, O. Y. 96
Smink, W. 255 Tanner, M. A. 96, 116
Smith, A. F. M. 121 Tate, R. F. 93, 118
Smyth, J. D. 24 Taylor, B. J. 346
Snijders, T. A. B. 198, 200, 203, 209, Taylor, J. M. G. 75, 80, 81, 94–96
221, 222, 224, 233–235, 238 Tempelman, D. C. G. 178
402 Author index
405
406 Flexible Imputation of Missing Data
design factors, 25, 36, 262, 267, 280, fixed effects, 199, 200, 209, 325, 330
340, 343 Fleishman polynomials, 75
deterministic imputation, see passive fluxplot, see functions, graphs
imputation forgotten mark, 260, 262, 268, 270
diagnostic graphs, 159, 190–196, 282 formulas, 306
diagnostics, 164, 188–194, 302, 343, Fourth Dutch Growth Study, 131
349 fraction of missing data, 270
differential item functioning, 290, 291 fraction of missing information, 47,
direct likelihood, 7, 25, 335 49, 59, 60, 116, 147, 152, 164,
disclosure risk, 345 308
discriminant analysis imputation, 117, full information maximum likelihood,
166 7, 25
discrimination index, pooling, 146 fully conditional specification (FCS),
distinct parameters, 39, 113, 295 see FCS
distributional discrepancy, 189–190 fully synthetic data, 77, 345
donor, 77, 79–82, 86, 94, 95 functions, external
double robustness, 25 foreign::read.xport(), 261
drop-out, 105, 106, 295–302, 314 lme4::glmer(), 214
dummy variable, 50, 147, 259, 347 MASS::polr(), 89
mitml::jomoImpute(), 223
ecological fallacy, 144 mitml::panImpute(), 223
effect size, 59 mitml::testEstimates(), 222
EM algorithm, 25 nnet::multinom(), 89
ensemble learning, 87 stats::na.action(), 4
estimand, 41, 44–46 stats::na.omit(), 4, 9
posterior mean, 42 survival::coxph(), 276
posterior variance, 42 tidyr::fill(), 16
estimate, 41, 44 functions, local
estimation task, 119 bmi(), 281
evaluation criteria, 52–53 create.data(), 53
Excel, 297, 316 generate(), 127, 128
existence, joint distribution, 121–123 impute(), 127, 128
explained variance, 14, 27, 46, 61, 111, make.missing(), 53
146, 167, 275, 283 micemill(), 286, 287, 290, 291,
explicit model, 77, 81, 89, 90, 96 294
extreme learning machine, 87 plotit(), 287, 290
rmse(), 56
F -distribution, 49, 50, 147 simulate(), 54, 82, 128, 137
FCS, 112, 119–137, 163, 164, 344 simulate2(), 56
feedback, 123, 124, 169, 172, 187, 188, test.impute(), 54
195, 275, 340 functions, mice graphs
Fifth Dutch Growth Study, 303–308 bwplot(), 37, 71, 73, 191, 192,
file matching, 106, 110, 346 211, 299, 301
finite mixture, 118 densityplot(), 191, 192, 213,
Fisher z, pooling, 146 227
408 Flexible Imputation of Missing Data
proportional hazards model, 95, 270, 103, 108, 116, 127, 144, 282,
275, 276 294, 303, 304, 330, 346
proportional odds imputation, 88, 89, sampling mechanism, 41, 51, 52
132–134, 166 sampling variance, 20, 45, 46, 48, 67,
PROPS, 315 68, 80, 156, 166, 167
pseudo time point, 322, 325 scenarios, 9, 97–102, 127, 137, 271,
274, 277
Q-Q plot, 189 scope, model, 46, 340
quadratic relation, 27, 61, 176–177, SE Fireworks Disaster Study, 313–320
319, 347 selection model, 97–100
self-reported data, 277–283
random effects, 199, 200, 325, 328, 330 semi-continuous data, 92–93, 347
random forest imputation, 166 sensitivity analysis, 96, 97, 100–102,
randomization-valid, 55 165, 184, 271–277, 341, 344
ratio imputation, 164, 170–174 separation, 89
recurrence, 121 sequential regression multivariate im-
regression imputation, 13–14, 18, 53, putation, 33
65, 67, 70, 73 sequential regressions, see FCS, 120
regression tree, see CART shift parameter, 97, 277
Reiter’s νf , 148 shrinkage, 156, 328
relative increase in variance, 47, 149, sign test, 149
151 simple equating, 284–285, 287, 289,
relative risk, pooling, 146 292
repeated analyses, 139, 144, 145, 147 simple random sample, 166
repeated measures, 312, 318–322 simulation designs, 51
reporting simulation error, 58, 60
guidelines, 342–345 simultaneously impute and select, 153
practice, 7, 10, 26–27, 31 single imputation, 12, 30, 56, 142
reproducible results, 60, 102, 340 SIR algorithm, 97
response conversion, 284 skewed data, 74, 75, 84, 85, 116, 341
response indicator, 17, 35, 38 software
response mechanism, 38, 40, 45, 295 Blimp, 125
response model, 8, 25, 44, 45, 61 CALIBERrfimpute, 87
restricted model, 150, 183, 184 countimp, 92, 215
ridge parameter, 68, 169 DPpackage, 215
risk score, 95 foreign, 261
risk set imputation, 94–96 gamlss, 75, 133
RMSE, 52, 55 Hmisc, 116
robustness, 165 jomo, 204
rounded data, 74, 93–96 lme4, 208, 214, 325
rounding, 117–118, 132, 134, 345 magrittr, 141
Rubin’s rules, 19, 31, 43, 145–147, MASS, 89
154, 158, 281, 302, 344, 348 mdmb, 125
sample data, 35, 36 mice, xxv, 12, 23, 48, 50, 67, 75,
sample size, 7, 11, 48, 65, 70, 74, 82, 89, 90, 107, 110, 113, 114,
414 Flexible Imputation of Missing Data
www.multiple-imputation.com, 23, 32