0% found this document useful (0 votes)
20 views14 pages

DL Vs Conventional

This paper reviews and compares deep learning methods for missing data imputation, specifically generative adversarial imputation networks (GAIN) and variational auto-encoders (VAE), against conventional methods like multiple imputation by chained equations (MICE) and missForest. The study finds that conventional methods generally outperform deep learning approaches, especially in small to moderate sample sizes and under various missing data mechanisms. The authors recommend using MICE and missForest for practitioners dealing with missing data in tabular formats with limited sample sizes.

Uploaded by

076bce084.kalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

DL Vs Conventional

This paper reviews and compares deep learning methods for missing data imputation, specifically generative adversarial imputation networks (GAIN) and variational auto-encoders (VAE), against conventional methods like multiple imputation by chained equations (MICE) and missForest. The study finds that conventional methods generally outperform deep learning approaches, especially in small to moderate sample sizes and under various missing data mechanisms. The authors recommend using MICE and missForest for practitioners dealing with missing data in tabular formats with limited sample sizes.

Uploaded by

076bce084.kalam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Expert Systems With Applications 227 (2023) 120201

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Deep learning versus conventional methods for missing data imputation: A


review and comparative study
Yige Sun a , Jing Li a , Yifan Xu b , Tingting Zhang c , Xiaofeng Wang d ,∗
a
Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA
b
Meta Platforms, Inc., Menlo Park, CA 94025, USA
c
Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
d
Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH 44195, USA

ARTICLE INFO ABSTRACT

Keywords: Deep learning models have been recently proposed in the applications of missing data imputation. In this paper,
Missing data imputation we review the popular statistical, machine learning, and deep learning approaches, and discuss the advantages
Deep learning and disadvantages of these methods. We conduct a comprehensive numerical study to compare the performance
Generative networks
of several widely-used imputation methods for incomplete tabular (structured) data. Specifically, we compare
MICE
the deep learning methods: generative adversarial imputation networks (GAIN) with onehot encoding, GAIN
MissForest
with embedding, variational auto-encoder (VAE) with onehot encoding, and VAE with embedding versus two
conventional methods: multiple imputation by chained equations (MICE) and missForest. Seven real benchmark
datasets and three simulated datasets are considered, including various scenarios with different feature types
under different levels of sample sizes. The missing data are generated based on different missing ratios
and three kinds of missing mechanisms: missing completely at random (MCAR), missing at random (MAR),
and missing not at random (MNAR). Our experiments show that, for small or moderate sample sizes, the
conventional methods establish better robustness and imputation performance than the deep learning methods.
GAINs only perform well in the case of MCAR and often fail in the cases of MAR and MNAR. VAEs are easy
to fall into mode collapse in all missing mechanisms. We conclude that the conventional methods, MICE and
missForest, are preferable for practitioners to deal with missing data imputation for tabular data with a limited
sample size (i.e., 𝑛 < 30, 000) in real case analyses.

1. Introduction missingness and either observed or unobserved variables. MCAR is a


good practical start for imputation analysis because of its convenient
Missing data commonly exist in a wide range of applications due to hypothesis. However, it is not applicable to most real situations since
human errors, data processing issues, or cases that the relevant facts actual missing generally involves complex relationships among ob-
are not observed or not available. Missingness creates problems in data served variables and can be potentially affected by unobserved reasons.
analyses and predictive modeling. Data imputation is an established In contrast to MCAR, MAR occurs when the missingness is still random
practice to resolve the issue, i.e. estimating missing values from non- but depends on the observed variables. MAR is more general and more
missing values in the dataset. Missing data imputation has been an realistic than MCAR. Most statistical missing data imputation methods
active research area in both statistics and machine learning fields for a start from the MAR assumption. If neither MCAR nor MAR holds, we
few decades (Rubin, 1976). speak of MNAR. It means that the probability that an element is missing
The validity and effectiveness of imputation strategies are affected depends on the unobserved value of the missing elements. Specifically,
by missing mechanism, formalized by Rubin and colleagues (Little & MNAR happens if (1) the missing value influences the probability of
Rubin, 2019; Rubin, 1976), which describes the underlying mechanism
missingness or (2) a certain unmeasured quantity predicts the value
that generates missing data that fall into three categories: missing com-
of the missing variable and/or the probability of missingness. For ex-
pletely at random (MCAR), missing at random (MAR), and missing not
ample, censored data in survival studies fall into this category. Rubin’s
at random (MNAR). MCAR means that there is no relationship between

∗ Correspondence to: Xiaofeng Wang, Ph.D., Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid
Ave/JJN3, Cleveland, OH 44195, USA.
E-mail addresses: [email protected] (Y. Sun), [email protected] (J. Li), [email protected] (Y. Xu), [email protected] (T. Zhang), [email protected]
(X. Wang).

https://doi.org/10.1016/j.eswa.2023.120201
Received 6 August 2022; Received in revised form 15 March 2023; Accepted 15 April 2023
Available online 26 April 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

distinction of these missing mechanisms is important for understanding Section 3 presents the setting of the numerical experiments to compare
why some methods will work, and others not. the performances of the recently developed deep learning methods,
Many methods and strategies have been developed to address the GAIN and VAE, versus the widely used and accepted methods, MICE
missing data problem. Generally, we can classify the available tech- and missForest. Section 4 summarizes the results of the numerical
niques into three categories: conventional statistical methods, machine study. Section 5 provides a discussion of our findings. The advantages
learning methods, and (newly-developed) deep learning methods. Pop- and disadvantages of the different missing data imputation approaches
ular statistical methods include mean/median imputation, regression are addressed. Concluding remarks and recommendations are given
imputation, and expectation–maximization (EM). These methods are in Section 6. All software code files and example datasets in this
single imputation approaches, which are simple but do not account study are available at the GitHub through the link: https://github.com/
for the uncertainty of the prediction of missing values. Multiple Im- EagerSun/A-comparison-of-Deep-Learning-and-conventional-statistical-
putation was introduced by Rubin (1978) and refined in Little and methods-for-missing-data-imputation.
Rubin (2019), Rubin (2004). It is a powerful imputation method that
originates from a Bayesian analysis of a large-scale survey. Multiple
imputation first creates several copies of the data set, each containing 2. Review of missing data imputation methods
different imputed values. The imputation is then carried out on each
data set using the same procedures. Finally, analyzing each data set 2.1. Problem definition
separately yields multiple sets of parameter estimates and standard er-
rors, and these multiple sets of results are combined into a single set of Assume that a 𝑑-dimensional random variable 𝑿 = (𝑋1 , … , 𝑋𝑑 )
results. Multivariate imputation by chained equations (MICE) (Buuren (continuous or categorical) follows a probability distribution 𝑝(𝑿), and
& Groothuis-Oudshoorn, 2011), sometimes called ‘‘fully conditional there is a 𝑑-dimensional variable 𝑴 = (𝑀1 , … , 𝑀𝑑 ) taking value
specification’’, is one of the most popular and robust methods of in {0, 1}𝑑 . We call 𝑴 the mask variable. We further define 𝑿 𝑀 =
multiple imputation. Machine learning-based imputation methods have (𝑋1𝑀 , … , 𝑋𝑑𝑀 ) such that
been proposed and studied; the well-known machine learning meth-
{
ods include K-nearest neighbors (Batista & Monard, 2002), traditional 𝑋𝑘 , if 𝑀𝑘 = 1
𝑋𝑘𝑀 = , 𝑘 = 1, … , 𝑑. (1)
feed-forward neural network (Gupta & Lam, 1996; Sharpe & Solly, ∅, if 𝑀𝑘 = 0
1995), and MissForest (Stekhoven & Bühlmann, 2012), and others.
Here ∅ represents an unobserved value. The mask variable 𝑴 is an in-
Many machine learning methods involve creating a predictive model to
dicator variable that indicates which elements of 𝑿 are missing. In the
estimate missing values from the available information in the dataset.
missing data imputation problem, we observe a sample {𝒙𝑀 1
, … , 𝒙𝑀
𝑛 }
For example, missForest is an imputation method based on a random
and {𝒎1 , … , 𝒎𝑛 }. The goal is to impute the unobserved values in
forest. By averaging over many unpruned classification or regression
{𝒙𝑀 , … , 𝒙𝑀
𝑛 }.
trees, missForest intrinsically constitutes a multiple imputation scheme. 1
In recent years, advances in deep learning models have motivated
a suite of new imputation methods. Generative adversarial imputation 2.2. Conventional statistical methods
nets (GAIN) (Yoon, Jordon, & Schaar, 2018) was proposed for missing
data imputation using generative adversarial network (GAN). A method 2.2.1. Mean/median/mode imputation
of multiple imputation using denoising auto-encoders (MIDA) was built For a continuous variable, mean/median imputation is a simple
from a de-noised auto-encoder (Gondara & Wang, 2018; Lu, Perrone, method in which the mean or median of the observed values for each
& Unpingco, 2020; Vincent, Larochelle, Bengio, & Manzagol, 2008). variable is computed and the missing values for that variable are
Several algorithms based on variational auto-encoders (VAE) were also
imputed by this mean or median. For a categorical variable, the missing
suggested for imputation (Camino, Hammerschmidt, & State, 2019;
values for that variable are imputed by the mode of the observed
McCoy, Kroon, & Auret, 2018; Qiu, Zheng, & Gevaert, 2020). Deep
values.
ladder imputation network (DLIN) is a novel deep learning imputation
method, which incorporates the advantages of denoising auto-encoders
and ladder architecture into an innovative formulation (Hallaji, Razavi- 2.2.2. Regression imputation
Far, & Saif, 2021). Although most of these deep learning methods Regression imputation assumes a linear relationship between vari-
demonstrated improved performances over traditional methods in cer- ables. It assumes that the value of one variable changes in a linear
tain simulation settings, there is no comprehensive comparison of these way with the other variables. Mean imputation can be viewed as the
state-of-the-art methods versus conventional statistical and machine simplest application of regression imputation. Regression imputation
learning methods for data with different types of variables under generally consists of two steps: a linear regression model is estimated on
different missing mechanisms. the basis of observed values in the target variable and some explanatory
This paper reviews the popular statistical, machine learning, and variables; the model is then used to predict values for the missing cases
deep learning approaches, and discusses the advantages and disadvan- in the target variable. Missing values of the variable are replaced based
tages of these methods. A comparative numerical study is conducted on these predictions.
to compare various methods under different scenarios. GAIN and VAE There are types of regression imputation: (1) deterministic regres-
are chosen as the representatives for deep learning methods to com- sion imputation — it replaces missing values with the exact prediction
pare with the widely-used statistical method, MICE, and the machine of the regression model. Random variation around the regression slope
learning method, missForest. The two selected conventional methods is not considered, hence, imputed values are often too precise; and
have been shown that they outperform other common imputation
(2) stochastic regression imputation — it adds an additional random
techniques in terms of imputation error and maintenance of predic-
error term to the predicted value imputed by deterministic regression
tive ability (Shah, Bartlett, Carpenter, Nicholas, & Hemingway, 2014;
imputation, which solves the above issue.
Waljee et al., 2013). We design numerical experiments to compare
the aforementioned methods in comprehensive settings by varying the
types of data sets (continuous, categorical, and mixed-type), the missing 2.2.3. Expectation–maximization
mechanisms, the levels of correlation, and the missing ratios. The expectation–maximization (EM) algorithm is a general method
The rest of the paper is organized as follows. Section 2 reviews the for obtaining maximum likelihood estimates when data are miss-
popular statistical, machine learning, and deep learning based methods. ing (Dempster, Laird, & Rubin, 1977). It is an iterative procedure

2
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

in which it uses other variables to impute a value (Expectation), 2.3.2. Feed-forward neural network
then checks whether that is the value most likely (Maximization). Traditional feed-forward neural network modeling can be used to
Each iteration includes two steps: (1) The expectation step (E-step) estimate missing values by training a network to learn the incomplete
uses the current estimate of the parameter to impute (expectation of) variable (as an output) using the remaining complete variables as
missing data, (2) The maximization step (M-step) uses the updated inputs. Sharpe and Solly (1995) proposed the following imputation
data from the E-step to find a maximum likelihood estimate of the scheme: (1) Given a dataset containing missing values, let us denote
parameter. The iterative process continues until there is convergence the subset that has missing values as 𝑿 ∅ and the subset that does
in the parameter estimates. EM imputation is generally better than not contain any missing values as 𝑿 𝑐 . For each possible combination
mean/median imputation since it preserves the relationship with other of incomplete variables in 𝑿 ∅ , one constructs a feed-forward neural
variables. network using 𝑿 𝑐 . Depending on the type of variable to be imputed
(numerical or categorical), different type of error function is used dur-
2.2.4. MICE ing the training process. (2) After each neural network model is trained,
Multiple imputation has several advantages over the above single unknown values are predicted using the corresponding model. The
imputation approaches. It involves filling in the missing values multiple imputed values are obtained by averaging the missing data estimates
times by creating multiple imputed datasets. The analyses of multiple by each model.
imputed datasets take the uncertainty of imputation into account and
yield the standard errors of estimates. MICE is one of the most pop- 2.3.3. MissForest
ular multiple imputation techniques (Buuren & Groothuis-Oudshoorn, Stekhoven and Bühlmann (2012) proposed an iterative imputation
2011). It operates under the assumption that the missing data are MAR. method called ‘‘missForest’’ that is based on random forests (RF). In
We assume that the multivariate distribution 𝑝(𝑿|𝜽) of 𝑿 is com- each iteration, to impute missing values of a variable 𝑋𝑘𝑀 , missForest
pletely specified by 𝜽, a vector of unknown parameters. The problem first fits a RF with 𝑋𝑘𝑀 ∼ 𝑿 𝑀−𝑘 using rows that do not contain missing
is how to estimate the distribution of 𝜽. The MICE algorithm is a Gibbs 𝑋𝑘𝑀 values, then it uses the trained RF to predict missing values in
sampler that estimates the posterior distribution of 𝜽 by sampling iter- 𝑋𝑘𝑀 . This process is done for all 𝑘 ∈ {1, … , 𝑑}, in the order of the
atively from conditional distributions: 𝑝(𝑋|𝑋−1 , 𝜽1 ), . . . , 𝑝(𝑋|𝑋−𝑘 , 𝜽𝑘 ), number missing values in 𝑋𝑘𝑀 from the fewest to the most. For the
. . . , 𝑝(𝑋|𝑋−𝑑 , 𝜽𝑑 ), where 𝑋−𝑘 = (𝑋1 , … , 𝑋𝑘−1 , 𝑋𝑘+1 , … , 𝑋𝑑 ) denotes the first iteration, missForest makes an initial guess for the missing values
collection of the 𝑑 − 1 variables in 𝑿 except 𝑋𝑘 , and the parameters in 𝑿 𝑀 using mean imputation or another simple imputation method.
𝜽1 , … , 𝜽𝑑 are specific to the corresponding conditional densities. The whole procedure is repeated until a stopping criterion is met.
In conventional applications of the Gibbs sampler, the full condi- By averaging over many unpruned classification or regression trees,
tional distributions are derived from the joint probability distribution. RF intrinsically constitutes a multiple imputation scheme. Using the
However, in MICE, the joint distribution is only implicitly known and built-in out-of-bag error estimates of random forest, one can estimate
may not actually exist, so the conditional densities are not necessarily the imputation error without the need of a test set. In Stekhoven
the product of a factorization of the joint distribution 𝑝(𝑿|𝜽). Starting and Buhlmann’s study, missForest outperforms other imputation meth-
from a simple draw from observed marginal distributions, the Gibbs ods, including K-nearest neighbors. It performs especially well in data
sampler successively draws settings where complex interactions and non-linear relations are sus-
(𝑡) pected.
𝜽̂ 1 ∼ 𝑝(𝜽1 |𝑋1𝑜𝑏𝑠 , 𝑋2(𝑡−1) , … , 𝑋𝑑(𝑡−1) )
(𝑡)
𝑋̂ 1(𝑡) ∼ 𝑝(𝑋1 |𝑋1𝑜𝑏𝑠 , 𝑋2(𝑡−1) , … , 𝑋𝑑(𝑡−1) , 𝜽̂ 1 ) 2.4. Deep learning methods
...
2.4.1. GAIN
(𝑡)
𝜽̂ 𝑑 (𝑡)
∼ 𝑝(𝜽𝑑 |𝑋𝑑𝑜𝑏𝑠 , 𝑋1(𝑡) , … , 𝑋𝑑−1 ) Based on the mechanism of GAN Goodfellow et al. (2020), Yoon
(𝑡) et al. (2018) proposed a generative model framework for missing
𝑋̂ 𝑑(𝑡) ∼ 𝑝(𝑋𝑑 |𝑋𝑑𝑜𝑏𝑠 , 𝑋1(𝑡) , … , 𝑋𝑑(𝑡) , 𝜽̂ 𝑑 )
data imputation, named GAIN. We outline the procedure below. GAIN
where 𝑋𝑘𝑜𝑏𝑠 and 𝑋̂ 𝑘(𝑡) stand for the observed and imputed data for the consists of two main components: (1) A generator 𝐺 that imputes the
𝑘th variable at iteration 𝑡, and 𝑋𝑘(𝑡) = (𝑋𝑘𝑜𝑏𝑠 , 𝑋̂ 𝑘(𝑡) ). The convergence of missing data conditioned on the observed data and outputs a completed
this algorithm is typically fast. vector, and (2) A discriminator 𝐷 that takes the output of 𝐺 and
Buuren and Groothuis-Oudshoorn (2011) presented an R package attempts to identify which parts of the data are imputed. To ensure
mice, which extends the standard algorithm of MICE in several ways. that 𝐺 produces a unique distribution, the authors introduced a hint
The new functionalities include imputing multilevel data, automatic vector that is correlated to the missing pattern to 𝐷.
predictor selection, post-processing imputed values, specialized pooling Fig. 1 shows the data flow of GAIN. The input of the generator, 𝐺,
routines, model selection tools, and diagnostic graphs. consists of three components 𝑿 𝑀 , 𝑴 and 𝑹, where 𝑿 𝑀 and 𝑴 are
defined in 2.1, and 𝑹 = {𝑅1 , … , 𝑅𝑑 } is 𝑑-dimensional noise variable,
2.3. Machine learning methods which is independent of 𝑿 𝑀 and 𝑴. The generator, 𝐺, imputes missing
values and produces an output variable 𝑿, ̂ where the missing values in
2.3.1. 𝐾-Nearest neighbors 𝑿 are replaced by the imputed values. Then we have
A 𝐾-nearest neighbor model imputes missing values using only ̂ = 𝑴 ⊙ 𝑿 𝑀 + (1 − 𝑴) ⊙ 𝐺(𝑿 𝑀 , 𝑴, (1 − 𝑴) ⊙ 𝑹)
𝑿 (2)
similar cases in a dataset. The method finds the 𝐾 samples in the
dataset ‘‘closest’’ to an incomplete data point and then averages the where ⊙ denotes the element-wise multiplication.
𝐾 data points to fill in the missing value. Typically, a 𝐾-nearest The inputs of the discriminator in GAIN, 𝐷, consist of two compo-
neighbor method uses a distance metric to compute the nearest neigh- nents, 𝑿̂ and 𝑯, where 𝑯 is a hint variable of the same shape as 𝑴
bors. Configuration of 𝐾-nearest neighbor imputation often involves and is obtained by randomly selecting a certain portion of values in
selecting the distance measure (e.g. Euclidean) and the number of 𝑴 to reveal to 𝐷. The output of 𝐷 is the predicted mask, 𝑴,̂ which is
contributing neighbors for each prediction, the hyperparameter 𝐾 of calculated as follows:
the algorithm. The optimal value of 𝐾 is usually chosen by cross- ̂ = 𝐷(𝑿,
̂ 𝑯)
𝑴 (3)
validation. The 𝐾-nearest neighbor method can impute both continuous
variables (the mean or weighted mean among the 𝐾-nearest neighbors) ̂ indicates the likelihood that the 𝑖𝑗th value
where element 𝑚̂ 𝑖𝑗 ∈ 𝑴
and categorical variables (the mode among the 𝐾-nearest neighbors). ̂
in 𝑿 is an observed (non-missing) data. By modifying the degree of

3
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 1. The data flow of GAIN.

Fig. 2. The dataflow of VAE for imputation.

correlation between 𝑯 and 𝑴, we control the amount of ‘‘hint’’ to pass 2.5. DLIN
to 𝐷.
Similar to GAN, The loss function of GAIN is defined as: DLIN is a novel deep learning method, whose architecture is differ-
[ ] ent than GAIN and VAE (Hallaji et al., 2021). It incorporates the denois-
𝑄(𝐺, 𝐷) = E ̂ 𝑴 𝑇 log(𝑴)
𝑿,𝑴,𝑯
̂ + (1 − 𝑴)𝑇 log(1 − 𝑴)̂
ing auto-encoders and ladder network into a unified framework. The
where 𝑙𝑜𝑔 is the element-wise logarithm function, and 𝑿 ̂ and 𝑴̂ are deep ladder network is a relatively new approach to semi-supervised
defined in (2) and (3). Thus, one is to solve the minimax problem, learning that turned out to be very successful (Rasmus, Berglund,
Honkala, Valpola, & Raiko, 2015). In a standard autoencoder network,
min max𝑄(𝐺, 𝐷). all information has to go through the highest layer. It needs to represent
𝐺 𝐷
all the details of the input 𝑥. The intermediate hidden layer output
2.4.2. VAE ℎ(𝑙) in the network cannot independently represent information because
VAE (Kingma & Welling, 2014) is another deep learning-based they only receive information from the highest layer. In contrast, lateral
generative framework, which can be utilized for missing data imputa- connections at each layer in a deep ladder network give a chance for
tion (Camino et al., 2019; McCoy et al., 2018). In the VAE framework, each ℎ(𝑙) to represent information from the higher layers.
one assumes that the real data 𝑿 is generated by a latent variable 𝒁 Most deep learning-based imputation methods cannot utilize unob-
that follows a multivariate Gaussian distribution. An encoder generates served data to train an imputation model. However, ladder networks
the posterior distribution 𝑞𝜙 (𝒁|𝑿) ∼  (𝝁𝒆 , 𝝈 𝒆 ) of 𝒁 given 𝑿, while a built in a DLIN can be used to remove this limitation. In a DLIN
decoder generates the distribution 𝑝𝜃 (𝑿|𝒁) ∼  (𝝁𝒅 , 𝝈 𝒅 ) of 𝑿 given 𝒁. model, one first gives initial imputed data, 𝑿 0 , and then generates a
Both the encoder and the decoder are neural networks, parameterized ̆ Consider that a DLIN has 𝐿 hidden
stochastically corrupted version, 𝑿.
by 𝜙 and 𝜃, respectively. Note that the input to the decoder is a layers. The autoencoder is formulated by
random sample drawn from 𝑞𝜙 (𝒁|𝑿). The reconstructed 𝑿 is a random
sample drawn from 𝑝𝜃 (𝑿|𝒁). The parameters 𝜙 and 𝜃 are learned by (𝑥̆ 𝑖 ) = (𝑥̆ 𝑖 , 𝑧1 , 𝑧2 , … , 𝑧𝐿 , 𝑥̂ 𝑖 ),
maximizing the evidence lower bound (ELBO),
[ ( where 𝑧𝑙 is the latent variable (hidden representation) at the 𝑙th hidden
)] [ ]
ELBO = 𝐸𝑞𝜙 (𝒁|𝑿) 𝑙𝑜𝑔 𝑝𝜃 (𝑿|𝒁) − 𝐾𝐿 𝑞𝜙 (𝒁|𝑿)∥𝑝(𝒁) (4) layer, and 𝑥̂ 𝑖 is an estimate of 𝑥̆ 𝑖 , and the autoencoder decoder network
is modeled as:
where 𝑝(𝒁) is the prior distribution of 𝒁, assumed to follow the
standard Gaussian distribution. Maximizing the first term in (4) is 𝑐 (𝑥̆ 𝑖 ) = (𝑥̆ 𝑖 , 𝑧𝑐1 , 𝑧𝑐2 , … , 𝑧𝑐𝐿 , 𝑥̂ 𝑐𝑖 )|𝒎𝑖 = 𝟏,
equivalent to minimizing the reconstruction loss. The second term
can be regarded as a regularization term. The readers are referred (𝑧̆ 𝑖 ) = (𝑧𝑐𝐿 , 𝑧𝑑1 , 𝑧𝑑2 , … , 𝑧𝑑𝐿 )|𝑧𝑐𝑙 ∈ 𝑐 (𝑥̆ 𝑖 ),
to Kingma and Welling (2014) for the full derivation of the equations.
Fig. 2 shows the dataflow for imputing missing values of 𝑿 𝑀 using where 𝒎𝑖 is the 𝑖th mask variable defined in Section 2.1, 𝟏 is a 1 × 𝑛
VAE. First, we train a VAE where missing values in 𝑿 𝑀 are filled with vector of ones. 𝑧𝑐𝑙 and 𝑧𝑑𝑙 are the hidden representations of 𝑐 and 
random values, which we denote it as 𝑿.̃ The reconstruction loss in (4) at the 𝑙th hidden layer, respectively. 𝑥̂ 𝑖 is an estimate of 𝑥̆ 𝑖 . In a DLIN,
is calculated over only non-missing values during training. Then we a fusion function is used to fuse the noisy representations in 𝑐 and
iteratively pass 𝑿̃ through the trained VAE, replacing missing values the denoised representations through the lateral connections. Thus, the
with the reconstructed values from the previous iteration, until the model facilitates the generation of more abstract features at each layer
reconstructive loss over non-missing values falls below a predetermined of the ladder network (Hallaji et al., 2021). It has been shown that DLIN
threshold. The reconstructed missing values in the last iteration are the performs well for data imputation with high missing ratios. It also can
imputed values. More specifically, handle cases where temporal and/or spatial missing values exist among
{ 𝑖𝑚𝑝 the data.
𝑿 ∼ 𝒑𝜽 (𝑿|𝒁) =  (𝝁𝒅 , 𝝈 𝒅 )
𝑿̂ = 𝑴 ⊙ 𝑿 𝑀 + (1 − 𝑴) ⊙ 𝑿 𝑖𝑚𝑝
3. Numerical experiments
where 𝑿 𝑖𝑚𝑝 denotes the imputed data vector sampled from complete
̂ represents the resulting complete data
distribution 𝑝𝜃 (𝑿|𝒁) and 𝑿 The efficacy of the missing data imputation methods depends heav-
vector. ily on the problem domain, for example, sample size, types of variables,

4
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Table 1
Summary of the ten datasets in the numerical study.
Dataset Summary Characteristics
Continuous data
PimaIndiansDiabetes 𝑛 = 768, 𝑑 = 8. It was from a study of Pima Indians’ Diabetes. It consists of A typical moderate-size clinical data
several medical variables including the number of pregnancies the patient has
had, their BMI, insulin level, age, etc.
Spam 𝑛 = 4601, 𝑑 = 57. It was a dataset that classifies e-mails as spam or non-spam. Skewed data, proportional and rate variables
The first 48 variables contain the frequency of some keyword in the e-mail.
The variables 49–54 indicate the frequency of certain characters. The variables
55–57 contain the average, longest and total run-length of capital letters.
DTI 𝑛 = 382, 𝑑 = 93. It contains fractional anisotropy tract profiles in diffusion Functional (time series) data
tensor images for patients with multiple sclerosis.
Categorical data
Carcinoma 𝑛 = 118, 𝑑 = 7. A dataset for diagnoses of carcinoma cancer on seven A small categorical dataset
categorical features representing pathologist ratings.
DNA 𝑛 = 3, 186, 𝑑 = 180. It contains 180 indicator binary variables for primate High-dimensional categorical variables
splice-junction gene sequences.
Mixed data
NCbirths 𝑛 = 1450, 𝑑 = 13. It contains data on a sample of birth records from the North A typical moderate-size data with mixed variables
Carolina State Center for Health and Environmental Statistics. It includes 5
continuous and 8 categorical variables.
VietNamI 𝑛 = 27, 765, 𝑑 = 12. It was a dataset of medical expenses in Vietnam from A big dataset with imbalanced and skewed
1997. It includes 7 continuous and 5 categorical variables. variables
Simulated1 𝑛 = 100, 𝑑 = 15. A simulated data with 10 continuous and 5 categorical A small set with known covariance structure
variables.
Simulated2 𝑛 = 500, 𝑑 = 15. A simulated data with 10 continuous and 5 categorical A moderate set with known covariance structure
variables.
Simulated3 𝑛 = 1000, 𝑑 = 15. A simulated data with 10 continuous and 5 categorical A large set with known covariance structure
variables.

missingness patterns. In this study, we compare the performances of 3.1. Real datasets
popular deep learning methods, GAIN and VAE, versus the widely used
and accepted methods, MICE and missForest. We consider seven real We include 7 real datasets in this study: PimaIndiansDiabetes (Ripley,
datasets and three synthetic datasets that represent typical cases in 2007; Wahba, Gu, Wang, & Chappell, 2018), spam (Hastie, Tibshirani,
practice. These datasets include cases with small, moderate, and large Friedman, & Friedman, 2009), and DTI (Goldsmith, Crainiceanu, Caffo,
sample sizes and cases with continuous variables, categorical variables, & Reich, 2012) only contain continuous features, Carcinoma (Agresti,
and mixed-typed (both continuous and categorical) variables. We gen- 2003) and DNA (Noordewier, Towell, & Shavlik, 1991) only con-
erate missing data in the datasets by three types of missing mechanisms: tain categorical features, and NCbirths (Cannon et al., 2018) and Viet-
MCAR, MAR, and MNAR and three levels of missing ratios: 10%, 30% NamI (Cameron & Trivedi, 2005) contain both continuous and cate-
and 50%. Table 1 provides the basic summary information of each gorical features. Table 1 provides a summary of these datasets and
dataset. the special characteristics of each set. The supplementary document
Model performances are evaluated based on (averaged) root mean provides further details of them.
squared error (RMSE) for continuous variables and Accuracy for cat-
egorical variables, respectively. Assume that we have 𝑝 continuous 3.2. Synthetic data construction
variables and 𝑞 categorical variables among all 𝑑-dimensional variables.
The average RMSE is given as In addition to the seven real benchmark datasets, we generate three
√ mixed-type synthetic datasets with three levels of sample size (see
𝑝 √
1 ∑√
𝑛𝑘
𝑅𝑀𝑆𝐸 = √ 1 ∑ (𝑥̂ − 𝑥 )2 , Table 1). The reason we consider the simulated datasets here is that
𝑝 𝑘=1 𝑛𝑘 𝑖=1 𝑘𝑖 𝑘𝑖 we know explicitly the data distribution and the covariance structure of
the data. All three simulated datasets have 15 features that are sampled
where 𝑥̂ 𝑘𝑖 is the 𝑖th imputed value, 𝑥𝑘𝑖 is the 𝑖th true observed value, from multivariate Gaussian distributions with the following means and
and 𝑛𝑘 is the total number of the missed observations for the 𝑘th covariance matrices:
continuous variables (𝑘 = 1, … , 𝑝). The averaged accuracy is given as
𝝁𝒔 = [0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 0, 0, 0, 0, 0] (5)
1 ∑ # correct label for the jth variable
𝑞
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = , 𝝈 𝒔 = 0.5 ⋅ 𝑰 + 0.5 ⋅ 𝟏 (6)
𝑞 𝑗=1 𝑛𝑗
where 𝑰 is an identity matrix and 𝟏 is a matrix of 1’s. We then transfer
where 𝑛𝑗 is the total number of missed observations for the 𝑗th cate-
the last five features of each synthetic data to categorical variables by
gorical variables (𝑘 = 1, … , 𝑞).
dichotomization. Specifically, categorical levels are assigned as follows
RMSE is calculated on scaled datasets to avoid disproportionate
variable contributions. RMSE and Accuracy provide us with measures
of the relative distance between the imputed dataset and the original ⎧
⎪ 𝑐0 𝐢𝐟 𝑣 ∈ (−∞, 𝑎1 ]
complete dataset. For multiple imputation scenarios with 𝐾 imputa- ⎪ 𝑐1 𝐢𝐟 𝑣 ∈ (𝑎1 , 𝑎2 ]
tions, we have 𝐾 values for RMSE and/or Accuracy per dataset and we 𝑐𝑖 = ⎨ (7)
⎪ ...
use the average to evaluate the model performance. ⎪ 𝑐𝑛 𝐢𝐟 𝑣 ∈ (𝑎𝑛 , +∞)
We conducted each experiment 100 times. We report mean RMSE ⎩
and/or Accuracy along with their standard deviations (SD) as the where 𝑣 is the continuous value and 𝑡𝑖 = [𝑎1 , 𝑎2 , … , 𝑎𝑛 ] are cutoff
performance metrics. thresholds for feature 𝑐𝑖 . We choose the cutoff thresholds for the last

5
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 3. The embedding structure of missing data 𝑿 𝑀 .

Fig. 4. The mean RMSEs with error bars of different methods under different missing ratios and different missing mechanisms for the DTI data.

Fig. 5. The mean accuracies with error bars of different methods under different missing ratios and different missing mechanisms for the DNA data.

five features to be 𝑡1 = [−1, 0, 1], 𝑡2 = [−1, 0], 𝑡3 = [−0.5], 𝑡4 = [0], 3.3.2. MAR
𝑡5 = [0.5], respectively. The probability that a value of is missing only depends on observed
values 𝑿 𝑜𝑏𝑠 . Defined as the following:
3.3. Generation of missing values
𝑝𝑀 (𝑴|𝑿 𝑜𝑏𝑠 , 𝑿 𝑚𝑖𝑠 ) = 𝑝𝑀 (𝑴|𝑿 𝑜𝑏𝑠 ), ∀𝑿 𝑜𝑏𝑠 . (9)

Let 𝑴 with probability distribution 𝑝𝑀 , parameterized by 𝜙, de-


notes the missingness of data 𝑿. Let 𝑿 𝑜𝑏𝑠 , 𝑿 𝑚𝑖𝑠 denote the observed and 3.3.3. MNAR
missing values in 𝑿, respectively. The three missing data generation Missing mechanisms that are neither MCAR nor MAR are considered
procedures are described as follows. MNAR. We create MNAR missing data by correlating 𝜙 with both 𝑿 𝑜𝑏𝑠
and 𝑿 𝑚𝑖𝑠 .
We also consider another special scenario of MNAR, the bursty
3.3.1. MCAR
missing data. The ‘‘bursty’’ missing refers to the situation that the
The probability that a value is missing does not depend on any other
instances of missing data appear to be clustered in some groups/orders
features, defined as the following:
or at certain moments of time. Erhan et al. (2021) provided a great
𝑝𝑀 (𝑴|𝑿 𝑜𝑏𝑠 , 𝑿 𝑚𝑖𝑠 ) = 𝑝𝑀 (𝑴). (8) example to compare non-bursty missing and bursty missing data.

6
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 6. The mean RMSEs with error bars (for continuous variables) and the mean accuracies with error bars (for categorical variables) of different methods under different missing
ratios and different missing mechanisms for the VietNamI data.

Fig. 7. The mean RMSEs with error bars (for continuous variables) and the mean accuracies with error bars (for categorical variables) of different methods under different missing
ratios under burst-missing for the NCbirths data.

7
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 8. The mean RMSEs with error bars (for continuous variables) and the mean accuracies with error bars (for categorical variables) of different methods under different missing
ratios and different missing mechanisms for the simulated3 data.

3.4. Data preprocessing for deep generative models 3.5. Model training configurations

Conventional methods, e.g. MICE and missForest, are selected to


Deep learning models cannot directly process categorical values. We
compare with deep generative methods. Both methods are implemented
apply onehot encoding and dense embedding to transform categorical
in R using the packages MICE and missForest. For MICE, we apply
values to numerical vectors. predictive mean matching for continuous features, and logistic regres-
sion/multinomial regression for binary/multi-level categorical features,
• Onehot encoding: A categorical value with 𝑚 possible unique
respectively. We set the number of multiple imputations to 50 for each
categorical values 𝒗 = {𝑣1 , … , 𝑣𝑚 }, is embedded as a vector 𝒆 ∈
incomplete dataset. Missing continuous values and categorical values
{0, 1}𝑚 , where the entry corresponding to the index of the value
are filled with the corresponding means and modes from those 50
in 𝒗 is set to 1, and other to 0.
imputation results. For missForest, we set the number of random trees
• Dense embedding : Each unique value is assigned to a numerical for imputation to 200.
vector 𝒆 ∈ 𝑹𝑛 that is updated during the training process. The Deep generative methods, GAIN and VAE, are implemented in
aim of applying dense embeddings in deep generative models is Python with Tensorflow/Keras (Abadi et al., 2015; Chollet et al.,
to improve the imputation performance by learning the potential 2015). After the proper epochs and batch sizes of models under each
relations between features. The embedding vector of missing miss-sampling setting are determined, we train GAIN and VAE with
categorical values is filled with N/A’s. Adam/RMSprop optimization mechanism in the imputation process,
respectively. Finally, we compute the result performance metrics from
All categorical features are transformed into numerical vectors as the trained models.
illustrated in Fig. 3 for deep neural net-based methods. All observed
continuous features are standardized into the range [0, 1] or [−1, 1]
4. Results
on the specific embedding method applied. Processed categorical and
continuous features are concatenated to form the input feature vector 4.1. Comparisons using real data
for deep neural net-based models.
The imputed data from deep generative models are decoded back to 4.1.1. Pure continuous data
the original space by data splitting: Values of continuous features are Fig. 4 shows the mean and standard deviations of RMSEs by differ-
transferred to their original ranges. Embeddings of categorical features ent models under three missing mechanisms for the DTI data. Three
are mapped to the original value whose embedding is the closest levels of missing rates were considered: 10%, 30% and 50%. We ob-
according to cosine similarities. serve that the conventional methods, MICE and missForest, outperform

8
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 9. The scatter plots of true v.s. predicted values of features in the study of DTI data (the case of 30% missing ratio) for GAIN: the first, second, and third row of each
sub-figure correspond to the case of MCAR, MAR and MNAR, respectively. Three continuous features are shown (con_cat_1, con_cat_2 and con_cat_3).

the deep learning-based methods, GAIN and VAE. The VAE is the worst 4.1.2. Pure categorical data
method in all scenarios. Unsurprisingly, the data with MCAR are easier Fig. 5 shows the mean and standard deviations of accuracies by
to be imputed comparing the cases with MAR and MNAR. The RMSEs different models under different missing rates and missing mechanisms
increase as the level of the missing rate increases. When the missing for the DNA data. The DNA dataset has a large sample size (𝑁 = 3186)
rate is low (e.g. 10% or 30%), MICE appears the winning among and high-dimensional variables (𝑑 = 180). We observe that MissForest
all methods. Note that the DTI data are functional (time-series) data, is the winner among all methods. The performance of MICE is similar to
that of MissForest as the missing ratio is small (ratio = 0.1). However,
where the 93 variables are dependent (Ramsay & Silverman, 2005).
it is of interest to observe that MICE is the worst method as the missing
Both MICE and missForest showed great performance for the functional
ratio is large (ratio = 0.5). GAIN is better than VAE, but both the deep
dependent data. Table S-I in the supplement document presents all
learning methods are worse than MissForest. We considered two types
summarized results of RMSEs for two other real continuous datasets, of encoding methods for deep learning imputation: onehot encoding
PimaIndiansDiabetes and spam. PimaIndiansDiabetes is a typical clin- and dense embedding. It appears that there is no obvious difference
ical dataset, while spam contains skewed proportional/rate variables. between the two encoding methods.
The results for these cases also show similar findings as in Fig. 4, where Table S-II in the supplement document shows the results of the
both MICE and missForest are winners compared with deep learning other categorical dataset, Carcinoma. The sample size of Carcinoma
methods. is much smaller (n = 118). In this case, MICE is the winner among

9
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 10. The scatter plots of true v.s. predicted values of features in the study of DTI data (the case of 30% missing ratio) for VAE.

all methods. The performance of MissForest is close to that of MICE in Erhan et al. (2021), we randomly select a corresponding number of
in all scenarios. They generally outperform deep generative models in bursts of a given block size (number of data points, here we set it as 50)
imputing categorical variables. to be invalidated from the dataset, in order to reach the desired dataset
impairment level (10%, 30% and 50%). The results of the different
4.1.3. Mixed data imputation methods are shown in Fig. 7. Both MICE and missForest
Fig. 6 presents the results of the study of the ViteNamI data. This are better than the deep learning methods for the case of the bursty
example dataset is a relatively big dataset with mixed variables. It missing. GAIN with onehot seems not a good choice for bursty missing
contains 27,765 observations with 12 variables. We note that GAIN data with a high level of missing ratios (for example, 50%).
performs poorly in imputing missing values for the continuous vari-
ables. It shows unstable for three different missing mechanisms and 4.2. Comparisons using simulated data
different missing rates. For both continuous and categorical variables,
the conventional methods are uniformly better than the deep generative As described in Section 3, the simulated data allow us to specify
models. Table S-III in the supplement document presents the summa- the known correlation structure among the variables. We considered
rized results for the other mixed data. The results show similar findings here a compound-symmetry correlation. As summarized in Table S-IV,
as in Fig. 6. MICE and missForest outperform neural net-based methods on these
We also conducted a case study of the bursty missing mechanism, simulated data as well. For the deep generative methods, GAIN is better
a special case of MNAR using the NCbirths data. Following the design than VAE in the case of MCAR, while VAE is better than GAIN in the

10
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 11. The scatter plots of true v.s. predicted values of features in the study of DTI data (the case of 30% missing ratio) for MICE.

cases of MAR and MNAR. Comparing onehot encoding and embedding GAIN and VAE, in these experiments. For the deep learning models,
models, it is interesting that GAIN with embedding works worse than there is no obvious improvement using embedding v.s. onehot encoding
GAIN with onehot encoding. The VAE with embedding and the VAE for categorical features.
with onehot encoding have similar performances in imputing both In our study, we also closely looked at model fitting through a
continuous and categorical variables. Fig. 8 displays the results for graphical evaluation of the predicted values of missing data. Here we
Simulated3, which shows that MICE and missForest are better than
demonstrate a few examples to show that the deep learning models,
GAIN and VAE.
including both GAIN and VAE, lead to a model collapse in many
situations. Figs. 9–12 show the scatter plots of true versus predicted
5. Discussion
values of continuous features in the study of DTI data for the case of
We have performed a thorough investigation of practical imputation 30% missing ratio. We present three continuous features (con cat 1, con
methods in the context of tabular data (i.e. structured data), investi- cat 2 and con cat 3). The first, second, and third rows correspond to
gating the performance of multiple missing data strategies using both the case of MCAR, MAR, and MNAR, respectively. Additional graphical
real-world and simulated datasets. The scenarios were varied by using examples can be found in the supplement document. Figure S-I presents
different sample sizes, missing data mechanisms, and ratios of missing the confusion matrices and the scatter plots of true versus predicted val-
data. In summary, we found that the conventional methods, MICE and ues for categorical and continuous features in the Simulated3 example
missForest, outperform the recently proposed deep generative methods, for the case of 30% missing ratio.

11
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Fig. 12. The scatter plots of true v.s. predicted values of features in the study of DTI data (the case of 30% missing ratio) for MissForest.

In this DTI data example, we note that the scatter plots based on large data. However, our findings demonstrated that the stability and
the predicted values using VAE tend to be horizontal. Those predicted convergence of the deep generative models were questionable for data
values are randomly distributed around the median of the feature. They with a small, moderate or even large sample size (i.e. when the sample
indicate that VAE resulted in the model collapse problem and generated size is less than 30,000).
bad results under all missing mechanism situations. Although GAIN Conventional methods have been well established in the scientific
was better than VAE, it seems that it is sensitive with the types of community, leading practitioners to use stable libraries instead of
the missingness mechanisms. GAIN appears reasonable with data are implementing deep learning alternatives. For example, the mice and
MCAR, but becomes poor in cases where data are MAR or MNAR. missForest packages in R are well-tested and debugged packages for
missing data imputation analysis. We have shown that both MICE and
Deep generative modeling is an active research area. Many novel
missForest methods are better and more stable methods for data with
deep models with real applications have been proposed in recent years.
limited sample size. In comparison between MICE and missForest, the
The amount of research papers on deep learning for missing data
computational cost of missForest is much lower than MICE. Applying
imputation is growing and these novel methods appear potentially
missForest does not require the standardization of the data, laborious
attractive. However, these models’ power and validity need to be dummy coding of categorical variables, and has no need for tuning
evaluated carefully in real applications. In many cases, reliable and parameters. MissForest can be applied to high-dimensional datasets
reproducible code is either unavailable or incomplete. For the deep where the number of variables greatly exceeds the sample size and still
generative models, the number of hyperparameters to tune is usually provides excellent imputation results (Stekhoven & Bühlmann, 2012).
much larger than in traditional statistical methods. The training time On the other hand, the advantages of MICE include that it results in
or memory size required for the hyperparameter search on big data can unbiased estimates; its results are readily interpreted in a Bayesian
be prohibitive in some applications. Recently, Dong et al. (2021) have context; many feasible algorithms are available within the MICE frame-
shown that GAIN had better accuracy than MICE and missForest in very work (Buuren & Groothuis-Oudshoorn, 2011). MICE is particularly

12
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Table 2
Advantages and disadvantages of popular missing data imputation methods.
Methods Advantages Disadvantages
Statistical
Mean/Median/Mode ∙ Simple and easy to implement ∙ Might lead to severely biased estimates even if data are MCAR
∙ Computationally fast ∙ Susceptible to skewed distributions
Regression ∙ Improvement over ∙ The assumptions of error distribution and linear relationship are
Mean/Median/Mode imputation relatively strict
∙ Poor results for heteroscedastic data
EM ∙ Performance equal to MICE ∙ It has slow convergence
when data are multivariate
normal
∙ Guarantee that the likelihood ∙ Underestimate standard error
will enhance after each iteration
MICE ∙ Widely used and accepted stable ∙ Specification of conditional models which may be difficult to
method know a priori
∙ Flexibility ∙ Implementing MICE when data are MNAR may result in biased
estimates
Machine learning
K-Nearest Neighbors ∙ Easily handle both quantitative ∙ Computational intensive for large data since it searches through
features and qualitative features all the dataset
∙ Accommodate missingness ∙ Requires specification of hyperparameters that can have a large
patterns effect on the results
Feed-forward NN ∙ Good performance for skewed ∙ Weight initialization is critical
data
∙ Many models have to be constructed in a high-dimensional
problem
MissForest ∙ Excellent performance for data ∙ Training on observed data can be biased
with complex interactions and/or
non-linear relations
∙ Good for data with both
quantitative features and
qualitative features
Deep learning
GAIN ∙ Data-driven approach and no ∙ Performance may be degraded depending on the sample size of
distribution assumptions are data
needed
∙ Good for imbalanced and ∙ No standard software
skewed data
∙ Sensitive to the type of missing mechanism
VAE ∙ Can learn latent associations of ∙ Implementation varies
the input data
∙ Model collapse issues
DLIN ∙ Not sensitive to the missing ∙ No standard software
mechanism
∙ Can handle spatial or temporal ∙ Users need the advanced knowledge for model training
data

useful if missing values are associated with the target variable in a no publicly available open-source software for DLIN. Missing data
way that introduces leakage. MICE also allows users to make statements imputation using deep learning techniques remains an active research
about the likely distribution of the missing value. area. There is great potential that deep learning-based methods will
The effectiveness of different missing data imputation techniques become popular in practice in the future, once reliable and easy-to-use
depends on multiple factors, such as the sample size of the data, the software is developed.
distribution of the variables, the amount of missing values in the data, In many situations, missing data imputation is just the first step
the correlation structure of the data, and the possible missingness of data analysis. Investigators will need to further build up casual
mechanisms. In Section 2, we have summarized the popular missing regression models or risk prediction models based on the imputed
data imputation methods. Here we address their advantages and dis- datasets. Using appropriate imputation techniques is crucial to the
advantages in Table 2. Practitioners should try to gather as much success of those models. Successfully imputing missing predictor values
information as possible about the factors of missingness in a study, so can substantially increase the reliability of the risk predictions from the
that a suitable imputation technique could be appropriately applied. fitted model. They also need a good understanding of the dangers and
limitations of the imputation technique they apply. A consent reporting
6. Conclusion guideline on how to report missing data imputation details in research
studies is needed in future literature.
In conclusion, we recommend that, for tabular data with a limited
sample size (𝑛 < 30, 000), practitioners use the conventional methods, CRediT authorship contribution statement
MICE and missForest, to deal with missing data imputation in real
case analyses. The recently developed deep learning method, DLIN, Yige Sun: Software, Writing – original draft, Data curation. Jing Li:
appeared to be a good alternative. The authors showed that it is not Writing – review & editing, Supervision. Yifan Xu: Writing – review &
sensitive to the missing mechanism and the levels of missing ratios and editing, Validation. Tingting Zhang: Writing – review & editing, Inves-
can handle spatial or temporal data. However, we did not compare tigation. Xiaofeng Wang: Conceptualization, Methodology, Writing –
this method with others in our numerical studies because there is original draft, Supervision.

13
Y. Sun et al. Expert Systems With Applications 227 (2023) 120201

Declaration of competing interest Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et
al. (2020). Generative adversarial networks. Communications of the ACM, 63(11),
139–144.
The authors declare that they have no known competing finan-
Gupta, A., & Lam, M. S. (1996). Estimating missing values using neural networks.
cial interests or personal relationships that could have appeared to Journal of the Operational Research Society, 47(2), 229–238.
influence the work reported in this paper. Hallaji, E., Razavi-Far, R., & Saif, M. (2021). DLIN: Deep ladder imputation network.
IEEE Transactions on Cybernetics, 52(9), 8629–8641.
Data availability Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of
statistical learning: Data mining, inference, and prediction (2nd ed.). New York, NY:
Springer.
Data will be made available on request. Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. arXiv:1312.6114.
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data, vol. 793 (3rd
Acknowledgments ed). John Wiley & Sons.
Lu, H.-m., Perrone, G., & Unpingco, J. (2020). Multiple imputation with denoising
autoencoder using metamorphic truth and imputation feedback. arXiv preprint
We are grateful to the Editor, the Associate Editor, and the reviewers arXiv:2002.08338.
for their constructive comments which substantially improved this McCoy, J. T., Kroon, S., & Auret, L. (2018). Variational autoencoders for missing
paper. JL was supported in part by National Science Foundation (NSF), data imputation with application to a simulated milling circuit. IFAC-PapersOnLine,
USA CCF-2006780, IIS-2027667, CCF-2200255 and National Institutes 51(21), 141–146, 5th IFAC Workshop on Mining, Mineral and Metal Processing
MMM 2018.
of Health (NIH), USA HG009658, 5U01AG073323. Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training knowledge-
based neural networks to recognize genes in DNA sequences. In Advances in neural
Appendix A. Supplementary data information processing systems (pp. 530–536).
Qiu, Y. L., Zheng, H., & Gevaert, O. (2020). Genomic data imputation with variational
auto-encoders. GigaScience, 9(8), giaa082.
Supplementary material related to this article can be found online
Ramsay, J., & Silverman, B. (2005). Springer Series in Statistics, Functional data analysis
at https://doi.org/10.1016/j.eswa.2023.120201. (2nd ed.). New York, NY: Springer.
Rasmus, A., Berglund, M., Honkala, M., Valpola, H., & Raiko, T. (2015). Semi-supervised
References learning with ladder networks. Advances in Neural Information Processing Systems,
28.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). Ripley, B. D. (2007). Pattern recognition and neural networks. Cambridge University Press.
TensorFlow: Large-scale machine learning on heterogeneous systems. URL https: Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
//www.tensorflow.org/. Software available from tensorflow.org. Rubin, D. B. (1978). Multiple imputations in sample surveys-a phenomenological
Agresti, A. (2003). Categorical data analysis. John Wiley & Sons, New York, NY. Bayesian approach to nonresponse. In Proceedings of the survey research methods
Batista, G. E., & Monard, M. C. (2002). A study of K-nearest neighbour as an imputation section of the American statistical association, vol. 1 (pp. 20–34). VA, USA: American
method. HIS - Frontiers in Artificial Intelligence and Applications, 87(87), 251–260. Statistical Association Alexandria.
Buuren, S. v., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys. New York, NY: John
chained equations inR. Journal of Statistical Software, 45, 1–67. Wiley & Sons.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and applications. Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014).
Cambridge University Press. Comparison of random forest and parametric imputation models for imputing
Camino, R. D., Hammerschmidt, C. A., & State, R. (2019). Improving missing data missing data using MICE: A CALIBER study. American Journal of Epidemiology,
imputation with deep generative models. (pp. 1–8). arXiv preprint arXiv:1902. 179(6), 764–774.
10666. Sharpe, P. K., & Solly, R. (1995). Dealing with missing values in neural network-based
Cannon, A. R., Cobb, G. W., Hartlaub, B. A., Legler, J. M., Lock, R. H., Moore, T. L., et diagnostic systems. Neural Computing & Applications, 3(2), 73–77.
al. (2018). STAT2: Modeling with regression and ANOVA. W.H. Freeman, New York, Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value
NY. imputation for mixed-type data. Bioinformatics, 28(1), 112–118.
Chollet, F., et al. (2015). Keras. https://keras.io. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from composing robust features with denoising autoencoders. In Proceedings of the 25th
incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series international conference on machine learning (pp. 1096–1103).
B. Statistical Methodology, 39(1), 1–22. Wahba, G., Gu, C., Wang, Y., & Chappell, R. (2018). Soft classification, aka risk
Dong, W., Fong, D. Y. T., Yoon, J.-s., Wan, E. Y. F., Bedford, L. E., Tang, E. H. M., et estimation, via penalized log likelihood and smoothing spline analysis of variance.
al. (2021). Generative adversarial networks for imputing missing data for big data In The mathematics of generalization (pp. 331–359). CRC Press.
clinical research. BMC Medical Research Methodology, 21(1), 1–10. Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., et al. (2013).
Erhan, L., Di Mauro, M., Anjum, A., Bagdasar, O., Song, W., & Liotta, A. (2021). Comparison of imputation methods for missing laboratory data in medicine. BMJ
Embedded data imputation for environmental intelligent sensing: A case study. Open, 3(8), Article e002847.
Sensors, 21(23), 7774. Yoon, J., Jordon, J., & Schaar, M. (2018). Gain: Missing data imputation using
Goldsmith, J., Crainiceanu, C. M., Caffo, B., & Reich, D. (2012). Longitudinal penalized generative adversarial nets. In International conference on machine learning (pp.
functional regression for cognitive outcomes on neuronal tract measurements. 5689–5698). PMLR.
Journal of the Royal Statistical Society. Series C. Applied Statistics, 61(3), 453–469.
Gondara, L., & Wang, K. (2018). Mida: Multiple imputation using denoising au-
toencoders. In Pacific-Asia conference on knowledge discovery and data mining (pp.
260–272). Springer.

14

You might also like