Platias2020 Greece
Platias2020 Greece
Imputation
Christos Platias∗
Georgios Petasis∗
[email protected]
[email protected]
Institute of Informatics & Telecommunications
National Centre for Scientific Research (N.C.S.R.) “Demokritos”
Agia Paraskevi, Attiki, Greece
ABSTRACT 1 INTRODUCTION
Handling missing values in a dataset is a long-standing issue across Missing data arise frequently in almost all statistical analyses in
many disciplines. Missing values can arise from different sources many disciplines. There are many reasons why data might be miss-
such as mishandling of samples, measurement errors, lack of re- ing, such as patients dropping out from clinical studies, people
sponses, or deleted values. The main problem emerging from this being embarrassed to fill in specific fields, human errors or even
situation is that many algorithms can’t run with incomplete datasets. sensor malfunction in I.o.T. networks. In some cases missingness
Several methods exist for handling missing values, including “Soft- might even indicate an answer or a state. In many cases, in order
Impute”, “k-nearest neighbor”, “mice”, “MatrixFactorization”, and to analyze the data and use them for predictive models, the dataset
“miss- Forest”. However, performance comparisons for these meth- must be complete. The easiest way to meet this condition is to
ods are hard to find, as most research approaches usually face exclude instances with missing feature values, a method known as
imputation as an intermediate problem of a regression or a classifi- complete case analysis [26]. However, this can dramatically limit
cation task, and only focus on this task’s performance. In addition, the amount of available information, especially in cases where the
comparisons with existing scientific work are difficult, due to the dataset has many instances with missing values. As a result, it is of
lack of evaluations on publicly-available, open-access datasets. In great importance to find an effective way to “discover” these miss-
order to overcome the aforementioned obstacles, in this paper we ing values, or in other words, impute them [17]. Data imputation
are proposing four new open datasets, representing data from real is the process of replacing missing values in a data structure with
use cases, collected from publicly-available existing datasets, so as other meaningful substitute values. As R. Little and D. Rubin stated
anyone can have access to them and compare their experimental in “Statistical Analysis with Missing Data” [19], “Imputations are
results. Then, we compared the performance of some of the state-of- means or draws from a predictive distribution of the missing val-
art approaches and most frequently used methods for missing data ues, and require a method of creating a predictive distribution for
imputation. In addition to that, we have proposed and evaluated the imputation based on the observed data. There are two generic
two new approaches, one based on Denoising Autoencoders and approaches to generating this distribution: Explicit modeling: the
one on bagging. All in all, 17 different methods were tested using predictive distribution is based on a formal statistical model, and
four different real world, publicly available datasets. hence the assumptions are explicit; Implicit modeling: the focus is
on an algorithm, which implies an underlying model; assumptions
KEYWORDS are implicit, but they still need to be carefully assessed to ensure
that they are reasonable”. The most common method to impute
missing values, neural networks, autoencoders, imputation methods
data is to use the mean or median of a feature if it is numerical and
the mode if it is categorical [39]. More sophisticated approaches
ACM Reference Format:
Christos Platias and Georgios Petasis. 2020. A Comparison of Machine
are statistically based [7, 34] and the latest imputation strategies
Learning Methods for Data Imputation . In 11th Hellenic Conference on depend on machine learning methods [9, 27]. A very informative
Artificial Intelligence (SETN 2020), September 2–4, 2020, Athens, Greece. ACM, book on data imputation is the one of Buuren [4].
New York, NY, USA, 10 pages. https://doi.org/10.1145/3411408.3411465 Missing values can be an obstacle when trying to train a model.
For example, when dealing with time series data, Recurrent Neural
Networks (RNNs) are widely used [5, 16]. But these networks are
Permission to make digital or hard copies of all or part of this work for personal or
not capable of handling missing values. Frequently, in these cases
classroom use is granted without fee provided that copies are not made or distributed data are interpolated with simple methods, like mean, or forward
for profit or commercial advantage and that copies bear this notice and the full citation imputation, where the last observed value of a feature is used [10].
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or However, the evaluation of such imputation approaches is based
republish, to post on servers or to redistribute to lists, requires prior specific permission on the performance of classification or regression tasks, and not
and/or a fee. Request permissions from [email protected]. on how close the imputed values were to the real ones. A more
SETN 2020, September 2–4, 2020, Athens, Greece
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
relevant study of the efficiency of imputation methods can be found
ACM ISBN 978-1-4503-8878-8/20/09. . . $15.00 by Muharemi et al. [24], where they tested kNN [2], Mice [30]
https://doi.org/10.1145/3411408.3411465
150
SETN 2020, September 2–4, 2020, Athens, Greece Christos Platias and Georgios Petasis
and Random forest [37] on some real word datasets, with random complete datasets, where the observed values are the same and the
forest having the best overall performance. Another similar and unobserved values are drawn from their posterior distributions.
interesting study by Beaulieu-Jones and Moore [3] benchmarked
seven different mainstream imputation methods and also proposed 2.2 Hmisc
an approach based on autoencoders. This method outperformed The “Hmisc” package [13] contains many functions for data analysis.
the rest but was evaluated on a regression task’s performance. The “aregImpute” function of Hmisc can be used for imputation. It
The initial motivation of our work has been to propose a new supports multiple imputations and uses a bootstrap approach to
imputation method. However, bringing together all the scientific approximate the process of drawing predicted values from a full
work mentioned above, one of the most important points stand- Bayesian predictive distribution. Different bootstrap resamples are
ing out from the very beginning was the absence of open-access used for each of the multiple imputations. A flexible additive model
datasets, which can be used to reproduce experiments and compare is fitted on a each bootstrap and this model is used to predict all of
new approaches with the state-of-art. In addition, even if the data the original missing values for the target variable. This method uses
were open, the authors almost never published the preprocessing a predictive mean matching to draw values from the imputation
that has been applied, limiting reproducibility. Since comparisons models instead of drawing at random.
of new imputation methods with state-of-art approaches is dif-
ficult, the introduction of new datasets for data imputation was 2.3 mi
imperative. As a result, our motivation was extended to include the This is another multiple imputation approach. “Mi” [36] iteratively
creation of new datasets, preferably from existing, notable, publicly- draws imputed values from the conditional distribution for each
available open datasets from real use cases. After examining several variable given the observed and imputed values of the other vari-
repositories, we have created four new datasets for data imputation, ables in the data. The main idea behind this process is the approxi-
making possible the direct comparison between methods. These mation of a Bayesian framework where multiple chains are run and
new datasets enabled us to fulfil our initial goal and within this convergence is assessed after a pre-specified number of iterations
context two new approaches are proposed. The first approach is within each chain. Predictive mean matching (PMM) [18, 29] is also
based on Denoising Autoencoders that also creates a model that used as imputation method in this implementation.
can be used for future predictions. The second approach explores
the applicability of a bagging scheme utilising different algorithms. 2.4 mice
Finally, having fulfilled all the necessary conditions, it is now pos-
Generates multiple imputations for incomplete multivariate data
sible to compare current data imputation methods on their ability
by Gibbs sampling [41]. The algorithm imputes each incomplete
to reconstruct multivariate data with missing values. To this end,
column (target) iteratively by generating synthetic values using the
the final goal was to create a benchmark with results from some
other variables of the dataset (predictors). Predictors that include
of the most frequently used machine learning approaches for data
missing values are filled with an initial value and then the most
imputation.
recently generated imputations are used to complete them. Each
The main contribution of the work presented in this paper can
column has a separate imputation model and several imputation
be summarised as follows. The proposition of four new datasets
methods can be used, including Random Forest [38] and PMM.
for data imputation, which derived from existing publicly-available
ones, representing real (not synthetic) data. The proposition of two
2.5 missForest
new approaches for data imputation, one based on denoising autoen-
coders and one implementing bagging over existing approaches. This package is especially useful when there are mixed-type vari-
The evaluation of a large number of existing approaches (17) on the ables in the dataset. A number of iterations is performed with a
same four datasets, enabling the comparison of all these methods forest being trained at each one. The out-of-bag (OOB) imputation
in a setting that can be easily reproduced. error is calculated after every iteration and the process stops either
when the number of iterations is exceeded or the the OOB increases.
The steps can be described as preimpute, grow forest for each vari-
2 ALGORITHMS AND IMPLEMENTATIONS able that has missing values, predict the missing values using the
In this section we are going to briefly describe the implementa- grown forest, update the missing values with the predicted values,
tions used for the experiments. With the exception of the self im- iterate for improved results [35, 37].
plemented “Autoencoder” and “Bagging” methods, the rest of the
evaluated approaches come either from Python or R libraries. 2.6 IterativeImputer
An implementation of the R “mice” algorithm in the fancyimpute
2.1 Amelia library [31]. Initial values can be either mean, median or constant,
“Amelia” [14] takes an incomplete dataset and returns m imputed predictions start from the variable with the fewest missing values
datasets under the assumption that the data follow a multivariate and the regression is done by using “RidgeCV” [33] or Bayesian-
normal distribution. To do so, the algorithm first creates a boot- Ridge [32].
strapped version of the original data, estimates the sufficient sta-
tistics by “Expectation-Maximization” (EM) [8] and then imputes 2.7 IterativeSVD
the missing values of the original data using the estimated suffi- This method is an iterative low-rank Singular Value Decomposition
cient statistics. It repeats this process m times to produce the m (SVD) and according to the “fancyimpute” library authors it should
151
A Comparison of Machine Learning Methods for Data Imputation SETN 2020, September 2–4, 2020, Athens, Greece
be similar to the one described by [40]. SVD obtains a set of mutu- 2.12 SoftImpute
ally orthogonal expression patterns that can be linearly combined The Soft-Impute is a matrix completion method by iterative soft
to approximate missing values in the data set. SVD can only be thresholding of SVD decompositions. It is an algorithm for Nu-
performed on complete matrices, therefore the method originally clear Norm Regularization, which is based on “Spectral Regulariza-
substitutes feature averages for all missing values in the original tion Algorithms for Learning Large Incomplete Matrices” by [23].
incomplete matrix and then utilize an expectation maximization ‘SoftImpute” iteratively replaces the missing elements with those
method to arrive at the final estimate. obtained from a soft-thresholded SVD.
152
SETN 2020, September 2–4, 2020, Athens, Greece Christos Platias and Georgios Petasis
By filling the original missing values with mean (dummy value is Dataset shapes
used for the pseudo missing values), all the dataset can be used and Dataset name Instances Features
by creating pseudo missing values the Autoencoder can be trained TADPOLE [22] 9088 18
on existing correct data. As a result, the Denoising Autoencoder Alsfrs [25] 36194 11
tries to undo the corruption done by the dummy values, so it is Lab tests [25] 14485 35
important that in the prediction phase the same dummy value for Gesture [21] 9901 18
each feature is used. Table 1: Dimensions of the datasets used in the experiments.
3.2.2 Prediction phase.
(1) Fill the original missing values with the dummy value used
during the training. and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). TAD-
(2) Use the learned model to reconstruct the dummy filled origi- POLE is a challenge to identify which people within an age group at
nal dataset and get the predictions. risk of AD will start to show symptoms in the short to medium term.
The data are accessible to anyone with a simple registration1 . Each
3.2.3 Model parameters. Below is a description of the most impor-
row in the dataset represents one particular visit of a subject for tests
tant parameters and setting of this implementation.
and each column represents a feature measurement (biomarker),
• Pseudo missing values rate. Depends on the original per- coming from MRI, PET and other sources. In order to be able to
centage of missing values. Approximately 20% seems to be evaluate the imputation results of each method, datasets should be
enough. complete. The original TADPOLE dataset contains many missing
• Augmentation permutations. Depends on the dataset and the values, so in order to be exploitable, columns and rows were dropped
original percentage of missing values. There seems to be a in such a way that as many as possible features would be kept but at
positive effect when concatenating up to 30 permutations of the same time there would be no empty fields and the total number
the original dataset with different pseudo missing values. of records would be appropriate for the experiments. After drop-
• Dummy value. The corrupted values the Autoencoder tries ping columns that are not useful for the imputation process, like
to undo. It must be a distinct value not contained in the set ids and timestamps, the final dataset size is 9088 rows x 18 columns.
of values of the variables. Large values can have a negative The features used are PTEDUCAT, APOE4, CDRSB_bl, ADAS11_bl,
effect on the training process. ADAS13_bl, MMSE_bl, RAVLT_immediate_bl, RAVL_ learning _bl,
• Network layers and nodes. Depend on the dataset but aug- RAVLT_forgetting_bl, RAVLT_perc_forgetting_bl, FAQ_bl, Ventri-
mentation tends to smooth results. Few layers, between 1 cles_bl, Hippocampus_bl, WholeBrain_bl, Entorhinal_bl, Fusiform
to 3, might result in lower error during the training process, _bl, Mid Temp_bl, ICV_bl. By keeping all the complete rows for the
but will have greater error at the prediction of the test set. mentioned feature subset, one can obtain the dataset used for the
These layers, also depending on the number of available fea- experiments.
tures, tend to result in a model with very few parameters,
which doesn’t help the network to learn and generalize. It is 4.2 The Pro-ACT dataset
advised to use 5 or more layers.
Pooled Resource Open-Access ALS Clinical Trials Database (Pro-
• Validation set. Used for validation during training. It was set
ACT) contains over 10,700 fully de-identified clinical patient records
to 30% of the dataset.
with Amyotrophic Lateral Sclerosis (ALS). ALS is a disease that
• Loss function. Mean squared error tends to perform better
involves the degeneration and death of the nerve cells that con-
than mean absolute error.
trol voluntary muscle movement. The Pro-ACT records contain
• Optimizer. The optimizer used was Adam with a learning
placebo and treatment-arm data measurements, demographic in-
rate of 0.0005.
formation, lab tests, medical and family history and other mea-
• Batch size. Batch size seems to affect the performance. Sizes
surements. In total, there are more than 10 million longitudinally
used were between 8 and 64. Small datasets tend perform
collected data points. It is maintained by Prize4Life and NCRI) of
better with smaller batch size. In addition training with a
the Massachusetts General Hospital. The data can be accessed easily
smaller batch size tends to give better prediction results.
with a registration2 . From this source, two different datasets were
• Early stopping with best epoch parameters was used.
created, described in the following two subsections.
4 DATASETS 4.2.1 “Alsfrs”. The first dataset was created using the “alsfrs.csv”
Here we describe the datasets used for the experiments. There were file. The Amyotrophic Lateral Sclerosis Functional Rating Scale
two main requirements in the data selection process. First, the data (ALSFRS) is an instrument for evaluating the functional status of
should come from real use cases. Second, they must be open and patients with ALS. It contains ratings in everyday tasks like speech,
accessible to anyone who wants to use them in order to replicate walking, handwriting and other. As mentioned before, the datasets
results or test other methods and make comparisons. must be complete for our experiments, so after dropping features
like timestamps and ids and removing rows with empty values,
4.1 The TADPOLE dataset the final dataset has 36194 rows x 11 columns. Once again, by
The Alzheimer’s Disease Prediction Of Longitudinal Evolution 1 https://tadpole.grand-challenge.org/
(TADPOLE) [22] is a collaboration between EuroPOND consortium 2 https://nctu.partners.org/ProACT/
153
A Comparison of Machine Learning Methods for Data Imputation SETN 2020, September 2–4, 2020, Athens, Greece
keeping the complete rows for the following features, Q1_Speech, true values x i (eq.1). This indicator is very useful for measuring
Q2_Salivation, Q3_Swallowing, Q4_Handwriting, Q5_Cutting, Q6 overall precision or accuracy. In general, the most effective method
_Dressing_and_Hygiene, Q7_ Turning_in_Bed, Q8_Walking, Q9 should have the lowest RMSE.
_Climbing_Stairs, Q10_Respiratory, Gastrostomy, one can repro- r
duce the experimental dataset. 1 N
RMSE = Σ (yi − x i )2 (1)
N i=1
4.2.2 “Lab tests”. The second dataset comes from the “labs.csv” file,
and contains many different lab measurements mainly based on Given N missing values the Mean Absolute Error between the
blood tests. The final dataset, after processing, contains 14485 rows imputed values yi and the respective true values x i is (eq. 2):
and 35 columns. The features used are the following: AST(SGOT), 1 N
Neutrophils, Lymphocytes, Gamma-glutamyltransferase, Potassium, MAE = Σ |yi − x i | (2)
N i=1
Albumin, Protein, ALT(SGPT), Eosinophils,Sodium, White Blood Again, lower is better. In general, RMSE is preferred since it gives
Cell (WBC), Triglycerides, Total Cholesterol, Absolute Neutrophil more weight to larger differences and lower RMSE means fewer
Count, Bicarbonate, Hemoglobin, Glucose, Bilirubin (Total), Urine large differences.
Ph, Platelets, Absolute Eosinophil Count, Calcium, Red Blood Cells
(RBC), Absolute Basophil Count, Creatinine, CK, Basophils, Phos- 6 RESULTS
phorus, Absolute Lymphocyte Count, Absolute Monocyte Count,
Chloride, Monocytes, Alkaline Phosphatase, Hematocrit, Blood Before moving on, it is important to note that the results in the
Urea Nitrogen (BUN). following tables are unnormalized error scores and not percentages.
The range of these values depends solely on the values of each
4.3 Gesture Phase Segmentation - UCI dataset’s features. Lower scores are better and the score of filling
missing values with the mean value is considered as the baseline.
This dataset is available at the UCI machine learning repository3
[20, 42]. It contains features extracted from 7 videos with people 6.1 “Alsfrs” dataset
gesticulating. The dataset was created using the file containing the
Starting with the Alsfrs dataset results (Table 2), for the 10% missing
raw positions of hands, wrists, head and spine of the user in each
values case we can see that our proposed bagging-based approached
frame. There were no missing values in this dataset and only two
(“mice” with bagging) achieved the lowest RMSE score (0.663) and
unnecessary columns were dropped, timestamp and phase. The
“missForest”, which also uses a bagging scheme, achieved the lowest
final shape of the dataset is 9901 rows and 18 columns.
MAE (0.464). “Iterativeimputer” had a balanced performance and
was slightly behind the best scores. Unlike mice where there is a
5 EVALUATION METHOD AND RESULTS
clear performance increase with bagging, “Iterativeimputer” per-
The datasets used for the experiments are complete, meaning there formed worse with bagging and “kNN” had slightly lower RMSE
are no missing values. In order to compare the imputation results and higher MAE. The “Autoencoder” implementation has similar
of each implementation, we create random missing values equal to performance with “MatrixFactorization”, “Sciblox Mice” and “kNN”,
10, 25 and 50 percent of the total values of every dataset. After the
imputation is done, the predicted values are compared to the their
original “hidden” values using the Root Mean Squared Error (RMSE) Alsfrs 10%
and Mean Absolute Error (MAE) metrics [1, 15]. The scope of our Algorithms RMSE MAE Algorithms
Iterativeimputer 0.666 0.464 missForest
work is to discover the missing values per se, and not implicitly
missForest 0.675 0.489 Iterativeimputer
evaluate the performance of imputation approaches through the
Autoencoder 0.721 0.493 Sciblox Mice
results obtained on a classification or regression task. Sciblox Mice 0.726 0.507 kNN
The methods presented in sections 2 and 3 were evaluated using kNN 0.730 0.535 MatrixFactorization
the four datasets mentioned in section 4 and three different missing MatrixFactorization 0.736 0.536 Autoencoder
percentage levels. The algorithms coming from libraries were used SoftImpute 0.899 0.563 mice
with default parameters, while the Denoising Autoencoder hyper- mice 0.914 0.568 Hmisc
parameters were described in section 3. The approach employing Hmisc 0.922 0.715 Amelia
bagging was applied on “mice”, “Iterativeimputer” and “kNN”. Each Amelia 0.937 0.717 SoftImpute
of the model’s bags included 70% of rows and columns of the original SimpleFill-Mean 1.136 0.794 IterativeSVD
dataset, with a total of 200 bags. IterativeSVD 1.211 0.833 SimpleFill-Median
SimpleFill-Median 1.226 0.899 SimpleFill-Mean
mi 1.334 1.605 mi
5.1 Evaluation metrics
Bagging
After the imputation of missing values, we assess the performance mice 0.663 0.492 mice
of each method based on two different metrics: Root Mean Squared Iterativeimputer 0.673 0.505 Iterativeimputer
Error (RMSE) and Mean Absolute Error (MAE). kNN 0.714 0.533 kNN
Given N missing values the RMSE is defined as the average Table 2: Alsfrs dataset imputation results for 10% missing
squared difference between the imputed values yi and the respective values. Lower is better.
3 https://archive.ics.uci.edu/ml/datasets/gesture+phase+segmentation
154
SETN 2020, September 2–4, 2020, Athens, Greece Christos Platias and Georgios Petasis
while “IterativeSVD” is at the same level with mean imputation and “missForest” and “MatrixFactorization”. Once again, “mi” was worse
“mi” gave the worst results. than the simple mean imputation.
When half of the values were removed (Table 4), our proposed ap-
Alsfrs 25% proach “mice” with bagging once again had the lowest RMSE score
Algorithms RMSE MAE Algorithms (0.799) and “Iterativeimputer” with bagging had the lowest MAE
Iterativeimputer 0.699 0.511 Iterativeimputer (0.603). The performance of these two implementations was identi-
Autoencoder 0.754 0.521 missForest cal in this case. The best single algorithm was “Iterativeimputer”,
missForest 0.763 0.560 Autoencoder followed by “IterativeSVD” and “missForest”. “MatrixFactorization”
MatrixFactorization 0.772 0.568 MatrixFactorization and the “Autoencoder” come next. This time, bagging gave a perfor-
kNN 0.834 0.599 mice mance boost to “kNN” and once again “mi” didn’t perform better
Sciblox Mice 0.914 0.602 Hmisc
than imputing with the mean value. Finally, “Amelia” wasn’t able
mice 0.963 0.609 Sciblox Mice
to give a solution.
Hmisc 0.966 0.609 kNN
IterativeSVD 0.981 0.691 IterativeSVD
Amelia 0.981 0.746 Amelia 6.2 “Lab Tests” dataset
SoftImpute 1.059 0.828 SimpleFill-Median For the 10% missing values in the “Lab Tests” dataset (Table 5),
SimpleFill-Mean 1.134 0.838 SoftImpute “Sciblox Mice” has both the lowest RMSE (50.370) and MAE (11.634).
mi 1.149 0.898 SimpleFill-Mean The next best scoring algorithm is “missForest”, which has an identi-
SimpleFill-Median 1.223 1.090 mi cal performance in both metrics. With approximately 7 units higher
Bagging RMSE (57.025) and 1 unit MAE (12.833), “Iterativeimputer” is the
mice 0.696 0.521 mice next best performing solution. Bagging using the “mice” algorithm
Iterativeimputer 0.698 0.521 Iterativeimputer and the “Autoencoder” fall in the same performance range with
kNN 0.833 0.637 kNN
minor deviations. After this point, the rest algorithms have RMSE
Table 3: Alsfrs dataset imputation results for 25% missing scores greater than 61. Aside from median imputation, only “Soft-
values. Lower is better. Impute” managed to perform worse than the mean imputation
with 87.838 against 87.318 respectively. Once again “mice” seems
to greatly benefit from bagging, “Iterativeimputer” performs worse
Moving on to the 25% missing values percentage (Table 3), our and “kNN” has an indifferent performance increase.
proposed approach “mice” with bagging is again the lowest RMSE
scoring algorithm (0.696). This time “Iterativeimputer” has the low- Lab tests 10%
Algorithms RMSE MAE Algorithms
est MAE (0.511) and is also slightly behind (+0.003) “mice” with
Sciblox Mice 50.37 11.634 Sciblox Mice
bagging in RMSE. Again “kNN” performed worse with bagging,
missForest 51.915 11.934 missForest
while “Iterativeimputer” was at the same level. The “Autoencoder” Iterativeimputer 57.025 12.833 Iterativeimputer
had the third lowest RMSE score and had similar performance with Autoencoder 57.813 13.972 Autoencoder
MatrixFactorization 61.955 14.113 MatrixFactorization
Alsfrs 50% kNN 64.851 15.454 kNN
Algorithms RMSE MAE Algorithms Hmisc 76.671 17.604 mice
Iterativeimputer 0.847 0.617 missForest mice 78.491 17.615 Hmisc
IterativeSVD 0.871 0.626 Iterativeimputer Amelia 81.931 19.48 SimpleFill-Median
MatrixFactorization 0.890 0.655 IterativeSVD mi 82.04 19.988 SimpleFill-Mean
missForest 0.903 0.665 MatrixFactorization IterativeSVD 85.16 20.191 Amelia
Autoencoder 0.914 0.704 Autoencoder SimpleFill-Mean 87.318 20.217 mi
kNN 1.023 0.705 mice SoftImpute 87.838 20.437 SoftImpute
SoftImpute 1.089 0.708 Hmisc SimpleFill-Median 89.082 20.621 IterativeSVD
mice 1.101 0.768 kNN Bagging
Hmisc 1.103 0.786 Sciblox Mice mice 57.293 13.057 mice
SimpleFill-Mean 1.135 0.831 SimpleFill-Median Iterativeimputer 57.837 13.079 Iterativeimputer
Sciblox Mice 1.142 0.861 SoftImpute kNN 64.833 15.153 kNN
mi 1.182 0.896 SimpleFill-Mean Table 5: Lab tests dataset imputation results for 10% missing
SimpleFill-Median 1.226 0.990 mi values. Lower is better.
Amelia - - Amelia
Bagging
mice 0.799 0.603 Iterativeimputer When the 25% of values is missing (Table 6), “missForest” is the
Iterativeimputer 0.805 0.604 mice
best performing algorithm in both metrics (58.716 RMSE, 13.220
kNN 0.965 0.747 kNN
MAE). “Sciblox Mice” follows with identical performance. Bagging
Table 4: Alsfrs dataset imputation results for 50% missing using “mice” and “Iterativeimputer” yields identical results with
values. Lower is better. the RMSE being around 63.3 and the MAE around 14.3. Next comes
the “Autoencoder” with one unit increase in both scores. The rest
155
A Comparison of Machine Learning Methods for Data Imputation SETN 2020, September 2–4, 2020, Athens, Greece
algorithms score above 70 RMSE and “SoftImpute” has the worst standing out by scoring almost 104 RMSE. In addition, “mi” didn’t
performance. This time “Iterativeimputer” also benefited from bag- manage to produce a result. Finally, bagging gave a great boost to
ging while “kNN” was negatively affected. all algorithms it was used with.
TADPOLE 10%
For the 50% missing percentage (Table 7), “missForest” remained Algorithms RMSE MAE Algorithms
at the first place in both RMSE (67.880) and MAE (15.119), followed kNN 2557 46 kNN
by “mice” and “Iterativeimputer” with bagging at 70.138 and 71.693 missForest 5039 454 missForest
RMSE respectively. At 75 units of RMSE follow the “Iterativeim- Sciblox Mice 7896 1102 Sciblox Mice
puter” and the “Autoencoder” while the rest implementations are Iterativeimputer 21625 6146 Iterativeimputer
above 80. In this case “Amelia”, “Hmisc”, mice and “SoftImpute” all Autoencoder 23061 6653 Autoencoder
performed worse than mean imputation (RMSE>91), with “kNN” MatrixFactorization 27577 7781 MatrixFactorization
mice 29623 8365 mice
Hmisc 30204 8666 Hmisc
Lab tests 50%
Amelia 30467 8807 Amelia
Algorithms RMSE MAE Algorithms
mi 30576 8864 mi
missForest 67.88 15.119 missForest
IterativeSVD 35142 9882 IterativeSVD
Iterativeimputer 75.211 16.789 Iterativeimputer
SoftImpute 40128 11119 SoftImpute
Autoencoder 75.945 17.649 Autoencoder
SimpleFill-Mean 45122 12913 SimpleFill-Median
Sciblox Mice 80.142 17.684 Sciblox Mice
SimpleFill-Median 45332 12973 SimpleFill-Mean
MatrixFactorization 83.081 18.543 MatrixFactorization
IterativeSVD 89.922 19.974 SimpleFill-Median Bagging
SimpleFill-Mean 90.344 20.557 SimpleFill-Mean kNN 3000 191 kNN
SoftImpute 91.413 20.92 SoftImpute Iterativeimputer 22605 6382 mice
SimpleFill-Median 91.818 21.181 IterativeSVD mice 22626 6403 Iterativeimputer
Hmisc 96.208 21.372 Hmisc Table 8: TADPOLE dataset imputation results for 10% miss-
Amelia 97.2 21.58 mice ing values. Lower is better.
mice 97.722 23.264 kNN
kNN 103.986 23.454 Amelia
mi - - mi Almost the same performance pattern is repeated for the 25%
Bagging case (Table 9). First is “kNN” with 4429 RMSE and 180 MAE, fol-
mice 70.138 15.627 mice lowed by “missForest” and “Sciblox Mice”. This time, mice and
Iterativeimputer 71.593 15.692 Iterativeimputer
“Iterativeimputer” used with bagging are at the same level with the
kNN 89.675 20.322 kNN
“Autoencoder”. “Iterativeimputer” without bagging has a RMSE of
Table 7: Lab tests dataset imputation results for 50% missing 71319, which is far greater than the 45551 of the mean imputation.
values. Lower is better. Clearly in this case this algorithm didn’t work as expected. The
rest of the algorithms performed better than the mean imputation.
156
SETN 2020, September 2–4, 2020, Athens, Greece Christos Platias and Georgios Petasis
Bagging gave a great boost to “Iterativeimputer” this time (25150 didn’t give results, an expected fact taking into consideration that
RMSE), as the algorithm by itself had very poor results. “Mice” also it performed poorly at 25%. All other algorithms were once again
benefited from bagging, while “kNN” didn’t. better than mean imputation, which holds a rather stable result
through all missing percentages ( 45k RMSE and 13k MAE).
TADPOLE 25%
Algorithms RMSE MAE Algorithms 6.4 Gesture phase segmentation
kNN 4429 180 kNN Moving on to the last dataset, for the 10% missing percentage (Ta-
missForest 10561 1630 missForest
ble 11) “missForest” has the best scores with 0.055 RMSE and 0.020
Sciblox Mice 13496 2762 Sciblox Mice
MAE. Next comes “kNN” with 0.091 RMSE and 0.030 MAE and then
Autoencoder 27192 7769 Autoencoder
MatrixFactorization 32603 9131 MatrixFactorization the “Iterativeimputer” with 0.169 and 0.083. “Amelia”, “mi”, “Matrix
mice 33101 9348 mice Factorization”, the “Autoencoder” and bagging with “mice” have
mi 33847 9655 Hmisc similar performance with a RMSE around 0.220 and MAE around
Amelia 34012 9672 Amelia 0.105 to 0.140. Our approach based on bagging helped only “mice”
Hmisc 34356 9743 mi to perform better and all algorithms were quite better than the
IterativeSVD 37638 10575 IterativeSVD mean imputation which had 0.877 RMSE and 0.644 MAE.
SoftImpute 41770 11737 SoftImpute
SimpleFill-Mean 45551 13157 SimpleFill-Median
Gesture 10%
SimpleFill-Median 45709 13193 SimpleFill-Mean
Algorithms RMSE MAE Algorithms
Iterativeimputer 71319 16912 Iterativeimputer
missForest 0.055 0.02 missForest
Bagging kNN 0.091 0.029 Sciblox Mice
kNN 7817 1293 kNN Sciblox Mice 0.098 0.03 kNN
Iterativeimputer 25150 7123 Iterativeimputer Iterativeimputer 0.169 0.083 Iterativeimputer
mice 25791 7220 mice Amelia 0.216 0.105 MatrixFactorization
Table 9: TADPOLE dataset imputation results for 25% miss- MatrixFactorization 0.217 0.117 Amelia
ing values. Lower is better. mi 0.22 0.118 mi
Autoencoder 0.226 0.122 Hmisc
Hmisc 0.251 0.134 IterativeSVD
IterativeSVD 0.268 0.139 Autoencoder
At 50% missing values (Table 10), “missForest” is the best scoring mice 0.355 0.171 mice
algorithm in both metrics with 25645 RMSE and 6002 MAE. Bagging SoftImpute 0.399 0.259 SoftImpute
with “mice” and “Sciblox Mice” come next with approximately 5000 SimpleFill-Mean 0.877 0.615 SimpleFill-Median
increase in RMSE, followed by the “Autoencoder” with an increase SimpleFill-Median 0.909 0.644 SimpleFill-Mean
of 9000. The “kNN” implementation stands at 36004 RMSE and Bagging
9662 MAE, which is similar to the “Autoencoder”. “Iterativeimputer” kNN 0.110 0.039 kNN
Iterativeimputer 0.195 0.111 Iterativeimputer
mice 0.230 0.132 mice
TADPOLE 50%
Algorithms RMSE MAE Algorithms
Table 11: Gesture dataset imputation results for 10% missing
missForest 25645 6002 missForest values. Lower is better.
Sciblox Mice 32366 8057 Sciblox Mice
Autoencoder 34602 9662 kNN
kNN 36004 9939 Autoencoder
MatrixFactorization 37352 10577 MatrixFactorization
At the next missing percentage level, 25% (Table 12), “missForest”
IterativeSVD 41039 11644 IterativeSVD was once again the best scoring implementation, having a RMSE
mi 41805 11750 Amelia equal to 0.083 and a MAE equal to 0.027. The error increase is small
Amelia 41841 11791 mi compared to the one of the 10% case. Similar to the 10% situation,
Hmisc 42317 11896 mice “kNN” and “Sciblox Mice” follow with identical performance, hav-
mice 42824 11919 Hmisc ing approximately 0.160 RMSE and 0.050 MAE. After this point,
SoftImpute 44029 12413 SoftImpute “Iterativeimputer”, bagging with “mice”, “Amelia” and “Matrix Fac-
SimpleFill-Mean 46159 13278 SimpleFill-Median torization” have similar behavior. The “Autoencoder” managed to
SimpleFill-Median 46316 13335 SimpleFill-Mean achieve a RMSE equal to 0.345 and a MAE equal to 0.215, which are
Iterativeimputer - - Iterativeimputer both a lot better than the mean imputation results. Once again, all
Bagging algorithms managed to perform better than the mean imputation
mice 31209 8730 mice and our bagging-based approach only improved “mice”.
kNN 39590 11262 kNN
When half the values are missing (Table 13), “missForest” man-
Iterativeimputer - - Iterativeimputer
ages to stay at the top of all implementations with 0.178 RMSE and
Table 10: TADPOLE dataset imputation results for 50% miss- 0.058 MAE. “Sciblox Mice” is the next best performing implementa-
ing values. Lower is better. tion with 0.276 RMSE and 0.103 MAE. Bagging with “mice” follows
and then comes the “Iterativeimputer” and bagging with “kNN”.
157
A Comparison of Machine Learning Methods for Data Imputation SETN 2020, September 2–4, 2020, Athens, Greece
Gesture 25% finding and using open-access datasets that could be used as bench-
Algorithms RMSE MAE Algorithms marks became equally important. Four real world open datasets
missForest 0.083 0.027 missForest
have been proposed and were used in our experiments, with their
kNN 0.157 0.049 Sciblox Mice
preprocessing being described so anyone can reproduce the results.
Sciblox Mice 0.159 0.054 kNN
Iterativeimputer 0.284 0.138 Iterativeimputer The datasets have 11 to 35 features and 9000 to 36000 rows. Having
MatrixFactorization 0.289 0.146 MatrixFactorization set a common reference point, two imputation methods have been
Amelia 0.309 0.161 Amelia proposed and evaluated. A method based on Denoising Autoen-
mi 0.339 0.169 mi coders and a bagging scheme using different algorithms. Finally,
Autoencoder 0.345 0.172 Hmisc using the datasets proposed, we present a collective and compar-
Hmisc 0.366 0.195 IterativeSVD ative evaluation for 13 of the most popular imputation methods
IterativeSVD 0.367 0.204 mice along with the two proposed by us approaches. The metrics used
mice 0.421 0.215 Autoencoder for the comparisons were the Root Mean Squared Error (RMSE)
SoftImpute 0.499 0.332 SoftImpute and the Mean Absolute Error (MAE).
SimpleFill-Mean 0.887 0.622 SimpleFill-Median From the results it is clear that there is no single implementation
SimpleFill-Median 0.919 0.651 SimpleFill-Mean
being consistently at the first place for all datasets. As a result, the
Bagging characteristics of each dataset are the factors that mainly determine
kNN 0.195 0.081 kNN
which algorithms are best suited for each case. Another parameter
mice 0.269 0.148 mice
that affects the performance of the imputation is the percentage of
Iterativeimputer 0.276 0.150 Iterativeimputer
missing values. Some implementation might perform well when
Table 12: Gesture dataset imputation results for 25% missing
few values are missing but might not work when the missing per-
values. Lower is better.
centage is high. More specifically, Amelia, mi and Iterative Imputer
weren’t able to give a result in some datasets, when half the values
were missing. On the other hand, one of our proposed approaches,
Gesture 50% bagging with “mice”, tends to have a solid performance at 50% miss-
Algorithms RMSE MAE Algorithms ing values compared to other implementations. As it is expected,
missForest 0.178 0.058 missForest the RMSE and MAE of the imputed datasets increase as the number
Sciblox Mice 0.276 0.103 Sciblox Mice of missing values increases. However, the mean imputation scores
Iterativeimputer 0.41 0.223 Iterativeimputer
fluctuate within a narrow range as the missing values increase and
MatrixFactorization 0.423 0.233 kNN
the errors of the other implementations tend to get closer to them
kNN 0.46 0.239 MatrixFactorization
Amelia 0.467 0.256 Amelia
as the percentage of missing values increases.
mi 0.492 0.269 mi For the Alsfrs dataset where all features were in the same scale
IterativeSVD 0.495 0.273 mice and the number of rows was relatively high (36000), bagging using
mice 0.535 0.277 Hmisc mice was the best implementation. In the Lab tests dataset which
Hmisc 0.545 0.283 IterativeSVD had the most features (35) and varying features scales, “Sciblox
Autoencoder 0.585 0.406 Autoencoder Mice” was the best implementation for 10% missing values, while
SoftImpute 0.782 0.566 SoftImpute missForest gave the best imputations for the other two percent-
SimpleFill-Mean 0.882 0.617 SimpleFill-Median ages. In the TADPOLE dataset, where there were great differences
SimpleFill-Median 0.914 0.647 SimpleFill-Mean in the feature scales, with values in a scale of 10 and 107 , “kNN”
Bagging managed to give an astonishingly low RMSE and MAE at 10 and
mice 0.377 0.214 mice 25 percent, while “missForest” was best at 50%. Finally, at the final
kNN 0.412 0.236 kNN dataset, Gesture phase segmentation, which was the only dataset
Iterativeimputer 0.414 0.241 Iterativeimputer
with some serious correlation between the features, “missForest”
Table 13: Gesture dataset imputation results for 50% missing outperformed all other implementations in all cases. Remarkably,
values. Lower is better. this is the only dataset that all implementations gave results in
all cases and performed clearly better than the mean imputation,
showing that correlation between features is necessary in some
cases and leads to better imputation results.
“Matrix factorization” has similar performance. The rest implemen- All in all, a good overall suggestion is “missForest”. In the ex-
tations are above 0.450 in RMSE. The “Autoencoder” achieved 0.585 periments it managed to achieve first place results in many cases
RMSE and 0.406 MAE. Bagging helped both “mice” and “kNN” this and was usually in the top three best performing implementations.
time, while “iterativeimputer” hasn’t benefited. The tests also showed that it has a solid performance regardless
of the missing values percentage. Another good choice is “Sciblox
7 CONCLUSION Mice”. Using “mice” in a bagging ensemble also has a solid per-
The initial motivation for our work was to develop and propose a formance and usually ranks in the first three positions. When the
new data imputation method. However, by reading the relevant lit- above methods fail, “kNN” might give unexpectedly good results.
erature, our focus shifted to three main goals. As most publications Finally, the Autoencoder gave good results in almost all cases and
used their own datasets, comparisons of methods were hard. Thus,
158
SETN 2020, September 2–4, 2020, Athens, Greece Christos Platias and Georgios Petasis
ranked around the sixth place overall from the tested implementa- [17] Serena G. Liao, Yan Lin, Dongwan D. Kang, Divay Chandra, Jessica Bon, Naftali
tions. The Autoencoder tends to have increased MAE compared to Kaminski, Frank C. Sciurba, and George C. Tseng. 2014. Missing value impu-
tation in high-dimensional phenomic data: imputable or not, and how? BMC
equivalent performing implementations, but this can be partially Bioinformatics 15, 1 (Nov. 2014), 346. https://doi.org/10.1186/s12859-014-0346-6
justified as the neural network was optimised on the mean squared [18] Roderick J. A. Little. 1988. Missing-Data Adjustments in Large Surveys. Journal
of Business & Economic Statistics 6, 3 (1988), 287–296. http://www.jstor.org/
error and not on the mean absolute error. stable/1391878
[19] Roderick J A Little and Donald B Rubin. 2019. Statistical Analysis with Missing
Data, 3rd Edition. John Wiley & Sons, Inc.
8 FUTURE WORK [20] Renata C. B. Madeo, Clodoaldo A. M. Lima, and Sarajane M. Peres. 2013. Gesture
One of the conditions that the datasets should fulfill to be considered Unit Segmentation Using Support Vector Machines: Segmenting Gestures from
Rest Positions. In Proceedings of the 28th Annual ACM Symposium on Applied
for the experiments was their open access, as this would allow Computing (SAC ’13). ACM, 46–52.
not only reproducibility but also would give others the chance [21] R. C. B. Madeo, P. K. Wagner, and S. M. Peres. [n.d.]. Gesture Phase Segmentation
Data Set. Retrieved from: https://archive.ics.uci.edu/ml/datasets/gesture+phase+
to compare other implementations against the already compared segmentation.
ones. As a result, one could try to further explore the Autoencoder [22] Razvan V. Marinescu, Neil P. Oxtoby, Alexandra L. Young, Esther E. Bron,
implementation by trying to further optimise it and also try a Arthur W. Toga, Michael W. Weiner, Frederik Barkhof, Nick C. Fox, Stefan Klein,
Daniel C. Alexander, the EuroPOND Consortium, and for the Alzheimer’s Dis-
variational or a stacked Autoencoder or use Generative Adversarial ease Neuroimaging Initiative. 2018. TADPOLE Challenge: Prediction of Longi-
Networks [11]. In general, deep learning techniques are quite rare tudinal Evolution in Alzheimer’s Disease. arXiv:1805.03909 [q-bio, stat] (May
in this field. Finally, one could also try to implement an ensemble, 2018).
[23] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. 2010. Spectral Regular-
combining the results from many different methods, including our ization Algorithms for Learning Large Incomplete Matrices. Journal of machine
proposed approach based on denoising autoencoders. This could learning research : JMLR 11 (March 2010), 2287–2322. https://www.ncbi.nlm.nih.
gov/pmc/articles/PMC3087301/
be a simple averaging scheme with weights or a stacking ensemble. [24] Fitore Muharemi, Doina Logofătu, and Florin Leon. 2018. Review on General
Techniques and Packages for Data Imputation in R on a Real World Dataset. In
Computational Collective Intelligence (Lecture Notes in Computer Science). Springer
REFERENCES International Publishing, 386–395.
[1] M. Albinsson and E. Gillsbro. 2017. Imputation Methods in Dialysis Data. Lund [25] NCRI and Prize4Life. [n.d.]. Pooled Resource Open-Access ALS Clinical Trials
University. Database (PRO-ACT). Retrieved from: https://nctu.partners.org/ProACT/.
[2] Gustavo Batista and Maria Carolina Monard. 2003. A Study of K-Nearest Neigh- [26] Therese D. Pigott. 2001. A Review of Methods for Missing Data. Educational
bour as an Imputation Method. In In HIS. Research and Evaluation 7, 4 (2001), 353–383.
[3] Brett K. Beaulieu-Jones and Jason H. Moore. 2017. Missing data imputation in the [27] Jason Poulos and Rafael Valle. 2018. Missing Data Imputation for Supervised
electronic health recocrd using deeply learned autoencoders. In Biocomputing Learning. Applied Artificial Intelligence 32, 2 (2018), 186–196.
2017. WORLD SCIENTIFIC, 207–218. [28] Manizheh Ranjbar, Parham Moradi, Mostafa Azami, and Mahdi Jalili. 2015. An
[4] Stef van Buuren. 2012. Flexible Imputation of Missing Data. Chapman and imputation-based matrix factorization method for improving accuracy of col-
Hall/CRC. laborative filtering systems. Engineering Applications of Artificial Intelligence 46
[5] Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. BRITS: (2015), 58 – 66.
Bidirectional Recurrent Imputation for Time Series. In Advances in Neural Infor- [29] Donald B. Rubin. 1986. Statistical Matching Using File Concatenation with
mation Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, Adjusted Weights and Multiple Imputations. Journal of Business & Economic
N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 6775–6785. Statistics 4, 1 (1986), 87–94. http://www.jstor.org/stable/1391390
[6] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting [30] D. B. Rubin. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley. 258
System. CoRR abs/1603.02754 (2016). pages.
[7] T. De Waal, J. Pannekoek, and S. Scholtus. 2011. Handbook of Statistical Data [31] Alex Rubinsteyn and Sergey Feldman. [n.d.]. fancyimpute. GitHub repository:
Editing and Imputation. John Wiley and Sons Inc. https://github.com/iskandr/fancyimpute.
[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from [32] Scikit-learn. [n.d.]. BayesianRidge. https://scikit-learn.org/stable/modules/
incomplete data via the EM algorithm. Journal of the Royal Statistical Society: generated/sklearn.linear_model.BayesianRidge.html.
Series B 39 (1977), 1–38. http://web.mit.edu/6.435/www/Dempster77.pdf [33] Scikit-learn. [n.d.]. RidgeCV. https://scikit-learn.org/stable/modules/generated/
[9] Pedro J. García-Laencina, José-Luis Sancho-Gómez, and Aníbal R. Figueiras-Vidal. sklearn.linear_model.RidgeCV.html.
2009. Pattern classification with missing data: a review. Neural Computing and [34] Shaun R. Seaman, Jonathan W. Bartlett, and Ian R. White. 2012. Multiple imputa-
Applications 19 (2009), 263–282. tion of missing covariates with non-linear effects and interactions: an evaluation
[10] Mostafa Mehdipour Ghazi, Mads Nielsen, Akshay Pai, M. Jorge Cardoso, Marc of statistical methods. BMC medical research methodology 12 (April 2012), 46.
Modat, Sebastien Ourselin, and Lauge Sørensen. 2018. Robust training of recur- https://doi.org/10.1186/1471-2288-12-46
rent neural networks to handle missing data for disease progression modeling. [35] Daniel J. Stekhoven and Peter Bühlmann. 2012. MissForest—non-parametric
arXiv:1808.05500 (Aug. 2018). missing value imputation for mixed-type data. Bioinformatics 28, 1 (Jan. 2012),
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 112–118.
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial [36] Yu-Sung Su, Andrew Gelman, Jennifer Hill, and Masanao Yajima. 2011. Multiple
Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box.
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Asso- Journal of Statistical Software, Articles 45, 2 (2011), 1–31.
ciates, Inc., 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial- [37] Fei Tang and Hemant Ishwaran. 2017. Random forest missing data algorithms.
nets.pdf Statistical Analysis and Data Mining (2017).
[12] Daniel Han-Chen. [n.d.]. sciblox. GitHub repository: https://github.com/ [38] Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd International
danielhanchen/sciblox. Conference on Document Analysis and Recognition, Vol. 1. 278–282 vol.1.
[13] Frank E Harrell, Cole Beck, and Charles Dupont. [n.d.]. Hmisc. GitHubrepository: [39] Luís Torgo. 2010. Data Mining with R : learning by case studies. 277 pages.
https://github.com/harrelfe/Hmisc, Cran documentation: https://cran.r-project. [40] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D.
org/web/packages/Hmisc/Hmisc.pdf. Botstein, and R. B. Altman. 2001. Missing value estimation methods for DNA
[14] James Honaker, Gary King, and Matthew Blackwell. 2011. Amelia II: A Program microarrays. Bioinformatics (Oxford, England) 17, 6 (June 2001), 520–525.
for Missing Data. Journal of Statistical Software 45, 7 (Dec 2011), 47. https: [41] Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate
//gking.harvard.edu/amelia Imputation by Chained Equations in R. Journal of Statistical Software, Articles 45,
[15] Ömür Kaya Kalkan, Yusuf Kara, and Hülya Kelecioğlu. 2018. Evaluating Per- 3 (2011), 1–67.
formance of Missing Data Imputation Methods in IRT Analyses. International [42] Priscilla K Wagner, Sarajane M Peres, Renata Cristina Barros Madeo, Clodoaldo
Journal of Assessment Tools in Education 5 (2018), 403 – 416. A M Lima, and Fernando A Freitas. 2014. Gesture Unit Segmentation Using
[16] Yeo Jin Kim and Min Chi. 2018. Temporal Belief Memory: Imputing Missing Data Spatial-Temporal Information and Machine Learning. In Proceedings of the 27th
during RNN Training. In Proceedings of the Twenty-Seventh International Joint International Florida Artificial Intelligence Research Society Conference, FLAIRS
Conference on Artificial Intelligence. International Joint Conferences on Artificial 2014. 6.
Intelligence Organization, 2326–2332.
159