Missing Data Mechanisms and Imputation Methods
Missing Data Mechanisms and Imputation Methods
https://www.scirp.org/journal/ojs
ISSN Online: 2161-7198
ISSN Print: 2161-718X
Department of Electrical Engineering and Computer Science, Howard University, Washington DC, USA
Keywords
Missing Data, Mechanisms, Imputation Techniques, Models
1. Introduction
Researchers have experimented with different methods that can tackle the miss-
ing data problem entirely for a long time. However, they have since explored
different approaches and their effectiveness and determined that the old me-
thods, such as univariate imputation and deletion methods, did not solve the
problem and make the situation worse. This has led to invalid conclusions being
arrived at in research because of incomplete data.
vey or breakdown of equipment that are to measure some variables in our data-
set such as temperature.
We aim to solve the following issues:
1) Impossible to find complete datasets; they are rarely discovered. Old univa-
riate imputations and deletion did not solve missing data problems and even
worsened results and gave the wrong outcome with invalid inferences and
misled researchers and biased conclusions. Therefore, researchers should never
be tempted to analyse the missing dataset. Also, it is imperative to distinguish
missingness mechanisms to be able to select the best imputation method.
2) We should avoid missing data at the collection practice stage, and that is
impossible, so we must use multivariate imputation by chained equation.
Analysing a dataset to make inferences is usually done when the dataset is in a
rectangular format. This makes it easy to observe patterns of missing data and
according to Schaffer and Graham, the identification of these patterns is a cru-
cial step in classifying these patterns and eventually determining how to handle
them.
Various research was conducted over the years, mainly after 2005. As re-
searchers advance their knowledge of this domain, newer improvements were
observed, reflecting extensive growth in Bayesian methods for simulating post-
erior distribution.
The work of Schafer & Graham [2] raised issues that remain unsolved to date,
like MNAR mechanism conversion to MAR, analysis, and the use of auxiliary
variables, and discuss dealing with other types of missingness. However, some of
their work is still not yet in the mainstream.
White, Royston, and Wood [3] show how to impute categorical and quantita-
tive variables, including skewed variables. They proposed and explained Mul-
tiple Imputation by Chained Equations as an immediate solution to missing data,
which we will point to in the next chapter.
The authors Little and Rubin [4] represent approaches and the introduction of
multivariate analysis with missing values. And lately Buuren [5] introduce Flexi-
ble Imputation of Missing Data and present MICE algorithm.
We mainly construct this Article for highlighting missing data mechanisms
and briefly explain the multiple imputing processes; and in the next chapter, we
extend our study to multiple imputation and the MICE algorithm.
We believe the added value of this Article to literature is the enforcement of
some central concepts that are easily understood about missing data—contribute
using modern techniques to avoid corruption of analysis of missing data in
many fields, including Cybersecurity systems.
Researchers should try as much as possible to avoid analysing missing data by
mainly adopting the best data collection and cleaning methods. It is, however,
nearly impossible to altogether prevent this problem in many research projects.
It is, therefore, crucial to find the possible mechanisms to decide to handle the
missing data once identified.
2. Problem Statement
The missing data problem results if data values for one or even more data variables
are not present from the observed dataset. A dataset is a group of observations
which in most cases is rectangular and mainly denoted by X = ( X 1 , , X n )
where X i ,i =1,2,, n is the ith column and X j , j =1,2,, m is the jth row. Suppose X i
is a randomly distributed variable which is defined by cumulative density func-
tion (CDF) Fk , k =1,2,, n , and every X i has a different probability density func-
tion (PDF). Then the cumulative density function of X i ,i =1,2,, n should be de-
fined as; F= ( X i ) Pr ( X j ≤ X i ) . If the observed dataset X has missing values
denoted by Ypq , then this is missing value of the qth and pth columns. Also,
{ }
Y( m1 , n ), n1 < n , is a group of rows containing missing data values, Y( m1 , n ) = Ypq and
hence Y( m1 , n ) ⊂ X .
To clearly indicate the amount of the missing values and their positions, let R
be the indicator of the missing value whose elements have the values one and
zero. When R = 1 , this indicates that the data within the dataset is known while
when R = 0 , this indicates that the value is missing. Also let Ym be the number
of the missing values in a certain row j and Y0 be the number of known values
in the same row, then Ym = ∑ R = 0 X R and Y0 = ∑ R =1 X R . Therefore, if the
missing element has a null value, then the observed dataset is complete, or else it
has some missing values.
by Buuren [9]. At the same time, the MCAR method assumes that the dataset
does not depend on the observed and unobserved values and can be tested by
Little’s test for the observed values. And if the missing data assumption is de-
pendent only on the unobserved data, then the mechanism of the missing data
pattern is MNAR (Table 1).
4. Missingness Explanation
Dealing with missingness is essential, and indeed one needs to understand the
mechanisms and patterns of missing data. Attributes that did not contain miss-
ing data are called complete data, while features that have missing values are
called incomplete data. For example, referring to the 2.2 of the Dissertation, Cy-
ber database selected for this research, is the KDDsubset reduced to cleansing
data, a complete data consisting of 145585 connections and 39 features. The la-
bel column was excluded from making the data incomplete at a different pro-
portion of 10%, 20%, and 30% with two missingness mechanisms: MCAR and
MAR.
Then to feed the imputed missing data, the label returned to the dataset to test
by a Machine Learning algorithm to measure the accuracy for each classifier; we
did R code to analyze data and make it incomplete by MCAR. This mechanism
means the data in the specific column or other columns have no relationship
because the missingness is unsystematic. Each cell of the dataset has the same
chance to be selected equally likely to the sample, so we randomly sample from
our data set as the population and have sample missing datasets with the men-
tioned proportions.
For MAR missingness, it is a bit more complex because the missingness de-
pends on the observed value, and for this reason, we wrote an R code that was to
make MAR is conditional MCAR, whereas the probability of cells in MCAR
missingness is the same, and samples are randomly selected. But the probability
in MAR is conditioned and hence called CMAR (Conditional Missing at Ran-
dom) not to be confused with the two abbreviated terms. Probability is given
some conditions and equal to MCAR. Quantiles are used as the condition for
this assumption. Finally, we imputed the missing data using the MICE algorithm.
Then we fed it to the Machine Learning algorithm to measure accuracy, these
were done in different chapters; for MNAR the most problematic type is left for
further investigation research (Figure 1).
Missing Data
Patterns Mechanisms
What Why
5. Literature Review
A few publications such as Little and Rubin [16] and Schafer [17] give an in-
credibly detailed and sophisticated theoretical outlook and analysis of the miss-
ing data. Rubin [18] and Schafer [19] avails a complete discussion of the theo-
retical conjecture of multiple imputation, and some examples of its use.
Stef van Buuren, [20] focused on the flexible nature for the many processing
data types present in a lot of real applications in the multiple imputation frame-
work. The technique for creating various estimates of missing data values varies
but multivariate imputation by chained equations (MICE), a popular method
creates these estimates using: Predictive mean matching, logistic regression,
Bayesian linear regression, and many others (Buuren and Groothuis—Oudshoorn)
[21]. Multivariate imputation by chained equations from Stef Buuren has arisen
in the data analysis as one robust major technique of handling missing data.
Generating multiple imputations, instead of single imputations, account for the
collection of uncertainty in data in the imputations. In addition, the chained eq-
uations method is so flexible and able to handle variables of different types, in-
cluding continuous data. This method assumes that the missing data is a MAR
type (missing at random). And this means that the likelihood of a being missing
from a dataset is solely dependent on the observed values. In other words, the
missingness that remains after the control of all the available data is purely ran-
dom.
and these are: stating null and alternative hypothesis, choosing the level of
significance ( α ) as it is just the area of the tail as in the shown diagram,
finding the critical values, finding test statistic, and then drawing a conclu-
sion about the null hypothesis. Rejecting it means we currently believe that
the alternative hypothesis is true. To look for the critical value in the chi
square table we must know the degrees of freedom, calculated by (rows − 1) *
(columens − 1) or (n − 1) and look for the result that intercepts with column
χ 2 = 0.05 . Then find test statistic which is the Chi square we use the formula
(O − E )
2
( χ 2 ) = sum E
, and if it is less than the critical value, we draw a con-
clusion that we cannot reject the null hypothesis, so we accept it (Figure 2).
From the table below we observe that Chi square increases as the (df) degree
of freedom decreases (Table 2).
- Dixon’s MCAR test compares the means of complete and incomplete cases,
and if the t-statistic is insignificant then the data is MCAR. If, however, the t
statistic is significant the data is not MCAR.
- Little’s test for MCAR data is a test that is used widely to determine if data
can be assumed to be MCAR (Little, 1988). Little’s MCAR test is the most
significant test when dealing with missing cases being MCAR. If its p-value is
statistically insignificant, then the data is assumed to be MCAR, and mis-
singness is then assumed not to be important in the analysis. Using listwise
deletion method on observations with missing data values is then appropriate
approach to deal with them, or we can use the most advanced relabel imputa-
tion method, multiple imputation by chained equations if a more complete
dataset is needed to increase the sample size and achieve better statistical
power.
- Hawkin’s test for MCAR, is a test of multivariate normality as well as a test of
homogeneity of covariances.
- Wilks’s test shows the amount of variance accounted for in the response va-
riable by the explanatory variables.
- For MAR there are other tests like (Diggle’s test, Kolmogorov MAR test, and
Fisher test).
- For MNAR there is Fairclough approach.
For Equation (1), we use the likelihood-ratio test ( λ ) and from the dataset we
calculate the mean and standard deviation to see if the mean ( µ ) is equal to a
certain estimated value ( µ0 ). The null hypothesis H 0 ≠ H1 where H 0 = µ
then the likelihood function after calculation would be [26] λ= (1 + t 2 n − 1) .
−n 2
If the missing data approach is not satisfying Equation (1) MAR type, then the
assumption here is MNAR.
Pr ( R | Ym , Yo , X ) = Pr ( R | Ym , X ) (3)
Table 3. Examples of list wise deletion and pair wise deletion [28].
M 25 243
F . 280
M 33 332
M . 272
F 25 .
M 29 326
. 26 259
M 32 297
M 25 243
.
M 33 332
. 272
.
M 29 326
.
M 32 297
As for imputation with linear regression, few predictors of the variables with
missing values are recognized with the help of a correlation matrix. The best
predictors are used as predictor variables and the variable with missing data as
the dependent variable in the regression equation, which is then used in the pre-
diction of missing values. Imputation with stochastic regression improves the li-
near regression-based imputation, which adds random noise terms on a regres-
sion line to restore lost variability to the data.
Other traditional techniques include general techniques, scale-item techniques,
and time-series techniques.
Enders summarized the traditional techniques for missingness mechanisms
that we consider like a road map for handling missing data as in Table 4.
All the above-mentioned methods, however, cause bias and do not satisfacto-
rily solve the problem of missing data.
● Listwise deletion if ● Listwise deletion if amount is not ● None of the above applicable
sample size is very important, and small ● Selection models
small ● Use stochastic regression imputation ● MAR techniques
imputation field [33]. Therefore, in this dissertation, the MICE technique, along
with multiple imputations, has been explored and adopted. Details regarding
multiple imputations and MICE and its application to the analysis are described
in detail in a subsequent chapter.
9. Conclusion
Missing data is always a limiting factor when undertaking any test or an experi-
mental project, regardless of whether the missing data is MAR, MNAR, or
MCAR. There will always be a loss of statistical power in all these scenarios that
can lead up to a type I error. And this may result in the making of inaccurate in-
ferences about a population. Therefore, researchers should try and prevent
missing data patterns in the dataset used in their research if possible. This can
only be achieved through a careful approach to data collection and preparation,
as stated earlier in the chapter. On the other hand, if these missing data patterns
are impossible to avoid, they are in most cases. Then researchers should adopt
appropriate techniques for handling the missing data and, more preferably,
modern methods such as the multivariate imputation by chained equations
(MICE) because traditional techniques involve excluding cases that have missing
values in the dataset. This old method is inappropriate since research results of-
ten strive to make inferences about an entire population and not just a portion
in a dataset.
Acknowledgements
The authors would like to thank reviewers for their helpful notes and sugges-
tions to this article. Thanks, to extend to Scientific Research/ the Open Journal
of Statistics for their valuable reviews and publishing.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Brangetto, P. and Veenendaal, M.A. (2016) Influence Cyber Operations: The Use of
Cyberattacks in Support of Influence Operations. 2016 8th International Confe-
rence on Cyber Conflict (CyCon), Tallinn, 31 May-3 June 2016, 113-126.
https://doi.org/10.1109/CYCON.2016.7529430
[2] Schafer, J.L. (2003) Multiple Imputation in Multivariate Problems When the Impu-
tation and Analysis Models Differ. Statistica Neerlandica, 57, 19-35.
https://doi.org/10.1111/1467-9574.00218
[3] White, I.R., Royston, P. and Wood, A.M. (2011) Multiple Imputation Using Chained
Equations: Issues and Guidance for Practice. Statistics in Medicine, 30, 377-399.
https://doi.org/10.1002/sim.4067
[4] Little, R.J. and Rubin, D.B. (1989) The Analysis of Social Science Data with Missing
Values. Sociological Methods & Research, 18, 292-326.
https://doi.org/10.1177%2F0049124189018002004
[5] Van Buuren, S. (2018) Flexible Imputation of Missing Data. CRC Press, Boca Raton.
[6] Rubin, D.B. (1976) Inference and Missing Data. Biometrika, 63, 581-592.
https://doi.org/10.1093/biomet/63.3.581
[7] Rubin, D.B. (1978) Multiple Imputations in Sample Surveys—A Phenomenological
Bayesian Approach to Nonresponse. In: Proceedings of the Survey Research Me-
thods Section of the American Statistical Association, Vol. 1, American Statistical
Association, Alexandria, 20-34.
[8] Little, R.J. (1988) A Test of Missing Completely at Random for Multivariate Data
with Missing Values. Journal of the American Statistical Association, 83, 1198-1202.
https://doi.org/10.1080/01621459.1988.10478722
[9] Doove, L.L., Van Buuren, S. and Dusseldorp, E. (2014) Recursive Partitioning for
Missing Data Imputation in the Presence of Interaction Effects. Computational Sta-
tistics & Data Analysis, 72, 92-104. https://doi.org/10.1016/j.csda.2013.10.025
[10] Chen, H.Y. and Little, R. (1999) A Test of Missing Completely at Random for Ge-
neralised Estimating Equations with Missing Data. Biometrika, 86, 1-13.
https://doi.org/10.1093/biomet/86.1.1
[11] Schafer, J.L. and Olsen, M.K. (1998) Multiple Imputation for Multivariate Miss-
ing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research,
33, 545-571. https://doi.org/10.1207/s15327906mbr3304_5
[12] Graham, J.W. and Hofer, S.M. (2000) Multiple Imputation in Multivariate Re-
search. In: Little, T.D., Schnabel, K.U. and Baumert, J., Eds., Modeling Longitudinal
and Multilevel Data, Psychology Press, New York, 189-204.
https://doi.org/10.4324/9781410601940-15
[13] Heitjan, D.F. and Basu, S. (1996) Distinguishing “Missing at Random” and “Missing
Completely at Random”. The American Statistician, 50, 207-213.
https://doi.org/10.1080/00031305.1996.10474381
[14] McPherson, S., Barbosa-Leiker, C., Mamey, M.R., McDonell, M., Enders, C.K. and
Roll, J. (2015) A ‘Missing Not at Random’ (MNAR) and ‘Missing at Random’
(MAR) Growth Model Comparison with a Buprenorphine/Naloxone Clinical Trial.
Addiction, 110, 51-58. https://doi.org/10.1111/add.12714
[15] Little, R.J. and Smith, P.J. (1987) Editing and Imputation for Quantitative Survey
Data. Journal of the American Statistical Association, 82, 58-68.
https://doi.org/10.1080/01621459.1987.10478391
[16] Little, R.J. and Rubin, D.B. (2019) Statistical Analysis with Missing Data. Vol. 793,
John Wiley & Sons, Hoboken. https://doi.org/10.1002/9781119482260
[17] Graham, J.W., Hofer, S.M., Donaldson, S.I., MacKinnon, D.P. and Schafer, J.L.
(1997) Analysis with Missing Data in Prevention Research.
https://content.apa.org/doi/10.1037/10222-010
[18] Rubin, D.B. (2003) Discussion on Multiple Imputation. International Statistical Re-
view, 71, 619-625. https://doi.org/10.1111/j.1751-5823.2003.tb00216.x
[19] Schafer, J.L. and Graham, J.W. (2002) Missing Data: Our View of the State of the
Art. Psychological Methods, 7, 147-177.
https://doi.apa.org/doi/10.1037/1082-989X.7.2.147
[20] Van Buuren, S. (2011) Multiple Imputation of Multilevel Data. Routledge, 181-204.
https://doi.org/10.1201/b11826
[21] Van Buuren, S., Groothuis-Oudshoorn, K., Robitzsch, A., Vink, G., Doove, L., Jola-
ni, S., et al. (2015) Package ‘Mice’.
[22] Diggle, P.J. (1979) On Parameter Estimation and Goodness-of-Fit Testing for Spa-
tial Point Patterns. Biometrics, 35, 87-101. https://doi.org/10.2307/2529938
[23] Barr, D.R. and Davidson, T. (1973) A Kolmogorov-Smirnov Test for Censored
Samples. Technometrics, 15, 739-757.
https://doi.org/10.1080/00401706.1973.10489108
[24] Saylordot Organisation (2019) 11.2 Chi-Square One Sample Test of Goodness of Fit.
https://saylordotorg.github.io/text_introductory-statistics/s15-02-chi-square-one-sa
mple-goodness.html
[25] Tallarida, R.J. and Murray, R.B. (1987) Chi-Square Test. In: Manual of Pharmaco-
logic Calculations. Springer, New York, 140-142.
https://doi.org/10.1007/978-1-4612-4974-0_43
[26] Kent, J.T. (1982) Robust Properties of Likelihood Ratio Tests. Biometrika, 69, 19-27.
https://doi.org/10.1093/biomet/69.1.19
[27] Scheffer, J.A. (2000) An Analysis of the Missing Data Methodology for Different
Types of Data: A Thesis Presented in Partial Fulfilment of the Requirements for the
Degree of Master of Applied Statistics at Massey University. Doctoral Dissertation,
Massey University, Palmerston North.
[28] StackOverflow (2017) Machine Learning with Incomplete Data.
https://stackoverflow.com/questions/39386936/machine-learning-with-incomplete-
data
[29] Enders, C.K. (2010) Applied Missing Data Analysis. Guilford Press, New York.
[30] Jakobsen, J.C., Gluud, C., Wetterslev, J. and Winkel, P. (2017) When and How
Should Multiple Imputation Be Used for Handling Missing Data in Randomised
Clinical Trials—A Practical Guide with Flowcharts. BMC Medical Research Me-
thodology, 17, Article No. 162. https://doi.org/10.1186/s12874-017-0442-1
[31] Graham, J.W., Hofer, S.M. and MacKinnon, D.P. (1996) Maximizing the Usefulness
of Data Obtained with Planned Missing Value Patterns: An Application of Maxi-
mum Likelihood Procedures. Multivariate Behavioral Research, 31, 197-218.
https://doi.org/10.1207/s15327906mbr3102_3
[32] Andradóttir, S. and Bier, V.M. (2000) Applying Bayesian Ideas in Simulation. Si-
mulation Practice and Theory, 8, 253-280.
https://doi.org/10.1016/S0928-4869(00)00025-2
[33] Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J. (2011) Multiple Imputation by
Chained Equations: What Is It and How Does It Work? International Journal of
Methods in Psychiatric Research, 20, 40-49. https://doi.org/10.1002/mpr.329