Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data
Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data
ISSN No:-2456-2165
Abstract:- It is an essential part of research to find ways These missing [3] values were imputed using machine
to impute the missing values in a data set. The learning and several types of multiple imputation
missingness is unavoidable as it could be due to techniques. Using linear mixed model LMM [4], a
natural or non-natural reasons. Missing information is comparison of how accurate these values were, was
inevitable in longitudinal or multilevel studies, and can studied.
result in biased estimates, loss of power, variability and
inaccuracy in results. For this study a complete data A. Missing Data
which showed the resistance scores of intellectually Longitudinal data [5] frequently contain missing data.
disabled children on giving behavioral skill training was This is particularly true in long-term biomedical
considered in order to compare the various imputation investigations of people since it is hard to guarantee a 100%
techniques. The secondary data collected was protocol compliance. A linear mixed model can take into
longitudinal in nature. The resistance score was noted account missing values and build a model to give out results
before the training and at four different time points after and help predict any variable unlike other models which
the training. A random missingness was created under require a complete data set to fit a model and predict the
varying percentages in the complete data (5%, 10%, estimates, although the validity of the parameters we want to
15%, 20%, 30%) using the MAR mechanism. The estimate has to be of a certain type. Many a times the
obtained values after imputation were compared with underlying reason as to why there is an occurrence of
full data using a linear mixed model. Various models missing data might not be known. In order to find these
built under the multiple imputation and machine values, the reason or cause has to be known well in advance.
learning techniques for imputing different features In Missing [3] data literature mainly there are three distinct
which are used to predict the resistance score, using the mechanisms how the data could be possibly missing.
coefficients taken from the real data and the same
mechanism was implemented for simulated data as well. Missing at Random (MAR)
The methods based on machine learning techniques were This indicates that there is a systematic link between
the most suited for the imputation of missing values and the probability of missing values and the observed data, that
led to a significant enhancement of prognosis accuracy is the missing value can be anticipated using the other
when compared to multiple imputation techniques using attributes in the data set.
linear mixed models.
Missing Completely at Random (MCAR)
Keywords:- Multiple Imputation, MAR Mechanisms, In contrast to MAR, this form of missingness [6]
Machine Learning Techniques, Linear Mixed Effect Model. shows that there is no connection between the missing value
and the other features in the data set.
I. INTRODUCTION
Since there is no logic involved to understand why
This Dealing with the incidence of missing data there is missingness of a certain value, this form of missing
remains a concern when using real world data to supplement values is the easiest to comprehend
Clinical Trials [1]. It is not always practical to remove a
record from your data in order to make predictions. In this Missing not at Random (MNAR)
study a lot of variation in the results as a consequence of The hardest data to discover and work with both in
missing data, and this can also lead to bias, which could terms of finding and using is MNAR data. The missingness
further lead to flawed and inaccurate results of the study. By and lack of data are caused by factors that we ignored to take
obtaining these values using a variety of methods, the into consideration.
missing values in the data may be replaced. A study design
that includes observing the same variables repeatedly
throughout a range of time-frames is called a longitudinal
data. Since longitudinal studies [2] are frequently employed
in clinical trials and social- personality psychology to
examine quick changes in actions, thoughts, and emotions
from one moment to the next or day to the next, they are
frequently referred to as a type of observational studies.
Based on the estimates that the various models using the imputed methodologies provided, the various techniques were
compared using linear mixed modelling. These findings were tabulated in order to determine which of these techniques
generated [9] the most accurate outcomes compatible with the original model we developed for the entire data set.
Combining together all these predicted values for all missingness percentages to check whether the techniques can produce
accurate estimates for all levels of missingness.
A conclusion on which method provides a better estimate for the missing values by comparing it to the original values was
computed.