Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data

It is an essential part of research to find ways to impute the missing values in a data set. The missingness is unavoidable as it could be due to natural or non-natural reasons. Missing information is inevitable in longitudinal or multilevel studies, and can result in biased estimates, loss of power, variability and inaccuracy in results.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

350 views13 pages

Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Comparing Multiple Imputation and Machine Learning

Techniques for Longitudinal Data
1 2
Sanjana Rajamani Seena Thomas
Department of Statistics and Data Science Department of Statistics and Data Science
Christ (Deemed to be University) Bangalore, India Christ (Deemed to be University) Bangalore, India

Abstract:- It is an essential part of research to find ways These missing [3] values were imputed using machine
to impute the missing values in a data set. The learning and several types of multiple imputation
missingness is unavoidable as it could be due to techniques. Using linear mixed model LMM [4], a
natural or non-natural reasons. Missing information is comparison of how accurate these values were, was
inevitable in longitudinal or multilevel studies, and can studied.
result in biased estimates, loss of power, variability and
inaccuracy in results. For this study a complete data A. Missing Data
which showed the resistance scores of intellectually Longitudinal data [5] frequently contain missing data.
disabled children on giving behavioral skill training was This is particularly true in long-term biomedical
considered in order to compare the various imputation investigations of people since it is hard to guarantee a 100%
techniques. The secondary data collected was protocol compliance. A linear mixed model can take into
longitudinal in nature. The resistance score was noted account missing values and build a model to give out results
before the training and at four different time points after and help predict any variable unlike other models which
the training. A random missingness was created under require a complete data set to fit a model and predict the
varying percentages in the complete data (5%, 10%, estimates, although the validity of the parameters we want to
15%, 20%, 30%) using the MAR mechanism. The estimate has to be of a certain type. Many a times the
obtained values after imputation were compared with underlying reason as to why there is an occurrence of
full data using a linear mixed model. Various models missing data might not be known. In order to find these
built under the multiple imputation and machine values, the reason or cause has to be known well in advance.
learning techniques for imputing different features In Missing [3] data literature mainly there are three distinct
which are used to predict the resistance score, using the mechanisms how the data could be possibly missing.
coefficients taken from the real data and the same
mechanism was implemented for simulated data as well.  Missing at Random (MAR)
The methods based on machine learning techniques were This indicates that there is a systematic link between
the most suited for the imputation of missing values and the probability of missing values and the observed data, that
led to a significant enhancement of prognosis accuracy is the missing value can be anticipated using the other
when compared to multiple imputation techniques using attributes in the data set.
linear mixed models.
 Missing Completely at Random (MCAR)
Keywords:- Multiple Imputation, MAR Mechanisms, In contrast to MAR, this form of missingness [6]
Machine Learning Techniques, Linear Mixed Effect Model. shows that there is no connection between the missing value
and the other features in the data set.
I. INTRODUCTION
Since there is no logic involved to understand why
This Dealing with the incidence of missing data there is missingness of a certain value, this form of missing
remains a concern when using real world data to supplement values is the easiest to comprehend
Clinical Trials [1]. It is not always practical to remove a
record from your data in order to make predictions. In this  Missing not at Random (MNAR)
study a lot of variation in the results as a consequence of The hardest data to discover and work with both in
missing data, and this can also lead to bias, which could terms of finding and using is MNAR data. The missingness
further lead to flawed and inaccurate results of the study. By and lack of data are caused by factors that we ignored to take
obtaining these values using a variety of methods, the into consideration.
missing values in the data may be replaced. A study design
that includes observing the same variables repeatedly
throughout a range of time-frames is called a longitudinal
data. Since longitudinal studies [2] are frequently employed
in clinical trials and social- personality psychology to
examine quick changes in actions, thoughts, and emotions
from one moment to the next or day to the next, they are
frequently referred to as a type of observational studies.

IJISRT23OCT1169 www.ijisrt.com 1013

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
All the real-world data sets have missing values and  Fully Conditional Specification, Linear mixed Model
this definitely has to be taken care of. There are several (FCS-LMM)
approaches to handling missing data since the outcomes we The cycles over the univariate imputation models,
aim to get might be significantly impacted by these missing using a multilevel LMM to impute missing values in each
values [7]. Data recording errors, damaged raw data incomplete time-dependent variable given all the others
columns, and human error are some of the common reasons instead of considering repeated measurements as separate
of missing data. Missing data poses some significant variables. The linear two-level model with homogeneous
obstacles for the study’s findings. if important information is within-subject variances, a specific instance of a
missed, it is difficult to conclude any results [8]. Hence, study multivariate LMM [5], is implemented using the Gibbs
of several tactics is necessary to ascertain which approach sampler in this method.
will result in adequate outcomes and a sound model [9].
 Joint Modelling (JM)
B. Imputation Employing a joint multivariate [12] LMM for imputing
Missing data may add a significant amount of bias, multiple incomplete longitudinal variables [11] rather than
make processing and analyzing the data more difficult, and considering repeated measurements as separate variables is
reduce efficiency. That is, omissions (introduced in the algorithm [13] behind this technique. In order to account
collection or processing) may result in certain sub-groups of for dependence among people over time, this technique
the target population from being excluded in the analysis of presupposes that all the incomplete variables are continuous
the data set, and in turn increasing the risk of biased with subject specific random effects. Similar to the
estimates, reducing the power of inferential statistics and univariate LMM [14], this approach makes the assumption
increasing the uncertainty of estimates and inferences that measurement errors and random effects have a normal
derived from the data. By substituting missing data with an distribution with constant error covariance across all
estimated value based on other available information, subjects.
imputation preserves all situations [8]. The data set may be
analyzed using methods used for complete data once all  Machine Learning Technique
missing values have been imputed. A large number of important machine learning
methods [15] have emerged since the 1980s and 1990s, such
 Multiple Imputation Techniques as back propagation neural network and random forest (RF),
The distribution of each variable with missing values which had a profound impact on the medical field including
when using the multiple imputation [10] process has to be clinical decision making in presence of missing data. Before
modeled. It is a generic solution to the issue of missing data that, the traditional methods used to process the missing data
that is included in a number of widely used statistical in clinical decision [16] making mainly included complete
software programs [7]. By constructing many distinct case analysis, k- nearest neighbors (KNN), expectation
plausible imputed data sets and correctly merging the maximization and so on. With the in-depth application of
findings from each one, it seeks to account for the machine learning models in this field, researchers found that
uncertainty around the missing data. Multiple imputation is a machine learning models can restore the true distribution of
method for addressing non-response bias that is based on a data from missing data sets more accurately than the
Bayesian approach. The steps involved in Multiple traditional missing data processing models.
Imputation [11]: Missing values are imputed m times (m
>1), resulting in m complete data sets. The imputed values  K- Nearest Neighbors
are drawn from distributions modelled specifically for each An unclassified sample point is given the classification
missing entry. The standard suggestion for relative of the nearest previously classified point via the nearest
efficiency is 4 - 5 imputations, while a higher number of neighbor judgement process. The sample points
imputations would give more accurate results. Each of these classifications and the underlying joint distribution have no
data sets is analyzed using the statistical model and hence it bearing on this rule. In contrast, we shall demonstrate in a
creates m set of estimates. The m analysis estimates are large sample analysis that for all appropriately [16] smooth
combined toone Multiple Imputation estimate. underlying distributions, the M category is where these
restrictions are the tightest. It may be claimed that the
 Fully Conditional Specification (FCS) nearest neighbor holds half of the classification information
We refer to this method as FCS Standard since it uses in an infinite sample set.
all of the repeated measurements of the time-varying
covariate as predictors in each of the univariate imputation  Random Forest
models. Due tomodel overfitting / collinearity when there are An effective method of imputation, Random Forest
a lot of associated repeated measures, this method is virtually meets all the criteria for becoming the best [15]
vulnerable to convergence issues [10]. imputation method. The Random Forests are fairly good at
scaling to large data settings, and they can tolerate outliers
and non-linearity in the data. Mixed-type data can be held
by Random Forests (both numerical and categorical). They
additionally offer a built-in feature selection method.
Random Forests can easily outperform KNN and other
techniques because of its unique advantages.

IJISRT23OCT1169 www.ijisrt.com 1014

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
C. Data Description II. METHODOLOGY
Behavioral Skills Training (BST) was conducted on
intellectually disabled children’s knowledge of sexual abuse  A complete data set was taken, with the resistance score
[17] and capacity for resistance. A pilot research was as well as several features that significantly influence the
carried out to look at how children who had experienced prediction of the score was considered, and it was then
sexual abuse in the past and those who had not responded to constructed with various percentages (5%, 10%, 15%,
the personal safety program. Parents or other caregivers 20%) of missingness.
trained 60 children between the ages of 3 and 7 who had  Given is the plot depicting where missingness in the data
been subjected to abuse and 60 children who had not. was created. No missingness was created in the
Before and after the intervention, the children were resistance column, since it is the predictor variable.
evaluated using the Personal Safety Questionnaire (PSQ). A
six-month at home behavioral skills training program on  By considering the technique’s relevance to the
sexual abuse prevention was offered [18]. The findings longitudinal data, a comparison was done with different
indicated that both groups had increased knowledge and multiple (FCS, FCS-LMM, and JM [20]) and
expertise in sexual abuse prevention. After participating in machine learning imputation techniques (KNN and
the program, children who had a history of sexual abuse Random Forest). These methods were all
significantly reduced their improper sexual behavior [21].
programmed using the RStudio program.

Fig 1 Variables with Missing Values

 Based on the estimates that the various models using the imputed methodologies provided, the various techniques were
compared using linear mixed modelling. These findings were tabulated in order to determine which of these techniques
generated [9] the most accurate outcomes compatible with the original model we developed for the entire data set.
 Combining together all these predicted values for all missingness percentages to check whether the techniques can produce
accurate estimates for all levels of missingness.
 A conclusion on which method provides a better estimate for the missing values by comparing it to the original values was
computed.

IJISRT23OCT1169 www.ijisrt.com 1015

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
III. RESULTS

(a) Missing Pattern

(b) Percentage of Missingness in Each Variable

Fig 2 The pattern of missingness in the data where 5%missingness was generated and visualization of the amount of missing data.
Showing in black is the location of missing values, and also providing information on the overall percentage of missing values in
each variable.