0% found this document useful (0 votes)
350 views13 pages

Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data

It is an essential part of research to find ways to impute the missing values in a data set. The missingness is unavoidable as it could be due to natural or non-natural reasons. Missing information is inevitable in longitudinal or multilevel studies, and can result in biased estimates, loss of power, variability and inaccuracy in results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
350 views13 pages

Comparing Multiple Imputation and Machine Learning Techniques For Longitudinal Data

It is an essential part of research to find ways to impute the missing values in a data set. The missingness is unavoidable as it could be due to natural or non-natural reasons. Missing information is inevitable in longitudinal or multilevel studies, and can result in biased estimates, loss of power, variability and inaccuracy in results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Comparing Multiple Imputation and Machine Learning


Techniques for Longitudinal Data
1 2
Sanjana Rajamani Seena Thomas
Department of Statistics and Data Science Department of Statistics and Data Science
Christ (Deemed to be University) Bangalore, India Christ (Deemed to be University) Bangalore, India

Abstract:- It is an essential part of research to find ways These missing [3] values were imputed using machine
to impute the missing values in a data set. The learning and several types of multiple imputation
missingness is unavoidable as it could be due to techniques. Using linear mixed model LMM [4], a
natural or non-natural reasons. Missing information is comparison of how accurate these values were, was
inevitable in longitudinal or multilevel studies, and can studied.
result in biased estimates, loss of power, variability and
inaccuracy in results. For this study a complete data A. Missing Data
which showed the resistance scores of intellectually Longitudinal data [5] frequently contain missing data.
disabled children on giving behavioral skill training was This is particularly true in long-term biomedical
considered in order to compare the various imputation investigations of people since it is hard to guarantee a 100%
techniques. The secondary data collected was protocol compliance. A linear mixed model can take into
longitudinal in nature. The resistance score was noted account missing values and build a model to give out results
before the training and at four different time points after and help predict any variable unlike other models which
the training. A random missingness was created under require a complete data set to fit a model and predict the
varying percentages in the complete data (5%, 10%, estimates, although the validity of the parameters we want to
15%, 20%, 30%) using the MAR mechanism. The estimate has to be of a certain type. Many a times the
obtained values after imputation were compared with underlying reason as to why there is an occurrence of
full data using a linear mixed model. Various models missing data might not be known. In order to find these
built under the multiple imputation and machine values, the reason or cause has to be known well in advance.
learning techniques for imputing different features In Missing [3] data literature mainly there are three distinct
which are used to predict the resistance score, using the mechanisms how the data could be possibly missing.
coefficients taken from the real data and the same
mechanism was implemented for simulated data as well.  Missing at Random (MAR)
The methods based on machine learning techniques were This indicates that there is a systematic link between
the most suited for the imputation of missing values and the probability of missing values and the observed data, that
led to a significant enhancement of prognosis accuracy is the missing value can be anticipated using the other
when compared to multiple imputation techniques using attributes in the data set.
linear mixed models.
 Missing Completely at Random (MCAR)
Keywords:- Multiple Imputation, MAR Mechanisms, In contrast to MAR, this form of missingness [6]
Machine Learning Techniques, Linear Mixed Effect Model. shows that there is no connection between the missing value
and the other features in the data set.
I. INTRODUCTION
Since there is no logic involved to understand why
This Dealing with the incidence of missing data there is missingness of a certain value, this form of missing
remains a concern when using real world data to supplement values is the easiest to comprehend
Clinical Trials [1]. It is not always practical to remove a
record from your data in order to make predictions. In this  Missing not at Random (MNAR)
study a lot of variation in the results as a consequence of The hardest data to discover and work with both in
missing data, and this can also lead to bias, which could terms of finding and using is MNAR data. The missingness
further lead to flawed and inaccurate results of the study. By and lack of data are caused by factors that we ignored to take
obtaining these values using a variety of methods, the into consideration.
missing values in the data may be replaced. A study design
that includes observing the same variables repeatedly
throughout a range of time-frames is called a longitudinal
data. Since longitudinal studies [2] are frequently employed
in clinical trials and social- personality psychology to
examine quick changes in actions, thoughts, and emotions
from one moment to the next or day to the next, they are
frequently referred to as a type of observational studies.

IJISRT23OCT1169 www.ijisrt.com 1013


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
All the real-world data sets have missing values and  Fully Conditional Specification, Linear mixed Model
this definitely has to be taken care of. There are several (FCS-LMM)
approaches to handling missing data since the outcomes we The cycles over the univariate imputation models,
aim to get might be significantly impacted by these missing using a multilevel LMM to impute missing values in each
values [7]. Data recording errors, damaged raw data incomplete time-dependent variable given all the others
columns, and human error are some of the common reasons instead of considering repeated measurements as separate
of missing data. Missing data poses some significant variables. The linear two-level model with homogeneous
obstacles for the study’s findings. if important information is within-subject variances, a specific instance of a
missed, it is difficult to conclude any results [8]. Hence, study multivariate LMM [5], is implemented using the Gibbs
of several tactics is necessary to ascertain which approach sampler in this method.
will result in adequate outcomes and a sound model [9].
 Joint Modelling (JM)
B. Imputation Employing a joint multivariate [12] LMM for imputing
Missing data may add a significant amount of bias, multiple incomplete longitudinal variables [11] rather than
make processing and analyzing the data more difficult, and considering repeated measurements as separate variables is
reduce efficiency. That is, omissions (introduced in the algorithm [13] behind this technique. In order to account
collection or processing) may result in certain sub-groups of for dependence among people over time, this technique
the target population from being excluded in the analysis of presupposes that all the incomplete variables are continuous
the data set, and in turn increasing the risk of biased with subject specific random effects. Similar to the
estimates, reducing the power of inferential statistics and univariate LMM [14], this approach makes the assumption
increasing the uncertainty of estimates and inferences that measurement errors and random effects have a normal
derived from the data. By substituting missing data with an distribution with constant error covariance across all
estimated value based on other available information, subjects.
imputation preserves all situations [8]. The data set may be
analyzed using methods used for complete data once all  Machine Learning Technique
missing values have been imputed. A large number of important machine learning
methods [15] have emerged since the 1980s and 1990s, such
 Multiple Imputation Techniques as back propagation neural network and random forest (RF),
The distribution of each variable with missing values which had a profound impact on the medical field including
when using the multiple imputation [10] process has to be clinical decision making in presence of missing data. Before
modeled. It is a generic solution to the issue of missing data that, the traditional methods used to process the missing data
that is included in a number of widely used statistical in clinical decision [16] making mainly included complete
software programs [7]. By constructing many distinct case analysis, k- nearest neighbors (KNN), expectation
plausible imputed data sets and correctly merging the maximization and so on. With the in-depth application of
findings from each one, it seeks to account for the machine learning models in this field, researchers found that
uncertainty around the missing data. Multiple imputation is a machine learning models can restore the true distribution of
method for addressing non-response bias that is based on a data from missing data sets more accurately than the
Bayesian approach. The steps involved in Multiple traditional missing data processing models.
Imputation [11]: Missing values are imputed m times (m
>1), resulting in m complete data sets. The imputed values  K- Nearest Neighbors
are drawn from distributions modelled specifically for each An unclassified sample point is given the classification
missing entry. The standard suggestion for relative of the nearest previously classified point via the nearest
efficiency is 4 - 5 imputations, while a higher number of neighbor judgement process. The sample points
imputations would give more accurate results. Each of these classifications and the underlying joint distribution have no
data sets is analyzed using the statistical model and hence it bearing on this rule. In contrast, we shall demonstrate in a
creates m set of estimates. The m analysis estimates are large sample analysis that for all appropriately [16] smooth
combined toone Multiple Imputation estimate. underlying distributions, the M category is where these
restrictions are the tightest. It may be claimed that the
 Fully Conditional Specification (FCS) nearest neighbor holds half of the classification information
We refer to this method as FCS Standard since it uses in an infinite sample set.
all of the repeated measurements of the time-varying
covariate as predictors in each of the univariate imputation  Random Forest
models. Due tomodel overfitting / collinearity when there are An effective method of imputation, Random Forest
a lot of associated repeated measures, this method is virtually meets all the criteria for becoming the best [15]
vulnerable to convergence issues [10]. imputation method. The Random Forests are fairly good at
scaling to large data settings, and they can tolerate outliers
and non-linearity in the data. Mixed-type data can be held
by Random Forests (both numerical and categorical). They
additionally offer a built-in feature selection method.
Random Forests can easily outperform KNN and other
techniques because of its unique advantages.

IJISRT23OCT1169 www.ijisrt.com 1014


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
C. Data Description II. METHODOLOGY
Behavioral Skills Training (BST) was conducted on
intellectually disabled children’s knowledge of sexual abuse  A complete data set was taken, with the resistance score
[17] and capacity for resistance. A pilot research was as well as several features that significantly influence the
carried out to look at how children who had experienced prediction of the score was considered, and it was then
sexual abuse in the past and those who had not responded to constructed with various percentages (5%, 10%, 15%,
the personal safety program. Parents or other caregivers 20%) of missingness.
trained 60 children between the ages of 3 and 7 who had  Given is the plot depicting where missingness in the data
been subjected to abuse and 60 children who had not. was created. No missingness was created in the
Before and after the intervention, the children were resistance column, since it is the predictor variable.
evaluated using the Personal Safety Questionnaire (PSQ). A
six-month at home behavioral skills training program on  By considering the technique’s relevance to the
sexual abuse prevention was offered [18]. The findings longitudinal data, a comparison was done with different
indicated that both groups had increased knowledge and multiple (FCS, FCS-LMM, and JM [20]) and
expertise in sexual abuse prevention. After participating in machine learning imputation techniques (KNN and
the program, children who had a history of sexual abuse Random Forest). These methods were all
significantly reduced their improper sexual behavior [21].
programmed using the RStudio program.

Fig 1 Variables with Missing Values

 Based on the estimates that the various models using the imputed methodologies provided, the various techniques were
compared using linear mixed modelling. These findings were tabulated in order to determine which of these techniques
generated [9] the most accurate outcomes compatible with the original model we developed for the entire data set.
 Combining together all these predicted values for all missingness percentages to check whether the techniques can produce
accurate estimates for all levels of missingness.
 A conclusion on which method provides a better estimate for the missing values by comparing it to the original values was
computed.

IJISRT23OCT1169 www.ijisrt.com 1015


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
III. RESULTS

(a) Missing Pattern

(b) Percentage of Missingness in Each Variable


Fig 2 The pattern of missingness in the data where 5%missingness was generated and visualization of the amount of missing data.
Showing in black is the location of missing values, and also providing information on the overall percentage of missing values in
each variable.

IJISRT23OCT1169 www.ijisrt.com 1016


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Cumulative Sum of Missing Values

(b) Correlation Plot


Fig 3 The plot shows the cumulative sum of missing values, reading the rows of the data set from the top to bottom and the
correlation between the variables after generating 5% missingness in the data set.

IJISRT23OCT1169 www.ijisrt.com 1017


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Missing Pattern

(b) Percentage of Missingness in Each Variable


Fig 4 The pattern of missingness in the data where 10% missingness was generated and visualization of the amount of missing
data, showing in black the location ofmissing values, and also providing information on the overall percentage of missing values
and in each variable, are shown above.

IJISRT23OCT1169 www.ijisrt.com 1018


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Cumulative Sum of Missing Values

(b) Correlation Plot


Fig 5 The plots show the cumulative sum of missing values, reading the rows of the data set from thetop to bottom and the
correlation between the variables after generating 10% missingness in the data set.

IJISRT23OCT1169 www.ijisrt.com 1019


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Missing Pattern

(b) Percentage of Missingness in Each Variable


Fig 6 The pattern of missingness in the data where 15% missingness was generated and visualization of the amount of missing
data, showing in black the location ofmissing values, and also providing information on the overall percentage of missing values
overall and in each variable, are shown above.

IJISRT23OCT1169 www.ijisrt.com 1020


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Cumulative sum of missing values

(b) Correlation Plot


Fig 7 The above plots show the cumulative sum of missing values, reading the rows of the data set from the top to bottom and the
correlation between the variables after generating 15% missingness in the data set.

IJISRT23OCT1169 www.ijisrt.com 1021


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Missing Pattern

(b) Percentage of Missingness in Each Variable


Fig 8 The pattern of missingness in the data where 20% missingness was generated and visualization of theamount of missing
data, showing in black the location ofmissing values, and also providing information on the overall percentage of missing values
overall and in each variable are shown above.

IJISRT23OCT1169 www.ijisrt.com 1022


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

(a) Cumulative Sum of Missing Values

(b) Correlation Plot


Fig 9 The above plots show the cumulative sum of missing values, reading the rows of the data set from thetop to bottom and the
correlation between the variables after generating 20% missingness in the data set.

IJISRT23OCT1169 www.ijisrt.com 1023


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Table 1 The estimated coefficients obtained using LMM for different Multiple and Machine Learning Imputation techniques at
different percentage (5%, 10%, 15%, 20%) of missingness are displayed in the table (1a, 2b, 3c, 4d). Only p values which are
significant have been added to linear mixed model.
Percentage of Missingness created Number of observations missing Percentage of missingness
(out of 600) reflected in the data
5 32 5.33
10 62 10.33
15 84 15
20 117 19.33

Table 1a 5 % IV. CONCLUSION


Variables Group Gender Domicile Education Assault
Original 8.61419 -0.72853 0.01083 -0.18508 -0.0273 In the current paper, the performance of different MI
LWD 8.0000 -0.72853 0.01083 -0.18508 -0.0273 methods (FCS-Standard , FCS-LMM , JM) and Machine
MI 8.521 -0.71 0.012 -0.0182 -0.026 Learning Imputation In the current study, we compared the
KNN 8.612 -0.723 0.01 -0.183 -0.0269 performance of several different MI methods (FCS-
RF 8.615 -0.726 0.01083 -0.184 -0.0265 Standard, FCS-LMM,, JM) and Machine learning
HD 8.2 -0.61 0.011 -0.122 -0.026 imputation techniques (K-nearest neighbors and Random
JM 8.59 -0.727 0.013 -0.186 -0.028 Forest)to handle missing values in longitudinal data in the
FCS 8.59 -0.727 0.013 -0.186 -0.028 context of fitting Linear mixed effect model with both
random intercepts and slopes. Our comparison also revealed
FCS-LMM -8.54 0.71 0.0122 -0.182 -0.0271
that Joint modeling (JM) approach holds great promise for
the imputation of longitudinal data. The results from our
Table 1b 10 %
theoretical exploration revealed, although several MI
Variables Group Gender Domicile Education Assault
methods are available for imputing missing values in
Original 8.61419 -0.72853 0.01083 -0.18508 -0.0273 longitudinal, its is quite evident that Machine learning
LWD 8.61419 -0.72853 0.01083 -0.18508 -0.0273 Imputation techniques can provide much better estimates.
MI 8.62 -0.73 0.011 -0.187 -0.026 Due to its simplicity, easy-understanding and relatively high
KNN 8.613 -0.728 0.0111 -0.185 -0.0269 accuracy we conclude from our study that K nearest
RF 8.62 -0.657 0.01 -0.184 -0.023 neighbor and random forest Machine learning imputation
HD 8.2 -0.61 0.011 -0.122 -0.026 techniques proved to show better and efficient performance
JM 8.2 -0.61 0.011 -0.122 -0.026 on comparison with other methods than Multiple Imputation
FCS 8.59 -0.727 0.013 -0.186 -0.028 like FCS and JM techniques on the basis of the coefficients
FCS-LMM -8.54 0.71 0.0122 -0.182 -0.0271 obtaine on fitting the linear mixed model.

Table 1c 15 % FUTURE SCOPE


Variables Group Gender Domicile Education Assault
Original 8.61419 -0.72853 0.01083 -0.18508 -0.0273 In many public health contexts where data are
LWD 8.1002 -0.72853 0.01083 -0.18508 -0.0273 collected from individuals repeatedly over time and from
MI 8.521 -0.71 0.012 -0.0182 -0.026 groups of people that are clustered within natural units,
KNN 8.612 -0.728 0.01 -0.185 -0.0261 longitudinal and cluster-correlated data both emerge.By
RF 8.59 -0.727 0.013 -0.186 -0.028 substituting missing data with an estimated value based on
HD -8.54 0.71 0.012 -0.182 -0.027 other available information, imputation preserves all cases.
JM 8.612 -0.728 0.01 -0.185 -0.027 The data set can be analysed using methods used for
FCS 8.614 -0.729 0.011 -0.185 -0.027 complete data once all missing values have been imputed.
FCS-LMM 8.2 -0.61 0.011 -0.122 -0.026 With the upcoming data that is present in the health sector,
there may be many situations where the missing data may
Table 1d 20% arise and there must be ways to uncover that value to
analyse the dataThere are numerous imputation approaches
Variables Group Gender Domicile Education Assault
that may be utilised to compare longitudinal studies.
Original 8.61419 -0.72853 0.01083 -0.18508 -0.0273
Although it is becoming easier to access machine learning
LWD 8.0090 -0.72853 0.01083 -0.18508 -0.0273 algorithms and a variety of approaches are being developed,
MI 8.62 -0.73 0.011 -0.187 -0.026 it still takes a lot of processing time for these techniques to
KNN 8.619 -0.728 0.0111 -0.185 -0.0269 produce estimates for data that are effective and almost
RF 8.59 -0.727 0.013 -0.186 -0.028 exact. Numerous methods can be developed to impute
HD -8.54 0.71 0.012 -0.182 -0.027 categorical and continuous types of data by modifying
JM 8.612 -0.728 0.01 -0.185 -0.027 multiple imputation algorithms.
FCS 8.614 -0.729 0.011 -0.185 -0.027
FCS-LMM 8.2 -0.61 0.011 -0.122 -0.026

IJISRT23OCT1169 www.ijisrt.com 1024


Volume 8, Issue 10, October – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
ACKNOWLEDGEMENT [14]. Enders, C., Hayes, T. & Du, H. A Comparison of
Multilevel Imputation Schemes for Random
I would also like to extend my sincere thanks to the Dr. Coefficient Models: Fully Conditional Specification
Natasha, Faculty of Health at University of Canberra, Bruce, and Joint Model Imputation with Random Covariance
ACT, Australia, whose willingness to share their knowledge Matrices. Multivariate Behavioral Research 53, 1–19
have significantly enhanced the quality of this paper. (Jan. 2019).
[15]. Cover, T. & Hart, P. Nearest neighbor pattern
REFERENCES classification. IEEE Transactions on Information
Theory 13, 21–27 (1967).
[1]. Van Buuren, S. Flexible Imputation of Missing Data [16]. Quartagno, M. & Carpenter, J. Substantive model
(CRC Press Taylor, Francis Group , Chapman, and compatible multilevel multiple imputation: A joint
Hall Book, 1991). modeling approach. Statistics in Medicine 41 (Aug.
[2]. Huque, M. H., Carlin, J., Simpson, J. & Lee, K. A 2022).
comparison of multiple imputation methods for [17]. Kumar, P. Imputation and characterization of uncoded
missing data in longitudinal studies. BMC Medical self-harm in major mental illness using machine
Research Methodology 18 (Dec. 2018). learning. Journal of the American Medical Informatics
[3]. in. Linear Mixed Models for Longitudinal Data 221– Association: JAMIA 27,1,136–146 (2020).
229 (Springer New York, New York, [18]. Riiser, K., Richardsen, K., Haugen, A., Lund, S. &
NY, 2000). isbn: 978-0-387-22775-7. Løndal, K. Active Play in ASP –A matched-pair
https://doi.org/10.1007/978-0-387- 22775-7_16 cluster-randomized trial investigating the effectiveness
[4]. Ben, ˆA. et al. The handling of missing data in trial- of an intervention in After-school programs for
based economic evaluations: should data be multiply supporting children’s physical activity (Apr. 2020).
imputed prior to longitudinal linear mixed-model [19]. Laukkanen, A. Physical activity and motor competence
analyses? The European Journal of Health Economics, in 4–8-year old children: results of a family-based
1–15 (Sept. 2022). cluster-randomized controlled physical activity trial
[5]. in. Linear Mixed Models for Longitudinal Data 19–29 isbn: ISBN 978-951-39-6582-2 (PDF) (Apr. 2016).
(Springer New York, New York, NY, 2000). isbn: [20]. in. Linear Mixed Models for Longitudinal Data 201–
978-0-387-22775 https://doi.org/10.1007/978-0-387- 207 (Springer New York, New York, NY, 2000). isbn:
22775-7_3. 978-0-387-22775-7. https://doi.org/10.1007/978-0-387-
[6]. Gurka, M. J. & Edwards, L. J. in Essential Statistical 22775-7_14.
Methods for Medical Statistics (eds Rao, C., Miller, J. [21]. Knowledge of sexual abuse and resistance ability
& Rao, D.) 146–173 (North- Holland, Boston, 2011). among children with intellectual disability. Child
isbn: 978-0-444-53737-9. Abuse Neglect 136, 105985. issn: 0145-
https://www.sciencedirect.com/science/article/pii/B978 2134.https://www.sciencedirecet.com/science/article/pi
04445373795 0086. i/S0145213422005191(2023).
[7]. Huque, M. H. et al. Multiple imputation methods for
handling incomplete longitudinal and clustered data
where the target analysis is a linear mixed effects
model. Biometrical Journal 62 (2020).
[8]. Berglund, P. A. Multiple Imputation Using the Fully
Conditional Specification Method : A Comparison of
SAS ® , Stata , IVEware , and in (2015)
[9]. Molenberghs, G. & Verbeke, G. A review on linear
mixed models forlongitudinal data, possibly subject to
dropout. Statistical Modelling - STAT MODEL 1,
235–269 (Dec. 2001).
[10]. 10. Sterne, J. A. C. et al. Multiple imputation for
missing data in epidemiological and clinical research:
potential and pitfalls. BMJ 338 (2009).
[11]. Diggle, P., Liang, K.-Y. & Zeger, S. Analysis of
Longitudinal Data. Biometrics 53, 782 (June 1997).
[12]. in. Linear Mixed Models for Longitudinal Data 209–
219 (Springer New York, New York, NY, 2000). isbn:
978-0-387-22775-7. https://doi.org/10.1007/978-0-
387-22775-7_15.
[13]. Quartagno, M., Grund, S. & Carpenter, J. jomo: A
Flexible Package for Two-level Joint Modelling
Multiple Imputation. The R Journal 9 (Jan. 2019).

IJISRT23OCT1169 www.ijisrt.com 1025

You might also like