Jdaip2024122 32870679
Jdaip2024122 32870679
Jdaip2024122 32870679
https://www.scirp.org/journal/jdaip
ISSN Online: 2327-7203
ISSN Print: 2327-7211
Keywords
LSTM, XGBoost, Hybrid Models, Machine Learning. Deep Learning
DOI: 10.4236/jdaip.2024.122010 Apr. 30, 2024 163 Journal of Data Analysis and Information Processing
A. D. Waberi et al.
1. Introduction
Type II diabetes is a chronic disease that affects millions of individuals world-
wide. The disease can cause serious damage to the body, especially nerves and
blood vessels, and is often preventable. Type II Diabetes Mellitus is a serious
public health concern with significant impacts on human life and health. It af-
fects individuals’ functional capacities and quality of life, leading to significant
morbidity and premature mortality [1]. The sudden increase in the number of
Type II Diabetes cases has raised serious public health concerns. The multifac-
torial nature of Type II Diabetes Mellitus poses a challenge for early detection, as
symptoms can be mild and take years to manifest. Additionally, the complexity
of the disease and its interactions with other factors make it difficult to predict
with high accuracy using traditional methods. Current predictive models have
limitations in capturing complex patterns in patient data, and there are concerns
about suboptimal control of blood glucose and other targets for many patients
[2].
Type II diabetes is a prevalent and serious health condition that affects a di-
verse range of individuals globally. It is characterized by the body’s ineffective
use of insulin, with around 90% of all diabetes diagnoses being type II. This
chronic disease can lead to various health complications, including kidney dis-
ease, amputations, blindness, cardiovascular disease, obesity, hypertension, hy-
poglycemia, dyslipidemia, and an increased risk of heart attack or stroke. Nota-
bly, diabetes claims more lives annually than breast cancer and AIDS combined.
The prevalence of type II diabetes is on the rise, with more young people be-
ing diagnosed. In America alone, expenditures related to diabetes healthcare
costs have significantly increased over the years. Lifestyle factors such as obesity
and lack of exercise contribute to the development of type II diabetes. Genetics
also plays a significant role in increasing the risk of this condition, especially for
individuals with close relatives who have diabetes [3].
Moreover, people from certain ethnic backgrounds are at a higher risk of de-
veloping type II diabetes. For instance, individuals of South Asian, Chinese,
African-Caribbean, and black African origin are more likely to develop this con-
dition. Regular exercise and maintaining a healthy weight can significantly re-
duce the risk of developing type II diabetes by more than 50%.
Early diagnosis and treatment are crucial in managing type II diabetes effec-
tively. Regular check-ups and blood tests are essential for early detection to pre-
vent severe complications associated with the disease. Individuals at risk or those
with pre-diabetes need to take preventative steps to avoid the progression to type
II diabetes.
The importance of accurately predicting Type II Diabetes cannot be empha-
sized. Early detection and action can improve disease and reduce the risk of se-
rious consequences. However, predicting Type II Diabetes is difficult due to the
complexity of the components involved, which include genetic, behavioral, and
environmental influences. Traditional techniques of prediction frequently rely
on a custom knowledge base using graphs, frames, first-order logic, etc., which
may not always capture the correct patterns found in patient data [4].
To overcome this issue, we offer a hybrid model that incorporates Long
Short-Term Memory (LSTM) networks and Extreme Gradient Boosting (XGBoost).
The hybrid LSTM-XGBoost model represents an advancement over traditional
methods, offering improved accuracy in predicting Type II Diabetes Mellitus
and its complications, thereby contributing to early intervention and better pa-
tient outcomes.
This model tries to combine the strengths of LSTM and XGBoost to process
and analyze complex medical data. We will discuss the LSTM model, this net-
work is a sort of recurrent neural network that is noted for its capacity to process
sequential data, making it perfect for dealing with time-series data, which is
common in medical records. They can detect patterns over time, providing de-
tailed insights into patient history and trends. In contrast, we will discuss the
XGBoost, this model is a sophisticated implementation of gradient boosting
techniques noted for its excellent efficiency, adaptability, and efficacy in classifi-
cation tasks. By combining these two methods, our approach tries to capture
both the temporal dynamics and complex correlations in the data, enhancing
diabetes prediction accuracy.
The objectives of the study are to develop a hybrid model that leverages LSTM
for temporal data analysis and XGBoost for robust classification, to validate the
model’s effectiveness in predicting diabetes using comprehensive datasets, and
to contribute to the field of predictive healthcare by introducing a model with
high accuracy, precision, recall, and F1 score. This research is significant because
it advances the field of medical data analysis and predictive healthcare. Our work
aims to improve prediction accuracy, allowing for earlier diagnosis and more ef-
fective therapies. This has the potential to enhance patient outcomes while also
lowering the overall strain on healthcare systems. The findings of this study are
likely to provide useful insights into the application of advanced machine learn-
ing techniques in healthcare [5].
2. Related Works
Several studies have been conducted on diabetes prediction using traditional sta-
tistical methods and machine learning algorithms. Traditional statistical me-
thods such as logistic regression, decision trees, and k-means clustering have
been used to predict diabetes with varying degrees of accuracy.
In recent years, many researchers have been using the concept of machine
learning to predict Diabetes Mellitus disease. Some of the commonly used algo-
rithms include logistic regression (LR), XGBoost (XGB), gradient boosting (GB),
decision trees (DTs), ExtraTrees, random forest (RF), and light gradient boost-
ing machines (LGBM). Each classifier has its advantages over the other classifi-
ers.
Another recent development in machine learning is the so-called Extreme
gradient boosting (XGBoost), which was introduced by [6]. XGBoost is an effi-
employed. The greatest rate of accuracy among these classifiers was 99.14%,
which was achieved by Bagged Decision Trees.
[12] implemented a machine learning system for Type I and Type II Diabetes
Mellitus that employs an ensemble learning technique to track glucose levels
based on independent features. They used data from 27,050 cases and 111
attributes gathered from patients at 10 different Slovenian healthcare facilities
that focused on preventative medicine. For this framework, 59 variables were se-
lected after preprocessing and feature engineering. When compared to other clas-
sifiers, LightGBM achieved better results across the board. This included better
accuracy, precision, recall, AUC, AUPRC, and RMSE.
Using a variety of machine learning classifiers such as k-nearest neighbors, deci-
sion trees, AdaBoost, naive Bayes, XGBoost, and multi-layer perceptrons, 15
created a solid framework for Type II Diabetes Mellitus. They used EDA to do
tasks including outlier detection, missing value completion, data standardization,
feature selection, and result validation. With a sensitivity of 0.789, a specificity of
0.934, a false omission rate of 0.092, a diagnostic odds ratio of 66.234, and an
AUC of 0.950, the ensembling classifiers AdaBoost and XGBoost performed the
best.
4) Combine the new weak predictive model with the previous models to create
an updated model.
5) Repeat steps 2 - 4 until a stopping criterion is met, such as a maximum
number of iterations or a minimum reduction in error.
Gradient Boosting is effective in classification tasks because it can handle
non-linear relationships and interactions between features, and it can be used
with various types of weak predictive models, such as decision trees, linear re-
gression, and neural networks [6].
The integration of Long Short-Term Memory (LSTM) with XGBoost represents
a novel contribution to diabetes prediction. This integration is expected to cap-
ture time-dependent patterns in diabetes progression and treatment response
while addressing the challenges posed by high-dimensional patient data. By leve-
raging the strengths of LSTM for temporal data analysis and XGBoost for robust
classification, the hybrid model is anticipated to significantly improve the accuracy
of diabetes prediction, thereby enabling more effective early intervention and pa-
tient care.
3. Methodology
The methodology section of this study outlines the comprehensive approach
undertaken to develop and evaluate a hybrid predictive model that synergizes
the capabilities of Long Short-Term Memory (LSTM) networks and eXtreme
Gradient Boosting (XGBoost) for the prediction of Type II Diabetes Mellitus.
This innovative model leverages the sequential data processing strength of LSTM
to capture temporal dependencies and intricate patterns within patient data,
alongside the robust classification and predictive power of XGBoost, to effectively
identify potential diabetes cases. This section delineates the step-by-step process,
from data collection and preprocessing to the final evaluation of the model’s per-
formance, establishing a clear and structured pathway toward achieving the goal
of improved diabetes prediction.
formation, BMI, age, gender, ethnicity, blood pressure measurements, blood test
results, and pre-existing health conditions, including diabetes mellitus status. To
ensure the integrity and applicability of our model, we conducted a thorough
preprocessing routine. This involved the elimination of columns with more than
30% missing values and identifier columns, which do not contribute to the predic-
tive analysis. The resulting dataset was further refined to address residual missing
values, with medians imputed for numerical data and modes for categorical data,
ensuring a dataset devoid of null values.
Feature engineering played a pivotal role in enhancing the predictive capabil-
ity of our model. This step involved the creation of new variables from existing
data points, designed to uncover underlying patterns and relationships indica-
tive of diabetes risk. Additionally, categorical variables were encoded to facilitate
their integration into the machine learning models, which necessitate numerical
input.
Figure 1 illustrates the varied distributions of selected clinical features from the
WiDS Diabetes Prediction Dataset. Each subplot highlights the different patterns
and ranges for features such as maximum oxygen saturation (h1_spo2_max),
minimum noninvasive diastolic blood pressure (h1_diasbp_noninvasive_min),
and patient age.
To address dataset imbalance where diabetes cases are fewer than non-diabetes
cases, the study uses Random Over-Sampling. This method duplicates the di-
abetes cases to balance the dataset, which helps prevent model bias toward the
more common non-diabetes cases.
The final stage of preprocessing involved standardizing the dataset using a
Standard Scaler. This procedure adjusted the data to have a mean of zero and a
standard deviation of one, a critical step to ensure uniformity in feature contri-
bution and to foster model convergence.
Table 1 illustrates the status before and after over-sampling:
Table 1. The number of instances before and after applying Random Over-Sampling to
balance the dataset.
Figure 2 shows the disparity between the cases with and without diabetes, in-
dicating the necessity for over-sampling.
Figure 3 demonstrates a balanced number of cases for both classes, achieved
by Random Over-Sampling to correct the imbalance in the dataset.
The input gate decides which new values to store, using both sigmoid and
tanh functions to produce the updated value and an intermediate value Ctx , re-
spectively:
=it sigmoid (Wi [ ht −1 , xt ] + bi ) (2)
These values are combined to generate Ct , which incorporates old data with
new inputs:
Ct = f t ⋅ Ct −1 + it ⋅ Ctx (4)
The cell output is then calculated, using the sigmoid function to decide which
data will be output from the cell, and the tanh function to scale this output:
=ot sigmoid (Wo [ ht −1 , xt ] + bo ) (5)
h=
t ot ⋅ tanh ( Ct ) (6)
The Multivariate LSTM structure used in this study is similar to the classical
LSTM structure but is specifically tailored for time series analysis in diabetes
prediction. It captures the dynamic changes in health indicators over time, con-
tributing to the risk of diabetes [19].
where x denotes the sequential input data, and F represents the extracted fea-
tures.
This step adapts the feature set for efficient processing by the XGBoost classifier.
The hybrid LSTM-XGBoost model merges LSTM’s feature extraction from se-
quential data with XGBoost’s classification strength, enhancing diabetes prediction
by understanding temporal patterns and employing a robust classification frame-
work. This innovative approach aims to surpass traditional models in accuracy,
marking a significant advancement in analyzing complex health data.
Figure 6. Training and validation loss of the LSTM model over epochs.
such as Accuracy, Precision, Recall, and the F1-Score. Additionally, the Confusion
Matrix will provide a detailed view of the model’s classification accuracy across
different categories.
3.6.1. Accuracy
This metric evaluates the total number of instances correctly predicted by the
trained model relative to all possible instances. Accuracy is defined as the propor-
tion of images accurately classified to the total number of images provided.
TP + TN
Accuracy = , (11)
TP + TN + FP + FN
where TP refers to true positive, TN refers to true negative, FP refers to false
positive, and FN refers to false negative values.
3.6.2. Precision
This metric measures the proportion of true positive cases among all predicted
positive instances. For instance, it is mathematically represented as follows:
TP
Precision = , (12)
TP + FP
where TP refers to true positive and FP refers to false positive values.
3.6.3. Recall
This metric assesses the model’s ability to correctly detect diabetes patients out
of all actual cases of diabetes. Recall becomes an important measure when the
3.6.4. F1-Score
The F1 score offers a combined metric of classification accuracy, taking into ac-
count both precision and recall. It is the harmonic mean of the two, providing a
balance between them. The F1 score reaches its maximum value when precision
and recall are equal. This measure effectively gauges the model’s comprehensive
performance by integrating the results of both precision and recall.
2 × Precision × Recall
F1Score = (14)
Precision + Recall
components within our hybrid model. The LSTM part encompasses a sequence
of layers with “relu” activations, tuned to capture the temporal dynamics of the
data. The “return sequences” parameter is carefully adjusted to ensure the out-
put feeds appropriately into subsequent layers. For the XGBoost classifier, a pre-
cise selection of hyperparameters balances the model’s learning complexity with
performance, incorporating a binary: logistic objective and regularization to op-
timize classification tasks.
Confusion Matrix
Figure 7 illustrates the LSTM model’s classification performance, with the con-
fusion matrix providing a clear visual representation. Darker shades indicate
higher numbers of correctly predicted cases, delineating the model’s true positive
and true negative rates. This visualization is key in evaluating the model’s ability
to distinguish between diabetic and non-diabetic instances accurately.
Precision
Figure 8 reflects the model’s precision, indicating the proportion of true posi-
tive predictions out of all positive predictions. High precision relates to a low
false positive rate, crucial for medical diagnostic tools.
Recall
Figure 9 shows the model’s recall, reflecting its capability to identify all actual
positives accurately. High recall indicates minimal false negatives, a vital factor
in medical diagnosis, where overlooking a true condition could have significant
consequences.
F1 Score
Figure 10 presents the F1 score, amalgamating precision and recall into a so-
litary measure that offers an equitable perspective on the LSTM model’s classifi-
cation efficacy. A high F1 score suggests a balanced classification capability.
Table 3. Architectural parameters of the XGBoost model with detailed descriptions, highlighting
the model’s complexity and regularization strategies to ensure effective learning without overfitting.
Architecture:
Table 3 delineates the architectural parameters of the XGBoost model, detailing
the specific values and their functions. It sheds light on the model’s complexity
and the implemented regularization strategies, such as feature fraction selection
and weight penalization, which are pivotal in fostering effective learning and
averting overfitting.
Confusion Matrix
Figure 11 illustrates the model’s proficiency in classifying true positives and
true negatives, which are pivotal for appraising the performance of a binary clas-
sifier.
Table 4 shows the precision, recall, and F1 score for the XGBoost model,
showcasing its reliable performance across both classes. The scores indicate the
model’s balanced accuracy in classifying both the negative and positive instances,
essential for medical diagnostics.
Confusion Matrix
Figure 12 shows the hybrid model’s true positive and true negative rates, with
the top left and bottom right cells displaying the counts of accurately predicted
negative (0) and positive (1) classes, respectively. The off-diagonal cells denote
the instances of misclassification.
Table 6 presents a concise summary of the LSTM-XGBoost model’s perfor-
mance, detailing the precision, recall, and F1 score metrics for both classes. Pre-
cision values demonstrate the model’s accuracy in predicting positive cases,
while recall figures reflect its effectiveness in identifying all positive samples. The
F1 scores indicate a well-balanced harmony between precision and recall for
both classes.
room for growth in reliably diagnosing non-diabetic cases, while its F1 Score of
0.84 indicates a decent but not ideal combination of precision and recall.
Model Type Metric Train Accuracy Test Accuracy Precision Recall F1 Score
In comparison, the hybrid model, which includes the properties of both LSTM
and XGBoost, outperforms the separate models by scoring near-perfect on all
criteria. It achieves an impressive training accuracy of 0.99 and a test accuracy of
0.98, demonstrating great learning and generalization abilities. The model achieves
a high precision score of 0.97 and a flawless recall score of 0.99, demonstrating its
outstanding ability to reliably identify all positive diabetes cases with no false neg-
atives. The hybrid model has a considerably higher F1 Score (0.98) than the
standalone LSTM and XGBoost models, indicating a better balance of precision
and recall. The hybrid model’s comprehensive and high-performing nature de-
monstrates the usefulness of combining LSTM’s sequential data processing ca-
pacity with XGBoost’s powerful classification, resulting in the most robust and
dependable model for predictive tasks in this study.
5. Discussion
Our study’s findings suggest that combining LSTM (Long Short-Term Memory)
and XGBoost models, known as a hybrid model, is good at predicting diabetes.
This hybrid model has demonstrated good levels of accuracy, precision, recall,
and F1 scores, all of which indicate how well the model predicts diabetes. The
rationale for this success is that LSTM excels at interpreting and processing pa-
tient data over time, whereas XGBoost excels at categorizing it (such as “has di-
abetes” or “does not have diabetes”). They work better together than they would
individually. The LSTM detects crucial trends and patterns in the patient’s health
data over time, and XGBoost uses these discoveries to reliably forecast whether a
patient has diabetes.
6. Conclusion
6.1. Summary of Key Findings
Our research finds some significant findings to predict diabetes using deep
learning and machine learning techniques. The key achievement was the crea-
tion and validation of a hybrid LSTM-XGBoost model, which outperformed
standalone LSTM and XGBoost models. This model correctly predicted diabetes
by efficiently processing patient data, particularly identifying temporal trends
with LSTM and robust classification with XGBoost. The strong accuracy, preci-
sion, recall, and F1 scores suggest that this model has the potential to be a trust-
worthy diabetes prediction tool in healthcare.
The hybrid approach’s effectiveness stems from its ability to combine the bene-
fits of LSTM’s sequential data processing with XGBoost’s excellent categorization
capabilities. This synergy has proven especially useful when working with compli-
cated datasets common in healthcare, where variables are numerous and interde-
pendent.
7. Experimental Setup
Our research utilized Jupyter Notebooks via Anaconda and Google Colab’s
cloud-based platform to develop and evaluate the hybrid LSTM-XGBoost model
for diabetes prediction.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this pa-
per.
References
[1] Sevilla-Gonzalez, M.D.R., Bourguet-Ramirez, B., Lazaro-Carrera, L.S., Marta-
gon-Rosado, A.J., Gomez-Velasco, D.V. and Viveros-Ruiz, T.L. (2022) Evaluation of a
Web Platform to Record Lifestyle Habits in Subjects at Risk of Developing Type 2
Diabetes in a Middle-Income Population: Prospective Interventional Study. JMIR
Diabetes, 7, e25105. https://doi.org/10.2196/25105
[2] Alam, T.M., Iqbal, M.A., Ali, Y., Wahab, A., Ijaz, S., Baig, T.I., Hussain, A., Malik,
M.A., Raza, M.M., Ibrar, S., et al. (2019) A Model for Early Prediction of Diabetes.
Informatics in Medicine Unlocked, 16, Article ID: 100204.
https://doi.org/10.1016/j.imu.2019.100204
[3] Bhat, S.S., Selvam, V., Ansari, G.A., Ansari, M.D., Rahman, M.H., et al. (2022) Pre-
valence and Early Prediction of Diabetes Using Machine Learning in North Kash-
mir: A Case Study of District Bandipora. Computational Intelligence and Neuros-
cience, 2022, Article ID: 2789760. https://doi.org/10.1155/2022/2789760
[4] American Diabetes Association (2010) Diagnosis and Classification of Diabetes
Mellitus. Diabetes Care, 33, S62-S69. https://doi.org/10.2337/dc10-S062
[5] Bhat, S.S. and Ansari, G.A. (2021) Predictions of Diabetes and Diet Recommendation
System for Diabetic Patients Using Machine Learning Techniques. 2021 2nd Interna-
tional Conference for Emerging Technology (INCET), Belagavi, 21-23 May 2021, 1-5.
[6] Chen, T.Q. and Guestrin, C. (2016) Xgboost: A Scalable Tree Boosting System. Pro-
ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, San Francisco, 13-17 August 2016, 785-794.
https://doi.org/10.1145/2939672.2939785
[7] Ahamed, B.S., Arya, M.S. and Nancy, A.O. (2022) Diabetes Mellitus Disease Predic-
tion Using Machine Learning Classifiers and Techniques Using the Concept of Data
Augmentation and Sampling. In: Tuba, M., Akashe, S. and Joshi, A., Eds., ICT Sys-
tems and Sustainability: Proceedings of ICT4SD 2022, Springer, Berlin, 401-413.
https://doi.org/10.1007/978-981-19-5221-0_40
[8] Zhang, X.J. and Zhang, Q.R. (2020) Short-Term Traffic Flow Prediction Based on