Machine Learning Approach For Predicting Heart and Diabetes Diseases Using Data-Driven Analysis
Machine Learning Approach For Predicting Heart and Diabetes Diseases Using Data-Driven Analysis
Corresponding Author:
Usha Sekar
Department of Computer Science, Faculty of Science and Humanities, SRM Institute of Science and
Technology, Kattankulathur, Chengalpattu Dt., TamilNadu, India
Email: [email protected]
1. INTRODUCTION
Healthy living is crucial to a good quality of life. A healthcare professional prevents, treats, and
inspects diseases to improve health. Because of the inexactness of the information provided by the patient, it
can be challenging to determine a specific disease based on their symptoms [1]. Globally predicting diseases
is a crucial challenge in fundamental problems [2]. Many diseases are associated with particular symptoms and
signs. It can be inherited, caused by infection, or triggered by stress [3]. Due to the residents' modern lifestyle,
there is a risk of mortality and morbidity from diseases like heart disease, chronic respiratory disease, and
diabetes [4]. Nowadays, millions of people worldwide suffer and die from many diseases [5]. The majority of
people with multiple disorders are also distressed by numerous infections.
In today's society, predicting disease based on early-stage symptoms is a very tough challenge for
physicians in the medical field. The field of medical informatics and disease prediction has become increasingly
relevant to the community of data scientists in recent years. Data repute, multi-attribution, incompleteness, and a
close correlation will occur when manually collecting medical data, making it difficult to identify disease
symptoms. The extensive use of computer-based technologies in the health sector has resulted in the availability
of colossal health databases for researchers. Many surgical research studies are using these electronic records [6].
Nowadays, all hospitals maintain electronic health data for patients to find the symptoms and diagnose the disease.
A health care system can be revolutionized by analyzing and interpreting the information recorded in electronic
health records, providing feedback, and implementing changes based on collected data [7].
Machine learning techniques have become increasingly significant in various fields in the past decade,
including the health care system and biomedical research [8]. In addition, correctly medicating a patient with
a large amount of data is an enormous task. Since the advent of the digital era and technological innovations,
several multidimensional patient data sets have been developed, including clinical data, hospital resource
information, and patient disease diagnosis information. A complex data set must be analyzed to extract valuable
insights [9]. The proposed work aims to develop heart and diabetic disease prediction models from a single
dataset incorporating machine learning algorithms, specifically supervised learning methods that employ the
ensemble method for more than one disease prediction in a single dataset.
Heart disease has been the leading cause of death worldwide in recent decades [10]. Since the heart
is a significant part of the human body, various factors cause heart disease, and people exhibit different
symptoms [11]. People consider the disease diabetes is a high sub challenge with deadly chronic disease [12].
Due to the increase in sugar in blood and fat, people affect their daily lives a lot.
In health care systems, machine learning techniques can support medical practitioners in promptly
and cost-effectively diagnosing various diseases from medical data [13]. Specifying probable disorders could
help patients conduct medical tests on targeted medicine. The patient might skip extensive medical tests due to
a lack of medical information, leading to severe health problems. In most cases, machine learning identifies
the patterns in massive datasets, which can involve human intelligence. Machine learning (ML) approaches
can aid in building prediction models that can handle and analyze vast volumes of complex medical data and
efficiently find the presence or absence of disease in a patient, which can help address this difficulty [14].
The main aim of recursion enhanced random forest with an improved linear model (RFRF-ILM) is to
find the key features. The prediction model produces better performance by combining the classification model.
This work compares the essential variables that suggest that coronary artery disease develops more frequently
as people age [15]. This paper describes disease progression and predicts disease outcomes. A proposed novel
approach [16] uses a model with various features and known classifier techniques to recognize the relevant
factors through a machine learning algorithm, leading to better predicting accuracy of cardiovascular disease.
Based on the prediction model's hybrid random forests with the linear model (HRFLM), it produced 88.7% of
the accuracy value. According to [17], it is tough to identify diabetic disease. A rigorous framework has been
developed by rejecting outliers and eliminating missing values. After selecting features, various machine
classifiers standardize the data. By estimating the area under receiver operator characteristic curve, the method
improved the outcome by weighting the classifier model and producing a better prediction.
The work [18] has proposed a hybrid technique by applying different machine learning classifiers to
diagnose cardiovascular disease-the various classifier helps this study to evaluate the performance metrics
using weka and keel tools. The primary intent of this work is to choose the best classifier by comparing each
classifier's accuracy value. This system [19] used a python tool to perform preprocessing with neighborhood
cleaning rule and feature engineering. AutoML, advanced extended gradient boost and advanced ensemble
bagging models are applied. Specialists perform this work to identify whether or not someone has
cardiovascular disease and diagnose the patient's condition. Studies suggested in [20] used four ML methods
to estimate diabetes risk, where bagging and boosting techniques were used to enhance robustness. Among the
existing algorithms, the Random Forest algorithm provides the most accurate results. The study employs [21]
the AdaBoost and bagging ensemble techniques using the J48 (c4.5) decision tree as a base learner and
standalone data mining methodology. The method applied was to classify the patients with diabetes using
diabetes risk indicators. In the study, the Adaboost ensemble method outperformed bagging and a standalone
J48 decision tree in terms of overall performance.
The [22] work aims to develop a model to predict diabetics. K-nearest neighbor (KNN) helps reduce
the processing time, and support vector machine (SVM) allocates a class for all the sample datasets. Selecting
features in this work helps build four classifiers. In addition, the researchers used four algorithms in this study
to determine the efficacy and accuracy of predicting whether or not people will have diabetes. According to
the study [23], A hierarchical ensemble model combines a decision tree and logistic regression classifiers
trained independently. The neural network joined with the previous model at the next level provides overall
better accuracy.
This work mainly focuses on diagnosing the risk of heart and diabetes diseases and encourages people
to have good health. The proposed study reveals that two chronic ailments, such as diabetes and heart disease,
can be predicted using the filter method chi-square and principal component analysis (PCA). Creating
classification techniques in diagnostics can help to avoid human error. The model utilized ensemble boosting
strategies such as Adaboost, Gradient boost, and Extreme Gradient boost to improve prediction accuracy.
Accordingly, the rest of the paper follows section 2 as a method. Section 3 presents a result and discussion.
Finally, section 4 covers the conclusion with future work.
2. METHOD
This section describes datasets, feature selection, and the ensemble, such as Adaptive boost, Gradient
boost, and Extreme Gradient boost classifier. Figure 1 depicts the pipeline of disease prediction—the proposed
system structured into different phases. The phases contained in this work are data collection, data
preprocessing and selecting features, feature extraction, splitting the data, classifier models, evaluation metrics,
and comparison of ensemble classifier models.
Machine learning approach for predicting heart and diabetes diseases using data-driven … (Usha Sekar)
1690 ISSN: 2252-8938
TP + TN
Accuracy =
TP + TN + FP + FN (1)
Where,
− TP: The classifier predicted TRUE, which was the correct class in the case of true positive.
− TN: In the case of real negatives, the classifier predicted FALSE, and it was the suitable class.
− FP: When there are false positives (FP), the classifier predicts TRUE, and the correct class is FALSE.
− FN: Models predict false when they have diseases in the case of false negatives.
The proposed work uses the Correlation and Chi-square selection method to select features after data
preprocessing. A heat map represents the correlation between the target and other features and shows the
relationship between the features. Figure 2 and Figure 3 uses a heat map to highlight the correlation between
the dataset's attributes in both predictions. This heat map represents values as colors in a two-dimensional
representation. In one glance, it provides a quick visual summary of data. The viewer can easily comprehend
complex datasets using more elaborate heat maps. The feature method increases the classification accuracy.
According to the principle of feature importance, all the features have a score value that determines
the extent. Figure 4 describes the most significant feature for prediction based on the feature importance
generated by the filter method. The subset of features used to predict both diseases is different in this work.
This works estimated the most significant features as sysbp, glucose, age, chol, ciger for heart disease and
api_hi, weight, api, age, and cholesterol for diabetic disease. The highest rank helps to select the features to
predict heart and diabetes disease based on the importance score.
Machine learning approach for predicting heart and diabetes diseases using data-driven … (Usha Sekar)
1692 ISSN: 2252-8938
In the final phase, the methodology provides a better accuracy by boosting classifiers, including
AdaBoosting, Gradient Boosting, and XGBoosting. These results were obtained by combining selected
parameters (by using chi-square) with PCA to devise the best classifiers to diagnose the disease. PCA reduces
the dimensionality of the input and lowers computation complexity, and speeds up the training process by
applying the principle component analysis to the input features. In this study Figure 5 depicts the most
significant accuracy outcomes of heart and diabetics diseases.
The result section compares all boosting classifiers in both diseases' predictions. The classification
results are shown in Table 1. The table shows the classification results for heart disease and people with
diabetes using the boosting classifier and concluded that Extreme Gradient boosting (XGBoosting) performed
well and produced the highest accuracy value.
4. CONCLUSION
This study aims to develop a dependable and accurate predictive model for heart and diabetic disease.
It has used a single dataset for predicting heart and diabetic disease. The given dataset has both target variables
for heart and diabetic diseases. Filter method Chi-square selected the feature to diagnose disease. PCA was
used to extract the features. The three ensemble boosting classifiers: are Adaboost, Gradient boost, and
XGBoost. Results showed that XGBoost provides a higher accuracy value than other boosting algorithms in
both disease predictions. Future work needs a better performance metric value by implementing a hybrid model
for both diseases.
ACKNOWLEDGEMENTS
I am pleased to thank my research supervisor Dr. S. Kanchana for her guidance and enthusiastic
encouragement of my research work. This article receives no financial support for its research, authorship,
and/or publication.
REFERENCES
[1] A. K. Yadav, R. Shukla, and T. R. Singh, “Machine learning in expert systems for disease diagnostics in human healthcare,”
Machine Learning, Big Data, and IoT for Medical Informatics, pp. 179–200, 2021, doi: 10.1016/B978-0-12-821777-1.00022-7.
[2] P. G. Shynu, V. G. Menon, R. L. Kumar, S. Kadry, and Y. Nam, “Blockchain-based secure healthcare application for diabetic-
cardio disease prediction in fog computing,” IEEE Access, vol. 9, pp. 45706–45720, 2021, doi: 10.1109/ACCESS.2021.3065440.
[3] K. Burse, V. P. S. Kirar, A. Burse, and R. Burse, “Various preprocessing methods for neural network based heart disease prediction,”
Advances in Intelligent Systems and Computing, vol. 851, pp. 55–65, 2019, doi: 10.1007/978-981-13-2414-7_6.
[4] P. Priyanga, V. V. Pattankar, and S. Sridevi, “A hybrid recurrent neural network-logistic chaos-based whale optimization framework
for heart disease prediction with electronic health records,” Computational Intelligence, vol. 37, no. 1, pp. 315–343, 2021,
doi: 10.1111/coin.12405.
[5] A. Elumalai, P. B. Maruthi, N. Gautam, S. Priyadharshini, and M. Suganthy, “RETRACTED ARTICLE: Optimal prediction of
attacks and arterial stiffness effects on heart disease by hybrid machine learning algorithm,” Journal of Ambient Intelligence and
Humanized Computing, vol. 13, p. 83, 2022, doi: 10.1007/s12652-020-02706-4.
[6] S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, “Comparing different supervised machine learning algorithms for disease
prediction,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, 2019, doi: 10.1186/s12911-019-1004-8.
[7] A. M. Khedr, Z. Al Aghbari, A. Al Ali, and M. Eljamil, “An efficient association rule mining from distributed medical databases
for predicting heart diseases,” IEEE Access, vol. 9, pp. 15320–15333, 2021, doi: 10.1109/ACCESS.2021.3052799.
[8] A. K. Dubey, “Optimized hybrid learning for multi disease prediction enabled by lion with butterfly optimization algorithm,”
Sadhana - Academy Proceedings in Engineering Sciences, vol. 46, no. 2, 2021, doi: 10.1007/s12046-021-01574-8.
[9] R. Manne and S. C. Kantheti, “Application of artificial intelligence in healthcare: chances and challenges,” Current Journal of
Applied Science and Technology, pp. 78–89, 2021, doi: 10.9734/cjast/2021/v40i631320.
[10] R. C. Ripan et al., “A data-driven heart disease prediction model through k-means clustering-based anomaly detection,” SN
Computer Science, vol. 2, no. 2, 2021, doi: 10.1007/s42979-021-00518-7.
[11] R. Kumar and P. Rani, “Comparative analysis of decision support system for heart disease,” Advances in Mathematics: Scientific
Journal, vol. 9, no. 6, pp. 3349–3356, 2020, doi: 10.37418/amsj.9.6.15.
[12] U. Ahmed et al., “Prediction of diabetes empowered with fused machine learning,” IEEE Access, vol. 10, pp. 8529–8538, 2022,
doi: 10.1109/ACCESS.2022.3142097.
[13] L. Men, N. Ilk, X. Tang, and Y. Liu, “Multi-disease prediction using LSTM recurrent neural networks,” Expert Systems with
Applications, vol. 177, 2021, doi: 10.1016/j.eswa.2021.114905.
[14] M. N. Uddin and R. K. Halder, “An ensemble method based multilayer dynamic system to predict cardiovascular disease using
machine learning approach,” Informatics in Medicine Unlocked, vol. 24, 2021, doi: 10.1016/j.imu.2021.100584.
[15] C. Guo, J. Zhang, Y. Liu, Y. Xie, Z. Han, and J. Yu, “Recursion enhanced random forest with an improved linear model (RERF-
ILM) for heart disease detection on the internet of medical things platform,” IEEE Access, vol. 8, pp. 59247–59256, 2020,
doi: 10.1109/ACCESS.2020.2981159.
[16] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE
Access, vol. 7, pp. 81542–81554, 2019, doi: 10.1109/ACCESS.2019.2923707.
[17] M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes prediction using ensembling of different machine learning
classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020, doi: 10.1109/ACCESS.2020.2989857.
[18] F. Z. Abdeldjouad, M. Brahami, and N. Matta, “A hybrid approach for heart disease diagnosis and prediction using machine learning
techniques,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), vol. 12157 LNCS, pp. 299–306, 2020, doi: 10.1007/978-3-030-51517-1_26.
[19] M. Fayez and S. Kurnaz, “RETRACTED ARTICLE: Novel method for diagnosis diseases using advanced high-performance
machine learning system (Applied Nanoscience, (2023), 13),” Applied Nanoscience (Switzerland), vol. 13, no. 3, p. 1787, 2023,
doi: 10.1007/s13204-021-01990-6.
[20] N. Nai-Arun and R. Moungmai, “Comparison of classifiers for the risk of diabetes prediction,” Procedia Computer Science, vol.
69, pp. 132–142, 2015, doi: 10.1016/j.procs.2015.10.014.
[21] S. Perveen, M. Shahbaz, A. Guergachi, and K. Keshavjee, “Performance analysis of data mining classification techniques to predict
diabetes,” Procedia Computer Science, vol. 82, pp. 115–121, 2016, doi: 10.1016/j.procs.2016.04.016.
[22] M. Panda, D. P. Mishra, S. M. Patro, and S. R. Salkuti, “Prediction of diabetes disease using machine learning algorithms,” IAES
International Journal of Artificial Intelligence, vol. 11, no. 1, pp. 284–290, 2022, doi: 10.11591/ijai.v11.i1.pp284-290.
[23] M. Abedini, A. Bijari, and T. Banirostam, “Classification of Pima Indian diabetes dataset using ensemble of decision tree, logistic
regression and neural network,” Ijarcce, vol. 9, no. 7, pp. 1–4, 2020, doi: 10.17148/ijarcce.2020.9701.
[24] E. Nasarian et al., “Association between work-related features and coronary artery disease: A heterogeneous hybrid feature selection
integrated with balancing approach,” Pattern Recognition Letters, vol. 133, pp. 33–40, 2020, doi: 10.1016/j.patrec.2020.02.010.
[25] B. A. Tama and K. H. Rhee, “Tree-based classifier ensembles for early detection method of diabetes: an exploratory study,”
Artificial Intelligence Review, vol. 51, no. 3, pp. 355–370, 2019, doi: 10.1007/s10462-017-9565-3.
[26] Y. Wang and L. Feng, “An adaptive boosting algorithm based on weighted feature selection and category classification confidence,”
Applied Intelligence, vol. 51, no. 10, pp. 6837–6858, 2021, doi: 10.1007/s10489-020-02184-3.
Machine learning approach for predicting heart and diabetes diseases using data-driven … (Usha Sekar)
1694 ISSN: 2252-8938
[27] P. Theerthagiri and J. Vidya, “Cardiovascular disease prediction using recursive feature elimination and gradient boosting
classification techniques,” Expert Systems, vol. 39, no. 9, 2022, doi: 10.1111/exsy.13064.
[28] H. Jiang et al., “Machine learning-based models to support decision-making in emergency department triage for patients with
suspected cardiovascular disease,” International Journal of Medical Informatics, vol. 145, 2021,
doi: 10.1016/j.ijmedinf.2020.104326.
[29] D. Ananey-Obiri and E. Sarku, “Predicting the presence of heart diseases using comparative data mining and machine learning
algorithms,” International Journal of Computer Applications, vol. 176, no. 11, pp. 17–21, 2020, doi: 10.5120/ijca2020920034.
BIOGRAPHIES OF AUTHORS
S. Usha received the B.Sc. & MCA. degree, respectively, from Madurai Kamaraj
Univeristy. She has worked as an Assistant Professor for 12 yrs in SRM Institute of Science
& Technology. Now, currently she is pursuing Ph.D as Full Time Research Scholar in
Department of Computer Science, SRM Institute of Science & Technology, Kattankulathur,
Chennai, India. Her research area includes Image Processing, Data Mining, Cloud
Computing, Machine Learning, and Deep Learning. She has published a paper in
International journal and presented paper in national and international conference. She can
be contacted at email: [email protected]