Heart Failure Prediction Based On Random Forest Algorithm Using Genetic Algorithm For Feature Selection
Heart Failure Prediction Based On Random Forest Algorithm Using Genetic Algorithm For Feature Selection
Corresponding Author:
Yudi Ramdhani
Department of Informatic Engineering, Adhirajasa Reswara Sanjaya University
Bandung, Indonesia
Email: [email protected]
1. INTRODUCTION
The heart is a vital organ that is also known as the "centrality of the human body" since it is
responsible for supplying blood to all of the other organs and when it is unable to fulfill these duties, a person
may pass away instantly. According to study, the majority of adults who are susceptible to heart-related
disorders have unhealthy eating habits, mental stress, despair, and excessive work hours. These factors are
also present in those who have heart failure [1]. Heart failure is essentially a disorder where the heart is
unable to adequately pump blood to the body's organs. This typically happens as a result of diseases like
diabetes, hypertension, or other heart ailments like human immunodeficiency virus (HIV), thyroid disorders,
alcoholism, or genetic diseases [2]. When the heart's muscles deteriorate, the heart's ability to pump blood is
restricted [3]. It is significant to mention that one of the most prevalent ailments among middle-aged persons
is heart disease [4]. As confirmed by the World Health Organization (WHO) [3], cardiovascular disease
(CVD), a condition affecting the heart and blood vessels, is responsible for 31% of annual deaths. Therefore,
in order to be able to give the proper treatment process and save the lives of many patients, it is essential to
recognize the condition early and precisely [5]. Medical professionals have effectively used machine learning
algorithms like principal component analysis (PCA) and support vector machine (SVM) to assist in the
diagnosis of a variety of disorders including diabetes and heart failure. Researchers have also used artificial
neural networks (ANN) in the medical field [6].
Additionally, a significant part in obtaining useful information from massive data is played by data
mining. It is extensively employed in practically all spheres of life, including business, engineering, medical,
and education. Data mining is employed to examine data by minimizing mistakes in forecasts and factual
outcomes, several machine learning algorithms have been used to comprehend the intricacy and non-linear
interactions between diverse components [7]. Additionally, this prospective strategy opens a significantly
better resource window, improving the sensitivity and specificity of disease detection and diagnosis [8].
Machine learning algorithms are required as medical data keeps growing in order to help the medical team
analyze data and make precise and accurate diagnostic conclusions [9]. To forecast cardiovascular disease in
patients and cardiac death, many classification algorithms are utilized in medical data mining [10].
Machine learning algorithms typically handle research prediction and classification tasks that use
ready-to-use data sets. Researchers have tried a variety of methods to increase the precision of data
classification to locate possible patients [11]. In this work, a machine learning system was used to classify
conditions of death from heart disease in accordance with measurement results and life information gathered
from individuals. One application of labeled data in supervised learning is classification. A new class of
predictions is created by category class, and label class after classification splits the dataset into training and test
data. One method that is quite well-known in classification is random forest [12]. Ho originally put forth the
random forest algorithm in 1995, and claims that the algorithm can achieve greater accuracy without overtraining if
the decision tree can recognize skewed hyperplanes [13]. The underlying algorithm's "randomness" is a
requirement for the random forest algorithm's improved accuracy. The random forest technique, which integrates
genetic algorithms, will be used in this study to pick features for categorizing heart failure. The primary goal of this
work is to overcome the issue of dataset imbalance and choose feature selection to obtain more accuracy. To do
this, we will employ the random forest algorithm and genetic algorithm as machine feature selection to predict the
survival of heart failure patients suitable in comparison to earlier studies.
2. METHOD
The data mining paradigm, which describes the process of looking for or mining knowledge, is the
foundation of this research which optimization function based on the random forest algorithm. The genetic
algorithm will search the "heart failure clinical record dataset" using this objective function to uncover
significant features. Figure 1 shows the overall progression of this study.
Int J Reconfigurable & Embedded Syst, Vol. 12, No. 2, July 2023: 205-214
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 207
2.2. Dataset
The dataset is the primary component that will be analyzed using an algorithm in order to perform
research. Heart failure clinical records data from the UCI repository website were the dataset used in this
investigation. The collection includes the medical histories of 299 patients with heart conditions. 194 males
and 105 females were found in the 299 records (older than 40). In the target class, there were 203 survivors
(incidence of death=0), compared to 96 fatalities (incidence of death=1). In terms of statistics, there were
67.89% negative and 32.11% positive. The dataset was published in 2020 with 13 features with 299 events of
class_death, the information shown in Table 1.
2.3. Pre-processing
Data preprocessing for data mining is the collection of approaches utilized prior to the use of data
mining methods, and it is acknowledged as one of the most important difficulties in the renowned knowledge
discovery of data processing [14]. Data cannot be applied directly to begin the data mining process since it is
likely incomplete, inconsistent, and redundant. More complex analysis methods are required when data
collection scales up. Preprocessing data enables the processing of data that would not otherwise be possible
by adjusting the data to the constraints set by the different data mining algorithms [15].
Heart failure prediction based on random forest algorithm using genetic algorithm for … (Yudi Ramdhani)
208 ISSN: 2089-4864
constructed, the random forest approach is distinguished by excellent performance in classifiability and good
noise resistance [18]. The decision tree is grown using the CART method, grows to its maximum size, and is
then left unpruned. As a result, a forest is created, which is a collection of trees [21]. Additionally, the
classification approach incorporates random forest, which consists of a structured collection of decision trees
that each cast a unit vote for the most prevalent class on input x using independent random vectors that are
distributed uniformly, as shown in Figure 2 [22].
Each decision tree will produce a result based on the input data, and the integration of several outputs
will produce the ultimate result of a random forest. For the original training set, the bagging method was used to
select random data by replacement and form the training set by difference. The features are also selected using a
sampling approach. If it is assumed that a data set has N features, then M features will be sampled from N,
where M<<N. for each extracted training set, only feature M selected randomly rather than all N features will be
used for node splitting in constructing the tree. All decision trees built will grow freely without pruning [18].
Int J Reconfigurable & Embedded Syst, Vol. 12, No. 2, July 2023: 205-214
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 209
Candidate solutions are more broadly referred to as components of a potential answer to a specific
problem. The population as a whole is made up of several potential solutions that GA is creating.
Additionally, candidate solutions include a collection of parameters known as the genotype or chromosome.
The genome is a representation of this set of variables [24]. A randomly generated initial population of
genomes is also used as the starting point for the GA evolution mechanism. In each iteration, the fitness value
is computed using the fitness function. The optimum overall fitness value was calculated based on a
comparison of the present population. In this study, random forest was utilized to categorize the features; the
fitness function was calculated using these results in the suggested framework. Genotypes represent the
original dataset's feature vectors. Chromosomal phenotypes represent the mask of the feature vector. As a
result, the phenotype labeled "0" stands for the trait that was removed, whereas the phenotype labeled "1"
stands for the characteristic that was chosen. As a result, a phenotype labeled "0" is considered a less
significant trait, whereas "1" is considered a highly significant feature. Each genotype produces a collection
of subsets based on phenotype. The training set for the suggested framework is this subset and the operating
basis of genetic processes is shown in Figure 4.
Figure 4. GA operation
2.7. Classification
The classification stage is the stage for classifying the quality of the "heart failure clinical record
dataset" dataset. In this study, validation was carried out using cross validation, which previously shared
training data and testing data to determine which model has the best level of accuracy. Then validate using
split validation to test the model that has been taken using cross validation. This model will later be
optimized using the genetic algorithm selection feature.
Heart failure prediction based on random forest algorithm using genetic algorithm for … (Yudi Ramdhani)
210 ISSN: 2089-4864
if you choose one positive and negative example at random, the classification method will give a higher score
on the positive example than the negative example. Therefore, a higher AUC value can indicate a better
classification method [26]. The following is a formula for finding the AUC value seen from the results of the
confusion matrix [27].
1 𝑇𝑃 𝑇𝑁
𝐴𝑈𝐶 = ( + )
2 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃
The AUC value will always be in the range 0-1, because part of the unit square area with the x-axis
and y-axis has a value from 0 to 1. Values above 0.5 are said to be interesting values because random
predictions produce diagonal lines between (0,0) and (1,1) which has an area of 0.5. The quality of
classification accuracy of diagnostic tests using AUC values is shown in Table 3 [26].
Int J Reconfigurable & Embedded Syst, Vol. 12, No. 2, July 2023: 205-214
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 211
After obtaining the best model, the next step is to implement feature optimization. In this study, five
experiments were carried out based on variations in the split validation ratio. In each experiment, the best
algorithm will be applied from the results of the comparison, namely random forest and optimization of
genetic algorithm features. The experimental results can be seen in Table 6. The random forest algorithm
yields an accuracy value of 82.55% on average, with the greatest results coming from a split validation ratio
of 0.7 and 0.9, which yields an accuracy value of 83.33%. The best results were obtained with a split
validation ratio of 0.9 at 100%, while the feature optimization approach can yield an average accuracy value
of 93.36%. It is clear from the trial findings in Table 6 that the random forest classification is doing better
than before. The average improvement in performance is 10.812%, and the biggest improvement, 16.670%,
was at a split validation ratio of 0.9. Furthermore, a t-Test paired two samples for means test was performed
to see whether the technique employed may significantly increase the performance of the random forest
algorithm [28]. The results of the t-Test paired two sample for means test results in a P value of t-Test of
0.00695. These results indicate that the application of genetic algorithms for feature selection in the random
forest algorithm can significantly improve accuracy performance, indicated by the P value of t-Test <0.05.
The results of the t-Test test can be seen in Table 7.
According to the results of the t-Test significance test, the genetic algorithm has an excellent
performance to improve the performance of the classification algorithm through the feature selection stage.
The random forest algorithm will undergo another experiment in which selection criteria other than genetic
algorithms, such as greedy forward selection and greedy backward selection, will be used. To ascertain
whether the genetic algorithm is the optimal algorithm for feature selection, this experiment was carried out.
Table 8 displays the outcomes of the experiment. The experimental results in Table 5 state that the genetic
algorithm is a feature selection algorithm that can produce the best classification performance when
compared to other feature selection algorithms. The average value of accuracy for each feature selection
algorithm is 90.18% for forward selection, 89.56% for backward selection, 90.52% for greedy forward
selection and 88.32% for greedy backward selection while the average gains the average value of the
accuracy of the genetic algorithm is 93.36%. Based on all the experimental results that have been obtained, it
can be concluded that the genetic algorithm has succeeded in improving the performance of the random
forest algorithm for the classification of heart disease through feature selection. In previous studies, several
Heart failure prediction based on random forest algorithm using genetic algorithm for … (Yudi Ramdhani)
212 ISSN: 2089-4864
classifications of heart disease have been obtained using various algorithms and classification methods. A
comparison of the results of this study with the results of other studies can be seen in Table 9.
This study's focus on data mining is not based on extremely big amounts of data, but it is an
illustration of a field that will likely be developing in the future and have an impact on heart failure and many
other aspects of health. Based on tests performed on the "heart failure clinical records dataset," it can be seen
that the application of the genetic algorithm for feature selection in the random forest algorithm has a good
accuracy of 93.36%, making it suitable for use by experts in the field of medical personnel. For
programmers, it can also be a reference method that can be used to implement the method into a program
related to heart failure. The use of this new science to the prevention of diseases, especially heart failure, and
the promotion of health will have significant and potentially extremely favorable effects.
4. CONCLUSION
This study has succeeded in measuring the accuracy level of the random forest algorithm on the “heart
failure clinical record dataset” by applying the genetic algorithm selection feature with an accuracy rate of 93.36%
higher than other algorithms. Genetic algorithm applied to feature selection and random forest algorithm to
improve the accuracy of the heart failure clinical record dataset. In five testing experiments using split validation
with varying ratios, the genetic algorithm proved to be effective in increasing accuracy significantly. Comparison
of genetic algorithms with other selection features such as forward selection, backward selection, greedy forward
Int J Reconfigurable & Embedded Syst, Vol. 12, No. 2, July 2023: 205-214
Int J Reconfigurable & Embedded Syst ISSN: 2089-4864 213
selection, and greedy backward selection to compare which selection features can increase the accuracy value of
the heart failure clinical record dataset. Genetic algorithm is proven to have better performance among other
selection features. In this study, genetic algorithms are generally applied for feature selection along with the
random forest algorithm which aims to improve the performance of heart failure classification. Several things that
can be done to improve this research include using other parameter optimization algorithms or also using other
optimization features such as feature weighting or feature generation against other classification algorithms such as
neural networks, decision tree, or extreme gradient boosting (XGBoost).
REFERENCES
[1] O. N. Emuoyibofarhe, S. Adebayo, A. Ibitoye, M. O. Ayomide, and A. Taye, “Predictive system for heart disease using a machine
learning trained model,” International Journal of Computer (IJC), vol. 34, no. 1, pp. 140–152, 2019.
[2] D. Chicco and G. Jurman, “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection
fraction alone,” BMC Medical Informatics and Decision Making, vol. 20, no. 1, 2020, doi: 10.1186/s12911-020-1023-5.
[3] O. O. Oladimeji and O. Oladimeji, “Predicting survival of heart failure patients using classification algorithms,” JITCE (Journal
of Information Technology and Computer Engineering), vol. 4, no. 02, pp. 90–94, 2020, doi: 10.25077/jitce.4.02.90-94.2020.
[4] J. H. Joloudari et al., “Coronary artery disease diagnosis; ranking the significant features using a random trees model,”
International Journal of Environmental Research and Public Health, vol. 17, no. 3, 2020, doi: 10.3390/ijerph17030731.
[5] M. F. Aslan, K. Sabanci, and A. Durdu, “A CNN-based novel solution for determining the survival status of heart failure patients
with clinical record data: numeric to image,” Biomed Signal Process Control, vol. 68, 2021, doi: 10.1016/j.bspc.2021.102716.
[6] M. T. Le, M. T. Vo, L. Mai, and S. V. T. Dao, “Predicting heart failure using deep neural network,” in 2020 International
Conference on Advanced Technologies for Communications (ATC), 2020, pp. 221–225, doi: 10.1109/ATC50776.2020.9255445.
[7] S. F. Weng, J. Reps, J. Kai, J. M. Garibaldi, and N. Qureshi, “Can machine-learning improve cardiovascular risk prediction using
routine clinical data?,” PLoS One, vol. 12, no. 4, 2017, doi: 10.1371/journal.pone.0174944.
[8] S. Perveen, M. Shahbaz, A. Guergachi, and K. Keshavjee, “Performance analysis of data mining classification techniques to
predict diabetes,” Procedia Computer Science, vol. 82, pp. 115–121, 2016, doi: 10.1016/j.procs.2016.04.016.
[9] A. Ishaq et al., “Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques,”
IEEE Access, vol. 9, pp. 39707–39716, 2021, doi: 10.1109/ACCESS.2021.3064084.
[10] M. M. A. Mary and T. L. A. Beena, “Heart disease prediction using machine learning techniques: A survey,” International
Journal for Research in Applied Science and Engineering Technology (IJRASET), vol. 8, no. 10, pp. 441–447, 2020, doi:
10.22214/ijraset.2020.31917.
[11] R. T. Prasetio, A. A. Rismayadi, N. Suryana, and R. Setiady, “Features selection and k-NN parameters optimization based on genetic
algorithm for medical datasets classification,” Heart Disease (SPECTF), pp. 3080–3086, 2020, doi: 10.5220/0009947130803086.
[12] M. Huljanah, Z. Rustam, S. Utama, and T. Siswantining, “Feature selection using random forest classifier for predicting prostate
cancer,” In IOP Conference Series: Materials Science and Engineering, vol. 546, no. 5, 2019, doi: 10.1088/1757-899X/546/5/052031.
[13] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world
classification problems?,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3133–3181, 2014.
[14] S. García, J. Luengo, and F. Herrera, “Data preprocessing in data mining,” Part of the book series: Intelligent Systems Reference
Library, Springer Cham, 2015, doi: 10.1007/978-3-319-10247-4.
[15] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera, “Big data preprocessing: methods and prospects,” Big
Data Analytics, vol. 1, no. 1, pp. 1–22, 2016, doi: 10.1186/s41044-016-0014-0.
[16] T. N. Nuklianggraita, A. Adiwijaya, and A. Aditsania, “On the feature selection of microarray data for cancer detection based on
random forest classifier,” Jurnal Infotel, vol. 12, no. 3, pp. 89–96, 2020, doi: 10.20895/infotel.v12i3.485.
[17] M. J. H. Mughal, “Data mining: Web data mining techniques, tools and algorithms: an overview,” International Journal of
Advanced Computer Science and Applications, vol. 9, no. 6, 2018, doi: 10.14569/IJACSA.2018.090630.
[18] L. Yingchun and Y. Liu, “Random forest algorithm in big data environment,” Computer Modelling and New Technologies, vol.
18, no. 12A, pp. 147–151, 2014.
[19] B. Dai, R. C. Chen, S. Z. Zhu, and W. W. Zhang, “Using random forest algorithm for breast cancer diagnosis,” Proceedings - 2018
International Symposium on Computer, Consumer and Control, IS3C 2018, pp. 449–452, 2019, doi: 10.1109/IS3C.2018.00119.
[20] A. D. Kulkarni and B. Lowe, “Random forest algorithm for land cover classification,” International Journal on Recent and
Innovation Trends in Computing and Communication, vol. 4, no. 3, pp. 58–63, 2016.
[21] J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques, USA: Elsevier Science Ltd, 2011.
[22] E. Goel, E. Abhilasha, E. Goel, and E. Abhilasha, “Random forest: A review,” International Journal of Advanced Research in
Computer Science, vol. 7, no. 1, pp. 251–257, 2017, doi: 10.23956/ijarcsse/V7I1/01113.
[23] D. Riana, Y. Ramdhani, R. T. Prasetio, and A. N. Hidayanto, “Improving hierarchical decision approach for single image
classification of pap smear,” International Journal of Electrical and Computer Engineering (IJECE), vol. 8, no. 6, pp. 5415-5424,
2018, doi: 10.11591/ijece.v8i6.pp5415-5424.
[24] C. B. Gokulnath and S. P. Shantharajah, “An optimized feature selection based on genetic approach and support vector machine
for heart disease,” Cluster Computing, vol. 22, pp. 14777–14787, 2019, doi: 10.1007/s10586-018-2416-4.
[25] A. M. Hay, “The derivation of global estimates from a confusion matrix,” International Journal of Remote Sensing, vol. 9, no. 8,
pp. 1395–1398, 1988, doi: 10.1080/01431168808954945.
[26] F. Gorunescu, Data Mining: Concepts, models and techniques, Springer Science and Business Media, vol. 12, 2011.
[27] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing
and Management vol. 45, no. 4, pp. 427–437, 2009, doi: 10.1016/j.ipm.2009.03.002.
[28] R. T. Prasetio and D. Riana, “A comparison of classification methods in vertebral column disorder with the application of genetic
algorithm and bagging,” in 2015 4th International Conference on Instrumentation, Communications, Information Technology,
and Biomedical Engineering (ICICI-BME), 2015, pp. 163–168.
[29] F. Novaldy and A. Herliana, “Application of PSO to Naïve Bayes for prediction of life expectancy in heart failure patients,”
Jurnal Responsif: Riset Sains dan Informatika, vol. 3, no. 1, pp. 37–43, 2021, doi: 10.51977/jti.v3i1.396.
Heart failure prediction based on random forest algorithm using genetic algorithm for … (Yudi Ramdhani)
214 ISSN: 2089-4864
BIOGRAPHIES OF AUTHORS
Int J Reconfigurable & Embedded Syst, Vol. 12, No. 2, July 2023: 205-214