Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
A R T I C L E I N F O A B S T R A C T
Keywords: Coronary artery disease prediction is considered to be one of the most challenging tasks in the health care in
Hyper parameter tuning dustry. In our research, we propose a prediction system to detect the heart disease. Three Hyper Parameter
Heart disease Optimization (HPO) techniques Grid Search, Randomized Search and Genetic programming (TPOT Classifier)
Grid search
were proposed to optimize the performance of Random forest classifier and XG Boost classifier model. The
Randomized search
TPOT classifier
performance of the two models Random Forest and XG Boost were compared with the existing studies. The
performance of the models is evaluated with the publicly available datasets Cleveland Heart disease Dataset
(CHD) and Z-Alizadeh Sani dataset. Random Forest along with TPOT Classifier achieved the highest accuracy of
97.52%for CHD Dataset. Random Forest with Randomized Search achieved the highest accuracy of 80.2%,
73.6% and 76.9% for the diagnosis of the stenos is of three vessels LAD, LCX and RCA respectively with Z-
Alizadeh Sani Dataset. The results were compared with the existing studies focusing on prediction of heart
disease that were found to outperform their results significantly.
1. Introduction techniques helps to derive useful knowledge to take decision from vast
datasets [7]. These algorithms have been broadly used in the area of
By 2030, the deaths due to cardiovascular disease is expected to Health Care, Computer vision, Speech recognition, Social Science, Cos
increase to 23.3 million [1]. The blood vessels in the heart supplies the mology and in the Education field [8]. They provide a variety of algo
oxygen and when these vessels get blocked or narrowed, it can lead to rithms to identify the different patterns in large dataset [9]. ML has been
any heart disease or stroke [2]. According to WHO, every year around 12 widely used in the health care industry for identifying the disease and
million people fall as a victim to the heart disease and 80% of the people making effective decisions. It helps to classify the patients with signifi
dies due to heart ailment [3]. High Blood pressure, High Cholesterol, cant risk factors [10–12]. Massive amounts of data is collected by the
stress, tension, consumption of alcohols, sedentary lifestyle, obesity, healthcare industry and ML provides different models to train and
diabetes are the major factors that affects the heart. These attributes analyze the data quickly [13]. These algorithms search through a large
helps in the prediction of heart disease. Due to increased blood pressure search space of solutions and finds an optimal solution by training the
the walls of the arteries get thickened that causes blockage, which can dataset. The performance of the models can be examined from the
increase the mortality rate [4]. Early diagnosis of heart disease, proper various performance metrics such as Accuracy, Sensitivity, Specificity,
treatment can prevent and can also reduce the mortality rate of the Precision and F1-Score. By tuning the hyper parameters, the best hyper
patients. One of the common method to diagnose the abnormal nar parameters are selected and applied to the ML models and the perfor
rowing of heart vessel is angiography. The symptoms, examination and mance is improved. Robust models can be built by adjusting the
ECG features were investigated with SMO, Naïve Bayes and Ensemble hyperparameters. Overfitting or Underfitting can be prevented by tun
method and reached an accuracy of 88.5% to predict the presence of ing the hyperparameters. The major contributions of our proposed
CAD [31]. Numerous Supervised and Unsupervised Machine learning model is as follows
algorithms have been applied by a number of researchers in the medical
field for diagnosis and prediction of the heart disease [5]. 1. The performance of the models are tested on a subset of features
The ability to detect patterns, turning data into information, data selected by Sequential Forward Selection (SFS) method with 10-fold
mining serves as a strong base for analysis [6]. Machine learning cross validation for Cleave land Heart disease dataset.
* Corresponding author.
E-mail addresses: [email protected] (R. Valarmathi), [email protected] (T. Sheela).
https://doi.org/10.1016/j.bspc.2021.103033
Received 9 March 2021; Received in revised form 12 July 2021; Accepted 30 July 2021
Available online 12 August 2021
1746-8094/© 2021 Elsevier Ltd. All rights reserved.
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
2. SMOTE technique is applied to balance the dataset in Z-Alizadeh Santhanam and Ephzibah [28] has taken genetic algorithm for feature
Sani dataset extraction and fuzzy logic for prediction. Mohan et al. [29] developed a
3. Next, the model performance are tuned and tested with Grid Search hybrid approach of random forest and Linear method for classification of
CV, Randomized Search CV and Genetic Programming (TPOT Clas heart disease. KaanUyar et al. [30] proposed a genetic algorithm based
sifier) with 10-fold cross validation. recurrent fuzzy neural networks (RFNN) to classify the data. None of the
4. Our work suggests the combination of the ML models and optimi aforementioned studies have implemented hyperparameter optimiza
zation technique that predicts the heart disease with highest tion (HPO) techniques to boost the accuracy of the heart disease pre
accuracy. diction system. Thus, in our proposed model we used HPO techniques to
improve the accuracy of the model.
The remainder of this paper is structured as follows: Section 2 covers Alizadehsani et al. [32] employed the rule based classifier and cost
related work on the previous studies employed in heart disease predic sensitive algorithm along with Sequential minimal optimization (SMO)
tion system using different machine learning algorithms. Section 3 to diagnose CAD. Alizadehsani et al. [33] handled data uncertainity.
demonstrates the methodology of the proposed work in more detail. Different evolutionary algorithms were used for feature selection [33-
Section 4 describes the experimental results, comparative analysis of the 38,65]. Some of the existing studies presented the Real time predictions
previous studies and techniques used. Section 5 outlines our findings and the performance of detection of heart disease using hardware. Based
and future research directions. on five characteristics Cardiac arrhythmias were differentiated using
Multi-Level Support vector machine classifiers [66]. Patient Specific
2. Literature review SCAD processor [67], Smart ECG processor [68], Wearable ECG Pro
cessor [69] was designed to discriminate the CA in real time. An ECG
The performance of different datasets were analyzed using Bayesian processor and STAC algorithm was presented to improve the accuracy of
Optimization based on Gaussian process [14]. The performance of 6 heart rate detection [70].
different machine learning models were examined on the heart dataset
and found logistic regression predicted the heart disease with the 3. Materials and methods
highest accuracy [15]. Least significant features are eliminated using the
backward feature elimination. The association rules were mined using 3.1. Proposed methodology
frequent item sets and the genetic algorithms is applied to predict the
heart disease [16]. Fitness function was used to remove the redundant Original datasets were collected and data preprocessing is done on
rules and in the optimization of association rules were generated. the collected data. Relevant features were selected using sequential
Three neural network model to were used to construct an ensemble forward selection method. The parameters of Random forest and
model to diagnose the heart disease [17]. SAS enterprise miner 5.2 was XGBoost were tuned using the hyper parameter optimization techniques
used to evaluate the performance. The number of NN was increased but Grid search, Randomized search and Genetic Programming (TPOT
no improvement was observed in the performance.270 Patient records Classifier). Finally, the models were validated and analyzed to predict
were trained and tested using Cascaded neural network, a Self Orga the heart disease.
nizing network and Support Vector machine with RBF function [18]. In our proposed model, 10-cross validation is used to validate the
Naïve Bayes machine learning model was developed to predict the heart data. Fig. 1 shows the flowchart of the proposed model.
disease [19].
An enhanced SVM classifier was presented to classify the linear and 3.2. Dataset description
nonlinear inputs. PSO was used in feature extraction and Fuzzy C-means
Clustering was applied to improve the accuracy [20]. Bhatla and Jyoti 3.2.1. Cleveland heart disease (CHD)
[21] employed Weka Tool on different data mining techniques for heart The Cleveland Heart Disease (CHD) dataset is a heart disease pre
disease prediction and observations showed that Neural network diction dataset available online in UCI repository [45]. The actual
showed good results compared to other data mining techniques. Krish dataset contains 76 attributes but most of the published articles used
naiah et al. [22] incorporated a fuzzy approach to remove the uncer only 14 attributes. The heart disease prediction dataset takencontains13
tainty in the data and applied a KNN Classifier to classify the heart independent variables and one dependent target variable, a total of 14
patients. columns. The target variable has two classes that represent the presence
Amin et al. [23] identified the risk factors of 50 patients and and absence of heart disease. It has 303 rows. The dataset is found to
implemented an integrated model of genetic algorithm and neural have no missing values. The dataset description is given in Table 1
network to predict the presence of heart disease. Abdeldjouadet al. [24]
used two different software’s Weka and Keel tool to build two different 3.2.2. Z-Alizadeh Sani Dataset
models. First model was built by applying the PCA feature selection Z-Alizadeh Sani dataset contains 303 samples and 54 attributes.
method to extract the significant features and 3 different classification There are two classes for diagnosis that represent normal and patient
algorithms Multi-Objective Evolutionary Fuzzy Classifier (MOEFC), affected with coronary artery disease (CAD). Among 303 samples, 216
Logistic Regression (LR), Adaptive Boosting (AdaBoostM1) using Weka samples indicate normal patients and 87 samples indicate the presence
tool for classification. The second model was built by applying the of heart disease [46]. The dataset description is given in Table 2.
Wrapper method for feature extraction and Genetic Fuzzy System-
LogitBoost (GFS-LB), Fuzzy Unordered Rule Induction Algorithm 3.3. Feature selection
(FURIA) and Fuzzy Hybrid Genetic Based Machine Learning (FH-GBML)
for classification under Keel tool. The two models were completely The features that are irrelevant decreases the performance of the
trained and the performance were evaluated. model. In our paper, we have used Sequential Forward Selection with
Purusothaman and Krishnakumari [25] surveyed the different ten-fold cross validation to remove the irrelevant features. Sequential
research findings based on single model approach and hybrid model and Forward Selection (SFS) selects one best single feature, then best pair is
concluded hybrid model are better in prediction of disease compared to selected, then best triplet of features selected and this procedure is
a single model. Khourdifiet al. [26] optimized KNN, RF, SVM, Naïve continued until n number of relevant features are selected. Table 2
Bayes and ANN with the combination of Particle Swarm optimization shows the number of features and the highest accuracy obtained with
and ant colony optimization. Kalaiselvi and Nasira [27] used PSO for irrelevant features removed. Random Forest and XGBoost is tested with
extraction of data and ANFIS with AGKNN was used in classification. SFS. Table 3 shows the model and their highest accuracies obtained with
2
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Table 2
Dataset Description(Z-Alizadeh Sani Dataset).
Feature Type Attribute(Values)
Demographic Age(30–86)
Weight(48–120)
Length(140–188)
Sex(M,F)
Body Mass Index(BMI in 18.1–40.9)
Diabetes Milletus(DM)(Y,N)
Hypertension(HTN)(Y,N)
Current Smoker(Y,N)
Ex-Smoker(Y,N)
Family History(FH) (Y,N)
Obesity(Y,N)
Chronic Renal Failure(CRF) (Y,N)
Cerebrovascular Accident(CVA) (Y,N)
Airway Disease(Y,N)
Thyroid Disease(Y,N)
Congestive Heart Failure(CHF) (Y,N)
Dyslipidemia(DLP) (Y,N)
Symptom and Examination Blood Pressure(BP in 90–190)
Pulse Rate(PR in 50–110)
Edema(Y,N)
Weak peripheral pulse(Y,N)
Lung Rales(Y,N)
Systolic murmur(Y,N)
Diastolic murmur(Y,N)
Typical Chest Pain(Y,N)
Dyspnea(Y,N)
Function Class(1,2,3,4)
Atypical(Y,N)
Nonanginal(Y,N)
Exertional CP(Y,N)
LowTHAng(low Threshold angina) (Y,N)
Rhythm(Y,N)
ECG QWave(0,1)
ST Elevation(0,1)
ST Depression(0,1)
T inversion(0,1)
Left Ventricular Hypertrophy(LVH) (Y,N)
Poor R Progression(Y,N)
Laboratory and Echo Fasting Blood Sugar(FBS in 62–100 mg/dl)
Creatine(Cr in 0.5–2.2 mg/dl)
Triglyceride(TG in 37–1050 mg/dl)
Low density lipoprotein(LDL in 18–232 mg/dl)
High density lipoprotein(HDL in 15.9–111 mg/dl)
Blood Urea Nitrogen(BUN in 6–52 mg/dl)
Erythrocyte Sedimentation rate(ESR in 1–90 mm/h)
Hemoglobin(HB in8.9–17.6 g/dl)
Potassium(K in 3.0–6.6 mEq/lit)
Sodium(Nain 128–156 mEq/lit)
Fig. 1. The Proposed Model of the Heart Disease Prediction system. White Blood Cells(WBC in 3700-18000cells/ml)
Lymphocyte (Lymph in 7–60 %)
Neutrophil(Neut in 32–89%)
Platelet(PLT in 25–742/ml)
Table 1 Ejection Fraction(EF in 15–60%)
Dataset description(CHD). Regional Wall Motion Abnormality(RWMA)(0,1,2,3,4)
Valvular Heart Disease(Normal,Mild,Moderate,
S.No Feature Description
Severe)
1 age Age
2 sex male, female
3 cp chest pain type
4 trestbps resting blood pressure Table 3
5 chol serum cholesterol Model, Number of features and their accuracy Using Sequential Forward Se
6 fbs fasting blood sugar lection (SFS).
7 restecg resting electrocardiographic results
8 thalach maximum heart rate achieved Model No of Features Accuracy No of Features Accuracy
9 exang exercise induced angina Random Forest 13 82.68% 9 83.96%
10 oldpeak ST depression induced by exercise relative to rest XGBoost 13 79.36% 11 82.7%
11 slope slope of the peak exercise ST segment
12 ca number of major vessels
13 thal Type of Defect
improves the accuracy of the proposed system. Figs. 2 and 3 shows the
14 target Risk, No risk
accuracy obtained with Random Forest and XGBoost applied with
Sequential Feature Selection(SFS). Table 4 and 5 shows the subset of
reduced number of features. Random forest achieved the highest accu features selected with Random Forest and XGboost.
racy of 83.96% with 9 relevant features. XGBoostobtained the highest
accuracy of 82.7% with 11 relevant features. Selecting relevant features
3
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Fig. 2. Performance of RF with Sequential Forward Selection (SFS). 3.4.2. Extreme Gradient Boosting (XGBoost)
Extreme Gradient Boosting (XGBoost) is an ensemble technique [40],
in which a set of weak learners are combined to improve the accuracy.
The training speed and the learning effect of the XGBoost model created
wide attention towards research community. XGBoost is an enhance
ment of Gradient Boosting Decision Tree (GBDT) algorithm. It uses the
CART model for classification and regression problems. The model is
trained by adding a tree and splitting the features in each iteration to
grow the tree. At the end of the training, a score of each leaf node is
obtained by multiplying the weight with the predicted value of the tree.
The hyperparameters min_samples_split, min_samples_leaf and max_
depth controls overfitting. These parameters influences each individual
tree in the model. The parameters learning_rate, n_estimators, subsam
ple enhances the boosting operation. These hyperparameter are tuned
using the hyper optimization techniques Grid Search CV, Randomized
Search CV and Genetic Programming (TPOT Classifier) and the best
hyperparameters are selected to improve the performance of the pro
posed system.
4
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
suitable for high dimensional space. Random search samples the search 4. Experimental results
space from the equally distributed search space [41].
The researcher [42] stated that only few parameters had a large 4.1. Experimental setup
impact in the optimization of model score. Grid Search spends more time
to find an unimportant parameter. Random search provided better The experiments were implemented in Python 3.0 on DELL, Intel (R)
choice of hyper parameter combination compared to Grid Search. Core (TM) i5-8250U CPU @ 1.60 GHz, RAM 8 GB withWindows10.
Random search focuses on exploring the hyper parameter that has
greater impact in improving the model score. Random search works
better under the assumption that all parameters are not equally impor 4.2. Results and analysis
tant. In random search the experiments were conducted separately, but
there was no way to use the information obtained in one experiment to In this study, the effect of hyper parameter on the predictive per
the next. These two approaches avoid the model falling in Local optima formance of two different machine learning models Random Forest and
[43]. The disadvantages are these two approaches are time consuming XGBoost were examined. Three different hyper parameter optimization
and not suitable for data having high dimensional space. Fig. 4 shows methods: Grid Search, Randomized Search and Genetic programming
the comparison between grid layout and random layout. (TPOT Classifier) were used to optimize the ML models. A comparative
The Tree-Based Pipeline Optimization Tool (TPOT) is a Genetic analysis of the predictive performance of the 2 algorithms RF and XG
Programming (GP) based Auto ML system that optimizes the ML models Boost with 3 hyper parameter techniques were carried out in the ex
automatically [44]. TPOT uses meta learning techniques to optimize the periments. Each algorithm is analyzed by selecting different hyper pa
machine learning pipelines using GP primitives to solve a particular rameters. We compared the different hyper parameter optimization
problem. Auto ML tool is proposed in which feature selection, feature results of Randomized Search, Grid Search, Genetic Programming al
preprocessing, feature construction, model selection and parameter gorithms with the results of existing techniques.80% and 20% of data
optimization takes place automatically. The genetic algorithm finds the were taken for training and testing respectively. In our study we used 10-
best parameters. This AutoML considers multiple machine learning al fold Cross Validation. Previous studies demonstrated 10-fold cross
gorithms, multiple preprocessing steps and multiple ways to ensemble. validation provides generalized model and avoids over fitting [47,48].
In our research, presence of heart disease is represented as 1 and
absence of heart disease as 0. We implemented the three hyper param
3.6. Performance metrics eter optimization techniques using Scikit Python library. Scikit-learn
comes with the built-in functionality for Hyperparameter tuning
Confusion matrix was employed to evaluate the performance of the techniques.
two models. True Positive (TP) is defined as the count of predicted
values correctly identified the presence of disease. True negative is
4.3. Experimental results with Random Forest model (Cleave land
defined as the count of predicted values correctly identified the absence
Dataset)
of disease. False Positive (FP) is defined as the count of predicted values
incorrectly classified as positive (actually when it was negative). False
The Random Forest model were evaluated with different hyper
Negative (FN) is defined as the count of predicted values incorrectly
parameter values. Table 6 shows the best hyper parameters obtained
classified as negative (actually when it was positive).
with different optimization techniques for Random Forest.
Once the model is trained, the risk of heart disease is predicted and
To validate the performance of the model, confusion matrix is used.
evaluated with ten-fold cross validation. The analysis were done with
The correct and incorrect prediction of the classifier is represented in a 2
the Performance metrics Accuracy, Specificity, Sensitivity, Precision and
× 2 matrix. Confusion matrix is depicted in Tables 7–9 for the Random
ROC-AUC values.
forest Classifier with ten fold cross validation.
TN + TP Table 10 shows the performance analysis of Random Forest with
Accuracy = (1)
TN + TP + FN + FP different optimization techniques.
TP
Sensitivity = (2)
TP + FN 4.4. Result analysis with XG Boost Model (Cleave land Dataset)
TN
Specificity = (3) The XG Boost model were evaluated with different hyper parameter
TN + FP values. Table 11 shows the best hyperparameters obtained with different
TP optimization techniques for XG Boost model.
Precision = (4) Confusion matrix for the XG Boost classifier is depicted in
TP + FP
Tables 12–14 for ten fold cross validation.
Table 15 shows the experimental results of Extreme Gradient
Boosting (XGBoost) model.
Table 6
Best hyper parameters obtained with different optimization techniques for
Random Forest for Cleave land Dataset.
Model Parameters Grid Randomized Genetic
Search Search Programming
(TPOT)
5
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Table 7 Table 13
Confusion Matrix for RF Using Grid Search for Cleaveland Dataset. Confusion Matrix for XGBoost Using Random Search for Cleave land Dataset.
Predicted Class Predicted Class
Actual Class Absent Present Actual Value Actual Class Absent Present Actual Value
Absent 98(88.28%) 8 106 Absent 100(90.09%) 8 108
Present 13 123 136 Present 11 123(93.89%) 134
Total predicted 111 131(93.89%) Total predicted 111 131
Table 8 Table 14
Confusion Matrix for RF Using Random Search for Cleave land Dataset. Confusion Matrix for XGBoost Using Genetic Programming (TPOT Classifier) for
Predicted Class
Cleave land Dataset.
Predicted Class
Actual Class Absent Present Actual Value
Absent 103(92.79%) 4 107 Actual Class Absent Present Actual Value
Present 8 127(96.95%) 135 Absent 97(87.38%) 9 108
Total predicted 111 131 Present 14 122(93.12%) 136
Total predicted 111 131
Table 9
Confusion Matrix for RF Using Genetic Programming(TPOT Classifier)for Cleave Table 15
land Dataset. Performance analysis of XG Boost Model with different optimization techniques
Predicted Class for Cleave land Dataset.
Grid Randomized Genetic Programming
Actual Class Absent Present Actual Value
Absent 108(97.29%) 3 111 Search Search (TPOT)
Present 3 128(97.70%) 131 AUC-ROC(%) 86.07 91.99 88.21
Total predicted 111 131 Accuracy(%) 86.36 92.14 90.50
Sensitivity 82.88 90.09 87.38
(%)
Specificity 89.31 93.89 93.12
Table 10 (%)
Performance analysis of Random Forest with different optimization techniques Precision (%) 87 93 92
for Cleave land Dataset. F1-Score(%) 85 91 89
Table 16
Performance analysis of Random forest for Z-Alizadeh Sani Dataset.
Table 12
Artery Accuracy (%) Sensitivity* Specificity (%)
Confusion Matrix for XG Boost Using Grid Search for Cleave land Dataset.
%)
Predicted Class
Grid Search LAD 78.0 75.6 80.0
Actual Class Absent Present Actual Value LCX 70.3 79.6 56.8
Absent 92(82.88%) 14 106 RCA 79.1 92.20 61.5
Present 19 117(89.31%) 136 Randomized Search LAD 80.2 75.6 84
Total predicted 111 131 LCX 73.6 85.1 56.8
RCA 76.9 94.2 74.0
TPot LAD 75.8 73.2 78
LCX 68.1 72.2 62.1
RCA 74.7 88.4 56.4
6
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Table 17 models is effective in classifying the heart disease when compared to the
Performance of XGBoost for Z-Alizadeh Sani Dataset. previous studies.
Artery Accuracy Sensitivity Specificity Babaoglu et al. [62] employed ANN and achieved an accuracy of
73%,64.8% and 69.4%. Alizadehsani et al. [63] studied the presence of
Grid Search LAD 74.7 78.0 72
LCX 65.9 75.9 51.3 CAD by applying different feature selection information gain and Gini
RCA 70.3 80.76 56.4 Index to extract the effective features and evaluated the performance
Randomized Search LAD 75.8 75.6 76 with C4.5 and Bagging algorithm. Alizadeh sani et al. [64] examined the
LCX 71.4 79.6 59.5 demographic, examination and ECG features and obtained an accuracy
RCA 78.0 86.5 66.6
TPot LAD 74.7 73.1 76
of 74.20%,63.76% and 68.33% for LAD, LCX and RCA respectively.
LCX 69.2 75.9 59.4 Table 19 shows our proposed models is effective when compared to the
RCA 78.0 90.3 61.5 previous studies
model (RF with Linear) and obtained an accuracy of 88.4%.Haq et al. 4.8. Performance analysis with ROC_AUC for Cleave land Dataset
[57] carried out the experiments on the combination of various feature
selection methods and different Machine Learning models. Their Receiver operating characteristics is a visualization curve to compare
research concluded the hybrid of Relief-based feature selection and the “true positive rate” and “false positive rate”. Area under the receiver
Logistic regression achieved the highest accuracy of 89% compared to operating characteristics is compared for both models for all the three
other models. Saqlain et al. [58] selected the significant features using hyper optimization techniques. The best model is found when the AUC
Fisher Score algorithm and the subset of features were given to SVM and value is closer or equal to 1. Fig. 5 shows the AUC-ROC curve for the
validated. The combination of MFSFSA and SVM obtained an accuracy experiments conducted with Grid Search. When experiments conducted
of 81.19%. with Grid Search, Random forest and XGBoost achieved an AUC of
Soni et al. [52] applied association rules to classify the disease and 91.09% and 86.09%
obtained an accuracy of 81.51%.Latha and Jeeva [59] employed a AUC-ROC curve for the experiments conducted with Randomized
voting model with NB, BN, RF and MLP and found an accuracy of Search is illustrated in Fig. 6.When experiments were conducted with
85.48%.Leema et al. [54] proposed a Computer-Aided Diagnostic sys Randomized Search, Random forest and XGBoost achieved an AUC of
tem which uses Differential Evolution for global search and Back prop 94.80% and 91.99% respectively.
agation for local search. The accuracy obtained from the system is found Fig. 7 shows the AUC-ROC curve for the experiments conducted with
to be 86.6%. Kumari and Godara [53] proposed a SVM model and ob Genetic Programming (TPOT Classifier). When experiments were con
tained an accuracy of 84.12%. ducted with Genetic Programming (TPOT Classifier), Random forest and
Long et el. [51] proposed a chaos firefly algorithm to predict the XGBoost achieved an AUC of 97.50% and 88.21% respectively.
disease and removed the irrelevant features using rough sets. The
highest accuracy obtained from CFARS-AR is 88.3%.Six different Ma 4.9. Performance Analysis with ROC_AUC for Z-Alizadeh Sani Dataset
chine models were compared by Dwivedi et al. [49] and LR model ob
tained the highest accuracy of 85% compared to other models. Amin Fig. 8–10 shows the AUC-ROC curve for the experiments conducted
et al. [56] developed a hybrid voting model with Naïve Bayes and Lo with Z-Alizadeh sani Dataset. The diagnosis of the stenos is with the
gistic regression. The model was trained with the significant features. vessel LAD achieved 77.8%, 79.8%,75.5% with Random forest and
The results of the hybrid model showed an accuracy of 87.41%.Ayon et 75.0%,75.8% and 74.5% with XGBoost. The diagnosis of the stenos is
al [61] made a comparative study of seven machine learning models and with the vessel LCX achieved 68.1%,71%,67.2% with Random forest
found DNN performed better compared to all other algorithms. The and 63.6%,69.5% and 67.7% with XGBoost. The diagnosis of the stenos
outcome of the study showed random forest obtained an accuracy of is with the vessel RCA achieved 76.9%,74.0%,72.4% with Random
87.45% with 10-fold cross validation. Table 18 shows our proposed forest and 68.5%,76.6% and 75.9% with XGBoost.
Table 18 5. Conclusion
Comparison of the Model from previous studies for Cleave land Dataset.
Every year, a lot of deaths happen due to heart diseases. Heart dis
Authors Method Results
ease, if predicted earlier, can save many lives. In this study, Sequential
Chadha and Mayank DT 88.03% forward selection is applied to remove the insignificant features.
[50] NB 85.86%
Removing the irrelevant features had a huge impact in the performance
Long et al [51] CFARS-AR 88.3%
Soni et al [52] Association rules 81.51% improving it significantly. The machine learning models: Random forest
Kumari and Godara SVM 84.12% and XGBoost were tuned and tested with three different hyperparameter
[53] tuning techniques and their accuracies were compared to the existing
Leema et al [54] DE + BP 86.6%
techniques. By tuning the parameters of Random forest with Grid
Verma et al [55] CFS + PSO + K-means + MLP 90.28%
Amin et al [56] Hybrid(NB + LR) 87.41% Search, Randomized search and Genetic Programming we obtained
A.K.Diwivedi [49] SVM 82.00%
LR 85.00% Table 19
Haq et al [57] Relief -based feature selection + Logistic 89%
Comparison of the model from previous studies for Z-Alizadeh Sani Dataset.
regression
Saqlain et al [58] MFSFSA + SupportVector Machine 81.19% Authors Method Accuracy (%)
Latha and Jeeva [59] Naïve Bayes + BN + RandomForest + MLP 85.48%
LAD LCX RCA
Mohan et al [60] RF + Linear Model 88.4%
Ayon et al [61] RF 87.45% Babaogluetal.[62] ANN 73.0 64.8 69.4
Proposed Model RF With Grid Search 91.32% Alizadehsani,Habibi, Bagging 79.5 65.1 68.0
RF with Randomized Search 95.04 Alizadehsaniet al.[63]
RF With TPOT Classifier 97.52% Alizadehsani,Habibi, Decision Tree 74.2 63.8 68.3
XGBoost with Grid Search 86.36% Bahadorianet al.[64]
XGBoost with Randomized Search 92.14 Proposed Method Random Forest with 80.2 73.6 76.9
XGBoost With TPOT classifier 90.50 Randomized Search
7
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Fig. 5. ROC curve with Grid Search. Fig. 8. ROC curve for LAD using Random Forest and XGBoost.
Fig. 6. ROC curve with Randomized Search. Fig. 9. ROC curve for LCX using Random Forest and XGBoost.
Fig. 7. ROC curve with Genetic Programming (TPOT Classifier). Fig. 10. ROC curve for RCA using Random Forest and XGBoost.
better results with 91.32%, 95.04% and 97.52% compared to XGBoost LCX and RCA vessels respectively. In the future, heart disease pre
and other state of the art algorithms for CHD. The proposed algorithms dictions can be done in real time and the performance can be evaluated
shows the highest accuracy for random forest with randomized search in hardware.
with 80.2%, 73.6% and 76.9% for the diagnosis of abnormality in LAD,
8
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
Declaration of Competing Interest [25] G. Purusothaman, P. Krishnakumari, “A survey of data mining techniques on risk
prediction: heart disease”, Indian J. Sci. Technol. 8(12), 2015.
[26] Youness Khourdifi, Mohamed Bahaj, Heart disease prediction and classification
The authors declare that they have no known competing financial using machine learning algorithms optimized by particle swarm optimization and
interests or personal relationships that could have appeared to influence ant colony optimization, Int. J. Intell. Eng. Syst. 12 (1) (2018) 242–252.
the work reported in this paper. [27] C. Kalaiselvi, G. Nasira, Prediction of heart diseases and cancer in diabetic patients
using data mining techniques. Indian J. Sci. Technol. 8(14), 2015.
[28] T. Santhanam, E. Ephzibah, Heart disease prediction using hybrid genetic fuzzy
References model”, Indian, J. Sci. Technol. 8 (9) (2015) 797–803.
[29] Senthilkumar Mohan, Chandrasegar Thirumalai, Gautam Srivastava, Effective
[1] C.D. Mathers, D. Loncar, Projections of global mortality and burden of disease from heart disease prediction using hybrid machine learning techniques, IEEE Access 7
2002 to 2030, Plosmed 3 (11) (2006). (2019) 81542–81554, https://doi.org/10.1109/ACCESS.2019.2923707.
[2] M. Akhil, Dr. Priti Chandra, Dr. B.L Deekshatulu, “Heart Disease Prediction System [30] Kaan Uyar, Ahmet İlhan, Diagnosis of heart disease using genetic algorithm based
using Associative Classification and Genetic Algorithm”, International Conference trained recurrent fuzzy neural networks, Procedia Comput. Sci. 120 (2017)
on Emerging Trends in Electrical, Electronics and Communication Technologies- 588–593.
ICECIT, 2012. [31] Roohallah Alizadehsani, Jafar Habibi, Javad Hosseini Mohammad,
[3] Ibrahim Umar Said, Abdullahi Haruna Adam, Dr. Ahmed BaitaGarko, “Association Reihane Boghrati, Asma Ghandeharioun, Behdad Bahadorian, et al., Diagnosis of
Rule Mining On Medical Data To Predict Heart Disease”, International Journal of Coronary Artery Disease Using Data Mining Techniques Based on Symptoms and
Science Technology and Management, August 2015, pp. 26-35. ECG Features, European Journal of Scientific Research 82 (4) (2012) 542–553.
[4] A. Chauhan, A. Jain, P. Sharma, V. Deep, Heart disease prediction using [32] Alizadehsani Roohallah et al. “Exerting Cost-Sensitive and Feature Creation
evolutionary rule learning, in: 2018 4th International Conference on Algorithms for Coronary Artery Disease Diagnosis.”IJKDB 3.1 (2012): 59-79. Web.
Computational Intelligence & Communication Technology (CICT), Ghaziabad, 1 May. 2021. doi:10.4018/jkdb.2012010104.
2018, pp. 1–4. [33] Roohallah Alizadehsani, Mohamad Roshanzamir, Moloud Abdar,
[5] Ilaria Castelli, Edmondo Trentin, Combination of supervised and unsupervised Adham Beykikhoshk, Mohammad Hossein Zangooei, Abbas Khosravi,
learning for training the activation functions of neural networks, Pattern Recogn. Saeid Nahavandi, Ru San Tan, U. Rajendra Acharya, Model uncertainty
Lett. 37 (2014) 178–191. quantification for diagnosis of each main coronary artery stenosis, Soft Computing
[6] R. Safdari, T. Samad-Soltani, M. GhaziSaeedi, M. Zolnoori, Evaluation of 24 (13) (2020) 10149–10160, https://doi.org/10.1007/s00500-019-04531-0.
classification algorithms vs knowledge-based methods for differential diagnosis of [34] Roohallah Alizadehsani, Jafar Habibi, Mohammad Javad Hosseini,
asthma in Iranian patients, Int. J. Inform. Syst. Serv. Sect. 10 (2) (2018) 22–26. Hoda Mashayekhi, Reihane Boghrati, Asma Ghandeharioun, Behdad Bahadorian,
[7] F.S. Alotaibi, Implementation of machine learning model to predict heart failure Zahra Alizadeh Sani, A data mining approach for diagnosis of coronary artery
disease, Int. J. Adv. Comput. Sci. Appl. 10 (6) (2019) 261–268. disease, Comput. Methods Programs Biomed. 111 (1) (2013) 52–61, https://doi.
[8] M.I. Jordan, T.M. Mitchell, Machine learning: trends, perspectives, and prospects, org/10.1016/j.cmpb.2013.03.004.
Science 349 (6245) (2015) 255–260, https://doi.org/10.1126/science.aaa8415. [35] R. Alizadehsani, M. Roshanzamir, M. Abdar, A. Beykikhoshk, A. Khosravi,
[9] M. Motwani, D. Dey, D.S. Berman, G. Germano, S. Achenbach, M.H. Al-Mallah, D. M. Panahiazar, A. Koohestani, F. Khozeimeh, S. Nahavandi, N. Sarrafzadegan,
Andreini, M.J. Budoff, F. Cademartiri, T.Q. Callister,”Machine learning for pre- A database for using machine learning and data mining techniques for coronary
diction of all-cause mortality in patients with suspected coronary artery disease”: a artery disease diagnosis, Scientific Data 6 (1) (2019), https://doi.org/10.1038/
5-year multicentre prospective registry analysis, Eur. Heart J. 38(7), pp. s41597-019-0206-3.
500–507,2016. [36] Roohallah Alizadehsani, Moloud Abdar, Mohamad Roshanzamir, Abbas Khosravi,
[10] Sani A,” Machine Learning for Decision Making”, Universitéde Lille 1, 2015,. Parham M. Kebria, Fahime Khozeimeh, Saeid Nahavandi, Nizal Sarrafzadegan, U.
[11] W. Raghupathi, V. Raghupathi, “Big data analytics in healthcare: promise and Rajendra Acharya, Machine learning-based coronary artery disease diagnosis: a
potential”, Health Inf. Sci. Syst. 2(3), 2014. comprehensive review, Comput. Biol. Med. 111 (2019) 103346, https://doi.org/
[12] P. Groves, B. Kayyali, D. Knott, S.V. Kuiken, ”The ’Big Data’ Revolution in Health- 10.1016/j.compbiomed.2019.103346.
care: Accelerating Value and Innovation”, 2016. [37] Roohallah Alizadehsani, Mohammad Hossein Zangooei, Mohammad
[13] T. Condie, P. Mineiro, N. Polyzotis, M. Weimer, Machine learning on big data: data Javad Hosseini, Jafar Habibi, Abbas Khosravi, Mohamad Roshanzamir,
engineering (ICDE), in: 2013 IEEE 29th International Conference on, IEEE, 2013, Fahime Khozeimeh, Nizal Sarrafzadegan, Saeid Nahavandi, Coronary artery
pp. 1242–1244. disease detection using computational intelligence methods, Knowledge-Based
[14] Wu. Jia, Xiu-Yun Chen, Hao Zhang, Li-Dong Xiong, Hang Lei, Si-Hao Deng, Syst. 109 (2016) 187–197, https://doi.org/10.1016/j.knosys.2016.07.004.
Hyperparameter optimization for machine learning models based on Bayesian [38] R. Alizadehsani, M.J. Hosseini, Z.A. Sani, A. Ghandeharioun, R. Boghrati. Diagnosis
optimization, J. Electron. Sci. Technol. 17 (1) (2019). of coronary artery disease using cost-sensitive algorithms. in: Paper presented at
[15] Debabrata Swain, Preeti Ballal, Vishal Dolase, Banchhanidhi Dash, Jayasri the 2012 IEEE 12th International Conference on Data Mining Workshops, Brussels,
Santhappan, An efficient heart disease prediction system using machine learning. Belgium 2012.
Advances in Intelligent Systems and Computing, Springer Nature Singapore Pvt [39] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32.
Ltd, 2020. [40] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’in: Proc. 22nd
[16] M.A. Jabbar, B.L. Deekshatulu, P. Chandra, “An Evolutionary Algorithm for Heart ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, San Francisco, CA, USA,
Disease Prediction. In: K.R. Venugopal, L.M. Patnaik (eds.) Wireless Networks and Aug. 2016, pp. 785_794, doi:10.1145/2939672.2939785.
Computational Intelligence”. ICIP 2012. Communications in Computer and [41] Douglas C Montgomery, Design and Analysis of Experiments, John Wiley & Sons,
Information Science, vol 292. Springer, Berlin, Heidelberg. DOI:10.1007/978-3- 2017.
642-31686-9_44. [42] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach.
[17] R. Das, I. Turkoglu, A. Sengur, Effective diagnosis of heart disease through neural Learn. Res. 13 (2012) 281–305.
networks ensembles, Expert Syst. Appl. 36 (4) (2009) 7675–7680. [43] R. Liessner, J. Schmitt, A. Dietermann, and B. Bäker, “Hyperparameter
[18] R. Chitra, V. Seenivasagam, Heart disease prediction system using supervised Optimization for Deep Reinforcement Learning in Vehicle Energy Management”,
learning classifier, Bonfring Int. J. Software Eng. Soft Comput. 3 (1) (2013) 1. in: Proceedings of the 11th International Conference on Agents and Artificial
[19] K. Vembandasamy, R. Sasipriya, E. Deepa, Heart diseases detection using Naive Intelligence (ICAART 2019), pages 134-144 ISBN: 978-989-758-350-6.
Bayes Algorithm, Int. J. Innov. Sci. Eng. Technol. 2 (9) (2015). [44] R.S. Olson, J.H. Moore, TPOT: A Tree-Based Pipeline Optimization Tool for
[20] R. Kavitha, T. Christopher, “An effective classification of heart rate data using PSO- Automating Machine Learning, in: F. Hutter, L. Kotthoff, J. Vanschoren (Eds.),
FCM clustering and enhanced support vector machine,” Indian Journal of Science Automated Machine Learning. The Springer Series on Challenges in Machine
and Technology, 8(30), 2015. Learning, Springer, Cham, 2019, https://doi.org/10.1007/978-3-030-05318-5_8.
[21] N. Bhatla, K. Jyoti, An analysis of heart disease prediction using different data [45] https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart
mining techniques, Int. J. Eng. Res. Technol. 1 (8) (2012) 1–4. -disease.names.
[22] V. Krishnaiah, G. Narsimha, N.S. Chandra, “Heart Disease Prediction System Using [46] http://archive.ics.uci.edu/ml/datasets/extention+of+Z-Alizadeh+sani+dataset.
Data Mining Technique by Fuzzy K-NN Approach. In: Satapathy S., Govardhan A., [47] R. Kohavi, D. Wolpert Bias plus variance decomposition for zero-oneloss functions.
Raju K., Mandal J. (eds) Emerging ICT for Bridging the Future - Proceedings of the in: Proc 13th Int. 1996 Conf. Mach. Learn., San Francisco, CA, USA pp. 275_283.
49th Annual Convention of the Computer Society of India (CSI) Volume 1. [48] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, San Diego,
Advances in Intelligent Systems and Computing, vol 337. Springer, Cham,2015. CA, USA, 2012.
DOI:10.1007/978-3-319-13728-5_42. [49] Ashok Kumar Dwivedi, Performance evaluation of differentmachine learning
[23] S. U. Amin, K. Agarwal and R. Beg, “Genetic neural network based data mining in techniques for prediction of heartdisease, Neural Comput. Appl. 29 (10) (2018)
prediction of heart disease using risk factors,” in: 2013 IEEE Conference on 685–693.
Information & Communication Technologies, Thuckalay, Tamil Nadu, India, 2013, [50] Ritika Chadha, Shubhankar Mayank, Prediction of heart diseaseusing data mining
pp. 1227-1231, doi: 10.1109/CICT.2013.6558288. techniques, CSI Trans. ICT 4 (2-4) (2016) 193–198.
[24] F.Z. Abdeldjouad, M. Brahami, N. Matta. “ A Hybrid Approach for Heart Disease [51] N.C. Long, P. Meesad, H. Unger, A highly accuratefirefly-based algorithm for heart
Diagnosis and Prediction Using Machine Learning Techniques.”, In: M. Jmaiel, M. disease prediction, Expert. Syst. Appl. 42 (2015) 8221–8231.
Mokhtari, B. Abdulrazak, H. Aloulou, S. Kallel (eds.) The Impact of Digital [52] J. Soni, U. Ansari, D. Sharma, S. Soni, Intelligent andeffective heart disease
Technologies on Public Health in Developed and Developing Countries. ICOST prediction system using weightedassociative classifiers, Int. J. Comput. Sci. Eng. 3
2020. Lecture Notes in Computer Science, vol 12157. Springer, Cham, 2020. DOI: (6) (2011) 2385–2392.
10.1007/978-3-030-51517-1_26. [53] M. Kumari, S. Godara, Comparative study of datamining classification methods in
cardiovascular diseaseprediction, IJCST 2 (2) (2011) 304–308.
9
R. Valarmathi and T. Sheela Biomedical Signal Processing and Control 70 (2021) 103033
[54] N. Leema, H. Khanna Nehemiah, A. Kannan, Neural network classifier optimization considering laboratory and echocardiography features, Res Cardiovasc Med. 2 (3)
using differential evolutionwith global information and back propagation (2013) 133, https://doi.org/10.5812/cardiovascmed.10888.
algorithm for clinical datasets, Appl. Soft. Comput. 49 (2016) 834–844. [64] Habibi Jafar Alizadehsani Roohallah, MashayekhiHoda Bahadorian Behdad,
[55] L. Verma, S. Srivastava, P.C. Negi, A hybrid data mining model to predict coronary Ghandeharioun Asma, Reihane Boghrati, et al., Diagnosis of coronary arteries
artery disease cases using non-invasive clinical data, J. Med. Syst. 40 (7) (2016) stenosis using data mining, J. Medical Signals Sens. 2 (3) (2012) 57–65.
178, https://doi.org/10.1007/s10916-016-0536-z. [65] Jalali, Seyed Mohammad Jafar, Khosravi, Abbas, Alizadehsani, Roohallah, Salaken,
[56] M. S. Amin, Y. K. Chiam, and K. D. Varathan, ‘‘Identication of signicant features Syed Moshfeq, Kebria, Parham Mohsenzadeh, Puri, Rishi and Nahavandi, Saeid
and data mining techniques in predicting heart disease,’’ Telematics Inform., vol. 2019, Parsimonious evolutionary-based model development for detecting artery
36, pp. 82_93, Mar. 2019, doi: 10.1016/j.tele.2018.11.007. disease, in: ICIT 2019 : Proceedings of the IEEE International Conference on
[57] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, and R. Sun, ‘‘A hybrid intelligent system Industrial Technology, IEEE, Piscataway, N.J., pp. 800-805, doi: 10.1109/
framework for the prediction of heart disease using machinelearning algorithms,’’ ICIT.2019.8755107.
Mobile Inf. Syst., vol. 2018, pp. 1_21, Dec. 2018,doi: 10.1155/2018/3860146. [66] M. A. Sohail, Z. Taufique, S. M. Abubakar, W. Saadeh and M. A. Bin Altaf, “An ECG
[58] S. M. Saqlain, M. Sher, F. A. Shah, I. Khan, M. U. Ashraf, M. Awais,and A. Ghani, Processor for the Detection of Eight Cardiac Arrhythmias with Minimum False
‘‘Fisher score and matthews correlation coef_cient-basedfeature subset selection for Alarms,” in: 2019 IEEE Biomedical Circuits and Systems Conference (BioCAS),
heart disease diagnosis using support vectormachines,’’ Knowl. Inf. Syst., 58(1), 2019, pp. 1-4, doi: 10.1109/BIOCAS.2019.8919053.
pp. 139_167, Jan. 2019, doi:10.1007/s10115-018-1185-y. [67] S.M. Abubakar, M. Rizwan Khan, W. Saadeh, M. A. Bin Altaf, “A Wearable Auto-
[59] C. B. C. Latha and S. C. Jeeva, ‘‘Improving the accuracy of prediction of heart Patient Adaptive ECG Processor for Shockable Cardiac Arrhythmia,” in: 2018 IEEE
disease risk based on ensemble classification techniques,’’ Inform. Med. Unlocked, Asian Solid-State Circuits Conference (A-SSCC) 2018 267 268 10.1109/
16, Jan. 2019, Art. no. 100203, doi:10.1016/j.imu.2019.100203. ASSCC.2018.8579263.
[60] Senthilkumar Mohan, Chandrasegar Thirumalai, Gautam Srivastava, Effective [68] Shihui Yin, Minkyu Kim, Deepak Kadetotad, Yang Liu, Chisung Bae, Sang
heart diseaseprediction using hybrid machine learning techniques, IEEE Access 7 Joon Kim, Yu Cao, Jae-Sun Seo, A 1.06- $\mu$ W Smart ECG Processor in 65-nm
(2019) 81542–81554, https://doi.org/10.1109/ACCESS.2019.2923707. CMOS for Real-Time Biometric Authentication and Personal Cardiac Monitoring,
[61] Safial Islam Ayon, Md. Milon Islam, Md. Rahat Hossain, CoronaryArtery heart IEEE J. Solid-State Circ. 54 (8) (2019) 2316–2326, https://doi.org/10.1109/
disease prediction: a comparative study of computational intelligence techniques, JSSC.2019.2912304.
IETE J. Res. (2020), https://doi.org/10.1080/03772063.2020.1713916. [69] S. M. Abubakar, W. Saadeh and M. A. B. Altaf, “A wearable long-term single-lead
[62] I. Babaoglu, O. Fındık, M. Bayrak, Effects of principle component analysis on ECG processor for early detection of cardiac arrhythmia,” in: 2018 Design,
assessment of coronary artery diseases using support vector machine, Expert Syst. Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 961-966,
Appl. 37 (3) (2010) 2182–2185, https://doi.org/10.1016/j.eswa.2009.07.055. doi: 10.23919/DATE.2018.8342148.
[63] Roohallah Alizadehsani, Jafar Habibi, ZahraAlizadeh Sani, Hoda Mashayekhi, [70] S. Izumi, et al., A 14 µA ECG processor with robust heart rate monitor for a
Reihane Boghrati, Asma Ghandeharioun, Fahime Khozeimeh, Fariba Alizadeh- wearable healthcare system, in: 2013 Proceedings of the ESSCIRC (ESSCIRC),
Sani, Diagnosing coronary artery disease via data mining algorithms by 2013, pp. 145–148, https://doi.org/10.1109/ESSCIRC.2013.6649093.
10