Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
17
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020
which was measured on 303 individuals. (i) The first step of to train the model. The second iteration uses the 2nd fold as
the preprocessing was identifying and removing duplicated the test set, and the remaining 9 as training set. This procedure
rows. (ii) Subsequently, rows containing missing values were is repeated until all the folds have been used as the testing
also identified and removed. (iii) Also, the box and whisker data. In each iteration, a model is fit on the training set, and
plots were used to detect outliers, and the (iv) rows with evaluated on the test set.
outliers (i.e. values that are outside the range of -3δ and +3δ)
were subsequently removed. (v) One duplicated row, and 15 2.2.1 Decision Tree Classifier Model (CART)
rows containing outliers were removed. A decision tree operates by concluding the value of a
dependent attribute given the values of the independent
2.1.1 Data Transformation features [10]. The classification and regression tree, a type of
The following features were normalized age, thalach, and decision was adopted implemented in this projected. It works
oldpeak to range between 0 and 1. This was done before the by splitting the dataset into several segments through posing
feature reduction process. The other features were not series of questions about the features. The was employed in
normalized because either they are categorical, or they are this case because of its successful use in medical research as a
already gaussian. powerful statistical tool for classification, interpretation, and
data manipulation (Song & Lu, 2015).
2.1.2 Feature Reduction
The aim of feature reduction has been searching for a 2.2.2 Logistic regression (LR)
projection of the data on features which preserve the LR has been one of the widely used machine learning models
information, pattern and trend as much as possible (Hira & for analyzing multivariate regression problems in the health
Gillies, 2015). In this study, single value decomposition fields [11]. It is used for predicting the outcome of a
(SVD) method was used to construct enriched features in the dependent variable with a continuous independent variable. It
data. The features were simplified from thirteen (13) to four models binary dependent variables and it fits an equation to
(4) features using SVD. the data.
2.2 Machine Learning Model Development 2.2.3 Gaussian Naïve Bayes Model (GNB)
Three classification models, Decision tree classifier, Logistic The Naïve Bayes model is a classification model which is
regression, Gaussian Naïve Bayes models. The three based on the Bayes theorem. It is a simple probabilistic model
classifications models were trained to find the best fit for the which is premised on the assumption that all the features are
models by splitting the data into trained and test dataset. The linearly independent of each other, for a given categorical
training dataset is used to fit the model. In splitting data into variable [12]. The Gaussian Naïve Bayes (GNB) classifier is
training and testing sets, it is important to avoid bias. The known for the prediction or for recognizing pattern in a data.
most efficient training method is using the k-fold cross This model operates by taking each data point, and
validation. The 10-fold cross validation resampling method subsequently assign it the nearest class to it. The GNB instead
was adopted in training and testing of the model. The dataset of using the Euclidean distance from the class means, it
was split into 10 folds. The first iteration uses the data in the considers also the compared class variance [13].
1st fold to test the model while the remaining 9 folds are used
Predicting using
Model
Evaluation of
Result
Model Comparison
Fig.1 A high level block diagram summarizing the methodology adopted in this paper
18
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020
3. RESULTS
3.1 Evaluation of Classification Algorithms
The three classification models were analyzed by evaluating
precision, recall, f1, and accuracy scores. The performance of
each classification model on the test data was visualized using
confusion matrix. Area under the curve (AUC) receiver
operating characteristic (ROC) curves were used to visualize
the performance output of each of the models. It plots the true
positive rates (sensitivity) against false positive rate
(specificity). AUC is between 0 and 1. The greater the AUC
value, the better the classification model. The reliability of the
proposed models was tested by dividing data with the 10-fold
cross validation method. Figure 11 provides a comparison of
the accuracy scores of the three algorithms ( LR, GNB,
CART). Fig 3: ROC curve showing the performance output of
Decision Tree Classifier model
3.1.2 Gaussian Naïve Bayes Model
Displayed below (table 3) is the result of the confusion matrix
for describing the performance of the GNB classifier,
indicating how the testing dataset was predicted. An accuracy
score, which indicates how correctly the model was able to
predict the presence or absence of heart disease. The model
was able to correctly predict 76% of the test dataset as either
3.1.1 Decision Tree Classifier model the presence or absence of heart disease in patients. Other
The confusion matrix for the decision tree classifier is performance metric such as precision, recall and f1-score are
presented in the figure below in table 1. The accuracy score displayed in table 4. The recall indicates how many of either
obtained for this model was 79.31%. Table 2 presents absence or presence of the disease the model was able to
evaluation metrics with precision, (ability of the model not to capture through classifying it as the absence of presence of the
label patients as having heart diseases that do not have heart heart disease, respectively. Moreover, the AUC for this model
disease, recall, and f1-score. The AUC value obtained was was 0.87, and the ROC curve is represented in fig. 3.
0.81,with the corresponding ROC graph displayed below in
Table 3. Confusion matrix of Gaussian Naive Bayes model
figure 8.
Predicted Predicted
Table 1. Confusion matrix of decision Tree Classifier
model (absence) (presence)
Predicted Predicted
Actual 21 6
(absence) (presence) (absence)
Actual 20 7
(absence) Actual 4 27
Actual 5 26 (presence)
(presence)
19
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020
Classification Models
LR [VALUE]
CART [VALUE]
GNB [VALUE]
76 78 80 82 84
6. REFERENCES
[1] H. K. Weir et al., “Heart Disease and Cancer Deaths -
Trends and Projections in the United States, 1969-2020,”
Prev. Chronic Dis., vol. 13, pp. E157–E157, Nov. 2016.
[2] C. S. Dangare and M. E. Cse, “Improved Study of Heart
Disease Prediction System using Data Mining
Classification Techniques,” Int. ournal Comput. Appl.,
vol. 47, no. 10, pp. 44–48, 2012.
[3] J. Patel, S. Tejalupadhyay, and S. Patel, Heart Disease
prediction using Machine learning and Data Mining
Fig 4. ROC curve showing the performance output of Technique. 2016.
Logistic Regression model
[4] S. Sakr et al., “Comparison of machine learning
techniques to predict all-cause mortality using fitness
data: the Henry ford exercIse testing (FIT) project,”
20
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020
BMC Med. Inform. Decis. Mak., vol. 17, no. 1, p. 174, [11] E. W. Steyerberg, M. J. Eijkemans, F. E. J. Harrell, and
Dec. 2017. J. D. Habbema, “Prognostic modeling with logistic
regression analysis: in search of a sensible strategy in
[5] S. Sakr et al., “Using machine learning on small data sets.,” Med. Decis. Making, vol. 21, no. 1, pp.
cardiorespiratory fitness data for predicting hypertension: 45–56, 2001.
The Henry Ford ExercIse Testing (FIT) Project.,” PLoS
One, vol. 13, no. 4, p. e0195344, 2018. [12] S. Xu, “Bayesian Naïve Bayes classifiers to text
classification,” J. Inf. Sci., vol. 44, no. 1, pp. 48–59, Nov.
[6] M. Shouman, T. Turner, and R. Stocker, Using decision 2016.
tree for diagnosing heart disease patients, vol. 121. 2011.
[13] R. D. S. Raizada and Y.-S. Lee, “Smoothness without
[7] A. H. Babar, “Comparative Analysis of Classification Smoothing: Why Gaussian Naive Bayes Is Not Naive for
Models for Healthcare Data Analysis,” vol. 07, no. 04, Multi-Subject Searchlight Studies,” PLoS One, vol. 8,
pp. 170–175, 2018. no. 7, p. e69566, Jul. 2013.
[8] V. Chaurasia, “Early Prediction of Heart Diseases Using [14] I. Rish, “An Empirical Study of the Naïve Bayes
Data Mining Techniques,” Caribb. J. Sci. Technol., vol. Classifier,” IJCAI 2001 Work Empir Methods Artif
Vol.1, pp. 208–217, Dec. 2013. Intell, vol. 3, Jan. 2001.
[9] H. Sharma, “Prediction of Heart Disease using Machine [15] H. Zhang, The Optimality of Naive Bayes, vol. 2. 2004.
Learning Algorithms : A Survey,” Int. J. Recent Innov.
Trends Comput. Commun., vol. 5, no. 8, pp. 99–104, [16] R. Nichenametla, T. Maneesha, S. Hafeez, and H.
2017. Krishna, “Prediction of Heart Disease Using Machine
Learning Algorithms,” Int. J. Eng. Technol., vol. 7, pp.
[10] N. Bhargava and G. Sharma, “Decision Tree Analysis on 363–366, May 2018.
J48 Algorithm for Data Mining,” Int. J. Adv. Res.
Comput. Sci. Softw. Eng., vol. 3, no. 6, pp. 1114–1119,
2013.
IJCATM : www.ijcaonline.org 21