0% found this document useful (0 votes)
52 views5 pages

Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms

This document summarizes a research study that used three machine learning algorithms (logistic regression, decision tree classifier, and Gaussian Naive Bayes) to predict the presence of heart disease in patients based on their clinical data. The researchers developed classification models using these three algorithms and evaluated their performance based on accuracy scores and area under the receiver operating characteristic curves. They found that the logistic regression model performed best with an accuracy of 84% and an AUC of 0.83, outperforming the decision tree and Gaussian Naive Bayes models. The dataset used contained information on 303 patients and 14 attributes. Feature selection and 10-fold cross validation were employed to test the models.

Uploaded by

ewuiplkw phuvdj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views5 pages

Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms

This document summarizes a research study that used three machine learning algorithms (logistic regression, decision tree classifier, and Gaussian Naive Bayes) to predict the presence of heart disease in patients based on their clinical data. The researchers developed classification models using these three algorithms and evaluated their performance based on accuracy scores and area under the receiver operating characteristic curves. They found that the logistic regression model performed best with an accuracy of 84% and an AUC of 0.83, outperforming the decision tree and Gaussian Naive Bayes models. The dataset used contained information on 303 patients and 14 attributes. Feature selection and 10-fold cross validation were employed to test the models.

Uploaded by

ewuiplkw phuvdj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

Volume 176 – No. 11, April 2020

Predicting the Presence of Heart Diseases using


Comparative Data Mining and Machine Learning
Algorithms

Daniel Ananey-Obiri Enoch Sarku


Department of Computational Science and Department of Computational Science and
Engineering Engineering
North Carolina Agricultural and Technical State North Carolina Agricultural and Technical State
University University

ABSTRACT mining techniques provide alternative approach for quick and


Heart disease, an example of cardiovascular diseases is the efficient detection of heart diseases at the early stages. The
number one notable reason for the death of many people in primary objective of this project is to develop three different
the world. Of recent, studies have concentrated on using classification models, Gaussian Naïve Bayes models (GNB),
alternative efficient techniques such as data mining and Logistic Regression (LR) and decision tree classifier (DCT)
machine learning in the diagnosis of diseases based on certain to predict heart diseases in patients based on clinical data
features of an individual. This study will use data exploratory sample trained and tested. Also, the models’ performance
and mining techniques to extract hidden patterns using efficiencies were evaluated and compared using accuracy
python. By this, machine learning algorithms (logistic linear scores, 10-fold cross validation, and area under the curve
regression, decision tree classifier, Gaussian Naïve Bayes receiver operating curves (AUCROC).
models) will be developed to predict the presence of heart Age, cholesterol, family history among other factors are
diseases in patients. This will try to seek better performance in considered risk factors for heart diseases. Early identification
predicting heart diseases to reduce the number of tests require of heart diseases among patients can reduce the fatality that is
for the diagnosis of heart diseases. The k-fold cross validation associated with them. Many research studies have involved
approach will be used in assessing the resulting models for the use of machine learning in predicting heart diseases
receiver operating characteristic (ROC) curves (sensitivity among patients [6][7] . However, findings have differed in the
against specificity). The dataset was collected from UCI metrics used in evaluating models, culminating in differences
machine learning repository which contains information on in accuracies. Some research work had involved the
patients with heart disease. The dataset has 14 attributes and development of ML algorithms using the Cleveland dataset
measured on 303 individuals. which is been used in this projected. [3] developed J48,
Logistic model tree and Random Forest algorithms. The
General Terms highest accuracy score was found in the J48 algorithm
Algorithms, pattern recognition, supervised learning, machine (56.76%),and the least in Logistic model tree algorithm
learning, heart disease. (55.77%). [8] also developed Classification and Regression
Tree and Iterative Dichotomized 3 (ID3), and Decision Table
Keywords based on this dataset with the models scoring accuracies of
Classification, regression, k-fold cross validation, Receiver 83.49%, 72.93%, and 82.50%, respectively. However, they
Operator Characteristics adopted feature selection, leaving out important features such
as number of major vessels (0-3) colored by fluoroscopy (ca),
1. INTRODUCTION ST depression induced by exercise relative to rest (oldpeak).
Heart diseases such as heart failure, myocardial infarction
[9] observed one common problem, that is, many authors have
have been ranked as the highest cause of death in the United
different parameters for testing the accuracies of their models.
States [1]. According to the center of disease control and
This has made it difficult to be conclusive on the best model.
prevention, about 610,000 people die every year from heart
diseases. Health professionals have put measures in place for 2. METHOD
early detection of heart diseases, as this is key in preventing Three classification models namely; Linear Regression (LG),
and curbing them. However, early-stage adopted strategies in Decision Tree Classifier (CART) and Gaussian Naïve Bayes
identifying heart diseases have not been successful, due to the (GNB) were developed. The data was analyzed and
associated complexities [2]. According to [3] , unlike implemented in python. Data preprocessing techniques such
traditional statistical methods, data mining techniques can as, feature transformation, and training, testing with the
detect and extract hidden inconspicuous patterns, relationships individual models, and finally comparison of the performance
in large dataset. Support vector machine, Naïve Bayesian, of the models were the steps followed through to achieve the
artificial neural networks, logistics regression, etc., models aim this research.
have been developed and used in healthcare research [4][5].
They have shown immense potential in accurate prediction of 2.1 Preprocessing
heart diseases based on clinical data of patients [2]. The dataset called the Cleveland Heart Diseases was collected
from UCI machine learning repository which contains
In the diagnosis of heart diseases, series of laboratory tests are
information on patients with heart disease. The dataset has 14
required. However, the numerous tests impede the rapidity
attributes including patients age, sex, cholesterol level, etc.
and efficiency in diagnosing heart diseases in patients. Data

17
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020

which was measured on 303 individuals. (i) The first step of to train the model. The second iteration uses the 2nd fold as
the preprocessing was identifying and removing duplicated the test set, and the remaining 9 as training set. This procedure
rows. (ii) Subsequently, rows containing missing values were is repeated until all the folds have been used as the testing
also identified and removed. (iii) Also, the box and whisker data. In each iteration, a model is fit on the training set, and
plots were used to detect outliers, and the (iv) rows with evaluated on the test set.
outliers (i.e. values that are outside the range of -3δ and +3δ)
were subsequently removed. (v) One duplicated row, and 15 2.2.1 Decision Tree Classifier Model (CART)
rows containing outliers were removed. A decision tree operates by concluding the value of a
dependent attribute given the values of the independent
2.1.1 Data Transformation features [10]. The classification and regression tree, a type of
The following features were normalized age, thalach, and decision was adopted implemented in this projected. It works
oldpeak to range between 0 and 1. This was done before the by splitting the dataset into several segments through posing
feature reduction process. The other features were not series of questions about the features. The was employed in
normalized because either they are categorical, or they are this case because of its successful use in medical research as a
already gaussian. powerful statistical tool for classification, interpretation, and
data manipulation (Song & Lu, 2015).
2.1.2 Feature Reduction
The aim of feature reduction has been searching for a 2.2.2 Logistic regression (LR)
projection of the data on features which preserve the LR has been one of the widely used machine learning models
information, pattern and trend as much as possible (Hira & for analyzing multivariate regression problems in the health
Gillies, 2015). In this study, single value decomposition fields [11]. It is used for predicting the outcome of a
(SVD) method was used to construct enriched features in the dependent variable with a continuous independent variable. It
data. The features were simplified from thirteen (13) to four models binary dependent variables and it fits an equation to
(4) features using SVD. the data.

2.2 Machine Learning Model Development 2.2.3 Gaussian Naïve Bayes Model (GNB)
Three classification models, Decision tree classifier, Logistic The Naïve Bayes model is a classification model which is
regression, Gaussian Naïve Bayes models. The three based on the Bayes theorem. It is a simple probabilistic model
classifications models were trained to find the best fit for the which is premised on the assumption that all the features are
models by splitting the data into trained and test dataset. The linearly independent of each other, for a given categorical
training dataset is used to fit the model. In splitting data into variable [12]. The Gaussian Naïve Bayes (GNB) classifier is
training and testing sets, it is important to avoid bias. The known for the prediction or for recognizing pattern in a data.
most efficient training method is using the k-fold cross This model operates by taking each data point, and
validation. The 10-fold cross validation resampling method subsequently assign it the nearest class to it. The GNB instead
was adopted in training and testing of the model. The dataset of using the Euclidean distance from the class means, it
was split into 10 folds. The first iteration uses the data in the considers also the compared class variance [13].
1st fold to test the model while the remaining 9 folds are used

Patient Data Feature Reduction


Data Preprocessing
Transformation
Dataset

Model Training Model Training

Applying Logistic Applying Gaussian Applying Decision


Regression Naïve ayes Tree Classifier

Predicting using
Model
Evaluation of
Result
Model Comparison

Fig.1 A high level block diagram summarizing the methodology adopted in this paper

18
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020

3. RESULTS
3.1 Evaluation of Classification Algorithms
The three classification models were analyzed by evaluating
precision, recall, f1, and accuracy scores. The performance of
each classification model on the test data was visualized using
confusion matrix. Area under the curve (AUC) receiver
operating characteristic (ROC) curves were used to visualize
the performance output of each of the models. It plots the true
positive rates (sensitivity) against false positive rate
(specificity). AUC is between 0 and 1. The greater the AUC
value, the better the classification model. The reliability of the
proposed models was tested by dividing data with the 10-fold
cross validation method. Figure 11 provides a comparison of
the accuracy scores of the three algorithms ( LR, GNB,
CART). Fig 3: ROC curve showing the performance output of
Decision Tree Classifier model
3.1.2 Gaussian Naïve Bayes Model
Displayed below (table 3) is the result of the confusion matrix
for describing the performance of the GNB classifier,
indicating how the testing dataset was predicted. An accuracy
score, which indicates how correctly the model was able to
predict the presence or absence of heart disease. The model
was able to correctly predict 76% of the test dataset as either
3.1.1 Decision Tree Classifier model the presence or absence of heart disease in patients. Other
The confusion matrix for the decision tree classifier is performance metric such as precision, recall and f1-score are
presented in the figure below in table 1. The accuracy score displayed in table 4. The recall indicates how many of either
obtained for this model was 79.31%. Table 2 presents absence or presence of the disease the model was able to
evaluation metrics with precision, (ability of the model not to capture through classifying it as the absence of presence of the
label patients as having heart diseases that do not have heart heart disease, respectively. Moreover, the AUC for this model
disease, recall, and f1-score. The AUC value obtained was was 0.87, and the ROC curve is represented in fig. 3.
0.81,with the corresponding ROC graph displayed below in
Table 3. Confusion matrix of Gaussian Naive Bayes model
figure 8.
Predicted Predicted
Table 1. Confusion matrix of decision Tree Classifier
model (absence) (presence)
Predicted Predicted
Actual 21 6
(absence) (presence) (absence)
Actual 20 7
(absence) Actual 4 27
Actual 5 26 (presence)
(presence)

Table 4. Classification report of Gaussian Naive Bayes


model
Table 2. Classification report of Decision Tree Classifier
model

19
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020

Classification Models
LR [VALUE]

CART [VALUE]

GNB [VALUE]

76 78 80 82 84

Accuracies Score (%)


Fig 4: ROC curve showing the performance output of
Gaussian Naive Bayes model
Fig 5. A diagrammatic comparison of the predicting
3.1.3 Logistic Regression accuracies (%) of the three models.
The confusion matrix associated with LR model is represented
below indicating how many of the test samples that were 4. DISCUSSION
predicted accurately as the presence or absence of heart The purpose of this research was to compare the accuracies
disease (table 5). The summary of the precision recall and f1 of the three algorithms (CART, LR and GNB) in predicting
scores are represented in table 6. The accuracy score for this heart diseases in patients. Surprisingly, the highest predicting
model was 82.75%. The AUC for this model was 0.86, and accuracy score was obtained with the LR and GNB. They
the output is presented graphically in figure 4. both had predicting score of 82.75%, precision, recall,f1
scores. However, the AUCROC value for the GNB was
Table 5. Confusion matrix of Logistic Regression model
higher than LR model. . Naïve Bayes algorithms are
Predicted Predicted documented to be effective in practical medical diagnosis
[14]. The competitive performance of GNB in classification
(absence) (presence) could be attributed to the dependence distribution [15]. The
Actual 21 6 CART model scored the lowest predicting accuracy, among
(absence the three models. The least accuracy score obtained with
CART algorithm could be due to the relatively smaller sample
Actual 4 27 size of the dataset [16].
(presence) 5. CONCLUSION
Heart disease detection at the early stages with few clinical
tests to diagnosing it is crucial in preventing the many deaths
Table 6. Classification report of Logistic Regression model associated with it. The burgeoning influence of data mining
techniques and machine learning in the medical field in
detecting subtle patterns in large dataset make their
applicability in heart disease diagnostics relevant. The
performance metric used in evaluating the three models puts
the GNB model ahead of the three classifiers. The greater
AUCROC value (0.87) from GNB model makes it a better
choice than LR (0.86). Future research could focus on
including different models such as random forest, K
Neighbors Classifier, support vector machine, etc.

6. REFERENCES
[1] H. K. Weir et al., “Heart Disease and Cancer Deaths -
Trends and Projections in the United States, 1969-2020,”
Prev. Chronic Dis., vol. 13, pp. E157–E157, Nov. 2016.
[2] C. S. Dangare and M. E. Cse, “Improved Study of Heart
Disease Prediction System using Data Mining
Classification Techniques,” Int. ournal Comput. Appl.,
vol. 47, no. 10, pp. 44–48, 2012.
[3] J. Patel, S. Tejalupadhyay, and S. Patel, Heart Disease
prediction using Machine learning and Data Mining
Fig 4. ROC curve showing the performance output of Technique. 2016.
Logistic Regression model
[4] S. Sakr et al., “Comparison of machine learning
techniques to predict all-cause mortality using fitness
data: the Henry ford exercIse testing (FIT) project,”

20
International Journal of Computer Applications (0975 – 8887)
Volume 176 – No. 11, April 2020

BMC Med. Inform. Decis. Mak., vol. 17, no. 1, p. 174, [11] E. W. Steyerberg, M. J. Eijkemans, F. E. J. Harrell, and
Dec. 2017. J. D. Habbema, “Prognostic modeling with logistic
regression analysis: in search of a sensible strategy in
[5] S. Sakr et al., “Using machine learning on small data sets.,” Med. Decis. Making, vol. 21, no. 1, pp.
cardiorespiratory fitness data for predicting hypertension: 45–56, 2001.
The Henry Ford ExercIse Testing (FIT) Project.,” PLoS
One, vol. 13, no. 4, p. e0195344, 2018. [12] S. Xu, “Bayesian Naïve Bayes classifiers to text
classification,” J. Inf. Sci., vol. 44, no. 1, pp. 48–59, Nov.
[6] M. Shouman, T. Turner, and R. Stocker, Using decision 2016.
tree for diagnosing heart disease patients, vol. 121. 2011.
[13] R. D. S. Raizada and Y.-S. Lee, “Smoothness without
[7] A. H. Babar, “Comparative Analysis of Classification Smoothing: Why Gaussian Naive Bayes Is Not Naive for
Models for Healthcare Data Analysis,” vol. 07, no. 04, Multi-Subject Searchlight Studies,” PLoS One, vol. 8,
pp. 170–175, 2018. no. 7, p. e69566, Jul. 2013.
[8] V. Chaurasia, “Early Prediction of Heart Diseases Using [14] I. Rish, “An Empirical Study of the Naïve Bayes
Data Mining Techniques,” Caribb. J. Sci. Technol., vol. Classifier,” IJCAI 2001 Work Empir Methods Artif
Vol.1, pp. 208–217, Dec. 2013. Intell, vol. 3, Jan. 2001.
[9] H. Sharma, “Prediction of Heart Disease using Machine [15] H. Zhang, The Optimality of Naive Bayes, vol. 2. 2004.
Learning Algorithms : A Survey,” Int. J. Recent Innov.
Trends Comput. Commun., vol. 5, no. 8, pp. 99–104, [16] R. Nichenametla, T. Maneesha, S. Hafeez, and H.
2017. Krishna, “Prediction of Heart Disease Using Machine
Learning Algorithms,” Int. J. Eng. Technol., vol. 7, pp.
[10] N. Bhargava and G. Sharma, “Decision Tree Analysis on 363–366, May 2018.
J48 Algorithm for Data Mining,” Int. J. Adv. Res.
Comput. Sci. Softw. Eng., vol. 3, no. 6, pp. 1114–1119,
2013.

IJCATM : www.ijcaonline.org 21

You might also like