Manuscript
Manuscript
*Pantea
Wada Mohammed Jinjri Keikhosrokiani Nasuha Lee Abdullah
School of Computer Sciences, School of Computer Sciences, School of Computer Sciences,
Universiti Sains Malaysia Universiti Sains Malaysia Universiti Sains Malaysia
Minden, 11800 Penang, Malaysia Minden, 11800 Penang, Malaysia Minden, 11800 Penang, Malaysia
[email protected] [email protected] [email protected]
Abstract— Heart disease (cardiovascular disease) is a major data for various resolutions. These make available a way of
human disorder that significantly affects many people's lives. analyzing a large dataset to find patterns and relations
Diagnosing heart disease becomes an important task to reduce amongst diverse entities that are not detectable without
its sovereignty in its early stage. Machine learning methods advanced analyzing techniques [9].
remain the most widely used for the classification and detection
processes. This work aims to design and identify a model that Due to the high dimensional and imbalanced nature of
best classifies cardiovascular disease and predicts the presence data, standard statistical approaches have stood a significant
or absence of the disease in patients using machine learning challenge and have abridged several classification techniques
methods with accurate predictions. Therefore, this paper unfeasible. Thus, researchers [10]–[12] have proposed
compares the five most powerful machine learning platforms to several techniques to handle the inherent difficulties of high
classify cardiovascular disease data. The proposed five different dimension in data. The difficulty of precise classification is
classifiers are are support vector machine (SVM), K-nearest possibly due to noisy features that are non-relevant for
neighbor (K-NN), Logistic regression (LR), Decision tree (DT), classification but relatively lead to the accumulation of
and Naive Bayes (NB) for the classification of cardiovascular significant errors that frequently lead to inaccurate analysis.
disease (CVD). To validate the work, the dataset was obtained Thus, the standard classification methods do not handle these
from the Kaggle repository online. The algorithms' data inadequacies, outliers, and noises leading to reduced
performance is analyzed, evaluated, and compared by applying
accurate results. For this reason, there is a need to identify
various performance factors. Results indicates that support
and apply a technique capable of solving these issues.
vector machine (SVM) and logistic regression (LR) methods are
the most efficient for diagnosing cardiovascular disease. Nowadays, classification is a persistent problem that
involves various applications. Several healthcare
Keywords—heart disease, cardiovascular disease, machine establishments face a significant challenge in delivering
learning, classification, comparative analysis eminence services like diagnosing patients appropriately and
I. INTRODUCTION managing treatment at reasonable expenses. Therefore,
classification approaches are generally used in the medical
Cardiovascular disease, also known as (heart disease) is a field to classify health data into different classes according to
well-known and critical human global problem that some constraints, relatively a specific classifier [13]. In
significantly affects people. A recent study estimated that general, a classification algorithm is a function that evaluates
millions of deaths had occurred worldwide due to heart the input features so that one class is separated into positive
diseases[1]–[5], representing 31% of all world deaths. values and the other into negative values by the output.
Medical evidence has revealed that certain risk factors Therefore, classifier training is essential to identify the
upsurge an individual’s likelihood of getting heart disease weights that provide the classes in the data with the most
(CVD). Some of these factors, as stated by [6], [7],are precise and best separation [14].
unhealthy diet, use of tobacco, depression, stress, excessive
use of alcohol, physical inactivity, inheritance overweight, Motivated by the advances of numerous machine learning
and age. Several reports by the W.H.O have shown the rise approaches to forecasting cardiovascular disease threat and
of death due to CVD diseases mainly attributed to insufficient improved classification performance, this paper contributes
protective measures despite increasing risk factors. by performing a comparative study of machine learning
algorithms and identifies the utmost effective algorithm with
The cumulative morbidity and mortality due to heart reasonable accuracy for classifying cardiovascular disease
disease worldwide have fascinated researcher’s attention to data. In addition to establishing the performance of different
perform numerous investigations in their determination to algorithms in large and minor datasets with one view, classify
curtail the rates. However, machine learning techniques play them appropriately, and offer information on building
a very dynamic role in medical data mining for information supervised machine learning models. We proposed five
extraction analysis. These systems have broadly used to machine learning algorithms: Support vector machine
execute medical decision systems for predictions, improved (SVM), K-nearest neighbor (KNN), Logistic regression (LR),
healthiness policy-making and inhibition of clinical errors, Decision tree (DT), and Naïve Bayes (NB) to efficiently
early discovery, prevention of ailments, and avoidable evaluate their performance by using clinical data obtained
infirmary deaths [8]. As an intelligent system, machine from cardiovascular disease patients. The rest of the paper
learning methods can be used to understand the meaning of a planned as follows: Section 2 discusses related works by
data set consistently and offers a suitable output from raw other researchers. Section 3 presents the proposed
Absence
Predicted value
TP FN
calculate entropy. Next, the dataset split with high data gain
or less entropy based on the variables/predictors. The rest of
the attributes are followed by the two steps as stated.
Presence
FP TN
Actual
machine (SVM), has used for the classification of Presence 2509 (17.92%) 4503 (32.16%) 7012
value
cardiovascular disease dataset. The dataset consists of 70,000 Total Predicted 8180 5820 14000
samples and 11 features. There are two categories of
analyzing the classes: absence or presence of the disease.
From the data samples, 35021 are specified as absent, while From the confusion matrix results, the naïve bayes (NB)
the remaining 334979 specified as presence of cardiovascular predicts the highest number of true positives, while logistic
disease. The data is partitioned into training and testing sets regression predicts the highest number of true negatives, as
at a ratio of 70:30, respectively. represented in Tables 6 and 5.
The confusion matrix of prediction results for DT, LR, As illustrated in Figure 2, is a comparison between the
KNN, NB, and SVM is shown in tables 1-5. With the help of algorithms DT, KNN, LR, NB, and SVM for the F1_score
f1_score, recall, precision, and accuracy are measured as with 63.94%, 67.02%, 71.13%, 44.43%, and 70.71%,
illustrated in Fig. 2-6. respectively. It is concluded that LR outperforms other
algorithms for the F1_score.
TABLE III. CONFUSION MATRIX FOR DECISION TREE CLASSIFIER
Predicted Value
Absence Presence Actual
value
Absence 6562 (31.35%) 3899 (18.46%) 10461
Actual
Predicted Value
Absence Presence Actual
value
Absence 8243 (39.25%) 2296 (10.93%) 10539
Fig. 2. Comparison for F1_score
Actual
Predicted Value
Absence Presence Actual
value
Absence 5363 (38.31%) 1625 (11.61%) 6988
Actual
Fig. 4 shows a comparison for a recall measure K-NN 61.46 67.02 73.68 69.87.% 5.78
between the algorithms, which illustrates the proportion of
LR 67.99 71.13 74.58 72.36% 2.52
predicted positive classes made out of entire examples that
are positive from the dataset.
NB 32.30 44.43 71.11 59.44% 0.63