0% found this document useful (0 votes)
9 views8 pages

New 1

This document discusses machine learning techniques for heart disease prediction and classification. It analyzes two heart disease datasets to build predictive models and identify important features. Random forest algorithms achieved 90-95% accuracy in classification and feature selection.

Uploaded by

kush.shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views8 pages

New 1

This document discusses machine learning techniques for heart disease prediction and classification. It analyzes two heart disease datasets to build predictive models and identify important features. Random forest algorithms achieved 90-95% accuracy in classification and feature selection.

Uploaded by

kush.shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Innovative Computing 9(1) 39-46

Classification and Feature Selection Approaches by


Machine Learning Techniques: Heart Disease
Prediction

N. Satish Chandra Reddy, Song Shue Nee, Lim Zhi Min & Chew Xin Ying
School of Computer Sciences,
11800, Universiti Sains Malaysia, Pulau Pinang, Malaysia
Email: [email protected]

Submitted: 03/08/2018. Revised edition: 05/12/2018. Accepted: 26/12/2018. Published online: Published online: 30/05/2019
DOI: https://doi.org/10.11113/ijic.v9n1.210

Abstract—The heart disease has been one of the major causes known as coronary artery disease (CAD) which refers to
of death worldwide. The heart disease diagnosis has been the disease of the heart arteries that supply oxygen and
expensive nowadays, thus it is necessary to predict the risk of blood to the heart, and is associated with life style
getting heart disease with selected features. The feature conditions and age. CHD happens with the narrowing of
selection methods could be used as valuable techniques to
the coronary arteries due to fatty material (plaque)
reduce the cost of diagnosis by selecting the important
attributes. The objectives of this study are to predict the deposition which is termed as atherosclerosis. The
classification model, and to know which selected features atherosclerosis reduces the blood flow to the heart, and
play a key role in the prediction of heart disease by using said to be the underlying cause of complications such as
Cleveland and statlog project heart datasets. The accuracy of angina and heart attacks. An angina is referred to typical
random forest algorithm both in classification and feature pain in the chest, but it can often radiate to the shoulder
selection model has been observed to be 90–95% based on left arm. The heart attacks, which is also known as
three different percentage splits. The 8 and 6 selected myocardial infraction (MI), cardiac or myocardial
features seem to be the minimum feature requirements to infarction, and coronary thrombosis or occlusion is a
build a better performance model. Whereby, further
condition when artery ruptures or narrows. A blood clots
dropping of the 8 or 6 selected features may not lead to
better performance for the prediction model. formed partially or completely during the repair of blood
vessels rupture, reduces the blood flow to the heart
Keywords—Machine Learning, Feature Selection, R tool, muscles causing heart attacks. The chest pain, which many
Classification, Prediction also radiate to the left are the signs of heart attacks. The
modifiable risk factors are smoking, high blood pressure,
I. INTRODUCTION high cholesterol, obesity, unhealthy diet, diabetics,
depression and stress, on the other hand the non-
The heart disease is said to be major causes of death modifiable risk factors are age, gender, genetic factors,
globally, according to the World Health Organization race and ethnicity. Heart failure also known as congestive
(WHO) the mortality rate due to heart disease is around heart failure where right, left or both sides of the heart are
17.7 million (31%) in 2015 and is estimated to increase by affected once damaged it cannot heal. This is a serious
2020. Every year nearly 20 million people die, indicating condition when the heart muscle gets damaged, then
the heart disease as a leading cause of death [1]. The becomes weak and does not pump blood properly to meets
group of diseases related to both the heart and blood the needs of the rest of the body. Some of the factors, such
vessels are referred as cardiovascular disease (CVD). The as heart diseases, high blood pressure and diabetics,
CVD includes coronary heart disease (CHD) or also cardiomyopathy (a disease of the heart muscle) can over

39
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

time, leave the heart too weak to fill and pump properly. proceedings such as clinical decision support, disease
The other circulatory system malfunctions are peripheral surveillance and population health management [8], [9].
artery disease (PAD) narrowing of the arteries at the On the other hand, the clinical decisions made by doctors
regions of arms, legs, stomach and head legs, stomach), indicate that there is a chance of leaving the hidden quality
venous thromboembolism (VTE: is a blood clot in a vein), from the data leading to errors and unwanted biases. This
aortic aneurysms (the weakened artery allowing it to might affect the medical costs and quality of health
widen) [2], [3]. services provided to the patients [10]. The decrease in
Based on the disease statistics, it is noted that the rate medical errors and unwanted practice variation, improve
of heart disease is increasing and cannot be identified from patient outcome and safety, can be achieved by the
the visual prospective since it is caused by many risk integration of clinical decision support with computer-
factors, and does not show noticeable symptoms. However, based patient records. The researchers and clinicians are
the continuous monitoring of different vital signs such as studying different heart disease datasets with different
heartbeat, blood pressure, glucose levels, ECG, EEG, machine learning classification algorithms and features
EMG, respiratory rate and temperature would be useful to selection methods to bring up possible predictions for
track the early detection of different disease abnormalities. heart disease [11], [12]. In this paper, the main objectives
This monitoring has been made possible with an are to build a model with prevails performance, and to
advancement in wireless communication technology identify which selected features play a key role in the
“Wireless Body Area Network (WBAN) which consists of prediction of heart disease by using Cleveland and statlog
tiny Bio-Medical Sensors (BMSs)” which are at project heart datasets from UCI [13] with R tool.
reasonable cost. The three methods of BMS deployment
includes on body wearable (to monitor blood pressure, II. LITERATURE REVIEW
ECG and EMG), in-body implantation (to monitor lungs,
liver and kidney), and off body (to monitor physical health This section reviews some work related to classification
conditions, body position, arm positions, walking and and feature selection methods applied for the prediction of
running). The monitoring of these BMS is done through heart diseases. Heart being one of the main organs of the
Body Area Network Coordinator (BANC) following star body allows the pumping of the blood through blood
network topology [4]. The monitoring of the patients is vessels (i.e., circulatory system). The oxygen and other
categorized as non-emergency (a normal reading of vital materials such as nutrients, hormones and waste
signs such as temperature and glucose levels) and substances are carried out to the other parts of the body
emergency data (signs such as low respiratory rate and through blood. Thus, heart is referred to play an important
high blood pressure). The Medium access control (MAC) role in circulatory system. The condition where the
Superframe structures uses IEEE 802.15.4 as a protocol impairment of the heart occurs is referred to as heart
[5]. The heterogeneous nature of the patient’s data is disease. The heart disease or cardio vascular disease refers
classified into Ordinary Packet (OP), Reliability Data to the disorders of heart and blood vessels. The different
Packet (RP), Critical Data Packet (CP) and Delay Data types of heart diseases are coronary heart disease, heart
Packet (DP). Though these classifications help to deliver attacks (blocked vessels or myocardial infarction) and
without loss in patient data to medical doctors, and with angina (coronary artery disease). The coronary heart
less energy consumption of BMSs, they lack the disease leads to heart attack and is said to be one of the
consideration of low and high threshold values of vital major causes for the individual death globally (WHO).
signs in emergency data categorization. This has been The non-modifiable risks factors such as age, gender,
overcome with the help of a traffic priority based MAC genetic or family history, ethnicity or race, and modifiable
protocol and slot allocation based on the alert signals to risk factors such as cholesterol, high blood pressure,
BMSs respectively, in which the vitals such as heart rate, hypertension, lack of physical exercise, smoking, obesity
blood pressure, respiratory rate and temperature has been and stress are responsible for heart diseases [14] [15].
categorized into threshold ranges of low, normal and high
values. These values are useful to provide appropriate and A. Data Mining and Machine Learning Algorithms
dedicated timeslots to non-emergency and emergency Data mining is a process of exploring the new and
based BMSs. Based on the severities of the threshold valuable information from data, this could significantly
values of vital signs BANC allocate slots, apart from improve the quality of clinical decisions and play an
resolving the conflict of slots allocation. This protocol important role in intelligent medical system. The data
improves throughput, reduces packets delay for non- mining consists of various disciplines, among them
emergency and emergency data, consume minimum machine learning is one discipline, and consists of
energy of BMSs and BANC [6], [7]. different techniques such as classification, clustering,
The enormous amount of the data is being produced regression, association analysis, and outlier detection [16].
from different hospitals and medical centers or an The various classification machine learning algorithms
organization. This data is not being used in a proper way, used for the prediction of heart disease are reviewed below:
although it contains potential information for further

40
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

1) K-Nearest Neighbor’s Algorithm 4) Naïve Bayes


It is a simple classifier that cannot handle noises, easy to Naïve Bayes is a classifier that is based on Bayes theorem,
implement and understand, requires short training time and it assumes the probability of the features which are
and whole training set is used for prediction. The K- independent of other features. The different classification
Nearest Neighbors (K-NN) with weighting parameter has algorithms REPTREE, J48, CART, Naïve Bayes and
been used for the prediction of heart disease. Among 13 Neural networks have been used on the data collected
attributes mentioned in UCI heart disease dataset, the from medical practitioners in South Africa. Weka tool is
selection of 8 attributes due to simple measurements has used for this purpose and the accuracy of Naïve Bayes is
been taken into consideration for this study. The study found to be 85.92% [23]. In another study, The single UCI
shows the accuracy of 81.9 % and also mentioned that 8 heart disease dataset has been used for the prediction of
attributes such as age, sex, chest pain, trestbps, trestbpd, heart disease using K-NN, decision tree, SVM and Naïve
restecg, thalrest, exang are more than enough to predict Bayes. The Weka software was used, and the Naïve Bayes
heart disease [17]. The similar techniques (KNN) have shows highest accuracy [24].
been used along with the feature selection technique such
as particle swarm optimization (PSO) for the heart disease 5) Neural Network
prediction where it shows 100% accuracy [18]. Neural network is inspired by the biological neural
2) Support Vector Machine (SVM) network that constitute in the brain or central nervous
system, this is also called as an artificial neural network. It
Based on the kernel functions the support vector is used in machine learning algorithm, and can be used of
classifiers are divided into different types such as linear, classification/supervised learning. The neurons and
nonlinear, radial basis function (RBF), sigmoid and synapse are interconnected each other that allow passage
polynomial. The hyperplane or support vector machine of messages with in them. The three major parts of the
separates the support vector or data points. The different neural network are the input layer, hidden layer and output
classification techniques such as K-Nearest Neighbors, layer.
J48 Decision tree, SMO and Naive Bayes based on 8 The three algorithms such as Neural Networks, J-48
attributes with 10 fold cross validation in WEKA was used. Decision Tree and Naïve Bayes along with feature
The four features have been extracted using gain ratio selection method CfsSubsetEval attribute filter with Best
evaluation technique. The highest accuracy is found to be First search method on single UCI heart disease dataset
in J48 with 83.73% [19]. In another study, the different was used. Weka tool was used for this study, and the
datasets of heart, diabetes and cancer were classified using highest accuracy is shown by Naïve Bayes both with and
SVM, RBF and Naïve Bayes in Weka. Among 3 without feature selection [25]. In another study, the Neural
classifiers SVM showed the highest accuracy of 93.75% Network with feature correlation analysis (NN-FCA) was
and also seem to be most effective and robust classifier in studied on Sixth Korea National Health and Nutrition
predicting diseases [20]. Examination Survey (KNHANES-VI) dataset. In this
study, it is found that chronic renal failure and triglyceride
3) Random Forest were closely related to coronary heart disease. The NN-
Random forest constructs a multitude of decision trees at FCA showed the highest accuracy of 82.51% [26].
training time and output the mode prediction of the classes
for classification and the mean prediction for regression. B. Feature Selection Approaches
The different patterns in the data are evaluated by the The huge amounts of data produced by different sources
decision tress. The class prediction is based on the have become a fundamental importance for capturing,
majority vote for classification. The random forest with 10 storing, searching, sharing, and are hard to interpret and
fold cross validation along with feature selection methods analyze [27]. The huge volume of data and the increase in
such as chi square and genetic algorithm was used to for diagnosis cost made to look for feature selection which in
the prediction of heart disease. The single UCI heart turn increases the accuracy of the model, and give a better
dataset was used in this study and the accuracy as found to result for the prediction of disease. The initial dataset
be 83.70 % and found to be better when compared with consists of number of attributes, some of them may not be
other algorithms such as Naïve Bayes, decision tree and useful thus it is necessary to remove them during data
neural nets [21]. In another study, the Cleveland heart preprocessing [28]. Thus one of the important steps in data
disease was used with random forest and 10 fold cross preprocessing is feature selection, by this unnecessary
validation to get an accuracy of 85.81%, the feature features can be removed and improve the performance to
selection approach has given better accuracy [22]. build a better classification model. The various feature
selection methods such as wrapper methods, filter,
embedded and ensemble and hybrid methods have been
applied to study the heart disease prediction [29].

41
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

1. Trstbps, Restecg, slope, CA, Thal, Age are the best TABLE 2. Description of attributes in the dataset
selected features with the approach of novel feature Features Type Description, Value
selection, with the selected features the accuracy was Sex Discrete 1 = male; 0 = female
found to be 93% and 89 % in neural network and SMO
algorithms respectively [28]. Chest Pain Discrete Value 1: typical angina;
Value 2: atypical angina;
Value 3: non-anginal pain;
2. The recursive Feature Elimination (RFE) has been Value 4: asymptomatic
applied along with stochastic gradient boosting algorithm
and the selected features was found to be chest pain, ca,
and Thal with the accuracy of 95.45% [30]. FBS Discrete Fasting Blood Sugar,
0 = false (FBS <120 mg/dl)
3. The 4 feature selection methods have been applied and 1 = true (FBS > 120 mg/dl)
12 classification algorithms has been used for heart Restecg Discrete Value 0: normal;
Rest Value 1: having ST-T wave
disease prediction. The application of different feature Electrocardio abnormality (T wave inversions
selection methods showed different best selected features. graphic and/or ST elevation or depression
The highest accuracy was found to be in SVM-linear and of > 0.05 mV);
Naïve Bayes with an accuracy of 84.81% [31]. Value2 : showing probable or
definite left ventricular hypertrophy
III. DATASET by Estes' criteria

A. Data Sources

In this study, the five heart disease datasets (Cleveland, Exang Discrete Exercise induced angina.
Switzerland, Hungarian, V.A. Medical and Statlog project 1 = yes; 0 = no
heart disease) are collected from publicly available source Slope Discrete Peak exercise slope measure, Value
UCI machine learning repository webpage [13]. The five 1: upsloping, Value 2: flat, Value 3:
down sloping
datasets were combined into single dataset (i.e., combined
dataset) for better model performance. The combined
dataset consists of 1190 instances with 14 attributes. The ca Discrete number of major vessels colored by
flourosopy (0-3)
characteristics of five heart disease datasets and combined
dataset are shown in Table 1.
thal Discrete Patient heart rate,
TABLE 1. Characteristics of five and combined dataset of heart Disease 3 = normal; 6=fixed defect;
7 = reversable defect
Datasets No of No of Missing Values Age Continuous Patients age , 28 to 77
Instances Attributes
Trestbps Continuous Resting blood pressures of patients
Statlog project 270 14 No measured in mm Hg on admission
to the hospital
Cleveland 303 14 No 80-200.
Chol Continuous Patient serum cholesterol measured
Hungarian 294 14 Yes in mg/dl. 85, 100 - 200 - 394, 400-
603
V.A. Long 200 14 Yes thalach Continuous Patient maximum heart rate
Beach achieved.60-202, Low: below 50,
Normal:51-119, High: 120-202 [6]
Switzerland 123 14 Yes [7]
oldpeak Continuous ST depression made by exercise
Combined 1190 14 Missing values relative to rest
Dataset replaced with Mode -2.6 to -0.1, 0, 0.1 to 6.2, 120
Value Diagnose Discrete 0: No heart disease; 1: Heart
disease

B. Attributes Description

The dataset consists of 14 attributes. The predictable


attribute is referred to “Diagnosis” and rest of 13 as input
attributes. The attribute descriptions are shown in Table 2.

42
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

IV. METHODOLOGY (RFE) method with random forest algorithm (library


“mblench”, functions=rfFuncs, method="cv", number=10).
A. Data Preprocessing The variable importance estimations such as regression
method to calculate variable importance (library
In the data preprocessing stage, missing values are “mblench”), rank feature by Importance (library “class”
replaced with mode value based on the particular datasets and “mblench”) with learning vector quantization model
source. Second, taking into consideration that heart (LVQ) is addressed using CARET package in R tool
disease patients might have high values of respective (http://topepo.github.io/caret/index.html).The performance
attributes (i.e., referred as outliers in the dataset) are not of a model on test data is calculated by accuracy,
removed. The normalization (normalize <- function(x) sensitivity/recall, and specificity in R tool. Sensitivities
{return ((x - min (x)) / (max(x) - min(x)))} has been and specificity measures the true positives (risk class) and
carried out since dataset consists of different measuring the true negatives (normal class) respectively. Thus the
units. predictive capabilities of the classifiers are measured by
B. Data Analysis sensitivity and specificity values.

After data normalization to build a classification model, V. RESULTS


the combined dataset with 14 attributes is divided into
training and testing data with a percentage split of 60–40%, A. Performance on Combined Dataset
70–30% and 80–20% respectively with set.seed (123). The
R -based CARET package (https://cran.r- To the best of author knowledge, there is no studies
project.org/web/packages/) is used for data splitting, pre- addressed with the combined dataset (i.e., of five heart
processing steps such as normalization, classification disease datasets: Cleveland, Switzerland, Hungarian, V.A.
algorithms such as K-Nearest Neighbor (KNN, library Medical and Statlog project heart disease with the five
“caret”, method = 'knn', tuneLength = 10), Support Vector mentioned algorithms in this study) [24] [32]. R tool is
Machine (SVM, library “caret”, method = 'svmLinear', selected to understand which classification model has a
tuneLength =10), Random Forest (RF, library better performance on combined dataset with 13 features.
“randomForest”,method='rf',ntree=500,importance=TRUE The performance measures are in accordance with the
), Naïve Bayes (NB, library “e1071”, method = accuracy of each classification algorithm. Among the five
‘naïve_bayes’) and Neural Network (NN, library “nnet”, classification algorithms used, the highest accuracy in
method = 'nnet', trace= FALSE) with 10 fold cross three percentage splits (60–40, 70–30, and 80–40) has
validation (method = "cv", number = 10) are evaluated been observed in random forest with a range of 91.39%–
based on the fore mentioned training and testing data. The 94.96%, and average accuracies (85.92–89.41%)
different feature selections methods such as correlation respectively (refer to Table 3).
matrix (library “mblench”), recursive feature elimination

TABLE 3. The performance comparisons of different percentage splits of combined dataset with 13 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 84.03 0.7917 0.8808 83.75 0.8176 0.853 86.55 0.847 0.879
SVM 84.45 0.791 0.888 85.15 0.786 0.904 87.82 0.8381 0.909
Random Forest (RF) 91.39 0.856 0.961 92.16 0.880 0.954 94.96 0.904 0.985
Naïve Bayes 84.45 0.842 0.846 85.71 0.729 0.959 88.66 0.800 0.954
Neural Network 85.29 0.773 0.919 85.43 0.786 0.909 89.08 0.876 0.902
Avg Accuracy 85.92 0.810 0.898 86.44 0.799 0.915 89.41 0.853 0.925

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

43
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

B. Performance on Feature Selection Attributes TABLE 4. The order of best selected features after the removal of exang
and old peak attributes
The feature selection is performed on the combined Recursive Regression to Calculate Rank Feature by
dataset to select a subset of relevant features for model Feature Variable Importance Importance
building that aims to improve model accuracy. In this Elimination
study, a correlation analysis was performed to know the
highly correlated attributes, in this case the correlation has thal thal*** 12.650 thal 0.816
ca ChestPain*** 7.368 ChestPain 0.766
been found to be between exang and oldpeak attributes ChestPain ca *** 7.159 thalach 0.734
with a given correlation cutoff at 0.75 (refer to Figure 1). oldpeak thalach*** 4.330 exang 0.711
thalach Trestbps** 2.731 oldpeak 0.704
Trestbps Sex * 2.520 ca 0.664
Age Slope * 2.226 Age 0.657
Chol Restecg 1.721 Sex 0.632
exang Chol 0.628 Slope 0.605
Slope Exang 0.622 Trestbps 0.547
Sex FBS 0.427 Chol 0.546
restecg oldpeak 0.057 Restecg 0.542
FBS Age 0.050 FBS 0.532
Significance code: ***: 0, **:0.001, *: 0.01

TABLE 5. The common selected features

8 selected features thal, chest pain, ca, thalach, trestbps, age, sex
and chol
6 selected features thal, chest pain, ca, thalach, trestbps, age

The highest accuracy in 8 and 6 selected features is


found to be in random forest (refer to Table 6 & 7) which
is in agreement with the combined dataset with 13
Fig. 1. Correlation plot of the combined dataset features (refer to Table 3). The 8 selected features show
average accuracies with an insignificant of 0.5% in 70–
30 and 80–40 percentage split (refer to Table 6), whereas
After the removal of exang and oldpeak from the the 6 selected features (refer to Table 7) showed the
combined dataset the different feature selection methods decrease in average accuracies (1.24% in 70–30% split
(see data analysis) implemented in R tool has been and 1.18% in 80–40%) which could be due to the dip in
applied. The 10 selected features/attributes such thal, ca, the performance of SVM, RF (in both 70–30% and 80–
chestpain, age, thalach, chol, trestbps, slope, restecg and 40%), and NN (80–40%) to that of combined dataset with
sex are common across 3 feature methods. The attribute 13 features (refer to Table 3).
FBS is observed in regression method and rank feature by
Importance. The selected features by 3 methods showed V. DISCUSSION
the similar order of arrangement/topology with minor
differences in the ranking order (refer to Table 4). In this study, the heart disease prediction has been
Based on the order of the features (refer to Table 4), evaluated with the classification and feature selection
the common 8 and 6 selected features (refer to Table 5) algorithms implemented in CARET package of R tool
are taken into consideration to build a model. In 8 using combined dataset. The highest accuracy is shown
selected features, the highest accuracy in three percentage by random forest in three percentage split (without and
splits (60–40, 70–30, and 80–40) has been observed in with feature selection). This shows the performed random
random forest with a range of 91.18–94.96%, and average forest algorithm shows the best performance with the
accuracy (85.99–89.07%) respectively (refer to Table 6). existing model accuracies [22] [33]. The average
Similarly with respect to 6 selected features the highest accuracies of 8 selected feature shows similar results with
accuracy in three percentage splits (60–40, 70–30, and the combined dataset, indicating these features could be
80–40) has been observed in random forest with a range used for the prediction of heart disease as mentioned in
of 89.64–92.44%, and average accuracy (85.20–88.40%) the previous studies where KNN algorithm was used [17].
respectively (refer to Table 7).

44
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

TABLE 6. The performance comparisons of different percentage splits of combined dataset with 8 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 83.82 0.796 0.873 83.47 0.811 0.853 86.97 0.866 0.872
SVM 84.24 0.777 0.896 81.79 0.748 0.873 85.71 0.828 0.879
Random Forest 91.18 0.875 0.942 91.60 0.880 0.944 94.96 0.914 0.977
Naïve Bayes 86.76 0.810 0.915 87.39 0.811 0.924 89.5 0.857 0.924
Neural Network 85.29 0.777 0.915 85.71 0.773 0.924 88.24 0.838 0.917
Avg Accuracy 86.25 0.807 0.908 85.99 0.804 0.903 89.07 0.860 0.913

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

TABLE 7. The performance comparisons of different percentage splits of combined dataset with 6 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 85.5 0.812 0.892 85.15 0.792 0.899 87.82 0.847 0.902
SVM 82.77 0.773 0.873 82.07 0.735 0.888 83.61 0.809 0.857
Random Forest 90.76 0.870 0.938 89.64 0.886 0.904 92.44 0.923 0.924
Naïve Bayes 86.97 0.824 0.907 84.03 0.792 0.878 90.76 0.885 0.924
Neural Network 85.5 0.791 0.907 85.15 0.786 0.904 87.39 0.828 0.909
Avg Accuracy 86.3 0.814 0.903 85.20 0.798 0.894 88.40 0.858 0.903

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

However, the further analysis with 6 selected features heart disease. However, drop of the 6 selected features
showed the decrease in average accuracies with that of may not lead to better model prediction.
combined dataset, this could be due to under fitting of the
attributes and the performance of the classifier under REFERENCES
default conditions. These results are in general agreement
with the previous studies where 5 out of 6 selected [1] Bhatnagar, P., Wickramasinghe, K., Williams, J.,
features in this study were used with different feature Rayner, M., and Townsend, N. (2015). The
selection method [11], thus indicating this features as a Epidemiology of Cardiovascular Disease in the UK
minimum number to build a reliable model. It is expected 2014. Heart, 101, 1182-1189.
that the further drop in 6 selected features to build a model [2] Thiyagaraj, M., and Suseendran, G. (2017). Survey
may not give a better prediction. However, this is in on Heart Disease Prediction System Based on Data
contrast where 3 selected features such as chest pain, ca, Mining Techniques. Indian Journal of Innovations
Thal showed 95.45% with different feature selection and Developments, 6(1), 1-9.
method. On the other hand, it is shown the 3 selected [3] https://www.heartfoundation.org.au/your-heart/heart-
features might be important for accuracy improvement conditions.
[30]. [4] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and Arshad M, M. (2017). Medium Access Control
VI. CONCLUSION (MAC) for Wireless Body Area Network (WBAN):
Superframe Structure, Multiple Access Technique,
It is known that the accuracy of the model depends on the Taxonomy, and Challenges. Hum Cent Comput Inf
database, preprocessing, analytical tools and techniques. Sci, 7(34), 1-39.
The present study shows it is important to select minimum [5] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and prominent attributes to improve the performance when & Lloret, J., and Arshad, M, Md. (2017). EETP-MAC:
compared to the use of whole features from the dataset. Energy Efficient Traffic Prioritization for Medium
This study shows that the random forest can be used as a Access Control in Wireless Body Area Networks.
good classification algorithm for the accurate prediction of Telecommunication Systems, 1-23.
heart disease with an accuracy of 90–95 %. However, the [6] Ullah, F., Abdullah, A, H., Kaiwartya, O., and Cao, Y.
addition of dataset with the inclusion of other non- (2017). TraPy-MAC: Traffic Priority Aware Medium
modifiable risk factors (genetic factors), and modifiable Access Control Protocol for Wireless Body Area
risk factors such as smoking, lack of physical exercise, Network. J Med Syst, 41, 93, 1-18.
alcohol consumption could result in better risk prediction [7] Ullah, F., Abdullah, A., H., Kaiwartya, O., and
of the heart disease. The less variation of accuracy Arshad, M., M. (2017). Traffic Priority-Aware
differences between dataset and selected features (8 and 6) Adaptive Slot Allocation for Medium Access Control
indicates these features can be useful for the prediction of

45
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

Protocol in Wireless Body Area Network. Computers, Using Random Forest and Evolutionary Approach.
6, 9, 1-26. Journal of Network and Innovative Computing, 4,
[8] Raghupathi, W., and Raghupathi, V. (2014). Big 175-184.
Data Analytics in Healthcare: Promise and Potential. [22] Yeshvendra, K, Singh., Nikhil, Sinha. and Sanjay, K,
Health Information Science and Systems, 2, 3. Singh. (2016). Heart Disease Prediction System
[9] Palanisamy, V. and Thirunavukarasu, R. (2017). Using Random Forest. Advances in Computing and
Implications of Big Data Analytics in Developing Data Sciences, 613-623.
Healthcare Frameworks – A Review. Journal of King [23] Lavanya, M., Gomathi, P, M. (2016). Prediction of
Saud University–Computer and Information Heart Disease using Classification Algorithms.
Sciences, 1-11, in press. International Journal of Advanced Research in
[10] Ghasemi, M. and Amyot, D. (2016). Process Mining Computer Engineering & Technology, 5(7), 2173-
In Healthcare: A Systematised Literature Review. 2175.
Int. J. of Electronic Healthcare, 9(1), 60-88. [24] Sen, S. K. (2017). Predicting and Diagnosing of
[11] Dominic, Vinitha., Deepa, Gupta. and Sangita, Heart Disease Using Machine Learning Algorithms.
Khare. (2015.) An Effective Performance Analysis of International Journal of Engineering and Computer
Machine Learning Techniques for Cardiovascular Science, 6, 21623-21631.
Disease. Applied Medical Informatics, 36(1), 23-32. [25] Shafique, Umair, Majeed, Fiaz., Qaiser, Haseeb and
[12] Chandralekha, M. and Shenbagavadivu, N. (2018). U., l, Mustafa, Irfan. (2015). Data Mining in
Performance Analysis of Various Machine Learning Healthcare for Heart Diseases. International Journal
Techniques to Predict Cardiovascular Disease: An of Innovation and Applied Studies, 10, 2028-9324.
Emprical Study. Appl. Math. Inf. Sci, 12(1), 217-226. [26] Karayılan, T. and Kılıç, Ö. (2017). Prediction of
[13] UCI Machine Learning Repository: heart disease using neural network. International
http://archive.ics.uci.edu/ml/datasets/Heart+Disease , Conference on Computer Science and Engineering
http://archive.ics.uci.edu/ml/datasets/Statlog+%28He (UBMK), Antalya, 719-723. doi:
art%29. 10.1109/UBMK.2017.8093512.
[14] Hajar, R. (2017). Risk Factors for Coronary Artery [27] Rishika, R., A. and Suresh, K. P. (2016). Predictive
Disease: Historical Perspectives. Heart Views, 18, Big Data Analytics in Healthcare. Second
109-114. International Conference on Computational
[15] World Heart Federation. (2018). Available at Intelligence & Communication Technology, IEEE.
https://www.world-heart- [28] Suganya, R., Rajaram, S., Sheik, Abdullah, A. and
federation.org/resources/risk-factors/ (accessed 26 Rajendran, J. (2016). A Novel Feature Selection
June 2018). Google Scholar. Method for Predicting Heart Disease with Data
[16] Neesha, Jothi., Nur, Aini., Abdul, Rashid. And Mining Techniques. Asian J of Info Tech, 15, 1314-
Wahidah, Husain. (2015). Data Mining in 1321.
Healthcare–A Review. Procedia Computer Science, [29] Jain, D., Singh, V. (2018). Feature Selection and
72, 306-313. Classification Systems for Chronic Disease
[17] Enriko I. K. A., Muhammad, Suryanegara. and Prediction: A Review. Egyptian Informatics J. 1-11,
Dadang, Gunawan. (2016). Heart Disease Prediction in press. https://doi.org/10.1016/j.eij.2018.03.002.
System using k-Nearest Neighbor Algorithm with [30] Kakulapati, V., Ankith, Kirti, Vaibhav, Kulkarni. and
Simplified Patient's Health Parameters. Journal of Charan, Pandit, Raj. (2017). Predictive Analysis of
Telecommunication, Electronic and Computer Heart Disease using Stochastic Gradient Boosting
Engineering, 8(12), 59-65. along with Recursive Feature Elimination.
[18] Jabbar, M. A. (2017). Prediction of Heart Disease International Journal of Science and Research, 6,
Using K-Nearest Neighbor and Particle Swarm 909-912.
Optimization. Biomedical Research, 28(9), 4154- [31] Hidayet, Takci. (2018). Improvement of Heart
4158 Attack Prediction by the Feature Selection Methods.
[19] Boshra, Bahrami., Mirsaeid, Hosseini, Shirvani. Turk J Elec Eng & Comp Sci, 26, 1-10.
(2015). Prediction and Diagnosis of Heart Disease by [32] Randa, El-Bialy., Mostafa, A, Salamay, Omar, H,
Data Mining Techniques. Journal of Karam and M, Essam, Khalifa. (2015). Feature
Multidisciplinary Engineering Science and Analysis of Coronary Artery Heart Disease Data
Technology, 2(2), 164-168. Sets. Procedia Computer Science, 65, 459-468.
[20] Janardhanan, P., L, Heena. and Sabika, F. (2015). [33] Patil, P., R. Kinariwala, A., S. (2017). Automated
Effectiveness of Support Vector Machines in Diagnosis of Heart Disease using Random Forest
Medical Data Mining. Journal of Communications Algorithm. International Journal of Advance
Software and Systems, 11(1), 25-30. Research, Ideas and Innovations in Technology,
[21] Jabbar, M. A., Deekshatulu, B. L. and Priti, Chandra. 3(2), 579-589.
(2016). Intelligent Heart Disease Prediction System

46

You might also like