0% found this document useful (0 votes)

9 views8 pages

New 1

This document discusses machine learning techniques for heart disease prediction and classification. It analyzes two heart disease datasets to build predictive models and identify important features. Random forest algorithms achieved 90-95% accuracy in classification and feature selection.

Uploaded by

kush.shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

New 1

Uploaded by

kush.shrestha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

International Journal of Innovative Computing 9(1) 39-46

Classification and Feature Selection Approaches by

Machine Learning Techniques: Heart Disease
Prediction

N. Satish Chandra Reddy, Song Shue Nee, Lim Zhi Min & Chew Xin Ying
School of Computer Sciences,
11800, Universiti Sains Malaysia, Pulau Pinang, Malaysia
Email: [email protected]

Submitted: 03/08/2018. Revised edition: 05/12/2018. Accepted: 26/12/2018. Published online: Published online: 30/05/2019
DOI: https://doi.org/10.11113/ijic.v9n1.210

Abstract—The heart disease has been one of the major causes known as coronary artery disease (CAD) which refers to
of death worldwide. The heart disease diagnosis has been the disease of the heart arteries that supply oxygen and
expensive nowadays, thus it is necessary to predict the risk of blood to the heart, and is associated with life style
getting heart disease with selected features. The feature conditions and age. CHD happens with the narrowing of
selection methods could be used as valuable techniques to
the coronary arteries due to fatty material (plaque)
reduce the cost of diagnosis by selecting the important
attributes. The objectives of this study are to predict the deposition which is termed as atherosclerosis. The
classification model, and to know which selected features atherosclerosis reduces the blood flow to the heart, and
play a key role in the prediction of heart disease by using said to be the underlying cause of complications such as
Cleveland and statlog project heart datasets. The accuracy of angina and heart attacks. An angina is referred to typical
random forest algorithm both in classification and feature pain in the chest, but it can often radiate to the shoulder
selection model has been observed to be 90–95% based on left arm. The heart attacks, which is also known as
three different percentage splits. The 8 and 6 selected myocardial infraction (MI), cardiac or myocardial
features seem to be the minimum feature requirements to infarction, and coronary thrombosis or occlusion is a
build a better performance model. Whereby, further
condition when artery ruptures or narrows. A blood clots
dropping of the 8 or 6 selected features may not lead to
better performance for the prediction model. formed partially or completely during the repair of blood
vessels rupture, reduces the blood flow to the heart
Keywords—Machine Learning, Feature Selection, R tool, muscles causing heart attacks. The chest pain, which many
Classification, Prediction also radiate to the left are the signs of heart attacks. The
modifiable risk factors are smoking, high blood pressure,
I. INTRODUCTION high cholesterol, obesity, unhealthy diet, diabetics,
depression and stress, on the other hand the non-
The heart disease is said to be major causes of death modifiable risk factors are age, gender, genetic factors,
globally, according to the World Health Organization race and ethnicity. Heart failure also known as congestive
(WHO) the mortality rate due to heart disease is around heart failure where right, left or both sides of the heart are
17.7 million (31%) in 2015 and is estimated to increase by affected once damaged it cannot heal. This is a serious
2020. Every year nearly 20 million people die, indicating condition when the heart muscle gets damaged, then
the heart disease as a leading cause of death [1]. The becomes weak and does not pump blood properly to meets
group of diseases related to both the heart and blood the needs of the rest of the body. Some of the factors, such
vessels are referred as cardiovascular disease (CVD). The as heart diseases, high blood pressure and diabetics,
CVD includes coronary heart disease (CHD) or also cardiomyopathy (a disease of the heart muscle) can over

39
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

time, leave the heart too weak to fill and pump properly. proceedings such as clinical decision support, disease
The other circulatory system malfunctions are peripheral surveillance and population health management [8], [9].
artery disease (PAD) narrowing of the arteries at the On the other hand, the clinical decisions made by doctors
regions of arms, legs, stomach and head legs, stomach), indicate that there is a chance of leaving the hidden quality
venous thromboembolism (VTE: is a blood clot in a vein), from the data leading to errors and unwanted biases. This
aortic aneurysms (the weakened artery allowing it to might affect the medical costs and quality of health
widen) [2], [3]. services provided to the patients [10]. The decrease in
Based on the disease statistics, it is noted that the rate medical errors and unwanted practice variation, improve
of heart disease is increasing and cannot be identified from patient outcome and safety, can be achieved by the
the visual prospective since it is caused by many risk integration of clinical decision support with computer-
factors, and does not show noticeable symptoms. However, based patient records. The researchers and clinicians are
the continuous monitoring of different vital signs such as studying different heart disease datasets with different
heartbeat, blood pressure, glucose levels, ECG, EEG, machine learning classification algorithms and features
EMG, respiratory rate and temperature would be useful to selection methods to bring up possible predictions for
track the early detection of different disease abnormalities. heart disease [11], [12]. In this paper, the main objectives
This monitoring has been made possible with an are to build a model with prevails performance, and to
advancement in wireless communication technology identify which selected features play a key role in the
“Wireless Body Area Network (WBAN) which consists of prediction of heart disease by using Cleveland and statlog
tiny Bio-Medical Sensors (BMSs)” which are at project heart datasets from UCI [13] with R tool.
reasonable cost. The three methods of BMS deployment
includes on body wearable (to monitor blood pressure, II. LITERATURE REVIEW
ECG and EMG), in-body implantation (to monitor lungs,
liver and kidney), and off body (to monitor physical health This section reviews some work related to classification
conditions, body position, arm positions, walking and and feature selection methods applied for the prediction of
running). The monitoring of these BMS is done through heart diseases. Heart being one of the main organs of the
Body Area Network Coordinator (BANC) following star body allows the pumping of the blood through blood
network topology [4]. The monitoring of the patients is vessels (i.e., circulatory system). The oxygen and other
categorized as non-emergency (a normal reading of vital materials such as nutrients, hormones and waste
signs such as temperature and glucose levels) and substances are carried out to the other parts of the body
emergency data (signs such as low respiratory rate and through blood. Thus, heart is referred to play an important
high blood pressure). The Medium access control (MAC) role in circulatory system. The condition where the
Superframe structures uses IEEE 802.15.4 as a protocol impairment of the heart occurs is referred to as heart
[5]. The heterogeneous nature of the patient’s data is disease. The heart disease or cardio vascular disease refers
classified into Ordinary Packet (OP), Reliability Data to the disorders of heart and blood vessels. The different
Packet (RP), Critical Data Packet (CP) and Delay Data types of heart diseases are coronary heart disease, heart
Packet (DP). Though these classifications help to deliver attacks (blocked vessels or myocardial infarction) and
without loss in patient data to medical doctors, and with angina (coronary artery disease). The coronary heart
less energy consumption of BMSs, they lack the disease leads to heart attack and is said to be one of the
consideration of low and high threshold values of vital major causes for the individual death globally (WHO).
signs in emergency data categorization. This has been The non-modifiable risks factors such as age, gender,
overcome with the help of a traffic priority based MAC genetic or family history, ethnicity or race, and modifiable
protocol and slot allocation based on the alert signals to risk factors such as cholesterol, high blood pressure,
BMSs respectively, in which the vitals such as heart rate, hypertension, lack of physical exercise, smoking, obesity
blood pressure, respiratory rate and temperature has been and stress are responsible for heart diseases [14] [15].
categorized into threshold ranges of low, normal and high
values. These values are useful to provide appropriate and A. Data Mining and Machine Learning Algorithms
dedicated timeslots to non-emergency and emergency Data mining is a process of exploring the new and
based BMSs. Based on the severities of the threshold valuable information from data, this could significantly
values of vital signs BANC allocate slots, apart from improve the quality of clinical decisions and play an
resolving the conflict of slots allocation. This protocol important role in intelligent medical system. The data
improves throughput, reduces packets delay for non- mining consists of various disciplines, among them
emergency and emergency data, consume minimum machine learning is one discipline, and consists of
energy of BMSs and BANC [6], [7]. different techniques such as classification, clustering,
The enormous amount of the data is being produced regression, association analysis, and outlier detection [16].
from different hospitals and medical centers or an The various classification machine learning algorithms
organization. This data is not being used in a proper way, used for the prediction of heart disease are reviewed below:
although it contains potential information for further

40
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

1) K-Nearest Neighbor’s Algorithm 4) Naïve Bayes

It is a simple classifier that cannot handle noises, easy to Naïve Bayes is a classifier that is based on Bayes theorem,
implement and understand, requires short training time and it assumes the probability of the features which are
and whole training set is used for prediction. The K- independent of other features. The different classification
Nearest Neighbors (K-NN) with weighting parameter has algorithms REPTREE, J48, CART, Naïve Bayes and
been used for the prediction of heart disease. Among 13 Neural networks have been used on the data collected
attributes mentioned in UCI heart disease dataset, the from medical practitioners in South Africa. Weka tool is
selection of 8 attributes due to simple measurements has used for this purpose and the accuracy of Naïve Bayes is
been taken into consideration for this study. The study found to be 85.92% [23]. In another study, The single UCI
shows the accuracy of 81.9 % and also mentioned that 8 heart disease dataset has been used for the prediction of
attributes such as age, sex, chest pain, trestbps, trestbpd, heart disease using K-NN, decision tree, SVM and Naïve
restecg, thalrest, exang are more than enough to predict Bayes. The Weka software was used, and the Naïve Bayes
heart disease [17]. The similar techniques (KNN) have shows highest accuracy [24].
been used along with the feature selection technique such
as particle swarm optimization (PSO) for the heart disease 5) Neural Network
prediction where it shows 100% accuracy [18]. Neural network is inspired by the biological neural
2) Support Vector Machine (SVM) network that constitute in the brain or central nervous
system, this is also called as an artificial neural network. It
Based on the kernel functions the support vector is used in machine learning algorithm, and can be used of
classifiers are divided into different types such as linear, classification/supervised learning. The neurons and
nonlinear, radial basis function (RBF), sigmoid and synapse are interconnected each other that allow passage
polynomial. The hyperplane or support vector machine of messages with in them. The three major parts of the
separates the support vector or data points. The different neural network are the input layer, hidden layer and output
classification techniques such as K-Nearest Neighbors, layer.
J48 Decision tree, SMO and Naive Bayes based on 8 The three algorithms such as Neural Networks, J-48
attributes with 10 fold cross validation in WEKA was used. Decision Tree and Naïve Bayes along with feature
The four features have been extracted using gain ratio selection method CfsSubsetEval attribute filter with Best
evaluation technique. The highest accuracy is found to be First search method on single UCI heart disease dataset
in J48 with 83.73% [19]. In another study, the different was used. Weka tool was used for this study, and the
datasets of heart, diabetes and cancer were classified using highest accuracy is shown by Naïve Bayes both with and
SVM, RBF and Naïve Bayes in Weka. Among 3 without feature selection [25]. In another study, the Neural
classifiers SVM showed the highest accuracy of 93.75% Network with feature correlation analysis (NN-FCA) was
and also seem to be most effective and robust classifier in studied on Sixth Korea National Health and Nutrition
predicting diseases [20]. Examination Survey (KNHANES-VI) dataset. In this
study, it is found that chronic renal failure and triglyceride
3) Random Forest were closely related to coronary heart disease. The NN-
Random forest constructs a multitude of decision trees at FCA showed the highest accuracy of 82.51% [26].
training time and output the mode prediction of the classes
for classification and the mean prediction for regression. B. Feature Selection Approaches
The different patterns in the data are evaluated by the The huge amounts of data produced by different sources
decision tress. The class prediction is based on the have become a fundamental importance for capturing,
majority vote for classification. The random forest with 10 storing, searching, sharing, and are hard to interpret and
fold cross validation along with feature selection methods analyze [27]. The huge volume of data and the increase in
such as chi square and genetic algorithm was used to for diagnosis cost made to look for feature selection which in
the prediction of heart disease. The single UCI heart turn increases the accuracy of the model, and give a better
dataset was used in this study and the accuracy as found to result for the prediction of disease. The initial dataset
be 83.70 % and found to be better when compared with consists of number of attributes, some of them may not be
other algorithms such as Naïve Bayes, decision tree and useful thus it is necessary to remove them during data
neural nets [21]. In another study, the Cleveland heart preprocessing [28]. Thus one of the important steps in data
disease was used with random forest and 10 fold cross preprocessing is feature selection, by this unnecessary
validation to get an accuracy of 85.81%, the feature features can be removed and improve the performance to
selection approach has given better accuracy [22]. build a better classification model. The various feature
selection methods such as wrapper methods, filter,
embedded and ensemble and hybrid methods have been
applied to study the heart disease prediction [29].

41
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

1. Trstbps, Restecg, slope, CA, Thal, Age are the best TABLE 2. Description of attributes in the dataset
selected features with the approach of novel feature Features Type Description, Value
selection, with the selected features the accuracy was Sex Discrete 1 = male; 0 = female
found to be 93% and 89 % in neural network and SMO
algorithms respectively [28]. Chest Pain Discrete Value 1: typical angina;
Value 2: atypical angina;
Value 3: non-anginal pain;
2. The recursive Feature Elimination (RFE) has been Value 4: asymptomatic
applied along with stochastic gradient boosting algorithm
and the selected features was found to be chest pain, ca,
and Thal with the accuracy of 95.45% [30]. FBS Discrete Fasting Blood Sugar,
0 = false (FBS <120 mg/dl)
3. The 4 feature selection methods have been applied and 1 = true (FBS > 120 mg/dl)
12 classification algorithms has been used for heart Restecg Discrete Value 0: normal;
Rest Value 1: having ST-T wave
disease prediction. The application of different feature Electrocardio abnormality (T wave inversions
selection methods showed different best selected features. graphic and/or ST elevation or depression
The highest accuracy was found to be in SVM-linear and of > 0.05 mV);
Naïve Bayes with an accuracy of 84.81% [31]. Value2 : showing probable or
definite left ventricular hypertrophy
III. DATASET by Estes' criteria

A. Data Sources

In this study, the five heart disease datasets (Cleveland, Exang Discrete Exercise induced angina.
Switzerland, Hungarian, V.A. Medical and Statlog project 1 = yes; 0 = no
heart disease) are collected from publicly available source Slope Discrete Peak exercise slope measure, Value
UCI machine learning repository webpage [13]. The five 1: upsloping, Value 2: flat, Value 3:
down sloping
datasets were combined into single dataset (i.e., combined
dataset) for better model performance. The combined
dataset consists of 1190 instances with 14 attributes. The ca Discrete number of major vessels colored by
flourosopy (0-3)
characteristics of five heart disease datasets and combined
dataset are shown in Table 1.
thal Discrete Patient heart rate,
TABLE 1. Characteristics of five and combined dataset of heart Disease 3 = normal; 6=fixed defect;
7 = reversable defect
Datasets No of No of Missing Values Age Continuous Patients age , 28 to 77
Instances Attributes
Trestbps Continuous Resting blood pressures of patients
Statlog project 270 14 No measured in mm Hg on admission
to the hospital
Cleveland 303 14 No 80-200.
Chol Continuous Patient serum cholesterol measured
Hungarian 294 14 Yes in mg/dl. 85, 100 - 200 - 394, 400-
603
V.A. Long 200 14 Yes thalach Continuous Patient maximum heart rate
Beach achieved.60-202, Low: below 50,
Normal:51-119, High: 120-202 [6]
Switzerland 123 14 Yes [7]
oldpeak Continuous ST depression made by exercise
Combined 1190 14 Missing values relative to rest
Dataset replaced with Mode -2.6 to -0.1, 0, 0.1 to 6.2, 120
Value Diagnose Discrete 0: No heart disease; 1: Heart
disease

B. Attributes Description

The dataset consists of 14 attributes. The predictable

attribute is referred to “Diagnosis” and rest of 13 as input
attributes. The attribute descriptions are shown in Table 2.

42
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

IV. METHODOLOGY (RFE) method with random forest algorithm (library

“mblench”, functions=rfFuncs, method="cv", number=10).
A. Data Preprocessing The variable importance estimations such as regression
method to calculate variable importance (library
In the data preprocessing stage, missing values are “mblench”), rank feature by Importance (library “class”
replaced with mode value based on the particular datasets and “mblench”) with learning vector quantization model
source. Second, taking into consideration that heart (LVQ) is addressed using CARET package in R tool
disease patients might have high values of respective (http://topepo.github.io/caret/index.html).The performance
attributes (i.e., referred as outliers in the dataset) are not of a model on test data is calculated by accuracy,
removed. The normalization (normalize <- function(x) sensitivity/recall, and specificity in R tool. Sensitivities
{return ((x - min (x)) / (max(x) - min(x)))} has been and specificity measures the true positives (risk class) and
carried out since dataset consists of different measuring the true negatives (normal class) respectively. Thus the
units. predictive capabilities of the classifiers are measured by
B. Data Analysis sensitivity and specificity values.

After data normalization to build a classification model, V. RESULTS

the combined dataset with 14 attributes is divided into
training and testing data with a percentage split of 60–40%, A. Performance on Combined Dataset
70–30% and 80–20% respectively with set.seed (123). The
R -based CARET package (https://cran.r- To the best of author knowledge, there is no studies
project.org/web/packages/) is used for data splitting, pre- addressed with the combined dataset (i.e., of five heart
processing steps such as normalization, classification disease datasets: Cleveland, Switzerland, Hungarian, V.A.
algorithms such as K-Nearest Neighbor (KNN, library Medical and Statlog project heart disease with the five
“caret”, method = 'knn', tuneLength = 10), Support Vector mentioned algorithms in this study) [24] [32]. R tool is
Machine (SVM, library “caret”, method = 'svmLinear', selected to understand which classification model has a
tuneLength =10), Random Forest (RF, library better performance on combined dataset with 13 features.
“randomForest”,method='rf',ntree=500,importance=TRUE The performance measures are in accordance with the
), Naïve Bayes (NB, library “e1071”, method = accuracy of each classification algorithm. Among the five
‘naïve_bayes’) and Neural Network (NN, library “nnet”, classification algorithms used, the highest accuracy in
method = 'nnet', trace= FALSE) with 10 fold cross three percentage splits (60–40, 70–30, and 80–40) has
validation (method = "cv", number = 10) are evaluated been observed in random forest with a range of 91.39%–
based on the fore mentioned training and testing data. The 94.96%, and average accuracies (85.92–89.41%)
different feature selections methods such as correlation respectively (refer to Table 3).
matrix (library “mblench”), recursive feature elimination

TABLE 3. The performance comparisons of different percentage splits of combined dataset with 13 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 84.03 0.7917 0.8808 83.75 0.8176 0.853 86.55 0.847 0.879
SVM 84.45 0.791 0.888 85.15 0.786 0.904 87.82 0.8381 0.909
Random Forest (RF) 91.39 0.856 0.961 92.16 0.880 0.954 94.96 0.904 0.985
Naïve Bayes 84.45 0.842 0.846 85.71 0.729 0.959 88.66 0.800 0.954
Neural Network 85.29 0.773 0.919 85.43 0.786 0.909 89.08 0.876 0.902
Avg Accuracy 85.92 0.810 0.898 86.44 0.799 0.915 89.41 0.853 0.925

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

43
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

B. Performance on Feature Selection Attributes TABLE 4. The order of best selected features after the removal of exang
and old peak attributes
The feature selection is performed on the combined Recursive Regression to Calculate Rank Feature by
dataset to select a subset of relevant features for model Feature Variable Importance Importance
building that aims to improve model accuracy. In this Elimination
study, a correlation analysis was performed to know the
highly correlated attributes, in this case the correlation has thal thal*** 12.650 thal 0.816
ca ChestPain*** 7.368 ChestPain 0.766
been found to be between exang and oldpeak attributes ChestPain ca *** 7.159 thalach 0.734
with a given correlation cutoff at 0.75 (refer to Figure 1). oldpeak thalach*** 4.330 exang 0.711
thalach Trestbps** 2.731 oldpeak 0.704
Trestbps Sex * 2.520 ca 0.664
Age Slope * 2.226 Age 0.657
Chol Restecg 1.721 Sex 0.632
exang Chol 0.628 Slope 0.605
Slope Exang 0.622 Trestbps 0.547
Sex FBS 0.427 Chol 0.546
restecg oldpeak 0.057 Restecg 0.542
FBS Age 0.050 FBS 0.532
Significance code: ***: 0, **:0.001, *: 0.01

TABLE 5. The common selected features

8 selected features thal, chest pain, ca, thalach, trestbps, age, sex
and chol
6 selected features thal, chest pain, ca, thalach, trestbps, age

The highest accuracy in 8 and 6 selected features is

found to be in random forest (refer to Table 6 & 7) which
is in agreement with the combined dataset with 13
Fig. 1. Correlation plot of the combined dataset features (refer to Table 3). The 8 selected features show
average accuracies with an insignificant of 0.5% in 70–
30 and 80–40 percentage split (refer to Table 6), whereas
After the removal of exang and oldpeak from the the 6 selected features (refer to Table 7) showed the
combined dataset the different feature selection methods decrease in average accuracies (1.24% in 70–30% split
(see data analysis) implemented in R tool has been and 1.18% in 80–40%) which could be due to the dip in
applied. The 10 selected features/attributes such thal, ca, the performance of SVM, RF (in both 70–30% and 80–
chestpain, age, thalach, chol, trestbps, slope, restecg and 40%), and NN (80–40%) to that of combined dataset with
sex are common across 3 feature methods. The attribute 13 features (refer to Table 3).
FBS is observed in regression method and rank feature by
Importance. The selected features by 3 methods showed V. DISCUSSION
the similar order of arrangement/topology with minor
differences in the ranking order (refer to Table 4). In this study, the heart disease prediction has been
Based on the order of the features (refer to Table 4), evaluated with the classification and feature selection
the common 8 and 6 selected features (refer to Table 5) algorithms implemented in CARET package of R tool
are taken into consideration to build a model. In 8 using combined dataset. The highest accuracy is shown
selected features, the highest accuracy in three percentage by random forest in three percentage split (without and
splits (60–40, 70–30, and 80–40) has been observed in with feature selection). This shows the performed random
random forest with a range of 91.18–94.96%, and average forest algorithm shows the best performance with the
accuracy (85.99–89.07%) respectively (refer to Table 6). existing model accuracies [22] [33]. The average
Similarly with respect to 6 selected features the highest accuracies of 8 selected feature shows similar results with
accuracy in three percentage splits (60–40, 70–30, and the combined dataset, indicating these features could be
80–40) has been observed in random forest with a range used for the prediction of heart disease as mentioned in
of 89.64–92.44%, and average accuracy (85.20–88.40%) the previous studies where KNN algorithm was used [17].
respectively (refer to Table 7).

44
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

TABLE 6. The performance comparisons of different percentage splits of combined dataset with 8 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 83.82 0.796 0.873 83.47 0.811 0.853 86.97 0.866 0.872
SVM 84.24 0.777 0.896 81.79 0.748 0.873 85.71 0.828 0.879
Random Forest 91.18 0.875 0.942 91.60 0.880 0.944 94.96 0.914 0.977
Naïve Bayes 86.76 0.810 0.915 87.39 0.811 0.924 89.5 0.857 0.924
Neural Network 85.29 0.777 0.915 85.71 0.773 0.924 88.24 0.838 0.917
Avg Accuracy 86.25 0.807 0.908 85.99 0.804 0.903 89.07 0.860 0.913

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

TABLE 7. The performance comparisons of different percentage splits of combined dataset with 6 features

Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 85.5 0.812 0.892 85.15 0.792 0.899 87.82 0.847 0.902
SVM 82.77 0.773 0.873 82.07 0.735 0.888 83.61 0.809 0.857
Random Forest 90.76 0.870 0.938 89.64 0.886 0.904 92.44 0.923 0.924
Naïve Bayes 86.97 0.824 0.907 84.03 0.792 0.878 90.76 0.885 0.924
Neural Network 85.5 0.791 0.907 85.15 0.786 0.904 87.39 0.828 0.909
Avg Accuracy 86.3 0.814 0.903 85.20 0.798 0.894 88.40 0.858 0.903

Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)

However, the further analysis with 6 selected features heart disease. However, drop of the 6 selected features
showed the decrease in average accuracies with that of may not lead to better model prediction.
combined dataset, this could be due to under fitting of the
attributes and the performance of the classifier under REFERENCES
default conditions. These results are in general agreement
with the previous studies where 5 out of 6 selected [1] Bhatnagar, P., Wickramasinghe, K., Williams, J.,
features in this study were used with different feature Rayner, M., and Townsend, N. (2015). The
selection method [11], thus indicating this features as a Epidemiology of Cardiovascular Disease in the UK
minimum number to build a reliable model. It is expected 2014. Heart, 101, 1182-1189.
that the further drop in 6 selected features to build a model [2] Thiyagaraj, M., and Suseendran, G. (2017). Survey
may not give a better prediction. However, this is in on Heart Disease Prediction System Based on Data
contrast where 3 selected features such as chest pain, ca, Mining Techniques. Indian Journal of Innovations
Thal showed 95.45% with different feature selection and Developments, 6(1), 1-9.
method. On the other hand, it is shown the 3 selected [3] https://www.heartfoundation.org.au/your-heart/heart-
features might be important for accuracy improvement conditions.
[30]. [4] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and Arshad M, M. (2017). Medium Access Control
VI. CONCLUSION (MAC) for Wireless Body Area Network (WBAN):
Superframe Structure, Multiple Access Technique,
It is known that the accuracy of the model depends on the Taxonomy, and Challenges. Hum Cent Comput Inf
database, preprocessing, analytical tools and techniques. Sci, 7(34), 1-39.
The present study shows it is important to select minimum [5] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and prominent attributes to improve the performance when & Lloret, J., and Arshad, M, Md. (2017). EETP-MAC:
compared to the use of whole features from the dataset. Energy Efficient Traffic Prioritization for Medium
This study shows that the random forest can be used as a Access Control in Wireless Body Area Networks.
good classification algorithm for the accurate prediction of Telecommunication Systems, 1-23.
heart disease with an accuracy of 90–95 %. However, the [6] Ullah, F., Abdullah, A, H., Kaiwartya, O., and Cao, Y.
addition of dataset with the inclusion of other non- (2017). TraPy-MAC: Traffic Priority Aware Medium
modifiable risk factors (genetic factors), and modifiable Access Control Protocol for Wireless Body Area
risk factors such as smoking, lack of physical exercise, Network. J Med Syst, 41, 93, 1-18.
alcohol consumption could result in better risk prediction [7] Ullah, F., Abdullah, A., H., Kaiwartya, O., and
of the heart disease. The less variation of accuracy Arshad, M., M. (2017). Traffic Priority-Aware
differences between dataset and selected features (8 and 6) Adaptive Slot Allocation for Medium Access Control
indicates these features can be useful for the prediction of

45
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46

Protocol in Wireless Body Area Network. Computers, Using Random Forest and Evolutionary Approach.
6, 9, 1-26. Journal of Network and Innovative Computing, 4,
[8] Raghupathi, W., and Raghupathi, V. (2014). Big 175-184.
Data Analytics in Healthcare: Promise and Potential. [22] Yeshvendra, K, Singh., Nikhil, Sinha. and Sanjay, K,
Health Information Science and Systems, 2, 3. Singh. (2016). Heart Disease Prediction System
[9] Palanisamy, V. and Thirunavukarasu, R. (2017). Using Random Forest. Advances in Computing and
Implications of Big Data Analytics in Developing Data Sciences, 613-623.
Healthcare Frameworks – A Review. Journal of King [23] Lavanya, M., Gomathi, P, M. (2016). Prediction of
Saud University–Computer and Information Heart Disease using Classification Algorithms.
Sciences, 1-11, in press. International Journal of Advanced Research in
[10] Ghasemi, M. and Amyot, D. (2016). Process Mining Computer Engineering & Technology, 5(7), 2173-
In Healthcare: A Systematised Literature Review. 2175.
Int. J. of Electronic Healthcare, 9(1), 60-88. [24] Sen, S. K. (2017). Predicting and Diagnosing of
[11] Dominic, Vinitha., Deepa, Gupta. and Sangita, Heart Disease Using Machine Learning Algorithms.
Khare. (2015.) An Effective Performance Analysis of International Journal of Engineering and Computer
Machine Learning Techniques for Cardiovascular Science, 6, 21623-21631.
Disease. Applied Medical Informatics, 36(1), 23-32. [25] Shafique, Umair, Majeed, Fiaz., Qaiser, Haseeb and
[12] Chandralekha, M. and Shenbagavadivu, N. (2018). U., l, Mustafa, Irfan. (2015). Data Mining in
Performance Analysis of Various Machine Learning Healthcare for Heart Diseases. International Journal
Techniques to Predict Cardiovascular Disease: An of Innovation and Applied Studies, 10, 2028-9324.
Emprical Study. Appl. Math. Inf. Sci, 12(1), 217-226. [26] Karayılan, T. and Kılıç, Ö. (2017). Prediction of
[13] UCI Machine Learning Repository: heart disease using neural network. International
http://archive.ics.uci.edu/ml/datasets/Heart+Disease , Conference on Computer Science and Engineering
http://archive.ics.uci.edu/ml/datasets/Statlog+%28He (UBMK), Antalya, 719-723. doi:
art%29. 10.1109/UBMK.2017.8093512.
[14] Hajar, R. (2017). Risk Factors for Coronary Artery [27] Rishika, R., A. and Suresh, K. P. (2016). Predictive
Disease: Historical Perspectives. Heart Views, 18, Big Data Analytics in Healthcare. Second
109-114. International Conference on Computational
[15] World Heart Federation. (2018). Available at Intelligence & Communication Technology, IEEE.
https://www.world-heart- [28] Suganya, R., Rajaram, S., Sheik, Abdullah, A. and
federation.org/resources/risk-factors/ (accessed 26 Rajendran, J. (2016). A Novel Feature Selection
June 2018). Google Scholar. Method for Predicting Heart Disease with Data
[16] Neesha, Jothi., Nur, Aini., Abdul, Rashid. And Mining Techniques. Asian J of Info Tech, 15, 1314-
Wahidah, Husain. (2015). Data Mining in 1321.
Healthcare–A Review. Procedia Computer Science, [29] Jain, D., Singh, V. (2018). Feature Selection and
72, 306-313. Classiﬁcation Systems for Chronic Disease
[17] Enriko I. K. A., Muhammad, Suryanegara. and Prediction: A Review. Egyptian Informatics J. 1-11,
Dadang, Gunawan. (2016). Heart Disease Prediction in press. https://doi.org/10.1016/j.eij.2018.03.002.
System using k-Nearest Neighbor Algorithm with [30] Kakulapati, V., Ankith, Kirti, Vaibhav, Kulkarni. and
Simplified Patient's Health Parameters. Journal of Charan, Pandit, Raj. (2017). Predictive Analysis of
Telecommunication, Electronic and Computer Heart Disease using Stochastic Gradient Boosting
Engineering, 8(12), 59-65. along with Recursive Feature Elimination.
[18] Jabbar, M. A. (2017). Prediction of Heart Disease International Journal of Science and Research, 6,
Using K-Nearest Neighbor and Particle Swarm 909-912.
Optimization. Biomedical Research, 28(9), 4154- [31] Hidayet, Takci. (2018). Improvement of Heart
4158 Attack Prediction by the Feature Selection Methods.
[19] Boshra, Bahrami., Mirsaeid, Hosseini, Shirvani. Turk J Elec Eng & Comp Sci, 26, 1-10.
(2015). Prediction and Diagnosis of Heart Disease by [32] Randa, El-Bialy., Mostafa, A, Salamay, Omar, H,
Data Mining Techniques. Journal of Karam and M, Essam, Khalifa. (2015). Feature
Multidisciplinary Engineering Science and Analysis of Coronary Artery Heart Disease Data
Technology, 2(2), 164-168. Sets. Procedia Computer Science, 65, 459-468.
[20] Janardhanan, P., L, Heena. and Sabika, F. (2015). [33] Patil, P., R. Kinariwala, A., S. (2017). Automated
Effectiveness of Support Vector Machines in Diagnosis of Heart Disease using Random Forest
Medical Data Mining. Journal of Communications Algorithm. International Journal of Advance
Software and Systems, 11(1), 25-30. Research, Ideas and Innovations in Technology,
[21] Jabbar, M. A., Deekshatulu, B. L. and Priti, Chandra. 3(2), 579-589.
(2016). Intelligent Heart Disease Prediction System

Kelechi MSC Project 1-5 Corrected
No ratings yet
Kelechi MSC Project 1-5 Corrected
157 pages
PythonHeartDisease FinalDocumentByMS
No ratings yet
PythonHeartDisease FinalDocumentByMS
68 pages
Research Proposal
No ratings yet
Research Proposal
8 pages
Predictive Modeling of Cardiovascular Risk Usingmachine Learning - Focus On Heart Attack Prevention
No ratings yet
Predictive Modeling of Cardiovascular Risk Usingmachine Learning - Focus On Heart Attack Prevention
7 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Enhancing Accuracy in Heart Disease Prediction: A Hybrid Approach
No ratings yet
Enhancing Accuracy in Heart Disease Prediction: A Hybrid Approach
27 pages
Advanced Machine Learning Techniques For Cardiovascular Disease Early Detection and Diagnosis
No ratings yet
Advanced Machine Learning Techniques For Cardiovascular Disease Early Detection and Diagnosis
29 pages
Ensemble Framework For Cardiovascular Disease Prediction
No ratings yet
Ensemble Framework For Cardiovascular Disease Prediction
23 pages
Final Report
No ratings yet
Final Report
67 pages
Literature Survey Heart Disease GA
No ratings yet
Literature Survey Heart Disease GA
21 pages
Sat - 47.Pdf - An Analysis of Heart Disease Prediction Using Machine Learning and Deep Learning Techniques
No ratings yet
Sat - 47.Pdf - An Analysis of Heart Disease Prediction Using Machine Learning and Deep Learning Techniques
11 pages
A Study On Machine Learning Techniques Used For Heart Disease Prediction: A Critical Analysis
No ratings yet
A Study On Machine Learning Techniques Used For Heart Disease Prediction: A Critical Analysis
9 pages
IJSDR2004071
No ratings yet
IJSDR2004071
13 pages
Paper-Final For Phase 1
No ratings yet
Paper-Final For Phase 1
16 pages
SSRN Id3759562
No ratings yet
SSRN Id3759562
14 pages
An Active Learning Machine Technique Based Prediction of Cardiovascular Heart Disease From UCI Repository Database
No ratings yet
An Active Learning Machine Technique Based Prediction of Cardiovascular Heart Disease From UCI Repository Database
19 pages
Eng p3 Heartdisease Forecasting-V7
No ratings yet
Eng p3 Heartdisease Forecasting-V7
25 pages
Heart Disease Prediction Using Machine Learning
No ratings yet
Heart Disease Prediction Using Machine Learning
7 pages
Prediction of Cardiovascular Disease Using Machine Learning: Journal of Physics: Conference Series
No ratings yet
Prediction of Cardiovascular Disease Using Machine Learning: Journal of Physics: Conference Series
9 pages
Association Rule Mining To Detect Factors Which Contribute To Heart Disease in Males and Females 2013
No ratings yet
Association Rule Mining To Detect Factors Which Contribute To Heart Disease in Males and Females 2013
8 pages
10 Detailed Project Report
No ratings yet
10 Detailed Project Report
15 pages
Heart Diceas Project Report
No ratings yet
Heart Diceas Project Report
14 pages
Feb 25 - Vol. 23 No. 1
No ratings yet
Feb 25 - Vol. 23 No. 1
73 pages
Cardiovascular Diseases (CVDS) Detection Using Machine Learning Algorithms
No ratings yet
Cardiovascular Diseases (CVDS) Detection Using Machine Learning Algorithms
8 pages
Analysis of Heart Disease Prediction Using Various Machine Learning Techniques
No ratings yet
Analysis of Heart Disease Prediction Using Various Machine Learning Techniques
8 pages
Development of Enhanced Prediction
No ratings yet
Development of Enhanced Prediction
21 pages
Heart Disease Prediction by Using Machine Learning Final Research Paper
No ratings yet
Heart Disease Prediction by Using Machine Learning Final Research Paper
8 pages
Proof Document
No ratings yet
Proof Document
27 pages
5203-Article Text-9671-1-10-20210505
No ratings yet
5203-Article Text-9671-1-10-20210505
9 pages
Diagnosis and Prediction of Heart Disease Using Machine Learning Techniques
No ratings yet
Diagnosis and Prediction of Heart Disease Using Machine Learning Techniques
11 pages
Olayinka Babe-2
No ratings yet
Olayinka Babe-2
48 pages
Sensors: Long-Term Coronary Artery Disease Risk Prediction With Machine Learning Models
No ratings yet
Sensors: Long-Term Coronary Artery Disease Risk Prediction With Machine Learning Models
13 pages
Heart Disease Prediction System Using Machine Learning
No ratings yet
Heart Disease Prediction System Using Machine Learning
7 pages
Final Project AinaMarti
No ratings yet
Final Project AinaMarti
21 pages
Abstract
No ratings yet
Abstract
4 pages
A Comprehensive Survey On Heart Disease Prediction
No ratings yet
A Comprehensive Survey On Heart Disease Prediction
16 pages
Heart Ailment Prediction Using Machine LearningMethods
No ratings yet
Heart Ailment Prediction Using Machine LearningMethods
7 pages
Ijisa V7 N12 8
No ratings yet
Ijisa V7 N12 8
8 pages
View of Cardiovascular Heart Disease Prediction Using Machine Learning Classifiers With Data Mining Techniques
No ratings yet
View of Cardiovascular Heart Disease Prediction Using Machine Learning Classifiers With Data Mining Techniques
9 pages
Heart Disease Prediction System Research-1
No ratings yet
Heart Disease Prediction System Research-1
5 pages
04PredictionofCoronary0 25point
No ratings yet
04PredictionofCoronary0 25point
7 pages
Final Year Project
No ratings yet
Final Year Project
30 pages
Performance Analysis of Some Se-Lected Machine Learning Algo - Rithms On Heart Disease Predic - Tion Using The Noble Uci Datasets
No ratings yet
Performance Analysis of Some Se-Lected Machine Learning Algo - Rithms On Heart Disease Predic - Tion Using The Noble Uci Datasets
11 pages
Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques
No ratings yet
Efficient Prediction of Cardiovascular Disease Using Machine Learning Algorithms With Relief and LASSO Feature Selection Techniques
23 pages
JASE V03 Iss02 S004
No ratings yet
JASE V03 Iss02 S004
9 pages
2409.03697v1 Check
No ratings yet
2409.03697v1 Check
10 pages
Heart Disease Identification Method Using Machine Learning Classification in E Healthcare
No ratings yet
Heart Disease Identification Method Using Machine Learning Classification in E Healthcare
9 pages
Heart Disease
No ratings yet
Heart Disease
6 pages
Islamia College University Peshawar
No ratings yet
Islamia College University Peshawar
15 pages
Computational Framework For Heart Disease Prediction Using Deep Belief Neural Network With Fuzzy Logic
No ratings yet
Computational Framework For Heart Disease Prediction Using Deep Belief Neural Network With Fuzzy Logic
9 pages
PrimerEntregable MOET
No ratings yet
PrimerEntregable MOET
17 pages
IJCAIT133 Rsingh
No ratings yet
IJCAIT133 Rsingh
14 pages
Muhammad Arslan Heart Disease Report
No ratings yet
Muhammad Arslan Heart Disease Report
11 pages
Final - PPR (1) BTP
No ratings yet
Final - PPR (1) BTP
14 pages
Work Method Statement - POP PUNNING
100% (3)
Work Method Statement - POP PUNNING
3 pages
Analysis of The Diagnostic Parameters of Heart Diseases and Prediction of Heart Attacks
No ratings yet
Analysis of The Diagnostic Parameters of Heart Diseases and Prediction of Heart Attacks
6 pages
An Active Learning Machine Technique Based Prediction of Cardiovascular Heart Disease From UCI Repository Database
No ratings yet
An Active Learning Machine Technique Based Prediction of Cardiovascular Heart Disease From UCI Repository Database
19 pages
Heart Disease Prediction Using Effective Machine Learning Techniques
No ratings yet
Heart Disease Prediction Using Effective Machine Learning Techniques
7 pages
Air Regulations Quick Notes
No ratings yet
Air Regulations Quick Notes
23 pages
A Comprehensive Review On Heart Disease
No ratings yet
A Comprehensive Review On Heart Disease
10 pages
English 4 MISOSA - Following Directions Using Sequence Signals
100% (1)
English 4 MISOSA - Following Directions Using Sequence Signals
20 pages
A Catalogue of The Greek and Etruscan Vases in The British Museum III (1896)
100% (2)
A Catalogue of The Greek and Etruscan Vases in The British Museum III (1896)
496 pages
Suryopanishad
100% (3)
Suryopanishad
15 pages
Reversal Strength & Overspeed Eccentrics
100% (1)
Reversal Strength & Overspeed Eccentrics
4 pages
Hess's Law and The Enthalpy of Combustion of Magnesium
100% (1)
Hess's Law and The Enthalpy of Combustion of Magnesium
8 pages
Dial Gauge Indicator Report
No ratings yet
Dial Gauge Indicator Report
10 pages
Aqim Assalat - Establish The Prayer (Illustrated) - Revised March 2024-En
No ratings yet
Aqim Assalat - Establish The Prayer (Illustrated) - Revised March 2024-En
37 pages
Rights Duties of Employees and Employers Report
100% (1)
Rights Duties of Employees and Employers Report
13 pages
Haberfield-Presentation On Geotechnical Investigation& Design For Piling Works
No ratings yet
Haberfield-Presentation On Geotechnical Investigation& Design For Piling Works
33 pages
Endogenic Process
No ratings yet
Endogenic Process
46 pages
MOS PPT
No ratings yet
MOS PPT
13 pages
Grammar 3 Syllabus 1 Pages 1 6 Compressed
No ratings yet
Grammar 3 Syllabus 1 Pages 1 6 Compressed
6 pages
TechniSpa Presentation
No ratings yet
TechniSpa Presentation
8 pages
Left Beyond - First Ending
No ratings yet
Left Beyond - First Ending
660 pages
Principles of Endodontics II: Retreatment
No ratings yet
Principles of Endodontics II: Retreatment
31 pages
Metrology and Quality Assurance Lab
No ratings yet
Metrology and Quality Assurance Lab
5 pages
Probability Trees Worksheet
No ratings yet
Probability Trees Worksheet
5 pages
Dog Bite - Lesson Plan (Final)
No ratings yet
Dog Bite - Lesson Plan (Final)
3 pages
Forest Classification
No ratings yet
Forest Classification
13 pages
Renewable Energy: J.N. Kamau, R. Kinyua, J.K. Gathua
No ratings yet
Renewable Energy: J.N. Kamau, R. Kinyua, J.K. Gathua
5 pages
Crux v20n04 Apr
No ratings yet
Crux v20n04 Apr
35 pages
Transocean Epoxy Thinner 6.03-1
No ratings yet
Transocean Epoxy Thinner 6.03-1
1 page
n2xh Iec 60502 1 Xlpe FRNC 0 6 1kv Cable
No ratings yet
n2xh Iec 60502 1 Xlpe FRNC 0 6 1kv Cable
4 pages
Ace 1
No ratings yet
Ace 1
2 pages
Human Activity Recognition Using Convolutional Neural Network
No ratings yet
Human Activity Recognition Using Convolutional Neural Network
19 pages
This Study Resource Was
No ratings yet
This Study Resource Was
6 pages
Frequency Reconfigurable Spirograph Planar Monopole Antenna (SPMA)
No ratings yet
Frequency Reconfigurable Spirograph Planar Monopole Antenna (SPMA)
4 pages
Second Paper PDF
No ratings yet
Second Paper PDF
2 pages
Unveiling the Unseen: A Journey Into the Hearts Labyrinth Sean
From Everand
Unveiling the Unseen: A Journey Into the Hearts Labyrinth Sean
Sean Harrison
5/5 (1)
Understanding Coronary Artery Disease: Understanding Medicine
From Everand
Understanding Coronary Artery Disease: Understanding Medicine
Pejman Hajbabaie
No ratings yet

New 1

Uploaded by

New 1

Uploaded by

International Journal of Innovative Computing 9(1) 39-46

Classification and Feature Selection Approaches by

1) K-Nearest Neighbor’s Algorithm 4) Naïve Bayes

The dataset consists of 14 attributes. The predictable

IV. METHODOLOGY (RFE) method with random forest algorithm (library

After data normalization to build a classification model, V. RESULTS

TABLE 5. The common selected features

The highest accuracy in 8 and 6 selected features is

You might also like