New 1
New 1
N. Satish Chandra Reddy, Song Shue Nee, Lim Zhi Min & Chew Xin Ying
School of Computer Sciences,
11800, Universiti Sains Malaysia, Pulau Pinang, Malaysia
Email: [email protected]
Submitted: 03/08/2018. Revised edition: 05/12/2018. Accepted: 26/12/2018. Published online: Published online: 30/05/2019
DOI: https://doi.org/10.11113/ijic.v9n1.210
Abstract—The heart disease has been one of the major causes known as coronary artery disease (CAD) which refers to
of death worldwide. The heart disease diagnosis has been the disease of the heart arteries that supply oxygen and
expensive nowadays, thus it is necessary to predict the risk of blood to the heart, and is associated with life style
getting heart disease with selected features. The feature conditions and age. CHD happens with the narrowing of
selection methods could be used as valuable techniques to
the coronary arteries due to fatty material (plaque)
reduce the cost of diagnosis by selecting the important
attributes. The objectives of this study are to predict the deposition which is termed as atherosclerosis. The
classification model, and to know which selected features atherosclerosis reduces the blood flow to the heart, and
play a key role in the prediction of heart disease by using said to be the underlying cause of complications such as
Cleveland and statlog project heart datasets. The accuracy of angina and heart attacks. An angina is referred to typical
random forest algorithm both in classification and feature pain in the chest, but it can often radiate to the shoulder
selection model has been observed to be 90–95% based on left arm. The heart attacks, which is also known as
three different percentage splits. The 8 and 6 selected myocardial infraction (MI), cardiac or myocardial
features seem to be the minimum feature requirements to infarction, and coronary thrombosis or occlusion is a
build a better performance model. Whereby, further
condition when artery ruptures or narrows. A blood clots
dropping of the 8 or 6 selected features may not lead to
better performance for the prediction model. formed partially or completely during the repair of blood
vessels rupture, reduces the blood flow to the heart
Keywords—Machine Learning, Feature Selection, R tool, muscles causing heart attacks. The chest pain, which many
Classification, Prediction also radiate to the left are the signs of heart attacks. The
modifiable risk factors are smoking, high blood pressure,
I. INTRODUCTION high cholesterol, obesity, unhealthy diet, diabetics,
depression and stress, on the other hand the non-
The heart disease is said to be major causes of death modifiable risk factors are age, gender, genetic factors,
globally, according to the World Health Organization race and ethnicity. Heart failure also known as congestive
(WHO) the mortality rate due to heart disease is around heart failure where right, left or both sides of the heart are
17.7 million (31%) in 2015 and is estimated to increase by affected once damaged it cannot heal. This is a serious
2020. Every year nearly 20 million people die, indicating condition when the heart muscle gets damaged, then
the heart disease as a leading cause of death [1]. The becomes weak and does not pump blood properly to meets
group of diseases related to both the heart and blood the needs of the rest of the body. Some of the factors, such
vessels are referred as cardiovascular disease (CVD). The as heart diseases, high blood pressure and diabetics,
CVD includes coronary heart disease (CHD) or also cardiomyopathy (a disease of the heart muscle) can over
39
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
time, leave the heart too weak to fill and pump properly. proceedings such as clinical decision support, disease
The other circulatory system malfunctions are peripheral surveillance and population health management [8], [9].
artery disease (PAD) narrowing of the arteries at the On the other hand, the clinical decisions made by doctors
regions of arms, legs, stomach and head legs, stomach), indicate that there is a chance of leaving the hidden quality
venous thromboembolism (VTE: is a blood clot in a vein), from the data leading to errors and unwanted biases. This
aortic aneurysms (the weakened artery allowing it to might affect the medical costs and quality of health
widen) [2], [3]. services provided to the patients [10]. The decrease in
Based on the disease statistics, it is noted that the rate medical errors and unwanted practice variation, improve
of heart disease is increasing and cannot be identified from patient outcome and safety, can be achieved by the
the visual prospective since it is caused by many risk integration of clinical decision support with computer-
factors, and does not show noticeable symptoms. However, based patient records. The researchers and clinicians are
the continuous monitoring of different vital signs such as studying different heart disease datasets with different
heartbeat, blood pressure, glucose levels, ECG, EEG, machine learning classification algorithms and features
EMG, respiratory rate and temperature would be useful to selection methods to bring up possible predictions for
track the early detection of different disease abnormalities. heart disease [11], [12]. In this paper, the main objectives
This monitoring has been made possible with an are to build a model with prevails performance, and to
advancement in wireless communication technology identify which selected features play a key role in the
“Wireless Body Area Network (WBAN) which consists of prediction of heart disease by using Cleveland and statlog
tiny Bio-Medical Sensors (BMSs)” which are at project heart datasets from UCI [13] with R tool.
reasonable cost. The three methods of BMS deployment
includes on body wearable (to monitor blood pressure, II. LITERATURE REVIEW
ECG and EMG), in-body implantation (to monitor lungs,
liver and kidney), and off body (to monitor physical health This section reviews some work related to classification
conditions, body position, arm positions, walking and and feature selection methods applied for the prediction of
running). The monitoring of these BMS is done through heart diseases. Heart being one of the main organs of the
Body Area Network Coordinator (BANC) following star body allows the pumping of the blood through blood
network topology [4]. The monitoring of the patients is vessels (i.e., circulatory system). The oxygen and other
categorized as non-emergency (a normal reading of vital materials such as nutrients, hormones and waste
signs such as temperature and glucose levels) and substances are carried out to the other parts of the body
emergency data (signs such as low respiratory rate and through blood. Thus, heart is referred to play an important
high blood pressure). The Medium access control (MAC) role in circulatory system. The condition where the
Superframe structures uses IEEE 802.15.4 as a protocol impairment of the heart occurs is referred to as heart
[5]. The heterogeneous nature of the patient’s data is disease. The heart disease or cardio vascular disease refers
classified into Ordinary Packet (OP), Reliability Data to the disorders of heart and blood vessels. The different
Packet (RP), Critical Data Packet (CP) and Delay Data types of heart diseases are coronary heart disease, heart
Packet (DP). Though these classifications help to deliver attacks (blocked vessels or myocardial infarction) and
without loss in patient data to medical doctors, and with angina (coronary artery disease). The coronary heart
less energy consumption of BMSs, they lack the disease leads to heart attack and is said to be one of the
consideration of low and high threshold values of vital major causes for the individual death globally (WHO).
signs in emergency data categorization. This has been The non-modifiable risks factors such as age, gender,
overcome with the help of a traffic priority based MAC genetic or family history, ethnicity or race, and modifiable
protocol and slot allocation based on the alert signals to risk factors such as cholesterol, high blood pressure,
BMSs respectively, in which the vitals such as heart rate, hypertension, lack of physical exercise, smoking, obesity
blood pressure, respiratory rate and temperature has been and stress are responsible for heart diseases [14] [15].
categorized into threshold ranges of low, normal and high
values. These values are useful to provide appropriate and A. Data Mining and Machine Learning Algorithms
dedicated timeslots to non-emergency and emergency Data mining is a process of exploring the new and
based BMSs. Based on the severities of the threshold valuable information from data, this could significantly
values of vital signs BANC allocate slots, apart from improve the quality of clinical decisions and play an
resolving the conflict of slots allocation. This protocol important role in intelligent medical system. The data
improves throughput, reduces packets delay for non- mining consists of various disciplines, among them
emergency and emergency data, consume minimum machine learning is one discipline, and consists of
energy of BMSs and BANC [6], [7]. different techniques such as classification, clustering,
The enormous amount of the data is being produced regression, association analysis, and outlier detection [16].
from different hospitals and medical centers or an The various classification machine learning algorithms
organization. This data is not being used in a proper way, used for the prediction of heart disease are reviewed below:
although it contains potential information for further
40
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
41
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
1. Trstbps, Restecg, slope, CA, Thal, Age are the best TABLE 2. Description of attributes in the dataset
selected features with the approach of novel feature Features Type Description, Value
selection, with the selected features the accuracy was Sex Discrete 1 = male; 0 = female
found to be 93% and 89 % in neural network and SMO
algorithms respectively [28]. Chest Pain Discrete Value 1: typical angina;
Value 2: atypical angina;
Value 3: non-anginal pain;
2. The recursive Feature Elimination (RFE) has been Value 4: asymptomatic
applied along with stochastic gradient boosting algorithm
and the selected features was found to be chest pain, ca,
and Thal with the accuracy of 95.45% [30]. FBS Discrete Fasting Blood Sugar,
0 = false (FBS <120 mg/dl)
3. The 4 feature selection methods have been applied and 1 = true (FBS > 120 mg/dl)
12 classification algorithms has been used for heart Restecg Discrete Value 0: normal;
Rest Value 1: having ST-T wave
disease prediction. The application of different feature Electrocardio abnormality (T wave inversions
selection methods showed different best selected features. graphic and/or ST elevation or depression
The highest accuracy was found to be in SVM-linear and of > 0.05 mV);
Naïve Bayes with an accuracy of 84.81% [31]. Value2 : showing probable or
definite left ventricular hypertrophy
III. DATASET by Estes' criteria
A. Data Sources
In this study, the five heart disease datasets (Cleveland, Exang Discrete Exercise induced angina.
Switzerland, Hungarian, V.A. Medical and Statlog project 1 = yes; 0 = no
heart disease) are collected from publicly available source Slope Discrete Peak exercise slope measure, Value
UCI machine learning repository webpage [13]. The five 1: upsloping, Value 2: flat, Value 3:
down sloping
datasets were combined into single dataset (i.e., combined
dataset) for better model performance. The combined
dataset consists of 1190 instances with 14 attributes. The ca Discrete number of major vessels colored by
flourosopy (0-3)
characteristics of five heart disease datasets and combined
dataset are shown in Table 1.
thal Discrete Patient heart rate,
TABLE 1. Characteristics of five and combined dataset of heart Disease 3 = normal; 6=fixed defect;
7 = reversable defect
Datasets No of No of Missing Values Age Continuous Patients age , 28 to 77
Instances Attributes
Trestbps Continuous Resting blood pressures of patients
Statlog project 270 14 No measured in mm Hg on admission
to the hospital
Cleveland 303 14 No 80-200.
Chol Continuous Patient serum cholesterol measured
Hungarian 294 14 Yes in mg/dl. 85, 100 - 200 - 394, 400-
603
V.A. Long 200 14 Yes thalach Continuous Patient maximum heart rate
Beach achieved.60-202, Low: below 50,
Normal:51-119, High: 120-202 [6]
Switzerland 123 14 Yes [7]
oldpeak Continuous ST depression made by exercise
Combined 1190 14 Missing values relative to rest
Dataset replaced with Mode -2.6 to -0.1, 0, 0.1 to 6.2, 120
Value Diagnose Discrete 0: No heart disease; 1: Heart
disease
B. Attributes Description
42
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
TABLE 3. The performance comparisons of different percentage splits of combined dataset with 13 features
Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 84.03 0.7917 0.8808 83.75 0.8176 0.853 86.55 0.847 0.879
SVM 84.45 0.791 0.888 85.15 0.786 0.904 87.82 0.8381 0.909
Random Forest (RF) 91.39 0.856 0.961 92.16 0.880 0.954 94.96 0.904 0.985
Naïve Bayes 84.45 0.842 0.846 85.71 0.729 0.959 88.66 0.800 0.954
Neural Network 85.29 0.773 0.919 85.43 0.786 0.909 89.08 0.876 0.902
Avg Accuracy 85.92 0.810 0.898 86.44 0.799 0.915 89.41 0.853 0.925
Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)
43
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
B. Performance on Feature Selection Attributes TABLE 4. The order of best selected features after the removal of exang
and old peak attributes
The feature selection is performed on the combined Recursive Regression to Calculate Rank Feature by
dataset to select a subset of relevant features for model Feature Variable Importance Importance
building that aims to improve model accuracy. In this Elimination
study, a correlation analysis was performed to know the
highly correlated attributes, in this case the correlation has thal thal*** 12.650 thal 0.816
ca ChestPain*** 7.368 ChestPain 0.766
been found to be between exang and oldpeak attributes ChestPain ca *** 7.159 thalach 0.734
with a given correlation cutoff at 0.75 (refer to Figure 1). oldpeak thalach*** 4.330 exang 0.711
thalach Trestbps** 2.731 oldpeak 0.704
Trestbps Sex * 2.520 ca 0.664
Age Slope * 2.226 Age 0.657
Chol Restecg 1.721 Sex 0.632
exang Chol 0.628 Slope 0.605
Slope Exang 0.622 Trestbps 0.547
Sex FBS 0.427 Chol 0.546
restecg oldpeak 0.057 Restecg 0.542
FBS Age 0.050 FBS 0.532
Significance code: ***: 0, **:0.001, *: 0.01
8 selected features thal, chest pain, ca, thalach, trestbps, age, sex
and chol
6 selected features thal, chest pain, ca, thalach, trestbps, age
44
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
TABLE 6. The performance comparisons of different percentage splits of combined dataset with 8 features
Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 83.82 0.796 0.873 83.47 0.811 0.853 86.97 0.866 0.872
SVM 84.24 0.777 0.896 81.79 0.748 0.873 85.71 0.828 0.879
Random Forest 91.18 0.875 0.942 91.60 0.880 0.944 94.96 0.914 0.977
Naïve Bayes 86.76 0.810 0.915 87.39 0.811 0.924 89.5 0.857 0.924
Neural Network 85.29 0.777 0.915 85.71 0.773 0.924 88.24 0.838 0.917
Avg Accuracy 86.25 0.807 0.908 85.99 0.804 0.903 89.07 0.860 0.913
Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)
TABLE 7. The performance comparisons of different percentage splits of combined dataset with 6 features
Algorithms 60 - 40 70 - 30 80-20
Accu Sensi Speci Accu Sensi Speci Accu Sensi Speci
KNN 85.5 0.812 0.892 85.15 0.792 0.899 87.82 0.847 0.902
SVM 82.77 0.773 0.873 82.07 0.735 0.888 83.61 0.809 0.857
Random Forest 90.76 0.870 0.938 89.64 0.886 0.904 92.44 0.923 0.924
Naïve Bayes 86.97 0.824 0.907 84.03 0.792 0.878 90.76 0.885 0.924
Neural Network 85.5 0.791 0.907 85.15 0.786 0.904 87.39 0.828 0.909
Avg Accuracy 86.3 0.814 0.903 85.20 0.798 0.894 88.40 0.858 0.903
Accu: Accuracy; Sensi: Sensitivity; Speci: Specificity; k-nearest neighbors algorithm (k-NN), Support vector machine (SVM)
However, the further analysis with 6 selected features heart disease. However, drop of the 6 selected features
showed the decrease in average accuracies with that of may not lead to better model prediction.
combined dataset, this could be due to under fitting of the
attributes and the performance of the classifier under REFERENCES
default conditions. These results are in general agreement
with the previous studies where 5 out of 6 selected [1] Bhatnagar, P., Wickramasinghe, K., Williams, J.,
features in this study were used with different feature Rayner, M., and Townsend, N. (2015). The
selection method [11], thus indicating this features as a Epidemiology of Cardiovascular Disease in the UK
minimum number to build a reliable model. It is expected 2014. Heart, 101, 1182-1189.
that the further drop in 6 selected features to build a model [2] Thiyagaraj, M., and Suseendran, G. (2017). Survey
may not give a better prediction. However, this is in on Heart Disease Prediction System Based on Data
contrast where 3 selected features such as chest pain, ca, Mining Techniques. Indian Journal of Innovations
Thal showed 95.45% with different feature selection and Developments, 6(1), 1-9.
method. On the other hand, it is shown the 3 selected [3] https://www.heartfoundation.org.au/your-heart/heart-
features might be important for accuracy improvement conditions.
[30]. [4] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and Arshad M, M. (2017). Medium Access Control
VI. CONCLUSION (MAC) for Wireless Body Area Network (WBAN):
Superframe Structure, Multiple Access Technique,
It is known that the accuracy of the model depends on the Taxonomy, and Challenges. Hum Cent Comput Inf
database, preprocessing, analytical tools and techniques. Sci, 7(34), 1-39.
The present study shows it is important to select minimum [5] Ullah, F., Abdullah A, H., Kaiwartya, O., Kuman, S.,
and prominent attributes to improve the performance when & Lloret, J., and Arshad, M, Md. (2017). EETP-MAC:
compared to the use of whole features from the dataset. Energy Efficient Traffic Prioritization for Medium
This study shows that the random forest can be used as a Access Control in Wireless Body Area Networks.
good classification algorithm for the accurate prediction of Telecommunication Systems, 1-23.
heart disease with an accuracy of 90–95 %. However, the [6] Ullah, F., Abdullah, A, H., Kaiwartya, O., and Cao, Y.
addition of dataset with the inclusion of other non- (2017). TraPy-MAC: Traffic Priority Aware Medium
modifiable risk factors (genetic factors), and modifiable Access Control Protocol for Wireless Body Area
risk factors such as smoking, lack of physical exercise, Network. J Med Syst, 41, 93, 1-18.
alcohol consumption could result in better risk prediction [7] Ullah, F., Abdullah, A., H., Kaiwartya, O., and
of the heart disease. The less variation of accuracy Arshad, M., M. (2017). Traffic Priority-Aware
differences between dataset and selected features (8 and 6) Adaptive Slot Allocation for Medium Access Control
indicates these features can be useful for the prediction of
45
N. Satish Chandra Reddy et al. / IJIC Vol. 9:1(2019) 39-46
Protocol in Wireless Body Area Network. Computers, Using Random Forest and Evolutionary Approach.
6, 9, 1-26. Journal of Network and Innovative Computing, 4,
[8] Raghupathi, W., and Raghupathi, V. (2014). Big 175-184.
Data Analytics in Healthcare: Promise and Potential. [22] Yeshvendra, K, Singh., Nikhil, Sinha. and Sanjay, K,
Health Information Science and Systems, 2, 3. Singh. (2016). Heart Disease Prediction System
[9] Palanisamy, V. and Thirunavukarasu, R. (2017). Using Random Forest. Advances in Computing and
Implications of Big Data Analytics in Developing Data Sciences, 613-623.
Healthcare Frameworks – A Review. Journal of King [23] Lavanya, M., Gomathi, P, M. (2016). Prediction of
Saud University–Computer and Information Heart Disease using Classification Algorithms.
Sciences, 1-11, in press. International Journal of Advanced Research in
[10] Ghasemi, M. and Amyot, D. (2016). Process Mining Computer Engineering & Technology, 5(7), 2173-
In Healthcare: A Systematised Literature Review. 2175.
Int. J. of Electronic Healthcare, 9(1), 60-88. [24] Sen, S. K. (2017). Predicting and Diagnosing of
[11] Dominic, Vinitha., Deepa, Gupta. and Sangita, Heart Disease Using Machine Learning Algorithms.
Khare. (2015.) An Effective Performance Analysis of International Journal of Engineering and Computer
Machine Learning Techniques for Cardiovascular Science, 6, 21623-21631.
Disease. Applied Medical Informatics, 36(1), 23-32. [25] Shafique, Umair, Majeed, Fiaz., Qaiser, Haseeb and
[12] Chandralekha, M. and Shenbagavadivu, N. (2018). U., l, Mustafa, Irfan. (2015). Data Mining in
Performance Analysis of Various Machine Learning Healthcare for Heart Diseases. International Journal
Techniques to Predict Cardiovascular Disease: An of Innovation and Applied Studies, 10, 2028-9324.
Emprical Study. Appl. Math. Inf. Sci, 12(1), 217-226. [26] Karayılan, T. and Kılıç, Ö. (2017). Prediction of
[13] UCI Machine Learning Repository: heart disease using neural network. International
http://archive.ics.uci.edu/ml/datasets/Heart+Disease , Conference on Computer Science and Engineering
http://archive.ics.uci.edu/ml/datasets/Statlog+%28He (UBMK), Antalya, 719-723. doi:
art%29. 10.1109/UBMK.2017.8093512.
[14] Hajar, R. (2017). Risk Factors for Coronary Artery [27] Rishika, R., A. and Suresh, K. P. (2016). Predictive
Disease: Historical Perspectives. Heart Views, 18, Big Data Analytics in Healthcare. Second
109-114. International Conference on Computational
[15] World Heart Federation. (2018). Available at Intelligence & Communication Technology, IEEE.
https://www.world-heart- [28] Suganya, R., Rajaram, S., Sheik, Abdullah, A. and
federation.org/resources/risk-factors/ (accessed 26 Rajendran, J. (2016). A Novel Feature Selection
June 2018). Google Scholar. Method for Predicting Heart Disease with Data
[16] Neesha, Jothi., Nur, Aini., Abdul, Rashid. And Mining Techniques. Asian J of Info Tech, 15, 1314-
Wahidah, Husain. (2015). Data Mining in 1321.
Healthcare–A Review. Procedia Computer Science, [29] Jain, D., Singh, V. (2018). Feature Selection and
72, 306-313. Classification Systems for Chronic Disease
[17] Enriko I. K. A., Muhammad, Suryanegara. and Prediction: A Review. Egyptian Informatics J. 1-11,
Dadang, Gunawan. (2016). Heart Disease Prediction in press. https://doi.org/10.1016/j.eij.2018.03.002.
System using k-Nearest Neighbor Algorithm with [30] Kakulapati, V., Ankith, Kirti, Vaibhav, Kulkarni. and
Simplified Patient's Health Parameters. Journal of Charan, Pandit, Raj. (2017). Predictive Analysis of
Telecommunication, Electronic and Computer Heart Disease using Stochastic Gradient Boosting
Engineering, 8(12), 59-65. along with Recursive Feature Elimination.
[18] Jabbar, M. A. (2017). Prediction of Heart Disease International Journal of Science and Research, 6,
Using K-Nearest Neighbor and Particle Swarm 909-912.
Optimization. Biomedical Research, 28(9), 4154- [31] Hidayet, Takci. (2018). Improvement of Heart
4158 Attack Prediction by the Feature Selection Methods.
[19] Boshra, Bahrami., Mirsaeid, Hosseini, Shirvani. Turk J Elec Eng & Comp Sci, 26, 1-10.
(2015). Prediction and Diagnosis of Heart Disease by [32] Randa, El-Bialy., Mostafa, A, Salamay, Omar, H,
Data Mining Techniques. Journal of Karam and M, Essam, Khalifa. (2015). Feature
Multidisciplinary Engineering Science and Analysis of Coronary Artery Heart Disease Data
Technology, 2(2), 164-168. Sets. Procedia Computer Science, 65, 459-468.
[20] Janardhanan, P., L, Heena. and Sabika, F. (2015). [33] Patil, P., R. Kinariwala, A., S. (2017). Automated
Effectiveness of Support Vector Machines in Diagnosis of Heart Disease using Random Forest
Medical Data Mining. Journal of Communications Algorithm. International Journal of Advance
Software and Systems, 11(1), 25-30. Research, Ideas and Innovations in Technology,
[21] Jabbar, M. A., Deekshatulu, B. L. and Priti, Chandra. 3(2), 579-589.
(2016). Intelligent Heart Disease Prediction System
46