Using Decision Trees in Data Mining For Predicting Factors Influencing of Heart Disease
Using Decision Trees in Data Mining For Predicting Factors Influencing of Heart Disease
Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 31
Moloud Abdar
Undergraduate of Computer Engineering, Department of Engineering
University of Damghan
Damghan, Iran
[email protected]
B. Heart Disease accuracy rate of 99% and the confirming level of 79%,
The human body has a complex mechanism so that any respectively for the models of Apriori, Predictive Apriori and
dysfunction of any body parts influences the other. The human Tertius.
heart is about the size of a fist, while it is one of the strongest Another valuable research by Kemal Polat and Salih
muscles in the body. It begins to throb 21-28 days after Güneş [8] utilizing fuzzy weighted pre-processing and
forming the fetus in the womb and beats 100,000 times daily artificial immune recognition system (AIRS) reported 92.59%
on average. Average heart rate is about 70 beats per minute, accuracy. The article by Roohallah and et al. [9] have used
which doubles or multiplies because of physical activity. The the algorithms of SMO, Bagging, and ANN. They have used
human heart is a body part playing an important role in his/her Z-Alizadeh Sani data involving 303 patients with 54 features.
life. Any dysfunction of human heart leads to dysfunction in This study has then noted that "Chest Pain and Age" have had
the body's total system, especially blood supply and the greatest impact on productivity than the rest of the
respiratory system. features, and their reported accuracy has been equal to
According to the WHO, the World Heart Federation 94.08%. Another article written by Abushariah, Mohammad,
(WHF) and the USA's Centers for Disease Control and Assal AM Alqudah, Omar Y. Adwan, and Rana MM Yousef
Prevention (CDC) in 2020, the number of deaths due to “heart [10] deals with the comparison of ANN and ANFIS using
disease and stroke” reaches up to 20 million, whereas the MATLAB software. The accuracy of training data for ANFIS
mentioned number will be increased up to 24 million deaths 100% and ANN equal to 90.74% has been calculated.
by the year 2030 [4]. The rising number of deaths due to heart However, the accuracy of experimental data for the ANN
disease is the reason for the high importance of research on 87.04% and for ANFIS equal to 75.93% has been obtained.
heart disease. The above-mentioned problem causes a lot of Negar Ziasabounchi and Iman Askerzade [11] in their paper
spending by patients and governments to manage and cure have discussed the accuracy of ANFIS technique. In their
heart disease. There are many types of heart disease including study, UCI data with seven features have been considered as
"coronary heart disease, stroke, hypertensive heart disease, an input. The results of their study report the accuracy of
inflammatory heart disease, and rheumatic heart disease [5]. 92.30%. An investigation conducted by Sumit Bhatia, Praveen
Like other diseases, heart diseases have also certain Prakash, and G. N. Pillai [12] has examined the SVM
symptoms, which we can refer to chest pain, discomforts in technique, and using five classes, the accuracy obtained has
chest area, cough, palpitations, and fluid retention from among been 72.55%, but when it has been examined on two classes
[6]. There are many data concerning heart patients, and one of including patients and healthy people, the obtained accuracy is
the most popular sources of data is for Cleveland. The source 90.57%.
includes 303 records with 13 features in five classes [7]. Mai Shouman, Tim Turner, and Rob Stocker [13] in their
paper discusses some kinds of decision trees defined as J4.8,
C. Risk factors of Heart Disease Gain Ratio and binary discretization, then introduces two
For each disease there are some factors causing the illness types of less used decision trees namely Gini Index and
or intensifying its effects. The effect of these factors varies Information Gain. After comparing to J4.8 and Bagging, this
with each patient. Each of the following factors also has a paper has finally proved the effectiveness of the chosen
variety of effects on different types of heart disease. The most method with the accuracy of 84.10%. The paper by Thanh
common risk factors for heart disease include: smoking, Nguyen, Abbas Khosravi, Douglas Creighton, and Saeid
gender (Sex), age, ethnicity, family history of the disease, high Nahavandi [14], introduces GSAM obtained from the
blood pressure, high blood cholesterol, diabetes, poor diet, integration of "Fuzzy Standard Additive Model" (SAM) with
Lack of exercise, obesity, stress and blood vessel "Genetic algorithm". Then, the paper has compared it to
inflammation. Probabilistic Neural Network (PNN), SOM, Fuzzy ARTMAP
(FARTMAP), Adaptive Neuro-Fuzzy Inference System or
Adaptive Network-based Fuzzy Inference System (ANFIS),
II. RELATED WORKS
and Original Standard Additive Model (SAM). Finally, the
One of these studies conducted by Jasmin Nahar, highest level of accuracy has obtained as follows: in the
Tasadduq Imam, Kevin S. Tickle, and Yi-Ping Phoebe Chen Original SAM the ANFIS method with 73.10%, in the
[6] deals with identifying the role of relationship among risk Principal Component Analysis (PCA) the GSAM method with
factors for heart disease in women and men. It refers to the 64.25% and in the Wavelets the GSAM method with 78.78%.
fact that men are more likely to develop Coronary heart
To predict heart disease and breast cancer using
disease than women are. Men and women can overcome chest
classification algorithms, Hlaudi Daniel Masethe and Mosima
pain doing exercise. One of the extracted points in this article
Anna Masethe [15] have compared the algorithms of J4.8,
is that factor of Rest ECG are introduced in either Normal or
Bayes Net, Naïve Bayes, Simple Cart, and REPTREE with
Hyper forms, and Slope being flat is defined as risk factors.
each other in their paper, and they have obtained the accuracy
However, for men Rest ECG only as a Hyper is a risk.
of 99.0741% for each of the algorithms of J4.8, REPTREE,
Therefore, the result is that Rest ECG as a factor to predict
and Simple Cart for heart patients.
heart disease in women has to be considered as well. In this
research, the techniques of Apriori, Predictive Apriori and Articles presented by Kay Chen Tan, Eu Jin Teoh, Q. Yu,
Tertius were compared with each other, resulting in the report and K. C. Goh [3], which examines the diseases Iris, Diabetes,
as follows: With the level of confidence of 90%, the high Breast-Cancer, Heart-c, Hepatitis using GA-SVM obtained
from the integration of Genetic algorithm and SVM, gives the A. C5.0 Algorithm
best accuracy of 85.81% in relation to heart patients. C5.0 algorithms developed from ID3 and C4.5 algorithms
Another article about heart disease presented by Jyoti Soni is one of the most important and widely used algorithms in
, Ujma Ansari, Dipesh Sharma, and Sunita Soni [16] has data mining. C5.0 tree is a classification tree, which finds an
compared the techniques of Naïve Bayes, Decision Tree, and attribute (feature) based on the analysis of the input data,
Classification via clustering. It has found the accuracy of aiming to use it for making decisions on each Node. Since
Decision Tree equal to 99.2%. Another article done by each Node is likely to have different features, all of them will
Chaitrali S. Dangare, and Sulabha S. Apte [17], deals with the be examined to choose one feature from among, so as to
application of data mining to predict heart disease using ANN. selecting the feature would lead to entropy (disorder)
In the article, it is important to note that adding the two factors reduction. This process goes on to reach the last Node (Leaf).
of obesity and smoking to 13 previous factors causes the The algorithm has the capacity to be applied to classify into a
accuracy level to rise up to 100%. In the same vein, Acharya, decision tree or a set of rules. In many applications, it is
U. Rajendra et al. [18] have studied the coronary artery preferred to the other rules because the set of rules are easier
disease. For this research, they have provided 400 healthy to understand [21], [22], [23], [24].
controls and 400 cases of patients. The researchers in this
study have chosen the Gaussian Mixture Model (GMM), B. C&R Tree Algorithm
finally the research accuracy has been reported equal to 100%. This algorithm was introduced in 1984 by Leo Breiman,
Nura Esfandiari, Mohammad Reza Babavalian, Amir-Masoud Jerome Friedman, Charles J. Stone, and Richard A. Olshen
Eftekhari Moghadam, and Vahid Kashani Tabar [19] in a [25]. Using this algorithm, it is possible to create a decision
study based on data collected between the years 1999-2013; tree with single-variable binary division. In fact, this algorithm
deal with knowledge discovery in medicine. Each section of has been developed for quantitative variables but it can also be
the paper has been devoted to one of the six medical tasks, used for other variables. In this algorithm, the standard Gini
which includes the following: screening, diagnosis, treatment, coefficient (Gini Index) is used to divide the data into different
prognosis, monitoring, and management. For each of the six groups, and it is also possible to use index such as entropy at
tasks, five data mining approaches have been considered: higher speed. C&R Tree algorithm generates a univariate
classification, regression, clustering, association and hybrid. binary tree. This algorithm can also develop regression tree.
The main purpose of this paper is the investigation of From among the weaknesses of this algorithm, we can refer to
performed tasks between the years 1999-2013, as well as the biased selection of variables and misleading results in
integration of them to extract medical information and data qualitative variables with more than two levels.
mining from 291 articles published in the mentioned periods.
Classifying the frequency cycle of cardiovascular disease in C. CHAID Algorithm
different regions of Texas, that is done by Kyle E Walker and
Sean M. Crotty [20] show that the main cause of mortality in This algorithm is a type of decision tree developed and
this state is cardiovascular disease. To achieve this goal and to introduced by Kass in 1980 [26]. It stands for CHi-squared
help the development of health care policies in combating the Automatic Interaction Detection algorithm that can be used for
disease, the present paper has examined the area most prediction, classification, and also establishing relationship
vulnerable to the disease. After investigations, it was between the various factors. Decision trees usually provide
concluded that although factors such as poor health, social and simple and understandable results. One of the advantages of
economic deprivation can cause this disease, in some areas, this algorithm is also simplicity of results to understand and
also this disease affected on people with high living standards. interpret. CHAID algorithm can be used for grouped
qualitative and quantitative variables. Using three steps of
merging, splitting, and stopping which is done iteratively,
III. METHOD AND DATASET CHAID Algorithm moves from root Node toward the bottom
To prepare the present paper, decision trees, which is one of tree. At each step, CHAID chooses the best choice to
of the most important algorithms used in data mining was predict and the best choice continues to reach the end of the
employed; they included the algorithms of C5.0, C&R Tree, tree. The algorithm uses p-values to find the best attributes
CHAID, and QUEST. Decision tree structure in machine (features) on each Node, so each variable with lower p-values
learning is a predictive model, which turns the observed facts will be considered in the first stage to split the node.
about a phenomenon into some inferences about the purpose
value of that phenomenon. Machine learning techniques to
infer a decision tree from data is called Decision Tree D. QUEST Algorithm
Learning, which is one of the most common methods of data The algorithm was designed and introduced for nominal
mining. Decision trees are able to produce an understandable variables by Loh and Shih in 1997 [27]. The Tree formed by
description for humans, from the relationships in a data set this algorithm has binary division just like C&R Tree
that can be used for performing classification and prediction Algorithm. This algorithm creates a single variable tree using
tasks. This technique is widely used in various fields such as the linear separation standard. This tree is an upgraded version
diagnosis, classification of plants and customer marketing of FACT tree. The decision criterion for selecting variables in
strategies. Then each of the algorithms is briefly described: this algorithm concerning F statistics using the P-Value is
ANOVA test for quantitative variables and P-Value of chi-
square statistic concerning correspondence tables for
qualitative variables. Because of the P-Value for deciding various categories. Ideally, a large amount of data relevant to
QUEST algorithm does not cause an unbiased tree to be observation must be located on the main diagonal of the
created. The algorithm accuracy is the same as C&R Tree's, matrix, and the remaining values of matrix are zero or near
but its speed is higher. zero [28], [29]. The accuracy of the generated models was
calculated to better model for knowledge extraction as shown
E. Dataset in TABLE III. Then the accuracy of the generated models to
To perform the study, Cleveland heart disease data were choose better model for knowledge extraction was calculated
used [7], which included 303 records with 13 features as as shown in TABLE III. Since the model generated by the
shown in TABLE I. The data have divided into 5 classes from C5.0 algorithm had the highest accuracy, this model was
among class 0 shows lack of heart disease and class 1 to 4 selected to extract knowledge.
indicates respectively increasing severity of heart disease. TP = number of positive data labels, which have been properly
Features of this data are as follows: classified,
FP = number of negative data labels, which have been falsely
TABLE I. CLEVELAND HEART DISEASE DATASET
classified as positive,
1. age: age in years [29.0, 77.0] FN = number of positive data labels, which have been falsely
2. sex: gender (1 = male; 0 = female) [0.0, 1.0] classified as negative,
3. cp: chest pain type[1.0, 4.0] TN = number of negative data labels, which have been
4. trestbps: resting blood pressure (in mm Hg on admission to the properly classified,
hospital) [94.0, 200.0]
5. chol: serum cholestoral in mg/dl[126.0, 564.0]
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) [0.0,
1.0] Specificity = TN / TN+FP (1)
7. restecg: resting electrocardiographic results[0.0, 2.0] Sensitivity = TP / TP+FN (2)
8. thalach: maximum heart rate achieved[71.0, 202.0]
9. exang: exercise induced angina (1 = yes; 0 = no) [0.0, 1.0] Precision = TP / TP+FP (3)
10. oldpeak = ST depression induced by exercise relative to rest[0.0, Accuracy = TP+TN / TP+TN+FP +FN (4)
6.2]
11. slope: the slope of the peak exercise ST segment[1.0, 3.0]
12. ca: number of major vessels (0-3) colored by flourosopy [0.0, 3.0]
TABLE IV. shows the values of four indices for each class
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect [3.0, 7.0]
14. num: the predicted attribute: 0=healthy, 1-4: increasingly sick) labels. The values have been calculated for C5.0 algorithm
[0,1,2,3,4] using the confusion matrix. The error rate or misclassification
rate can also be calculated based on accuracy (formula 5)
index [28], [29].
IV. RESULTS
TABLE III: ACCURCY FOR ALGORITHMS IN THIS STUDY
In this paper, the Clementine 12 was used. The
Algorithm Name Accuracy (%)
specifications for the system to implement as follows: Intel
C5.0 85.33
core i7, 3610 QM, 2.3 GHz with 8 GB Installed memory. C&R Tree 60.82
C5.0, C&R Tree, CHAID QUEST algorithms implemented on CHAID 59
heart patients' data with the aim of extracting knowledge QUEST 59.36
underlying in the data under investigation. In this regard, field
of diagnosis containing category model label (TABLE II) was
TABLE IV. THE INDICES FOR C5.0 ALGORITHM
considered as the output. The sample data was divided into
two groups (70% for training and 30% for testing). Category(Class) Specificity Sensitivity Precision Accuracy
Label (%) (%) (%) (%)
TABLE II: CATEGORY MODEL LABLE
0 54.67 95.12 71.23 76.56