Using Decision Trees in Data Mining For Predicting Factors Influencing of Heart Disease

Uploaded by

shruti.pict21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views6 pages

Using Decision Trees in Data Mining For Predicting Factors Influencing of Heart Disease

Uploaded by

shruti.pict21

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

M.

Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 31

Using Decision Trees in Data Mining for Predicting

Factors Influencing of Heart Disease

Moloud Abdar
Undergraduate of Computer Engineering, Department of Engineering
University of Damghan
Damghan, Iran
[email protected]

Abstract—Statistics from the World Health Organization

(WHO) shows that heart disease is one of the leading causes of
A. Data Mining and Classification
mortality all over the world. Because of the importance of heart With the advancement of science, the volume of stored
disease, in recent years, many studies have been conducted on data in various fields has been increased. Analyzing the
this disease using data mining. The main objective of this study is accumulated data can extract useful information existing in
to find a better decision tree algorithm and then use the them. Applying data mining as a new science on data can
algorithm for extracting rules in predicting heart disease. extract the science lying within the data. Data mining reveals
Cleveland data, including 303 records are used for this study. the beneficial relationship between data and the right decisions
These data include 13 features and we have categorized them into can be made based on these relationships [2], [3]. Utilizing the
five classes. In this paper, C5.0 algorithm with a accuracy value related tools to show the results, data mining uses analytical
of 85.33% has a better performance compared to the rest of the modeling, classification, and prediction of information. To be
algorithms used in this study. Considering the rules created by able to extract information easily, data mining algorithms need
this algorithm, the attributes of Trestbps, Restecg, Thalach,
a set of pre-processing on the data and post-processing on the
Slope, Oldpeak, and CP were extracted as the most influential
extracted patterns. The techniques used for data mining can be
causes in predicting heart disease.
classified as follows:
Keywords— Data Mining; Heart Disease; Classification; 1. Classification (Predictive Technique): In this
Decision Tree; C5.0 Algorithm. method, a sample is classified into one of several
predefined categories.
I. INTRODUCTION 2. Regression (Preventive Technique): Prediction of
a variable amount based on other variables
In recent years, the volume of accumulated data has rapidly 3. Clustering (Descriptive Technique): A data set
been increased. Regarding this project, using a method, which mapped into one of the several clusters. Clusters
can extract beneficial information from these data, has highly are defined as groupings of data categories,
been considered. Data mining is used in most of the scientific which are shaped based on the similarity of
fields, including medical sciences. So far, data mining some criteria.
techniques have been used for diagnosing diseases such as 4. Discovery of association rules (Descriptive
heart disease, diabetes, neurology, depression, breast cancer, Technique): It states the relationship of
liver disease, etc. There are different methods and algorithms in dependence among the various features.
data mining, and according to different provided data, the 5. Sequence Analysis: It models sequence patterns,
power, and performance of each of these algorithms is such as time series.
different. For example, algorithms of Support Vector Machines
(SVM), K-Nearest Neighbors (KNN), Decision Trees, One of the divisions in data mining is classification, which
Artificial Neural Network (ANN), etc. can be named in this acts using If-Then rule. Its purpose is to predict a feature
area. The WHO has reported that heart disease as a major cause (characteristic) based on other features (characteristics) that
of death around the world is highly important [1]. Because of are known as predictors. In classification, data are divided into
the importance of this disease in the world, this paper examines two classes of Training and Testing, and having the training
the performance of decision algorithms of C5.0, CHi-squared data mining algorithms extract the rules. The purpose feature
Automatic Interaction Detection (CHAID), Quick Unbiased and the value of prediction features should be placed into data
and Efficient Statistical Tree (QUEST), and The Classification mining algorithms. Algorithms of KNN, SVM, Decision Tree,
and Regression Tree (C&R Tree) on heart patients' data. and ANN are from among classification algorithms.

ISSN 1844 – 9689 http://cjece.ubm.ro

M. Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 32

B. Heart Disease accuracy rate of 99% and the confirming level of 79%,
The human body has a complex mechanism so that any respectively for the models of Apriori, Predictive Apriori and
dysfunction of any body parts influences the other. The human Tertius.
heart is about the size of a fist, while it is one of the strongest Another valuable research by Kemal Polat and Salih
muscles in the body. It begins to throb 21-28 days after Güneş [8] utilizing fuzzy weighted pre-processing and
forming the fetus in the womb and beats 100,000 times daily artificial immune recognition system (AIRS) reported 92.59%
on average. Average heart rate is about 70 beats per minute, accuracy. The article by Roohallah and et al. [9] have used
which doubles or multiplies because of physical activity. The the algorithms of SMO, Bagging, and ANN. They have used
human heart is a body part playing an important role in his/her Z-Alizadeh Sani data involving 303 patients with 54 features.
life. Any dysfunction of human heart leads to dysfunction in This study has then noted that "Chest Pain and Age" have had
the body's total system, especially blood supply and the greatest impact on productivity than the rest of the
respiratory system. features, and their reported accuracy has been equal to
According to the WHO, the World Heart Federation 94.08%. Another article written by Abushariah, Mohammad,
(WHF) and the USA's Centers for Disease Control and Assal AM Alqudah, Omar Y. Adwan, and Rana MM Yousef
Prevention (CDC) in 2020, the number of deaths due to “heart [10] deals with the comparison of ANN and ANFIS using
disease and stroke” reaches up to 20 million, whereas the MATLAB software. The accuracy of training data for ANFIS
mentioned number will be increased up to 24 million deaths 100% and ANN equal to 90.74% has been calculated.
by the year 2030 [4]. The rising number of deaths due to heart However, the accuracy of experimental data for the ANN
disease is the reason for the high importance of research on 87.04% and for ANFIS equal to 75.93% has been obtained.
heart disease. The above-mentioned problem causes a lot of Negar Ziasabounchi and Iman Askerzade [11] in their paper
spending by patients and governments to manage and cure have discussed the accuracy of ANFIS technique. In their
heart disease. There are many types of heart disease including study, UCI data with seven features have been considered as
"coronary heart disease, stroke, hypertensive heart disease, an input. The results of their study report the accuracy of
inflammatory heart disease, and rheumatic heart disease [5]. 92.30%. An investigation conducted by Sumit Bhatia, Praveen
Like other diseases, heart diseases have also certain Prakash, and G. N. Pillai [12] has examined the SVM
symptoms, which we can refer to chest pain, discomforts in technique, and using five classes, the accuracy obtained has
chest area, cough, palpitations, and fluid retention from among been 72.55%, but when it has been examined on two classes
[6]. There are many data concerning heart patients, and one of including patients and healthy people, the obtained accuracy is
the most popular sources of data is for Cleveland. The source 90.57%.
includes 303 records with 13 features in five classes [7]. Mai Shouman, Tim Turner, and Rob Stocker [13] in their
paper discusses some kinds of decision trees defined as J4.8,
C. Risk factors of Heart Disease Gain Ratio and binary discretization, then introduces two
For each disease there are some factors causing the illness types of less used decision trees namely Gini Index and
or intensifying its effects. The effect of these factors varies Information Gain. After comparing to J4.8 and Bagging, this
with each patient. Each of the following factors also has a paper has finally proved the effectiveness of the chosen
variety of effects on different types of heart disease. The most method with the accuracy of 84.10%. The paper by Thanh
common risk factors for heart disease include: smoking, Nguyen, Abbas Khosravi, Douglas Creighton, and Saeid
gender (Sex), age, ethnicity, family history of the disease, high Nahavandi [14], introduces GSAM obtained from the
blood pressure, high blood cholesterol, diabetes, poor diet, integration of "Fuzzy Standard Additive Model" (SAM) with
Lack of exercise, obesity, stress and blood vessel "Genetic algorithm". Then, the paper has compared it to
inflammation. Probabilistic Neural Network (PNN), SOM, Fuzzy ARTMAP
(FARTMAP), Adaptive Neuro-Fuzzy Inference System or
Adaptive Network-based Fuzzy Inference System (ANFIS),
II. RELATED WORKS
and Original Standard Additive Model (SAM). Finally, the
One of these studies conducted by Jasmin Nahar, highest level of accuracy has obtained as follows: in the
Tasadduq Imam, Kevin S. Tickle, and Yi-Ping Phoebe Chen Original SAM the ANFIS method with 73.10%, in the
[6] deals with identifying the role of relationship among risk Principal Component Analysis (PCA) the GSAM method with
factors for heart disease in women and men. It refers to the 64.25% and in the Wavelets the GSAM method with 78.78%.
fact that men are more likely to develop Coronary heart
To predict heart disease and breast cancer using
disease than women are. Men and women can overcome chest
classification algorithms, Hlaudi Daniel Masethe and Mosima
pain doing exercise. One of the extracted points in this article
Anna Masethe [15] have compared the algorithms of J4.8,
is that factor of Rest ECG are introduced in either Normal or
Bayes Net, Naïve Bayes, Simple Cart, and REPTREE with
Hyper forms, and Slope being flat is defined as risk factors.
each other in their paper, and they have obtained the accuracy
However, for men Rest ECG only as a Hyper is a risk.
of 99.0741% for each of the algorithms of J4.8, REPTREE,
Therefore, the result is that Rest ECG as a factor to predict
and Simple Cart for heart patients.
heart disease in women has to be considered as well. In this
research, the techniques of Apriori, Predictive Apriori and Articles presented by Kay Chen Tan, Eu Jin Teoh, Q. Yu,
Tertius were compared with each other, resulting in the report and K. C. Goh [3], which examines the diseases Iris, Diabetes,
as follows: With the level of confidence of 90%, the high Breast-Cancer, Heart-c, Hepatitis using GA-SVM obtained

ISSN 1844 – 9689 http://cjece.ubm.ro

M. Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 33

from the integration of Genetic algorithm and SVM, gives the A. C5.0 Algorithm
best accuracy of 85.81% in relation to heart patients. C5.0 algorithms developed from ID3 and C4.5 algorithms
Another article about heart disease presented by Jyoti Soni is one of the most important and widely used algorithms in
, Ujma Ansari, Dipesh Sharma, and Sunita Soni [16] has data mining. C5.0 tree is a classification tree, which finds an
compared the techniques of Naïve Bayes, Decision Tree, and attribute (feature) based on the analysis of the input data,
Classification via clustering. It has found the accuracy of aiming to use it for making decisions on each Node. Since
Decision Tree equal to 99.2%. Another article done by each Node is likely to have different features, all of them will
Chaitrali S. Dangare, and Sulabha S. Apte [17], deals with the be examined to choose one feature from among, so as to
application of data mining to predict heart disease using ANN. selecting the feature would lead to entropy (disorder)
In the article, it is important to note that adding the two factors reduction. This process goes on to reach the last Node (Leaf).
of obesity and smoking to 13 previous factors causes the The algorithm has the capacity to be applied to classify into a
accuracy level to rise up to 100%. In the same vein, Acharya, decision tree or a set of rules. In many applications, it is
U. Rajendra et al. [18] have studied the coronary artery preferred to the other rules because the set of rules are easier
disease. For this research, they have provided 400 healthy to understand [21], [22], [23], [24].
controls and 400 cases of patients. The researchers in this
study have chosen the Gaussian Mixture Model (GMM), B. C&R Tree Algorithm
finally the research accuracy has been reported equal to 100%. This algorithm was introduced in 1984 by Leo Breiman,
Nura Esfandiari, Mohammad Reza Babavalian, Amir-Masoud Jerome Friedman, Charles J. Stone, and Richard A. Olshen
Eftekhari Moghadam, and Vahid Kashani Tabar [19] in a [25]. Using this algorithm, it is possible to create a decision
study based on data collected between the years 1999-2013; tree with single-variable binary division. In fact, this algorithm
deal with knowledge discovery in medicine. Each section of has been developed for quantitative variables but it can also be
the paper has been devoted to one of the six medical tasks, used for other variables. In this algorithm, the standard Gini
which includes the following: screening, diagnosis, treatment, coefficient (Gini Index) is used to divide the data into different
prognosis, monitoring, and management. For each of the six groups, and it is also possible to use index such as entropy at
tasks, five data mining approaches have been considered: higher speed. C&R Tree algorithm generates a univariate
classification, regression, clustering, association and hybrid. binary tree. This algorithm can also develop regression tree.
The main purpose of this paper is the investigation of From among the weaknesses of this algorithm, we can refer to
performed tasks between the years 1999-2013, as well as the biased selection of variables and misleading results in
integration of them to extract medical information and data qualitative variables with more than two levels.
mining from 291 articles published in the mentioned periods.
Classifying the frequency cycle of cardiovascular disease in C. CHAID Algorithm
different regions of Texas, that is done by Kyle E Walker and
Sean M. Crotty [20] show that the main cause of mortality in This algorithm is a type of decision tree developed and
this state is cardiovascular disease. To achieve this goal and to introduced by Kass in 1980 [26]. It stands for CHi-squared
help the development of health care policies in combating the Automatic Interaction Detection algorithm that can be used for
disease, the present paper has examined the area most prediction, classification, and also establishing relationship
vulnerable to the disease. After investigations, it was between the various factors. Decision trees usually provide
concluded that although factors such as poor health, social and simple and understandable results. One of the advantages of
economic deprivation can cause this disease, in some areas, this algorithm is also simplicity of results to understand and
also this disease affected on people with high living standards. interpret. CHAID algorithm can be used for grouped
qualitative and quantitative variables. Using three steps of
merging, splitting, and stopping which is done iteratively,
III. METHOD AND DATASET CHAID Algorithm moves from root Node toward the bottom
To prepare the present paper, decision trees, which is one of tree. At each step, CHAID chooses the best choice to
of the most important algorithms used in data mining was predict and the best choice continues to reach the end of the
employed; they included the algorithms of C5.0, C&R Tree, tree. The algorithm uses p-values to find the best attributes
CHAID, and QUEST. Decision tree structure in machine (features) on each Node, so each variable with lower p-values
learning is a predictive model, which turns the observed facts will be considered in the first stage to split the node.
about a phenomenon into some inferences about the purpose
value of that phenomenon. Machine learning techniques to
infer a decision tree from data is called Decision Tree D. QUEST Algorithm
Learning, which is one of the most common methods of data The algorithm was designed and introduced for nominal
mining. Decision trees are able to produce an understandable variables by Loh and Shih in 1997 [27]. The Tree formed by
description for humans, from the relationships in a data set this algorithm has binary division just like C&R Tree
that can be used for performing classification and prediction Algorithm. This algorithm creates a single variable tree using
tasks. This technique is widely used in various fields such as the linear separation standard. This tree is an upgraded version
diagnosis, classification of plants and customer marketing of FACT tree. The decision criterion for selecting variables in
strategies. Then each of the algorithms is briefly described: this algorithm concerning F statistics using the P-Value is
ANOVA test for quantitative variables and P-Value of chi-
square statistic concerning correspondence tables for

ISSN 1844 – 9689 http://cjece.ubm.ro

M. Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 34

qualitative variables. Because of the P-Value for deciding various categories. Ideally, a large amount of data relevant to
QUEST algorithm does not cause an unbiased tree to be observation must be located on the main diagonal of the
created. The algorithm accuracy is the same as C&R Tree's, matrix, and the remaining values of matrix are zero or near
but its speed is higher. zero [28], [29]. The accuracy of the generated models was
calculated to better model for knowledge extraction as shown
E. Dataset in TABLE III. Then the accuracy of the generated models to
To perform the study, Cleveland heart disease data were choose better model for knowledge extraction was calculated
used [7], which included 303 records with 13 features as as shown in TABLE III. Since the model generated by the
shown in TABLE I. The data have divided into 5 classes from C5.0 algorithm had the highest accuracy, this model was
among class 0 shows lack of heart disease and class 1 to 4 selected to extract knowledge.
indicates respectively increasing severity of heart disease. TP = number of positive data labels, which have been properly
Features of this data are as follows: classified,
FP = number of negative data labels, which have been falsely
TABLE I. CLEVELAND HEART DISEASE DATASET
classified as positive,
1. age: age in years [29.0, 77.0] FN = number of positive data labels, which have been falsely
2. sex: gender (1 = male; 0 = female) [0.0, 1.0] classified as negative,
3. cp: chest pain type[1.0, 4.0] TN = number of negative data labels, which have been
4. trestbps: resting blood pressure (in mm Hg on admission to the properly classified,
hospital) [94.0, 200.0]
5. chol: serum cholestoral in mg/dl[126.0, 564.0]
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) [0.0,
1.0] Specificity = TN / TN+FP (1)
7. restecg: resting electrocardiographic results[0.0, 2.0] Sensitivity = TP / TP+FN (2)
8. thalach: maximum heart rate achieved[71.0, 202.0]
9. exang: exercise induced angina (1 = yes; 0 = no) [0.0, 1.0] Precision = TP / TP+FP (3)
10. oldpeak = ST depression induced by exercise relative to rest[0.0, Accuracy = TP+TN / TP+TN+FP +FN (4)
6.2]
11. slope: the slope of the peak exercise ST segment[1.0, 3.0]
12. ca: number of major vessels (0-3) colored by flourosopy [0.0, 3.0]
TABLE IV. shows the values of four indices for each class
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect [3.0, 7.0]
14. num: the predicted attribute: 0=healthy, 1-4: increasingly sick) labels. The values have been calculated for C5.0 algorithm
[0,1,2,3,4] using the confusion matrix. The error rate or misclassification
rate can also be calculated based on accuracy (formula 5)
index [28], [29].
IV. RESULTS
TABLE III: ACCURCY FOR ALGORITHMS IN THIS STUDY
In this paper, the Clementine 12 was used. The
Algorithm Name Accuracy (%)
specifications for the system to implement as follows: Intel
C5.0 85.33
core i7, 3610 QM, 2.3 GHz with 8 GB Installed memory. C&R Tree 60.82
C5.0, C&R Tree, CHAID QUEST algorithms implemented on CHAID 59
heart patients' data with the aim of extracting knowledge QUEST 59.36
underlying in the data under investigation. In this regard, field
of diagnosis containing category model label (TABLE II) was
TABLE IV. THE INDICES FOR C5.0 ALGORITHM
considered as the output. The sample data was divided into
two groups (70% for training and 30% for testing). Category(Class) Specificity Sensitivity Precision Accuracy
Label (%) (%) (%) (%)
TABLE II: CATEGORY MODEL LABLE
0 54.67 95.12 71.23 76.56

Category Label Description 1 95.96 16.36 47.36 81.51

0 Patients with no heart failure 2 89.51 50 39.13 84.81
1-4 Heart Disease
3 98.88 11.42 57.14 88.77
4 97.58 38.46 41.66 95.04
A. Evaluation
Evaluating the results obtained from C5.0 model causes Error Rate = 1 – accuracy (5)
the model to be improved and usable. There are various
parameters such as specificity, sensitivity, precision, and
The average of model accuracy calculated by confusion
accuracy to evaluate the classification methods, which are
matrix is 85.338%. Hence, the error rate of the model is
calculated in accordance with the following 1 to 4 formula.
14.662% and this rate indicates a rather good precision and
We can use the confusion matrix to compute the indices. This
accuracy for the model.
matrix is a useful tool to analyze the performance of
classification method in data detection or observation of

ISSN 1844 – 9689 http://cjece.ubm.ro

M. Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 35

B. Findings patients associated with Cleveland heart disease dataset. After

The value of 85.33% for accuracy, 51.30% for precision, implementation of the C5.0, C&R Tree, CHAID, and QUEST
87.32% for specificity, and 42.27% for sensitivity calculated algorithms on the data, the C5.0 algorithm with the accuracy
by C5.0 algorithm shows that the existing tree could present of 85.33 percent had the best performance in detection of heart
comprehensive rules for predicting the future mood of disease causes. Attributes of Cp, Old Peak, Slope, Thalach,
patients. The extracted knowledge or rules have been shown in Rest Ecg, Trestbps were obtained as important and influential
TABLE V. Since the class 0 is associated with heart disease, if factors in coronary heart disease. One of the main
the condition is Cp<= 3 and also if Cp> 3 and Oldpeak <= 0.7, distinguished issues in the study of rules generated by the C5.0
Cp> 3, Oldpeak> 0.7 and Thalach> 168, the result is lack of algorithm is paying attention to Thalach attribute value, so that
heart disease. Classes 1 to 4 indicating an increase in if the value of this factor equals to Thalach> 168, it will
symptoms and severity of heart disease in ascending form, is belong to the class representing lack of heart disease.
more important. There are symptoms of heart disease in class However, sifting the rules we made it clear that if Thalach
1, but they are less severe than in classes 2, 3 and 4. In this <=168, the attribute value of [71-168] belongs to all classes of
class, if Cp> 3, Oldpeak> 0.7, Thalach <= 168, Restecg> 1, patients afflicted with heart disease. This relationship suggests
Slope <= 2, Oldpeak <= 2.6 and Trestbps <= 142, there are that the attribute can be a great help in detecting patients
weak signs of heart disease. In class 2, the severity of heart suffering from heart disease. Another important point was
disease is more than that of class 1 If Cp>3, Oldpeak >0.7, paying attention to the value of Oldpeak attribute in class 4
Thalach <=168, Restecg<=1 and If Cp>3 and Oldpeak >0.7, patients. The value of Oldpeak attribute along with
Thalach <=168, Restecg>1, Slope<=2 and Oldpeak>2.6, the consideration of other attributes was in intervals of (0.7-2.6].
precision in the evaluation of heart disease is increased, Therefore, after careful examination we can say that this
therefore it should be considered more sensitively. Classes 4 model can serve as an appropriate model to help identify the
and 3, which have a great number of heart diseases, must be factors influencing heart patients.
seriously taken into consideration.
REFERENCES
TABLE V. A SAMPLE OF RULES MADE BY C5.0 ALGORITHM
[1] WHO Report, The Top 10 Causes of Death, Last Accessed 12/9/2013
Category From Http:// Who.Int/Mediacentre/Factsheets/Fs310/En/ [Accessed
(Class) Rules 02/03/2015].
Label
[2] V. Paramasivam, T. S. Yee, S. K. Dhillon and A. S. Sidhu, “A
If Cp<= 3 then ClassLable=0
methodological review of data mining techniques in predictive
If Cp>3 and Oldpeak <=0.7 then Classlable=0 medicine: An application in hemodynamic prediction for abdominal
0
If Cp>3 and Oldpeak >0.7 and Thalach >168 then aortic aneurysm disease” , Biocybernetics and Biomedical Engineering,
ClassLable=0 Vol. 34, Issue. 3, pp. 139-145, 2014.
If Cp>3 and Oldpeak >0.7 and Thalach <=168 and
Restecg>1 and Slope<=2 and Oldpeak<=2.6 and [3] K. C. Tan, E. J. Teoh, Q. Yu and K. C. Goh , “A hybrid evolutionary
1 algorithm for attribute selection in data mining" , Expert Systems with
Trestbps<=142 then ClassLable=1
Applications, Vol. 36, No. 4, pp. 8616-8630, 2009.
If Cp>3 and Oldpeak >0.7 and Thalach <=168 and [4] S. Mendis, David Porter, Judith Mackay, Lauren O’Brien.” Http://W
Restecg<=1 then ClassLable=2 ww.Who.Int/Mediacentre/News/Releases/2004/Pr68/En/” [Accessed
2 If Cp>3 and Oldpeak >0.7 and Thalach <=168 and 03/04/2015].
Restecg>1 and Slope<=2 and Oldpeak>2.6 then
ClassLable=2 [5] J. W. V. Goethe. “Types Of Cardiovascular Disease”,
,”Www.Who.Int/Cardiovascular_Diseases./Cvd,[Accessed04.04.2015].
If Cp>3 and Oldpeak >0.7 and Thalach <=168 and
3 [6] J. Nahar, T. Imam, K. S. Tickle and Y. P. P. Chen, “Association rule
Restecg>1 and Slope>2 then ClassLable=3
If Cp>3 and Oldpeak >0.7 and Thalach <=168 and mining to detect factors which contribute to heart disease in males and
4 Restecg>1 and Slope<=2 and Oldpeak<=2.6 and females" , Expert Systems with Applications, Vol. 40, No. 4, pp. 1086-
Trestbps>142 then ClassLable=4 1093, 2013.
[7] KEEL. Cleveland Heart Disease Datset, “A Software Tool To Assess
Evolutionary Algorithms For Data Mining
In class 3, if Cp>3, Oldpeak >0.7, Thalach <=168, Problem”,Http://Sci2s.Ugr.Es/Keel/Dataset.Php?Cod=57, [Accessed
02/12/2015].
Restecg>1 and Slope>2, the severity of heart disease is high,
and the patients need more care than the previous classes. [8] K. Polat and Salih Güneş, “A hybrid approach to medical decision
support systems: Combining feature selection, fuzzy weighted pre-
Class 4, which is the most important and principally the most processing and AIRS" , computer methods and programs in
risky for heart patients, is of extraordinary importance. biomedicine, Vol. 88, No. 2, pp.164-174, 2007.
Patients in this class are likely to be subjected to death. [9] R. Alizadehsani, J. Habibi, M. J. Hosseini, H. Mashayekhi, R. Boghrati,
Therefore, If Cp> 3, Oldpeak> 0.7, Thalach <= 168 and A. Ghandeharioun, B. Bahadorian and Z. A. Sani., "A data mining
Restecg> 1, Slope <= 2, Oldpeak <= 2.6, and Trestbps> 142, approach for diagnosis of coronary artery disease" , Computer methods
the associated patients include in the class with patients having and programs in biomedicine, Vol. 111, No. 1, pp. 52-61, 2013.
very bad conditions, so it can be concluded that these patients [10] M. A. M. Abushariah, A. A. M. Alqudah, O. Y. Adwan and R. M. M.
Yousef, “Automatic Heart Disease Diagnosis System Based on Artificial
must be under the perfect care. Neural Network (ANN) and Adaptive Neuro-Fuzzy Inference Systems
(ANFIS) Approaches." Journal of Software Engineering and
V. CONCLUSION Applications, 7, No. 12, pp. 1055-, 2014.

This paper examined the factors influencing heart disease

using data mining techniques. The data consisted of 303

ISSN 1844 – 9689 http://cjece.ubm.ro

M. Abdar / Carpathian Journal of Electronic and Computer Engineering 8/2 (2015) 31-36 36
[11] N. Ziasabounchi and I. Askerzade, "ANFIS Based Classification Model [20] K. E. Walker and S. M. Crotty ,“Classifying high-prevalence
for Heart Disease Prediction" ,International Journal Of Electrical & neighborhoods for cardiovascular disease in Texas" , Applied
Computer Sciences IJECS-IJENS, Vol. 14, No. 02, pp. 7-12, 2014. Geography, Vol. 57, pp. 22-31, 2015.
[12] S. Bhatia, P. Prakash and G.N. Pillai, “SVM based decision support [21] J. R. Quinlan, "Induction of decision trees" , Machine learning, Vol. 1,
system for heart disease classification with integer-coded genetic No. 1, pp. 81-106, 1986.
algorithm to select critical features" , In Proceedings of the World [22] J. R. Quinlan , “C4. 5: programs for machine learning” , Elsevier, 2014.
Congress on Engineering and Computer Science, WCECS, San
Francisco, USA, pp. 22-24. 2008. [23] J. R. Quinlan , "Bagging, boosting, and C4. 5" , In AAAI/IAAI, Vol. 1,
pp. 725-730. 1996.
[13] M. Shouman, T. Turner and R. Stocker, “Using decision tree for
[24] J. R. Quinlan , “C5”, Http://Rulequest.Com, 2007.
diagnosing heart disease patients" , In Proceedings of the Ninth
Australasian Data Mining Conference-Volume 121, Australian [25] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, “Classification
Computer Society, Inc., pp. 23-30, 2011. and regression trees, CRC press, 1984.
[14] T. Nguyen, A. Khosravi, D. Creighton and S. Nahavandi, "Classification [26] G. V. Kass ,"An exploratory technique for investigating large quantities
of healthcare data using genetic fuzzy logic system and wavelets" , of categorical data" , Applied statistics, pp. 119-127, 1980.
Expert Systems with Applications, Vol. 42, No. 4, pp. 2184-2197, 2015. [27] W-Y. Loh and Y-S. Shih. "Split selection methods for classification
[15] H. D. Masethe and M. A. Masethe, “Prediction of Heart Disease using trees" , Statistica sinica, Vol. 7, No. 4, pp. 815-840, 1997.
Classification Algorithms." In Proceedings of the World Congress on [28] S. Alizadeh, M. Ghazanfari, and B. Teimorpour, "Data Mining and
Engineering and Computer Science, WCECS, San Francisco, USA, Vol. Knowledge Discovery" , Publication of Iran University of Science and
2. 2014. Technology, 2011.
[16] J. Soni, U. Ansari, D. Sharma and S. Soni ,"Predictive data mining for [29] J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques:
medical diagnosis: An overview of heart disease prediction", concepts and techniques. Elsevier, 2011.
International Journal of Computer Applications, Vol. 17, No. 8, pp. 43-
48, 2011.
Moloud Abdar. He received his Undergraduate
[17] C. S. Dangare and S. S. Apte. ,"A data mining approach for prediction
of heart disease using neural networks." International Journal of degree in Computer Engineering from the
Computer Engineering and Technology (IJCET), Vol. 3, No. 3, 2012. University of Damghan, Iran in 2015. He has
[18] U. R. Acharya, S. VSree, M. M. R. Krishnan, N. Krishnananda, S. more than 9 conference and journal papers about
Ranjan, P. Umesh, and J. S. Suri, “Automated classification of patients Data Mining. Currently, His research interests
with coronary artery disease using grayscale features from left ventricle include data mining, web and text mining,
echocardiographic images" , Computer methods and programs in artificial intelligence and image processing.
biomedicine, Vol. 112, No. 3, pp. 624-632, 2013.
[19] N. Esfandiari, M. R. Babavalian, A-M E. Moghadam and V. Kashani
Tabar, "Knowledge discovery in medicine: Current issue and future
trend" , Expert Systems with Applications, Vol. 41, No. 9, pp. 4434-
4463, 2014.