Expert Systems With Applications: Jesmin Nahar, Tasadduq Imam, Kevin S. Tickle, Yi-Ping Phoebe Chen

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Expert Systems with Applications 40 (2013) 96104

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Computational intelligence for heart disease diagnosis: A medical


knowledge driven approach
Jesmin Nahar a,, Tasadduq Imam a, Kevin S. Tickle a, Yi-Ping Phoebe Chen b
a
Faculty of Arts, Business, Informatics and Education, Central Queensland University, Queensland, Australia
b
Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia

a r t i c l e i n f o a b s t r a c t

Keywords: This paper investigates a number of computational intelligence techniques in the detection of heart dis-
Cleveland data ease. Particularly, comparison of six well known classiers for the well used Cleveland data is performed.
Heart disease Further, this paper highlights the potential of an expert judgment based (i.e., medical knowledge driven)
Computational intelligence feature selection process (termed as MFS), and compare against the generally employed computational
Classication
intelligence based feature selection mechanism. Also, this article recognizes that the publicly available
Feature selection
Cleveland data becomes imbalanced when considering binary classication. Performance of classiers,
and also the potential of MFS are investigated considering this imbalanced data issue. The experimental
results demonstrate that the use of MFS noticeably improved the performance, especially in terms of
accuracy, for most of the classiers considered and for majority of the datasets (generated by converting
the Cleveland dataset for binary classication). MFS combined with the computerized feature selection
process (CFS) has also been investigated and showed encouraging results particularly for NaiveBayes,
IBK and SMO. In summary, the medical knowledge based feature selection method has shown promise
for use in heart disease diagnostics.
2012 Elsevier Ltd. All rights reserved.

1. Introduction and 4 demonstrate the experimental setup that has been used in
this research. Section 5 then presents the comparative research
Various classication and regression processes have been used to of the different classication algorithms and describes the best sui-
identify heart disease (Boors et al., 2000; Das, Turkoglu, & Sengur, ted ones for this problem. Section 6 describes signicant risk fac-
2009; Detrano et al., 1989; El-hanjouri, Alkhaldi, Hamdy, & Alim, tors for heart disease from medical point of view. Section 7
2002; Skalak, 1997). In particular, focus has been made on the presents the results of the comparison of computer feature selec-
University of California Irvine (UCI) heart disease dataset (also tion (CFS) process and medical knowledge based feature selection
known as the Cleveland dataset (Uci, 2009) and different computa- for heart disease dataset. Section 8 proposes the medical knowl-
tional intelligence algorithms have been used. But, existing investi- edge motivated feature selection (MFS), as well as a process com-
gations are, to the best of the authors knowledge, yet to show a bining CFS with MFS. Finally, Section 9 concludes the paper with a
comparative research that considers modern classication tech- summary of ndings and future research directions.
niques and imbalanced nature of the data, and employs a feature
selection process which incorporates medical knowledge. Medical
2. Computational intelligence for heart disease diagnostics
knowledge is important for feature selection in this area since com-
puter automated process may remove features important or select
This section provides an overview of existing research on using
feature that are less likely related from clinical view. The research
computational intelligence techniques in heart diseases diagnosis
presented in this paper highlights this issue and the ndings may
and points to the limitations that motivated this research. Cardio-
aid in contributing to future identication of heart disease.
vascular disease is a highly mortal disease with over 17 million
The plan of this paper is as follows: Section 2 provides an over-
deaths globally (Smith, 2010). So, early detection and treatment
view of existing research on using computational intelligence tech-
of the disease are imperative. Researchers have used different com-
niques in heart diseases diagnosis. Section 3 details the datasets
putational intelligence techniques to improve heart disease diag-
nostics over the years. A particular heart disease diagnostic
Corresponding author. Tel.: +61 07 40232112; fax: +61 07 49309700. dataset widely popular with data mining researchers is the pub-
E-mail addresses: [email protected] (J. Nahar), [email protected] (T. Imam), licly available University of California Irvine, Cleveland dataset
[email protected] (K.S. Tickle), [email protected] (Y-.P.P. Chen). (Uci, 2009). Some of the key researches on this datasets are:

0957-4174/$ - see front matter 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.eswa.2012.07.032
J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104 97

 Aha & Kibler (1988) used the dataset to predict effectiveness of pressure and ECG characteristics. This sort of outcomes is doubted
instance-based algorithms and achieved 77% and 74.8% accu- by medical practitioners and reduces the signicance of the auto-
racy for NTgrowth and C4.5 techniques. mated system. So, a feature selection process motivated by medical
 Detrano et al. (1989) investigated a probabilistic algorithm to knowledge is important.
diagnose the risk of coronary artery disease and concluded that The literature also indicates that in most cases complex and
patients experiencing chest pain and transitional disease occur- time intensive algorithms have been recommended. Well-known
rences are the higher risk subjects. standard classication algorithms are, however, more easily acces-
 Gennari, Langley, & Fisher (1989) explored a conceptual cluster- sible due to its availability in different software packages. From the
ing system and gained an acceptable accuracy (78.9%). medical practitioners point of view, in particular, a comprehensive
 Edmonds (2005) worked on the Cleveland data set with focus analysis of well-established classiers is, so, of essence.
on comparing global evolutionary computation approaches, This research focuses on these issues. As Cleveland dataset is
and observed some prediction performance improvement with considered a benchmark data in many existing research, this re-
a new approach. However, performance of the proposed tech- search also uses this dataset. The study provides a comparative
nique is dependent on the attributes selected by the algorithm. suitability of commonly used classiers. In addition, the research
investigates medical knowledge guided feature selection process
Other than these works, several researches have focused on di- for classication of heart disease.
verse aspects of heart disease diagnosis on different datasets (Avci,
2009; Boors et al., 2000; Doyle, Temko, Marnane, Lightbody, &
Boylan, 2010; El-hanjouri et al., 2002; Gamboa, Mendoza, Orozco, 3. Dataset details
VARGAS, & Gress, 2006; Maglogiannis, Loukis, Zaropoulos, &
Stasis, 2009; Obayya & Abou-chadi, 2008; Zheng, Jiang, & Yan, As mentioned earlier, the popular and publicly available UCI
2006; Kim, Lee, Cho, & Oh, 2008). Also, different researchers have heart disease dataset is used in this research. The UCI heart disease
used different machine learning techniques in related research. dataset consists of a total 76 attributes. However, majority of the
These include: fuzzy support vector clustering for the identica- existing studies have used only a maximum of 14 attributes (Uci,
tion of heart disease (Gamboa et al., 2006), prototype development 2009; Uci, 2010). Different datasets have been based on the UCI
using data mining techniques, mainly decision trees, Naive Bayes heart disease data. Computational intelligence researchers, how-
and Neural Networks, (Palaniappan & Awang, 2008) diagnostic sys- ever, have mainly used the Cleveland dataset consisting of 14 attri-
tem improved using feature extraction and Hidden Markov Models butes. The 14 attributes of the Cleveland dataset along with the
(HMM) (El-hanjouri et al., 2002), a data fusion approach recom- values and data types are as follow (Uci, 2009; Uci, 2010).
mended for classifying heart diseases (Obayya & Abou-chadi,
2008), an intelligent system based on genetic-support vector ma- 1. Age: age in years (numeric);
chines (GSVM) (Avci, 2009), use of an automated detection system 2. Sex: male, female (nominal);
based on the SVM classication (Maglogiannis et al., 2009), a com- 3. Chest pain type (CP): (a) typical angina (angina), (b) atypical
mittee machine (CM) based on an ensemble of Multilayer Percep- angina (abnang), (c) non-anginal pain (notang), (d) asymp-
tions (MLP) (Zheng et al., 2006), a computerized cardiovascular tomatic (asympt) (nominal). From medical point of view,
disease diagnosis and categorization system (Kim et al., 2008) (a) Typical angina is the condition in which the past history
and decision trees and SVM to predict heart disease (Soman, of the patient shows the usual symptoms and so the pos-
Shyam, & Madhavdas, 2003). sibility of having coronary artery blockages is high
Feature selection has also been applied in heart disease (Baliga & Eagle, 2008; Diagnosis, 2010; Kaul, 2010).
diagnostics, but for mainly datasets other than Cleveland. For (b) Atypical angina refers to the condition that the patients
instance, Zhao, Chen, Hou, Zheng, & Wang (2010) used backward symptoms are not detailed and so the probability of
elimination procedure along with a novel algorithm, Fan & blockages is lower (Baliga & Eagle, 2008; Diagnosis,
Chaovalitwongse (2010) suggested a novel optimization frame- 2010; Kaul, 2010).
work for getting improved feature selection in classication. Sev- (c) Non-angina pain is the stabbing or knife-like, prolonged,
eral other researchers have also noted impact of feature selection dull, or painful condition that can last for short or long
in different heart disease diagnosis (Chang, 2010; Hanbay, 2009; periods of time (Diagnosis, 2010; Mengel & Schwiebert,
Qazi et al., 2007; Zhao, Guo et al., 2010). Further, feature selection 2005; Society, 1945).
processes have often been found to improve the prediction perfor- (d) Asymptomatic pain shows no symptoms of illness or
mance of different classiers (Abraham, Simha, & Iyengar, 2007; disease and possibly will not cause or exhibit disease
Cheng, Wei, & Tseng, 2006; Devaney & Ram, 1997; Polat & symptoms (Pickett, 2000; Freedc, 2010);
Guenes, 2009; Sethi & Jain, 2010; Wang & Ma, 2009; Zhao, Chen 4. Trestbps: patients resting blood pressure in mm Hg at the
et al., 2010). time of admission to the hospital (numeric);
It is observed that a number of different classier have been 5. Chol: Serum cholesterol in mg/dl;
used to diagnose heart disease in the different studies. The com- 6. Fbs: Boolean measure indicating whether fasting blood sugar
parison of different algorithms in order to identify the heart dis- is greater than 120 mg/dl: (1 = True; 0 = false) (nominal);
ease, however, has to date not received appropriate focus. In 7. Restecg: electrocardiographic results during rest. Three
addition, the literature has not taken into account medical knowl- types of values normal (norm), abnormal (abn): having
edge based feature selection for medical datasets during the classi- ST-T wave abnormality, ventricular hypertrophy (hyp)
cation of heart disease. Computer based feature selection (CFS) (nominal);
selects features randomly, through calculating the signicance of 8. Thalach: maximum heart rate attained (numeric);
the attributes and by considering the individual predictive capac- 9. Exang: Boolean measure indicating whether exercise
ity. So, there is a chance to discard medically important factors induced angina has occurred: 1 = yes, 0 = no (nominal);
for a specic disease. For instance, as shown in Fig. 4, applying 10. Oldpeak: ST depression brought about by exercise relative to
computerized feature selection (CFS) on Cleveland dataset (with rest (numeric);
healthy as the positive class) discards medically established 11. Slope: the slope of the ST segment for peak exercise. Three
attributes like age, cholesterol, fasting blood sugar, resting blood types of values upsloping, at, downsloping (nominal);
98 J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104

Table 1
The heart-disease datasets (Uci, 2009).

Dataset name Class label considered No. of positive No. of negative Class (Label: Status indicated
as positive class instances class instances No. of instances) by positive class
H-O 0 (healthy) 165 138 1:165 2:138 Healthy, Sick
Sick-1 1 (sick1) 56 247 1:56 2:247 Sick1
Sick-2 2 (sick2) 37 266 1:37 2:266 Sick2
Sick-3 3 (sick3) 36 267 1:36 2:267 Sick2
Sick-4 4 (sick4) 14 289 1:14 2:289 Sick4

12. Ca: number of major vessels (03) colored by uoroscopy expected to be biased to the training data and may not reect
(numeric); the expected performance when applied on real-life data. So, in
13. Thal: the heart status (normal, xed defect, reversible addition to generally used 10-fold cross validation, we have also
defect) (nominal); performed a train-test split on the dataset and then used a 10-fold
14. The class attributes: value is either healthy or heart disease cross validation to select the best parameter for training. Perfor-
(sick type: 1, 2, 3, and 4). mance results were presented based on the prediction outcomes
of the test set. For each of the datasets, a stratied sampling pro-
The dataset has ve class attributes indicating either healthy or cess was used to select two-thirds of the data for training and
one of four sick types. For this research, multi-class classication the rest for prediction. The CVParameter selection tool provided
problem is converted into a binary classication problem. The rea- by Weka was used for the train-test split.
son for this is SMO, a robust and modern algorithm and used in the The results obtained using the two experimental processes are
experiments, is principally a binary classier. For the conversion shown in Table 2. The following discussion refers to these two
into binary, one of the class labels was considered as positive and experiments as the 10-fold and the CVP 10-fold.
the rest as negative. This way ve datasets were created. Table 1 From the results (Table 2), it was found that for the 10-fold, SMO
shows the characteristics of these datasets. The generated datasets is the best performing algorithm for all the datasets in terms of accu-
are referenced using the symbols: H-0 (healthy), Sick1, Sick2, Sick3 racy. But the CVP 10-fold results showed that the SMO algorithm is
and Sick4 respectively. the best for three datasets only (sick-1, sick-2, and sick-4). For the H-
O (healthy) dataset, SMO shows better performance in terms of TP
(0.891) and F-measure (0.862) compared to other algorithms using
4. Research design
the10-fold cross validation. But using the CVP 10-fold, the results
showed that Naive Bayes was best in terms of accuracy, IBK was best
To provide a comparison among the well popular classication
in terms of TP and AdaBoostM1 was best in terms of F-measure.
algorithms, four performance metrics were used in our experi-
It is observed that the two experimental processes gave varied
ment. These are: accuracy, true positive rate (TP), F-measure, and
results. But, in terms of accuracy, the best performing algorithm
time. Here, accuracy was the overall prediction accuracy, true
was the SMO for both the 10-fold and CVP 10-fold. Thus it can
positive rate (TP) was the accurate classication rate for the posi-
be concluded that when considering accuracy as the key perfor-
tive classes, and F-measure indicates the effectiveness of an algo-
mance measure, the experimental result showed that SMO is the
rithm when the accurate prediction rates for both of the classes
best suited classication algorithm among six algorithms for the
are considered. Also, training time was considered to compare
UCI heart disease dataset. In respect to performance metrics like
the computational complexity for learning.
TP and F-measure, no such clear outcome exists and the choice
In the case of medical data diagnosis, many researchers have
of best algorithm would then be dependent upon the characteris-
used a 10-fold cross validation on the total data and reported the
tics of the dataset.
result for disease detection, while other researchers have not used
this method for heart disease prediction (Abdel-aal, 2005; Baek,
6. Responsible risk factors for heart disease from a medical
Tsai, & Chen, 2009; Chen et al., 2007; Dash, 2008; Fountoulaki,
point of view
Karacapilidis, & Manatakis, 2010; Kumar & Shelokar, 2008; Mei,
Ma, Ashley-koch, & Martin, 2005; Polat & Gne, 2007; Xing, Wang,
Single risk factors are not notably sensitive to make out all indi-
Zhao, & Gao, 2007). We argue that selecting the best training
viduals at high risk of heart disease. There are different factors of
parameters on a validation set and reporting prediction on a test
heart disease, although none of each of these factors are less
set is more authentic than simply performing a 10-fold cross vali-
important for disease diagnosis however, some of those factors de-
dation on a training set. However, to relate with common culture,
serve to be more actively considered when diagnosing heart dis-
we have used both the train-test split method and 10-fold cross
ease according to some medical literature; emphasizing that
validation when comparing the algorithms.
certain factors need to be considered during heart disease diagno-
The experimental software used was Kumar & Shelokar, 2008;
sis. For example in heart disease diagnosis, most of the time exer-
Witten & Frank, 2005). Five classication techniques, as imple-
cise stress testing is an important factor in assessing the coronary
mented in Weka, were used. These are: (accuracy, TP, F-measure,
heart disease. The literature argues that this test is not necessarily
and time).
a wise decision for the majority of patients with a recognized
unstable angina (type of chest pain) condition or those with an
5. Comparison of algorithms abnormal ECG. One could argue that heart disease diagnosis, if
the patient is able to exercise for 6 or 8 min without any com-
Weka provides facilities to report performance of classiers by plaints in the chest that patient could be considered to possess a
performing a 10-fold cross validation on a provided dataset and re- healthy heart condition. In contrast, if patients display any pain
port performance results on the given dataset. This is the method or discomfort in the chest during exercise, or if their heart rate falls
generally used by many researches. However, this method is or goes up, this could suggest an abnormal condition of the heart;
J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104 99

Table 2
Performance table for 10-fold and CVP 10-fold. (Bold values indicate the best algorithm. Performance close to the best algorithm is also bold. The column 10-fold shows the results
from the 10-fold cross validation, while CVP 10-Fold shows the results from the train-test split and 10-fold cross-validations being applied on a training set in order to choose the
best learning parameters.)

Dataset Algorithms Accuracy (%) TP F-measure Training time (s)


10-fold CVP-10-fold 10-fold CVP-10-fold 10-fold CVP-10-fold 10-fold CVP-10-fold
H-O Naive Bayes 83.83 81.88 0.797 0.782 0.818 0.819 0.02 0.02
SMO 84.49 75.25 0.891 0.764 0.862 0.771 0.14 0.14
IBK 76.90 76.24 0.80 0 0.855 0.79 0.797 0 0
AdaBoostM1 83.50 81.19 0.855 0.818 0.849 0.826 0.03 0.03
J48 76.57 76.24 0.806 0.745 0.789 0.774 0.03 0.03
PART 81.52 79.21 0.830 0.818 0.830 0.811 0.03 0.02
Sick-1 Naive Bayes 74.92 72.28 0.196 0.111 0.224 0.125 0.02 0
SMO 81.52 82.18 0 0 0 0 0.14 0.16
IBK 73.27 72.28 0.321 0.111 0.308 0.125 0 0.02
AdaBoostM1 81.52 82.18 0 0 0 0 0.03 0.05
J48 81.52 82.18 0 0 0 0 0.03 0.03
PART 75.25 71.29 0.214 0.222 0.242 0.216 0.03 0.02
Sick-2 Naive Bayes 78.55 79.21 0.405 0.333 0.772 0.276 0.02 0.02
SMO 87.79 88.12 0 0 0 0 0.16 0.16
IBK 82.84 79.21 0.216 0.167 0.586 0.16 0 0
AdaBoostM1 86.80 86.14 0 0 0 0 0.03 0.03
J48 86.14 86.14 0 0.167 0 0.222 0.03 0.02
PART 79.87 85.15 0.162 0.167 0.539 0.211 0.02 0.02
Sick-3 Naive Bayes 81.52 81.19 0.472 0.583 0.378 0.424 0 0
SMO 87.88 88.11 0 0 0 0 0.22 0.16
IBK 83.17 87.13 0.194 0.333 0.215 0.381 0 0
AdaBoostM1 87.46 86.14 0.056 0.083 0.095 0.125 0.03 0.03
J48 87.46 88.12 0.083 0 0.136 0 0.05 0.02
PART 82.18 83.17 0.167 0 0.182 0 0 0.03
Sick-4 Naive Bayes 89.77 92.08 0.143 0 0.114 0 0 0
SMO 95.38 96.04 0 0 0 0 0.19 0.16
IBK 92.74 95.05 0.214 0 0.644 0 0 0
AdaBoostM1 95.38 96.04 0 0 0 0 0.03 0.03
J48 95.38 96.04 0 0 0 0 0 0.02
PART 93.40 96.04 0.143 0 0.167 0 0 0.02

or when the patient is resting or exercising his/or her ECG showing For the computerized feature selection process, CfsSubsetEval
an abnormal reading could be an indication of a heart disease sta- attribute selection (using BestFirst search strategy) provided by
tus. A positive exercise stress test is specied by abnormal horizon- Weka (Witten & Frank, 2005) was used. In later discussions, the
tal or downsloping ST segment depression (Khan, 2005). Those symbol CFS (computer feature selection) is used to indicate this
factors might support the diagnosis of heart disease more strongly process. In Fig. 1, the attributes selected by MFS and CFS for H-O
than other factors and are discussed below. dataset are shown below.
From this overview of medical literature (details in Section 6 CFS computational modeling selection is based on the signi-
and Table 3), it can be seen that factors such as cholesterol, hyper- cant predictor and as a consequence does not necessarily consider
tension (blood pressure), heart rate, resting ECG, blood sugar, dia- medically signicant factors. It can be seen that medically impor-
betes, exercise induced angina, stress and older age, are the most tant attributes such as age, resting blood pressure, cholesterol, fast-
signicant factors for predicting heart disease. Several books and ing blood sugar and resting ECG have been discarded by CFS for
articles mentioned those particular factors being effective and healthy (H-O) dataset. In sick-1 (Fig. 2) age, resting blood pressure,
therefore should be considered during disease diagnosis. In the cholesterol, fasting blood sugar, resting ECG, max heart rate were
UCI Cleveland heart disease dataset, there are eight factors of med- not considered relevant by CFS. Similarly for sick-2 (Fig. 3), age,
ical signicance that are considered for feature selection as MFS. resting blood pressure, cholesterol, resting ECG was discarded by
The factors are: age, chest pain type (angina, abnang, notang, asy- CFS. While for sick-3 (Fig. 4) age, resting blood pressure, choles-
mpt), resting blood pressure, cholesterol, fasting blood sugar, rest- terol, resting ECG, max heart rate has been discarded by CFS.
ing heart rate (normal, abnormal, ventricular hypertrophy), And for sick-4 (Fig. 5), age, resting blood pressure, cholesterol,
maximum heart rate, and exercise induced angina. As discussed fasting blood sugar, max heart rate and exercise induced angina
previously, feature selection based on medical knowledge is an were not considered. For sick-4 only four attributes were selected
important factor in heart disease diagnosis. It can be argued that by CFS: chest pain type, resting ECG, slope, number of vessels col-
during feature selection, if signicant symptoms related to heart ored and thal (heart status). Factors such as age, resting blood pres-
disease are not considered then there is a strong likelihood that sure, cholesterol, maximum heart rate, exercise induced angina,
the diagnosis runs the risk of neglecting the most important fac- fasting blood sugar, selected by MFS, were disregarded by CFS.
tors. In the subsequent experiment, the knowledge derived from The results indicate that the CFS method selected attributes with
the medical literature survey is taken into account and the eight the signicance association among the factors. The six classica-
factors mentioned above are used for feature selection. tion algorithms were executed on each of the datasets and feature
selection was applied based on CFS or MFS. Table 4 presents perfor-
7. Comparison of automated and medical knowledge based mance of the algorithms using datasets selected on the basis of
feature selection MFS, CFS and CVP-10 fold results. Performances are quantied in
terms of accuracy, TP and F-measure.
This section presents a comparison between medical knowl- Results indicated that for H-O, MFS performance improved
edge based feature selection and computerized feature selection. when compared with CFS in terms of accuracy for two cases
100 J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104

Table 3
Medically important factors related to heart disease.

Source Considered Factors


Age Angina Blood Blood Chest Cholestoerol Diabetes ECG Exercise Fasting serum Heart Hypertension Low Smoking Stress
pressure sugar pain induced glucose & total rate exercise
angina serum cholesterol workload
Boyko et al., 2004 U
Dietrich et al. 2008 U U U U
Ding et al., 2004 U
Edlin and Golanty, U U U U U U U
2009
Edlin et al., 1999 U U U U
Facchini et al. 2009 U U
Fuster et al., 2005 U U U U U U U U
Goodpaster et al., U
2003
Hales, 2008 U U U U
Hales, 2009 U U U U
Hayashi et al., 2004 U
Hoegerand Hoeger, U U U U U U U
2010
Huikuri2009 U U
Kanaya et al., 2004 U
Khang et al. 2008 U U
Kurl et al. 2009 U
Lindeberg et al. U U U
2009
Lindsay and Gaw, U U U U U U U U
2004
Lyerly et al. 2008 U
Maylunas and U U U U U U
Mironenko, 2005
Miller, 2008 U U U U U U
Mittal, 2005 U U U U U U U U
Mozafman et al. U U U
2008
Nyman et al. 2009 U
Peacock et al., 2009 U U U U U U U
Schenck-Gustason U U U U
2009
Sizer and Whitney, U U U U
2007
Stansfeld et al. 2009 U
Tan et al. 2008 U
Yusuf et al. 2004 U U U U

Attributes Attributes Attributes


selected by selected by selected by
MFS MFS & CFS CFS

Age Oldpeak
Chest pain type

Resting blood pressure Dataset: Dataset: Number of vessels


H-O Maximum heart rate H-O coloured

Cholesterol Thal
Exercise induced angina

Fasting blood sugar

Resting ECG

Fig. 1. Attributes selected by MFS and CFS for dataset (H-O).

(IBK-100, PART-86.77) and comparable to CFS (J48-80.88), the per- J48-82.35) and better than CFS (Naive Bayes-85.29, AdaBoostM1-
formance of MFS was comparable to CFS for two cases (SMO-82.35, 82.35) in 2 cases. Similar characteristics were observed for sick-2,
J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104 101

Attributes Attributes Attributes


selected by selected by selected by
MFS MFS & CFS CFS

Maximum heart rate


Chest pain type Sex

Age Dataset: Dataset: Number of vessels


Sick - 1 Sick - 1 coloured

Resting blood pressure Thal


Exercise induced angina

Fasting blood sugar

Resting ECG

Fig. 2. Attributes selected by MFS and CFS for dataset (sick-1).

Attributes Attributes Attributes


selected by selected by selected by
MFS MFS & CFS CFS

Age
Chest pain type Oldpeak

Resting blood pressure Dataset: Dataset: Number of vessels


Sick-2 Maximum heart rate Sick-2 coloured

Cholesterol Thal
Exercise induced angina

Fasting blood sugar

Resting ECG

Fig. 3. Attributes selected by MFS and CFS for dataset (sick-2).

Attributes Attributes Attributes


selected by selected by selected by
MFS MFS & CFS CFS

Maximum heart rate


Chest pain type Slope

Age Dataset: Dataset: Number of vessels


Sick-3 Exercise induced angina Sick-3 coloured

Resting blood pressure Thal


Fasting blood sugar

Cholesterol

Resting ECG

Fig. 4. Attributes selected by MFS and CFS for dataset (sick-3).


102 J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104

Attributes Attributes Attributes


selected by selected by selected by
MFS MFS & CFS CFS

Maximum heart rate


Chest pain type Slope

Exercise induced angina Dataset: Dataset: Number of vessels


Sick - 4 Sick - 4 coloured

Age
Resting ECG Thal

Cholesterol

Fasting blood sugar

Fig. 5. Attributes selected by MFS and CFS for dataset (sick-4).

Table 4
Combining proposed feature selection (MFS) with automated feature selection (CFS).

Dataset Algorithms Accuracy (%) TP F-measure


MFS CFS CVP-10-fold MFS CFS CVP-10-fold MFS CFS CVP-10-fold
H-O Naive Bayes 69.11 82.36 81.88 0.757 0.865 0.782 0.727 0.842 0.02
SMO 77.95 77.94 75.25 0.811 0.892 0.764 0.8 0.815 0.14
IBK 100 80.88 76.24 1 0.838 0.855 1 0.827 0
AdaBoostM1 72.05 77.94 81.19 0.784 0.784 0.818 0.753 0.795 0.03
J48 80.88 80.88 76.24 0.838 0.838 0.745 0.827 0.827 0.03
PART 86.77 82.35 79.21 0.919 0.865 0.818 0.883 0.842 0.02
Sick-1 Naive Bayes 85.29 76.47 72.28 0.167 0.167 0.111 0.286 0.2 0
SMO 82.35 82.35 82.18 0 0 0 0 0 0.16
IBK 67.65 70.58 72.28 0.333 0.167 0.111 0.267 0.167 0.02
AdaBoostM1 82.35 76.47 82.18 0 0 0 0 0 0.05
J48 82.35 82.35 82.18 0 0 0 0 0 0.03
PART 73.53 73.53 71.29 0.167 0 0.222 0.182 0 0.02
Sick-2 Naive Bayes 85.29 79.41 79.21 0 0.25 0.333 0 0.222 0.02
SMO 88.24 88.25 88.12 0 0 0 0 0 0.16
IBK (2) 88.24 82.35 79.21 0.25 0.25 0.167 0.333 0.25 0
AdaBoostM1 88.24 82.35 86.14 0 0.25 0 0 0.25 0.03
J48 85.29 88.23 86.14 0 0 0.167 0 0 0.02
PART 85.29 91.18 85.15 0.25 0.25 0.167 0.286 0.4 0.02
Sick-3 Naive Bayes 82.35 100 81.19 0.25 1 0.583 0.25 1 0
SMO 88.24 83.33 88.11 0 0 0 0 0 0.16
IBK 79.41 75 87.13 0.25 0 0.333 0.222 0 0
AdaBoostM1 88.23 83.33 86.14 0 0 0.083 0 0 0.03
J48 88.24 83.33 88.12 0 0 0 0 0 0.02
PART 85.29 83.33 83.17 0.25 0 0 0.286 0 0.03
Sick-4 Naive Bayes 97.05 94.12 92.08 0 0 0 0 0 0
SMO 97.05 97.05 96.04 0 0 0 0 0 0.16
IBK 97.05 97.05 95.05 0 0 0 0 0 0
AdaBoostM1 97.05 97.05 96.04 0 0 0 0 0 0.03
J48 97.05 97.05 96.04 0 0 0 0 0 0.02
PART 97.05 97.05 96.04 0 0 0 0 0 0.02

sick-3 and sick-4 datasets (for sick-2, MFS was comparable to CFS parison to the results with no feature selection. The experiment
in one cases and higher for three cases; for sick-3, MFS produced a showed when feature selection is applied on the UCI heart disease
higher accuracy reading than CFS in ve cases, and for sick-4, MFS data, it advances the prediction efciency, with medical knowledge
was higher in one case and comparable with the other ve cases). based feature selection resulting in the better performance. It was
Overall it is estimated that in terms of accuracy, medical based fea- also shown that using MFS with IBK resulted in better or compara-
ture selection produced the better prediction performance than ble TP and F-measure results relative to the results using CFS for
computerized feature selection. Table 4 also shows results in com- four datasets (H-O, sick-1, sick-2 and sick-3). Similar results were
parison to CVP 10-fold settings, used in the previous experiment also observed using PART.
(in other words, no feature selection). Results show that both In context of the experiment it can be concluded that the results
MFS and CFS had improved prediction rates, in terms of accuracy, strongly suggest that using MFS improves the performance, espe-
for the majority of the algorithms and for all the datasets in com- cially in light of accuracy, of most of the classiers for the majority
J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104 103

Table 5
Medical feature selection with combination of medical and computer based feature selection. (MFS stands for medical feature selection, while MFS+CFS stand for medical feature
selection plus computer feature selection. Classiers performed on the ve datasets are shown, with the best performance shown in bold.)

Dataset Algorithms Accuracy (%) TP F-measure


MFS MFS+CFS MFS MFS+CFS MFS MFS+CFS
H-O Naive Bayes 69.11 83.83 0.757 0.892 0.727 0.857
SMO 77.95 83.83 0.811 0.919 0.8 0.861
IBK 100 73.53 1 0.784 1 0.763
AdaBoostM1 72.05 80.88 0.784 0.811 0.753 0.822
J48 80.88 73.52 0.838 0.784 0.827 0.763
PART 86.77 75.00 0.919 0.757 0.883 0.767
Sick-1 Naive Bayes 85.29 77.94 0.167 0.077 0.286 0.118
SMO 82.35 75 0 0.077 0 0.105
IBK 67.65 76.47 0.333 0.231 0.267 0.273
AdaBoostM1 82.35 80.88 0 0 0 0
J48 82.35 80.88 0 0 0 0
PART 73.53 73.53 0.167 0.077 0.182 0.1
Sick-2 Naive Bayes 85.29 85.29 0 0.25 0 0.286
SMO 88.24 88.24 0 0 0 0
IBK (2) 88.24 94.11 0.25 0.5 0.333 0.667
AdaBoostM1 88.24 82.35 0 0.25 0 0.25
J48 85.29 88.23 0 0.5 0 0.5
PART 85.29 85.29 0.25 0.25 0.286 0.286
Sick-3 Naive Bayes 82.35 85.30 0.25 0.75 0.25 0.545
SMO 88.24 88.23 0 0 0 0
IBK 79.41 85.29 0.25 0 0.222 0
AdaBoostM1 88.23 85.30 0 0 0 0
J48 88.24 88.23 0 0 0 0
PART 85.29 88.23 0.25 0.5 0.286 0.5
Sick-4 Naive Bayes 97.05 91.17 0 0 0 0
SMO 97.05 97.05 0 0 0 0
IBK 97.05 97.05 0 0 0 0
AdaBoostM1 97.05 97.05 0 0 0 0
J48 97.05 97.05 0 0 0 0
PART 97.05 97.05 0 0 0 0

of the datasets. Therefore, the method shows promise in case of some improvement in terms of TP and F-measure using MFS+CFS
measured performance. In the next section, MFS is investigated over using MFS alone. Overall, the performance indicated that MFS
further in combination with CFS. feature selection appeared to be a favorable strategy for heart dis-
ease data. But the combination of MFS and CFS has promise for some
8. Extension to MFS of the classiers (particularly Naive Bayes, IBK and SMO). Therefore,
the proposed MFS and MFS+CFS are promising techniques for use in
The previous section showed that MFS is a promising feature heart disease diagnostics.
selection method in heart disease diagnosis and that CFS often dis-
regards features that are medically important. This section details 9. Conclusion
an extension to the MFS. Features selected by CFS were combined
with those selected by MFS and the results were compared to using Early detection of heart disease is essential to save lives. Under-
MFS alone. The objective was to see if this combination of MFS and standing the usefulness of data mining for assisting in the diagno-
CFS improved prediction results over MFS. Results are shown in Ta- sis of heart disease is so important. This paper has provided details
ble 5. The experimental results for combining MFS and CFS are on the comparison of classiers for the detection of heart disease. It
shown using the symbol MFS+CFS. was observed that SMO (Support Vector Machine) has shown po-
The results showed that in terms of accuracy, MFS improved the tential classication algorithm in this area, particularly when con-
performance for 10 cases (for H-O: IBK, J48 and PART-86.77; for sick- sidering total accuracy as a performance measure. The paper also
1: Naive Bayes, SMO, AdaBoostM1 and J48; for sick-2: AdaBoostM1; presented outcomes from using automated feature selection and
for sick-3: AdaBoostM1; for sick-4: Naive Bayes) and 11 comparable a medical knowledge based motivated feature selection process
accuracy to MFS+CFS among all the datasets. MFS+CFS indicated im- (MFS). The results of experiment demonstrated that the use of
proved accuracy for nine aspects within H-O: Naive Bayes, SMO and MFS noticeably improved the performance especially in terms of
AdaboostM1; for sick-1: IBK; for sick-2: IBK and J48 and for sick-3: accuracy, for most of the classiers for the majority of the datasets.
Naive Bayes, IBK and PART. Although the performance of both meth- This indicates that the method has promise. MFS combined with
ods appeared comparable, in terms of accuracy MFS showed mar- the computerized feature selection process (CFS) was also experi-
ginally higher results. But in terms of TP and F-measure, MFS+CFS mented and encouraging results were seen for some of the classi-
resulted in a higher performance than using MFS alone for the ers particularly NaiveBayes, IBK and SMO. In summary, the MFS
majority of the cases and particularly for the Naive Bayes classier and MFS+CFS are promising techniques for use in heart disease
(better than MFS for three datasets). On the other hand, for MFS+CFS, diagnostics.
SMO performed better in terms of accuracy for one of the dataset (H-
O), and comparable for three of the datasets (sick-2, sick-3, and sick-
4). SMO also performed better using MFS+CFS in terms of accuracy References
than using MFS for one dataset (H-O) and performed comparably
Abdel-aal, R. (2005). Improved classication of medical data using abductive
to MFS for three datasets (sick-2, sick-3, sick-4). Similar results were network committees trained on different feature subsets. Computer Methods
also observed for the IBK algorithm. SMO, however, further showed and Programs in Biomedicine, 80, 141153.
104 J. Nahar et al. / Expert Systems with Applications 40 (2013) 96104

Abraham, R., Simha, J. B., & Iyengar, S. (2007). Medical datamining with a new Kim, B.-H., Lee, S.-H., Cho, D.-U., & Oh, S.-Y. (2008). A proposal of heart diseases
algorithm for feature selection and Nave Bayesian classier. In 10th diagnosis method using analysis of face color. In International conference on
international conference on information technology, (ICIT), 2007 Orissa IEEE advanced language processing and web information technology, ALPIT (pp. 220
computer society (pp. 4449). 225).
Aha, D., & Kibler, D. (1988). Instance-based prediction of heart-disease presence Kumar, K., & Shelokar, P. (2008). An SVM method using evolutionary information for
with the Cleveland database. Technical Report, University of California, Irvine, the identication of allergenic proteins. Bioinformation, 2, 253256.
Department of Information and Computer Science, Number ICS-TR-88-07. Maglogiannis, I., Loukis, E., Zaropoulos, E., & Stasis, S. (2009). Support vectors
Avci, E. (2009). A new intelligent diagnosis system for the heart valve diseases by machine-based identication of heart valve diseases using heart sounds.
using genetic-SVM classier. Expert Systems with Applications, 36, 1061810626. Computer Methods and Programs in Biomedicine, 95, 4761.
Baek, S., Tsai, C., & Chen, J. (2009). Development of biomarker classiers from high- Mei, H., Ma, D., Ashley-koch, A., & Martin, E. (2005). Extension of multifactor
dimensional data. Briengs in bioinformatics, 10, 537546. dimensionality reduction for identifying multilocus effects in the GAW14
Baliga, R. R., & Eagle, K. A. (2008). Practical cardiology: Evaluation and treatment of simulated data. BMC genetics, 6, S145.
common cardiovascular. Lippincott Williams & Wilkins. Mengel, M. B., & Schwiebert, L. P. (2005). Family medicine: Ambulatory care &
Boors, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., & Muchnik, I. B. (2000). An prevention. McGraw-Hill Professional.
implementation of logical analysis of data. IEEE Transactions on Knowledge and Obayya, M., & Abou-chadi, F. (2008). Data fusion for heart diseases classication
Data Engineering, 12, 292306. using multi-layer feed forward neural network. In International conference on
Chang, C. (2010). Recognition of atrial brillation and congestive heart failure based computer engineering & systems, ICCES (Vol. 978, pp. 670).
on heart rate variability. Masters thesis, Graduate Institute of Electrical Palaniappan, S., & Awang, R. (2008). Intelligent heart disease prediction system
Engineering, China. using data mining techniques. In International conference on computer systems
Cheng, T., Wei, C., & Tseng, V. (2006). Feature selection for medical data mining: and applications, AICCSA. IEEE/ACS, Doha IEEE (pp. 108115).
Comparisons of expert judgment and automatic approaches. In 19th IEEE Pickett, J. P. (2000). The American heritage dictionary of the english language.
symposium on computer-based medical systems (CBMS06) (pp. 165170). Houghton Mifin.
Chen, S., Zhou, S., Zhang, J., Yin, F., Marks, L., & Das, S. (2007). A neural network Polat, K., & Guenes, S. (2009). A new feature selection method on classication of
model to predict lung radiation-induced pneumonitis. Medical physics, 34, medical datasets: Kernel F-score feature selection. Expert Systems with
38083814. Applications, 36, 1036710373.
Dash, M. O. Y. H. (2008). Efcient cross validation over skewed noisy data. IEEE Polat, K., & Gne, S. (2007). An improved approach to medical data sets
International Conference on Systems, Man and Cybernatics (SMC), 749756. classication: Articial immune recognition system with fuzzy resource
Das, R., Turkoglu, I., & Sengur, A. (2009). Effective diagnosis of heart disease through allocation mechanism. Expert Systems, 24, 252270.
neural networks ensembles. Expert Systems with Applications, 36, 76757680. Qazi, M., Fung, G., Krishnan, S., Rosales, R., Steck, H., Rao, B., et al. (2007). Automated
Detrano, R., Janosi, A., Steinbrunn, W., Psterer, M., Schmid, J., Sandhu, S., et al. heart wall motion abnormality detection from ultrasound images using
(1989). International application of a new probability algorithm for the Bayesian networks. In International joint conference on articial intelligence
diagnosis of coronary artery disease. American Journal of Cardiology, 64, (pp. 519525).
304310. Sethi, P., & Jain, M. (2010). A comparative feature selection approach for the
Devaney, M., & Ram, A. (1997). Efcient feature selection in conceptual clustering. prediction of healthcare coverage. Information Systems, Technology and
In Proceedings of the fourteenth international conference on machine learning, Management, 392403.
Nashville, TN, Citeseer (pp. 9297). Skalak, D. B. (1997). Prototype selection for composite nearest neighbor classiers.
Diagnosis, E. (2010). Coronary disease or heart attack from expert system: Chest Department of Computer Science, University of Massachusetts-Amherst, PhD thesis.
pain [Online]. Available: <http://www.easydiagnosis.com/cgi-bin/expert/ Smith, J. S. (2010). Screening for high-risk cardiovascular disease: A challenge for
explain2.cgi?mod=Chest+Pain&ask=ddisease2&title=Coronary+Disease+or+ the guidelines. Archives of Internal Medicine, 170, 4042.
Heart+Attack&showmod=yes> [Accessed 12th December 2010]. Society, M. M. (1945). The New England journal of medicine. 232: Massachusetts
Doyle, O., Temko, A., Marnane, W., Lightbody, G., & Boylan, G. (2010). Heart rate Medical Society, New England Surgical Society, HighWire Press new.
based automatic seizure detection in the newborn. Medical Engineering and Soman, K. P., Shyam, D. M., & Madhavdas, P. (2003). Efcient classication and
Physics, 32, 829839. analysis of ischemic heart disease using proximal support vector machines
Edmonds, B. H. (2005). Using localised Gossip to structure distributed learning, centre based decision trees, TENCON. In Conference on convergent technologies for Asia-
for policy modelling. Proceedings of the joint symposium on socially inspired Pacic region (Vol. 1, pp. 214217).
computing engineering with social metaphors (AISB). Hateld, UK: University of Uci. 2009. Heart disease dataset [Online]. Available: <http://archive.ics.uci.edu/ml/
Hertfordshire. 127134. machine-learning-databases/heart-disease/cleve.mod> [Accessed 5th March,
El-hanjouri, M., Alkhaldi, W., Hamdy, N., & Alim, O. A. (2002). Heart diseases 2009].
diagnosis using HMM. In 11th mediterrane electrotechnical conference, MELECON, Uci. 2010. Cleveland heart disease data details [Online]. Available: <http://
Cairo, Egypt (pp. 489492). archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-
Fan, Y., & Chaovalitwongse, W. (2010). Optimizing feature selection to improve disease.names> [Accessed 8th February 2010].
medical diagnosis. Annals of Operations Research, 174, 169183. Wang, Y., & Ma, L. (2009). Feature selection for medical dataset using rough set
Fountoulaki, A., Karacapilidis, N., & Manatakis, M. (2010). Using decision trees for theory. In Proceedings of the 3rd WSEAS international conference on computer
the semi-automatic development of medical data patterns: A computer- engineering and applications (CEA) (pp. 6872). Stevens Point, Wisconsin: USA
supported framework. Web-Based Applications in Healthcare and Biomedicine, World Scientic and Engineering Academy and Society (WSEAS).
229242. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and
Freedc. 2010. Asymptomatic [Online]. Available: <http://www.thefreedictionary. techniques. San Francisco: Morgan Kaufmann.
com/asymptomatic> [Accessed 12th December, 2010]. Xing, Y., Wang, J., Zhao, Z., & Gao, A. (2007). Combination data mining methods
Gamboa, A. L. G., Mendoza, M. G., Orozco, R. E. I., VARGAS, J. M., & Gress, N. H. with new medical data to predicting outcome of coronary heart disease
(2006). Hybrid Fuzzy-SV clustering for heart disease identication, (PDF). In International conference on convergence information technology (ICCIT
computational intelligence for modelling. In International conference on control 2007).
and automation, 2006 and international conference on intelligent agents, web Zhao, H., Chen, J., Hou, N., Zheng, C., & Wang, W. (2010). Identifying metabolite
technologies and internet commerce (pp. 121121). biomarkers in unstable angina in-patients by feature selection based data
Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept mining methods. In Second international conference on computer modeling and
formation. Articial Intelligence, 40, 1161. simulation (pp. 438442). Sanya, China: IEEE.
Hanbay, D. (2009). An expert system based on least square support vector machines Zhao, H., Guo, S., Chen, J., Shi, Q., Wang, J., Zheng, C., et al. (2010). Characteristic
for diagnosis of the valvular heart disease. Expert Systems with Applications, 36, pattern study of coronary heart disease with blood stasis syndrome based on
42324238. decision tree. In 4th international conference on bioinformatics and biomedical
Kaul, U. (2010). What is typical and atypical angina? [Online]. Available: <http:// engineering (iCBBE) (pp. 13). Chengdu, China: IEEE.
doctor.ndtv.com/faq/ndtv/d/2907/ Zheng, J., Jiang, Y., & Yan, H. (2006). Committee machines with ensembles of
What_is_typical_and_atypical_angina.html> [Accessed 12th February, 2010]. multilayer perceptron for the support of diagnosis of heart diseases. In
Khan, M. I. G. (2005). Heart disease diagnosis and therapy: A practical approach. Proceedings of the international conference on, communications, circuits and
Humana Press. systems (Vol. 3, pp. 20462050).

You might also like