0% found this document useful (0 votes)
89 views7 pages

Machine Learning Classification Techniques For Heart Disease Prediction: A Review

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views7 pages

Machine Learning Classification Techniques For Heart Disease Prediction: A Review

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Engineering & Technology, 7 (4) (2018) 5373-5379

International Journal of Engineering & Technology


Website: www.sciencepubco.com/index.php/IJET

Research paper

Machine learning classification techniques for heart


disease prediction: a review
Maryam I. Al-Janabi 1, Mahmoud H. Qutqut 1 2 *, Mohammad Hijjawi 1
1 Faculty of Information Technology, Applied Science Private University, Amman, 11931 Jordan
2 Telecommunications Research Lab, School of Computing, Queen's University, Kingston, ON K7L 2N8 Canada
*Corresponding author E-mail: [email protected]

Abstract

The most crucial task in the healthcare field is disease diagnosis. If a disease is diagnosed early, many lives can be saved. Machine learn-
ing classification techniques can significantly benefit the medical field by providing an accurate and quick diagnosis of diseases. Hence,
save time for both doctors and patients. As heart disease is the number one killer in the world today, it becomes one of the most difficult
diseases to diagnose. In this paper, we provide a survey of the machine learning classification techniques that have been proposed to help
healthcare professionals in diagnosing heart disease. We start by overviewing the machine learning and de-scribing brief definitions of
the most commonly used classification techniques to diagnose heart disease. Then, we review represent-able research works on using
machine learning classification techniques in this field. Also, a detailed tabular comparison of the sur-veyed papers is presented.

Keywords: Heart Disease; Heart Disease Diagnosis; Heart Disease Prediction; Machine Learning; Machine Learning Classification Techniques.

1. Introduction The term heart disease, also called cardiovascular disease, encom-
passes the diverse diseases that affect the heart. The World Health
Organization estimates that 12 million deaths occur worldwide
the task of making computers more intelligent. Since the most
every year due to heart disease. It is the major cause of deaths in
basic requirement of intelligence is learning, hence came the sub-
many developing countries. For example, in the United States, it
field of AI that is called machine learning (ML). ML is one of the
kills one person every 34 seconds. It is also the main cause of
most rapidly evolving fields of AI which is used in many areas of
deaths in India, which proves that heart disease is one of the most
life, primarily in the healthcare field. ML has a great value in the
dangerous diseases threatening adults lives today [2]. Heart dis-
healthcare field since it is an intelligent tool to analyze data, and
ease diagnosis is one of the most critical and challenging tasks in
the medical field is rich with data. In the past few years, numerous
the healthcare field. It must be diagnosed quickly, efficiently and
amount of data was collected and stored because of the digital
correctly in order to save lives. It requires the patient to do many
revolution. Monitoring and other data collection devices are avail-
tests, and healthcare professionals must carefully examine the
able in modern hospitals and are being used every day, and abun-
results. That is why researchers have been interested in predicting
dant amounts of data are being gathered. It is very hard or even
heart disease, and they developed different heart disease predic-
impossible for humans to derive useful information from these
tion systems using various machine learning algorithms [3]. Some
massive amounts of data, that is why machine learning is widely
of them achieved better results than others. Many used the famous
used nowadays to analyze these data and diagnose problems in the
UCI heart disease dataset to train and test their classifier, while
healthcare field. A simplified explanation of what the machine
others used data obtained from other hospitals accessible to them.
learning algorithms would do is, it will learn from previously di-
This survey paper provides an overview of the machine learning
agnosed cases of patients. The resulting classifier can have many
classification techniques used in the field of diagnosing heart dis-
uses such as helping doctors to diagnose new patients with higher
ease, and how previous researchers implemented them. It throws
speed and efficiency and training students and non-specialists to
the light on how important is machine learning in the healthcare
diagnose patients [1].
field and how it can make accurate predictions and help healthcare
Since we have vast amounts of medical datasets, machine learning
professionals.
can help us discover patterns and beneficial information from
The rest of the paper is organized as follows. Section 2 presents
them. Although it has many uses, machine learning is mostly used
background topics on ML, classification techniques, and the most
for disease prediction in the medical field. Many researchers be-
widely used heart disease dataset by researchers in this field. Sec-
came interested in using machine learning for diagnosing diseases
tion 3 contains the literature review of the current proposed re-
because it helps to reduce diagnosing time and increases the accu-
search work in this area. Section 4 presents a tabular comparison
racy and efficiency. Several diseases can be diagnosed using ma-
between the classification techniques overviewed in section 3 on
chine learning techniques, but the focus of this paper will be on
the basis of their accuracy. Finally, the conclusion is presented in
heart disease diagnosis. Since heart disease is the primary cause of
section 5.
deaths in the world today, and the effective diagnosis of heart
disease is immensely useful to save lives [1].

Copyright © 2018 Maryam I. Al-Janabi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
5374 International Journal of Engineering & Technology

2. Background ought to get the correct result through examination and trying out
different possibilities [5].
The most common type of learning is the supervised learning
This section provides descriptions of the related topics of this
technique; especially the classification technique that is widely
paper such as machine learning, its techniques with brief descrip-
used for prediction. In this paper, we mainly focus on the papers
tions, data preprocessing, performance evaluation metrics and a
that used classification algorithms to diagnose heart disease.
brief explanation of the most used heart disease dataset.
2.3. Classification machine learning techniques
2.1. Machine learning
Classification, which is a type of supervised ML techniques per-
Machine learning (ML) is a domain of artificial intelligence that
form predictions for future cases based on a previous dataset. In
involves constructing algorithms that can learn from experience.
this section, we present a brief definition of the most widely used
The way that ML algorithms work is that they detect hidden pat-
classification techniques for heart disease prediction.
terns in the input dataset and build models. Then, they can make
accurate predictions for new datasets that are entirely new for the 2.3.1. Naive bayes (NB)
algorithms. This way the machine became more intelligent
through learning; so it can identify patterns that are very hard or Naive Bayes classifier belongs to a family of probabilistic classifi-
impossible for humans to detect by themselves. ML algorithms ers based on Naive Bayes theorem. It assumes sturdy independ-
and techniques can operate with large datasets and make decisions ence between the features, and this is the essential part of how this
and predictions [4]. Figure 1 represents a simplified representation classifier makes predictions. It is easy to build, and it usually per-
of how machine learning works. In this figure, the dataset, which forms well which makes it suitable for the medical science field
in our case can be a patient database, is preprocessed first. The and diagnosing diseases [6].
preprocessing phase is crucial as it cleans the dataset and prepares
it to be used by the machine learning algorithm. The model con- 2.3.2. Artificial neural network (ANN)
sists of a single algorithm, or it can contain multiple algorithms
working together in a hybrid approach. The output of the model is This algorithm was developed to imitate the neurons in the human
a classifier; this is where the intelligence is, and this is what will brain. It consists of some nodes or neurons that are connected, and
make the prediction. If the classifier receives input data, it can the output of one node is the input of another. Each node receives
predict without any human interruption. For example, if the da- multiple inputs, but the output is only one value. The Multi-Layer
taset that is fed into the model is a medical dataset of healthy and Perceptron (MLP) is a widely used type of ANN, and it consists of
unhealthy patients' information, the input data can be a new pa- an input layer, hidden layers, and an output layer. A different
tient's information. This input data is entirely new to the classifier number of neurons are assigned to each layer under different con-
and has never been seen before. The classifier will receive this ditions [6].
data and will predict whether this new patient is healthy or un-
healthy based on past data. 2.3.3. Radial basis function (RBF)

2.2. Machine learning techniques This is a type of ANN, and is similar to the Multi-Layer Percep-
tron (MLP) Neural Network but has a different number of hidden
The main ML techniques can be classified as follows: layers, approximation technique, number of parameters, and other
factors [6].
2.2.1. Supervised learning
2.3.4. Decision tree (DT)
In this technique, a dataset exists with examples and their response
(the output). The algorithm can learn from the dataset through a This algorithm has a tree-like structure or flowchart-like structure.
training process; then it can respond to any new input based on It consists of branches, leaves, nodes and a root node. The internal
what it has learned. An example of the supervised learning tech- nodes contain the attributes while the branches represent the result
nique is classification and regression [5]. of each test on each node. DT is widely used for classification
purposes because it does not need much knowledge in the field or
setting the parameters for it to work [6].

2.3.5. K-nearest neighbor (KNN)

This algorithm predicts the class of a new instance based on the


most votes by its closest neighbors. It uses Euclidean distance to
calculate the distance of an attribute from its neighbours [6].

2.3.6. Support vector machine (SVM)

This algorithm has a useful classification accuracy. It is defined as


Fig. 1: Machine Learning Simplified Representation. a finite-dimensional vector space which consists of a dimension
for every feature/attribute of an object [6].
2.2.2. Unsupervised learning
2.3.7. Genetic algorithm
The dataset does not contain the responses in this technique. So,
the algorithm tries to recognize the similarities between input It is an evolutionary algorithm that is built based on Darwin's the-
values and categorizes them based on their similarities. The unsu- ory of evolution. It imitates methods in nature such as mutation,
pervised learning technique contains the clustering method [5]. crossover, and natural selection. One of the mostly used ad-
vantages of the genetic algorithm is its usage to initialize weights
2.2.3. Reinforcement learning of the neural network model [8]. That is why its use alongside
ANN is witnessed in many researches to produce a hybrid predic-
This technique is in the middle of supervised and unsupervised tion model.
learning, where the model improves its performance as it interacts
with the environment. Hence, learn how to correct its mistakes. It
International Journal of Engineering & Technology 5375

2.3.8. Ensemble learning 3. Current classification techniques for pre-


This method combines multiple classifiers into one model to in- dicting heart disease
crease the accuracy. There are three types of Ensemble learning
method. The first type is Bagging, which is aggregating classifiers There are various classification techniques used for predicting
of the similar kind by voting technique. Boosting is the second heart disease by many researchers. In this section, we provide a
type, which is like bagging, yet the new model is affected by pre- summary of the surveyed papers in this area. We grouped the pa-
vious models results. Stacking is the third type, which means ag- pers based on the algorithms that were used in their prediction
gregating machine learning classifiers for various kinds to produce model. Most researchers combined multiple algorithms in their
one model [6]. research work or provided a comparison between them; this can be
found in the last section, called the "Hybrid Approach" section.
2.4. Data preprocessing
Table 1: Dataset attributes
The performance and accuracy of the predictive model is not only Number Attribute Description
affected by the algorithms used, but also by the quality of the da- 1 Age Age in years
2 Gender Male or Female
taset and the preprocessing techniques. Preprocessing refers to the
3 cp Chest pain type
steps applied to the dataset before applying the machine learning 4 trestbps Resting blood pressure in mmHg
algorithms to the dataset. The preprocessing stage is very im- 5 chol Serum cholesterol in mg/dl
portant because it prepares the dataset and puts it in a form that the 6 fbs Fasting blood sugar
algorithm understands. 7 restecg Resting electrocardiographic results
Datasets can have errors, missing data, redundancies, noise, and 8 thalach Maximum heart rate achieved
many other problems which cause the data to be unsuitable to be 9 exang Exercise induced angina
used by the machine learning algorithm directly. Another factor is 10 oldpeak ST depression induced by exercise relative to rest
11 slope The slope of the peak exercise ST segment
the size of the dataset. Some datasets have many attributes that
Number of major vessels (0-3) colored by
make it harder for the algorithm to analyze it, discover patterns, or 12 ca
flourosopy
make accurate predictions. Such problems can be solved by ana- 13 thal Thallium heart scan
lyzing the dataset and using the suitable data preprocessing tech- Diagnosis of heart disease (angiographic disease
14 num
niques. Data preprocessing steps includes: data cleaning, data status)
transformation, missing values imputation, data normalization,
feature selection, and other steps depending on the nature of the 3.1. Naive bayes
dataset [9].
Vembandasamy et al. in [11] used Naive Bayes classifier to diag-
2.5. Performance evaluation metrics nose either the presence or absence of heart disease. The dataset
used in the research is obtained from one of the leading diabetic
The metrics mentioned below are used by researchers to evaluate research institutes in Chennai and contained records of about 500
prediction models and show their performance results. We provide patients and had 11 attributes (including the diagnosis). Waikato
a short definition for each method without delving into the deep Environment for Knowledge Analysis (WEKA) tool, which is a
details and mathematical equations. collection of ML algorithms, is used to apply Naive Bayes classi-
1) Accuracy: This metric shows the percentage of the accurate fier. The accuracy of their research work was 86.4198%.
results. Medhekar et al. in [12] proposed a system that categorized the
2) Precision: This metric shows how relevant the result is. data into five categories using Naive Bayes classifier. The catego-
3) Recall or Sensitivity: Measures the returned relevant results. ries are no, low, average, high and very high. The system predicts
4) F-Measure: Combines precision and recall. the possibility of heart disease in the input data. The dataset used
5) Receiver Operation Characteristic (ROC): Is a graph for for training and testing is the UCI heart disease dataset shown in
visualizing the classifier's performance. It shows the cor- table 1. The system showed an accuracy of 88.96%.
rectly classified cases as well as the incorrectly classified
ones [6]. 3.2. Artificial neural network (ANN)
The most widely used performance evaluation metric is accuracy,
which is used in all research papers discussed in our article. Das et al. in [7] proposed a system using Artificial Neural Net-
Hence, the focus of this overview article is on categorizing, com- work (ANN) Ensemble method. The Cleveland heart disease da-
paring and reviewing previous work based on the accuracy. taset is shown in table 1 was used. The ensemble model provided
increased generalization by combining a number of models trained
2.6. Heart disease dataset on the same task. The tool used to implement the experiment was
SAS enterprise miner 5.2, and the results showed that the model
The dataset that is used in the majority of research papers is the predicted heart disease with an accuracy of 89.01%.
heart disease dataset obtained from the UCI (University of Cali- Chen et al. in [13] developed a heart disease prediction system
fornia, Irvine C.A) Center for machine learning and intelligent (HDPS) using Artificial Neural Network. Learning Vector Quanti-
systems. It contains four databases from four hospitals. Each data- zation (LVQ), which is a type of ANN was used in this research.
base has the same number of features, which is 14, but different The ANN model in this paper used thirteen neurons for the input
numbers of records. The Cleveland dataset is the most used da- layer, six neurons for the hidden layer and two neurons for the
taset by machine learning researchers, due to containing less miss- output layer. The dataset used in the paper is the Cleveland dataset
ing attributes than the other datasets and having more records. The in table 1. The developed system has a user-friendly interface and
"num" field refers to the presence of heart disease in the patient. It requires users to fill in the thirteen medical attributes to be able to
is integer valued from 0 (no presence) to 4. The Cleveland dataset make predictions. The output displays the result of the prediction,
contains 303 instances [10]. Table 1 shows the 14 attrib- either healthy or unhealthy alongside the ROC curve, the accura-
utes/features as they exist in the dataset alongside the description cy, sensitivity, specificity and the running time it took to display
of each attribute. the result. The tool used to develop the system is C programming
language and C# for making the user interface. The results showed
that the model obtained an accuracy, sensitivity, and specificity of
approximately 80%, 85%, and 70% respectively.
Dangare and Apte in [14] used ANN to develop a Heart Disease
5376 International Journal of Engineering & Technology

Prediction system (HDPS) to predict the presence or absence of features were removed such as age, sex and resting blood sugar. In
heart disease in patients. It used the Cleveland heart disease da- case 4, the dataset was resampled by WEKA tool and only the
taset shown in table 1 for training the algorithm, and the Statlog seven most essential attributes were used. The resampling in-
dataset for testing; both obtained from the UCI repository and creased the accuracy of each classifier. In case 5, resampling was
contain thirteen medical attributes. Additional two attributes applied to all the 14 attributes. Finally, in case 6, the Synthetic
which are smoking and obesity were added to increase the accura- Minority Over-sampling Technique (SMOTE) was applied in
cy, which makes them fifteen attributes. The tool used for experi- WEKA tool. The best result achieved was using KNN on case 5,
menting is WEKA tool. The results showed that using the thirteen which yielded 79.20% accuracy.
attributes provided an accuracy of 99.25% whereas using the fif- Pouriyeh et al. in [6] conducted a comprehensive comparison of
teen attributes provided an accuracy of nearly 100% for predicting different classification techniques on the Cleveland heart disease
the disease. dataset to determine which classifier outperforms the rest. The
classifiers included were Decision Tree (DT), Naive Bayes (NB),
3.3. Decision tree (DT) Multi-layer Perceptron (MLP), K-Nearest Neighbor (KNN), Sin-
gle Conjunctive Rule Learner (SCRL), Radial Basis Function
Sabarinathan and Sugumaran in [15] used the Decision Tree J48 (RBF) and Support Vector Machine (SVM). The paper also in-
algorithm for feature selection and for predicting heart disease. cluded comparing ensemble techniques as bagging, boosting and
The dataset used contains thirteen medical attributes/features, and stacking. The authors used the K-Fold Cross Validation technique
240 records were used for training and 120 for testing. The accu- to estimate the accuracy of classifiers. For each classifier, the per-
racy achieved was 75.83% using all the features; while the accura- formance evaluation metrics were accuracy, precision, recall, F-
cy is improved to 76.67% using feature selection. Furthermore, measure and ROC curve. For the KNN classifier, different values
when more irrelevant features were removed, the accuracy is im- of K were tried, resulting in K=9 as the best value. For ANN, sev-
proved to 85%. The paper claims that the J48 algorithm enables eral neuron numbers were experimented to arrive at the best com-
selecting minimum features to enhance prediction accuracy. bination which is thirteen, seven and two for the input, hidden and
Patel et al. in [16] compared several decision tree algorithms using output layers respectively. The research was divided into two ex-
WEKA tool on the UCI dataset to determine the presence or ab- periments: the first one included comparing the different classifi-
sence of heart disease. The different algorithms tested were J48, ers mentioned above, while the second one involved applying the
logistic model tree, and random forest. The J48 algorithm outper- ensemble techniques. The results showed that SVM outperformed
formed the rest with an accuracy of 56.76%. the other classifiers in the first experiment at an accuracy of
84.15%. In the second experiment, using the boosting technique
3.4. K-nearest neighbour (KNN) with SVM also proved to be the most efficient with an accuracy of
84.81%.
Shouman et al. in [17] applied K-Nearest Neighbor (KNN) to Amin et al. in [19] proposed a hybrid system for predicting heart
predict heart disease using the Cleveland dataset. The paper com- disease using ANN and Genetic algorithm. The dataset used in
pared the results of applying KNN only and applying KNN with this research was collected from 50 people through a survey con-
the voting technique. Voting is the method of dividing the data ducted by the American Heart Association and contains thirteen
into subsets and applying the classifier to each subset. Evaluation attributes. Data analysis involved preprocessing the data to re-
is done using 10-fold cross-validation. The results showed that move missing or incorrect values. The dataset was divided into
without voting, the accuracy ranged from 94% to 97.4% with var- 70% of the data for training and 15% for testing and validation.
ious values for K. When K=7, the accuracy was the highest at The system was implemented using MATLAB R2012a through
97.4%. Using the voting technique, however, did not improve the Global Optimization Toolbox and the Neural Network Toolbox.
accuracy. The results showed that at K=7, the accuracy decreased The results showed an accuracy of 89% for predicting whether a
to 92.7%. person has heart disease or not.
Waghulde and Patil in [8] developed a heart disease prediction
3.5. Support vector machine (SVM) system using ANN and Genetic algorithm. The method used a
genetic algorithm to initialize the weights in the Neural Network.
Wiharto et al. in [18] studied the accuracy of SVM algorithm The experiment was done using MATLAB on a dataset of 50 peo-
types on the UCI dataset to diagnose heart disease. The study in- ple collected by the American Health Association and included
cluded various SVM types such as Binary Tree Support Vector thirteen attributes. The results generated an accuracy of 98% and
Machine (BTSVM), One-Against-One (OAO), One-Against-All 84% when carried out using six hidden nodes and ten hidden
(OAA), Decision Direct Acyclic Graph (DDAG) and Exhaustive nodes respectively.
Output Error Correction Code (ECOC). The dataset was first pre- Amma in [20] presented a system for heart disease diagnosis by
processed using a min-max scaler. The next stage was training the combining ANN and Genetic algorithm. The dataset used was the
algorithm on the dataset which was done using the SVM algo- Cleveland dataset. Preprocessing the dataset consisted of filling
rithms mentioned above. In the performance evaluation, BTSVM out missing values and normalizing the data using Min-Max nor-
performed better than the other algorithms with 61.86% overall malization. The weights of the neural network were determined
accuracy. using the genetic algorithm. The accuracy obtained was 94.17%.
Venkatalakshmi and Shivsankar in [21] included a comparison
3.6. Hybrid approach between Naive Bayes and Decision Tree to determine which one
has the highest accuracy for heart disease prediction. The dataset
This section contains research work that built a model using dif- used was the UCI heart disease dataset. The experiment was car-
ferent algorithms or made a comparison between several algo- ried out using WEKA tool and showed an accuracy of 85.03% and
rithms. 84.01% for Naive Bayes and Decision Tree respectively. The
Khateeb and Usman in [3] experimented with various classifica- paper suggested using a genetic algorithm in MATLAB to reduce
tion algorithms such as Naive Bayes, KNN, decision tree and bag- the number of features before feeding the dataset into the WEKA
ging technique on the UCI Cleveland dataset. The work was di- tool for future work.
vided into six cases, and the accuracy is calculated for every case Palaniappan and Awang in [22] proposed an Intelligent Heart
by every classifier. In case 1, all the classifiers were applied to the Disease Prediction System (IHDPS) using multiple classification
dataset without feature reduction. In case 2, feature reduction was techniques which are Decision Tree, Naive Bayes and Neural
used where instead of using all the 14 attributes in the dataset, Network. The system is web-based and was implemented using
only seven attributes, which are the most important for heart dis- .NET framework. The data source consisted of 909 records with
ease diagnosis, were selected. In case 3, only the most generic fifteen attributes obtained from the Cleveland Heart Disease data-
International Journal of Engineering & Technology 5377

base. Data Mining Extension (DMX) query language was used to 2) Classification Technique/s: This represents the classification
create the model. The results showed that Naive Bayes proved to algorithm used in the research; whether it was a single algo-
be the most efficient with 86.53% correct predictions followed by rithm, a comparison or a hybrid model.
Neural Network with only 1% difference. 3) Best Technique Found: This column is only applicable to
Dangare and Apte in [23] developed a model for predicting heart papers having a comparison between multiple algorithms. It
disease. The dataset used is the Cleveland database consisting of represents the best algorithm found in the research work,
303 records alongside the Statlog database comprising of 270 which is the algorithm with the highest accuracy.
records. Instead of using only the thirteen attributes present in the 4) Tool: The framework or programming language used to
dataset, they added two attributes: obesity and smoking. WEKA build the model is shown in this column. It is what the re-
tool used for preprocessing the dataset. The classification tech- searcher used to pre-process the input dataset, create the
niques used for analyzing the dataset were Decision Tree, Naive predictive model and test it.
Bayes, and ANN. According to the results, ANN gave an accuracy 5) Dataset: This shows the dataset that was used as an input for
of 100%, Decision Tree 99.62%, and Naive Bayes 90.74% which the classification algorithm.
proves that Artificial Neural Network is the highest performing 6) Accuracy: This represents the accuracy of the results of the
algorithm. proposed model. If the paper contained a comparison, this
Zriqat et al. in [24] developed an effective intelligent medical column only shows the accuracy of the best technique found
decision support system. Five classification algorithms were com- by the author.
pared which are: Naive Bayes, Decision Tree, Discriminant, Ran-
dom Forest, and Support Vector Machine. The analysis was done Table 2: Comparison of Classification Techniques for Heart Disease Pre-
using MATLAB on two datasets, the Cleveland Heart Disease and diction
the Statlog Heart Disease. The results showed that Decision Tree Best
Classifica-
performed the highest accuracy for both datasets at 99.01% and Tech- Accu-
Author tion Tech- Tool Dataset
nique racy
98.15% for the Cleveland and Statelog datasets respectively. nique/s
Found
Liu et al. in [25] proposed a hybrid model for diagnosing heart A diabet-
disease. The dataset used was the Statlog heart disease dataset ic re-
from the UCI repository. The model developed with MATLAB Vemban-
search 86.4198
dasamy et NB *n/a WEKA
consisted of two subsystems which are: feature selection and clas- institute %
al. [11]
sification. The feature selection subsystem uses the Relief method in Chen-
to estimate the weight of features then used the feature selection nai
approach Rough Set method (RFRS) to remove unnecessary fea- Not Cleve-
Medhekar et
n/a men- land 88.96%
tures and improve the accuracy of the model. The classification al. [12]
tioned (UCI)
subsystem used Ensemble classifier with the C4.5 algorithm SAS
(which is used to generate a Decision Tree) as the base. The re- enter- Cleve-
sults showed 92.59% classification accuracy. ANN
Das et al. [7] n/a prise land 89.01%
Ensemble
Ghumbre et al. in [26] compared Support Vector Machine and miner (UCI)
Radial Basis Function (RBF), which is a type of ANN. The algo- 5.2
rithms were applied to a patient dataset in India consisting of 214 Cleve-
Chen et al. ANN C and
records and 19 attributes and predicting whether a person has heart n/a land 80%
[13] LVQ C#
(UCI)
disease or not. The performance of the algorithms was evaluated
Cleve-
using the overall average through training and testing the dataset, Dangre and land and Nearly
5-fold cross-validation, and 10-fold cross-validation. The overall ANN n/a WEKA
Apte [14] Statlog 100%
average performance yielded 86.42% and 80.81% accuracy for (UCI)
SVM and RBF respectively. Their results showed that SVM pro- A dataset
vided a better accuracy. with 240
J48
Masethe and Masethe in [27] applied several algorithms namely: Sabarina- records
with Not
J48, Naive Bayes, REPTREE, Simple Cart (Classification and than and for test-
DT feature men- 85%
Sugumaran ing
Regression Tree) which is a type of Decision Tree, and Bayes Net selec- tioned
[15] and 120
to diagnose heart disease. The dataset used for this work has been tion
for train-
obtained from South African physicians containing eleven attrib- ing
utes which are: patient identification number (replaced with dum- Cleve-
Patel et al.
my values to protect the privacy of patients), gender, cardiogram, J48 WEKA land 56.76%
[16]
age, chest pain, blood pressure level, heart rate, cholesterol, smok- (UCI)
ing, alcohol consumption and blood sugar level. The tool used in Not Cleve-
Shouman et
the experiment was the WEKA tool. The performance evaluation KNN n/a men- land 97.4%
al. [17]
tioned (UCI)
was done using 10-fold cross-validation to assess the efficiency of
Not Cleve-
the built model. The results showed an accuracy of 99.0471% for Wiharto et BT
SVM men- land 61.86%
J48, 99.0471% for REPTREE, 97.222% Naive Bayes, 98.1481% al. [18] SVM
tioned (UCI)
for Bayes Net, and 99.0741% for the simple cart, showing that NB, KNN,
Cleve-
simple cart outperformed the rest. Khateeb and DT and
KNN WEKA land 79.20%
Usman [3] bagging
(UCI)
technique
4. Comparison of ML classification techniques NB, DT,
for heart disease prediction MLP,
KNN,
SCRL,
This section provides a tabular comparison between all the re- Boost- Not Cleve-
Pouriyeh et RBF,
search papers described above. ing with men- land 84.81%
al. [6] SVM,
SVM tioned (UCI)
The comparison is made on the basis of accuracy and can be seen bagging,
in table 2. The table has six elements which are as follow: boosting
1) Author: This shows the author/s of the paper and the refer- and stack-
ence number. ing
Amin et al. MATL Ameri-
n/a 89%
[19] AB can
5378 International Journal of Engineering & Technology

Heart chine learning algorithm and get good results. Also, a suitable
Associa- algorithm must be used when developing a prediction model. We
tion can notice that Artificial Neural Network (ANN) performed well
dataset in most models for predicting heart disease as well as Decision
Ameri-
ANN and
can
Tree (DT).
Genetic Finally, the field of using machine learning for diagnosing heart
Waghulde MATL Heart
Algorithm n/a 98% disease is an important field, and it can help both healthcare pro-
and Patil [8] AB Associa-
hybrid fessionals and patients. It is still a growing field, and despite the
tion
system
dataset massive availability of patient data in hospitals or clinics, not
Not Cleve- much of it is published. As observed in table 2, most researchers
Amma [20] n/a men- land 94.17% got their datasets from the same source which is the UCI reposito-
tioned (UCI) ry. Since the quality of the dataset is an essential factor in the pre-
Venkata-
lakshmi and NB and
diction's accuracy, more hospitals should be encouraged to publish
NB WEKA UCI 85.03% high-quality datasets (while protecting the privacy of patients) so
Shivsankar DT
[21] that researchers can have a good source to help them develop their
Palaniappan Cleve- models and obtain good results.
DT, NB
and Awang NB DMX land 86.53%
and ANN
[22] (UCI)
Cleve- Acknowledgement
Dangare and land and Nearly
ANN WEKA
Apte [23] Statlog 100% This work was made possible by the financial support from the
(UCI) Applied Science Private University in Amman, Jordan.
99.01%
NB, DT,
for
Discrimi-
nant,
Cleve- Cleve- References
Zriqat et al. MATL land and land
Random DT
[24] AB Statlog and [1] I. Kononenko, “Machine learning for medical diagnosis: History,
Forest,
(UCI) 98.15% state of the art and perspective,” Artificial Intelligence in Medicine,
and
for vol. 23, no. 1, pp. 89–109, 2001. https://doi.org/10.1016/S0933-
SVM
Statlog 3657(01)00077-X.
ReliefF [2] J. Soni et al., “Intelligent and effective heart disease prediction sys-
and Rough tem using weighted associative classifiers,” International Journal
Set on Computer Science and Engineering, vol. 3, no. 6, pp. 2385–2392,
(RFRS) 2011.
for feature [3] N. Khateeb and M. Usman, “Efficient heart disease prediction sys-
Liu et al. MATL Statlog
reduction, n/a 92.59% tem using k-nearest neighbor classification technique,” in Proceed-
[25] AB (UCI)
Ensemble ings of the International Conference on Big Data and Internet of
using C4.5 Thing (BDIOT), New York, NY, USA: ACM, 2017, pp. 21–26.
for https://doi.org/10.1145/3175684.3175703.
classifica- [4] H. Almarabeh and E. Amer, “A study of data mining techniques
tion accuracy for healthcare,” International Journal of Computer Appli-
Indian cations, vol. 168, no. 3, pp. 12–17, Jun 2017.
patients [5] M. Fatima and M. Pasha, “Survey of machine learning algorithms
SVM and
Not dataset for disease diagnostic,” Journal of Intelligent Learning Systems and
Ghumbre et Radial
SVM men- of 214 86.42% Applications, vol. 9, no. 01, pp. 1–16, 2017.
al. [26] Basis
tioned records https://doi.org/10.4236/jilsa.2017.91001.
Function
and 19 [6] S. Pouriyeh et al., “A comprehensive investigation and comparison
attributes of machine learning techniques in the domain of heart disease,” in
J48, NB, South Proceedings of IEEE Symposium on Computers and Communica-
REP- African tions (ISCC). Heraklion, Greece: IEEE, July 2017, pp. 204–207.
Masethe and TREE, dataset https://doi.org/10.1109/ISCC.2017.8024530.
Simple 99.0741
Masethe Simple WEKA contain- [7] R. Das, I. Turkoglu, and A. Sengur, “Effective diagnosis of heart
Cart %
[27] Cart, and ing disease through neural networks ensembles,” Expert systems with
Bayes 11 at- applications, vol. 36, no. 4, pp. 7675–7680, 2009.
Net tributes https://doi.org/10.1016/j.eswa.2008.09.013.
*n/a: not applicable. [8] N. Waghulde and N. Patil, “Genetic neural approach for heart dis-
ease prediction,” International Journal of Advanced Computer Re-
5. Conclusion and final remarks search, vol. 4, no. 3, pp. 778, 2014.
[9] S. Garcia et al., “Big data preprocessing: methods and prospects,”
Big Data Analytics, vol. 1, no. 1, p. 9, Nov 2016.
This paper overviews the literature of machine learning classifica- https://doi.org/10.1186/s41044-016-0014-0.
tion methods for diagnosing heart disease. Many representational [10] A. Janosi et al., “Heart disease data set,” Jul 1988. [Online]. Avail-
papers on using machine learning classification techniques were able: http://archive.ics.uci.edu/ml/datasets/heart Disease.
surveyed and categorized. The accuracy of the proposed models [11] K. Vembandasamy, R. Sasipriya, and E. Deepa, “Heart diseases
vary depending on the tool used, the dataset used, the number of detection using naive bayes algorithm,” International Journal of
Innovative Science, Engineering & Technology, vol. 2, no. 9, pp.
attributes and records in the dataset, the preprocessing techniques,
441–444, 2015.
as well as the classifier implemented in the model. It depends on [12] D. Medhekar, M. Bote, and S. Deshmukh, “Heart disease prediction
whether it is a hybrid model or not and whether the model uses system using naive bayes,” International Journal of Enhanced Re-
feature selection or not. From table 2, we can conclude that the search In Science Technology & Engineering, vol. 2, no. 3, pp. 1–5,
researchers who produced the highest accuracy were Dangare and 2013.
Apte using Artificial Neural Network (ANN), WEKA tool and a [13] A. Chen et al., “HDPS: Heart disease prediction system,” in Com-
combination of the Cleveland and Statlog heart disease datasets. putting in Cardiology, Hangzhou, China: IEEE, 2011, pp. 557–560.
We conclude that to build an accurate heart disease prediction [14] C. Dangare and S. Apte, “A data mining approach for prediction of
heart disease using neural networks,” International Journal of
model, a dataset with sufficient samples and correct data must be
Computer Engineering & Technology, vol. 3, no. 3, pp. 30–40,
used. The dataset must be preprocessed accordingly because it is 2012.
the most critical part to prepare the dataset to be used by the ma-
International Journal of Engineering & Technology 5379

[15] V. Sabarinathan and V. Sugumaran, “Diagnosis of heart disease


using decision tree,” International Journal of Research in Comput-
er Applications & Information Technology, vol. 2, no. 6, pp. 74–79,
2014.
[16] J. Patel et al., “Heart disease prediction using machine learning and
data mining technique,” Heart Disease, vol. 7, no. 1, pp. 129–137,
2015.
[17] M. Shouman, T. Turner, and R. Stocker, “Applying k-nearest
neighbour in diagnosing heart disease patients,” International Jour-
nal of Information and Education Technology, vol. 2, no. 3, pp. 220,
2012. https://doi.org/10.7763/IJIET.2012.V2.114.
[18] W. Wiharto, H. Kusnanto, and H. Herianto, “Performance analysis
of multiclass support vector machine classification for diagnosis of
coronary heart diseases,” International Journal on Computational
Science & Applications, vol. 5, no. 5, pp. 27–37, 2015.
https://doi.org/10.5121/ijcsa.2015.5503.
[19] S. Amin, K. Agarwal, and R. Beg, “Genetic neural network based
data mining in prediction of heart disease using risk factors,” in
IEEE Conference on Information Communication Technologies.
Thuckalay, Tamil Nadu, India, April 2013, pp. 1227–1231.
https://doi.org/10.1109/CICT.2013.6558288.
[20] N. Amma, “Cardiovascular disease prediction system using genetic
algorithm and neural network,” in International Conference on
Computing, Communication and Applications. Dindigul, Tamilnadu,
India: IEEE, Feb 2012, pp. 1–5.
https://doi.org/10.1109/ICCCA.2012.6179185.
[21] B. Venkatalakshmi and M. Shivsankar, “Heart disease diagnosis
using predictive data mining,” International Journal of Innovative
Research in Science, Engineering and Technology, vol. 3, no. 3, pp.
1873–1877, 2014.
[22] S. Palaniappan and R. Awang, “Intelligent heart disease prediction
system using data mining techniques,” in IEEE/ACS International
Conference on Computer Systems and Applications. Doha, Qatar,
March 2008, pp. 108–115.
https://doi.org/10.1109/AICCSA.2008.4493524.
[23] C. Dangare and S. Apte, “Improved study of heart disease predic-
tion system using data mining classification techniques,” Interna-
tional Journal of Computer Applications, vol. 47, no. 10, pp. 44–48,
2012.
[24] I. Zriqat, A. Altamimi, and M. Azzeh, “A comparative study for
predicting heart diseases using data mining classification methods,”
International Journal of Computer Science and Information Securi-
ty (IJCSIS), vol. 14, no. 12, pp. 868–879, 2017.
[25] X. Liu et al., “A hybrid classification system for heart disease diag-
nosis,” Computational and Mathematical Methods in Medicine, vol.
2017, pp. 1-11, 2017. https://doi.org/10.1155/2017/8272091.
[26] S. Ghumbre, C. Patil, and A. Ghatol, “Heart disease diagnosis using
support vector machine,” in International conference on computer
science and information technology. Pattaya, Thailand: Planetary
Scientific Research Centre, 2011, pp. 84–88.
[27] H. Masethe and M. Masethe, “Prediction of heart disease using
classification algorithms,” in Proceedings of the world congress on
Engineering and Computer Science, San Francisco, USA: Interna-
tional Association of Engineers (IAENG), 2014, pp. 22–24.

You might also like