Machine Learning Classification Techniques For Heart Disease Prediction: A Review
Machine Learning Classification Techniques For Heart Disease Prediction: A Review
Research paper
Abstract
The most crucial task in the healthcare field is disease diagnosis. If a disease is diagnosed early, many lives can be saved. Machine learn-
ing classification techniques can significantly benefit the medical field by providing an accurate and quick diagnosis of diseases. Hence,
save time for both doctors and patients. As heart disease is the number one killer in the world today, it becomes one of the most difficult
diseases to diagnose. In this paper, we provide a survey of the machine learning classification techniques that have been proposed to help
healthcare professionals in diagnosing heart disease. We start by overviewing the machine learning and de-scribing brief definitions of
the most commonly used classification techniques to diagnose heart disease. Then, we review represent-able research works on using
machine learning classification techniques in this field. Also, a detailed tabular comparison of the sur-veyed papers is presented.
Keywords: Heart Disease; Heart Disease Diagnosis; Heart Disease Prediction; Machine Learning; Machine Learning Classification Techniques.
1. Introduction The term heart disease, also called cardiovascular disease, encom-
passes the diverse diseases that affect the heart. The World Health
Organization estimates that 12 million deaths occur worldwide
the task of making computers more intelligent. Since the most
every year due to heart disease. It is the major cause of deaths in
basic requirement of intelligence is learning, hence came the sub-
many developing countries. For example, in the United States, it
field of AI that is called machine learning (ML). ML is one of the
kills one person every 34 seconds. It is also the main cause of
most rapidly evolving fields of AI which is used in many areas of
deaths in India, which proves that heart disease is one of the most
life, primarily in the healthcare field. ML has a great value in the
dangerous diseases threatening adults lives today [2]. Heart dis-
healthcare field since it is an intelligent tool to analyze data, and
ease diagnosis is one of the most critical and challenging tasks in
the medical field is rich with data. In the past few years, numerous
the healthcare field. It must be diagnosed quickly, efficiently and
amount of data was collected and stored because of the digital
correctly in order to save lives. It requires the patient to do many
revolution. Monitoring and other data collection devices are avail-
tests, and healthcare professionals must carefully examine the
able in modern hospitals and are being used every day, and abun-
results. That is why researchers have been interested in predicting
dant amounts of data are being gathered. It is very hard or even
heart disease, and they developed different heart disease predic-
impossible for humans to derive useful information from these
tion systems using various machine learning algorithms [3]. Some
massive amounts of data, that is why machine learning is widely
of them achieved better results than others. Many used the famous
used nowadays to analyze these data and diagnose problems in the
UCI heart disease dataset to train and test their classifier, while
healthcare field. A simplified explanation of what the machine
others used data obtained from other hospitals accessible to them.
learning algorithms would do is, it will learn from previously di-
This survey paper provides an overview of the machine learning
agnosed cases of patients. The resulting classifier can have many
classification techniques used in the field of diagnosing heart dis-
uses such as helping doctors to diagnose new patients with higher
ease, and how previous researchers implemented them. It throws
speed and efficiency and training students and non-specialists to
the light on how important is machine learning in the healthcare
diagnose patients [1].
field and how it can make accurate predictions and help healthcare
Since we have vast amounts of medical datasets, machine learning
professionals.
can help us discover patterns and beneficial information from
The rest of the paper is organized as follows. Section 2 presents
them. Although it has many uses, machine learning is mostly used
background topics on ML, classification techniques, and the most
for disease prediction in the medical field. Many researchers be-
widely used heart disease dataset by researchers in this field. Sec-
came interested in using machine learning for diagnosing diseases
tion 3 contains the literature review of the current proposed re-
because it helps to reduce diagnosing time and increases the accu-
search work in this area. Section 4 presents a tabular comparison
racy and efficiency. Several diseases can be diagnosed using ma-
between the classification techniques overviewed in section 3 on
chine learning techniques, but the focus of this paper will be on
the basis of their accuracy. Finally, the conclusion is presented in
heart disease diagnosis. Since heart disease is the primary cause of
section 5.
deaths in the world today, and the effective diagnosis of heart
disease is immensely useful to save lives [1].
Copyright © 2018 Maryam I. Al-Janabi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
5374 International Journal of Engineering & Technology
2. Background ought to get the correct result through examination and trying out
different possibilities [5].
The most common type of learning is the supervised learning
This section provides descriptions of the related topics of this
technique; especially the classification technique that is widely
paper such as machine learning, its techniques with brief descrip-
used for prediction. In this paper, we mainly focus on the papers
tions, data preprocessing, performance evaluation metrics and a
that used classification algorithms to diagnose heart disease.
brief explanation of the most used heart disease dataset.
2.3. Classification machine learning techniques
2.1. Machine learning
Classification, which is a type of supervised ML techniques per-
Machine learning (ML) is a domain of artificial intelligence that
form predictions for future cases based on a previous dataset. In
involves constructing algorithms that can learn from experience.
this section, we present a brief definition of the most widely used
The way that ML algorithms work is that they detect hidden pat-
classification techniques for heart disease prediction.
terns in the input dataset and build models. Then, they can make
accurate predictions for new datasets that are entirely new for the 2.3.1. Naive bayes (NB)
algorithms. This way the machine became more intelligent
through learning; so it can identify patterns that are very hard or Naive Bayes classifier belongs to a family of probabilistic classifi-
impossible for humans to detect by themselves. ML algorithms ers based on Naive Bayes theorem. It assumes sturdy independ-
and techniques can operate with large datasets and make decisions ence between the features, and this is the essential part of how this
and predictions [4]. Figure 1 represents a simplified representation classifier makes predictions. It is easy to build, and it usually per-
of how machine learning works. In this figure, the dataset, which forms well which makes it suitable for the medical science field
in our case can be a patient database, is preprocessed first. The and diagnosing diseases [6].
preprocessing phase is crucial as it cleans the dataset and prepares
it to be used by the machine learning algorithm. The model con- 2.3.2. Artificial neural network (ANN)
sists of a single algorithm, or it can contain multiple algorithms
working together in a hybrid approach. The output of the model is This algorithm was developed to imitate the neurons in the human
a classifier; this is where the intelligence is, and this is what will brain. It consists of some nodes or neurons that are connected, and
make the prediction. If the classifier receives input data, it can the output of one node is the input of another. Each node receives
predict without any human interruption. For example, if the da- multiple inputs, but the output is only one value. The Multi-Layer
taset that is fed into the model is a medical dataset of healthy and Perceptron (MLP) is a widely used type of ANN, and it consists of
unhealthy patients' information, the input data can be a new pa- an input layer, hidden layers, and an output layer. A different
tient's information. This input data is entirely new to the classifier number of neurons are assigned to each layer under different con-
and has never been seen before. The classifier will receive this ditions [6].
data and will predict whether this new patient is healthy or un-
healthy based on past data. 2.3.3. Radial basis function (RBF)
2.2. Machine learning techniques This is a type of ANN, and is similar to the Multi-Layer Percep-
tron (MLP) Neural Network but has a different number of hidden
The main ML techniques can be classified as follows: layers, approximation technique, number of parameters, and other
factors [6].
2.2.1. Supervised learning
2.3.4. Decision tree (DT)
In this technique, a dataset exists with examples and their response
(the output). The algorithm can learn from the dataset through a This algorithm has a tree-like structure or flowchart-like structure.
training process; then it can respond to any new input based on It consists of branches, leaves, nodes and a root node. The internal
what it has learned. An example of the supervised learning tech- nodes contain the attributes while the branches represent the result
nique is classification and regression [5]. of each test on each node. DT is widely used for classification
purposes because it does not need much knowledge in the field or
setting the parameters for it to work [6].
Prediction system (HDPS) to predict the presence or absence of features were removed such as age, sex and resting blood sugar. In
heart disease in patients. It used the Cleveland heart disease da- case 4, the dataset was resampled by WEKA tool and only the
taset shown in table 1 for training the algorithm, and the Statlog seven most essential attributes were used. The resampling in-
dataset for testing; both obtained from the UCI repository and creased the accuracy of each classifier. In case 5, resampling was
contain thirteen medical attributes. Additional two attributes applied to all the 14 attributes. Finally, in case 6, the Synthetic
which are smoking and obesity were added to increase the accura- Minority Over-sampling Technique (SMOTE) was applied in
cy, which makes them fifteen attributes. The tool used for experi- WEKA tool. The best result achieved was using KNN on case 5,
menting is WEKA tool. The results showed that using the thirteen which yielded 79.20% accuracy.
attributes provided an accuracy of 99.25% whereas using the fif- Pouriyeh et al. in [6] conducted a comprehensive comparison of
teen attributes provided an accuracy of nearly 100% for predicting different classification techniques on the Cleveland heart disease
the disease. dataset to determine which classifier outperforms the rest. The
classifiers included were Decision Tree (DT), Naive Bayes (NB),
3.3. Decision tree (DT) Multi-layer Perceptron (MLP), K-Nearest Neighbor (KNN), Sin-
gle Conjunctive Rule Learner (SCRL), Radial Basis Function
Sabarinathan and Sugumaran in [15] used the Decision Tree J48 (RBF) and Support Vector Machine (SVM). The paper also in-
algorithm for feature selection and for predicting heart disease. cluded comparing ensemble techniques as bagging, boosting and
The dataset used contains thirteen medical attributes/features, and stacking. The authors used the K-Fold Cross Validation technique
240 records were used for training and 120 for testing. The accu- to estimate the accuracy of classifiers. For each classifier, the per-
racy achieved was 75.83% using all the features; while the accura- formance evaluation metrics were accuracy, precision, recall, F-
cy is improved to 76.67% using feature selection. Furthermore, measure and ROC curve. For the KNN classifier, different values
when more irrelevant features were removed, the accuracy is im- of K were tried, resulting in K=9 as the best value. For ANN, sev-
proved to 85%. The paper claims that the J48 algorithm enables eral neuron numbers were experimented to arrive at the best com-
selecting minimum features to enhance prediction accuracy. bination which is thirteen, seven and two for the input, hidden and
Patel et al. in [16] compared several decision tree algorithms using output layers respectively. The research was divided into two ex-
WEKA tool on the UCI dataset to determine the presence or ab- periments: the first one included comparing the different classifi-
sence of heart disease. The different algorithms tested were J48, ers mentioned above, while the second one involved applying the
logistic model tree, and random forest. The J48 algorithm outper- ensemble techniques. The results showed that SVM outperformed
formed the rest with an accuracy of 56.76%. the other classifiers in the first experiment at an accuracy of
84.15%. In the second experiment, using the boosting technique
3.4. K-nearest neighbour (KNN) with SVM also proved to be the most efficient with an accuracy of
84.81%.
Shouman et al. in [17] applied K-Nearest Neighbor (KNN) to Amin et al. in [19] proposed a hybrid system for predicting heart
predict heart disease using the Cleveland dataset. The paper com- disease using ANN and Genetic algorithm. The dataset used in
pared the results of applying KNN only and applying KNN with this research was collected from 50 people through a survey con-
the voting technique. Voting is the method of dividing the data ducted by the American Heart Association and contains thirteen
into subsets and applying the classifier to each subset. Evaluation attributes. Data analysis involved preprocessing the data to re-
is done using 10-fold cross-validation. The results showed that move missing or incorrect values. The dataset was divided into
without voting, the accuracy ranged from 94% to 97.4% with var- 70% of the data for training and 15% for testing and validation.
ious values for K. When K=7, the accuracy was the highest at The system was implemented using MATLAB R2012a through
97.4%. Using the voting technique, however, did not improve the Global Optimization Toolbox and the Neural Network Toolbox.
accuracy. The results showed that at K=7, the accuracy decreased The results showed an accuracy of 89% for predicting whether a
to 92.7%. person has heart disease or not.
Waghulde and Patil in [8] developed a heart disease prediction
3.5. Support vector machine (SVM) system using ANN and Genetic algorithm. The method used a
genetic algorithm to initialize the weights in the Neural Network.
Wiharto et al. in [18] studied the accuracy of SVM algorithm The experiment was done using MATLAB on a dataset of 50 peo-
types on the UCI dataset to diagnose heart disease. The study in- ple collected by the American Health Association and included
cluded various SVM types such as Binary Tree Support Vector thirteen attributes. The results generated an accuracy of 98% and
Machine (BTSVM), One-Against-One (OAO), One-Against-All 84% when carried out using six hidden nodes and ten hidden
(OAA), Decision Direct Acyclic Graph (DDAG) and Exhaustive nodes respectively.
Output Error Correction Code (ECOC). The dataset was first pre- Amma in [20] presented a system for heart disease diagnosis by
processed using a min-max scaler. The next stage was training the combining ANN and Genetic algorithm. The dataset used was the
algorithm on the dataset which was done using the SVM algo- Cleveland dataset. Preprocessing the dataset consisted of filling
rithms mentioned above. In the performance evaluation, BTSVM out missing values and normalizing the data using Min-Max nor-
performed better than the other algorithms with 61.86% overall malization. The weights of the neural network were determined
accuracy. using the genetic algorithm. The accuracy obtained was 94.17%.
Venkatalakshmi and Shivsankar in [21] included a comparison
3.6. Hybrid approach between Naive Bayes and Decision Tree to determine which one
has the highest accuracy for heart disease prediction. The dataset
This section contains research work that built a model using dif- used was the UCI heart disease dataset. The experiment was car-
ferent algorithms or made a comparison between several algo- ried out using WEKA tool and showed an accuracy of 85.03% and
rithms. 84.01% for Naive Bayes and Decision Tree respectively. The
Khateeb and Usman in [3] experimented with various classifica- paper suggested using a genetic algorithm in MATLAB to reduce
tion algorithms such as Naive Bayes, KNN, decision tree and bag- the number of features before feeding the dataset into the WEKA
ging technique on the UCI Cleveland dataset. The work was di- tool for future work.
vided into six cases, and the accuracy is calculated for every case Palaniappan and Awang in [22] proposed an Intelligent Heart
by every classifier. In case 1, all the classifiers were applied to the Disease Prediction System (IHDPS) using multiple classification
dataset without feature reduction. In case 2, feature reduction was techniques which are Decision Tree, Naive Bayes and Neural
used where instead of using all the 14 attributes in the dataset, Network. The system is web-based and was implemented using
only seven attributes, which are the most important for heart dis- .NET framework. The data source consisted of 909 records with
ease diagnosis, were selected. In case 3, only the most generic fifteen attributes obtained from the Cleveland Heart Disease data-
International Journal of Engineering & Technology 5377
base. Data Mining Extension (DMX) query language was used to 2) Classification Technique/s: This represents the classification
create the model. The results showed that Naive Bayes proved to algorithm used in the research; whether it was a single algo-
be the most efficient with 86.53% correct predictions followed by rithm, a comparison or a hybrid model.
Neural Network with only 1% difference. 3) Best Technique Found: This column is only applicable to
Dangare and Apte in [23] developed a model for predicting heart papers having a comparison between multiple algorithms. It
disease. The dataset used is the Cleveland database consisting of represents the best algorithm found in the research work,
303 records alongside the Statlog database comprising of 270 which is the algorithm with the highest accuracy.
records. Instead of using only the thirteen attributes present in the 4) Tool: The framework or programming language used to
dataset, they added two attributes: obesity and smoking. WEKA build the model is shown in this column. It is what the re-
tool used for preprocessing the dataset. The classification tech- searcher used to pre-process the input dataset, create the
niques used for analyzing the dataset were Decision Tree, Naive predictive model and test it.
Bayes, and ANN. According to the results, ANN gave an accuracy 5) Dataset: This shows the dataset that was used as an input for
of 100%, Decision Tree 99.62%, and Naive Bayes 90.74% which the classification algorithm.
proves that Artificial Neural Network is the highest performing 6) Accuracy: This represents the accuracy of the results of the
algorithm. proposed model. If the paper contained a comparison, this
Zriqat et al. in [24] developed an effective intelligent medical column only shows the accuracy of the best technique found
decision support system. Five classification algorithms were com- by the author.
pared which are: Naive Bayes, Decision Tree, Discriminant, Ran-
dom Forest, and Support Vector Machine. The analysis was done Table 2: Comparison of Classification Techniques for Heart Disease Pre-
using MATLAB on two datasets, the Cleveland Heart Disease and diction
the Statlog Heart Disease. The results showed that Decision Tree Best
Classifica-
performed the highest accuracy for both datasets at 99.01% and Tech- Accu-
Author tion Tech- Tool Dataset
nique racy
98.15% for the Cleveland and Statelog datasets respectively. nique/s
Found
Liu et al. in [25] proposed a hybrid model for diagnosing heart A diabet-
disease. The dataset used was the Statlog heart disease dataset ic re-
from the UCI repository. The model developed with MATLAB Vemban-
search 86.4198
dasamy et NB *n/a WEKA
consisted of two subsystems which are: feature selection and clas- institute %
al. [11]
sification. The feature selection subsystem uses the Relief method in Chen-
to estimate the weight of features then used the feature selection nai
approach Rough Set method (RFRS) to remove unnecessary fea- Not Cleve-
Medhekar et
n/a men- land 88.96%
tures and improve the accuracy of the model. The classification al. [12]
tioned (UCI)
subsystem used Ensemble classifier with the C4.5 algorithm SAS
(which is used to generate a Decision Tree) as the base. The re- enter- Cleve-
sults showed 92.59% classification accuracy. ANN
Das et al. [7] n/a prise land 89.01%
Ensemble
Ghumbre et al. in [26] compared Support Vector Machine and miner (UCI)
Radial Basis Function (RBF), which is a type of ANN. The algo- 5.2
rithms were applied to a patient dataset in India consisting of 214 Cleve-
Chen et al. ANN C and
records and 19 attributes and predicting whether a person has heart n/a land 80%
[13] LVQ C#
(UCI)
disease or not. The performance of the algorithms was evaluated
Cleve-
using the overall average through training and testing the dataset, Dangre and land and Nearly
5-fold cross-validation, and 10-fold cross-validation. The overall ANN n/a WEKA
Apte [14] Statlog 100%
average performance yielded 86.42% and 80.81% accuracy for (UCI)
SVM and RBF respectively. Their results showed that SVM pro- A dataset
vided a better accuracy. with 240
J48
Masethe and Masethe in [27] applied several algorithms namely: Sabarina- records
with Not
J48, Naive Bayes, REPTREE, Simple Cart (Classification and than and for test-
DT feature men- 85%
Sugumaran ing
Regression Tree) which is a type of Decision Tree, and Bayes Net selec- tioned
[15] and 120
to diagnose heart disease. The dataset used for this work has been tion
for train-
obtained from South African physicians containing eleven attrib- ing
utes which are: patient identification number (replaced with dum- Cleve-
Patel et al.
my values to protect the privacy of patients), gender, cardiogram, J48 WEKA land 56.76%
[16]
age, chest pain, blood pressure level, heart rate, cholesterol, smok- (UCI)
ing, alcohol consumption and blood sugar level. The tool used in Not Cleve-
Shouman et
the experiment was the WEKA tool. The performance evaluation KNN n/a men- land 97.4%
al. [17]
tioned (UCI)
was done using 10-fold cross-validation to assess the efficiency of
Not Cleve-
the built model. The results showed an accuracy of 99.0471% for Wiharto et BT
SVM men- land 61.86%
J48, 99.0471% for REPTREE, 97.222% Naive Bayes, 98.1481% al. [18] SVM
tioned (UCI)
for Bayes Net, and 99.0741% for the simple cart, showing that NB, KNN,
Cleve-
simple cart outperformed the rest. Khateeb and DT and
KNN WEKA land 79.20%
Usman [3] bagging
(UCI)
technique
4. Comparison of ML classification techniques NB, DT,
for heart disease prediction MLP,
KNN,
SCRL,
This section provides a tabular comparison between all the re- Boost- Not Cleve-
Pouriyeh et RBF,
search papers described above. ing with men- land 84.81%
al. [6] SVM,
SVM tioned (UCI)
The comparison is made on the basis of accuracy and can be seen bagging,
in table 2. The table has six elements which are as follow: boosting
1) Author: This shows the author/s of the paper and the refer- and stack-
ence number. ing
Amin et al. MATL Ameri-
n/a 89%
[19] AB can
5378 International Journal of Engineering & Technology
Heart chine learning algorithm and get good results. Also, a suitable
Associa- algorithm must be used when developing a prediction model. We
tion can notice that Artificial Neural Network (ANN) performed well
dataset in most models for predicting heart disease as well as Decision
Ameri-
ANN and
can
Tree (DT).
Genetic Finally, the field of using machine learning for diagnosing heart
Waghulde MATL Heart
Algorithm n/a 98% disease is an important field, and it can help both healthcare pro-
and Patil [8] AB Associa-
hybrid fessionals and patients. It is still a growing field, and despite the
tion
system
dataset massive availability of patient data in hospitals or clinics, not
Not Cleve- much of it is published. As observed in table 2, most researchers
Amma [20] n/a men- land 94.17% got their datasets from the same source which is the UCI reposito-
tioned (UCI) ry. Since the quality of the dataset is an essential factor in the pre-
Venkata-
lakshmi and NB and
diction's accuracy, more hospitals should be encouraged to publish
NB WEKA UCI 85.03% high-quality datasets (while protecting the privacy of patients) so
Shivsankar DT
[21] that researchers can have a good source to help them develop their
Palaniappan Cleve- models and obtain good results.
DT, NB
and Awang NB DMX land 86.53%
and ANN
[22] (UCI)
Cleve- Acknowledgement
Dangare and land and Nearly
ANN WEKA
Apte [23] Statlog 100% This work was made possible by the financial support from the
(UCI) Applied Science Private University in Amman, Jordan.
99.01%
NB, DT,
for
Discrimi-
nant,
Cleve- Cleve- References
Zriqat et al. MATL land and land
Random DT
[24] AB Statlog and [1] I. Kononenko, “Machine learning for medical diagnosis: History,
Forest,
(UCI) 98.15% state of the art and perspective,” Artificial Intelligence in Medicine,
and
for vol. 23, no. 1, pp. 89–109, 2001. https://doi.org/10.1016/S0933-
SVM
Statlog 3657(01)00077-X.
ReliefF [2] J. Soni et al., “Intelligent and effective heart disease prediction sys-
and Rough tem using weighted associative classifiers,” International Journal
Set on Computer Science and Engineering, vol. 3, no. 6, pp. 2385–2392,
(RFRS) 2011.
for feature [3] N. Khateeb and M. Usman, “Efficient heart disease prediction sys-
Liu et al. MATL Statlog
reduction, n/a 92.59% tem using k-nearest neighbor classification technique,” in Proceed-
[25] AB (UCI)
Ensemble ings of the International Conference on Big Data and Internet of
using C4.5 Thing (BDIOT), New York, NY, USA: ACM, 2017, pp. 21–26.
for https://doi.org/10.1145/3175684.3175703.
classifica- [4] H. Almarabeh and E. Amer, “A study of data mining techniques
tion accuracy for healthcare,” International Journal of Computer Appli-
Indian cations, vol. 168, no. 3, pp. 12–17, Jun 2017.
patients [5] M. Fatima and M. Pasha, “Survey of machine learning algorithms
SVM and
Not dataset for disease diagnostic,” Journal of Intelligent Learning Systems and
Ghumbre et Radial
SVM men- of 214 86.42% Applications, vol. 9, no. 01, pp. 1–16, 2017.
al. [26] Basis
tioned records https://doi.org/10.4236/jilsa.2017.91001.
Function
and 19 [6] S. Pouriyeh et al., “A comprehensive investigation and comparison
attributes of machine learning techniques in the domain of heart disease,” in
J48, NB, South Proceedings of IEEE Symposium on Computers and Communica-
REP- African tions (ISCC). Heraklion, Greece: IEEE, July 2017, pp. 204–207.
Masethe and TREE, dataset https://doi.org/10.1109/ISCC.2017.8024530.
Simple 99.0741
Masethe Simple WEKA contain- [7] R. Das, I. Turkoglu, and A. Sengur, “Effective diagnosis of heart
Cart %
[27] Cart, and ing disease through neural networks ensembles,” Expert systems with
Bayes 11 at- applications, vol. 36, no. 4, pp. 7675–7680, 2009.
Net tributes https://doi.org/10.1016/j.eswa.2008.09.013.
*n/a: not applicable. [8] N. Waghulde and N. Patil, “Genetic neural approach for heart dis-
ease prediction,” International Journal of Advanced Computer Re-
5. Conclusion and final remarks search, vol. 4, no. 3, pp. 778, 2014.
[9] S. Garcia et al., “Big data preprocessing: methods and prospects,”
Big Data Analytics, vol. 1, no. 1, p. 9, Nov 2016.
This paper overviews the literature of machine learning classifica- https://doi.org/10.1186/s41044-016-0014-0.
tion methods for diagnosing heart disease. Many representational [10] A. Janosi et al., “Heart disease data set,” Jul 1988. [Online]. Avail-
papers on using machine learning classification techniques were able: http://archive.ics.uci.edu/ml/datasets/heart Disease.
surveyed and categorized. The accuracy of the proposed models [11] K. Vembandasamy, R. Sasipriya, and E. Deepa, “Heart diseases
vary depending on the tool used, the dataset used, the number of detection using naive bayes algorithm,” International Journal of
Innovative Science, Engineering & Technology, vol. 2, no. 9, pp.
attributes and records in the dataset, the preprocessing techniques,
441–444, 2015.
as well as the classifier implemented in the model. It depends on [12] D. Medhekar, M. Bote, and S. Deshmukh, “Heart disease prediction
whether it is a hybrid model or not and whether the model uses system using naive bayes,” International Journal of Enhanced Re-
feature selection or not. From table 2, we can conclude that the search In Science Technology & Engineering, vol. 2, no. 3, pp. 1–5,
researchers who produced the highest accuracy were Dangare and 2013.
Apte using Artificial Neural Network (ANN), WEKA tool and a [13] A. Chen et al., “HDPS: Heart disease prediction system,” in Com-
combination of the Cleveland and Statlog heart disease datasets. putting in Cardiology, Hangzhou, China: IEEE, 2011, pp. 557–560.
We conclude that to build an accurate heart disease prediction [14] C. Dangare and S. Apte, “A data mining approach for prediction of
heart disease using neural networks,” International Journal of
model, a dataset with sufficient samples and correct data must be
Computer Engineering & Technology, vol. 3, no. 3, pp. 30–40,
used. The dataset must be preprocessed accordingly because it is 2012.
the most critical part to prepare the dataset to be used by the ma-
International Journal of Engineering & Technology 5379