Hybrid Heart Disease Prediction Model Using Machine Learning Algorithm

This document summarizes a research paper that uses machine learning algorithms to predict heart disease. Specifically, it: 1) Compares various machine learning classifiers like Naive Bayes, logistic regression, SVM, XGBOOST, Random Forest to predict heart disease based on a dataset with a wide range of samples. Random Forest and ensemble methods like Ada-boost and XGBOOST provided the highest accuracy, with XGBOOST achieving 90.6% accuracy. 2) Discusses how machine learning can be used to analyze patient data and classify whether they have cardiovascular disease to predict future risk. Supervised learning algorithms are trained on labeled data to make predictions. 3) Mentions related work using the UCI heart

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

Hybrid Heart Disease Prediction Model Using Machine Learning Algorithm

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Hybrid Heart Disease Prediction Model using

Machine Learning Algorithm
Ankita Singha1, Anushka Sikdar2, Palak Choudhary3, Pranati Rakshit4, Sonali Bhattacharyya5
1,2,3
B. Tech Student, 4,5 Associate Professor
Department of Computer Science and Engineering,
JIS College of Engineering, Kalyani, India

Abstract:- Worldwide, machine learning is used in a cardiovascular illness by gathering the information from
variety of fields. Machine learning will be crucial in many sources, classifying them under relevant headings,
determining whether or not heart disorders will exist. If and ultimately examining to make out the necessary
forecasted long in advance, such information will knowledge.
provide clinicians with crucial intuitions. The majority
of our work focuses on applying machine learning Machine learning is unbelievably complicated and the
algorithms to predict possible heart problems. We tend way it works varies counting on the task and the
to compare classifiers such Naive Bayes, logistical algorithmic program accustomed accomplish it. However,
Regression, SVM, XGBOOST, Random Forest, etc. at its core, a machine learning model could be a laptop
during the course of this work. Since it will have a wide viewing information and characteristic patterns, so
range of samples for coaching and confirmatory victimization those insights to raised complete its allotted
analysis, Random Forest suggests an ensemble classifier task. Any task that depends upon a group of information
that does hybrid classification by using both strong and points or rules will be automatic victimization machine
weak classifiers. As a result, we analyse planned and learning, even those additional complicated tasks like
existing classifiers like Ada-boost and XG-boost that responding to client service calls and reviewing resumes. A
offer the highest accuracy and prognostication. The best Decision Process: normally, machine learning algorithms
accuracy is provided by XGBOOST (90.6%). are accustomed create a prediction or classification.
supported some input file, which might be tagged or
Keywords:- SVM, Naive Bayes, Random Forest, logistic unlabeled, your algorithmic program can manufacture
regression, Ada-boost, XG-boost, Python programming, associate estimate a few patterns within the information.
confusion matrix, and matrix.
An Error Function: a blunder perform serves to judge
I. INTRODUCTION the make out accuracy. If there are best-known examples, a
blunder perform will create a comparison to assess the
The World Health Organization estimates that accuracy about our project.
cardiovascular disease causes 12 million deaths worldwide
each year. One of the leading causes of death and disease A model improvement process: Weights are modified
around the world is cardiovascular disease. One of the most to reduce the difference between the model estimate and the
important topics in the area of information analysis is best-known example if the model performs better with the
regarded to be the prediction of disorder. Since a few years data points in the coaching set. When a category label is
ago, there has been a rapid increase in the amount of anticipated for a specific example of an input file, this is
disorder everywhere in the world. Numerous studies are referred to as classification in machine learning.
carried out to identify the most prestigious risk factors for
cardiovascular disease as well as to precisely anticipate the A. Supervised Learning
risk. Cardiovascular disease is also referred to as a silent Supervised learning is a type of machine learning in
killer that kills a person without showing any evident signs. which computers are taught to use carefully "labelled"
The first diagnosis of cardiovascular disease is crucial in coaching data and then make predictions about the outcome
helping patients decide whether to adjust their lifestyles and based on that data. According to the tagged information,
subsequently lowers the problems. some input files have already been labelled with the
appropriate output.
With the use of machine learning, the health care
industry's huge volume of data may be used to make Because the supervisor educates the machines to
decisions and predictions. This study uses machine learning forecast the output correctly, the coaching information
to analyse patient data and categorise whether or not they given to the machines in supervised learning is effective. It
have cardiovascular disease in order to predict future uses a similar idea to how a pupil learns while under the
cardiovascular disease. In this aspect, machine learning teacher's supervision.
techniques are extremely helpful. Even though there are
many different ways that cardiovascular disease can One way to give the machine learning model the
manifest, there is a common set of critical risk indicators "input information input file computer file" in addition to
that can determine whether someone is unquestionably at the "right output data" is through supervised learning. The
risk. We may determine that this method is suitable for purpose of an algorithmic rule for supervised learning is to
using to attempt and conduct the prediction of

IJISRT22JUL359 www.ijisrt.com 444

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
find a mapping operation to map the input variable (x) with learning in every manner because in supervised learning,
the output variable (y). the model is finished because the crucial information is
included.
There are following supervised machine learning
algorithms: II. RELATED WORK
 Linear Regression
 Logistical Regression Utilizing data from the UCI Machine Learning
 Support Vector Machines (SVM) dataset, numerous studies and tests have been conducted to
identify cardiac disease. Various data mining approaches
 Neural Networks
have been used to obtain high levels of precision. These
 Random Forest
strategies are explained as follows:
 Gradient Boosted Trees
 Decision Trees Avinash Golande and colleagues investigate various
 Naive Bayes Machine Learning techniques that can be applied to the
classification and prediction of heart disorders. Research
B. Unsupervised learning was conducted for the study, and knowledge of the decision
Because, in contrast to supervised learning, we have the tree, k-nearest neighbour, and k-means algorithms, which
"input information input file computer file" but no may be used for classification, was compared. This study
corresponding output data, unsupervised learning cannot be comes to the conclusion that the Decision Tree receives
applied immediately to a regression or classification accurate predictions. The utmost conclusion was that by
problem. combining several strategies and fine-tuning the
parameters, it might be made effective.

T. Nagamani et al. developed a system that combined

the MapReduce algorithm with data mining principles and
practises. For the 45 instances in the experiment testing set,
the accuracy obtained from this study was better than the
accuracy obtained using a traditional fuzzy artificial neural
network. As a result, the usage of linear scaling and
dynamic schema increased the algorithm's accuracy.

Fahd Saleh Alotaibi developed a machine learning

model by contrasting five alternative methods. The Rapid
Miner tool was then used, which generated results that were
more accurate than those from the MATLAB and Weka
tools. In this investigation and experiment, the
classification accuracy predictions of Decision Tree,
Logistic Regression, Random Forest, Naive Bayes, and
Fig. 1: System Architecture SVM were also contrasted. The findings produced by the
decision tree algorithm were the most precise.
Unattended learning aims to identify the underlying
structure of a dataset, group related pieces of data together, Anjan Nikhil Repaka, et al., offered a system for the
and display that dataset in a very compact manner. prediction, accuracy, and output of the disease that
provided NB (Na ve Bayesian) strategies for dataset
Unsupervised learning "is helpful" in extracting
division and the AES (Advanced Encryption Standard)
insightful information from the data. Unsupervised learning
algorithm for the security of data transfer.
is far more similar to how a person learns to make
assumptions based on their own experiences, which brings Different division algorithms used for heart disease
it closer to the $64,000 AI. Unattended learning becomes prediction were included in a survey conducted by Theresa
even more critical because unsupervised learning relies on Princy, R., et al. In order to classify the data for the survey,
unlabeled and uncategorized input. Naive Bayes, KNN (K- Nearest Neighbor), Decision Trees,
and Neural Networks were utilised. The accuracy of the
In the actual world, we don't always have an input
classifiers was then examined for a variety of present
source and an output source, therefore unattended learning
attributes.
is ideal in these situations.
By combining SVM and Naive Bayes classification,
C. Reinforcement learning
Nagaraj M. Lutimath et al. successfully predicted cardiac
Machine learning includes the field of reinforcement
disease (Support Vector Machine). Mean Absolute Error,
learning. It has to do with choosing the right course of
Sum of Squared Error, and Root Mean Squared Error were
action to maximise reward in a very particular
the principal metrics employed in the analysis, and it was
circumstance. Different types of programming and
established that SVM appeared to be a superior method
machines utilise it to find the most straightforward
than Naive Bayes in terms of precision.
behaviour or route it should take to deal with a specific
situation. Reinforcement learning differs from supervised

IJISRT22JUL359 www.ijisrt.com 445

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
After reading the aforementioned studies, the c) Pre-processing of data
fundamental idea behind the suggested system was to A crucial first step in creating a machine learning
develop a heart disease prediction system with necessary model is data pre-processing. Initial data may not
input. be accurate or in the model's required format,
which could lead to misleading results. We
By comparing the accuracy, precision, recall, and f- frequently alter information during pre-processing
measure scores of the various classification algorithms, so that it fits our specific needs. It won't deal with
including Decision Tree, Random Forest, Logistic the dataset's noise, duplication, or missing values.
Regression, and Naive Bayes, we were able to determine Information pre-processing includes tasks like
which classification method would be most effective at dataset import, dataset rendering, attribute
predicting heart disease. scaling, etc. Pre-processing data is necessary to
improve the model's accuracy.
III. METHODOLOGY

A. Existing System
The silent killer of heart disease, which is a leading
cause of death in people with no outward signs of the
condition, is highlighted. The source of mounting worry
about the illness and its effects is part of the essence of this
sickness. As a result, constant effort is made.

B. Proposed System
Data gathering and the selection of critical attributes are
the first steps in the system's operation. The necessary data
is then pre-processed into the necessary format. Training
and testing data are separated from the whole amount of Fig. 2: Information Pre-processing
data. The algorithms are used, and the training data is used
to train the model. By analysing the system with the help of d) Balancing of Data
the testing data, the precision of the system is discovered. Unbalance datasets would be adjusted in one of
The modules listed below are used to run this system: two ways: beneath sampling, or (a), and
oversampling.
 Collection of Dataset
 Selection of attributes a. beneath Sampling:
 Data Pre-Processing By reducing the size of the large category in
 Balancing of Data beneath Sampling, the dataset balance is
 Disease Prediction completed. Once there is enough information,
this strategy is taken into consideration.
a) Collection of datasets:
We first gather a dataset for our algorithm that b. Over Sampling
forecasts cardiac illness. We divide the dataset into In this scenario, the dataset balance is
training data and testing data after grouping it. The accomplished by enlarging the size of the
learning of the predicting model uses the training sparse samples. When there is not enough
dataset, and the estimation of the predicting model information, this strategy is taken into
uses the testing dataset. In this project, 70% of the consideration.
data are used for training, while 30% are used for
testing. e) Prediction of Disease
SVM, Naive Bayes, Decision Trees, Random
Heart Disease UCI is the dataset that was used Trees, Logistic Regression, Adaboost, and XG-
for this project. There are 76 attributes in the boost are just a few examples of the many
dataset; 14 of them are utilised by the system. machine learning algorithms that are used for
classification. Comparative analysis is done
b) Selection of attributes between algorithms, and the algorithm that
The process of choosing appropriate attributes for provides the highest accuracy is then used to
the prediction system is referred to as attribute or predict heart disease for patients.
feature selection. By doing this, the system's
effectiveness is improved. Numerous patient C. Machine Learning Algorithms
characteristics, including gender, the kind of chest Machine learning is a potent technology that is defined
pain, fasting blood pressure, serum cholesterol, as the methodical examination of multiple algorithms that
exang, etc., are taken into account for the gives systems the potential to mimic human learning
prediction. In order to choose the attributes for processes without the need for programming. Unsupervised
this model, the correlation matrix is used. learning, supervised learning, and reinforcement learning
are the other divisions of machine learning.

IJISRT22JUL359 www.ijisrt.com 446

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
D. Naïve Bayes Algorithm While Leaf nodes are the result of these choices and
Naive A supervised learning method called the Thomas don't have any additional branches, Decision nodes desire
Bayes formula that looks for classification problems and is to build any call and have numerous branches. the choices
based on the Thomas Bayes theorem. It is mostly used in or examinations are made in accordance with the options in
text categorization, which offers a large training set. the given dataset. It's a graphical representation of all
possible decisions or solutions to a problem that are
Naive Thomas Bayes Classifier is among the simplest supported by the current situation. It is called a call tree
and easiest Classification algorithms that aid in creating because, like a tree, it starts with a base node and grows by
quick machine learning models that produce quick adding more branches to form a structure resembling a tree.
predictions. We frequently use the CART algorithmic programme,
which stands for Classification and Regression Tree
Because it is a probabilistic classifier, it predicts based algorithmic programme, to generate trees.The call tree
on the likelihood that an object will exist. Spam filtration, simply poses a question then supports the response
Sentimental analysis, and categorising articles are some (Yes/No). It does not further divide the tree into subtrees.
examples of Nave Thomas Bayes algorithm applications in
style. The supervised machine learning algorithm family
includes the Decision Tree algorithmic programme. It is
It is a classification method that relies on the Bayes frequently used for both a classification and a regression
Theorem and the assumption of predictor independence. drawback.
Simply put, a Naive Thomas Bayes classification
presupposes that the presence of one particular feature in a The objective of such a algorithmic programme is just
larger class is unrelated to the presence of another feature. to make predictions of the value of a target variable. To do
this, a decision tree is used, in which the interior node of
the tree contains diagrams of the characteristics and the leaf
node corresponds to a category label.

The goal to keep in mind when creating a machine

learning model is to use the simplest algorithmic
programme for a given dataset and problem. There are
several machine learning algorithms. The following list
includes the two justifications for using the decision tree:

a) The branch is followed, and a subsequent node is

jumped to using the (l dataset) attribute and
comparison support.
Fig. 3: Naïve Thomas Bayes Classifier After comparing the attribute value with the
opposing sub-nodes for each subsequent node, the
E. Support Vector Machine(SVM) : algorithmic programme proceeds on. The process is
One of the most well-known supervised learning carried out until the tree's leaf node is reached.
algorithms, Support Vector Machine, or SVM, is used for Using the following algorithm, the entire procedure
both classification and regression problems. Nevertheless, is frequently easier to comprehend:
it mostly addresses Classification challenges in machine  Step-1: S advises starting the tree at the base node,
learning. which holds the entire dataset.
 Step-2: Take note of the dataset's most basic
The SVM algorithmic rule's objective is to create the attribute, which is Attribute Choice Live (ASM).
simplest line or call boundary that will divide an n-  Step-3: Subsets of the S that include possible
dimensional space into categories, allowing us to quickly values for the most basic properties should be
assign fresh data to the appropriate class in the future. A created.
hyperplane's SVM selects the intense points and vectors  Step-4: Create the tree node of your choice that
that help create the border known as a best call boundary.
has the most basic characteristic.
 Step-5: Recursively develop new call trees by
F. Decision Tree Algorithm using the subsets of the step 3-created dataset.
Decision trees are a supervised learning method that can Continue using this technique until you can no
be used to find classification and regression problems, longer categorise the nodes, at which point you
however they are typically most frequently used to find will declare the final node to be a leaf.
classification problems. Internal nodes in the classifier's
tree-like structure stand in for a dataset's possibilities, G. RANDOM FOREST ALGORITHM
branches for the decision-making process, and each leaf A supervised learning algorithmic software called
node for the final classification outcome. There are two Random Forest may exist. It's an extension of machine
nodes in an option tree: the choice node and the leaf node. learning classifiers that uses sacking to improve call Tree
performance. It combines tree predictors, and trees are

IJISRT22JUL359 www.ijisrt.com 447

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
focused on a randomly sampled vector. All trees have a It is tough to capture advanced relationships
consistent distribution. exploitation supply regression. Additional powerful and
complicated rules like Neural Networks will simply shell
On randomly chosen knowledge samples, Random this algorithm.
Forests generate decision trees, obtain predictions from
each tree, and then vote on the best option. Additionally, it I. ADABOOST ALGORITHM
gives a clear indication of how important the function is. The first tremendously invigorating formula created
with binary classification in mind was called ADABOOST.
Random Forest may be a classifier that uses a number The acronym ADABOOST stands for "adaptive Boosting,"
of call trees on various subsets of the supplied dataset and and it is a well-known boosting approach that combines
uses the average to improve the prognosis accuracy of that several "weak classifiers" into one "strong classifier."
dataset. The random forest predicts the final output by
using predictions from all trees and supported by the At first, ADABOOST randomly chooses a coaching
majority votes of forecasts, as opposed to just one call tree. set.

The greater variety of trees inside the forest results in By selecting the coaching set that supported the
greater accuracy and avoids the issue of overfitting. accurate forecast of the previous coaching, it iteratively
trains the ADABOOST machine learning model.

It gives incorrectly classified observations a higher

weight so that they will have a higher chance of being
correctly classified in the following round.

Additionally, it shifts the responsibility to the trained

classifier in accordance with the classifier's improvement
with each iteration. Additional accurate classifiers may be
given more weight.

This process is repeated till the entire coaching. The

best boosted classifier is created by adding these classifiers
Fig. 3: Random Forest Algorithm in a weighted manner. Every classifier's accuracy is
measured using weights.
H. LOGISTIC REGRESSION ALGORITHM
One of the most popular Machine Learning algorithms Changing weights AdaBoost requires a learning
that falls under the superior education approaches is logistic formula that takes into consideration the weighted input
regression. It is passed down anticipating a particular instances, meaning that the loss perform should give
variable quantity using a particular collection of heavier examples more weight.
independent variables.
By selecting the coaching set that supported the
In supply regression, we typically deal with a "Shaped accurate forecast of the previous coaching, it iteratively
supply operation, that predicts 2 most values" rather than trains the ADABOOST machine learning model.
fitting a regression curve (0 or 1).
It gives incorrectly classified observations a higher
The curve from the supply function shows the weight so that they will have a higher chance of being
likelihood of anything, such as whether or not the cells are correctly classified in the following round.
cancerous or not, whether or not a mouse is heavy or not
supported by its weight, etc. Additionally, it places responsibility on the trained
classifier in each iteration in accordance with the classifier's
Because of its ability to categorise fresh data using validity. Additional accurate classifiers may be given more
distinct and continuous datasets, logistic regression may be weight.
a crucial machine learning strategy. J. Xgboost algorithm
Gradient Boosted call trees are implemented in part by
XG-boost. It is a type of code library that was created
primarily to improve model performance and speed. Call
trees are generated using this approach in consecutive
kinds. The weights that are used in XG-boost are crucial.
Any or all of the independent variables are given weights
before being placed into the choice tree that forecasts
outcomes. The weight of variables that the tree incorrectly
predicted in advance is increased, and these variables are
Fig. 4: Logistic Regression subsequently sent to the second call tree. These distinct
classifiers/predictors are then combined to provide a robust

IJISRT22JUL359 www.ijisrt.com 448

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
and more accurate model. Regression, classification, suitable technology support in this regard to be of great
ranking, and user-generated. benefit. SVM, call Tree, random forest, naive Thomas
bayes, logistical regression, reconciling boosting, and
 The Power of XGBoost extreme gradient boosting are just a few of the seven
The beauty of this powerful algorithm lies in its distinct types of machine learning algorithms that will be
scalability, which drives superfast learning through parallel tested in this research as they are applied to the dataset.
and distributed computing and offers efficient memory
usage. Seventy-six possibilities make up the dataset that
contains the expected characteristics that lead to cardiac
IV. RESULTS & DISCUSSION conditions in patients, and fourteen crucial options that are
useful for assessing the system are selected among them.
After learning the machine learning algorithms, we are One prediction model was created, as evidenced by the
getting outputs represented in tabular Form. comparison of the seven machine learning methods'
accuracies. Therefore, it is intended to apply a variety of
Rhythm Accuracy
analysis metrics, with XGBOOST providing the best
XGBoost 90.6% accuracy (90.6%).
SVM 82.5%
Logistic Regression 83.5% ACKNOWLEDGMENT
Random Forest 90.2% We acknowledge the persons who helped us in doing
Naive Bayes 76.9% this present work.
Decision Tree 89.3%
Adaboost 83.4% REFERENCES
Table 1: Accuracy comparison of algorithms [1.] Soni J, Ansari U, Sharma D & Soni S (2011).
The highest accuracy/precision is given by the anticipating data analytics for medical diagnosis: a
form for predicting heart condition. International
XGBOOST algorithm.
Journal of pc Applications, 17(8), 43-8
V. PERFORMANCE ANALYSIS [2.] Dangare C S & Apte S S (2012). Improved study of
heart condition prediction system using data analytics
In this project, numerous machine learning algorithms classification techniques. International Journal of pc
like SVM, Naive mathematician, call Tree, Random Forest, Applications, 47(10), 44-8.
provision Regression, ADABOOST, XG-boost square [3.] Ordonez C (2006). Association rule discovery with
measure accustomed predict heart condition. Heart the train and test approach for predicting heart disease.
condition UCI dataset, incorporates a total of seventy-six IEEE Transactions on IT in Biomedicine, 10(2), 334-
attributes, out of these solely fourteen attributes square 43.
measure thought for the prognosis of heart condition. [4.] Shinde R, Arjun S, Patil P & Waghmare J (2015). An
Numerous attributes of the patient like gender, hurting intelligent heart disease predictive system using k-
kind, abstinence force per unit area, body fluid cholesterol, means clustering and Naïve Bayes algorithm.
heart disease etc. Square measure thought of for this International Journal of CS and Information
project. Any algorithm that provides the simplest Accuracy Technologies, 6(1), 637-9.
must be used for specific algorithmic programmes. For the [5.] Bashir S, Qamar U & Javed M Y (2014, November).
forecast of the gastrointestinal ailment, that is considered. An ensemble-based decision support framework for
Numerous analysis criteria, including accuracy, confusion intelligent diagnosis of heart disease. In International
matrix, precision, recall, and f1-score, have been Conference on Information Society (i-Society 2014)
considered for analysing the experiment. (pp. 259-64). IEEE.
[6.] Jee S H, Jang Y, Oh D J, Oh B H, Lee S H, Park S W
VI. CONCLUSION AND FUTURE SCOPE & Yun Y D (2014). A coronary model for predicting
heart disease: the Korean Heart Study. BMJ open,
Application of promising technology, such machine 4(5),e005025.
learning, to the initial prognosis of heart problems can have [7.] Jabbar M A, Deekshatulu B L & Chandra P (2013,
a significant influence on society. Heart diseases are a March). Prediction of heart disease using lazy
major cause of death in India and around the world. The associative classification. In 2013 International Mutli-
first cardiac problem prognosis will help in making Conference on Automation, Computing,
decisions about behaviour adjustments in high-risk patients Communication.
and progressively reduce the complications, which may be
an excellent milestone in the medication industry. The
number of people suffering from heart disease is increasing
yearly. This helps in both early detection and treatment.
The medical community and patients will find the use of