Development of Heart Disesase Prediction System Using Firefly Feature Selection and Logistic Regression Algorithm (Tobless)
Development of Heart Disesase Prediction System Using Firefly Feature Selection and Logistic Regression Algorithm (Tobless)
Development of Heart Disesase Prediction System Using Firefly Feature Selection and Logistic Regression Algorithm (Tobless)
INTRODUCTION
Heart disease is a leading cause of death worldwide, accounting for over 17.9 million
deaths per year. Early prediction and diagnosis of heart disease can improve treatment
outcomes, reduce healthcare costs, and enhance patient quality of life. Traditional
prediction method rely on medical professionals’ expertise and manual analysis of patient
A scarcity of clinical specialists, a growth in the number of chronic diseases, and rising
healthcare costs are all barriers in today's environment. Heart disease remains the leading
cause of premature mortality. Heart diseases occur when enough blood does not reach the
body's needs throughout the pumping process (Zheng, Y.(2018). Because of multiple
contributing danger issues such as diabetes, high blood pressure, high cholesterol,
incorrectpulse rate, and many other conditions, it is difficult to detect heart disease.
People's health, particularly their hearts, suffers as a result of busy lifestyles and junk
food consumption. An accurate decision support system can play a key role in the early-
are still not available in remote,semi-urban, and rural locations (Rani et al., 2021).
Heart disease is a harmful disease that affects the functionality of the heart. Heart disease
represents a collection of different diseases such as heart failure, coronary artery disease
(CAD), Heart Arrhythmia, Heart Valve Disease, Pericardial Disease, Cardiomyopathy
(Heart Muscle Disease), and congenital heart disease that many people suffering with.
Cardiovascular disease is one of the most dangerous diseases of the heart. As per the
world health organization report, there are 17.9 million deaths occur through world wide
due to cardiovascular disease. Heart issue leads to scarier people’s lifestyle. Now a day’s
conditions that affect the heart mechanism. So cardiovascular disease is identified and
measured by taking a range of conditions from the human affected heart. This project can
predict and detect diagnose with heart disease from their medical history reports. It can
help those who are having heart disease symptoms like high blood pressure, asthma, heart
valve pain, and chest pain by giving effective treatment, accurate with less medical
algorithm. HD Prediction built by Logistic regression algorithm in the rate of 62%. The
regression algorithm falls under the category of machine learning technique. The
regression algorithm is widely used in skin cancer prediction techniques and breast
cancer prediction. Improper diet, sugar at a young age, increasing age, taking more
calories food and no physical activities are the main reasons for the heart disease. These
are impacting major heart disease. Meditation, physical activities, a proper diet, and
Heart disease is a leading cause of mortality worldwide, accounting for over 17.9millions
deaths per year. Early prediction and diagnosis of heart disease can be significantly
reduces the risk of death and improve patient outcomes. However, the accuracy of
analysis of patient data, which can be time consuming and prone to errors and early
identification of people who are at a high risk of contracting the illness is essential for
effective heart disease prevention and management. There is a need for more precise and
effective prediction models because traditional risk factors, including age, family history
High Dimensionality: The vast number of feature collected in electronic health records
(EHRS) and other data sources make it challenging to identify the most relevant feature
Feature Selection: Existing features selection method they may not effectively identify
the most informative features, leading to reduced model accuracy and increased risk of
over fitting
Model Complexity: Logistic regression model may not capture complex non linear
Data imbalance: Heart disease datasets often suffer from class imbalance, where the
number of healthy instances far exceeds the number of diseased instances, leading to
biased model.
Aim
The aim of this project is to develop heart disease prediction system using firefly feature
Objectives
The objective of the HD (heart disease) prediction is to detect the heart disease from their
age, name, medical history report, cardiovascular test, Blood test Electrocardiogram,
nuclear cardiac stress test, and so on. A dataset is kept from the Kaggle repository with
the use of attributes and patient medical history report. Using a dataset and 14 types of
attributes can predict and detect heart disease. In order to predict disease as early as
The significance of the system is to develop a heart disease prediction system using
firefly feature selection and logistic regression algorithm, the system uses a dataset of
patient characteristics and medical history. Firefly feature selection is applied to identify
the most relevant features. Logistic regression algorithms are then used to predict the
This study intends to predict heart disease with high accuracy by proposing an improved
feature selection and enhanced classification approach. The project employs logistic
LITERATURE REVIEW
Based on World Health Organization (WHO) report, there are 17.9 million deaths yearly,
and almost 32% of all are passed away (Maghdid etal, 2022). According to the WHO
page, the cause of heart disease is a heart attack, stroke, and rheumatic. Everyone has the
potential for heart disease, especially men compared to the woman. Unhealthy lifestyles,
such as smoking, cholesterol, high blood pressure, obesity, alcohol, and hereditary
history, become the most critical risk of heart disease (Latha et al, 2020) . Not all
sufferers of heart disease end in death. A controlled lifestyle, such as eating habits and
Symptoms indicate heart disease, such as shortness of breath (Alshuky et al, 2020),
physical fatigue (Nagarajam et al, 2020), and pain in the chest, arms, shoulders, or back
(Kim J, 2021). Heart disease can attack the sufferer and is not easy to cure because it
needs special treatment. As a vital organ, heart health care must be highly guarded. The
most effortless action to take as a preventive measure is to reduce smoking habits, have a
healthy diet, be active in physical activities and stop consuming alcohol (Ndejjo, 2020).
The various causes of heart disease may increase the prediction complexity.
With the development of medical data sourced from the patient's health record, there is a
great opportunity as a basic material in developing patient health. Currently, the use of
computers has been applied in various fields. In health, it can be used to improve the
machine learning as an analytical tool can find hidden patterns in the data (Hassan M,
prevention.
Heart disease, also known as cardiovascular disease, has several causes and risk factors.
i. High blood pressure: uncontrolled high blood pressure can damage blood
can lead to plaque buildup in arteries, increasing the risk of heart disease
iii. Smoking: Smoking damages blood vessels, increases blood pressure, and
iv. Diabetes: high blood sugar levels can damage blood vessels and nerves,
v. Obesity: excess weight can lead to high blood pressure, high colestrol, and
vii. Lack of exercise: a sedentary lifestyle can contribute to obesity, high blood
viii. Age: heart disease risk increases with age especially after 45 for men and 55
for women.
The subset of artificial intelligence focuses on building systems that learn or improve
performance based on the data they consume (Nasteski n.d.). It was born from pattern
recognition and the theory that computers can learn without being programmed to
computers could learn from data. The iterative aspect of machine learning is important
because as models are exposed to new data, they can independently adapt. They learn
from previous computations to produce reliable, repeatable decisions and results. The
practice of machine learning involves taking data, examining it for patterns, and
developing some sort of prediction about future outcomes (Liu et al. 2022). By feeding an
algorithm more data over time, data scientists can sharpen the machine learning model's
predictions. From this basic concept, several different types of machine learning have
developed.
2.2.1 Unsupervised Machine Learning
A technique in which models are not supervised using a training dataset. Instead, models
themselves find hidden patterns and insights from the given data. It can be compared to
learning which takes place in the human brain while learning new things (Eweje et al.
2021).
because, unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of the dataset,
group that data according to similarities, and represent that dataset in a compressed
format.
containing images of different types of dogs. The algorithm is never trained upon the
given dataset, which means it does not have any idea about the features of the dataset.
The task of the unsupervised learning algorithm is to identify the image features on their
own. An unsupervised learning algorithm will perform this task by clustering the image
The unsupervised learning algorithm can be further categorized into two types of
problems:
i. Clustering: Clustering is a method of grouping objects into clusters such that
objects with the most similarities remain in a group and have fewer or no
similarities with the objects of another group (Benndorf et al. 2018). Cluster
analysis finds the commonalities between the data objects and categorizes them as
the set of items that occurs together in the dataset. The Association rule makes
Supervised learning is the type of machine learning in which machines are trained using
well "labeled" training data, which means that, machines predict the output (Nasteski
n.d.). The labeled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly. It applies the same concept as a
student learns under the supervision of the lecturer. In supervised learning, models are
trained using labeled datasets, where the model learns about each type of data (Benndorf
et al. 2018). Once the training process is completed, the model is tested based on test data
input variable and the output variable. Which is used for the prediction of
ii. Classification: Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-Female, and
True-false.
Logistic Regression:
Logistic regression is a popular machine learning algorithms used for predicting heart
categorical dependent variable (Target Variable) based on one or more predictor variable.
solve problems involving binary categorization. It's a type of ordinary regression that can
only model a binary variable, such as whether or not an event occurs. Logistic Regression
may help you figure out if a new instance belongs to a given class. The result will be
ii. Odd Ratio: it estimates the odds ratio, which represents the change in odds of
iii. Sigmoid Curve: Logistic Regression uses a sigmoid curve (S-shaped curve) to
ii. Risk assessment: estimate the probability of a specific outcome (e.g heart
disease risk).
iii. Customer churns prediction: predicting whether a customer will leave or stay.
Where
P (outcome) is the probability of the outcome
Feature Extraction aims to reduce the number of features in a dataset by creating new
features from the existing ones (and then discarding the original features). These new
reduced sets of features should then be able to summarize most of the information
contained in the original set of features. In this way, a summarized version of the original
features can be created from a combination of the original set (Gemescu et al. 2019). The
process of feature extraction is useful when you need to reduce the number of resources
extraction can also reduce the amount of redundant data for a given analysis. Also, the
reduction of the data and the machine’s efforts in building variable combinations
(features) facilitate the speed of learning and generalization steps in the machine learning
process.
Heart disease prediction system using data mining and hybrid intelligent techniques is
done mostly by doctor’s expertise and experience. Computer Aided Decision Support
System plays a major role in medical field. With the growing research on heart disease
predicting system, it has become important to categories the research outcomes and
provides readers with an overview of the existing heart disease prediction techniques in
each category. Neural Networks are one of many data mining analytical tools that can be
utilized to make predictions for medical data. From the study it is observed that Hybrid
Intelligent Algorithm improves the accuracy of the heart disease prediction system.
Heart Disease Detection Using Feature Extraction and Artificial Neural Networks . It
uses an artificial neural network (ANN) technique to identify scent patterns in individuals
using ten metal oxide semiconductor sensors. Sensor data is scanned and extracted before
using ANN patterns. Before using ANN patterns to generate patterns from sensor data, it
is important to scan and extract sensory information from that data. Each participant is
recognized and scanned for a totally of 1000 different characteristics during the course of
the multiple investigations, which are conducted across a variety of time periods that
include 5, 10, 15, and 20 people. Because of the varying time periods, signals from
sensors are received in analog form, which is then transformed by Arduino into digital
form. It is necessary to train architecture on the data set that has been created. The
benchmarks that are employed for the assessment of the model that is presented for the
among other things. Experiments are carried out using the assessment measures, and the
findings demonstrate that this model has an accuracy of greater than 85 % in most cases.
Heart Disease Prediction System Using Logistic Regression Algorithm this is done
Detecting the disease at a premature stage may save the life of the patient. Data mining
techniques are very popular and have been used in many fields including healthcare to
help the doctor to make better decisions. Machine learning provides classification
algorithms such as decision tree (DT), Naïve Bayes algorithm, Support machine vector
(SVM), and Logistic Regression (LG) are used in many types of research for predicting
heart disease. The dataset is collected from the Kaggle repository. It contains 604 data
and 14 attributes used to train the model that will be used in the web application.
Building an efficient prediction model to be deployed into the web application is the main
(Jabbar et al, 2016) proposed work employed RF to predict cardiac illness. The CHI
approach was utilized to choose to take the related features. When compared to decision
trees, the proposed research suggests that random forests yield more accurate results. The
proposed work was built utilizing neural networks by (Kim JK, 2017). Sensitivity
analysis is indeed one of the evaluation metrics for prediction. The importance of features
with such a high degree of sensitivity was considered. After selecting the relevant
sensitivity of each feature is determined by it. This (Amin U, 2018) employed seven
classification algorithms to predict cardiac disease in people. This study used Relief,
MRMR, and LAS, and Selection Operator feature selection methods to choose the
appropriate feature.
In addition to the seven performance metrics this study employed, the ROC and AUC
will help clinicians diagnose heart patients more efficiently. To select an appropriate
feature, (Rani et al., 2021) used a Genetic Algorithm (GA) and recursive feature
elimination. The proposed study used standard and SMOTE to preprocess the data and
performed support vector machines, naive Bayes, logistic regression, random forest, and
an Ada Boost classifier to aid in the earlier prediction of heart disease hung on the
patient's medical features. The system's simulation environment was built in Python, and
it was discovered that random forest achieved a maximum accuracy of 86.6 percent. (Ali
et al., 2019) used the chi-square statistical approach to pick significant features. Particular
features that were selected were fed into a deep neural network, which was then trained to
configuration.
(Paul et al, 2016) used a fuzzy decision support system (FDSS) that includes rules
derived from the genetic algorithm with perhaps even weighted fuzzy derivatives (GA).
They were able to recover eight useful features with an accuracy of 80%. Multiple heart
(Bashir et al., 2019) for experimentation analysis and to increase accuracy performance.
Feature selection algorithms such as Decision Tree, Logistic Regression SVM, Nave
Bayes, and Random Forest are used with the Rapid miner, and accuracy is enhanced.
(Liu et al., 017) offered a study that used relief and rough set approaches. The proposed
system consists of two subsystems: the RFRS feature system and ensemble classifier
classifications. The first system has three stages: data extraction using the ReliefF
method, feature reduction using our heuristic Rough Set reduction technique, and feature
reduction using our heuristic Rough Set reduction technique. In the second system, which
technique had a classification accuracy of 92.32 percent. On the Cleveland heart disease
dataset, (Singh et al, 2017) used an RF classifier that can handle large amounts of data
with missing values. This classifier generates a large number of decision trees that are
selected through voting. The chosen branch is used to improve precision. Due to the
obvious non-linear dataset, this study was able to reach an accuracy of 85.81 percent.
CHAPTER THREE
RESEARCH METHODOLOGY
Machine learning algorithms become very popular and used in different fields such as
healthcare, business, etc. to solve many problems. in this system, we proposed a logistic
regression machine learning algorithm for predicting heart disease. The logistic
regression algorithm shows a high accurate result for the prediction. The user-friendly
web application is developed using flask, HTML, and CSS. The user will login to the
system and gives the required input for prediction. Problems in this system, we proposed
a logistic regression machine learning algorithm for predicting heart disease. The logistic
regression algorithm shows a high accurate result for the prediction. The user-friendly
web application is developed using flask, HTML, and CSS. The user will login to the
3.1.1 Advantages
of the patient
Patient
1 = Female
Type
2=Atypical Angina
4 = Asymptomatic
cholesterol
pressure
sugar
Value 1: >120 mg/dl
Result
1 = having ST-T wave
abnormality
Achieved
9 Exang Exercise - 0 = No
Included
1 = Yes
Angina
included by
exercise relative
to
rest
peak exercise
ST segment
Major Vessels
Defect
prediction
1 = stage 1
2 = stage 2
3 = stage 3
4 = stage 4
The data is imported into the python environment as a CSV file format. Independent
variables such as (name, gender, age, chest pain type, resting blood pressure, serum
cholesterol, fasting blood sugar, exercise- induce angina, resting ECG result, max heart
rate achieved, st depression, slop of the peak exercise st, number of major vessels) and
dependent variables (target values) are extracted and stored as x and y respectively.
3.3 Flowchart Model
Data Collecton
Data Processing
Feature
Selection
Logistic Regression
Classifier
Web Application
Performance
evaluation
It is the process of obtaining a subset from the original dataset without losing the features
of the data set. In this step, irrelevant data, and noise data are removed. Removing
irrelevant data provide a huge impact in the process such as improving the accuracy,
reducing time, and easy understanding of the model. The data is divided into a training
set and a testing set. The test set is used for scaling in order to get an accurate result for
learning algorithms. Logistic regression is getting very popular and used for classification
and prediction due to its high accuracy. There are two types of regression models: binary
logistic model and multinomial logistic regression model. In the Binary logistic
regression model, the target variable can have either 1 or 0. in the other hand
multinomial logistic regression model which is the model used in this project the
dependent variable can have 3 or more possibilities. The algorithm predicts dependent
variables based on the independent variables. In this project, the logistic regression model
predicts whether the patient has heart disease or not with the specific stage of the disease
(target) based on the symptoms, personal details, and medical test (independents
The hardware requirement refers to the tangible (physical) component to be used for the
development of the system and these are; Personal computer (PC) Macbook Air 4G RAM
Windows 8 or higher operating system software can be used for the deployment of this
Apache (A), and Python3 will all be used in the project to develop the system. Visual
Studio Code is the software package that will be used to create the source file to make the
4.1 RESULTS
The analyses have been performed on the provided heart dataset (multiple times), also,
the typical exactness and the standard deviation have been noted for each of the datasets.
As the heart disease dataset is exceptionally contrasted and the number of times the
dataset have been split for producing best accuracy results using cross-validation
approach took more than 13000 ms. Subsequently, for each of the heart disease datasets,
the proposed strategy has been applied by choosing 10, 50, and 100 elements in firefly
algorithm, individually.
Regression
selection and logistic regression for predicting heart disease. The system achieved an
accuracy of 92.1% and sensitivity of 90.5%, and specificity of 93.5% outperforming the
Regression Regression
Postive Positive 85 80
Positive Negative 5 10
Negative Positive 10 15
Negative Negative 90 85
Discussion
The firefly feature selection method successfully identified the most relevant features,
including age, sex, Blood Pressure, and family history, which are consistent with clinical
risk factors for heart disease. The selected features improved the performance of the
The improved performance of the system can be attributed to the ability of the firefly
feature selection to identify the most informative features and eliminate the redundant or
irrelevant ones. This reduces the dimensionality of the data and improves the models
generalizability.
The result of this study has the implications for the development of clinical decision
prediction accuracy.
4.3 RESULT OF THE DEVELOPED HEAR DISEASE PREDICTION SYSTEM
Logistic regression
regression is mainly used to for prediction and also calculating the probability of succes
Index page
Heart disease Predictor
Heart disease analysis Result
Heart Disease predictor Analysis page
Sex: This action give the user an option to select the his/her gender, the gender are
Resting Blood Pressure: These modules allow the user to select the blood pressure of
the patient it ranges from 94mmHg to 200 mmHg; this indicates the resting period of the
patient.
Thallium stress test: thallium this option allows the user to pick the stressing rate and
CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATION
5.1 SUMMARY
Machine learning techniques, a type of artificial intelligence, are being used in the
researchers are exploring the level of uncertainty that arises when using machine
learning algorithms for ways to anticipate the disease. The most significant concept in
health data analysis is the prediction of cardiac disease from clinical data. The
prediction helps physicians to take exact decisions regarding patients' health. The
proposed model used Data collection, Data preprocessing, and Data Transformation
methods to train the model. This model exploited feature selection methods: filter and
established in this work. The work was designed with the help of machine learning
classifiers Logistic Regression and firefly algorithm. The dataset used in the study
comprises a number of patients affected by heart disease and also includes related
features to perform the prediction. The prevalence of features in this dataset was
determined by feature selection procedures. These are the methods that were used to
resolve the issue. Relevant features are used in the classifier model to perform evaluation
measures were also used to evaluate the identification system's performance. According
to Table 2, the firefly classifier produces the best results in both feature selection
algorithms when compared with other classifiers. When compared to other ways of filter
feature selection, firefly produces better results. Furthermore, irrelevant features harm the
groundbreaking part of this research was the use of feature selection algorithms to
simultaneously reducing the computation of the diagnosis process. Other feature selection
Development of heart disease prediction system using firefly feature selection and
the queue method, relief doctor from stress work on diagnoses patient. The system can
also helped the health care center to predict if a patient have heart disease or not.
The development of heart disease prediction system can also be recommend for an
individual to diagnose themselves at home to without visiting clinic, this help for them
predict if they have heart disease or not, and to check the functionality of their heart due
It can also be recommend to other researcher who want to continue in the development of
disease prediction system logistic regression and firefly algorithm, researcher can also
make use of other algorithm for the implementation of the new system using this
designed.
REFERENCE
Amin Ul Haq, Jian Ping Li, Muhammad Hammad Memon, Shah Nazir, Ruinan Sun,
https://doi.org/10.1155/2018/3860146 .
models for heart disease prediction using feature selection and PCA.
Jabbar MA, Deekshatulu BL, Chandra P (2016) “Prediction of heart disease using
://doi.org/10.1007/978-3-319-28031 - 8_16.
10.1007/s10278-018-0145-0.
Costelloe, Colleen M., and John E. Madewell. 2021. “An Approach to Undiagnosed
10.1053/j.sult.2020.08.014.
Eweje, Feyisope R., BingtingBao, Jing Wu, DeepaDalal, Wei-hua Liao, Yu He,
Harrison X. Bai, and Lisa States. 2021. “Deep Learning for Classification
10.1016/j.ebiom.2021.103402.
Gemescu, Ioan N., Kolja M. Thierfelder, ChristophRehnitz, and Marc-André Weber.
He, Yu, Ian Pan, BingtingBao, Kasey Halsey, Marcello Chang, Hui Liu,
Jiang, Liangxiao, Lungan Zhang, Chaoqun Li, and Jia Wu. 2019. “A Correlation-
10.1109/TKDE.2018.2836440.
Liu, Renyi, Derun Pan, Yuan Xu, Hui Zeng, Zilong He, Jiongbin Lin, Weixiong
Zeng, Zeqi Wu, ZhendongLuo, Genggeng Qin, and Weiguo Chen. 2022.
Methods.” 12.
Palmerini, Emanuela, PieroPicci, Peter Reichardt, and Gerald Downey. 2019.
10.1177/1533033819840000.
Suster, David, Yin Pun Hung, and G. Petur Nielsen. 2020. “Differential Diagnosis of
Tao, Yuzhang, Xiao Huang, Yiwen Tan, Hongwei Wang, Weiqian Jiang, Yu Chen,