Machine Learning-Based Cardiovascular Disease Detection Using Optimal Feature Selection
Machine Learning-Based Cardiovascular Disease Detection Using Optimal Feature Selection
ABSTRACT Cardiovascular disease (CVD) is a prevalent and serious condition causing a significant
global mortality rate. According to the World Health Organization (WHO), in 2022, CVD claimed the
lives of approximately 19.1 million people, accounting for 33% of global fatalities. ECG is widely used
for automatic detection of CVD using traditional machine learning; however, it is usually difficult to select
optimal features. Addressing this issue, a scalable machine learning-based architecture is proposed for
early CVD detection based optimal feature selection. This architecture aims to revolutionize healthcare
by enabling timely diagnosis and treatment, reducing CVD-related fatalities. Comprising data collection,
storage, and processing components, the system employs machine learning classifiers to predict patients’
heart conditions. Initially features are extracted from ECG signals then feature selection techniques like
FCBF, MrMr, and relief, along with PSO-optimization are used to select optimal features. Extra Tree and
Random Forest classifiers trained on the selected features have achieved notable performance rates with
accuracy of 100% respectively. Furthermore, a comparison of the proposed method with state of the art
on both small and large dataset is provided. The proposed system holds potential to revolutionize patient
care and substantially lower CVD-related mortality, enhancing the quality of life for affected individuals.
In summary, this architecture offers a promising solution to the pressing issue of CVD and paves the way
for advanced healthcare systems.
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 16431
T. Ullah et al.: Machine Learning-Based Cardiovascular Disease Detection Using Optimal Feature Selection
In a comparable way, other nations around the world are II. TRADITIONAL RISK ASSESSMENT SYSTEMS
dealing with the issues associated with CVD. According to Traditional risk assessment systems have certain limitations
studies that were just recently made public, chronic diseases in terms of their accuracy and efficiency, despite the fact
are responsible for 86.5% of deaths in China [3]. that early detection and prompt intervention can consider-
Cardiovascular diseases (CVD) have emerged as a major ably improve patient outcomes. The application of machine
global cause of mortality, claiming a substantial number of learning algorithms has shown some encouraging results in
lives annually. The underlying pathology of cardiac diseases predicting the risk of cardiovascular disease, and these algo-
is the inability of the heart to effectively circulate an adequate rithms have the potential to improve clinical decision-making
amount of blood to various organs. This condition poses and individualized care. However, there is a dearth of suf-
a significant threat to life and is recognized as one of the ficient knowledge, empirical data, and competent research
most lethal and life-threatening chronic diseases worldwide. studies in this field, underscoring the need for additional
By affecting the heart or blood vessels, CVD disrupts the exploration into the topic. As a result of addressing this
normal supply of blood, impairing the proper functioning of research gap, the proposed project hopes to make a contribu-
essential body organs [4]. tion to the creation of prediction models for cardiovascular
Cardiovascular disease (CVD) is a leading cause of mor- disease that are more accurate and efficient, which would
tality worldwide, affecting both developed and developing eventually improve the outcomes for patients and reduce the
nations. According to the World Health Organization (WHO), costs of healthcare.
in 2022, CVD claimed the lives of approximately 19.1 mil- In a previously carried out study, MrMr, FCBF, LASSO,
lion people, accounting for 33% of global fatalities. In the and Relief were utilized to identify characteristics. However,
United States alone, CVD causes the death of around 647,000 the outcomes were neither satisfactory nor of the required
individuals each year. Similarly, in Pakistan, CVD claims the quality. In addition to the characteristics listed above, we used
lives of approximately 200,000 people annually, with mor- ANOVA with join combinations. The authors predicted car-
tality rates on the rise. The European Society of Cardiology diovascular disease (CVD) with the help of a classifier,
(ESC) estimates that 26.5 million Europeans are currently however, in the current research work, we optimized the find-
living with CVD, and each year, 3.8 million new cases ings of the model with the help of an optimization approach
are diagnosed. Shockingly, 50% to 55% of CVD patients called particle swarm optimization (PSO), as described in
do not survive beyond a year, placing a significant bur- the methodological section of this research study. This will
den on healthcare systems. Moreover, approximately 4% of save a lot of effort and produce empirical data to predict
healthcare spending is allocated to the treatment of CVD cardiovascular disease (CVD), and we utilized four different
patients [5]. machine learning techniques to do this.
The symptoms associated with CVD encompass physi-
cal fragility, shortness of breath, inflammation in the feet, III. RELATED WORK
lethargy, and other related manifestations [6]. Accurate prediction of cardiac issues is crucial for providing
Cardiovascular disease (CVD) is a significant global optimal care to patients. Machine learning (ML) techniques
health issue that can be attributed to various risk fac- offer a promising approach to gaining a deeper understanding
tors including hypertension, high cholesterol levels, smok- of heart disease symptoms and improving treatment strate-
ing, sedentary lifestyle, and obesity. CVD encompasses gies. In our study, we evaluated six ML models using a
a range of conditions such as congenital heart disease, dataset comprising 74 features. The results demonstrated high
congestive heart failure, and cardiac arrhythmias. Tradi- accuracy rates of 98.7%, 99.0%, and 99.4% for the Cleve-
tional approaches to predicting and diagnosing CVD were land, Hungarian, and Cleveland-Hungarian (CH) datasets,
complex and often led to complications that impacted respectively, using the combined approach of chi-square
individuals’ overall well-being. This disease remains the and principal component analysis (CHI-PCA) with random
leading cause of mortality in both developed and developing forests (RF). Through the analysis, we identified signifi-
countries, necessitating effective preventive and diagnostic cant features such as cholesterol levels, maximum heart rate,
measures [7]. chest discomfort, ST depression characteristics, and coronary
In developing countries, clinicians face challenges in accu- artery structure using the Chi-Square Selector. Our experi-
rately diagnosing and treating cardiovascular disease (CVD) ments revealed the powerful combination of chi-square and
due to limited resources. Computer technology and machine PCA in predicting cardiac issues [1]. Although this work
learning have been introduced as aids in clinical decision- achieves high performance however it can’t be generalized
making, enabling early detection and assessment of CVD as the model is applied to a limited dataset.
risk. Medical data mining technologies can extract useful Deep learning-based ECG signal classification. Stacked
information from the massive amounts of data in healthcare, de-noising autoencoders (SDAEs) with sparsity constraints
which is vital due to the complexity of medical data. Our extracted meaningful features from raw ECG data. A DNN
technology for CVD prediction could potentially save mil- was created using these features and a soft-max regression
lions of lives by enabling more people to receive treatment layer. Experts labeled the most relevant and uncertain ECG
faster [8]. beats during interaction to update network weights. DNN
posterior probabilities and confidence measures classified learning methods play a vital role in this domain. Researchers
ECG beats. Multiple databases showed improved accuracy, are actively engaged in accelerating their efforts to develop
reduced expert interaction, and faster online retraining. The software using machine learning algorithms that can assist
proposed method may improve ECG classification and car- doctors in making accurate predictions and diagnoses of
diac disease diagnosis [2]. The performance of this model is heart disease. The primary objective of this study was to
good however their main focus is on use of Python language leverage machine learning techniques to determine the pres-
in heart disease detection. ence of heart disease in patients. Graphical representations
A recent study demonstrated the benefits of automated of data were employed to evaluate the performance of vari-
classification methods in aiding physicians’ treatment deci- ous machine learning algorithms [5]. The article presents a
sions for cardiac arrhythmias. The study focused on utilizing significant overview of the articles in ML models for CVD
probabilistic n-grams to classify these arrhythmias and however, it is limited to the work done until 2021.
compared the performance of five unsupervised dimension- To identify coronary artery disease (CAD) patients the
ality reduction (DR) methods: principal component analysis HRV data from a standard library and self-recorded data from
(PCA), fast independent component analysis (fastICA) with healthy people is used. The HRV time series was broken
tangential, kurtosis, and Gaussian contrast functions, kernel up into four levels, and 62 features were taken out of it
PCA (KPCA) with polynomial kernel, hierarchical nonlinear using nonlinear methods. Numerical tests were done, and the
PCA (hNLPCA), and principal polynomial analysis (PPA). suggested method of using the ten most important entropy
Notably, employing the fastICA DR algorithm with a tan- features got close to a 100% detection accuracy. HRV signals
gential contrast function on at least 10 dimensions, along can be used to find and study CAD patients by breaking them
with a PNN classifier at a spread parameter of 0.4, resulted up into subspaces level 4 and level 3 [6]. This method can’t
in a significant improvement in F score (99.83%). However, be generalized to all the cases of heart diseases as it is applied
it is important to note that the calculation of low-dimensional to a very small dataset.
mapping using hNLPCA or KPCA is more time-consuming. The study in [7] aimed to improve fuzzy clustering
Additionally, PPA demonstrated a 10% higher effectiveness methods by automatically calculating feature weights and
compared to PCA. These findings highlight the potential eliminating irrelevant feature components. They proposed a
of utilizing specific DR methods in conjunction with a feature-reduction FCM (FRFCM) approach, which utilized a
PNN classifier for accurate cardiac arrhythmia classifica- learning schema to optimize parameters and reduce irrelevant
tion [3]. The study was conducted on a relatively small dataset features. FRFCM outperformed existing feature-weighted
of 100 ECG recordings. This limits the generalizability of FCM algorithms, as demonstrated through experimental
the findings. The study only evaluated five dimensionality results and comparisons on numerical and real datasets. The
reduction methods and one classifier. It is possible that other findings highlight the efficacy and practicality of FRFCM
methods and classifiers could achieve even better perfor- in enhancing fuzzy clustering techniques. Overall, the study
mance. provides some promising results on the use of machine learn-
Cardiac Resynchronization Therapy (CRT) is a rhythm ing for early and accurate heart disease prediction. However,
treatment for people with heart failure that has been used for it also has the overfitting problem and is trained on features
a long time. People often use the New York Heart Association which are not ranked for optimality.
(NYHA) rating to figure out how well a patient is responding A pre-processing technique is used to improve ECG signal
to CRT. Finding a patient’s NYHA class regularly over time classification accuracy by removing noise from raw data.
in an electronic health record (EHR) can help doctors learn Classifiers (KNN, Naive Bayes, and Decision Tree) are tested
more about how heart failure gets worse and how well CRT for detecting normal and abnormal heart rhythms, with the
works. But NYHA is rarely kept as organized data in an EHR. Decision Tree performing best. This technique proves effec-
Instead, this kind of information is often written down in tive in accurately diagnosing heart-related diseases. Machine
unstructured clinical notes. They investigated the feasibility learning is applied to predict and understand heart disease
of using NLP to categorize NYHA from clinical notes. The symptoms, using feature selection to reduce dimensionality.
authors analyzed 6,174 hospital-specific clinical records that The CHI-PCA with RF approach achieves high accuracies
had been linked to NYHA class diagnostic numbers. The on different datasets, identifying relevant features. Chi-square
results of machine learning-based methods were compara- with PCA outperforms other classifiers, while PCA with raw
ble to those of rule-based ones. Support vector machines data yields poorer results The study’s drawbacks include
using n-gram features performed best (93.0% F-measure). its limited dataset size, a lack of investigation into feature
The study does not provide details of the feature selection selection, and a lack of comparison with state-of-the-art
method also it is trained and validated on a small dataset [4]. approaches. Ethical considerations in healthcare research are
The accurate prediction and diagnosis of heart disease also not thoroughly explored in the work [9].
has become a critical challenge for healthcare systems Heart disease has been a subject of significant research due
worldwide. To reduce the mortality rate associated with to its detrimental impact on human health. It is a leading cause
heart diseases, it is crucial to develop a fast and efficient of death in the United States. Data mining has emerged as
method of detection. Data mining techniques and machine a crucial technique for analyzing healthcare data, enabling
easier interpretation of medical records. The paper focuses on publicly available UCI Heart Disease dataset. By applying
predicting heart disease using supervised machine learning machine learning techniques, they sought to identify the most
algorithms like support vector machines, k-nearest neighbors, effective approach for detecting cardiovascular diseases. The
and naive Bayes. The implementation of these algorithms is system’s performance was evaluated using accuracy, sensi-
carried out using the R programming language. The accuracy tivity, F-measure, and precision, highlighting the superiority
metric is utilized to assess the performance of the algo- of the proposed strategy compared to similar models but
rithms, and the findings of the analysis are discussed however One problem with this survey paper is that it doesn’t go
insufficient discussion on the model’s generalizability and into enough detail about the techniques and algorithms it
ethical considerations in heart disease prediction [10]. The looks at. It also only gives a general outline of automa-
method is validated on relatively small dataset thus can’t tion in cardiovascular disease prediction without looking
be generalized more over use of different of optimization into each method in more detail [14]. A dataset from India
techniques are not analyzed. Accurate prediction of heart was utilized to diagnose heart disease, and the performance
disease is crucial for saving lives, as misdiagnosis can have of an automatic diagnosis system was evaluated based on
severe consequences. This research focuses on analyzing the classification accuracy, sensitivity, and specificity. The find-
UCI Machine Learning Heart Disease dataset using a range ings indicated that the Sequential Minimization Optimization
of machine learning and deep learning techniques, compar- (SMO) learning method in Support Vector Machines (SVM)
ing the obtained results and analysis. The dataset consists outperformed other approaches for medical disease diagnosis
of 14 key features, which are thoroughly examined. Eval- applications [15].
uation is performed using accuracy metrics and confusion Classification, data mining, machine learning, and deep
matrix, confirming promising outcomes. To enhance accu- learning algorithms for predicting cardiovascular diseases are
racy, irrelevant features are eliminated using Isolation Forest, compared and reported on. The survey is broken down into
and data standardization techniques are applied. The integra- three sections: cardiovascular disease classification and data
tion of this research with multimedia technology, particularly mining techniques, cardiovascular disease prediction using
mobile devices, is also explored. By leveraging deep learn- machine learning and deep learning models. In addition to
ing, an impressive accuracy of 94.2% is achieved [11]. The collecting and reporting the accuracy metrics, dataset, and
researchers aimed to predict the likelihood of heart disease instruments used for prediction and classification, the survey
based on medical criteria. They employed machine learning also compiles and publishes the performance metrics utilized
techniques such as logistic regression and K-Nearest Neigh- for reporting the accuracy however ignoring overfitting and
bors to classify individuals with cardiovascular disease. The high-dimensional data issues, the evaluation recommends
proposed model demonstrated high flexibility and outper- wide research without specific solutions to cardiovascular
formed previous classifiers. It effectively alleviated concerns disease prediction problems [16].
by accurately predicting the likelihood of identifying heart An effective machine learning system for heart disease
disease however Ambitious plans for the future don’t have diagnosis uses Support Vector Machine, Logistic Regres-
conversations about whether they are possible, and MICE’s sion, Artificial Neural Network, K-nearest neighbors, Naïve
claim to be the best algorithm doesn’t have any comparisons. Bayes, and Decision Tree. The study introduces Relief,
It is imperative to consider these characteristics in order Minimal Redundancy Maximal Relevance, Least Absolute
to improve the system’s dependability in practical medical Shrinkage Selection Operator, Local Learning, and a novel
situations [12]. Fast Conditional Mutual Information algorithm to improve
Using machine learning algorithms and Python program- classification accuracy and execution time. The suggested
ming to identify heart disease. Heart disease has become approach (FCMIM-SVM) shows potential accuracy and fea-
a common and dangerous disease in the last few decades. sibility for healthcare deployment using leave one subject
It is caused by fat. People get this disease when their bodies out cross-validation. The study’s weaknesses include a lack
have too much pressure. The authors looked at a dataset with of discussion on dataset biases, system generalizability, and
13 attributes and 270 individual data points to study how well feature interpretability and clinical relevance [17].
patients did. The main goal of the paper is to get better at Given the rising prevalence of cardiovascular disease, par-
detecting heart disease using algorithms whose aim output ticularly among the younger population, it is imperative to
is a count of whether or not a person has heart disease but adopt a proactive approach in the early detection of symptoms
the investigation of alternative data mining methods was con- in order to mitigate future complications such as strokes. The
strained, and there was a lack of comprehensive examination feasibility of conducting costly ECG tests on a daily basis
about the computing efficiency and interpretability of the for the general population may be questionable. Therefore,
selected algorithms [13]. it is imperative to establish a consensus on a reliable and
In another study, the researchers developed an intelli- readily available approach to predict the risk of heart disease.
gent medical system utilizing machine learning to detect The objective of this work is to create a nursing framework
heart issues and aid in accurate diagnoses. They addressed Assistant that can identify the risk of heart disease by utiliz-
information gaps between the Framingham dataset and the ing essential parameters such as age, gender, and heart rate.
The incorporation of neural codes into the learning pro- a hard voting ensemble technique, which combines multiple
cess boosts the dependability and resilience of the predictive machine learning models to make predictions. This ensemble
model, providing a viable method for timely evaluation of approach has been found to achieve a notable accuracy rate
potential risks. A potential limitation of this technique is the of 90% however The lack of a thorough examination of
possibility of oversimplifying the prediction of heart disease the interpretability and clinical significance of the ensemble
risk by relying on a restricted range of indications, which may model [22].
result in the omission of other pertinent aspects [18]. Over the past two decades, artificial intelligence (AI) has
Many sources, including wearable sensor devices in Inter- grown rapidly in computer engineering and has many applica-
net of Things health monitoring and streaming systems, cre- tions in computer vision, medicine, philosophy, psychology,
ate an unprecedented amount of continuous data. In health- and robotics. Machine Learning, a subtype of AI, has shaped
care, streaming big data analytics and machine learning can manufacturing automation, biometric recognition, medical
detect early cardiac problems cost-effectively. A sophisti- diagnosis, and data science. Even though machine learning
cated large-scale distributed computing platform, Apache is used daily, cardiovascular diseases (CVDs) are a global
Spark, is used to predict cardiac illness in real time. Spark’s health issue. The trustworthy Boolean Machine Learning
in-memory computations match streaming data events, opti- Algorithm (RBMLA) is a unique heart disease prediction
mizing machine learning. Data storage/visualization and algorithm that emphasizes the requirement for a trustworthy
streaming processing are included. The former applies a clas- and accurate system for timely identification and diagno-
sification model for real-time heart disease prediction using sis. The proposed RBMLA has 86% accuracy, indicating its
Spark MLlib with Spark streaming, while the latter saves potential for real-time and new test data prediction. Due to
much created data in Apache Cassandra however One of the limits and changes in the supervised machine learning algo-
potential obstacles in guaranteeing data privacy and security rithms it relies on, the proposed Reliable Boolean Machine
is the absence of a comparative analysis with other real-time Learning Algorithm (RBMLA) cannot attain 100% accuracy
prediction systems [19]. and ideal performance [23].
An improved machine learning method for heart disease The study used multi-layer perceptron (MLP) and
risk prediction used mean-based splitting to randomly par- K-nearest neighbor (K-NN) machine learning algorithms to
tition the dataset. A homogeneous ensemble is formed by diagnose cardiovascular disease (CVD) early and automati-
building classification and regression tree (CART) models cally. Performance optimization through outlier removal and
for each subgroup using an accuracy-based weighted ageing null value handling gave the MLP model 82.47% detection
classifier ensemble. This weighted ageing classifier ensem- accuracy and 86.41% area-under-the-curve value over the
ble (WAE) adjustment optimizes performance. Classification K-NN model. The MLP model proposed for automatic CVD
accuracy on the Cleveland and Framingham datasets is detection may also work for other diseases. One of the draw-
93% and 91%, respectively, outperforming previous machine backs of this study is the absence of an in-depth investigation
learning techniques and similar scientific publications. The of the probable factors contributing to the observed perfor-
ensemble learning method’s higher effectiveness in predict- mance disparity between the Multilayer Perceptron (MLP)
ing heart disease risk is supported by receiver operating and K-Nearest Neighbors (K-NN) models [24].
characteristic curves. The study lacks additional dataset val- The research aimed to examine different computational
idation despite excellent classification accuracy, limiting intelligence techniques, including Logistic Regression, Sup-
the implications for different demographics and healthcare port Vector Machine, Deep Neural Network, Decision Tree,
settings [20]. Early diagnosis is essential for cardiovas- Naïve Bayes, Random Forest, and K-Nearest Neighbor,
cular disease, the leading cause of death worldwide. The in their ability to predict coronary artery heart disease. A thor-
article introduced a Machine Intelligence Framework for ough analysis of performance measures was conducted. The
Heart Disease Diagnosis (MIFH) using Factor Analysis of deep neural network demonstrated a remarkable accuracy
Mixed Data (FAMD) to extract features and train machine rate of 98.15%, along with sensitivity and precision val-
learning models using the UCI heart disease Cleveland ues of 98.67% and 98.01% correspondingly. The proposed
dataset. Holdout-validated MIFH exceeds recent approaches methodologies shown superior performance in comparison to
in accuracy, helping healthcare professionals and radiologists state-of-the-art studies in cardiac disease prediction, as evi-
diagnose heart patients however The research does not inves- denced by comparative assessments. However, the work does
tigate multi-class classification for heart disease, which is not address imbalanced dataset issues or biases [25].
important for medical institution conditions [21]. Analyzing features to create an effective system using
The study presented a novel ensemble method applying enormous data is the proposed work. The study highlights the
majority voting to predict the occurrence of heart disease need to evaluate medical and pathological data from health-
using cost-effective medical tests conducted at community care providers. The proposed approach is tested on whole and
healthcare facilities. The objective is to enhance the level of reduced feature sets for classifier precision and implementa-
confidence and precision in physicians’ diagnostic abilities tion time. Machine learning, notably the Decision Tree and
by using authentic patient data. The proposed model utilizes Ada-Boost algorithms, helps medical professionals diagnose
FIGURE 1. Flow chart of the proposed system for cardiovascular disease detection.
cardiac patients of all ages. This study incorporates machine in the existing body of knowledge by investigating various
learning techniques and critical features to understand cardiac feature selection and machine learning algorithms for the
disease however the Decision Tree algorithm overfits, and prediction of cardiovascular diseases.
while the Ada-Boost algorithm optimizes output, the study is
limited by its use of simulated data, suggesting the need for IV. METHODOLOGY
future validation on real-world datasets and exploration of a A detailed schematic representation of the suggested research
wider range of machine learning techniques for heart disease framework’s design is depicted as flow chart in Figure 1. This
prediction [26]. diagram provides a thorough overview of the structure and
Machine learning optimization issues were originally components of the proposed framework.
described. Then, they presented the basic principles and
developments of well-known optimization techniques. Fol- A. DATASET COLLECTION
lowing this, they provided a brief overview of how optimiza- The accuracy of classification metrics is heavily dependent
tion techniques have been used and developed in a variety of on the quality of the dataset used for statistical predictions.
well-known ML applications. In conclusion, the authors out- For our research, we have picked the following datasets to
lined certain difficulties and unanswered questions regarding both highlight the significance of the dataset and to assess its
machine learning optimization [27]. generalizability.
Despite rich activity of research in the field, this research The first dataset used for CVD is Hungarian Heart Disease
suggests that an analysis of the different feature selection Dataset (HHDD) (Small Dataset) is obtained from the UCI
methods combined with machine learning algorithms for pre- Machine Learning Repository and Kaggle. It is an older
dicting heart diseases have been understudied in the previous and standard dataset developed in 1988. It comprises mul-
two decades, and the current body of literature lacks sufficient tiple databases, including those from Cleveland, Hungary,
information and research studies to adequately address this Switzerland, and Long Beach V. The dataset consists of
gap in knowledge. Therefore, this study presents a compre- 14 attributes and a total of 1025 instances. The target field
hensive work of analyzing different feature selection and in the dataset represents the patient’s heart condition, with a
optimization methods. This research aims to bridge a gap numerical scale ranging from 0 (indicating no disease) to 1
(indicating severe disease). The 2nd dataset used in this study information of redundancy. It then selects features with
is the Kaggle (Large Dataset). In this dataset, the Behavioral high significance and lower redundancy to the target vari-
Risk Factor Surveillance System (BRFSS), conducted by the ables [31]. Relief assigns values to feature weights on the
Centers for Disease Control (CDC), involves annual phone basis of their differentiating ability of different classes. Using
surveys of over 400,000 Americans. The surveys gather weights, it selects the optimal features with no redundancy
information on health-related behaviors, chronic conditions, and more informative [32]. ANOVA on the other hand is a
and the use of preventive services. This dataset specifically statistics based method which ensure the differences among
focuses on the 2015 BRFSS, containing 253,680 responses different classes. It selects optimal features which have sta-
that have been cleaned and categorized into two groups based tistical significance on the target variable [33].
on the presence or absence of heart disease. It should be These are standard feature selection methods used in
noted that there is a significant imbalance in the classes, with machine learning. Alongside these features selection meth-
229,787 individuals categorized as not having heart disease ods, we proposed a novel Particle Swarm Optimization (PSO)
and 23,893 individuals having a history of heart disease. for optimal feature selection.
P
for Gradient Boosting is as follows [38]. • j∈ all trees: This is the summation symbol, indicating
Xn that we are summing up the terms that follow for each j
Fo (x) = argmin L(yi , γ ) (2) in the set of all trees.
i=1
• normfiij: This term represents the normalized value of fi
In the above equation
associated with index i in the j-th tree
• Fo (x) : Represents a function Fo that takes input (x).
It is utilized in this research to detect CVD automatically
• argmin: Denotes the argument (input) that minimizes the
based on the optimal features chosen by the proposed models.
following expression.
P It is also evaluated in comparison to the most recent machine
• i = 1n: This is the summation symbol, indicating that
learning models.
we are summing up the terms that follow for each i
from 1 to n
• L(yi , γ ) : A loss function that measures the difference 4) LOGISTIC REGRESSION
between the true value yi and a predicted or estimated Logistic Regression is a statistical method used for binary
value γ classification problems, where the goal is to predict the prob-
ability of an instance belonging to one of two classes (e.g.
In this study it is used to automatically detect CVD from the
yes/no, true/false). It is a type of generalized linear model
optimal features.
that uses a logistic function to model the relationship between
the input features and the output class. The logistic regres-
2) EXTRA TREE CLASSIFIER
sion model is expressed as a mathematical equation of the
An ensemble machine learning algorithm, Extra Trees Classi-
form [41].
fier has been put to use in classification challenges. A random
subset of features is used as the split points at each node to
ea+bx
create each decision tree in this variant of Random Forest. The P= (5)
main advantage of Extra Trees Classifier is its fast training 1 + ea+bx
time, as the decision trees are grown using random features
In the above equation
and split points at each node [39]. Equation for Extra Tree
Classifier as follow. • P: This represents the probability of an event occurring.
Xe In logistic regression, it’s often the probability of a
Entropy (S) = −pi log2 (pi ) (3) binary outcome being 1.
i=1
• e: This is the mathematical constant approximately equal
In the above equation to 2.71828.
• Entropy (S): This represents a measure of uncertainty or • a and b: These are coefficients that are determined dur-
randomness
P in a system, often denoted as S. ing the training of the logistic regression model. They
• i = 1e: This is the summation symbol, indicating influence the shape and position of the logistic curve.
that we are summing up the terms that follow for each • x: This is the input variable, and the logistic function is
possible outcome i from 1 to e. modeling how it influences the probability P.
• −pilog2(pi): This is the contribution of each possible
outcome to the overall entropy. It consists of two parts:
H. PERFORMANCE EVALUATION
• pi: The probability of the i-th outcome.
• log2(pi): The logarithm base 2 of the probability
The performance evaluation of each algorithm in this study
is done using various widely used metrics such as the confu-
In this study it is used to automatically detect CVD from the
sion matrix, accuracy, precision, sensitivity, specificity, area
optimal features. Its performance is also compared with the
under the curve (AUC), F1-score, and Matthews correlation
state of the art ML models.
coefficient (MCC). These metrics provide a comprehensive
assessment of the algorithms’ performance and allow for
3) RANDOM FOREST
a thorough analysis of their effectiveness in predicting and
It’s a type of ensemble learning in which numerous decision diagnosing heart disease [28].
trees are built and their predictions are averaged out. Random
Forest also has good interpretability, as it can provide feature
1) ACCURACY
importance that indicate the relative importance of each fea-
The model’s overall performance can be measured by cal-
ture in the final prediction.
culating its accuracy using the formula in equation 6 [28].
Equation for Random Forest as follow [40].
X In these equations (6)-(10), TP is number of true-positives,
RFfii = normfiij (4) TN is number of true-negatives, FP is false positives and FN
j∈alltrees is false-negatives.
In the equation
• RFfii: This represents a quantity or value associated with
TP + TN
Accuracy = (6)
index i. TP + TN + FP + FN
MrMr, FCBF, and Relief along with PSO on Extra Tree Clas-
sifier and Random Forest, a significant improvement from
the as compared to state of the art which is less than 96%.
Additionally, we introduced the ANOVA selection technique,
resulting in improved accuracy in our research compared to
the state of the art.
The comparison highlighted in Figure 4 shows that the
strong performance of the Extra Tree and Random For-
est models is due to the optimal features provided by our
proposed methods. Notably the MrMr, FCBF and Relief
selection techniques achieved the highest accuracy of 100%.
On the contrary, the Lasso technique showed the poorest
performance among the methods evaluated because Lasso’s
L1 regularization penalty tends to favor features with large
FIGURE 2. Distribution of numerical features.
absolute values, potentially overlooking features that have
smaller values. Also it has mostly overfitting issue which
leads to low validation accuracy.
TABLE 2. Accuracy of all model on small dataset.
FIGURE 5. Pearson correlation between all the features for Large Data.
TABLE 3. Overall results of all classifiers with confusion matrix on small dataset.
TABLE 4. Results for large data using the selected features. performance for each model also plotted in Figure 6. The
analyses conducted, highlighting the effectiveness and intri-
cacies of each classifier in our research work using the
reduced features set. In the above analysis we found that two
models’ extra tree and random forest have achieved highest
performance with the limited number of features. Among
the feature selection techniques, FCBF and relief achieved
accuracies of 78% and 77% respectively. On the other hand,
MrMr and ANOVA had lowest performance with accuracy of
70% each for large datasets. Incorporated within TABLE 5
are the comprehensive and overarching outcomes yielded by
a diverse set of models, coupled with their corresponding
confusion matrices. This table serves as an encompassing
depiction of the cumulative performance of these varied mod-
els, meticulously evaluated within the framework of a larger
dataset. It provides an extensive and detailed overview of
the analyses conducted, shedding light on the intricate inter-
dataset. As shown in Tables 4 and 5, this reduced feature play between each classifier and its corresponding results in
subset from the CVD dataset resulted in the classification the context of our research endeavors. Moreover, employing
TABLE 5. Over all results of all classifier with confusion matrix on large dataset using the selected features.
the reduced feature set during the training of classification performance compared to the other techniques in terms of
models led to a decrease in computational iterations per accuracy. Notably, the Extra Tree Classifier and Random
second (it/s). These findings underscore the impact of feature Forest achieved a remarkable accuracy of 100% when using
selection techniques, demonstrating that they not only reduce features selected by MRMR, FCBF, and Relief on. This study
the dimensionality of the feature space but also enhance the underscored the significance of feature selection in enhancing
performance of ML models in various aspects. the performance of machine learning algorithms and empha-
This proposed framework holds promise for improving the sized the importance of selecting an appropriate technique
early detection and diagnosis of cardiovascular disease, a sig- based on the dataset’s size and characteristics. The large
nificant global public health concern. The results obtained dataset used in this research work comprised 253,680 records
in this study also presents a comparative analysis of fea- and 22 columns, predominantly consisting of categorical
ture selection techniques on a small dataset and evaluated features. Imbalanced data was observed, and to address this
their impact on the performance of machine learning algo- issue, oversampling techniques were employed to balance the
rithms. To assess the performance, we considered various dataset. The report compared the performance of various fea-
evaluation metrics such as accuracy, precision, sensitivity, ture selection techniques, including MRMR, FCBF, LASSO,
specificity, AUC, F1-score, and MCC. The findings indi- ANOVA, and RELIEF, in combination with different machine
cated that MRMR, FCBF, and Relief demonstrated superior learning models.
the results, the most effective feature selection method for this
dataset could be identified. Additional data sources: While
the dataset used in this study contained a large amount of
information about heart disease risk factors, there may be
additional data sources that could be used to further improve
the accuracy of the models. For example, data on the envi-
ronmental factors such as air quality and access to healthcare
could be included and analyzed to see if they have an impact
on heart disease risk. Improved data balancing techniques:
In this study, oversampling was used to balance the imbal-
anced data. However, there are other techniques such as under
sampling and SMOTE (Synthetic Minority Over-Sampling
Technique) that could be explored to improve the balance of
the data. Ensemble learning: Ensemble learning is a technique
that combines multiple machine learning models to improve
the overall accuracy of the predictions. By using ensemble
FIGURE 6. Accuracy of models on each technique for large data.
learning techniques such as bagging or boosting, the accuracy
of the models could potentially be improved by exploring
these and other potential future works, it may be possible
VI. CONCLUSION to further improve the accuracy of the heart disease risk
This study proposed a novel framework for detecting and prediction models and develop more effective strategies for
classifying cardiovascular disease (CVD) using machine preventing and managing heart disease.
learning algorithms and optimal feature selection techniques.
The proposed framework demonstrated the significant impact REFERENCES
of feature selection on enhancing the performance of machine [1] A. K. Gárate-Escamila, A. H. El Hassani, and E. Andrès, ‘‘Classifi-
learning algorithms for CVD prediction. The study evaluated cation models for heart disease prediction using feature selection and
five different feature selection techniques: MRMR, FCBF, PCA,’’ Informat. Med. Unlocked, vol. 19, Jan. 2020, Art. no. 100330, doi:
10.1016/j.imu.2020.100330.
LASSO, Relief, and ANOVA. Among these techniques, [2] V. Chang, V. R. Bhavani, A. Q. Xu, and M. Hossain, ‘‘An artificial
FCBF exhibited superior performance, achieving an accuracy intelligence model for heart disease detection using machine learning
of 78% when combined with the Extra Tree and Random algorithms,’’ Healthcare Anal., vol. 2, Nov. 2022, Art. no. 100016, doi:
10.1016/j.health.2022.100016.
Forest models. This finding highlights the effectiveness of [3] M. Ganesan and N. Sivakumar, ‘‘IoT based heart disease prediction and
FCBF in selecting relevant features from large-scale CVD diagnosis model for healthcare using machine learning models,’’ in Proc.
datasets. The study also highlights the importance of selecting IEEE Int. Conf. Syst., Comput., Autom. Netw. (ICSCAN), Mar. 2019,
pp. 1–5, doi: 10.1109/ICSCAN.2019.8878850.
an appropriate feature selection and optimization technique [4] D. P. Isravel, S. V. P. Darcini, and S. Silas, ‘‘Improved heart disease
based on the characteristics of the dataset. For large datasets diagnostic IoT model using machine learning techniques,’’ Int. J. Sci.
with predominantly categorical features, like the one used Technol. Res., vol. 9, no. 2, pp. 4442–4446, 2020.
[5] I. S. G. Brites, L. M. da Silva, J. L. V. Barbosa, S. J. Rigo, S. D. Correia, and
in this study, FCBF emerged as a promising technique for V. R. Q. Leithardt, ‘‘Machine learning and IoT applied to cardiovascular
identifying relevant features and improving the performance diseases identification through heart sounds: A literature review,’’ Infor-
of machine learning algorithms in CVD prediction. matics, vol. 8, no. 4, p. 73, Oct. 2021, doi: 10.3390/informatics8040073.
[6] D. T. Thai, Q. T. Minh, and P. H. Phung, ‘‘Toward an IoT-based expert
system for heart disease diagnosis,’’ in Proc. 28th Mod. Artif. Intell. Cogn.
VII. FUTURE WORK Sci. Conf. (MAICS), 2017, pp. 157–164.
In addition to the future works mentioned earlier, there are [7] B. Padmaja, C. Srinidhi, K. Sindhu, K. Vanaja, N. M. Deepika, and
E. K. R. Patro, ‘‘Early and accurate prediction of heart disease using
several other areas that could be explored to improve the machine learning model,’’ Turkish J. Comput. Math. Educ., vol. 12, no. 6,
analysis and results of this study. Some of these potential pp. 4516–4528, 2021.
future works include: [8] S. Anitha and N. Sridevi, Heart Disease Prediction Using Data Mining
Techniques S Anitha, N Sridevi to Cite This Version, document HAL Id Hal-
Consider using deep learning models, specifically neural 02196156, 2019. [Online]. Available: https://hal.archives-ouvertes.fr/
networks, to enhance model accuracy on different datasets. hal-02196156/document
These models capture complex feature relationships and have [9] R. Bharti, A. Khamparia, M. Shabaz, G. Dhiman, S. Pande, and P. Singh,
‘‘Prediction of heart disease using a combination of machine learning and
the potential to improve overall performance. Further explo- deep learning,’’ Comput. Intell. Neurosci., vol. 2021, pp. 1–11, Jul. 2021,
ration of deep learning could lead to valuable insights and doi: 10.1155/2021/8387680.
advancements in the field Alternative feature selection tech- [10] H. Jindal, S. Agrawal, R. Khera, R. Jain, and P. Nagrath, ‘‘Heart disease
prediction using machine learning algorithms,’’ IOP Conf., Mater. Sci.
niques: While several feature selection techniques were used Eng., vol. 1022, no. 1, Jan. 2021, Art. no. 012072, doi: 10.1088/1757-
in this study, there are many other methods that could be 899x/1022/1/012072.
explored. For example, genetic algorithms, decision trees, [11] B. Pavithra and V. Rajalakshmi, ‘‘Heart disease detection using machine
learning algorithms,’’ in Proc. Int. Conf. Emerg. Current Trends Comput.
and mutual information-based methods could be used for Expert Technol., vol. 35, 2020, pp. 1131–1137, doi: 10.1007/978-3-030-
feature selection. By testing various methods and comparing 32150-5_115.
[12] N. Louridi, S. Douzi, and B. El Ouahidi, ‘‘Machine learning-based identi- [34] A. H. Shahid and M. P. Singh, ‘‘A novel approach for coronary artery
fication of patients with a cardiovascular defect,’’ J. Big Data, vol. 8, no. 1, disease diagnosis using hybrid particle swarm optimization based emo-
pp. 1–5, Dec. 2021, doi: 10.1186/s40537-021-00524-9. tional neural network,’’ Biocybernetics Biomed. Eng., vol. 40, no. 4,
[13] P. Singh, G. K. Pal, and S. Gangwar, ‘‘Prediction of cardiovascular disease pp. 1568–1585, Oct. 2020, doi: 10.1016/j.bbe.2020.09.005.
using feature selection techniques,’’ Int. J. Comput. Theory Eng., vol. 14, [35] R. Tr, U. K. Lilhore, P. M, S. Simaiya, A. Kaur, and M. Hamdi, ‘‘Predictive
no. 3, pp. 97–103, 2022, doi: 10.7763/ijcte.2022.v14.1316. analysis of heart diseases with machine learning approaches,’’ Malaysian
[14] M. Swathy and K. Saruladha, ‘‘A comparative study of classification and J. Comput. Sci., pp. 132–148, Mar. 2022.
prediction of cardio-vascular diseases (CVD) using machine learning and [36] N. A. Baghdadi, S. M. Farghaly Abdelaliem, A. Malki, I. Gad, A. Ewis,
deep learning techniques,’’ ICT Exp., vol. 8, no. 1, pp. 109–116, Mar. 2022, and E. Atlam, ‘‘Advanced machine learning techniques for cardiovascular
doi: 10.1016/j.icte.2021.08.021. disease early detection and diagnosis,’’ J. Big Data, vol. 10, no. 1, p. 144,
[15] D. Vaddella, C. Sruthi, B. K. Chowdary, and S.-R. Subbareddy, ‘‘Predic- Sep. 2023.
tion of heart disease using machine learning techniques,’’ Restaur. Bus., [37] K. M. Mohi Uddin, R. Ripa, N. Yeasmin, N. Biswas, and S. K. Dey,
vol. 118, no. 1, pp. 125–129, 2019, doi: 10.26643/rb.v118i1.7621. ‘‘Machine learning-based approach to the diagnosis of cardiovascular
[16] V. V. Ramalingam, A. Dandapath, and M. Karthik Raja, ‘‘Heart disease vascular disease using a combined dataset,’’ Intell.-Based Med., vol. 7,
prediction using machine learning techniques: A survey,’’ Int. J. Eng. Jan. 2023, Art. no. 100100.
Technol., vol. 7, no. 2, p. 684, Mar. 2018, doi: 10.14419/ijet.v7i2.8.10557. [38] P. Geurts, D. Ernst, and L. Wehenkel, ‘‘Extremely randomized trees,’’
Mach. Learn., vol. 63, no. 1, pp. 3–42, Apr. 2006, doi: 10.1007/s10994-
[17] J. P. Li, A. U. Haq, S. U. Din, J. Khan, A. Khan, and A. Saboor,
006-6226-1.
‘‘Heart disease identification method using machine learning classification
[39] M. M. Hameed, M. K. AlOmar, F. Khaleel, and N. Al-Ansari, ‘‘An
in E-healthcare,’’ IEEE Access, vol. 8, pp. 107562–107582, 2020, doi:
extra tree regression model for discharge coefficient prediction: Novel,
10.1109/ACCESS.2020.3001149.
practical applications in the hydraulic sector and future research direc-
[18] P. Kalpana, S. S. Vignesh, L. M. P. Surya, and V. V. Prasad, ‘‘Prediction of tions,’’ Math. Problems Eng., vol. 2021, pp. 1–19, Sep. 2021, doi:
heart disease using machine learning,’’ J. Phys., Conf. Ser., vol. 1916, no. 1, 10.1155/2021/7001710.
May 2021, Art. no. 012022, doi: 10.1088/1742-6596/1916/1/012022. [40] M. Pal and S. Parija, ‘‘Prediction of heart diseases using random forest,’’
[19] A. Ed-Daoudy and K. Maalmi, ‘‘Real-time machine learning for early J. Phys., Conf. Ser., vol. 1817, no. 1, Mar. 2021, Art. no. 012009, doi:
detection of heart disease using big data approach,’’ in Proc. Int. Conf. 10.1088/1742-6596/1817/1/012009.
Wireless Technol., Embedded Intell. Syst. (WITS), Apr. 2019, pp. 1–5, doi: [41] A. G, B. Ganesh, A. Ganesh, C. Srinivas, and K. Mensinkal, ‘‘Logis-
10.1109/WITS.2019.8723839. tic regression technique for prediction of cardiovascular disease,’’
[20] I. D. Mienye, Y. Sun, and Z. Wang, ‘‘An improved ensemble Global Transitions Proc., vol. 3, no. 1, pp. 127–130, Jun. 2022, doi:
learning approach for the prediction of heart disease risk,’’ 10.1016/j.gltp.2022.04.008.
Informat. Med. Unlocked, vol. 20, Jan. 2020, Art. no. 100402, doi:
10.1016/j.imu.2020.100402.
[21] A. Gupta, R. Kumar, H. S. Arora, and B. Raman, ‘‘MIFH: A machine
intelligence framework for heart disease diagnosis,’’ IEEE Access, vol. 8, TAHSEEN ULLAH received the Master of Com-
pp. 14659–14674, 2020, doi: 10.1109/ACCESS.2019.2962755. puter Science (M.C.S.) degree from Abdul Wali
[22] R. Atallah and A. Al-Mousa, ‘‘Heart disease detection using machine Khan University Mardan, Pakistan. He is currently
learning majority voting ensemble method,’’ in Proc. 2nd Int. pursuing the M.S. degree in computer science
Conf. New Trends Comput. Sci. (ICTCS), Oct. 2019, pp. 1–6, doi: with Abasyn University, Peshawar, Pakistan. His
10.1109/ICTCS.2019.8923053. research interests include artificial intelligence,
[23] M. Bheemalingaiah, G. R. Swamy, P. Vishvapathi, P. V. Babu, E. N. Rao, machine learning, deep learning, computer vision,
and P. N. Rao, ‘‘Detection of heart disease by using reliable Boolean and the IoT.
machine learning algorithm,’’ J. Theor. Appl. Inf. Technol., vol. 99, no. 15,
pp. 3856–3880, 2021, doi: 10.5281/zenodo.5353586.
[24] M. Pal, S. Parija, G. Panda, K. Dhama, and R. K. Mohapatra, ‘‘Risk
prediction of cardiovascular disease using machine learning classifiers,’’
Open Med., vol. 17, no. 1, pp. 1100–1113, Jun. 2022. SYED IRFAN ULLAH received the Ph.D. degree
[25] S. I. Ayon, M. M. Islam, and M. R. Hossain, ‘‘Coronary artery heart from the Department of Computer Science, Inter-
disease prediction: A comparative study of computational intelligence national Islamic University Peshawar. He is cur-
techniques,’’ IETE J. Res., vol. 68, no. 4, pp. 2488–2507, Jul. 2022, doi: rently an Associate Professor with the Department
10.1080/03772063.2020.1713916. of Computing, Abasyn University. He is a well
[26] G. Choudhary and S. N. Singh, ‘‘Prediction of heart disease using known Researcher in the area of intelligent cryp-
machine learning algorithms,’’ in Proc. Int. Conf. Smart Technol. tosystems and he has a number of research articles
Comput., Electr. Electron. (ICSTCEE), Oct. 2020, pp. 197–202, doi: in various fields of computer science.
10.1109/ICSTCEE49637.2020.9276802.
[27] Y. Khourdifi and M. Bahaj, ‘‘Heart disease prediction and classification
using machine learning algorithms optimized by particle swarm optimiza-
tion and ant colony optimization,’’ Int. J. Intell. Eng. Syst., vol. 12, no. 1,
pp. 242–252, Feb. 2019, doi: 10.22266/ijies2019.0228.24.
KHALIL ULLAH received the Graduate degree
[28] Y. Muhammad, M. Tahir, M. Hayat, and K. T. Chong, ‘‘Early and
accurate detection and diagnosis of heart disease using intelligent com- in computer systems engineering from the Uni-
putational model,’’ Sci. Rep., vol. 10, no. 1, pp. 1–18, Nov. 2020, doi: versity of Engineering and Technology, Peshawar,
10.1038/s41598-020-76635-9. Pakistan, in 2006, the Master of Science (M.S.)
[29] M. A. Hall, ‘‘Correlation-based feature selection for machine learning,’’ degree in electronics and communications engi-
Ph.D. dissertation, Univ. Waikato, Hamilton, New Zealand, 1999. neering from Myongji University, South Korea,
[30] L. Yu and H. Liu, ‘‘Feature selection for regression,’’ Data Mining Knowl. in 2009, and the Ph.D. degree in biomedical
Discovery, vol. 15, no. 3, pp. 259–285, 2003. engineering from LISiN, Politecnico di Torino,
[31] R. Tibshirani, ‘‘Regression shrinkage and selection via the lasso,’’ J. Roy. in 2016, under the Erasmus Mundus Expert II Fel-
Stat. Soc., Ser. B, Methodolog., vol. 58, no. 1, pp. 267–288, Jan. 1996. lowship. He is currently an Assistant Professor and
[32] K. Kira and L. Cooper, ‘‘A compressed sensing approach to feature extrac- the Head of the Software Engineering Department, University of Malakand.
tion from high-dimensional data,’’ School Comput. Sci., Carnegie Mellon His research interests include extracting muscle anatomical and physio-
Univ., Pittsburgh, PA, USA, Tech. Rep., 2000. logical information from high-density electromyography, computer vision,
[33] D. C. Montgomery, Introduction to Linear Regression Analysis. Hoboken, digital signal and image processing, and deep learning with applications to
NJ, USA: Wiley, 2012. medical healthcare.
MUHAMMAD ISHAQ received the B.S. degree YAZEED YASIN GHADI received the Ph.D.
in computer science from The University of degree in electrical and computer engineering
Haripur, Pakistan. He is currently pursuing the from Queensland University. His dissertation
M.S. degree in telecommunication and networks on developing novel hybrid plasmonicphotonic
with Abasyn University, Peshawar, Pakistan. onchip biochemical sensors received the Sigma Xi
He is currently a Senior IT Officer with the Best Ph.D. Thesis Award. He is currently an Assis-
Helping Hand Institute of Rehabilitation Sci- tant Professor in software engineering with Al
ences, Mansehra, Pakistan. His research interests Ain University. He was a Postdoctoral Researcher
include artificial intelligence, machine learning, with Queensland University, before joining Al
deep learning, and the IoT. Ain. His current research is on developing novel
electro-acousto-optic neural interfaces for largescale high-resolution electro-
physiology and distributed optogenetic stimulation. He has published more
than 80 peer-reviewed journals and conference papers and he holds three
pending patents. He is a recipient of several awards.