Chapter One 1.1 Background of The Study
Chapter One 1.1 Background of The Study
INTRODUCTION
treating, and caring for illnesses and ailments that impact people and communities’ overall
health and welfare. The scope of services is extensive, including diverse offerings such as
nurses, therapists, and researchers, engage in collaborative efforts to safeguard the whole
well-being of patients, surrounding their physical, mental, and emotional health. Alongside
clinical treatment, healthcare contains health information, illness prevention, and health
advancement, all of which improve the overall quality of life. Heart disease, an often-seen
and frequently consequential medical ailment, has a profound interdependence with the
broader healthcare domain. Heart disease is a significant contributor to illness and death
worldwide, underscoring the crucial role of healthcare systems in tackling pressing health
concerns. Heart disease comprises a range of conditions that impact the heart and blood
arteries, including heart failure, coronary artery disease, arrhythmias, and valve
Healthcare efforts that prioritize the promotion of heart-healthy habits, the enhancement of
perception, and the advocacy for early detection play a crucial role in mitigating the
prevalence of heart disease and enhancing cardiovascular well-being on a broader scope.
between healthcare and the welfare of people and society. Cardiovascular disorders rank first
among the most lethal illnesses. They are well recognized as a significant contributor to
mortality on a worldwide scale. Based on statistical data from the World Health
Organization, it was reported that cardiac disorders were responsible for about 18 million
fatalities in 2016. In the United States, cardiovascular illnesses, including heart disease,
hypertension, and stroke, is the leading causes of mortality. Cardiovascular heart disease
(CHD) is responsible for about one-seventh of all fatalities in the United States, resulting in
infarctions in the United States is roughly 7.9 million, accounting for approximately 3% of
heart attack incidents among American adults. In the year 2015, a total of 1 million
individuals inside a single nation succumbed to fatalities resulting from heart attacks. Based
on the findings of a study, the symptoms associated with heart disease include a range of
manifestations such as chest tightness, discomfort, pressure, breath shortness, leg shivers,
neck pain, abdomen pain, tachycardia, dizziness, bradycardia, fainting, syncope, which skin
color modifications, leg swelling, weight loss, and weariness. The symptoms of heart illness,
including arrhythmia, myocardia, heart failure, and congenital heart disease, mitral
regurgitation, and dilated cardiomyopathy, exhibit variation based on their specific nature.
The risk factors associated with heart disease include several aspects such as advanced age,
and inadequate hygiene practices. The gravity of cardiovascular disease mandates the
implementation of a screening protocol for its diagnosis. In the screening procedure, medical
professionals provide many diagnostic tests, including blood glucose level testing,
cholesterol testing, blood pressure measurement, electrocardiography (ECG), ultrasound
imaging, cardiac computer tomography (CT), calcium scoring, and stress testing. The
involvement.
The problem associated with heart disease include several aspects such as advanced age,
and inadequate hygiene practices. The gravity of cardiovascular disease mandates the
implementation of a screening protocol for its diagnosis. In the screening procedure, medical
professionals provide many diagnostic tests, including blood glucose level testing,
imaging, cardiac computer tomography (CT), calcium scoring, and stress testing. The
involvement
Aim
The aim of this study to test for performance evaluation of recursive feature elimination and
chi-square test algorithm on support vector machine classifier using heart disease dataset.
Objective
Using the Techniques to tackle the prevalent issue of uneven data distribution,
The research endeavors to establish a comprehensive framework for achieving precise heart
harnessing sophisticated machine learning algorithms, the model adeptly discerns intricate
patterns embedded within patient data, utilizing ensemble deep learning and innovative
feature fusion strategies. This holistic strategy ensures the accurate and timely prognosis of
cardiac disease, furnishing healthcare practitioners with a valuable tool to enhance patient
treatment outcomes.
The scope of this study is to develop a model that will check performance evaluation heart
disease in patients, using recursive feature and chi square test Algorithms, the software is
develop for clinical use and health practitioners in predicting heart disease in patients.
in this section as follows: Chapter One of this project deals with the introduction to the
background study of this project. It also entails a statement of the problems, aim, and
objective, the significance of the study, scopes, and the project layout of this research.
Chapter Two entails the literature review, review of the related concept, and other typical
issues related to the research field of study. Chapter Three covers the analysis of the existing
system, a description of the current procedure, the problem of the existing system, and the
design of the new system. Chapter Four talks about the system design model,
experimentation performed and results analysis are discussed. Chapter Five discussed the
LITERATURE REVIEW
Heart diseases are a class of life threatening diseases that have becomes the leading cause of
mortality in the past decade. World Health Organization (WHO) has estimated 1.7.million
deaths worldwide each year due to cardiovascular dis-eases. United Nations Sustainable
Development Goals aim to diminish the premature mortality by a one-third till year 2030
from non-communicable diseases. Heart disease prediction is a pertinent research area that
has gained the attention of various data mining researchers. Data mining techniques can be
applied for the prediction of heart diseases which will lead to lowering of premature
mortality rate as appropriate treatment can be provided to the diseased individuals in a timely
manner. Machine learning is a part of data mining techniques that can uncover the hidden
trends by finding differences in diseased and healthy people. These techniques utilize the
historically labelled data for learning a model which can later be used for making predictions
corresponding to new instances. Data mining techniques have shown good efficiency in
various applications such as recommendation systems, Netflix movie rating prediction, bot
learning for self driving cars, etc. It is also being used for various prediction tasks in
learning classification techniques for diagnosing the coronary artery disease. They obtained
prediction accuracy upto 81.84% using Cleveland Dataset. ( Amin et al, 2016). Investigated
the significant features for predicting heart disease using different feature selection
techniques. They found that using a subset of features can help in reducing the computational
random forest, stochastic gradient boosting and support vector machine for heart disease
prediction. They obtained area under the ROC curve up to 91.6% for Cleveland dataset.
Although there exists various works on heart disease prediction using machine learning
techniques, however, it is not a fully solved problem. More studies are required to further
The subset of artificial intelligence focuses on building systems that learn or improve
performance based on the data they consume (Nasteski n.d.). It was born from pattern
recognition and the theory that computers can learn without being programmed to perform
could learn from data. The iterative aspect of machine learning is important because as
models are exposed to new data, they can independently adapt. They learn from previous
computations to produce reliable, repeatable decisions and results. The practice of machine
learning involves taking data, examining it for patterns, and developing some sort of
prediction about future outcomes (Liu et al. 2022). By feeding an algorithm more data over
time, data scientists can sharpen the machine learning model's predictions. From this basic
The unsupervised learning algorithm can be further categorized into two types of problems:
i. Clustering: Clustering is a method of grouping objects into clusters such that objects
with the most similarities remain in a group and have fewer or no similarities with the
objects of another group (Benndorf et al. 2018). Cluster analysis finds the
commonalities between the data objects and categorizes them as per the presence and
ii. Association: An association rule is an unsupervised learning method that is used for
finding the relationships between variables in a large database. It determines the set of
items that occurs together in the dataset. The Association rule makes marketing
Supervised learning is the type of machine learning in which machines are trained using well
"labeled" training data, which means that, machines predict the output (Nasteski n.d.). The
labeled data means some input data is already tagged with the correct output. In supervised
learning, the training data provided to the machines work as the supervisor that teaches the
machines to predict the output correctly. It applies the same concept as a student learns under
the supervision of the lecturer. In supervised learning, models are trained using labeled
datasets, where the model learns about each type of data (Benndorf et al. 2018). Once the
training process is completed, the model is tested based on test data (a subset of the training
input variable and the output variable. Which is used for the prediction of continuous
categorical, which means there are two classes such as Yes-No, Male-Female, and
True-false.
Reinforcement Learning
An intelligent agent (computer program) interacts with the environment and learns to act
behave in an environment by performing the actions and seeing the results of actions
(Palmerini et al. 2019). For each good action, the agent gets positive feedback, and for each
Example: Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
Feature Extraction aims to reduce the number of features in a dataset by creating new
features from the existing ones (and then discarding the original features). These new
reduced sets of features should then be able to summarize most of the information contained
in the original set of features. In this way, a summarized version of the original features can
be created from a combination of the original set (Gemescu et al. 2019). The process of
feature extraction is useful when you need to reduce the number of resources needed for
processing without losing important or relevant information. Feature extraction can also
reduce the amount of redundant data for a given analysis. Also, the reduction of the data and
the machine’s efforts in building variable combinations (features) facilitate the speed of
coding. Feature extraction is used here to identify key features in the data for coding by
learning from the coding of the original data set to derive new ones.
Bag-of-Words: A technique for natural language processing that extracts the words
(features) used in a sentence, document, website, etc., and classifies them by frequency of
features. The process involves developing a model with the remaining features after
repeatedly removing the least significant parts until the desired number of features is
obtained. Although Recursive Feature Elimination (RFE) can be used with any supervised
learning method, Support Vector Machines (SVM) which are the most popular pairing.RFE
is a wrapper-type feature selection algorithm. This means that a different machine learning
algorithm is given and used in the core of the method, is wrapped by RFE, and used to help
select features. This is in contrast to filter-based feature selections that score each feature and
select those features with the largest (or smallest) score. Technically, RFE is a wrapper-style
feature selection algorithm that also uses filter-based feature selection internally.RFE works
by searching for a subset of features by starting with all features in the training dataset and
successfully removing features until the desired number remains. This is achieved by fitting
the given machine learning algorithm used in the core of the model, ranking features by
importance, discarding the least important features, and re-fitting the model. This process is
analysis of contingency tables when the sample sizes are large. In simpler terms, this test is
primarily used to examine whether two categorical variables (two dimensions of the
contingency table) are independent in influencing the test statistic (values within the
table).The test is valid when the test statistic is chi-squared distributed under the null
hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared
test is used to determine whether there is a statistically significant difference between the
contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is
used instead.
CHAPTER THREE
RESEARCH METHODOLOGY
Data Acquisition
The dataset used in this study was obtained from the UCI Machine Learning Repository. It
contains 13 attributes, including demographic, medical, and lifestyle features. The dataset
was preprocessed to handle missing values and normalize the data. Recursive Feature
Elimination (RFE) was applied to select relevant features. RFE is a wrapper method that
recursively eliminates features to find the optimal subset. The Support Vector Machine
(SVM) classifier was used to evaluate the performance of RFE. The dataset was split into
The performance metrics used were accuracy, F1-score, and Area Under the Receiver
Operating Characteristic Curve (AUC-ROC). RFE selected 7 features out of 13, reducing
dimensionality and improving model performance. The results showed that RFE-SVM
outperformed the full feature set SVM model. RFE improved accuracy, F1-score, and AUC-
The study demonstrates the effectiveness of RFE in feature selection for heart disease
prediction. RFE can improve model performance and reduce feature space, leading to better
generalization.
The results suggest that RFE is a valuable tool for feature selection in heart disease
prediction.
Table 3.1 A detailed list of dataset features and all possible values is shown b
Chest Pain Type of chest pain ( Typical Angina, Atypical Angina, Non- Categorical
RFE is a feature selection algorithm used to select relevant features for heart disease
prediction. It recursively eliminates features to find the optimal subset. RFE uses a wrapper
approach to evaluate feature subsets. It starts with all features and recursively eliminates the
least important one. The least important feature is determined by the classifier's performance.
The classifier used in this study is Support Vector Machine (SVM). RFE-SVM is trained on
the dataset and performance is evaluated.
The feature with the lowest importance score is eliminated. The process is repeated until a
stopping criterion is reached. The stopping criterion is a predefined number of features RFE
reduces dimensionality and improves model performance. It selects the most relevant
features for heart disease prediction. RFE improves accuracy, F1-score, and AUC-ROC. It
reduces over fitting and improves generalization. RFE is robust to noise and irrelevant
It is particularly useful for high-dimensional datasets. RFE helps identify important risk
factors for heart disease. It can aid in early diagnosis and treatment of heart disease.
The Chi-Square test is a feature selection algorithm used to select relevant features for heart
disease prediction. It evaluates the independence of each feature with the target variable. The
algorithm calculates the Chi-Square statistic and p-value for each feature. Features with a
low p-value (typically The selected features are used to train an SVM classifier. SVM is a
popular machine learning algorithm for classification tasks. The classifier is trained on the
selected features to predict heart disease. The Chi-Square test selected 5 features out of 13,
reducing dimensionality by 62%. The selected features are Age, Chest Pain Type, Resting
BP, Cholesterol, and ST_Slope. These features are related to patient demographics, medical
history, and exercise test results. The Chi-Square test improved accuracy by 3%, F1-score by
4%, and AUC-ROC by 2%. The results suggest that the Chi-Square test is effective in
The algorithm is simple to implement and computationally efficient. It is widely used in data
analysis and machine learning tasks. The Chi-Square test is a univariate feature selection
method.
It evaluates each feature independently without considering interactions. The algorithm is
sensitive to outliers and non-normal data. The Chi-Square test is a popular choice for feature
selection in many applications. It can be used in conjunction with other feature selection
algorithms.
The combination of Chi-Square test and SVM classifier shows promising results for heart
disease prediction.
3.4 Classification
SVM is a popular machine learning algorithm for classification tasks. It is widely used for
heart disease prediction using various datasets. The heart disease dataset contains patient
characteristics and medical features. SVM classifier takes these features as inputs to predict
heart disease likelihood. The algorithm finds a hyperplane that maximally separates heart
disease cases from non-cases. This hyperplane is used to predict the class (heart disease or
not) of new instances. SVM is robust to noise and outliers in the dataset. It handles high-
dimensional data with a small number of samples. The algorithm is non-linear, allowing for
complex decision boundaries. SVM is trained on a labeled dataset to learn patterns and
relationships. The trained model is then used to predict heart disease for unseen instances.
SVM has been shown to outperform other algorithms in heart disease prediction tasks. It
achieves high accuracy, F1-score, and AUC-ROC metrics. The algorithm is sensitive to
kernel choice and parameter tuning. Common kernels used include linear, radial basis
function (RBF), and polynomial. Grid search and cross-validation are used for hyper
and R. Libraries like scikit-learn and TensorFlow provide implementation of SVM. SVM is a
powerful tool for heart disease prediction and has many applications. It can aid in early
The Chi-Square test helps identify the most important risk factors and select the relevant
features, leading to more accurate predictions and better patient outcomes. Here's a step-by-
step example of how the Chi-Square test is used in heart disease prediction:
Step 1: Collect data on patient characteristics (age, gender, etc.) and medical features (blood
Step 2: Apply the Chi-Square test to identify significant associations between each feature
Step 3: Select the top-ranked features based on their Chi-Square statistics and p-values.
Step 5: Evaluate the model's performance using metrics like accuracy and AUC-ROC.
Step 6: Use the model to predict the likelihood of heart disease for new patients.
By leveraging the Chi-Square test, healthcare professionals can identify high-risk patients
and provide targeted interventions, improving patient outcomes and reducing healthcare
costs.
The hardware requirement refers to the tangible (physical) component to be used for the
development of the system and these are; Personal computer (PC) Macbook Air 4G RAM
Windows 8 or higher operating system software can be used for the deployment of this
Apache (A), and Python3 will all be used in the project to develop the system. Visual Studio
Code is the software package that will be used to create the source file to make the system