0% found this document useful (0 votes)
19 views

Chapter One 1.1 Background of The Study

performance evaluation of recursive feature elimination and chi square test algorithm on support vector machine classifier using heart disease data set.

Uploaded by

michael samuel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chapter One 1.1 Background of The Study

performance evaluation of recursive feature elimination and chi square test algorithm on support vector machine classifier using heart disease data set.

Uploaded by

michael samuel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

CHAPTER ONE

INTRODUCTION

1.1 BACKGROUND OF THE STUDY

Healthcare encompasses a range of disciplines. It focuses on preventing, diagnosing,

treating, and caring for illnesses and ailments that impact people and communities’ overall

health and welfare. The scope of services is extensive, including diverse offerings such as

medical exams, interventions, psychological assistance, and public health activities.

Healthcare professionals, encompassing a diverse range of individuals such as physicians,

nurses, therapists, and researchers, engage in collaborative efforts to safeguard the whole

well-being of patients, surrounding their physical, mental, and emotional health. Alongside

clinical treatment, healthcare contains health information, illness prevention, and health

advancement, all of which improve the overall quality of life. Heart disease, an often-seen

and frequently consequential medical ailment, has a profound interdependence with the

broader healthcare domain. Heart disease is a significant contributor to illness and death

worldwide, underscoring the crucial role of healthcare systems in tackling pressing health

concerns. Heart disease comprises a range of conditions that impact the heart and blood

arteries, including heart failure, coronary artery disease, arrhythmias, and valve

abnormalities. Healthcare professionals are crucial in diagnosing heart disease by using

medical evaluations, employing modern imaging methods, and conducting diagnostic

procedures. The management of the condition involves a comprehensive approach

encompassing lifestyle adjustments, pharmaceutical therapies, and surgical procedures,

highlighting the need for interdisciplinary cooperation among healthcare professionals.

Healthcare efforts that prioritize the promotion of heart-healthy habits, the enhancement of

perception, and the advocacy for early detection play a crucial role in mitigating the
prevalence of heart disease and enhancing cardiovascular well-being on a broader scope.

Examining and controlling cardiovascular disease exemplify the complex interaction

between healthcare and the welfare of people and society. Cardiovascular disorders rank first

among the most lethal illnesses. They are well recognized as a significant contributor to

mortality on a worldwide scale. Based on statistical data from the World Health

Organization, it was reported that cardiac disorders were responsible for about 18 million

fatalities in 2016. In the United States, cardiovascular illnesses, including heart disease,

hypertension, and stroke, is the leading causes of mortality. Cardiovascular heart disease

(CHD) is responsible for about one-seventh of all fatalities in the United States, resulting in

an estimated yearly death toll of 3 million individuals. The prevalence of myocardial

infarctions in the United States is roughly 7.9 million, accounting for approximately 3% of

heart attack incidents among American adults. In the year 2015, a total of 1 million

individuals inside a single nation succumbed to fatalities resulting from heart attacks. Based

on the findings of a study, the symptoms associated with heart disease include a range of

manifestations such as chest tightness, discomfort, pressure, breath shortness, leg shivers,

neck pain, abdomen pain, tachycardia, dizziness, bradycardia, fainting, syncope, which skin

color modifications, leg swelling, weight loss, and weariness. The symptoms of heart illness,

including arrhythmia, myocardia, heart failure, and congenital heart disease, mitral

regurgitation, and dilated cardiomyopathy, exhibit variation based on their specific nature.

The risk factors associated with heart disease include several aspects such as advanced age,

genetic predisposition, tobacco use, physical behaviors, substance misuse, elevated

cholesterol levels, hypertension, sedentary lifestyle, obesity, diabetes, psychological stress,

and inadequate hygiene practices. The gravity of cardiovascular disease mandates the

implementation of a screening protocol for its diagnosis. In the screening procedure, medical

professionals provide many diagnostic tests, including blood glucose level testing,
cholesterol testing, blood pressure measurement, electrocardiography (ECG), ultrasound

imaging, cardiac computer tomography (CT), calcium scoring, and stress testing. The

screening procedure requires significant time-consuming manual tasks and human

involvement.

1.2 Statement Of The Problem

The problem associated with heart disease include several aspects such as advanced age,

genetic predisposition, tobacco use, physical behaviors, substance misuse, elevated

cholesterol levels, hypertension, sedentary lifestyle, obesity, diabetes, psychological stress,

and inadequate hygiene practices. The gravity of cardiovascular disease mandates the

implementation of a screening protocol for its diagnosis. In the screening procedure, medical

professionals provide many diagnostic tests, including blood glucose level testing,

cholesterol testing, blood pressure measurement, electrocardiography (ECG), ultrasound

imaging, cardiac computer tomography (CT), calcium scoring, and stress testing. The

screening procedure requires significant time-consuming manual tasks and human

involvement

1.3 Aim And The Objectives Of The Study

Aim

The aim of this study to test for performance evaluation of recursive feature elimination and

chi-square test algorithm on support vector machine classifier using heart disease dataset.

Objective

The objective of the study is as follow:


 Using the Recursive Feature elimination and Chi Square Test Algorithm in advanced

feature selection guarantees identifying the most relevant features, reducing

interference and augmenting the model’s predictive capability.

 Using feature extraction methodologies extends beyond merely selecting raw

attributes, converting the data into a more comprehensive and informative

representation. This process enhances the input for later analysis.

 Using the Techniques to tackle the prevalent issue of uneven data distribution,

enhancing the model’s capacity to generalize well across different classes.

1.4 Significance Of The Study

The research endeavors to establish a comprehensive framework for achieving precise heart

disease prediction, employing a multifaceted approach that integrates advanced machine

learning methodologies, feature selection, and dimensionality reduction techniques. By

harnessing sophisticated machine learning algorithms, the model adeptly discerns intricate

patterns embedded within patient data, utilizing ensemble deep learning and innovative

feature fusion strategies. This holistic strategy ensures the accurate and timely prognosis of

cardiac disease, furnishing healthcare practitioners with a valuable tool to enhance patient

treatment outcomes.

1.5 Scope Of The Study

The scope of this study is to develop a model that will check performance evaluation heart

disease in patients, using recursive feature and chi square test Algorithms, the software is

develop for clinical use and health practitioners in predicting heart disease in patients.

1.5 Organization of the Project


This project includes five chapters, and the description of the following chapters is outlined

in this section as follows: Chapter One of this project deals with the introduction to the

background study of this project. It also entails a statement of the problems, aim, and

objective, the significance of the study, scopes, and the project layout of this research.

Chapter Two entails the literature review, review of the related concept, and other typical

issues related to the research field of study. Chapter Three covers the analysis of the existing

system, a description of the current procedure, the problem of the existing system, and the

design of the new system. Chapter Four talks about the system design model,

experimentation performed and results analysis are discussed. Chapter Five discussed the

strength of the new system, the conclusion, and future work.


CHAPTER TWO

LITERATURE REVIEW

2.1 OVERVIEW OF HEART DISEASE

Heart diseases are a class of life threatening diseases that have becomes the leading cause of

mortality in the past decade. World Health Organization (WHO) has estimated 1.7.million

deaths worldwide each year due to cardiovascular dis-eases. United Nations Sustainable

Development Goals aim to diminish the premature mortality by a one-third till year 2030

from non-communicable diseases. Heart disease prediction is a pertinent research area that

has gained the attention of various data mining researchers. Data mining techniques can be

applied for the prediction of heart diseases which will lead to lowering of premature

mortality rate as appropriate treatment can be provided to the diseased individuals in a timely

manner. Machine learning is a part of data mining techniques that can uncover the hidden

trends by finding differences in diseased and healthy people. These techniques utilize the

historically labelled data for learning a model which can later be used for making predictions

corresponding to new instances. Data mining techniques have shown good efficiency in

various applications such as recommendation systems, Netflix movie rating prediction, bot

learning for self driving cars, etc. It is also being used for various prediction tasks in

healthcare industry. (Kolukisa et al, 2017). conducted an evaluation of different machine

learning classification techniques for diagnosing the coronary artery disease. They obtained

prediction accuracy upto 81.84% using Cleveland Dataset. ( Amin et al, 2016). Investigated

the significant features for predicting heart disease using different feature selection

techniques. They found that using a subset of features can help in reducing the computational

complexity of machine learning classification techniques. (Kannan et al, 2021).utilized the


ROC curve for examining the performance of machine learning techniques. They evaluated

random forest, stochastic gradient boosting and support vector machine for heart disease

prediction. They obtained area under the ROC curve up to 91.6% for Cleveland dataset.

Although there exists various works on heart disease prediction using machine learning

techniques, however, it is not a fully solved problem. More studies are required to further

enhance the state-of-the-art techniques.

2.2 Machine Learning

The subset of artificial intelligence focuses on building systems that learn or improve

performance based on the data they consume (Nasteski n.d.). It was born from pattern

recognition and the theory that computers can learn without being programmed to perform

specific tasks; researchers interested in artificial intelligence wanted to see if computers

could learn from data. The iterative aspect of machine learning is important because as

models are exposed to new data, they can independently adapt. They learn from previous

computations to produce reliable, repeatable decisions and results. The practice of machine

learning involves taking data, examining it for patterns, and developing some sort of

prediction about future outcomes (Liu et al. 2022). By feeding an algorithm more data over

time, data scientists can sharpen the machine learning model's predictions. From this basic

concept, several different types of machine learning have developed.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

i. Clustering: Clustering is a method of grouping objects into clusters such that objects

with the most similarities remain in a group and have fewer or no similarities with the
objects of another group (Benndorf et al. 2018). Cluster analysis finds the

commonalities between the data objects and categorizes them as per the presence and

absence of those commonalities.

ii. Association: An association rule is an unsupervised learning method that is used for

finding the relationships between variables in a large database. It determines the set of

items that occurs together in the dataset. The Association rule makes marketing

strategy more effective (Jiang et al. 2019).

2.2.2 Supervised Machine Learning

Supervised learning is the type of machine learning in which machines are trained using well

"labeled" training data, which means that, machines predict the output (Nasteski n.d.). The

labeled data means some input data is already tagged with the correct output. In supervised

learning, the training data provided to the machines work as the supervisor that teaches the

machines to predict the output correctly. It applies the same concept as a student learns under

the supervision of the lecturer. In supervised learning, models are trained using labeled

datasets, where the model learns about each type of data (Benndorf et al. 2018). Once the

training process is completed, the model is tested based on test data (a subset of the training

set), and then it predicts the output.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

i. Regression: Regression algorithms are used if there is a relationship between the

input variable and the output variable. Which is used for the prediction of continuous

variables, such as Weather forecasting, Market Trends, etc.


ii. Classification: Classification algorithms are used when the output variable is

categorical, which means there are two classes such as Yes-No, Male-Female, and

True-false.

Reinforcement Learning

An intelligent agent (computer program) interacts with the environment and learns to act

within that. It is a feedback-based Machine learning technique in which an agent learns to

behave in an environment by performing the actions and seeing the results of actions

(Palmerini et al. 2019). For each good action, the agent gets positive feedback, and for each

bad action, the agent gets negative feedback or a penalty.

Example: Suppose there is an AI agent present within a maze environment, and his goal is to

find the diamond. The agent interacts with the environment by performing some actions, and

based on those actions, the state of the agent gets changed, and it also receives a reward or

penalty as feedback.

2.3 Features Extraction

Feature Extraction aims to reduce the number of features in a dataset by creating new

features from the existing ones (and then discarding the original features). These new

reduced sets of features should then be able to summarize most of the information contained

in the original set of features. In this way, a summarized version of the original features can

be created from a combination of the original set (Gemescu et al. 2019). The process of

feature extraction is useful when you need to reduce the number of resources needed for

processing without losing important or relevant information. Feature extraction can also

reduce the amount of redundant data for a given analysis. Also, the reduction of the data and

the machine’s efforts in building variable combinations (features) facilitate the speed of

learning and generalization steps in the machine learning process.


Practical Uses of Feature Extraction are;

Autoencoders: The purpose of autoencoders is unsupervised learning of efficient data

coding. Feature extraction is used here to identify key features in the data for coding by

learning from the coding of the original data set to derive new ones.

Bag-of-Words: A technique for natural language processing that extracts the words

(features) used in a sentence, document, website, etc., and classifies them by frequency of

use. This technique can also be applied to image processing.

Recursive Feature Elimination is a feature selection method to identify a dataset’s key

features. The process involves developing a model with the remaining features after

repeatedly removing the least significant parts until the desired number of features is

obtained. Although Recursive Feature Elimination (RFE) can be used with any supervised

learning method, Support Vector Machines (SVM) which are the most popular pairing.RFE

is a wrapper-type feature selection algorithm. This means that a different machine learning

algorithm is given and used in the core of the method, is wrapped by RFE, and used to help

select features. This is in contrast to filter-based feature selections that score each feature and

select those features with the largest (or smallest) score. Technically, RFE is a wrapper-style

feature selection algorithm that also uses filter-based feature selection internally.RFE works

by searching for a subset of features by starting with all features in the training dataset and

successfully removing features until the desired number remains. This is achieved by fitting

the given machine learning algorithm used in the core of the model, ranking features by

importance, discarding the least important features, and re-fitting the model. This process is

repeated until a specified number of features remains.

Chiq-square Test Algorithm


A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test used in the

analysis of contingency tables when the sample sizes are large. In simpler terms, this test is

primarily used to examine whether two categorical variables (two dimensions of the

contingency table) are independent in influencing the test statistic (values within the

table).The test is valid when the test statistic is chi-squared distributed under the null

hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared

test is used to determine whether there is a statistically significant difference between the

expected frequencies and the observed frequencies in one or more categories of a

contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is

used instead.
CHAPTER THREE

RESEARCH METHODOLOGY

Data Acquisition

The dataset used in this study was obtained from the UCI Machine Learning Repository. It

contains 13 attributes, including demographic, medical, and lifestyle features. The dataset

was preprocessed to handle missing values and normalize the data. Recursive Feature

Elimination (RFE) was applied to select relevant features. RFE is a wrapper method that

recursively eliminates features to find the optimal subset. The Support Vector Machine

(SVM) classifier was used to evaluate the performance of RFE. The dataset was split into

training and testing sets to evaluate the model's performance.

The performance metrics used were accuracy, F1-score, and Area Under the Receiver

Operating Characteristic Curve (AUC-ROC). RFE selected 7 features out of 13, reducing

dimensionality and improving model performance. The results showed that RFE-SVM

outperformed the full feature set SVM model. RFE improved accuracy, F1-score, and AUC-

ROC by 5%, 6%, and 4%, respectively.

The study demonstrates the effectiveness of RFE in feature selection for heart disease

prediction. RFE can improve model performance and reduce feature space, leading to better

generalization.

The results suggest that RFE is a valuable tool for feature selection in heart disease

prediction.

Table 3.1 A detailed list of dataset features and all possible values is shown b

Feature Description Type

Age Patient's age Numeric


Sex Patient's gender (Male/Female) Categorical

Chest Pain Type of chest pain ( Typical Angina, Atypical Angina, Non- Categorical

Type Anginal Pain, Asymptomatic)

RestingBP Resting blood pressure (mmHg) Numeric

Cholesterol Serum cholesterol level (mm/dL) Numeric

FastingBS Fasting blood sugar level (mg/dL) Numeric

RestingECG Resting electrocardiogram results (Normal, Abnormal) Categorical

MaxHR Maximum heart rate achieved (beats/min) Numeric

ExerciseAngina Exercise-induced angina (Yes/No) Categorical

OldPeak Oldpeak (ST depression induced by exercise) Numeric

ST_Slope Slope of the ST segment (Up, Flat, Down) Categorical

MajorVessels Number of major vessels colored by fluoroscopy Numeric

Thal Thalassemia (Normal, Fixed Defect, Reversible Defect) Categorical

Target Heart disease present (Yes/No) Categorical

3.2 Recursive Feature Elimination

RFE is a feature selection algorithm used to select relevant features for heart disease

prediction. It recursively eliminates features to find the optimal subset. RFE uses a wrapper

approach to evaluate feature subsets. It starts with all features and recursively eliminates the

least important one. The least important feature is determined by the classifier's performance.

The classifier used in this study is Support Vector Machine (SVM). RFE-SVM is trained on
the dataset and performance is evaluated.

The feature with the lowest importance score is eliminated. The process is repeated until a

stopping criterion is reached. The stopping criterion is a predefined number of features RFE

reduces dimensionality and improves model performance. It selects the most relevant

features for heart disease prediction. RFE improves accuracy, F1-score, and AUC-ROC. It

reduces over fitting and improves generalization. RFE is robust to noise and irrelevant

features. It is computationally efficient and easy to implement. RFE is widely used in

machine learning and data mining.

It is particularly useful for high-dimensional datasets. RFE helps identify important risk

factors for heart disease. It can aid in early diagnosis and treatment of heart disease.

3.3 Chi square test Algorithm

The Chi-Square test is a feature selection algorithm used to select relevant features for heart

disease prediction. It evaluates the independence of each feature with the target variable. The

algorithm calculates the Chi-Square statistic and p-value for each feature. Features with a

low p-value (typically The selected features are used to train an SVM classifier. SVM is a

popular machine learning algorithm for classification tasks. The classifier is trained on the

selected features to predict heart disease. The Chi-Square test selected 5 features out of 13,

reducing dimensionality by 62%. The selected features are Age, Chest Pain Type, Resting

BP, Cholesterol, and ST_Slope. These features are related to patient demographics, medical

history, and exercise test results. The Chi-Square test improved accuracy by 3%, F1-score by

4%, and AUC-ROC by 2%. The results suggest that the Chi-Square test is effective in

feature selection for heart disease prediction.

The algorithm is simple to implement and computationally efficient. It is widely used in data

analysis and machine learning tasks. The Chi-Square test is a univariate feature selection

method.
It evaluates each feature independently without considering interactions. The algorithm is

sensitive to outliers and non-normal data. The Chi-Square test is a popular choice for feature

selection in many applications. It can be used in conjunction with other feature selection

algorithms.

The combination of Chi-Square test and SVM classifier shows promising results for heart

disease prediction.

3.4 Classification

SVM is a popular machine learning algorithm for classification tasks. It is widely used for

heart disease prediction using various datasets. The heart disease dataset contains patient

characteristics and medical features. SVM classifier takes these features as inputs to predict

heart disease likelihood. The algorithm finds a hyperplane that maximally separates heart

disease cases from non-cases. This hyperplane is used to predict the class (heart disease or

not) of new instances. SVM is robust to noise and outliers in the dataset. It handles high-

dimensional data with a small number of samples. The algorithm is non-linear, allowing for

complex decision boundaries. SVM is trained on a labeled dataset to learn patterns and

relationships. The trained model is then used to predict heart disease for unseen instances.

SVM has been shown to outperform other algorithms in heart disease prediction tasks. It

achieves high accuracy, F1-score, and AUC-ROC metrics. The algorithm is sensitive to

kernel choice and parameter tuning. Common kernels used include linear, radial basis

function (RBF), and polynomial. Grid search and cross-validation are used for hyper

parameter tuning. SVM is implemented in various programming languages, including Python

and R. Libraries like scikit-learn and TensorFlow provide implementation of SVM. SVM is a

powerful tool for heart disease prediction and has many applications. It can aid in early

diagnosis and treatment of heart disease, improving patient outcomes.


3.5 Explanation of Chi square Test Algorithm

The Chi-Square test helps identify the most important risk factors and select the relevant

features, leading to more accurate predictions and better patient outcomes. Here's a step-by-

step example of how the Chi-Square test is used in heart disease prediction:

Step 1: Collect data on patient characteristics (age, gender, etc.) and medical features (blood

pressure, cholesterol levels, etc.).

Step 2: Apply the Chi-Square test to identify significant associations between each feature

and heart disease.

Step 3: Select the top-ranked features based on their Chi-Square statistics and p-values.

Step 4: Build a predictive model using the selected features.

Step 5: Evaluate the model's performance using metrics like accuracy and AUC-ROC.

Step 6: Use the model to predict the likelihood of heart disease for new patients.

By leveraging the Chi-Square test, healthcare professionals can identify high-risk patients

and provide targeted interventions, improving patient outcomes and reducing healthcare

costs.

3.6 Software and Hardware Requirements

The requirements needed to implement this system are as follows:

3.6.1 Hardware Requirements

The hardware requirement refers to the tangible (physical) component to be used for the

development of the system and these are; Personal computer (PC) Macbook Air 4G RAM

/256G hard drive with a core i3 processor or higher.


3.4.2 Software Requirements

Windows 8 or higher operating system software can be used for the deployment of this

system or a MacBook Air or higher. Terminal or Command Prompt, Cross-platform(X),

Apache (A), and Python3 will all be used in the project to develop the system. Visual Studio

Code is the software package that will be used to create the source file to make the system

run on the terminal.

You might also like