Final - Urop - Report - Heart Attack Machine Learning
Final - Urop - Report - Heart Attack Machine Learning
Bachelor of Technology
In
Computer Science and Engineering
School of Engineering and Sciences
Submitted by
B P V Manikanteswara Rao (AP21110011588)
Y N V Sai Prakash (AP21110011583)
B Sathyam (AP21110011562)
M Sai Venkat (AP21110011582)
M Manohar Naik (AP21110011602)
SRM University–AP
Neerukonda, Mangalagiri, Guntur
Andhra Pradesh – 522 240
Nov, 2023
Certificate
Date: 27-Nov-23
This is to certify that the work present in this Project entitled “Heart Attack Detection
Using Machine Learning” has been carried out by Sai Prakash, Sathyam,
Manikanteswara Rao, Sai Venkat and Manohar Naik under my/our supervision.
The work is genuine, original, and suitable for submission to the SRM University – AP
for the award of Bachelor of Technology/Master of Technology in the School of
Engineering and Sciences.
Supervisor
(Signature)
Mr. Shaiju Panchikkil
Assistant Professor,
Department of Computer Science and Engineering.
i
Acknowledgments
We would like to express our gratitude to everyone who played an important role in
the success of this project. The project that we have completed under your guidance
is an excellent opportunity. We consider ourselves extremely fortunate people to have
the opportunity to interact with the professionals who guided us throughout this
project.
We thank our supervisor Mr Shaiju Panchikkil, who has provided insights and
expertise whenever required that greatly assisted the course project. Additionally, we
want to express our gratitude for sharing their pearls of knowledge with us during
the course project.
We perceive this opportunity as one of the milestones in the progression of our careers.
To achieve our intended career objectives and succeed in our careers, we will try to
utilize the acquired skills and information as effectively as possible and develop them
in our care.
ii
Table of Contents
Certificate .................................................................................................................................. i
Acknowledgments ..................................................................................................................ii
Table of Contents .................................................................................................................. iii
Abstract .................................................................................................................................. iv
Abbreviations ..........................................................................................................................v
List of Tables .......................................................................................................................... vi
List of Figures ....................................................................................................................... vii
System Requirements ......................................................................................................... viii
1. Introduction ......................................................................................................................... 1
2. Literature Surveys .............................................................................................................. 2
3. Methodology ....................................................................................................................... 4
3.1 Data Collection........................................................................................................................... 5
2.2 Data Preprocessing .................................................................................................................... 5
3.3 Split Data .................................................................................................................................... 6
3.4 Classification Methods.............................................................................................................. 7
3.5 Testing of Model: ..................................................................................................................... 10
4. Results Discussion ............................................................................................................ 12
5. Conclusion ......................................................................................................................... 22
References .............................................................................................................................. 23
iii
Abstract
Heart attack is the leading cause of death worldwide. Millions of people have heart
attacks every year. Early detection of heart attack symptoms is important to improve
patient outcomes and reduce mortality. In recent years, machine learning (ML) has
become a powerful tool for clinical applications, including early detection of heart
disease. This project aims to develop a machine learning system using algorithms
such as logistic regression (LR), decision tree (DT) algorithms, and support vector
machines (SVC) to detect heart attacks. The Kaggle dataset is used for empirical
study. We use data after some pre-processing to train the model and calculate the
accuracy after testing the model to find the best machine learning model for heart
attack diagnosis.
iv
Abbreviations
ML Machine Learning
CM Confusion Matrix
CR Classification Report
LR Logistic Regression
DT Decision Tree
SVC /SVM Support Vector Classifier
GSCV Grid-Search-CV
TP True Positive
FP False Positive
TN True Negative
FN False Negative
CM Confusion Matrix
v
List of Tables
vi
List of Figures
vii
System Requirements
viii
1. Introduction
One of the most important parts of the human body is the heart. The body's
circulatory system consists of the heart and blood vessels that work together like a
fist-sized muscle to pump blood throughout the body. Abnormalities in the heart's
blood flow can lead to many heart conditions, including cardiovascular disease
(CVD). Heart attack is the leading cause of death worldwide.
WHO’s frightening data serve as a sobering reminder of the worldwide effect of these
cardiovascular diseases. Globally, heart attacks and strokes contribute to 17.5 million
deaths, with over 75 percent of these fatalities attributed to cardiovascular disease
occurring in middle- and low-income countries. Moreover, a significant 80 percent of
cardiovascular disease-related deaths are specifically linked to stroke and heart
attacks. With a large majority of these deaths occurring in low- and middle-income
countries, there is an urgent need for affordable and effective solutions to tackle the
incidence of heart disease.
This project aims to address the critical need for early detection of heart attacks
through the application of machine learning techniques. The input dataset used
comprises 13 numerical features derived from various aspects of patient health. We
use several algorithms such as LR, DT, and SVM, to output a binary number 1 or 0,
A score of "1" indicates a heart attack, and "0" indicates no heart attack.
Finally, we find the best algorithm among used algorithms using evaluation metrics
like accuracy, precision, and F1-score. So that the model can help in further studies
and future enhancement developments of this topic and reduce the deaths.
1
2. Literature Surveys
2
decisions based on certain parameters. One measurement during the testing
and training phases yielded 86.3% accuracy in the testing phase and 87.3%
accuracy in the training phase.
6. In 2018, Kumar Dwivedi employed SVM and KNN algorithm models on the
UCI data set and found that SVM obtained 82% accuracy.
7. In 2019, A Lakshmanarao employed the SVM algorithm on a data set of 644
samples of Ten-Year-CHD and obtained 82.30% accuracy.
3
3. Methodology
4
3.1 Data Collection
The first step involves collecting research-related data. This data is pivotal as it forms
the foundation for subsequent analysis and the development of data. The data is
obtained from the Kaggle data set.
This stage is used in the preprocessing of collected data, which is essential for
ensuring data quality and consistency. This involves handling missing values,
removing null values, encoding the categorical values, and standardization of data
for the specific requirements of algorithms.
Here, we are checking if there are any null values and duplicated data in the data set.
If there are any null-value we replace them with mean values and if there are any
duplicated sets of data we remove that particular data.
After we divide the attributes of data into two types of variables and encode the
categorical variables into continuous variables.
Categorical Variables:
1) SEX: Male=1, Female=0.
2) CP: Type of Chest pain - (0, 1, 2, 3).
5
Correlation Matrix for the data set is shown in Figure 2.
Here, we removed FBS and CHOL from the data since they are very low correlation
values with output and processed the data such that we encoded the categorical
values into numerical values and standardized the whole values using standard
scalar, and after standardizing of data we removed sex attribute because it has no
much significance in the problem.
This step involves splitting the dataset into two sets of train data and test data
randomly in the ratio of 80:20. Training data is used to train a particular model and
we implore testing data on that particular model to predict the accurate results.
6
3.4 Classification Methods
Following the completion of data splitting, we utilize LR, SVM, and DT algorithms
to train the model using the train data.
7
impurity or uncertainty in the data. The root node is divided into trees or
branches by maximizing IG, which is a measure that evaluates the reduction in
entropy achieved with a given distribution. This process continues recursively
until it reaches the leaf and gives the final result.
8
Figure 5: Basic Graph of SVM
Here, we used two Methods one is the normal Method without using the GSCV, and
another one is using the GSCV Method of 5-fold cross-validation for hyper-parameter
tuning the data values so that its accuracy will be increased.
Define the parameter grid: The first step is to define a grid of hyperparameter values
to search. The grid should contain a set of values that are considered appropriate for
the model and dataset.
9
Select the best hyperparameters: The combination of hyperparameter values that
yields the best average performance is selected as the optimal set of hyperparameters.
The trained model in both cases is put into action for testing purposes on a separate
set of test data to give valuable outcomes. In this alternative perspective, the model’s
performance is assessed by applying it to a new dataset, distinct from the one used for
training.
This phase involves the evaluation of the proposed model. It contains the
classification report containing confusion matrix, recall, precision, and F1- score.
However, the algorithm which gives the best accuracy is the best in this analysis.
We used predicted Accuracy, Precision, Recall, and F1-score for the evaluation of each
model. These metrics are based on the Confusion Matrix, which compares the
predicted values obtained in the model to the actual values.
The confusion matrix consists of,
True Positive (TP):
This measure represents the number of samples that actually belong to the positive
class (presence of heart attack) and it is correctly predicted by the model that they are
in the positive class (presence of heart attack).
False Positive (FP):
This measure represents the number of samples that actually belong to the negative
class (absence of heart attack) and it is incorrectly predicted by the model that they
are in the positive class (presence of heart attack).
True Negative (TN):
This measure represents the number of samples that actually belong to the negative
class (absence of heart attack) and it is correctly predicted by the model that they
belong to the negative class (absence of heart attack).
False Negative (FN): This measure represents the number of samples that actually
belong to the positive class (presence of heart attack) and it is incorrectly predicted
by the model that they belong to the negative class (absence of heart attack).
10
These four metrics are used to construct a Confusion matrix, which helps to calculate
evaluation metrics. The basic confusion matrix regarding this project is shown in
Figure 6.
The evaluation metrics are:
1. Accuracy: Measures the number of predictions that are correct out of the total
number of predictions predicted.
Accuracy = (T P + T N)/ (T P + F P + F N + T N)
Precision = T P/ (T P + F P)
3. Recall: Measures the model’s ability to identify all relevant instances of the
positive class.
Recall = T P/ (T P + F N)
The Confusion Matrix of LR, DT, and SVM without using the Grid-Search-CV in
Figure 7, Figure 8, and Figure 9 respectively, and with using the Grid-Search-CV
in Figure 10, Figure 11, and Figure 12 respectively.
11
4. Results Discussion
This section consists of a discussion of experiments we have done and the results
that have obtained while doing the above results.
We used Kaggle data which is publicly available on the Kaggle website. It consists of
13 features. The used data set consists of both continuous type and categorical type.
We encoded categorical values into continuous values for the best performance of
models. Here, we removed ‘CHOL’ and ‘FBS’ from the data due to poor correlation
with output. After removing them we standardized the data set then we removed
‘SEX’ from the above data set which does not play a very important role in this type
of project. Even though we included the removed attributes to the data there is not
much increase in accuracy so we selected the best features. We split the data in the
ratio of 80:20 because it helps us give the best accuracy than the other ratios and it is
the standard one that is best to be used.
Here, when we use the normal Method without GSCV we get the accuracy values of
the LR model as 86.88%, the DT classifier is 73.77% and SVC is 85.24%. After we GSCV
with a 5-fold cross-validation method we found that the accuracy of the DT increased
to 77.04% and the remaining two models' accuracy values remained unchanged, it
shows that hyperparameter tuning of the data does not make much difference because
these two algorithms exclusively support the binary classification than the DT. Here,
we think that the change in the accuracy of DT may be related to the change in depth
of the tree from 4 to 5 and the splitter of the DT from best to random.
The Logistic Regression (LR) shows the best results in both cases or Methods, with
Logistic regression having a slight edge towards it in terms of F1-score and accuracy
in both the confusion matrix and classification report. This indicates that Logistic
Regression might be the best model among those tested for this particular project
which is binary type one. Therefore, LR is the best model among the three models for
detecting heart attack due to its high accuracy.
12
Figure 6: Basic CM for this project.
13
Figure 7: CM of LR without GSCV
14
o
15
Figure 9: CM of SVM without GSCV
16
Figure 10: CM of LR with GSCV
17
Figure 11: CM of DT with GSCV
18
Figure 12: CM of LR with GSCV
19
Figure 13: Histogram of data
20
Model Accuracy Precision Recall F1-Score
21
5. Conclusion
This research analysis on the detection of heart attacks has yielded valuable insights
regarding ML techniques. The study has focused on the assessment of key
performance metrics like Accuracy, Precision, Recall, and F1-score. They serve as vital
indicators for evaluating the quality of predictive models. One notable finding from
the analysis is that logistic regression emerged as the top-performing technique, with
an accuracy rate of 86.88%, SVM has the slightest edge of accuracy rate below LR
which is 85.24% and the accuracy rate of DT has improved slightly due to hyper tuning
the parameters which is from 74% to 77%. So, we can say that Logistic regression, a
well-established supervised learning algorithm, proved effective in classifying
individuals of heart attack presence or absence. This finding suggests that logistic
regression can be a reliable choice for such binary classification tasks and is the best
Method for this project.
6. Future Work
In the future, we can have an impact if we add more deep-learning techniques and
machine-learning algorithms with some more additional data so that we can make a
wide range of assumptions that it will provide an effective solution for heart attacks
and diseases.
We can use real-time data from various healthcare organizations rather than the
recorded data so that it will give an accurate result at that certain period.
22
References
2. Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview
of heart disease prediction." International Journal of Computer Applications
17.8 (2011): 43-48.
3. Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International
Journal of Computer Applications 47.10 (2012): 44-48.
4. Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.
8. R, T.P. Thomas, J., 2016. Human Heart Disease Prediction System Using Data
Mining Techniques.
23