0% found this document useful (0 votes)
29 views33 pages

Final - Urop - Report - Heart Attack Machine Learning

Uploaded by

mutyalasai7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

Final - Urop - Report - Heart Attack Machine Learning

Uploaded by

mutyalasai7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Heart Attack Detection Using Machine Learning

Project submitted to the


SRM University – AP, Andhra Pradesh
for the partial fulfillment of the requirements to award the degree of

Bachelor of Technology
In
Computer Science and Engineering
School of Engineering and Sciences

Submitted by
B P V Manikanteswara Rao (AP21110011588)
Y N V Sai Prakash (AP21110011583)
B Sathyam (AP21110011562)
M Sai Venkat (AP21110011582)
M Manohar Naik (AP21110011602)

Under the Guidance of


(Mr. Shaiju Panchikkil)

SRM University–AP
Neerukonda, Mangalagiri, Guntur
Andhra Pradesh – 522 240
Nov, 2023
Certificate

Date: 27-Nov-23

This is to certify that the work present in this Project entitled “Heart Attack Detection
Using Machine Learning” has been carried out by Sai Prakash, Sathyam,
Manikanteswara Rao, Sai Venkat and Manohar Naik under my/our supervision.
The work is genuine, original, and suitable for submission to the SRM University – AP
for the award of Bachelor of Technology/Master of Technology in the School of
Engineering and Sciences.

Supervisor

(Signature)
Mr. Shaiju Panchikkil

Assistant Professor,
Department of Computer Science and Engineering.

i
Acknowledgments

We would like to express our gratitude to everyone who played an important role in
the success of this project. The project that we have completed under your guidance
is an excellent opportunity. We consider ourselves extremely fortunate people to have
the opportunity to interact with the professionals who guided us throughout this
project.

We thank our supervisor Mr Shaiju Panchikkil, who has provided insights and
expertise whenever required that greatly assisted the course project. Additionally, we
want to express our gratitude for sharing their pearls of knowledge with us during
the course project.

We perceive this opportunity as one of the milestones in the progression of our careers.
To achieve our intended career objectives and succeed in our careers, we will try to
utilize the acquired skills and information as effectively as possible and develop them
in our care.

ii
Table of Contents

Certificate .................................................................................................................................. i
Acknowledgments ..................................................................................................................ii
Table of Contents .................................................................................................................. iii
Abstract .................................................................................................................................. iv
Abbreviations ..........................................................................................................................v
List of Tables .......................................................................................................................... vi
List of Figures ....................................................................................................................... vii
System Requirements ......................................................................................................... viii
1. Introduction ......................................................................................................................... 1
2. Literature Surveys .............................................................................................................. 2
3. Methodology ....................................................................................................................... 4
3.1 Data Collection........................................................................................................................... 5
2.2 Data Preprocessing .................................................................................................................... 5
3.3 Split Data .................................................................................................................................... 6
3.4 Classification Methods.............................................................................................................. 7
3.5 Testing of Model: ..................................................................................................................... 10
4. Results Discussion ............................................................................................................ 12
5. Conclusion ......................................................................................................................... 22
References .............................................................................................................................. 23

iii
Abstract

Heart attack is the leading cause of death worldwide. Millions of people have heart
attacks every year. Early detection of heart attack symptoms is important to improve
patient outcomes and reduce mortality. In recent years, machine learning (ML) has
become a powerful tool for clinical applications, including early detection of heart
disease. This project aims to develop a machine learning system using algorithms
such as logistic regression (LR), decision tree (DT) algorithms, and support vector
machines (SVC) to detect heart attacks. The Kaggle dataset is used for empirical
study. We use data after some pre-processing to train the model and calculate the
accuracy after testing the model to find the best machine learning model for heart
attack diagnosis.

iv
Abbreviations

ML Machine Learning

CM Confusion Matrix
CR Classification Report
LR Logistic Regression

DT Decision Tree
SVC /SVM Support Vector Classifier
GSCV Grid-Search-CV
TP True Positive

FP False Positive
TN True Negative
FN False Negative

CM Confusion Matrix

v
List of Tables

Table 1. Evaluation Metrics without GSCV…………………………………………….,21


Table 2. Evaluation Metrics with GSCV………………………………………………....21

vi
List of Figures

Figure 1. Flow Chart of Proposed Work ………………………………………………….4

Figure 2. Correlation Matrix of data ………………………………………………………6

Figure 3. Sigmoid Function Graph ………………………………………………………...7

Figure 4. Basic Structure of DT …………………………………………………………….8

Figure 5. Basic Graph of SVM………………………………………………………………9

Figure 6. Basic CM for this project………………………………………………………....13

Figure 7. CM of LR without GSCV………………………...…………….………………..14

Figure 8. CM of DT without GSCV…………………………………………………….….15

Figure 9. CM of SVM without GSCV ……………………………………………………..16

Figure 10. CM of LR with GSCV…………………………………………………………...17

Figure 11. CM of DT with GSCV………………….………………..……………………...18

Figure 12. CM of SVM with GSCV………………………………………………………...19

Figure 13. Histograms of Data……………….…………………………………………….20

vii
System Requirements

1. Operating System: Windows 8 or higher Laptop or PC.


2. Python with version 3.7 or higher with all Python libraries installed.
3. Jupyter-Notebook installed.
4. 4 GB or higher RAM.
5. 16 GB or higher Hard disk.

viii
1. Introduction

One of the most important parts of the human body is the heart. The body's
circulatory system consists of the heart and blood vessels that work together like a
fist-sized muscle to pump blood throughout the body. Abnormalities in the heart's
blood flow can lead to many heart conditions, including cardiovascular disease
(CVD). Heart attack is the leading cause of death worldwide.

WHO’s frightening data serve as a sobering reminder of the worldwide effect of these
cardiovascular diseases. Globally, heart attacks and strokes contribute to 17.5 million
deaths, with over 75 percent of these fatalities attributed to cardiovascular disease
occurring in middle- and low-income countries. Moreover, a significant 80 percent of
cardiovascular disease-related deaths are specifically linked to stroke and heart
attacks. With a large majority of these deaths occurring in low- and middle-income
countries, there is an urgent need for affordable and effective solutions to tackle the
incidence of heart disease.

Cardiovascular diseases, including heart attacks, continue to be a leading cause of


mortality worldwide. Timely and accurate detection of impending heart issues is
crucial for effective intervention and improved patient outcomes. In the realm of
medical diagnostics, ML has become a powerful tool for leveraging vast datasets to
enhance predictive models and decision support systems. Thus, to create a heart
attack prediction system, a machine learning technique is suggested in this research
and verified on the publicly available Kaggle dataset.

This project aims to address the critical need for early detection of heart attacks
through the application of machine learning techniques. The input dataset used
comprises 13 numerical features derived from various aspects of patient health. We
use several algorithms such as LR, DT, and SVM, to output a binary number 1 or 0,
A score of "1" indicates a heart attack, and "0" indicates no heart attack.

Finally, we find the best algorithm among used algorithms using evaluation metrics
like accuracy, precision, and F1-score. So that the model can help in further studies
and future enhancement developments of this topic and reduce the deaths.

1
2. Literature Surveys

1. In 2007, Boleslaw Szymanski used the capability of computation of sparse in


SUPA NOVA. The researcher employed this approach on a normal BHM
dataset to uncover heart illnesses, measure heart activities, and forecast heart
disorders, and 83.7 measured with the assistance of the SVM algorithm and
kernel equal to it were discovered. Spline kernel produces a high-quality
output when used with the typical BHM database.
2. In 2011, Ujma Ansari achieved a notable 99% accuracy in predicting heart
disease using a Decision Tree model. This success inspired our exploration of
an improved variant, Random Forest, known for its enhanced generalization.
However, replicating Ansari's work posed challenges as their paper referenced
a dataset with 3000 instances, yet the publicly available heart disease dataset
on UCI comprises only 303 instances. The absence of clarity on the dataset's
source in Ansari's work raises concerns about data integrity and transparency.
3. In 2012, Santana Krishnan. J and Chaitrali S. D. presented a paper titled
“Predicting Heart Disease Using Machine Learning Algorithms,” which uses
decision trees and the Naive Bayes algorithm to predict heart disease. In a
decision tree algorithm, the tree is built using certain decision criteria that yield
true or false. The results of algorithms such as SVM and KNN are based on
vertical or horizontal segmentation conditions depending on different
parameters. However, the decision tree, which has a tree-like structure
consisting of roots, leaves, and branches, is based on the decisions in each tree.
Decision trees also help understand the importance of attributes in the data
set. They also used the Cleveland dataset. Using some Methods, the data set
was divided into 70% training and 30% testing. The accuracy of the algorithm
is 91%. The second algorithm is the Naive Bayes algorithm used for
classification purposes. It can handle complex, non-linear, dependent data, so
it is also suitable for heart disease data as these data are also complex, non-
dependent and non-linear. The accuracy of this algorithm is 87%.
4. In 2017, Kaan Uyar and Ahmet İlhan used the same materials we used in this
project. During their analysis, “the distribution category was defined as 54%
without heart disease and 46% with heart disease.” In the drawing line of the
data set we downloaded from Kaggle, 54% consists of 1s and 46% consists of
0s. From their analysis, we can see that 1 means no heart disease and vice versa.
To make it easier to understand, we replaced 1 and 0 in the graph, so that 1
represents heart disease, showing our discrepancy in results [10].
5. In 2016, Purushottam, Saxena & Sharma⁠ proposed an effective cardiovascular
disease prediction using data mining. The system helps doctors make the right

2
decisions based on certain parameters. One measurement during the testing
and training phases yielded 86.3% accuracy in the testing phase and 87.3%
accuracy in the training phase.
6. In 2018, Kumar Dwivedi employed SVM and KNN algorithm models on the
UCI data set and found that SVM obtained 82% accuracy.
7. In 2019, A Lakshmanarao employed the SVM algorithm on a data set of 644
samples of Ten-Year-CHD and obtained 82.30% accuracy.

3
3. Methodology

This study provides an examination of different (ML) algorithms. The algorithms


highlighted in this research include Logistic Regression (LR), Decision Tree Classifier,
and Support Vector Machine (SVM). These algorithms aid medical analysts and
practitioners in accurately diagnosing heart attacks. The Methodology acts as a
systematic approach, facilitating the conversion of raw data into recognizable data
patterns, thereby enriching the available knowledge for users.
The flow chart of the proposed system consists of the following steps of sequences as
shown in Figure 1 below.

Figure 1: Flow chart of proposed work.

4
3.1 Data Collection

The first step involves collecting research-related data. This data is pivotal as it forms
the foundation for subsequent analysis and the development of data. The data is
obtained from the Kaggle data set.

3.2 Data Preprocessing

This stage is used in the preprocessing of collected data, which is essential for
ensuring data quality and consistency. This involves handling missing values,
removing null values, encoding the categorical values, and standardization of data
for the specific requirements of algorithms.

Here, we are checking if there are any null values and duplicated data in the data set.
If there are any null-value we replace them with mean values and if there are any
duplicated sets of data we remove that particular data.
After we divide the attributes of data into two types of variables and encode the
categorical variables into continuous variables.
Categorical Variables:
1) SEX: Male=1, Female=0.
2) CP: Type of Chest pain - (0, 1, 2, 3).

3) FBS: Fasting blood sugar (¿ 120 mg/dl: 1, ¡= 120 mg/dl: 0).


4) RESTECG: Result of resting electro cardio-graphic - (0, 1, 2).
5) EXNG: Exercise-induced angina (1 - Yes, 0 - No).
6) SLP: Peak exercise ST segment slope - (0, 1, 2).

7) CAA: Major vessels - (0, 1, 2, 3, 4).


8) THALL: Thallium stress - (0, 1, 2, 3).
Continuous Variables:
1) AGE: Age of the patient.
2) TRTBPS: Resting blood pressure.
3) CHOL: Cholesterol level.
4) THALACHH: Maximum Heart rate achieved.
5) OLDPEAK: Exercise-induced ST depression compared to rest.

5
Correlation Matrix for the data set is shown in Figure 2.

Figure 2: Correlation Matrix of data.

Here, we removed FBS and CHOL from the data since they are very low correlation
values with output and processed the data such that we encoded the categorical
values into numerical values and standardized the whole values using standard
scalar, and after standardizing of data we removed sex attribute because it has no
much significance in the problem.

3.3 Split Data

This step involves splitting the dataset into two sets of train data and test data
randomly in the ratio of 80:20. Training data is used to train a particular model and
we implore testing data on that particular model to predict the accurate results.

6
3.4 Classification Methods

Following the completion of data splitting, we utilize LR, SVM, and DT algorithms
to train the model using the train data.

1. Logistic Regression: It is a well-known ML algorithm categorized under


supervised learning techniques. Analogous to linear regression, both logistic
and linear regression depend on datasets to make predictions. It mainly helps
in classification-type tasks. It utilizes probability for data classification and
operates by linearly combining input values. a sigmoid or logistic function,
along with coefficient values, to estimate the likelihood of outcome. Maximum
likelihood estimation is a key concept in logistic regression, as it utilizes the
sigmoid function to determine the most probable data outcome. The resulting
probability is constrained to the range between 0 and 1, indicating whether an
event is likely to occur or not.
The graph of the sigmoid function is shown in Figure 3.

Figure 3: Sigmoid Function Graph

2. Decision Tree: It is an extreme learning approach that serves a versatile role in


characterizing and regressing data. The structure of a decision tree resembles a
branching tree, with the root node initiating the process and branches leading
to the internal nodes and leaf nodes. Each node here represents a decision or
policy based on a feature, while a leaf provides the final prediction. The
construction of the tree involves calculating attribute entropy, which measures

7
impurity or uncertainty in the data. The root node is divided into trees or
branches by maximizing IG, which is a measure that evaluates the reduction in
entropy achieved with a given distribution. This process continues recursively
until it reaches the leaf and gives the final result.

The basic structure of the decision tree is shown in Figure 4.

Figure 4: Basic structure of DT

3. Support Vector Machine: SVM stands as a powerful ML algorithm applicable


to regression and classification tasks and it is effective in binary classification
scenarios. The data points closest to this hyperplane, known as support vectors,
determine the margin’s distance. SVM’s adaptability extends beyond linear
separation by utilizing kernel functions to transform data facilitating the
effective segregation of classes by a linear hyperplane. This enables SVM to
handle not only linearly separable data but also non-linear patterns. To strike a
balance between margin maximization and classification error minimization,
the algorithm also includes a regularization parameter (C). After being trained,
SVM can categorize fresh data points according to where they are about the
hyperplane. To adapt SVM to various tasks and data types, it is essential to
select appropriate kernel and regularization parameters.

The basic Graph of SVM is shown in Figure 5.

8
Figure 5: Basic Graph of SVM

Here, we used two Methods one is the normal Method without using the GSCV, and
another one is using the GSCV Method of 5-fold cross-validation for hyper-parameter
tuning the data values so that its accuracy will be increased.

Grid-Search-CV: It is a popular method for hyperparameter tuning in machine


learning. It is a search algorithm that evaluates all possible combinations of
hyperparameter values in a specified grid.

How GSCV Works,

Define the parameter grid: The first step is to define a grid of hyperparameter values
to search. The grid should contain a set of values that are considered appropriate for
the model and dataset.

Perform cross-validation: GSCV uses cross-validation to evaluate model performance


for each combination of hyperparameter values. In cross-validation, the dataset is
divided into folds, the model is examined in all folds except one (validation fold) and
evaluated in the validation fold. This process is repeated for each fold and the average
performance of each fold is calculated.

9
Select the best hyperparameters: The combination of hyperparameter values that
yields the best average performance is selected as the optimal set of hyperparameters.

3.5 Testing of Model:

The trained model in both cases is put into action for testing purposes on a separate
set of test data to give valuable outcomes. In this alternative perspective, the model’s
performance is assessed by applying it to a new dataset, distinct from the one used for
training.

3.6 Result Evaluation:

This phase involves the evaluation of the proposed model. It contains the
classification report containing confusion matrix, recall, precision, and F1- score.
However, the algorithm which gives the best accuracy is the best in this analysis.
We used predicted Accuracy, Precision, Recall, and F1-score for the evaluation of each
model. These metrics are based on the Confusion Matrix, which compares the
predicted values obtained in the model to the actual values.
The confusion matrix consists of,
True Positive (TP):

This measure represents the number of samples that actually belong to the positive
class (presence of heart attack) and it is correctly predicted by the model that they are
in the positive class (presence of heart attack).
False Positive (FP):

This measure represents the number of samples that actually belong to the negative
class (absence of heart attack) and it is incorrectly predicted by the model that they
are in the positive class (presence of heart attack).
True Negative (TN):

This measure represents the number of samples that actually belong to the negative
class (absence of heart attack) and it is correctly predicted by the model that they
belong to the negative class (absence of heart attack).
False Negative (FN): This measure represents the number of samples that actually
belong to the positive class (presence of heart attack) and it is incorrectly predicted
by the model that they belong to the negative class (absence of heart attack).

10
These four metrics are used to construct a Confusion matrix, which helps to calculate
evaluation metrics. The basic confusion matrix regarding this project is shown in
Figure 6.
The evaluation metrics are:
1. Accuracy: Measures the number of predictions that are correct out of the total
number of predictions predicted.

Accuracy = (T P + T N)/ (T P + F P + F N + T N)

2. Precision: Measures the proportion of correct positive predictions.

Precision = T P/ (T P + F P)

3. Recall: Measures the model’s ability to identify all relevant instances of the
positive class.

Recall = T P/ (T P + F N)

4. F1-score: It is particularly useful when there is an uneven class distribution in


the data set, making accuracy an inadequate measure of a model’s
performance. It is the H.M. of precision and recall. Balances the trade-off
between precision and recall.

F1−Score = 2 ∗ ((Precision ∗ Recall)/ (Precision + Recall))

The Confusion Matrix of LR, DT, and SVM without using the Grid-Search-CV in
Figure 7, Figure 8, and Figure 9 respectively, and with using the Grid-Search-CV
in Figure 10, Figure 11, and Figure 12 respectively.

11
4. Results Discussion

This section consists of a discussion of experiments we have done and the results
that have obtained while doing the above results.

We used Kaggle data which is publicly available on the Kaggle website. It consists of
13 features. The used data set consists of both continuous type and categorical type.
We encoded categorical values into continuous values for the best performance of
models. Here, we removed ‘CHOL’ and ‘FBS’ from the data due to poor correlation
with output. After removing them we standardized the data set then we removed
‘SEX’ from the above data set which does not play a very important role in this type
of project. Even though we included the removed attributes to the data there is not
much increase in accuracy so we selected the best features. We split the data in the
ratio of 80:20 because it helps us give the best accuracy than the other ratios and it is
the standard one that is best to be used.
Here, when we use the normal Method without GSCV we get the accuracy values of
the LR model as 86.88%, the DT classifier is 73.77% and SVC is 85.24%. After we GSCV
with a 5-fold cross-validation method we found that the accuracy of the DT increased
to 77.04% and the remaining two models' accuracy values remained unchanged, it
shows that hyperparameter tuning of the data does not make much difference because
these two algorithms exclusively support the binary classification than the DT. Here,
we think that the change in the accuracy of DT may be related to the change in depth
of the tree from 4 to 5 and the splitter of the DT from best to random.
The Logistic Regression (LR) shows the best results in both cases or Methods, with
Logistic regression having a slight edge towards it in terms of F1-score and accuracy
in both the confusion matrix and classification report. This indicates that Logistic
Regression might be the best model among those tested for this particular project
which is binary type one. Therefore, LR is the best model among the three models for
detecting heart attack due to its high accuracy.

12
Figure 6: Basic CM for this project.

13
Figure 7: CM of LR without GSCV

14
o

Figure 8: CM of DT without GSCV

15
Figure 9: CM of SVM without GSCV

16
Figure 10: CM of LR with GSCV

17
Figure 11: CM of DT with GSCV

18
Figure 12: CM of LR with GSCV

19
Figure 13: Histogram of data

20
Model Accuracy Precision Recall F1-Score

LR 0.87 0.90 0.84 0.87

DT 0.74 0.81 0.66 0.72

SVM 0.85 0.90 0.81 0.85

Table 1: Evaluation Metrics without Grid-Search-CV

Model Accuracy Precision Recall F1-Score

LR 0.87 0.90 0.84 0.87

DT 0.77 0.76 0.81 0.79

SVM 0.85 0.90 0.81 0.85

Table 2: Evaluation Metrics with Grid-Search-CV

21
5. Conclusion

This research analysis on the detection of heart attacks has yielded valuable insights
regarding ML techniques. The study has focused on the assessment of key
performance metrics like Accuracy, Precision, Recall, and F1-score. They serve as vital
indicators for evaluating the quality of predictive models. One notable finding from
the analysis is that logistic regression emerged as the top-performing technique, with
an accuracy rate of 86.88%, SVM has the slightest edge of accuracy rate below LR
which is 85.24% and the accuracy rate of DT has improved slightly due to hyper tuning
the parameters which is from 74% to 77%. So, we can say that Logistic regression, a
well-established supervised learning algorithm, proved effective in classifying
individuals of heart attack presence or absence. This finding suggests that logistic
regression can be a reliable choice for such binary classification tasks and is the best
Method for this project.

6. Future Work
In the future, we can have an impact if we add more deep-learning techniques and
machine-learning algorithms with some more additional data so that we can make a
wide range of assumptions that it will provide an effective solution for heart attacks
and diseases.

We can use real-time data from various healthcare organizations rather than the
recorded data so that it will give an accurate result at that certain period.

22
References

1. K. Szymanski, L. Zhu, L. Han, M. Embrechts, A. Ross, and K. Sternickel, “A


computationally efficient supernova: Spline kernel-based
machine learning tool,” in Soft Computing in Industrial Applications:
Recent Trends. Springer, 2007, pp. 144–155.

2. Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview
of heart disease prediction." International Journal of Computer Applications
17.8 (2011): 43-48.

3. Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International
Journal of Computer Applications 47.10 (2012): 44-48.

4. Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.

5. Purushottam, Saxena, K., & Sharma, R. (2016). Efficient Heart Disease


Prediction System. In Procedia Computer Science (Vol. 85, pp. 962–969).
https://doi.org/10.1016/j.procs.2016.05.288

6. AK. Dwivedi, "Performance evaluation of different machine learning techniques for


prediction of heart disease", Neural Comput Appl, vol. 29, no. 10, pp. 685-693,
2018

7. A. Lakshmanarao, Y. Swathi and P. Sri Sai Sundareswar, "Machine Learning


Techniques for Heart Disease Prediction", International Journal Of Scientific &
Technology Research, vol. 8, no. 11, November 2019.

8. R, T.P. Thomas, J., 2016. Human Heart Disease Prediction System Using Data
Mining Techniques.

23

You might also like