Project Report Half
Project Report Half
A PROJECT REPORT
Submitted by
Shivank [Reg No:RA201100301386]
BACHELOR OF TECHNOLOGY in
COMPUTER SCIENCE AND
ENGINEERING
BONAFIDE CERTIFICATE
Certified that 18CSP109L / I8CSP111L project report titled “EARLY HEART DISEASE
Dr. M. PUSHPALATHA
HEAD OF THE DEPARTMENT
Department of Computing Technologies
ii
supervision. Certified further, that to the best of my knowledge the work reported here in does
not form part of any other thesis or dissertation on the basis of which a degree or award was
I/We here by certify that this assessment compiles with the University’s Rules and
Regulations relating to Academic misconduct and plagiarism, as listed in the University
Website, Regulations, and the Education Committee guidelines.
I / We confirm that all the work contained in this assessment is our own except where
indicated, and that we have met the following conditions:
▪ Clearly references / listed all sources as appropriate
▪ Referenced and put in inverted commas all quoted text (from books,
web,etc.)
▪ Given the sources of all pictures, data etc that are not my own.
▪ Not made any use of the report(s) or essay(s) of any other
student(s)either past or present
▪ Acknowledged in appropriate places any help that I have received from
others (e.g fellow students, technicians, statisticians, external sources)
▪ Compiled with any other plagiarism criteria specified in the Course hand
book / University website
I understand that any false claim for this work will be penalized in accordance with the
University policies and regulations.
DECLARATION:
I am aware of and understand the iii University’s policy on Academic misconduct and
plagiarism and I certify that this assessment is my / our own work, except where
indicated by referring, and that I have followed the good academic practices noted above.
Student 1 Signature:
Student 2 Signature:
Date:
If you are working in a group, please write your registration numbers and sign with the
date for every student in your group.
ACKNOWLEDGEMENT
Institute of Science and Technology, for the facilities extended for the project work and his
continued support.
We extend our sincere thanks to Dean-CET, SRM Institute of Science and Technology, Dr. T.
Computing, SRM Institute of Science and Technology, for her support throughout the project
work.
We are incredibly grateful to our Head of the Department, Dr. M. Pushpalatha, Professor,
Department of Computing Technologies, SRM Institute of Science and Technology, for her
We want to convey our thanks to our Project Coordinators, Dr. A.Anbarasi, Dr. T.K.
Sivakumar and Dr. P. Saravanan, Panel Head, Dr. Jeyasekar A, Associate Professor and
Panel Members, Dr. M. Vijayalakshmi Assistant Professor, Mrs A. Mariya Nancy Assistant
Technologies, SRM Institute of Science and Technology, for their inputs during the project
We register our immeasurable thanks to our Faculty Advisor, Dr. Jagadeesan, Assistant
Our inexpressible respect and thanks to our guide, Dr. A. ANBARASI, Associate Professor,
providing us with an opportunity to pursue our project under her mentorship. She provided us
with the freedom and support to explore the research topics of our interest. Her passion for
solving problems and making a difference in the world has always been inspiring.
We sincerely thank all the staff and students of Computing Technologies Department, School of
Computing, S.R.M Institute of Science and Technology, for their help during our project.
Finally, we would like to thank our parents, family members, and friends for their
v
ABSTRACT:
The incidence of heart disease is steadily on the rise, and it is of utmost importance
to be able to predict such diseases in advance. The process of diagnosis is quite
challenging, requiring precision and efficiency. The primary focus of this research
paper is to determine which patients are at a higher risk of developing heart disease
based on various medical attributes. To accomplish this, we have developed a heart
disease prediction system that utilizes a patient's medical history. We employed
various machine learning algorithms, including logistic regression and K-nearest
neighbors (KNN), to predict and classify patients with heart disease. An effective
approach was adopted to fine-tune the model and enhance the accuracy of heart
attack prediction in individuals.
The results demonstrated the model's robust performance, particularly when using
KNN and Logistic Regression, showcasing improved accuracy compared to
previously employed classifiers such as naïve Bayes. This model has significantly
alleviated the pressure associated with accurately identifying heart disease, offering
a valuable tool for assessing the probability of heart disease in individuals. The
heart disease prediction system presented here not only enhances medical care but
also reduces associated costs. This project equips us with valuable insights that can
aid in the early detection of heart disease, and the implementation is available in the
.pynb format.
INTRODUCTION
Fortunately, with the passage of time, substantial research data and hospital patient
records have become available. Open sources offer access to patient records, facilitating
research to leverage computer technologies for accurate disease diagnosis and
prevention. Machine learning and artificial intelligence are now widely acknowledged
for their substantial contributions to the medical field. These technologies enable the
development of various models for diagnosing and predicting heart disease. They also
permit in-depth analysis of complete genomic data, knowledge pandemic predictions,
and improved understanding of medical records for more accurate predictions.
Numerous studies have explored machine learning models for heart disease
classification and prediction. Notable examples include a classifier by Melillo et al.,
which detected congestive heart failure with a 93.3% sensitivity and 63.5% specificity
using the CART algorithm. Rahhal et al. improved performance using deep neural
networks and electrocardiogram data. Guidi et al. introduced a clinical decision
support system, comparing various machine learning and deep learning models to
achieve an 87.6% accuracy with random forest and CART algorithms, outperforming
other classifiers.
The data used for this platform has been sourced from Kaggle, where data collection
involves the systematic process of gathering and measuring information from diverse
sources to effectively utilize this data.
Methodology:-
The dataset utilized for the research comprises the Public Health Dataset from 1988,
encompassing four distinct databases: Cleveland, Hungary, Switzerland, and Long
Beach V. This dataset contains a total of 76 attributes, including the predictive
attribute. However, all published experiments primarily focus on a subset of 14 specific
attributes. The "target" field within this dataset pertains to the presence of heart
disease in the patient and is represented by integer values, where 0 signifies the
absence of disease and 1 denotes the presence of disease.
Here's a description of the attributes used in this research and their respective
significance:
3. Cp (Chest Pain Type): This attribute characterizes the type of chest pain experienced.
5. Chol (Serum Cholesterol): This attribute reflects the serum cholesterol level, which
indicates the quantity of triglycerides present. Ideal values should be less than 170
mg/dL.
6. Fbs (Fasting Blood Sugar): It categorizes fasting blood sugar levels, where values
larger than 120 mg/dL are labeled as "1" for true. Normal values are below 100 mg/dL.
7. Restecg (Resting Electrocardiographic Results): This attribute captures the results of
resting electrocardiography.
8. Thalach (Maximum Heart Rate Achieved): It records the maximum heart rate
achieved, which can be roughly estimated as 220 minus the patient's age.
9. Exang (Exercise-Induced Angina): This attribute is marked as "1" for yes and is
associated with angina, a chest pain type caused by reduced blood flow to the heart.
11. Slope (Slope of Peak Exercise ST Segment): This attribute describes the slope of the
peak exercise ST segment.
12. Ca (Number of Major Vessels): It quantifies the number of major vessels, ranging
from 0 to 3, colored by fluoroscopy.
13. Thal: Although not explicitly explained, this attribute is likely associated with
thalassemia and contains values such as "3" for normal, "6" for fixed defects, and "7"
for reversible defects.
14. Target (T): This attribute indicates the patient's disease status, where "0" signifies
the absence of disease, and "1" denotes the presence of angiographic disease.
Machine Learning :
Supervised Learning:-
The dataset used for this analysis is the Heart Disease Dataset, a
combination of four different databases, although only the UCI Cleveland
dataset was utilized. This dataset contains a total of 76 attributes, but all
published experiments focus on a subset of only 14 features. Consequently,
we utilized the pre-processed UCI Cleveland dataset available on Kaggle for
our analysis. A detailed description of the 14 attributes employed in our
work is provided in Table 1 below.
The dataset does not contain any null values; however, it required proper
handling of outliers and addressing the issue of uneven distribution. Two
approaches were employed: one without outliers and feature selection,
which yielded unsatisfactory results, and another using a normally
distributed dataset to mitigate overfitting, combined with the use of
Isolation Forest for outlier detection, resulting in more promising
outcomes. Various visualization techniques were used to assess data
skewness, detect outliers, and evaluate data distribution. These
preprocessing techniques play a crucial role when applying the data for
classification or prediction purposes.
For checking the attribute values and determining the skewness of the data (the
asymmetry of a distribution), many distribution plots are plotted so that some
interpretation of the data can be seen. Different plots are shown, so an overview of the
data could be analyzed. The distribution of age and sex, the distribution of chest pain and
trestbps, the distribution of cholesterol and fasting blood, the distribution of ecg resting
electrode and thalach, the distribution of exang and oldpeak, the distribution of slope and
ca, and the distribution of thal and target all are analyzed and the conclusion.
By analyzing the distribution plots, it is visible that thal and fasting blood sugar is not
uniformly distributed and they needed to be handled; otherwise, it will result in overfitting
or underfitting of the data.
The proposed approach was applied to the dataset in which firstly the dataset was
properly analyzed and then different machine learning algorithms consisting of linear
model selection in which Logistic Regression was used. For focusing on neighbor
selection technique KNeighbors Classifier was used, then tree-based technique like
DecisionTree Classifier was used, and then a very popular and most popular technique of
ensemble methods RandomForest Classifier was used. Also for checking the high
dimensionality of the data and handling it, Support Vector Machine was used. Another
approach which also works on ensemble method and Decision Tree method combination
is XGBoost classifier
• Dataset of training
• Dataset of testing
• Adding dense layers with dropout layers and ReLU activation functions
• Adding a last dense layer with one output and binary activation function
• End repeat
• L (output)
• End procedure
There are two ways a deep learning approach can be applied. One is using a sequential
model and another is a functional deep learning approach. In this particular research, the
first one is used. A sequential model with a fully connected dense layer is used, with the
flatten and dropout layers to prevent the overfitting and the results are compared of the
machine learning and deep learning and variations in the learning including computational
time and accuracy can be analyzed and can be seen in the figures further discussed in
the Results section.
For the evaluation process, confusion matrix, accuracy score, precision, recall, sensitivity,
and F1 score are used. A confusion matrix is a table-like structure in which there are true
values and predicted values, called true positive and true negative. It is defined in four
parts: the first one is true positive (TP) in which the values are identified as true and, in
reality, it was true also. The second one is false positive (FP) in which the values identified
are false but are identified as true. The third one is false negative (FN) in which the value
was true but was identified as negative. The fourth one is true negative (TN) in which the
value was negative and was truly identified as negative.
Then for checking how well a model is performing, an accuracy score is used. It is defined
as the true positive values plus true negative values divided by true positive plus true
negative plus false positive plus false negative. The formula is
accuracy = TP+TN/TP+TN+FP+FN.
After accuracy there is specificity which is the proportion of true negative cases that were
classified as negative; thus, it is a measure of how well a classifier identifies negative
cases. It is also known as the true negative rate. The formula is
Specificity = TN/TN+FP.
Then there is sensitivity in which the proportion of actual positive cases got predicted as
positive (or true positive). Sensitivity is also termed as recall. In other words, an unhealthy
person got predicted as unhealthy. The formula is
Sensitivity = TP/TP+FN
Results:
By applying different machine learning algorithms and then using deep learning to see
what difference comes when it is applied to the data, three approaches were used. In the
first approach, normal dataset which is acquired is directly used for classification, and in
the second approach, the data with feature selection are taken care of and there is no
outliers detection. The results which are achieved are quite promising and then in the third
approach the dataset was normalized taking care of the outliers and feature selection; the
results achieved are much better than the previous techniques, and when compared with
other research accuracies, our results are quite promising.
Using the First Approach (without Doing Feature Selection and Outliers Detection):
As can be seen in the dataset is not normalized, there is no equal distribution of the target
class, it can further be seen when a correlation heatmap is plotted, and there are so many
negative values.
So, even if the feature selection is done, still, we have outliers.
By applying the first approach, the accuracy achieved by the Random Forest is 76.7%,
Logistic Regression is 83.64%, KNeighbors is 82.27%, Support Vector Machine is
84.09%, Decision Tree is 75.0%, and XGBoost is 70.0%. SVM is having the highest
accuracy here which is achieved by using the cross-validation and grid search for finding
the best parameters or in other words doing the hyperparameter tuning. Then after
machine learning, deep learning is applied by using the sequential model approach. In the
model, 128 neurons are used and the activation function used is ReLU, and in the output
layer which is a single class prediction problem, the sigmoid activation function is used,
with loss as binary cross-entropy and gradient descent optimizer as Adam. The accuracy
achieved is 76.7%.
Using the Second Approach (Doing Feature Selection and No Outliers Detection) :
After selecting the features (feature selection) and scaling the data as there are outliers,
the robust standard scalar is used; it is used when the dataset is having certain outliers. In
the second approach, the accuracy achieved by Random Forest is 88%, the Logistic
Regression is 85.9%, KNeighbors is 79.69%, Support Vector Machine is 84.26%, the
Decision Tree is 76.35%, and XGBoost is 71.1%. Here the Random Forest is the clear
winner with a precision of 88.4% and an F1 score of 86.5%. Then deep learning is
applied with the same parameters before and the accuracy achieved is 86.8%, and the
evaluation accuracy is 81.9%, which is better than the first approach.
Using the Third Approach (by Doing Feature Selection and Also Outliers Detection):
In this approach, the dataset is normalized and the feature selection is done and also the
outliers are handled using the Isolation Forest. The correlation comparison can be seen in
Figure 10. The accuracy of the Random Forest is 80.3%, Logistic Regression is
83.31%, KNeighbors is 84.86%, Support Vector Machine is 83.29%, Decision Tree is
82.33%, and XGBoost is 71.4%. Here the winner is KNeighbors with a precision of 77.7%
and a specificity of 80%. A lot of tips and tricks for selecting different algorithms are shown
by Garate-Escamila et al. [38]. Using deep learning in the third approach, the accuracy
achieved is 94.2%. So, the maximum accuracy achieved by the machine learning model
is KNeighbors ( 83.29%) in the third approach, and, for deep learning, the maximum
accuracy achieved is 81.9%. Thus, the conclusion can be drawn here that, for this
dataset, the deep learning algorithm achieved 94.2 percent accuracy which is greater than
the machine learning models. We also made a comparison with another research of the
deep learning by Ramprakash et al. [39] in which they achieved 84% accuracy and Das et
al. [33] achieved 92.7 percent accuracy. So our algorithm produced greater accuracy and
more promising than other approaches. The comparison of different classifiers of ML and
DL
• Machine learning allows building models to quickly analyze data and deliver
results, leveraging the historical and real-time data, with machine learning that
will help healthcare service providers to make better decisions on patient’s
disease diagnosis
• By analyzing the data we can predict the occurrence of the disease in our
project.
• This intelligent system for disease prediction plays a major role in controlling the
disease and maintaining the good health status of people by predicting accurate
disease risk.
• Machine learning algorithms can also help provide vital statistics, real-time data
and advanced analytics in terms of the patient’s disease, lab test results, blood
pressure, family history, clinical trial data, etc., to doctors.
• The goal is to predict the health of the patient from collective data to be able to
detect configurations at risk for the patient, and therefore, in cases requiring
emergency medical assistance, alert the appropriate medical staff of the
situation of the latter.
• The results of the predictions, derived from the predictive models generated by
machine learning, will be presented through several distinct graphical interfaces
according to the datasets considered. We will then bring criticism as to the
scope of our results.
• Healthy Heart :
• Atrium.
• Heart Arrhythmias.
• Heart Failure.
• Cardiomyopathy
Eating a diet high in saturated fats, trans fat, and cholesterol has been linked to heart
disease and related conditions, such as atherosclerosis. Also, too much salt (sodium) in
the diet can raise blood pressure. Not getting enough physical activity can lead to heart
disease.
Coronary artery disease
OverviewSymptomsTreatmentsNewsSpecialists
• The usual cause is the build-up of plaque. This causes coronary arteries to
narrow, limiting blood flow to the heart.
• OverviewSymptomsTreatmentsNewsSpecialists
• A condition in which the force of the blood against the artery walls is too high.
• Cardiac arrest
• OverviewSymptomsTreatmentsNewsSpecialists
• In cardiac arrest, the heart abruptly stops beating. Without prompt intervention,
it can result in the person's death.
Supervised Learning :
This study, an effective heart disease prediction system (EHDPS) is developed using
neural network for predicting the risk level of heart disease. The system uses 15 medical
parameters such as age, sex, blood pressure, cholesterol, and obesity for prediction
Data insight: As mentioned here we will be working with the heart disease detection
dataset and we will be putting out interesting inferences from the data to derive some
meaningful results.
EDA: Exploratory data analysis is the key step for getting meaningful results.
Feature engineering: After getting the insights from the data we have to alter the features
so that they can move forward for the model building phase.
Model building: In this phase, we will be building our Machine learning model for heart
disease detection.
Conclusion:
The conclusion which we found is that machine learning algorithms performed better in
this analysis. Many researchers have previously suggested that we should use ML where
the dataset is not that large, which is proved in this work. In this paper, we proposed three
methods in which comparative analysis was done and promising results were achieved.
The conclusion which we found is that machine learning algorithms performed better in
this analysis. Many researchers have previously suggested that we should use ML where
the dataset is not that large, which is proved in this paper. The methods which are used
for comparison are confusion matrix, precision, specificity, sensitivity, and F1 score. For
the 13 features which were in the dataset, KNeighbors classifier performed better in the
ML approach when data preprocessing is applied.
The computational time was also reduced which is helpful when deploying a model. It was
also found out that the dataset should be normalized; otherwise, the training model gets
overfitted sometimes and the accuracy achieved is not sufficient when a model is
evaluated for real-world data problems which can vary drastically to the dataset on which
the model was trained. It was also found out that the statistical analysis is also important
when a dataset is analyzed and it should have a Gaussian distribution, and then the
outlier’s detection is also important and a technique known as Isolation Forest is used for
handling this. The difficulty which came here is that the sample size of the dataset is not
large. If a large dataset is present, the results can increase very much in deep learning
and ML as well. The algorithm applied by us in ANN architecture increased the accuracy
which we compared with the different researchers. The dataset size can be increased and
then deep learning with various other optimizations can be used and more promising
results can be achieved. Machine learning and various other optimization techniques can
also be used so that the evaluation results can again be increased. More different ways of
normalizing the data can be used and the results can be compared. And more ways could
be found where we could integrate heart-disease-trained ML and DL models with certain
multimedia for the ease of patients and doctors.
REFERENCE :
58381e0602d2.
https://webfocusinfocenter.informationbuilders.com/wfappent/TLs/TL_rstat/source/Decis ionTree
https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-
algorithmlong-she- may-rein-edd9f99be63d.
https://www.investopedia.com/terms/n/neuralnetwork.asp.
topics/heart-attack/angina-chest-pain.
M., Hetts, S., English, J., & Wilson, M. (2012, January). MR fluoroscopy in vascular
and cardiac interventions (review). Retrieved March 14, 2020, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3275732/
"What Can I Do To Avoid A Heart Attack Or A Stroke?". World Health Organization, 2020,
https://www.who.int/features/qa/27/en/.