Heart Disease Prediction Using ML
Heart Disease Prediction Using ML
PHASE I
PROJECT REPORT
Submitted by
JEEVAMUKESH M (1901032)
DHARNESH K (1901019)
KARUPPASAMY A (1901039)
MUTHUKUMAR V (1901059)
of
BACHELOR OF ENGINEERING
in
DECEMBER 2022
ANNA UNIVERSITY : CHENNAI 600025
BONAFIDE CERTIFICATE
Signature Signature
Dr. A. Ramathilagam M.E., Ph.D., Dr. S. Priyadarsini M.E., Ph.D.,
HEAD OF THE DEPARTMENT, SUPERVISOR,
Professor & Head, Associate Professor,
Computer Science and Engineering, Computer Science and Engineering,
P.S.R. Engineering College, P.S.R. Engineering College,
Sivakasi – 626140. Sivakasi – 626140.
ii
ABSTRACT
Machine Learning is used across many ranges around the world. The
healthcare industry is no exclusion. Machine Learning can play an essential role
in predicting presence/absence of locomotors disorders, Heart diseases and
more. Such information, if predicted well in advance, can provide important
intuitions to doctors who can then adapt their diagnosis and dealing per patient
basis. We work on predicting possible Heart Diseases in people by using
Random Forest Algorithm in machine learning techniques by python platform.
Due to the usage of this algorithm in python platform gives the accuracy of
93%.
iii
ACKNOWLEDEMENT
We also wish to express our sincere thanks to our project guide Dr.
S.PRIYADARSINI M.E., Ph.D., Associate Professor for her excellent
guidance and constant encouragement during this project work.
iv
TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 INTRODUCTION 1
1.2 PROBLEM STATEMENT 2
1.3 OVERVIEW OF THE PROJECT 2
1.4 OBJECTIVES OF THE PROJECT 2
2 LITERATURE REVIEW 3
2.1 RELATED WORKS 3
2.2 EXISTING SYSTEM 6
3 SYSTEM ARCHITECTURE 7
3.1 PROPOSED SYSTEM 7
3.2 SYSTEM ARCHITECTURE 7
3.3 METHODOLOGY 8
3.4 SYSTEM DESIGN 10
3.4.1 USE CASE DIAGRAM 10
3.4.2 SEQUENCE DIAGRAM 11
4 SYSTEM REQUIREMENTS 12
4.1 PLATFORM 12
v
4.2 HARDWARE REQUIREMENTS 12
4.3 SOFTWARE REQUIREMENTS 12
5 SYSTEM IMPLEMENTATION 13
5.1 THEORETICAL BACKGROUND 13
5.2 MODULES 14
5.2.1 DATASET COLLECTION 14
5.2.2 ATTRIBUTES SELECTION 15
5.2.3 DATA PRE-PROCESSING 16
5.2.3.1 ONE-HOT ENCODING 17
5.2.3.2 LABEL ENCODING 17
5.2.4 PERFORMANCE METRICS 17
5.2.4.1 CONFUSION MATRIX 17
5.2.4.2 F1-SCORE 18
5.2.4.3 CLASSIFICATION REPORT 18
APPENDIX 31
REFERENCES 38
vi
LIST OF TABLES
LIST OF FIGURES
vii
6.11 Data Transformation 24
(Heart Disease Frequency based on Age & Max Heart
Rate - Scatterplot)
6.12 Data Transformation 24
(Heart Disease Frequency based on per chest pain
type)
6.13 Data Transformation 25
(Heart Disease Frequency based on per chest pain type
- Bar Chart)
6.14 Data Pre-processing (One Hot Encoding) 25
6.15 Data Pre-processing (One Hot Encoding - Result) 26
6.16 Data Pre-processing (Label Encoding & its result) 26
6.17 Performance Metrics (Confusion Matrix) 27
6.18 Performance Metrics (Confusion Matrix - Result) 27
6.19 Performance Metrics 28
(Display Scores - Precision, Recall, F1-Score)
6.20 Performance Metrics (Display a Classification Report) 28
viii
LIST OF ABBREVIATIONS
ABBREVIATION EXPANSION
ML Machine Learning
OS Operating System
Sklearn Scikit-Learn
ix
SVM Support Vector Machine
Learning Repository
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
1
1.2 PROBLEM STATEMENT
LITERATURE REVIEW
3
[3] Fitriyani, N. Syafrudin, M. Alfian, G. and Rhee, J. (2020)
"Hdpm: An Effective Heart Disease Prediction Model for A Clinical
Decision Support System" , IEEE Access, Vol. 8, No. 07, pp. 133034-
133050. This paper proposes an effective heart disease prediction model for a
CDSS which consists of Density-Based Spatial Clustering of Applications with
Noise to detect and eliminate the outliers, a hybrid Synthetic Minority Over-
sampling Technique-Edited Nearest Neighbor to balance the training data
distribution and XGBoost to predict heart disease. Two publicly available
datasets (Statlog and Cleveland) were used to build the model and compare the
results withthose of other models (Naive Bayes, Logistic regression, multilayer
perceptron (MLP), Support Vector machine (SVM), decision tree, and random
forest and of previous study results by achieving accuracies of 95.90% and
98.40% for Statlog and Cleveland datasets, respectively.
4
cannot be observed with the naked eye, it can appear immediately anywhere,
anytime. Many ML algorithms are more capable of handling various algorithms.
Due to complexity, the processing of massive data sets is more complicated. By
improving these systems, the quality of medical diagnosis decisions can be
improved. They can find patterns hidden in large amounts of data that will avoid
the use of traditional statistical methods for analysis. In this article, An ENDDP
Algorithm is developed to predict the early stages of heart disease. The results
prove the performance of the proposed system.
[7] Liaqat Ali, Awais Niamat, Javed Ali Khan, Noorbakhsh Amiri
Golilarz , Xiong Xingzhong , Adeeb Noor , Redhwan Nour , And Syed
Ahmad Chan Bukhari, “An Optimized Stacked Support Vector Machines
Based Expert System for the Effective Prediction of Heart Failure”, IEEE
Access 2019. In this article, we propose a hybrid grid search algorithm that is
capable of optimizing the two models simultaneously. The effectiveness of the
proposed method is evaluated using six different evaluation metrics: accuracy,
sensitivity, specificity, the Matthews correlation coefficient, ROC charts, and
AUC. The experimental results confirm that the proposed method improves the
5
performance of a conventional SVM model by 3.3%. Moreover, the proposed
method shows better performance compared to the ten previously proposed
methods that achieved accuracies in the range of 57.85%–91.83%.
6
CHAPTER 3
SYSTEM ARCHITECTURE
The working of the system starts with the collection of data and selecting
the important attributes. Then the required data is preprocessed into the required
format. The data is then divided into two parts training and testing data. The
algorithms are applied and the model is trained using the training data. The
accuracy of the system is obtained by testing the system using the testing data.
Due to this system takes many advantages,
7
Fig 3.1 System Architecture for heart disease prediction
3.3 METHODOLOGY
Machine Learning (ML) is the subset of the AI to which a machine can learn
from the given data without any programming explicitly. In ML, there are so
many algorithms included in Supervised Learning, Unsupervised Learning,
Reinforcement Learning. In this Project, we choose Random Forest Algorithm.
It is one of the supervised learning algorithm to which machines are trained
using well "labelled" training data, and on the basis of that data, machines
predict the output.
8
number of datasets. It can be used for both the classification and regression. The
time complexity of the worst case of learning with Random Forests is
O(M(dnlogn)) , where M is the number of growing trees, n is the number of
instances, and d is the data dimension. Random Forests have a variety of
applications, such as recommendation engines, image classification and feature
selection. It can be used to classify loyal loan applicants, identify fraudulent
activity and predict diseases. As the name suggests, “Random Forest is a
classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that
dataset.” Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it
predicts the final output.
Assumptions: Since the random forest combines multiple trees to predict the
class of the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions for a better Random forest
classifier
Advantages
10
3.4.1 USE CASE DIAGRAM
11
CHAPTER 4
SYSTEM REQUIREMENTS
4.1 PLATFORM
Windows is very powerful scalable operating system that provides basic file and
prints services as well as robust platform for server application. Main features
are as follow
12
CHAPTER 5
SYSTEM IMPLEMENTATION
Sklearn : Sklearn is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and
statistical modeling including classification, regression, clustering and
dimensionality reduction via a consistent interface in Python. This library,
which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
13
GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural
"pylab" interface based on a state machine (like OpenGL), designed to closely
resemble that of MATLAB, though its use is discouraged.
5.2 MODULES
Initially, we collect a dataset for our heart disease prediction system. The
dataset used for this project is Heart Disease UCI. The dataset consists of 76
attributes; out of which, 14 attributes are used for the system.
15
9. exang exercise induced angina (1 = yes; 0 = no) Nominal
10. oldpeak ST depression induced by exercise relative to rest Numerical
looks at stress of heart during exercise
unhealthy heart will stress more
11. slope the slope of the peak exercise ST segment Nominal
1. Upsloping: better heart rate with exercise (uncommon)
2. Flatsloping: minimal change (typical healthy heart)
3. Downsloping: signs of unhealthy heart
12. Ca number of major vessels (0-3) colored by flourosopy Numerical
colored vessel means the doctor can see the blood passing
through the more blood movement the better (no clots)
13. Thal thalium stress result Nominal
1,3: normal
6: fixed defect: used to be defect but ok now
7: reversable defect: no proper blood movement when
exercising
14 target have disease or not (1=yes, 0=no) (= the predicted attribute) Nominal
16
5.2.3.1 ONE-HOT ENCODING
5.2.4.2 F1-SCORE
18
CHAPTER 6
21
Fig 6.7 Data Transformation (No.of.patients - Pie Chart)
22
Fig 6.9 Data Transformation
(Heart Disease Frequency based on Age & Max Heart Rate - Scatterplot)
24
Fig 6.13 Data Transformation
(Heart Disease Frequency based on Per chest pain type - Bar Chart)
25
Fig 6.15 Data Pre-processing (One Hot Encoding - Result)
7.1 CONCLUSION
Heart diseases are a major killer in India and throughout the world,
application of promising technology like machine learning to the initial
prediction of heart diseases will have a profound impact on society. The early
prognosis of heart disease can aid in making decisions on lifestyle changes in
high-risk patients and in turn reduce the complications, which can be a great
milestone in the field of medicine. The number of people facing heart diseases
is on a raise each year. This prompts for its early diagnosis and treatment. The
utilization of suitable technology support in this regard can prove to be highly
beneficial to the medical fraternity and patients. In this paper, Random Forest
algorithm is used to measure the performance which is applied on the dataset.
The expected attributes leading to heart disease in patients are available in the
dataset which contains 76 features and 14 important features that are useful to
evaluate the system are selected among them. If all the features taken into the
consideration then the efficiency of the system the author gets is less. To
increase efficiency, attribute selection is done. In this n features have to be
selected for evaluating the model which gives more accuracy. The correlation of
some features in the dataset is almost equal and so they are removed. If all the
attributes present in the dataset are taken into account then the efficiency
decreases considerably. Hence, the aim is to use various evaluation metrics like
confusion matrix, accuracy, precision, recall, and f1-score which predicts the
disease efficiently. After the Development of heart disease prediction, tested
this model by using Random Forest classifier, it gives the highest accuracy of
93%.
29
7.2 FUTURE ENHANCEMENT
For the Future Scope more machine learning approach will be used for
best analysis of the heart diseases and for earlier prediction of diseases so that
the rate of the death cases can be minimized by the awareness about the
diseases. This application can be made as common platform for predicting all
kind of diseases. We are excited to continue our project and add some more
attributes and we try to this application as disease prediction application for all
kind of diseases.
30
APPENDIX
import numpy as np
import pandas as pd
%matplotlib inline
df.head()
df.info()
df.describe()
corrmat = df.corr()
top_corr_features = corrmat.index
31
plt.figure(figsize=(16,7))
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
df['target'].value_counts()
plt.rcdefaults()
fig, ax = plt.subplots()
y_pos = np.arange(len(y))
x = (disease, no_disease)
ax.barh(y_pos, x, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(y)
ax.invert_yaxis()
ax.set_xlabel('Count')
ax.set_title('Target')
for i, v in enumerate(x):
plt.show()
32
y = ('Heart Disease', 'No Disease')
y_pos = np.arange(len(y))
x = (disease, no_disease)
ax1.axis('equal')
plt.show()
df.sex.value_counts()
# Display a crosstab for identifying how many persons are heart diseased
persons & how many persons are no diseased persons according to sex
pd.crosstab(df.target,df.sex)
pd.crosstab(df.target,df.sex).plot(kind='bar',
figsize=(10,6),
color=['cyan','crimson']);
# Labelling
plt.ylabel('Amount')
plt.legend(['Female','Male'])
plt.grid()
plt.xticks(rotation = 0);
plt.figure(figsize=(10,6))
plt.scatter(df.age[df.target==1],
df.thalach[df.target==1],
color='crimson')
plt.scatter(df.age[df.target==0],
df.thalach[df.target==0],
color='darkcyan')
plt.xlabel('Age')
plt.grid()
34
# Display a crosstab for identifying how many persons are heart diseased
persons & how many persons are no diseased persons according to types of
chest pain
pd.crosstab(df.cp,df.target)
pd.crosstab(df.cp,df.target).plot(kind='bar',
figsize=(10,6),
color=['cyan','crimson'])
# Labelling
plt.ylabel("Amount")
plt.grid()
plt.xticks(rotation=0);
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['target']]).toarray())
35
final_df = df.join(encoder_df)
# view final df
print(final_df)
label_encoder = preprocessing.LabelEncoder()
df['target']= label_encoder.fit_transform(df['target'])
df['target'].unique()
# Create a Dataframe
data=pd.DataFrame(df['target'])
print(data)
print (confusion_matrix)
sns.heatmap(confusion_matrix, annot=True)
y_true=df['cp']
y_pred=df['target']
print(classification_report(y_true, y_pred))
37
REFERENCES
1. Adeeb Noor , Awais Niamat, Javed Ali Khan, Noorbakhsh Amiri Golilarz , Redhwan Nour ,
Syed System for the Ahmad Chan Bukhari, Liaqat Ali, And Xiong Xingzhong, “An
Optimized Stacked Support Vector Machines Based Expert Effective Prediction of Heart
Failure”, IEEE Access 2019.
2. Alfian, G. Fitriyani, N. Rhee, J .Syafrudin, M. and (2020) "Hdpm: An Effective Heart Disease
Prediction Model for A Clinical Decision Support System", IEEE Access, Vol.8, No.07,
pp. 133034-133050.
3. Ashir Javeed, Atiqur Rahman, Aurangzeb Khan, Javed Ali Khan, Mingyi Zhou, And Liaqat
Ali “An Automated Diagnostic System for Heart Disease Prediction Based on χ2 Statistical
Model and Optimally Configured Deep Neural Network”, IEEE Access 2019.
4. Bahaj, M. and Khourdifi, Y. (2019) "Heart Disease Prediction and Classification Using
Machine Learning Algorithms Optimized by Particle Swarm Optimization and Ant Colony
Optimization" ,International Journal of Intelligent Engineering and Systems, Vol. 12, No. 02.
5. Chandrasegar thirumalai , Gautam Srivastva, Senthil kumar mohan, “Effective Heart Disease
Prediction Using Hybrid Machine Learning Techniques”, IEEE Access 2019.
6. Chintala Srinidhib , Deepikae , Kalali Vanajad , Kotha Sindhuc , Krishna Rao Patro, E, and
Padmajaa, B (2021) “Early and Accurate Prediction of Heart Disease Using Machine
Learning Model”, Turkish Journal of Computer and Mathematics Education , Vol.12, No.6,
pp. 4516-4528.
7. Deepak, K . , Koushik, K. V. S. , Nikhil Kumar, M.(2019) “Prediction of Heart Diseases
Using Data Mining and Machine Learning Algorithms and Tools” International Journal of
Scientific Research in Computer Science, Engineering and Information Technology,
IJSRCSEIT.
8. Hussein Kanaan , Israa Nadheer , Mohammad Ayache “Heart Disease Prediction System
Using Machine Learning Algorithm”, Iraqi Journal of Information and Communications
Technology(IJICT) Conference Series: The 1st Conference of Applied Researches in
Information Engineering (ARIE2021).
9. Jaymin Patel, Dr. Samir Patel ,and Prof. Tejal Upadhyay, (2015-2016) “Heart Disease
Prediction using Machine Learning and Data Mining Technique”, Vol.7, No.1, pp. 129-137.
10. MALKARI BHARGAV, J.RAGHUNATHA “Study on Risk Prediction of Cardiovascular
Disease Using Machine Learning Algorithms”, JETIR August 2020, Volume 7, Issue 8.
38