0% found this document useful (0 votes)
31 views48 pages

Heart Disease Prediction Using ML

Uploaded by

M. Jeevamukesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views48 pages

Heart Disease Prediction Using ML

Uploaded by

M. Jeevamukesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

HEART DISEASE PREDICTION

USING MACHINE LEARNING


TECHNIQUES

PHASE I
PROJECT REPORT

Submitted by
JEEVAMUKESH M (1901032)
DHARNESH K (1901019)
KARUPPASAMY A (1901039)
MUTHUKUMAR V (1901059)

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

P.S.R. ENGINEERING COLLEGE, SIVAKASI


(An Autonomous Institution – Affiliated To Anna University, Chennai)

ANNA UNIVERSITY :: CHENNAI 600 025

DECEMBER 2022
ANNA UNIVERSITY : CHENNAI 600025

BONAFIDE CERTIFICATE

Certified that this project report “HEART DISEASE PREDICTION


USING MACHINE LEARNING TECHNIQUES” is the bonafide work
of JEEVAMUKESH M (1901032), DHARNESH K (1901019),
KARUPPASAMY A (1901039), MUTHUKUMAR V (1901059) who
carried out the work under my supervision.

Signature Signature
Dr. A. Ramathilagam M.E., Ph.D., Dr. S. Priyadarsini M.E., Ph.D.,
HEAD OF THE DEPARTMENT, SUPERVISOR,
Professor & Head, Associate Professor,
Computer Science and Engineering, Computer Science and Engineering,
P.S.R. Engineering College, P.S.R. Engineering College,
Sivakasi – 626140. Sivakasi – 626140.

Submitted for viva-voce examination held on ………………………………...

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
ABSTRACT

Machine Learning is used across many ranges around the world. The
healthcare industry is no exclusion. Machine Learning can play an essential role
in predicting presence/absence of locomotors disorders, Heart diseases and
more. Such information, if predicted well in advance, can provide important
intuitions to doctors who can then adapt their diagnosis and dealing per patient
basis. We work on predicting possible Heart Diseases in people by using
Random Forest Algorithm in machine learning techniques by python platform.
Due to the usage of this algorithm in python platform gives the accuracy of
93%.

iii
ACKNOWLEDEMENT

We take this opportunity to all those who helped towards successful


completion of this project. At the very outset we thank the almighty for his
profuse blessings showered on us.

We thank our beloved parents whose encouragement and support help us


to complete our project successfully.

We wish to express our cordial gratitude to our Honourable


correspondent Thiru. R.SOLAISAMY and our beloved Director Er.
S.VIGNESWARI ARUNKUMAR B.Tech., our Respected Principal Dr. J.S.
SENTHILKUMAAR M.E., Ph.D., for the pronage excellent facilities
provided during the project.

We wish to express our sincere thanks to our adored Head of the


Department and guide, Dr. A.RAMATHILAGAM M.E., Ph.D., for her
motivation during this course of work.

We also wish to express our sincere thanks to our project guide Dr.
S.PRIYADARSINI M.E., Ph.D., Associate Professor for her excellent
guidance and constant encouragement during this project work.

We also wish to express our sincere thanks to our project coordinators


Dr. S.PRIYADARSINI M.E., Ph.D., Associate Professor and Dr. S.EDWIN
RAJA M.Tech., Ph.D., Associate Professor for having helped us in all aspects.

We also bound to thanks to all Faculty and Non-teaching staff members


of the Department of Computer Science And Engineering whose support and
cooperation also contributed much to complete this project work.

iv
TABLE OF CONTENTS

CHAPTER NO: TITLE PAGE NO:


ABSTRACT iii
LIST OF TABLES vii
LIST OF FIGURES vii
LIST OF ABBREVIATIONS ix

1 INTRODUCTION 1
1.1 INTRODUCTION 1
1.2 PROBLEM STATEMENT 2
1.3 OVERVIEW OF THE PROJECT 2
1.4 OBJECTIVES OF THE PROJECT 2

2 LITERATURE REVIEW 3
2.1 RELATED WORKS 3
2.2 EXISTING SYSTEM 6

3 SYSTEM ARCHITECTURE 7
3.1 PROPOSED SYSTEM 7
3.2 SYSTEM ARCHITECTURE 7
3.3 METHODOLOGY 8
3.4 SYSTEM DESIGN 10
3.4.1 USE CASE DIAGRAM 10
3.4.2 SEQUENCE DIAGRAM 11

4 SYSTEM REQUIREMENTS 12
4.1 PLATFORM 12

v
4.2 HARDWARE REQUIREMENTS 12
4.3 SOFTWARE REQUIREMENTS 12

5 SYSTEM IMPLEMENTATION 13
5.1 THEORETICAL BACKGROUND 13
5.2 MODULES 14
5.2.1 DATASET COLLECTION 14
5.2.2 ATTRIBUTES SELECTION 15
5.2.3 DATA PRE-PROCESSING 16
5.2.3.1 ONE-HOT ENCODING 17
5.2.3.2 LABEL ENCODING 17
5.2.4 PERFORMANCE METRICS 17
5.2.4.1 CONFUSION MATRIX 17
5.2.4.2 F1-SCORE 18
5.2.4.3 CLASSIFICATION REPORT 18

6 RESULTS & DISCUSSIONS 19

7 CONCLUSION & FUTURE ENHANCEMENT 29


7.1 CONCLUSION 29
7.2 FUTURE ENHANCEMENT 30

APPENDIX 31

REFERENCES 38

vi
LIST OF TABLES

TABLE.NO. TABLE NAME PAGE NO.


5.2.1 Input Dataset Attributes 16

LIST OF FIGURES

FIGURE NO. FIGURE NAME PAGE NO.


3.1 System Architecture for heart disease prediction 8
3.2 Working Principle of Random Forest Algorithm 10
3.3.1 Use Case Diagram for heart disease prediction 11
3.3.2 Sequence Diagram for heart disease prediction 11
5.2.1 Dataset Collection 14
5.2.4.1 Confusion Matrix 17
6.1 Import Necessary Libraries & Dataset 19
6.2 Input Dataset Attributes (Summary) 19
6.3 Input Dataset Attributes (Statistical Analysis) 20
6.4 Attribute Selection (Correlation Matrix) 20
6.5 Data Transformation (No.of.Patients) 21
6.6 Data Transformation (No.of.patients - Bar Chart) 21
6.7 Data Transformation (No.of.patients - Pie Chart) 22
6.8 Data Transformation 22
(Heart Disease Frequency based on Sex)
6.9 Data Transformation 23
(Heart Disease Frequency based on Sex - Bar Chart)
6.10 Data Transformation 23
(Heart Disease Frequency based on Age & Max Heart
Rate)

vii
6.11 Data Transformation 24
(Heart Disease Frequency based on Age & Max Heart
Rate - Scatterplot)
6.12 Data Transformation 24
(Heart Disease Frequency based on per chest pain
type)
6.13 Data Transformation 25
(Heart Disease Frequency based on per chest pain type
- Bar Chart)
6.14 Data Pre-processing (One Hot Encoding) 25
6.15 Data Pre-processing (One Hot Encoding - Result) 26
6.16 Data Pre-processing (Label Encoding & its result) 26
6.17 Performance Metrics (Confusion Matrix) 27
6.18 Performance Metrics (Confusion Matrix - Result) 27
6.19 Performance Metrics 28
(Display Scores - Precision, Recall, F1-Score)
6.20 Performance Metrics (Display a Classification Report) 28

viii
LIST OF ABBREVIATIONS

ABBREVIATION EXPANSION

ANN Artificial Neural Network

API Application Programming Interface

AUC Area Under the Curve

CDSS Clinical Decision Support System

DNN Deep Neural Network

ENDDP Enhanced New Dynamic Data Processing

GUI Graphical User Interface

HDL High Density Lipoprotein

IDE Integrated Development Environment

LDL Low Density Lipoprotein

ML Machine Learning

mg/dl Milligrams per decilitre

mm/Hg Millimetre of mercury

MLP Multilayer Perceptron

OS Operating System

RAM Random Access Memory

ROC Receiver Operating Characteristic Curve

Sklearn Scikit-Learn

ix
SVM Support Vector Machine

SMO Sequential Minimal Optimization

UCI University Of California Irvine Machine

Learning Repository

WHO World Health Organization

x
CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

According to the WHO, every year 12 million deaths occur worldwide


due to Heart Disease. Heart disease is one of the biggest causes of morbidity
and mortality among the population of the world. Prediction of cardiovascular
disease is regarded as one of the most important subjects in the section of data
analysis. The load of cardiovascular disease is rapidly increasing all over the
world from the past few years. Many researches have been conducted in attempt
to pinpoint the most influential factors of heart disease as well as accurately
predict the overall risk. Heart Disease is even highlighted as a silent killer which
leads to the death of the person without obvious symptoms. The early diagnosis
of heart disease plays a vital role in making decisions on lifestyle changes in
high-risk patients and in turn reduces the complications.

Machine learning proves to be effective in assisting in making decisions


and predictions from the large quantity of data produced by the health care
industry. This project aims to predict future Heart Disease by analyzing data of
patients which classifies whether they have heart disease or not using machine-
learning algorithm. Machine Learning techniques can be a boon in this regard.
Even though heart disease can occur in different forms, there is a common set of
core risk factors that influence whether someone will ultimately be at risk for
heart disease or not. By collecting the data from various sources, classifying
them under suitable headings & finally analysing to extract the desired data we
can say that this technique can be very well adapted to do the prediction of heart
disease.

1
1.2 PROBLEM STATEMENT

The major challenge in heart disease is its detection. There are


instruments available which can predict heart disease but either it are expensive
or are not efficient to calculate chance of heart disease in human. Early
detection of cardiac diseases can decrease the mortality rate and overall
complications. However, it is not possible to monitor patients everyday in all
cases accurately and consultation of a patient for 24 hours by a doctor is not
available since it requires more sapience, time and expertise. Since we have a
good amount of data in today’s world, we can use various machine learning
algorithms to analyze the data for hidden patterns. The hidden patterns can be
used for health diagnosis in medicinal data.

1.3 OVERVIEW OF THE PROJECT

Overview of this project is to present a heart disease prediction model for


the prediction of occurrence of heart disease. Further, this research work is
aimed towards to get the highest accuracy of the given classification algorithm
for identifying the possibility of heart disease in a patient.

1.4 OBJECTIVES OF THE PROJECT

 To develop machine learning model to predict the heart disease by


implementing classification algorithm like Random Forest Classifier.
 To determine significant risk factors based on medical dataset which may
lead to heart disease for analyzing feature selection methods like
Correlation matrix and understand their working principle. Finally we
check the performance metrics for this model by using Confusion matrix,
F1-score, Precision, Recall, etc.,
2
CHAPTER 2

LITERATURE REVIEW

2.1 RELATED WORKS

[1] Israa Nadheer , Mohammad Ayache , Hussein Kanaan “Heart


Disease Prediction System Using Machine Learning Algorithm”, Iraqi
Journal of Information and Communications Technology(IJICT)
Conference Series: The 1st Conference of Applied Researches in
Information Engineering(ARIE2021), In this paper we propose a Heart
Disease Prediction System using Machine Learning Algorithms, in terms of
data we used the Cleveland dataset, this dataset is normalized then divided into
three scenarios in terms of training and testing respectively, 80%-20%, 50%-
50%, 3%-70%. In each case of the dataset if it is normalized or not we will have
these three scenarios. We used three machine learning algorithms for every
scenario mentioned before which support- vector machine (SVM), Sequential
minimal optimization (SMO) and multilayer perceptron (MLP).

[2] MALKARI BHARGAV, J.RAGHUNATHA “Study on Risk


Prediction of Cardiovascular Disease Using Machine Learning
Algorithms”, JETIR August 2020, Volume 7, Issue 8 , The main cause of
this cardiovascular disease is changes in the Blood Pressure, Cholesterol,
increasing Heartbeat etc., It may lead to risk for life and death also. The main
aim of this project is to predict the heart disease with machine learning
algorithms and diagnose in early stages. In this research, we are implementing
different machine learning algorithms with UCI dataset to find the best accuracy
in different algorithms. Then I got best accuracy in ANN. So ANN
classification algorithm is used to know the possibilities of getting heart disease
and diagnose in initial stage.

3
[3] Fitriyani, N. Syafrudin, M. Alfian, G. and Rhee, J. (2020)
"Hdpm: An Effective Heart Disease Prediction Model for A Clinical
Decision Support System" , IEEE Access, Vol. 8, No. 07, pp. 133034-
133050. This paper proposes an effective heart disease prediction model for a
CDSS which consists of Density-Based Spatial Clustering of Applications with
Noise to detect and eliminate the outliers, a hybrid Synthetic Minority Over-
sampling Technique-Edited Nearest Neighbor to balance the training data
distribution and XGBoost to predict heart disease. Two publicly available
datasets (Statlog and Cleveland) were used to build the model and compare the
results withthose of other models (Naive Bayes, Logistic regression, multilayer
perceptron (MLP), Support Vector machine (SVM), decision tree, and random
forest and of previous study results by achieving accuracies of 95.90% and
98.40% for Statlog and Cleveland datasets, respectively.

[4] Senthil kumar mohan, chandrasegar thirumalai and Gautam


Srivastva, “Effective Heart Disease Prediction Using Hybrid Machine
Learning Techniques” IEEE Access 2019. In this paper, we propose a novel
method that aims at finding significant features by applying machine learning
techniques resulting in improving the accuracy in the prediction of
cardiovascular disease. The prediction model is introduced with different
combinations of features and several known classification techniques. We
produce an enhanced performance level with an accuracy level of 88.7%
through the prediction model for heart disease with the hybrid random forest
with a linear model.

[5] Nikhil Kumar, M. , Koushik, K. V. S. , Deepak, K.(2019)


“Prediction of Heart Diseases Using Data Mining and Machine Learning
Algorithms and Tools” International Journal of Scientific Research in
Computer Science, Engineering and Information Technology ,IJSRCSEIT .
The prediction of heart disease is the most complex task in the medical field. It

4
cannot be observed with the naked eye, it can appear immediately anywhere,
anytime. Many ML algorithms are more capable of handling various algorithms.
Due to complexity, the processing of massive data sets is more complicated. By
improving these systems, the quality of medical diagnosis decisions can be
improved. They can find patterns hidden in large amounts of data that will avoid
the use of traditional statistical methods for analysis. In this article, An ENDDP
Algorithm is developed to predict the early stages of heart disease. The results
prove the performance of the proposed system.

[6] Liaqat Ali, Atiqur Rahman, Aurangzeb Khan, Mingyi Zhou,


Ashir Javeed, And Javed Ali Khan, “An Automated Diagnostic System for
Heart Disease Prediction Based on χ2 Statistical Model and Optimally
Configured Deep Neural Network” IEEE Access 2019. In this paper, we
propose to use χ2 statistical model while the optimally configured DNN is
searched by using exhaustive search strategy. The strength of the proposed
hybrid model named χ2 -DNN is evaluated by comparing its performance with
conventional ANN and DNN models, other state of the art machine learning
models and previously reported methods for heart disease prediction. The
proposed model achieves prediction accuracy of 93.33%. The obtained results
are promising compared to the previously reported methods.

[7] Liaqat Ali, Awais Niamat, Javed Ali Khan, Noorbakhsh Amiri
Golilarz , Xiong Xingzhong , Adeeb Noor , Redhwan Nour , And Syed
Ahmad Chan Bukhari, “An Optimized Stacked Support Vector Machines
Based Expert System for the Effective Prediction of Heart Failure”, IEEE
Access 2019. In this article, we propose a hybrid grid search algorithm that is
capable of optimizing the two models simultaneously. The effectiveness of the
proposed method is evaluated using six different evaluation metrics: accuracy,
sensitivity, specificity, the Matthews correlation coefficient, ROC charts, and
AUC. The experimental results confirm that the proposed method improves the

5
performance of a conventional SVM model by 3.3%. Moreover, the proposed
method shows better performance compared to the ten previously proposed
methods that achieved accuracies in the range of 57.85%–91.83%.

[8] Padmajaa, B , Chintala Srinidhib , Kotha Sindhuc , Kalali Vanajad ,


Deepikae and Krishna Rao Patro, E (2021) “Early and Accurate Prediction
of Heart Disease Using Machine Learning Model”, Turkish Journal of
Computer and Mathematics Education , Vol.12, No.6, pp. 4516-4528.
The purpose of this article is to design a model to predict the heart diseases
using machine learning techniques. This model is developed using classification
algorithms, as they play important role in prediction. The model is developed
using different classification algorithms which include Logistic Regression,
Random Forest, Support vector machine, Gaussian Naive Bayes, Gradient
boosting, K-nearest neighbours, Multinomial Naive bayes and Decision trees.
Out of all the classifiers evaluated using performance metrics, Random forest is
giving good accuracy(93%).

2.2 EXISTING SYSTEM

By collecting the data from various sources, classifying them under


suitable headings & finally analysing to extract the desired data we can
conclude. This technique can be very well adapted to the do the prediction of
heart disease. The main disadvantages of the existing system are

1. Medical Misdiagnoses are a serious risk to our healthcare profession. If


they continue, then people will fear going to the hospital for treatment.
We can put an end to medical misdiagnosis by informing the public and
filing claims and suits against the medical practitioners at fault.
2. Most of these studies are theoretical analysis at the macro level and there
is a lack of quantitative investigations.

6
CHAPTER 3

SYSTEM ARCHITECTURE

System Architecture is the conceptual model that defines the structure,


behaviour, views of the system. It is a formal description and representation of a
system, organized in a way that supports the structure and behaviour of the
system.

3.1 PROPOSED SYSTEM

The working of the system starts with the collection of data and selecting
the important attributes. Then the required data is preprocessed into the required
format. The data is then divided into two parts training and testing data. The
algorithms are applied and the model is trained using the training data. The
accuracy of the system is obtained by testing the system using the testing data.
Due to this system takes many advantages,

1. To enhance visualization and ease of interpretation.


2. Extensive experiments on real-world large datasets have demonstrated the
effectiveness of our approach for prediction of heart disease.

3.2 SYSTEM ARCHITECTURE

Dataset collection is collecting data which contains patient details.


Attributes selection process selects the useful attributes for the prediction of
heart disease. After identifying the available data resources, they are further
selected, cleaned, made into the desired form. Different classification
techniques as stated will be applied on preprocessed data to predict the accuracy
of heart disease. Accuracy measure compares the accuracy of the classifier.

7
Fig 3.1 System Architecture for heart disease prediction

3.3 METHODOLOGY

Machine Learning (ML) is the subset of the AI to which a machine can learn
from the given data without any programming explicitly. In ML, there are so
many algorithms included in Supervised Learning, Unsupervised Learning,
Reinforcement Learning. In this Project, we choose Random Forest Algorithm.
It is one of the supervised learning algorithm to which machines are trained
using well "labelled" training data, and on the basis of that data, machines
predict the output.

Random forest is a one of the best algorithm in machine learning and it is


member of Supervised Learning. Random forest is based on the concept of
Ensemble learning which is a process of combining multiple classifiers to solve
a complex problem and to improve the performance of the model. In this
algorithm, a tree can splits into nodes and it predicts the each node are available
in tree. The main advantage of random forest algorithms are it is very easy to
solve the algorithm with good accuracy and it can able to handle the large

8
number of datasets. It can be used for both the classification and regression. The
time complexity of the worst case of learning with Random Forests is
O(M(dnlogn)) , where M is the number of growing trees, n is the number of
instances, and d is the data dimension. Random Forests have a variety of
applications, such as recommendation engines, image classification and feature
selection. It can be used to classify loyal loan applicants, identify fraudulent
activity and predict diseases. As the name suggests, “Random Forest is a
classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that
dataset.” Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it
predicts the final output.

Assumptions: Since the random forest combines multiple trees to predict the
class of the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict the
correct output. Therefore, below are two assumptions for a better Random forest
classifier

 There should be some actual values in the feature variable of the


dataset so that the classifier can predict accurate results rather than a
guessed result.
 The predictions from each tree must have very low correlations.

Algorithm: It works in four steps

 Select random samples from a given dataset.


 Construct a Decision Tree for each sample and get a prediction result
from each Decision Tree.
 Perform a vote for each predicted result.
 Select the prediction result with the most votes as the final prediction.
9
Fig 3.2 Working Principle of Random Forest Algorithm

Advantages

 Random Forest is capable of performing both Classification and


Regression tasks. Also, it is most flexible and easy to use algorithm.
 It is capable of handling large datasets with high dimensionality.
 It enhances the accuracy of the model and prevents the overfitting
issue.

3.4 SYSTEM DESIGN

System Design is the process of designing the architecture, components,


and interfaces for a system so that it meets the end-user requirements. It is
important for defining the product and its architecture. It is necessary for the
interfaces, design, data, and modules to satisfy the system requirements.

10
3.4.1 USE CASE DIAGRAM

Fig 3.3.1 Use Case Diagram for heart disease prediction

3.4.2 SEQUENCE DIAGRAM

Fig 3.3.2 Sequence Diagram for heart disease prediction

11
CHAPTER 4

SYSTEM REQUIREMENTS

System requirement is concerned with analysis of the existing system


with the aim of determining and structuring the requirement of the proposed
system. The analysis stage was specifically carried out in focus of the
functionality dataflow at the Heart Disease Prediction using Machine Learning
Techniques. In System Requirements the following needs will be necessary for
developing Heart Disease Prediction using ML Techniques :

4.1 PLATFORM

Windows is very powerful scalable operating system that provides basic file and
prints services as well as robust platform for server application. Main features
are as follow

 An easier way to use interface and tools.


 More extensive network performance.
 Enhanced communication features

4.2 HARDWARE REQUIREMENTS

 Processor - Intel Pentium Processor i3 (or) above


 RAM - 2 GB (or) above
 System type - 32 bit OS, x64- based processor

4.3 SOFTWARE REQUIREMENTS

 Operating System : Windows 7 (or) above


 Technology : Python
 IDE : Jupyter Notebook

12
CHAPTER 5

SYSTEM IMPLEMENTATION

5.1 THEORETICAL BACKGROUND

Heart Disease Prediction using ML Techniques is implemented by using


Python.

Python : Python is an interpreted, high-level, general purpose programming


language. Python's design philosophy emphasizes code Readability with its
notable use of significant White space. Its language constructs and object
oriented approach aim to help programmers write clear, logical code for small
and large-scale projects. Python is dynamically typed and garbage collected. It
supports multiple programming paradigms, including procedural, object-
oriented, and functional programming.

Sklearn : Sklearn is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and
statistical modeling including classification, regression, clustering and
dimensionality reduction via a consistent interface in Python. This library,
which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

Numpy : NumPy is a library for the python programming language, adding


support for large, multi- dimensional arrays and matrices, along with a large
collection of high level mathematical functions to operate on these arrays. The
ancestor of NumPy, Numeric, was originally created by Jim with contributions
from several other developers.

Matplotlib : Matplotlib is a plotting library for the Python programming


language and its numerical mathematics extension NumPy. It provides an
object-oriented API for embedding plots into applications using general-purpose

13
GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural
"pylab" interface based on a state machine (like OpenGL), designed to closely
resemble that of MATLAB, though its use is discouraged.

Seaborn : Seaborn is a Python data visualization library based on matplotlib. It


provides a highlevel interface for drawing attractive and informative statistical
graphics. Seaborn is a library in Python predominantly used for making
statistical graphics. Seaborn is a data visualization library built on top of
matplotlib and closely integrated with pandas data structures in Python.
Visualization is the central part of Seaborn which helps in exploration and
understanding of data.

5.2 MODULES

5.2.1 DATASET COLLECTION

Initially, we collect a dataset for our heart disease prediction system. The
dataset used for this project is Heart Disease UCI. The dataset consists of 76
attributes; out of which, 14 attributes are used for the system.

Fig 5.2.1 Dataset Collection


14
5.2.2 ATTRIBUTES SELECTION

Attribute or Feature selection includes the selection of appropriate


attributes for the prediction system. This is used to increase the efficiency of the
system. Various attributes of the patient like gender, chest pain type, fasting
blood pressure, serum cholesterol, exang, etc are selected for the prediction. The
Correlation matrix is used for attribute selection for this model.

S.no Attibutes Description Type


1. age age in years (29 to 77) Numerical
2. sex 1 = Male, 0 = Female Nominal
3. Cp chest Pain type Nominal
0: Typical angina: chest pain related decrease blood supply to
the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain: typically esophageal spasms (non heart
related)
3: Asymptomatic: chest pain not showing signs of disease
4. trestrps resting blood pressure (in mm Hg on admission to the hospital) Numerical
anything above 130-140 is typically cause for concern
5. chol serum cholestoral in mg/dl Numerical
serum=LDL+HDL+.2*triglycerides
above 200 is cause for concern
6. fbs (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) Nominal
‘>126' mg/dL signals diabetes
7. restecg Resting electrocardiographic results Nominal
0: Nothing to note
1: ST-T Wave abnormality
can range from mild symptoms to severe problems & signals
non-normal heart beat
2: Possible or definite left ventricular hypertrophy
Enlarged heart's main pumping chamber
8. thalach Maximum heart rate achieved Numerical

15
9. exang exercise induced angina (1 = yes; 0 = no) Nominal
10. oldpeak ST depression induced by exercise relative to rest Numerical
looks at stress of heart during exercise
unhealthy heart will stress more
11. slope the slope of the peak exercise ST segment Nominal
1. Upsloping: better heart rate with exercise (uncommon)
2. Flatsloping: minimal change (typical healthy heart)
3. Downsloping: signs of unhealthy heart
12. Ca number of major vessels (0-3) colored by flourosopy Numerical
colored vessel means the doctor can see the blood passing
through the more blood movement the better (no clots)
13. Thal thalium stress result Nominal
1,3: normal
6: fixed defect: used to be defect but ok now
7: reversable defect: no proper blood movement when
exercising
14 target have disease or not (1=yes, 0=no) (= the predicted attribute) Nominal

Table 5.2.1 Input Dataset Attributes

5.2.3 DATA PRE-PROCESSING

Data Pre-Processing is an important step for the creation of a machine


learning model. Initially, data may not be clean or in the required format for the
model which can cause misleading outcomes. In pre-processing of data, we
transform data into our required format. It is used to deal with noises,
duplicates, and missing values of the dataset. Data pre-processing has the
activities like importing datasets, attribute scaling, etc. Preprocessing of data is
required for improving the accuracy of the model. It consists of various
techniques like One hot Encoding, Label encoding, etc.,

16
5.2.3.1 ONE-HOT ENCODING

One hot encoding can be defined as the essential process of converting


the categorical data variables to be provided to machine and deep learning
algorithms which in turn improve predictions as well as classification accuracy
of a model. One Hot Encoding is a common way of preprocessing categorical
features for machine learning models.

5.2.3.2 LABEL ENCODING

Label Encoding refers to converting the labels into a numeric form so as


to convert them into the machine-readable form. Machine learning algorithms
can then decide in a better way how those labels must be operated. It is an
important pre-processing step for the structured dataset in supervised learning.

5.2.4 PERFORMANCE METRICS

5.2.4.1 CONFUSION MATRIX

A confusion matrix is a technique for summarizing the performance of a


classification algorithm. It is an N x N matrix used for evaluating the
performance of a classification model, where N is the number of target classes.

Fig 5.2.4.1 Confusion Matrix


17
From the Confusion matrix it consists of 2 values namely, Actual Values &
Predicted Values. These Values contains Positive & Negative Values. (TP –
True Positive, FP – False Positive, TN – True Negative, FN – False Negative)

5.2.4.2 F1-SCORE

The F1-score is a popular performance measure for classification. The F1


score is defined as the harmonic mean of precision and recall. Precision means
Precision is how good the model is at predicting a specific category. Recall
means The recall is calculated as the ratio between the numbers of Positive
samples correctly classified as Positive to the total number of Positive samples.
The recall measures the model's ability to detect positive samples. The higher
the recall, the more positive samples detected.

F1-Score = 2 X (Precision X Recall)/(Precision + Recall) = 2TP/2TP+FP+FN - (1)

Precision = Recall = Micro F1 = Accuracy - (2)

5.2.4.3 CLASSIFICATION REPORT

A classification report is a performance evaluation metric in machine


learning. It is used to show the precision, recall, F1 Score, and support of your
trained classification model. A Classification report is used to measure the
quality of predictions from a classification algorithm.

18
CHAPTER 6

RESULTS & DISCUSSIONS

Fig 6.1 Import Necessary Libraries & Dataset

Fig 6.2 Input Dataset Attributes (Summary)


19
Fig 6.3 Input Dataset Attributes (Statistical Analysis)

Fig 6.4 Attribute Selection (Correlation Matrix)


20
Fig 6.5 Data Transformation (No.of Patients)

Fig 6.6 Data Transformation (No.of.Patients - Bar Chart)

21
Fig 6.7 Data Transformation (No.of.patients - Pie Chart)

Fig 6.8 Data Transformation (Heart Disease Frequency based on Sex)

22
Fig 6.9 Data Transformation

(Heart Disease Frequency based on Sex - Bar Chart Representation)

Fig 6.10 Data Transformation

(Heart Disease Frequency based on Age & Max Heart Rate)


23
Fig 6.11 Data Transformation

(Heart Disease Frequency based on Age & Max Heart Rate - Scatterplot)

Fig 6.12 Data Transformation

(Heart Disease Frequency based on Per chest pain type)

24
Fig 6.13 Data Transformation

(Heart Disease Frequency based on Per chest pain type - Bar Chart)

Fig 6.14 Data Pre-processing (One Hot Encoding)

25
Fig 6.15 Data Pre-processing (One Hot Encoding - Result)

Fig : 6.16 Data Pre-processing (Label Encoding & its Result)


26
Fig 6.17 Performance Metrics (Confusion Matrix)

Fig 6.18 Performance Metrics (Confusion Matrix - Result)


27
Fig 6.19 Performance Metrics (Display Scores - Precision, Recall, F1-Score)

Fig 6.20 Performance Metrics (Display a Classification Report)


28
CHAPTER 7

CONCLUSION & FUTURE ENHANCEMENT

7.1 CONCLUSION

Heart diseases are a major killer in India and throughout the world,
application of promising technology like machine learning to the initial
prediction of heart diseases will have a profound impact on society. The early
prognosis of heart disease can aid in making decisions on lifestyle changes in
high-risk patients and in turn reduce the complications, which can be a great
milestone in the field of medicine. The number of people facing heart diseases
is on a raise each year. This prompts for its early diagnosis and treatment. The
utilization of suitable technology support in this regard can prove to be highly
beneficial to the medical fraternity and patients. In this paper, Random Forest
algorithm is used to measure the performance which is applied on the dataset.
The expected attributes leading to heart disease in patients are available in the
dataset which contains 76 features and 14 important features that are useful to
evaluate the system are selected among them. If all the features taken into the
consideration then the efficiency of the system the author gets is less. To
increase efficiency, attribute selection is done. In this n features have to be
selected for evaluating the model which gives more accuracy. The correlation of
some features in the dataset is almost equal and so they are removed. If all the
attributes present in the dataset are taken into account then the efficiency
decreases considerably. Hence, the aim is to use various evaluation metrics like
confusion matrix, accuracy, precision, recall, and f1-score which predicts the
disease efficiently. After the Development of heart disease prediction, tested
this model by using Random Forest classifier, it gives the highest accuracy of
93%.

29
7.2 FUTURE ENHANCEMENT

For the Future Scope more machine learning approach will be used for
best analysis of the heart diseases and for earlier prediction of diseases so that
the rate of the death cases can be minimized by the awareness about the
diseases. This application can be made as common platform for predicting all
kind of diseases. We are excited to continue our project and add some more
attributes and we try to this application as disease prediction application for all
kind of diseases.

30
APPENDIX

# import necessary libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import *

from sklearn.metrics import *

from sklearn.feature_selection import *

# import & read the dataset

df=pd.read_csv("/content/drive/MyDrive/Heart Disease Prediction/heart.csv")

df.head()

df.shape # (rows, columns)

df.info()

df.describe()

#create a correlation matrix

corrmat = df.corr()

top_corr_features = corrmat.index

31
plt.figure(figsize=(16,7))

g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")

df['target'].value_counts()

disease = len(df[df['target'] == 1])

no_disease = len(df[df['target']== 0])

plt.rcdefaults()

fig, ax = plt.subplots()

y = ('Heart Disease', 'No Disease')

y_pos = np.arange(len(y))

x = (disease, no_disease)

ax.barh(y_pos, x, align='center')

ax.set_yticks(y_pos)

ax.set_yticklabels(y)

# labels read top-to-bottom

ax.invert_yaxis()

ax.set_xlabel('Count')

ax.set_title('Target')

for i, v in enumerate(x):

ax.text(v + 10, i, str(v), color='black', va='center', fontweight='normal')

plt.show()

32
y = ('Heart Disease', 'No Disease')

y_pos = np.arange(len(y))

x = (disease, no_disease)

# Assigning labels for identifying No Disease & Heart Diseased persons

labels = 'Heart Disease', 'No Disease'

sizes = [disease, no_disease]

fig1, ax1 = plt.subplots()

ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)

# Equal aspect ratio ensures that pie is drawn as a circle.

ax1.axis('equal')

plt.title('Percentage of target', size=16)

plt.show()

df.sex.value_counts()

# Display a crosstab for identifying how many persons are heart diseased
persons & how many persons are no diseased persons according to sex

pd.crosstab(df.target,df.sex)

pd.crosstab(df.target,df.sex).plot(kind='bar',

figsize=(10,6),

color=['cyan','crimson']);

# Labelling

plt.title('Heart Disease Frequency for Sex')


33
plt.xlabel('0 = No Disease, 1 = Disease')

plt.ylabel('Amount')

plt.legend(['Female','Male'])

plt.grid()

plt.xticks(rotation = 0);

plt.figure(figsize=(10,6))

# Scatter with positive examples

plt.scatter(df.age[df.target==1],

df.thalach[df.target==1],

color='crimson')

# Scatter with negative examples

plt.scatter(df.age[df.target==0],

df.thalach[df.target==0],

color='darkcyan')

# Adding some helpful info

plt.title('Heart Disease in function of Age and Max Heart Rate')

plt.xlabel('Age')

plt.ylabel('Max Heart Rate (thalach)')

plt.legend(["Disease", "No Disease"]);

plt.grid()

34
# Display a crosstab for identifying how many persons are heart diseased
persons & how many persons are no diseased persons according to types of
chest pain

pd.crosstab(df.cp,df.target)

pd.crosstab(df.cp,df.target).plot(kind='bar',

figsize=(10,6),

color=['cyan','crimson'])

# Labelling

plt.title('Heart Disease Frequency Per Chest Pain Type')

plt.xlabel("Chest Pain Type")

plt.ylabel("Amount")

plt.legend(["No Disease", "Disease"])

plt.grid()

plt.xticks(rotation=0);

# import a library for performing One-hot encoding

from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder

encoder = OneHotEncoder(handle_unknown='ignore')

# perform one-hot encoding on 'target' column

encoder_df = pd.DataFrame(encoder.fit_transform(df[['target']]).toarray())

# merge one-hot encoded columns back with original DataFrame

35
final_df = df.join(encoder_df)

# view final df

print(final_df)

from sklearn import preprocessing

# label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'

df['target']= label_encoder.fit_transform(df['target'])

df['target'].unique()

# Create a Dataframe

data=pd.DataFrame(df['target'])

print(data)

# Create a confusion matrix under target & sex attribute

confusion_matrix = pd.crosstab(df['sex'], df['target'], rownames=['target'],


colnames=['sex'])

print (confusion_matrix)

# Display confusion matrix

sns.heatmap(confusion_matrix, annot=True)

y_true=df['cp']

y_pred=df['target']

# Display the micro & macro scores in Precision, Recall, F1-Score


36
print('Micro Precision: {:.2f}'.format(precision_score(y_true, y_pred,
average='micro')))

print('Micro Recall: {:.2f}'.format(recall_score(y_true, y_pred,


average='micro')))

print('Micro F1-score: {:.2f}\n'.format(f1_score(y_true, y_pred,


average='micro')))

print('Macro Precision: {:.2f}'.format(precision_score(y_true, y_pred,


average='macro')))

print('Macro Recall: {:.2f}'.format(recall_score(y_true, y_pred,


average='macro')))

print('Macro F1-score: {:.2f}\n'.format(f1_score(y_true, y_pred,


average='macro')))

# Display a Classification Report

print(classification_report(y_true, y_pred))

37
REFERENCES

1. Adeeb Noor , Awais Niamat, Javed Ali Khan, Noorbakhsh Amiri Golilarz , Redhwan Nour ,
Syed System for the Ahmad Chan Bukhari, Liaqat Ali, And Xiong Xingzhong, “An
Optimized Stacked Support Vector Machines Based Expert Effective Prediction of Heart
Failure”, IEEE Access 2019.
2. Alfian, G. Fitriyani, N. Rhee, J .Syafrudin, M. and (2020) "Hdpm: An Effective Heart Disease
Prediction Model for A Clinical Decision Support System", IEEE Access, Vol.8, No.07,
pp. 133034-133050.
3. Ashir Javeed, Atiqur Rahman, Aurangzeb Khan, Javed Ali Khan, Mingyi Zhou, And Liaqat
Ali “An Automated Diagnostic System for Heart Disease Prediction Based on χ2 Statistical
Model and Optimally Configured Deep Neural Network”, IEEE Access 2019.
4. Bahaj, M. and Khourdifi, Y. (2019) "Heart Disease Prediction and Classification Using
Machine Learning Algorithms Optimized by Particle Swarm Optimization and Ant Colony
Optimization" ,International Journal of Intelligent Engineering and Systems, Vol. 12, No. 02.
5. Chandrasegar thirumalai , Gautam Srivastva, Senthil kumar mohan, “Effective Heart Disease
Prediction Using Hybrid Machine Learning Techniques”, IEEE Access 2019.
6. Chintala Srinidhib , Deepikae , Kalali Vanajad , Kotha Sindhuc , Krishna Rao Patro, E, and
Padmajaa, B (2021) “Early and Accurate Prediction of Heart Disease Using Machine
Learning Model”, Turkish Journal of Computer and Mathematics Education , Vol.12, No.6,
pp. 4516-4528.
7. Deepak, K . , Koushik, K. V. S. , Nikhil Kumar, M.(2019) “Prediction of Heart Diseases
Using Data Mining and Machine Learning Algorithms and Tools” International Journal of
Scientific Research in Computer Science, Engineering and Information Technology,
IJSRCSEIT.
8. Hussein Kanaan , Israa Nadheer , Mohammad Ayache “Heart Disease Prediction System
Using Machine Learning Algorithm”, Iraqi Journal of Information and Communications
Technology(IJICT) Conference Series: The 1st Conference of Applied Researches in
Information Engineering (ARIE2021).
9. Jaymin Patel, Dr. Samir Patel ,and Prof. Tejal Upadhyay, (2015-2016) “Heart Disease
Prediction using Machine Learning and Data Mining Technique”, Vol.7, No.1, pp. 129-137.
10. MALKARI BHARGAV, J.RAGHUNATHA “Study on Risk Prediction of Cardiovascular
Disease Using Machine Learning Algorithms”, JETIR August 2020, Volume 7, Issue 8.

38

You might also like