0% found this document useful (0 votes)
107 views51 pages

Report Final Year Project Completed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views51 pages

Report Final Year Project Completed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

HEART DISEASE PREDICTION USING MACHINE

LEARNING

PID-BS IT-F18M10

MUHAMMAD ZAIN UL ABIDEEN (Bsf1804155)

WAQAS RAZZAQ (Bsf1804119)

MUHAMMAD HUNZLA BIN JAVAID (Bsf1804550)

DEPARTMENT OF INFORMATION SCIENCES

MULTAN CAMPUS

UNIVERSITY OF EDUCATION

LAHORE

2018-22
HEART DISEASE PREDICTION USING MACHINE
LEARNING

MUHAMMAD ZAIN UL ABIDEEN (Bsf1804155)

WAQAS RAZZAQ (Bsf1804119)

MUHAMMAD HUNZLA BIN JAVAID (Bsf1804550)

2018-22

A project submitted in partial fulfillment of the

Requirements for the award of the degree of

Bachelor of Sciences in Information Technology

DEPARTMENT OF INFORMATION SCIENCE

MULTAN CAMPUS

UNIVERSITY OF EDUCATION

LAHORE

2018-22
@Copyright Muhammad Zain ul Abideen 2022

@Copyright Waqas Razzaq 2022

@Copyright Muhammad Hunzla bin Javaid 2022


“We hereby declare that we have read this project and in our opinion this project is
sufficient in terms of scope and quality for the award of the degree of BS Information
Technology.”

Name of Supervisor: ___________________________________________

Signature: ___________________________________________

Date: ___________________________________________

Name of Co-Supervisor: ______________________________________

Signature: __________________________________________

Date: ________________________________
I

DECLARATION

We declare this research project/report entitled “Heart Disease Prediction System


Using Machine Learning” is the result of my own research except as cited in the
references. The research project has not been accepted for any degree and is not
concurrently submitted in candidature for any other degree. At any time if my
statement is found to be incorrect even after award of IT degree; the university has the
right to withdraw my IT degree.

Name: ____ Muhammad Zain-ul-Adideen_________

Signature: ___________________________________________

Name: _____________Waqas Razzaq_________________

Signature: ___________________________________________

Name: _________Muhammad Hunzala-bin-Javaid______

Signature: ___________________________________________

Date: ___________________________________________
II

PLAGIARISM UNDERTAKING

We solemnly declare that research work presented in the research project entitled
“Heart Disease Prediction System Using Machine Learning” is solely our research
work with no significant contribution from any other person. Small contribution/help
wherever taken has been duly acknowledged and that complete thesis has been
written by us.

We understand the zero-tolerance policy of the HEC and University of Education,


Lahore towards plagiarism. Therefore, we as an Author of the above titled thesis
declare that no portion of our thesis has been plagiarized and any material used as
reference is properly referred/cited.

We undertake that if I am found guilty of any formal plagiarism in the above titled
thesis even after award of BS, the University reserves the rights to withdraw/revoke
my IT degree and that HEC and the University has the right to publish my name on
the HEC/University Website on which names of students are placed who submitted
plagiarized thesis.

Name: ___________________________________________
Signature: ___________________________________________

Name: ___________________________________________

Signature: ___________________________________________

Name: ___________________________________________
Signature: ___________________________________________

Date: ___________________________________________
III

CERTIFICATE OF APPROVAL

This is to certify that the research project/project presented in this thesis, entitled
“Heart Disease Prediction System Using Machine Learning” was conducted by
Muhammad Zain-ul-Abideen, Waqas Razzaq and Muhammad Hunzala-bib-Javaid
under the supervision of sir Shahid Touqeer.

No part of this research project/project has been submitted anywhere else for any
other degree. This thesis is submitted to the Division of Science and Technology,
University of Education, Lahore in partial fulfillment of the requirements for the
degree of Bachelors in Information Technology.

Student Name:___________________ Signature: ________________

Student Name:___________________ Signature: ________________

Student Name:___________________ Signature: ________________

Examination Committee:

1. External Examiner
Name: ____________________ Signature: _________________

(Designation & Office Address)


__________________________
__________________________

Supervisor Name: ____________________ Signature:__________________

Dean/HOD Name: ____________________ Signature:__________________


IV

ABSTRACT

One of the most life-threatening disease is cardiovascular disease. Its high mortality
rate contributes to nearly 17 million deaths all over the world. Early diagnosis helps to
treat the disease in timely manner to prevent mortality. There are several machine and
deep learning techniques available to classify the presence and absence of the disease.
In this research, Logistic Regression (LR) techniques is applied to dataset to classify
the cardiac disease. To improve the performance of the model, pre-processing of data
by Cleaning the dataset, finding the missing values are done and features selection
were performed by correlation with the target value for all the feature. The highly
positive correlated features were selected. Then classification is performed by
dividing the dataset into training. testing in the ratio of 80:20, 70:30, 40:60 and
50:50. The splitting ratio of 50:50 gives best accuracy. The LR model obtained 86.9%
accuracy.
V

Table of Contents
1. INTRODUCTION....................................................................................................1

1.1 Background of the Study.......................................................................................2

1.2 Problem Statement.................................................................................................4

1.3 Objective of the Study...........................................................................................4

1.3.1 Main Objectives...............................................................................................4

1.3.2 Specific Objectives...........................................................................................5

1.4 Scope of the Study..................................................................................................5

1.4.1 Limitations of the Study..................................................................................5

1.5 Significance of the Study.......................................................................................5

1.6 Gantt Chart............................................................................................................6

2. LITERATURE REVIEW.......................................................................................7

2.1 Introduction............................................................................................................8

2.2 Literature Review..................................................................................................8

2.3 Proposed Architecture.........................................................................................11

2.3.1 Naïve Bayes Classifier....................................................................................11

2.3.2 Stochastic gradient descent (SGD)...............................................................16

3. RESEARCH METHODOLOGY.........................................................................17

3.1 System Development Methodology....................................................................18

3.2 Work Break down Structure..............................................................................18

3.3 Process Model.......................................................................................................19

3.4 Proposed Work.....................................................................................................20

3.4.1 Data Acquisition.............................................................................................21

3.4.2 Data Pre-processing........................................................................................21


VI

3.4.3 Feature Selection...............................................................................................21

3.4.4 Splitting Dataset................................................................................................22

3.5 Data Visualization...................................................................................................22

3.6 Classification...........................................................................................................24

3.6.1 Pseudo Code......................................................................................................25

3.7 Summary.................................................................................................................25

4. Results and Discussion..............................................................................................27

4.1 Results of Entire System..........................................................................................31

4.1.1 Person without Heart Disease.............................................................................31

4.1.2 Person with Heart Disease..................................................................................31

4.2 Technologies............................................................................................................32

5. IMPLICATIONS/RECOMMENDATIONS..............................................................33

6. CONCLUSION.........................................................................................................35

REFERENCES.............................................................................................................38
VII

LIST OF FIGURES

FIGURE PAGE
TITLE
NO. NO.
1.1 Gantt Chart 6
2.1 Flow Chart of naïve Bayes Decision 13
2.2 Flow Chart of Entire System 14
2.3 Data Flow Diagram of Model Working 15
2.4 Kappa, ROC, MAE for different algorithms 15
2.5 Relative Absolute Error of different algorithms 16
3.1 Work Break down Structure 18
3.2 Iterative Process Model 19
Logistic Regression Cardiac Disease Classification flow
3.3 20
diagram
3.4 Distribution of Heart Disease in accordance to Gender 23
3.5 Heat Map of Subset Attributes 24
Accuracy Result of Logistic Regression Classifier on
4.1 28
Training Data
Accuracy Result of Logistic Regression Classifier on
4.2 29
Testing Data
4.3 ROC curve 30
VIII

LIST OF TABLES

TABLE PAGE
TITLE
NO. NO.

Accuracy of different Algorithms


2.1 11

3.1 Feature selection using Correlation 21

3.2 Split Percentage of Training and Test Set 22

Accuracy Result of Logistic Regression Classifier on Training


4.1 28
Data

Accuracy Result of Logistic Regression Classifier on Testing


4.2 29
Data

4.3 Previous work on Logistic Regression 30


1

CHAPTER 1
INTRODUCTION
2

Machine learning disease prediction is a system that predicts diseases based on


information provided by people. Today the greatest challenge to medical industry is to
provide higher level facility to health infrastructure to diagnose the disease in the
initial day and give timely treatment to improve the quality of life through quality of
service. Around 31% of mortality occurs world due to cardiac disease (World Health
Organization, J. Dostupno, K. Uyar & A. Ilhan, 2016, 2017 ). It predicts the disease of
the patient or the user based on the information or the symptoms put into the system
and gives results based on that inputs. It would be very helpful to use if the patient
isn't much serious and therefore the user just wants to understand the sort of disease,
he/she has been through.

It's a system which gives the ideas and tips to take care of the health of the user and it
provides how to seek out disease using this prediction. So just by entering the
symptoms and every one other useful information the user can get to understand the
disease he/she is affected by and therefore the health industry also can get enjoy this
technique by just asking the values from the user and entering within the system and
in only few seconds they will tell whether the heart is in good condition or not.

These kinds of ML systems have been implemented by many other organizations, but
we intend to make it unique and more useful to users who use this system. This Heart
Disease Prediction Using Machine Learning is completely done with the help of
Machine Learning algorithms and Python Programming language and also using the
dataset that's available previously by the hospitals using that we'll predict the disease.
Nowadays doctors are using many technologies and methodology for not only can
identify and diagnose common diseases, but also many deadly diseases.
The exact and accurate analysis is normally attributed to the successful treatment.
When doctors fail to make accurate decisions while examining a patient's disease,
disease forecasting systems that use ML algorithms can help.

1.1 Background of the Study


Among all fatal diseases, heart attacks diseases are considered as the most prevalent.
Medical specialists conduct different surveys on heart diseases and gather information
of heart patients, their symptoms and disease progression. Mostly patients are
3

reported with common diseases that have typical symptoms. In this fast moving world
people want to live a very luxurious life so they work like a machine in order to earn
lot of money and live a comfortable life therefore in this race they forget to take care
of themselves, because of this there food habits change their entire lifestyle change, in
this type of lifestyle they are more tensed they have blood pressure, sugar at a very
young age and they don’t give enough rest for themselves and eat what they get and
they even don’t bother about the quality of the food. If they found themselves sick
they go for their own medication as a result of all these small negligence it leads to a
major threat that is the heart disease.

Data mining has been used in a variety of applications such as marketing, customer
relationship management, engineering, and medicine analysis, expert prediction, web
mining and mobile computing. Of late, data mining has been applied successfully in
healthcare fraud and detecting abuse cases.

Data analysis proves to be crucial in the medical field. It provides a meaningful base
to critical decisions. It helps to create a complete study proposal. One of the most
important uses of data analysis is that it helps in keeping human bias away from
medical conclusion with the help of proper statistical treatment. By use of data mining
for exploratory analysis because of nontrivial information in large volumes of data.
The health care industries collect huge amounts of data that contain some hidden
information, which is useful for making effective decisions for providing appropriate
results and making effective decisions on data, some data mining techniques are used
to better the experience and conclusion that have been given. The vast medical
records are available to the research. The medical industry faces enormous challenges
in using the huge medical data. The vast amount of data is transformed to obtain
valuable and accurate information speedily by machine. Thus, machine learning is the
important area. The highly useful machine learning models used to discover the
hidden pattern and correlation among features in the dataset
( Kausar, S. Palaniappan, B.B. Samir, A. Abdullah, N. Dey, T.Turner & R. Stocker,
2016, 2013 ). Heart disease detection system will use the data mining knowledge to
give a user-oriented approach to new and hidden patterns in the data. The knowledge
which is implemented can be used by the healthcare experts to get better quality of
service and to reduce the extent of adverse medicine effect.
4

1.2 Problem Statement

Heart disease can be managed effectively with a combination of lifestyle changes,


medicine and in some cases surgery. With the right treatment, the symptoms of heart
disease can be reduced and the functioning of the heart improved. The predicted
results can be used to prevent and thus reduce cost for surgical treatment and other
expensive. Various researchers have included risk of different feature the most
prevalent are 13 features. Since the feature selection become an important part of the
study, based on the feature selection the model increases or decrease the prediction
accuracy (D. Singh & J.S. Samagh, 2020). The overall objective of our work will be
to predict accurately with few tests and attributes the presence of heart disease. Many
input attributes can be taken but our goal is to predict with few attributes and faster
efficiency the risk of having heart disease. Decisions are often made based on doctors
intuition and experience rather than on the knowledge rich data hidden in the data set
and databases. This practice leads to unwanted errors and excessive medical costs
which affects the quality of service provided to patients. Data mining holds great
potential for the healthcare industry to enable health systems to systematically and
statistically use data and analytics to identify inefficiencies and best practices that
improve care and reduce costs. The opportunities to improve care and reduce costs
concurrently could apply to as much as 30% of overall healthcare spending. The
successful application of data mining in highly visible fields like e-business,
marketing and retail has led to its application in other industries and sectors. Among
these sectors just discovering is healthcare. The healthcare environment is still
“information rich” but “knowledge poor”. There is a wealth of data available within
the healthcare systems.

1.3 Objectives of Study

1.3.1 Main Objectives

The main objective of this study is to predict whether a patient is affected with heart
disease or not using machine learning algorithm (Logistic Regression ) on a qualified
dataset by find out the correlations between different attributes (Carney, R. M. &
Freedland, K. E., 2010).The system can discover and extract hidden knowledge
associated with diseases from a historical heart data set heart disease prediction
5

system aims to use data mining techniques on medical data set to assist in the
prediction of the heart diseases.
1.3.2 Specific Objectives
 Provides new approach to concealed patterns in the data
 Helps avoid human biasness
 Reduce the medical cost

1.4 Scope of the Study

Here the scope of the project is that integration of clinical decision support with
computer-based patient records could reduce medical errors, enhance patient safety,
decrease unwanted practice variation, and improve patient health more effectively.
This suggestion is promising as data modeling and analysis tools, e.g., data mining,
have the potential to generate a knowledge-rich environment which can help to
significantly improve the quality of clinical decisions.

1.4.1 Limitation of the Study

Medical diagnosis is considered as a significant yet complex thing that needs to be


carried out precisely and efficiently. The automation of this system would be highly
beneficial. Clinical decisions are often made based on doctor’s intuition and
experience rather than on the knowledge rich data hidden in the database. This
practice leads to unwanted biases, errors and excessive medical costs which affects
the quality of service provided to patients. Data mining have the potential to generate
a knowledge-rich environment which can help to significantly improve the quality of
clinical decisions.
1.5 Significance of the Study

Clinical decisions are often made based on doctor’s insight and experience rather than
on the knowledge hidden in the dataset. This practice leads to different barriers, errors
and excessive medical costs which affects the quality of service provided to patients.
The proposed system will integrate clinical decision support with computer-based
patient records (Data Sets). This will reduce medical errors, enhance patient safety,
decrease unwanted practice variation, and improve patient outcome. This suggestion
is promising as data modeling and analysis tools, e.g., data mining, have the potential
to generate a knowledge rich environment which can help to significantly improve the
6

quality of clinical decisions. There are voluminous records in medical data domain
and because of this; it has become necessary to use data mining techniques to help in
decision support and prediction in the field of healthcare. Therefore, medical data
mining is useful for diagnosing of disease.

1.6 Gantt Chart

FIGURE 1.1: Gantt Chart


7

CHAPTER 2
THE LITERATURE REVIEW
8

2.1 Introduction
Data mining is the process of discovering previously unknown patterns and trends in
the database and using that information to create predictable models. Data mining
involves statistical analysis, machine learning and website technology to extract
hidden patterns and relationships on a large website. A World Health Statistics 2012
report highlights the fact that one in three adults worldwide has elevated blood
pressure, a condition that affects almost half of all deaths from stroke and heart
disease.

Heart disease, also known as coronary heart disease (CVD), involves many
conditions that affect the heart - not just the heart attack. Heart disease is the leading
cause of death in various countries, including India. Heart disease kills one person
every 34 seconds in the United States. Heart disease, Cardiomyopathy and heart
disease are other stages of heart disease. The term “cardiovascular disease” covers a
variety of conditions that affect the heart and blood vessels as well as the way blood is
pumped and distributed throughout the body.

Diagnosis is a difficult and important task that needs to be done accurately and
effectively. Diagnosis is usually made, based on the doctor's experience and
knowledge. This leads to unwanted side effects and overdose of treatment costs
provided to patients. Therefore, an automated medical diagnostic program can be very
helpful.

2.2 Literature Review


Numerous studies have been conducted focusing on the diagnosis of heart disease.
They have used different data mining techniques to diagnose and discover different
opportunities in different ways.

(Polaraju, Durga Prasad, & Tech Scholar, 2017) Proposed Heart Disease Prediction
using the Multiple Regression Model and proves that Multiple Linear Regression is
appropriate to predict the risk of heart disease. The work is done using a training data
set with 3000 scenarios with 13 different qualifications mentioned earlier. The data
set is divided into two parts which means 70% of the data is used for training and
30% is used for testing (Polaraju, K., Durga Prasad, D. & Tech Scholar, M., 2017).
9

(Beyene & Kamat, 2018) recommend different algorithms such as Naive Bayes, Tree
of Divide, KNN, Logistic Regression, SVM and ANN. Logistic Regression provides
better accuracy compared to other algorithms. (Beyene & Kamat, 2018) developed a
Cardiovascular Predictability System using Data Mining Strategies. WEKA software
used for automatic diagnosis and provision of services at health facilities. The paper
has used various algorithms such as SVM, Naïve Bayes, Organization Law, KNN,
ANN, and Decision Tree. The SVM Recommended Paper is more efficient and offers
more accuracy compared to other data mining algorithms. Chala Beyene commended
Predicting and Analyzing the Incidence of Heart Disease Using Data Mining
Strategies. The primary goal is to predict the onset of heart disease in order to
diagnose autoimmune disease more quickly during the short-term outcome. The
proposed approach is also important for a health care organization with professionals
who do not have the knowledge and skills. It uses a variety of medical attributes such
as blood sugar and heart rate, age, sex and some of the included traits to determine if a
person has heart disease or not. Data set analysis is performed using the WEKA
software (Beyene, C., & Kamat, P., 2018).

(Soni, Ansari, & Sharma, 2011) it is proposed to use an indirect class algorithm to
predict heart disease. It is recommended to use big data tools such as Hadoop
Distributed File System (HDFS), map download and SVM for heart disease with a set
of attribute set. This work investigated the use of various data mining methods for
predicting heart disease. It suggests using HDFS to store large data on different nodes
and using predictive algorithm using SVM in more than one location at a time using
SVM. SVM is used in the same way that has produced a better calculation time than
consecutive SVM (Soni, J., Ansari, U., & Sharma, D., 2011).

(Science and Wisdom, 2009) suggested the use of heart disease using data mining
and machine learning algorithm. The aim of this study was to uncover hidden patterns
through data mining techniques. The best J48 data-based algorithm for UCI has a
much higher accuracy rate compared to LMT (Science, C., & Faculty, G. M., 2009).

(Purushottam, Saxena, & Sharma, 2016) proposed a system for predicting heart
disease using data mining. This program helps the doctor to make effective decisions
based on a specific parameter. For the specific testing and training phase, it provides
10

86.3% accuracy in the test phase and 87.3% in the training phase (Purushottam,
Saxena, K., & Sharma, R.).

(Sai & Reddy, 2017) propose to predict heart disease using the ANN algorithm for
data mapping. Due to the increased cost of heart disease diagnosis, there has been a
need to develop a new system that can predict heart disease. The predictive model is
used to predict the patient's condition after the test based on various parameters such
as heart rate, blood pressure, cholesterol etc. System accuracy is proven in java (Sai,
P. P., & Reddy, C.).

(A & Naik, 2016) are recommended to develop a diagnostic system that will
diagnose heart disease from a patient's medical data set. 13 risk factors for input
attributes were considered for system design. After data analysis from the database,
data purification and data integration was performed. He used the methods of k and
naïve Bayes to predict heart disease. This paper is a program design using
cardiovascular history data that provides diagnostics. Thirteen aspects of building this
program have been considered. To extract information from a database, data mining
techniques such as aggregation, classification methods can be used. Thirteen attributes
with a total of 300 records were used in the Cleveland Heart Database. This model is
predicting whether a patient has heart disease or not based on the number of 13
symptoms (A, A. S., & Naik, C.).

(Sultana, Haider, & Uddin, 2017) proposed a diagnosis of heart disease. This paper
proposes data mining techniques to predict the disease. It is intended to provide a
current strategic survey to extract information from the database and will be useful to
health professionals. Performance can be achieved based on the time it takes to build
a program decision tree. The main goal is to predict the disease with a small number
of factors (Sultana, M., Haider, A., & Uddin, M. S.).

Firda Anindita Latifah et.al., proposed comparative study of machine learning model
namely, logistical regression and random forest for classification of heart disease. The
research done on Framingham dataset with 3656 records and training to testing ratio
of 70:30. The accuracy of 85.04% was achieved by the model  (F.A. Latifah,
& I. Slamet).

Saba Bashir et al., proposed several ML algorithms namely, logistic regression,


decision tree, logistic regression support vector machine etc. The research uses UCI
11

dataset and the logistic regression achieved accuracy of 82.56% and logistic
regression support vector machine achieved accuracy of 84.85%
(S. Bashir, Z.S. Khan, F. Hassan Khan, A. Anjum, & K. Bashir).

2.3 Proposed Architecture


We can use an effective heart attack system using the Naïve Bayes algorithm. We can
provide input as in CSV file or input into the system. After taking the input algorithms
apply to the input Naïve Bayes. After accessing the data set the work is done and the
active heart rate is generated. The proposed program will add other important
parameters to a heart attack by their weight, age and critical levels in consultation
with specialist physicians and medical professionals. A heart attack guessing system
designed to help identify different levels of risk of heart attack as normal, low or high
and to provide prescriptive medical information related to the predicted outcome.

2.3.1 Naïve Bayes Classifier


The Naïve Bayes classifier is based on the vision of the Bayes. This separator uses
conditional independence where the attribute value is independent of the values of
other attributes. The Bayes theory is as follows: Let X = {x1, x2, ......, Xn} be the set
of attributes of n. In Bayesian, X is considered as evidence and H as alternatives, X
data belongs to class C. We must determine P (H | X), the probability that hypothesis
H has given evidence i.e. sample data X. According to the Bayes theorem P (H | X) is
defined as: P (H | X) = P (X | H) P (H) / P (X) .

Table 2.1: Accuracy of different Algorithms


Algorithm Used Accuracy
Naïve Bayes 52.33%
Decision Tree 52%
K-NN 45.67%

Through the Bayesian categories, the system will discover confidential information
related to diseases in the historical records of patients with heart disease. Bayesian
class dividers predict class membership opportunities, in such a way that the
probability of a given sample is mathematically class. The Bayes category is based on
a Bayes perspective. We can use Bayes theory to determine the likelihood that the
proposed diagnosis is true, given the observations. Possibly simple, a non-judgmental
Bayes divider is used to classify based on what is based on a Bayes perspective.
12

According to the Naïve Bayesian section an event or event of a particular element of a


class is considered independent where the presence or absence of any other element.
If the size of the input is high and the result is more expected, the main method of
Dividing the Naïve Bayes is effective. The Naïve Bayes model identifies the physical
characteristics and characteristics of patients suffering from heart disease. For each
entry, it provides an opportunity for the expected status quo. The Naïve Bayes is a
mathematical category that does not take reliance on merit. This separator algorithm
uses conditional independence, meaning that it assumes that the attribute value of a
given category is independent of the values of other attributes. The beauty of using
Naïve Bayes is that one can work with the Naïve Bayes model outside. Using any of
the Bayesian methods. (Brownlee, 2016). P (disease | symptom1, symptom 2,…….,
Symptom) P (disease) P (symptom1,.., Symptom | disease) = P (symptom1,
symptom2, Uphawu. Symptom).
13

FIGURE 2.1: Flow Chart of naïve Bayes Decision

The split tree literally creates a tree with branches, nodes, and leaves that allow us to
replace unknown data and descend from the tree, using the points of the data point in
the tree until the leaf is reached and anonymous extraction of data. the point can be
determined. To create a good split tree model, we need to have an existing data set
with a known effect on which we can build our model. We also divided our data into
14

two parts: a training set, used for model creation, and a test set, used to ensure that the
model was accurate and not over-installed.

This will be the proposed flow chart that the system will look like

Start

Collect Heart
Disease Dataset

Extract Significant
Variable

Data Processing

Build Neural Network

Train Neural Network

Test Performance

Deploy Model

Classifier

Normal Heart Disease


End

FIGURE 2.2: Flow Chart of Entire System

Data Set Preprocessing


15

Pattern Matching

Prediction

Rule
Generation

Accuracy
Calculation

Results

FIGURE 2.3: Data Flow Diagram of Model Working

FIGURE 2.4: Kappa, ROC, MAE for different algorithms


16

FIGURE 2.6: Relative Absolute Error of different algorithms

2.3.2 Stochastic Gradient Descent (SGD)

Gradient descent is an algorithm that optimizes many loss functions, such as Support
Vector Machine (SVM), and Logistic Regression models, and is often used to
improve line function, and the stochastic concept is introduced here based on finding
natural roots for development work. In the Stochastic Gradient Descent, for each
multiplication, samples are randomly selected using the word “bulk” by the number of
samples, instead of the whole set of data, and these collections are used to calculate
each multiplication.
17

CHAPTER 3
RESEARCH METHADOLOGY

3.1 System Development Methodology


18

The methodology of software development is the method in managing project


development. There are many models of the methodology available such as Waterfall
model, Incremental model, RAD Model, Agile model, Iterative model and Spiral
model. However, it still need to be considered by developer to decide which is will be
used in the project. The methodology model is useful to manage the project efficiently
and able to help developer from getting any problem during time of development.
Also, it help to achieve the objective and scope of the projects. In order to build the
project, it need to understand the stakeholder requirements.
Methodology provides a framework for undertaking the proposed DM modeling. The
methodology is a system comprising steps that transform raw data into recognized
data patterns to extract knowledge for users.

3.2 Work Break down Structure

FIGURE 3.1: Work Break down Structure

3.3 Process Model


19

Heart Disease Detection System will be implemented and executed using the Plan
driven Iterative Process Model.

FIGURE 3.2: Iterative Process Model

The reason for selecting this model is that here instead of beginning with fully
known requirements, we can start implementing a set of software requirements,
testing, evaluating and plug-in further requirements after an iteration. During each
iteration new version of the software gets produced. This rinsing and repetition go
on until the complete project is ready. This provides the flexibility in modifying the
requirements and software design if needed. So, the process model we adopt for
developing this project is Iterative model. Because this is the only SDLC model we
20

found suitable for our project development.

3.4 Proposed Work


In the proposed system, the analysis of the cardiac disease dataset is carried out
using suitable data acquisition, preprocessing by cleaning the data, then using
selects all the features which have high correlation with the target function. Then
logistic regression model was trained and tested for predicting the cardiac disease is
present or not.

FIGURE 3.3: Logistic Regression Cardiac Disease Classification flow diagram

3.4.1 Data Acquisition

The cardiac disease dataset obtained from the UCI ML repository. It


contains 13 features and 1025 records.

3.4.2 Data Pre-processing


21

Cardiovascular disease UCI dataset is first loaded and then data cleaning and finding
missing values was performed on all records. The dataset contains complete
information. The attributes of the dataset are multi-class variable in characteristics
with double classification.

3.4.3 Feature Selection

The patient record is identified uniquely by 13 features of the dataset such as sex and
age. The rest of the features consists of medical information. The medical information
are vital attributes predicting heart disease. The correlation performed on all 13
attributes with the target value to select the features with high and positive correlation
feature as shown in table 3.1.

Table 3.1: Feature selection using Correlation

Features Correlation

Exang 0.436757

Cp 0.433798

Oldpeak 0.430696

Thalach 0.421741

Ca 0.391724

Slope 0.344029

3.4.4 Splitting Dataset


The table 3.2 below shows the splitting of the dataset in the following ratios of
training and testing set in percentile.
22

Table 3.2: Split Percentage of Training and Test Set

Serial Number Training Set Test Set

1 50% 50%

2 60% 40%

3 70% 30%

4 80% 20%

3.5 Dataset Visualization

The data visualization of features such as gender, chest pain category, and fasting
blood sugar level. Males are more likely than females to get heart disease, according
to this Cleveland dataset. The majority of individuals with cardiovascular disease
experience asymptomatic chest discomfort.

The distribution of heart disease in accordance to gender is shown in figure 3.4 where
it is shown that males are more likely to get heart disease rather than females.

The figure 3.4 shoes that 68% males and 32% females getting heart disease
23

FIGURE 3.4: Distribution of Heart Disease in accordance to Gender


24

FIGURE 3.5: Heat Map of Subset Attributes

3.6 Classification

One of the Simplest and best ML classification algorithm is Logistic Regression. The
LR is the supervised ML binary classification algorithm widely used in most
application. It works on categorical dependent variable the result can be discrete or
binary categorical variable 0 or 1. The sigmoid function is used as a cost function.
Sigmoid function maps a predicted real value to a probabilistic value between ‘0’ and
‘1’.

Logistic Sigmoid function:

1
P ( x) = (− x) (1)
(1+e )

Where, P(x) is probability estimation function a value between 0 and 1, x is input to


the probability function (algorithm’s prediction value), the mathematical constant e is
25

Euler’s number and its value is approximately equal to 2.71828 as shown in equation
1.

To predict the cardiac disease logistic regression ML model is used, firstly the LR
model are trained with five splitting condition and tested with test data for prediction
to get the best accuracy and to find the models behavior. The algorithm results
category of 1 and 0 for presence and absences of cardiac disease.

The Logistics Regression Model is described in Pseudo code 1 is used in both training
and testing the data instance.

3.6.1 Pseudo Code: Logistic Regression

 Input: Feature selected data


 Output: Best classification
 Algorithm
 For i ← 1 to k
 For each training & testing data instance di
 Set the target value for the regression to

Z ← yj−P ( 1−dj )
[ P ( 1−dj ) • ( 1− p ( 1−dj ) ) ]
 Initialy the weight of instance dj to P(1|dj). (1-P). (1|dj)
 Finalize a f(j) to the data with class value (zj) & weights (wj)
 Assign (class label:1) if P (1|dj) >0.5, otherwise (class label:2)

3.7 Summary
Planning the project before-hand aids in the timely completion of the project.
Project plan give’s details about the deliverables. The methodology is the important
aspect that describes how we are going to achieve our goals and the manners of
doing it. The iterative model is used in this project due to predefined requirements.
The iterative model is a fast development process and it is a suitable choice for this
project that allows producing prototypes so that it becomes easy to identify faults
and make the refined final product. The condition of using an iterative model is,
requirements must be clear in advance. The algorithm that we used is Logistic
Regression Model. The whole software is built on Logistic Regression Model
26

because its accuracy increases as the training data gets matured with time.
27

CHAPTER 4
RESULTS AND DISCUSSION

The logistical regression is tested with UCI dataset with four different ratios and their
accuracy as shown in the table below. The accuracy of 86.91% obtained by logistical
28

regression for split ratio of training and testing is 50:50. The accuracy of the model on
the basis of training data is shown in table 4.1 and figure 4.1.

Table 4.1: Accuracy Result of Logistic Regression Classifier on Training Data

Training Data : Testing Data


80:20 70:30 60:40 50:50
85.24% 86.47% 86.17% 86.91%

The Logistics Regression gives its best accuracy of 86.91% when training and testing
data are split as 50:50 on training data, 86.17% on 60:40, 86.47% when data is split as
70% training data and 30% testing data and 85.24% on 80:20.

87.5

87

86.5

86

85.5

85

84.5

84
80/20 70/30 60/40 50/50

FIGURE 4.1: Accuracy Result of Logistic Regression Classifier on Training Data

Table 4.2: Accuracy Result of Logistic Regression Classifier on Testing Data

( Training Data : Testing Data )


80:20 70:30 60:40 50:50
80.48% 79.22% 81.70% 83.43%
29

This model gives its highest accuracy of 83.43% when training and testing data are
split as 50:50 on testing data, 81.70% on 60:40, 79.22% when data is split as 70%
training data and 30% testing data and 80.48% on 80:20.

So best accuracy comes when data is split as 50:50 in this model(50% training data
and 50% testing data)

84

83

82

81

80

79

78

77
80/20 70/30 60/40 50/50

FIGURE 4.2: Accuracy Result of Logistic Regression Classifier on Testing Data

The ROC (Receiver Operator Characteristics) curve as shown in the figure 4.3 is used
to further investigation in to the model. The performance of the model is visualized by
ROC Curve and the tradeoff  between TPR (True Positive Rate) and FPR (False
Positive Rate). It ranges from 0 to 1 and the area under it signifies the capabilities of
distinguish the class of ML model. The ROC curve as near to one it is more capable
of classifying.
30

FIGURE 4.3: ROC curve

The table 4.3 represents the various previous research work carried on Logistic


Regression using rapid minor and python on UCI Dataset from the year 2019 to 2021
with accuracy of prediction.

Table 4.3: Previous work on Logistic Regression

Year Author Tool/Techniques Logistic


Regression

2019 Z. Khan, D.K. Mishra, V. Sharma, A. Sharma Rapid


Regression)
Miner (Logistic
82.56%

2019 Z. Khan, D.K. Mishra, V. Sharma, A. Sharma Rapid Miner (Logistic Regression


Support Vector Machine)
84.85%

2020 World Health Organization and J. Dostupno Python


85.04%
31

4.1 Results of Entire System

4.1.1 Person without Heart Disease:

The output of person without heart disease is below:

4.1.2 Person with Heart Disease:

The output of person without heart disease is below:


32

4.2 Technologies
For the development, the following are the Software Requirements:
 Operating System: Windows or any Linux

 Language: Python
 Tools: Anaconda, Google Colaboratory, Draw.io and Visio to Create and
design Data Flow and Context Diagram

 Technologies used: Python, Streamlit


33

CHAPTER 5
RECOMMENDATIONS
34

The developed application can be used in hospitals for a quick assessment of the
patient underlying conditions which may result in getting an overview of the patient’s
health. This process can save time and expensive medical tests. The developed
application can be used in homes. People use to be unconscious about their health.
They feel lazy to go for a test. This study will help them getting a prediction about
their heart health.

The previous researchers worked on different algorithms which gave results according
to their efficiency.

Our research is based on the most accurate algorithm named Logistic Regression
which gives the most accurate results. This model will get trained day by day as new
data will be entered on daily basis which will also increase the efficiency and
accuracy of the model.

All available information can be transmitted to mobile devices, meaning that when a
person inserts these signals into a cell phone the trained model will already be
available and will be able to analyze the symptoms and provide the appropriate
prescription. Different doctors can be considered and a complete independent plan
developed. We can also combine doctor numbers if the model shows a high risk and
they can consult a doctor. And if they show minor symptoms, then medication that is
already prescribed by doctors at some point will be indicated. This program will prove
to be beneficial and the work for doctors will also be minimal. And in the current era
of corona virus, we need independent programs that can help and ultimately ensure
authenticity among most people. So we can build other apps with the help of doctors
and make them work.
35

CHAPTER 6
CONCLUSION
36

Prediction of Heart disease is a challenging and very necessary in the medical field.
The recognition of heart diseases through the processing of raw health care
information will help in the long term saving of human lives.

This project predicts people with cardiovascular disease by extracting the patient
medical history that leads to a fatal heart disease from a dataset that includes patients’
medical history such as chest pain, sugar level, blood pressure, etc. This Heart
Disease detection system assists a patient based on his/her clinical information of
them been diagnosed with a previous heart disease (Piller L B, Davis B R, Cutler J A,
Cushman W C, Wright J T, Williamson J D & Haywood L J, 2002).

The mortality rate can be controlled if the disorder is detected at early stages and
preventative measures are adopted as soon as possible It is helpful in the early
detection of abnormalities in heart.

In this software, Logistic Regression is utilized to gather information and furnish a


method towards heart disease. However, a further extension of the work is highly
desirable to direct the investigations towards real world data instead of theoretical
methods and simulations.

Out of the 13 features we examined, the top 4 significant features that helped us
classify between a positive & negative Diagnosis were chest pain type (cp), maximum
heart rate achieved (thalach), number of major vessels (ca), and ST depression
induced by exercise relative to restimulations.

Our machine learning algorithm can now classify patients with Heart Disease. Now
we can properly diagnose patients, & get them the help they needs to recover. By
diagnosing and detecting these features early, we may prevent worse symptoms from
arising later.

Our system yields the highest accuracy of 86.91% on training data and 83.43% on test
data. Any accuracy above 70% is considered good, but if your accuracy is extremely
high, it may be too good to be true (an example of Over fitting). Thus, 80% is the
ideal accuracy.

Use of more training data ensures the higher chances of the model to accurately
predict whether the given person has a heart disease or not (Dangare Chaitrali S and
Sulabha S Apte., 2012).
37

References

World Health Organization and J. Dostupno, cardiovascular diseases: key facts, vol.

13, no. 2016, p. 6, 2016. [Online]. Available: https://www.who.int/en/news-

room/fact-sheets/detail/cardiovascular-diseases-(cvds).

K. Uyar, A. Ilhan, Diagnosis of heart disease using genetic algorithm based trained

recurrent fuzzy neural networks, Proced. Comput. Sci., 120 (2017), pp. 588-

593.

N. Kausar, S. Palaniappan, B.B. Samir, A. Abdullah, N. Dey, Systematic analysis of

applied data mining based optimization algorithms in clinical attribute

extraction and classification for diagnosis of cardiac patients in Applications

of Intelligent Optimization in Biology and Medicine, Cham, Switzerland:

Springer (2016), pp. 217-231.

M. Shouman, T. Turner, R. Stocker, Integrating clustering with different data mining

techniques in the diagnosis of heart disease, J. Comput. Sci.

Eng., 20 (1) (2013), pp. 1-10.

D. Singh, J.S. Samagh ,A comprehensive review of heart disease prediction using

machine learning, J. Crit. Rev., 7 (12) (2020), p. 2020.

Polaraju, K., Durga Prasad, D., & Tech Scholar, M. (2017). Prediction of Heart

Disease using Multiple Linear Regression Model. International Journal of

Engineering Development and Research, 5(4), 2321–9939. Retrieved from

www.ijedr.org
38

Beyene, C., & Kamat, P. (2018). Survey on prediction and analysis the occurrence of

heart disease using data mining techniques. International Journal of Pure and

Applied Mathematics, 118(Special Issue 8), 165–173. Retrieved from

https://www.scopus.com/inward/record.uri?eid=2-s2.0-

85041895038&partnerID=40&md5=2f0b0c5191a82bc0c3f0daf67d73bc81.

Soni, J., Ansari, U., & Sharma, D. (2011). Intelligent and Effective Heart Disease

Prediction System using Weighted Associative Classifiers. Heart Disease,

3(6), 2385–2392.

Science, C., & Faculty, G. M. (2009). Heart Disease Prediction Using Machine

learning and Data Mining Technique. Ijcsc 0973-7391, 7, 1–9.

Purushottam, Saxena, K., & Sharma, R. (2016). Efficient Heart Disease Prediction

System. In Procedia Computer Science (Vol. 85, pp. 962–969).

https://doi.org/10.1016/j.procs.2016.05.288

Sai, P. P., & Reddy, C. (2017). International Journal of Computer Science and Mobile

Computing HEART DISEASE PREDICTION USING ANN ALGORITHM

IN DATA MINING. International Journal of Computer Science & Mobile

Computing, 6(4), 168–172. Retrieved from www.ijcsmc.com

A, A. S., & Naik, C. (2016). Different Data Mining Approaches for Predicting Heart

Disease, 277– 281. https://doi.org/10.15680/IJIRSET.2016.0505545


39

Sultana, M., Haider, A., & Uddin, M. S. (2017). Analysis of data mining techniques

for heart isease prediction. In 2016 3rd International Conference n Electrical

Engineering and Information and Communication Technology, iCEEiCT 2016

(pp. 1–5). https://doi.org/10.1109/CEEICT.2016.7873142

F.A. Latifah, I. Slamet, Comparison of heart disease classification with logistic

regression algorithm and random forest algorithm

Proceedings of the AIP Conference (2020), p. 2296, 10.1063/5.0030579

S. Bashir, Z.S. Khan, F. Hassan Khan, A. Anjum, K. Bashir, Improving heart disease

prediction using feature selection approaches

Proceedings of the 16th International Bhurban3 Conference on Applied Sciences and

Technology (IBCAST) (2019), pp. 619-623, 10.1109/IBCAST.2019.8667106

Carney, R. M. and Freedland, K. E. (2010). Psychotherapies for depression in people

with heart disease. Depression and Heart Disease, page 145–168.

Piller L B, Davis B R, Cutler J A, Cushman W C, Wright J T, Williamson J D &

Haywood L J (2002). Validation of heart failure events in the

Antihypertensive and Lipid Lowering Treatment to Prevent Heart Attack Trial

(ALLHAT) participants assigned to doxazosin and chlorthalidone. Current

controlled trials in cardiovascular medicine, 3(1), 10.

Dangare Chaitrali S and Sulabha S Apte. "Improved study of heart disease prediction

system using data mining classification techniques." International Journal of

Computer Applications 47.10 (2012): 44-8.

You might also like