Final Report
Final Report
A PROJECT REPORT - II
Submitted by
of
MASTER OF ENGINEERING
in
JULY 2024
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “HEART DISEASE PROGNOSIS WITH
ARTIFICIAL INTELLIGENCE AND NEURAL NETWORKS”is the bonafide
work of “J. JENI NENATHA (212922405001)” who carried out theproject work
under my supervision.
SIGNATURE SIGNATURE
We are very grateful and gifted in taking up this opportunity to thank the
Lord Almighty for showering his unlimited blessings upon us.
We take this occasion to express our respect and thanks to Rev. Fr. Dr. J. E.
Arul Raj, Founder and Chairman and Rev. Sr. S. Gnanaselvam DMI, Managing
Trustee, for facilitating us to do our project successfully.
We thank our family members and friends for their honorable support.
ABSTRACT
Mortality rate increases all over the world on daily basis. The reasons for
this could be increase in the numbers of patient with cardiovascular disease. When
considering death rates and large number of people who suffers from heartdisease,
as ECG, Stress Test, and Heart MRI etc. Nowadays, Health care industry contains
huge amount of heath care data, which contains hidden information. This hidden
along with advanced Data mining techniques are used for appropriate results.
Neural network is widely used tool for predicting heart disease diagnosis. In our
project, a heart disease prediction system which uses artificial neural network
the neural network and then the neural network was trained with back
i
TABLE OF CONTENT
ii
5.1 HARDWARE DESCRIPTION 31
5.2 SOFTWARE DESCRIPTION 31
5.2.1 Python Programming 32
5.2.2 Jupyter Notebook 34
6 SYSTEM IMPLEMENTATION 36
6.1 SOFTWARE IMPLEMENTATION 37
6.1.1 Implementation Methods 37
6.1.2 Import Libraries 38
6.1.3 Analyze and Visualize Pandas 39
Data Structure
6.1.4 Exploratory Data Analysis 40
6.1.5 Checking Relationship Among 40
Variables
6.1.6 Checking and Dropping 40
Duplicates
6.1.7 Outlier Detection and Removal 40
6.1.8 Pre-processing 41
6.1.9 Fitting an ANN Model 42
6.1.10 Confusion Matrix 42
6.1.11 ANN Model Classification and 42
Saving
7 SYSTEM TESTING 43
7.1 UNIT TESTING 43
7.2 VALIDATION TESTING 45
7.3 MODULE LEVEL TESTING 45
7.4 INTEGRATION & SYSTEM TESTING 46
7.5 REGRESSION TESTING 46
8 RESULTS AND DISCUSSION 47
8.1 MODULES IMPORT 47
8.2 DATASET VISUALIZATION 49
iii
8.3 CORRELATION MATRIX 50
8.4 FEATURE SELECTION 50
8.5 ANN MODEL TRAINING 51
8.6 PREDICTION 52
8.7 CLASSIFICATION 52
8.8 PERFORMANCE EVALUATION 54
9 CONCLUSION 55
REFERNCES 56
iv
LIST OF ABBREVIATIONS
v
LIST OF FIGURES
vi
CHAPTER 1
INTRODUCTION
Heart Disease is one of the leading diseases around the world. The effective
functioning of the heart plays a vital role in the body. There are many types of
1
heart diseases such as Myocardial infarction, Myocardial ischemia, Congenital
heart disease, Coronary heart disease, Cardiac arrest, Peripheral heart disease etc.,
There are various methodologies available in predicting the heart disease.
Heart disease describes a range of condition that affects the heart. The term
cardiac disease is regularly utilized with cardio vascular disease (CVD). The
blood to the heart is supplied by coronary supply routes and narrowing of
coronary arteries is the major cause for heart failure. Prediction of cardiovascular
disease is considered as one of the most important subjects in the section of data
analytics. The major cause of heart attack in United States is coronary artery
disease. Cardiac disorder is widespread in male than that of female. The survey
analyzed by World Health Organization (WHO) estimates that 24% of people died
in India due to cardiac disorder. Analysts have recorded the differentcomponents
that increment the possibility of cardiac disorder and coronary arterydisease.
The elements are put down into two types that are the elements that could
be modified and elements that could not be modified. Age, smoking, heredity,
sex, high blood pressure, poor diet, consumption of alcohol, physical inactivity is
considered to be the risk factor of heart diseases. Age, sex, heredity are the
elements that could not be modified while the elements like smoking, intake of
2
alcohol can be modified. The determinant condition related to lifestyle could be
vicissitude with meditation and lifestyle changes. The most common method used
for treating coronary artery disease was angiography. Angiography is the costliest
method and have more reactions. The risk factors make the detection of heart
diseases very difficult; these constrains leads to the evolution of modern methods
which detects the presence of cardiac diseases. It is hard to identify cardiac
disorders manually hence conventional methods are used which is mostly based
on examination of patient data. Machine learning techniques are useful for
predicting the diseases. We develop a diagnosis system which is based on 2 -
DNN. The most objective of the determination framework is to dispense with the
issue of over fitting and under fitting. The training data is feed in to the neural
organizes. The testing dataset is utilized to access the execution of the neural
organize. The DNN with multiple hidden layers is used hence the proposed model
has high performance than ANN. 2 -DNN is developed for improving the
classification accuracy of cardiac disorder prognosis.
3
a "risk marker", a witness to a process (e.g., elevation of microalbuminuria,
elevation of C-reactive protein "CRP"). While there is a direct causal link between
the agent and the disease, it is a genuine risk factor. Overall cardiovascular risk
is defined as the probability of developing a cardiovascular event in a given time
depending on several risk factors considered simultaneously. Born more than 50
years ago with the advent of the analytical epidemiology of cardiovascular
diseases, this concept now appears as a potential prevention tool integrating the
multifactorial nature of these diseases.
A. Physiological Factors
1) Age
2) Sex
3) Menopausal Status
B. Lifestyle Factors
1) Smoking
2) Physical Activity
3) Alcohol
1.3 BACKGROUND
4
model achieved more accuracy when decision tree classification algorithm is
used. The accuracy of the model is evaluated using confusion matrix and the
model achieved accuracy of 55% for SVM.
5
PSO is obtained by initial rate of ACO. After the parameter changes, the
classification of heart disease model is created. The best algorithm for the
classification and heart disease prediction is K-Nearest Neighbour and the
Random Forest algorithm.
The perception receipts an info esteem vector and yields 1 for the outcome
which is more noteworthy than predefined edges and -1 in any case. The
verification assembly of calculation is called as the apperception combination
hypothesis. Figure 1.1 shows the example of ANN. The yield hub isutilized to
speak to the system which provide the nodes in neural organization engineering
is ordinarily called neurons. Every info node is associated with the yield hub by
means of connection. This is utilized for copying the strong point ofthe synaptic
association among neurons.
6
Fig 1.1: Example of Artificial Neural Network
Coronary heart disease happens occur when the supply routes of the heart
typically give oxygen and blood to the heart is limited and totally hindered.
Cardiovascular sicknesses represent huge death and grimness in world. In India,
death because of Chewier 1.6 million in the year 2000, 201561 million cases will
be because of coronary heart disease.
1) Blood pressure
2) Cholesterol
3) Diabetes
4) Smoking
5) Eating Habits
6) Sedentary life style
7) Stress
7
This pandemic might be stopped through the advancement of better ways
of life, active work, conventional food utilization would assist with alleviating
this weight.
8
1.7 OBJECTIVE
9
CHAPTER 2
LITERATURE SURVEY
Their project explains the concept of the use of delicate figuring in choice
emotionally supportive network in sickness analysis is one of the creating
interdisciplinary exploration territories in the field of software engineering. AI
calculations assumes a significant part in danger location of sicknesses. Highlight
determination between the dataset is the fundamental factor that impacts coronary
illness forecast exactness. Mathew's connection coefficient execution metric is
additionally considered. Molecule swarm streamlining calculation is changed and
pragmatic for execution quality determination. Upgraded fluffy counterfeit neural
organization classifier plays out the forecast task.
Their project explains the idea of coronary heart disease (CHD) builds each
year by a critical amount of morality. In addition, death from coronary heart
disease acquires the most elevated predominance in Indonesia at 1.5 percent. The
misdiagnosis of coronary heart disease is an urgent central which is the main
consideration which create demise. To avoid wrong diagnosis of CHD the
insightful framework has planned. Their method is suggested a simulation that
can utilized for the analyze of coronary heart disease in preferable execution over
conventional indicative strategies. A few analysts have built up a framework
utilizing ordinary neural network and machine learning calculation. Yet,
outcomes do not provide a decent execution. In light of a regular neural
organization, deep neural organization (DNN) is suggested in work. The
supervised learning neural network provides the calculation that great in the
classification. In DNN model, the usage of double grouping was executed to
analyze CHD present or CHD missing. To help execution examination utilizing
the UCI machine archive coronary heart disease dataset, ROC Curve and its
disarray grid was actualized in the proposed method. The overall accuracy,
sensitivity, and specificity acquired for the proposed method is 96%, 99%, 92%.
11
Atiqur Rahman and Aurangzeb Khan (2019) An Automated Diagnostic
System for Heart Disease Prediction Based on _2 Statistical Model and
Optimally Configured Deep Neural Network [4]
12
Bin Xiao, Yunqiu Xu, Xiuli Bi and Junhui Zhang (2019) discovered Heart
sounds classification using a novel 1-D convolutional neural network with
extremely low parameter consumption [5]
Their project explains the concept of Automatic heart sound is one of the normally
utilized procedures for cardiovascular infections discovery. A novel heart sound
arrangement technique dependent on profound learning advances for
cardiovascular infection expectation is presented, which is fundamentally
included three sections: pre-preparing, 1-D waveform heart sound patches
characterization utilizing a profound convolutional neural organization (CNN)
with consideration component, and larger part deciding in favor of conclusive
forecast of heart sound accounts. To upgrade the data stream of the CNNs, a
square stacked style design with coterie blocks is utilized, and in every faction
block a bidirectional association structure is presented in the proposed CNN. By
utilizing the stacked faction and change obstructs, the proposed CNN
accomplishes both spatial and channel consideration driving a promising
characterization execution. Also, a novel detachable convolution with upset
bottleneck is used to decouple the spatial and channel-wise significance of
highlights proficiently. Analyses on Physio Net/CinC 2016 show that the
proposed strategy acquires a predominant grouping result and dominates in
utilization of boundary contrasting with best-in-class strategies.
Bin Deng, Min Guo "Risk Factors and Intervention Status of Cardiovascular
Disease in Elderly Patients with Coronary Heart Disease", Health( 2020) [9]
Recent days, heart ailments assume a fundamental role in the world. The
physician gives different names for heart disease, for example, cardiovascular
failure, heart failure and so on. Among the automated techniques to discover the
coronary illness, this research work uses Named Entity Recognition (NER)
algorithm to discover the equivalent words for the coronary illness content to
15
mine the significance in clinical reports and different applications. The Heart
sickness text information given by the physician is taken for the preprocessing
and changes the text information to the ideal meaning, at that point the resultant
text data taken as input for the prediction of heart disease. This experimental work
utilizes the NER to discover the equivalent words of the coronary illness text data and
currently uses the two strategies namely Optimal Deep Learning and Whale
Optimization which are consolidated and proposed another strategy Optimal Deep
Neural Network (ODNN) for predicting the illness. For the prediction, weights
and ranges of the patient affected information by means of chosen attributes are
picked for the experiment. The outcome is then characterized with the Deep
Neural Network and Artificial Neural Network to discover the accuracy of the
algorithms. The performance of the ODNN is assessed by means for classification
methods, for example, precision, recall and f-measure values.
16
Neighbor) to apply on a diabetes patient’s database and analyze them by taking
various attributes of diabetes for prediction of diabetes disease.
17
CHAPTER 3
SYSTEM ANALYSIS
The term heart disease encompasses the diverse diseases that affect the
heart. Heart disease was then major cause of casualties in the United States,
England, Canada and Wales as in 2007. Heart disease kills one person every 34
seconds in the United States. Coronary heart disease, Cardiomyopathy and
Cardiovascular disease are some categories of heart diseases. The term
18
“cardiovascular disease” includes a wide range of conditions that affect the heart
and the blood vessels and the manner in which blood is pumped and circulated
through the body. Cardiovascular disease (CVD) results in severe illness,
disability, and death.
19
coronary heart disease. Association rules are used for prediction of heart disease.
These association rules are applied on the medical dataset and it generates many
rules which are irrelevant. In order to identify the rules that are truly essential for
predicting the heart disease are identified by using search constraints which
searches the association rules in training dataset and finally validates on the test
set. The hybrid system is used with the global optimization of genetic algorithm
and their system is used to initialize the neural network weights. A multi-layered
feed-forward network is used. The input nodes, hidden nodes and output nodes
are 12, 10 and 2 respectively. The input nodes are basically the risk factors which
are used in predicting the heart disease. The risk factors of heart attack otherwise
called myocardial infarction is identified based on the decision trees and Apriori
algorithm. Based on these methods, the risk factors which are identified as
efficient in the detection of heart attack are chest pain, diabetes, smoking, gender
and physical inactivity, age, lipids, cholesterol, triglyceride, blood pressure.
3.1.1 Disadvantages
● Low accuracy
● Highly expensive
● Sensitive to noise
● Highly complicated
3.2 PROPOSED SYSTEM
Now a day’s artificial neural network (ANN) has been widely used as a tool
for solving many decision modeling problems. A multilayer perception is a feed
forward ANN model that is used extensively for the solution of a no. of different
problems. An ANN is the simulation of the human brain. It is a supervised
learning technique used for non-linear classification coronary heart disease is
major epidemic in India and Andhra Pradesh is in risk of coronary heartdisease.
Clinical diagnosis is done mostly by doctor’s expertise and patients wereasked to
take no. of diagnosis tests. But all the tests will not contribute towards
20
effective diagnosis of disease. Feature subset selection is a pre-processing step
used to reduce dimensionality, remove irrelevant data. In this project we introduce
a classification approach which uses ANN and feature subset selection for the
classification of heart disease. PCA is used for pre-processing and to reduce no.
Of attributes which indirectly reduces the no. of diagnosis tests whichare needed
to be taken by a patient. We applied our approach on Andhra Pradeshheart disease
data base. Our experimental results show that accuracy improved over traditional
classification techniques. This system is feasible and faster and more accurate for
diagnosis of heart disease.
21
optimization problems. Pattern recognition and function Estimation abilities make
ANN prevalent utility in data mining. Their main advantage is that they cansolve
problems that are too complex for conventional technologies. Neural networks are
well suited to problem like pattern recognition and forecasting. ANN are used to
extract useful patterns from the data and infer rules from them.These are useful in
providing information on associations, classifications and clustering.
22
artificial neural network used for classification where patterns are linearly
separable. Output node is used to represent the model output the nodes in neural
network architecture are commonly known as neurons. Each input node is
connected to output node via a weighted link. This is used to emulate the strength
of synaptic connection between neurons.
Step 3: Repeat.
3.4 ADVANTAGES
● High Accuracy and Independent from prior assumptions about the
distribution of the data.
● Noise tolerance & Ease of Maintenance.
● ANN can be implemented in parallel hardware
23
CHAPTER 4
SYSTEM DESIGN
Model Building
Evaluation ANN
Performance
Measures
24
single neuron with 13 inputs inside and a step function is used as a transfer
function. There is a bias also which controls the y-intercepts and to each input,
there is a separate weight. Finally, ANN is used for classification of heart disease.
4.3 MODULES
The modules involved in the proposed ANN based heart disease prediction
are explained as follows:
4.3.1 Data Collection
The data used in this project is obtained from the Cleveland Heart Disease
database. A total of 297 records with 14 medical attributes which is used to predict
the heart disease.
This database contains 76 attributes, but all published experiments refer to
using a subset of 14 of them. In particular, the Cleveland database is the only one
that has been used by ML researchers to this date. The "goal" field refers to the
presence of heart disease in the patient. It is integer valued from 0 (no presence)
to 4. Experiments with the Cleveland database have concentrated on simply
attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
The names and social security numbers of the patients were recently
removed from the database, replaced with dummy values.
One file has been "processed", that one containing the Cleveland database.
All four unprocessed files also exist in this directory.
Only 14 attributes used:
1. #3 (age)
2. #4 (sex)
3. #9 (cp)
4. #10 (trestbps)
5. #12 (chol)
6. #16 (fbs)
7. #19 (restecg)
8. #32 (thalach)
25
9. #38 (exang)
10. #40 (oldpeak)
11. #41 (slope)
12. #44 (ca)
13. #51 (thal)
14. #58 (num) (the predicted attribute)
The description of dataset is tabulated in Table 4.1.
Table 4.1: Description of Dataset
26
13 Thal 3= normal, 6= fixed defect, 7= reversible effect
The dataset of 297 records is divided into training and testing dataset. The
training dataset is used to build a predictive relationship and it is the set of
examples that is used for learning and to fit weights of the classifier. The test set
is set of examples which is used to evaluate the performance of a fully-specified
classifier. The training and testing dataset is divided into 75% and 25%
respectively.
Here, the model will be trained using the datasets and tested for finding the
accuracy of the model. Optimization will be done to improve the accuracy if
needed. In machine learning, a common task is the study and construction of
algorithms that can learn from and make predictions on data. Such algorithms
work by making data-driven predictions or decisions, through building a
mathematical model from input data. The data used to build the final model
usually comes from multiple datasets. In particular, three data sets are commonly
used in different stages of the creation of the model.
The model (e.g., a neural net classifier) is trained on the training dataset
using a supervised learning method (e.g., gradient descent or stochastic gradient
descent). In practice, the training dataset often consist of pairs of an input vector
(or scalar) and the corresponding output vector (or scalar), which is commonly
denoted as the target (or label). The current model is run with the training dataset
and produces a result, which is then compared with the target, for each input
vector in the training dataset. Based on the result of the comparison and the
specific learning algorithm being used, the parameters of the model are adjusted.
27
The model fitting can include both variable selection and parameter estimation.
Successively, the fitted model is used to predict the responses for the observations
in a second dataset called the validation dataset.
28
The Heart Disease Dataset consists of 14 input attributes including age,
gender, type of chest pain, blood pressure, cholesterol, blood sugar level,
electrocardiograph result, maximum heart rate, exercise induced angina, old peak,
Slope, number of vessels colored, thal. The dataset contains 303 instances, from
which 2 instances for a number of vessels colored attribute and 4 instances for
thal attribute are missing, which are filled by their mean value for the dataset
respectively. The prediction attribute consists of 2 classes ranging from integer
value 0 - 1 where 0 indicate no heart disease and the integer value from 1 indicate
the presence of heart disease. The feature selection from 14 input parameter by
backward elimination resulted in a total of 14 significant input parameters which
include gender, type of chest pain, blood pressure, blood sugar level,
electrocardiograph result, maximum heart rate, exercise induced angina, old peak,
Slope, number of vessels colored, thal.
Precision: It is the percentage of results that are relevant and is defined as:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙
29
Sensitivity: The sensitivity of a test is its ability to determine the patient cases
correctly. Mathematically, this can be stated as:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Specificity: The specificity of a test is its ability to determine the healthy cases
correctly. Mathematically, this can be stated as:
𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
30
CHAPTER 5
SYSTEM DESCRIPTION
31
Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many of its
features support functional programming and aspect-oriented programming
(including by metaprogramming and metaobjects (magic methods)). Many other
paradigms are supported via extensions, including design by contract and logic
programming.
Python's design offers some support for functional programming in the Lisp
tradition. It has filter, map, and reduce functions; list comprehensions,
dictionaries, sets, and generator expressions. The standard library has two
modules (itertools and functools) that implement functional tools borrowed from
Haskell and Standard ML.
Rather than having all of its functionality built into its core, Python was
designed to be highly extensible. This compact modularity has made it
particularly popular as a means of adding programmable interfaces to existing
applications. Van Rossum's vision of a small core language with a large standard
library and easily extensible interpreter stemmed from his frustrations with ABC,
32
which espoused the opposite approach.
● IPython
● ØMQ
● Tornado (web server)
● jQuery
● Bootstrap (front-end framework)
● MathJax
The Notebook interface was added to IPython in the 0.12 release, renamed
to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter Notebook is
similar to the notebook interface of other programs suchas Maple, Mathematica,
and SageMath, a computational interface style that originated with Mathematica
in the 1980s.Colaboratory (also known as Colab) is a free Jupyter notebook
environment that runs in the cloud and stores its notebooks on Google Drive.
Colab was originally an internal Google project; an attempt was made to open
source all thecode and work more directly upstream, leading to the developmentof
the "Open in Colab" Google Chrome extension, but this eventually ended, and
Colabdevelopment continued internally. As of October 2019, the Colaboratory UI
only allows for the creation of notebooks with Python 2 and Python 3
kernels;however, an existing notebook whose kernelspec is IR or Swift will also
work, since both R and Swift are installed in the container. Julia language can also
workon Colab (with e.g., Python and GPUs; Google's tensor processing units also
workwith Juliaon Colab).
35
CHAPTER 6
SYSTEM IMPLEMENTATION
Implementation includes all those activities that take place to convert from
the old system to the new. The old system consists of manual operations, which
is operated in a very different manner from the proposed new system. A proper
implementation is essential to provide a reliable system to meet the requirements
of the organizations. An improper installation may affect the success of the
computerized system. The implementation steps are listed below.
36
6.1.1 Implementation Methods
There are several methods for handling the implementation and the
consequent conversion from the old to the new computerized system.
The most secure method for conversion from the old system to the new
system is to run the old and new system in parallel. In this approach, a person may
operate in the manual older processing system as well as start operating the new
computerized system. This method offers high security, because even if there is a
flaw in the computerized system, we can depend upon the manual system.
However, the cost for maintaining two systems in parallel is very high. This
outweighs its benefits.
Another commonly method is a direct cut over from the existing manual
system to the computerized system. The change may be within a week or within
a day. There are no parallel activities. However, there is no remedy in case of a
problem. This strategy requires careful planning.
The implementation plan includes a description of all the activities that must occur
to implement the new system and to put it into operation. It identifies the personnel
responsible for the activities and prepares a time chart for implementing the
system.
The implementation plan consists of the following steps.
37
● Identify all data required to build new files during the implementation.
● List all new documents and procedures that go into the new system.
The implementation plan should anticipate possible problems and must be
able to deal with them. The usual problems may be missing documents; mixed
data formats between current and files, errors in data translation, missing data etc.
Pandas is for reading our dataset, NumPy is for working with arrays and
also performing linear algebra calculations, seaborn for visualizations, and also
Matplotlib is for visualization, and lastly, warning messages are typically issued
in situations where it is useful to alert the user of some condition in a program,
where that condition (normally) doesn’t warrant raising an exception and
terminating the program.
So, our aim here is to analyze our data and report our findings through
visualizations and the code below allows us to check if we have any missing
values in our dataset before going further with the analysis.
38
6.1.3 Analyze and Visualize Pandas Data Structure
Splitting the Data: Dividing the dataset into training, validation, and test sets.
Typically, the data is split in a ratio such as 70-20-10, where 70% is used for
training, 20% for validation, and 10% for testing.
The data preprocessing involves several key steps. First, the describe() function
is used to generate a statistical summary, including mean and standard deviation.
Next, data cleaning is performed to correct errors, handle missing values, and
remove duplicates. Then, normalization or standardization is applied to ensure
that features are on a comparable scale. Categorical variables are encoded into
numerical formats using techniques like one-hot or label encoding. Finally, the
dataset is split into training, validation, and test sets, typically in a 70-20-10 ratio,
for effective model development and evaluation.
39
6.1.4 Exploratory Data Analysis
The unique() function on the output variable shows the data type and
present unique values. Here we can notice that the output variable is imbalanced
with a small fraction which does make any difference in the development of the
algorithm.
6.1.5 Checking Relationship among Variables
Now checking the relationship between variables using the seaborn relplot
chart and from the result, we can deduct that people between the age of 50 years
and 65 are experiencing cp = 0 (AGINA) which is the most common pain among
the list. Also, we can conclude that these heart disease and cholesterol levels
indicate that it is mostly among women than men.
6.1.6 Checking and Dropping Duplicates
Here we used the z-score to detect outliers in our dataset and I used a
threshold of the 3 to filter the outliers. our dataset of 303 entries was reduced to
287 entries
6.1.8 Pre-processing
We are preprocessing our data before building our model and we are going
to separate them into three categories namely:
6.1.8.1 Split the data into training and testing sets
6.1.8.2 Scaling the features into an unformal range.
6.1.8.3 Scaling our data
40
Train test split
Here at this stage, we enter another phase into the development of our
machine learning model which is training and testing, we split the data into two
sections; training and testing. we do this because we can’t use the data to train
and test our model for too, hence we separate data for training and the other for
testing with our model.
Scaling
We scale the data so that when feeding the data into the machine learning
model it will be uniform which also positively impacts our model accuracy.
Initializing Weights: Setting initial weights for the network, often using
methods like Xavier initialization to facilitate efficient training.
Forward Propagation: Passing the input data through the network to obtain
predictions. Each neuron calculates a weighted sum of its inputs and applies an
activation function.
Training the Network: Iteratively feeding the training data through the
network and updating weights until the model converges to a minimum error.
42
CHAPTER 7
SYSTEM TESTING
43
disease prediction system. The software must be executed several times in order
to find out the errors in the different modules of the proposed system.
Validation refers to the process of using the new software for the developed
system in a live environment i.e., the jupyter notebook file of the proposed heart
disease prediction model, in order to find out the errors in it. The validation phase
reveals the failures and the bugs in the heart disease prediction system. It will
become to know about the practical difficulties, the heart disease prediction model
faces when operated in the jupyter notebook. By testing the code of the
implemented software, the logic of the program can be examined. A specification
test is conducted to check whether the specifications like the hardware
specifications like system, monitor, RAM, ROM, mouse, keyboard, etc. and
software specifications like python 3.7 and jypyter notebook, stating the program
are performing under various conditions. Apart from these tests, there are some
special tests conducted which are given below:
Peak Load Tests: This determines whether the proposed heart disease
prediction system will handle the volume of activities when the system is at the
peak of its processing demand. The test has revealed that the new software for the
agency is capable of handling the demands at the peak time.
Storage Testing: This determines the capacity of the proposed heart
disease prediction system to store transaction data on a disk or on other files. The
proposed software has the required storage space available, because of the use of
a number of hard disks.
Performance Time Testing: This test determines the length of the time
used by the heart disease prediction to process transaction data.
In this phase the heart disease prediction model developed testing is
exercising the software to uncover errors and ensure the system meets defined
requirements. Testing may be done at 4 levels
• Unit Level
44
• Module Level
• Integration & System
• Regression
2. Configuration review
45
7.4 INTEGRATION & SYSTEM TESTING
Integration testing is used to verify the combining of the software modules
like dataset preparation, training, testing, model building and performance
evaluation. Integration testing addresses the issues associated with the dual
problems of verification and program construction. System testing is used to
verify, whether the predictive model meets the requirements. System testing is
actually a series different test whose primary purpose is to full exercise the
computer base system, where the software and other system elements are tested
as whole. To test computer software, we spiral out along streamlines that broadens
the scope of testing with each turn.
The last higher-order testing step falls outside the boundary of software
engineering and in to the broader context of computer system engineering. The
proposed predictive model of heart disease availability, once validated, must be
combining with order system elements (e.g., hardware, people, databases). System
testing verifies that all the elements mesh properly and that overall system
function/performance is achieved.
1. Recovery Testing
2. Security Testing
3. Stress Testing
7.5 REGRESSION TESTING
Each modification in software impacts unmodified areas, which results
serious injuries to that software. So, the process of re-testing for rectification of
errors due to modification is known as regression testing.
Installation and Delivery: Installation and Delivery is the process of delivering
the proposed predictive model and tested model to the customer.
Acceptance and Project Closure: Acceptance is the part of the project by which
the customer accepts the product. This will be done as per the Project Closure,
once the customer accepts the developed heart disease prediction model, closure
of the project is started. This includes metrics collection, PCD, etc.
46
CHAPTER 8
The necessary modules are imported and n sample data is retrieved and shown
from the data set.
8.2 DATASET VISUALIZATION
47
Preprocessing is done by dropping the duplicates
48
8.2 CORRELATION MATRIX
49
8.3 FEATURE SELECTION
50
Output prediction after model training.
8.5 PREDICTION
51
8.6 CLASSIFICATION
52
Plot for accuracy.
53
Acquiring the required op through custom input.
54
CHAPTER 9
CONCLUSION
In our project we have proposed a new feature selection method for heart
disease classification using ANN and various feature selection methods for
Andhra Pradesh Population. We applied different feature selection methods to
rank the attributes which contribute more towards classification of heart disease,
which indirectly reduces the no. of diagnosis tests to be taken by a patient. Our
experimental results indicate that on an average with ANN and feature subset
selection provides on the average better classification accuracy and
dimensionality reduction. Our proposed method eliminates useless and distortive
data. This research will contribute reliable and faster automatic heart disease
diagnosis system, where easy diagnosis of heart disease will save lives. Coronary
heart disease can be handled successfully if more research is encouraged in this
area.
55
REFERENCES
[1] Ahmed, Rizgar Maghdid, and Omar Qusay Alshebly. "Prediction and factors
affecting of chronic kidney disease diagnosis using artificial neural networks
model and logistic regression model." Iraqi Journal of Statistical Sciences
16.28 (2019): 140-159.
[2] Burse, Kavita, et al. "Various pre-processing methods for neural network
based heart disease prediction." Smart innovations in communication and
computational sciences. Springer, Singapore, 2019. 55-65.
[3] Ali, Farman, et al. "A smart healthcare monitoring system for heart disease
prediction based on ensemble deep learning and feature fusion." Information
Fusion 63 (2020): 208-222.
[4] Shahid, Nida, Tim Rappon, and Whitney Berta. "Applications of artificial
neural networks in health care organizational decision-making: A scoping
review." PloS one 14.2 (2019): e0212356.
[5] Khourdifi, Youness, and Mohamed Bahaj. "Heart disease prediction and
classification using machine learning algorithms optimized by particle
swarm optimization and ant colony optimization." Int. J. Intell. Eng. Syst.
12.1 (2019): 242-252.
[6] Khalil, Ahmed J., et al. "Energy Efficiency Predicting using Artificial Neural
Network." (2019).
[7] Ali, Liaqat, et al. "An Automated Diagnostic System for Heart Disease
Prediction Based on Chi-Square Statistical Model and Optimally Configured
Deep Neural Network." IEEE Access 7 (2019): 34938-34945.
[8] Mohan, Senthilkumar, Chandrasegar Thirumalai, and Gautam Srivastava.
"Effective heart disease prediction using hybrid machine learning
techniques." IEEE Access 7 (2019): 81542-81554.
[9] Latha, C. Beulah Christalin, and S. Carolin Jeeva. "Improving the accuracy
of prediction of heart disease risk based on ensemble classification
techniques." Informatics in Medicine Unlocked 16 (2019): 100203.
[10] Darmawahyuni, Annisa, Siti Nurmaini, and Firdaus Firdaus. "Coronary
Heart Disease Interpretation Based on Deep Neural Network." Computer
Engineering and Applications Journal 8.1 (2019): 1-12.
[11] Aqueel Ahmed, Shaikh Abdul Hannan, “Data Mining Techniques to Find
Out Heart Diseases: An Overview”, IJITEE.
56
[12] Monika Gandhi, Dr. Shailendra Narayan Singh, “Predictions in Heart
Disease Using Techniques of Data Mining”, IEEE.(2019)
[13] Marjia Sultana, Afrin Haider and Mohammad ShorifUddin, “Analysis of
Data Mining Techniques for Heart Disease Prediction”.
[14] M.A.Nishara Banu , B Gomathy, “Disease Predicting System Using Data
Mining Techniques” IJTRA.
[15] K Raj Mohan ,Ilango Paramasivam ,SubhashiniSathyaNarayan, “Prediction
and Diagnosis of Cardiovascular Disease – A Critical Survey”, IEEE (2019)
[16] Mohamed Elnoamany, Ashraf Dawood, Nahed Mohamed Momtaz, Waleed
Abdou, “Thrombospondin-1 Levels in Patients with Coronary Heart
Disease” WJCD (2021)
[17] Abhishek Rairikar, Vedant Kulkarni, Vikas Sabale , “Heart Disease
Prediction Using Data Mining Techniques”, IEEE (2018)
[18] Pahulpreet Singh Kohli, Shriya Arora, “Application of Machine Learning in
Disease Prediction”, IEEE (2019)
[19] Jyoti Soni, Ujma Ansari, “Predictive Data Mining for Medical Diagnosis:
An Overview of Heart Disease Prediction”, IJCA.
[20] Mudasir M Kirmani, “Cardiovascular Disease Prediction Using Data Mining
Techniques: A Review”, Oriental Scientific Publishing Co. (2019)
[21] Bin Deng, Min Guo "Risk Factors and Intervention Status of Cardiovascular
Disease in Elderly Patients with Coronary Heart Disease", Health( 2020)
[22] T. Velmurugan, U. Latha "Classifying Heart Disease in Medical Data Using
Deep Learning Methods", Journal of Computer and Communications (2021)
[23] Karaskov Alexander "Impact of chemical elements on heart failure
progression in coronary heart disease patients", SciRes.
[24] Jyoti Soni,Ujma Ansari "Predictive Data Mining for Medical Diagnosis: An
Overview of Heart Disease Prediction", International Journal of Computer
Applications((2019)
[25] Ammar Aldallal; Amina Abdul Aziz Al-Moosa "Using Data Mining
Techniques to Predict Diabetes and Heart Diseases", IEEE (2018)
[26] Monika Gandhi, Shailendra Narayan Singh "Predictions in Heart Disease
Using Techniques of Data Mining", IEEE.
[27] Durairaj M, Revathi V, "Prediction Of Heart Disease Using
Backpropagation MLP Algorithm", IJSTR (2019)
57
[28] M. Akhiljabbar,B.L.Deekshatulu " Classification of Heart Disease Using K-
Nearest Neighbor and Genetic Algorithm "Procedia Technology.
[29] ResulDas, IbrahimTurkoglu,"Effective diagnosis of heart disease through
neural networks ensembles", Expert Systems with Applications.
[30] Samir B Patel,Samir Tejalupadhyay,"Heart Disease Prediction Using
Machine learning and Data Mining Technique”, IJCSC(2019)
[31] Saima Safdar, Saad Zafar, “Machine Learning Based Decision Support
Systems (DSS) For Heart Disease Diagnosis: A Review”, SpringerLink.
[32] P.Santhi a, R.Ajayb"A Survey on Heart Attack Prediction Using Machine
Learning" Turkish Journal of Computer and Mathematics Education (2021).
58