1822 B.E Cse Batchno 95
1822 B.E Cse Batchno 95
MAY 2022
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
(Established under Section 3 of UGC Act, 1956)
JEPPIAAR NAGAR, RAJIV GANDHI SALAI
CHENNAI– 600119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
Internal Guide
_____________________________________________________________________________________________
Submitted for Viva voce Examination held on
ii
DECLARATION
DATE: G. Venugopalreddy
iii
ACKNOWLEDGEMENT
I convey my thanks to Dr. T. Sasikala M.E., Ph.D., Dean, School of Computing, Dr. S.
Vigneshwari, M.E., Ph.D. and Dr. L. Lakshmanan, M.E., Ph.D., Heads of the Department
of Computer Science and Engineering for providing me necessary support and details at the
right time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide Dr. D.
Usha Nandini, M.E., Ph.D., for his valuable guidance, suggestions and constant
encouragement paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways for the
completion of the project.
iv
ABSTRACT
In the medical field, the diagnosis of heart disease is the most difficult task. The
diagnosis of heart disease is difficult as a decision relied on grouping of large clinical
and pathological data. Due to this complication, the interest increased in a significant
amount between the researchers and clinical professionals about the efficient and
accurate heart disease prediction. In case of heart disease, the correct diagnosis in
early stage is important as time is the very important factor. Heart disease is the
principal source of deaths widespread, and the prediction of heart disease is significant
at an untimely phase. Machine learning in recent years has been the evolving, reliable
and supporting tools in medical domain and has provided the greatest support for
predicting disease with correct case of training and testing. The main idea behind this
work is to study diverse prediction models for the heart disease and selecting important
heart disease feature using Random Forests algorithm. Random Forests is the
Supervised Machine Learning algorithm which has the high accuracy compared to
other Supervised Machine Learning algorithms such as logistic regression etc. By
using Random Forests algorithm, we are going to predict if a person has heart disease
or not.
v
TABLE OF CONTENTS
CHAPTERS CONTENTS PG. NOS
ABSTRACT V
CHAPTER 1 INTRODUCTION 1
REFERENCES 31
APPENDICES 32
A. SAMPLE CODE 33
B. SCREEN SHOTS 37
C. PLAGIARISM REPORT 41
vii
LIST OF FIGURES
LIST OF TABLES
viii
CHAPTER 1
INTRODUCTION
The heart is a kind of muscular organ which pumps blood into the body and is the
central part of the body‘s cardiovascular system which also contains lungs.
Cardiovascular system also comprises a network of blood vessels, for example, veins,
arteries, and capillaries. These blood vessels deliver blood all over the body.
Abnormalities in normal blood flow from the heart cause several types of heart
diseases which are commonly known as cardiovascular diseases (CVD). Heart
diseases are the main reasons for death worldwide. According to the survey of the
World Health Organization (WHO), 17.5 million total global deaths occur because of
heart attacks and strokes. More than 75% of deaths from cardio-vascular diseases
occur mostly in middle-income and low-income countries. Also, 80% of the deaths that
occur due to CVDs are because of stroke and heart attack. Therefore, prediction of
cardiac abnormalities at the early stage and tools for the prediction of heart diseases
can save a lot of life and help doctors to design an effective treatment plan which
ultimately reduces the mortality rate due to cardiovascular diseases.
Due to the development of advance healthcare systems, lots of patient data are
nowadays available (i.e., Big Data in Electronic Health Record System) which can be
used for designing predictive models for cardiovascular diseases. Data mining or
machine learning is a discovery method for analyzing big data from an assorted
perspective and encapsulating it into useful information. ―Data Mining is a non-trivial
extraction of implicit previously unknown and potentially useful information about
data‖. Nowadays, a huge amount of data pertaining to disease diagnosis, patients etc.
are generated by healthcare industries. Data mining provides a number of techniques
which discover hidden patterns or similarities from data.
Therefore, in this paper, a machine learning algorithm is proposed for the
implementation of a heart disease prediction system which was validated on two open
access heart disease prediction datasets.
1
Data mining is the computer-based process of extracting useful information
from enormous sets of databases. Data mining is most helpful in an explorative
analysis because of nontrivial information from large volumes of evidence.
Medical data mining has great potential for exploring the cryptic patterns in the data
sets of the clinical domain.
These patterns can be utilized for healthcare diagnosis. However, the available
raw medical data are widely distributed, voluminous and heterogeneous in nature.
This data needs to be collected in an organized form. This collected data can be then
integrated to form a medical information system. Data mining provides a user-oriented
approach to novel and hidden patterns in the Data The data mining tools are useful for
answering business questions and techniques for predicting the various diseases in
the healthcare field. Disease prediction plays a significant role in data mining. This
paper analyzes the heart disease predictions using classification algorithms. These
invisible patterns can be utilized for health diagnosis in healthcare data.
Data mining technology affords an efficient approach to the latest and indefinite
patterns in the data. The information which is identified can be used by the healthcare
administrators to get better services. Heart disease was the most crucial reason for
victims in the countries like India, United States. In this project we are predicting the
heart disease using classification algorithms. Machine learning techniques like
Classification algorithms such as Random Forest, Logistic Regression are used to
explore different kinds of heart-based problems.
2
CHAPTER 2
LITERATURE SURVEY
Machine Learning techniques are used to analyze and predict the medical
data information resources. Diagnosis of heart disease is a significant and tedious
task in medicine. The term heart disease encompasses the various diseases that
affect the heart. The exposure of heart disease from various factors or symptom is an
issue which is not complimentary from false presumptions often accompanied by
unpredictable effects. The data classification is based on Supervised Machine
Learning algorithm which results in better accuracy. Here we are using the Random
Forest as the training algorithm to train the heart disease dataset and to predict the
heart disease. The results showed that the medicinal prescription and designed
prediction system is capable of prophesying the heart attack successfully. Machine
Learning techniques are used to indicate the early mortality by analyzing the heart
disease patients and their clinical records (Richards, G. et al., 2001). (Sung, S.F. et
al., 2015) have brought about the two Machine Learning techniques, k-nearest
neighbor model and existing multi linear regression to predict the stroke severity
index (SSI) of the patients. Their study show that k-nearest neighbor performed
better than Multi Linear Regression model. (Arslan, A. K. et al.,2016) have suggested
various Machine Learning techniques such as support vector machine (SVM),
penalized logistic regression (PLR) to predict the heart stroke. Their results show that
SVM produced the best performance in prediction when compared to other models.
Boshra Brahmi et al, [20] developed different Machine Learning techniques to
evaluate the prediction and diagnosis of heart disease. The main objective is to
evaluate the different classification techniques such as J48, Decision Tree, KNN and
Naïve Bayes. After this, evaluating some performance in measures of accuracy,
precision, sensitivity, specificity is evaluated.
3
Data source:
Clinical databases have collected a significant amount of information
about patients and their medical conditions. Records set with medical attributes were
obtained from the Cleveland Heart Disease database. With the help of the dataset,
the patterns significant to the heart attack diagnosis are extracted.
The records were split equally into two datasets: training dataset and testing dataset.
A total of 303 records with 76 medical attributes were obtained. All the attributes are
numeric-valued. We are working on a reduced set of attributes, i.e., only 14
attributes.
All these restrictions were announced to shrink the digit of designs, these are as
follows:
1. The features should seem on a single side of the rule.
2. The rule should distinct various features into the different groups.
3. The count of features available from the rule is organized by medica history
people having heart disease only.
The following table shows the list of attributes on which we are working.
Table 2.1: List of Attributes
S no Attribute Description
Name
1 Age age in years
3 Cp Chest Pain
4
6 Fbs (Fasting blood sugar >120 mg/dl) (1 = true; 0 =
false)
7 Restecg Resting electrocardiographic results
5
CHAPTER 3
AIM AND SCOPE OF PRESENT INVESTIGATION
Clinical decisions are often made based on doctors‘ intuition and experience rather
than on the knowledge rich data hidden in the database. This practice leads to
unwanted biases, errors and excessive medical costs which affects the quality of
service provided to patients. There are many ways that a medical misdiagnosis can
present itself. Whether a doctor is at fault, or hospital staff, a misdiagnosis of a
serious illness can have very extreme and harmful effects. The National Patient
Safety Foundation cites that 42% of medical patients feel they have had experienced
a medical error or missed diagnosis. Patient safety is sometimes negligently given
the back seat for other concerns, such as the cost of medical tests, drugs, and
operations. Medical Misdiagnoses are a serious risk to our healthcare profession. If
they continue, then people will fear going to the hospital for treatment. We can put an
end to medical misdiagnosis by informing the public and filing claims and suits
against the medical practitioners at fault.
Disadvantages:
Prediction is not possible at early stages.
In the Existing system, practical use of collected data is time consuming.
Any faults occurred by the doctor or hospital staff in predicting would lead to
fatal incidents.
Highly expensive and laborious process needs to be performed before treating
the patient to find out if he/she has any chances to get heart disease in future.
6
3.2 PROPOSED SYSTEM:
This section depicts the overview of the proposed system and illustrates all of the
components, techniques and tools are used for developing the entire system. To
develop an intelligent and user-friendly heart disease prediction system, an efficient
software tool is needed in order to train huge datasets and compare multiple machine
learning algorithms. After choosing the robust algorithm with best accuracy and
performance measures, it will be implemented on the development of the
smartphone-based application for detecting and predicting heart disease risk level.
Hardware components like Arduino/Raspberry Pi, different biomedical sensors,
display monitor, buzzer etc. are needed to build the continuous patient monitoring
system.
7
of the internet and for the professional programmer it is easy to learn and use
effectively. As the developing organization has all the resources available to build the
system therefore the proposed system is technically feasible.
3.3.3 Operational Feasibility:
Operational feasibility is defined as the process of assessing the degree to
which a proposed system solves business problems or takes advantage of business
opportunities. The system is self-explanatory and doesn‘t need any extra
sophisticated training. The system has built-in methods and classes which are
required to produce the result. The application can be handled very easily with a
novice user. The overall time that a user needs to get trained is 14 less than one
hour. As the software that is used for developing this application is very economical
and is readily available in the market. Therefore, the proposed system is
operationally feasible.
3.4.1. Embedded:
This class of system is characterized by tight constraints, changing
environment, and unfamiliar surroundings. Projects of the embedded type are model
to the company and usually exhibit temporal constraints.
3.4.2. Organic:
This category encompasses all systems that are small relative to project size
8
and team size and have a stable environment, familiar surroundings and relaxed
interfaces. These are simple business systems, data processing systems, and small
software libraries.
3.4.3. Semidetached:
The software systems falling under this category are a mix of those of organic
and embedded in nature. Some examples of software of this class are operating
systems, database management system, and inventory management systems class
are operating systems, database management system, and inventory management
systems.
The list of attributes is composed of several features of the software and includes
9
product, computer, personal and project attributes as follows.
3.4.4Product Attributes:
Required reliability (RELY): It is used to express an effect of software faults
ranging from slight inconvenience (VL) to loss of life (VH). The nominal value
(NM) denotes moderate recoverable losses.
Data bytes per DSI (DATA): The lower rating comes with lower size of a
database. Complexity (CPLX): The attribute expresses code complexity again
ranging from straight batch code (VL) to real time code with multiple resources
scheduling (XH).
3.4.5Computer Attributes:
Execution time (TIME) and memory (STOR) constraints: This attribute
identifies the percentage of computer resources used by the system. NM
states that less than 50% is used; 95% is indicated by XH.]
Virtual machine volatility (VIRT): It is used to indicate the frequency of
changes made to the hardware, operating system, and overall software
environment. More frequent and significant changes are indicated by higher
ratings.
Development turnaround time (TURN): This is a time from when a job is
submitted until output becomes received. LO indicated a highly interactive
environment, VH quantifies a situation when this time is longer than 12 hours.
10
3.4.7 Project Attributes:
Modern development practices (MODP): deals with the amount of use of
modern software practices such as structural programming and object-
oriented approach.
Use of software tools (TOOL): is used to measure a level of sophistication of
automated tools used in software development and a degree of integration
among the tools being used. Higher rating describes levels in both aspects.
Schedule effects (SCED): concerns the amount of schedule compression (HI
or VH), or schedule expansion (LO or VL) of the development schedule in
comparison to a nominal (NM) schedule.
VL LO NM HI VH XH
11
VEXP 1.21 1.10 1.00 0.90
12
CHAPTER 4
EXPERIMENTAL OR MATERIALS AND METHODS
13
4.2 REQUIREMENT ANALYSIS:
Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of
the entire system cannot be easily comprehended. Hence the need for the
requirement phase arose. The software project is initiated by the client needs. The
SRS is the means of translating the ideas of the minds of clients (the input) into a
formal document (the output of the requirement phase.) Under requirement
specification, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications
are addressed during this activity. The Requirement phases terminates with the
production of the validate SRS document. Producing the SRS document is the basic
goal of this phase. The purpose of the Software Requirement Specification is to
reduce the communication gap between the clients and the developers. Software
Requirement Specification is the medium though which the client and user needs are
accurately specified. It forms the basis of software development. A good SRS should
satisfy all the parties involved in the system.
4.2.1 Product Perspective:
The application is developed in such a way that any future enhancement can
be easily implementable. The project is developed in such a way that it requires
minimal maintenance. The software used are open source and easy to install. The
application developed should be easy to install and use. This is an independent
application which can be easily run on to any system which has Python installed and
Jupiter Notebook.
4.2.2 Product Features:
The application is developed in a way that ‗heart disease‘ accuracy is
predicted using Random Forest. The dataset is taken from
https://www.datacamp.com/community/tutorials/scikit-learn-credit-card. We can
compare the accuracy for the implemented algorithms. User characteristics
Application is developed in such a way that its users are v Easy to use v Error free 20
v Minimal training or no training v Patient regular monitor Assumption &
Dependencies It is considered that the dataset taken fulfils all the requirements.
14
4.2.3 Domain Requirements:
This document is the only one that describes the requirements of the system.
It is meant for the use by the developers and will also be the bases for validating the
final heart disease system. Any changes made to the requirements in the future will
have to go through a formal change approval process. User Requirements User can
decide on the prediction accuracy to decide on which algorithm can be used in real-
time predictions. Non-Functional Requirements ÿ Dataset collected should be in the
CSV format ÿ. The column values should be numerical values ÿ Training set and test
set are stored as CSV files ÿ Error rates can be calculated for prediction algorithms
product.
4.2.4 Requirements Efficiency:
Less time for predicting the Heart Disease Reliability: Maturity, fault tolerance
and recoverability. Portability: can the software easily be transferred to another
environment, including install ability.
4.2.5 Usability:
How easy it is to understand, learn and operate the software system
Organizational. Requirements: Do not block some available ports through the
windows firewall. Internet connection should be available Implementation
Requirements The dataset collection, internet connection to install related libraries.
Engineering Standard Requirements User interface is developed in python, which
gets input such stock symbol.
4.2.6 Hardware Interfaces:
Ethernet on the AS/400 supports TCP/IP, Advanced Peer-to-Peer Networking
(APPN) and advanced program-to-program communications (APPC). ISDN To
connect AS/400 to an Integrated Services Digital Network (ISDN) for faster, more
accurate data transmission. An ISDN is a public or private digital communications
network that can support data, fax, image, and other services over the same physical
interface. We can use other protocols on ISDN, such as IDLC and X.25. Software
Interfaces Anaconda Navigator and Jupiter Notebook are used.
15
4.2.7 Operational Requirements:
a. Economic: The developed product is economic as it is not required any
hardware interface etc. Environmental Statements of fact and assumptions
that define the expectations of the system in terms of mission objectives,
environment, constraints, and measures of effectiveness and suitability
(MOE/MOS). The customers are those that perform the eight primary
functions of systems engineering, with special emphasis on the operator as
the key customer.
b. Health and Safety: The software may be safety critical. If so, there are issues
associated with its integrity level. The software may not be safety-critical
although it forms part of a safety-critical system.
There is little point in producing 'perfect' code in some language if hardware
and system software (in widest sense) are not reliable. If a computer system is
to run software of a high integrity level, then that system should not at the
same time accommodate software of a lower integrity level.
Systems with different requirements for safety levels must be separated.
Otherwise, the highest level of integrity required must be applied to all
systems in the same environment.
16
4.4 SOFTWARE DESCRIPTION
4.4.1 Python:
Python is an interpreted high-level programming language for general-purpose
programming. Created by Guido van Rossum and first released in 1991, Python has
a design philosophy that emphasizes code readability, notably using significant
whitespace. It provides constructs that enable clear programming on both small and
large scales. Python features a dynamic type of system and automatic memory
management. It supports multiple programming paradigms, including object-oriented,
imperative, functional and procedural, and has a large and comprehensive standard
library. Python interpreters are available for many operating systems. C Python, the
reference implementation of Python, is open-source software and has a community-
based development model, as do nearly all its variant implementations. C Python is
managed by the non-profit Python Software Foundation.
4.4.2 Pandas
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas
is derived from the word Panel Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data. Prior to Pandas, Python was majorly
used for data mining and preparation. It had very little contribution towards data
analysis. Pandas solved this problem.
Using Pandas, we can accomplish five typical steps in the processing and analysis of
data, regardless of the origin of data — load, prepare, manipulate, model, and
analyze. Python with Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas:
Fast and efficient Data Frame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
20
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
4.4.3 NumPy:
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays. It
is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
A powerful N-dimensional array object
Sophisticated (broadcasting) functions
Tools for integrating C/C++ and Fortran code
Useful linear algebra, Fourier transform, and random number capabilities 24
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary datatypes can be
defined using NumPy which allows NumPy to seamlessly and speedily
integrate with a wide variety of databases.
4.4.4 Sckit-Learn:
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
4.4.5 Matploit lib:
Matplotlib is a python library used to create 2D graphs and plots by using
python scripts.
It has a module named pyplot which makes things easy for plotting by
providing feature to control line styles, font properties, formatting axes etc.
It supports a very wide variety of graphs and plots namely - histogram, bar
charts, power spectra, error charts etc.
4.4.6 Jupyter Notebook:
The Jupyter Notebook is an incredibly powerful tool for interactively
developing and presenting data science projects.
21
A notebook integrates code and its output into a single document that
combines visualizations, narrative text, mathematical equations, and other rich
media.
The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations
and narrative text.
Uses include data cleaning and transformation, numerical simulation,
statistical modeling, data visualization, machine learning, and much more.
The Notebook has support for over 40 programming languages, including
Python, R, Julia, and Scala.
Notebooks can be shared with others using email, Drop box, Git Hub and the
Jupyter Notebook.
Your code can produce rich, interactive output: HTML, images, videos,
LATEX, and custom MIME types.
Leverage big data tools, such as Apache Spark, from Python, R and Scala.
Explore that same data with pandas, scikit-learn, ggplot2, Tensor Flow.
4.5 ALGORITHMS
LogR models the data points using the standard logistic function, which is an S-
shaped
curve also called as sigmoid curve and is given by the equation.
22
Logistic Regression Assumptions:
Logistic regression requires the dependent variable to be binary.
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
Only the meaningful variables should be included.
The independent variables should be independent of each other.
Logistic regression requires quite large sample sizes.
Even though, logistic (logit) regression is frequently used for binary variables
(2
classes), it can be used for categorical dependent variables with more than 2
classes.
In this case it‘s called Multinomial Logistic Regression.
Fig 4.5.1: Logistic Regression
23
4.5.2 Random Forest:
Random Forest is a supervised learning algorithm which is used for both
classification as well as regression. But however, it is mainly used for classification
problems. As we know that a forest is made up of trees and more trees means more
robust forest.
Similarly, random forest creates decision trees on data samples and then gets the
prediction from each of them and finally selects the best solution by means of voting.
It is ensemble method which is better than a single decision tree because it reduces
the over-fitting by averaging the result.
Working of Random Forest with the help of following steps:
First, start with the selection of random samples from a given dataset.
Next, this algorithm will construct a decision tree for every sample. Then
it will get the prediction result from every decision tree.
In this step, voting will be performed for every predicted result.
At last, select the most voted prediction results as the final prediction
result.
The following diagram will illustrate its working-
Fig 4.5.2: Random Forest Classifier
24
4.6 SYSTEM ARCHITECTURE
The below figure shows the process flow diagram or proposed work. First, we
collected the Cleveland Heart Disease Database from UCI website then pre-
processed the dataset and select 16 important features
For feature selection we used Recursive feature Elimination Algorithm using Chi2
method and get 16 top features. After that applied ANN and Logistic algorithm
individually and compute the accuracy. Finally, we used proposed Ensemble Voting
method and compute best method for diagnosis of heart disease.
25
4.7 MODULES:
The entire work of this project is divided into 4 modules.
They are:
a. Data Pre-Processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
This file contains all the pre-processing functions needed to process all input
documents and texts. First, we read the train, test and validation data files then
performed some preprocessing like tokenizing, stemming etc. There are some
exploratory data analyses is performed like response variable distribution and data
quality checks like null or missing values etc.
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked
by the following-
b. Feature:
26
methods from sci-kit learn python libraries. For feature selection, we have used
methods like simple bag-of-words and n-grams and then term frequency like tf-idf
weighting. We have also used word2vec and POS tagging to extract the features,
though POS tagging and word2vec has not been used at this point in the project
Bag of Words:
It‘s an algorithm that transforms the text into fixed-length vectors. This is
possible by counting the number of times the word is present in a document. The
word occurrences allow to compare different documents and evaluate their
similarities for applications, such as search, document classification, and topic
modeling.
The reason for its name, ―Bag-Of-Words‖, is due to the fact that it represents
the sentence as a bag of terms. It doesn‘t consider the order and the structure of the
words, but it only checks if the words appear in the document.
N-grams:
TF-IDF Weighting:
c. Classification:
Here we have built all the classifiers for the breast cancer diseases detection.
27
The extracted features are fed into different classifiers. We have used Naive-bayes,
Logistic Regression, Linear SVM, Stochastic gradient decent and Random Forest
classifiers from sklearn. Each of the extracted features was used in all the classifiers.
Once fitting the model, we compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as candidate
models for heart diseases classification. We have performed parameter tuning by
implementing GridSearchCV methods on these candidate models and chosen best
performing parameters for these classifiers. Finally selected model was used for
heart disease detection with the probability of truth. In Addition to this, we have also
extracted the top 50 features from our term-frequency tf-idf Vectorizer to see what
words are most and important in each of the classes. We have also used Precision-
Recall and learning curves to see how training and test set performs when we
increase the amount of data in our classifiers.
d. Prediction:
Our finally selected and best performing classifier was algorithm which was
then saved on disk with name final_model.sav. Once you close this repository, this
model will be copied to user's machine and will be used by prediction.py file to
classify the heart diseases. It takes a news article as input from user then model is
used for final classification output that is shown to user along with probability of truth.
4.8 DATA FLOW DIAGRAM:
The data flow diagram (DFD) is one of the most important tools used by
system analysis. Data flow diagrams are made up of number of symbols, which
represents system components. Most data flow modeling methods use four kinds of
symbols: Processes, Data stores, Data flows and external entities. These symbols
are used to represent four kinds of system components. Circles in DFD represent
processes. Data Flow represented by a thin line in the DFD, and each data store has
a unique name and square or rectangle represents external entities.
LEVEL:0
28
DATA
COLLECTI
ON
PRE-
PROCESSIN
G
RANDOM
SELECTION
TRAINING
&TESTING
DATASET
29
LEVEL 1:
DATA
COLLECTIO
N
PRE -
PROCESSIN
G
FEATURE
EXTRACTIO
N
APPLY
ALGORITHM
S
30
CHAPTER 5
In this project, we introduce about the heart disease prediction system with different
classifier techniques for the prediction of heart disease. The techniques are Random
Forest and Logistic Regression: we have analyzed that the Random Forest has
better accuracy as compared to Logistic Regression. Our purpose is to improve the
performance of the Random Forest by removing unnecessary and irrelevant
attributes from the dataset and only picking those that are most informative for the
classification task.
It shows the target value of dataset for Male and Female in bar graph format and shows the
percentage of patience having with or without heart problem.
31
It shows the values of fbs and restecg with a bar graph and analyzing the restecg and fbs
features.
32
It shows the features of slope and ca form dataset values in bar graph format.
33
Finally, it shows that Random Forest algorithm has more accuracy than Logistic Regression
from the dataset values and also shows the accuracy percentage of Random Forest
Algorithm.
34
Chapter 6
6.1 Summary
This project objective is to predict the Heart Disease Using Machine Learning.
So, this paper a machine learning algorithm is proposed for the
implementation of a heart disease prediction system which was validated on two
open
access heart disease prediction datasets.
6.2 Conclusion
In this project, we introduce about the heart disease prediction system with
different classifier techniques for the prediction of heart disease. The techniques are
Random Forest and Logistic Regression: we have analyzed that the Random Forest
has better accuracy as compared to Logistic Regression. Our purpose is to improve
the performance of the Random Forest by removing unnecessary and irrelevant
attributes from the dataset and only picking those that are most informative for the
classification task
References
[1] P.K. Anooj, ―Clinical decision support system: Risk level prediction of heart
disease using weighted fuzzy rulesǁ; Journal of King Saud University – Computer
and Information Sciences (2012) 24, 27–40. Computer Science & Information
Technology (CS & IT) 59.
[2] Nidhi Bhatla, Kiran Jyoti "An Analysis of Heart Disease Prediction using Dif ferent
Data Mining Techniques". International Journal of Engineering Research &
Technology.
35
[3] Jyoti Soni Ujma Ansari Dipesh Sharma, Sunita Soni. ―Predictive Data Mining for
Medical Diagnosis: An Overview of Heart Disease Prediction‖.
[5] Dane Bertram, Amy Voida, Saul Greenberg, Robert Walker, ―Communication,
Collaboration, and Bugs: The Social Nature of Issue Tracking in Small, Collocated
Teams‖.
[7] Ankita Dewan, Meghna Sharma,‖ Prediction of Heart Disease Using a Hybrid
Technique in Data Mining Classification‖, 2nd International Conference on
Computing for Sustainable Global Development IEEE 2015 pp 704-706.
[9] M Akhil Jabbar, BL Deekshatulu, Priti Chandra,‖ heart disease classification using
nearest neighbor classifier with feature subset selection‖, Anale. Seria Informatica,
11, 2013.
[10] Shadab Adam Pattekari and Asma Parveen,‖ PREDICTION SYSTEM FOR
HEART DISEASE USING NAIVE BAYES‖, International Journal of Advanced
Computer and Mathematical Sciences ISSN 2230-9624, Vol 3, Issue 3, 2012, pp
290-294.
36
[11] C. Kalaiselvi, PhD, ―Diagnosis of Heart Disease Using K-Nearest Neighbor
Algorithm of Data Mining‖, IEEE, 2016.
[12] Keerthana T. K., ―Heart Disease Prediction System using Data Mining Method‖,
International Journal of Engineering Trends and Technology‖, May 2017.
[13] Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber,
ELSEVIER. Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Prediction Using
Machine Learning and Data Mining July 2017, pp.2137-2159.
APPENDIX:
40
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
#Printing the accuracy score for Random Forest Algorithm
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
print ("The accuracy score achieved using Random Forest is: "+str(score_rf)+" %")
#Showing in bar graph format
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")
sns.barplot(algorithms,scores)
#Showing the accuracy of both algorithms
scores = [score_lr,score_rf]
algorithms = ["Logistic Regression","Random Forest"]
for i in range(len(algorithms)):
print ("The accuracy score achieved using "+algorithms[i]+" is:‖+str(scores[I])+‖ %‖)
SCREENSHOTS:
1.
41
In the above screenshot, we are importing and printing the modules and libraries and
also reading the dataset form system.
2.
42
4.
5.
Here we are printing target value in the format of graph format for both Male and Female.
6.
43
7.
Here we printing the accuracy percentage value of Logistic Regression and Random Forest
Algorithm.
8.
44
Finally, printing the Accuracy score of Logistic Regression and Random Forest values.
By seeing the accuracy score we declared that Random Forest Algorithm is more accurate
than the Logistic Regression.
45
Plagiarism Report:
46
47
48
49
50
51
52