0% found this document useful (0 votes)
18 views46 pages

Project Report

Uploaded by

Abhi Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views46 pages

Project Report

Uploaded by

Abhi Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

A

Project Report

on

“Heart Attack Prediction Using Machine Learning”


Submitted to

I.K. Gujral Punjab Technical University Kapurthala

In partial fulfillment of the requirement for the award of degree

of
Master of Computer Application

Submitted by Project Guide


Abhishek Chauhan Dr. Avnip Deora
(2229826) Dean (School of I.T)

(Session 2022 - 2024)

School of IT

Apeejay Institute of Management & Engineering Technical

Campus Jalandhar
CERTIFICATE

This is to certify that the project report titled “ Heart Attack Prediction Using Machine
Learning” submitted by Mr. Abhishek Chauhan (University Roll Number 2229826) is a
Bonafede piece of work conducted under my direct supervision and guidance. No part of this
work has been submitted for any other degree of any other university.

It may be considered for evaluation in partial fulfillment of the requirement for the award of
degree of Master of Computer Application.

Date:-

AIMETC, Jalandhar

I
DECLARATION

I Abhishek Chauhan (University Roll Number 2229826), student of MCA Semester IV of


Apeejay Institute of Management & Engineering Technical Campus, Jalandhar hereby
declare that the project entitled “Heart Attack Prediction Using Machine Learning ”
is the original work done by me.

Date: Signature of Student

This is to certify that the above statement made by the candidate is correct to the best of my
knowledge

Date: Name & Signature of Supervisor

M C A Semester I Vth ‘Project’ Examination has been held on and


accepted.

Signature of Signature of Signature of

Internal Examiner External Examiner Dean

II
ABSTRACT

In the medical field, the diagnosis of heart disease is the most difficult task. The
diagnosis of heart disease is difficult as a decision relied on grouping of large
clinical and pathological data. Due to this complication, the interest increased in a
significant amount between the researchers and clinical professionals about the
efficient and accurate heart disease prediction. In case of heart disease, the correct
diagnosis in early stage is important as time is the very important factor. Heart
disease is the principal source of deaths widespread, and the prediction of heart
disease is significant at an untimely phase. Machine learning in recent years has
been the evolving, reliable and supporting tools in medical domain and has
provided the greatest support for predicting disease with correct case of training
and testing. The main idea behind this work is to study diverse prediction models
for the heart disease and selecting important heart disease feature using Random
Forests algorithm. Random Forests is the Supervised Machine Learning
algorithm which has the high accuracy compared to other Supervised Machine
Learning algorithms such as logistic regression etc. By using Random Forests
algorithm, we are going to predict if a person has heart disease or not.

Abhishek Chauhan

III
ACKNOWLEDGEMENT

Apart from the efforts of me, the success of any work depends largely on the encouragement and
guidance of many others I take this opportunity to express my gratitude to the people who have
been instrumental in the work progress this report. First of all I sincerely acknowledge my
gratitude to Almighty God for his compassion and bountiful of blessings, which made me to see
this wonderful moments. I would like to express my deep sense of gratitude to Dr. Avnip Deora
Dean (School of I.T) AIMETC’s School of IT for continuous support of my study and research
his motivation, enthusiasm and immense knowledge provided me to during this research work is
deeply appreciated.

I would like to express my deep sense of gratitude for his suggestions. I am highly indebted to
him as he did not spare any effort to review and audit my report linguistically and technically.
Last but not least I would like to thank my family for encouraging me not only in its project
work but throughout my life.

IV
TABLE OF CONTENTS
CERTIFICATE I
DECLARATIO
II
N
ABSTRACT III
ACKNOWLED IV
GEMENT

CHAPTER
TITLE PAGE
S.No.
NO.

CHAPTER-1 INTRODUCTION

1 1.1 Introduction 1
1.2 Background of study
CHAPTER-2 LITERATURE SURVEY
2 3
2.1 Data source
CHAPTER-3 AIM AND SCOPE OF
PRESENT INVESTIGATION
3.1 EXISTING SYSTEM
3.2 PROPOSED
SYSTEM
3.3 FEASIBILITY
STUDY

3 3.3.1 Economic 6
Feasibility
3.3.2 Technical
Feasibility
3.3.3 Operational
Feasibility

CHAPTER-4 REQUIREMENT ANALYSIS


4 9
4.1 REQUIREMENT
ANALYSIS
4.2 SYSTEM
REQUIREMENTS
0
5 CHAPTER-5 DESIGN 12
5.1Data Flow Diagram
5.2Types of DFD
5.3DFD Components

CHAPTER-6 SCREENSHOTS
6 15

1
CHAPTER-7 INTRODUCTION OF
TECHNOLOGIES AND LIBRARIES USED
IN PROJECT
7.1TECHNOLOGIES
7 7.2 Libraries 24
7.3 ALGORITHMS
7.4 SYSTEM ARCHITECTUR
7.5 MODULES

8 CHAPTER-8 IMPLEMENTATION 35
9 BIBLIOGRAPHY 37

0
Chapter 1
INTRODUCTION

1.1 Introduction

The heart is a kind of muscular organ which pumps blood into the body and is the central part
of the body‘s cardiovascular system which also contains lungs. Cardiovascular system also
comprises a network of blood vessels, for example, veins, arteries, and capillaries. These
blood vessels deliver blood all over the body. Abnormalities in normal blood flow from the
heart cause several types of heart diseases which are commonly known as cardiovascular
diseases (CVD). Heart diseases are the main reasons for death worldwide. According to the
survey of the World Health Organization (WHO), 17.5 million total global deaths occur
because of heart attacks and strokes. More than 75% of deaths from cardio-vascular diseases
occur mostly in middle-income and low-income countries. Also, 80% of the deaths that occur
due to CVDs are because of stroke and heart attack. Therefore, prediction of cardiac
abnormalities at the early stage and tools for the prediction of heart diseases can save a lot of
life and help doctors to design an effective treatment plan which ultimately reduces the
mortality rate due to cardiovascular diseases.
Due to the development of advance healthcare systems, lots of patient data are nowadays
available (i.e., Big Data in Electronic Health Record System) which can be used for
designing predictive models for cardiovascular diseases. Data mining or machine learning is
a discovery method for analyzing big data from an assorted perspective and encapsulating it
into useful information. ―Data Mining is a non-trivial
extraction of implicit previously unknown and potentially useful information about data‖.
Nowadays, a huge amount of data pertaining to disease diagnosis, patients etc. are generated
by healthcare industries. Data mining provides a number of techniques which discover
hidden patterns or similarities from data.
Therefore, in this paper, a machine learning algorithm is proposed for the implementation of
a heart disease prediction system which was validated on two open access heart disease
prediction datasets.2 Data mining is the computer-based process of extracting useful
information from enormous sets of databases. Data mining is most helpful in an explorative
analysis because of nontrivial information from large volumes of evidence. Medical data
mining has great potential for exploring the cryptic patterns in the data sets of the clinical
domain. These patterns can be utilized for healthcare diagnosis. However, the available raw
medical data are widely distributed, voluminous and heterogeneous in nature. This data needs
to be collected in an organized form. This collected data can be then integrated to form a

1|Page
medical information system. Data mining provides a user-oriented approach to novel and
hidden patterns in the Data The data mining tools are useful for answering business questions
and techniques for predicting the various diseases in the healthcare field. Disease prediction
plays a significant role in data mining. This paper analyzes the heart disease predictions using
classification algorithms. These invisible patterns can be utilized for health diagnosis in
healthcare data. Data mining technology affords an efficient approach to the latest and
indefinite patterns in the data. The information which is identified can be used by the
healthcare administrators to get better services. Heart disease was the most crucial reason for
victims in the countries like India, United States. In this project we are predicting the heart
disease using classification algorithms. Machine learning techniques like Classification
algorithms such as Random Forest, Logistic Regression are used to explore different kinds of
heart-based problems.

1.2 Background of study

Heart disease predictor is an offline platform designed and developed to explore the path of
machine learning. The goal is to predict the health of the patient from collective data to be
able to detect configurations at risk for the patient, and therefore, in cases requiring
emergency medical assistance, alert the appropriate medical staff of the situation of the latter.
We initially have a dataset collecting information of many patients with which we can
conclude the results into a complete form and can predict data precisely. The results of the
predictions, derived from the predictive models generated by machine learning, will be
presented through several distinct graphical interfaces according to the datasets considered.
We will then bring criticism as to the scope of our results. Data has been collected from
Kaggle. Data collection is the process of gathering and measuring information from countless
different sources to use the data.

2|Page
Chapter 2
LITERATURE SURVEY

Machine Learning techniques are used to analyze and predict the medical data information
resources. Diagnosis of heart disease is a significant and tedious task in medicine. The term
heart disease encompasses the various diseases that affect the heart. The exposure of heart
disease from various factors or symptom is an issue which is not complimentary from false
presumptions often accompanied by unpredictable effects. The data classification is based on
Supervised Machine Learning algorithm which results in better accuracy. Here we are using the
Random Forest as the training algorithm to train the heart disease dataset and to predict the
heart disease. The results showed that the medicinal prescription and designed prediction
system is capable of prophesying the heart attack successfully. Machine Learning techniques
are used to indicate the early mortality by analyzing the heart disease patients and their clinical
records (Richards, G. et al., 2001). (Sung, S.F. et al., 2015) have brought about the two
Machine Learning techniques, k-nearest neighbor model and existing multi linear regression to
predict the stroke severity index (SSI) of the patients. Their study show that k-nearest neighbor
performed better than Multi Linear Regression model. (Arslan, A. K. et al.,2016) have
suggested various Machine Learning techniques such as support vector machine (SVM),
penalized logistic regression (PLR) to predict the heart stroke. Their results show that SVM
produced the best performance in prediction when compared to other models.
Boshra Brahmi et al, [20] developed different Machine Learning techniques to evaluate the
prediction and diagnosis of heart disease. The main objective is to evaluate the different
classification techniques such as J48, Decision Tree, KNN and Naïve Bayes. After this,
evaluating some performance in measures of accuracy, precision, sensitivity, specificity is
evaluated.
2.1 Data source:
Clinical databases have collected a significant amount of information about patients and their
medical conditions. Records set with medical attributes were obtained from the Cleveland Heart
Disease database. With the help of the dataset, the patterns significant to the heart attack
diagnosis are extracted.
The records were split equally into two datasets: training dataset and testing dataset. A total of
303 records with 76 medical attributes were obtained. All the attributes are numeric-valued. We
are working on a reduced set of attributes, i.e., only 14 attributes.
All these restrictions were announced to shrink the digit of designs, these are as follows:
1. The features should seem on a single side of the rule.
2. The rule should distinct various features into the different groups.

3|Page
3. The count of features available from the rule is organized by medica history
people having heart disease only.
The following table shows the list of attributes on which we are working.
Variable definitions in the Dataset:
 Age: Age of the patient
 Sex: Sex of the patient
 exang: exercise induced angina (1 = yes; 0 = no)
 ca: number of major vessels (0-3)
 cp: Chest Pain type chest pain type
 Value 1: typical angina
 Value 2: atypical angina
 Value 3: non-anginal pain
 Value 4: asymptomatic
 trtbps: resting blood pressure (in mm Hg)
 chol: cholestoral in mg/dl fetched via BMI sensor
 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
 rest_ecg: resting electrocardiographic results
 Value 0: normal
 Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV)
 Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
 thalach: maximum heart rate achieved
 target: 0= less chance of heart attack 1= more chance of heart attack
Additional variable descriptions to help us:
 age - age in years
 sex - sex (1 = male; 0 = female)
 cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 =
asymptomatic)
 trestbps - resting blood pressure (in mm Hg on admission to the hospital)
 chol - serum cholestoral in mg/dl
 fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
 restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality;
0 = hypertrophy)
 thalach - maximum heart rate achieved
 exang - exercise induced angina (1 = yes; 0 = no)

4|Page
 oldpeak - ST depression induced by exercise relative to rest
 slope - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 =
downsloping)
 ca - number of major vessels (0-3) colored by flourosopy
 thal - 2 = normal; 1 = fixed defect; 3 = reversable defect
 num - the predicted attribute - diagnosis of heart disease (angiographic disease status)
(Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)

5|Page
Chapter 3
AIM AND SCOPE OF PRESENT INVESTIGATION

3.1 EXISTING SYSTEM:


Clinical decisions are often made based on doctors‘ intuition and experience rather than on
the knowledge rich data hidden in the database. This practice leads to unwanted biases,
errors and excessive medical costs which affects the quality of service provided to patients.
There are many ways that a medical misdiagnosis can present itself. Whether a doctor is at
fault, or hospital staff, a misdiagnosis of a serious illness can have very extreme and
harmful effects. The National Patient Safety Foundation cites that 42% of medical patients
feel they have had experienced a medical error or missed diagnosis. Patient safety is
sometimes negligently given the back seat for other concerns, such as the cost of medical
tests, drugs, and operations. Medical Misdiagnoses are a serious risk to our healthcare
profession. If they continue, then people will fear going to the hospital for treatment. We
can put an end to medical misdiagnosis by informing the public and filing claims and suits
against the medical practitioners at fault.
Disadvantages:
 Prediction is not possible at early stages.
 In the Existing system, practical use of collected data is time consuming.
 Any faults occurred by the doctor or hospital staff in predicting would lead to fatal
incidents.
 Highly expensive and laborious process needs to be performed before treating the
patient to find out if he/she has any chances to get heart disease in future.

3.2 PROPOSED SYSTEM:


This section depicts the overview of the proposed system and illustrates all of the
components, techniques and tools are used for developing the entire system. To develop an
intelligent and user-friendly heart disease prediction system, an efficient software tool is
needed in order to train huge datasets and compare multiple machine learning algorithms.
After choosing the robust algorithm with best accuracy and performance measures, it will

6|Page
be implemented on the development of the smartphone-based application for detecting and
predicting heart disease risk level.

3.3 FEASIBILITY STUDY:


A Feasibility Study is a preliminary study undertaken before the real work of a project
starts to ascertain the likely hood of the project's success. It is an analysis of possible
alternative solutions to a problem and a recommendation on the best alternative.

3.3.1 Economic Feasibility:


It is defined as the process of assessing the benefits and costs associated with the
development of project. A proposed system, which is both operationally and technically
feasible, must be a good investment for the organization. With the proposed system the
users are greatly benefited as the users can be able to detect the fake news from the real
news and are aware of most real and most fake news published in the recent years. This
proposed system does not need any additional software and high system configuration.
Hence the proposed system is economically feasible.

3.3.2 Technical Feasibility:


The technical feasibility infers whether the proposed system can be developed considering
the technical issues like availability of the necessary technology, technical capacity,
adequate response and extensibility. The project is decided to build using Python. Jupyter
Notebook is designed for use in distributed environment 8 of the internet and for the
professional programmer it is easy to learn and use effectively. As the developing
organization has all the resources available to build the system therefore the proposed
system is technically feasible.

7|Page
3.3.3 Operational Feasibility:
Operational feasibility is defined as the process of assessing the degree to which a
proposed system solves business problems or takes advantage of business opportunities.
The system is self-explanatory and doesn‘t need any extra sophisticated training. The
system has built-in methods and classes which are
required to produce the result. The application can be handled very easily with a novice
user. The overall time that a user needs to get trained is 14 less than one hour. As the
software that is used for developing this application is very economical and is readily
available in the market. Therefore, the proposed system is operationally feasible.

8|Page
CHAPTER 4

REQUIREMENT ANALYSIS

4.1 REQUIREMENT ANALYSIS

Software Requirement Specification (SRS) is the starting point of the software developing
activity. As system grew more complex it became evident that the goal of the entire system
cannot be easily comprehended. Hence the need for the requirement phase arose. The
software project is initiated by the client needs. The SRS is the means of translating the
ideas of the minds of clients (the input) into a formal document (the output of the
requirement phase.) Under requirement specification, the focus is on specifying what has
been found giving analysis such as representation, specification languages and tools, and
checking the specifications are addressed during this activity. The Requirement phases
terminates with the production of the validate SRS document. Producing the SRS
document is the basic goal of this phase. The purpose of the Software Requirement
Specification is to reduce the communication gap between the clients and the developers.
Software Requirement Specification is the medium though which the client and user needs
are accurately specified. It forms the basis of software development. A good SRS should
satisfy all the parties involved in the system.
4.1.1 Product Perspective:
The application is developed in such a way that any future enhancement can be easily
implementable. The project is developed in such a way that it requires minimal
maintenance. The software used are open source and easy to install. The application
developed should be easy to install and use. This is an independent application which can
be easily run on to any system which has Python installed and Jupiter Notebook.
4.1.2 Product Features:
The application is developed in a way that heart disease accuracy is predicted using
Random Forest. We can compare the accuracy for the implemented algorithms. User
characteristics Application is developed in such a way that its users are v Easy to use v
Error free 20 v Minimal training or no training v Patient regular monitor Assumption &
Dependencies It is considered that the dataset taken fulfils all the requirements.

9|Page
4.1.3 Domain Requirements:
This document is the only one that describes the requirements of the system. It is meant for
the use by the developers and will also be the bases for validating the final heart disease
system. Any changes made to the requirements in the future will have to go through a
formal change approval process. User Requirements User can decide on the prediction
accuracy to decide on which algorithm can be used in realtime predictions. Non-Functional
Requirements ÿ Dataset collected should be in the CSV format ÿ. The column values
should be numerical values ÿ Training set and test set are stored as CSV files ÿ Error rates
can be calculated for prediction algorithms product.
4.1.4 Requirements Efficiency:
Less time for predicting the Heart Disease Reliability: Maturity, fault tolerance and
recoverability. Portability: can the software easily be transferred to another environment,
including install ability.
4.1.5 Usability:
How easy it is to understand, learn and operate the software system Organizational.
Requirements: Do not block some available ports through the windows firewall. Internet
connection should be available Implementation Requirements The dataset collection,
internet connection to install related libraries.
Engineering Standard Requirements User interface is developed in python, which gets
input such stock symbol.
4.1.6 Hardware Interfaces:
Ethernet on the AS/400 supports TCP/IP, Advanced Peer-to-Peer Networking (APPN) and
advanced program-to-program communications (APPC). ISDN To connect AS/400 to an
Integrated Services Digital Network (ISDN) for faster, more accurate data transmission.
An ISDN is a public or private digital communications network that can support data, fax,
image, and other services over the same physical interface. We can use other protocols on
ISDN, such as IDLC and X.25. Software Interfaces Anaconda Navigator and Jupiter
Notebook are used.

10 | P a g
4.1.7 Operational Requirements:
a. Economic: The developed product is economic as it is not required any hardware
interface etc. Environmental Statements of fact and assumptions that define the
expectations of the system in terms of mission objectives, environment, constraints, and
measures of effectiveness and suitability (MOE/MOS). The customers are those that
perform the eight primary functions of systems engineering, with special emphasis on the
operator as the key customer.
b. Health and Safety: The software may be safety critical. If so, there are issues associated
with its integrity level. The software may not be safety-critical although it forms part of a
safety-critical system. There is little point in producing 'perfect' code in some language if
hardware and system software (in widest sense) are not reliable. If a computer system is to
run software of a high integrity level, then that system should not at the same time
accommodate software of a lower integrity level. Systems with different requirements for
safety levels must be separated. Otherwise, the highest level of integrity required must be
applied to all systems in the same environment.

4.2 SYSTEM REQUIREMENTS

4.2.1 Hardware Requirements:


Processor : above 500MHz
Ram : 4GB
Hard Disk : 4GB
Input device : Standard keyboard and Mouse
Output device : VGA and High-Resolution Monitor

4.2.2 SOFTWARE REQUIREMENTS:


Operating System : Windows 7 or higher
Programming : python 3.6 and related libraries
Software : Anaconda Navigator, Jupyter Notebook and
Google colab

11 | P a g
CHAPTER 5

DATA FLOW DIAGRAMS

5.1 Data Flow Diagrams

The data flow diagram (DFD) is one of the most important tools used by system analysis.
Data flow diagrams are made up of number of symbols, which represents system
components. Most data flow modeling methods use four kinds of symbols: Processes, Data
stores, Data flows and external entities. These symbols are used to represent four kinds of
system components. Circles in DFD represent processes. Data Flow represented by a thin
line in the DFD, and each data store has a unique name and square or rectangle represents
external entities.
5.2. Types of DFD:

• Logical DFD - This type of DFD concentrates on the system process and flow of
data in the system. In this project logical DFD is all about how data comes from
backend to frontend. By passing through number of authenticity and authentication
only if the database got connected only than the data flow starts.
• Physical DFD - This type of DFD shows how the data flow is actually
implemented in the system. It is more specific and close to the implementation.

5.3. DFD Components:

• Entities - Entities are source and destination of information data. Entities are
represented by rectangles with their respective names.
• Process - Activities and action taken on the data are represented by Circle or
Round-edged rectangles.
• Data Storage - There are two variants of data storage - it can either be represented
as a rectangle with absence of both smaller sides or as an open-sided rectangle with
only one side missing.
• Data Flow - Movement of data is shown by pointed arrows. Data movement is
shown from the base of arrow as its source towards head of the arrow as

12 | P a g
destination.

LEVEL:0

DATA
COLLECTI

13 | P a g
LEVEL 1:

14 | P a g
CHAPTER 6

SCREENSHOTS

1.

In the above screenshot, we are importing and printing the modules and libraries
and
also reading the dataset form system.

2.

15 | P a g
And here we are printing the datasets and values.

3.

16 | P a g
4.

5.

17 | P a g
Age Variable

 The vast majority of patients are between 50 and 60.


 There is a remarkable place on the chart. There is a decrease in patients between the
ages of 47-and 50.
 It looks like there are no outliers in the variable.

6.

18 | P a g
Sex Variable

 68.3% of the patients are male, 31.7% are female.


 So, the number of male patients is more than twice that of female patients.

7.

Cp Variable

 Almost half of the patients have an observation value of 0. In other words, there is
asymptomatic angina
 Half of the patients are asymptomatic; they have pain without symptoms.
 If we examine the other half of the pie chart, 1 out of 4 patients has an observation
value of 2.
 In other words, atypical angina is in 29% of the patients.
 This observation value shows patients with shortness of breath or non-classical
pain.
 The other two observation values are less than the others.

19 | P a g
 16.5% of patients have a value of 1. In other words, typical angina is seen. Typical
angina is the classic exertion pain that comes during any physical activity.
 The other 8% has the value of non-anginal pain, which is three types of angina.
 Non-anginal pain is the term used to describe chest pain that is not caused by heart
disease or a heart attack.

8.

20 | P a g
Sex - Target Variable

 Patients at high risk of heart attack from women are almost more than half of those
with low.
 The situation is different for those with an observation value of 1, that is, for men.
The blue-colored bar has more observation values.
 So men are more likely than not to have a heart attack.
 In summary, female patients are at higher risk for heart attack
 The correlation between the two variables is -0.280937. In other words, we can say
that there is a negative low-intensity correlation

9.

10. Roc Curve and Area Under Curve(AUC)

21 | P a g
11.

12.

22 | P a g
13.

23 | P a g
14.

1. We used four different algorithms in the model phase.


2. We got 87% accuracy and 88% AUC with the Logistic Regression model.
3. We got 83% accuracy and 85% AUC with the Decision Tree Model.
4. We got 83% accuracy and 89% AUC with the Support Vector Classifier Model.
5. And we got 90.3% accuracy and 93% AUC with the Random Forest Classifier
Model.
6. When all these model outputs are evaluated, we prefer the model we created with
the Random Forest Algorithm, which gives the best results.

24 | P a g
CHAPTER 7

INTRODUCTION OF TECHNOLOGIES AND LIBRARIES USED IN PROJECT

7.1TECHNOLOGIES

7.1.1 Python:
Python is an interpreted high-level programming language for general-purpose
programming. Created by Guido van Rossum and first released in 1991, Python has a
design philosophy that emphasizes code readability, notably using significant whitespace.
It provides constructs that enable clear programming on both small and large scales.
Python features a dynamic type of system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional and
procedural, and has a large and comprehensive standard library. Python interpreters are
available for many operating systems. C Python, the reference implementation of Python,
is open-source software and has a communitybased development model, as do nearly all its
variant implementations. C Python is managed by the non-profit Python Software
Foundation.

7.2 Libraries

7.2.1 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data – an Econometrics from Multidimensional data. In 2008, developer Wes
McKinney started developing pandas when in need of high performance, flexible tool for
analysis of data. Prior to Pandas, Python was majorly
used for data mining and preparation. It had very little contribution towards data analysis.
Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide range of fields
including academic and commercial domains including finance, economics, Statistics,
analytics, etc.

25 | P a g
Key Features of Pandas:
 Fast and efficient Data Frame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
  Data alignment and integrated handling of missing data.
  Reshaping and pivoting of date sets.
  Label-based slicing, indexing and subsetting of large data sets.
  Columns from a data structure can be deleted or inserted.
  Group by data for aggregation and transformations.
  High performance merging and joining of data.
  Time Series functionality.
7.2.2 NumPy:

NumPy is a general-purpose array-processing package. It provides a highperformance


multidimensional array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python. It contains various features
including these important ones:
 A powerful N-dimensional array object
 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities
 Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary datatypes can be defined using
NumPy which allows NumPy to seamlessly and speedily integrate with a wide variety
of databases.

7.2.3 Sckit-Learn:
  Simple and efficient tools for data mining and data analysis

26 | P a g
  Accessible to everybody, and reusable in various contexts
  Built on NumPy, SciPy, and matplotlib
  Open source, commercially usable - BSD license

7.2.4 Matploit lib:


 Matplotlib is a python library used to create 2D graphs and plots by using python
scripts.
 It has a module named pyplot which makes things easy for plotting by providing
feature to control line styles, font properties, formatting axes etc.
 It supports a very wide variety of graphs and plots namely - histogram, bar charts,
power spectra, error charts etc.

7.2.5 Jupyter Notebook:


 The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects.
 A notebook integrates code and its output into a single document that combines
visualizations, narrative text, mathematical equations, and other rich media.
 The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and narrative text.
 Uses include data cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more.
 The Notebook has support for over 40 programming languages, including Python, R,
Julia, and Scala.
 Notebooks can be shared with others using email, Drop box, Git Hub and the Jupyter
Notebook.
 Your code can produce rich, interactive output: HTML, images, videos, LATEX, and
custom MIME types.

7.3 ALGORITHMS

27 | P a g
7.3.1 Logistic Regression
A popular statistical technique to predict binomial outcomes (y = 0 or 1) is Logistic
Regression. Logistic regression predicts categorical outcomes (binomial / multinomial
values of y). The predictions of Logistic Regression (henceforth, LogR in this article) are
in the form of probabilities of an event occurring, i.e., the probability of y=1, given certain
values of input variables x. Thus, the results of LogR range between 0-1.
LogR models the data points using the standard logistic function, which is an Sshaped
curve also called as sigmoid curve and is given by the equation.

Logistic Regression Assumptions:


 Logistic regression requires the dependent variable to be binary.
 For a binary regression, the factor level 1 of the dependent variable should represent
the desired outcome.
 Only the meaningful variables should be included.
 The independent variables should be independent of each other.
 Logistic regression requires quite large sample sizes.
 Even though, logistic (logit) regression is frequently used for binary variables (2
classes), it can be used for categorical dependent variables with more than 2 classes. In
this case it‘s called Multinomial Logistic Regression.

Fig 7.3.1: Logistic Regression

28 | P a g
7.3.2Decision Tree:

A decision tree is a non-parametric supervised learning algorithm for classification


and regression tasks. It has a hierarchical tree structure consisting of a root node,
branches, internal nodes, and leaf nodes. Decision trees are used for classification and
regression tasks, providing easy-to-understand models.

A decision tree is a hierarchical model used in decision support that depicts decisions
and their potential outcomes, incorporating chance events, resource expenses, and
utility. This algorithmic model utilizes conditional control statements and is non-
parametric, supervised learning, useful for both classification and regression tasks. The
tree structure is comprised of a root node, branches, internal nodes, and leaf nodes,
forming a hierarchical, tree-like structure.

Fig 7.3.2: Decision Tree

29 | P a g
7.3.3 Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

Fig 7.3.3: Support Vector Machine

30 | P a g
7.3.4 Random Forest:
Random Forest is a supervised learning algorithm which is used for both classification as
well as regression. But however, it is mainly used for classification problems. As we know
that a forest is made up of trees and more trees means more robust forest. Similarly,
random forest creates decision trees on data samples and then gets the prediction from each
of them and finally selects the best solution by means of voting. It is ensemble method
which is better than a single decision tree because it reduces
the over-fitting by averaging the result.
Working of Random Forest with the help of following steps:
 First, start with the selection of random samples from a given dataset.
 Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.
 In this step, voting will be performed for every predicted result.
 At last, select the most voted prediction results as the final prediction result.
The following diagram will illustrate its working

Fig 7.3.4: Random Forest Classifier

31 | P a g
7.4 SYSTEM ARCHITECTURE
The below figure shows the process flow diagram or proposed work. First, we collected the
Cleveland Heart Disease Database from UCI website then pre-processed the dataset and
select 16 important features

Fig 7.4:SYSTEM ARCHITECTURE

After that applied ANN and Logistic algorithm individually and compute the accuracy.
Finally, we used proposed Ensemble Voting method and compute best method for
diagnosis of heart disease.

32 | P a g
7.5 MODULES:
The entire work of this project is divided into 4 modules.
They are:
a. Data Pre-Processing
b. Feature
c. Classification
d. Prediction
a. Data Pre-processing:
This file contains all the pre-processing functions needed to process all input documents
and texts. First, we read the train, test and validation data files then performed some
preprocessing like tokenizing, stemming etc. There are some exploratory data analyses is
performed like response variable distribution and data quality checks like null or missing
values etc. Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we cannot work with
raw data. The quality of the data should be checked before applying machine learning or
data mining algorithms. Preprocessing of data is mainly to check the data quality. The
quality can be checked by the following-
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do not
match.
 Timeliness: The data should be updated correctly
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.

b. Feature:
Extraction In this file we have performed feature extraction and selection methods from
sci-kit learn python libraries. For feature selection, we have used methods like simple bag-

33 | P a g
of-words and n-grams and then term frequency like tf-idf weighting. We have also used
word2vec and POS tagging to extract the features, though POS tagging and word2vec has
not been used at this point in the project
Bag of Words:
It‘s an algorithm that transforms the text into fixed-length vectors. This is possible by
counting the number of times the word is present in a document. The word occurrences
allow to compare different documents and evaluate their similarities for applications, such
as search, document classification, and topic modeling.

N-grams:
N-grams are continuous sequences of words or symbols or tokens in a document. In
technical terms, they can be defined as the neighbouring sequences of items in a document.
They come into play when we deal with text data in NLP(Natural Language Processing)
tasks.
TF-IDF Weighting:
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in
the fields of information retrieval (IR) and machine learning, that can quantify the
importance or relevance of string representations (words, phrases, lemmas, etc) in a
document amongst a collection of documents (also known as a corpus).
c. Classification:
Here we have built all the classifiers for the breast cancer diseases detection. The extracted
features are fed into different classifiers. We have used Naive-bayes, Logistic Regression,
Linear SVM, Stochastic gradient decent and Random Forest classifiers from sklearn. Each
of the extracted features was used in all the classifiers. Once fitting the model, we
compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as candidate models
for heart diseases classification.

d. Prediction:
Our finally selected and best performing classifier was algorithm which was then saved on
disk with name final_model.sav. Once you close this repository, this model will be copied

34 | P a g
to user's machine and will be used by prediction.py file to classify the heart diseases. It
takes a news article as input from user then model is used for final classification output that
is shown to user along with probability of truth.

CHAPTER 8

IMPLEMENTATION

35 | P a g
1.Environment Setup

 Install necessary libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn.

2.Import Libraries

 Import essential libraries for data handling, preprocessing, modeling, and evaluation.

3.Load the Dataset

 Load the heart disease dataset (e.g., UCI Heart Disease dataset) using Pandas.

4.Data Exploration

 Display the first few rows of the dataset.


 Get a summary of the dataset including data types and missing values.
 Generate descriptive statistics of the dataset.

5.Data Preprocessing

 Handle missing values, if any, by filling them with appropriate statistics (mean,
median).
 Encode categorical variables using one-hot encoding or label encoding.
 Split the dataset into features (X) and target (y).
 Further split the features and target into training and testing sets.
 Scale the features using StandardScaler for better model performance.

6.Model Training

 Initialize the Random Forest Classifier with specified hyperparameters.


 Train the classifier on the training dataset.

7.Model Evaluation

 Use the trained model to make predictions on the testing dataset.


 Calculate the accuracy of the model.

36 | P a g
 Generate and display the confusion matrix.
 Generate and display the classification report, which includes precision, recall, and F1-
score.

8.Visualization

 Plot the confusion matrix using Seaborn for better visualization and understanding of
the model’s performance.

This process will allow you to implement a machine learning model for heart attack
prediction from loading the data to evaluating the model’s performance.

BIBLIOGRAPHY

37 | P a g
American Heart Association. (2022). Heart Attack. Retrieved from
https://www.heart.org/en/health-topics/heart-attack

Dua, D., & Graff, C. (2019). UCI Machine Learning Repository. University of
California,Irvine, School of Information and Computer Sciences. Retrieved from
http://archive.ics.uci.edu/ml

Krittanawong, C., & Zhang, H. (2020). Artificial Intelligence in Precision Cardiovascular


Medicine. Journal of the American College of Cardiology, 75(20), 2560-2574.
doi:10.1016/j.jacc.2020.03.036

Mathews, S. C., McShea, M. J., Hanley, C. L., & Ravitz, A. D. (2019). Lab Values:
Interpreting Chemistry and Hematology for Adult Patients. Hoboken, NJ: Wiley. World
Health Organization. (2022). Cardiovascular Diseases (CVDs). Retrieved from
https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1

Rasheed, J., Haroon, S., & Ali, A. (2020). Machine Learning Techniques for Predicting
Heart Diseases: A Comprehensive Review. Journal of Healthcare Engineering, 2020, 1- 15.
doi:10.1155/2020/8518034

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction (2nd ed.). New York, NY: Springer.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., .Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research, 12, 2825-2830. Retrieved from
http://jmlr.org/papers/v12/pedregosa11a.html

38 | P a g

You might also like