0% found this document useful (0 votes)
45 views42 pages

Heart Disease Predicition

The document discusses using machine learning algorithms to predict heart disease. It describes heart disease as a major cause of death and the importance of early prediction. The proposed system uses data mining techniques on healthcare data to build predictive models for heart disease using machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views42 pages

Heart Disease Predicition

The document discusses using machine learning algorithms to predict heart disease. It describes heart disease as a major cause of death and the importance of early prediction. The proposed system uses data mining techniques on healthcare data to build predictive models for heart disease using machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

HEART DISEASE PREDICTION USING

MACHINE LEARNING

A MINI PROJECT REPORT

Submitted By

KAVIYA SRI U (421122108020)


VICSHIYA SHERIN V (42112208056)
NANDHINI V (421122108303)

In partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY

IFET COLLEGE OF ENGINEERING

(An Autonomous Institution)


Approved by AICTE,New Delhi and Accredited by NAAC &NBA
Affiliated to Anna University, Chennai-25
Gangarampalayam, Villupuram – 605 108
NOVEMBER 2023

i
IFET COLLEGE OF ENGENEERING
An Autonomous Institution)

BONAFIDE CERTIFICATE

Certified that this report titled “HEART DISEASE PREDICITION USING


MACHINE LEARNING” is the bonafide work of KAVIYA SRI U (421122108020)
, VICSHIYA SHERIN V (421122108056) & NANDHINI V (421122108303) who
carried out the work under my supervision. Certified further that to the best of my
knowledge the work reported herein does not form part of any other thesis or dissertation
on the basis of which a degree or award was conferred on an earlier occasion on this or
any other candidate.

SIGNATURE SIGNATURE

Dr.R.THENDRAL, M.E., Ph.D. MR.ARUNKUMAR

HEAD OF THE DEPARTMENT, SUPERVISOR,

Associate Professor, Associate Professor,

Department of IT, Department of IT,

IFET College of Engineering, IFET College of Engineering,

Villupuram-605108. Villupuram-605108.

Submitted to the Viva Voice Examination held on

INTERNAL EXAMINER EXTERNAL

ii
ACKNOWLEDGEMENT

We thank the almighty, for the blessings that have been showered upon us to bring
forth the success of the project. We would like to express our sincere gratitude to our
Chairman Mr. K. V. RAJA, Secretary Mr. K. SHIVRAM ALVA and our Treasurer
Mr. R. VIMAL for providing us with an excellent infrastructure and necessary resources
to carry out this project and we extend our gratitude to our principal Dr. G.
MAHENDRAN, for his constant support to our work.
We take this opportunity to express our sincere thanks to our Vice Principaland
Dean Academics Dr.S.MATILDA and our Head Placement and Corporate Affairs
Prof.S. VISWANATHAN, who has provided all the needful help in executing the
project successfully
We also wish to express our delighted thanks to our Head of the Department Dr.
R.THENDRAL for her persistent encouragement and support to complete this project.
We express our heartfelt gratitude to our guide Mr.ARUNKUMAR, Associate
professor, Department of Information Technology for her priceless guidance and
motivation which helped us to bring this project to a perfect shape.
We express our deep sense of thanks to all faculty members and lab technicians
in our department for their cooperation and interest shown at every stage of our endeavor
in making a project work success. Last but not the least, we whole heartedly thanks to our
lovely parents and friends for their moral support in tough times and their constructive
criticism whichmade us to succeed in my work.

iii
ABSTRACT

In the medical field, the diagnosis of heart disease is the most difficult task. The diagnosis
of heart disease is difficult as a decision relied on grouping of large clinical and
pathological data. Due to this complication, the interest increased in a significant amount
between the researchers and clinical professionals about the efficient and accurate heart
disease prediction. In case of heart disease, the correct diagnosis in early stage is
important as time is the very important factor. Heart disease is the principal source of
deaths widespread, and the prediction of heart disease is significant at an untimely phase.
Machine learning in recent years has been the evolving, reliable and supporting tools in
medical domain and has provided the greatest support for predicting disease with correct
case of training and testing. The main idea behind this work is to study diverse prediction
models for the heart disease and selecting important heart disease feature using Random
Forests algorithm. Random Forests is the Supervised Machine Learning algorithm which
has the high accuracy compared to other Supervised Machine Learning algorithms such
as logistic regression etc. By using Random Forests algorithm, we are going to predict if
a person has heart disease or not.

iv
TABLE OF CONTENT

CHAPTER TITLE PAGE


NO NO

ABSTRACT Iv

LIST OF FIGURES Vii

LIST OF ABBREVIATIONS Viii

1 INTRODUCTION

1.1 GENERAL 1

1.2 DOMAIN OVERVIEW 2

2 LITERATURE SURVEY 3

2.1 REVIEW OF LIST

2.2 EXISTING SYSTEM 7

2.3 DISADVANTAGES OF EXISTING SYSTEM 7

3 PROPOSED SYSTEM 8

3.1 SYSTEM REQUIREMENTS 8

3.1.1 SOFTWARE REQUIREMENTS 8

3.1.2 HARDWARE REQUIREMENTS 8

3.2 SYSTEM ARCHITECTURE 9

3.3 MODULES 10

v
3.3.1 DATA PROCESSING

3.3.2 FEATURE

3.3.3 CLASSIFIICATION

3.3.4 PREDICITION

3.4 ADVANTAGES

4. RESULT AND DISCUSSION

4.1 FIGURE

4.2 SCREENSHOT

5. CONCLUSION AND FUTURE WORK

5.1 CONCLUSION

5.2 FUTURE WORK

6. REFERENCE

6.1 REFERENCE

6.2 APPENDIX

vi
LIST OF FIGURES

FIGURE NAME OF FIGURE PAGE NO


NO
1 SYSTEM ARCHITECTURE

2 LOGISTIC REGRESSION

3 RANDOM FOREST CLASSIFIER

vii
LIST OF ABBREVIATIONS

CVD - Cardiovascular disease


WHO - World Health Organization

NPL - Natural Language Processing


SSI stroke severity index

SVM - support vector machine


PLR - penalized logistic regression

viii
CHAPTER 1

INTRODUCTION
1.1GENERAL
The heart is a kind of muscular organ which pumps blood into the body and
is the central part of the body‘s cardiovascular system which also contains lungs.
Cardiovascular system also comprises a network of blood vessels, for example,
veins, arteries, and capillaries. These blood vessels deliver blood all over the body.
Abnormalities in normal blood flow from the heart cause several types of heart
diseases which are commonly known as cardiovascular diseases (CVD). Heart
diseases are the main reasons for death worldwide. According to the survey of the
World Health Organization (WHO), 17.5 million total global deaths occur because
of heart attacks and strokes. More than 75% of deaths from cardio-vascular
diseases occur mostly in middle-income and low-income countries. Also, 80% of
the deaths that occur due to CVDs are because of stroke and heart attack.
Therefore, prediction of cardiac abnormalities at the early stage and tools for the
prediction of heart diseases can save a lot of life and help doctors to design an
effective treatment plan which ultimately reduces the mortality rate due to
cardiovascular diseases. Due to the development of advance healthcare systems,
lots of patient data are nowadays available (i.e., Big Data in Electronic Health
Record System) which can be used for designing predictive models for
cardiovascular diseases. Data mining or machine learning is a discovery method
for analyzing big data from an assorted perspective and encapsulating it into useful
information. ―Data Mining is a non-trivial extraction of implicit previously
unknown and potentially useful information about data‖. Nowadays, a huge amount
of data pertaining to disease diagnosis, patients etc. are generated by healthcare

ix
industries. Data mining provides a number of techniques which discover hidden
patterns or similarities from data. Therefore, in this paper, a machine learning
algorithm is proposed for the implementation of a heart disease prediction system
which was validated on two open access heart disease prediction datasets. Data
mining is the computer-based process of extracting useful information from
enormous sets of databases. Data mining is most helpful in an explorative analysis
because of nontrivial information from large volumes of evidence. Medical data
mining has great potential for exploring the cryptic patterns in the data sets of the
clinical domain. These patterns can be utilized for healthcare diagnosis. However,
the available raw medical data are widely distributed, voluminous and
heterogeneous in nature. This data needs to be collected in an organized form. This
collected data can be then integrated to form a medical information system. Data
mining provides a user-oriented approach to novel and hidden patterns in the Data
The data mining tools are useful for answering business questions and techniques
for predicting the various diseases in the healthcare field. Disease prediction plays
a significant role in data mining. This paper analyzes the heart disease predictions
using classification algorithms. These invisible patterns can be utilized for health
diagnosis in healthcare data. Data mining technology affords an efficient approach
to the latest and indefinite patterns in the data. The information which is identified
can be used by the healthcare administrators to get better services. Heart disease
was the most crucial reason for victims in the countries like India, United States. In
this project we are predicting the heart disease using classification algorithms.
Machine learning techniques like Classification algorithms such as Random Forest,
Logistic Regression are used to explore different kinds of heart-based problems.

x
1.2 DOMAIN OVERVIEW
Machine learning is used for heart disease prediction in a system that
predicts diseases based on information provided by users. It predicts the disease of
the patient or the user based on the information or the symptoms entered into the
web system and gives results based on that information. If the patient isn’t
very serious and therefore the user just wants to understand the sort of disease,
he/she has been through. It’s a system which gives the ideas and tips to take care
of the health of the user and it provides how to seek out disease using this
prediction.

xi
CHAPTER 2
LITERATURE SURUVEY
2.1 REVIEW OF LITERATURE
Machine Learning techniques are used to analyze and predict the medical data
information resources. Diagnosis of heart disease is a significant and tedious task in
medicine. The term heart disease encompasses the various diseases that affect the heart. The
exposure of heart disease from various factors or symptom is an issue which is not
complimentary from false presumptions often accompanied by unpredictable effects. The
data classification is based on Supervised Machine Learning algorithm which results in
better accuracy. Here we are using the Random Forest as the training algorithm to train the
heart disease dataset and to predict the heart disease. The results showed that the medicinal
prescription and designed prediction system is capable of prophesying the heart attack
successfully. Machine Learning techniques are used to indicate the early mortality by
analyzing the heart disease patients and their clinical records (Richards, G. et al., 2001).
(Sung, S.F. et al., 2015) have brought about the two Machine Learning techniques, k-nearest
neighbor model and existing multi linear regression to predict the stroke severity index (SSI)
of the patients. Their study show that k-nearest neighbor performed better than Multi Linear
Regression model. (Arslan, A. K. et al.,2016) have suggested various Machine Learning
techniques such as support vector machine (SVM), penalized logistic regression (PLR) to
predict the heart stroke. Their results show that SVM produced the best performance in
prediction when compared to other models. Boshra Brahmi et al, [20] developed different
Machine Learning techniques to evaluate the prediction and diagnosis of heart disease. The
main objective is to evaluate the different classification techniques such as J48, Decision
Tree, KNN and Naïve Bayes. After this, evaluating some performance in measures of
accuracy, precision, sensitivity, specificity is evaluated.

xii
Data source:

Clinical databases have collected a significant amount of


information about patients and their medical conditions. Records set
with medical attributes were obtained from the Cleveland Heart Disease
database. With the help of the dataset, the patterns significant to the
heart attack diagnosis are extracted. The records were split equally into
two datasets: training dataset and testing dataset. A total of 303 records
with 76 medical attributes were obtained. All the attributes are numeric-
valued. We are working on a reduced set of attributes, i.e., only 14
attributes. All these restrictions were announced to shrink the digit of
designs, these are as follows: 1. The features should seem on a single
side of the rule. 2. The rule should distinct various features into the
different groups. 3. The count of features available from the rule is
organized by medical history people having heart disease only. The
following table shows the list of attributes on which we are working.

S no Attribute Name Description

1 Age age in years

2 Sex (1 = male; 0 =
female)
3 Cp Chest Pain

xiii
4 Trest bps resting blood
pressure (in mm Hg
on admission to the
hospital)
5 Chol serum cholesterol in
mg/d
6 Fbs (Fasting blood
sugar >120 mg/dl)
(1 = true; 0 = false)
7 Restecg Resting
electrocardiographic
results
8 Thalach Maximum heart rate
achieved
9 Exang Exercise induced
angina
(1=yes;0=no)
10 Old peak ST depression
induced by exercise
relative to rest
11 Slope The slope of the
peak exercise ST
segment
12 ca Number of major
vessels (0-3)
colored by

xiv
fluoroscopy

13 Thal 3 = normal; 6 =
Fixed defect; 7 =
reversible
fluoroscopy
14 Target 1 or 0

2.2EXISTING SYSTEM:
Clinical decisions are often made based on doctors‘ intuition and experience rather
than on the knowledge rich data hidden in the database. This practice leads to
unwanted biases, errors and excessive medical costs which affects the quality of
service provided to patients. There are many ways that a medical misdiagnosis can
present itself. Whether a doctor is at fault, or hospital staff, a misdiagnosis of a
serious illness can have very extreme and harmful effects. The National Patient
Safety Foundation cites that 42% of medical patients feel they have had
experienced a medical error or missed diagnosis. Patient safety is sometimes
negligently given the back seat for other concerns, such as the cost of medical
tests, drugs, and operations. Medical Misdiagnoses are a serious risk to our
healthcare profession. If they continue, then people will fear going to the hospital
for treatment. We can put an end to medical misdiagnosis by informing the public
and filing claims and suits against the medical practitioners at fault.
2.3 DISADVANTAGES OF EXISTING SYSTEM
✓ Prediction of cardiovascular disease result is not accurate
✓ Data mining techniques dose not help to provide effective decision making.
✓ Cannot handle enormous dataset for patient records.

xv
CHAPTER 3
PROPOSED SYSTEM:
This section depicts the overview of the proposed system and illustrates all of the
components, techniques and tools are used for developing the entire system. To
develop an intelligent and user-friendly heart disease prediction system, an
efficient software tool is needed in order to train huge datasets and compare
multiple machine learning algorithms. After choosing the robust algorithm with
best accuracy and performance measures, it will be implemented on the
development of the smartphone-based application for detecting and predicting
heart disease risk level. Hardware components like Arduino/Raspberry Pi, different
biomedical sensors, display monitor, buzzer etc. are needed to build the continuous
patient monitoring system.

3.1 SYSTEM REQUIREMENTS


3.1.1 HARDWARE REQUIREMENTS
✓ Processor : above 500MHz
✓ Ram : 4GB
✓ Hard Disk : 4GB
✓ Input device : Standard keyboard and Mouse
✓ Output device : VGA and High-Resolution Monitor
3.1.2 SOFTWARE REQUIREMENTS:
✓ Operating System : Windows 7 or higher
✓ Programming : python 3.6 and related libraries
✓ Software : Anaconda Navigator, Jupyter Notebook and Google
colab

xvi
ALGORITHMS
Logistic Regression
A popular statistical technique to predict binomial outcomes (y = 0 or 1) is
Logistic Regression. Logistic regression predicts categorical outcomes (binomial /
multinomial values of y). The predictions of Logistic Regression (henceforth, LogR
in this article) are in the form of probabilities of an event occurring, i.e., the
probability of y=1, given certain values of input variables x. Thus, the results of
LogR range between 0-1.
LogR models the data points using the standard logistic function, which is an
Sshaped
curve also called as sigmoid curve and is given by the equation.

Logistic Regression Assumptions:


• Logistic regression requires the dependent variable to be binary.
• For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
• Only the meaningful variables should be included.
• The independent variables should be independent of each other.
• Logistic regression requires quite large sample sizes.
• Even though, logistic (logit) regression is frequently used for binary
variables (2 classes), it can be used for categorical dependent variables with more
than 2 classes.

xvii
• In this case it‘s called Multinomial Logistic Regression.

Random Forest:
Random Forest is a supervised learning algorithm which is used for both
classification as well as regression. But however, it is mainly used for classification
problems. As we know that a forest is made up of trees and more trees means more
robust forest.
Similarly, random forest creates decision trees on data samples and then gets the
prediction from each of them and finally selects the best solution by means of voting.
It is ensemble method which is better than a single decision tree because it reduces
the over-fitting by averaging the result.
Working of Random Forest with the help of following steps:
• First, start with the selection of random samples from a given dataset.
• Next, this algorithm will construct a decision tree for every sample. Then it will
get the prediction result from every decision tree.

xviii
• In this step, voting will be performed for every predicted result.
• At last, select the most voted prediction results as the final prediction result. The
following diagram will illustrate its working

3.2 SYSTEM ARCHITECTURE:


The below figure shows the process flow diagram or proposed work. First,
we collected the Cleveland Heart Disease Database from UCI website then
preprocessed the dataset and select 16 important features

HEART DISEASE DATA


DATABASE PREPROCESSING FEATURE SELECTION

MACHINE
OUTPUTS LEARNING
ALGORITHMS

xix
For feature selection we used Recursive feature Elimination Algorithm using Chi2
method and get 16 top features. After that applied ANN and Logistic algorithm
individually and compute the accuracy. Finally, we used proposed Ensemble Voting
method and compute best method for diagnosis of heart disease.

3.3MODULES:
The entire work of this project is divided into 4 modules.
They are:
✓ Data Pre-Processing
✓ Feature
✓ Classification
✓ Prediction

3.3.1Data Pre-processing:
This file contains all the pre-processing functions needed to process all input
documents and texts. First, we read the train, test and validation data files then
performed some preprocessing like tokenizing, stemming etc. There are some
exploratory data analyses is performed like response variable distribution and data
quality checks like null or missing values etc. Data preprocessing is the process of
transforming raw data into an understandable format. It is also an important step in
data mining as we cannot work with raw data. The quality of the data should be
checked before applying machine learning or data mining algorithms. Preprocessing
of data is mainly to check the data quality.
The quality can be checked by the following-
• Accuracy: To check whether the data entered is correct or not.

xx
• Completeness: To check whether the data is available or not recorded.
• Consistency: To check whether the same data is kept in all the places that do or
do not match.
• Timeliness: The data should be updated correctly
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
3.3.2 Feature:
Extraction In this file we have performed feature extraction and selection 27 methods
from sci-kit learn python libraries. For feature selection, we have used methods like
simple bag-of-words and n-grams and then term frequency like tf-idf weighting. We
have also used word2vec and POS tagging to extract the features, though POS
tagging and word2vec has not been used at this point in the project.
• Bag of Words:
It‘s an algorithm that transforms the text into fixed-length vectors.
This is possible by counting the number of times the word is present in a
document. The word occurrences allow to compare different documents and
evaluate their similarities for applications, such as search, document classification,
and topic modeling. The reason for its name, ―Bag-Of-Words‖, is due to the fact
that it represents the sentence as a bag of terms. It doesn‘t consider the order and
the structure of the words, but it only checks if the words appear in the document.
• N-grams:
N-grams are continuous sequences of words or symbols or tokens in
a document. In technical terms, they can be defined as the neighbouring sequences
of items in a document. They come into play when we deal with text data in
NLP(Natural Language Processing) tasks.
• TF-IDF Weighting:

xxi
TF-IDF stands for term frequency-inverse document frequency and
it is a measure, used in the fields of information retrieval (IR) and machine
learning, that can quantify the importance or relevance of string representations
(words, phrases, lemmas, etc) in a document amongst a collection of documents
(also known as a corpus).
3.3.3Classification:
Here we have built all the classifiers for the breast cancer diseases detection. 28 The
extracted features are fed into different classifiers. We have used Naive-bayes,
Logistic Regression, Linear SVM, Stochastic gradient decent and Random Forest
classifiers from sklearn. Each of the extracted features was used in all the classifiers.
Once fitting the model, we compared the f1 score and checked the confusion matrix.
After fitting all the classifiers, 2 best performing models were selected as candidate
models for heart diseases classification. We have performed parameter tuning by
implementing GridSearchCV methods on these candidate models and chosen best
performing parameters for these classifiers. Finally selected model was used for
heart disease detection with the probability of truth. In Addition to this, we have also
extracted the top 50 features from our term-frequency tf-idf Vectorizer to see what
words are most and important in each of the classes. We have also used
PrecisionRecall and learning curves to see how training and test set performs when
we increase the amount of data in our classifiers.
3.3.4Prediction:
Our finally selected and best performing classifier was algorithm which was then
saved on disk with name final_model.sav. Once you close this repository, this model
will be copied to user's machine and will be used by prediction.py file to classify the
heart diseases. It takes a news article as input from user then model is used for final
classification output that is shown to user along with probability of truth.

xxii
3.4 ADAVANTAGES
✓ Increased accuracy for effective heart disease diagnosis.
✓ Handles roughest(enormous) amount of data using random forest algorthim
and feature selection
✓ Reduce the time complexity of doctors.
✓ Cost effective for patients

xxiii
CHAPTER 4
RESULT AND DISCUSSION

In this project, we introduce about the heart disease prediction system with
different classifier techniques for the prediction of heart disease. The techniques
are Random Forest and Logistic Regression: we have analyzed that the Random
Forest has better accuracy as compared to Logistic Regression. Our purpose is to
improve the performance of the Random Forest by removing unnecessary and
irrelevant attributes from the dataset and only picking those that are most
informative for the classification task.

4.1 FIGURE

It shows the target value of dataset for Male and Female in bar graph format and
shows the percentage of patience having with or without heart problem.

xxiv
It shows the values of fbs and restecg with a bar graph and analyzing the restecg
and fbs features.

xxv
It shows the features of slope and ca form dataset values in bar graph format.

xxvi
Finally, it shows that Random Forest algorithm has more accuracy than Logistic
Regression from the dataset values and also shows the accuracy percentage of
Random Forest Algorithm.

xxvii
4.2 SCREENSHOT

1.

In the above screenshot, we are importing and printing the modules and libraries
and also reading the dataset form system.
2.

xxviii
And here we are printing the datasets and values.
3.

4.

xxix
Here we are printing target value in the format of graph format for both Male and
Female.
6.

xxx
Here we printing the accuracy percentage value of Logistic Regression and Random
Forest Algorithm

xxxi
And here printing the Accuracy score of algorithms in graph format.
9.

Finally, printing the Accuracy score of Logistic Regression and Random Forest
values. By seeing the accuracy score we declared that Random Forest Algorithm is
more accurate than the Logistic Regression.

xxxii
CHAPTER 5
CONCLUSION AND FUTURE WORK

This project objective is to predict the Heart Disease Using Machine Learning.
A machine learning algorithm is proposed for the implementation of a heart disease
prediction system which was validated on two open access heart disease prediction
datasets.
5.1 CONCLUSION
In this project, we introduce about the heart disease prediction system with
different classifier techniques for the prediction of heart disease. The techniques are
Random Forest and Logistic Regression: we have analyzed that the Random Forest
has better accuracy as compared to Logistic Regression. Our purpose is to improve
the performance of the Random Forest by removing unnecessary and irrelevant
attributes from the dataset and only picking those that are most informative for the
classification task
5.2 FUTURE WORK
Future advancements in using machine learning for heart disease prediction
involve integrating diverse datasets, such as genetic information, lifestyle data, and
medical records, to enhance the accuracy of predictions. Researchers are
emphasizing the development of models that offer clear and interpretable results,
particularly in the context of complex algorithms like deep learning. Long-term
predictions, considering temporal changes in risk factors, are gaining attention to
provide a more comprehensive understanding of an individual's cardiovascular
health over time. Improving the robustness and generalizability of models across
various populations is crucial, along with exploring personalized risk assessment
models that consider individual responses to treatments. Real-time monitoring,
incorporating data from wearable devices and telehealth applications, is being

xxxiii
explored for continuous and remote assessment. Ethical considerations, including
privacy and bias mitigation, are at the forefront, requiring careful attention to
fairness and ethical deployment. Collaboration between machine learning
researchers and healthcare professionals is essential to align predictive models with
clinical needs and ensure seamless integration into healthcare workflows. Thorough
validation on diverse datasets and populations, along with strategies for effective
deployment and continuous learning, are key aspects of future research, contributing
to more accurate, personalized, and ethical heart disease prediction and intervention
strategies.

xxxiv
CHAPTER 6
REFERENCE

6.1 REFERENCE
[1] P.K. Anooj, ―Clinical decision support system: Risk level prediction of
heart disease using weighted fuzzy rules‖; Journal of King Saud University –
Computer and Information Sciences (2012) 24, 27–40. Computer Science &
Information Technology (CS & IT) 59.

[2] Nidhi Bhatla, Kiran Jyoti "An Analysis of Heart Disease Prediction using Dif
ferent Data Mining Techniques". International Journal of Engineering Research &
Technology.

[3] Jyoti Soni Ujma Ansari Dipesh Sharma, Sunita Soni. ―Predictive Data Mining
for Medical Diagnosis: An Overview of Heart Disease Prediction‖.

[4] Chaitrali S. Dangare Sulabha S. Apte, Improved Study of Heart Disease


Prediction System using Data Mining Classification Techniques‖ International
Journal of Computer Applications (0975 – 888).

[5] Dane Bertram, Amy Voida, Saul Greenberg, Robert Walker, ―Communication,
Collaboration, and Bugs: The Social Nature of Issue Tracking in Small, Collocated
Teams‖.

[6] M. Anbarasi, E. Anupriya, N.Ch.S.N.Iyengar, ―Enhanced Prediction of Heart


Disease with Feature Subset Selection using Genetic Algorithm; International
Journal of Engineering Science and Technology, Vol. 2(10), 2010.

xxxv
[7] Ankita Dewan, Meghna Sharma,‖ Prediction of Heart Disease Using a Hybrid
Technique in Data Mining Classification‖, 2nd International Conference on
Computing for Sustainable Global Development IEEE 2015 pp 704-706.

[8] R. Alizadehsani, J. Habibi, B. Bahadorian , H. Mashayekhi, A. Ghandeharioun,


R. Boghrati, et al., "Diagnosis of coronary arteries stenosis using data mining," J
Med Signals Sens, vol. 2, pp. 153-9, Jul 2012.

[9] M Akhil Jabbar, BL Deekshatulu, Priti Chandra,‖ heart disease classification


using nearest neighbor classifier with feature subset selection‖, Anale. Seria
Informatica, 11, 2013.

[10] Shadab Adam Pattekari and Asma Parveen,‖ PREDICTION SYSTEM FOR
HEART DISEASE USING NAIVE BAYES‖, International Journal of Advanced
Computer and Mathematical Sciences ISSN 2230-9624, Vol 3, Issue 3, 2012, pp
290-294.

[11] C. Kalaiselvi, PhD, ―Diagnosis of Heart Disease Using K-Nearest Neighbor


Algorithm of Data Mining‖, IEEE, 2016.

[12] Keerthana T. K., ―Heart Disease Prediction System using Data Mining
Method‖, International Journal of Engineering Trends and Technology‖, May 2017.

[13] Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber,
ELSEVIER. Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Prediction Using
Machine Learning and Data Mining July 2017, pp.2137-2159.

xxxvi
6.2 APPENDIX
STEPS FOR IMPLEMENTATION:
1. Install the required packages for building the ‗Passive Aggressive Classifiers‘.
2. Load the libraries into the workspace from the packages.
3. Read the input data set.
4. Normalize the given input dataset.
5. Divide this normalized data into two parts:
A. Train data.
B. Test data (Note: 80% of Normalized data is used as Train data, 20%the
Normalized data is used as Test data)
SOURCE CODE
#Importing essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
print(os.listdir())
import warnings
warnings.filterwarnings('ignore')
#Importing and understanding our dataset
dataset = pd.read_csv(―heart.csv")
#Verifying it as a 'dataframe' object in pandas
type(dataset)
#Shape of dataset
dataset.shape

xxxvii
#Printing out a few columns
dataset.head(5)
dataset.sample(5)
#Description the dataset
dataset.describe()
#Describing dataset information
dataset.info ()
#Let's understand our columns better
info = ["age","1: male, 0: female","chest pain type, 1: typical angina, 2: atypical
angina, 3: non-anginal pain, 4: asymptomatic","resting blood pressure"," serum
cholestoral in mg/dl","fasting blood sugar > 120 mg/dl","resting
electrocardiographic results (values 0,1,2)"," maximum heart rate
achieved","exercise induced angina","oldpeak = ST depression induced by exercise
relative to rest","the slope of the peak exercise ST segment","number of major
vessels (0-3) colored by flourosopy","thal: 3 = normal; 6 = fixed defect; 7 =
reversable defect"]
for i in range(len(info)):
print(dataset.columns[i]+":\t\t\t"+info[i])
#Analysing the 'target' variable
dataset["target"].describe()
dataset["target"].unique()
#Checking correlation between columns
print(dataset.corr()["target"].abs().sort_values(ascending=False))
#First, analysing the target variable:
y = dataset["target"]
sns.countplot(y)
target_temp = dataset.target.value_counts()

xxxviii
print(target_temp)
#printing the patience with or without heart problem
print ("Percentage of patience without heart problems
"+str(round(target_temp[0]*100/303,2)))
print ("Percentage of patience with heart problems:
"+str(round(target_temp[1]*100/303,2)))
#We'll analyse 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca' and 'thal'
features
#Analysing the 'Sex' feature
dataset["sex"].unique()
sns.barplot(dataset["sex"],y)
#Analysing the 'Chest Pain Type' feature
dataset["cp"].unique()
sns.barplot(dataset["cp"],y)
#Analysing the FBS feature
dataset["fbs"].describe()
dataset["fbs"].unique()
sns.barplot(dataset["fbs"],y)
#Analysing the restecg feature
dataset["restecg"].unique()
sns.barplot(dataset["restecg"],y)
#Analysing the 'exang' feature
dataset["exang"].unique()
sns.barplot(dataset["exang"],y)
#Analysing the Slope feature
dataset["slope"].unique()
sns.barplot(dataset["slope"],y)

xxxix
#Analysing the 'ca' feature
dataset["ca"].unique()
sns.countplot(dataset["ca"])
sns.barplot(dataset["ca"],y)
# Analysing the 'thal' feature
dataset["thal"].unique()
sns.barplot(dataset["thal"],y)
sns.distplot(dataset["thal"])
#Train Test split
from sklearn.model_selection import train_test_split
predictors = dataset.drop("target",axis=1)
target = dataset["target"]
X_train,X_test,Y_train,Y_test =
train_test_split(predictors,target,test_size=0.20,random_state=0)
X_train.shape
X_test.shape
Y_train.shape
Y_test.shape
#Model Fitting
from sklearn.metrics import accuracy_score
#Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,Y_train)
Y_pred_lr = lr.predict(X_test)
Y_pred_lr.shape
#Printing the accuracy score for Logistic Regression Algorithm

xl
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)
print ("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+"
%")
#Random Forest
from sklearn.ensemble import RandomForestClassifier
max_accuracy = 0
for x in range (2000):
rf = RandomForestClassifier(random_state=x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
#Printing the accuracy score for Random Forest Algorithm
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
print ("The accuracy score achieved using Random Forest is: "+str(score_rf)+" %")
#Showing in bar graph format
sns.set(rc={'figure.figsize':(15,8)}) plt.xlabel("Algorithms")
plt.ylabel("Accuracy score") sns.barplot(algorithms,scores)
#Showing the accuracy of both algorithms
scores = [score_lr,score_rf]
algorithms = ["Logistic Regression","Random Forest"]

xli
for i in range(len(algorithms)):
print ("The accuracy score achieved using "+algorithms[i]+"
is:‖+str(scores[I])+‖ %‖)

xlii

You might also like