0% found this document useful (0 votes)
704 views69 pages

Heart Disease Prediction Using Machine Learning in R

this is my final year project

Uploaded by

Raja Lakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
704 views69 pages

Heart Disease Prediction Using Machine Learning in R

this is my final year project

Uploaded by

Raja Lakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

ABSTRACT

Machine Learning in R language is used across the world. The healthcare


industry is no exclusion. Machine Learning can play an essential role in predicting
presence/absence of locomotors disorderd, Heart diseases and more. Such
information, if predicted well in advance, can provide important intuitions to
doctors who can then adapt their diagnosis and dealing per patient basis. It’s works
on predicting possible Heart Diseases in people using Machine Learning
algorithms. In this project we perform the comparative analysis of classifiers like
decision tree, Naïve Bayes, Logistic Regression, SVM and Random Forest and
propose an ensemble classifier which perform hybrid classification by taking
strong and weak classifiers since it can have multiple number of samples for
training and validating the data so we perform the analysis of existing classifier
and proposed classifier like Ada-boost and XG-boost which can Give accurate
result and aids in predictive analysis.

Keywords: Machine Learning in R language, SVM, Random Forest, Linear


Discriminant Analysis, Quadrant Discriminant Analysis, k-nearest neighbors,
glioblastoma multiforme

1
OBJECTIVES

The main objective of developing this project are:

 To develop Machine Learning in R language model to predict future possibility


of heart disease by implementing Logistic Regression.
 To determine significant risk factors based on medical dataset which may lead
to heart disease.
 To analyze feature selection methods and understand their working principle.

2
LIST OF ABBREVIATIONS:

 LDA: Linear Discriminant Analysis


 QDA: Quadrant Discriminant Analysis
 K-NN: k-nearest neighbors
 SVM: Support-vector machines
 RF: Random Forest
 GBM: glioblastoma multiforme
 EDA: Exploratory data analysis is the key step for getting meaningful
results.
 ECG: Electro Cardio Gram
 AMI: Acute Myocardial Infarction

3
SYSTEM SPECIFICATIONS
HARDWARE REQUIREMENTS

The section of hardware configuration is an important task related to the software


development insufficient random access memory may affect adversely on the
speed and efficiency of the entire system. The process should be powerful to
handle the entire operations. The hard disk should have sufficient capacity to store
the file and application.

System : Intel Pentium processor.

RAM : 2 GB and above.

Hard disk : 250 GB Hard Disk and above

SOFTWARE REQUIREMENTS

A major element in building a system is the section of compatible software


since the software in the market is experiencing in geometric progression. Selected
software should be acceptable by the firm and one user as well as it should be
feasible for the system. This document gives a detailed description of the software
requirement specification. The study of requirement specification is focused
specially on the functioning of the system. It allow the developer or analyst to
understand the system, function to be carried out the performance level to be
obtained and corresponding interfaces to be established.

Operating system : Windows 11

Environment (IDE) : RSTUDIO

Back end : R version 4.2.1

4
1 INTRODUCTION

According to the World Health Organization, every year 12 million deaths


occur worldwide due to Heart Disease. The load of cardiovascular disease is
rapidly increasing all over the world from the past few years. Many researches
have been conducted in attempt to pinpoint the most influential factors of heart
disease as well as accurately predict the overall risk. Heart Disease is even
highlighted as a silent killer which leads to the death of the person without obvious
symptoms.

The early diagnosis of heart disease plays a vital role in making decisions on
lifestyle changes in high-risk patients and in turn reduce the complications. This
project aims to predict future Heart Disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning
algorithms.

1.1 PROBLEM DEFINITION

The major challenge in heart disease is its detection. There are instruments
available which can predict heart disease but either they are expensive or are not
efficient to calculate chance of heart disease in human. Early detection of cardiac
diseases can decrease the mortality rate and overall complications. However, it is
not possible to monitor patients every day in all cases accurately and consultation
of a patient for 24 hours by a doctor is not available since it requires more
sapience, time and expertise. Since we have a good amount of data in today’s
world, we can use various Machine Learning in R language algorithms to analyze
the data for hidden patterns. The hidden patterns can be used for health diagnosis
in medicinal data.

5
1.2 MOTIVATION FOR THE WORK

Machine Learning in R language techniques have been around us and has


been compared and used for analysis for many kinds of data science applications.
The major motivation behind this research-based project was to explore the feature
selection methods, data preparation and processing behind the training models in
the Machine Learning in R language. With first hand models and libraries, the
challenge we face today is data where beside their abundance, and our cooked
models, the accuracy we see during training, testing and actual validation has a
higher variance. Hence this project is carried out with the motivation to explore
behind the models, and further implement Logistic Regression model to train the
obtained data. Furthermore, as the whole Machine Learning in R language is
motivated to develop an appropriate computer-based system and decision support
that can aid to early detection of heart disease, in this project we have developed a
model which classifies if patient will have heart disease in ten years or not based
on various features (i.e. potential risk factors that can cause heart disease) using
logistic regression. Hence, the early prognosis of cardiovascular diseases can aid in
making decisions on lifestyle changes in high risk patients and in turn reduce the
complications, which can be a great milestone in the field of medicine.

With growing development in the field of medical science alongside


Machine Learning in R language various experiments and researches has been
carried out in these recent years releasing the relevant significant papers. The paper
propose heart disease prediction using KStar, J48, SMO, and Bayes Net and
Multilayer perceptron using WEKA software. Based on performance from
different factor SMO (89% of accuracy) and Bayes Net (87% of accuracy) achieve

6
optimum performance than KStar, Multilayer perceptron and J48 techniques using
k-fold cross validation. The accuracy performance achieved by those algorithms
are still not satisfactory. So that if the performance of accuracy is improved more
to give batter decision to diagnosis disease.

In research conducted using Cleveland dataset for heart diseases which


contains 303 instances and used 10-fold Cross Validation, considering 13
attributes, implementing 4 different algorithms, they concluded Gaussian Naïve
Bayes and Random Forest gave the maximum accuracy of 91.2 percent.

Using the similar dataset of Framingham, Massachusetts, the experiments


were carried out using 4 models and were trained and tested with maximum
accuracy K Neighbors Classifier: 87%, Support Vector Classifier: 83%, Decision
Tree Classifier: 79% and Random Forest Classifier: 84%.

7
2 PROJECT DESCRIPTION
Heart disease is perceived as the deadliest disease in the human life across
the world. In particular, in this type of disease the heart is not capable in pushing
the required quantity of blood to the remaining organs of the human body in order
to accomplish the regular functionalities. Some of the symptoms of heart disease
include physical body weakness, improper breathing, swollen feet, etc. The
techniques are essential to identify the complicated heart diseases which results in
high risk in turn affect the human life. Presently, diagnosis and treatment process
are highly challenging due to inadequacy of physicians and diagnostic apparatus
that affect the treatment of heart patients

Early diagnosis of heart disease is significant to minimize the heart related


issues and to protect it from serious risks. The invasive techniques are
implemented to diagnose heart diseases based on medical history, symptom
analysis report by experts, and physical laboratory report. Moreover, it causes
delay and imprecise diagnosis due to human intervention. It is time consuming,
computationally intensive and expensive at the time of assessment.

Heart disease can be predicted based on various symptoms such as age,


gender, pulse rate etc. Data analysis in healthcare assists in predicting diseases,
improving diagnosis, analyzing symptoms, providing appropriate medicines,
improving the quality of care, minimizing cost, extending the life span and reduces
the death rate of heart patients. ECG helps in screening irregular heart beat and
stroke with the embedded sensors by resting it on a chest in order to track the
patient’s heart beat.

8
Heart disease prediction is being done with the detailed clinical data that
could assist experts to make decision. Human life is highly dependent on proper
functioning of blood vessels in the heart. The improper blood circulation causes
heart inactiveness, kidney failure, imbalanced condition of brain, and even
immediate death also. Some of the risk factors that can cause heart diseases are
obesity, smoking, diabetes, blood pressure, cholesterol, lack of physical activities
and unhealthy diet.

AMI is the cardiovascular disease that happens due to interruption in the


blood flow or circulation in the heart muscle, causes heart muscle to become
necrotic (damage or die). The primary reason for this disease is the blockage
means that the blood flow to the heart muscle become obstructed or reduced. If the
blood flow is reduced or obstructed, the functioning of red blood cells that carries
enough oxygen helps in sustaining consciousness and human life have a severe
impact. Without oxygen supply for 6 to 8 minutes, heart muscle may get arrest that
in turn resulted in patient’s death.

The significant cause of the cardiovascular disease is ‘plaque’ means a hard


substance formed in the coronary arteries which is made up of cholesterol (fat),
causes the blood flow to be reduced or obstructed. Sometimes, it can be formed in
the arteries known as atherosclerosis and investigating the cause of it are
determined as a chronic inflammation.

The increase in the amount of white blood cells causes inflammation and
other subsequent disorders such as stroke or reinfarction Generally, there are two
stages of wound healing in terms of monocytes and macrophages, namely,
inflammatory, and reparative stages. However, the two stages are compulsory for
proper wound healing and if the inflammation is continued too long, then it leads
to heart failure.
9
An unusual type of heart disease is the acute spasm or contraction in the
coronary arteries. The spasms become visible in arteries suddenly with no
symptom of atherosclerosis. It blocks the blood flow that causes oxygen
deprivation in the heart. Male genders are more likely to experience heart attack
than females. Moreover, women can experience pain more than an hour and the
duration to experience the pain of men is normally less than an hour. The
cardiovascular disease has an impact in the complete physiological system, not
only in the heart; changes occur everywhere that too in the remote organs such as
bone marrow and spleen.

10
3 SOFTWARE DESCRIPTION
R is a programing language and free software developed by Ross Ihaka and
Robert Gentleman in 1993. R possesses an in-depth catalog of applied mathematics
and graphical strategies. It includes Machine Learning in R language algorithms,
simple and linear regression, statistics, applied mathematics. Most of the R
libraries are written in R, except for serious machine tasks, C, C++, and algebraic
language codes are most well-liked.

R is not solely entrusted by academics, however many massive firms and


MNC’s additionally use R programing language, including Uber, Google, Airbnb,
Facebook, and then on. Data analysis with R is finished in an exceedingly series of
steps: programming, transforming, discovering, modeling, and communicating the
results. Moreover, if you need any help with R programming homework or
assignment, we have tutors available 24/7.

3.1 EVOLUTION OF R

R was initially written by Ross Ihaka and Robert Gentleman at the


Department of Statistics of the University of Auckland in Auckland, New Zealand.
R made its first appearance in 1993.

A large group of individuals has contributed to R by sending code and bug reports.

Since mid-1997 there has been a core group (the "R Core Team") who can modify
the R source code archive.

11
3.2 R VERSION

R version 4.2.1 (2022-06-23 ucrt) -- "Funny-Looking Kid" Copyright (C)


2022 The R Foundation for Statistical Computing Platform: x86_64-w64-
mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You


are welcome to redistribute it under certain conditions. Type 'license()' or
'licence()' for distribution details. Natural language support but running in an
English locale

R is a collaborative project with many contributors. Type 'contributors()' for


more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an
HTML browser interface to help. Type 'q()' to quit R.

3.3 FEATURES OF R

As stated earlier, R is a programming language and software environment for


statistical analysis, graphics representation and reporting. The following are the
important features of R −

 R is a well-developed, simple and effective programming language which


includes conditionals, loops, user defined recursive functions and input and
output facilities.
 R has an effective data handling and storage facility,

12
 R provides a suite of operators for calculations on arrays, lists, vectors and
matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

3.4 USAGE OF R PROGRAMMING

R language is in much demand in real-world applications because of the following


reasons:

IMPORTANT FOR DATA SCIENCE: As R is an interpreted language, we can


run code without any compiler which is most important in data science. R is a
vector language and hence powerful and faster than another language. R is used in
biology, genetics as well as in statistics. Hence, it can perform any type of task.

OPEN-SOURCE: R language is an open-source language. It is also maintained by


a large number of the programmer as a community across the world. Since R is
issued under the General Public License (GNU), and hence there is no restriction
on its usage.

POPULARITY: R programming language has become the most popular


programming language in the technological world. R language is not given
importance in the academic world but with the emergence of data science, the
requirement for R in industries has increased.

ROBUST VISUALIZATION LIBRARY: R language consist of libraries like


ggplot2, plotly that provides graphical plots to the user. R is mostly recognized for

13
its amazing visualizations which is very important in data science programming
language.

USED TO DEVELOP WEB APPS: R provides the ability to build web


applications. Using the R package, we can create develop interactive applications
using the console of your R IDE.

PLATFORM INDEPENDENT: R language is a platform-independent language.


It can work on any system irrespective of whether it is Windows, Linux, and Mac.

USED IN MACHINE LEARNING IN R LANGUAGE: Most important


advantage of R programming is that it helps to carry out Machine Learning in R
language operations like classification, regression and also provides features for
artificial intelligence and neural networks.

3.5 RSTUDIO APPLICATION FOR WINDOWS

RStudio is free, open-source integrated development environment for R


programming language. It is designed for the use of data scientists, statisticians,
data miners, business intelligence developers. App is designed for the use by data
science project teams. Software is powerful and popular IDE for R language.
Application is cross-platform product, which is available for Windows, macOS,
Ubuntu. Product RStudio app has a simple and intuitive interface that is easy to
learn and makes it a good place to start for beginners. Interface has a lot of
functionality and it is great tool for analyzing and visualizing data. Application
provides a lot of support in form of a built in help system.

INTERFACE

14
Software RStudio download is well-designed, intuitive, user-friendly
application. Design is minimalistic and clean. Interface is clean and user can easily
navigate through the menus. Application provides a wide range of features that are
mainly categorized into Data Science, Visualization, Administration. Interface is
very user-friendly and has a clean design. It is based on "scratchpad" principle,
where the user can create their own projects, or start with one of many templates
that are available. Application provides code completion, which makes it easier to
write code.

USABILITY

Application download RStudio with an intuitive, user-friendly interface.


Design is clean and minimalistic. Product is simple to use. Application is easy to
use and has a simple and intuitive interface. There are many features that make
application usable, such as code completion and built-in help system. Is very user-
friendly and intuitive application. It is designed for data analysts and statisticians
and provides a lot of features, functionality for them. Software install RStudio is
easy to install has a quick start guide that is easy to follow. It has a built-in help
guide that has short tutorials to get you started.

FUNCTIONALITY

A wide range of features are available in RStudio install. Features are


divided into Data Science, Visualization, Administration. These are main areas of
software. Functionality of application is excellent. It has a wide range of features
for analyzing, visualizing data. It can be used for different purposes, such as data
science, web development, other fields. Product RStudio Mac offers a lot of
15
functionality. You can use console to run scripts and use interactive code to
explore data sets. Editor and help system are very helpful and informative. Plots
and charts you create with app are easy to customize and look beautiful. You can
use download RStudio for windows to manage your packages and collaborate with
other users.

Is very powerful IDE for R. It has a lot of features that you can use for more
comfortable using of application. You can use it open up a project, which is folder
with a collection of related documents that make up a complete work session. You
can use it to open up a file, which is single document or file with a collection of
related data or text.

SUPPORT

There is wide range of documentation and tutorials available on internet.


Application RStudio Windows has a forum where user can ask questions and get
an answer. There is online training available. Has a lot of support and you can find
a lot of tutorials and how-to guides on their website on YouTube. You can contact
the customer service with any problem that you might have. They have a very
active community of developers and users.

16
4 PACKAGES IN R PROGRAMMING
The package is an appropriate way to organize the work and share it with
others. Typically, a package will include code (not only R code!), documentation
for the package and the functions inside, some tests to check everything works as it
should, and data sets.

4.1 PACKAGES IN R

Packages in R Programming language are a set of R functions, compiled


code, and sample data. These are stored under a directory called “library” within
the R environment. By default, R installs a group of packages during installation.
Once we start the R console, only the default packages are available by default.
Other packages that are already installed need to be loaded explicitly to be utilized
by the R program that’s getting to use them.

4.2 PACKAGES USED

library(readr):

The goal of readr is to provide a fast and friendly way to read rectangular
data from delimited files, such as comma-separated values (CSV) and tab-
separated values (TSV). It is designed to parse many types of data found in the
wild, while providing an informative problem report when parsing leads to

17
unexpected results. If you are new to readr, the best place to start is the data import
chapter in R for Data Science.

library(tidyverse):

These Tidyverse packages were specially designed for Data Science with a
common design philosophy. They include all the packages required in the data
science workflow, ranging from data exploration to data visualization.

Tidyverse Packages in R following:

1. Data Visualization and Exploration: ggplot2

2. Data Wrangling and Transformation: dplyr, tidyr, stringr, forcats

3. Data Import and Management: tibble, readr

4. Functional Programming: purr

library(broom):

broom summarizes key information about models in tidy tibble()s. broom


provides three verbs to make it convenient to interact with model objects:

1. tidy () summarizes information about model components

2. glance () reports information about the entire model

3. augment () adds information about observations to a dataset

tidy() produces a Tibble () where each row contains information about an


important component of the model. For regression models, this often corresponds

18
to regression coefficients. This is can be useful if you want to inspect a model or
create custom visualizations

library(Metrics):

The Matrix package contains functions that extend R to support highly dense
or sparse matrices. It provides efficient access to BLAS (Basic Linear Algebra
Subroutines), Lapack (dense matrix), TAUCS (sparse matrix) and UMFPACK
(sparse matrix) routines.

library(dslabs)

dslabs: Data Science Labs

Datasets and functions that can be used for data analysis practice, homework
and projects in data science courses and workshops. 26 datasets are available for
case studies in data visualization, statistical inference, modeling, linear regression,
data wrangling and Machine Learning in R language.

library(dplyr)

dplyr is an R package that provides a grammar of data manipulation and


provides a most used set of verbs that helps data science analysts to solve the most
common data manipulation. In order to use this, you have to install it first using
install.packages('dplyr') and load it using library(dplyr).

library(caret)
19
Caret Package is a comprehensive framework for building Machine
Learning in R language models in R. In this tutorial, I explain nearly all the core
features of the caret package and walk you through the step-by-step process of
building predictive models. Be it a decision tree or xgboost, caret helps to find the
optimal model in the shortest possible time.

library(lubridate)

Lubridate makes it easier to do the things R does with date-times and


possible to do the things R does not. If you are new to lubridate, the best place to
start is the date and times chapter in R for data science.

library(tidytext)

Using tidy data principles can make many text mining tasks easier, more
effective, and consistent with tools already in wide use. Much of the infrastructure
needed for text mining with tidy data frames already exists in packages like dplyr,
broom, tidyr, and ggplot2. In this package, we provide functions and supporting
data sets to allow conversion of text to and from tidy formats, and to switch
seamlessly between tidy tools and existing text mining packages.

library("RColorBrewer")

RColorBrewer is an R Programming Language package library that offers a


variety of color palettes to use while making different types of plots. Colors impact
the way we visualize data. If we have to make a data standout or we want a color-

20
blind person to visualize the data as well as a normal person we have to use the
right color palette.

library(randomForest)

The R package "randomForest" is used to create random forests. Use the


below command in R console to install the package. You also have to install the
dependent packages if any. The package "randomForest" has the function
randomForest () which is used to create and analyze random forests.

library(tictoc)

tictoc is a R library typically used in User Interface, Frontend Framework,


React applications. tictoc has no bugs, it has no vulnerabilities, it has a Permissive
License and it has low support. You can download it from GitHub. R package with
extended timing functions tic/toc, as well as stack and list structures.

library(e1071)

e1071 Package in R e1071 is a package for R programming that provides


functions for statistic and probabilistic algorithms like a fuzzy classifier, naive
Bayes classifier, bagged clustering, short-time Fourier transform, support vector
machine, etc.. When it comes to SVM, there are many packages available in R to
implement it.

library(ggpubr)

21
ggpubr: 'ggplot2' Based Publication Ready Plots The 'ggplot2' package is
excellent and flexible for elegant data visualization in R. However the default
generated plots requires some formatting before we can send them for publication.

5 DATA VISUALIZATION
The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create a new
variable called “Cleveland_hd” to represent a binary 1/0 outcome.There are a few
other categorical/discrete variables in the dataset. We also convert sex into a

‘factor’ for next step analysis. Otherwise, R will treat this as continuous by default.

NAME TYPE DESCRIPTION


Age Continuous Age in years
Sex Discrete 0=Female 1=Male
Cp Discrete Chest Pain type:
1=typical angina,
2=atypical angina,
3=non-anginal pain,
4=asymptom

22
Trestbps Continuous Resting blood pressure (in
mm Hg)
Chol Continuous Serum cholesterol in mg/dl

Fbs Discrete Fasting blood sugar > 120


mg/dl: 1=True, 0=False
Exang Continuous Max Discrete Exercise induced angina:
heart rate achieved 1=Yes 0=No
Thalach Continuous Max heart rate achieved.
Old peak ST Continuous Depression induced by
exercise relative to rest.
Slope Discrete The slope of the peak
exercise segment:
1=up sloping,
2=Flat,
3=Down sloping.
Ca Continuous Num of major vessels
colored by fluoroscopy that
ranged between 0 and 3.
Thal Discrete 3=Normal 6=fixed defect
7=reversible defect
Class Discrete Diagnosis classes:
0=No Presence
1=Least to have heart
disease
2=>1; 3=>2

23
4=More likely have heart
disease

5.1 CLINICAL VARIABLES

Use statistical tests to see which predictors are related to heart disease. We
can explore the associations for each variable in the dataset. Depending on the type
of the data (i.e., continuous or categorical), we use t-test or chi-squared test to
calculate the p-values.

T-test is used to determine whether there is a significant difference between


the means of two groups (e.g., is the mean age from group A different from the
mean age from group B?). A chi-squared test for independence compares the
equivalence of two proportions.

5.2 PUTTING ALL THREE VARIABLES IN ONE MODEL:

The plots and the statistical tests both confirmed that all the three variables
are highly significantly associated with our outcome (p<0.001 for all tests).

In general, we want to use multiple logistic regression when we have one


binary outcome variable and two or more predicting variables. The binary variable
is the dependent (Y) variable; we are studying the effect that the independent (X)
variables have on the probability of obtaining a particular value of the dependent
variable. For example, we might want to know the effect that maximum heart rate,
age, and sex have on the probability that a person will have a heart disease in the
next year. The model will also tell us what the remaining effect of maximum heart
rate is after we control or adjust for the effects of the other two effectors.The glm()

24
command is designed to perform generalized linear models (regressions) on binary
outcome data, count data, probability data, proportion data, and many other data
types. In our case, the outcome is binary following a binomial distribution.

5.3 EXTRACTING USEFUL INFORMATION FROM THE MODEL


OUTPUT:

It is common practice in medical research to report Odds Ratio (OR) to


quantify how strongly the presence or absence of property A is associated with the
presence or absence of the outcome. When the OR is greater than 1, we say A is
positively associated with outcome B (increases the Odds of having B). Otherwise,
we say A is negatively associated with B (decreases the Odds of having B).

The raw glm coefficient table (the ‘estimate’ column in the printed output)
in R represents the log(Odds Ratios) of the outcome. Therefore, we need to convert
the values to the original OR scale and calculate the corresponding 95%
Confidence Interval (CI) of the estimated Odds Ratios when reporting results from
a logistic regression.

5.4 PREDICTED PROBABILITIES FROM OUR MODEL:

So far, we have built a logistic regression model and examined the model
coefficients/ORs. We may wonder how can we use this model we developed to
predict a person’s likelihood of having heart disease given his/her age, sex, and
maximum heart rate. Furthermore, we’d like to translate the predicted probability
into a decision rule for clinical use by defining a cutoff value on the probability
scale. In practice, when an individual comes in for a health check-up, the doctor
would like to know the predicted probability of heart disease, for specific values of
the predictors: a 45-year-old female with a max heart rate of 150. To do that, we
25
create a data frame called newdata, in which we include the desired values for our
prediction.

5.5 MODEL PERFORMANCE METRICS:

We are going to use some common metrics to evaluate the model


performance. The most straightforward one is Accuracy, which is the proportion of
the total number of predictions that were correct. On the other hand, we can
calculate the classification error rate using 1- accuracy. However, accuracy can be
misleading when the response is rare (i.e., imbalanced response). Another popular
metric, Area Under the ROC curve (AUC), has the advantage that it’s independent
of the change in the proportion of responders. AUC ranges from 0 to 1. The closer
it gets to 1 the better the model performance. Lastly, a confusion matrix is an N X
N matrix, where N is the level of outcome. For the problem at hand, we have N=2,
and hence we get a 2 X 2 matrix. It cross-tabulates the predicted outcome levels
against the true outcome levels.

After these metrics are calculated, we’ll see (from the logistic regression OR
table) that older age, being male and having a lower max heart rate are all risk
factors for heart disease. We can also apply our model to predict the probability of
having heart disease. For a 45 years old female who has a max heart rate of 150,
our model generated a heart disease probability of 0.177 indicating low risk of
heart disease.

5.6 DISEASE PREDICTION

The analyis below shows the disease prediction using various ML algorithms. The
outcome has been defined to be a binary classification variable, and several
classification algorithms have been used to predict the accuracy. This is just a
26
comparison study and the reasoning behind the usage of these algorithms has not
been the focus of this study.

5.7 EXPLORE THE ASSOCIATIONS GRAPHICALLY

In addition to p-values from statistical tests, we can plot the age, sex, and
maximum heart rate distributions with respect to our outcome variable. This will
give us a sense of both the direction and magnitude of the relationship.

 First, we plot age using a boxplot since it is a continuous variable.


 Next, we plot sex using a barplot since it is a binary variable in this dataset.
 Finally, we plot thalach using a boxplot since it is a continuous variable.
 Age is on the x-axis, sex on the y-axis (0 - female, 1 - male), size of the circle is
the cholestorl level, and color is condition. Yellow means disease and blue
means no disease and each circle is a datapoint. You can see that male count it
much more than the female count, and male has the more cases with disease
than female population. Also, the disease seems more popular with high
cholestorl values.
 The plot below is same as the above, except, the y-axis is the chest pain type,
and the color is sex rather than condition.

27
6 SOURCE CODE

# Read datasets Cleveland_hd.csv into Cleveland_hd

library(readr)

Cleveland_hd <- read.csv("D:/Mini R Project/Cleveland_hd.csv ")

# take a look at the first 5 rows of Cleveland_hd

head(Cleveland_hd,5)

28
IDENTIFYING IMPORTANT CLINICAL VARIABLES

# load the tidyverse package

library(tidyverse)

# Use the 'mutate' function from dplyr to recode our data

Cleveland_hd %>% mutate(hd = ifelse(class > 0, 1, 0))-> Cleveland_hd

# recode sex using mutate function and save as Cleveland_hd

Cleveland_hd %>% mutate(sex = factor(sex, levels = 0:1, labels = c("Female",


"Male")))-> Cleveland_hd

# Does sex have an effect? Sex is a binary variable in this dataset,

# so the appropriate test is chi-squared test

29
hd_sex <- chisq.test(Cleveland_hd$hd, Cleveland_hd$sex)

# Does age have an effect? Age is continuous, so we use a t-test

hd_age <- t.test (age~hd, Cleveland_hd)

# What about thalach? Thalach is continuous, so we use a t-test

hd_heartrate <- t.test (thalach~hd, Cleveland_hd)

30
31
# Print the results to see if p<0.05.

print(hd_sex)

print(hd_age)

32
print(hd_heartrate)

PUTTING ALL THREE VARIABLES IN ONE MODEL

# use glm function from base R and specify the family argument as binomial

model <-glm(data = Cleveland_hd, hd~age+sex+thalach, family="binomial")

# extract the model summary

summary(model)

33
34
EXTRACTING USEFUL INFORMATION FROM THE MODEL OUTPUT

# load the broom package

library(broom)

# tidy up the coefficient table

tidy_m <- tidy(model)

tidy_m

35
# calculate OR

tidy_m$OR <- exp(tidy_m$estimate)

# calculate 95% CI and save as lower CI and upper CI

tidy_m$lower_CI <- exp(tidy_m$estimate - 1.96 * tidy_m$std.error)

tidy_m$upper_CI <- exp(tidy_m$estimate + 1.96 * tidy_m$std.error)

# display the updated coefficient table

tidy_m

36
PREDICTED PROBABILITIES FROM OUR MODEL

# get the predicted probability in our dataset using the predict() function

pred_prob <- predict(model,Cleveland_hd, type = "response")

# create a decision rule using probability 0.5 as cutoff and save the predicted
decision into the main data frame

Cleveland_hd$pred_hd <- ifelse(pred_prob >= 0.5,1,0)

# create a newdata data frame to save a new case information

newdata <- data.frame(age = 45, sex = "Female", thalach = 150)

# predict probability for this new case and print out the predicted value

p_new <- predict(model,newdata, type = "response")

p_new

37
MODEL PERFORMANCE METRICS

# load Metrics package

library(Metrics)

# calculate auc, accuracy, clasification error

auc <- auc(Cleveland_hd$hd,Cleveland_hd$pred_hd)

accuracy <- accuracy(Cleveland_hd$hd,Cleveland_hd$pred_hd)

classification_error <- ce(Cleveland_hd$hd,Cleveland_hd$pred_hd)

# print out the metrics on to screen

print(paste("AUC=", auc))

print(paste("Accuracy=", accuracy))

print(paste("Classification Error=", classification_error))

# confusion matrix

table(Cleveland_hd$hd,Cleveland_hd$pred_hd, dnn=c('True Status','Predicted


Status')) # confusion matrix

38
39
7. GRAPHICAL OUTPUT

7.1 RECODE HD TO BE LABELLED

# Recode hd to be labelled

Cleveland_hd%>%mutate(hd_labelled = ifelse(hd == 0, "No Disease", "Disease"))


-> Cleveland_hd

# age vs hd

ggplot(data = Cleveland_hd, aes(x = hd_labelled,y = age)) + geom_boxplot()

40
7.2 MAX HEART RATE VS HD

# Max heart rate vs hd

ggplot(data = Cleveland_hd,aes(x=hd_labelled,y=thalach)) + geom_boxplot()

41
7.3 DISEASE DISTRIBUTION FOR AGE.

####################################################

# Disease distribution for age.

# 0 - no disease

# 1 - disease

####################################################

Cleveland_hd%>% group_by(age, condition) %>% summarise(count = n()) %>%

ggplot() + geom_bar(aes(age, count, fill = as.factor(condition)), stat =


"Identity") +

theme_bw() +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ylab("Count") + xlab("Age") + labs(fill = "Condition")

42
43
7.4 CHEST PAIN TYPE FOR DISEASED PEOPLE

####################################################

# Chest pain type for diseased people

# You can see - Majority as condition 3 type

# 0: typical angina 1: atypical angina Value 2: non-anginal pain Value 3:


asymptomatic

####################################################

Cleveland_hd%>% filter(condition == 1) %>% group_by(age, cp) %>%


summarise(count = n()) %>%

ggplot() + geom_bar(aes(age, count, fill = as.factor(cp)),stat = "Identity") +

theme_bw() +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ylab("Count") + xlab("Age") + labs(fill = "Condition") +

ggtitle("Age vs. Count (disease only) for various chest pain conditions") +

scale_fill_manual(values=c("red", "blue", "green", "black"))

44
45
7.5 CONDITION SEX WISE

age is on the x-axis,

sex on the y-axis (0 - female, 1 - male),

size of the circle is the cholesterol level, and color is condition.

Yellow means disease and blue means no disease and each circle is a
datapoint.

Can see that male count it much more than the female count, and male
has the more cases with disease than female population. Also, the
disease seems more popular with high cholesterol values.

####################################################

# condition sex wise

####################################################

options(repr.plot.width = 20, repr.plot.height = 8)

heart_disease_data %>% ggballoonplot(x = "age", y = "sex",

size = "chol", size.range = c(5, 30), fill =


"condition",show.label = FALSE,

ggtheme = theme_bw()) +

scale_fill_viridis_c(option = "C") +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ggtitle("Age vs. Sex Map") + labs(fill = "Condition")

46
47
The plot below is same as the above, except, the y-axis is
the chest pain type, and the color is sex rather than
condition

options(repr.plot.width = 20, repr.plot.height = 8)

####################################################

# condition sex wise

####################################################

heart_disease_data %>% ggballoonplot(x = "age", y = "cp",

size = "chol", size.range = c(5, 30), fill = "sex",show.label =


FALSE,

ggtheme = theme_bw()) +

scale_fill_viridis_c(option = "C") +

theme(axis.text.x = element_text(angle = 90, size = 10)) +

ggtitle("Age vs. Chest Pain Map") + labs(fill = "sex")

48
49
7.6 DISEASE PREDICTION SETUP
set.seed(2020, sample.kind = "Rounding")
# Divide into train and validation dataset
test_index <- createDataPartition(y = heart_disease_data$condition,
times = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ]
validation <- heart_disease_data[test_index, ]

# Converting the dependent variables to factors


train_set$condition <- as.factor(train_set$condition)
validation$condition <- as.factor(validation$condition)

7.7 LDA: LINEAR DISCRIMINANT ANALYSIS

50
################################

# LDA Analysis

###############################

lda_fit <- train(condition ~ ., method = "lda", data = train_set)

lda_predict <- predict(lda_fit, validation)

confusionMatrix(lda_predict, validation$condition)

51
52
7.8 QDA: QUADRANT DISCRIMINANT ANALYSIS

################################

# QDA Analysis

###############################

qda_fit <- train(condition ~ ., method = "qda", data = train_set)

qda_predict <- predict(qda_fit, validation)

confusionMatrix(qda_predict, validation$condition)

53
54
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER

5-fold cross validation was used, and tuning was done on all the next algorithms
discussed here to avoid over-training the algorithms.

ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

knnFit <- train(condition ~ .,

data = train_set, method = "knn", preProcess = c("center","scale"),

trControl = ctrl , tuneGrid = expand.grid(k = seq(1, 20, 2)))

plot(knnFit)

toc()

knnPredict <- predict(knnFit,newdata = validation )

knn_results <- confusionMatrix(knnPredict, validation$condition )

knn_results

55
56
7.10 SVM: SUPPORT-VECTOR MACHINES

############################

# SVM

############################

ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

grid_svm <- expand.grid(C = c(0.01, 0.1, 1, 10, 20))

tic(msg= " Total time for SVM :: ")

svm_fit <- train(condition ~ .,data = train_set,

method = "svmLinear", preProcess = c("center","scale"),

tuneGrid = grid_svm, trControl = ctrl)

plot(svm_fit)

toc()

svm_predict <- predict(svm_fit, newdata = validation)

svm_results <- confusionMatrix(svm_predict, validation$condition)

svm_results

57
58
59
7.11 RF: RANDOM FOREST

############################

# RF

############################

control<- trainControl(method = "cv", number = 5, verboseIter = FALSE)

grid <-data.frame(mtry = seq(1, 10, 2))

tic(msg= " Total time for rf :: ")

rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl =
control,

tuneGrid = grid)

plot(rf_fit)

toc()

rf_predict <- predict(rf_fit, newdata = validation)

rf_results <- confusionMatrix(rf_predict, validation$condition)

rf_results

60
61
62
7.12 GBM: GLIOBLASTOMA MULTIFORME

############################
# GBM
############################

gbmGrid <- expand.grid(interaction.depth = c(1, 5, 10, 25, 30),


n.trees = c(5, 10, 25, 50),
shrinkage = c(0.1, 0.2, 0.3, 0.4, 0.5),
n.minobsinnode = 20)

tic(msg= " Total time for GBM :: ")


gbm_fit <- train(condition ~ ., method = "gbm", data = train_set, trControl =
control, verbose = FALSE,
tuneGrid = gbmGrid)

plot(gbm_fit)
toc()
gbm_predict <- predict(gbm_fit, newdata = validation)

gbm_results <- confusionMatrix(gbm_predict, validation$condition)

gbm_results

63
64
65
CONCLUSION
Heart diseases are a major killer in India and throughout the world,
application of promising technology like machine learning to the initial prediction
of heart diseases will have a profound impact on society. The early prognosis of
heart disease can aid in making decisions on lifestyle changes in high-risk patients
and in turn reduce the complications, which can be a great milestone in the field of
medicine. The number of people facing heart diseases is on a raise each year. This
prompts for its early diagnosis and treatment. The utilization of suitable technology
support in this regard can prove to be highly beneficial to the medical fraternity
and patients. In this paper, the seven different machine learning algorithms used to
measure the performance are SVM, Decision Tree, Random Forest, Naïve Bayes,
Logistic Regression, Adaptive Boosting, and Extreme Gradient Boosting applied
on the dataset.

66
FUTURE ENHANCEMENT

The expected attributes leading to heart disease in patients are available in


the dataset which contains 76 features and 14 important features that are useful to
evaluate the system are selected among them. If all the features taken into the
consideration, then the efficiency of the system the author gets is less. To increase
efficiency, attribute selection is done. In this n features have to be selected for
evaluating the model which gives more accuracy. The correlation of some features
in the dataset is almost equal and so they are removed. If all the attributes present
in the dataset are taken into account then the efficiency decreases considerably.

All the seven machine learning methods accuracies are compared based on

which one prediction model is generated. Hence, the aim is to use various
evaluation metrics like confusion matrix, accuracy, precision, recall, and f1-score
which predicts the disease efficiently. Comparing all seven the extreme gradient
boosting classifier gives the highest accuracy of 81%

67
REFERENCES
[1] Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview
of heart disease prediction." International Journal of Computer Applications 17.8
(2011): 43-48.

[2] Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International
Journal of Computer Applications 47.10 (2012): 44-48.

[3] Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.

[4] Kim, Jae Kwon, and Sanggil Kang. "Neural network-based coronary heart
disease risk prediction using feature correlation analysis." Journal of healthcare
engineering 2017 (2017).

[5] Baccouche, Asma, et al. "Ensemble Deep Learning Models for Heart Disease
Classification: A Case Study from Mexico." Information 11.4 (2020): 207.

[6] https://archive.ics.uci.edu/ml/datasets/Heart+Disease

[7] https://www.kaggle.com/ronitf/heart-disease-uci

[8] https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf

[9]https://nthu-datalab.github.io/ml/labs/03_Decision
Trees_RandomForest/03_Decision-Tree_Random-Forest.html

[10] https://www.kaggle.com/jprakashds/confusion-matrix-in-python-binaryclass

[11] scikit-learn, keras, pandas and matplotlib

68
[12] A. H. M. S. U. Marjia Sultana, "Analysis of Data Mining Techniques for
Heart Disease Prediction," 2018.

[13] M. I. K. ,. A. I. ,. S. Musfiq Ali, "Heart Disease Prediction Using Machine


Learning Algorithms".

[14] K. Bhanot, "towarddatascience.com," 13 Feb 2019. [Online]. Available:

https://towardsdatascience.com/predicting-presence-of-heart-diseases-using
machinelearning-36f00f3edb2c. [Accessed 2 March 2020].

[15] [Online]. Available: https://www.kaggle.com/ronitf/heart-disease-


uci#heart.csv.. [Accessed 05 December 2019].

[16] M. A. K. S. H. K. M. a. V. P. M Marimuthu, "A Review on Heart Disease


Prediction using Machine Learning and Data Analytics Approach".

69

You might also like