0% found this document useful (0 votes)
10 views59 pages

Heart Disease Diagnosis Prediction Using Machine Learning and Data Analytics Approach.

Okay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views59 pages

Heart Disease Diagnosis Prediction Using Machine Learning and Data Analytics Approach.

Okay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

“Heart Disease Diagnosis Prediction using

Random Forest”

ABSTRACT
Machine Learning in R language is used across the world. The
healthcare industry is no exclusion. Machine Learning can play an
essential role in predicting presence/absence of locomotors
disordered, heart diseases and more. Such information, if predicted
well in advance, can provide important intuitions to doctors who can
then adapt their diagnosis and dealing per patient basis. It’s works on
predicting possible heart diseases in people using Machine Learning
algorithms. In this project we perform the comparative analysis of
classifiers like decision tree, Naïve Bayes, Logistic Regression, SVM
and Random Forest and propose an ensemble classifier which perform
hybrid classification by taking strong and weak classifiers since it can
have multiple number of samples for training and validating the data so
we perform the analysis of existing classifier and proposed classifier
like Ada-boost and XG-boost which can Give accurate result and aids
in predictive analysis.

➢ Keywords: Machine Learning in R language, SVM, Random


Forest, Linear Discriminant Analysis, Quadrant Discriminant
Analysis, k-nearest neighbour’s, glioblastoma multiforme.
OBJECTIVES
➢ The main objective of developing this project is:

• To develop Machine Learning in R language model to predict


future possibility of heart disease by implementing Logistic
Regression.
• To determine significant risk factors based on medical dataset
which may lead to heart disease.
• To analyse feature selection methods and understand their
working principle.

❖ LIST OF ABBREVIATIONS:
1. LDA: Linear Discriminant Analysis
2. QDA: Quadrant Discriminant Analysis
3. K-NN: k-nearest neighbour’s
4. SVM: Support-vector machines
5. RF: Random Forest
6. GBM: glioblastoma multiforme
7. EDA: Exploratory data analysis is the key step for getting meaningful
results.

8. ECG: Electro Cardio Gram


9. AMI: Acute Myocardial Infarction
SYSTEM SPECIFICATIONS
➢ HARDWARE REQUIREMENTS:

The section of hardware configuration is an important task related to


the software development insufficient random access memory may
affect adversely on the speed and efficiency of the entire system. The
process should be powerful to handle the entire operations. The hard
disk should have sufficient capacity to store the file and application.

❖ System: Intel Pentium processor.


❖ RAM: 2 GB and above.
❖ Hard disk: 250 GB Hard Disk and above.

➢ SOFTWARE REQUIREMENTS:

A major element in building a system is the section of compatible


software since the software in the market is experiencing in geometric
progression. Selected software should be acceptable by the firm and
one user as well as it should be feasible for the system. This document
gives a detailed description of the software requirement specification.
The study of requirement specification is focused specially on the
functioning of the system. It allows the developer or analyst to
understand the system, function to be carried out the performance
level to be obtained and corresponding interfaces to be established.

❖ Operating system: Windows 11


❖ Environment (IDE): R STUDIO
❖ Back end: R version 4.2.1
1. INTRODUCTION
According to the World Health Organization, every year 12
million deaths occur worldwide due to Heart Disease. The load of
cardiovascular disease is rapidly increasing all over the world from the
past few years. Many researches have been conducted in attempt to
pinpoint the most influential factors of heart disease as well as
accurately predict the overall risk. Heart Disease is even highlighted
as a silent killer which leads to the death of the person without obvious
symptoms.

The early diagnosis of heart disease plays a vital role in making


decisions on lifestyle changes in high-risk patients and in turn reduce
the complications. This project aims to predict future Heart Disease by
analyzing data of patients which classifies whether they have heart
disease or not using machine-learning algorithms.

1.1 PROBLEM DEFINITION :


The major challenge in heart disease is its detection. There are
instruments available which can predict heart disease but either they
are expensive or are not efficient to calculate chance of heart disease
in human. Early detection of cardiac diseases can decrease the
mortality rate and overall complications. However, it is not possible to
monitor patients every day in all cases accurately and consultation of a
patient for 24 hours by a doctor is not available since it requires more
sapience, time and expertise. Since we have a good amount of data in
today’s world, we can use various Machine Learning in R language
algorithms to analyses the data for hidden patterns. The hidden
patterns can be used for health diagnosis in medicinal data.

1.2 MOTIVATION FOR THE WORK :


Machine Learning in R language techniques have been around us
and has been compared and used for analysis for many kinds of data
science applications. The major motivation behind this research-based
project was to explore the feature selection methods, data preparation
and processing behind the training models in the Machine Learning in
R language. With first hand models and libraries, the challenge we face
today is data where beside their abundance, and our cooked models,
the accuracy we see during training, testing and actual validation has a
higher variance. Hence this project is carried out with the motivation to
explore behind the models, and further implement Logistic Regression
model to train the obtained data. Furthermore, as the whole Machine
Learning in R language is motivated to develop an appropriate
computer-based system and decision support that can aid to early
detection of heart disease, in this project we have developed a model
which classifies if patient will have heart disease in ten years or not
based on various features (i.e. potential risk factors that can cause heart
disease) using logistic regression. Hence, the early prognosis of
cardiovascular diseases can aid in making decisions on lifestyle
changes in high risk patients and in turn reduce the complications,
which can be a great milestone in the field of medicine.
2. PROJECT DESCRIPTION
Heart disease is perceived as the deadliest disease in the human
life across the world. In particular, in this type of disease the heart is
not capable in pushing the required quantity of blood to the remaining
organs of the human body in order to accomplish the regular
functionalities. Some of the symptoms of heart disease include physical
body weakness, improper breathing, swollen feet, etc. The techniques
are essential to identify the complicated heart diseases which results in
high risk in turn affect the human life. Presently, diagnosis and
treatment process are highly challenging due to inadequacy of
physicians and diagnostic apparatus that affect the treatment of heart
patients

Early diagnosis of heart disease is significant to minimize the


heart related issues and to protect it from serious risks. The invasive
techniques are implemented to diagnose heart diseases based on
medical history, symptom analysis report by experts, and physical
laboratory report. Moreover, it causes delay and imprecise diagnosis
due to human intervention. It is time consuming, computationally
intensive and expensive at the time of assessment.

Heart disease can be predicted based on various symptoms such


as age, gender, pulse rate etc. Data analysis in healthcare assists in
predicting diseases, improving diagnosis, analyzing symptoms,
providing appropriate medicines, improving the quality of care,
minimizing cost, extending the life span and reduces the death rate of
heart patients. ECG helps in screening irregular heart beat and stroke
with the embedded sensors by resting it on a chest in order to track the
patient’s heart beat.

Heart disease prediction is being done with the detailed clinical


data that could assist experts to make decision. Human life is highly
dependent on proper functioning of blood vessels in the heart. The
improper blood circulation causes heart inactiveness, kidney failure,
imbalanced condition of brain, and even immediate death also. Some
of the risk factors that can cause heart diseases are obesity, smoking,
diabetes, blood pressure, cholesterol, lack of physical activities and
unhealthy diet.

AMI is the cardiovascular disease that happens due to


interruption in the blood flow or circulation in the heart muscle, causes
heart muscle to become necrotic (damage or die). The primary reason
for this disease is the blockage means that the blood flow to the heart
muscle become obstructed or reduced. If the blood flow is reduced or
obstructed, the functioning of red blood cells that carries enough
oxygen helps in sustaining consciousness and human life have a severe
impact. Without oxygen supply for 6 to 8 minutes, heart muscle may get
arrest that in turn resulted in patient’s death.

The significant cause of the cardiovascular disease is ‘plaque’


means a hard substance formed in the coronary arteries which is made
up of cholesterol (fat), causes the blood flow to be reduced or
obstructed. Sometimes, it can be formed in the arteries known as
atherosclerosis and investigating the cause of it are determined as a
chronic inflammation.
3. SOFTWARE DESCRIPTION
R is a programming language and free software developed by
Ross Ihaka and Robert Gentleman in 1993. R possesses an in-depth
catalogue of applied mathematics and graphical strategies. It includes
Machine Learning in R language algorithms, simple and linear
regression, statistics, applied mathematics. Most of the R libraries are
written in R, except for serious machine tasks, C, C++, and algebraic
language codes are most well-liked.

R is not solely entrusted by academics, however many massive firms


and MNC’s additionally use R programming language, including Uber,
Google, Airbnb, Facebook, and then on. Data analysis with R is finished
in an exceedingly series of steps: programming, transforming,
discovering, modelling, and communicating the results. Moreover, if
you need any help with R programming homework or assignment, we
have tutors available 24/7.

3.1 EVOLUTION OF R :
• R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in
Auckland, New Zealand. R made its first appearance in 1993.
• A large group of individuals has contributed to R by sending code
and bug reports.
• Since mid-1997 there has been a core group (the "R Core Team")
who can modify the R source code archive.
3.2 R VERSION :
R version 4.2.1 (2022-06-23 curt) -- "Funny-Looking Kid" Copyright (C)
2022 The R Foundation for Statistical Computing Platform: x86_64-
w64mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.


You are welcome to redistribute it under certain conditions. Type
'license()' or 'licence()' for distribution details. Natural language
support but running in an English locale

R is a collaborative project with many contributors. Type


'contributors()' for more information and 'citation()' on how to cite R or
R packages in publications.

Type 'demo()' for 0some demos, 'help()' for on-line help, or


'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

3.3 FEATURES OF R :
As stated earlier, R is a programming language and software
environment for statistical analysis, graphics representation and
reporting. The following are the important features of R −

• R is a well-developed, simple and effective programming language


which includes conditionals, loops, user defined recursive functions
and input and output facilities.
• R has an effective data handling and storage facility,
• R provides a suite of operators for calculations on arrays, lists,
vectors and matrices.
• R provides a large, coherent and integrated collection of tools for
data analysis.
• R provides graphical facilities for data analysis and display either
directly at the computer or printing at the papers.

3.4 USAGE OF R PROGRAMMING :


R language is in much demand in real-world applications because of
the following reasons:

❖ IMPORTANT FOR DATA SCIENCE: As R is an interpreted


language, we can run code without any compiler which is most
important in data science. R is a vector language and hence powerful
and faster than another language. R is used in biology, genetics as well
as in statistics. Hence, it can perform any type of task.

❖ OPEN-SOURCE: R language is an open-source language. It is also


maintained by a large number of the programmer as a community
across the world. Since R is issued under the General Public License
(GNU), and hence there is no restriction on its usage.

❖ POPULARITY: R programming language has become the most


popular programming language in the technological world. R
language is not given importance in the academic world but with the
emergence of data science, the requirement for R in industries has
increased.

❖ ROBUST VISUALIZATION LIBRARY: R language consist of


libraries like ggplot2, plotly that provides graphical plots to the user.
R is mostly recognized for its amazing visualizations which is very
important in data science programming language.

❖ USED TO DEVELOP WEB APPS: R provides the ability to build web


applications. Using the R package, we can create develop interactive
applications using the console of your R IDE.

❖ PLATFORM INDEPENDENT: R language is a platform-


independent language. It can work on any system irrespective of
whether it is Windows, Linux, and Mac.
❖ USED IN MACHINE LEARNING IN R LANGUAGE: Most
important advantage of R programming is that it helps to carry out
Machine Learning in R language operations like classification,
regression and also provides features for artificial intelligence and
neural networks.

3.5 RSTUDIO APPLICATION FOR WINDOWS :


R Studio is free, open-source integrated development
environment for R programming language. It is designed for the use of
data scientists, statisticians, data miners, business intelligence
developers. App is designed for the use by data science project teams.
Software is powerful and popular IDE for R language. Application is
cross-platform product, which is available for Windows, macOS,
Ubuntu. Product RStudio app has a simple and intuitive interface that is
easy to learn and makes it a good place to start for beginners. Interface
has a lot of functionality and it is great tool for analysing and visualizing
data. Application provides a lot of support in form of a built-in help
system.

I. INTERFACE:
Software RStudio download is well-designed, intuitive, user-friendly
application. Design is minimalistic and clean. Interface is clean and
user can easily navigate through the menus. Application provides a
wide range of features that are mainly categorized into Data Science,
Visualization, Administration. Interface is very user-friendly and has a
clean design. It is based on "scratchpad" principle, where the user can
create their own projects, or start with one of many templates that are
available.
II. USABILITY:

Application download RStudio with an intuitive, user-friendly interface.


Design is clean and minimalistic. Product is simple to use. Application
is easy to use and has a simple and intuitive interface. There are many
features that make application usable, such as code completion and
built-in help system. Is very user-friendly and intuitive application. It is
designed for data analysts and statisticians and provides a lot of
features, functionality for them.

III. FUNCTIONALITY:

A wide range of features are available in RStudio install. Features are


divided into Data Science, Visualization, Administration. These are
main areas of software. Functionality of application is excellent. It has a
wide range of features for analysing, visualizing data. It can be used for
different purposes, such as data science, web development, other
fields. Product RStudio Mac offers a lot of functionality.
4. PACKAGES IN R PROGRAMMING
The package is an appropriate way to organize the work and
share it with others. Typically, a package will include code (not only R
code!), documentation for the package and the functions inside, some
tests to check everything works as it should, and data sets.

4.1 PACKAGES IN R:

Packages in R Programming language are a set of R functions,


compiled code, and sample data. These are stored under a directory
called “library” within the R environment. By default, R installs a group
of packages during installation. Once we start the R console, only the
default packages are available by default. Other packages that are
already installed need to be loaded explicitly to be utilized by the R
program that’s getting to use them.

4.2 PACKAGES USED:

▪ library(readr):
The goal of readr is to provide a fast and friendly way to read
rectangular data from delimited files, such as comma-separated
values (CSV) and tabseparated values (TSV). It is designed to
parse many types of data found in the wild, while providing an
informative problem report when parsing leads to unexpected
results. If you are new to readr, the best place to start is the data
import chapter in R for Data Science.
▪ library(tidyverse):

These Tidyverse packages were specially designed for Data Science


with a common design philosophy. They include all the packages
required in the data science workflow, ranging from data exploration
to data visualization.

 Tidyverse Packages in R following:

1. Data Visualization and Exploration: ggplot2

2. Data Wrangling and Transformation: dplyr, tidyr, stringr,


forcats

3. Data Import and Management: tibble, readr

4. Functional Programming: purr

▪ library(broom):

The broom summarizes key information about models in tidy tibble()s.


broom provides three verbs to make it convenient to interact with
model objects:

1. tidy () summarizes information about model components


2. glance () reports information about the entire model
3. augment () adds information about observations to a dataset
4. tidy() produces a Tibble () where each row contains information
about an important component of the model. For regression
models, this often corresponds to regression coefficients. This
is can be useful if you want to inspect a model or create custom
visualizations.
▪ library(Metrics):

The Matrix package contains functions that extend R to support highly


dense or sparse matrices. It provides efficient access to BLAS (Basic
Linear Algebra Subroutines), Lapack (dense matrix), TAUCS (sparse
matrix) and UMFPACK (sparse matrix) routines.

▪ library(dslabs): dslabs: Data Science Labs

Datasets and functions that can be used for data analysis practice,
homework and projects in data science courses and workshops. 26
datasets are available for case studies in data visualization, statistical
inference, modeling, linear regression, data wrangling and Machine
Learning in R language.

▪ library(dplyr):

dplyr is an R package that provides a grammar of data manipulation


and provides a most used set of verbs that helps data science analysts
to solve the most common data manipulation. In order to use this, you
have to install it first using install.packages('dplyr') and load it using
library(dplyr).

▪ library(caret):

Caret Package is a comprehensive framework for building Machine


Learning in R language models in R. In this tutorial, I explain nearly all
the core features of the caret package and walk you through the step-
by-step process of building predictive models. Be it a decision tree or
xgboost, caret helps to find the optimal model in the shortest possible
time.

▪ library(lubridate):

Lubridate makes it easier to do the things R does with date-times and


possible to do the things R does not. If you are new to lubridate, the
best place to start is the date and times chapter in R for data science.

▪ library(tidytext):

Using tidy data principles can make many text mining tasks easier,
more effective, and consistent with tools already in wide use. Much of
the infrastructure needed for text mining with tidy data frames already
exists in packages like dplyr, broom, tidyr, and ggplot2.

▪ library("RColorBrewer"):

RColorBrewer is an R Programming Language package library that


offers a variety of color palettes to use while making different types of
plots. Colors impact the way we visualize data.

▪ library(random Forest):

The R package "randomForest" is used to create random forests. Use


the below command in R console to install the package. You also have
to install the dependent packages if any. The package "randomForest"
has the function randomForest () which is used to create and analyze
random forests.

▪ library(tictoc):

tictoc is a R library typically used in User Interface, Frontend


Framework, React applications. tictoc has no bugs, it has no
vulnerabilities, it has a Permissive License and it has low
support.

▪ library(e1071):

e1071 Package in R e1071 is a package for R programming


that provides functions for statistic and probabilistic
algorithms like a fuzzy classifier, naive Bayes classifier,
bagged clustering, short-time Fourier transform, support
vector machine, etc... When it comes to SVM, there are
many packages available in R to implement it.

▪ library(ggpubr):

ggpubr: 'ggplot2' Based Publication Ready Plots The


'ggplot2' package is excellent and flexible for elegant data
visualization in R. However, the default generated plots
require some formatting before we can send them for
publication.
5. DATA VISUALIZATION

The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create
a new variable called “Cleveland_hd” to represent a binary 1/0
outcome.There are a few other categorical/discrete variables in the
dataset. We also convert sex into a ‘factor’

for next step analysis. Otherwise, R will treat this as continuous by default.
NAME TYPE DESCRIPTION
Age Continuous Age in years

Sex Discrete 0=Female 1=Male

Cp Discrete Chest Pain type:


1=typical angina,
2=atypical angina,
3=non-anginal pain,
4=asymptom

Trestbps Continuous Resting blood


pressure (in mm Hg)

Chol Continuous Serum cholesterol in


mg/dl

Fbs Discrete Fasting blood sugar >


120 mg/dl: 1=True,
0=False

Exang Continuous Max heart rate Discrete Exercise induced


achieved angina:
1=Yes 0=No

Thalach Continuous Max heart rate


achieved.

Old peak ST Continuous Depression induced


by exercise relative to
rest.
Slope Discrete The slope of the peak
exercise segment:
1=up sloping,
2=Flat,
3=Down sloping.

Ca Continuous Num of major vessels


colored by
fluoroscopy that
ranged between 0 and
3.

Thal Discrete 3=Normal 6=fixed


defect
7=reversible defect

Class Discrete Diagnosis classes:


0=No Presence

1=Least to have heart


disease
2=>1; 3=>2
4=More likely have
heart disease

5.1 CLINICAL VARIABLES:

Use statistical tests to see which predictors are related to heart


disease. We can explore the associations for each variable in the
dataset. Depending on the type of the data (i.e., continuous or
categorical), we use t-test or chi-squared test to calculate the p-values.

T-test is used to determine whether there is a significant


difference between the means of two groups (e.g., is the mean age from
group A different from the mean age from group B?). A chi-squared test
for independence compares the equivalence of two proportions.

5.2 PUTTING ALL THREE VARIABLES IN ONE MODEL:


The plots and the statistical tests both confirmed that all the three
variables are highly significantly associated with our outcome (p<0.001
for all tests).

In general, we want to use multiple logistic regression when we


have one binary outcome variable and two or more predicting
variables. The binary variable is the dependent (Y) variable; we are
studying the effect that the independent (X) variables have on the
probability of obtaining a particular value of the dependent variable.
For example, we might want to know the effect that maximum heart
rate, age, and sex have on the probability that a person will have a
heart disease in the next year.

5.3 EXTRACTING USEFUL INFORMATION FROM THE MODEL


OUTPUT:

It is common practice in medical research to report Odds Ratio (OR) to


quantify how strongly the presence or absence of property A is
associated with the presence or absence of the outcome. When the OR
is greater than 1, we say A is positively associated with outcome B
(increases the Odds of having B). Otherwise, we say A is negatively
associated with B (decreases the Odds of having B).

The raw glm coefficient table (the ‘estimate’ column in the printed
output) in R represents the log(Odds Ratios) of the outcome. Therefore,
we need to convert the values to the original OR scale and calculate the
corresponding 95% Confidence Interval (CI) of the estimated Odds
Ratios when reporting results from a logistic regression.

5.4 PREDICTED PROBABILITIES FROM OUR MODEL:

So far, we have built a logistic regression model and examined the


model coefficients/ORs. We may wonder how can we use this model
we developed to predict a person’s likelihood of having heart disease
given his/her age, sex, and maximum heart rate. Furthermore, we’d
like to translate the predicted probability into a decision rule for
clinical use by defining a cutoff value on the probability scale. In
practice, when an individual comes in for a health check-up, the doctor
would like to know the predicted probability of heart disease, for
specific values of the predictors: a 45-year-old female with a max heart
rate of 150. To do that, we create a data frame called new data, in which
we include the desired values for our prediction.

5.5 MODEL PERFORMANCE METRICS:

We are going to use some common metrics to evaluate the model


performance. The most straightforward one is Accuracy, which is the
proportion of the total number of predictions that were correct. On the
other hand, we can calculate the classification error rate using 1-
accuracy. However, accuracy can be misleading when the response is
rare (i.e., imbalanced response). Another popular metric, Area Under
the ROC curve (AUC), has the advantage that it’s independent of the
change in the proportion of responders. AUC ranges from 0 to 1.

5.6 DISEASE PREDICTION:

The analysis below shows the disease prediction using various ML


algorithms. The outcome has been defined to be a binary classification
variable, and several classification algorithms have been used to
predict the accuracy. This is just a comparison study and the reasoning
behind the usage of these algorithms has not been the focus of this
study.

5.7 EXPLORE THE ASSOCIATIONS GRAPHICALLY:

In addition to p-values from statistical tests, we can plot the age,


sex, and maximum heart rate distributions with respect to our outcome
variable. This will give us a sense of both the direction and magnitude
of the relationship.

▪ First, we plot age using a boxplot since it is a continuous variable.


▪ Next, we plot sex using a barplot since it is a binary variable in this dataset.
▪ Finally, we plot thalach using a boxplot since it is a continuous variable.
▪ Age is on the x-axis, sex on the y-axis (0 - female, 1 - male), size of the circle is
the cholesterol level, and colour is condition. Yellow means disease and blue
means no disease and each circle is a datapoint. You can see that male count it
much more than the female count, and male has the more cases with disease
than female population. Also, the disease seems more popular with high
cholesterol values.
▪ The plot below is same as the above, except, the y-axis is the chest pain type,
and the colour is sex rather than condition.
6. SOURCE CODE
# Read datasets Cleveland_hd.csv into Cleveland_hd library(readr)

Cleveland_hd <- read.csv("D:/Mini R Project/Cleveland_hd.csv ")

# take a look at the first 5 rows of Cleveland_hd head(Cleveland_hd,5)


▪ IDENTIFYING IMPORTANT CLINICAL VARIABLES:

# load the tidyverse package

library(tidyverse)

# Does age have an effect? Age is continuous, so we use a t-test hd_age <- t.test

(age~hd, Cleveland_hd)

# What about thalach? Thalach is continuous, so we use a t-test hd_heartrate <-

t.test (thalach~hd, Cleveland_hd)


# Print the results to see if p<0.05.
print(hd_sex)

print(hd_age)

print(hd_heartrate)
▪ PUTTING ALL THREE VARIABLES IN ONE MODEL :
# use glm function from base R and specify the family argument as binomial

model <-glm(data = Cleveland_hd, hd~age+sex+thalach, family="binomial")

# extract the model summary


summary(model)
▪ EXTRACTING USEFUL INFORMATION FROM THE
MODEL OUTPUT:
# load the broom package

library(broom)

# tidy up the coefficient table

tidy_m <- tidy(model) tidy_m


# calculate OR tidy_m$OR <-

exp(tidy_m$estimate)

# calculate 95% CI and save as lower CI and upper CI tidy_m$lower_CI <-

exp(tidy_m$estimate - 1.96 * tidy_m$std.error) tidy_m$upper_CI <-

exp(tidy_m$estimate + 1.96 * tidy_m$std.error)

# display the updated coefficient table tidy_m

▪ PREDICTED PROBABILITIES FROM OUR MODEL :


# get the predicted probability in our dataset using the predict() function pred_prob <-

predict(model,Cleveland_hd, type = "response")

# create a decision rule using probability 0.5 as cutoff and save the predicted de cision into
the main data frame

Cleveland_hd$pred_hd <- ifelse(pred_prob >= 0.5,1,0)

# create a newdata data frame to save a new case information newdata <-

data.frame(age = 45, sex = "Female", thalach = 150)

# predict probability for this new case and print out the predicted value p_new
<- predict(model,newdata, type = "response") p_new
▪ MODEL PERFORMANCE METRICS:
# load Metrics package library(Metrics)

# calculate auc, accuracy, clasification error auc <-

auc(Cleveland_hd$hd,Cleveland_hd$pred_hd) accuracy <-

accuracy(Cleveland_hd$hd,Cleveland_hd$pred_hd) classification_error <-

ce(Cleveland_hd$hd,Cleveland_hd$pred_hd)

# print out the metrics on to screen print(paste("AUC=", auc))

print(paste("Accuracy=", accuracy)) print(paste("Classification

Error=", classification_error))

# confusion matrix table(Cleveland_hd$hd,Cleveland_hd$pred_hd, dnn=c('True


Status','Predicted S tatus')) # confusion matrix
7. GRAPHICAL OUTPUT
7.1 RECODE HD TO BE LABELLED:

# Recode hd to be labelled

Cleveland_hd%>%mutate(hd_labelled = ifelse(hd == 0, "No Disease", "Disease ")) ->


Cleveland_hd

# age vs hd ggplot(data = Cleveland_hd, aes(x = hd_labelled,y = age)) + geom_boxplot()

7.2 MAX HEART RATE VS HD:


# Max heart rate vs hd ggplot(data = Cleveland_hd,aes(x=hd_labelled,y=thalach)) +

geom_boxplot()

7.3 DISEASE DISTRIBUTION FOR AGE:

####################################################

# Disease distribution for age.

# 0 - no disease

# 1 - disease

####################################################

Cleveland_hd%>% group_by(age, condition) %>% summarise(count = n()) %>


%

ggplot() + geom_bar(aes(age, count, fill = as.factor(condition)), stat = "Identit y") +

theme_bw() + theme(axis.text.x = element_text(angle = 90, size = 10)) + ylab("Count") +

xlab("Age") + labs(fill = "Condition")


7.4 CHEST PAIN TYPE FOR DISEASED PEOPLE:

####################################################

# Chest pain type for diseased people

# You can see - Majority as condition 3 type

# 0: typical angina 1: atypical angina Value 2: non-anginal pain Value 3: asymp tomatic

####################################################

Cleveland_hd%>% filter(condition == 1) %>% group_by(age, cp) %>% summa

rise(count = n()) %>% ggplot() + geom_bar(aes(age, count, fill = as.factor(cp)),stat =

"Identity") + theme_bw() + theme(axis.text.x = element_text(angle = 90, size = 10))

+ ylab("Count") + xlab("Age") + labs(fill = "Condition") + ggtitle("Age vs. Count

(disease only) for various chest pain conditions") + scale_fill_manual(values=c("red",

"blue", "green", "black"))


7.5 CONDITION SEX WISE:
▪ Age is on the x-axis,
▪ Sex on the y-axis (0 - female, 1 - male),
▪ Size of the circle is the cholesterol level, and color is condition.
▪ Yellow means disease and blue means no disease and each
circle is a datapoint.
▪ Can see that male count it much more than the female count, and
male has the more cases with disease than female population. Also,
the disease seems more popular with high cholesterol values.

####################################################

# condition sex wise

####################################################

options(repr.plot.width = 20, repr.plot.height = 8) heart_disease_data

%>% ggballoonplot(x = "age", y = "sex",

size = "chol", size.range = c(5, 30), fill = "condition",sho


w.label = FALSE,

ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") + theme(axis.text.x =
element_text(angle = 90, size = 10)) + ggtitle("Age vs.
Sex Map") + labs(fill = "Condition")
▪ The plot below is same as the above, except, the y-axis is the chest
pain type, and the colour is sex rather than condition ;

options(repr.plot.width = 20, repr.plot.height = 8)

####################################################

# condition sex wise

####################################################

heart_disease_data %>% ggballoonplot(x = "age", y = "cp",

size = "chol", size.range = c(5, 30), fill = "sex",show.label =


FALSE,

ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") + theme(axis.text.x =
element_text(angle = 90, size = 10)) + ggtitle("Age vs.
Chest Pain Map") + labs(fill = "sex")
7.6 DISEASE PREDICTION SETUP:
set.seed(2020, sample.kind = "Rounding") # Divide into train and
validation dataset test_index <- createDataPartition(y =
heart_disease_data$condition, ti mes = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ] validation <-
heart_disease_data[test_index, ]

# Converting the dependent variables to factors train_set$condition


<- as.factor(train_set$condition) validation$condition <-
as.factor(validation$condition)

7.7 LDA: LINEAR DISCRIMINANT ANALYSIS:

################################

# LDA Analysis

###############################

lda_fit <- train(condition ~ ., method = "lda", data = train_set) lda_predict <- predict(lda_fit,

validation) confusionMatrix(lda_predict, validation$condition)


7.8 QDA: QUADRANT DISCRIMINANT ANALYSIS:

################################

# QDA Analysis

###############################

qda_fit <- train(condition ~ ., method = "qda", data = train_set) qda_predict <-

predict(qda_fit, validation) confusionMatrix(

qda_predict, validation$condition)
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER:
5-fold cross validation was used, and tuning was done on all the next
algorithms discussed here to avoid over-training the algorithms.
ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

knnFit <- train(condition ~ ., data = train_set, method = "knn",

preProcess = c("center","scale"), trControl = ctrl , tuneGrid =

expand.grid(k = seq(1, 20, 2))) plot(knnFit) toc() knnPredict <-

predict(knnFit,newdata = validation ) knn_results <-

confusionMatrix(knnPredict, validation$condition )

knn_results
7.10 SVM: SUPPORT-VECTOR MACHINES:

############################

# SVM

############################

ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)

grid_svm <- expand.grid(C = c(0.01, 0.1, 1, 10, 20))

tic(msg= " Total time for SVM :: ") svm_fit <- train(condition ~ .,data =

train_set, method = "svmLinear", preProcess = c("center","scale"),

tuneGrid = grid_svm, trControl = ctrl) plot(svm_fit) toc() svm_predict <-

predict(svm_fit, newdata = validation) svm_results <-

confusionMatrix(svm_predict, validation$condition)

svm_results
7.11 RF: RANDOM FOREST:

############################

# RF

############################ control<- trainControl(method = "cv", number = 5,

verboseIter = FALSE) grid <-data.frame(mtry = seq(1, 10, 2)) tic(msg= " Total time for rf ::

") rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl

= control, tuneGrid

= grid)

plot(rf_fit) toc() rf_predict <- predict(rf_fit, newdata =

validation)

rf_results <- confusionMatrix(rf_predict, validation$condition)

rf_results
7.12 GBM: GLIOBLASTOMA MULTIFORME:
############################
# GBM
############################
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 10, 25, 30),
n.trees = c(5, 10, 25, 50), shrinkage
= c(0.1, 0.2, 0.3, 0.4, 0.5), n.minobsinnode = 20)
tic(msg= " Total time for GBM :: ") gbm_fit <- train(condition ~ ., method = "gbm", data =
train_set, trControl = co ntrol, verbose = FALSE,
tuneGrid = gbmGrid)
plot(gbm_fit) toc() gbm_predict <- predict(gbm_fit, newdata =
validation)
gbm_results <- confusionMatrix(gbm_predict, validation$condition)
gbm_results
Heart diseases are a major killer in India and throughout the
world, application of promising technology like machine learning to
the initial prediction of heart diseases will have a profound impact on
society. The early prognosis of heart disease can aid in making
decisions on lifestyle changes in high-risk patients and in turn reduce
the complications, which can be a great milestone in the field of
medicine. The number of people facing heart diseases is on a raise
each year. This prompts for its early diagnosis and treatment. The
utilization of suitable technology support in this regard can prove to be
highly beneficial to the medical fraternity and patients. In this paper,
the seven different machine learning algorithms used to measure the
performance are SVM, Decision Tree, Random Forest, Naïve Bayes,
Logistic Regression, Adaptive Boosting, and Extreme Gradient
Boosting applied on the dataset.
FUTURE ENHANCEMENT

The expected attributes leading to heart disease in patients are


available in the dataset which contains 76 features and 14 important
features that are useful to evaluate the system are selected among
them. If all the features taken into the consideration, then the efficiency
of the system the author gets is less. To increase efficiency, attribute
selection is done. In this n features have to be selected for evaluating
the model which gives more accuracy. The correlation of some features
in the dataset is almost equal and so they are removed. If all the
attributes present in the dataset are taken into account, then the
efficiency decreases considerably.

All the seven machine learning methods accuracies are


compared based on which one prediction model is generated. Hence,
the aim is to use various evaluation metrics like confusion matrix,
accuracy, precision, recall, and f1-score which predicts the disease
efficiently. Comparing all seven the extreme gradient boosting
classifier gives the highest accuracy of 81%.
PHOTOS

➢ The Proposed model developed to


predict the heart disease;

You might also like