Heart Disease Diagnosis Prediction Using Machine Learning and Data Analytics Approach.
Heart Disease Diagnosis Prediction Using Machine Learning and Data Analytics Approach.
Random Forest”
ABSTRACT
Machine Learning in R language is used across the world. The
healthcare industry is no exclusion. Machine Learning can play an
essential role in predicting presence/absence of locomotors
disordered, heart diseases and more. Such information, if predicted
well in advance, can provide important intuitions to doctors who can
then adapt their diagnosis and dealing per patient basis. It’s works on
predicting possible heart diseases in people using Machine Learning
algorithms. In this project we perform the comparative analysis of
classifiers like decision tree, Naïve Bayes, Logistic Regression, SVM
and Random Forest and propose an ensemble classifier which perform
hybrid classification by taking strong and weak classifiers since it can
have multiple number of samples for training and validating the data so
we perform the analysis of existing classifier and proposed classifier
like Ada-boost and XG-boost which can Give accurate result and aids
in predictive analysis.
❖ LIST OF ABBREVIATIONS:
1. LDA: Linear Discriminant Analysis
2. QDA: Quadrant Discriminant Analysis
3. K-NN: k-nearest neighbour’s
4. SVM: Support-vector machines
5. RF: Random Forest
6. GBM: glioblastoma multiforme
7. EDA: Exploratory data analysis is the key step for getting meaningful
results.
➢ SOFTWARE REQUIREMENTS:
3.1 EVOLUTION OF R :
• R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in
Auckland, New Zealand. R made its first appearance in 1993.
• A large group of individuals has contributed to R by sending code
and bug reports.
• Since mid-1997 there has been a core group (the "R Core Team")
who can modify the R source code archive.
3.2 R VERSION :
R version 4.2.1 (2022-06-23 curt) -- "Funny-Looking Kid" Copyright (C)
2022 The R Foundation for Statistical Computing Platform: x86_64-
w64mingw32/x64 (64-bit)
3.3 FEATURES OF R :
As stated earlier, R is a programming language and software
environment for statistical analysis, graphics representation and
reporting. The following are the important features of R −
I. INTERFACE:
Software RStudio download is well-designed, intuitive, user-friendly
application. Design is minimalistic and clean. Interface is clean and
user can easily navigate through the menus. Application provides a
wide range of features that are mainly categorized into Data Science,
Visualization, Administration. Interface is very user-friendly and has a
clean design. It is based on "scratchpad" principle, where the user can
create their own projects, or start with one of many templates that are
available.
II. USABILITY:
III. FUNCTIONALITY:
4.1 PACKAGES IN R:
▪ library(readr):
The goal of readr is to provide a fast and friendly way to read
rectangular data from delimited files, such as comma-separated
values (CSV) and tabseparated values (TSV). It is designed to
parse many types of data found in the wild, while providing an
informative problem report when parsing leads to unexpected
results. If you are new to readr, the best place to start is the data
import chapter in R for Data Science.
▪ library(tidyverse):
▪ library(broom):
Datasets and functions that can be used for data analysis practice,
homework and projects in data science courses and workshops. 26
datasets are available for case studies in data visualization, statistical
inference, modeling, linear regression, data wrangling and Machine
Learning in R language.
▪ library(dplyr):
▪ library(caret):
▪ library(lubridate):
▪ library(tidytext):
Using tidy data principles can make many text mining tasks easier,
more effective, and consistent with tools already in wide use. Much of
the infrastructure needed for text mining with tidy data frames already
exists in packages like dplyr, broom, tidyr, and ggplot2.
▪ library("RColorBrewer"):
▪ library(random Forest):
▪ library(tictoc):
▪ library(e1071):
▪ library(ggpubr):
The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create
a new variable called “Cleveland_hd” to represent a binary 1/0
outcome.There are a few other categorical/discrete variables in the
dataset. We also convert sex into a ‘factor’
for next step analysis. Otherwise, R will treat this as continuous by default.
NAME TYPE DESCRIPTION
Age Continuous Age in years
The raw glm coefficient table (the ‘estimate’ column in the printed
output) in R represents the log(Odds Ratios) of the outcome. Therefore,
we need to convert the values to the original OR scale and calculate the
corresponding 95% Confidence Interval (CI) of the estimated Odds
Ratios when reporting results from a logistic regression.
library(tidyverse)
# Does age have an effect? Age is continuous, so we use a t-test hd_age <- t.test
(age~hd, Cleveland_hd)
print(hd_age)
print(hd_heartrate)
▪ PUTTING ALL THREE VARIABLES IN ONE MODEL :
# use glm function from base R and specify the family argument as binomial
library(broom)
exp(tidy_m$estimate)
# create a decision rule using probability 0.5 as cutoff and save the predicted de cision into
the main data frame
# create a newdata data frame to save a new case information newdata <-
# predict probability for this new case and print out the predicted value p_new
<- predict(model,newdata, type = "response") p_new
▪ MODEL PERFORMANCE METRICS:
# load Metrics package library(Metrics)
ce(Cleveland_hd$hd,Cleveland_hd$pred_hd)
Error=", classification_error))
# Recode hd to be labelled
geom_boxplot()
####################################################
# 0 - no disease
# 1 - disease
####################################################
####################################################
# 0: typical angina 1: atypical angina Value 2: non-anginal pain Value 3: asymp tomatic
####################################################
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") + theme(axis.text.x =
element_text(angle = 90, size = 10)) + ggtitle("Age vs.
Sex Map") + labs(fill = "Condition")
▪ The plot below is same as the above, except, the y-axis is the chest
pain type, and the colour is sex rather than condition ;
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") + theme(axis.text.x =
element_text(angle = 90, size = 10)) + ggtitle("Age vs.
Chest Pain Map") + labs(fill = "sex")
7.6 DISEASE PREDICTION SETUP:
set.seed(2020, sample.kind = "Rounding") # Divide into train and
validation dataset test_index <- createDataPartition(y =
heart_disease_data$condition, ti mes = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ] validation <-
heart_disease_data[test_index, ]
################################
# LDA Analysis
###############################
lda_fit <- train(condition ~ ., method = "lda", data = train_set) lda_predict <- predict(lda_fit,
################################
# QDA Analysis
###############################
qda_predict, validation$condition)
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER:
5-fold cross validation was used, and tuning was done on all the next
algorithms discussed here to avoid over-training the algorithms.
ctrl <- trainControl(method = "cv", verboseIter = FALSE, number = 5)
confusionMatrix(knnPredict, validation$condition )
knn_results
7.10 SVM: SUPPORT-VECTOR MACHINES:
############################
# SVM
############################
tic(msg= " Total time for SVM :: ") svm_fit <- train(condition ~ .,data =
confusionMatrix(svm_predict, validation$condition)
svm_results
7.11 RF: RANDOM FOREST:
############################
# RF
verboseIter = FALSE) grid <-data.frame(mtry = seq(1, 10, 2)) tic(msg= " Total time for rf ::
") rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl
= control, tuneGrid
= grid)
validation)
rf_results
7.12 GBM: GLIOBLASTOMA MULTIFORME:
############################
# GBM
############################
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 10, 25, 30),
n.trees = c(5, 10, 25, 50), shrinkage
= c(0.1, 0.2, 0.3, 0.4, 0.5), n.minobsinnode = 20)
tic(msg= " Total time for GBM :: ") gbm_fit <- train(condition ~ ., method = "gbm", data =
train_set, trControl = co ntrol, verbose = FALSE,
tuneGrid = gbmGrid)
plot(gbm_fit) toc() gbm_predict <- predict(gbm_fit, newdata =
validation)
gbm_results <- confusionMatrix(gbm_predict, validation$condition)
gbm_results
Heart diseases are a major killer in India and throughout the
world, application of promising technology like machine learning to
the initial prediction of heart diseases will have a profound impact on
society. The early prognosis of heart disease can aid in making
decisions on lifestyle changes in high-risk patients and in turn reduce
the complications, which can be a great milestone in the field of
medicine. The number of people facing heart diseases is on a raise
each year. This prompts for its early diagnosis and treatment. The
utilization of suitable technology support in this regard can prove to be
highly beneficial to the medical fraternity and patients. In this paper,
the seven different machine learning algorithms used to measure the
performance are SVM, Decision Tree, Random Forest, Naïve Bayes,
Logistic Regression, Adaptive Boosting, and Extreme Gradient
Boosting applied on the dataset.
FUTURE ENHANCEMENT