Heart Disease Prediction Using Machine Learning in R
Heart Disease Prediction Using Machine Learning in R
1
OBJECTIVES
2
LIST OF ABBREVIATIONS:
3
SYSTEM SPECIFICATIONS
HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
4
1 INTRODUCTION
The early diagnosis of heart disease plays a vital role in making decisions on
lifestyle changes in high-risk patients and in turn reduce the complications. This
project aims to predict future Heart Disease by analyzing data of patients which
classifies whether they have heart disease or not using machine-learning
algorithms.
The major challenge in heart disease is its detection. There are instruments
available which can predict heart disease but either they are expensive or are not
efficient to calculate chance of heart disease in human. Early detection of cardiac
diseases can decrease the mortality rate and overall complications. However, it is
not possible to monitor patients every day in all cases accurately and consultation
of a patient for 24 hours by a doctor is not available since it requires more
sapience, time and expertise. Since we have a good amount of data in today’s
world, we can use various Machine Learning in R language algorithms to analyze
the data for hidden patterns. The hidden patterns can be used for health diagnosis
in medicinal data.
5
1.2 MOTIVATION FOR THE WORK
6
optimum performance than KStar, Multilayer perceptron and J48 techniques using
k-fold cross validation. The accuracy performance achieved by those algorithms
are still not satisfactory. So that if the performance of accuracy is improved more
to give batter decision to diagnosis disease.
7
2 PROJECT DESCRIPTION
Heart disease is perceived as the deadliest disease in the human life across
the world. In particular, in this type of disease the heart is not capable in pushing
the required quantity of blood to the remaining organs of the human body in order
to accomplish the regular functionalities. Some of the symptoms of heart disease
include physical body weakness, improper breathing, swollen feet, etc. The
techniques are essential to identify the complicated heart diseases which results in
high risk in turn affect the human life. Presently, diagnosis and treatment process
are highly challenging due to inadequacy of physicians and diagnostic apparatus
that affect the treatment of heart patients
8
Heart disease prediction is being done with the detailed clinical data that
could assist experts to make decision. Human life is highly dependent on proper
functioning of blood vessels in the heart. The improper blood circulation causes
heart inactiveness, kidney failure, imbalanced condition of brain, and even
immediate death also. Some of the risk factors that can cause heart diseases are
obesity, smoking, diabetes, blood pressure, cholesterol, lack of physical activities
and unhealthy diet.
The increase in the amount of white blood cells causes inflammation and
other subsequent disorders such as stroke or reinfarction Generally, there are two
stages of wound healing in terms of monocytes and macrophages, namely,
inflammatory, and reparative stages. However, the two stages are compulsory for
proper wound healing and if the inflammation is continued too long, then it leads
to heart failure.
9
An unusual type of heart disease is the acute spasm or contraction in the
coronary arteries. The spasms become visible in arteries suddenly with no
symptom of atherosclerosis. It blocks the blood flow that causes oxygen
deprivation in the heart. Male genders are more likely to experience heart attack
than females. Moreover, women can experience pain more than an hour and the
duration to experience the pain of men is normally less than an hour. The
cardiovascular disease has an impact in the complete physiological system, not
only in the heart; changes occur everywhere that too in the remote organs such as
bone marrow and spleen.
10
3 SOFTWARE DESCRIPTION
R is a programing language and free software developed by Ross Ihaka and
Robert Gentleman in 1993. R possesses an in-depth catalog of applied mathematics
and graphical strategies. It includes Machine Learning in R language algorithms,
simple and linear regression, statistics, applied mathematics. Most of the R
libraries are written in R, except for serious machine tasks, C, C++, and algebraic
language codes are most well-liked.
3.1 EVOLUTION OF R
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify
the R source code archive.
11
3.2 R VERSION
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an
HTML browser interface to help. Type 'q()' to quit R.
3.3 FEATURES OF R
12
R provides a suite of operators for calculations on arrays, lists, vectors and
matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
13
its amazing visualizations which is very important in data science programming
language.
INTERFACE
14
Software RStudio download is well-designed, intuitive, user-friendly
application. Design is minimalistic and clean. Interface is clean and user can easily
navigate through the menus. Application provides a wide range of features that are
mainly categorized into Data Science, Visualization, Administration. Interface is
very user-friendly and has a clean design. It is based on "scratchpad" principle,
where the user can create their own projects, or start with one of many templates
that are available. Application provides code completion, which makes it easier to
write code.
USABILITY
FUNCTIONALITY
Is very powerful IDE for R. It has a lot of features that you can use for more
comfortable using of application. You can use it open up a project, which is folder
with a collection of related documents that make up a complete work session. You
can use it to open up a file, which is single document or file with a collection of
related data or text.
SUPPORT
16
4 PACKAGES IN R PROGRAMMING
The package is an appropriate way to organize the work and share it with
others. Typically, a package will include code (not only R code!), documentation
for the package and the functions inside, some tests to check everything works as it
should, and data sets.
4.1 PACKAGES IN R
library(readr):
The goal of readr is to provide a fast and friendly way to read rectangular
data from delimited files, such as comma-separated values (CSV) and tab-
separated values (TSV). It is designed to parse many types of data found in the
wild, while providing an informative problem report when parsing leads to
17
unexpected results. If you are new to readr, the best place to start is the data import
chapter in R for Data Science.
library(tidyverse):
These Tidyverse packages were specially designed for Data Science with a
common design philosophy. They include all the packages required in the data
science workflow, ranging from data exploration to data visualization.
library(broom):
18
to regression coefficients. This is can be useful if you want to inspect a model or
create custom visualizations
library(Metrics):
The Matrix package contains functions that extend R to support highly dense
or sparse matrices. It provides efficient access to BLAS (Basic Linear Algebra
Subroutines), Lapack (dense matrix), TAUCS (sparse matrix) and UMFPACK
(sparse matrix) routines.
library(dslabs)
Datasets and functions that can be used for data analysis practice, homework
and projects in data science courses and workshops. 26 datasets are available for
case studies in data visualization, statistical inference, modeling, linear regression,
data wrangling and Machine Learning in R language.
library(dplyr)
library(caret)
19
Caret Package is a comprehensive framework for building Machine
Learning in R language models in R. In this tutorial, I explain nearly all the core
features of the caret package and walk you through the step-by-step process of
building predictive models. Be it a decision tree or xgboost, caret helps to find the
optimal model in the shortest possible time.
library(lubridate)
library(tidytext)
Using tidy data principles can make many text mining tasks easier, more
effective, and consistent with tools already in wide use. Much of the infrastructure
needed for text mining with tidy data frames already exists in packages like dplyr,
broom, tidyr, and ggplot2. In this package, we provide functions and supporting
data sets to allow conversion of text to and from tidy formats, and to switch
seamlessly between tidy tools and existing text mining packages.
library("RColorBrewer")
20
blind person to visualize the data as well as a normal person we have to use the
right color palette.
library(randomForest)
library(tictoc)
library(e1071)
library(ggpubr)
21
ggpubr: 'ggplot2' Based Publication Ready Plots The 'ggplot2' package is
excellent and flexible for elegant data visualization in R. However the default
generated plots requires some formatting before we can send them for publication.
5 DATA VISUALIZATION
The outcome variable class has more than two levels. According to the
codebook, any non-zero values can be coded as an “event.” We create a new
variable called “Cleveland_hd” to represent a binary 1/0 outcome.There are a few
other categorical/discrete variables in the dataset. We also convert sex into a
‘factor’ for next step analysis. Otherwise, R will treat this as continuous by default.
22
Trestbps Continuous Resting blood pressure (in
mm Hg)
Chol Continuous Serum cholesterol in mg/dl
23
4=More likely have heart
disease
Use statistical tests to see which predictors are related to heart disease. We
can explore the associations for each variable in the dataset. Depending on the type
of the data (i.e., continuous or categorical), we use t-test or chi-squared test to
calculate the p-values.
The plots and the statistical tests both confirmed that all the three variables
are highly significantly associated with our outcome (p<0.001 for all tests).
24
command is designed to perform generalized linear models (regressions) on binary
outcome data, count data, probability data, proportion data, and many other data
types. In our case, the outcome is binary following a binomial distribution.
The raw glm coefficient table (the ‘estimate’ column in the printed output)
in R represents the log(Odds Ratios) of the outcome. Therefore, we need to convert
the values to the original OR scale and calculate the corresponding 95%
Confidence Interval (CI) of the estimated Odds Ratios when reporting results from
a logistic regression.
So far, we have built a logistic regression model and examined the model
coefficients/ORs. We may wonder how can we use this model we developed to
predict a person’s likelihood of having heart disease given his/her age, sex, and
maximum heart rate. Furthermore, we’d like to translate the predicted probability
into a decision rule for clinical use by defining a cutoff value on the probability
scale. In practice, when an individual comes in for a health check-up, the doctor
would like to know the predicted probability of heart disease, for specific values of
the predictors: a 45-year-old female with a max heart rate of 150. To do that, we
25
create a data frame called newdata, in which we include the desired values for our
prediction.
After these metrics are calculated, we’ll see (from the logistic regression OR
table) that older age, being male and having a lower max heart rate are all risk
factors for heart disease. We can also apply our model to predict the probability of
having heart disease. For a 45 years old female who has a max heart rate of 150,
our model generated a heart disease probability of 0.177 indicating low risk of
heart disease.
The analyis below shows the disease prediction using various ML algorithms. The
outcome has been defined to be a binary classification variable, and several
classification algorithms have been used to predict the accuracy. This is just a
26
comparison study and the reasoning behind the usage of these algorithms has not
been the focus of this study.
In addition to p-values from statistical tests, we can plot the age, sex, and
maximum heart rate distributions with respect to our outcome variable. This will
give us a sense of both the direction and magnitude of the relationship.
27
6 SOURCE CODE
library(readr)
head(Cleveland_hd,5)
28
IDENTIFYING IMPORTANT CLINICAL VARIABLES
library(tidyverse)
29
hd_sex <- chisq.test(Cleveland_hd$hd, Cleveland_hd$sex)
30
31
# Print the results to see if p<0.05.
print(hd_sex)
print(hd_age)
32
print(hd_heartrate)
# use glm function from base R and specify the family argument as binomial
summary(model)
33
34
EXTRACTING USEFUL INFORMATION FROM THE MODEL OUTPUT
library(broom)
tidy_m
35
# calculate OR
tidy_m
36
PREDICTED PROBABILITIES FROM OUR MODEL
# get the predicted probability in our dataset using the predict() function
# create a decision rule using probability 0.5 as cutoff and save the predicted
decision into the main data frame
# predict probability for this new case and print out the predicted value
p_new
37
MODEL PERFORMANCE METRICS
library(Metrics)
print(paste("AUC=", auc))
print(paste("Accuracy=", accuracy))
# confusion matrix
38
39
7. GRAPHICAL OUTPUT
# Recode hd to be labelled
# age vs hd
40
7.2 MAX HEART RATE VS HD
41
7.3 DISEASE DISTRIBUTION FOR AGE.
####################################################
# 0 - no disease
# 1 - disease
####################################################
theme_bw() +
42
43
7.4 CHEST PAIN TYPE FOR DISEASED PEOPLE
####################################################
####################################################
theme_bw() +
ggtitle("Age vs. Count (disease only) for various chest pain conditions") +
44
45
7.5 CONDITION SEX WISE
Yellow means disease and blue means no disease and each circle is a
datapoint.
Can see that male count it much more than the female count, and male
has the more cases with disease than female population. Also, the
disease seems more popular with high cholesterol values.
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") +
46
47
The plot below is same as the above, except, the y-axis is
the chest pain type, and the color is sex rather than
condition
####################################################
####################################################
ggtheme = theme_bw()) +
scale_fill_viridis_c(option = "C") +
48
49
7.6 DISEASE PREDICTION SETUP
set.seed(2020, sample.kind = "Rounding")
# Divide into train and validation dataset
test_index <- createDataPartition(y = heart_disease_data$condition,
times = 1, p = 0.2, list= FALSE)
train_set <- heart_disease_data[-test_index, ]
validation <- heart_disease_data[test_index, ]
50
################################
# LDA Analysis
###############################
confusionMatrix(lda_predict, validation$condition)
51
52
7.8 QDA: QUADRANT DISCRIMINANT ANALYSIS
################################
# QDA Analysis
###############################
confusionMatrix(qda_predict, validation$condition)
53
54
7.9 K-NN: K-NEAREST NEIGHBORSCLASSIFIER
5-fold cross validation was used, and tuning was done on all the next algorithms
discussed here to avoid over-training the algorithms.
plot(knnFit)
toc()
knn_results
55
56
7.10 SVM: SUPPORT-VECTOR MACHINES
############################
# SVM
############################
plot(svm_fit)
toc()
svm_results
57
58
59
7.11 RF: RANDOM FOREST
############################
# RF
############################
rf_fit <- train(condition ~ ., method = "rf", data = train_set, ntree = 20, trControl =
control,
tuneGrid = grid)
plot(rf_fit)
toc()
rf_results
60
61
62
7.12 GBM: GLIOBLASTOMA MULTIFORME
############################
# GBM
############################
plot(gbm_fit)
toc()
gbm_predict <- predict(gbm_fit, newdata = validation)
gbm_results
63
64
65
CONCLUSION
Heart diseases are a major killer in India and throughout the world,
application of promising technology like machine learning to the initial prediction
of heart diseases will have a profound impact on society. The early prognosis of
heart disease can aid in making decisions on lifestyle changes in high-risk patients
and in turn reduce the complications, which can be a great milestone in the field of
medicine. The number of people facing heart diseases is on a raise each year. This
prompts for its early diagnosis and treatment. The utilization of suitable technology
support in this regard can prove to be highly beneficial to the medical fraternity
and patients. In this paper, the seven different machine learning algorithms used to
measure the performance are SVM, Decision Tree, Random Forest, Naïve Bayes,
Logistic Regression, Adaptive Boosting, and Extreme Gradient Boosting applied
on the dataset.
66
FUTURE ENHANCEMENT
All the seven machine learning methods accuracies are compared based on
which one prediction model is generated. Hence, the aim is to use various
evaluation metrics like confusion matrix, accuracy, precision, recall, and f1-score
which predicts the disease efficiently. Comparing all seven the extreme gradient
boosting classifier gives the highest accuracy of 81%
67
REFERENCES
[1] Soni, Jyoti, et al. "Predictive data mining for medical diagnosis: An overview
of heart disease prediction." International Journal of Computer Applications 17.8
(2011): 43-48.
[2] Dangare, Chaitrali S., and Sulabha S. Apte. "Improved study of heart disease
prediction system using data mining classification techniques." International
Journal of Computer Applications 47.10 (2012): 44-48.
[3] Uyar, Kaan, and Ahmet İlhan. "Diagnosis of heart disease using genetic
algorithm based trained recurrent fuzzy neural networks." Procedia computer
science 120 (2017): 588-593.
[4] Kim, Jae Kwon, and Sanggil Kang. "Neural network-based coronary heart
disease risk prediction using feature correlation analysis." Journal of healthcare
engineering 2017 (2017).
[5] Baccouche, Asma, et al. "Ensemble Deep Learning Models for Heart Disease
Classification: A Case Study from Mexico." Information 11.4 (2020): 207.
[6] https://archive.ics.uci.edu/ml/datasets/Heart+Disease
[7] https://www.kaggle.com/ronitf/heart-disease-uci
[8] https://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
[9]https://nthu-datalab.github.io/ml/labs/03_Decision
Trees_RandomForest/03_Decision-Tree_Random-Forest.html
[10] https://www.kaggle.com/jprakashds/confusion-matrix-in-python-binaryclass
68
[12] A. H. M. S. U. Marjia Sultana, "Analysis of Data Mining Techniques for
Heart Disease Prediction," 2018.
https://towardsdatascience.com/predicting-presence-of-heart-diseases-using
machinelearning-36f00f3edb2c. [Accessed 2 March 2020].
69