Applications of Logistic Regression Analysis by Using R Programming - Zhyan Haidar Pirdawd
Applications of Logistic Regression Analysis by Using R Programming - Zhyan Haidar Pirdawd
Salaheddin university-Erbil
Research Project
Submitted to the department of (mathematics)in partial fulfillment of the
requirements for the degree of BSc in (mathematic)
Prepared By:
Zhyan Haidar Perdoowd
Supervised by:
Dr. Awaz Kakamam Muhammad
April -2023
Certification of the supervisors:
I certify that this work was prepared under my supervision at the department
of mathematics /college of education/Salaheddin university –Erbil in partial
fulfilment of the requirements for the degree of bachelor of philosophy of
science in mathematics.
Signature:
Signature:
i
Acknowledgement
ii
Abstract
The heart disease is one of the most common diseases in the world that many
people suffer from, so far there are several articles that have been published
related to heart disease in various disciplines of science and social context [1].
In this work, we applied logistic regression to product model on a dataset with
is contains around 303 instances, each having 13 features which are used to
infer the presence (values 1, 2, 3, 4) or absence (value 0) of heart disease. we
take 4 variables to prediction such as: age, sex, cholesterol (chol), chest pain
(cp) with heart disease.
The goal of this project is to explore logistic regression. logistic regression
models have been used to predictions. we found that the relation between
(chol) and (heart disease) is 0.434. We can say that there is very strong
positive correlation between (chol) and (heart disease). The correlation
coefficient between Age and heart disease is -0.2254 which is suggests a
negative correlation. The correlation coefficient between (sex) and (chol) is a
-0.1979 which is week negative correlation between them.
iii
Contents
Abstract ................................................................................................................................................ iii
Introduction .......................................................................................................................................... 1
Chapter one: .......................................................................................................................................... 3
Logistic Regression................................................................................................................................ 3
Graphical presentation:.................................................................................................................................. 8
Correlation coefficient:- ...................................................................................................................... 10
iv
Introduction
Heart disease describes a range of conditions that affect the heart. Diseases
under the umbrella term heart disease include:(Cardiovascular disease, Heart
arrhythmia, Congenital heart disease, Cardiomyopathy, Heart disease caused
by heart infections, heart valve disease), first described in 1768 by William
Heberden, it was believed by many to have something to do with blood
circulating in the coronary arteries, though others though it was a harmless
condition according , Heart disease (Ischaemic Heart disease ;CHD) it’s
estimated around 200 million people are living 110 million mean and 80
million women have coronary heart disease , heart disease is significantly high
in the world especially in countries India, China , Indonesia Russia, Iran and
turkey [1]
Symptoms of heart disease, beating swelling and slowing of heart beats, pain
in the chest and feeling uncomfortable, severe breathing, dizziness and
burrowing, yellow ingestion and disintegration of the skin, stains on the legs
and throat and around the eyes. That they have a lot of danger to human life
Now we are talk about what is logistic regression what are the type of logistic
what are the step of logistic for analysis data how do logistic regression use to
analysis data of heart disease and what are the regression between each (chest
paint (cp), Age, cholesterol (chol) and sex with heart disease).[2]
logistic regression is a logistic model, a model that assumes a logistic
relationship between the input variables (x) and the single output variable
(y). logistic regression analysis is used to predict the value of a variable based
on the value of another variable.
In this work, I applied logistic regression to predict model on a dataset that
which I received this data from a trusted website [3]. This data includes 303
examples (cases) with 4 variables (Age, sex, cp, Chol, and heart disease).
logistic regression model has been used to predictions. We use R programing
[4] and Excel to this analysis.
This project contained two chapters. In chapter one, I describe a logistic
regression model. logistic regression analysis is used to predict the value of a
variable based on the value of another variable. The variable you want to
predict is called the dependent variable (often called the 'outcome' or
'response' variable). The variable we are using to predict the other variable's
value is called the independent variable (often called 'predictors', 'covariates',
'explanatory variables’, ‘attributes ‘or 'features’).
1
Chapter two includes some applications on the dataset by using simple logistic
regression.
From this work we found that the relation between (cp) and heart disease is a
0.4338, we can say that there is very strong positive correlation between them.
In this project, the result show that correlation coefficient between chol. and
heart disease is -0.0852. It means they have week negative relation between
them. Moreover, we found that the correlation coefficient between age and
heart disease is -0.2254 the correlation coefficient of -0.2254 suggests a
negative correlation between age and heart disease . The correlation
coefficient between sex and chol. is a-0.1979 which is week negative
correlation between them. for this data analysis we use R programing [4] and
Excel programing.
2
Chapter one:
Logistic Regression
In this chapter we will show some basic definition and concepts about logistic
regression.
What is regression?
regression is a statistical procedure which attempts to predict the values of a
given variable, (termed the dependent, outcome, or response variable) based
on the values of one or more other variables (called independent variables,
predictors, or covariates). The result of a regression is usually an equation
which summarizes the relationship between the dependent and independent
variable(s). Typically, the model is accompanied by summary statistics
describing how well the model fits the data, the amount of variation in the
outcome accounted for by the model, and a basis for comparing the existing
model to other similar models. By comparing these statistics across multiple
models, the user is able to determine a combination and order of independent
variables that most satisfactorily predict the values of the outcome. Numerous
forms of regression have been developed to predict the values of a wide
variety of outcome measures. Since the focus of regression modeling is on the
response variable, the type of regression you use will be dictated by the type
of response variable you are analyzing and by your eventual analytic goal.
So,
thou
gh
3
we may have continuous or categorical independent variables, we can use the
logistic regression modeling technique to predict the outcome when the
outcome variable is binary.
Let’s see how the algorithm differs from linear regression. Linear regression
statistical model is used to predict continuous outcome variables, whereas
logistic regression predicts categorical outcome variables. Linear regression
model regression line is highly susceptible to outliers. So, it will not be
appropriate for logistic regression.
4
If we feed an output value to the sigmoid function, it will return the
probability of the outcome between 0 and 1. If the value is below 0.5, then the
output is return as No/Fail/Deceased. If the value is above 0.5, then the output
is returned as Yes/Pass/Deceased.
5
Assumptions of Logistic regression:
There are three main types of logistic regression binary, multinomial and
ordinal. They differ in execution and theory. Binary regression deals with two
possible values, essentially: yes or no, multinomial logistic regression deals
with three or more values. And ordinal logistic regression deals with three or
more classes in a predetermined.
Binary logistic regression is just two possible outcome answers. This concept
is typically represented as 0 or a 1 in coding.
Binary logistic regression is a type of regression analysis where the dependent
variable is a dummy variable (coded 0, 1)
Why not just use ordinary least squares?
Y = a + bx. You would typically get the correct answers in terms of the sign
and significance of coefficients. However, there are three problems:
The error terms are heteroskedastic (variance of the dependent variable is
different with different values of the independent variables. The error terms
are not normally distributed, and most importantly, for purpose of
interpretation, the predicted probabilities can be greater than 1 or less than 0,
which can be a problem for subsequent analysis.
The coefficients of the multiple regression model are estimated using sample
data with k independent variable.
Ŷ𝑖 = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑘 𝑋𝑘
6
2) Ordinal Logistic Regression
Ordinal logistic regression is also a model where there are multiple classes
that an item can be classified as however in this case an ordering of classes
is required classes do not need to be proportion the distance between each
class.
7
Chapter two
Some Application of logistic regression
Graphical presentation:
We will Represented a graph for the data set and describe the distribution
frequency table of it
Figure 2.1: histogram for age Figuer2.2: histogram for chest pian (cp)
8
Fiuger2.3: histogram of dataset for Figuer2.4 histogram of dataset for
Chol (cholesterol) heart disease
9
Correlation coefficient:-
Notation:
If Y tends to increase as X increases, the correlation is called positive, or
direct, correlation.
If Y tends to decrease as X increases, the correlation is called negative, or
inverse, correlation.
If there is no relationship indicated between the variables, we say that there is
no correlation between them (i.e., they are uncorrelated).
In this project we put r=0.4 very strong correlation between the features in this
data set and r= 0.3 strong, and r= 0.1 is week correlation.
10
Difference between correlation and regression:
In this table below we will show some basic differences between Correlation
and regression.
Table 2.1: Differences between Correlation and regression.
Basis for comparison Correlation regression
meaning Correlation is a statical Regression describes
measure that determines how to numerically
the association or co- relate an independent
relation between two variable to the
variables dependent variables
usage To represent a linear To fit the best line and
relationship between to estimate one variable
two variables based on another
Dependent and No difference Both variables are
independent variables different
indicates Correlation coefficient Regression indicates the
indicates the extent to impact of a change of
which two variables unit on the estimated
move together variables(y)in the know
variables(x)
objective To find a numerical To estimate value of
value expressing random variables on the
between variables basis of the value of
fixed variables
The main problem with Figure 2.6 is that the variability in heart disease at all
ages is large. This makes it difficult to see any functional relationship between
Age and heart disease. One common method of removing some variation,
while still maintaining the structure of the relationship between the dependent
and the independent variable, is to create intervals for the independent
variable and compute the mean of the outcome variable within each group.
We use this strategy by grouping age into the categories (Age Group) defined
in Table 2.2. Table 2.1 contains, for each age group, the frequency of
occurrence of each outcome, as well as the percent with heart disease.
11
Figure2.6: scatterplot of “yes” or “no” of heart disease by age.
Table 2.1 illustrate frequency table of Age Group by heart disease. I divide
Age feature for 10 subset and putted in 10 class intervals. Each class have size
5.
12
plot heart diseas in each Age group
1.2
0.6
mean
0.4
Linear (mean)
0.2
Age group
Figure 2.7: plot of the percentage of subjects with hard disease in each age group.
Moreover, from figures 2.6 and 2.7, I can have same study for another
variable (age group and hard disease)
Figure 2.8: a) plot of age with hard disease b) plot of cp with hard disease
13
Figure 2.9: a) plot of sex with hard disease b) plot of Chol with hard disease.
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
(Dispersion parameter for gaussian family taken to be 0.2369743)
Null deviance: 75.149 on 302 degrees of freedom
Residual deviance: 71.329 on 301 degrees of freedom
AIC: 427.61
14
Between (chest pain) cp and heart disease
glm(formula = Heart.Dis. ~ cp, data = Zhyanproject)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.97082 -0.34180 0.02918 0.44853 0.65820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.34180 0.03547 9.636 2e-16 ***
cp 0.20967 0.02510 8.353 2.47e-15 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
(Dispersion parameter for gaussian family taken to be 0.2026811)
Null deviance: 75.149 on 302 degrees of freedom
Residual deviance: 61.007 on 301 degrees of freedom
AIC: 380.25
Number of Fisher Scoring iterations: 2
15
Between cholestrol and heart disease
glm(formula = Heart.Dis. ~ chol, data = Zhyanproject)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6391 -0.5370 0.4027 0.4499 0.7161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7465813 0.1390865 5.368 1.6e-07 ***
chol -0.0008204 0.0005527 -1.484 0.139
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1
(Dispersion parameter for gaussian family taken to be 0.2478489)
Null deviance: 75.149 on 302 degrees of freedom
Residual deviance: 74.603 on 301 degrees of freedom
AIC: 441.2
Number of Fisher Scoring iterations: 2
16
Conclusion: -
In this work, I applied logistic regression to predict model on a dataset that
which I received this data from a trusted website [3]. This data includes 303
examples (cases) with 4 variables (Age, sex, cp, Chol, and heart disease).
logistic regression model has been used to predictions. We use R programing
[4] and Excel to this analysis.
This project contained two chapters. In chapter one, I describe a logistic
regression model. logistic regression analysis is used to predict the value of a
variable based on the value of another variable. The variable you want to
predict is called the dependent variable (often called the 'outcome' or
'response' variable). The variable we are using to predict the other variable's
value is called the independent variable (often called 'predictors', 'covariates',
'explanatory variables’, ‘attributes ‘or 'features’).
Chapter two includes some applications on the dataset by using simple logistic
regression.
From this work we found that the relation between (cp) and heart disease is a
0.4338, we can say that there is very strong positive correlation between them.
In this project, the result show that correlation coefficient between chol. and
heart disease is -0.0852. It means they have week negative relation between
them. Moreover, we found that the correlation coefficient between age and
heart disease is -0.2254 the correlation coefficient of -0.2254 suggests a
negative correlation between age and heart disease . The correlation
coefficient between sex and chol. is a-0.1979 which is week negative
correlation between them. for this data analysis we use R programing [4] and
Excel programing.
17
پوختە
لەم کارەدا ،ئێمە ڕیگرەیشن لۆجستیمان بەکارهێنا بۆ مۆدێلی بەرهەم لەسەر کۆمەڵە داتایەک کە
نزیکەی 304نموونەی تێدایە ،هەریەکەیان 13تایبەتمەندییان هەیە کە بەکاردەهێنرێن بۆ
دەرئەنجامدانی بوونی (بەهاکانی )4 ،3 ،2 ،1یان نەبوونی (بەهای )0دڵ نەخۆشی .ئێمە 4گۆڕاو
لەگەڵ ) ، (cpئازاری سنگ) (cholدەبەین بۆ پێشبینیکردن وەک :تەمەن ،ڕەگەز ،کۆلیسترۆڵ و
ئامانجی ئەم پڕۆژەیە لێکۆڵینەوەیە لە ڕیگرەیشن لۆجستی .مۆدێلی ڕیگرەیشن نەخۆشی دڵ
لۆجستیکی بۆ پێشبینییەکان بەکارهێنراوە.
بۆمان دهركەوت كە پەیوهندی نێوان (نەخۆشی دڵ) و (کۆلیستڕۆل) 0.434یە .دەتوانین بڵێین
پەیوەندییەکی ئەرێنی زۆر بەهێز هەیە لە نێوان ( کۆلیستڕۆل) و (نەخۆشی دڵ) .ڕێژەی پەیوەندی
نێوان تەمەن و نەخۆشییەکانی دڵ -0.2254یە کە پێشنیاری پەیوەندییەکی نەرێنی دەکات .ڕێژەی
پەیوەندی نێوان (ڕەگەز ) و(کۆلیستڕۆل ) -0.1979یە كە پەیوهندی الوازی نەرێنییە لە نێوانیاندا
18
References:
Guido, J. J., Winters, P. C., & Rains, A. B. (2006). Logistic regression basics. MSc
University of Rochester Medical Center, Rochester, NY.
Tranmere, M., & Elliot, M. (2008). Binary logistic regression. Cathie Marsh for census and
survey research, paper, 20.
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1988). UCI machine learning
repository-heart disease data set. School Inf. Comput. Sci., Univ. California, Irvine, CA,
USA.
19