Applications of Regression Analysis by Using R Programming - Tara Qasm Bakr

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

‫هەولێر‬-‫زانکۆی سەاڵحەدین‬

Salahaddin university-Erbil

Applications of Regression analysis


By Using: R programming

Research Project Submitted to the department of (mathematics) in partial


fulfillment of the requirements for the degree of BSc in (mathematic)

Prepared By: Tara Qasm Bakr


Supervised by: Dr. Awaz K. Muhammad

April – 2023
Certification of the supervisors:

I certify that this work was prepared under my supervision at the department of mathematics
/college of education/Salahaddin university –Erbil in partial fulfilment of the requirements for
the degree of bachelor of philosophy of science in mathematics.

Signature:

Supervisor: Dr. Awaz Kakamam Muhammad

Scientific grade: Lecturer

In view of the available recommendations, forward this work for debate by the
examining committee.

Signature:

Name: Dr. Rashad R. Haji

Scientific grade: Assistant Professor

Chairman of the Mathematics Department

i
Acknowledgement
Primarily, I would like to thanks my god for helping me to complete this research
with success.

Then I would like to express special of my supervisor Dr . Awaz Kakamam whose


valuable to guidance has been the once helped me to completing my research.
Words can only inadequately express my gratitude to my supervisor for patiently
helping me to think clearly and consistently by discussing every point of this
project with me.

I would also like to extend my gratitude to the head of the Mathematic department
Assist. Prof. Dr. Rashad Rashid.

I would like to thank my family, friend and library staff whose support has helped
me to conceive this research

ii
Abstract

In this work we review regression model. We show types of regression with given some examples
on it. We applied linear regression on dataset with is contains about 304 instances, each having 13
features which are used to infer the presence (values 1, 2, 3, 4) or absence (value 0) of heart disease.
we take 6 variables to prediction (Age, sex, oldpea, ca, slope and thal). Linear regression model
has been used to predictions. The aim is to explore simple linear regression models and multiple
linear regression models on examining heart disease with these variables. we found that the relation
between age and heart disease is -0.226 we can say that there is a significant negative week
relationship between age and heart disease. we found that the correlation coefficient between
oldpeak and heart disease is -0.431. The correlation coefficient is very strong negative correlation
between oldpeak and heart disease

iii
Contents
Introduction ............................................................................................................................................. 1
Chapter one ............................................................................................................................................. 3
Regression: - ............................................................................................................................................ 3
There are two kinds of regression: - ...................................................................................................... 3
Non- linear regression .......................................................................................................................... 3
Linear regression.................................................................................................................................. 3
Simple linear regression: .................................................................................................................. 4
Ordinary least square (OLS) method: ................................................................................................... 5
Example of simple linear regression:- ............................................................................................... 5
Multiple linear regression ..................................................................................................................... 8
Example on multiple linear regression .............................................................................................. 8
Multivariate liner regression: - ........................................................................................................... 10
Example of multivariate linear regression:- ..................................................................................... 10
Chapter two ........................................................................................................................................... 11
Some Application on Linear Regression Model .................................................................................. 11
Graphical representation:.................................................................................................................... 11
Correlation Coefficient: ...................................................................................................................... 15
Production of Linear Regression model: ................................................................................................. 16
For Multiple linear regression: - ............................................................................................................. 20
Conclusion: ........................................................................................................................................ 22
Reference:.............................................................................................................................................. 23

iv
Introduction
William Harvey (1578–1657), physician to King Charles I, is credited with
discovering that blood moves around the body in a circulatory manner from the heart.
Friedrich Hoffmann (1660–1742), chief professor of medicine at the University of
Halle, noted later that coronary heart disease started in the “reduced passage of the
blood within the coronary arteries

Heart disease is considered one of the top preventable causes of death in the world.
Some genetic factors can contribute, but the disease is largely attributed to poor
lifestyle habits. Among these are poor diet, lack of regular exercise, tobacco
smoking, alcohol or drug abuse, and high stress. These are issues that remain
prevalent in American culture, so it’s no wonder that heart disease is of great
concern.[1]

Now we are talk about what is linear regression what are the type of regression and
linear what are the step of linear for analysis data how do linear regression use to
analysis data of heart disease and what are the regression between each age, sex,
oldpeak, slope, and thal. Linear regression is a linear model, a model that assumes a
linear relationship between the input variables (x) and the single output variable (y).
linear regression analysis is used to predict the value of a variable based on the value
of another variable

In this work, I applied linear regression to predict model on dataset [4]. The dataset
was collected from the Cleveland Clinic Foundation and contains about 304
instances, each having 13 features which are used to infer the presence (values 1, 2,
3, 4) or absence (value 0) of heart disease. The features are (1) age, (2) sex, (3) chest
pain type, (4) resting blood pressure, (5) cholesterol, (6) fasting blood sugar, (7)
resting electrocardiographic results, (8) maximum heart rate, (9) exercise induced
angina, (10) depression induced by exercise relative to segment, (11) slope of peak
exercise, (12) number of major vessels and (13) thal.

In this data set we just take 5 variables (Age, sex, oldpea, ca, and slope). Linear
regression model has been used to predictions. We use R programing and Excel to
this analysis.

This project contained two chapters. In chapter one, I describe a linear regression
model. linear regression analysis is used to predict the value of a variable based on
the value of another variable. The variable you want to predict is called the dependent
variable (often called the 'outcome' or 'response' variable). The variable we are using
to predict the other variable's value is called the independent variable (often called
'predictors', 'covariates', 'explanatory variables' , ‘attributes ‘or 'features')[3].

Chapter two includes some applications on our data set by using simple linear
regression and multiple linear regression. We used R programing to analysis data.

From this work we found that the relation between age and heart disease is -0.226
we can say that there is a significant negative week relationship between age and
heart disease. in this project, the results show that correlation coefficient between sex
and heart disease. is -0.28. The correlation coefficient is suggests a week negative
correlation between them. Moreover, we found that the correlation coefficient
between oldpeak and heart disease is -0.431. The correlation coefficient is very
strong negative correlation between oldpeak and heart disease

The correlation coefficient between ca and heart disease is -0.392 which is


considered negative very strong correlation between them. For this data analysis we
use R programming [4] and Excel programming.

2
Chapter one
Regression: -
Regression analysis has been one of the most widely used statistical methodologies
during the past 50 years of analyzing relationships among variables. Due to its
flexibility, usefulness, applicability, theoretical and technical succinctness.
Regression analysis has become a basic statistical tool to solve problems in the real
world. Regression analysis is a collection of statistical techniques that serve as a basis
for drawing inferences about relationships among interrelated variables. Since these
techniques are applicable in almost every field of study, including the social, physical
and biological sciences, business and engineering, regression analysis is now perhaps
the most used of all data analysis methods.

There are two kinds of regression: -


1) Non-linear regression
2) Linear regression

Non- linear regression


The relationship between the response and some of the predictors is nonlinear or
some of parameters appear nonlinearly, but no transformation is possible to make the
parameters appear linearly

Linear regression:
Linear regression is a modelling technique for analyzing data to make predictions. In
simple linear regression, a bivariate model is built to predict a response variable (𝑦)
from an explanatory variable (𝑥). In multiple linear regression the model is extended
to include more than one explanatory variable (𝑥1, 𝑥2 ,…..,𝑥𝑝 ) producing a
multivariate model. This primer presents the necessary theory and gives a practical
outline of the technique for bivariate and multivariate linear regression models. We
discuss model building, assumptions for regression modelling and interpreting the
results to gain meaningful understanding from data. [3].

3
We have three types of linear regression: -

1. simple linear regression: models using only one predictor


2. Multiple linear regression: models using multiple predictor
3. Multivariate linear regression: models for multiple response variable

Simple linear regression:


A simple linear regression estimates the relationship between a response variable 𝑦,
and a single explanatory variable 𝑥, given a set of data that includes observations for
both of these variables for a particular sample. To fit a straight line to the points on
this scatterplot, we use linear regression – the equation of this line, is what we use to
make predictions. The equation for the line in regression modelling takes the form:

𝑦𝑖 =𝐵0 +𝐵1 Xi+ei

• 𝑦𝑖 = is the predicted value of the dependent variable (y) for any given value of
the independent variable (x).
• 𝐵0 = is the intercept, the predicted value of y when the x is 0.
• 𝐵1 = is the regression coefficient – how much we expect y to change as x
increases.
• 𝑥𝑖 = is the independent variable ( the variable we expect is influencing y).
• 𝑒𝑖 = is the error of the estimate, or how much variation there is in our estimate
of the regression coefficient.

4
Ordinary least square (OLS) method:
When we have more than one input we can use ordinary least squares to estimate
the value of the coefficients. The OLS procedure seeks to minimize the sum of the
squared residuals (a residual being: the difference between an observed value, and
the fitted value provided by a model) made in the results of each individual
equation. We can use sample data to fine the line of regression.

𝑦̂=𝑏0 +𝑏1 x

➢ 𝑦̂:in the predicted grade on exam.


➢ 𝑏0 : is the y intercept of the line.
➢ 𝑏1 : is the slope of the line.
➢ X: is the independent variable.

Recall that least square method

Min ∑(𝑦𝑖 -𝑦̂𝑖 )2

Where

𝑦𝑖 = observed value of the dependent variable for the observation.

𝑦̂𝑖 = Predicted value of the dependent variable for the observation.

Example of simple linear regression:-


When an anthropologist finds skeletal remains, they need to figure out the height of
the person. The height of a person (in cm) and the length of theimetacarpal bone 1
(in cm) were collected and are in table 1.1 ("Prediction of height," 2013). Create a
scatter plot and find a regression equation between the height of a person and the
length of their metacarpal. Then use the regression equation to find the height of a
person for a metacarpal length of 44 cm and for a metacarpal length of 55 cm.
Which height that you calculated do you think is closer to the true height of the
person? Why?

5
Table 1.1: Data of Metacarpal versus Height
Length of Metacarpal (cm) Height of Person (cm)
45 171
51 178
39 157
41 163
48 172
49 183
46 173
43 175
47 173

Calculating the slope:


∑(𝑥𝑖 −𝑥
̂ )(𝑦 −𝑦
̂)
𝑖
𝑏1 =
(𝑥𝑖 −𝑥̂ )2

➢ 𝑥𝑖 =Value of independent variable for the observation.


➢ 𝑦𝑖 = value of dependent variable for the observation.
➢ 𝑥̅ = mean value for independent variable.
➢ 𝑦̂ = mean value for dependent variable.

Calculating the y – intercept , 𝑏0 = 𝑦̂ - 𝑏1 𝑥̂

Table 1.2
𝑥𝑖 𝑦𝑖 𝑥𝑖 -𝑥̂ 𝑦𝑖 -𝑦̂ (𝑥𝑖 -𝑥̂)(𝑦𝑖 -𝑦̂) (𝑥𝑖 -𝑥̂)2
45 171 0 -1 0 0
51 178 6 5 30 36
39 157 -6 -15 90 36
41 163 -4 -9 36 16
48 172 3 0 0 9
49 183 4 11 44 16
46 173 1 1 1 1
43 175 -2 3 -6 4
47 173 2 1 2 4
=45 =172

𝑥𝑖 409
𝑥̂= ∑ = =45
𝑛 9

𝑦𝑖 1545
𝑦̂=∑ = =172
𝑛 9
6
∑(𝑥𝑖 −𝑥
̂ )(𝑦 −𝑦
𝑖
̂)
𝑏1 =
(𝑥𝑖 −𝑥̂ )2

From above table we get,


197
𝑏1 = =2
122

𝑏0 =𝑦̂ - 𝑏1 𝑥̂

𝑏0 = 172 – 2(45) = 82

𝑦̂= 𝑏0 +𝑏1 x

𝑦̂ = 82 +2x

This is a linear regression equation.

Estimated Regression line:

𝑦̂ = 82 +2x

Use regression line to predict the value of 𝑦 for a given (𝑥)

Suppose number Length of Metacarpal = 45

What is the predicted grade on exam?

𝑥=45 what is the predicted value of 𝑦 ?

𝑦ˉ=82+2(45)=172

7
Multiple linear regression
Multiple linear regression extends simple linear regression to include more than
one explanatory variable. In both cases, we still use the term ‘linear’ because we
assume that the response variable is directly related to a linear combination of the
explanatory variables. The equation for multiple linear regression has the same
form as that for simple linear regression but has more terms:

𝑦𝑖 =𝛽0 +𝛽1 𝑥𝑖1+𝛽2 𝑥𝑖2 +…+𝛽𝑝 𝑥𝑖𝑝 +∈

As for the simple case, ß0 is the constant which will be the predicted value of y
when all explanatory variables are 0. In a model with 𝑝 explanatory variables, each
explanatory variable has its own β_coefficient.[2]

➢ i =number of observations
➢ 𝑦𝑖 =dependent variable
➢ 𝑥𝑖 = explanatory (independent) variables
➢ 𝛽0 = intercept (constant term)
➢ 𝛽𝑝 = slope coefficients for each explanatory variable
➢ ∈ = the model’s error term (also known as the residuals)

Example on multiple linear regression


Two independent variable case y= 𝑏0 + 𝑏1 𝑥1+ 𝑏2 𝑥2

Table 1.3: data table if we have two independent variable..

Table 1.3

y 𝑥1 𝑥2 (𝑥1 )2 (𝑥2 )2 𝑥1 𝑦 𝑥2 𝑦 𝑥1 𝑥2
-3.7 3 8 9 64 -11.1 -29.6 24
3.5 4 5 16 25 14 17.5 20
2.5 5 7 25 49 12.5 17.5 35
11.5 6 3 36 9 69 34.5 18
5.7 2 1 4 1 11.4 5.7 2

𝑏0 =𝑦̂ - 𝑏1 𝑥̂1- 𝑏2 𝑥̂2


(∑ 𝑥22 )(∑ 𝑦𝑥1 )−(∑ 𝑥1 𝑥2 )(∑ 𝑦𝑥2 )
𝑏1 =
((∑ 𝑥12 )(∑ 𝑥12 ))−(∑ 𝑥2 𝑥1 )2

8
(𝑥12 ) (𝑥1 𝑦)−(𝑥1 𝑥2 ) (𝑥1 𝑦)
𝑏2 =
((∑ 𝑥12 ) (∑ 𝑥22 ))−(𝑥1 𝑥2 )2

∑ 𝑥12 =90, ∑ 𝑥22 =148

∑ 𝑦𝑥1=95.8

∑ 𝑦𝑥2 =45.6
(𝑥𝑖 )2
∑ 𝑥1 =20, ∑ 𝑥2 =25, ∑ 𝑦=19.5, ∑ 𝑥𝑖2 = ∑ 𝑥𝑖2 – , where
𝑁

(∑ 𝑥1 )2
(𝑖)1 ∑ 𝑥12 = ∑ 𝑥12 - = 90-80 = 10
𝑁

(∑ 𝑥2 )2
(𝑖)2 ∑ 𝑥12 = ∑ 𝑥22 - , = 32.8
𝑁

(∑ 𝑥𝑖 ) (∑ 𝑦)
∑ 𝑥𝑖 𝑦 = ∑ 𝑥𝑖 𝑦 -
𝑁

(∑ 𝑥1 ) (∑ 𝑦)
𝑖= 1 ∑ 𝑥1𝑦 = ∑ 𝑥1 𝑦 – = 17.8
𝑁

(∑ 𝑥2 ) (∑ 𝑦)
𝑖= 2 ∑ 𝑥2 𝑦 = ∑ 𝑥2 𝑦 – = -48
𝑁

(𝑥1 ) (𝑥2 )
∑ 𝑥1 𝑥2 = ∑ 𝑥1 𝑥2 - =5
𝑁

727.84
𝑏1 = = 2.28
319

−533.4
𝑏2 = = -1.67
319

𝑏0 = 2.796

𝑦= 2.796+ 2.28𝑥1- 1.67𝑥2

9
Multivariate liner regression: -
Multivariate Regression is a method used to measure the degree at which more than
one independent variable (predictors) and more than one dependent variable
(Responses), are linearly related.

A mathematical model, based on multivariate Regression analysis will address this


and other more complicated questions.

The form of the equation

𝑦𝑡 =𝑎1 +𝑏2 𝑥2𝑡 +𝑏3 𝑥3𝑡 +𝑒𝑡

➢ 𝑌𝑡 = Dependent variable
➢ 𝑎₁= Intercept
➢ 𝑎2 = Constant (partial regression coefficient)
➢ 𝑏𝑎 = Constant (partial regression coefficient)
➢ 𝑥2 = Explanatory variable
➢ 𝑥3 = Explanatory variable
➢ e, = Error term

Example of multivariate linear regression:-


Researcher has collected data on three psychological variables, four academic
variables (standardized test scores), and the type of educational program the student
is in for 600 high school students. She is interested in how the set of psychological
variables is related to the academic variables and the type of program the student is
in.

10
Chapter two
Some Application on Linear Regression Model
In this chapter we applied linear regression to predict model. We have the data
from [4]. The dataset was collected from the Cleveland Clinic Foundation and
contains about 304 instances, each having 13 features but we take a 6 features that
called independent variable with dependent variables y, (age = 𝑥1 , sex = 𝑥2 ,
oldpea = 𝑥3 ,slope = 𝑥4 ,ca= 𝑥5 , ) where each xi define as follows:

• Age:- age in year


• Sex:- sex (1 = male; 0 = female)
• Oldpeak:- depression induced by exercise relative to segment
• Slope:- the slope of the peak exercise
-- Value 1: upsloping
-- Value 2: flat
-- Value 3: downsloping
• Ca:- number of major vessels (0-3) colored by fluoroscopy
• thal
• Output: (heart disease) Y
Graphical representation:
We will Represent a graph by histogram for the data set and describe the
distribution frequency table of it.

Fieger 2.1: histogram of sex Fieger 2.2: histogram of age

11
Figure 2.3: histogram of oldpeak Figure 2.4: histogram of ca

Fieger 2.5 :histogram of slope Fieger 2.6 :histogram of thal

Fieger 2.7 :histogram of heart dieses

12
The main problem with Figure 2.9 is that the variability in heart disease at all ages
is large. This makes it difficult to see any functional relationship between Age and
heart disease. One common method of removing some variation, while still
maintaining the structure of the relationship between the dependent and the
independent variable, is to create intervals for the independent variable and
compute the mean of the outcome variable within each group. We use this strategy
by grouping age into the categories (Age Group) defined in Table 1.4 contains, for
each age group, the frequency of occurrence of each outcome, as well as the
percent with heart disease present.

Figure 2.8: Scatterplot of “yes” or “no” of heart disease by Age.

Table 1.4 illustrate frequency table of Age Group by heart disease. I divide Age
feature for 10 subset and putted in 10 class intervals. Each class have size 4.

N Age group n Heart disease (yes) Heart disease (no) Mean


1 29-33 1 1 0 1
2 34-38 11 8 3 0.73
3 39-43 33 25 8 0.76
4 44-48 38 25 13 0.66
5 49-53 46 32 14 0.70
6 54-58 71 32 39 0.45
7 59-63 53 15 48 0.28
8 64-68 39 20 19 0.52
9 69-73 10 6 4 0.6
10 74-78 3 2 1 0.66
13
Figure 2.9: Plot of the percentage of subjects with heart disease in each AGE group

Figure2.10: plot of oldpeak with heart disease

14
Correlation Coefficient:
Definition (correlation coefficient) [5]: The correlation coefficient is an indicator
of measuring the dependence between attributes. The Pearson’s Correlation
Coefficient (PCC) can be defined as the covariance of two random variables divided
by the product of the individual standard deviations. Let as consider two variables X
and Y, for a series of 𝑛 measurements of 𝑋 and 𝑌 written 𝑥𝑖 and 𝑦𝑖, where 𝑖=1, 2, …,
𝑛. The Pearson’s 𝑟 is defined as:
∑𝑛
𝑖=1(𝑥𝑖 −𝑥ˉ)(𝑦𝑖 −𝑦ˉ)
𝑟= where -1˂ 𝑟 ˂1
√∑𝑛 𝑛
𝑖=1(𝑥𝑖 −𝑥ˉ) ∑𝑖=1(𝑦𝑖 −𝑦ˉ)

Notation:
• If Y tends to increase as X increases, the correlation is called positive, or
direct, correlation.
• If Y tends to decrease as X increases, the correlation is called negative, or
inverse, correlation.
• If there is no relationship indicated between the variables, we say that there is
no correlation between them (i.e., they are uncorrelated).

In this research we decide that correlation between two features greater or equal (0.4)
[i.e r=0.4] is very strong, if r=0.3 correlation is strong, if r=0.2 we said the correlation
is week.

In our data we can see that the correlation coefficient between any two features as
follows: We found that the relation between age and heart disease is -
0.2254387. We can say that there is a significant negative week relationship between
age and heart disease. The results show that correlation coefficient between sex and
heart disease. is -0.2809366. The correlation coefficient is suggesting a week
15
negative correlation between them. Moreover, we found that the correlation
coefficient between oldpeak and heart disease is -0.430696. The correlation
coefficient is very strong negative correlation between oldpeak and heart disease.
The correlation coefficient between ca and heart disease is -0.391724 which is
considered negative very strong correlation between them.

Production of Linear Regression model:


In this project, we applied linear regression method on our dataset. First, we want
to find relationship between independent variable Age with oldpeak.

Figure 2.11 : Plot of age with oldpeak

Figure 2.11 show the linear relationship between Age and oldpeak in the dataset
of 304 cases with Age ranging from 29 to 77 years old, where oldpeak is depression
induced by exercise relative to rest.
We perform a simple linear regression analysis and the results of fitting the linear
regression model to the Age with oldpeak Data, n = 303 as follows.

16
lm(formula = oldpeak ~ age, data = Taraproject)
Residuals:
Min 1Q Median 3Q Max
-1.6473 -0.8418 -0.3519 0.5702 4.9554
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.420048 0.397085 -1.058 0.290981
age 0.026848 0.007204 3.727 0.000232 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.137 on 301 degrees of freedom

Multiple R-squared: 0.04411, Adjusted R-squared: 0.04093

F-statistic: 13.89 on 1 and 301 DF, p-value: 0.0002317

The linear relationship between Age and Heart Disease in the data set
with Age ranging from 29 to 77 years old, where Heart Disease is measure the
person infected 165 and not infected 138. the results of fitting the linear regression
model to the Age with heart disease Data, n = 303 as follows:

The output looks like this:

lm(formula = output.Heart Disease..Y ~ age.x1, data = Taraproject)


Residuals:

Min 1Q Median 3Q Max


-0.7843 -0.4996 0.2899 0.4385 0.7233
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.217731 0.170000 7.163 6.09e-12 ***
Age -0.012382 0.003084 -4.015 7.52e-05 ***
---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4868 on 301 degrees of freedom
Multiple R-squared: 0.05082, Adjusted R-squared: 0.04767
F-statistic: 16.12 on 1 and 301 DF, p-value: 7.525e-05

In this case about Heart Disease we can see the value of intercepts ‘’a’’ and slope
‘’b’’ for the age. These a and b value plot a line between all the points of the data is
person 30 years old a intercept 1.217731, b is -0.012382 that is heart diseas:
17
Y=a+bx+error
For male 30 years old
Heart disease=1.217731+{-0.012382*30}+0.003084 = -0.368376
You can see that significant negative relation between Age and heart disease.
Moreover, correlation coefficient between age and heart disease is -0.23 which
is significant negative correlation.

linear relationship between ca and heart Disease in the database of 303 cases
with heart disease ranging infected 165 persons not infected 138, where ca is
number of major vessels (0-3) colored by fluoroscopy.
The output looks like:
lmoutput.Heart. Disease..Y=lm(output.Heart. Disease..Y~ ca ,data= Taraproject)
summary(lmoutput.Heart.Dises..Y)
Call:
lm(formula = output.Heart. Disease..Y ~ ca, data = Taraproject)
Residuals:

Min 1Q Median 3Q Max


-0.6839 -0.4928 0.3161 0.3161 1.0804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.68393 0.03246 21.071 < 2e-16 ***
ca -0.19109 0.02587 -7.386 1.49e-12 ***
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4597 on 301 degrees of freedom
Multiple R-squared: 0.1534, Adjusted R-squared: 0.1506
F-statistic: 54.56 on 1 and 301 DF, p-value: 1.492e-12

linear relationship between sex and heart disease in the database of 303 cases
with heart disease ranging infected 165 persons not infected 138, the results as
follows:
lm(formula = output.Heart. Disease..Y ~ sex.x2, data = Taraproject)
Residuals:
Min 1Q Median 3Q Max
-0.7500 -0.4493 0.2500 0.5507 0.5507
18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.75000 0.04894 15.324 < 2e-16 ***
sex.x2 -0.30072 0.05921 -5.079 6.68e-07 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4795 on 301 degrees of freedom
Multiple R-squared: 0.07893, Adjusted R-squared: 0.07587
F-statistic: 25.79 on 1 and 301 DF, p-value: 6.679e-07

linear relationship between oldpeak and heart disease in the database as follows:
lm(formula = output.Heart. Disease ~ oldpeak.x3, data = Taraproject)
Residuals:
Min 1Q Median 3Q Max
-0.7369 -0.4409 0.2631 0.3371 1.0402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.73692 0.03480 21.18 < 2e-16 ***
oldpeak.x3 -0.18504 0.02235 -8.28 4.09e-15 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4509 on 301 degrees of freedom
Multiple R-squared: 0.1855, Adjusted R-squared: 0.1828
F-statistic: 68.55 on 1 and 301 DF, p-value: 4.085e-15

linear relationship between slope and heart Disease in the database as follows:
where Slope:- the slope of the peak exercise:

▪ Value 1: upsloping
▪ Value 2: flat
▪ Value 3: downsloping

lm(formula = output.Heart.Diseas~ slope, data = Taraproject)


Residuals:

Min 1Q Median 3Q Max


-0.7127 -0.4327 0.2873 0.2873 0.8472
Coefficients:
19
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.15276 0.06692 2.283 0.0231 *
slope 0.27999 0.04378 6.395 6.1e-10 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4688 on 301 degrees of freedom
Multiple R-squared: 0.1196, Adjusted R-squared: 0.1167
F-statistic: 40.9 on 1 and 301 DF, p-value: 6.102e-10

For Multiple linear regression: -


The visualization step for multiple regression is more difficult than for simple
regression, because we now have two predictors. OLS was applied to examine
dependent and independent variables. OLS is a linear regression to perform a
prediction or detect relationship between dependent and independent variables. We
examine heart disease cases as a dependent variable with all independent variables
(Age, sex, ca, slope, oldpeak, and thal). This OLS model uses the equation below:
𝑦=𝑏0 +𝑏1𝑥1 +𝑏2 𝑥2 +………..

In this project, I will examine OLS of three case study:-

Firstly: I applied to examine heart disease (dependent) with Age and oldpeak
(independent variables).

lm(formula = output.Heart.disease ~ age + oldpeak., data = Taraproject)


Residuals:
Min 1Q Median 3Q Max
-0.8739 -0.4233 0.1726 0.3437 1.0359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.145357 0.156173 7.334 2.10e-12 ***
Age -0.007756 0.002893 -2.681 0.00774 **
Oldpeak -0.172299 0.022627 -7.615 3.47e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4464 on 300 degrees of freedom
Multiple R-squared: 0.2046, Adjusted R-squared: 0.1993
F-statistic: 38.58 on 2 and 300 DF, p-value: 1.233e-15

20
Y=a+b1X1+b2X2
For age 50 yeras old and oldpeak=2
Y=infect heart disease =1.145357+{-0.007756*50}+{-0.172299*2}= -7.857393

Secondly: I applied to examine heart disease (dependent) with oldpeak, Age and
slope (independent variables)
lm(formula = output.Heart.disease..Y ~ age.x1 + oldpeak.x3 + slope,
data = Taraproject)
Residuals:
Min 1Q Median 3Q Max
-0.8984 -0.4097 0.1533 0.3396 1.0836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.936416 0.182702 5.125 5.34e-07 ***
age -0.007384 0.002880 -2.564 0.0108 *
oldpeak -0.139124 0.027201 -5.115 5.62e-07 ***
slope 0.110222 0.050837 2.168 0.0309 *
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4437 on 299 degrees of freedom
Multiple R-squared: 0.2169, Adjusted R-squared: 0.209
F-statistic: 27.6 on 3 and 299 DF, p-value: 8.749e-16

Y=a+b1X1+b2X2+b3X3
For age 60 yeras old , old peak=2.2, and slop=1
Y= 0.936416 +{-0.007384 *60}+{-0.139124 *2.2}+{0.110222 *1}= - 7.192732

Thirdly: I applied to examine heart disease (dependent) with oldpeak, Age, slope
and ca (independent variables) the results are as follows:
lm(formula = output.Heart.disease ~ age + oldpeak + slope + ca, data =
Taraproject)
Residuals:
Min 1Q Median 3Q Max
-0.9105 -0.3486 0.1191 0.3057 0.9549

Coefficients:
21
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.760471 0.175573 4.331 2.03e-05 ***
Age -0.003284 0.002814 -1.167 0.24411
Oldpeak -0.109840 0.026232 -4.187 3.72e-05 ***
slope 0.132490 0.048299 2.743 0.00645 **
ca -0.148855 0.025067 -5.938 8.01e-09 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4202 on 298 degrees of freedom
Multiple R-squared: 0.2997, Adjusted R-squared: 0.2903
F-statistic: 31.89 on 4 and 298 DF, p-value: < 2.2e-16

Conclusion:
In this work we reviewed a linear regression model. linear regression analysis is used
to predict the value of a variable based on the value of another variable. The variable
you want to predict is called the dependent variable. The variable we are using to
predict the other variable's value is called the independent variable. we applied
simple linear regression and multiple linear regression on our dataset by using R
programing.
correlation coefficients in regression analysis work together to tell you which
relationships in your model are statistically significant and the nature of those
relationships. The coefficients describe the mathematical relationship between each
independent variable and the dependent variable. The p-values for the coefficients
indicate whether these relationships are statistically significant in this work.

From this work we found that the relation between age and heart disease is -0.225
we can say that there is a significant negative week relationship between age and
heart disease. In this project, the results show that correlation coefficient between
sex and heart disease is -0.281. The correlation coefficient is suggesting a week
negative correlation between them. Moreover, we found that the correlation
coefficient between oldpeak and heart disease is -0.431. The correlation coefficient
is very strong negative correlation between oldpeak and heart disease

The correlation coefficient between ca and heart diseas is -0.392 which is


considered negative very strong correlation between them. For this data analysis we
use R programming and Excel programming.
22
Reference:

https://www.healthline.com/health/heart-disease/history#early-discoveries [1]
Golberg, M.A. and Cho, H.A., 2004. Introduction to regression analysis. WIT
press.

Tranmer, M. and Elliot, M., 2008. Multiple linear regression. The Cathie Marsh
Centre for Census and Survey Research (CCSR), 5(5), pp.1-5.

UCI Machine Learning Repository: Heart Disease Data Set.


https://archive.ics.uci.edu/ml/datasets/heart+disease

Hogg, R.V. and Craig, A.T., 1995. Introduction to mathematical statistics 5


edition. Englewood Hills, New Jersey.

Sykes, A.O., 1993. An introduction to regression analysis.

23
‫پوختە‬

‫لەم کارەدا پێداچوونەوە بە مۆدێلی پاشەکشە دەکەین‪ .‬ئێمە جۆرەکانی ڕیگرەیشن نیشان دەدەین لەگەڵ پێدانی‬
‫هەندێک نموونە لەسەری‪ .‬ئێمە ڕیگرەیشن هێڵیمان لەسەر کۆمەڵە داتاکان بەکارهێنا لەگەڵ نزیكەی‬
‫‪ 304‬نموونە لەخۆدەگرێت‪ ،‬هەریەکەیان ‪ 13‬تایبەتمەندییان هەیە کە بەکاردەهێنرێن بۆ دەرئەنجامدانی بوونی‬
‫‪(,‬بەهاکانی ‪ )4 ،3 ،2 ،1‬یان نەبوونی (بەهای ‪ )0‬نەخۆشی دڵ‪ .‬ئێمە ‪ ٦‬گۆڕاو دەبەین بۆ پێشبینیکردن‬
‫‪(age ,sex ,age , slope ,thal ,ca).‬‬

‫مۆدێلی ڕیگرەیشن هێڵی بۆ پێشبینییەکان بەکارهێنراوە‪ .‬ئامانج لێی لێکۆڵینەوەیە لە مۆدێلی ڕیگرەیشن‬
‫هێڵی سادە و مۆدێلی ڕیگرەیشن هێڵی فرەیی لەسەر پشکنینی نەخۆشییەکانی دڵ بەم گۆڕاوانە‪ .‬بۆمان‬
‫دەرکەوت کە پەیوەندی نێوان تەمەن و نەخۆشییەکانی دڵ ‪ -0.226‬یە دەتوانین بڵێین کە پەیوەندییەکی الواز‬
‫ی نەرێنی بەرچاو لە نێوان تەمەن و نەخۆشییەکانی دڵدا هەیە‪ .‬بۆمان دەرکەوت کە ڕێژەی پەیوەندی نێوان‬
‫لوتکەی کۆن و نەخۆشییەکانی دڵ ‪ -0.431‬یە‪ .‬ڕێژەی پەیوەندی پەیوەندییەکی نەرێنی زۆر بەهێزە لە نێوان‬
‫خەمۆكی و نەخۆشییەکانی دڵ‬

‫‪24‬‬

You might also like