Linear Regression and Correlation
Linear Regression and Correlation
5
linear regression and correlation cont’d...
The residual or
error term, e, for
this subject.
Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
linear regression and correlation cont’d...
Figure 2: scatter plot indicating the relation ship between the height of
oldest sons and fathers‘ height
linear regression and correlation cont’d...
BMI = α + bo*HIP
BMI = α + bo*HIP
Model Summary
Table 4
Adjusted Std. Error of
Model R R Square R Square the Estimate
1 .719a .517 .480 6.850
2 .951b .905 .889 3.162
a. Predictors: (Constant), dummy 1
b. Predictors: (Constant), dummy 1, dummy 2
Table 2:Dummy coding cont’d…
Table 5: provides unstandardized regression
coefficients (B), intercept (constant), standardized
regression coefficients (ß), which we can use for the
development of the model.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 19.000 2.166 8.771 .000 14.320 23.680
dummy 1 -14.000 3.752 -.719 -3.731 .003 -22.106 -5.894
2 (Constant) 26.000 1.414 18.385 .000 22.919 29.081
dummy 1 -21.000 2.000 -1.079 -10.500 .000 -25.358 -16.642
dummy 2 -14.000 2.000 -.719 -7.000 .000 -18.358 -9.642
a. Dependent Variable: DV_score
Table 2:Dummy coding cont’d…
The residual or
error term, e, for
this subject.
Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
linear regression and correlation cont’d...
= y bx
linear regression and correlation cont’d...
n XY X Y ( X X )(Y Y )
b = n X ( X )
2 2 = (X X ) 2
linear regression and correlation cont’d...
Example 1: Heights of 10 fathers(X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.
n XY X Y ( X X )(Y Y )
b = =
n X ( X )
2 2
(X X ) 2
= y bx
10(45967) (676 x 679) 459670 459004
b= 10(45784) (676) 2 = 457840 456976
666
b = = 0.77
864
linear regression and correlation cont’d...
679 676
= 0.77* = 67.9 – 52.05 = 15.85
10 10
43
linear regression and correlation cont’d..
44
linear regression and correlation cont’d..
Variable X
This line shows a perfect linear relationship between two variables. It is a perfect
positive correlation (r = 1)
45
linear regression and correlation cont’d..
Variable X
A perfect linear relationship; however a negative correlation (r = -1)
46
linear regression and correlation cont’d..
47
linear regression and correlation cont’d...
Strength of relationship
– Correlation from 0 to 0.25 (or –0.25) indicate little
or no relationship
– Those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
– Those from 0.50 to 0.75(or –0.50 to –0.75)
moderate to good relationship; and
– Those greater than 0.75 (or –0.75 to –1.00)
indicate very good to excellent relationship.
49
linear regression and correlation cont’d...
r (n 2)
tcal =
(1 r ) 2
51
linear regression and correlation cont’d..
Its formula is:
n XY ( X )( Y )
r=
n X 2 ( X ) 2 n Y 2 ( Y ) 2
Properties
– -1 r 1
– r is a pure number without unit
– If r is close to 1 a strong positive relationship
– If r is close to -1 a strong negative relationship
– If r = 0 → no linear correlation
linear regression and correlation cont’d..
Assumptions in correlation
– The assumptions needed to make inferences
about the correlation coefficient are that the
sample was randomly selected and the two
variables, X and Y, vary together in a joint.
Distribution that is normally distributed, (called
the bivariate normal distribution).
54
linear regression and correlation cont’d...
Spearman’s rank correlation coefficient
– If either (or both) of the variables is ordinal, then
Spearman’s rank correlation coefficient (usually
denoted ρs in the population and r in the sample)
is appropriate. (if there is extreme value)
– This is a non-parametric measure.
– As with Pearson’s correlation coefficient,
Spearman’s correlation coefficient varies from –1
to +1,
linear regression and correlation cont’d...
6 d i 2
rs = 1-
n(n 1)2
linear regression and correlation cont’d..
A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4
di² = 10, n = 6.
6 di 2
rs = 1-
n ( n 2 1)
6(10) 60
rs = 1- = 1- = 1-0.29 = 0.71
6(6 1)
2
6*35
linear regression and correlation cont’d...
r (n 2)
tcal =
(1 r ) 2
59
Multiple linear regression
– Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
– It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
– All types of multivariate analyses involve the
construction of a mathematical model to
describe the association between independent
and dependent variables.
12/18/2023 60
Multiple linear regression cont’d…
Multiple linear regression (we often refer to this
method as multiple regression) is an extension of the
most fundamental model describing the linear
relationship between two variables.
12/18/2023 61
Multiple linear regression cont’d…
Regression equation for a linear relationship:
A linear relationship of n predictor variables,
denoted as:
X1, X2, . . ., Xn
to a single response variable, denoted (Y)
is described by the linear equation involving several
variables.
The general linear equation (model) is:
Y = α + b1X1 + b2X2 + . . . + bnXn
12/18/2023 62
Multiple linear regression cont’d…
• Where:
– The regression coefficients (or b1 . . . bn ) represent
the independent contributions of each explanatory
variable to the prediction of the dependent variable.
12/18/2023 63
Multiple linear regression cont’d…
Assumptions
1. First of all, as it is evident in the name multiple
linear regression, it is assumed that the relationship
between the dependent variable and each
continuous explanatory variable is linear. We can
examine this assumption for any variable, by
plotting (i.e., by using bivariate scatter plots) the
residuals (the difference between observed values
of the dependent variable and those predicted by
the regression equation) against that variable.
12/18/2023 64
Multiple linear regression cont’d…
12/18/2023 65
Multiple linear regression cont’d…
12/18/2023 66
Multiple linear regression cont’d…
Predicted and Residual Scores
– The regression line expresses the best prediction
of the dependent variable (Y), given the
independent variables (X).
– However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial
variation of the observed points around the fitted
regression line.
– The deviation of a particular point from the
regression line (its predicted value) is called the
residual value.
12/18/2023 67
Multiple linear regression cont’d…
Residual Variance and R-square
– The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction.
– For example, if there is no relationship between
the X and Y variables, then the ratio of the
residual variability of the Y variable to the
original variance is equal to 1.0.
12/18/2023 68
Multiple linear regression cont’d…
12/18/2023 69
Multiple linear regression cont’d…
12/18/2023 70
Multiple linear regression cont’d…
12/18/2023 71
Multiple linear regression cont’d…
N.B. A) The sources of variation in regressions are:
i) Due to regression
ii) Residual (about regression)
B) The sum of squares due to regression (SSR)
over the total sum of squares (TSS) is the
proportion of the variability accounted for by the
regression model.
Therefore, the percentage variability accounted for
or explained by the regression is 100 times this
proportion.
12/18/2023 72
Multiple linear regression cont’d…
Interpreting the multiple Correlation Coefficient (R)
– Customarily, the degree to which two or more
predictors (independent or X variables) are
related to the dependent (Y) variable is expressed
in the multiple correlation coefficient R, which is
the square root of R-square.
– In multiple correlation coefficient, R assumes
values between 0 and 1. This is true due to the
fact that no meaning can be given to the
correlation in the multivariate case. (why?)
12/18/2023 73
Multiple linear regression cont’d…
12/18/2023 74
Multiple linear regression cont’d…
Multicollinearity
– This is a common problem in many multivariate
correlation analyses.
– Imagine that you have two predictors (X variables)
of a person's height:
1. weight in pounds and
2. weight in ounces.
Trying to decide which one of the two measures is a
better predictor of height would be rather silly.
12/18/2023 75
Multiple linear regression cont’d…
76
Multiple linear regression cont’d…
A condition index greater than 15 indicates a possible problem and an index greater
than 30 suggests a serious problem with collinearity
Collinearity Diagnosticsa
Variance Proportions
height of monthly family period of
Condition mother income gestation Age of mother
Model Dimension Eigenvalue Index (Constant) (cms)(X2) (Birr)(X5) (days)(X6) (years)(X3)
1 1 1.999 1.000 .00 .00
2 .001 58.071 1.00 1.00
2 1 2.845 1.000 .00 .00 .01
2 .154 4.294 .00 .00 .43
3 .000 104.138 1.00 1.00 .56
3 1 3.829 1.000 .00 .00 .01 .00
2 .170 4.741 .00 .00 .42 .00
3 .000 116.493 .84 .19 .57 .03
4 7.58E-005 224.782 .16 .81 .00 .97
4 1 4.806 1.000 .00 .00 .00 .00 .00
2 .171 5.308 .00 .00 .41 .00 .00
3 .023 14.410 .00 .00 .13 .00 .90
4 .000 132.931 .87 .17 .45 .03 .03
5 7.10E-005 260.214 .13 .83 .01 .96 .06
a. Dependent Variable: birth weight of the child (kgs)(X1)
77
Multiple linear regression cont’d…
12/18/2023 78
Multiple linear regression cont’d…
79
Multiple linear regression cont’d…
Example:
A popular radio talk show host has just received the latest
government study on public health care funding and has
uncovered a startling fact: As health care funding
increases, disease rates also increase! Cities that spend
more actually seem to be worse off than cities that spend
less!
The data in the government report yield a high, positive
correlation between health care funding and disease rates
-- which seems to indicate that people would be much
healthier if the government simply stopped putting money
into health care programs.
80
Multiple linear regression cont’d…
Correlations
Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0
82
Multiple linear regression cont’d…
83
Multiple linear regression cont’d…
Correlations
Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0
The
Visits zero-order
to health correlation
Health care funding between
Correlation health care funding
1.000 and.013disease
rates
(rate peris, indeed, both fairly high
care providers
10,000)
(amount per 100)
(0.737) and statistically
Significance (2-tailed)
df
.
0
significant(p
.928
47
<
0.001). Reported diseases
(rate per 10,000)
Correlation
Significance (2-tailed)
.013 1.000
.928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.
84
Multiple linear regression cont’d…
The partial correlation controlling for the rate of visits to health care
providers, however, is negligible (0.013) and not statistically significant (p
= 0.928.)
Correlations
Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0
85
Multiple linear regression cont’d…
Correlations
Going back to the zero-order correlations, you can see that bothVisits health
to
care funding rates and reported disease rates Health
funding
are diseases
care highly positively
Reported health care
providers
correlated
Control Variables
with the control variable, rate perof100)visits10,000)
(amount to health
(rate per care
(rate per
10,000)
providers.
-none-a Health care funding
(amount per 100)
Correlation
Significance (2-tailed)
1.000
.
.737
.000
.964
.000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0
87
Multiple linear regression cont’d…
88
Multiple linear regression cont’d…
89
Multiple linear regression cont’d…
Linear Regression Variable Selection Methods
Method selection allows you to specify how
independent variables are entered into the analysis.
Using different methods, you can construct a variety
of regression models from the same set of variables.
– Enter (Regression): A procedure for variable
selection in which all variables in a block are
entered in a single step.
12/18/2023 90
Linear Regression Variable Selection Methods
cont’d…
– Stepwise: At each step, the independent variable
not in the equation which has the smallest
probability of F is entered, if that probability is
sufficiently small. Variables already in the
regression equation are removed if their
probability of F becomes sufficiently large. The
method terminates when no more variables are
eligible for inclusion or removal.
– Remove: A procedure for variable selection in
which all variables in a block are removed in a
single step.
12/18/2023 91
Linear Regression Variable Selection Methods cont’d…
12/18/2023 94
Multiple linear regression cont’d…
Notations:
BW = Birth weight (kgs) of the child =X1
HEIGHT = Height of mother (cms) = X2
AGEMOTH = Age of mother (years) = X3
AGEFATH = Age of father (years) = X4
FAMINC = Monthly family income (Birr) = X5
GESTAT = Period of gestation (days) = X6
12/18/2023 95
Multiple linear regression cont’d…
Answer the following questions based on the above
data
1. Check the association of each predictor with the
dependent variable.
2. Fit the full regression model
3. Fit the condensed regression model
4. What do you understand from your answers in parts 1,
2 and 3 ?
12/18/2023 96
Multiple linear regression cont’d…
5. What is the proportion of variability accounted for
by the regression?
6. Compute the multiple correlation coefficient
7. Predict the birth weight of a baby born alive from a
woman aged 30 years and with the following
additional characteristics;
– height of mother =170 cm
– age of father =40 years
– monthly family income = 600 Birr
– period of gestation = 275 days
12/18/2023 97
Multiple linear regression cont’d…
8. Estimate the birth weight of a baby born alive from
a woman with the same characteristics as in “7"
but with a mother's age of 49 years.
12/18/2023 98
Thank you!