Lecture 8 Correlation and Linear Regression
Lecture 8 Correlation and Linear Regression
x x
y y
x x
Scatter Plot Examples
No relationship at all
y
x
Correlation Coefficient
Unit free
y y y
x x x
r = -1 r = -.6 r=0
y y
r = +0.3 x r = +1 x
Calculating the Correlation Coefficient
Sample correlation coefficient:
r
(x x)( y y)
SS xy / SS SS
xx yy
[ ( x x) ][( y y) ]
2 2
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the ‘independent’ variable
y = Value of the ‘dependent’ variable
Example
Weight gained Diet (x) Weight gained Diet (x)
(y) (y)
60
[n( x 2 ) ( x) 2 ][n( y ) ( y)
2 2
]
50
40 8(3142) (73)(321)
30
[8(713) (73) 2 ][8(14111) (321) 2 ]
20
10 0.886
0
0 2 4 6 8 10 12 14
r .886
t 4.68
1 r2 1 .8 8 6 2
n2 82
Correlation Cont,…..
Correlation coefficients below 0.35 show only a slight
relationship between variables.
The first one is linear in the parameters, but the second one
is not.
Simple Linear Regression Model
Only one independent variable, x
y β 0 β1x ε
Linear component Random Error
component
Linear Regression Assumptions
The relationship between the two variables, x and y is Linear
Independent observations
No outlier distortion
Assumptions viewed pictorially
LINE (Liner, Independent, Normal and Equal
variance) assumptions
Y
y|x= + x
Identical normal
distributions of
errors, all centered on
~N(y|x y|x)
the regression line.
X
Population Linear Regression
y y β0 β1x ε
Observed
Value of y
for xi
εi Slope = β1
Predicted Random Error
Value of y for this x value
for xi
Intercept
= β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Independent
e2 ( y y ˆ ) 2
( y ( b 0 b 1 x)) 2
The Least Squares Equation
After some application of calculus (derivation)
and equating it to zero, we can find the
following:
b1
( x x)( y y)
( x x) 2
xy x y
b1 n
( x) 2 and b0 y b1x
x2
n
Interpretation of the Slope and the Intercept
x y x y
b n
1 ( x) 2
x n
2
17.57 20.35*16.27 / 20
b1 0.643
22.30 414.12 / 20
Standardized
Unstandardized Coefficients Coefficients
B Std. Error Beta
Model t Sig.
(Constant) 0.160 .077 2.065 .054
i.e. ( y yˆ) 2
is minimized
where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
yˆ = Estimated value of y for the given x value
Explained and Unexplained …
x
Xi
Coefficient of Determination, R2
S S R
R 2
S S T where 0 R2 1
Coefficient of Determination, R2
Coefficient of determination
SSR s u m o f s q u a r es ex p l ai ned by r e g r e s s i o n
R2
SST total s u m o f s q u a r e s
R 2 r2
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Coefficient of Determination, R2 cont…
Unpredictable
Variation
Assumptions of the Linear regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)*
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity*
8. No autocorrelation of the errors
9. No outlier distortion
(Most of them, except the 4th and 7th, are mentioned in the simple
linear regression model assumptions) 54
Explanation of some of the Assumptions
Specification of the model (no omitted variables)
In multivariable models (including multiple
regression), to define the best prediction rule of
the outcome variable, one must select some
variables into the final (best explaining) model
from a set of candidate variables.
55
Option 1: variable selection based on significance in univariate
models (here simple linear regression):
all variables that show a significant effect in univariate
models are included. Usually the significance level is set to
0.15-0.25. More specifically a p-value of 0.2 which is the
average is used
Option 2: variable selection based on significance in
multivariable model: starting with a multivariable model
including all candidate variables, one eliminates non-
significant effects one-by-one until all effects in the model are
significant. (backward/stepwise/forward selection)
Explanation of some of the Assumptions
Option 3: The ‘Purposeful Selection’ algorithm . This
variable selection procedure selects variables not only
based on their significance in a multivariable model, but
also if their omission from the multivariable model would
cause the regression coefficients of other variables in the
model to change by more than, say, 20%.
Option 4: variable selection based on substance matter
knowledge: this is the best way to select variables, as it is
not data-driven and it is therefore considered as yielding
unbiased results.
Among all ‘automatic’ selection procedures (from option 1-
3), the third
one is currently state-of-the-art and should be applied.
56
Explanation of some of the Assumptions
Multicollinearity prevents proper parameter estimation.
58
Simple vs. Multiple Regression
n 1
Adjusted R 1 - (1- R )
2
2
n p 1
Where n = sample size
P = number of parameters in model
As the sample size increases above 20 cases per
variable, adjustment is less needed (and vice versa).
Example:
Y = α + β1X1 + β2X2 + …+ βkDk + βk+1Dk+1 ; here the kth
variable has a nominal scale with three categories, hence it
will have two dummies, Dk and Dk+1 that must be included in
the model independently.
Intercorrelation or collinearlity
If the two independent variables are uncorrelated,
we can uniquely partition the amount of variance
in Y due to X1 and X2 and bias is avoided.
Coefficients
Standardized 95% Confidence
Unstandardized Coefficients Coefficients Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522
sex 16.594 3.670 .729 4.521 .000 8.883 24.305
2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457
sex 10.130 4.517 .445 2.243 .039 .600 19.659
age 0.309 .145 .424 2.136 .047 .004 .614
a. Dependent Variable: %age of body fat relative to body
Interpretations
Keeping all other variables constant, females
have 10.13%of body fat relative body higher than
males