Lec 9 Linear Correlation and Linear Regression
Lec 9 Linear Correlation and Linear Regression
linear regression
By
Md. Siddikur Rahman, PhD
Associate Professor
Department of Statistics
Begum Rokeya University, Rangpur.
Scatter Plots and Correlation
A scatter plot (or scatter diagram)
is used to show the relationship
between two variables.
Correlation analysis is used to
measure strength of the association
(linear relationship) between two
variables.
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
No relationship
x
Recall: Covariance
å ( x - X )( y - Y )
i i
cov ( x , y ) = i =1
n -1
Interpreting Covariance
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
Calculating the Correlation Coefficient
å ( x - x )( y - y )
i =1
i i
cov ariance ( x, y ) n -1
r= =
var x var y n n
å i
( x
i =1
- x ) 2
å i
( y
i =1
- y ) 2
n -1 n - 1 Numera
å ( x - x )( y - y ) =
tor of
SP ( x, y ) covarian
= ce
[å ( x - x ) ][å ( y - y ) ]
2 2
SS ( x) SS ( y )
Numerator of
variance
Calculating the Correlation Coefficient
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Calculation Example
Tree Trunk
Height Diameter
• y • x xy y2 x2
• 35 • 8 280 1225 64
• 49 • 9 441 2401 81
• 27 • 7 189 729 49
• 33 • 6 198 1089 36
• 60 • 13 780 3600 169
• 21 • 7 147 441 49
• 45 • 11 495 2025 121
• 51 • 12 612 2601 144
S=321 S=73 S=3142 S=14111 S=713
Calculation Example
(continued)
Tree
Height,
nå xy - å x å y
r=
y70 [n( å x 2 ) - ( å x)2 ][n(å y 2 ) - ( å y)2 ]
60
8(3142) - (73)(321)
50 =
40 [8(713) - (73)2 ][8(14111) - (321)2 ]
30
= 0.886
20
10
0
r = 0.886 → relatively strong positive
0 2 4 6 8 10 12 14
linear association between x and y
Trunk Diameter, x
Excel Output
Excel Correlation Output
Tools / data analysis / correlation…
Manually CORRL(array1,array2)
Correlation between
Tree Height and Trunk Diameter
Business Statistics: A Decision-
Making Approach, 7e © 2008
Prentice-Hall, Inc. Chap 14-15
Introduction to
Regression Analysis
Regression analysis is used to:
◦ Predict the value of a dependent variable based on the
value of at least one independent variable
◦ Explain the impact of changes in an independent variable
on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable
y = β0 + β1x + ε
Variable
y y = β0 + β1x + ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value Random Error for
of y for xi
this x value
Intercept = β0
xi x
Business Statistics: A Decision-
Making Approach, 7e © 2008
Prentice-Hall, Inc. Chap 14-21
Simple Linear Regression Equation
(Prediction Line)
The simple linear regression equation provides an estimate of the
population regression line
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept
Value of X for
Ù Ù observation i
Ŷi = b 0 + b 1 Xi
The individual random error terms ei have a mean of zero
Department of Statistics, ITS
Surabaya Slide-22
Least Squares Criterion
b0 and b1 are obtained by finding the values of
b0 and b1 that minimize the sum of the squared
residuals
Ù Ù
SSE = f ( b 0 , b 1) = å e2 = å (y -ŷ)2
Ù Ù
= å (y - ( b 0 + b 1 X)) 2
Ù n Ù n n
b0 å
i =1
Xi + b1 å i =1
X i2 = åY X
i =1
i i
24
The Least Squares Equation
Ù Ù
The formulas for b 0 and b 1 are:
Ù
b1 =
å (x - x)(y - y) algebraic equivalent for b1:
å (x - x) 2 Ù
å xå y
å xy -
b1
Ù
b1 = n
and
å x2 -
( å x )2
n
Ù Ù
b 0 = y - b1 x
Age 23 23 27 27 39 41 45 49 50
% Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1
Age 53 53 54 56 57 58 58 60 61
% Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5
å X = 834
45 27.4 2025 1233
49 25.2 2401 1234.8
50 31.1 2500 1555
å y = 515 53
53
34.7
42
2809 1839.1
2809 2226
å X = 41612
2
54
56
29.1
32.5
2916 1571.4
3136 1820
å XY = 25489.2 57
58
30.3
33
3249 1727.1
3364 1914
58 33.8 3364 1960.4
60 41.1 3600 2466
61 34.5 3721 2104.5
834 Copyright
515 © 2005 41612
Brooks/Cole,25489.2
a division
28 of Thomson Learning, Inc.
Example
n = 18, å x = 834, å y = 515
å = 41612,
x 2
å xy = 25489.2
( å x)
2
S xx = å x 2
-
n
8342
= 41612 - = 2970
18
S xy = å xy -
( å x )( å y )
n
= 25489.2 -
( 834 )( 515 )
= 1627.53
1829 of Thomson Learning, Inc.
Copyright © 2005 Brooks/Cole, a division
Example
Ù S xy 1627.53
b1 = b = = = 0.54799
S xx 2970
Ù 515 834
b 0 = a = y - bx = - 0.54799 = 3.2209
18 18
ŷ = 3.22 + 0.548x
If we want to predict average %Fat for 45(say) year
old humans( ? )
ŷ = 3.22 + 0.548x =3.22+0.548*(45)=27.9
Simple Linear Regression Example
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
350 Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248
0 500 1000 1500 2000 2500 3000
Square Feet
ei = Yi - Yˆi
The residual for observation i, ei, is the difference between
its observed and predicted value
Check the assumptions of regression by examining the
residuals
◦ Examine for linearity assumption
◦ Examine for constant variance for all levels of X (homoscedasticity)
◦ Evaluate normal distribution assumption
◦ Evaluate independence assumption
Y Y
x x
residuals
x residuals x
Not Linear
ü Linear
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
x residuals x
Not Independent
ü Independent
residuals
residuals
X
residuals
X
Explained and Unexplained Variation
Xi x
Coefficient of Determination, R2
The coefficient of determination is the portion of
the total variation in the dependent variable that is
explained by variation in the independent variable
SSR
R =2 where 0 £R £1 2
SST
Coefficient of Determination, R2
(continued)
Coefficient of determination
SSR sum of squares explained by regression
R =
2
=
SST total sum of squares
R =r 2 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Examples of Approximate
R2 Values
y
R2 = 1
x
R2 = +1
Examples of Approximate
R2 Values
(continued)
y
0 < R2 < 1
x
Examples of Approximate
R2 Values
(continued)
R2 = 0
y
No linear relationship
between x and y:
51
Outliers in linear regression
Extreme X value Extreme Y value
y1 - yˆ 1 = y1 - (a + bx1 )
y 2 - yˆ 2 = y 2 - (a + bx 2 )
y n - yˆ n = yn - (a + bx n )
,
Outlier In Y-direction
Standardized Residuals:
StandardizedÙ residuals are defined as
ei
di = Ù
; i = 1, 2,......., n
s
Ù 1 n Ù2
Where, s i å ei ;
2
= i = 1, 2,......n
n - p i =1
If |di| > 3 then the corresponding observation
is said an outlier.
Outlier In Y-direction
Studentised Residuals:
Studentised residuals are defined as,
Ù Ù
ei ei
ri = Ù
= Ù
; i = 1, 2,......., n
s hii s 1 - wii
Where, 1 ( xi - x )2
wii = + n ; i = 1, 2,......., n
n
å i( x
i =1
- x ) 2
Where, S = (n - p - 1) å ê y j - x j b
2 T
(i ) ú
j =1 ë û
Outlier In Y-direction
Ù
ei is the residual with the i-th case
( -i )
h= p/n
63
Outlier In X-direction
hii > 3 p / n
Outlier In X-direction
Huber’s suggestions: Huber(1981)
suggested to break this range of hii into
three intervals. Observations having hii ≤
0.2 are said to be safe, 0.2 < hii ≤ 0.5 are
risky and hii > 0.5 should be avoided
Influential Outlier
Influence point:
For the point A, it has a moderately unusual x-
coordinate, and the y value is unusual as well.
An influence point has a noticeable impact on the
model coefficients in that it pulls the regression model
in its direction.
Influential Outlier
Cook’s distance:
Ù ( -i ) Ù Ù ( -i ) Ù
2 [b - b ] ( X X )[ b
T T
- b]
CDi = Ù
ps2
71