Correlation and Regression
Correlation and Regression
Correlation and Regression
\ .
\ .
,
5
where x , and
x
s
are the sample mean and sample standard
deviation of
1
,...,
n
x x
. Similarly for y and y
s
. Note r =
.
A useful formula for computational purpose is:
xy
xx yy
S
r
S S
=
2
2 2
1
( )
( 1)
n
i
xx i x
i
x
S x n S
n
=
= =
2
2 2
1
( )
( 1)
n
i
yy i y
i
y
S y n S
n
=
= =
1
( )( )
n
i
i
xy i i
i
x y
S x y
n
=
=
Remarks:
(i) The standardized scores say how many SDs above or
between x .
(ii) The correlation
r
has no unit.
(iii) The measure r is called Pearsons sample correlation
coefficient.
Properties of Correlation
Note both variables have to be quantitative.
6
-
( , ) ( , ) r x y r y x =
- 1 s r s 1, and has no units. (Hint the proof using co-
variance inequality).
- r measures the extent of linear relationship between x and y
and does not capture the non-linear relationship.
- Variables may be strongly associated, but still may have
small r, if the association is not linear.
- Sign of correlation gives the direction of the association;
r < 0 initiates negative association: and r > o indicates
positive association.
- Value of r does not depend on the units of measurement for
either variable. That is, it is not affected by the change of
shifting or scaling the variables. This is because,
c. positive and d real for ) , ( ) , ( y x r d cy b ax r = + +
- Strongly affected by a few outlying observations.
- r = 1 only when all positive
( )
,
i i
x y
lie on a straight line.
Example 5 (Ex 12.15): An accurate assessment of soil
productivity is critical to rational land-use planning. The following
data presents the data on corn yield and peanut yield (mT/Ha)
for eight types of soil.
7
X 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1
Y 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63
Find if there is any association between
Solution: With
Hence
Example 5. The following data gives the marks of first midterm
(x) and second midterms (y) of 9 students from 3 sections:
8 A.M: (70, 60), (72, 83), (94, 85)
Noon : (80, 72), (60, 74), (55, 58)
Evening: (45, 63), (50, 40), (35, 54)
(a) Find the correlation coefficient between x and y.
(b) Find the correlation coefficient between
x
and
y
.
8
Solution.
(a) S
xx
= 37695 - (561)
2
/9 = 2726, SS
yy
= 40223 - (589)
2
/9 =
1676.222, and
S
xy
= 38281- (561)(589)/9 = 1566.666.
So,
222 . 1676 2726
667 . 1566
= r
= .733.
(b) Now
1
x
= (70+72+94)/3 = 78.667,
1
y
= (60+83+85)/3 = 76.
2
x
= (80+60+55)/3 = 65,
2
y
= (72+74+58)/3 = 68.
3
x
= (45+50+35)/3 = 43.333,
3
y
= (63+40+54)/3 = 52.333.
xx
S
= [(78.667)
2
+(65)
2
+(43.333)
2
- (78.667+65+43.333)
2
/3
=634.913,
yy
S
= [(76)
2
+(68)
2
+(52.333)
2
-(76+68+52.333)
2
/3] = 289.923,
xy
S
= [(78.667)(76)+(65)(68)+(43.333)(52.333)-
(187)(196.333)/3] = 428.348.
So,
923 . 289 913 . 634
348 . 428
= r
= .9984.
Population Correlation Coefficient
The population correlation coefficient between and is defined
by
We now only look at its properties:
9
(i)
1 s
(ii)
1 =
if all
( , )
i i
x y
in the population lie on a straight line.
Sample correlation coefficient
r
can be used to decide if
0 =
(no linear relationship between and or not.
(iii) Also,
1 =
for the bivariate distribution means that the
variables and are linearly related.
A test for
To test the hypothesis
Carry out a test of significance level 0.01 to see whether
10
Solution:
0 :
0
= H
vs
0 : =
a
H
.
2
1
2
r
n r
t
=
; Reject H
o
at level .01 if
either
819 . 2
22 , 005 .
= > t t
or 819 . 2 s t . r = .5778, t = 3.32, so H
o
should be rejected. There appears to be a non-zero correlation in
the population.
12.1 Linear Regression
11
Least squares
12
Coefficient determination
Details of Linear Regression
Fitting a straight line
Often, one is interested in not only studying the relationship, but
also in predicting the value of the dependent variable based on
independent (predictor or explanatory variable) .
When scatter plot suggests a linear relationship, it is natural to find
a straight line which is as close as possible to the points.
The equation of straight line is
bx a y + =
.
A particular equation is
5 y x = +
. Here . 1 and 5 = = b a
13
To draw a line, we need two quantities namely intercept (with
y
-
axis) term a and the slope
b
.
Given the data
1 1
( , )..., ( , )
n n
x y x y
on
( , ) x y
.
Aim: To find the straight line
y ax b = +
which fits the data well.
2. Method of Least Squares
Here = explanatory (predictor) variable
= response variable.
Let
( )
i i i
y a bx c = +
= error= deviation from the line.
Then
2 2
1 1
( )
n n
i i i
i
y a bx c
=
=
= sums of squares of errors.
Principle of least squares says choose the line (or find a and b )
such that
2
i
c
is minimum. The resulting equation is called
Sample Regression Line.
3. The Derivation
Let
2
1
) ( ) , (
=
=
n
i
i i
bx a y b a f
(*)
For fixed b and treating as a function of a, we have
0 ) 1 ( ) ( 0
1
= =
c
c
n
i i
bx a y
a
f
0 = x nb na y n
) ( say a x b y a = =
.
14
Also, substituting a in (*) and treating as a function of b,
0 ) ( ) ( 0
1
= =
c
c
i
n
i i
x bx a y
b
f
0
1
2
1
=
n
i
n
i i
x b x a n y x
0 ) (
1
2
1
=
n
i
n
i i
x b x x b y n y x
(substituting a )
Solving now for b, we obtain
) (
1
2
2
1
say b
S
S
x n x
y x n y x
b
xx
xy
n
i
n
i i
= =
Then the line
x b a
y + =
is called the fitted least-squares
(regression) line.
The slope of the least squares (regression) line is
;
xy
xx
S
b
S
=
The intercept of the line is =
a y bx =
. Therefore, the
(sample) regression line is
y a bx = +
.
The value
i i
y a bx = +
is called the fitted value of
y
and
i
y
is
called the observed value of
y
.
The quantities
( )
i i i
e y y =
is called the residual.
If,
i
e
> 0, the model under estimate data value;
15
If
i
e
< 0, the model over estimate data value.
Example 1. The following data gives the mean height of a group
of children in Kalama, an Egyptian village, that was the study of
nutrition in developing countries. The data were obtained on 161
children each month from 18 to 29 months of age.
Here, = age (in months) = explanatory variable;
= height (in cm) = response variable.
x
y
18
19
20
21
22
23
24
25
26
27
28
29
76.1
77.0
78.1
78.2
78.8
79.7
79.9
81.1
81.2
81.8
82.8
83.5
For the above data
16
x
= 23.5;
y
= 79.85
x
s
= 3.606; y
s
= 2.302
Also,
( , ) .9944 r r x y = =
Hence,
1
2.302
(.9944)
3.606
y
x
s
b r
s
= =
= .6348 =
b
.
And
= a
0 1
b y b x =
= 79.85 (.6348) (23.5) = 64.932
Therefore, the least-square line is
64.932 0.6348 y x = +
Interpretation
The slope b = .6348 cm/month is the rate of change in mean height
as age increases. Though
r
does not change, with the units of
measurement, the equation of least-square line changes.
Genesis of Regression. Note the slope
xy yy
xy y
xx x
xx yy xx
s s
s s
b r
s s
s s s
= = =
Hence,
( )
y
x
s
y y r x x
s
= +
Put
x
x x s = +
, then
y
y y rs = +
.
17
When,
1,
1 1
,
2 2
y
y
r y y s
r y y s
= = +
= = +
For any x - value, y (predicted value) will be closer to (in terms of
SD) to y than x is to
x
. That is, y is pulled toward (regressed
toward)
y
.
This regression effect, was first noticed by Sir Francis Galton who
predicted height of a son (
i
y
) was always closer to y than his
fathers height (
i
x ).
Assessing the Fit
To assess the effectiveness of the fit, the residuals can be used.
Note
( ) 0
i i i
e y y = >
if
i i
y y >
And
( ) 0
i i i
e y y = <
if
i i
y y <
Also,
. 0 ) (
2
i i i i
y y y y = =
That is, all observed values lie on a straight line. Also,
2
1
n
i
e
can be
used as a measure of the fit. Another one is the total variation in
i
y
s, namely
2
1
( )
n
i i
y y
.
Definition:
The residual sum of squares, SSE, is
18
SSE =
2
1
( )
n
i i
y y
=
2
1
n
i
e
and the total sum of squares is defined as
SST =
2
( )
i yy i
y y S =
Note:
( ) ( )
i i i i
y y y a bx = +
( ) ( )
i i
y y b x x =
(Substituting a )
Hence,
= 0
i
e
and
SSE = SST +
2
2
xx xy
b S bS
,
xy
SST bS =
(since
xy xx
S S b =
)
which shows SSE can be calculated without
'
i
e s
.
Note :
i. SSE is used as a measure of unexplained variation by the
regression line.
ii. Similarly, SST is used as a measure of total variation.
iii.
SSE
SST
= fraction of total variation that is unexplained by line.
Definition: The coefficient of determination, denoted by
2
1
SSE
R
SST
=
.
It is the proportion of variation in
y
explained by regression.
Result.
2 2
r R = , where r is the sample correlation
coefficient.
19
Definition: The quantity
2
2
1
2 2
n
i
e
e
SSE
s
n n
= =
is the variance of
residuals and
2
e
s s =
is called the SD of residuals about least
squares line. The estimator of
.
Plotting the Residuals (Residual Plot)
Definition: A scatter plot of
'
i
e s
against
'
i
x s
is called residual
plot.
(i) It is used for checking if there is any unusual, highly
influential observations or revealing patterns are present in
the data.
(ii) If there is no particular pattern, such as curvature and etc,
the least-square fit is a good fit. Also, the residuals will
be centered around x-axis.
(iii) Looking at residual plot is equivalent to examining
y
after removing linear dependence on x . This may
sometimes show existence of a non-linear relationship.
Example 2 (Ex 9): The flow rate y (m
3
/min) in a device used for
air-quality measurement depends on the pressure drop x (in.of
water) across the devices filter. Suppose that for x values between
20
5 and 20, the two variables are related according to the simple
linear regression model with true regression line y = -.12+.095x.
a. What is the expected change in flow rate associated with a 1-in
increase in pressure drop? Explain.
b. What change in flow rate can be expected when pressure drop
decreases by 5 in.?
c. What is the expected flow rate for a pressure drop of 10 in.? A
drop of 15 in.?
d. Suppose and consider a pressure drop of 10 in. What is
the probability that the observed value of flow rate will exceed
.835? That observed flow rate will exceed .840?
e. What is the probability that an observation on flow rate when
pressure is 10 in. will exceed an observation on flow rate made
when pressure drop is 11 in.?
Solution:
a. =
1
| expected change in flow rate (y) associated with a one inch
increase in pressure drop (x) = .095.
b. We expect flow rate to decrease by
475 . 5
1
= |
.
c.
( ) , 83 . 10 095 . 12 .
10
= + =
Y
\
|
> = > Z P Z P Y P
( ) ( ) 3446 . 40 .
025 .
830 . 840 .
840 . = > =
|
.
|
\
|
> = > Z P Z P Y P
21
e. Let Y
1
and Y
2
denote pressure drops for flow rates of 10 and 11,
respectively. Then
, 925 .
11
=
Y
so Y
1
- Y
2
has expected value
.830 - .925 = -.095, and s.d.
( ) ( ) 035355 . 025 . 025 .
2 2
= +
. Thus
( ) 0036 . 69 . 2
035355 .
095 .
) 0 ( ) (
2 1 2 1
= > =
|
.
|
\
|
+
> = > = > Z P z P Y Y P Y Y P
Example 3 (Ex 13): The accompanying data on x = current density
(mA/cm
2
) and y = rate of deposition (m/min) appeared in an
article. Do you agree with the articles author that a linear
relationship was obtained from the tin-lead rate of deposition as a
function of current density? Explain your reasoning.
X 20 40 60 80
y .24 1.20 1.71 2.22
Solution: For this data, n = 4,
200 = E
i
x
,
37 . 5 = E
i
y
,
000 . 12
2
= E
i
x
,
3501 . 9
2
= E
i
y
,
333 = E
i i
y x
.
( )
2000
4
200
000 , 12
2
= =
xx
S
,
( )
140875 . 2
4
37 . 5
3501 . 9
2
= =
yy
S
, and
( )( )
5 . 64
4
37 . 5 200
333 = =
xy
S
.
03225 .
2000
5 . 64
1
= = =
xx
xy
S
S
|
and
( ) 27000 .
4
200
03225 .
4
37 . 5
0
= = |
.
( )( ) 060750 . 5 . 64 03225 . 14085 . 2
1
= = =
xy yy
S S SSE |
.
972 .
14085 . 2
060750 .
1 1
2
= = =
SST
SSE
r
. This is a very high value of
2
r ,
22
which confirms the authors claim that there is a strong linear
relationship between the two variables.
Example 4 (Ex 19): The following data is representative of that
reported in an article with x = burner area liberation rate (MBtu/hr-
ft
2
) and y = NO
x
emission rate (ppm):
X 100 125 125 150 150 200 200 250 250 300 300 350 400 400
Y 150 140 180 210 190 320 280 400 430 440 390 600 610 670
a. Assuming that the simple linear regression model is valid,
obtain the least squares estimate of the true regression line.
b. What is the estimate of expected NOx emission rate when
burner area liberation rate equals 225?
c. Estimate the amount by which you expect NOx emission rate to
change when burner area liberation rate is the decreased by 50.
d. Would you use the estimated regression line to predict emission
rate for a liberation rate of 500? Why or why not?
Solution:
N = 14,
3300 = E
i
x
,
5010 = E
i
y
,
750 , 913
2
= E
i
x
, 100 , 207 , 2
2
= E
i
y ,
500 , 413 , 1 = E
i i
y x
a.
71143233 . 1
500 , 902 , 1
000 , 256 , 3
1
= = |
,
55190543 . 45
0
= |
, so we use the
equation
x y 7114 . 1 5519 . 45 + =
.
b.
( ) 51 . 339 225 7114 . 1 5519 . 45
225
= + =
Y
23
c. Estimated expected change
57 . 85
50
1
= = |
d. No, the value 500 is outside the range of x values for which
observations were available (the danger of extrapolation).
Home work:
Sec 12.1: 3, 8, 9
Sec 12.2: 12, 14, 16
Sec 12.5: 58, 62, 65