0% found this document useful (0 votes)
30 views12 pages

Lecture 15

This document provides a summary of key concepts from a lecture on simple linear regression and correlation. It discusses using the regression equation to make predictions, and introduces the prediction interval and confidence interval. It explains how to test for a linear relationship between variables using the coefficient of correlation r and the t-statistic. Both the Pearson and Spearman methods for calculating the correlation coefficient are covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Lecture 15

This document provides a summary of key concepts from a lecture on simple linear regression and correlation. It discusses using the regression equation to make predictions, and introduces the prediction interval and confidence interval. It explains how to test for a linear relationship between variables using the coefficient of correlation r and the t-statistic. Both the Pearson and Spearman methods for calculating the correlation coefficient are covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 15.

Chapter 17
Simple linear regression and
correlation

17.5 Using the regression equation


17.6 Coefficients of correlation
17.7 Regression diagnostics (optional)
1

17.5 Using the regression equation


• Before using the regression model, we need to assess
how well it fits the data.
• If we are satisfied with how well the model fits the
data, we can use it to make predictions for y.
Estimating the expected value of Y for a
given value of X: E(Y/X=x) = 𝑦 = 𝛽0 + 𝛽1 𝑥𝑔
Example: Predict the selling price of a three-year-old
Ford Laser with 40000 km on the odometer (refer to
Example 17.3).
Solution yˆ  19.61  0.0937 x  19.61  0.0937(40)  15.862

1
Prediction interval & confidence interval
• Two intervals can be used to discover how closely
the predicted value will match the true value of y
– prediction interval – for a particular value of y
– confidence interval – for the expected value of y.

The prediction interval The confidence interval

1 ( x g  x) 2 1 ( x g  x) 2
ŷ  t  2,n 2 s  1   ŷ  t  2,n2 s  
n  ( x i  x) 2 n  ( x i  x) 2

The prediction interval is wider than the confidence


interval (this is reasonable: predicting a single value
is more difficult than estimating the average value).

Example 17.8 (Example 21.3, contd.)


a. Provide an interval estimate for the bidding
price on a Ford Laser with 40000 km on the
odometer.
Solution
– The dealer would like to predict the price of a
single car.
– The prediction interval (95%) =
2
1 ( x g  x)
ŷ  t  2,n2 s  1  
n  ( x i  x) 2

t.025,98

1 ( 40  36.01)2
[15.862  1.984 0.4526 1    15.862  0.904
100 4307.378

2
b. The car dealer wants to bid on a lot of 250
Ford Lasers, where each car has been driven
for about 40000 km.
Solution
– The dealer needs to estimate the mean price
per car.
– The confidence interval (95%) =
2
1 ( x g  x)
ŷ  t  2,n 2 s 
n  ( x i  x )2

1 (40  36.01) 2
[15.862  1.984  0.4526   15.862  0.105
100 4307.378

The effect of the given value xg


of X on the intervals (optional reading)
As xg moves away from x the interval becomes longer.
That is, the shortest interval is found at xg = x.
ŷ  ˆ 0  ˆ 1x g The confidence interval
when xg = x
The confidence interval
when x=gx= 1 1 02
yˆ  t 2 s 
n  ( xi  x ) 2
ŷ( x g  x  1)
ŷ( x g  x  1) 1 12
ŷ  t  2s 
n
 ( x  x)
i
2

x  2 x 1 x 1 x  2 1 22
x ŷ  t  2s 
( x  1)  x  1 ( x  1)  x  1
n
 ( x  x)
i
2

( x  2)  x  2 ( x  2)  x  2 The confidence interval


when xg = x  2 6

3
17.6 Coefficient of Correlation
• The coefficient of correlation is used to measure
the strength of a linear association between two
variables.
• The population coefficient of correlation is
denoted  (rho).
• The coefficient values range between –1 and 1.
– If  = –1 (perfect negative linear association)
or  = +1 (perfect positive linear association)
every point falls on the regression line.
– If  = 0 there is no linear association.
• The coefficient can be used to test for linear
relationships between two variables.
7

Coefficient of Correlation…
We estimate the value of population coefficient of
correlation  from sample data with the sample
coefficient of correlation: r  SS / SS SS
xy x y

(or 𝑟 = 𝑠𝑥𝑦 / 𝑠𝑥 2𝑠𝑦 2 )


The test statistic for testing if  = 0 is:

which is student t-distributed with  = n–2 degrees of


freedom.
Remark: It can be proved that the above t-statistic is
exactly the same t – statistic used to test for testing if
 1 = 0 although the two formulas to calculate t -
statistic look quite different. 8

4
Testing the Coefficient of Correlation
• When there is no linear
relationship between two
variables,  = 0.
• The hypotheses are: Y
H0:  = 0 (no linear relationship)
HA:   0 (a linear relationship X
exists)
• The test statistic is:
n2
tr
1 r 2
The statistic is student t-distributed with d.f. = n – 2, provided the
variables are bivariate normally distributed.
9

Example 17.9, page 748


(example 17.3, contd.)
• Test the coefficient of correlation to determine if a
linear association exists in the data of Example
17.3.
Solution
– We test the hypothesis: H0:  = 0, HA:   0.
Solving manually
– The rejection region is: |t| > t/2,n-2 = t0.025,98 = 1.984.
– The sample coefficient of
correlation r = SSxy / SSx SSy = –0.8083
– The t-statistic value n2
tr  13.59
1 r 2
– Conclusion: There is sufficient evidence at  = 5% to
infer that there is a linear association between the two
variables.

5
Example 17.9 using the Computer
COMPUTE
We can also use Excel > Add-Ins > Data Analysis
Plus and the Correlation (Pearson) tool to get this
output:
We can also perform a one-tail test for
positive or negative linear relationships

p-value
compare
Again, we reject the null hypothesis (that there is no
linear correlation) in favor of the alternative hypothesis
(that our two variables are in fact related in a linear
fashion).
11

Spearman Rank Correlation Coefficient


The Spearman rank test is used to test whether a
relationship exists between variables in cases where
– at least one variable is ranked, or
– both variables are numerical but the normality
requirement is not satisfied.
• The hypotheses are: H0: s = 0, HA: s  0
• The test statistic is rs: SS ab
rs 
SS aSS b
a and b are the ranks of the data.
• For a large sample (n  30) rs is approximately
normally distributed with the mean equaling 0 and
the standard deviation 1 𝑛 − 1 and therefore we
can use the test statistic
z  rs n  1
12

6
Example 17.10, page 751
 A production manager wants to examine the
relationship between
– aptitude test score given prior to hiring, and
– performance rating three months after starting
work.
 A random sample of 20 production workers was
selected. Their test scores and performance ratings
were recorded.
 The aptitude test results range from 0 to 100. The
performance ratings are from1 to 5 with 5 being the
highest performance level (well above average)

13

Example 17.10 Solution


Scores range from 1 to 5
– The problem objective is
Aptitude Performance to analyze the relationship
Employee test rating between two variables.
1 59 3
2 47 2
– Performance rating is
3 58 4 ranked.
4 66 3
5 77 2 – The hypotheses are:
. . . H0: s = 0
. . .
HA: s  0
. . .
Scores range from 0 to 100
– The test statistic is rs and the rejection region is
|rs| > rcritical (taken from the Spearman rank correlation
table).
14

7
Example 17.10. Solution
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging
4 66 14 3 10.5 the ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .
Solving by hand Conclusion: We do not
- Rank each variable separately. reject the null hypothesis.
– Calculate SSa = 665; SSb = 575; SSab= At the 5% level of
234.5. rs = SSab / 𝑆𝑆𝑎 𝑆𝑆𝑏 = 0.379. significance there is
– The critical value for  = 0.05 and n = insufficient evidence to
20 is 0.450 (Table 10 Appendix B) infer that the two variables
– Note: 0.45=1.96/ 20 − 1. are related to one another.
15

-xx

17.7 Regression Diagnostics (Optional)


• The three important conditions required for the
validity of the regression analysis are:
– The error variable is normally distributed.
– The error variance is constant for all values of x.
– The errors are independent of each other.

• How can we diagnose violations of these


conditions?
 Residual analysis, that is, examine the
differences between the actual data points and
those predicted by the linear equation…

16

8
Residual Analysis…
Recall the deviations between the actual data points and
the regression line were called residuals. Excel calculates
residuals as part of its regression analysis:
RESIDUAL OUTPUT

Observation Predicted Price (y) Residuals Standard Residuals


1 16.10684 -0.10684 -0.23729
2 15.41343 -0.21343 -0.47400
3 15.31973 -0.31973 -0.71007
4 16.71592 0.68408 1.51924
5 16.64096 0.75904 1.68572

We can use these residuals to determine whether the


error variable is non-normal, whether the error variance is
constant, and whether the errors are independent…
17

Residual Analysis…
For each residual we calculate the standard deviation
as follows:
sri  s  1  hi where

1 ( xi  x)2
hi  
n
 ( x j  x)2
Standardized residual i
= residual i/standard deviation

18

9
Example 17.3 continued
Non-normality
– Use Excel to obtain the standardized residual
histogram.
– Examine the histogram and look for a bell-shaped
diagram with mean close to zero.

30 – As can be seen, the


25 standardized residual
20
histogram appear to
Frequency

15
be bell-shaped.
10

5
– We can also apply the
0
-1.2 -0.9 -0.6 -0.3 0 0.3 0.6 0.9 1.2 M ore
Lilliefors test or the
Residuals c2 test of normality.
19

Heteroscedasticity
When the requirement of a constant variance is violated,
we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ + ++
+ + ++

The spread increases with ^y

20

10
Homoscedasticity
When the requirement of a constant variance is not
violated, we have homoscedasticity.

+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
++++
+ +++
+
The spread of the data points
does not change much.

21

Homoscedasticity…
When the requirement of a constant variance is not
violated, we have homoscedasticity.
^y
+
Residual +++ +
+ + ++ ++
+ +
+ + + +
+ ++ + +++
+ + +++ +
+ +++
+ + + + ++
+ + y^ + +
++
+ + +
+ + + ++ +
+ ++
+ ++

As far as the even spread, this is We can diagnose heteroscedasticity by plotting


a much better situation. the residual against the predicted values of Y.

22

11
Heteroscedasticity…
If the variance of the error variable ( ) is not constant,
then we have ‘heteroscedasticity’. Here’s the plot of
the residual against the predicted value of y:

There doesn’t appear to


be a change in the
spread of the plotted
points, therefore no
heteroscedasticity
23

Summary: pages 763 - 764

Home assignment:

- Section 17.5 Exercises page 746: 17.43, 17.47

- Section 17.6 Exercises pages 753 - 754: 17.55,


17.56, 17.57, 17.58

24

12

You might also like