Lecture 15
Lecture 15
Chapter 17
Simple linear regression and
correlation
1
Prediction interval & confidence interval
• Two intervals can be used to discover how closely
the predicted value will match the true value of y
– prediction interval – for a particular value of y
– confidence interval – for the expected value of y.
1 ( x g x) 2 1 ( x g x) 2
ŷ t 2,n 2 s 1 ŷ t 2,n2 s
n ( x i x) 2 n ( x i x) 2
t.025,98
1 ( 40 36.01)2
[15.862 1.984 0.4526 1 15.862 0.904
100 4307.378
2
b. The car dealer wants to bid on a lot of 250
Ford Lasers, where each car has been driven
for about 40000 km.
Solution
– The dealer needs to estimate the mean price
per car.
– The confidence interval (95%) =
2
1 ( x g x)
ŷ t 2,n 2 s
n ( x i x )2
1 (40 36.01) 2
[15.862 1.984 0.4526 15.862 0.105
100 4307.378
x 2 x 1 x 1 x 2 1 22
x ŷ t 2s
( x 1) x 1 ( x 1) x 1
n
( x x)
i
2
3
17.6 Coefficient of Correlation
• The coefficient of correlation is used to measure
the strength of a linear association between two
variables.
• The population coefficient of correlation is
denoted (rho).
• The coefficient values range between –1 and 1.
– If = –1 (perfect negative linear association)
or = +1 (perfect positive linear association)
every point falls on the regression line.
– If = 0 there is no linear association.
• The coefficient can be used to test for linear
relationships between two variables.
7
Coefficient of Correlation…
We estimate the value of population coefficient of
correlation from sample data with the sample
coefficient of correlation: r SS / SS SS
xy x y
4
Testing the Coefficient of Correlation
• When there is no linear
relationship between two
variables, = 0.
• The hypotheses are: Y
H0: = 0 (no linear relationship)
HA: 0 (a linear relationship X
exists)
• The test statistic is:
n2
tr
1 r 2
The statistic is student t-distributed with d.f. = n – 2, provided the
variables are bivariate normally distributed.
9
5
Example 17.9 using the Computer
COMPUTE
We can also use Excel > Add-Ins > Data Analysis
Plus and the Correlation (Pearson) tool to get this
output:
We can also perform a one-tail test for
positive or negative linear relationships
p-value
compare
Again, we reject the null hypothesis (that there is no
linear correlation) in favor of the alternative hypothesis
(that our two variables are in fact related in a linear
fashion).
11
6
Example 17.10, page 751
A production manager wants to examine the
relationship between
– aptitude test score given prior to hiring, and
– performance rating three months after starting
work.
A random sample of 20 production workers was
selected. Their test scores and performance ratings
were recorded.
The aptitude test results range from 0 to 100. The
performance ratings are from1 to 5 with 5 being the
highest performance level (well above average)
13
7
Example 17.10. Solution
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging
4 66 14 3 10.5 the ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .
Solving by hand Conclusion: We do not
- Rank each variable separately. reject the null hypothesis.
– Calculate SSa = 665; SSb = 575; SSab= At the 5% level of
234.5. rs = SSab / 𝑆𝑆𝑎 𝑆𝑆𝑏 = 0.379. significance there is
– The critical value for = 0.05 and n = insufficient evidence to
20 is 0.450 (Table 10 Appendix B) infer that the two variables
– Note: 0.45=1.96/ 20 − 1. are related to one another.
15
-xx
16
8
Residual Analysis…
Recall the deviations between the actual data points and
the regression line were called residuals. Excel calculates
residuals as part of its regression analysis:
RESIDUAL OUTPUT
Residual Analysis…
For each residual we calculate the standard deviation
as follows:
sri s 1 hi where
1 ( xi x)2
hi
n
( x j x)2
Standardized residual i
= residual i/standard deviation
18
9
Example 17.3 continued
Non-normality
– Use Excel to obtain the standardized residual
histogram.
– Examine the histogram and look for a bell-shaped
diagram with mean close to zero.
15
be bell-shaped.
10
5
– We can also apply the
0
-1.2 -0.9 -0.6 -0.3 0 0.3 0.6 0.9 1.2 M ore
Lilliefors test or the
Residuals c2 test of normality.
19
Heteroscedasticity
When the requirement of a constant variance is violated,
we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ + ++
+ + ++
20
10
Homoscedasticity
When the requirement of a constant variance is not
violated, we have homoscedasticity.
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
++++
+ +++
+
The spread of the data points
does not change much.
21
Homoscedasticity…
When the requirement of a constant variance is not
violated, we have homoscedasticity.
^y
+
Residual +++ +
+ + ++ ++
+ +
+ + + +
+ ++ + +++
+ + +++ +
+ +++
+ + + + ++
+ + y^ + +
++
+ + +
+ + + ++ +
+ ++
+ ++
22
11
Heteroscedasticity…
If the variance of the error variable ( ) is not constant,
then we have ‘heteroscedasticity’. Here’s the plot of
the residual against the predicted value of y:
Home assignment:
24
12