Session 15 Regression and Correlation
Session 15 Regression and Correlation
Session 15 Regression and Correlation
Positively related
Not related
Negatively related
Covariance
Correlation
Sample Covariance
The sample covariance measures the strength of
the linear association between two numerical
n
variables. ( Xi X)( Yi Y )
The sample covariance: cov ( X , Y ) i1
n 1
The covariance is only concerned with the
strength of the relationship.
No causal effect is implied.
Covariance
Covariance between two random variables:
( X X)( Y Y )
i i
cov ( X , Y )
r i 1
n n SX SY
( Xi X )
i 1
2
i
( Y
i 1
Y ) 2
The Correlation Coefficient
Unit free
Ranges between –1 and 1
The closer to –1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any linear relationship
r2 is called the Coefficient of Determination
The Correlation Coefficient
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
r = +1 r = +.3
An example
Y 12.8 11.3 3.2 6.4 11.6 3.2 8.7 10.5 8.2 10.5 9.4 12.8 8.2
Another example
A real estate agent wishes to examine the relationship
between the selling price of a home and its size (measured
in square feet). And then he wants to predict value of Y with
having some value of X
S a le s S a le s
A d v e r tis in g A d v e r tis in g
S a le s S a le s
A d v e r tis in g A d v e r tis in g
Type of regression models
1 E x p la n a to r y R e g r e s s io n 2 + E x p la n a to r y
V a r ia b le M o d e ls V a r ia b le s
S im p le M u ltip le
Non- Non-
L in e a r L in e a r
L in e a r L in e a r
Linear regression model
Linear regression model
Y
Y = bX + a
C hange
b = S lo p e in Y
C h a n g e in X
a = Y -in te r c e p t
X
The Scatter Diagram
100
Y 50
0
0 20 X 40 60
Types of Scatter diagrams
Positive Linear Relationship Relationship NOT Linear
100
Y 50
0
0 20 X 40 60
The Slope and y-intercept of
the best fitting regression line
b
XY nXY
X nX
2 2
a Y bX
Assumptions of Regression
L.I.N.E
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given
value of X
Equal Variance (also called homoscedasticity)
The probability distribution of the errors has
constant variance
Normality and Constant
Variance Assumption
f(e )
Y
X 1
X 2
X
Variation of Errors Around
the Regression Line
y values are normally distributed
f(e) around the regression line.
For each x value, the “spread” or
variance around the regression
line is the same.
Y
X2
X1
X
Regression Line
Linear Regression : Assumptions
If these assumptions hold -
The formulas that we use to estimate the coefficients in
a regression yield BLUE
BLUE (Best Linear Unbiased
Estimators)
Yi 0 1 X i i
Dependent
(Response) Independent
Slope (Explanatory)
Variable
Variable
Population
Linear Regression Model
Y Yi 0 1X i i Observed
Value
i = Random Error
0 1X i
YX
(E(Y))
X
Observed Value
Simple Linear Regression Model
Y i b0 b1X i
Yi = Predicted Value of Y for observation i
Yi ŷ a bx
yi yˆi error
yi y Total Effect
ŷ y regression effect
Y
X
X
Revisit the Example
A random sample of 10 houses is selected: Dependent
variable (Y) = house price in $1000s, Independent variable
(X) = square feet. Let us find the equation of the straight line
that fits the data best
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Scatter Diagram Example
House price model: scatter plot
450
400
House Price ($1000s)
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Equation for the Best Straight
Line using SPSS output
Regression Statistics
The regression equation is:
Multiple R 0.76211
R Square 0.58082
house price 98.24833 0.10977 (square feet)
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
450
400
House Price ($1000s)
350
300 Do not try to
250 extrapolate beyond
200
150
the range of
100 observed X’s
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Linear Regression Example
Making Predictions
Predict the price for a house with 2000 square feet:
98.25 0.1098(200 0)
317.85
X
X Xi
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
• measures _the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
• explained variation attributable to the relationship
between X and Y
F = 5.32
Conclusion:
= .05
There is sufficient evidence that
0 Do not Reject H0
F house size affects selling price
reject H0 F.05 = 5.32
The Coefficient of Determination
Y r2 = 1, r = +1 Y r2 = 1, r = -1
^=b +b X
Yi 0 1 i
^=b +b X
Yi 0 1 i
X X
^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i
X X
R2 and F connection
SSR 2 2
F
(2 1)
r * SST n 2 r n 2
* *
SSE 2
2
n 2 (1 r ) * SST 2 1 1 r 1
n
SSE ( Yi Yi )
2
Syx = i 1
n2
n2
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
b1 1 SYX
•Test Statistic: t Where Sb1
S b1 n
2
i
( X X )
i 1
and df = n - 2
Inferences About the Slope:
t Test Example
House Price in
Square Feet
Estimated Regression Equation:
$1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
319 1425
sales price?
255 1700
Inferences About the Slope:
t Test Example
b1 Sb1
H0 : β1 = 0 From output:
H1 : β1 ≠ 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 β1 0.10977 0
t t 3.32938
Sb1 0.03297
Inferences About the Slope:
t Test Example
Test Statistic: t = 3.329 H0: β1 = 0
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
1 ( X i X )2
Yˆi t n 2 S xy 1 n
n
i
( X
i 1
X ) 2
Confidence Bands
Error associated with a forecast has two
components:
Error at the mean (standard error of
estimate)
Error in estimating B
Therefore, the confidence intervals around
forecasts will be larger as we move away
from the mean of X
Interval Estimates for
Different Values of X
Confidence Interval Confidence
for a individual Yi Interval for the
Y mean of Y
+ b X
1 i
Yi = b0
_ X
X A Given X
Example: Produce Stores
Data for 7 Stores:
Annual Store
Square Sales Feet Predict the annual
(000)
sales for a store with
1 1,726 3,681 2000 square feet.
2 1,542 3,395
3 2,816 6,653
Regression Model Obtained:
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563 Yi = 1636.415 +1.487Xi
7 1,313 3,760
Estimation of Predicted
Values: Example
Confidence Interval Estimate for XY
Find the 95% confidence interval for annual sales of one
particular stores of 2,000 square feet
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 (000)
1 ( X i X )2
Ŷi t n 2 Syx 1 n = 4610.45 1853.45
n ( X X )2
i Confidence interval for indivi
i 1
Y