Session 15 Regression and Correlation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 66

PGP 17-19

Statistics fro Decision Making

Simple Linear Regression Model


and Correlation

Dr. Rohit Joshi, IIMS


Contents
 Correlation - Measuring the Strength of the
Association
 Types of Regression Models
 Determining the Simple Linear Regression
Equation
 Measures of Variation in Regression and
Correlation
 Assumptions of Regression and Correlation
 Estimation of Predicted Values
Association between two variable
Amount of time spent in class ~ Understanding
of the statistics and SPSS

 Positively related
 Not related
 Negatively related

Covariance
Correlation
Sample Covariance
 The sample covariance measures the strength of
the linear association between two numerical
n
variables.  ( Xi  X)( Yi  Y )
 The sample covariance: cov ( X , Y )  i1
n 1
 The covariance is only concerned with the
strength of the relationship.
 No causal effect is implied.
Covariance
 Covariance between two random variables:

 cov(X,Y) > 0 X and Y tend to move in the same


direction
 cov(X,Y) < 0 X and Y tend to move in opposite
directions
 cov(X,Y) = 0 X and Y are independent
Limitation
Dependence on measurement scale
The Correlation Coefficient
 The correlation coefficient measures the relative
strength of the linear relationship between two
variables.
 Sample coefficient of correlation:

 ( X  X)( Y  Y )
i i
cov ( X , Y )
r i 1

n n SX SY
 ( Xi  X )
i 1
2
 i
( Y
i 1
 Y ) 2
The Correlation Coefficient
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear
relationship
 The closer to 0, the weaker any linear relationship
 r2 is called the Coefficient of Determination
The Correlation Coefficient
Y Y Y

X X X
r = -1 r = -.6 r=0

Y Y

X X
r = +1 r = +.3
An example

 Bank of India is interested in reducing the


amount of time people spend waiting to see a
personal banker. The bank is interested to
trace if there is any relationship between
waiting time (Y) in minutes the number of
bankers on duty (X). Customers are randomly
selected with the data given below:
X 2 3 5 4 2 6 1 3 4 3 3 2 4

Y 12.8 11.3 3.2 6.4 11.6 3.2 8.7 10.5 8.2 10.5 9.4 12.8 8.2
Another example
A real estate agent wishes to examine the relationship
between the selling price of a home and its size (measured
in square feet). And then he wants to predict value of Y with
having some value of X

House Price in $1000s Square Feet


(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Analysis
Correlation vs. Regression
 A scatter plot (or scatter diagram) can
be used to show the relationship
between two numerical variables
 Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
 Correlation is only concerned with strength
of the relationship
 No causal effect is implied with correlation
Regression Analysis
Regression analysis is used to:
 Predict the value of a dependent variable based

on the value of at least one independent variable


 Explain the impact of changes in an independent

variable on the dependent variable


Dependent variable: the variable you wish to explain
Independent variable: the variable used to explain
the dependent variable
Hypothesis Testing
 For example: hypothesis 1 : X is statistically
significantly related to Y.
 The relationship is positive (as X increases, Y
increases) or negative (as X decreases, Y
increases).
 The magnitude of the relationship is small,
medium, or large.
If the magnitude is small, then a unit change in x
is associated with a small change in Y.
Simple linear regression model
 Only one independent variable, X
 Relationship between X and Y is described by a linear
function
 Changes in Y are related to changes in X
Model specification based on Theory

 Economic, Psychological & business


theory
 Mathematical theory
 Previous research
 ‘Common sense’
 We ASSUME causality flows from X to Y
Which one is more logical?

S a le s S a le s

A d v e r tis in g A d v e r tis in g

S a le s S a le s

A d v e r tis in g A d v e r tis in g
Type of regression models

1 E x p la n a to r y R e g r e s s io n 2 + E x p la n a to r y
V a r ia b le M o d e ls V a r ia b le s

S im p le M u ltip le

Non- Non-
L in e a r L in e a r
L in e a r L in e a r
Linear regression model
Linear regression model

Y
Y = bX + a
C hange
b = S lo p e in Y
C h a n g e in X
a = Y -in te r c e p t
X
The Scatter Diagram

Plot of all (Xi , Yi) pairs

100
Y 50

0
0 20 X 40 60
Types of Scatter diagrams
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship


The Scatter Diagram

Plot of all (Xi , Yi) pairs

100
Y 50

0
0 20 X 40 60
The Slope and y-intercept of
the best fitting regression line

b
 XY  nXY
 X  nX
2 2

a  Y  bX
Assumptions of Regression
L.I.N.E
 Linearity
 The relationship between X and Y is linear

 Independence of Errors
 Error values are statistically independent

 Normality of Error
 Error values are normally distributed for any given

value of X
 Equal Variance (also called homoscedasticity)
 The probability distribution of the errors has

constant variance
Normality and Constant
Variance Assumption
f(e )

Y
X 1
X 2
X
Variation of Errors Around
the Regression Line
y values are normally distributed
f(e) around the regression line.
For each x value, the “spread” or
variance around the regression
line is the same.

Y
X2
X1
X
Regression Line
Linear Regression : Assumptions
If these assumptions hold -
The formulas that we use to estimate the coefficients in
a regression yield BLUE
BLUE (Best Linear Unbiased
Estimators)

Best = “Most Efficient” = smallest variance


Unbiased = Expected value of estimator=true
population value
Simple Linear Regression Model
• Relationship Between Variables is a Linear Function

• The Straight Line that Best Fit the Data

Y intercept (Constant term) Random


Error

Yi   0   1 X i   i
Dependent
(Response) Independent
Slope (Explanatory)
Variable
Variable
Population
Linear Regression Model
Y Yi   0  1X i   i Observed
Value

i = Random Error

   0  1X i
YX
(E(Y))
X
Observed Value
Simple Linear Regression Model

Y i  b0  b1X i


Yi = Predicted Value of Y for observation i

Xi = Value of X for observation i

b0 = Sample Y - intercept used as estimate of


the population 0
b1 = Sample Slope used as estimate of the
population 1
Decomposition of Effects

 
Yi ŷ  a  bx

yi  yˆi  error
yi  y  Total Effect

ŷ  y  regression effect
Y

X
X
Revisit the Example
A random sample of 10 houses is selected: Dependent
variable (Y) = house price in $1000s, Independent variable
(X) = square feet. Let us find the equation of the straight line
that fits the data best
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Scatter Diagram Example
 House price model: scatter plot
450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Equation for the Best Straight
Line using SPSS output
Regression Statistics
The regression equation is:
Multiple R 0.76211
R Square 0.58082
house price  98.24833  0.10977 (square feet)
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

If X=0, then Ŷ =1636.414 Realistic?


ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      

  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Linear Regression Example
Graphical Representation
 House price model: scatter plot and
regression450line
400
House Price ($1000s)

350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price  98.24833  0.10977 (square feet)


Linear Regression Example
Making Predictions
When using a regression model for prediction, only
predict within the relevant range of data

Relevant range for


interpolation

450
400
House Price ($1000s)

350
300 Do not try to
250 extrapolate beyond
200
150
the range of
100 observed X’s
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Linear Regression Example
Making Predictions
Predict the price for a house with 2000 square feet:

house price  98.25  0.1098 (sq.ft.)

 98.25  0.1098(200 0)

 317.85

The predicted price for a house with 2000 square feet


is 317.85($1,000s) = $317,850
Interpreting the Results
Linear Regression Example
Interpretation of b0
house price  98.24833  0.10977 (square feet)

 b0 is the estimated mean value of Y


when the value of X is zero (if X = 0 is in
the range of observed X values)
 Because the square footage of the
house cannot be 0, the Y intercept has
no practical application.
Linear Regression Example
Interpretation of b1
house price  98.24833  0.10977 (square feet)

 b1 measures the mean change in the average


value of Y as a result of a one-unit change in X
 Here, b = .10977 tells us that the mean value
1
of a house increases by .10977($1000) =
$109.77, on average, for each additional one
square foot of size
Reliability of the estimated
equation
Does X explains the significant portion of Y
Explaining Variation in Y
 If X and Y have no relationship, we
should predict the mean of Y for every
X value.
 We would like to measure whether
knowing the value of X helps us explain
why Y differs from its mean value.
Measures of Variation: The Sum of
Squares
Y 
SSE =(Yi - Yi )2
_ b Xi
 b0 + 1
SST = (Yi - Y) 2
Yi =
 _
SSR = (Yi - Y)2
_
Y

X
X Xi
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
• measures _the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
• explained variation attributable to the relationship
between X and Y

SSE = Error Sum of Squares


• variation attributable to factors other than the
relationship between X and Y
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
• This is the_identical measure that we used in ANOVA

SSR = Regression Sum of Squares


• We called this Sum of Squares Among in ANOVA

SSE = Error Sum of Squares


• We shall call this Sum of Squares Within in ANOVA
F-Test for Significance Output
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082
F   11.0848
Adjusted R Square 0.52842 MSE 1708.1957
Standard Error 41.33032
Observations 10 With 1 and 8 degrees P-value for
of freedom the F-Test
ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      
Interpreting Anova Results
 The F-test tests the null hypothesis that
the regression does not explain a
significant proportion of the variation in Y
 The degrees of freedom for the F-test of a
simple regression are 1 and n-2
 In this example, F=11.08 with 1 and 8
degrees of freedom.
F-Test for Significance
 H0 : β1 = 0 Test Statistic:
 H1 : β1 ≠ 0 MSR
F  11.08
  = .05 MSE
 df1= 1 df2 = 8 Decision:
Critical Value: Reject H0 at  = 0.05

F = 5.32
Conclusion:
 = .05
There is sufficient evidence that
0 Do not Reject H0
F house size affects selling price
reject H0 F.05 = 5.32
The Coefficient of Determination

SSR regression sum of squares


r =
2
=
SST total sum of squares

Measures the proportion of variation that is


explained by the independent variable X in
the regression model
Linear Regression Example
Coefficient of Determination, r2
Regression Statistics
SSR 18934.9348
Multiple R 0.76211 r  2
  0.58082
R Square 0.58082 SST 32600.5000
Adjusted R Square 0.52842
Standard Error 41.33032
58.08% of the variation in house
Observations 10 prices is explained by variation in
square feet
ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      

  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Coefficients of Determination (r2)
and Correlation (r)

Y r2 = 1, r = +1 Y r2 = 1, r = -1
^=b +b X
Yi 0 1 i
^=b +b X
Yi 0 1 i
X X

Yr2 = .8, r = +0.9 Y r2 = 0, r = 0

^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i

X X
R2 and F connection
SSR 2 2

F
(2  1)
 r * SST n  2 r n  2
*  *
SSE 2
 2
 n  2 (1  r ) * SST 2 1 1  r 1

The F-test can be written in terms of the r2.


The F-test is the test that the r2=0.
Standard Error of Estimate

n 
SSE  ( Yi  Yi )
2
Syx  = i 1
n2
n2

The standard deviation of the variation of


observations around the regression line
Linear Regression Example
Standard Error of Estimate
Regression Statistics
Multiple R
R Square
0.76211
0.58082
S YX  41.33032
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
  df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000      

  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Inferences about the Slope: t Test

• t Test for a Population Slope Is a Linear


Relationship Between X & Y ?
•Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship) H1: 1
0 (Linear Relationship)

b1   1 SYX
•Test Statistic: t  Where Sb1 
S b1 n
2
 i
( X  X )
i 1
and df = n - 2
Inferences About the Slope:
t Test Example
House Price in
Square Feet
Estimated Regression Equation:
$1000s
(x)
(y)
house price  98.25  0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
319 1425
sales price?
255 1700
Inferences About the Slope:
t Test Example
b1 Sb1
 H0 : β1 = 0 From output:
 H1 : β1 ≠ 0   Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

b1  β1 0.10977  0
t  t  3.32938
Sb1 0.03297
Inferences About the Slope:
t Test Example
Test Statistic: t = 3.329  H0: β1 = 0
 H1: β1 ≠ 0
d.f. = 10- 2 = 8

a/2=.025 a/2=.025
Decision: Reject H0

There is sufficient evidence


Reject H0
-tα/2
Do not reject H
0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price
Inferences About the Slope:
t Test Example
From the output: P-Value
  Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
 H0: β1 = 0
 H1: β1 ≠ 0
Decision: Reject H0, since p-value < α

There is sufficient evidence that


square footage affects house price.
Confidence Interval Estimate
for the Slope
Confidence Interval Estimate of the Slope:
b1  t n2Sb1 d.f. = n - 2

Excel Printout for House Prices:


  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

At the 95% level of confidence, the confidence


interval for the slope is (0.0337, 0.1858)
Confidence Interval Estimate
for the Slope
  Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Since the units of the house price variable is $1000s, you


are 95% confident that the mean change in sales price is
between $33.74 and $185.80 per square foot of house size

This 95% confidence interval does not include 0.


Conclusion: There is a significant relationship between house price and
square feet at the .05 level of significance
Estimation of Predicted Values
Confidence Interval Estimate for XY
Confidence Interval Estimate for
Individual Response Yi at a Particular Xi
Standard error
of the estimate Size of interval vary according to
distance away from mean, X.
t value from table
with df=n-2

1 ( X i  X )2
Yˆi  t n  2  S xy 1  n
n
 i
( X
i 1
 X ) 2
Confidence Bands
 Error associated with a forecast has two
components:
 Error at the mean (standard error of
estimate)
 Error in estimating B
 Therefore, the confidence intervals around
forecasts will be larger as we move away
from the mean of X
Interval Estimates for
Different Values of X
Confidence Interval Confidence
for a individual Yi Interval for the
Y mean of Y

 + b X
1 i
Yi = b0

_ X
X A Given X
Example: Produce Stores
Data for 7 Stores:
Annual Store
Square Sales Feet Predict the annual
(000)
sales for a store with
1 1,726 3,681 2000 square feet.
2 1,542 3,395
3 2,816 6,653
Regression Model Obtained:
4 5,555 9,543
5 1,292 3,318 
6 2,208 5,563 Yi = 1636.415 +1.487Xi
7 1,313 3,760
Estimation of Predicted
Values: Example
Confidence Interval Estimate for XY
Find the 95% confidence interval for annual sales of one
particular stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 (000)

X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

1 ( X i  X )2
Ŷi  t n  2  Syx 1  n = 4610.45  1853.45
n  ( X  X )2
i Confidence interval for indivi
i 1
Y

You might also like