Session 15 Regression and Correlation

PGP 17-19
Statistics fro Decision Making
Simple Linear Regression Model

and Correlation
Dr. Rohit Joshi, IIMS

Contents
 Correlation - Measuring the Strength of the
Association
 Types of Regression Models
 Determining the Simple Linear Regression
Equation
 Measures of Variation in Regression and
Correlation
 Assumptions of Regression and Correlation
 Estimation of Predicted Values
Association between two variable
Amount of time spent in class ~ Understanding
of the statistics and SPSS
 Positively related
 Not related
 Negatively related
Covariance
Correlation
Sample Covariance
 The sample covariance measures the strength of
the linear association between two numerical
n
variables.  ( Xi  X)( Yi  Y )
 The sample covariance: cov ( X , Y )  i1
n 1
 The covariance is only concerned with the
strength of the relationship.
 No causal effect is implied.
Covariance
 Covariance between two random variables:
 cov(X,Y) > 0 X and Y tend to move in the same

direction
 cov(X,Y) < 0 X and Y tend to move in opposite
directions
 cov(X,Y) = 0 X and Y are independent
Limitation
Dependence on measurement scale
The Correlation Coefficient
 The correlation coefficient measures the relative
strength of the linear relationship between two
variables.
 Sample coefficient of correlation:
 ( X  X)( Y  Y )
i i
cov ( X , Y )
r i 1

n n SX SY
 ( Xi  X )
i 1
2
 i
( Y
i 1
 Y ) 2
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear
relationship
 The closer to 0, the weaker any linear relationship
 r2 is called the Coefficient of Determination
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
r = +1 r = +.3
An example
 Bank of India is interested in reducing the

amount of time people spend waiting to see a
personal banker. The bank is interested to
trace if there is any relationship between
waiting time (Y) in minutes the number of
bankers on duty (X). Customers are randomly
selected with the data given below:
X 2 3 5 4 2 6 1 3 4 3 3 2 4
Y 12.8 11.3 3.2 6.4 11.6 3.2 8.7 10.5 8.2 10.5 9.4 12.8 8.2
Another example
A real estate agent wishes to examine the relationship
between the selling price of a home and its size (measured
in square feet). And then he wants to predict value of Y with
having some value of X
House Price in $1000s Square Feet

(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Analysis
Correlation vs. Regression
 A scatter plot (or scatter diagram) can
be used to show the relationship
between two numerical variables
 Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
 Correlation is only concerned with strength
of the relationship
 No causal effect is implied with correlation
Regression Analysis
Regression analysis is used to:
 Predict the value of a dependent variable based
on the value of at least one independent variable

 Explain the impact of changes in an independent
variable on the dependent variable

Dependent variable: the variable you wish to explain
Independent variable: the variable used to explain
the dependent variable
Hypothesis Testing
 For example: hypothesis 1 : X is statistically
significantly related to Y.
 The relationship is positive (as X increases, Y
increases) or negative (as X decreases, Y
increases).
 The magnitude of the relationship is small,
medium, or large.
If the magnitude is small, then a unit change in x
is associated with a small change in Y.
Simple linear regression model
 Only one independent variable, X
 Relationship between X and Y is described by a linear
function
 Changes in Y are related to changes in X
Model specification based on Theory
 Economic, Psychological & business

theory
 Mathematical theory
 Previous research
 ‘Common sense’
 We ASSUME causality flows from X to Y
Which one is more logical?
S a le s S a le s
A d v e r tis in g A d v e r tis in g
S a le s S a le s
A d v e r tis in g A d v e r tis in g
Type of regression models
1 E x p la n a to r y R e g r e s s io n 2 + E x p la n a to r y
V a r ia b le M o d e ls V a r ia b le s
S im p le M u ltip le
Non- Non-
L in e a r L in e a r
L in e a r L in e a r
Linear regression model
Linear regression model
Y
Y = bX + a
C hange
b = S lo p e in Y
C h a n g e in X
a = Y -in te r c e p t
X
The Scatter Diagram
Plot of all (Xi , Yi) pairs
100
Y 50
0
0 20 X 40 60
Types of Scatter diagrams
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship

The Scatter Diagram
Plot of all (Xi , Yi) pairs
100
Y 50
0
0 20 X 40 60
The Slope and y-intercept of
the best fitting regression line
b
 XY  nXY
 X  nX
2 2
a  Y  bX
Assumptions of Regression
L.I.N.E
 Linearity
 The relationship between X and Y is linear
 Independence of Errors
 Error values are statistically independent
 Normality of Error
 Error values are normally distributed for any given
value of X
 Equal Variance (also called homoscedasticity)
 The probability distribution of the errors has
constant variance
Normality and Constant
Variance Assumption
f(e )
Y
X 1
X 2
X
Variation of Errors Around
the Regression Line
y values are normally distributed
f(e) around the regression line.
For each x value, the “spread” or
variance around the regression
line is the same.
Y
X2
X1
X
Regression Line
Linear Regression : Assumptions
If these assumptions hold -
The formulas that we use to estimate the coefficients in
a regression yield BLUE
BLUE (Best Linear Unbiased
Estimators)
Best = “Most Efficient” = smallest variance

Unbiased = Expected value of estimator=true
population value
• Relationship Between Variables is a Linear Function
• The Straight Line that Best Fit the Data
Y intercept (Constant term) Random

Error
Yi   0   1 X i   i
Dependent
(Response) Independent
Slope (Explanatory)
Variable
Variable
Population
Linear Regression Model
Y Yi   0  1X i   i Observed
Value
i = Random Error
   0  1X i
YX
(E(Y))
X
Observed Value

Y i  b0  b1X i

Yi = Predicted Value of Y for observation i
Xi = Value of X for observation i
b0 = Sample Y - intercept used as estimate of

the population 0
b1 = Sample Slope used as estimate of the
population 1
Decomposition of Effects
 
Yi ŷ  a  bx
yi  yˆi  error
yi  y  Total Effect
ŷ  y  regression effect
Y
X
X
Revisit the Example
A random sample of 10 houses is selected: Dependent
variable (Y) = house price in $1000s, Independent variable
(X) = square feet. Let us find the equation of the straight line
that fits the data best
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Scatter Diagram Example
 House price model: scatter plot
450
400
House Price ($1000s)
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Equation for the Best Straight
Line using SPSS output
Regression Statistics
The regression equation is:
Multiple R 0.76211
R Square 0.58082
house price  98.24833  0.10977 (square feet)
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
If X=0, then Ŷ =1636.414 Realistic?

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Linear Regression Example
Graphical Representation
 House price model: scatter plot and
regression450line
400
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

Making Predictions
When using a regression model for prediction, only
predict within the relevant range of data
Relevant range for

interpolation
450
400
350
300 Do not try to
250 extrapolate beyond
200
150
the range of
100 observed X’s
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Making Predictions
Predict the price for a house with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(200 0)
 317.85
The predicted price for a house with 2000 square feet

is 317.85($1,000s) = $317,850
Interpreting the Results
Interpretation of b0
 b0 is the estimated mean value of Y

when the value of X is zero (if X = 0 is in
the range of observed X values)
 Because the square footage of the
house cannot be 0, the Y intercept has
no practical application.
Interpretation of b1
 b1 measures the mean change in the average

value of Y as a result of a one-unit change in X
 Here, b = .10977 tells us that the mean value
1
of a house increases by .10977($1000) =
$109.77, on average, for each additional one
square foot of size
Reliability of the estimated
equation
Does X explains the significant portion of Y
Explaining Variation in Y
 If X and Y have no relationship, we
should predict the mean of Y for every
X value.
 We would like to measure whether
knowing the value of X helps us explain
why Y differs from its mean value.
Measures of Variation: The Sum of
Squares
Y 
SSE =(Yi - Yi )2
_ b Xi
 b0 + 1
SST = (Yi - Y) 2
Yi =
 _
SSR = (Yi - Y)2
_
Y
X
X Xi
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
• measures _the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
• explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares

• variation attributable to factors other than the
relationship between X and Y
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
• This is the_identical measure that we used in ANOVA
SSR = Regression Sum of Squares

• We called this Sum of Squares Among in ANOVA
SSE = Error Sum of Squares

• We shall call this Sum of Squares Within in ANOVA
F-Test for Significance Output
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082
F   11.0848
Adjusted R Square 0.52842 MSE 1708.1957
Observations 10 With 1 and 8 degrees P-value for
of freedom the F-Test
ANOVA
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Interpreting Anova Results
 The F-test tests the null hypothesis that
the regression does not explain a
significant proportion of the variation in Y
 The degrees of freedom for the F-test of a
simple regression are 1 and n-2
 In this example, F=11.08 with 1 and 8
degrees of freedom.
F-Test for Significance
 H0 : β1 = 0 Test Statistic:
 H1 : β1 ≠ 0 MSR
F  11.08
  = .05 MSE
 df1= 1 df2 = 8 Decision:
Critical Value: Reject H0 at  = 0.05
F = 5.32
Conclusion:
 = .05
There is sufficient evidence that
0 Do not Reject H0
F house size affects selling price
reject H0 F.05 = 5.32
The Coefficient of Determination
SSR regression sum of squares

r =
2
=
SST total sum of squares
Measures the proportion of variation that is

explained by the independent variable X in
the regression model
Coefficient of Determination, r2
SSR 18934.9348
Multiple R 0.76211 r  2
  0.58082
R Square 0.58082 SST 32600.5000
58.08% of the variation in house
Observations 10 prices is explained by variation in
square feet
ANOVA
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Coefficients of Determination (r2)
and Correlation (r)
Y r2 = 1, r = +1 Y r2 = 1, r = -1
^=b +b X
Yi 0 1 i
^=b +b X
Yi 0 1 i
X X
Yr2 = .8, r = +0.9 Y r2 = 0, r = 0
^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i
X X
R2 and F connection
SSR 2 2
F
(2  1)
 r * SST n  2 r n  2
*  *
SSE 2
 2
 n  2 (1  r ) * SST 2 1 1  r 1
The F-test can be written in terms of the r2.

The F-test is the test that the r2=0.
Standard Error of Estimate
n 
SSE  ( Yi  Yi )
2
Syx  = i 1
n2
n2
The standard deviation of the variation of

observations around the regression line
Standard Error of Estimate
Multiple R
R Square
0.76211
0.58082
S YX  41.33032
Observations 10
ANOVA
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Inferences about the Slope: t Test
• t Test for a Population Slope Is a Linear

Relationship Between X & Y ?
•Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship) H1: 1
0 (Linear Relationship)
b1   1 SYX
•Test Statistic: t  Where Sb1 
S b1 n
2
 i
( X  X )
i 1
and df = n - 2
Inferences About the Slope:
t Test Example
House Price in
Square Feet
Estimated Regression Equation:
$1000s
(x)
(y)
house price  98.25  0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
319 1425
sales price?
255 1700
t Test Example
b1 Sb1
 H0 : β1 = 0 From output:
 H1 : β1 ≠ 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1  β1 0.10977  0
t  t  3.32938
Sb1 0.03297
t Test Example
Test Statistic: t = 3.329  H0: β1 = 0
 H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
There is sufficient evidence

Reject H0
-tα/2
Do not reject H
0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price
t Test Example
From the output: P-Value
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
 H0: β1 = 0
 H1: β1 ≠ 0
Decision: Reject H0, since p-value < α
There is sufficient evidence that

square footage affects house price.
Confidence Interval Estimate
for the Slope
Confidence Interval Estimate of the Slope:
b1  t n2Sb1 d.f. = n - 2
Excel Printout for House Prices:

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
At the 95% level of confidence, the confidence

interval for the slope is (0.0337, 0.1858)
Confidence Interval Estimate
for the Slope
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Since the units of the house price variable is $1000s, you

are 95% confident that the mean change in sales price is
between $33.74 and $185.80 per square foot of house size
This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship between house price and
square feet at the .05 level of significance
Estimation of Predicted Values
Confidence Interval Estimate for XY
Confidence Interval Estimate for
Individual Response Yi at a Particular Xi
Standard error
of the estimate Size of interval vary according to
distance away from mean, X.
t value from table
with df=n-2
1 ( X i  X )2
Yˆi  t n  2  S xy 1  n
n
 i
( X
i 1
 X ) 2
Confidence Bands
 Error associated with a forecast has two
components:
 Error at the mean (standard error of
estimate)
 Error in estimating B
 Therefore, the confidence intervals around
forecasts will be larger as we move away
from the mean of X
Interval Estimates for
Different Values of X
Confidence Interval Confidence
for a individual Yi Interval for the
Y mean of Y
 + b X
1 i
Yi = b0
_ X
X A Given X
Example: Produce Stores
Data for 7 Stores:
Annual Store
Square Sales Feet Predict the annual
(000)
sales for a store with
1 1,726 3,681 2000 square feet.
2 1,542 3,395
3 2,816 6,653
Regression Model Obtained:
4 5,555 9,543
5 1,292 3,318 
6 2,208 5,563 Yi = 1636.415 +1.487Xi
7 1,313 3,760
Estimation of Predicted
Values: Example
Confidence Interval Estimate for XY
Find the 95% confidence interval for annual sales of one
particular stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 (000)
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706
1 ( X i  X )2
Ŷi  t n  2  Syx 1  n = 4610.45  1853.45
n  ( X  X )2
i Confidence interval for indivi
i 1
Y

Session 15 Regression and Correlation

Uploaded by

Copyright:

Available Formats

Session 15 Regression and Correlation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 15 Regression and Correlation

Uploaded by

Copyright:

Available Formats

PGP 17-19

Statistics fro Decision Making

Simple Linear Regression Model

Dr. Rohit Joshi, IIMS

 cov(X,Y) > 0 X and Y tend to move in the same

 Bank of India is interested in reducing the

House Price in $1000s Square Feet

on the value of at least one independent variable

variable on the dependent variable

 Economic, Psychological & business

Plot of all (Xi , Yi) pairs

Negative Linear Relationship No Relationship

Plot of all (Xi , Yi) pairs

Best = “Most Efficient” = smallest variance

• The Straight Line that Best Fit the Data

Y intercept (Constant term) Random

Xi = Value of X for observation i

b0 = Sample Y - intercept used as estimate of

If X=0, then Ŷ =1636.414 Realistic?

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

house price  98.24833  0.10977 (square feet)

Relevant range for

house price  98.25  0.1098 (sq.ft.)

The predicted price for a house with 2000 square feet

 b0 is the estimated mean value of Y

 b1 measures the mean change in the average

SSE = Error Sum of Squares

SSR = Regression Sum of Squares

SSE = Error Sum of Squares

SSR regression sum of squares

Measures the proportion of variation that is

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Yr2 = .8, r = +0.9 Y r2 = 0, r = 0

The F-test can be written in terms of the r2.

The standard deviation of the variation of

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

• t Test for a Population Slope Is a Linear

There is sufficient evidence

There is sufficient evidence that

Excel Printout for House Prices:

At the 95% level of confidence, the confidence

Since the units of the house price variable is $1000s, you

This 95% confidence interval does not include 0.

X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706

You might also like