Bivariate Regression Model
Bivariate Regression Model
section data
Ref: Damodar Gujarati-Econometrics by example
Having survived recent economic slowdowns that have diminished their competitors, Sunflowers Apparel, a chain of
upscale fashion stores for women, is in the midst of companywide review that includes researching the factors that
make their stores successful. Until recently, Sunflower managers had no data analyses to support location decisions,
relying instead on subjective factors, such as the availability of an inexpensive lease or the perception that a particular
location seemed ideal for one of their stores.
As the new director of planning, you have already consulted with marketing data firms that specialize in using business
analytics to identify and classify groups of consumers. Based on such preliminary analyses, you have tentatively
discovered that the profile of Sunflowers shoppers may not only be the upper middle class, long suspected of being
the chain’s clientele, but may also include younger, aspirational families with young children, and surprisingly urban
hipsters that set trends and are mostly single.
You seek to develop a systematic approach that will lead to making better decisions during the site-selection process.
As a starting point you have asked one marketing data firm to collect and organize data for the number of people in
the identified categories who live within a fixed radius of each sunflower store. You believe that the greater number of
profiled customers contribute to store sales, and you want to explore the possible use of this relationship in the
decision making process. How can you use statistics so that you can forecast the annual sales of a proposed store
based on the number of profiled customers that reside within the fixed radius of a Sunflowers store?
In regression, we try to understand the variation in Y variable:
Profiled dependent variable.
Customers( Annual Sales
Independent variable: X variable the one which is influencing your Y variable
Store millions) ($millions)
1 3.7 5.7
What is the nature of influence. How do you describe the way X is influencing Y?
2 3.6 5.9
3 2.8 6.7 Description: The type of relation is the regression equation. Describe that in an
4 5.6 9.5 Equation form?
5 3.3 5.4 Answer: LINEAR REGRESSION: Y=mx+c. Y=b0+b1X+e
6 2.2 3.5
7 3.3 6.2
8 3.1 4.7
9 3.2 6.1
10 3.5 4.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3 4.1
Profiled
Customers Annual Sales Scatter Plot of Profiled Customers and Annual Sales
Store (millions) ($millions) 14
1 3.7 5.7 12
2 3.6 5.9
10
3 2.8 6.7
4 5.6 9.5
Annual Sales
8
5 3.3 5.4
6
6 2.2 3.5
7 3.3 6.2 4
8 3.1 4.7
2
9 3.2 6.1
10 3.5 4.9 0
0 1 2 3 4 5 6 7
11 5.2 10.7 Profiled Customers
12 4.6 7.6
13 5.8 11.8
14 3 4.1
Bivariate Regression with Cross Section Data
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales -1.20884 2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
12
10
12
0
0 1 2 3 4 5 6 7
Profiled Customers(millions)
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one
independent variable
• Explain the impact of changes in an independent variable on the dependent
variable
-Dependent variable: the variable we wish to predict or explain
-Independent variable: the variable used to predict or explain the dependent variable
Bivariate Linear Regression Model
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Annual Sales -1.20884 2.074173 (Profiled Customers)
Multiple R
R Square
Adjusted R Square
Standard Error
Observations 14
ANOVA
df SS MS F Significance F
Regression 66.78540482 2.99943E-06
Residual
Total 78.76857143
Population
Population Slope Independent Random
Y intercept Coefficient Variable term
Dependent Variable
Yi β0 β1Xi ε i
Deterministic component Random
component
Simple Linear Regression Model:
Coefficient estimates
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept
Value of X for
Ŷi b0 b1Xi
observation i
(Prediction Line)
Simple Linear Regression Model:
Coefficient estimates
Y Ŷi b0 b1Xi
(Yi ) Observed Value
of Y for Xi
ei Slope = b1
(Yˆi ) Predicted Value Error committed in
of Y for Xi predicting Yi for this Xi (Yi Yˆi ) = ei
value
Intercept = b0
Xi X
Least Squares Method
• The vertical distance from each point (observed Y) to the line (predicted Y) is the
error of the prediction.
• For n number of observations we get n errors in prediction. We obtain the sum of
squares of error (total error) for this line
• The same process is repeated for all the possible lines, generated from all possible
values of coefficient estimates.
• The least squares regression line is the regression line that results in the smallest
sum of squared errors.
Simple Linear Regression Model:
Deriving the Coefficient estimates
The Least Squares Method
b0 and b1 are obtained by finding the values of that minimize the
sum of the squared differences between Y and Ŷ :
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
Mean of b1 = E (b1 ) 1
Standard Deviation of b1 = se(b1 ) V (b1 )
( X i X ) 2
Probability distribution of the random term
Yi β 0 β1 X i ε i
ε i ~ N(0, σ2)
E ( i ) 0
Var ( i ) 2
Sum of Squares
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales -1.20884 2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
Xi X
Sum of Squares
Measures of Variation
• Total variation is made up of two parts:
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales -1.20884 2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
R
2 SSR regression sum of squares
SST SSE SSE
OR R 2
1
SST total sum of squares SST SST
note:
0 R 1
2
Bivariate Regression with Cross Section Data
SSR 66.7854
R2 0.847869
SST 78.76857
SUMMARY OUTPUT
84.78% of the variation in
Regression Statistics Annual Sales is explained by
Multiple R 0.92079785 variation in Profiled Customers
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
r2 = 1
X
r2 = 1 Perfect linear relationship
Y between X and Y
X
r2 =1
Examples of Approximate
r2 Values Some but not all of the variation
in Y is explained by variation in X
Y
0 < r2 < 1
X
Weaker linear relationships
Y
between X and Y:
X
Examples of Approximate
r2 Values The value of Y does not depend on X.
(None of the variation in Y is
explained by variation in X)
Y
r2 = 0
No linear relationship
between X and Y:
X
r2 = 0
Probability distribution of the random term
Yi β 0 β1 X i ε i
ε i ~ N(0, σ2)
E ( i ) 0
Var ( i ) 2
Standard Error of Regression
• Error in estimation occurs because of the inherent randomness in the
data generating process of Y
SUMMARY OUTPUT • The standard deviation of the error in the regression is called standard
error of the regression
Regression Statistics • Standard error of the regression serves as an estimator of the standard
Multiple R 0.92079785 deviation of ε (σ)
i
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
n X
i i
(Y Yˆ ) 2
SSE
small S
S i 1
n2 n2 Y
Where
SSE = error sum of squares
n = sample size X
large S
Standard Error of Regression
SUMMARY OUTPUT
11.98317
S 0.999298
Multiple R
Regression Statistics
0.92079785 14 2
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA SSE
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
SSE
S = Standard error of the regression
n2
(estimator of σ)
Standard Error of Coefficients
SUMMARY OUTPUT
SSE 11.98
S S n2 12 .2536
Regression Statistics
S b1
Multiple R
R Square
0.92079785
0.84786868
SSX i
(X X ) 2
( X i X ) 2 3.94
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.92079785 b1 β1 2.074173 0
R Square 0.84786868 t STAT 8.1779 where:
S b1 0.253629
Adjusted R Square 0.83519107
b1 = regression slope
Standard Error 0.99929836 coefficient
Observations 14
β1 = hypothesized slope
ANOVA Sb1 = standard
df SS MS F Significance F error of the slope
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
a/2=.025 a/2=.025
There is sufficient evidence
that number of profiled
customers significantly
Reject H0 Do not reject H0 Reject H0
-tα/2
0
tα/2 affects annual sales
-2.179 2.179 8.1779
Inference about the R Square: F Test
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
Mean Square Regression (MSR): If different samples of Y and X Explanatory power of the model
( in our case annual sales and profiled customers) are taken on average, if you run the model
and regression is run, we will obtain different SSRs every time. many times with different
MSR is interpreted as the average of the SSRs. samples
MSR 66.7854/1
SUMMARY OUTPUT
FSTAT 66.87922
MSE 11.98317/14 - 1 - 1
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143
To estimate ρ with R2, what we are essentially doing is replacing with variation in the error ((V(e)) which is SSE/n
2
and Y becomes the variation of Y (V(Y)) based on 14 values of Y, which is SST/n. That’s how R2 becomes:
2
SSE
R2 1 n 1 SSE However, it can be statistically proven that SSE/n and SST/n are not unbiased estimators of
2 and Y . But SSE/(n-(k+1)) and SST/(n-1) are.
2
SST SST
n
How to adjust for the loss of degrees of freedom?
SSE
( n ( k 1))
R 2
1
SST
( n 1)
( n 1) SSE
1
( n ( k 1) SST
( n 1)
1 1 R 2
( n ( k 1)
Adjusted R Square
ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143