0% found this document useful (0 votes)
11 views51 pages

Bivariate Regression Model

Uploaded by

rinkiakumari16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views51 pages

Bivariate Regression Model

Uploaded by

rinkiakumari16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Regression analysis with cross

section data
Ref: Damodar Gujarati-Econometrics by example
Having survived recent economic slowdowns that have diminished their competitors, Sunflowers Apparel, a chain of
upscale fashion stores for women, is in the midst of companywide review that includes researching the factors that
make their stores successful. Until recently, Sunflower managers had no data analyses to support location decisions,
relying instead on subjective factors, such as the availability of an inexpensive lease or the perception that a particular
location seemed ideal for one of their stores.
As the new director of planning, you have already consulted with marketing data firms that specialize in using business
analytics to identify and classify groups of consumers. Based on such preliminary analyses, you have tentatively
discovered that the profile of Sunflowers shoppers may not only be the upper middle class, long suspected of being
the chain’s clientele, but may also include younger, aspirational families with young children, and surprisingly urban
hipsters that set trends and are mostly single.
You seek to develop a systematic approach that will lead to making better decisions during the site-selection process.
As a starting point you have asked one marketing data firm to collect and organize data for the number of people in
the identified categories who live within a fixed radius of each sunflower store. You believe that the greater number of
profiled customers contribute to store sales, and you want to explore the possible use of this relationship in the
decision making process. How can you use statistics so that you can forecast the annual sales of a proposed store
based on the number of profiled customers that reside within the fixed radius of a Sunflowers store?
In regression, we try to understand the variation in Y variable:
Profiled dependent variable.
Customers( Annual Sales
Independent variable: X variable the one which is influencing your Y variable
Store millions) ($millions)
1 3.7 5.7
What is the nature of influence. How do you describe the way X is influencing Y?
2 3.6 5.9
3 2.8 6.7 Description: The type of relation is the regression equation. Describe that in an
4 5.6 9.5 Equation form?
5 3.3 5.4 Answer: LINEAR REGRESSION: Y=mx+c. Y=b0+b1X+e
6 2.2 3.5
7 3.3 6.2
8 3.1 4.7
9 3.2 6.1
10 3.5 4.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3 4.1
Profiled
Customers Annual Sales Scatter Plot of Profiled Customers and Annual Sales
Store (millions) ($millions) 14

1 3.7 5.7 12
2 3.6 5.9
10
3 2.8 6.7
4 5.6 9.5

Annual Sales
8
5 3.3 5.4
6
6 2.2 3.5
7 3.3 6.2 4

8 3.1 4.7
2
9 3.2 6.1
10 3.5 4.9 0
0 1 2 3 4 5 6 7
11 5.2 10.7 Profiled Customers
12 4.6 7.6
13 5.8 11.8
14 3 4.1
Bivariate Regression with Cross Section Data

SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales  -1.20884  2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Using Linear Regression Model, we obtain the line that best describes
the relation between the two variables—“The Best Fitted Line”

Scatter Plot of Profiled Customers and Annual Sales


14

12

10

Annual Sales  -1.20884  2.074173 (Profiled Customers)


Annual Sales

The technique of obtaining the best fitted line is called


0
0 1 2 3 4 5 6 7
“Ordinary Least Squares” regression method
Profiled Customers
Profiled Customers(millions) Line Fit Plot
14

12

Annual Sales  -1.20884  2.074173 (Profiled Customers)


10
Annual Sales ($millions)

8 Annual Sales ($millions)

Predicted Annual Sales ($millions)


6
Linear (Predicted Annual Sales
($millions))
4

0
0 1 2 3 4 5 6 7
Profiled Customers(millions)
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one
independent variable
• Explain the impact of changes in an independent variable on the dependent
variable
-Dependent variable: the variable we wish to predict or explain
-Independent variable: the variable used to predict or explain the dependent variable
Bivariate Linear Regression Model

• Only one independent variable, X


• Relationship between X and Y is described
by a linear function
• Changes in Y are assumed to be related to
changes in X
Bivariate Regression with Cross Section Data
- Interpret the regression output
- Able to calculate the statistics

SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Annual Sales  -1.20884  2.074173 (Profiled Customers)
Multiple R
R Square
Adjusted R Square
Standard Error
Observations 14

ANOVA
df SS MS F Significance F
Regression 66.78540482 2.99943E-06
Residual
Total 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.215067 0.247707351 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 3E-06
Modeling the behavior of the dependent variable

Population
Population Slope Independent Random
Y intercept Coefficient Variable term
Dependent Variable

Yi  β0  β1Xi  ε i
Deterministic component Random
component
Simple Linear Regression Model:
Coefficient estimates
The simple linear regression equation provides an
estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i intercept

Value of X for

Ŷi  b0  b1Xi
observation i

(Prediction Line)
Simple Linear Regression Model:
Coefficient estimates

Y Ŷi  b0  b1Xi
(Yi ) Observed Value
of Y for Xi

ei Slope = b1
(Yˆi ) Predicted Value Error committed in
of Y for Xi predicting Yi for this Xi (Yi  Yˆi ) = ei
value

Intercept = b0

Xi X
Least Squares Method
• The vertical distance from each point (observed Y) to the line (predicted Y) is the
error of the prediction.
• For n number of observations we get n errors in prediction. We obtain the sum of
squares of error (total error) for this line
• The same process is repeated for all the possible lines, generated from all possible
values of coefficient estimates.
• The least squares regression line is the regression line that results in the smallest
sum of squared errors.
Simple Linear Regression Model:
Deriving the Coefficient estimates
The Least Squares Method
b0 and b1 are obtained by finding the values of that minimize the
sum of the squared differences between Y and Ŷ :

min  (Yi Ŷi )  min  (Yi  (b0  b1Xi ))


2 2
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales  -1.20884  2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Simple Linear Regression Model:
Interpretation of the Coefficient estimates
• b0 is the estimated average value of Y when the value of X is zero.

• b1 is the estimated change in the average value of Y as a result of a


one-unit increase in X. In our example, one additional profiled
customer will increase sales by $2.07 on average.
Probability Distribution of OLS Estimators
• Ordinary Least Squares (OLS) is an estimator (b0 and b1) of
population parameters (β0 and β1). Values of b0 and b1
obtained from a particular sample (1.2 and 2.07 in our
example) are called estimates of β0 and β1.
• OLS estimators are random variables
• b0 and b1 are estimators of β0 and β1
• For different sample the values of b0 and b1 will change
• Since our samples are random, b0 and b1 are also random
variables
• So b0 and b1 will have probability distributions
Probability Distribution of OLS Estimators

Probability Distribution of b0: b0~N(β0 , se(b0))

Probability Distribution of b1: b1~N(β1 , se(b1))


Probability Distribution of b1: b1~N(β1 , se(b1))

Mean of b1 = E (b1 )   1


Standard Deviation of b1 = se(b1 )  V (b1 ) 
( X i  X ) 2
Probability distribution of the random term

Yi  β 0  β1 X i  ε i

ε i ~ N(0, σ2)

E ( i )  0
Var ( i )   2
Sum of Squares
SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales  -1.20884  2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Sum of Squares
Measures of Variation
Y
Yi  
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
_
 
_ _ SSR = (Yi - Y)2 _ _

Y/Y Y/Y

Xi X
Sum of Squares
Measures of Variation
• Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST   ( Yi  Y )2 SSR   (Yˆi  Yˆ ) 2 SSE   ( Yi  Ŷi )2


where:
Y = Mean value of the dependent variable
Calculating Sum of Squares
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
Measuring R square
Measures of Variation

• SST = total sum of squares (Total Variation)


• Measures the variation of the Yi values around their mean
Y
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X
Measuring R square

SUMMARY OUTPUT
Regression Equation:
Regression Statistics
Multiple R 0.92079785 Annual Sales  -1.20884  2.074173 (Profiled Customers)
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Measuring R square

• The coefficient of determination is the portion of


the total variation in the dependent variable that is
explained by variation in the independent variable
• The coefficient of determination is also called R-
squared and is denoted as R2

R 
2 SSR regression sum of squares
 SST  SSE SSE
OR R  2
 1
SST total sum of squares SST SST

note:
0  R 1
2
Bivariate Regression with Cross Section Data
SSR 66.7854
R2    0.847869
SST 78.76857
SUMMARY OUTPUT
84.78% of the variation in
Regression Statistics Annual Sales is explained by
Multiple R 0.92079785 variation in Profiled Customers
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Examples of Approximate
r2 Values 100% of the variation in Y
is explained by variation in X
Y

r2 = 1
X
r2 = 1 Perfect linear relationship
Y between X and Y

X
r2 =1
Examples of Approximate
r2 Values Some but not all of the variation
in Y is explained by variation in X
Y

0 < r2 < 1
X
Weaker linear relationships
Y
between X and Y:

X
Examples of Approximate
r2 Values The value of Y does not depend on X.
(None of the variation in Y is
explained by variation in X)

Y
r2 = 0
No linear relationship
between X and Y:
X
r2 = 0
Probability distribution of the random term

Yi  β 0  β1 X i  ε i

ε i ~ N(0, σ2)

E ( i )  0
Var ( i )   2
Standard Error of Regression
• Error in estimation occurs because of the inherent randomness in the
data generating process of Y
SUMMARY OUTPUT • The standard deviation of the error in the regression is called standard
error of the regression
Regression Statistics • Standard error of the regression serves as an estimator of the standard
Multiple R 0.92079785 deviation of ε (σ)
i
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Standard Error of Regression
Measures the overall error that the S is a measure of the variation of observed Y
regression has committed in values from the regression line
predicting the actual observations
Y
So basically we have to measure the
deviation of the predicted Y values
from the actual Y values, which is
measured as:

n X
 i i
(Y  Yˆ ) 2

SSE
small S

S  i 1

n2 n2 Y

Where
SSE = error sum of squares
n = sample size X
large S
Standard Error of Regression
SUMMARY OUTPUT
11.98317
S   0.999298
Multiple R
Regression Statistics
0.92079785 14  2
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA SSE
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Standard Error of Coefficients
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Standard Error of Coefficients
The standard error of the regression slope coefficient (b1) is the standard
deviation of b1, estimated as

Standard Deviation of b1 = se(b1 )  V (b1 ) 
( X i  X ) 2

Since σ is unknown, we replace σ by S S S


where: S b1  
S b1 = Estimator of the standard error of the slope SSX  (X i  X) 2

SSE
S = Standard error of the regression
n2
(estimator of σ)
Standard Error of Coefficients
SUMMARY OUTPUT
SSE 11.98
S S n2 12  .2536
Regression Statistics
S b1    
Multiple R
R Square
0.92079785
0.84786868
SSX  i
(X  X ) 2
( X i  X ) 2 3.94
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Inferences about the regression coefficients: t Test

• Sum of Squares of error (SSE) is required to derive S (standard error of


regression), which is the estimator of the standard deviation of
randomness (which is causing the error)
• S is used to derive the standard deviation of the error committed in
estimating the b’s, i.e. the standard error of b.
• Standard error of b (Sb1) is used to get the t-statistic required for
hypothesis testing.
Inferences about the regression coefficients: t Test

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Inferences about the regression coefficients: t Test
Your objective is to see whether there is any relation between Annual Sales and Profiled Customers. For that you have first
proposed a model which would represent the relation between the two in the population, took a sample and estimated the
relation between the two using your proposed model (OLS regression). Your estimate turns out to be 2.07.
The relation that the sample is portraying, is it true for the population? Does such kind of relation exists between sales and
customers for the entire population?
To test whether the magnitude of the relationship we
Null and alternative hypotheses have obtained between annual sales and profiled
customers (the value of b1=2.07) are statistically
H0: β1 = 0(no linear relationship) significant or not? Meaning, whether the value we have
H1: β1 ≠ 0(linear relationship does obtained happened by chance (because we have got this
exist) specific sample), or every time you take a sample, the
value of b1 will be around 2.07?
b1 ~ N (  1 , S b1 )
b1   1
Z /t 
S b1
b1   1
Z 

( X i  X ) 2
b1   1
t 
S
( X i  X ) 2
Inferences about the regression coefficients: t Test
Computing the t stat assuming that the null hypothesis is true

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.92079785 b1  β1 2.074173  0
R Square 0.84786868 t STAT    8.1779 where:
S b1 0.253629
Adjusted R Square 0.83519107
b1 = regression slope
Standard Error 0.99929836 coefficient
Observations 14
β1 = hypothesized slope
ANOVA Sb1 = standard
df SS MS F Significance F error of the slope
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Inferences about the regression coefficients: t Test
H0: β1 = 0
Decision: Reject H0
H1: β1 ≠ 0
Meaning: 95% of the time
Test Statistic: tSTAT = 8.1779, d.f=14-2=12 The value of b1 is “far
away” from 0.

a/2=.025 a/2=.025
There is sufficient evidence
that number of profiled
customers significantly
Reject H0 Do not reject H0 Reject H0
-tα/2
0
tα/2 affects annual sales
-2.179 2.179 8.1779
Inference about the R Square: F Test

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Inference about the R Square: F Test
In our regression, we have obtained R square of 0.848. How likely it is to get an R square around 0.848 if we run the
regression for different samples?
H0: R2 = 0
SSR
H1: R2 ≠ 0 MSR 
MSR k
FSTAT 
MSE SSE
MSE 
n  k 1

Mean Square Regression (MSR): If different samples of Y and X Explanatory power of the model
( in our case annual sales and profiled customers) are taken on average, if you run the model
and regression is run, we will obtain different SSRs every time. many times with different
MSR is interpreted as the average of the SSRs. samples

Mean Square Error(MSE): If different samples of Y and X


Error committed by the model on
( in our case annual sales and profiled customers) are taken
average, if you run the model many
and regression is run, we will obtain different SSEs every time.
times with different samples
MSE is interpreted as the average of the SSEs
Inference about the R Square: F Test

MSR 66.7854/1
SUMMARY OUTPUT
FSTAT    66.87922
MSE 11.98317/14 - 1 - 1
Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Adjusted R Square

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.92079785
R Square 0.84786868
Adjusted R Square 0.83519107
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601
Getting Adjusted R squared from R Squared
Consider the entire population of stores. The variation of annual sales (Y) across stores in the population is partly
correlated with the variation in number of profiled customers (X) across stores. The remaining variation in Y is random (ε).
The correlation between Y and X, if we consider the population is defined as:

 2 where   2 is the variation in ε (V(ε)) and  Y 2 is the variation in Y


  1 2 (V(Y))
Y
As we have a sample of 14 stores, we can only estimate ρ, and we do that with R2.

To estimate ρ with R2, what we are essentially doing is replacing   with variation in the error ((V(e)) which is SSE/n
2

and  Y becomes the variation of Y (V(Y)) based on 14 values of Y, which is SST/n. That’s how R2 becomes:
2

SSE
R2  1 n  1  SSE However, it can be statistically proven that SSE/n and SST/n are not unbiased estimators of
  2 and  Y . But SSE/(n-(k+1)) and SST/(n-1) are.
2
SST SST
n
How to adjust for the loss of degrees of freedom?
SSE
( n  ( k  1))
R 2
 1
SST
( n  1)
 ( n  1)  SSE 
 1    
 ( n  ( k  1)  SST 
 ( n  1) 
 1  1 R 2

 ( n  ( k  1) 
Adjusted R Square

• Every time we add explanatory variables (X variables) in the model, our R2


will increase.

• Adding more and more explanatory variables means loss of degrees of


freedom.

• We can “penalize” the R2 every time we add a new explanatory variable to


the model. The penalized R2 is called the adjusted R2.
Adjusted R Square
 ( n  1) 
Ra
2
 1 
   1  R 2
 
SUMMARY OUTPUT
 ( n ( k 1) 
 14  1 
Regression Statistics  1   1  0.8478
Multiple R
R Square
0.92079785
0.84786868
14  (1  1) 
Adjusted R Square 0.83519107  0.8351
Standard Error 0.99929836
Observations 14

ANOVA
df SS MS F Significance F
Regression 1 66.78540482 66.7854 66.87922 2.99943E-06
Residual 12 11.98316661 0.9986
Total 13 78.76857143

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -1.2088391 0.994874424 -1.2151 0.247707 -3.376484251 0.958806066
Profiled Customers(millions) 2.07417292 0.253629259 8.17797 3E-06 1.521562232 2.626783601

You might also like