0% found this document useful (0 votes)
27 views51 pages

Lec Topic3

Uploaded by

xinyangw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views51 pages

Lec Topic3

Uploaded by

xinyangw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

ECON2280 Introductory Econometrics

First Term, 2024-2025

Multiple Regression Analysis: Estimation

Fall, 2024

1 / 51
Motivation for Multiple Regression

2 / 51
Motivation for Multiple Regression

▶ The multiple linear regression (MLR) model is defined as

y = β0 + β1 x1 + · · · + βk xk + u,

which tries to explain variable y in terms of variables x1 , · · · , xk .


▶ The terminologies for y , (x1 , · · · , xk ), u, (β0 , β1 , · · · , βk ) are the
same as in the SLR model.
▶ Motivations:
– Incorporate more explanatory factors into the model;
– Explicitly hold fixed other factors that otherwise are in u;
– Allow for more flexible functional forms.

3 / 51
Example: Wage Equation

▶ Suppose
wage = β0 + β1 educ + β2 exper + u

– wage: hourly wage


– educ: years of education
– exper : years of labor market experience
– u: all other factors affecting wage

▶ Now, β1 measures effect of education explicitly holding experience


fixed.
▶ If omitting exper , then E [u|educ] ̸= 0 given that educ and exper
are correlated =⇒ β̂1 is biased.

4 / 51
Example: Family Income and Family Consumption
▶ Suppose
cons = β0 + β1 inc + β2 inc 2 + u,

– cons: family consumption


– inc: family income
– inc 2 : family income squared
– u: all other factors affecting cons

▶ Model has two explanatory variables: income and income squared.

▶ Consumption is explained as a quadratic function of income.

▶ To interpret the coefficients:

∂cons
= β1 + 2β2 inc,
∂inc
which depends on how much income is already there.

5 / 51
Mechanics and Interpretation of
Ordinary Least Squares

6 / 51
Obtaining the OLS Estimates

▶ Suppose we have a random sample


{(xi1 , · · · , xik , yi ) : i = 1, · · · , n}, where i denotes the observation
number, and j denotes different independent variables.
▶ Given (β̂0 , β̂1 , · · · , β̂k ), the residual for observation i is:

ûi = yi − ŷi = yi − β̂0 − β̂1 xi1 − · · · − β̂k xik .

▶ We choose to minimize the sum of squared residuals


n
X n
X
min û 2 = min (yi − β̂0 − β̂1 xi1 − · · · − β̂k xik )2
β̂0 ,β̂1 ,··· ,β̂k i=1 β̂0 ,β̂1 ,··· ,β̂k i=1

▶ The minimizers are the OLS estimates.

7 / 51
Obtaining the OLS Estimates

▶ The OLS estimates are the solution to the FOCs:


n
X
(yi − β̂0 − β̂1 xi1 − · · · − β̂k xik ) = 0,
i=1
n
X
xi1 (yi − β̂0 − β̂1 xi1 − · · · − β̂k xik ) = 0,
i=1
..
.
n
X
xik (yi − β̂0 − β̂1 xi1 − · · · − β̂k xik ) = 0,
i=1

which can be carried out through a standard econometric software.

8 / 51
Interpreting the OLS Regression Equation
▶ In the MLR model y = β0 + β1 x1 + · · · + βk xk + u:

∂y
βj = ,
∂xj

which means “by how much does the dependent variable change if
the j-th independent variable is increased by one unit, holding all
other independent variables and the error term constant”.
▶ The multiple linear regression model manages to hold the values of
other explanatory variables fixed even if, in reality, they are
correlated with the explanatory variable under consideration.
▶ dy ∂u
dxj |(x1 ,··· ,xj−1 ,xj+1 ,··· ,xk ) = βj + ∂xj .

▶ We still need to assume that u do not change with xj conditional on


(x1 , · · · , xj−1 , xj+1 , · · · , xk ). The zero conditional mean assumption
is E [u|x1 , · · · , xk ] = 0, which is more plausible than E [u|xj ] = 0.

9 / 51
Example: Determinants of College GPA

▶ The fitted regression is

\ = 1.29 + 0.5hsGPA + 0.0003SAT .


FreshGPA

– FreshGPA: GPA in freshman year


– hsGPA: high school GPA
– SAT : SAT score

▶ Holding SAT fixed, an increase in high school GPA by 1 point is


associated with a 0.5 point higher freshman year GPA.
▶ Or: If we compare two students, A and B, with the same SAT , but
the hsGPA of A is one point higher, we predict A to have a
FreshGPA that is 0.5 higher than that of B.

10 / 51
A ”Partialling Out“ Interpretation of Multiple Regression
▶ One can show that the estimated coefficient of an explanatory
variable in a multiple regression can be obtained in two steps:
1. Regress the explanatory variable on all other explanatory
variables.
2. Regress y on the residuals from this regression.

▶ Mathematically, suppose we regress y on the constant 1, x1 and x2


(denoted as ŷ = β̂0 + β̂1 x1 + β̂2 x2 ), and want to get β̂1 .
1. x̂i1 = δ̂0 + δ̂1 xi2 =⇒ r̂i1 is the residual from this regression.
2. ŷi = α̂0 + α̂1 r̂i1
Pn
r̂i1 yi
=⇒ α̂1 = Pni 2
= β̂1
i=1 r̂i1
Pn
▶ From Step 1, xi1 = x̂i1 + r̂i1 with x̂i1 = δ̂0 + δ̂1 xi2 , i=1 r̂i1 = 0,
Pn Pn
i=1 xi2 r̂i1 = 0, and i=1 x̂i1 r̂i1 = 0.

11 / 51
A ”Partialling Out“ Interpretation of Multiple Regression
Pn
▶ The FOC w.r.t β̂1 is i=1 xi1 (yi − β̂0 − β̂1 xi1 − β̂k xi2 ) = 0
n
X
=⇒ (δ̂0 + δ̂1 xi2 + r̂i1 )(yi − β̂0 − β̂1 xi1 − β̂k xi2 )
i=1
n
X n
X n
X
= δ̂0 ûi + δ̂1 xi2 ûi + r̂i1 (yi − β̂0 − β̂1 xi1 − β̂2 xi2 )
i=1 i=1 i=1
n
X n
X n
X
= −β̂0 r̂i1 − β̂2 xi2 r̂i1 + r̂i1 [yi − β̂1 (x̂i1 + r̂i1 )]
i=1 i=1 i=1
n
X
= r̂i1 (yi − β̂1 r̂i1 ) = 0
i=1
Pn Pn
where i=1 ûi = 0 and i=1 xi2 ûi = 0 are obtained from the FOCs
w.r.t β̂0 and β̂2 , respectively.

12 / 51
Why does This Procedure Work?

▶ This procedure is usually called the Frisch-Waugh theorem.

▶ The residuals from the first regression is the part of the explanatory
variable that is uncorrelated with the other explanatory variables.
▶ The slope coefficient of the second regression therefore represents
the isolated (or pure) effect of the explanatory variable on the
dependent variable.
▶ Recall that in the SLR,
Pn
(xi − x̄ )yi
β̂1 = Pi=1
n 2
.
i=1 (xi − x̄ )

In the MLR, we replace (xi − x̄ ) by r̂i1 . Actually, in the SLR,


(xi − x̄ ) is the residual in the regression of xi on all other
explanatory variables which include only the constant 1.

13 / 51
Properties of OLS on Any Sample of Data

▶ Algebraic properties of OLS regression:


Pn
– i=1 ûi = 0: deviations from the fitted regression ”plane” sum
up to zero. ȳ = β̂0 + β̂1 x̄1 + · · · + β̂k x̄k : sample averages of y
and of the regressors lie on the fitted regression plane.
Pn
– i=1 xij ûi = 0, j = 1, ..., k: correlations between deviations
and regressors are zero.

▶ These properties are corollaries of the FOCs for the OLS estimates.

14 / 51
Goodness-of-Fit

▶ Decomposition of total variation:

SST = SSE + SSR

▶ R-squared:
SSE SSR
R2 = =1− .
SST SST
▶ Alternative expression for R-squared [proof not required]:
Pn 2
( yi − ȳ )(ŷi − ŷ¯ ) d (y , ŷ )2
Cov
R 2 = Pn i=1  Pn = d (y , ŷ )2
= Corr
(y − ȳ ) 2 (ŷ − ¯
ŷ ) 2
Var (y )Var (ŷ )
i=1 i i=1 i
d d

i.e., R-squared is equal to the squared correlation coefficient between


the actual and the predicted value of the dependent variable.
d (y , ŷ ) ∈ [−1, 1], R 2 ∈ [0, 1].
▶ Because Corr

15 / 51
R 2 Cannot Decrease When One More Regressor Is Added
▶ SSR with k and k + 1 regressors,
n
X
SSRk = min (yi − β̂0 − β̂1 xi1 − · · · − β̂k xik )2
β̂0 ,β̂1 ,··· ,β̂k i=1

n
X
SSRk+1 = min (yi − β̂0 − β̂1 xi1 −· · ·− β̂k xik − β̂k+1 xik+1 )2
β̂0 ,β̂1 ,··· ,β̂k ,β̂k+1 i=1

▶ Treat SSRk+1 as a function of β̂k+1 , i.e., for each value of β̂k+1 , we


minimize the objective function of SSRk+1 with respect to
(β̂0 , β̂1 , ..., β̂k ). Denote the resulting function as SSRk+1 (β̂k+1 ).
▶ Obviously, when β̂k+1 = 0, the two objective functions of SSRk and
SSRk+1 are the same, i.e., SSRk+1 (0) = SSRk .
▶ However, we search for the optimal β̂k+1 that minimizes
SSRk+1 (β̂k+1 ). If the minimizer β̂k+1 ̸= 0, then
2
SSRk+1 (β̂k+1 ) < SSRk+1 (0) = SSRk . =⇒ Rk+1 > Rk2 .

16 / 51
Example: Explaining Arrest Records
▶ The fitted regression line is

narr
\ 86 = 0.712 − 0.150pcnv − 0.034ptime86 − 0.104qemp86
n = 2, 725, R 2 = 0.0413

– narr 86: number of times arrested during 1986


– pcnv : proportion (not percentage) of prior arrests that led to
conviction
– ptime86: months spent in prison during 1986
– qemp86: the number of quarters employed in 1986

▶ pcnv : +0.5 =⇒ -0.075, i.e., -7.5 arrests per 100 men.

▶ ptime86: +12 =⇒ -0.408 arrests.

▶ qemp86: +1 =⇒ -.104, i.e., -10.4 arrests per 100 men - economic


policies are effective.

17 / 51
Example: Explaining Arrest Records

▶ An additional explanatory variable is added:

narr
\ 86 = 0.707 − 0.151pcnv + 0.007avgsen − 0.037ptime86 − 0.103qemp86
n = 2, 725, R 2 = 0.0422 (increases only slightly)

▶ Average prior sentence increases number of arrests.


(counter-intuitive but β̂2 ≈ 0).
▶ Limited additional explanatory power as R-squared increases by
little. (why? β̂2 ≈ 0).
▶ General remark on R-squared: even if R-squared is small (as in the
given example), regression may still provide good estimates (i.e.,
s.e.’s are small) of ceteris paribus effects.

18 / 51
The Expected Value of the OLS Estimators

19 / 51
Standard Assumptions for the MLR Model

▶ Assumption MLR.1 (Linear in Parameters):

y = β0 + β1 x1 + · · · + βk xk + u.

– In the population, the relationship between y and x is linear.


– The “linear” in linear regression means “linear in parameter”.

▶ Assumption MLR.2 (Random Sampling): The data


{(xi1 , · · · , xik , yi ) : i = 1, ..., n} is a random sample drawn from the
population, i.e., each data point follows the population equation,

yi = β0 + β1 xi1 + · · · + βk xik + ui .

20 / 51
Standard Assumptions for the MLR Model

▶ Assumption MLR.3 (No Perfect Collinearity): In the sample (and


therefore in the population), none of the independent variables is
constant and there are no exact linear relationships among the
independent variables.

– The assumption only rules out perfect correlation between


explanatory variables; imperfect correlation is allowed.

– If an explanatory variable is a perfect linear combination of other


explanatory variables it is redundant and can be removed.

– Constant variables are ruled out (collinear with the regressor 1).
▶ This is an extension of ni=1 (xi − x̄ )2 > 0 in the SLR models. Why?
P

21 / 51
Example for Perfect Collinearity

▶ Suppose the MLR model is

voteA = β0 + β1 shareA + β2 shareB + u

where voteA is the percentage of vote for candidate A, and shareA


and shareB are the percentage of total campaign expenditures spent
by A and B in two-candidate elections.
▶ Either shareA or shareB has to be dropped from the regression
because there is an exact linear relationship between them:
shareA + shareB = 1.

22 / 51
Standard Assumptions for the MLR Model
▶ Assumption MLR.4 (Zero Conditional Mean):

E [u|x1 , x2 , · · · , xk ] = 0

The the values of the explanatory variables must contain no


information about the mean of the unobserved factors.
▶ In a MLR model, the zero conditional mean assumption is much
more likely to hold because fewer things end up in the error.
▶ Example: avgscore = β0 + β1 expend + β2 avginc + u,
– avgscore is average standardized test score of a school; expend
is per student spending at the school; avginc is average family
income of students at the school.
If avginc was not included in the regression, it would end up in the
error term; it would then be hard to defend that expend is
uncorrelated with the error.

23 / 51
Unbiasedness of OLS
▶ Explanatory variables that are correlated with u are called
endogenous variables; endogeneity is a violation of assumption
MLR.4.
▶ Explanatory variables that are uncorrelated with u are called
exogenous variables; MLR.4 holds if all explanatory variables are
exogenous.
▶ Exogeneity is the key assumption for a causal interpretation of the
regression, and for unbiasedness of the OLS estimators.
▶ Theorem (Unbiasedness of OLS): Under assumptions
MLR.1-MLR.4,
E [β̂j ] = βj , j = 0, 1, · · · , k,
for any values of population parameter βj .
▶ Unbiasedness is an average property in repeated samples; in a given
sample, the estimates may still be far away from the true values.

24 / 51
Including Irrelevant Variables in a Regression Model

▶ Suppose
y = β0 + β1 x1 + β2 x2 + β3 x3 + u,
where β3 = 0, i.e., x3 is irrelevant to y .
▶ No problem because E [β̂3 ] = β3 = 0.

▶ However, including irrelevant variables may increase sampling


variance of β̂1 and β̂2 .

25 / 51
Omitted Variable Bias: the Simple Case

▶ Suppose the true model is

y = β0 + β1 x1 + β2 x2 + u,

i.e., the true model contains both x1 and x2 (β1 ̸= 0, β2 ̸= 0).


However, we estimate a misspecified model:

y = α0 + α1 x1 + ε.

So, x2 is omitted.
▶ If x1 and x2 are correlated. Assume a linear regression relationship:

x2 = δ0 + δ1 x1 + v .

26 / 51
Omitted Variable Bias: the Simple Case

▶ Then,

y = β0 + β1 x1 + β2 (δ0 + δ1 x1 + v ) + u
= β0 + β2 δ0 + (β1 + β2 δ1 ) x1 + (u + β2 v ) .
| {z } | {z } | {z }
α0 α1 ε

▶ If y is only regressed on x1 , the estimated intercept and slope satisfy

E [α̂0 ] = β0 + β2 δ0

E [α̂1 ] = β1 + β2 δ1
Why? The new error term ε = u + β2 v satisfies the zero conditional
mean assumption: E [u + β2 v |x1 ] = E [u|x1 ] + β2 E [v |x1 ] = 0.
▶ Obviously, if β2 δ1 = 0, α̂1 is an unbiased estimator of β1 .

27 / 51
bias is determined by the sizes of b2 and δ 1.
In practice, since b2 is an unknown population parameter, we cannot be certain
Omitted Variable Bias: the Simple Case
whether b2 is positive or negative. Nevertheless, we usually have a pretty good idea about
the direction of the partial effect of x2 on y. Further, even though the sign of the correlation
between x1 and x2 cannot be known if x2 is not observed, in many cases, we can make an
educated guess about whether x1 and x2 are positively or negatively correlated.
In the wage equation (3.42), by definition, more ability leads to higher productivity
and therefore higher wages: b2  0. Also, there are reasons to believe that educ and abil
are▶positively
Summary correlated:
of biasoninaverage,
α̂1 when individuals with more
x2 is omitted ininnate ability choose
estimating higher
equation.
~
T a b l e 3 . 2 Summary of Bias in b1 when x2 Is Omitted in Estimating Eqution (3.40)

© Cengage Learning, 2013


Corr(x1, x2)  0 Corr(x1, x2)  0
b2 > 0 Positive bias Negative bias
b2 < 0 Negative bias Positive bias

age Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial rev
d that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

▶ When is there no omitted variable bias? If the omitted variable is


irrelevant (β2 = 0) or uncorrelated (δ1 = 0).

28 / 51
Example: Omitting Ability in a Wage Equation

▶ Suppose the true wage equation is

wage = β0 + β1 educ + β2 abil + u, where β2 > 0.

But the estimated equation is

wage = α0 + α1 educ + ε.

▶ Suppose
abil = δ0 + δ1 educ + v , where δ1 > 0.
▶ The return to education β1 will be overestimated because β2 δ1 > 0.
It will look as if people with many years of education earn very high
wages, but this is partly due to the fact that people with more
education are also more able on average.

29 / 51
Omitted Variable Bias: More General Cases

▶ How the omission of xk bias the estimates for β0 , β1 , · · · , βk−1


when k ≥ 3?
– Long regression: y = β̂0 + β̂1 x1 + · · · + β̂k xk
– Short regression: y = α̂0 + α̂1 x1 + · · · + α̂k−1 xk−1
– xk regression: xk = δ̃0 + δ̃1 x1 + · · · + δ̃k−1 xk−1

▶ Plug in the xk regression to the long regression and collect terms:

α̂j = β̂j + β̂k δ̃j , j = 0, 1, · · · , k − 1.

=⇒ E [α̂j ] = βj + βk δj .
▶ α̂j is an unbiased estimator for βj only if βk = 0 or δj = 0.

30 / 51
Exercise

Suppose that you are interested in estimating the ceteris paribus


relationship between y and x1 . For this purpose, you can collect
data on two control variables, x2 and x3 . Let β̃1 be the simple
regression estimate from y on x1 and let β̂1 be the multiple
regression estimate from y on x1 , x2 , x3 .
▶ If x1 is highly correlated with x2 and x3 in the sample, and x2
and x3 have large partial effects on y , would you expect β̃1
and β̂1 to be similar or very different? Explain.
▶ If x1 is almost uncorrelated with x2 and x3 , but x2 and x3 are
highly correlated, will β̃1 and β̂1 tend to be similar or very
different? Explain.

31 / 51
Solution

▶ Because x1 is highly correlated with x2 and x3 , and these


latter variables have large partial effects on y , the simple and
multiple regression coefficients on x1 can differ by large
amounts.
▶ Here we would expect β̃1 and β̂1 to be similar. The amount of
correlation between x2 and x3 does not directly effect the
multiple regression estimate on x1 , if x1 is essentially
uncorrelated with x2 and x3 .

32 / 51
The Variance of the OLS Estimators

33 / 51
Standard Assumptions for the MLR Model

▶ Assumption MLR.5 (Homoskedasticity):

Var [u|x1 , · · · , xk ] = σ 2

The value of the explanatory variables must contain no information


about the variance of the unobserved factors.
▶ Example: In the wage equation

wage = β0 + β1 educ + β2 exper + β3 tenure + u,

the homoskedasticity assumption

Var [u|educ, exper , tenure] = σ 2

may also be hard to justify in many cases.

34 / 51
Sampling Variances of the OLS Slope Estimators
▶ Theorem (Sampling Variances of the OLS Slope Estimators):
Under assumptions MLR.1-MLR.5,

σ2
Var (β̂j ) = , j = 1, · · · , k,
SSTj (1 − Rj2 )

where σ 2 is the variance of error term, SSTj = i (xij − x̄j )2 is the


P
total sample variation in explanatory variable xj , and Rj2 is the
R-squared from the regression:

xj = δ0 + δ1 x1 + · · · + δj−1 xj−1 + δj+1 xj+1 + · · · + δk xk . (1)


Pn
▶ Note that SSTj (1 − Rj2 ) = SSRj = 2
i=1 r̂ij , where r̂ij is the residual
from the regression (1).

▶ Compared with the SLR case where Var (β̂1 ) = σ2 σ2


SSTx = Pn (xi −x̄ )2
,
i=1
the MLR case replaces xi − x̄ by r̂ij .

35 / 51
The Components of OLS Variances
▶ The error variance, σ 2 :

– A high σ 2 indicates more “noise” in the equation, which


increases the sampling variance and makes estimates imprecise.

▶ The total sample variation in the explanatory variable xj , SSTj :

– More sample variation leads to more precise estimates.


– Total sample variation is non-decreasing with the sample size.
n
X n
X n+1
X
(xi − x̄n )2 ≤ (xi − x̄n+1 )2 ≤ (xi − x̄n+1 )2
i=1 i=1 i=1

 1 Pn 2

– SSTj = n i=1 (xij − x̄j ) = nVar
d (xj ); Var
d (xj ) tends to be
n
stable.
– Increasing the sample size n is thus a way to get more precise
estimates.

36 / 51
The Components of OLS Variances

▶ The linear relationships among the independent variables, Rj2 :

– In the regression of xj on all other independent variables


(including a constant), the R 2 will be the higher the better xj
can be linearly explained by the other independent variables.
– Var (β̂j ) will be the higher the better explanatory variable xj
can be linearly explained by other independent variables.
– The problem of almost linearly dependent explanatory variables
is called multicollinearity (i.e., R 2 → 1 for some j).
– If Rj2 = 1, i.e., there is perfect collinearity between xj and other
regressors, βj cannot be identified. This is why Var (β̂j ) = ∞.
– Multicollinearity is a small-sample problem. As larger and
larger data sets are available nowadays, i.e., n is much larger
than k, it is seldom a problem in current econometric practice.

37 / 51
An Example for Multicollinearity
▶ Consider the following MLR model,

avgscore = β0 + β1 teacherexp + β2 matexp + β3 othexp + · · · ,


– avgscore: average standardized test score of school
– teacherexp: expenditures for teachers
– matexp: expenditures for instructional materials
– othexp: other expenditures

▶ The different expenditure categories will be strongly correlated


because if a school has ample resources it will spend on everything.
▶ For precise estimates of the differential effects, one needs
information about situations where expenditure categories change
differentially.
▶ Therefore, sampling variance of the estimated effects will be large.

38 / 51
Discussion of the Multicollinearity Problem

▶ In the above example, it would probably be better to lump all


expenditure categories together.
▶ In other cases, dropping some independent variables may reduce
multicollinearity (but this may lead to omitted variable bias).
▶ Only the sampling variance of the variables involved in
multicollinearity will be inflated; the estimates of other effects may
be very precise.

▶ Multicollinearity may be detected by ”variance inflation factors”:

1
VIFj = .
1 − Rj2

As an (arbitrary) rule of thumb, the variance inflation factor should


not be larger than 10 (or Rj2 should not be larger than 0.9).

39 / 51
96 Part 1 Regression Analysis with Cross-Sectional Data

Discussion of the Multicollinearity Problem


ˆ1) as a function of R​12​.
F i g u r e 3 . 1 Var(​b​

Var( ˆ 1)

© Cengage Learning, 2013


0
R12 1

​ˆj is in
will see in Chapter 4, for statistical inference, what ultimately matters is how big b​ 40 / 51
Variances in Misspecified Models

▶ The choice of whether to include a particular variable in a regression


can be made by analyzing the trade-off between bias and variance.
▶ Suppose the true model is

y = β0 + β1 x1 + β2 x2 + u,

the fitted regression line in model 1 is

ŷ = β̂0 + β̂1 x1 + β̂2 x2 ,

and model 2 is
ỹ = β̃0 + β̃1 x1 .
▶ It might be the case that the likely omitted variable bias of β̃1 in the
misspecified model 2 is overcompensated by a smaller variance.

41 / 51
Variances in Misspecified Models

▶ Mean Squared Error (MSE): For a general estimator, say, β̂,


2
 
MSE (β̂) = E (β̂ − β)
= E (β̂ − E [β̂] + E [β̂] − β)2
 
2
= E (β̂ − E [β̂])2 + E [β̂] − β
    
− 2 β − E [β̂] E β̂ − E [β̂]
= Var (β̂) + Bias(β̂)2

▶ MSE is unobserved. In practice, estimate both models, and assess


how sensitive are β̂1 and se(β̂1 )

42 / 51
Estimating the Error Variance

▶ The unbiased estimator of σ 2 is


n
1 X SSR
σ̂ 2 = û 2 =
n − k − 1 i=1 i n−k −1

where n − (k + 1) is called the degree of freedom.


▶ The n estimated squared residuals {ûi : i = 1, · · · , n} in the sum are
not completely independent but related through the k + 1 equations
that define the first order conditions of the minimization problem.
1
Pn
▶ In the SLR, k = 1, and σ̂ 2 = n−2 2
i=1 ûi .

▶ Theorem (Unbiased Estimation of σ 2 ): Under assumptions


MLR.1-MLR.5,
E [σ̂ 2 ] = σ 2 .

43 / 51
Estimating the Error Variance

▶ The true sampling variation of the estimated βj is


s
σ2
q
sd(β̂j ) = Var (β̂j ) =
SSTj (1 − Rj2 )

▶ The estimated sampling variation of the estimated βj is


s
σ̂ 2
q
se(β̂j ) = Var (β̂j ) =
d
SSTj (1 − Rj2 )

i.e., we plug in σ̂ 2 for the unknown σ 2 .


▶ Note that these formulas are only valid under assumptions
MLR.1-MLR.5 (in particular, there has to be homoskedasticity).

44 / 51
Exercise

Suppose that you are interested in estimating the ceteris paribus


relationship between y and x1 . For this purpose, you can collect
data on two control variables, x2 and x3 . Let β̃1 be the simple
regression estimate from y on x1 and let β̂1 be the multiple
regression estimate from y on x1 , x2 , x3 .
▶ If x1 is highly correlated with x2 and x3 , and x2 and x3 have
small partial effect on y , would you expect se(β̃1 ) or se(β̂1 ) to
be smaller?
▶ If x1 is almost uncorrelated with x2 and x3 , and x2 and x3 are
highly correlated, would you expect se(β̃1 ) or se(β̂1 ) to be
smaller?

45 / 51
Solution

▶ In this case, we are (unnecessarily) introducing


multicollinearity into the regression; x2 and x3 have small
partial effects on y and yet x2 and x3 are highly correlated
with x1 . Adding x2 and x3 increases the standard error of the
coefficient on x1 substantially, so se(β̂1 ) is likely to be much
larger than se(β̃1 ).
▶ In this case, adding x2 and x3 will decrease the residual
variance without causing much collinearity (because x1 is
almost uncorrelated with x2 and x3 ), so we should see se(β̂1 )
smaller than se(β̃1 ). The amount of correlation between x2
and x3 does not directly affect se(β̂1 ).

46 / 51
Efficiency of OLS: The Gauss-Markov Theorem

47 / 51
Efficiency of OLS

▶ Under assumptions MLR.1-MLR.4, OLS is unbiased.

▶ However, under these assumptions there may be many other


estimators that are unbiased. Which one is the unbiased estimator
with the smallest variance?
▶ In order to answer this question one usually limits oneself to linear
estimators, i.e., estimators linear in the dependent variable:
n
X
β̃j = wij yi ,
i=1

where wij is an arbitrary function of the sample values of all the


explanatory variables.

48 / 51
Efficiency of OLS

▶ The OLS estimator can be shown to be of this form. In the SLR


Pn
(xi − x̄ )yi
β̂1 = Pi=1
n 2
,
i=1 (xi − x̄ )

i.e.,
xi − x̄ xi − x̄
wi1 = Pn 2
= ,
(x
i=1 i − x̄ ) SSTx
which is a function of {xi : i = 1, · · · , n}. (How about β̂j in MLR?)

49 / 51
The Gauss-Markov Theorem

▶ Theorem (The Gauss-Markov Theorem): Under assumptions


MLR.1-MLR.5, the OLS estimators are the best linear unbiased
estimators (BLUEs) of the regression coefficients, i.e.,

Var (β̂j ) ≤ Var (β̃j )


Pn
for all β̃j = i=1 wij yi for which E [β̃j ] = βj , j = 0, 1, · · · , k.
▶ OLS is the best linear estimator only when MLR.1-MLR.5 hold; if
there is heteroskedasticity for example, there are better estimators.
▶ The key assumption for the Gauss-Markov theorem is Assumption
MLR.5.
▶ Due to the Gauss-Markov Theorem, assumptions MLR.1-MLR.5 are
collectively known as the Gauss-Markov assumption.

50 / 51
Gauss-Markov Theorem
The Gauss-Markov
OLS is Theorem
efficient in the class of unbiased, linear estimators.

All estimators

unbiased

linear

OLS is BLUE--best linear unbiased estimator. 51 / 51

You might also like