0% found this document useful (0 votes)
75 views76 pages

EC212: Introduction To Econometrics Multiple Regression: Estimation (Wooldridge, Ch. 3)

This document provides an introduction to multiple linear regression analysis. It explains that multiple regression allows incorporating different factors to better explain relationships between variables. The key assumptions of multiple regression are presented, including that the error term is uncorrelated with the regressors. Ordinary least squares is used to estimate the coefficients, which then have interpretations as partial effects holding other regressors constant. An example illustrates how controlling for IQ reduces the estimated effect of education on wages.

Uploaded by

SHUMING ZHU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views76 pages

EC212: Introduction To Econometrics Multiple Regression: Estimation (Wooldridge, Ch. 3)

This document provides an introduction to multiple linear regression analysis. It explains that multiple regression allows incorporating different factors to better explain relationships between variables. The key assumptions of multiple regression are presented, including that the error term is uncorrelated with the regressors. Ordinary least squares is used to estimate the coefficients, which then have interpretations as partial effects holding other regressors constant. An example illustrates how controlling for IQ reduces the estimated effect of education on wages.

Uploaded by

SHUMING ZHU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

EC212: Introduction to Econometrics

Multiple Regression: Estimation


(Wooldridge, Ch. 3)

Tatiana Komarova

London School of Economics

Summer 2021

1
1. Motivation for multiple regression
(Wooldridge, Ch. 3.1)

2
Example: Wage equation

• Extend simple regression for wage as

log(wage) = β0 + β1 educ + β2 IQ + u

where IQ is IQ score

• Primarily interested in β1 but β2 is of some interest, too

• Now IQ is taken out from error term. If IQ is good proxy for


intelligence, this may lead to more convincing estimate of
causal effect of schooling

3
Model with two regressors

• Generally, we can write regression model with two regressors

y = β0 + β1 x1 + β2 x2 + u

where β0 is intercept, β1 measures change in y with respect


to x1 , holding other factors (u and x2 ) fixed, and β2 measures
change in y with respect to x2 , holding other factors (u and
x1 ) fixed

• In this model, key assumption about how u is related to x1


and x2 is
E (u|x1 , x2 ) = 0
i.e. for any values of x1 and x2 in population, conditional
expectation of u is zero

4
Back to example

• In wage example, this assumption is E (u|educ, IQ) = 0. Now


u no longer contains intelligence (we hope). So this condition
has better chance of being true. In simple regression, we had
to assume IQ and educ are unrelated to justify leaving IQ in
error term

• Other factors such as experience and “motivation” are part of


u. Motivation is very difficult to measure. Experience is easier:

log(wage) = β0 + β1 educ + β2 IQ + β3 exper + u

5
Model with k regressors

• Multiple linear regression model is written as

y = β0 + β1 x1 + · · · + βk xk + u

where β0 is intercept and β1 , . . . , βk are slope parameters


(i.e. (k + 1) unknown parameters in total)

• Key assumption is

E (u|x1 , . . . , xk ) = 0

• Provided we are careful, we can make this condition closer to


being true by “controlling for” more variables. In wage
example, we “control for” IQ when estimating return to
education

6
• Multiple regression allows to incorporate different factors to
explain behavior of y

• Also multiple regression is useful to allow more flexible


functional forms. For example

log(wage) = β0 + β1 educ + β2 IQ + β3 exper + β4 exper 2 + u

so that exper is allowed to have quadratic effect on log(wage)

• In this case, we set x1 = educ, x2 = IQ, x3 = exper , and


x4 = exper 2 . Note that x4 is a nonlinear function of x3

• See App. A.4 for review on quadratic function

7
• We already know that 100 · β1 is percent change in wage
when educ increases by one year. 100 · β2 has similar
interpretation (for one point increase in IQ)

• β3 and β4 are harder to interpret, but we can use calculus to


get slope of log(wage) with respect to exper

∂ log(wage)
= β3 + 2β4 exper
∂exper
• Multiply by 100 to get percentage effect

8
2. Mechanics and interpretation of
OLS
(Wooldridge, Ch. 3.2)

9
OLS for multiple regression

• Suppose we have x1 and x2 (k = 2) along with y . We want to


fit the equation

ŷi = β̂0 + β̂1 xi1 + β̂2 xi2

by data {(yi , xi1 , xi2 ) : i = 1, . . . , n}

• Now regressors have two subscripts: i is observation number


and the second subscript (1 and 2 in this case) are labels for
particular regressor. For example,

xi1 = educi for i = 1, . . . , n


xi2 = IQi for i = 1, . . . , n

10
• As in simple regression case, different ways to derive OLS
estimator. We choose β̂0 , β̂1 , and β̂2 (so three unknowns) to
minimize sum of squared residuals
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 )2
i=1

• Case with k regressors is easy to state: choose k + 1 values


β̂0 , β̂1 , . . . , β̂k to minimize
n
X
(yi − β̂0 − β̂1 xi1 − · · · − β̂k xik )2
i=1

• Later we discuss condition to have unique solution. Stata is


good at finding solution
• Terminology: We say β̂0 , β̂1 , . . . , β̂k are the OLS estimates
from the regression
y on x1 , x2 , . . . , xk
11
OLS regression line

• OLS regression line is written as

ŷ = β̂0 + β̂1 x1 + · · · + β̂k xk

• Slope coefficients now explicitly have ceteris paribus


interpretations

• For example, if k = 2, then

∆ŷ = β̂1 ∆x1 + β̂2 ∆x2

which allows us to compute how predicted y changes when x1


and x2 change by any amount

12
• What if we “hold x2 fixed”? Then

∆ŷ
β̂1 = if ∆x2 = 0
∆x1

i.e., β̂1 is slope of ŷ with respect to x1 when x2 is held fixed

• Similarly
∆ŷ
β̂2 = if ∆x1 = 0
∆x2
• We call β̂1 and β̂2 partial effects

13
Example: Regress log(wage) on educ (WAGE2.dta)

. reg lwage educ

Source SS df MS Number of obs = 935


F( 1, 933) = 100.70
Model 16.1377042 1 16.1377042 Prob > F = 0.0000
Residual 149.518579 933 .160255712 R-squared = 0.0974
Adj R-squared = 0.0964
Total 165.656283 934 .177362188 Root MSE = .40032

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0598392 .0059631 10.03 0.000 .0481366 .0715418


_cons 5.973063 .0813737 73.40 0.000 5.813366 6.132759

14
Multiple regression: log(wage) on educ and IQ
. reg lwage educ IQ

Source SS df MS Number of obs = 935


F( 2, 932) = 69.42
Model 21.4779447 2 10.7389723 Prob > F = 0.0000
Residual 144.178339 932 .154697788 R-squared = 0.1297
Adj R-squared = 0.1278
Total 165.656283 934 .177362188 Root MSE = .39332

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0391199 .0068382 5.72 0.000 .0256998 .05254


IQ .0058631 .0009979 5.88 0.000 .0039047 .0078215
_cons 5.658288 .0962408 58.79 0.000 5.469414 5.847162

. corr educ IQ
(obs=935)

educ IQ

educ 1.0000
IQ 0.5157 1.0000
15
• Results

\
log(wage) = 5.973 + 0.060 educ
\
log(wage) = 5.658 + 0.039 educ + 0.0059 IQ

• Estimated return to one year of education falls from 6.0% to


3.9% when we control for differences in IQ

• To interpret multiple regression, we do this thought


experiment: Take two people A and B with same IQ score.
Suppose person B has one more year of schooling than person
A. Then we predict B to have wage with 3.9% higher

• Simple regression does not allow us to compare people with


same IQ score. Larger estimated return from simple regression
is because we are attributing part of IQ effect to education

16
• Not surprisingly, there is nontrivial positive correlation
between educ and IQ: Corr (educi , IQi ) = 0.516

• Multiple regression “partials out” other regressors when looking


at effect of educ. We can show that β̂1 measures effect of
educ on log(wage) once correlation between educ and IQ is
partialled out

• Another IQ point is worth much less than one year of


education. Holding educ fixed, 10 more IQ points increases
predicted wage by about 5.9%

• Beauty of multiple regression is that it gives us ceteris paribus


interpretation without having to find two people with same
value of IQ who differ in education by one year. OLS
automatically does it for us

17
Fitted values and residuals

• For each i, fitted value is

ŷi = β̂0 + β̂1 xi1 + · · · + β̂k xik

and residual is
ûi = yi − ŷi

18
Algebraic properties

• (1) Residuals always average to zero


n
X
ûi = 0
i=1

• (2) Each regressor has zero sample covariance (or


correlation) with residuals
n
X
xij ûi = 0 for j = 1, . . . , k
i=1

• These properties follow from the first order conditions of OLS


Pn
• These properties imply e.g. ȳ = ȳˆ and i=1 ŷi ûi

19
Goodness-of-fit
• As with simple regression, it can be shown that

SST = SSE + SSR

where SST , SSE and SSR are total, explained and residual
sum of squares

• We define R-squared as before

SSE SSR
R2 = =1−
SST SST
• Property: It holds 0 ≤ R 2 ≤ 1, but using same dependent
variable, R 2 never falls when another regressor is added
to regression (adding another x cannot increase SSR)

• Thus, if we focus on R 2 , we might include silly variables


20
Adjusted R 2

• One way to overcome this problem R 2 is to use adjusted R 2

[SSR/(n − k − 1)]
R̄ 2 = 1 −
[SST /(n − 1)]
• When more regressors are added, SSR falls, but so does
df = n − k − 1. R̄ 2 can increase or decrease

• Goodness-of-fit of different multiple regression models can be


compared by R̄ 2

• See Ch. 6.3 for further discussion

21
Compare simple and multiple regression estimates

• Compare simple and multiple OLS regression lines

ỹ = β̃0 + β̃1 x1
ŷ = β̂0 + β̂1 x1 + β̂2 x2

where tilde (∼) denotes simple regression and hat (∧) denotes
multiple regression (by same data)

• Question: Is there simple relationship between β̃1 (which does


not control for x2 ) and β̂1 (which does)?

22
• Yes, but we need to define another simple regression. Let δ̃1
be the slope from regression

xi2 on xi1

(note: x2 plays the role of dependent variable)

• It is always true for any sample that

β̃1 = β̂1 + β̂2 δ̃1

23
Case 1: β̂2 > 0, x1 & x2 are positively correlated

• Positive correlation between x1 and x2 implies δ̃1 > 0. Thus, if


β̂2 > 0, then β̂2 δ̃1 > 0 and

β̃1 = β̂1 + β̂2 δ̃1


= β̂1 + (+)(+) > β̂1

i.e. slope estimate of x1 gets smaller if x2 is added

24
Example: log(wage) = β0 + β1 educ + β2 IQ + u

• We got β̃1 = .060 > .039 = β̂1

. reg IQ educ

Source SS df MS Number of obs = 935


F( 1, 933) = 338.02
Model 56280.9277 1 56280.9277 Prob > F = 0.0000
Residual 155346.531 933 166.502177 R-squared = 0.2659
Adj R-squared = 0.2652
Total 211627.459 934 226.581862 Root MSE = 12.904

IQ Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ 3.533829 .1922095 18.39 0.000 3.156616 3.911042


_cons 53.68715 2.622933 20.47 0.000 48.53962 58.83469

• Indeed δ̃1 = 3.53 > 0 from above

25
Case 2: β̂2 > 0, x1 & x2 are negatively correlated

• Negative correlation between x1 and x2 implies δ̃1 < 0. Thus,


if β̂2 > 0, then β̂2 δ̃1 < 0 and

β̃1 = β̂1 + β̂2 δ̃1


= β̂1 + (+)(−) < β̂1

i.e. slope estimate of x1 gets larger if x2 is added

26
Regression through the origin

• Occasionally, one wants to impose that predicted y is zero


when all xj ’s are zero. This means intercept should be set to
zero, rather than estimated

ỹ = β̃1 x1 + · · · + β̃k xk

• Cost of imposing zero intercept when population intercept is


not zero (i.e. β 6= 0) is severe: all slope estimators are
biased, in general

• Estimating intercept when we do not need to (i.e. β0 = 0)


does not cause bias in slope estimators. We already know this
from simple regression: Nothing in SLR.1-4 prevents
population parameters to be zero

27
3. Expected value of OLS estimators
(Wooldridge, Ch. 3.3)

28
Statistical properties of OLS

• As with simple regression, there is set of assumptions under


which OLS is unbiased

• We also explicitly consider bias caused by omitting regressor


appearing in population model

29
Assumptions MLR (Multiple Linear Regression)
• Assumption MLR.1 (Linear in parameters) In population,
it holds
y = β0 + β1 x1 + · · · + βk xk + u
where βj ’s are parameters and u is error term

• Assumption MLR.2 (Random Sampling) We have a


random sample {(yi , xi1 , . . . , xik ) : i = 1, . . . , n} of size n from
population

• Assumption MLR.3 (No perfect collinearity) None of


regressor is constant, and there are no exact linear
relationships among them

• Assumption MLR.4 (Zero conditional mean)

E (u|x1 , . . . , xk ) = 0 for all (x1 , . . . , xk )


30
Assumption MLR.1

• Assumption MLR.1 (Linear in parameters)

In population, it holds

y = β0 + β1 x1 + · · · + βk xk + u

where βj ’s are parameters and u is error term

• y and xj ’s can be nonlinear functions of underlying variables


(e.g. log(y ) and xj2 ), so the model is flexible

31
Assumption MLR.2

• Assumption MLR.2 (Random Sampling)

We have a random sample {(yi , xi1 , . . . , xik ) : i = 1, . . . , n} of


size n from population

• As with SLR.2, this assumption introduces data and implies


data are representative sample from population

• By SLR.1-2, we can write

yi = β0 + β1 xi1 + · · · + βk xik + ui

for i = 1, . . . , n

32
Assumption MLR.3

• Assumption MLR.3 (No perfect collinearity)

None of regressor is constant, and there are no exact linear


relationships among them

• Need to rule out cases where {xij : i = 1, . . . , n} has no


variation is clear from simple regression

• New part to this assumption because of multiple regressors:


We must rule out (extreme) case that some of regressors is an
exact linear function of others

33
Perfect collinearity

• If, say, xi1 is an exact linear function of xi2 , . . . , xik in sample,


we say model suffers from perfect collinearity

• Under perfect collinearity, there are no unique OLS estimators.


Stata will indicate the problem

• Usually perfect collinearity arises from bad specification of


model. Small sample size can also be reason (e.g. unluckily
educi = 2experi for all i)

34
Example: Same variable in different units

• Do not include same variable in model measured in different


units

• For example, in CEO salary equation, it would make no sense


to include firm sales measured in dollars along with sales
measured in millions of dollars (no new information)

• Another example: Return on equity should be included as


percent or proportion, but not both

35
• Also be careful with functional forms

• For example, following does not work

log(cons) = β0 + β1 log(inc) + β2 log(inc 2 ) + u

because log(inc 2 ) = 2 log(inc)

• Instead we probably mean something like

log(cons) = β0 + β1 log(inc) + β2 [log(inc)]2 + u

With this choice, x2 = x12 is an exact nonlinear function of


x1 , but this is allowed in MLR.3

36
One more example
• Consider

voteA = β0 + β1 exA + β2 exB + β3 exTotal + u

where exA and exB are campaign spendings by Candidates A


and B, exTotal is total spending

• Problem is: by definition

exA + exB = exTotal

• One of three variables has to be dropped (Stata automatically


does this, but better to do by yourself)
exA
• On the other hand, share of expenditure shareA = exA+exB
can be included along with exA and exB because shareA is
nonlinear function of exA and exB
37
Further remark on MLR.3

• Key: MLR.3 does not say regressors have to be uncorrelated.


MLR.3 only rules out perfect correlation in sample, i.e.
correlations of ±1

• Again in practice violations of MLR.3 are rare unless mistake


has been made in specifying model

• In equation like

log(wage) = β0 + β1 educ + β2 IQ + β3 exper + u

we fully expect correlation among regressors

• Multiple regression allows us to estimate ceteris paribus


effects even under correlation among xj ’s

38
Assumption MLR.4

• Assumption MLR.4 (Zero conditional mean)

E (u|x1 , . . . , xk ) = 0 for all (x1 , . . . , xk )

• If u is correlated with any of xj ’s, MLR.4 is violated

• Often hope is that if our focus is on, say, x1 , we can include


enough other variables in x2 , . . . , xk to make MLR.4 true or
close to true

• When MLR.4 holds, we say x1 , . . . , xk are exogenous


regressors

• If xj is correlated with u, we often say xj is an endogenous


regressor (although this name comes from another context)

39
Example: Effects of class size on student performance

• Consider regression for standardized test score

score = β0 + β1 classize + β2 income + u

• Even at same income level, families differ in their interest and


concern about children’s education. Family support and
student motivation are in u. Are these correlated with class
size even though we have included income? (Probably)

40
Theorem: Unbiasedness of OLS

• Under Assumptions MLR.1-4, OLS estimators are unbiased

E (β̂j ) = βj

for each j = 0, 1, . . . , k

• This result holds for any value of βj , including zero

• See Appendix 3A for proof

41
Inclusion of irrelevant variables

• It is important to see that unbiasedness result allow for βj to


be any value, including zero

• Consider

log(wage) = β0 + β1 educ + β2 exper + β3 motheduc + u

where MLR.1-4 hold

• Suppose that β3 = 0, but we do not know that. We estimate


full model by OLS

\ = β̂0 + β̂1 educ + β̂2 exper + β̂3 motheduc


log(wage)

42
• We automatically know from unbiasedness result that

E (β̂j ) = βj for j = 0, 1, 2
E (β̂3 ) = 0

• Including irrelevant variables (regressors with zero coefficients)


do not cause bias in any coefficients

• In other words, overspecifying the model cause no bias

43
Omitted variable bias (OVB)

• Leaving a variable out when it should be included in multiple


regression is serious problem

• Consider the case where correct model has two explanatory


variables (satisfying MLR.1-4)

y = β0 + β1 x1 + β2 x2 + u

• If we regress y on x1 and x2 , we know resulting OLS


estimators will be unbiased. But suppose we omit x2 and use
simple regression of y on x1

ỹ = β̃0 + β̃1 x1

• In most cases, we omit x2 because we cannot collect data on it

44
Derivation of OVB

• We can easily derive bias in β̃1 conditional on the sample


outcomes X = {(xi1 , xi2 ) : i = 1, . . . , n}

• We already have relationship between β̃1 and multiple


regression estimator β̂1

β̃1 = β̂1 + β̂2 δ̃1

where β̂2 is multiple regression estimator of β2 and δ̃1 is slope


coefficient in auxiliary regression

x2 on x1

45
• Now use the fact that β̂1 and β̂2 are unbiased conditional on X

E (β̂1 ) = β1
E (β̂2 ) = β2

• Since δ̃1 is a function of {(xi1 , xi2 ) : i = 1, . . . , n}, conditional


on X

E (β̃1 ) = E (β̂1 ) + E (β̂2 )δ̃1


= β1 + β2 δ̃1

• Therefore, conditional on X

Bias(β̃1 ) = E (β̃1 ) − β1 = β2 δ̃1

• Recall that δ̃1 has same sign as sample correlation


Corr (xi1 , xi2 )

46
When does β̃1 happen to be unbiased?

• Simple regression estimator β̃1 is unbiased in two cases

• (1) β2 = 0. But this means x2 does not appear in model, so


simple regression is right thing to do

• (2) δ̃1 = 0 or Corr (xi1 , xi2 ) = 0

• If β2 6= 0 and Corr (xi1 , xi2 ) 6= 0, then β̃1 is generally biased

• We do not know β2 and only have vague idea about size of δ̃1 .
But we can often guess sign of bias

47
Bias in simple regression estimator of β1

• OVB formula: Conditional on X

Bias(β̃1 ) = E (β̃1 ) − β1 = β2 δ̃1

• Sign of bias
Corr (x1 , x2 ) > 0 Corr (x1 , x2 ) < 0
β2 > 0 Positive Bias Negative Bias
β2 < 0 Negative Bias Positive Bias

48
Example: Omitted ability bias

• Consider

log(wage) = β0 + β1 educ + β2 abil + u

where abil is “ability”

• Essentially by definition β2 > 0. We also think

Corr (educ, abil) > 0

so that higher ability people get more education on average

49
• In this scenario

E (β̃1 ) = β1 + β2 δ̃1
= β1 + (+)(+) > β1

so there is upward bias in simple regression. Failure to control


for ability leads to (on average) overestimating return to
education

• Remember, for particular sample, we never know whether


β̃1 > β1 . But we should be very hesitant to trust procedure
that produces to large bias on average

50
Example: Effects of tutoring program on student
performance
• Consider
GPA = β0 + β1 tutor + β2 abil + u
where tutor is hours spent in tutoring.
• Again β2 > 0. Suppose that students with lower ability tend
to use more tutoring

Corr (tutor , abil) < 0

• In this scenario,

E (β̃1 ) = β1 + β2 δ̃1
= β1 + (+)(−) < β1

so that failure to account for ability leads to underestimate


effect of tutoring
51
4. Variance of OLS estimators
(Wooldridge, Ch. 3.4)

52
Assumptions so far

• MLR.1: y = β0 + β1 x1 + · · · + βk xk + u

• MLR.2: random sampling from the population

• MLR.3: no perfect collinearity in the sample

• MLR.4: E (u|x1 , . . . , xk ) = 0

• Under MLR.3 we can compute OLS estimates

• Other assumptions ensure that OLS is unbiased

• To get Var (β̂j ), we add simplifying assumption,


homoskedasticity

53
Assumption MLR.5

• Assumption MLR.5 (Homoskedasticity)

Variance of u does not change with any of x1 , . . . , xk

Var (u|x1 , . . . , xk ) = Var (u) = σ 2

• This assumption can never be guaranteed. We impose this for


now to get simple formulas

• MLR.1-4 imply

E (y |x1 , . . . , xk ) = β0 + β1 x1 + · · · + βk xk

and when we add MLR.5

Var (y |x1 , . . . , xk ) = Var (u|x1 , . . . , xk ) = σ 2

54
Example: Savings equation

• Consider savings equation

sav = β0 + β1 inc + β2 famsize + β3 pareduc + u

where famsize is size of family and pareduc is total parents’


education

• MLR.5 means that variance in sav cannot depend in income,


family size, or parents’s education

• Later we will show how to relax MLR.5, and how to test


whether it is true

55
Formula for Var (β̂j )
• Focus on slope (different formula is needed for intercept)

• As before, we compute variance conditional on the values of


regressors

• We need to define two quantities associated with each xj .


First is total variation of xj in sample
n
X
SSTj = (xij − x̄j )2
i=1

• Second is measure of correlation between xj and other


regressors, which is

Rj2 = R 2 of regression from xj on other regressors

(note: y plays no role here)


56
Theorem: Sampling variance of OLS estimators

• Under Assumptions MLR.1-5, and conditional on X,

σ2
Var (β̂j ) =
SSTj (1 − Rj2 )

for j = 1, . . . , k, where
n
X
SSTj = (xij − x̄j )2
i=1
Rj2 = R 2 of regression from xj on other regressors

57
Remark on theorem

• All five assumptions are needed to get this formula

• Note: Rj2 = 1 is ruled out by Assumption MLR.3

• Any value 0 ≤ Rj2 < 1 is permitted. As Rj2 gets closer to one,


xj is more linearly related to other regressors
σ 2
• If MLR.5 is violated, variance formula Var (β̂j ) = is
SSTj (1−Rj2 )
generally incorrect for all j = 1, . . . , k

58
Remark on variance formula

• Variance formula

σ2
Var (β̂j ) =
SSTj (1 − Rj2 )

has three components

• σ 2 and SSTj are familiar from simple regression. The third


component 1 − Rj2 is new to multiple regression

• As error variance σ 2 = Var (ui ) decreases, Var (β̂j ) decreases.


One way to reduce error variance is to take more stuff out of
the error, i.e. add more regressors

59
Effect of SSTj

• As total sample variation SSTj in xj increases, Var (β̂j )


decreases. It is easier to estimate how xj affects y if we see
more variation in xj

• As we mentioned earlier, SSTj /n (or SSTj /(n − 1)) is sample


variance of {xij : i = 1, . . . , n}. So we can say

SSTj ≈ nσj2

where σj2 = Var (xj ) is variance of xj

• We can increase SSTj by increasing sample size

60
Effect of Rj2

• As Rj2 → 1, Var (β̂j ) → ∞. Rj2 measures how linearly related


xj is to other regressors

• We get smallest variance for β̂j when Rj2 = 0

σ2
Var (β̂j ) =
SSTj

which looks just like simple regression formula

• If xj is unrelated to all other regressors, it is easier to estimate


its ceteris paribus effect on y

• Rj2 = 0 is very rare. In fact, Rj2 ≈ 1 is somewhat common.


This can cause problems for getting sufficiently precise
estimate of βj

61
Multicollinearity

• Loosely, Rj2 “close” to one is called the “problem” of


multicollinearity

• Unfortunately, we cannot define what we mean by “close” that


is relevant for all situations. We have ruled out the case of
perfect collinearity Rj2 = 1

• Here is important point: One often hears discussions of


multicollinearity as if high correlation among regressors is
violation of an assumption we made. But it does not violate
any of Assumptions MLR.1-5

• So regardless of multicollinearity, we still have E (β̂j ) = βj and


variance formula is correct

62
• In fact, formula is doing its job: It shows that if Rj2 is “close”
to one, Var (β̂j ) might be very large

• If Rj2 is “close” to one, xj does not have much sample variation


separate from other regressors. We are trying to estimate
effect of xj on y , holding x1 , . . . , xj−1 , xj+1 , . . . , xk fixed, but
data may not allow us to do that very precisely

• Because multicollinearity violates none of our assumptions, it


is essentially impossible to state hard rules about when it is a
“problem”

63
• Value of Rj2 per se is not important. Ultimately what
important is Var (β̂j )

• For Var (β̂j ), large Rj2 can be offset by large SSTj , which
grows roughly linearly with sample size n

• At this point, we have no way of knowing whether Var (β̂j ) is


“too large” for the estimate β̂j to be useful. Only when we
discuss confidence intervals and hypothesis testing (in Ch. 4),
this will be apparent

64
Correlation among control variables

• Consider
y = β0 + β1 x1 + β2 x2 + β3 x3 + u
where β1 is coefficient of interest. Assume x2 and x3 act as
controls so that we hope to get good ceteris paribus estimate
of x1 . Such controls are often highly correlated. (E.g. x2 and
x3 are different test scores)

• Key is: correlation between x2 and x3 has nothing to do with


Var (β̂1 ). It is only correlation of x1 with (x2 , x3 ) that matters

65
Example

• To determine whether communities with larger minority


populations are discriminated against in lending

percapproved = β0 + β1 percminority
+β2 avginc + β3 avghouseval + u,

where β1 is of interest

• avginc and avghouseval might be highly correlated. But we


do not care whether we can precisely estimate β2 or β3

66
Variance in misspecified models

• As with bias calculations, we can study variances of OLS


estimators in misspecified models

• Consider
y = β0 + β1 x1 + β2 x2 + u
where Assumptions MLR.1-5 hold true

• We run “short” regression, y on x1 , and also “long” regression,


y on x1 , x2

ỹ = β̃0 + β̃1 x1
ŷ = β̂0 + β̂1 x1 + β̂2 x2

67
• From previous analysis, we know: conditional on X,

σ2
Var (β̂1 ) =
SST1 (1 − R12 )

• What about simple regression OLS? We can show: conditional


on X,
σ2
Var (β̃1 ) =
SST1
• Whenever xi1 and xi2 are correlated, then R12 > 0 and

σ2 σ2
Var (β̃1 ) = < = Var (β̂1 )
SST1 SST1 (1 − R12 )

• By omitting x2 , in fact we get estimator with smaller


variance, even though it is biased (bias-variance tradeoff)

68
Two cases: y = β0 + β1 x1 + β2 x2 + u

• (1) If β2 6= 0, then
β̃1 is biased, β̂1 is unbiased, but Var (β̃1 ) < Var (β̂1 )

• (2) If β2 = 0, then
β̃1 and β̂1 are both unbiased and Var (β̃1 ) < Var (β̂1 )

• Case 2 is clear. If β2 = 0, x2 has no (partial) effect on y .


When x2 is correlated with x1 , including it along with x1
makes it more difficult to estimate partial effect of x1 . Simple
regression is clearly preferred

69
Case 1

• Case 1 is more difficult, but there is reason to prefer unbiased


estimator β̂1

• Bias in β̃1 does not systematically change with sample size.


We should assume bias is as large when n = 1000 as when
n = 10

• By contrast, variances of β̃1 and β̂1 both shrink at the rate


1/n. With large sample size, difference between Var (β̃1 ) and
Var (β̂1 ) is less important

70
Estimation of σ 2 and Standard error
• We still need to estimate σ 2 = Var (u). For multiple
regression, its unbiased estimator is
n
2 1 X SSR
σ̂ = ûi2 =
n−k −1 n−k −1
i=1

• In Stata, square root σ̂ is reported as “Root MSE”

• Note: SSR falls when new regressor is added, but degree of


freedom n − (k + 1) falls too. So σ̂ can increase or decrease
when new variable is added
• Standard error of slope β̂j is computed as

σ̂
se(β̂j ) = q
SSTj (1 − Rj2 )

• Critical to report this along with coefficient estimate


71
Example: Wage equation

. reg lwage educ IQ exper

Source SS df MS Number of obs = 935


F( 3, 931) = 60.10
Model 26.876768 3 8.95892266 Prob > F = 0.0000
Residual 138.779515 931 .149065 R-squared = 0.1622
Adj R-squared = 0.1595
Total 165.656283 934 .177362188 Root MSE = .38609

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .057108 .007348 7.77 0.000 .0426875 .0715285


IQ .0057856 .0009797 5.91 0.000 .003863 .0077082
exper .0195249 .0032444 6.02 0.000 .0131579 .025892
_cons 5.198085 .1215426 42.77 0.000 4.959556 5.436614

72
5. Efficiency of OLS: Gauss-Markov
theorem
(Wooldridge, Ch. 3.5)

73
Efficiency of OLS

• How come we use OLS β̂j , rather than some other estimation
method, say β̌j ?

• First criterion to compare these estimators would be


unbiasedness. If β̌j is biased, we prefer OLS

• Suppose β̌j is unbiased. Then we prefer estimator with smaller


variance (called efficiency)

• Under MLR.1-5, OLS estimator has smallest variance in


certain class of estimators

74
Gauss-Markov theorem

• Terminology: Estimator β̃j of βj is called linear estimator if


it takes the form of
n
X
β̃j = wij yi
i=1

where wij (i = 1, . . . , n) is any function of regressors X

• OLS estimator β̂j can be written in this way, i.e., OLS


estimator is an example of linear estimator

• Gauss-Markov theorem: Under MLR.1-5, OLS estimator is


the Best Linear Unbiased Estimator (BLUE)

• Here “best” means “smallest variance”

75
• G-M theorem says: under MLR.1-5, if we take any linear
unbiased estimator β̃j , then conditional on X

Var (β̂j ) ≤ Var (β̃j )

for j = 0, . . . , k

• Recall β̂j is unbiased under MLR.1-4

• Implication of G-M theorem: If we insist on linear unbiased


estimators, then we need look no further than OLS

• See Appendix 3A for proof

• If MLR.5 fails, β̂j is not BLUE in general

76

You might also like