0% found this document useful (0 votes)
6 views

Multivariate Regression, slides

The document discusses multivariate regression, emphasizing the importance of conditioning on multiple variables to estimate means and strengthen causal interpretations. It covers the interpretation of coefficients, the concept of omitted variable bias, and the implications of multicollinearity in regression models. Additionally, it introduces R-squared and adjusted R-squared as measures of model fit and predictive accuracy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Multivariate Regression, slides

The document discusses multivariate regression, emphasizing the importance of conditioning on multiple variables to estimate means and strengthen causal interpretations. It covers the interpretation of coefficients, the concept of omitted variable bias, and the implications of multicollinearity in regression models. Additionally, it introduces R-squared and adjusted R-squared as measures of model fit and predictive accuracy.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Multivariate Regression

Econ 2560, Spring 2024

Prof. Josh Abel

(Chapters 6, 7.1)
Introduction

So far, we have estimated means conditional on 1 X variable


E [Y |X ] → E [Y |X1 ]
Linear
Non-parametric
Now will consider conditioning on multiple variables
E [Y |X ] → E [Y |X1 , X2 , ..., XK ]
Motivation

Why use multiple X variables?


Because you can!
Useful to have more information when estimating a mean
May strengthen the causal interpretation of coefficients on individual
variables
Quick discussion of causality

Have not discussed causation yet


Will discuss in much more detail later
For now, suffice to say that this probably does not give the right
causal effect:
E [Earnings|Education] = β0U + β1U · Education
Why not?
Quick discussion of causality

Have not discussed causation yet


Will discuss in much more detail later
For now, suffice to say that this probably does not give the right
causal effect:
E [Earnings|Education] = β0U + β1U · Education
Why not?
Maybe more “connected” people get more schooling and have better
access to jobs
Maybe “smarter” people find it easier to advance in school and excel at
work
In either case, people with more education will have higher earnings,
even without any causal relationship
A multivariate regression function

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi


A multivariate regression function

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

OLS chooses β̂0M , β̂1M , and β̂2M such that:

E [ûi · Educationi ] = 0
E [ûi · AFQTi ] = 0
E [ûi ] = 0
where ûi = Yi − [β̂0M + β̂1M · Educationi + β̂2M · AFQTi ]
Interpreting a multivariate regression function

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi


Let’s suppose AFQT is a perfect measure of how “smart” someone is
Not true...
How can we interpret β1M ?
“Holding constant” interpretation

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

If AFQT stays constant but Education increases by 1 year, how much


does expected Earnings increase?
“Holding constant” interpretation

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

If AFQT stays constant but Education increases by 1 year, how much


does expected Earnings increase?
β1M
β1M is the (predictive) effect of Education “holding AFQT constant”
“Holding constant” interpretation

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

If AFQT stays constant but Education increases by 1 year, how much


does expected Earnings increase?
β1M
β1M is the (predictive) effect of Education “holding AFQT constant”
Can think of a partial derivative from calculus:
∂E [Earningsi |Educationi , AFQTi ]
= β1M
∂Educationi
“Holding constant” interpretation (2)

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

Can we think of β1M as being the causal effect of education now?


“Holding constant” interpretation (2)

E [Earningsi |Xi ] = β0M + β1M · Educationi + β2M · AFQTi

Can we think of β1M as being the causal effect of education now?


Probably not
Education might have associations with factors other than AFQT that
drive β1M
Still, this is cleaner than the univariate case
At least β1M is (mostly) not being driven by AFQT
“Residual regression” interpretation

E [Earningsi |Educationi , AFQTi ] = β̂0M + β̂1M · Educationi + β̂2M · AFQTi


“Residual regression” interpretation

E [Earningsi |Educationi , AFQTi ] = β̂0M + β̂1M · Educationi + β̂2M · AFQTi

Now consider the auxiliary regression of one regressor on another:

E [Educationi |AFQTi ] = α̂0 + α̂1 · AFQTi ,

and let ûiX1 be the residual from this regression.


“Residual regression” interpretation

E [Earningsi |Educationi , AFQTi ] = β̂0M + β̂1M · Educationi + β̂2M · AFQTi

Now consider the auxiliary regression of one regressor on another:

E [Educationi |AFQTi ] = α̂0 + α̂1 · AFQTi ,

and let ûiX1 be the residual from this regression.


You can (but won’t have to) show the following:

E [Earningsi |ûiX1 ] = κ0 + β̂1M · ûiX1


“Residual regression” interpretation (2)

E [Yi |Xi ] = β̂0M + β̂1M · X1i + β̂2M · X2i

In words: β1M (or any other β M ) from a multivariate regression can be


estimated as follows:
“Residualize” X1 on all other regressors (call it ûiX1 )
Regress Y on ûiX1 : coefficient will be same β1M from above equation
Key interpretation: β1M measures the effect on Y of the portion
of X1 that cannot be explained by other variables
β1M is “identified off of” the “residual variation” of X1
Similarly, β2M is identified off of the residual variation of X2
Income and education regressions

Outcome: Annual earnings (1,000$s)


Univariate Multivariate
Constant -75.7 -53.2
Education 10.1 6.7
AFQT 0.5
Income and education

Earnings by education

400
300
Annual earnings (1,000$)

200
100
0

8 10 12 14 16 18 20

Years of education
Income and education

Earnings by education

150
Annual earnings (1,000$)

100
50
0

8 10 12 14 16 18 20

Years of education
Income and education

Earnings by education

150

Slope = 10.1
Annual earnings (1,000$)

100
50
0

8 10 12 14 16 18 20

Years of education
Income and education regressions

Outcome: Annual earnings (1,000$s)


Univariate Multivariate
Constant -75.7 -53.2
Education 10.1 6.7
AFQT 0.5
Education by AFQT

Education by AFQT

20
18
16
Years of education

14
12
10
8

0 20 40 60 80 100

AFQT percentile (1989)


Education by AFQT

Education by AFQT

20
18
16
Years of education

14
12
10
8

0 20 40 60 80 100

AFQT percentile (1989)


Residual regression
Earnings by education

400
300
Annual earnings (1,000$)

200
100
0

−5 0 5

Years of education (residual)


Residual regression
Earnings by education

150
Annual earnings (1,000$)

100
50
0

−4 −2 0 2 4

Years of education (residual)


Residual regression
Earnings by education

150

Slope = 6.7
Annual earnings (1,000$)

100
50
0

−4 −2 0 2 4

Years of education (residual)


Income and education regressions

Outcome: Annual earnings (1,000$s)


Univariate Multivariate
Constant -75.7 -53.2
Education 10.1 6.7
AFQT 0.5
Omitted Variable “Bias”

Frequently, we don’t have data on every variable we’d like


For instance, suppose we don’t observe AFQT for this regression:

E [Incomei |Educationi , AFQTi ] = β0M + β1M · Educationi + β2M · AFQTi

We then have no choice but to estimate this equation:

E [Incomei |Educationi ] = β0U + β1U · Educationi

Want to consider whether β1U = β1M .


Seems unlikely: β1U is estimated with all variation in Education while
β1M only uses residual variation
It turns we can be more precise about this.
Omitted Variable “Bias” (2)

E [Incomei |Educationi , AFQTi ] = β0M + β1M · Educationi + β2M · AFQTi

E [Incomei |Educationi ] = β0U + β1U · Educationi

Consider the following auxiliary regression:


E [AFQTi |Educationi ] = δ0 + δ1 · Educationi
Omitted Variable “Bias” (2)

E [Incomei |Educationi , AFQTi ] = β0M + β1M · Educationi + β2M · AFQTi

E [Incomei |Educationi ] = β0U + β1U · Educationi

Consider the following auxiliary regression:


E [AFQTi |Educationi ] = δ0 + δ1 · Educationi

It then follows that:


E [Incomei |Educationi ] = β0M + β1M · Educatoni + β2M · E [AFQTi |Educationi ])
= β0M + β1M · Educationi + β2M · (δ0 + δ1 · Educationi )
= (β0M + β2M · δ0 ) + (β1M + β2M · δ1 ) · Educationi

So β1U = β1M + β2M · δ1 ̸= β1M (typically)


Omitted Variable “Bias” results

β1U = β1M + β2M · δ1


“Bias” from omitting AFQT (β1U − β1M ) is δ1 · β2M .
Breakdown:
AFQT-Education AFQT-Education
relation is positive relation is negative
(δ1 > 0) (δ1 < 0)
AFQT increases income
(β2M > 0)
AFQT decreases income
(β2M < 0)
Omitted Variable “Bias” results

β1U = β1M + β2M · δ1


“Bias” from omitting AFQT (β1U − β1M ) is δ1 · β2M .
Breakdown:
AFQT-Education AFQT-Education
relation is positive relation is negative
(δ1 > 0) (δ1 < 0)
AFQT increases income
(β2M > 0) β1U > β1M β1U < β1M
AFQT decreases income
(β2M < 0) β1U < β1M β1U > β1M
Omitted Variable “Bias” example

Outcome: Earnings AFQT


Constant -75.68 -53.17 -41.83
Education 10.06 6.73 6.17
AFQT 0.54

10.06 = 6.73 + 0.54 · 6.17


β1U = β1M + β2M · δ1
Omitted Variable “Bias” discussion

OV“B” result is extremely useful in practice!


Even though you don’t observe β2M or δ1 , you can sometimes use the
OV“B” logic to get a sense of whether your “bias” is positive or
negative
“Bias” is a loaded word
Sometimes, the correct regression is the one that omits a particular
variable!
Sidebar: R 2

R 2 is a measure of how well a model fits the data


It answers the following question:
“How much closer to the actual data (Yi ) do I get by using my fitted
values (Ŷi ) than if I had just guessed the sample mean (Ȳ )?”
R 2 visual

15
10
Y

5
0
−5

−5 0 5 10

X
R 2 visual

15

Linear fit
Unconditonal mean
10
Y

5
0
−5

−5 0 5 10

X
R 2 visual
R 2 visual
R 2 formula

N
X
TSS = (Yi − Ȳ )2
i=1

N
X N
X
2
SSR = (Yi − Ŷi ) = ûi2
i=1 i=1

SSR
R2 = 1 −
TSS
Share of squared residuals we eliminate by using fitted values rather
than mean
Comments on R 2

R 2 is a measure of in-sample predictive accuracy


OLS maximizes R 2
R 2 ∈ [0, 1]; R 2 = ρ2
Measures tightness of relationship, not slope
It is not a summary measure of how “good” the model is
Can easily come up with inane models with high R 2 s
Can only compare R 2 across models if they have the same Yi
Adding variables to the model will always increase R2
If a variable were truly useless, the model would give it a coefficient of
0, so R 2 wouldn’t change
If there is any predictive content, R 2 will increase
Adjusted R 2

Because of the problems with R 2 , economists sometimes refer to


“adjusted R 2 ”
n − 1 SSR
R̄ 2 = 1 − ,
n − k − 1 TSS
where k is the number of regressors.
SSR
Recall R 2 = 1 − TSS , so if n >> k, R̄ 2 ≈ R 2
R̄ 2 is like R 2 , but it punishes models with many regressors
If you add a useless variable, R 2 will increase but R̄ 2 will probably fall
Multicollinearity

Consider the following regression model:

Incomei = β̂0 + β̂1 · CurrentYeari + β̂2 · Agei + β̂3 · BirthYeari + ûi

Modeling income as a linear trend of age, time, and birth cohort


Note that CurrentYeari = Agei + BirthYeari
Multicollinearity

Consider the following regression model:

Incomei = β̂0 + β̂1 · CurrentYeari + β̂2 · Agei + β̂3 · BirthYeari + ûi

Modeling income as a linear trend of age, time, and birth cohort


Note that CurrentYeari = Agei + BirthYeari
This model cannot be estimated!
In a model with perfect “multicollinarity” – one regressor is a linear
combination of other regressors – the coefficients are not uniquely
identified
Multicollinearity example
Intuition for multicollinearity

Incomei = β̂0 + β̂1 · CurrentYeari + β̂2 · Agei + β̂3 · BirthYeari + ûi

Interpretation 1: holding constant


You can’t look at a change in CurrentYear while keeping Age and
BirthYear constant
If CurrentYear changes, one of the others must as well!
Interpretation 2: residual regression
A regression of CurrentYear on Age and BirthYear will explain
CurrentYear perfectly; all residuals will be 0 (i.e. R 2 = 1)
Therefore, there is no residual variation to identify the effect of
CurrentYear on Income → SE will be ∞
Intuition for Multicollinearity
Intuition for Multicollinearity
Intuition for Multicollinearity
2.0e+11
1.5e+11
Density
1.0e+11
5.0e+10
0

-20 -15 -10 -5 0 5

Year (demeaned) Year (not predicted by age and birthyear)


Intuition for multicollinearity (2)

Incomei = β̂0 + β̂1 · CurrentYeari + β̂2 · Agei + β̂3 · BirthYeari + ûi

CurrentYeari = Agei + BirthYeari

Regression equation can be rewritten as:


Incomei = β̂0 + (β̂2 + β̂1 ) ·Agei + (β̂3 + β̂1 ) ·BirthYeari + ûi
| {z } | {z }
3 2
Intuition for multicollinearity (2)

Incomei = β̂0 + β̂1 · CurrentYeari + β̂2 · Agei + β̂3 · BirthYeari + ûi

CurrentYeari = Agei + BirthYeari

Regression equation can be rewritten as:


Incomei = β̂0 + (β̂2 + β̂1 ) ·Agei + (β̂3 + β̂1 ) ·BirthYeari + ûi
| {z } | {z }
3 2

Each of these 2 sets of parameters are consistent with the results


above (as are many others):
β̂1 = 1, β̂2 = 2, β̂3 = 1
β̂1 = 2, β̂2 = 1, β̂3 = 0
I.e. we don’t know how to pick the single solution
The model is asking something nonsensical, and so OLS fails
Multicollinearity in practice

Your software will tell you if you have perfect multicollinearity


You can “fix” it by removing one of the variables
This often comes up with “indicator variables” which we’ll discuss soon
Before you proceed, you should pause to think through whether your
model makes sense...
Multicollinearity in practice

Your software will tell you if you have perfect multicollinearity


You can “fix” it by removing one of the variables
This often comes up with “indicator variables” which we’ll discuss soon
Before you proceed, you should pause to think through whether your
model makes sense...
Subtler issues arise with high-but-imperfect multicollinearity
Multicollinearity in practice

Your software will tell you if you have perfect multicollinearity


You can “fix” it by removing one of the variables
This often comes up with “indicator variables” which we’ll discuss soon
Before you proceed, you should pause to think through whether your
model makes sense...
Subtler issues arise with high-but-imperfect multicollinearity
Suppose your regressors are Age, YearsEducation, YearsWorking
Multicollinearity in practice

Your software will tell you if you have perfect multicollinearity


You can “fix” it by removing one of the variables
This often comes up with “indicator variables” which we’ll discuss soon
Before you proceed, you should pause to think through whether your
model makes sense...
Subtler issues arise with high-but-imperfect multicollinearity
Suppose your regressors are Age, YearsEducation, YearsWorking
There is a very tight relationship between these 3 variables, though not
perfect because some people take gap years, maternity leave,
unemployment, etc.
There will be little residual variation and likely high SEs!
May want to consider dropping a variable
“OVB for SEs”

Adding/removing a variable from a regression doesn’t just change the


point estimates of the other coefficients
It changes the standard errors, too!
In case of homoskedasticity, there are two countervailing effects:
“OVB for SEs”

Adding/removing a variable from a regression doesn’t just change the


point estimates of the other coefficients
It changes the standard errors, too!
In case of homoskedasticity, there are two countervailing effects:
An additional variable reduces residuals and therefore shrinks SEs
An additional variable reduces the amount of residual variation in X
and increases SEs
Overall impact on SEs is ambiguous: depends on the strength of
those two effects
With heteroskedasticy, this stark tradeoff is not technically
guaranteed, but in practice it still typically holds
“OVB for SEs” results

New X is very New X is not very


correlated with old X correlated with old X
New X is a weak New X will increase
predictor of Y SE on old X Ambiguous
New X is a strong New X will decrease
predictor of Y Ambiguous SE on old X
“OVB for SEs” example

Outcome: ∆ GDP
∆G -0.87 -0.96 -0.83
(SE) (0.89) (1.04) (0.82)
Lagged ∆ G 0.29
(0.76)
Lagged ∆ GDP -0.14
(0.30)
Variance of û 20,322 20,287 19,927
Residual variation of ∆ G 442 401 442
Note: constant term not shown

Tip: Adding the lagged outcome variable is often a good trick – good
explanatory power, often not as correlated with other regressors
“OVB for SEs” takeaways

Takeaway: When considering controls, don’t just think about bias –


SEs matter, too!
A control that addresses bias might blow up the SE and make the
regression useless (if the X s are highly correlated)
A control variable that doesn’t address bias might reduce the SE and
be super-useful

You might also like