0% found this document useful (0 votes)
11 views72 pages

5ssmn932 Lecture4 2021 Collated

Uploaded by

1842708432z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views72 pages

5ssmn932 Lecture4 2021 Collated

Uploaded by

1842708432z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Lecture 4

Introduction: Multiple Regression

Dragos Radu
[email protected]

5SSMN932: Introduction to Econometrics


outline lecture 4

• motivation for multiple regression


• omitted variable bias (OVB)
• multiple regression: interpretation and inference
• variables of interest versus control: rescued by the CIA?
Recommended readings:
Stock and Watson, chapter: 6

What comes next?


• assumptions - multicolinearity
• hypothesis tests concerning (multiple) coefficients
Gauss-Markov conditions
simple linear regression (SLR)

SLR.1: y = b0 + b1 x + u
SLR.2: random sampling from the population
SLR.3: some sample variation in the xi
SLR.4: E (u |x ) = 0
SLR.5: Var (u |x ) = Var (u ) = s2
• under these assumptions OLS estimator has the smallest variance
among all linear estimators and is therefore BLUE (Best Linear
Unbiased Estimator). This is the Gauss-Markov theorem.
back to our question

TestScore = b 0 + b 1 · STR + u

• the error u arises because of factors, or variables, that influence Y but


are not included in the regression function.
• there are always omitted variables.
• when and why does the omission of those variables lead to bias in the
OLS estimator?
simple regression result

. reg testscr str

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(1, 418) = 22.58
Model | 7794.11004 1 7794.11004 Prob > F = 0.0000
Residual | 144315.484 418 345.252353 R-squared = 0.0512
-------------+---------------------------------- Adj R-squared = 0.0490
Total | 152109.594 419 363.030056 Root MSE = 18.581

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -2.279808 .4798256 -4.75 0.000 -3.22298 -1.336637
_cons | 698.933 9.467491 73.82 0.000 680.3231 717.5428
------------------------------------------------------------------------------
do smaller classes promote learning?
can an omitted variable a↵ect our result?
immigrants in California
English learners in Californian schools
do smaller classes promote learning?
class size and % of English learners
your turn: cross tabulation in Stata
overview: where are we going from here?

• we want to construct the same table in Stata


• the aim is to practise Stata and to introduce the basic intuition for
multiple regression
You can construct the graphs in three steps:
1 break districts down into four categories that correspond to the
quartiles of the distribution of the % of English learners
2 within each of these four categories we further break down the
districts into two groups: (i) with average classes smaller than 20 and
(ii) with average classes equal or larger than 20 students per teacher.
3 we use these eight groups to describe the relationship between class
size and test scores - replicate Table 6.1 (page 216) in your textbook
the data (caschool.dta) and an annotated do-file are on KEATS
Lecture 4
Part I: Omitted Variable Bias

Dragos Radu
[email protected]

5SSMN932: Introduction to Econometrics


outline lecture 4 part 1

• the conditions for omitted variable bias (OVB)


• definition of OVB
• the OVB formula

What comes next:


In the next part (two) of this lecture we’ll discuss the interpretation of
coefficients in multiple regression.

Stata example:
Determine OVB in our simple regression of test score on class size
(separate video).
test scores and class size
is the % of English lerners a confounder?
two conditions for OVB
class size and % of English learners
two conditions for OVB
class size and % of English learners
why does OVB arise?
we would need to think about a second variable in our regression
how can we assess the OVB
% of English learners: a determinant of TestScr and related to STR

short regression: TestScr = bs0 + bs1 · STR + u s

long regression: TestScr = bl0 + bl1 · STR + bl2 · PctEL + u l

knowing that: PctEL = d0 + d1 · STR + n

we can re-write the TestScr in long as:

TestScr = bl0 + bl1 · STR + bl2 · (d0 + d1 · STR + n) + u l =


⇣ ⌘
= bl0 + bl2 · d0 + bl1 + bl2 · d1 · STR + bl2 · n + u l
| {z }
this is our b 1 in short
defining OVB
short regression: TestScr = bs0 + bs1 · STR + u s

long regression: TestScr = bl0 + bl1 · STR + bl2 · PctEL + u l


⇣ ⌘ ⇣ ⌘ ⇣ ⌘
TestScr = bl0 + bl2 · d0 + bl1 + bl2 · d1 · STR + bl2 · n + u l
| {z } | {z } | {z }
this is our b 0 in short this is our b 1 in short this is our u in short

all coefficients are biased if we regress TestScr on STR alone


defining OVB
short regression: TestScr = bs0 + bs1 · STR + u s

long regression: TestScr = bl0 + bl1 · STR + bl2 · PctEL + u l


⇣ ⌘ ⇣ ⌘ ⇣ ⌘
TestScr = bl0 + bl2 · d0 + bl1 + bl2 · d1 · STR + bl2 · n + u l
| {z } | {z } | {z }
this is our b 0 in short this is our b 1 in short this is our u in short

omitted variable bias is:


OVB = Coef in short Coef in long
if we subtract b 1 in long from b 1 in short
OVB = bs1 bl1 =
= d1 · bl2
= {relationship between omitted var and var of interest}
⇥{e↵ect of omitted in long}
Venn diagram
multiple regression to solve OVB
Ballantine-Venn diagram
Ballantine-Venn diagram
Ballantine-Venn diagram
recap: OVB

OVB = {relationship between omitted var and var of interest}


⇥{e↵ect of omitted in long}
p su
b̂ 1 ! b 1 + rXu ·
sX
if an ommited variable Z is both:
1. a determinant of Y (that is, it is contained in u); and
2. correlated with X

then rXu 6= 0 and the OLS estimator b̂ 1 is biased and is not consistent
OVB in our simple regression
TestScore = b 0 + b 1 · STR + u

p su
b̂ 1 ! b 1 + rXu ·
sX
In our test score example:
1. English language ability (whether the student has English as a second
language) plausibly a↵ects standardized test scores:
Z is a determinant of Y.
2. Immigrant communities tend to be less a✏uent and thus have smaller
school budgets and higher STR:
Z is correlated with X .

• accordingly, b̂ 1 is biased. What is the direction of the bias?


• try to answer this question before watching the Stata example
(separate video)
what comes next?

• part 2 of lecture 4: interpretation of multiple regression coefficients


Lecture 4
Part II: The Multiple Regression Model

Dragos Radu
[email protected]

5SSMN932: Introduction to Econometrics


outline lecture 4 part 2

• the multiple regression model


• multiple regression: interpretation
• measures of fit in multiple regression
Recommended readings for this part:
Stock and Watson, chapter: 6.2-4

What comes next?


• assumptions - multicolinearity
• hypothesis tests in multiple regression (next week)
three ways to overcome OVB

1 run a randomized controlled experiment in which treatment (STR ) is


randomly assigned: then PctEL is still a determinant of TestScore,
but PctEL is uncorrelated with STR. (This solution is rarely feasible.)
2 adopt the “cross tabulation” approach, with finer gradations of STR
and PctEL – within each group, all classes have the same PctEL, so
we control for PctEL (But soon you will run out of data, and what
about other determinants like family income and parental education?)
3 use a regression in which the omitted variable (PctEL) is no longer
omitted: include PctEL as an additional regressor in a multiple
regression:
TestScore = b 0 + b 1 · STR + b 2 · PctEL + u (1)
from Galton to...

... George Udny Yule

Galton only used one variable and one control


we use regression to make comparisons more equal - multivariate regression
the use of multiple regression was pioneered by Yule
Yule and multiple regression (1911)
Yule and multiple regression (1911)
“holding other factors fixed”

• the beauty of multiple regression is that it gives us the ceteris paribus


interpretation without having to find two districts with the same value
of PctEL who di↵er in class size by one student per teacher.
• the estimation method does that for us.
• it is able to do it because we assume a particular relationship,
in this case

TestScore = b 0 + b 1 · STR + b 2 · PctEL + u (2)


motivation for multiple regression

with this we extend the TestScore equation we used for simple regression:

TestScore = b 0 + b 1 · STR + b 2 · PctEL + u (3)

• primarily interested in b 1 , but b 2 is of some interest, too.


• by explicitly including PctEL in the equation, we have taken it out of
the error term.
• this may lead to a more persuasive estimate of the causal e↵ect of
class size on test scores.
simple regression result
TestScore = b 0 + b 1 · STR + u

. reg testscr str

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(1, 418) = 22.58
Model | 7794.11004 1 7794.11004 Prob > F = 0.0000
Residual | 144315.484 418 345.252353 R-squared = 0.0512
-------------+---------------------------------- Adj R-squared = 0.0490
Total | 152109.594 419 363.030056 Root MSE = 18.581

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -2.279808 .4798256 -4.75 0.000 -3.22298 -1.336637
_cons | 698.933 9.467491 73.82 0.000 680.3231 717.5428
------------------------------------------------------------------------------
multiple regression regression result
TestScore = b 0 + b 1 · STR + b 2 · PctEL + u

. reg testscr str el_pct

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(2, 417) = 155.01
Model | 64864.3011 2 32432.1506 Prob > F = 0.0000
Residual | 87245.2925 417 209.221325 R-squared = 0.4264
-------------+---------------------------------- Adj R-squared = 0.4237
Total | 152109.594 419 363.030056 Root MSE = 14.464

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .3802783 -2.90 0.004 -1.848797 -.3537945
el_pct | -.6497768 .0393425 -16.52 0.000 -.7271112 -.5724423
_cons | 686.0322 7.411312 92.57 0.000 671.4641 700.6004
------------------------------------------------------------------------------
the multiple linear regression model

Yi = b 0 + b 1 · X1i + b 2 · X2i + ui

• Y is the dependent variable


• X1 , X2 are the two independent variables (regressors)
• (Yi , X1i , X2i ) denote the i th observation on Y , X1 and X2 .
• b 0 = unknown population intercept
• b 1 = e↵ect on Y of a change in X1 , holding X2 constant
• b 2 = e↵ect on Y of a change in X2 , holding X1 constant
• ui = the regression error (omitted factors)
interpretation of multiple regression

Y = b 0 + b 1 · X1 + b 2 · X2 + u
consider changing X1 by DX1 while holding X2 constant
• before the change: Y = b 0 + b 1 · X1 + b 2 · X2
• after the change: Y = b 0 + b 1 · (X1 + DX1 ) + b 2 · X2
• taking the di↵erence (after minus before): DY = b 1 · DX1

DY
b1 = DX1 , holding X2 constant

DY
b2 = DX2 , holding X1 constant

b 0 =predicted value of Y when X1 = X2 = 0


multiple regression regression result
TestScore = b 0 + b 1 · STR + b 2 · PctEL + u

. reg testscr str el_pct

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(2, 417) = 155.01
Model | 64864.3011 2 32432.1506 Prob > F = 0.0000
Residual | 87245.2925 417 209.221325 R-squared = 0.4264
-------------+---------------------------------- Adj R-squared = 0.4237
Total | 152109.594 419 363.030056 Root MSE = 14.464

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .3802783 -2.90 0.004 -1.848797 -.3537945
el_pct | -.6497768 .0393425 -16.52 0.000 -.7271112 -.5724423
_cons | 686.0322 7.411312 92.57 0.000 671.4641 700.6004
------------------------------------------------------------------------------

\ = 686.0
TestScrore 1.10 ⇥ STR 0.65 ⇥ PctEL
describing qualitative information

• how to we describe binary qualitative information?


(woman vs man; listed vs non-listed companies...)
• define a binary variable (or dummy, or zero-one variable).
• decide which outcome is assigned 0 and which is 1.
(hint: choose the variable name to be informative)
• e.g., to indicate gender, woman ( =1 for women and =0 for men) is
better than gender (unclear what gender = 1 corresponds to)
using the data set gpa2.dta

. list colgpa sat hsperc athlete female in 1/10


+---------------------------------------------+
| colgpa sat hsperc athlete female |
|---------------------------------------------|
1. | 3 810 66.66667 0 1 |
2. | 3.41 1110 96.2963 0 0 |
3. | 1.39 950 21.31148 0 0 |
4. | 3.75 1260 85.18519 0 0 |
5. | 2.84 870 54.05405 0 1 |
|---------------------------------------------|
6. | 3.61 1020 78.78788 0 1 |
7. | 2 860 79.62963 0 1 |
8. | 2.86 1150 81.81818 0 0 |
9. | 2.7 860 68.18182 1 0 |
10. | 1.65 820 32.78689 0 0 |
+---------------------------------------------+
single dummy independent variable

what would it mean to specify a simple regression model where the


explanatory variable is binary?

wage = b 0 + 0 · female + u

where we assume that:


E (u |female ) = 0
then:
E (wage |female ) = b 0 + 0 · female
regression with dummy independent variable
wage = b 0 + 0 · female

. reg wage female

Source | SS df MS Number of obs = 750


-------------+------------------------------ F( 1, 748) = 93.29
Model | 2334.06601 1 2334.06601 Prob > F = 0.0000
Residual | 18714.8844 748 25.019899 R-squared = 0.1109
-------------+------------------------------ Adj R-squared = 0.1097
Total | 21048.9504 749 28.1027376 Root MSE = 5.002

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -3.531853 .3656696 -9.66 0.000 -4.249714 -2.813992
_cons | 12.34696 .2643634 46.70 0.000 11.82797 12.86594
------------------------------------------------------------------------------

the estimated di↵erence is very large:


women earn about 3.53 less than men per hour, on average.
comparison of means
wage = b 0 + 0 · female

the simple regression allows us a comparison of means, where the null


hypothesis is:
H0 : µfemale = µmale
the t statistic and the confidence interval are directly reported:

tfemale = 9.66

which is a very strong rejection of H0 .


. reg wage female

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -3.531853 .3656696 -9.66 0.000 -4.249714 -2.813992
_cons | 12.34696 .2643634 46.70 0.000 11.82797 12.86594
------------------------------------------------------------------------------
comparison of means
wage = b 0 + 0 · female

the estimate ˆ0 = 3.53 does not control for factors that should a↵ect
wage, such as workforce experience and schooling, which could explain the
di↵erence in average wages.
if we just control for experience, the model in expected value is:

E (wage |female, exper ) = b 0 + 0 female + b 1 exper

where now 0 measures the gender di↵erence when we hold fixed exper.
dummy in multiple regression
wage = b 0 + 0 · female + b 1 · exper

0 = E (wage |female, exper0 ) E (wage |male, exper0 )


. reg wage female exper

Source | SS df MS Number of obs = 750


-------------+------------------------------ F( 2, 747) = 59.45
Model | 2890.1896 2 1445.0948 Prob > F = 0.0000
Residual | 18158.7608 747 24.3089168 R-squared = 0.1373
-------------+------------------------------ Adj R-squared = 0.1350
Total | 21048.9504 749 28.1027376 Root MSE = 4.9304
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -2.987036 .3780069 -7.90 0.000 -3.729118 -2.244954
exper | .3330561 .0696329 4.78 0.000 .1963566 .4697555
_cons | 8.642637 .8171341 10.58 0.000 7.038484 10.24679
------------------------------------------------------------------------------

we impose a common slope on exper for men and women, b 1 = .333 in this example
only the intercepts that are allowed to differ.
intercept shift
graph of wage = b 0 + 0 · female + b 1 · exper for 0 <0
14
wage
men (slope = .333)
12
10

difference = 2.99
8

women (slope = .333)


6
4
2
0

0 2 4 6 8 10 12 14
exper

predicted wage (men) predicted wage (women)


multiple regression

it is easy to add other variables, like coll college education


. reg wage female exper coll

Source | SS df MS Number of obs = 750


-------------+------------------------------ F( 3, 746) = 85.47
Model | 5384.33651 3 1794.77884 Prob > F = 0.0000
Residual | 15664.6139 746 20.998142 R-squared = 0.2558
-------------+------------------------------ Adj R-squared = 0.2528
Total | 21048.9504 749 28.1027376 Root MSE = 4.5824

------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -2.457225 .3546709 -6.93 0.000 -3.153497 -1.760954
exper | .4158217 .0651616 6.38 0.000 .2878998 .5437436
coll | .8004933 .0734492 10.90 0.000 .6563015 .944685
_cons | 5.785301 .803433 7.20 0.000 4.208042 7.36256
------------------------------------------------------------------------------
goodness of fit
SER and RootMSE

as in regression with a single regressor, the SER and the RootMSE are
measures of the spread of the Ys around the regression line.

s
n
1
SER = · Â ûi2
n k 1 i =1

s
1 n 2
n iÂ
RootMSE = · ûi
=1

for SER we apply a correction for the degrees of freedom, i.e. for n and k
variables we have n k 1 degrees of freedom as we estimate k slope
coefficients and the intercept.
more on goodness of fit
2
R 2 and R (adjusted R 2 )

ESS SSR
R2 = =1
TSS TSS
We need to think about a di↵erent goodness-of-fit measure because the
usual R 2 can only increase when one or more variables are added to a
regression.
Sometimes we want to compare across models that have di↵erent numbers
of explanatory variables but where one is not a special case of the other. It
is useful to have a goodness-of-fit measure that penalizes adding additional
explanatory variables. (The usual R 2 has no penalty.)
more on goodness of fit: adjusted R 2

2
• the R “penalising” us for including another regressor.
2
• R does not necessarily increase when we add another regressor.
• the adjusted R-squared, also called “R-bar-squared”:

2 [SSR/(n k 1)]
R = 1
[TSS /(n 1)]

• when more regressors are added, SSR falls, but so do the degrees of
freedom df = n k 1. R̄ 2 can increase or decrease.
• for k 1, R̄ 2 < R 2 unless SSR = 0 (not an interesting case).
In addition, it is possible that R̄ 2 < 0, especially if df is small.
Remember that R 2 0 always.
simple regression result
TestScore = b 0 + b 1 · STR + u

. reg testscr str

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(1, 418) = 22.58
Model | 7794.11004 1 7794.11004 Prob > F = 0.0000
Residual | 144315.484 418 345.252353 R-squared = 0.0512
-------------+---------------------------------- Adj R-squared = 0.0490
Total | 152109.594 419 363.030056 Root MSE = 18.581

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -2.279808 .4798256 -4.75 0.000 -3.22298 -1.336637
_cons | 698.933 9.467491 73.82 0.000 680.3231 717.5428
------------------------------------------------------------------------------
multiple regression regression result
TestScore = b 0 + b 1 · STR + b 2 · PctEL + u

. reg testscr str el_pct

Source | SS df MS Number of obs = 420


-------------+---------------------------------- F(2, 417) = 155.01
Model | 64864.3011 2 32432.1506 Prob > F = 0.0000
Residual | 87245.2925 417 209.221325 R-squared = 0.4264
-------------+---------------------------------- Adj R-squared = 0.4237
Total | 152109.594 419 363.030056 Root MSE = 14.464

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -1.101296 .3802783 -2.90 0.004 -1.848797 -.3537945
el_pct | -.6497768 .0393425 -16.52 0.000 -.7271112 -.5724423
_cons | 686.0322 7.411312 92.57 0.000 671.4641 700.6004
------------------------------------------------------------------------------
what comes next?

• part 3 of lecture 4: assumptions in multiple regression (SW ch. 6.5)


• before that: intuition and interpretation of multiple regression
(practical example in separate video)
Lecture 4
Part III: Assumptions for the Multiple Regression Model

Dragos Radu
[email protected]

5SSMN932: Introduction to Econometrics


outline lecture 4 part 3

• least squares assumptions for causal inference in multiple regression


• multicollinearity
• control variables and conditional mean independence
Recommended readings for this part:
Stock and Watson, chapter: 6.5-8

What comes next?


• hypothesis tests in multiple regression (next week)
multiple linear regression assumptions

Yi = b 0 + b 1 · X1i + b 2 · X2i + · · · + b k · Xki + ui , i = 1, . . . , n

1 The conditional distribution of u given the X s has mean zero, that is,
E (u |X1 , ..., Xk ) = E (u ) = 0
2 random sampling from the population
3 large outliers are unlikely
4 no perfect multicollinearity
no perfect multicollinearity

Perfect multicollinearity is when one of the regressors is an exact linear


function of the other regressors.
. reg testscr str str
note: str omitted because of collinearity

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -2.279808 .4798256 -4.75 0.000 -3.22298 -1.336637
str | 0 (omitted)
_cons | 698.933 9.467491 73.82 0.000 680.3231 717.5428
------------------------------------------------------------------------------

in such a regression we would ask: what is the e↵ect on TestScore of a unit change in STR,
holding STR constant??? (a logical impossibility)
perfect multicollinearity
dummy variable trap

. gen small=str<20
. gen large=1-small

. reg testscr small large


note: large omitted because of collinearity

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
small | 7.37241 1.843475 4.00 0.000 3.748774 10.99605
large | 0 (omitted)
_cons | 649.9788 1.387717 468.38 0.000 647.2511 652.7066
------------------------------------------------------------------------------
dummy variable trap

suppose you have a set of multiple binary (dummy) variables, which are
mutually exclusive and exhaustive – that is, there are multiple categories
and every observation falls in one and only one category
(e.g. small or large):
• if you include all these dummy variables and a constant, you will have
perfect multicollinearity – this is the dummy variable trap.
• why is there perfect multicollinearity here?
• solution to the dummy variable trap: omit one of the groups
(e.g. large)
• how do we interpret the coefficients?
perfect multicollinearity
dummy variable trap

. gen small=str<20
. gen large=1-small

. reg testscr small large


note: large omitted because of collinearity

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
small | 7.37241 1.843475 4.00 0.000 3.748774 10.99605
large | 0 (omitted)
_cons | 649.9788 1.387717 468.38 0.000 647.2511 652.7066
------------------------------------------------------------------------------

. reg testscr small

------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
small | 7.37241 1.843475 4.00 0.000 3.748774 10.99605
_cons | 649.9788 1.387717 468.38 0.000 647.2511 652.7066
------------------------------------------------------------------------------
dummy variable trap

• perfect multicollinearity usually reflects a mistake in the definitions of


the regressors, or an oddity in the data
• if you have perfect multicollinearity, your statistical software will let
you know – either by crashing or giving an error message or by
“dropping” one of the variables arbitrarily
• the solution to perfect multicollinearity is to modify your list of
regressors so that you no longer have perfect multicollinearity.
control variables and conditional mean independence

• we want to get an unbiased estimate of the e↵ect on test scores of


changing class size, holding constant such as outside learning
opportunities, parental involvement in education, etc.
• if we could run an experiment, we would randomly assign students
(and teachers) to di↵erent sized classes.
• then STRi would be independent of all the other factors that go into
ui , so E (ui |STRi ) = 0 and the OLS estimator in the regression of
TestScorei on STRi would be an unbiased estimator of the desired
causal e↵ect.
conditional mean independence

• but in observational data ui can include other omitted factors related


to the included variables, then: E (u |X1 , ..., Xk ) 6= 0?
• you can include “control variables” which are correlated with these
omitted causal factors, but which themselves are not causal.
• a control variable is a regressor included to hold constant factors that,
if neglected, could lead the estimated causal e↵ect of interest to su↵er
from omitted variable bias
• then we have conditional mean independence: given the control
variable, the mean of ui doesn’t depend on the variable of interest
recap: CIA

Y = b 0 + b 1 · X1 + b 2 · X2 + u
• in our discussion X1 : the variable of interest and X2 : control variable
• under the conditional independence assumption (CIA):

E ( u | x 1 , x2 ) = E ( u | x2 )

• we can claim a causal interpretation of your regression estimates for


b 1 but not for b 2

TestScr = b 0 + b 1 · STR + b 2 · PctEL + u


rescued by the CIA?

OVB = {relationship between omitted var and var of interest}


⇥{e↵ect of omitted in long}

• the OVB formula is one of the most important things to know about
your regression model
• if you claim no OVB for your study, you’re e↵ectively saying that the
regression you have is the regression you want
• in other words, you depend on the conditional independence
assumption (CIA):
E ( u | x1 , x2 ) = E ( u | x2 )
for a causal interpretation of your regression estimates
control variables in our California test score data

. reg testscr str el_pct freelunk


------------------------------------------------------------------------------
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
str | -.9983092 .2387543 -4.18 0.000 -1.467624 -.528994
el_pct | -.1215733 .0323173 -3.76 0.000 -.1850988 -.0580478
freelunk | -.5473456 .0215988 -25.34 0.000 -.589802 -.5048891
_cons | 700.15 4.685687 149.42 0.000 690.9394 709.3605
------------------------------------------------------------------------------

• str is the variable of interest


• el pct is a control variable: immigrant communities tend to be less a✏uent and often
have fewer outside learning opportunities, and el pct is correlated with those omitted
causal variables
• freelunk it also is correlated with and controls for income-related outside learning
opportunities.
control variables

Three interchangeable statements about e↵ective control variables:


• an e↵ective control variable when included in the regression, makes
the error term uncorrelated with the variable of interest.
• holding constant the control variable(s), the variable of interest is
“as if ” randomly assigned.
• among individuals (entities) with the same value of the control
variable(s), the variable of interest is uncorrelated with the omitted
determinants of Y .
what comes next?

next week:
• hypothesis tests in multiple regression
• examples of nonlinearities and interactions

You might also like