0% found this document useful (0 votes)
10 views997 pages

Overall Notes Econo

Advance Econometrics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views997 pages

Overall Notes Econo

Advance Econometrics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 997

FNCE 926

Empirical Methods in CF
Lecture 1 – Linear Regression I

Professor Todd Gormley


Today’s Agenda
n  Introduction
n  Discussion of Syllabus
n  Review of linear regressions
About Me
n  PhD from MIT economics
n  Undergraduate at Michigan St. Univ.
n  Research on bank entry & corporate
topics involving risk and governance
Today’s Agenda
n  Introduction… about me
n  Discussion of Syllabus
n  Review of linear regressions
Course Objectives
n  Provide toolbox & knowledge of cross-
sectional & panel data empirical methods
n  Course will have three-pronged approach
q  Lectures will provide you econometric intuition
behind each method discussed
q  Course readings expose you to examples of
these tools being used in recent research
q  Exercises will force you to use the methods
taught in actual data
Reading Materials [Part 1]
n  My lecture notes will be your primary
source for each econometric tool
n  But, please read background texts before
lecture [see syllabus for relevant sections]
q  Angrist & Pischke’s Mostly Harmless… book
q  Roberts & Whited (2010) paper
q  Greene’s textbook on econometrics
q  Wooldridge’s textbook on panel data
Reading Materials [Part 2]
n  We will also be covering 35+ empirical
papers; obtain these using Econlit or
by going to authors’ SSRN websites
for working papers [I’ve provided links]
q  Sorry, for copyright reasons, I can’t post
the papers to Canvas…
q  Just let me know if you have any problem
finding a particular paper
Study Groups
n  3 study groups will do in-class presentations
q  Choose own members; can change later if need to
q  Try to split yourself somewhat equally into groups
q  Choose initial groups during today’s break; first
group presentations will be in next class!
[More about group presentations in a second…]
Course Structure
n  Total of 150 possible points
q  In-class exam [50 points]
q  Five data exercises [25 points]
q  In-class presentations/participation [25 points]
q  Research proposal
n  Rough draft [15 points]
n  Final proposal [35 points]
Exam
n  Done in last class, April 26
n  More details when we get closer…, but a
practice exam is already available on Canvas
Data exercises
n  Exercises will ask you to download and
manipulate data within Stata
q  E.g. will need to estimate a triple-diff
q  To receive credit, you will send me your DO
files; I will then run them on my own dataset to
confirm the coding is correct
q  More instructions in handouts [which will be
available on Canvas website]
Turning in exercises
n  Please upload both DO file and typed
answers to Canvas; i.e., we won’t be
handing them in during class
q  They will be graded & returned on Canvas
q  Deadline to submit is noon
[Canvas tracks when the file is uploaded]
In-class presentations & participation
n  In every class (except today), students will
present three papers in second half
q  Each study group does one presentation (this is
why there needs to be three study groups)
n  But, only one student for each group actually presents
n  Rotate the presenter each week; doing this basically
guarantees everyone full participation points

q  Assign papers for next class at end of class


[all papers are listed in the syllabus]
PowerPoint Presentations [Part 1]
n  Should last for 10 min., no more than 12 min.
q  Summarize [2-3 minutes]
q  Analytical discussion which should focus on
identification and causality [6-7 minutes]
q  Conclusion [1 minute]

n  Presentations followed by 5-10 minutes


discussion; students must read all three papers
n  See handout on Canvas for more details
PowerPoint Presentations [Part 2]
n  Each student must also type up 2-3 sentence
concern for each paper their group does NOT
present and turn it in at start of class
q  I will randomly select one after each student
presentation to further facilitate class discussion
q  Write your comment with one of these goals in mind…
n  Write down your own view of “biggest concern”
n  Or, write a concern you think presenter might miss!
Goal of Presentations

n  Help you think critically about empirical


tools discussed in previous class
n  Allow you to see and learn from papers
that actually use these techniques
n  Gives you practice on presenting; this will
be important in the long run
Research Proposal
n  You will outline a possible empirical paper
that uses tools taught in this course
q  Rough draft due March 22
q  Final proposal due exam week, May 3

n  If you want, think of this as a jump start on a


possible 2nd or 3rd year paper
n  See handout on Canvas for more details
Office Hours & E-mail

n  My office hours will be…


q  Thursdays, 1:30-3:00 p.m.
q  Or, by appointment

n  Office location: 2458 SH-DH


n  Email: [email protected]
Teaching Assistant

n  The TA for this course will be…


q  Tetiana Davydiuk
q  [email protected]

n  She will be grading the exercises and answering


any questions you might have about them

All other questions can be directed to me!


Tentative Schedule
n  See syllabus…
n  While exam date & final research proposal
deadline are fixed, topics covered and other
case due dates may change slightly if there is a
sudden and unexpected class cancellation
How the course is structured…
n  We will have a 1-2 lectures per ‘tool’
q  I will lecture in first half (except today) on the ‘tool’
q  In the second half of the following class, students will
present papers using that particular tool
Canvas
n  https://wharton.instructure.com
n  Things available to download
q  Exercises & solutions [after turned in]
q  Lecture notes
q  Handouts that provide more details on what I
expect for presentations & research proposal,
including grading templates
q  Practice exam
q  Student presentations [to help study for exam]
Lecture Notes
n  I will provide a copy of lecture notes on
Canvas before the start of each class
q  I strongly encourage printing these out and
bringing them with you to class!
Structural estimation lecture
n  Prof. Taylor has agreed to give this lecture
n  Tuesday, April 19… the usual time
Remaining Items
n  3 hours is long! We’ll take one 10 minute
break or two 5 minute breaks
n  Read rest of syllabus for other details
about the course including:
q  Class schedule or assigned papers are subject
to change; I’ll keep you posted of changes
q  Limitation of course; I won’t have time to
cover everything you should know, but it will
be a good start
Questions
n  If you have a question, ask! J
q  If you’re confused, you’re probably not alone
q  I don’t mind being interrupted
q  If I’m going too fast, just let me know

n  I may not always have an immediate answer,


but all questions will be answered eventually

n  Any questions?


Today’s Agenda
n  Introduction
n  Discussion of Syllabus
n  Review of linear regressions

My expectation is that
you’ve seen most of this Despite trying to do much
before; but it is helpful to of it without math; today’s
review the key ideas that lecture likely to be long
are useful in practice and tedious… (sorry)
(without all the math)
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues

We will cover the latter


two in the next lecture
Background readings
n  Angrist and Pischke
q  Sections 3.1-3.2, 3.4.1
n  Wooldridge
q  Sections 4.1 & 4.2
n  Greene
q  Chapter 3 and Sections 4.1-4.4, 5.7-5.9, 6.1-6.2
Motivation
n  Linear regression is arguably the most popular
modeling approach in corporate finance
q  Transparent and intuitive
q  Very robust technique; easy to build on
q  Even if not interested in causality, it is useful for
describing the data

Given importance, we will spend today &


next lecture reviewing the key ideas
Motivation continued…
n  As researchers, we are interested
explaining how the world works
q  E.g. how are firms’ choices regarding leverage
are explained by their investment opportunities
n  I.e., if investment opportunities suddenly jumped
for some random reason, how would we expect
firms’ leverage to respond on average?

q  More broadly, how is y explained by x, where


both y and x are random variables?
Linear Regression – Outline
n  The CEF and causality (very brief)
q  Random variables & the CEF
q  Using OLS to learn about the CEF
q  Briefly describe “causality”

n  Linear OLS model


n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
A bit about random variables
n  With this in mind, it is useful know that any
random variable y can be written as

y = E ( y | x) + ε
where (y, x, ε) are random variables and E(ε|x)=0  

q  E(y|x) is expected value of y given x


q  In words, y can be broken down into part
‘explained’ by x, E(y|x), and a piece that is
mean independent of x, ε
Conditional expectation function (CEF)
n  E(y|x) is what we call the CEF, and
it has very desirable properties
q  Natural way to think about relationship
between x and y
q  And, it is best predictor of y given x
in a minimum mean-squared error sense
n  I.e. E(y|x) minimizes E[(y-m(x))2],where
m(x) can be any function of x.
CEF visually…
n  E(y|x) is fixed, but unobservable

Our goal is
to learn about
the CEF

n  Intuition: for any value of x, distribution


of y is centered about E(y|x)
Linear Regression – Outline
n  The CEF and causality (very brief)
q  Random variables & the CEF
q  Using OLS to learn about the CEF
q  Briefly describe “causality”

n  Linear OLS model


n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
Linear regression and the CEF
n  If done correctly, a linear regression can
help us uncover what the CEF is
n  Consider linear regression model, y = β x + u
q  y = dependent variable
q  x = independent variable
q  u = error term (or disturbance)
q  β = slope parameter
Some additional terminology
n  Other terms for y… n  Other terms for x…
q  Outcome variable q  Covariate
q  Response variable q  Control variable
q  Explained variable q  Explanatory variable
q  Predicted variable q  Predictor variable
q  Regressand q  Regressor
Details about y = βx + u
n  (y, x, u) are random variables
n  (y, x) are observable
n  (u, β) are unobservable
q  u captures everything that determines y after
accounting for x [This might be a lot of stuff!]
q  We want to estimate β
Ordinary Least Squares (OLS)
n  Simply put, OLS finds the β that
minimizes the mean-squared error

β = arg min = E[( y − bx) 2 ]


b

n  Using first order condition: E[x(y-βx)]=0,


we have β=E(xy)/E(x2)
n  Note: by definition, the residual from this
regression, y-βx, is uncorrelated with x
What great about this linear regression?
n  It can be proved that…
q  βx is best* linear prediction of y given x
q  βx is best* linear approximation of E(y|x)
* ‘best’ in terms of minimum mean-squared error

n  This is quite useful. I.e. even if E(y|x) is


nonlinear, the regression gives us the best
linear approximation of it
Linear Regression – Outline
n  The CEF and causality (very brief)
q  Random variables & the CEF
q  Using OLS to learn about the CEF
q  Briefly describe “causality”

n  Linear OLS model


n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
What about causality?
n  Need to be careful here…
q  How x explains y, which this regression
helps us understand, is not the same as
learning the causal effect of x on y
q  For that, we need more assumptions…
The basic assumptions [Part 1]
n  Assumption #1: E(u) = 0
q  With intercept, this is totally innocuous
q  Just change regression to y = α + βx + u,
where α is the intercept term
q  Now suppose, E(u)=k≠0
n  We could rewrite u = k + w, where E(w)=0
n  Then, model becomes y = (α + k) + βx + w
n  Intercept is now just α + k, and error, w, is mean zero
n  I.e. Any non-zero mean is absorbed by intercept
The basic assumptions [Part 2]
Intuition?
n  Assumption #2: E(u|x) = E(u)
q  In words, average of u (i.e. unexplained portion
of y) does not depend on value of x
q  This is “conditional mean independence” (CMI)
n  True if x and u are independent of each other
n  Implies u and x are uncorrelated

This is the key assumption being made


when people make causal inferences
CMI Assumption
n  Basically, assumption says you’ve got correct
CEF model for causal effect of x on y
q  CEF is causal if it describes differences in
average outcomes for a change in x
n  i.e. increase in x from values a to b is equal to
E(y|x=b)–E(y|x=a) [In words?]

q  Easy to see that this is only true if E(u|x) = E(u)


[This is done on next slide…]
Example of why CMI is needed
n  With model y = α + βx + u,
q  E(y|x=a) = α + βa + E(u|x=a)
q  E(y|x=b) = α + βb + E(u|x=b)
q  Thus, E(y|x=b) – E(y|x=a) =
β(b-a) + E(u|x=b) – E(u|x=a)
q  This only equals what we think of as the ‘causal’
effect of x changing from a to b if E(u|x=b) =
E(u|x=a)… i.e. CMI assumption holds
Tangent – CMI versus correlation
n  CMI (which implies x and u are
uncorrelated) is needed for no bias
[which is a finite sample property]
n  But, we only need to assume a zero
correlation between x and u for consistency
[which is a large sample property]
q  More about bias vs. consistency later; but we
typically care about consistency, which is why
I’ll often refer to correlations rather than CMI
Is it plausible?
n  Admittedly, there are many reasons why
this assumption might be violated
q  Recall, u captures all the factors that affect y
other than x… It will contain a lot!
q  Let’s just do a couple of examples…
Ex. #1 – Capital structure regression
n  Consider following firm-level regression:
Leveragei = α + β Profitabilityi + ui

q  CMI implies average u is same for each profitability


q  Easy to find a few stories why this isn’t true…
n  #1 – unprofitable firms tend to have higher bankruptcy risk,
which by tradeoff theory, should mean a lower leverage
n  #2 – unprofitable firms have accumulated less cash, which
by pecking order means they should have more leverage
Ex. #2 – Investment Measure of
investment
opportunities

n  Consider following firm-level regression:


Investmenti = α + β Qi + ui

q  CMI implies average u is same for each Tobin’s Q


q  Easy to find a few stories why this isn’t true…
n  #1 – Firms with low Q might be in distress & invest less
n  #2 – Firms with high Q might be smaller, younger firms
that have a harder time raising capital to fund investments
Is there a way to test for CMI?
n  Let ŷ be the predicted value of y, i.e.
= α + β x , where α and β are OLS estimates
ŷTtttt
n  And, let û be the residual, i.e. uˆ = y − y
ˆ
n  Can we prove CMI if residuals if E(û )=0
and if û is uncorrelated with x?
q  Answer: No! By construction these residuals are
mean zero and uncorrelated with x. See earlier
derivation of OLS estimates
Identification police
n  What people call the “identification police”
are those that look for violations of CMI
q  I.e. the “police” look for a reason why the
model’s disturbance is correlated with x
n  Unfortunately, it’s not that hard…
n  Trying to find ways to ensure the CMI
assumption holds and causal inferences can be
made will be a key focus of this course
A side note about “endogeneity”
n  Many “police” will criticize a model by
saying it has an “endogeneity problem” but
then don’t say anything further…

n  But what does it mean to say there is an


“an endogeneity problem”?
A side note about “endogeneity”
n  My view: such vague “endogeneity” critics
suspect something is potentially wrong, but
don’t really know why or how
q  Don’t let this be you! Be specific about
what the problem is!

n  Violations to CMI can be roughly


categorized into three bins… which are?
Three reasons why CMI is violated
n  Omitted variable bias
n  Measurement error bias
n  Simultaneity bias
q  We will look at each of these in much
more detail in the “Causality” lecture
What “endogenous” means to me
n  An “endogenous” x is when its value depends
on y (i.e. it determined jointly with y such that
there is simultaneity bias).
q  But, some use a broader definition to
mean any correlation between x and u
[e.g. Roberts & Whited (2011)]
q  Because of the confusion, I avoid using
“endogeneity”; I’d recommend the same for you
n  I.e. Be specific about CMI violation; just say omitted
variable, measurement error, or simultaneity bias
A note about presentations…
n  Think about “causality” when presenting
next week and the following week
q  I haven’t yet formalized the various reasons for
why “causal” inferences shouldn’t be made; but
I’d like you to take a stab at thinking about it
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
q  Basic interpretation
q  Rescaling & shifting of variables
q  Incorporating non-linearities

n  Multivariate estimation


n  Hypothesis testing
n  Miscellaneous issues
Interpreting the estimates
n  Suppose I estimate the following model of
CEO compensation
salaryi = α + β ROEi + ui

q  Salary for CEO i is in $000s; ROE is a %


n  If you get… αˆ = 963.2
βˆ = 18.50
q  What do these coefficients tell us?
q  Is CMI likely satisfied?
Interpreting the estimates – Answers

salaryi = 963.2 + 18.5ROEi + ui

n  What do these coefficients tell us?


q  1 percentage point increase in ROE is
associated with $18,500 increase in salary
q  Average salary for CEO with ROE = 0
was equal to $963,200
n  Is CMI likely satisfied? Probably not
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
q  Basic interpretation
q  Rescaling & shifting of variables
q  Incorporating non-linearities

n  Multivariate estimation


n  Hypothesis testing
n  Miscellaneous issues
Scaling the dependent variable
n  What if I change measurement of salary from
$000s to $s by multiplying it by 1,000?

q  Estimates were… αˆ = 963.2


βˆ = 18.50

αˆ = 963, 200
q  Now, they will be…
βˆ = 18,500
Scaling y continued…
n  Scaling y by an amount c just causes all the
estimates to be scaled by the same amount
q  Mathematically, easy to see why…
y =α + βx +u
cy = ( cα ) + ( cβ ) x + cu

New intercept New slope


Scaling y continued…
n  Notice, the scaling has no effect on the
relationship between ROE and salary
q  I.e. because y is expressed in $s now, β̂ = 18,500
means that a one percentage point increase in ROE
is still associated with $18,500 increase in salary
Scaling the independent variable
n  What if I instead change measurement of
ROE from percentage to decimal? (i.e.
multiply ROE by 1/100)
αˆ = 963.2
q  Estimates were…
βˆ = 18.50

αˆ = 963.2
q  Now, they will be…
βˆ = 1,850
Scaling x continued…
n  Scaling x by an amount k just causes the
slope on x to be scaled by 1/k
q  Mathematically, easy to see why…
Will interpretation of
estimates change?
y = α + βx + u
Answer: Again, no!
⎛β⎞
y = α + ⎜ ⎟ kx + u
⎝ k⎠

New slope
Scaling both x and y
n  If scale y by an amount c and x by
amount k , then we get…
q  Intercept scaled by c
q  Slope scaled by c/k
y =α + βx +u
⎛ cβ ⎞
cy = ( cα ) + ⎜ ⎟ kx + cu
⎝ k ⎠
n  When is scaling useful?
Practical application of scaling #1
n  No one wants to see a coefficient of
0.000000456 or 1,234,567,890
n  Just scale the variables for cosmetic purposes!
q  It will effect coefficients & SEs
q  But, it won’t affect t-stats or inference
Practical application of scaling #2 [P1]
n  To improve interpretation, in terms of
found magnitudes, helpful to scale by the
variables by their sample standard deviation
q  Let σx and σy be sample standard deviations of
x and y respectively
q  Let c, the scalar for y, be equal to 1/σy
q  Let k, the scalar for x, be equal to 1/σx
q  I.e. unit of x and y is now standard deviations
Practical application of scaling #2 [P2]
n  With the prior rescaling, how would we
interpret a slope coefficient of 0.25?
q  Answer = a 1 s.d. increase in x is associated
with ¼ s.d. increase in y
q  The slope tells us how many standard
deviations y changes, on average, for a
standard deviation change in x
q  Is 0.25 large in magnitude? What about 0.01?
Shifting the variables

n  Suppose we instead add c to y and k to x (i.e.


we shift y and x up by c and k respectively)

n  Will the estimated slope change?


Shifting continued…
n  No! Only the estimated intercept will change
q  Mathematically, easy to see why…
y =α + βx +u
y +c =α +c+ βx+u
y + c = α + c + β (x + k) − βk + u
y + c = (α + c − β k ) + β ( x + k ) + u

New intercept Slope the same


Practical application of shifting
n  To improve interpretation, sometimes helpful
to demean x by its sample mean
q  Let μx be the sample mean of x; regress y on x - μx
q  Intercept now reflects expected value of y for x =μx
y = (α + βµ x ) + β ( x − µ x ) + u
E( y | x = µ x ) = (α + βµ x )
q  This will be very useful when we get to diff-in-diffs
Break Time
n  Let’s take a 10 minute break
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
q  Basic interpretation
q  Rescaling & shifting of variables
q  Incorporating non-linearities

n  Multivariate estimation


n  Hypothesis testing
n  Miscellaneous issues
Incorporating nonlinearities [Part 1]

n  Assuming that the causal CEF is linear


may not always be that realistic
q  E.g. consider the following regression

wage = α+ βeducation + u

q  Why might a linear relationship between #


of years of education and level of wages be
unrealistic? How can we fix it?
Incorporating nonlinearities [Part 2]
n  Better assumption is that each year of
education leads to a constant proportionate
(i.e. percentage) increase in wages
q  Approximation of this intuition captured by…

ln(wage) = α+ βeducation + u

q  I.e. the linear specification is very flexible


because it can capture linear relationships
between non-linear variables
Common nonlinear function forms

n  Regressing Levels on Logs


n  Regressing Logs on Levels
n  Regressing Logs on Logs

Let’s discuss how to interpret each of these


The usefulness of log

n  Log variables are useful because


100*Δln(y)≈% Δy
q  Note: When I (and others) say “Log”, we
really mean the natural logarithm, “Ln”.
E.g. if you use the “log” function in Stata,
it assumes you meant “ln”
Interpreting log-level regressions

n  If estimate, the ln(wage) equation, 100β


will tell you the %Δwage for an additional
year of education. To see this…

ln( wage) = α + β education + u


Δ ln( wage) = βΔeducation
100 × Δ ln( wage) = (100 β )Δeducation
%Δwage ≈ (100 β )Δeducation
Log-level interpretation continued…

n  The proportionate change in y for a


given change in x is assumed constant
q  The change in y is not assumed to be
constant… it gets larger as x increases
q  Specifically, ln(y) is assumed to be linear in
x; but y is not a linear function of x…

ln( y ) = α + β x + u
y = exp(α + β x + u )
Example interpretation
n  Suppose you estimated the wage equation (where
wages are $/hour) and got…

ln(wage) = 0.584 + 0.083education


q  What does an additional year of education get you?
Answer = 8.3% increase in wages.
q  Any potential problems with the specification?
q  Should we interpret the intercept?
Interpreting log-log regressions

n  If estimate the following…

ln( y) = α + β ln( x) + u

n  β is the elasticity of y w.r.t. x!


q  i.e. β is the percentage change in y for a
percentage change in x
q  Note: regression assumes constant elasticity
between y and x regardless of level of x
Example interpretation of log-log
n  Suppose you estimated the CEO salary model
using logs got the following:

ln(salary) = 4.822 + 0.257ln(sales)

n  What is the interpretation of 0.257?

Answer = For each 1% increase in


sales, salary increases by 0.257%
Interpreting level-log regressions

n  If estimate the following…

y = α + β ln( x) + u

n  β/100 is the change in y for 1% change x


Example interpretation of level-log
n  Suppose you estimated the CEO salary
model using logs got the following,
where salary is expressed in $000s:

salary = 4.822 + 1,812.5ln(sales)

n  What is the interpretation of 1,812.5?


Answer = For each 1% increase in
sales, salary increases by $18,125
Summary of log functional forms
Dependent Independent
Model Interpretation of β
Variable Variable

Level-Level y x dy = βdx
Level-Log y ln(x) dy = (β/100)%dx
Log-Level ln(y) x %dy = (100β)dx
Log-Log ln(y) ln(x) %dy = β%dx
n  See syllabus…
n  Now, let’s talking about what happens if
you change units (i.e. scale) for either y
or x in these regressions…
Rescaling logs doesn’t matter [Part 1]
n  What happens to intercept & slope if rescale
(i.e. change units) of y when in log form?
n  Answer = Only intercept changes; slope
unaffected because it measures proportional
change in y in Log-Level model
log( y ) = α + β x + u
log(c) + log( y ) = log(c) + α + β x + u
log(cy ) = ( log(c) + α ) + β x + u
Rescaling logs doesn’t matter [Part 2]
n  Same logic applies to changing scale of x in
level-log models… only intercept changes

y = α + β log( x) + u
y + β log(c) = α + β log( x) + β log(c) + u
y = (α − β log(c) ) + β log(cx) + u
Rescaling logs doesn’t matter [Part 3]
n  Basic message – If you rescale a logged variable,
it will not effect the slope coefficient because you
are only looking at proportionate changes
Log approximation problems
n  I once discussed a paper where author
argued that allowing capital inflows into
country caused -120% change in stock
prices during crisis periods…
q  Do you see a problem with this?
n  Of course! A 120% drop in stock prices isn’t
possible. The true percentage change was -70%.
Here is where that author went wrong…
Log approximation problems [Part 1]
n  Approximation error occurs because as true
%Δy becomes larger, 100Δln(y)≈%Δy
becomes a worse approximation
n  To see this, consider a change from y to y’…
y '− y
q  Ex. #1: = 5% , and 100Δln(y) = 4.9%
y
y '− y
q  Ex. #2: = 75% , but 100Δln(y)= 56%
y
Log approximation problems [Part 2]
Log approximation problems [Part 3]
n  Problem also occurs for negative changes

y '− y
q  Ex. #1: = −5% , and 100Δln(y) = -5.1%
y
y '− y
q  Ex. #2: = −75% , but 100Δln(y)= -139%
y
Log approximation problems [Part 4]
q  So, if implied percent change is large, better to convert
it to true % change before interpreting the estimate

ln( y ) = α + β x + u
ln( y ') − ln( y ) = β ( x '− x)
ln( y '/ y ) = β ( x '− x)
y '/ y = exp ( β ( x '− x) )
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ( x '− x) ) − 1⎤⎦
Log approximation problems [Part 5]
n  We can now use this formula to see what
true % change in y is for x’–x = 1
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ( x '− x) ) − 1⎤⎦
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ) − 1⎤⎦
q  If β = 0.56, the percent change isn’t 56%, it is

100 ⎡⎣exp ( 0.56 ) − 1⎤⎦ = 75%


Recap of last two points on logs

n  Two things to keep in mind about using logs


q  Rescaling a logged variable doesn’t affect slope
coefficients; it will only affect intercept
q  Log is only approximation for % change; it can
be a very bad approximation for large changes
Usefulness of logs – Summary
n  Using logs gives coefficients
with appealing interpretation
n  Can be ignorant about unit of
measurement of log variables
since they’re proportionate Δs
n  Logs of y or x can mitigate
influence of outliers
“Rules of thumb” on when to use logs
n  Helpful to take logs for variables with…
q  Positive currency amount
q  Large integral values (e.g. population)

n  Don’t take logs for variables measured in


years or as proportions
n  If y ∈ [0, ∞) , can take ln(1+y), but be
careful… nice interpretation no longer true…
What about using ln(1+y)?
n  Because ln(0) doesn’t exist, people use ln(1+y)
for non-negative variables, i.e. y ∈ [0, ∞)
q  Be careful interpreting the estimates! Nice
interpretation no longer true, especially if a lot of
zeros or many small values in y [Why?]
n  Ex. #1: What does it mean to go from ln(0) to ln(x>0)?
n  Ex. #2: And, Ln(x’+1) – Ln(x+1) is not percent change of x

q  In this case, might be better to scale y by another


variable instead, like firm size
Tangent – Percentage Change
n  What is the percent change in
unemployment if it goes from 10% to 9%?
q  This is 10 percent drop
q  It is a 1 percentage point drop
n  Percentage change is [(x1 – x0)/x0]×100
n  Percentage point change is the raw change in
percentages

Please take care to get this right in


description of your empirical results
Models with quadratic terms [Part 1]
n  Consider y = β0 + β1x + β2x2 + u
n  Partial effect of x is given by…
Δy = ( β1 + 2 β 2 x ) Δx
q  What is different about this partial effect
relative to everything we’ve seen thus far?
n  Answer = It depends on the value of x. So, we will
need to pick a value of x to evaluation (e.g. x )
Models with quadratic terms [Part 2]
n  If βˆ1 > 0, βˆ2 < 0 , then it has parabolic relation
q  Turning point = Maximum = βˆ1 / 2 βˆ2
q  Know where this turning point is! Don’t claim a
parabolic relation if it lies outside range of x!
q  Odd values might imply misspecification or simply
mean the quadratic terms are irrelevant and should
be excluded from the regression
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
q  Properties & Interpretation
q  Partial regression interpretation
q  R2, bias, and consistency

n  Hypothesis testing


n  Miscellaneous issues
Motivation
n  Rather uncommon that we have
just one independent variable
q  So, now we will look at multivariate
OLS models and their properties…
Basic multivariable model
n  Example with constant and k regressors
y = β0 + β1 x1 + ... + βk xk + u
n  Similar identifying assumptions as before
q  No collinearity among covariates [why?]
q  E(u|x1,…, xk) = 0
n  Implies no correlation between any x and u, which
means we have the correct model of the true causal
relationship between y and (x1,…, xk)
Interpretation of estimates
n  Estimated intercept, βˆ0 , is predicted
value of y when all x = 0; sometimes this
makes sense, sometimes it doesn’t
n  (
Estimated slopes, βˆ ,..., βˆ , have a
1 k )
more subtle interpretation now…
y = βˆ0 + βˆ1x1 + ... + βˆk xk + uˆ
q  How would you interpret βˆ1 ?
Interpretation – Answer
n  ( )
Estimated slopes, βˆ1 ,..., βˆk , have partial
effect interpretations
n  Typically, we think about change in just one
variable, e.g. Δ x1, holding constant all other
variables, i.e. (Δx2,…, Δxk all equal 0)
q  This is given by Δŷ = βˆ1Δx1
q  I.e. βˆ1 is the coefficient holding all else fixed
(ceteris paribus)
Interpretation continued…
n  But, can also look at how changes in
multiple variables at once affects
predicted value of y
q  I.e. given changes in x1 through xk
we obtain the predicted change in y, Δy

Δyˆ = βˆ1Δx1 + ... + βˆk Δxk


Example interpretation – College GPA
n  Suppose we regress college GPA onto high
school GPA (4-point scale) and ACT score
for N = 141 university students
colGPA = 1.29 + 0.453hsGPA + 0.0094 ACT
q  What does the intercept tell us?
q  What does the slope on hsGPA tell us?
Example – Answers
n  Intercept pretty meaningless… person with
zero high school GPA and ACT doesn’t exist
n  Example interpretation of slope…
q  Consider two students, Ann and Bob, with
identical ACT score, but Ann’s GPA is 1 point
higher than Bob. Best prediction of Ann’s college
GPA is that it will be 0.453 higher than Bob’s
Example continued…

n  Now, what is effect of increasing high school


GPA by 1 point and ACT by 1 point?

ΔcolGPA = 0.453 × ΔhsGPA + 0.0094 × ΔACT


ΔcolGPA = 0.453 + 0.0094
ΔcolGPA = 0.4624
Example continued…

n  Lastly, what is effect of increasing high school


GPA by 2 points and ACT by 10 points?

ΔcolGPA = 0.453 × ΔhsGPA + 0.0094 × ΔACT


ΔcolGPA = 0.453 × 2 + 0.0094 × 10
ΔcolGPA = 1
Fitted values and residuals
n  Definition of residual for observation i, uˆi
uˆi = yi − yˆi
n  Properties of residual and fitted values
q  Sample average of residuals = 0; implies that
sample average of ŷ equals sample average of y
q  Sample covariance between each independent
variable and residuals = 0
q  Point of means ( y , x1 ,..., xk ) lies on regression line
Tangent about residuals
n  Again, it bears repeating…
q  Looking at whether the residuals are correlated
with the x’s is NOT a test for causality
q  By construction, they are uncorrelated with x
q  There is no “test” of whether the CEF is the
causal CEF; that justification will need to rely
on economic arguments
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
q  Properties & Interpretation
q  Partial regression interpretation
q  R2, bias, and consistency

n  Hypothesis testing


n  Miscellaneous issues
Question to motivate the topic…
n  What is wrong with the following? And why?
q  Researcher wants to know effect of x on y
after controlling for z
q  So, researcher removes the variation in y that is
driven by z by regressing y on z & saves residuals
q  Then, researcher regresses these residuals on x and
claims to have identified effect of x on y controlling
for z using this regression
We’ll answer why it’s
wrong in a second…
Partial regression [Part 1]
n  The following is quite useful to know…
n  Suppose you want to estimate the following
y = β0 + β1 x1 + β 2 x2 + u
q  Is there another way to get βˆ1 that doesn’t
involve estimating this directly?
n  Answer: Yes! You can estimate it by regressing the
residuals from a regression of y on x2 onto the
residuals from a regression of x1 onto x2
Partial regression [Part 2]
n  To be clear, you get βˆ1 , by…
#1 – Regress y on x2; save residuals (call them y! )
#2 – Regress x1 on x2; save residuals (call them x! )
#3 – Regress y! onto x! ; the estimated coefficient
will be the same as if you’d just run the original
multivariate regression!!!
Partial regression – Interpretation
n  Multivariate estimation is basically finding
effect of each independent variable after
partialing out effect of other variables
q  I.e. Effect of x1 on y after controlling for x2, (i.e.
what you’d get from regressing y on both x1 and
x2) is the same as what you get after you partial
out the effect x2 from both x1 and y and then run
a regression using the residuals
Partial regression – Generalized
n  This property holds more generally…
q  Suppose X1 is vector of independent variables
q  X2 is vector of more independent variables
q  And, you want to know that coefficients on X1 that
you would get from a multivariate regression of y
onto all the variables in X1 and X2…
Partial regression – Generalized, Part 2
n  You can get the coefficients for each
variable in X1 by…
q  Regress y and each variable in X1 onto all the
variables in X2 (at once), save residuals from
each regression
q  Do a regression of residuals; i.e. regress y
onto variables of X1, but replace y and X1
with the residuals from the corresponding
regression in step #1
Practical application of partial regression
n  Now, what is wrong with the following?
q  Researcher wants to know effect of x on y
after controlling for z
q  So, researcher removes the variation in y that is
driven by z by regressing y on z & saves residuals
q  Then, researcher regresses these residuals on x and
claims to have identified effect of x on y controlling
for z using this regression
Practical application – Answer
n  It’s wrong because it didn’t partial effect of
z out of x! Therefore, it is NOT the same
as regressing y onto both x and z!
n  Unfortunately, it is commonly done by
researchers in finance [e.g. industry-adjusting]
q  We will see how badly this can mess up things in
a later lecture where we look at my paper with
David Matsa on unobserved heterogeneity
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
q  Properties & Interpretation
q  Partial regression interpretation
q  R2, bias, and consistency

n  Hypothesis testing


n  Miscellaneous issues
Goodness-of-Fit (R2)
n  A lot is made of R-squared; so let’s
quickly review exactly what it is
n  Start by defining the following:
q  Sum of squares total (SST)
q  Sum of squares explained (SSE)
q  Sum of squares residual (SSR)
Definition of SST, SSE, SST
If N is the number of observations and the
regression has a constant, then
N
SST = ∑ ( yi − y )
2
SST is total variation in y
i =1
N
SSE is total variation in predicted y
SSE = ∑ ( yˆi − y )
2

i =1
[mean of predicted y = mean of y]
N
SSR = ∑ uˆi2 SSR is total variation in residuals
i =1 [mean of residual = 0]
SSR, SST, and SSE continued…
n  The total variation, SST, can be broken
into two pieces… the explained part,
SSE and unexplained part, SSR

SST = SSE + SSR


n  R2 is just the share of total variation that
is explained! In other words,

R2 = SSE/SST = 1 – SSR/SST
More about R2
n  As seen on last slide, R2 must be
between 0 and 1
n  It can also be shown that R2 is equal
to the square of the correlation
between y and predicted y
n  If you add an independent variable,
R2 will never go down
Adjusted R2
n  Because R2 always goes up, we often use
what is called Adjusted R2

⎛ N −1 ⎞
AdjR = 1 − (1 − R ) ⎜
2 2

⎝ N − 1 − k ⎠

q  k = # of regressors, excluding the constant


q  Basically, you get penalized for each additional
regressor, such that adjusted R2 won’t go up after
you add another variable if it doesn’t improve fit
much [it can actually go down!]
Interpreting R2
n  If I tell you the R2 is 0.014 from a
regression, what does that mean? Is it bad?
q  Answer #1 = It means I’m only explaining
about 1.4% of the variation in y with the
regressors that I’m including in the regression
q  Answer #2 = Not necessarily! It doesn’t mean
the model is wrong; you might still be getting a
consistent estimate of the β you care about!
Unbiasedness versus Consistency
n  When we say an estimate is unbiased
or consistent, it means we think it has
a causal interpretation…
q  I.e. the CMI assumption holds and the x’s are
all uncorrelated with the disturbance, u

n  Bias refers to finite sample property;


consistency refers to asymptotic property
More formally…
n  An estimate, β̂ , is unbiased if E βˆ = β( )
q  I.e. on average, the estimate is centered around the
true, unobserved value of β
q  Doesn’t say whether you get a more precise
estimate as sample size increases

n  An estimate is consistent if plim βˆ = β


N →∞
q  I.e. as sample size increases, the estimate converges
(in probability limit) to the true coefficient
Unbiasedness of OLS
n  OLS will be unbiased when…
q  Model is linear in parameters
q  We have a random sample of x
q  No perfect collinearity between x’s
q  E(u|x1,…, xk) = 0
[Earlier assumptions #1 and #2 give us this]

n  Unbiasedness is nice feature of OLS; but in


practice, we care more about consistency
Consistency of OLS
n  OLS will be consistent when
q  Model is linear in parameters
q  u is not correlated with any of the x’s,
[CMI assumptions #1 and #2 give us this]

n  Again, this is good


n  See textbooks for more information
Summary of Today [Part 1]
n  The CEF, E(y|x) has desirable properties
q  Linear OLS gives best linear approx. of it
q  If correlation between error, u, and independent
variables, x’s, is zero it has causal interpretation

n  Scaling & shifting of variables doesn’t affect


inference, but can be useful
q  E.g. demean to give intercepts more meaningful
interpretation or rescale for cosmetic purposes
Summary of Today [Part 2]
n  Multivariate estimates are partial effects
q  I.e. effect of x1 holding x2,…, xk constant
q  Can get same estimates in two steps by first
partialing out some variables and regressing
residuals on residuals in second step
Assign papers for next week…
n  Angrist (AER 1990)
These seminal
q  Military service & future earnings papers in
economics with
n  Angrist and Lavy (QJE 1999) clever identification
strategies…
q  Class size & student achievements i.e., what we aspire
to learn about later
n  Acemoglu, et al. (AER 2001) in the course

q  Institutions and economic development


In First Half of Next Class
n  Finish discussion of the linear regression
q  Hypothesis testing
q  Irrelevant regressors & multicollinearity
q  Binary variables & interactions

n  Relevant readings; see syllabus


FNCE 926
Empirical Methods in CF
Lecture 2 – Linear Regression II

Professor Todd Gormley


Today's Agenda
n  Quick review
n  Finish discussion of linear regression
q  Hypothesis testing
n  Standard errors
n  Robustness, etc.
q  Miscellaneous issues
n  Multicollinearity
n  Interactions

n  Presentations of "Classics #1"

2
Background readings
n  Angrist and Pischke
q  Sections 3.1-3.2, 3.4.1
n  Wooldridge
q  Sections 4.1 & 4.2
n  Greene
q  Chapter 3 and Sections 4.1-4.4, 5.7-5.9, 6.1-6.2

3
Announcements
n  Exercise #1 is due next week
q  You can download it from Canvas
q  If any questions, please e-mail TA, or if
necessary, feel free to e-mail me
q  When finished, upload both typed
answers and DO file to Canvas

4
Quick Review [Part 1]
n  When does the CEF, E(y|x), we approx.
with OLS give causal inferences?
q  Answer = If correlation between error, u, and
independent variables, x's, is zero

n  How do we test for whether this is true?


q  Trick question! You can't test it. The error is
unobserved. Need to rely on sound logic.

5
Quick Review [Part 2]
n  What is interpretation of coefficients
in a log-log regression?
q  Answer = Elasticity. It captures the percent
change in y for a percent change in x

n  What happens if rescale log variables?


q  Answer = The constant will change

6
Quick Review [Part 3]
n  How should I interpret coefficient on x1 in a
multivariate regression? And, what two steps
could I use to get this?
q  Answer = Effect of x1 holding other x's constant
q  Can get same estimates in two steps by first
partialing out some variables and regressing
residuals on residuals in second step

7
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
q  Heteroskedastic versus Homoskedastic errors
q  Hypothesis tests
q  Economic versus statistical significance

n  Miscellaneous issues

8
Hypothesis testing
n  Before getting to hypothesis testing, which
allows us to say something like "our
estimate is statistically significant", it is
helpful to first look at OLS variance
q  Understanding it and the assumptions made to
get it can help us get the right standard errors
for our later hypothesis tests

9
Variance of OLS Estimators
n  Homoskedasticity implies Var(u|x) = σ2
q  I.e. Variance of disturbances, u, doesn't
depend on level of observed x
n  Heteroskedasticity implies Var(u|x) = f(x)
q  I.e. Variance of disturbances, u, does depend
on level of x in some way

10
Variance visually…

Homoskedasticity Heteroskedasticity

11
Which assumption is more realistic?
n  In investment regression, which is more realistic,
homoskedasticity or heteroskedasticity?

Investment = α + βQ + u

q  Answer: Heteroskedasticity seems like a much safer


assumption to make; not hard to come up with stories
on why homoskedasticity is violated

12
Heteroskedasticity (HEK) and bias
n  Does heteroskedasticity cause bias?
q  Answer = No! E(u|x)=0 (which is what we need
for consistent estimates) is something entirely
different. Hetereskedasticity just affects SEs!
q  Heteroskedasticity just means that the OLS
estimate may no longer be the most efficient (i.e.
precise) linear estimator

n  So, why do we care about HEK?

13
Default is homoskedastic (HOK) SEs
n  Default standard errors reported by
programs like Stata assume HOK
q  If standard errors are heteroskedastic,
statistical inferences made from these
standard errors might be incorrect…

q  How do we correct for this?

14
Robust standard errors (SEs)
n  Use "robust" option to get standard
errors (for hypothesis testing ) that are
robust to heteroskedasticity
q  Typically increases SE, but usually won't
make that big of a deal in practice
q  If standard errors go down, could have
problem; use the larger standard errors!
q  We will talk about clustering later…

15
Using WLS to deal with HEK
n  Weighted least squares (WLS) is sometimes
used when worried about heteroskedasticity
q  WLS basically weights the observation of x using
an estimate of the variance at that value of x
q  Done correctly, can improve precision of estimates

16
WLS continued… a recommendation
n  Recommendation of Angrist-Pischke
[See Section 3.4.1]: don't bother with WLS
q  OLS is consistent, so why bother?
Can just use robust standard errors
q  Finite sample properties can be bad [and it may
not actually be more efficient]
q  Harder to interpret than just using OLS [which
is still best linear approx. of CEF]

17
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
q  Heteroskedastic versus Homoskedastic errors
q  Hypothesis tests
q  Economic versus statistical significance

n  Miscellaneous issues

18
Hypothesis tests
n  This type of phrases are common: "The
estimate, βˆ , is statistically significant"
q  What does this mean?
q  Answer = "Statistical significance" is
generally meant to imply an estimate is
statistically different than zero

But, where does this come from?

19
Hypothesis tests[Part 2]
n  When thinking about significance, it is
helpful to remember a few things…
q  Estimates of β1, β2, etc. are functions of random
variables; thus, they are random variables with
variances and covariances with each other
q  These variances & covariances can be estimated
[See textbooks for various derivations]
q  Standard error is just the square root of an
estimate's estimated variance

20
Hypothesis tests[Part 3]
n  Reported t-stat is just telling us how
many standard deviations our sample
estimate, βˆ , is from zero
q  I.e. it is testing the null hypothesis: β = 0
q  p-value is just the likelihood that we would
get an estimate βˆ standard deviations away
from zero by luck if the true β = 0

21
Hypothesis tests[Part 4]
n  See textbooks for more details on how to
do other hypothesis tests; E.g.
q  β1 = β 2

q  β1 = β 2 = β 3 = 0

q  Given these are generally easily done in


programs like Stata, I don't want to
spend time going over the math

22
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
q  Heteroskedastic versus Homoskedastic errors
q  Hypothesis tests
q  Economic versus statistical significance

n  Miscellaneous issues

23
Statistical vs. Economic Significance
n  These are not the same!
q  Coefficient might be statistically
significant, but economically small
n  You can get this in large samples, or when
you have a lot of variation in x (or outliers)

q  Coefficient might be economically large,


but statistically insignificant
n  Might just be small sample size or too little
variation in x to get precise estimate

24
Economic Significance
n  You should always check economic
significance of coefficients
q  E.g. how large is the implied change in y
for a standard deviation change in x?
q  And importantly, is that plausible? If not,
you might have a specification problem

25
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
q  Irrelevant regressors & multicollinearity
q  Binary models and interactions
q  Reporting regressions

26
Irrelevant regressors

n  What happens if include a regressor that


should not be in the model?
q  We estimate y = β0 + β1x1 + β2x2 + u
q  But, real model is y = β0 + β1x1 + u
q  Answer: We still get a consistent of all the β,
where β2 = 0, but our standard errors might
go up (making it harder to find statistically
significant effects)… see next few slides

27
Variance and of OLS estimators
n  Greater variance in your estimates, βˆ j ,
increases your standard errors, making it
harder to find statistically significant estimates

n  So, useful to know what increases Var βˆ j ( )

28
Variance formula
n  Sampling variance of OLS slope is…

σ2
( )
Var βˆ j =
− x j ) (1 − R
∑ (x )
N 2 2
i =1 ij j

for j = 1,…, k, where Rj2 is the R2 from


regressing xj on all other independent variables
including the intercept and σ2 is the variance of
the regression error, u

29
Variance formula – Interpretation
n  How will more variation in x affect SE? Why?
n  How will higher σ2 affect SE? Why?
n  How will higher Rj2 affect SE? Why?

σ
( )
2
Var βˆ j =
∑ i=1( ij j ) ( j )
N 2
x − x 1 − R 2

30
Variance formula – Variation in xj
n  More variation in xj is good; smaller SE!
q  Intuitive; more variation in xj helps us
identify its effect on y!
q  This is why we always want larger samples;
it will give us more variation in xj

31
Variance formula – Effect of σ2
n  More error variance means bigger SE
q  Intuitive; a lot of the variation in y is
explained by things you didn't model
q  Can add variables that affect y (even if not
necessary for identification) to improve fit!

32
Variance formula – Effect of Rj2
n  But, more variables can also be bad if
they are highly collinear
q  Gets harder to disentangle effect of the
variables that are highly collinear
q  This is why we don't want to add variables
that are "irrelevant" (i.e. they don't affect y)

Should we include variables that do explain y and


are highly correlated with our x of interest?

33
Multicollinearity [Part 1]
n  Highly collinear variables can inflate SEs
q  But, it does not cause a bias or inconsistency!
q  Problem is really just one of a having too small
of a sample; with a larger sample, one could get
more variation in the independent variables
and get more precise estimates

34
Multicollinearity [Part 2]
n  Consider the following model
y = β0 + β1 x1 + β 2 x2 + β3 x3 + u
where x2 and x3 are highly correlated

q  ( ) ( )
Var βˆ2 and Var βˆ3 may be large, but
correlation between x2 and x3 has no
( )
direct effect on Var βˆ1
q  If x1 is uncorrelated with x2 and x3, the
( )
R12 = 0 and Var βˆ1 unaffected

35
Multicollinearity – Key Takeaways
n  It doesn't cause bias
n  Don't include controls that are highly
correlated with independent variable of
interest if they aren't needed for
identification [i.e. E(u|x) = 0 without them]
q  But obviously, if E(u|x) ≠ 0 without these
controls, you need them!
q  A larger sample will help increase precision

36
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
q  Irrelevant regressors & multicollinearity
q  Binary models and interactions
q  Reporting regressions

37
Models with interactions
n  Sometimes, it is helpful for identification, to
add interactions between x's
q  Ex. – theory suggests firms with a high value of x1
should be more affected by some change in x2
q  E.g. see Rajan and Zingales (1998)

n  The model will look something like…


y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + u

38
Interactions – Interpretation [Part 1]
n  According to this model, what is the effect of
increasing x1 on y, holding all else equal?
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
q  Answer:
Δy = ( β1 + β3 x2 ) Δx1
dy
= β1 + β3 x2
dx1

39
Interactions – Interpretation [Part 2]
n  If β3 < 0, how does a higher x2 affect the
partial effect of x1 on y?
dy
= β1 + β3 x2
dx1
q  Answer: The increase in y for a given change in
x1 will be smaller in levels (not necessarily in
absolute magnitude) for firms with a higher x2

40
Interactions – Interpretation [Part 3]
n  Suppose, β1 > 0 and β3 < 0 … what is
the sign of the effect of an increase in x1
for the average firm in the population?
dy
= β1 + β3 x2
dx1

dy x2 = x2
q  Answer: It is the sign of | = β1 + β3 x2
dx1

41
A very common mistake! [Part 1]

q  Researcher claims that "since β1>0 and β3<0, an


increase in x1 increases y on for the average firm, but
the increase is less for firms with a high x2"
dy x2 = x2
| = β1 + β3 x2
dx1
n  Wrong!!! The average effect of an increase in x1
might actually be negative if x2 is very large!
n  β1 only captures partial effect when x2 = 0, which
might not even make sense if x2 is never 0!

42
A very common mistake! [Part 2]
n  To improve interpretation of β1, you can
reparameterize the model by demeaning
each variable in the model, and estimate
y! = δ 0 + δ 1x!1 + δ 2 x!2 + δ 3 x!1x!2 + u
where y! = y − µ y
x!1 = x1 − µ x
1

x!2 = x2 − µ x
2

43
A very common mistake! [Part 3]
n  You can then show… Δy = (δ 1 + δ 3 x!2 ) Δx1
dy x2 =µ2
and thus, | = δ1 + δ 3 ( x2 − µ2 )
dx1
dy x2 =µ2
| = δ1
dx1

n  Now, the coefficient on the demeaned x1 can


be interpreted as effect of x1 for avg. firm!

44
The main takeaway – Summary
n  If you want to coefficients on non-
interacted variables to reflect the effect
of that variable for the "average" firm,
demean all your variables before
running the specification

n  Why is there so much confusion about this?


Probably because of indicator variables…

45
Indicator (binary) variables

n  We will now talk about indicator variables


q  Interpretation of the indicator variables
q  Interpretation when you interact them
q  When demeaning is helpful
q  When using an indicator rather than a
continuous variable might make sense

46
Motivation
n  Indicator variables, also known as binary
variables, are quite popular these days
q  Ex. #1 – Sex of CEO (male, female)
q  Ex. #2 – Employment status (employed, unemployed)
q  Also see in many diff-in-diff specifications
n  Ex. #1 – Size of firm (above vs. below median)
n  Ex. #2 – Pay of CEO (above vs. below median)

47
How they work
n  Code the information using dummy variable
⎧1 if person i is male
q  Ex. #1: Malei = ⎨
⎩0 otherwise
⎧1 if Ln(assets) of firm i > median
q  Ex. #2: Largei = ⎨
⎩0 otherwise

n  Choice of 0 or 1 is relevant only for interpretation

48
Single dummy variable model
n  Consider wage = β0 + δ 0 female + β1educ + u
n  δ0 measures difference in wage between male
and female given same level of education
q  E(wage|female = 0, educ) = β0 + β1educ
q  E(wage|female = 1, educ) = β0 + δ0 + β1educ
q  Thus, E(wage|f = 1, educ) – E(wage|f = 0, educ) = δ0

n  Intercept for males = β0 , females = β0 + δ0

49
Single dummy just shifts intercept!
n  When δ0 < 0, we have visually…

β1

50
Single dummy example – Wages
n  Suppose we estimate the following wage model

Wage = -1.57 – 1.8female + 0.57educ + 0.03exp + 0.14tenure

q  Male intercept is -1.57; it is meaningless, why?


q  How should we interpret the 1.8 coefficient?
n  Answer: Females earn $1.80/hour less then men
with same education, experience, and tenure

51
Log dependent variable & indicators
n  Nothing new; coefficient on indicator has %
interpretation. Consider following example…
ln( price) = −1.35 + 0.17ln(lotsize) + 0.71ln( sqrft )
+0.03bdrms + 0.054colonial
q  Again, negative intercept meaningless; all other
variables are never all equal to zero
q  Interpretation = colonial style home costs about
5.4% more than "otherwise similar" homes

52
Multiple indicator variables
n  Suppose you want to know how much lower
wages are for married and single females
q  Now have 4 possible outcomes
n  Single & male
n  Married & male
n  Single & female
n  Married & female

q  To estimate, create indicators for three of the


variables and add them to the regression

53
But, which to exclude?
n  We have to exclude one of the four
because they are perfectly collinear with
the intercept, but does it matter which?
q  Answer: No, not really. It just effects the
interpretation. Estimates of included
indicators will be relative to excluded indicator
q  For example, if we exclude "single & male",
we are estimating partial change in wage
relative to that of single males

54
But, which to exclude? [Part 2]

n  Note: if you don't exclude one, then


statistical programs like Stata will just
drop one for you automatically. For
interpretation, you need to figure out
which one was dropped!

55
Multiple indicators – Example
n  Consider the following estimation results…
ln( wage) = 0.3 + 0.2l marriedMale − .20marriedFemale
−0.11singleFemale + 0.08education

q  I omitted single male; thus intercept is for single males


q  And, can interpret other coefficients as…
n  Married men earn ≈ 21% more than single males, all else equal
n  Married women earn ≈ 20% less than single males, all else equal

56
Interactions with Indicators
n  We could also do prior regression instead
using interactions between indicators
q  I.e. construct just two indicators, 'female' and
'married' and estimate the following

ln( wage) = β 0 + β1 female + β 2 married


+ β3 ( female × married ) + β 4education

q  How will our estimates and interpretation


differ from earlier estimates?

57
Interactions with Indicators [Part 2]
n  Before we had,
ln( wage) = 0.3 + 0.2l marriedMale − .20marriedFemale
−0.11singleFemale + 0.08education
n  Now, we will have,
ln( wage) = 0.3 − 0.11 female + 0.21married
−0.30 ( female × married ) + 0.08education

q  Question: Before, married females had wages


that were 0.20 lower; how much lower are
wages of married females now?

58
Interactions with Indicators [Part 3]
n  Answer: It will be the same!
ln( wage) = 0.32 − 0.11 female + 0.21married
−0.30 ( female × married ) + ...

q  Difference for married female = –0.11+0.21–


0.30 = -0.20; exactly the same as before

n  Bottom line = you can do the indicators


either way; inference is unaffected

59
Indicator Interactions – Example
n  Krueger (1993) found…
ln( wage) = βˆ0 + 0.18compwork + 0.07comphome
+0.02 ( compwork × comphome ) + ...

q  Excluded category = people with no computer


q  How do we interpret these estimates?
n  How much higher are wages if have computer at work? ≈18%
n  If have computer at home? ≈7%
n  If have computers at both work and home? ≈18+7+2=27%

60
Indicator Interactions – Example [part 2]
n  Remember, these are just approximate percent
changes… To get true change, need to convert
q  E.g. % change in wages for having computers at both
home and work is given by
100*[exp(0.18+0.07+0.02) – 1] = 31%

61
Interacting Indicators w/ Continuous
n  Adding dummies alone will only shift
intercepts for different groups
n  But, if we interact these dummies with
continuous variables, we can get different
slopes for different groups as well
q  See next slide for an example of this

62
Continuous Interactions – Example
n  Consider the following
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u

q  What is intercept for males? β0


q  What is slope for males? β1
q  What is intercept for females? β0+δ0
q  What is slope for females? β1+δ1

63
Visual #1 of Example
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u

In this example…
q  Females earn lower wages
at all levels of education
q  Avg. increase per unit of
education is also lower

64
Visual #2 of Example
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u

In this example…
q  Wage is lower for females
but only for lower levels
of education because their
slope is larger

Is it fair to conclude that


women eventually earn
higher wages with
enough education?

65
Cautionary Note on Different Slopes!
n  Crossing point (where women earn higher
wages) might occur outside the data
(i.e. at education levels that don't exist)
q  Need to solve for crossing point before
making this claim about the data
Women : ln( wage) = β 0 + δ 0 + ( β1 + δ1 ) educ + u
Men : ln( wage) = β 0 + β1educ + u

q  They equal when educ = δ0/δ1

66
Cautionary Note on Interpretation!
n  Interpretation of non-interacted terms when
using continuous variables is tricky
n  E.g., consider the following estimates
ln( wage) = 0.39 − 0.23 female + 0.08educ − .01( female × educ )

q  Return to educ is 8% for men, 7% for women


q  But, at the average education level, how much less
do women earn? [–0.23 – 0.01×avg. educ]%

67
Cautionary Note [Part 2]
n  Again, interpretation of non-interacted
variables does not equal average effect unless
you demean the continuous variables
q  In prior example estimate the following:
ln( wage) = β 0 + δ 0 female + β1 ( educ − µeduc )
+δ1 female × ( educ − µeduc )
q  Now, δ0 tells us how much lower the wage is of
women at the average education level

68
Cautionary Note [Part 3]
n  Recall! As we discussed in prior lecture, the
slopes won't change because of the shift
q  Only the intercepts, β0 and β0 + δ0 , and their
standard errors will change

n  Bottom line = if you want to interpret non-


interacted indicators as the effect of indicators
at the average of the continuous variables, you
need to demean all continuous variables

69
Ordinal Variables
n  Consider credit ratings: CR ∈ ( AAA, AA,..., C , D)
n  If want to explain interest rate, IR, with
ratings, we could convert CR to numeric scale,
e.g. AAA = 1, AA = 2, … and estimate

IRi = β 0 + β1CRi + ui
q  But, what are we implicitly assuming, and how
might it be a problematic assumption?

70
Ordinal Variables continued…
n  Answer: We assumed a constant linear
relation between interest rates and CR
q  I.e. Moving from AAA to AA produces same
change as moving from BBB to BB
q  Could take log interest rate, but is a constant
proportional much better? Not really…

n  A better route might be to convert the


ordinal variable to indicator variables

71
Convert ordinal to indicator variables
n  E.g. let CRAAA = 1 if CR = AAA, 0 otherwise;
CRAA = 1 if CR = AA, 0 otherwise, etc.
n  Then, run this regression
IRi = β 0 + β1CRAAA + β 2CRAA + ... + β m−1CRC + ui
q  Remember to exclude one (e.g. "D")

n  This allows IR change from each rating


category [relative to the excluded indicator]
to be of different magnitude!

72
Linear Regression – Outline
n  The CEF and causality (very brief)
n  Linear OLS model
n  Multivariate estimation
n  Hypothesis testing
n  Miscellaneous issues
q  Irrelevant regressors & multicollinearity
q  Binary models and interactions
q  Reporting regressions

73
Reporting regressions
n  Table of OLS outputs should
generally show the following…
q  Dependent variable [clearly labeled]
q  Independent variables
q  Est. coefficients, their corresponding
standard errors (or t-stat), and stars
indicating level of stat. significance
q  Adjusted R2
q  # of observations in each regression

74
Reporting regressions [Part 2]
n  In body of paper…
q  Focus only on variable(s) of interest
n  Tell us their sign, magnitude, statistical &
economic significance, interpretation, etc.

q  Don't waste time on other coefficients


unless they are "strange" (e.g. wrong
sign, huge magnitude, etc)

75
Reporting regressions [Part 3]
n  And last, but not least, don't report
regressions in tables that you aren't going to
discuss and/or mention in the paper's body
q  If it's not important enough to mention in the
paper, it's not important enough to be in a table

76
Summary of Today [Part 1]
n  Irrelevant regressors and multi-
collinearity do not cause bias
q  But, can inflate standard errors
q  So, avoid adding unnecessary controls

n  Heteroskedastic variance doesn't cause bias


q  Just means the default standard errors for
hypothesis testing are incorrect
q  Use 'robust' standard errors (if larger)

77
Summary of Today [Part 2]
n  Interactions and binary variables
can help us get a causal CEF
q  But, if you want to interpret non-interacted
indicators it is helpful to demean continuous var.

n  When writing up regression results


q  Make sure you put key items in your tables
q  Make sure to talk about both economic and
statistical significance of estimates

78
In First Half of Next Class
n  Discuss causality and potential biases
q  Omitted variable bias
q  Measurement error bias
q  Simultaneity bias

n  Relevant readings; see syllabus

79
Assign papers for next week…
n  Fazzari, et al (BPEA 1988) These classic papers
in finance that use
q  Finance constraints & investment rather simple
estimations and
n  Morck, et al (BPEA 1990) 'identification' was
not a foremost
q  Stock market & investment concern

n  Opler, et al (JFE 1999) Do your best to think


about their potential
q  Corporate cash holdings weaknesses…

80
Break Time
n  Let's take our 10 minute break
n  We'll do presentations when we get back

81
FNCE 926
Empirical Methods in CF
Lecture 3 – Causality

Professor Todd Gormley


Announcement
n  You should have uploaded Exercise #1
and your DO files to Canvas already
n  Exercise #2 due two weeks from today

2
Background readings for today
n  Roberts-Whited
q  Section 2
n  Angrist and Pischke
q  Section 3.2
n  Wooldridge
q  Sections 4.3 & 4.4
n  Greene
q  Sections 5.8-5.9

3
Outline for Today
n  Quick review
n  Motivate why we care about causality
n  Describe three possible biases & some
potential solutions
q  Omitted variable bias
q  Measurement error bias
q  Simultaneity bias

n  Student presentations of "Classics #2"

4
Quick Review [Part 1]
n  Why is adding irrelevant regressors a
potential problem?
q  Answer = It can inflate standard errors if the
irrelevant regressors are highly collinear with
variable of interest

n  Why is a larger sample helpful?


q  Answer = It gives us more variation in x,
which helps lower our standard errors

5
Quick Review [Part 2]
n  Suppose, β1 < 0 and β3 > 0 … what is the
sign of the effect of an increase in x1 for the
average firm in the below estimation?
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
q  Answer: It is the sign of
dy x2 = x2
| = β1 + β3 x2
dx1

6
Quick Review [Part 3]
n  How could we make the coefficients easier
to interpret in the prior example?
q  Shift all the variables by subtracting out their
sample mean before doing the estimation
q  It will allow the non-interacted coefficients to be
interpreted as effect for average firm

7
Quick Review [Part 4]
n  Consider the following estimate:
ln( wage) = 0.32 − 0.11 female + 0.21married
−0.30 ( female × married ) + 0.08education

q  Question: How much lower are wages of married


and unmarried females after controlling for
education, and who is this relative to?
n  Answer = unmarried females make 11% less than single
males; married females make –11%+21%–30%=20% less

8
Outline for Today
n  Quick review
n  Motivate why we care about causality
n  Describe three possible biases & some
potential solutions
q  Omitted variable bias
q  Measurement error bias
q  Simultaneity bias

n  Student presentations of "Classics #2"

9
Motivation
n  As researchers, we are interested in
making causal statements
q  Ex. #1 – what is the effect of a change in
corporate taxes on firms' leverage choice?
q  Ex. #2 – what is the effect of giving a CEO
more stock ownership in the firm on the
CEO's desire to take on risky investments?

n  I.e. we don't like to just say variables are


'associated' or 'correlated' with each other

10
What do we mean by causality?
n  Recall from earlier lecture, that if our linear
model is the following…

y = β 0 + β1 x1 + ... + β k xk + u

And, we want to infer β1 as the causal


effect of x1 on y, holding all else equal, then
we need to make the following
assumptions…

11
The basic assumptions
n  Assumption #1: E(u) = 0
n  Assumption #2: E(u|x1,…,xk) = E(u)
q  In words, average of u (i.e. unexplained portion
of y) does not depend on value of x
q  This is "conditional mean independence" (CMI)
n  Generally speaking, you need the estimation
error to be uncorrelated with all the x's

12
Tangent – CMI versus correlation
n  CMI (which implies x and u are
uncorrelated) is needed for unbiasedness
[which is again a finite sample property]
n  But, we only need to assume a zero
correlation between x and u for consistency
[which is a large sample property]
q  This is why I'll typically just refer to whether
u and x are correlated in my test of whether
we can make causal inferences

13
Three main ways this will be violated
n  Omitted variable bias
n  Measurement error bias
n  Simultaneity bias

n  Now, let's go through each in turn…

14
Omitted variable bias (OVB)
n  Probably the most common concern you
will hear researchers worry about
n  Basic idea = the estimation error, u,
contains other variable, e.g. z, that affects
y and is correlated with an x
q  Please note! The omitted variable is only
problematic if correlated with an x

15
OVB more formally, with one variable
n  You estimate: y = β 0 + β1 x + u
n  But, true model is: y = β 0 + β1 x + β 2 z + v

n  Then, βˆ1 = β1 + δ xz β2 , where δ xz is the


coefficient you'd get from regressing the
omitted variable, z, on x; and

cov( x, z )
δ xz =
var( x)

16
Interpreting the OVB formula
cov( x, z ) Bias
ˆ
β1 = β1 + β2
var( x)

Effect of Effect of
x on y Regression z on y
of z on x

n  Easy to see, estimated coefficient is only unbiased


if cov(x, z) = 0 [i.e. x and z are uncorrelated] or z
has no effect on y [i.e. β2 = 0]

17
Direction and magnitude of the bias
ˆ cov( x, z )
β1 = β1 + β2
var( x)

n  Direction of bias given by signs of β2, cov(x, z)


q  E.g. If know z has positive effect on y [i.e. β2 > 0]
and x and z are positively correlated [cov(x, z) > 0],
then the bias will be positive

n  Magnitude of the bias will be given by


magnitudes of β2, cov(x, z)/var(x)

18
Example – One variable case
n  Suppose we estimate: ln( wage) = β 0 + β1educ + w
n  But, true model is:
ln( wage) = β 0 + β1educ + β 2 ability + u

n  What is likely bias on β̂1 ? Recall,

ˆ cov(educ, ability )
β1 = β1 + β2
var(educ)

19
Example – Answer
q  Ability & wages likely positively correlated, so β2 > 0
q  Ability & education likely positive correlated, so
cov(educ, ability) > 0
q  Thus, the bias is likely to positive! βˆ is too big!
1

20
OVB – General Form
n  Once move away from simple case of just one
omitted variable, determining sign (and
magnitude) of bias will be a lot harder
q  Let β be vector of coefficients on k included variables
q  Let γ be vector of coefficient on l excluded variables
q  Let X be matrix of observations of included variables
q  Let Z be matrix of observations of excluded variables

ˆβ = β + E[ X'Z] γ
E[ X'X]

21
OVB – General Form, Intuition
ˆβ = β + E[ X'Z] γ
E[ X'X]
Vector of partial effects of
Vector of regression excluded variables
coefficients

n  Same idea as before, but more complicated


n  Frankly, this can be a real mess!
[See Gormley and Matsa (2014) for example with
just two included and two excluded variables]

22
Eliminating Omitted Variable Bias

n  How we try to get rid of this bias will


depend on the type of omitted variable
q  Observable omitted variable
q  Unobservable omitted variable

How can we deal with an


observable omitted variable?

23
Observable omitted variables
n  This is easy! Just add them as controls
q  E.g. if the omitted variable, z, in my simple case
was 'leverage', then add leverage to regression
n  A functional form misspecification is a special
case of an observable omitted variable
Let's now talk about this…

24
Functional form misspecification
n  Assume true model is…
y = β 0 + β1 x1 + β 2 x2 + β3 x22 + u
2
n  x
But, we omit squared term, 2
q  Just like any OVB, bias on (β0 , β1 , β2 )will
2
depend on β3 and correlations among 1 2 2 ) ( x , x , x
q  You get same type of problem if have incorrect
functional form for y [e.g. it should be ln(y) not y]
n  In some sense, this is minor problem… Why?

25
Tests for correction functional form
n  You could add additional squared and
cubed terms and look to see whether
they make a difference and/or have
non-zero coefficients
n  This isn't as easy when the possible
models are not nested…

26
Non-nested functional form issues…
n  Two non-nested examples are:
y = β 0 + β1 x1 + β 2 x2 + u Let's use this
versus example and
y = β 0 + β1 ln( x1 ) + β 2 ln( x2 ) + u see how we can
try to figure out
which is right
y = β 0 + β1 x1 + β 2 x2 + u
versus
y = β 0 + β1 x1 + β 2 z + u

27
Davidson-MacKinnon Test [Part 1]
n  To test which is correct, you can try this…
q  Take fitted values, ŷ , from 1st model and add them
as a control in 2nd model
y = β0 + β1 ln( x1 ) + β2 ln( x2 ) + θ1 yˆ + u
q  Look at t-stat on θ1; if significant rejects 2nd model!
q  Then, do reverse, and look at t-stat on θ1 in
y = β 0 + β1 x1 + β 2 x2 + θ1 yˆˆ + u
where ŷˆ is predicted value from 2nd model… if
significant then 1st model is also rejected L

28
Davidson-MacKinnon Test [Part 2]
n  Number of weaknesses to this test…
q  A clear winner may not emerge
n  Both might be rejected
n  Both might be accepted [If this happens, you can
use the R2 to choose which model is a better fit]
q  And, rejecting one model does NOT imply
that the other model is correct L

29
Bottom line advice on functional form
n  Practically speaking, you hope that changes
in functional form won't effect coefficients
on key variables very much…
q  But, if it does… You need to think hard about
why this is and what the correct form should be
q  The prior test might help with that…

30
Eliminating Omitted Variable Bias

n  How we try to get rid of this bias will


depend on the type of omitted variable
q  Observable omitted variable
q  Unobservable omitted variable

Unobservable are much harder to deal with,


but one possibility is to find a proxy variable

31
Unobserved omitted variables
n  Again, consider earlier estimation

ln( wage) = β 0 + β1educ + β 2 ability + u

q  Problem: we don't observe & can't measure ability


q  What can we do? Ans. = Find a proxy variable that
is correlated with the unobserved variable, E.g. IQ

32
Proxy variables [Part 1]
n  Consider the following model…
y = β 0 + β1 x1 + β 2 x2 + β3 x3* + u
where x3* is unobserved, but we have proxy x3
Then, suppose 3 = δ 0 + δ1 x3 + v
*
n  x
q  v is error associated with proxy's imperfect
representation of unobservable x3
q  Intercept just accounts for different scales
[e.g. ability has different average value than IQ]

33
Proxy variables [Part 2]
n  If we are only interested in β1 or β2, we can just
replace x3* with x3 and estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x3 + u

n  But, for this to give us consistent estimates of β1


and β2 , we need to make some assumptions
#1 – We've got the right model, and
#2 – Other variables don't explain unobserved
variable after we've accounted for our proxy

34
Proxy variables – Assumptions
3 ) = 0 ; i.e. we have the right
*
#1 – E (u | x ,
1 2x , x
model and x3 would be irrelevant if we could
control for x1, x2, 3 , such that E (u | x3 ) = 0
*
x
q  This is a common assumption; not controversial

#2 – E (v | x1 , x2 , x3 ) = 0 ; i.e. x3 is a good proxy


* *
for x3 such that after controlling for x3, x3
doesn't depend on x1 or x2
I.e. E ( x3 | x1 , x2 , x3 ) = E ( x3 | x3 )
* *
q 

35
Why the proxy works…
n  Recall true model: y = β 0 + β x
1 1 + β x
2 2 + β 3 3 +u
x *

n  Now plug-in for x3*, using x3* = δ 0 + δ1 x3 + v

y = ( β0 + β3δ 0 ) + β1x1 + β 2 x2 + ( β3δ 1 ) x3 + ( u + β3v )


!#"#$ % ! #"# $
α0 α1 e

q  Prior assumptions ensure that E (e | x1 , x2 , x3 ) = 0


such that the estimates of (α 0 , β1 , β 2 , α1 ) are consistent
q  Note: β0 and β3 are not identified

36
Proxy assumptions are key [Part 1]
n  Suppose assumption #2 is wrong such that
x = δ 0 + δ 1x3 + γ 1x1 + γ 2 x2 + w
*
3
!##"## $
v
where E ( w | x1 , x2 , x3 ) = 0

q  If above is true, E (v | x1 , x2 , x3 ) ≠ 0, and if you


substitute into model of y, you'd get…

37
Proxy assumptions are key [Part 2]
*
n  x
Plugging in for 3 , you'd get
y = α 0 + α1 x1 + α 2 x2 + α 3 x3 + e
where α 0 = β 0 + β3δ 0 E.g. α1 captures effect
α1 = β1 + β 3γ 1 of x1 on y, β1 , but also
α 2 = β 2 + β 3γ 2 its correlation with
unobserved variable
α 3 = β 3δ1

n  We'd get consistent estimates of (α 0 ,α1 ,α 2 ,α 3 )


But that isn't what we want!

38
Proxy variables – Example #1
n  Consider earlier wage estimation
ln( wage) = β 0 + β1educ + β 2 ability + u

q  If use IQ as proxy for unobserved ability, what


assumption must we make? Is it plausible?
n  Answer: We assume E (ability | educ, IQ) = E (ability | IQ) ,
i.e. average ability does not change with education after
accounting for IQ… Could be questionable assumption!

39
Proxy variables – Example #2
n  Consider Q-theory of investment
investment = β 0 + β1Q + u

q  Can we estimate β1 using a firm's market-to-book


ratio (MTB) as proxy for Q? Why or why not?
n  Answer: Even if we believe this is the correct model
(Assumption #1) or that Q only depends on MTB
(Assumption #2), e.g. Q=δ0+δ1MTB, we are still not
getting estimate of β1… see next slide for the math

40
Proxy variables – Example #2 [Part 2]
n  Even if assumptions held, we'd only be getting
consistent estimates of
investment = α 0 + α1Q + e
where α 0 = β0 + β1δ 0
α1 = β1δ1

q  While we can't get β1, is there something we can


get if we make assumptions about sign of δ1?
q  Answer: Yes, the sign of β1

41
Proxy variables – Summary

n  If the coefficient on the unobserved variable


isn't what we are interested in, then a proxy
for it can be used to identify and remove
OVB from the other parameters
q  Proxy can also be used to determine sign of
coefficient on unobserved variable

42
Random Coefficient Model
n  So far, we've assumed that the effect of x on y
(i.e. β) was the same for all observations
q  In reality, this is unlikely true; model might look
more like yi = αi + βi xi + ui , where
α i = α + ci I.e. each observation's
βi = β + di relationship between x
and y is slightly different
E (ci ) = E (d i ) = 0

q  α is the average intercept and β is what we call the


"average partial effect" (APE)

43
Random Coefficient Model [Part 2]
n  Regression would seem to be incorrectly
specified, but if willing to make assumptions,
we can identify the APE
If like, can think of
q  Plug in for α and β the unobserved
yi = α + β xi + ( ci + di xi + ui ) differential intercept
and slopes as
q  Identification requires omitted variable

E ( ci + di xi + ui | x ) = 0
What does this imply?

44
Random Coefficient Model [Part 3]
n  This amounts to requiring
E ( ci | x ) = E ( ci ) = 0 ⇒ E (α i | x ) = E (α i )
E ( di | x ) = E ( di ) = 0 ⇒ E ( βi | x ) = E ( βi )

q  We must assume that the individual slopes and


intercepts are mean independent (i.e. uncorrelated
with the value of x) in order to estimate the APE
n  I.e. knowing x, doesn't help us predict the
individual's partial effect

45
Random Coefficient Model [Part 4]
n  Implications of APE
q  Be careful interpreting coefficients when
you are implicitly arguing elsewhere in paper
that effect of x varies across observations
n  Keep in mind the assumption this requires
n  And, describe results using something like…
"we find that, on average, an increase in x
causes a β change in y"

46
Three main ways this will be violated
n  Omitted variable bias
n  Measurement error bias
n  Simultaneity bias

47
Measurement error (ME) bias
n  Estimation will have measurement error whenever
we measure the variable of interest imprecisely
q  Ex. #1: Altman-z-score is noisy measure of default risk
q  Ex. #2: Avg. tax rate is noisy measure of marg. tax rate

n  Such measurement error can cause bias, and


the bias can be quite complicated

48
Measurement error vs. proxies
n  Measurement error is similar to proxy variable,
but very different conceptually
q  Proxy is used for something that is entirely
unobservable or measureable (e.g. ability)
q  With measurement error, the variable we don't
observe is well-defined and can be quantified… it's
just that our measure of it contains error

49
ME of Dep. Variable [Part 1]
n  Usually not a problem (in terms of bias); just
causes our standard errors to be larger. E.g. …
Let y = β0 + β1 x1 + ... + β k xk + u
*
q 

But, we measure y* with error e = y − y


*
q 

q  Because we only observe y, we estimate


y = β 0 + β1 x1 + ... + β k xk + ( u + e )

Note: we always assume E(e)=0; this


is innocuous because if untrue, it
only affects the bias on the constant

50
ME of Dep. Variable [Part 2]
n  As long as E(e|x)=0, the OLS estimates
are consistent and unbiased
q  I.e. as long as the measurement error of y is
uncorrelated with the x's, we're okay
q  Only issue is that we get larger standard errors
when e and u are uncorrelated [which is what
we typically assume] because Var(u+e)>Var(u)

What are some common examples of ME?

51
ME of Dep. Variable [Part 3]
n  Some common examples
q  Market leverage – typically use book value
of debt because market value hard to observe
q  Firm value – again, hard to observe market
value of debt, so we use book value
q  CEO compensation – value of options are
approximated using Black-Scholes

Is assuming e and x are uncorrelated plausible?

52
ME of Dep. Variable [Part 4]
n  Answer = Maybe… maybe not
q  Ex. – Firm leverage is measured with error; hard to
observe market value of debt, so we use book value
n  But, the measurement error is likely to be larger when firm's
are in distress… Market value of debt falls; book value doesn't
n  This error could be correlated with x's if it includes things like
profitability (i.e. ME larger for low profit firms)
n  This type of ME will cause inconsistent estimates

53
ME of Independent Variable [Part 1]
n  Let's assume the model is y = β 0 + β1 x *
+u
n  But, we observe x* with error, e = x − x*
q  We assume that E(y|x*, x) = E(y|x*) [i.e. x
doesn't affect y after controlling for x*; this is
standard and uncontroversial because it is just
stating that we've written the correct model]

n  What are some examples in CF?

54
ME of Independent Variable [Part 2]
n  There are lots of examples!
q  Average Q measures marginal Q with error
q  Altman-z score measures default prob. with error
q  GIM, takeover provisions, etc. are all just noisy
measures of the nebulous "governance" of firm

Will this measurement error cause bias?

55
ME of Independent Variable [Part 2]
n  Answer depends crucially on what we assume
about the measurement error, e
n  Literature focuses on two extreme assumptions
#1 – Measurement error, e, is uncorrelated
with the observed measure, x
#2 – Measurement error, e, is uncorrelated
with the unobserved measure, x*

56
Assumption #1: e uncorrelated with x
n  Substituting x* with what we actually
observe, x* = x – e, into true model, we have
y = β 0 + β1 x + u − β1e
q  Is there a bias?
n  Answer = No. x is uncorrelated with e by assumption,
and x is uncorrelated with u by earlier assumptions

q  What happens to our standard errors?


n  Answer = They get larger; error variance is now σ u2 + β12σ e2

57
Assumption #2: e uncorrelated with x*
n  We are still estimating y = β 0 + β1 x + u − β1e ,
but now, x is correlated with e
q  e uncorrelated with x* guarantees e is correlated
with x; cov( x, e) = E ( xe) = E ( x*e) + E (e2 ) = σ e2
q  I.e. an independent variable will be correlated with
the error… we will get biased estimates!

n  This is what people call the Classical Error-


in-Variables (CEV) assumption

58
CEV with 1 variable = attenuation bias
n  If work out math, one can show that the
estimate of β1, βˆ1 , in prior example (which
had just one independent variable) is…
⎛ σ x2* ⎞ This scaling
p lim( βˆ1 ) = β1 ⎜ 2
⎜ σ * + σ e2 ⎟⎟
factors is always
⎝ x ⎠
between 0 and 1
q  The estimate is always biased towards zero; i.e. it
is an attenuation bias
n  And, if variance of error, σ e2, is small, then attenuation
bias won't be that bad

59
Measurement error… not so bad?
n  Under current setup, measurement error
doesn't seem so bad…
q  If error uncorrelated with observed x, no bias
q  If error uncorrelated with unobserved x*, we
get an attenuation bias… so at least the sign
on our coefficient of interest is still correct

n  Why is this misleading?

60
Nope, measurement error is bad news
n  Truth is, measurement error is
probably correlated a bit with both the
observed x and unobserved x*
q  I.e… some attenuation bias is likely

n  Moreover, even in CEV case, if there


is more than one independent variable,
the bias gets horribly complicated…

61
ME with more than one variable
n  If estimating y = β 0 + β1 x1 + ... + β k xk + u , and
just one of the x's is mismeasured, then…
q  ALL the β's will be biased if the mismeasured
variable is correlated with any other x
[which presumably is true since it was included!]
q  Sign and magnitude of biases will depend on all
the correlations between x's; i.e. big mess!
n  See Gormley and Matsa (2014) math for AvgE
estimator to see how bad this can be

62
ME example
n  Fazzari, Hubbard, and Petersen (1988) is
classic example of a paper with ME problem
q  Regresses investment on Tobin's Q (it's measure
of investment opportunities) and cash
q  Finds positive coefficient on cash; argues there
must be financial constraints present
q  But Q is noisy measure; all coefficients are biased!

n  Erickson and Whited (2000) argues the pos.


coeff. disappears if you correct the ME

63
Three main ways this will be violated
n  Omitted variable bias
n  Measurement error bias
n  Simultaneity bias

64
Simultaneity bias
n  This will occur whenever any of the supposedly
independent variables (i.e. the x's) can be
affected by changes in the y variable; E.g.
y = β 0 + β1 x + u
x = δ 0 + δ1 y + v
q  I.e. changes in x affect y, and changes in y affect x;
this is the simplest case of reverse causality
q  An estimate of y = β0 + β1 x + u will be biased…

65
Simultaneity bias continued…
n  To see why estimating y = β 0 + β1 x + u won't
reveal the true β1, solve for x
x = δ 0 + δ1 y + v
x = δ 0 + δ1 ( β 0 + β1 x + u ) + v
⎛ δ 0 + δ1 β 0 ⎞ ⎛ v ⎞ ⎛ δ1 ⎞
x=⎜ ⎟+⎜ ⎟+⎜ ⎟u
⎝ 1 − δ1β1 ⎠ ⎝ 1 − δ1β1 ⎠ ⎝ 1 − δ1β1 ⎠

q  Easy to see that x is correlated with u! I.e. bias!

66
Simultaneity bias in other regressors
n  Prior example is case of reverse causality; the
variable of interest is also affected by y
n  But, if y affects any x, their will be a bias; E.g.
y = β 0 + β1 x1 + β 2 x2 + u
x2 = γ 0 + γ 1 y + w
q  Easy to show that x2 is correlated with u; and there
will be a bias on all coefficients
q  This is why people use lagged x's

67
"Endogeneity" problem – Tangent
n  In my opinion, the prior example is
what it means to have an "endogeneity"
problem or and "endogenous" variable
q  But, as I mentioned earlier, there is a lot of
misusage of the word "endogeneity" in
finance… So, it might be better just saying
"simultaneity bias"

68
Simultaneity Bias – Summary
n  If your x might also be affected by the y
(i.e. reverse causality), you won't be able to
make causal inferences using OLS
q  Instrumental variables or natural experiments
will be helpful with this problem

n  Also can't get causal estimates with OLS if


controls are affected by the y

69
"Bad controls"
n  Similar to simultaneity bias… this is when
one x is affected by another x; e.g.
y = β 0 + β1 x1 + β 2 x2 + u
x2 = γ 0 + γ 1 x1 + v
q  Angrist-Pischke call this a "bad control", and it
can introduce a subtle selection bias when
working with natural experiments
[we will come back to this in later lecture]

70
"Bad Controls" – TG's Pet Peeve
n  But just to preview it… If you have an x
that is truly exogenous (i.e. random) [as you
might have in natural experiment], do not put
in controls, that are also affected by x!
q  Only add controls unaffected by x, or just
regress your various y's on x, and x alone!

We'll revisit this in later lecture…

71
What is Selection Bias?
n  Easiest to think of it just as an omitted
variable problem, where the omitted
variable is the unobserved counterfactual
q  Specifically, error, u, contains some unobserved
counterfactual that is correlated with whether
we observe certain values of x
q  I.e. it is a violation of the CMI assumption

72
Selection Bias – Example
n  Mean health of hospital visitors = 3.21
n  Mean health of non-visitors = 3.93
q  Can we conclude that going to the hospital
(i.e. the x) makes you less healthy?
n  Answer = No. People going to the hospital are
inherently less healthy [this is the selection bias]
n  Another way to say this: we fail to control for what
health outcomes would be absent the visit, and this
unobserved counterfactual is correlated with going
to hospital or not [i.e. omitted variable]

73
Selection Bias – More later
n  We'll treat it more formally later when
we get to natural experiments

74
Summary of Today [Part 1]
n  We need conditional mean independence
(CMI), to make causal statements
n  CMI is violated whenever an independent
variable, x, is correlated with the error, u
n  Three main ways this can be violated
q  Omitted variable bias
q  Measurement error bias
q  Simultaneity bias

75
Summary of Today [Part 2]
n  The biases can be very complex
q  If more than one omitted variable, or omitted
variable is correlated with more than one
regressor, sign of bias hard to determine
q  Measurement error of an independent
variable can (and likely does) bias all
coefficients in ways that are hard to determine
q  Simultaneity bias can also be complicated

76
Summary of Today [Part 3]
n  To deal with these problems, there are
some tools we can use
q  E.g. Proxy variables [discussed today]
q  We will talk about other tools later, e.g.
n  Instrumental variables
n  Natural experiments
n  Regression discontinuity

77
In First Half of Next Class
n  Before getting to these other tools, will first
discuss panel data & unobserved heterogeneity
q  Using fixed effects to deal with unobserved variables
n  What are the benefits? [There are many!]
n  What are the costs? [There are some…]

q  Fixed effects versus first differences


q  When can FE be used?

n  Related readings: see syllabus

78
Assign papers for next week…
n  Rajan and Zingales (AER 1998)
q  Financial development & growth

n  Matsa (JF 2010)


q  Capital structure & union bargaining

n  Ashwini and Matsa (JFE 2013)


q  Labor unemployment risk & corporate policy

79
Break Time
n  Let's take our 10 minute break
n  We'll do presentations when we get back

80
FNCE 926
Empirical Methods in CF
Lecture 4 – Panel Data

Professor Todd Gormley


Announcements
n  Exercise #2 is due next week
q  You can download it from Canvas
q  Largely just has you manipulate panel data
n  Please upload both completed DO file and
typed solutions to Canvas [don't e-mail]

2
Background readings
n  Angrist and Pischke
q  Sections 5.1, 5.3
n  Wooldridge
q  Chapter 10 and Sections 13.9.1, 15.8.2, 15.8.3
n  Greene
q  Chapter 11

3
Outline for Today
n  Quick review
n  Motivate how panel data is helpful
q  Fixed effects model
q  Random effects model
q  First differences
q  Lagged y models

n  Student presentations of “Causality”

4
Quick Review [Part 1]
n  What is the key assumption needed for us
to make causal inferences? And what are
the ways in which it can be violated?
q  Answer = CMI is violated whenever an
independent variable, x, is correlated with the
error, u. This occurs when there is…
n  Omitted variable bias
n  Measurement error bias
n  Simultaneity bias

5
Quick Review [Part 2]
n  When is it possible to determine the sign of
an omitted variable bias?
q  Answer = Basically, when there is just one
OMV that is correlated with just one of the x's;
other scenarios are much more complicated

6
Quick Review [Part 3]
n  When is measurement error of the
dependent variable problematic (for
identifying the causal CEF)?
q  Answer = If error is correlated with any x.

7
Quick Review [Part 4]
n  What is the bias on the coefficient of x,
and on other coefficients when an indep-
endent variable, x, is measured with error?
q  Answer = Hard to know!
n  If ME is uncorrelated with observed x, no bias
n  If ME is uncorrelated with unobserved x*, the
coefficient on x has an attenuation bias, but the
sign of the bias on all other coefficients is unclear

8
Quick Review [Part 5]
n  When will an estimation suffer from
simultaneity bias?
q  Answer = If we can think of any x as a
potential outcome variable; i.e. we think y
might directly affect an x

9
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

10
Motivation [Part 1]
n  As noted in prior lecture, omitted
variables pose a substantial hurdle in
our ability to make causal inferences
n  What's worse… many of them are
inherently unobservable to researchers

11
Motivation [Part 2]
n  E.g. consider a the firm-level estimation
leveragei , j ,t = β 0 + β1 profiti , j ,t −1 + ui , j ,t
where leverage is debt/assets for firm i,
operating in industry j in year t, and profit is
the firms net income/assets

What might be some unobservable


omitted variables in this estimation?

12
Motivation [Part 3]
n  Oh, there are so, so many…
q  Managerial talent and/or risk aversion
q  Industry supply and/or demand shock
Sadly, this is
q  Cost of capital easy to do with
q  Investment opportunities other dependent
or independent
q  And so on… variables…

n  Easy to think of ways these might be affect


leverage and be correlated with profits

13
Motivation [Part 4]
n  Using observations from various
geographical regions (e.g. state or country)
opens up even more possibilities…
q  Can you think of some unobserved variables
that might be related to a firm's location?
n  Answer: any unobserved differences in local economic
environment, e.g. institutions, protection of property
rights, financial development, investor sentiment,
regional demand shocks, etc.

14
Motivation [Part 5]
n  Sometimes, we can control for these
unobservable variables using proxy variables
q  But, what assumption was required for a
proxy variable to provide consistent
estimates on the other parameters?
n  Answer: It needs to be a sufficiently good proxy such
that the unobserved variable can't be correlated with
the other explanatory variables after we control for
the proxy variable… This might be hard to find

15
Panel data to the rescue…
n  Thankfully, panel data can help us with a
particular type of unobserved variable…

q  What type of unobserved variable does


panel data help us with, and why?
q  Answer = It helps us with time-invariant
omitted variables; now, let's see why…
[Actually, it helps with any unobserved variable that
doesn't vary within groups of observations]

16
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

17
Panel data
n  Panel data = whenever you have multiple
observations per unit of observation i (e.g.
you observe each firm over multiple years)
q  Let's assume N units i
q  And, T observations per unit i [i.e. balanced panel]
n  Ex. #1 – You observe 5,000 firms in Compustat
over a twenty year period [i.e. N=5,000, T=20]
n  Ex. #2 – You observe 1,000 CEOs in Execucomp
over a 10 year period [i.e. N=1,000, T=10]

18
Time-invariant unobserved variable
n  Consider the following model… Unobserved,
time-invariant
yi ,t = α + β xi ,t + δ fi + ui ,t variable, f

where E (ui ,t ) = 0 These implies what?


Answer: If don't control
corr ( xi ,t , f i ) ≠ 0 for f, we have OVB, but if
corr ( fi , ui ,t ) = 0 could, then we wouldn't

corr ( xi ,t , ui , s ) = 0 for all s, t

Note: This is stronger assumption then we usually make; it's


called strict exogeneity. In words, this assumption means what?

19
If we ignore f, we get OVB
n  If estimate the model…
yi,t = α + β xi,t + vi,t
!
δ f i +ui ,t

q  x is correlated with the disturbance v (through


it's correlation with the unobserved variable, f,
which is now part of the disturbance)

ˆ σ xf This is standard OVB…


q  Easy to show β = β + δ 2
OLS

σx coefficient from regression


of omitted var., f, on x
times the true coeff. on f

20
Can solve this by transforming data
n  First, notice that if you take the population
mean of the dependent variable for each
unit of observation, i, you get…
yi = α + β xi + δ fi + ui Again, I assumed
there are T obs.
per unit i
where
1 1 1
yi = ∑ yi ,t , xi = ∑ xi ,t , ui = ∑ ui ,t
T t T t T t

21
Transforming data [Part 2]
n  Now, if we subtract yi from yi ,t , we have

yi ,t − yi = β ( xi ,t − xi ) + ( ui ,t − ui )

q  And look! The unobserved variable, fi, is gone


(as is the constant) because it is time-invariant
q  With our assumption of strict exogeneity earlier,
easy to see that ( xi ,t − xi ) is uncorrelated with the
new disturbance, (ui ,t − ui ), which means…
?

22
Fixed Effects (or Within) Estimator
n  Answer: OLS estimation of transformed
model will yield a consistent estimate of β
n  The prior transformation is called the
“within transformation” because it
demeans all variables within their group
q  In this case, the “group” was each cross-section
of observations over time for each firm
q  This is also called the FE estimator

23
Unobserved heterogeneity – Tangent
n  Unobserved variable, f, is very general
q  Doesn't just capture one unobserved
variable; captures all unobserved variables
that don't vary within the group
q  This is why we often just call it
“unobserved heterogeneity”

24
FE Estimator – Practical Advice
n  When you use the fixed effects (FE)
estimator in programs like Stata, it does
the within transformation for you
n  Don't do it on your own because…
q  The degrees of freedom(doF) (which are used
to get the standard errors) sometimes need to be
adjusted down by the number of panels, N
q  What adjustment is necessary depends on
whether you cluster, etc.

25
Least Squares Dummy Variable (LSDV)
n  Another way to do the FE estimation is
by adding indicator (dummy) variables
q  Notice that the coefficient on fi, δ, doesn't
really have any meaning; so, can just rescale
the unobserved fi to make it equal to 1
yi ,t = α + β xi ,t + fi + ui ,t
q  Now, to estimate this, we can just treat each
fi as a parameter to be estimated

26
LSDV continued…
n  I.e. create a dummy variable for each
group i, and add it to the regression
q  This is least squares dummy variable model
q  Now, our estimation equation exactly matches
the true underlying model
yi ,t = α + β xi ,t + fi + ui ,t
q  We get consistent estimates and SE that are
identical to what we'd get with within estimator

27
LSDV – Practical Advice
n  Because the dummy variables will be
collinear with the constant, one of them
will be dropped in the estimation
q  Therefore, don't try to interpret the intercept;
it is just the average y when all the x's are
equal to zero for the group corresponding to
the dropped dummy variable
q  In xtreg, fe, the reported intercept is just
average of individual specific intercepts

28
LSDV versus FE [Part 1]
n  Can show that LSDV and FE are identical,
using partial regression results [How?]
q  Remember, to control for some variable z, we can
regress y onto both x and z, or we can just partial
z out from both y and x before regressing y on x
(i.e. regress residuals from regression of y on z
onto residual from regression of x on z)
q  The demeaned variables are the residuals from a
regression of them onto the group dummies!

29
LSDV versus FE [Part 2]
n  Reported R2 will be larger with LSDV
q  All the dummy variables will explain a lot of the
variation in y, driving up R2
q  Within R2 reported for FE estimator just reports
what proportion of the within variation in y that is
explained by the within variation in x
q  The within R2 is usually of more interest to us

30
R-squared with FE – Practical Advice
n  The within R2 is usually of more interest
since it describes explanatory power of x's
[after partialling out the FE]
q  The get within R2, use xtreg, fe
n  Reporting overall adjusted-R2 is also useful
q  To get overall adjusted-R2, use areg command
instead of xtreg, fe. The “overall R2” reported
by xtreg does not include variation explained
by FE, but the R2 reported by areg does

31
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

32
FE Estimator – Benefits [Part 1]
n  There are many benefits of FE estimator
q  Allows for arbitrary correlation between each
fixed effect, fi, and each x within group i
n  I.e. its very general and not imposing much structure on
what the underlying data must look like

q  Very intuitive interpretation; coefficient is


identified using only changes within cross-sections

33
FE Estimator – Benefits [Part 2]
q  It is also very flexible and can help us control for
many types of unobserved heterogeneities
n  Can add year FE if worried about unobserved
heterogeneity across time [e.g. macroeconomic shocks]
n  Can add CEO FE if worried about unobserved
heterogeneity across CEOs [e.g. talent, risk aversion]
n  Add industry-by-year FE if worried about unobserved
heterogeneity across industries over time [e.g. investment
opportunities, demand shocks]

34
FE Estimator – Tangent [Part 1]

n  FE estimator is very general


q  It applies to any scenario where
observations can be grouped together
n  Ex. #1 – Firms can be grouped by industry
n  Ex. #2 – CEOs observations (which may span multiple
firms) can be grouped by CEO-firm combinations

q  Textbook example of grouping units i across time


is just example (though, the most common)

35
FE Estimator – Tangent [Part 2]

n  Once you are able to construct groups, you


can remove any unobserved 'group-level
heterogeneity' by adding group FE
q  Consistency just requires there be a large
number of groups

36
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

37
FE Estimator – Costs

n  But, FE estimator also has its costs


q  Can't identify variables that don't vary within group
q  Subject to potentially large measurement error bias
q  Can be hard to estimate in some cases
q  Miscellaneous issues

38
FE Cost #1 – Can't estimate some var.
n  If no within-group variation in the
independent variable, x, of interest, can't
disentangle it from group FE
q  It is collinear with group FE; and will be
dropped by computer or swept out in the
within transformation

39
FE Cost #1 – Example
q  Consider following CEO-level estimation
ln(totalpay )ijt = α + β1 ln( firmsize)ijt + β1volatilityijt
+ β3 femalei + δ t + fi + λ j + uijt
n  Ln(totalpay) is for CEO i, firm j, year t
n  Estimation includes year, CEO, and firm FE

q  What coefficient can't be estimated?


n  Answer: β3! Being female doesn’t vary within
the group of each CEO’s observations; i.e. it is
collinear with the CEO fixed effect

40
FE Cost #1 – Practical Advice
n  Be careful of this!
q  Programs like xtreg are good about dropping the
female variable and not reporting an estimate…
q  But, if you create dummy variables yourself and
input them yourself, the estimation might drop one
of them rather than the female indicator
n  I.e. you'll get an estimate for β3, but it has no
meaning! It's just a random intercept value that
depends entirely on the random FE dropped by Stata

41
FE Cost #1 – Any Solution?
n  Instrumental variables can provide a
possible solution for this problem
q  See Hausman and Taylor (Econometrica 1981)
q  We will discuss this next week

42
FE Cost #2 – Measurement error [P1]
n  Measurement error of independent variable
(and resulting biases) can be amplified
q  Think of there being two types of variation
n  Good (meaningful) variation
n  Noise variation because we don't perfectly
measure the underlying variable of interest

q  Adding FE can sweep out a lot of the good


variation; fraction of remaining variation coming
from noise goes up [What will this do?]

43
FE Cost #2 – Measurement error [P2]
n  Answer: Attenuation bias on
mismeasured variable will go up!
q  Practical advice: Be careful in interpreting 'zero'
coefficients on potentially mismeasured
regressors; might just be attenuation bias!
q  And remember, sign of bias on other
coefficients will be generally difficult to know

44
FE Cost #2 – Measurement error [P3]
n  Problem can also apply even when all
variables are perfectly measured [How?]
n  Answer: Adding FE might throw out relevant
variation; e.g. y in firm FE model might respond to
sustained changes in x, rather than transitory
changes [see McKinnish 2008 for more details]
n  With FE you'd only have the transitory variation
leftover; might find x uncorrelated with y in FE
estimation even though sustained changes in x is
most important determinant of y

45
FE Cost #2 – Example
n  Difficult to identify causal effect of credit
shocks on firm output because credit shocks
coincide with demand shocks [i.e. OVB]
q  Paravisini, Rappoport, Schnabl, Wolfenzon
(2014) used product-level export data & shock to
some Peru banks to address this
n  Basically regressed product output on total firm credit,
and added firm, bank, and product×destination FE (i.e.
dummy for selling a product to a particular country!)
n  Found small effect… [Concern?]

46
FE Cost #2 – Example continued
n  Concern = Credit extended to firms may
be measured with error!
q  E.g. some loan originations and payoffs may
not be recorded in timely fashion
q  Need to be careful interpreting a coefficient
from a model with so many FE as “small”
n  Note: This paper is actually very good (and does
IV as well), and the authors are very careful to not
interpret their findings as evidence that financial
constraints only have a “small” effect

47
FE Cost #2 – Any solution?
n  Admittedly, measurement error, in
general, is difficult to address
n  For examples on how to deal with
measurement error, see following papers
q  Griliches and Hausman (JoE 1986)
q  Biorn (Econometric Reviews 2000)
q  Erickson and Whited (JPE 2000, RFS 2012)
q  Almeida, Campello, and Galvao (RFS 2010)

48
FE Cost #3 – Computation issues [P1]
n  Estimating a model with multiple types of
FE can be computationally difficult
q  When more than one type of FE, you cannot
remove both using within-transformation
n  Generally, you can only sweep one away with
within-transformation; other FE dealt with by
adding dummy variable to model
n  E.g. firm and year fixed effects [See next slide]

49
FE Cost #3 – Computation issues [P2]
Year FE
n  Consider below model: Firm FE

yi ,t = α + β xi ,t + δ t + fi + ui ,t

q  To estimate this in Stata, we'd use a


command something like the following…
Tells Stata that panel dimension
xtset firm is given by firm variable
xi: xtreg y x i.year, fe
Tells Stata to remove FE for
panels (i.e. firms) by doing
Tells Stata to create and add dummy within-transformation
variables for year variable

50
FE Cost #3 – Computation issues [P3]

n  Dummies not swept away in within-


transformation are actually estimated
q  With year FE, this isn't problem because
there aren't that many years of data
q  If had to estimate 1,000s of firm FE,
however, it might be a problem
n  In fact, this is why we sweep away the firm FE
rather than the year FE; there are more firms!

51
FE Cost #3 – Example
n  But, computational issues is becoming
increasingly more problematic
q  Researchers using larger datasets with many
more complicated FE structures
q  E.g. if you try adding both firm and
industry×year FE, you'll have a problem
n  Estimating 4-digit SIC×year and firm FE in
Compustat requires ≈ 40 GB memory
n  No one has this; hence, no one does it…

52
FE Cost #3 – Any Solution?
n  Yes, there are some potential solutions
q  Gormley and Matsa (2014) discusses some
of these solutions in Section 4
q  We will come back to this in “Common
Limitations and Errors” lecture

53
FE – Some Remaining Issues

n  Two more issues worth noting about FE


q  Predicted values of unobserved FE
q  Non-linear estimations with FE and the
incidental parameter problem

54
Predicted values of FE [Part 1]
n  Sometimes, predicted value of
unobserved FE is of interest
n  Can get predicted value using
fˆi = yi − βˆ xi , for all i = 1,..., N
q  E.g. Bertrand and Schoar (QJE 2003) did
this to back out CEO fixed effects
n  They show that the CEO FE are jointly
statistically significant from zero, suggesting
CEOs have 'styles' that affect their firms

55
Predicted values of FE [Part 2]
n  But, be careful with using these predicted
values of the FE
q  They are unbiased, but inconsistent
n  As sample size increases (and we get more
groups), we have more parameters to estimate…
never get the necessary asymptotics
n  We call this the Incidental Parameters Problem

56
Predicted values of FE [Part 3]

q  Moreover, doing an F-test to show they are


statistically different from zero is only valid
under rather strong assumptions
n  Need to assume errors, u, are distributed normally,
homoskedastic, and serially uncorrelated
n  See Wooldridge (2010, Section 10.5.3) and Fee,
Hadlock, and Pierce (2011) for more details

57
Nonlinear models with FE [Part 1]
n  Because we don't get consistent estimates
of the FE, we can't estimate nonlinear
panel data models with FE
q  In practice, Logit, Tobit, Probit should not be
estimated with many fixed effects
q  They only give consistent estimates under
rather strong and unrealistic assumptions

58
Nonlinear models with FE [Part 2]
Why should
q  E.g. Probit with FE requires… we believe this
n  Unobserved fi is to be distributed normally to be true?
n  fi and xi,t to be independent
Almost surely
not true in CF
q  And, Logit with FE requires…
n  No serial correlation of y after conditioning on the
observable x and unobserved f
Probably unlikely in
q  For more details, see… many CF settings
n  Wooldridge (2010), Sections 13.9.1, 15.8.2-3
n  Greene (2004) – uses simulation to show how bad

59
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

60
Random effects (RE) model [Part 1]
n  Very similar model as FE…
yi ,t = α + β xi ,t + fi + ui ,t

n  But, one big difference…


q  It assumes that unobserved heterogeneity, fi,
and observed x's are uncorrelated
n  What does this imply about consistency of OLS?
n  Is this a realistic assumption in corporate finance?

61
Random effects (RE) model [Part 2]

n  Answer #1 – That assumption means that


OLS would give you consistent estimate of β!
n  Then why bother?
q  Answer… potential efficiency gain relative to FE
n  FE is no longer most efficient estimator. If our
assumption is correct, we can get more efficient estimate
by not eliminating the FE and doing generalized least
squares [Note: can't just do OLS; it will be consistent as well but
SE will be wrong since they ignore serial correlation]

62
Random effects (RE) model [Part 3]

n  Answer #2 – The assumption that f and x


are uncorrelated is likely unrealistic in CF
q  The violation of this assumption is whole
motivation behind why we do FE estimation!
n  Recall that correlation between unobserved
variables, like managerial talent, demand shocks,
etc., and x will cause omitted variable bias

63
Random effects – My Take

n  In practice, RE model is not very useful


q  As Angrist-Pischke (page 223) write,
n  Relative to fixed effects estimation, random effects
requires stronger assumptions to hold
n  Even if right, asymptotic efficiency gain likely modest
n  And, finite sample properties can be worse

q  Bottom line, don't bother with it

64
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

65
First differencing (FD) [Part 1]
n  First differencing is another way to
remove unobserved heterogeneities
q  Rather than subtracting off the group
mean of the variable from each variable,
you instead subtract the lagged observation
q  Easy to see why this also works…

66
First differencing (FD) [Part 2]
n  Notice that, yi ,t = α + β xi ,t + fi + ui ,t
yi ,t −1 = α + β xi ,t −1 + fi + ui ,t −1 Note: we'll lose
on observation
per cross-section
n  From this, we can see that because there
won't be a lag
yi ,t − yi ,t −1 = β ( xi ,t − xi ,t −1 ) + (ui ,t − ui ,t −1 )

q  When will OLS estimate of this provide a


consistent estimate of β?
n  Answer: With same strict exogeneity assumption of
FE (i.e. xi,t and ui,s are uncorrelated for all t and s)

67
First differences (without time)

n  First differences can also be done even


when observations within groups aren't
ordered by time
q  Just order the data within groups in whatever
way you want, and take 'differences'
q  Works, but admittedly, not usually done

68
FD versus FE [Part 1]
n  When just two observations per group,
they are identical to each other
n  In other cases, both are consistent;
difference is generally about efficiency
q  FE is more efficient if disturbances,
ui,t, are serially uncorrelated
Which is true?
q  FD is more efficient if disturbance, Unclear. Truth is
ui,t, follow a random walk that it is probably
something in between

69
FD versus FE [Part 2]
n  If strict exogeneity is violated (i.e. xi,t is
correlated with ui,s for s≠t), FE might be better
q  As long as we believe xi,t and ui,t are uncorrelated,
the FE's inconsistency shrinks to 0 at rate 1/T;
but, FD gets no better with larger T
q  Remember: T is the # of observations per group

n  But, if y and x are spuriously correlated, and N


is small, T large, FE can be quite bad

70
FD versus FE [Part 3]
n  Bottom line: not a bad idea to try both…
q  If different, you should try to understand why
q  With an omitted variable or measurement
error, you’ll get diff. answers with FD and FE
n  In fact, Griliches and Hausman (1986) shows that
because measurement error causes predictably
different biases in FD and FE, you can (under
certain circumstances) use the biased estimates to
back out the true parameter

71
Outline for Panel Data
n  Motivate how panel data is helpful
n  Fixed effects model
q  Benefits [There are many]
q  Costs [There are some…]

n  Random effects model


n  First differences
n  Lagged y models

72
Lagged dependent variables with FE
n  We cannot easily estimate models with both
a lagged dep. var. and unobserved FE
yi ,t = α + ρ yi ,t −1 + β xi ,t + fi + ui ,t , ρ <1

q  Same as before, but now true model contains


lagged y as independent variable
n  Can't estimate with OLS even if x & f are uncorrelated
n  Can't estimate with FE

73
Lagged y & FE – Problem with OLS
n  To see the problem with OLS, suppose
you estimate the following:
yi,t = α + ρ yi,t−1 + β xi,t + vi,t
!
f i +ui ,t

q  But, yi ,t −1 = α + ρ yi ,t − 2 + β xi ,t −1 + fi + ui ,t −1
q  Thus, yi,t-1 and composite error, vi,t are positively
correlated because they both contain fi
q  I.e. you get omitted variable bias

74
Lagged y & FE – Problem with FE
n  Will skip the math, but it is always biased
q  Basic idea is that if you do a within
transformation, the lagged mean of y, which will be
on RHS of the model now, will always be
negatively correlated with demeaned error, u
n  Note #1 – This is true even if there was no unobserved
heterogeneity, f; FE with lagged values is always bad idea
n  Note #2: Same problem applies to FD

q  Problem, however goes away as T goes to infinity

75
How do we estimate this? IV?
n  Basically, you're going to need instrument;
we will come back to this next week….

76
Lagged y versus FE – Bracketing
n  Suppose you don't know which is correct
q  Lagged value model: yi ,t = α + γ yi ,t −1 + β xi ,t + ui ,t
q  Or, FE model: yi ,t = α + β xi ,t + fi + ui ,t

n  Can show that estimate of β>0 will…


q  Be too high if lagged model is correct, but you
incorrectly use FE model
q  Be too low if FE model is correct, but you
incorrectly used lagged model

77
Bracketing continued…
n  Use this to 'bracket' where true β is…
q  But sometimes, you won't observe bracketing
q  Likely means your model is incorrect in other
ways, or there is some severe finite sample bias

78
Summary of Today [Part 1]

n  Panel data allows us to control for certain


types of unobserved variables
q  FE estimator can control for these potential
unobserved variables in very flexible way
q  Greatly reduces the scope for potential omitted
variable biases we need to worry about
q  Random effects model is useless in most
empirical corporate finance settings

79
Summary of Today [Part 2]
n  FE estimator, however, has weaknesses
q  Can't estimate variables that don't vary within
groups [or at least, not without an instrument]
q  Could amplify any measurement error
n  For this reason, be cautious interpreting zero or small
coefficients on possibly mismeasured variables

q  Can't be used in models with lagged values of the


dependent variable [or at least, not without an IV]

80
Summary of Today [Part 3]

n  FE are generally not a good idea when


estimating nonlinear models [e.g. Probit,
Tobit, Logit]; estimates are inconsistent
n  First differences can also remove
unobserved heterogeneity
q  Largely just differs from FE in terms of relative
efficiency; which depends on error structure

81
In First Half of Next Class
n  Instrumental variables
q  What are the necessary assumptions? [E.g.
what is the exclusion restriction?]
q  Is there are way we can test whether our
instruments are okay?

n  Related readings… see syllabus

82
Assign papers for next week…
n  Khwaja and Mian (AER 2008)
q  Bank liquidity shocks

n  Paravisini, et al. (ReStud 2014)


q  Impact of credit supply on trade

n  Becker, Ivkovic, and Weisbenner (JF 2011)


q  Local dividend clienteles

83
Break Time
n  Let's take our 10 minute break
n  We'll do presentations when we get back

84
FNCE 926
Empirical Methods in CF
Lecture 5 – Instrumental Variables

Professor Todd Gormley


Announcements
n  Exercise #2 due; should have uploaded
it to Canvas already

2
Background readings
n  Roberts and Whited
q  Section 3
n  Angrist and Pischke
q  Sections 4.1, 4.4, and 4.6
n  Wooldridge
q  Chapter 5
n  Greene
q  Sections 8.2-8.5

3
Outline for Today
n  Quick review of panel regressions
n  Discuss IV estimation
q  How does it help?
q  What assumptions are needed?
q  What are the weaknesses?

n  Student presentations of “Panel Data”

4
Quick Review [Part 1]
n  What type of omitted variable does panel
data and FE help mitigate, and how?
q  Answer #1 = It can help eliminate omitted
variables that don’t vary within panel groups
q  Answer #2 = It does this by transforming the
data to remove this group-level heterogeneity
[or equivalently, directly controls for it using
indicator variables as in LSDV]

5
Quick Review [Part 2]
n  Why is random effects pretty useless
[at least in corporate finance settings]?
q  Answer = It assumes that unobserved
heterogeneity is uncorrelated with x’s; this is
likely not going to be true in finance

6
Quick Review [Part 3]
n  What are three limitations of FE?
#1 – Can’t estimate coefficient on variables that
don’t vary within groups
#2 – Could amplify any measurement error
n  For this reason, be cautious interpreting zero or small
coefficients on possibly mismeasured variables

#3 – Can’t be used in models with lagged values


of the dependent variable

7
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

8
Motivating IV [Part 1]
n  Consider the following estimation
y = β 0 + β1 x1 + ... + β k xk + u

where cov( x1 , u ) = ... = cov( xk −1 , u ) = 0


cov( xk , u ) ≠ 0

n  If we estimate this model, will we get a


consistent estimate of βk?
n  When would we get a consistent estimate
of the other β’s, and is this likely?

9
Motivation [Part 2]
q  Answer #1: No. We will not get a
consistent estimate of βk
q  Answer #2: Very unlikely. We will only
get consistent estimate of other β if xk is
uncorrelated with all other x

n  Instrumental variables provide a


potential solution to this problem…

10
Instrumental variables – Intuition
q  Think of xk as having ‘good’ and ‘bad’ variation
n  Good variation is not correlated with u
n  Bad variation is correlated with u

q  An IV (let’s call it z) is a variable that explains


variation in xk, but doesn’t explain y
n  I.e. it only explains the “good” variation in xk

q  Can use the IV to extract the “good” variation


and replace xk with only that component!

11
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

12
Instrumental variables – Formally
n  IVs must satisfy two conditions
q  Relevance condition
q  Exclusion condition
n  What are these two conditions?
n  Which is harder to satisfy?
n  Can we test whether they are true?

To illustrate these conditions, let’s start with the


simplest case, where we have one instrument, z,
for the problematic regressor, xk

13
Relevance condition [Part 1]
How can we test
n  The following must be true… this condition?

q  In the following model


xk = α 0 + α1 x1 + ... + α k −1 xk −1 + γ z + v
z satisfies the relevance condition if γ≠0
q  What does this mean in words?
n  Answer: z is relevant to explaining the problematic
regressor, xk, after partialling out the effect of all of
the other regressors in the original model

14
Relevance condition [Part 2]
n  Easy to test the relevance condition!
q  Just run the regression of xk on all the other
x’s and the instrument z to see if z explains xk
q  As we see later, this is what people call the
‘first stage’ of the IV estimation

15
Exclusion condition [Part 1]
How can we test
n  The following must be true… this condition?

q  In the original model, where


y = β0 + β1 x1 + ... + βk xk + u
z satisfies the exclusion condition if cov(z, u)=0
q  What does this mean in words?
n  Answer: z is uncorrelated with the disturbance, u…
i.e. z has no explanatory power with respect to y after
conditioning on the other x’s;

16
Exclusion condition [Part 2]

n  Trick question! You cannot test the


exclusion restriction [Why?]
q  Answer: You can’t test it because u is unobservable
q  You must find a convincing economic argument as to
why the exclusion restriction is not violated

17
Side note – What’s wrong with this?
n  I’ve seen many people try to use the below
argument as support for the exclusion
restriction… what’s wrong with it?

q  Estimate the below regression…


y = β0 + β1 x1 + ... + βk xk + γ z + u

q  If γ=0, then exclusion restriction likely holds...


i.e. they argue that z doesn’t explain y after
conditioning on the other x’s

18
Side note – Answer
n  If the original regression doesn’t give
consistent estimates, then neither will this one!
q  cov(xk, u)≠0, so the estimates are still biased
q  Moreover, if we believe the relevance condition,
then the coefficient on z is certainly biased because
z is correlated with xk

19
What makes a good instrument?
n  Bottom line, an instrument must be justified
largely on economic arguments
q  Relevance condition can be shown formally, but
you should have an economic argument for why
q  Exclusion restriction cannot be tested… you need
to provide a convincing economic argument as to
why it explains y, but only through its effect on xk

20
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

21
Implementing IV estimation
n  You’ve found a good IV, now what?
n  One can think of the IV estimation as
being done in two steps
q  First stage: regress xk on other x’s & z
q  Second stage: take predicted xk from first
stage and use it in original model instead of xk
This is why we also call IV estimations
two stage least squares (2SLS)

22
First stage of 2SLS
n  Estimate the following
xk = α 0 + α1 x1 + ... + α k −1 xk −1 + γ z + v

Problematic regressor Instrumental


All other non-problematic
[i.e. cov(xk, u)≠0] variable
variables that explain y

q  Get estimates for the α’s and γ


q  Calculate predicted values, xˆk , where
xˆk = αˆ 0 + αˆ1 x1 + ... + αˆ k −1 xk −1 + γˆ z

23
Second stage of 2SLS
n  Use predicted values to estimate
y = β0 + β1 x1 + ... + βk xˆk + u

Predicted values replace the problematic regressor

q  Can be shown (see textbook for math) that


this 2SLS estimation yields consistent
estimates of all the β when both the relevance
and exclusion conditions are satisfied

24
Intuition behind 2SLS
n  Predicted values represent variation in xk
that is ‘good’ in that it is driven only by
factors that are uncorrelated with u
q  Specifically, predicted value is linear function of
variables that are uncorrelated with u

n  Why not just use other x’s? Why need z?


q  Answer: Can’t just use other x’s to generate
predicted value because then predicted value
would be collinear in the second stage

25
Reduced Form Estimates [Part 1]
n  The “reduced form” estimation is when
you regress y directly onto the instrument,
z, and other non-problematic x’s
y = β0 + β1 x1 + ... + βk −1 xk −1 + δ z + u

q  It is an unbiased and consistent estimate of the


effect of z on y (presumably through the
channel of z’s effect on xk)

26
Reduced Form Estimates [Part 2]
n  It can be shown that the IV estimate for
xk , βˆkIV, is simply given by…
Reduced form coefficient
δˆ estimate for z
βˆ IV
=
k
γˆ First stage coefficient
estimate for z

q  I.e. if you don’t find effect of z on y in


reduced form, then IV is unlikely to work
n  IV estimate is just scaled version of reduced form

27
Practical advice [Part 1]
n  Don’t state in your paper’s intro that
you use an IV to resolve an identification
problem, unless…
q  You also state what the IV you use is
q  And, provide a strong economic argument as
to why it satisfies the necessary conditions

Don’t bury the explanation of your IV! Researchers that


do this almost always have a bad IV. If you really have a
good IV, you’ll be willing to defend it in the intro!

28
Practical advice [Part 2]
n  Don’t forget to justify why we should be
believe the exclusion restriction holds
q  Too many researchers only talk
about the relevance condition
q  Exclusion restriction is equally important

29
Practical Advice [Part 3]
n  Do not do two stages on your own!
q  Let the software do it; e.g. in Stata, use the
IVREG or XTIVREG (for panel data) commands

n  Three ways people will mess up when


trying to do 2SLS on their …
#1 – Standard errors will be wrong
#2 – They try using nonlinear models in first stage
#3 – They will use the fitted values incorrectly

30
Practical Advice [Part 3-1]

n  Why will standard errors be wrong if you


try to do 2SLS on your own?
q  Answer: Because the second stage uses
‘estimated’ values that have their own
estimation error. This error needs to be taken
into account when calculating standard errors!

31
Practical Advice [Part 3-2]
n  People will try using predicted values
from non-linear model, e.g. Probit or
Logit, in a ‘second stage’ IV regression
q  But, only linear OLS in first stage guarantees
covariates and fitted values in second stage
will be uncorrelated with the error
n  I.e. this approach is NOT consistent
n  This is what we call the “forbidden regression”

32
Practical Advice [Part 3-3]
n  In models with quadratic terms, e.g.
y = β 0 + β1 x + β 2 x 2 + u
people often try to calculate one fitted
value x̂ using one instrument, z, and then
plug in x̂ and x̂ 2 into second stage…
q  Seems intuitive, but it is NOT consistent!
q  Instead, you should just use z and z2 as IVs!

33
Practical Advice [Part 3]
n  Bottom line… if you find yourself plugging
in fitted values when doing an IV, you are
probably doing something wrong!
q  Let the software do it for you; it will prevent
you from doing incorrect things

34
Practical Advice [Part 4]
n  All x’s that are not problematic, need to be
included in the first stage!!!
q  You’re not doing 2SLS, and you’re not getting
consistent estimates if this isn’t done
q  This includes things like firm and year FE!

n  Yet another reason to let statistical


software do the 2SLS estimation for you!

35
Practical Advice [Part 5]
n  Always report your first stage results & R2
n  There are two good reasons for this…
[What are they?]
q  Answer #1: It is direct test of relevance
condition… i.e. we need to see γ≠0!
q  Answer #2: It helps us determine whether
there might be a weak IV problem…

36
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

37
Consistent, but biased
n  IV is a consistent, but biased, estimator
q  For any finite number of observations, N,
the IV estimates are biased toward the
biased OLS estimate
q  But, as N approaches infinity, the IV
estimates converge to the true coefficients

n  This feature of IV leads to what we call


the weak instrument problem…

38
Weak instruments problem
n  A weak instrument is an IV that doesn’t
explain very much of the variation in the
problematic regressor
n  Why is this an issue?
q  Small sample bias of estimator is greater
when the instrument is weak; i.e. our estimates,
which use a finite sample, might be misleading…
q  t-stats in finite sample can also be wrong

39
Weak IV bias can be severe [Part 1]
n  Hahn and Hausman (2005) show that
finite sample bias of 2SLS is ≈
j ρ (1 − r 2 )
Nr 2
q  j = number of IVs [we’ll talk about
multiple IVs in a second]
q  ρ = correlation between xk and u
q  r2 = R2 from first-stage regression
q  N = sample size

40
Weak IV bias can be severe [Part 2]

j ρ (1 − r 2 )
Nr 2
A low explanatory power in
More instruments, which we’ll talk first stage can result in
about later, need not help; they help large bias even if N is large
increase r2, but if they are weak (i.e.
don’t increase r2 much), they can
still increase finite sample bias

41
Detecting weak instruments
n  Number of warning flags to watch for…
q  Large standard errors in IV estimates
n  You’ll get large SEs when covariance between
instrument and problematic regressor is low

q  Low F statistic from first stage


n  The higher F statistic for excluded IVs, the better
n  Stock, Wright, and Yogo (2002) find that an F
statistic above 10 likely means you’re okay…

42
Excluded IVs – Tangent
n  Just some terminology…
q  In some ways, can think of all non-
problematic x’s as IVs; they all appear in first
stage and are used to get predicted values
q  But, when people refer to excluded IVs, they
refer to the IVs (i.e. z’s) that are excluded
from the second stage

43
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

44
More than one problematic regressor
n  Now, consider the following…
y = β 0 + β1 x1 + ... + β k xk + u

where cov( x1 , u ) = ... = cov( xk − 2 , u ) = 0


cov( xk −1 , u ) ≠ 0
cov( xk , u ) ≠ 0

n  There are two problematic regressors, xk-1 and xk


n  Easy to show that IVs can solve this as well

45
Multiple IVs [Part 1]
n  Just need one IV for each
problematic regressor, e.g. z1 and z2
n  Then, estimate 2SLS in similar way…
q  Regress xk on all other x’s (except xk-1)
and both instruments, z1 and z2
q  Regress xk-1 on all other x’s (except xk)
and both instruments, z1 and z2
q  Get predicted values, do second stage

46
Multiple IVs [Part 2]
n  Need at least as many IVs as problematic
regressors to ensure predicted values are not
collinear with the non-problematic x’s
q  If # of IVs match # of problematic x’s,
model is said to be “Just Identified”

47
“Overidentified” Models

n  Can also have models with more IVs


than # of problematic regressors
q  E.g. m instruments for h problematic
regressors, where m > h
q  This is what we call an overidentified model

n  Can implement 2SLS just as before…

48
Overidentified model conditions
n  Necessary conditions very similar
q  Exclusion restriction = none of the
instruments are correlated with u
q  Relevance condition
E.g. you can’t
just have one IV n  Each first stage (there will be h of them) must
that is correlated have at least one IV with non-zero coefficient
with all the
problematic
n  Of the m instruments, there must be at least h of
regressors and them that are partially correlated with problematic
all the other IVs regressors [otherwise, model isn’t identified]
are not

49
Benefit of Overidentified Model

n  Assuming you satisfy the relevance and


exclusion conditions, you will get more
asymptotic efficiency with more IVs
q  Intuition: you are able to extract more ‘good’
variation from the first stage of the estimation

50
But, Overidentification Dilemma
n  Suppose you are a very clever researcher…
q  You find not just h instruments for h
problematic regressors, you find m > h
n  First, you should consider yourself very clever
[a good instrument is hard to come by]!
n  But, why might you not want to use the m-h
extra instruments?

51
Answer – Weak instruments
n  Again, as we saw earlier, a weak
instrument will increase likelihood of finite
sample bias and misleading inferences!
q  If have one really good IV, not clear you want
to add some extra (less good) IVs...

52
Practical Advice – Overidentified IV

n  Helpful to always show results using “just


identified” model with your best IVs
q  It is least likely to suffer small sample bias
q  In fact, the just identified model is median-
unbiased making weak instruments critique
less of a concern

53
Overidentification “Tests” [Part 1]

n  When model is overidentified, you can


supposedly “test” the quality of your IVs
n  The logic of the tests is as follows…
q  If all IVs are valid, then we can get consistent
estimates using any subset of the IVs
q  So, compare IV estimates from different subsets; if
find they are similar, this suggests the IVs okay

54
Overidentification “Tests” [Part 2]

n  But, I see the following all the time…


q  Researcher has overidentified IV model
q  All the IVs are highly questionable in that
they lack convincing economic arguments
q  But, authors argue that because their model
passes some “overidentification test” that
the IVs must be okay

n  What is wrong with this logic?

55
Overidentification “Tests” [Part 3]

n  Answer = All the IVs could be junk!


q  The “test” implicitly assumes that some
subset of instruments is valid
q  This may not be the case!

n  To reiterate my earlier point…


q  There is no test to prove an IV is valid! Can
only motivate that the IV satisfies exclusion
restriction using economic theory

56
“Informal” checks – Tangent
n  It is useful, however, to try some
“informal” checks on validity of IV
q  E.g. One could show the IV is uncorrelated
with other non-problematic regressors or with
y that pre-dates the instrument
n  Could help bolster economic argument that IV
isn’t related to outcome y for other reasons
n  But, don’t do this for your actual outcome, y, why?
Answer = It would suggest a weak IV (at best)

57
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

58
Miscelleneous IV issues
n  IVs with interactions
n  Constructing additional IVs
n  Using lagged y or lagged x as IVs
n  Using group average of x as IV for x
n  Using IV with FE
n  Using IV with measurement error

59
IVs with interactions
n  Suppose you want to estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
cov( x1 , u ) = 0
where
cov( x2 , u ) ≠ 0

q  Now, both x2 and x1x2 are problematic


q  Suppose you can only find one IV, z.
Is there a way to get consistent estimates?

60
IVs with interactions [Part 2]
n  Answer = Yes! In this case, one can
construct other instruments from the one IV
q  Use z as IV for x2
q  Use x1z as IV for x1x2

n  Same economic argument used to


support z as IV for x2 will carry
through to using x1z as IV for x1x2

61
Constructing additional IV
n  Now, suppose you want to estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x3 + u
cov( x1 , u ) = 0
Now, both x2 and x3
where cov( x , u ) ≠ 0 are problematic
2

cov( x3 , u ) ≠ 0

q  Suppose you can only find one IV, z, and you
think z is correlated with both x2 and x3…
Can you use z and z2 as IVs?

62
Constructing additional IV [Part 2]
n  Answer = Technically, yes. But
probably not advisable…
q  Absent an economic reason for why z2 is
correlated with either x2 or x3 after
partialling out z, it’s probably not a good IV
n  Even if it satisfies the relevance condition, it
might be a ‘weak’ instrument, which can be very
problematic [as seen earlier]

63
Lagged instruments
n  It has become common in CF to use
lagged variables as instruments
n  This usually takes two forms
q  Instrumenting for a lagged y in dynamic
panel model with FE using a lagged lagged y
q  Instrumenting for problematic x or lagged y
using lagged version of the same x

64
Example where lagged IVs are used
n  As noted last week, we cannot estimate
models with both a lagged dep. var. and
unobserved FE
yi ,t = α + ρ yi ,t −1 + β xi ,t + fi + ui ,t , ρ <1

q  The lagged y independent variable will be


correlated with the error, u
q  One proposed solution is to use lagged values
of y as IV for problematic yi,t-1

65
Using lagged y as IV in panel models

n  Specifically, papers propose using first


differences combined with lagged values,
like yi,t-2 , as instrument for yi,t-1
q  Could work in theory, …
n  Lagged y will likely satisfy relevance criteria
n  But, exclusion restriction requires lagged values of y to
be uncorrelated with differenced residual, ui,t – ui,t-1

Is this plausible in corporate finance?

66
Lagged y values as instruments?
n  Probably not…
q  Lagged values of y will be correlated with
changes in errors if errors are serially correlated
q  This is common in corporate finance,
suggesting this approach is not helpful

[ See Holtz-Eakin, Newey, and Rosen (1988), Arellano


and Bond (1991), Blundell and Bond (1998) for more
details on these type of IV strategies]

67
Lagged x values as instruments? [Part 1]

n  Another approach is to make assumptions


about how xi,t is correlated with ui,t
q  Idea behind relevance condition is x is
persistent and predictive of future x or future y
[depends on what you’re trying to instrument]
q  And exclusion restriction is satisfied if we assume
xi,t is uncorrelated with future shocks, u

68
Lagged x values as instruments? [Part 2]

n  Just not clear how plausible this is…


q  Again, serial correlation in u (which is very common in
CF) all but guarantees the IV is invalid
q  An economic argument is generally lacking,
[and for this reason, I’m very skeptical of these strategies]
[ See Arellano and Bond (1991), Arellano and Bover (1995)
for more details on these type of IV strategies]

69
Using group averages as IVs [Part 1]
n  Will often see the following…
yi , j = α + β xi , j + ui , j
q  yi,j is outcome for observation i (e.g., firm)
in group j (e.g., industry)
q  Researcher worries that cov(x,u)≠0
q  So, they use group average, x−i , j , as IV
1
x− i , j = ∑
J − 1 i∈ j
xk , j J is # of observations
in the group
k ≠i

70
Using group averages as IVs [Part 2]
n  They say…
q  “group average of x is likely correlated with
own x” – i.e. relevance condition holds
q  “but, group average doesn’t directly affect y”
– i.e., exclusion restriction holds

n  Anyone see a problem?

71
Using group averages as IVs [Part 3]
n  Answer =
q  Relevance condition implicitly assumes
some common group-level heterogeneity,
fj , that is correlated with xij
q  But, if model has fj (i.e. group fixed effect),
then x−i , j must violate exclusion restriction!

n  This is a really bad IV [see Gormley and


Matsa (2014) for more details]
?

72
Other Miscellaneous IVs
n  As noted last week, IVs can also be useful
in panel estimations
#1 – Can help identify effect of variables that
don’t vary within groups [which we can’t
estimate directly in FE model]
#2 – Can help with measurement error

73
#1 – IV and FE models [Part 1]
n  Use the following three steps to identify
variables that don’t vary within groups…
#1 – Estimate the FE model
#2 – Take group-averaged residuals, regress them
onto variable(s), x’, that don’t vary in groups
(i.e. the variables you couldn’t estimate in FE model)
n  Why is this second step (on its own) problematic?
n  Answer: because unobserved heterogeneity (which
is still collinear with x’) will still be in error (because
it partly explains group-average residuals)

74
#1 – IV and FE models [Part 2]
n  Solution in second step is to use IV!
#3 – Use covariates that do vary in group (from
first step) as instruments in second step
n  Which x’s from first step are valid IVs?
n  Answer = those that don’t co-vary with unobserved
heterogeneity but do co-vary with variables that don’t
vary within groups [again, economic argument needed here]

q  See Hausman and Taylor (1981) for details


q  Done in Stata using XTHTAYLOR

75
#2 – IV and measurement error [Part 1]
n  As discussed last week, measurement
error can be a problem in FE models
n  IVs provide a potential solutions
q  Pretty simple idea…
q  Find z correlated to mismeasured variable,
but not correlated with u; use IV

76
#2 – IV and measurement error [Part 2]
n  But easier said then done!
q  Identifying a valid instrument requires researcher
to understand exact source of measurement error
n  This is because the disturbance, u, will include the
measurement error; hence, how can you make an
economic argument that z is uncorrelated with it if you
don’t understand the measurement error?

[See Biorn (2000) and Almeida, Campello, and Galvao


(RFS 2010) for examples of this strategy]

77
Outline for Instrumental Variables
n  Motivation and intuition
n  Required assumptions
n  Implementation and 2SLS
q  Weak instruments problem
q  Multiple IVs and overidentification tests

n  Miscellaneous IV issues


n  Limitations of IV

78
Limitations of IV
n  There are two main limitations to discuss
q  Finding a good instrument is really hard; even
the seemingly best IVs can have problems
q  External validity can be a concern

79
Subtle violations of exclusion restriction
n  Even the seemingly best IVs can violate
the exclusion restriction
q  Roberts and Whited (pg. 31, 2011) provide a
good example of this in description of
Bennedsen et al. (2007) paper
q  Whatever group is discussing this paper
next week should take a look… J

80
Bennedsen et al. (2007) example [Part 1]
n  Paper studies effect of family CEO
succession on firm performance
q  IVs for family CEO succession using
gender of first-born child
n  Families where the first child was a boy are
more likely to have a family CEO succession
n  Obviously, gender of first-born is totally
random; seems like a great IV…

Any guesses as to what might be wrong?

81
Bennedsen et al. (2007) example [Part 2]
n  Problem is that first-born gender may
be correlated with disturbance u
q  Girl-first families may only turnover firm
to a daughter when she is very talented
q  Therefore, effect of family CEO turnover
might depend on gender of first born
q  I.e. gender of first born is correlated with
u because it includes interaction between
problematic x and the instrument, z!

82
External vs. Internal validity
n  External validity is another concern of IV
[and other identification strategies]
q  Internal validity is when the estimation
strategy successfully uncovers a causal effect
q  External validity is when those estimates are
predictive of outcomes in other scenarios
n  IV (done correctly) gives us internal validity
n  But, it doesn’t necessarily give us external validity

83
External validity [Part 1]
n  Issue is that IV estimates only tell us about
subsample where the instrument is predictive
q  Remember, you’re only making use
of variation in x driven by z
q  So, we aren’t learning effect of x for
observations where z doesn’t explain x!

n  It’s a version of LATE (local average


treatment effect) and affects interpretation

84
External validity [Part 2]
n  Again, consider Bennedsen et al (2007)
q  Gender of first born may only predict likelihood
of family turnover in certain firms…
n  I.e. family firms where CEO thinks females (including
daughters) are less suitable for leadership positions

q  Thus, we only learn about effect of family


succession for these firms
q  Why might this matter?

85
External validity [Part 3]
n  Answer: These firms might be different in
other dimensions, which limits the external
validity of our findings
q  E.g. Could be that these are poorly run firms…
n  If so, then we only identify effect for such
poorly run firms using the IV
n  And, effect of family succession in well-run
firms might be quite different…

86
External validity [Part 4]
n  Possible test for external validity problems
q  Size of residual from first stage tells us something
about importance of IV for certain observations
n  Large residual means IV didn’t explain much
n  Small residual means it did

q  Compare characteristics (i.e. other x’s) of


observations of groups with small and large
residuals to make sure they don’t differ much

87
Summary of Today [Part 1]

n  IV estimation is one possible way to


overcome identification challenges
n  A good IV needs to satisfy two conditions
q  Relevance condition
q  Exclusion condition

n  Exclusion condition cannot be tested; must


use economic argument to support it

88
Summary of Today [Part 2]

n  IV estimations have their limits


q  Really hard to come up with good IV
q  Weak instruments can be a problem,
particularly when you have more IVs
than problematic regressors
q  External validity can be an concern

89
In First Half of Next Class
n  Natural experiments [Part 1]
q  How do they help with identification?
q  What assumptions are necessary to
make causal inferences?
q  What are their limitations?

n  Related readings… see syllabus

90
Assign papers for next week…
n  Gormley (JFI 2010)
q  Foreign bank entry and credit access

n  Bennedsen, et al. (QJE 2007)


q  CEO family succession and performance

n  Giroud, et al (RFS 2012)


q  Debt overhang and performance

91
Break Time
n  Let’s take our 10 minute break
n  We’ll do presentations when we get back

92
FNCE 926
Empirical Methods in CF
Lecture 6 – Natural Experiment [P1]

Professor Todd Gormley


Announcements
n  Exercise #3 is due next week
q  You can download it from Canvas
q  Largely just has you do some initial work on
natural experiments (from today's lecture);
but also has a bit of IV in it
n  Remember, please upload completed DO file and
typed answers to Canvas [don't e-mail them]
n  Just let me know if you have any questions or
difficulty doing this

2
Background readings
n  Roberts and Whited
q  Sections 2.2, 4
n  Angrist and Pischke
q  Section 5.2

3
Outline for Today
n  Quick review of IV regressions
n  Discuss natural experiments
q  How do they help?
q  What assumptions are needed?
q  What are their weaknesses?

n  Student presentations of “IV” papers

4
Quick Review [Part 1]

n  Two necessary conditions for an IV


q  Relevance condition – IV explains problematic
regressor after conditioning on other x's
q  Exclusion restriction – IV does not explain y
after conditioning on other x's

n  We can only test relevance condition

5
Quick Review [Part 2]
n  Angrist (1990) used randomness of
Vietnam draft to study effect of military
service on Veterans' earnings
q  Person's draft number (which was random)
predicted likelihood of serving in Vietnam
q  He found, using draft # as IV, that serving in
military reduced future earnings
Question: What might be a concern about the
external validity of his findings, and why?

6
Quick Review [Part 3]

n  Answer = IV only identifies effect of serving


on those that served because of being drafted
q  I.e. His finding doesn't necessarily tell us what the
effect of serving is for people that would serve
regardless of whether they are drafted or not
q  Must keep this local average treatment effect
(LATE) in mind when interpreting IV

7
Quick Review [Part 4]
n  Question: Are more instruments
necessarily a good thing? If not, why not?
q  Answer = Not necessarily. Weak instrument
problem (i.e. bias in finite sample) can be much
worse with more instruments, particularly if
they are weaker instruments

8
Quick Review [Part 5]
n  Question: How can overidentification tests
be used to prove the IV is valid?
q  Answer = Trick question! They cannot be
used in such a way. They rely on the
assumption that at least one IV is good. You
must provide a convincing economic argument
as to why your IVs make sense!

9
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
n  Difference-in-differences

10
Recall… CMI assumption is key

n  A violation of conditional mean independence


(CMI), such that E(u|x)≠E(u) precludes our
ability to make causal inferences
y = β 0 + β1 x + u
q  Cov(x,u)≠0 implies CMI is violated

11
CMI violation implies non-randomness
n  Another way to think about CMI is that
it indicates that our x is non-random
q  I.e. the distribution of x (or the
distribution of x after controlling for
other observable covariates) isn't random
n  E.g. firms with high x might have higher y
(beyond just the effect of x on y) because high x
is more likely for firms with some omitted
variable contained in u…

12
Randomized experiments are great…
n  In many of the “hard” sciences, the
researcher can simply design experiment to
achieve the necessary randomness
q  Ex. #1 – To determine effect of new drug, you
randomly give it to certain patients
q  Ex. #2 – To determine effect of certain gene,
you modify it in a random sample of mice

13
But, we simply can't do them L
n  We can't do this in corporate finance!
q  E.g. we can't randomly assign a firm's leverage
to determine it's effect on investment
q  And, we can't randomly assign CEOs' # of
options to determine their effect on risk-taking

n  Therefore, we need to rely on what we call


“Natural experiments”

14
Defining a Natural Experiment
n  Natural experiment is basically when
some event causes a random assignment
of (or change in) a variable of interest, x
q  Ex. #1 – Some weather event increases
leverage for a random subset of firms
q  Ex. #2 – Some change in regulation reduces
usage of options at a random subset of firms

15
Nat. Experiments Provide Randomness

n  We can use such “natural” experiments


to ensure that randomness (i.e. CMI)
holds and make causal inferences!
q  E.g., we use the randomness introduced
into x by the natural experiment to
uncover the causal effect of x on y

16
NEs can be used in many ways
n  Technically, natural experiments can be
used in many different ways
q  Use them to construct IV
n  E.g. gender of first child being a boy used in
Bennedsen, et al. (2007) is an example NE

q  Use them to construct regression discontinuity


n  E.g. cutoff for securitizing loans at credit score of 620
used in Keys, et al. (2010) is a NE

17
And, the Difference-in-Differences…
n  But admittedly, when most people refer to
natural experiment, they are talking about a
difference-in-difference (D-i-D) estimator
q  Basically, compares outcome y for a “treated” group
to outcome y for “untreated” group where treatment
is randomly assigned by the natural experiment
q  This is how I'll use NE in this class

18
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
q  Notation and definitions
q  Selection bias and why randomization matters
q  Regression for treatment effects

n  Two types of simple differences


n  Difference-in-differences

19
Treatment Effects
n  Before getting into natural experiments in
context of difference-in-difference, it is first
helpful to describe “treatment effects”

20
Notation and Framework
n  Let d equal a treatment indicator from the
experiment we will study
q  d = 0 à untreated by experiment (i.e. control group)
q  d = 1 à treated by experiment (i.e. treated group)

n  Let y be the potential outcome of interest


q  y = y(0) for untreated group
q  y = y(1) for treated group
q  Easy to show that y = y(0) + d[y(1) – y(0)]

21
Example treatments in corp. fin…
n  Ex. #1 – Treatment might be that your
firm's state passed anti-takeover law
q  d = 1 for firms incorporated in those states
q  y could be a number of things, e.g. ROA

n  Ex. #2 – Treatment is that your firm


discovers workers exposed to carcinogen
q  d = 1 if have exposed workers
q  y could be a number of things, like M&A

22
Average Treatment Effect (ATE)
n  Can now define some useful things
q  Average Treatment Effect (ATE) is given by

E[y(1) – y(0)]

n  What does this mean in words?


n  Answer: The expected change in y from being
treated by the experiment; this is the causal effect
we are typically interested in uncovering!

23
But, ATE is unobservable
E[y(1) – y(0)]

n  Why can't we actually directly observe ATE?


q  Answer = We only observe one outcome…
n  If treated, we observe y(1); if untreated, we
observe y(0). We never observe both.
n  I.e. we cannot observe the counterfactual of what
your y would have been absent treatment

24
Defining ATT
q  Average Treatment Effect if Treated (ATT)
is given by E[y(1) – y(0)|d =1]
n  This is the effect of treatment on those that are treated;
i.e change in y we'd expect to find if treated random
sample from population of observations that are treated

n  What don't we observe here?


n  Answer = y(0)|d = 1

25
Defining ATU
q  Average Treatment Effect if Untreated (ATU)
is given by E[y(1) – y(0)|d =0]
n  This is what the effect of treatment would have been on
those that are not treated by the experiment
n  We don't observe y(1) | d = 0

26
Uncovering ATE [Part 1]

n  How do we estimate ATE, E[y(1) – y(0)]?


q  Answer = We instead rely on E[y(1)|d =1]–
E[(y(0)|d =0] as our way to infer the ATE

In words, what are we doing


& what are we assuming?

27
Uncovering ATE [Part 2]
n  In words, we compare average y of treated
to average y of untreated observations
q  If we interpret this as the ATE, we are
assuming that absent the treatment, the treated
group would, on average, have had same
outcome y as the untreated group
q  We can show this formally by simply working
out E[y(1)|d =1]–E[y(0)|d =0]…

28
Uncovering ATE [Part 3]
{E[ y(1) | d = 1] − E[ y(0) | d = 1]} + {E[ y(0) | d = 1] − E[ y(0) | d = 0]}

First bracket is ATT Just added and Second bracket is


subtracted the what we call the
same term “selection bias”

n  Simple comparison doesn't give us the ATE!


In fact, the comparison is rather meaningless!
n  What is the “selection bias” in words?

29
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
q  Notation and definitions
q  Selection bias and why randomization matters
q  Regression for treatment effects

n  Two types of simple differences


n  Difference-in-differences

30
Selection bias defined
n  Selection bias: E[ y(0) | d = 1] − E[ y(0) | d = 0]
q  Definition = What the difference in average y
would have been for treated and untreated
observations absent any treatment
q  We do not observe this counterfactual!

n  Now let's see why randomness is key!

31
Introducing random treatment
n  A random treatment, d, implies that d is
independent of potential outcomes; i.e.
In words, the
E[ y (0) | d = 1] = E[ y (0) | d = 0] = E[ y(0)] expected value
of y is the same
and for treated and
E[ y (1) | d = 1] = E[ y (1) | d = 0] = E[ y (1)] untreated absent
treatment

q  With this, easy to see that selection bias = 0


q  And, remaining ATT is equal to ATE!

32
Random treatment makes life easy

n  I.e. with random assignment of treatment, our


simple comparison gives us the ATE!
q  This is why we like randomness!
q  But, absent randomness, we must worry that our
comparison is driven by selection bias

33
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
q  Notation and definitions
q  Selection bias and why randomization matters
q  Regression for treatment effects

n  Two types of simple differences


n  Difference-in-differences

34
ATE in Regression Format [Part 1]
n  Can re-express everything in regression format
y = β 0 + β1d + u This regression will only give
consistent estimate of β1 if
β 0 = E[ y (0)] cov(d, u) = 0; i.e. treatment, d,
where β1 = y (1) − y(0) is random, and hence,
uncorrelated with y(0)!
u = y (0) − E[ y (0)]

q  If you plug-in, it will get you back to what the


true model, y = y(0) + d[y(1) – y(0)]

35
ATE in Regression Format [Part 2]
n  We are interested in E[y|d =1]–E[y|d =0]
q  But, can easily show that this expression is equal to

β1 + E[ y(0) | d = 1] − E[ y(0) | d = 0]

Note: Selection bias


Our estimate will
term occurs only if
equal true effect plus
CMI isn't true!
selection bias term

36
Adding additional controls [Part 1]
n  Regression format also allows us to easily
put in additional controls, X
q  Intuitively, comparison of treated and untreated
just becomes E[y(1)|d =1, X]–E[y(0)|d =0,X]
q  Same selection bias term will appear if treatment,
d, isn't random after conditioning on X
q  Regression version just becomes Why might
y = β0 + β1d + ΓX + u there still be a
selection bias?

37
Adding additional controls [Part 2]
n  Selection bias can still be present if treatment
is correlated with unobserved variables
q  As we saw earlier, it is what we can't observe
(and control for) that can be a problem!

Question: If we had truly randomized


experiment, are controls necessary?

38
Adding additional controls [Part 3]
n  Answer: No, controls are not necessary in
truly randomized experiment
q  But, they can be helpful in making the estimates
more precise by absorbing residual variation…
we'll talk more about this later

39
Treatment effect – Example
n  Suppose compare leverage of firms with and
without a credit rating [or equivalently, regress
leverage on indicator for rating]
q  Treatment is having a credit rating
q  Outcome of interest is leverage

Why might our estimate not equal ATE of rating?


Why might controls not help us much?

40
Treatment effect – Example Answer
n  Answer #1: Having a rating isn't random
q  Firms with rating likely would have had higher
leverage anyway because they are larger, more
profitable, etc.; selection bias will be positive
q  Selection bias is basically an omitted var.!

n  Answer #2: Even adding controls might


not help if firms also differ in unobservable
ways, like investment opportunities

41
Heterogeneous Effects
n  Allowing the effect of treatment to vary
across individuals doesn't affect much
q  Just introduces additional bias term
q  Will still get ATE if treatment is random…
broadly speaking, randomness is key

42
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
q  Cross-sectional difference & assumptions
q  Time-series difference & assumptions
q  Miscellaneous issues & advice
We actually just
n  Difference-in-differences did this one!

43
Cross-sectional Simple Difference

n  Very intuitive idea


q  Compare post-treatment outcome, y, for
treated group to the untreated group
q  I.e. just run following regression…

44
In regression format…
n  Cross-section simple difference
yi ,t = β 0 + β1di + ui ,t

q  d = 1 if observation i is in treatment


group and equals zero otherwise
q  Regression only contains post-
treatment time periods

What is needed for β1 to capture the


true (i.e. causal) treatment effect?

45
Identification Assumption

n  Answer: E(u|d) = 0; i.e. treatment, d, is


uncorrelated with the error
q  In words… after accounting for effect of
treatment, the expected level of y in post-
treatment period isn't related to whether you're
in the treated or untreated group
q  I.e., expected y of treated group would have been
same as untreated group absent treatment

46
Another way to see the assumption…
This is causal interpretation
of coefficient on d
E[ y | d = 1] c− E[ y | d = 0]
( β0 + β1 + E[u | d = 1]) − ( β0 + E[u | d = 0]) CMI assumption ensures
β1 + E[u | d = 1] − E[u | d = 0] these last two terms cancel
such that our interpretation
matches causal β1

q  Then, plugging in for u = y(0) – E[y(0)], which is


what true error is (see earlier slides)…
I.e. we must
β1 + E[ y(0) | d = 1] −c E[ y(0) | d = 0] assume no
selection bias

47
Multiple time periods & SEs

n  If have multiple post-treatment periods,


need to be careful with standard errors
q  Errors ui,t and ui,t+1 likely correlated if
dependent variable exhibits serial correlation
n  E.g. we observe each firm (treated and untreated) for
five years after treatment (e.g. regulatory change), and
our post-treatment observations are not independent

48
Multiple time periods & SEs – Solution

n  Should do one of two things


q  Collapse data to one post-treatment per
unit; e.g. for each firm, use average of the
firm's post-treatment observations
q  Or, cluster standard errors at firm level
[We will come back to clustering in later lecture]

49
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
q  Cross-sectional difference & assumptions
q  Time-series difference & assumptions
q  Miscellaneous issues & advice

n  Difference-in-differences

50
Time-series Simple Difference
n  Very intuitive idea
q  Compare pre- and post-treatment
outcomes, y, for just the treated group
[i.e. pre-treatment period acts as 'control' group]
q  I.e. run following regression…

51
In Regression Format
n  Time-series simple difference
yi ,t = β 0 + β1 pt + ui ,t

q  pt = 1 if period t occurs after treatment and


equals zero otherwise
q  Regression contains only observations that
are treated by “experiment”

What is needed for β1 to capture the true


(i.e. causal) treatment effect?

52
Identification Assumption

n  Answer: E(u|p) = 0; i.e. post-treatment


indicator, p, is uncorrelated with the error
q  I.e., after accounting for effect of treatment, p,
the expected level of y in post-treatment
period wouldn't have been any different than
expected y in pre-treatment period

53
Showing the assumption math…
This would be causal
interpretation of
coefficient on p

E[ y | p = 1] − E[ y | p = 0]
( β0 + β1 + E[u | p = 1]) − ( β0 + E[u | p = 0])
β1 + E[u | p = 1] − E[u | p = 0]
β1 + E[ y (0) | p = 1] − E[ y (0) | p = 0]
Same selection
bias term… our
estimated
coefficient on p
only matches
true causal effect
if this is zero

54
Again, be careful about SEs

n  Again, if have multiple pre- and post-treatment


periods, need to be careful with standard errors
q  Either cluster SEs at level of each unit
q  Or, collapse data down to one pre- and one post-
treatment observation for each cross-section

55
Using a First-Difference (FD) Approach
n  Could also run regression using first-
differences specification
yi ,t − yi ,t −1 = β1 ( pt − pt −1 ) + (ui ,t − ui ,t −1 )

q  If just one pre- and one post-treatment period


(i.e. t-1 and t ), then will get identical results
q  But, if more than one pre- and post-treatment
period, the results will differ…

56
FD versus Standard Approach [Part 1]
n  Why might these two models give different
estimates of β1 when there are more than
one pre- and post-treatment periods?

yi ,t = β 0 + β1 pt + ui ,t

versus
yi ,t − yi ,t −1 = β1 ( pt − pt −1 ) + (ui ,t − ui ,t −1 )

57
FD versus Standard Approach [Part 2]
How might this
n  Answer: matter in practice?

q  In 1st regression, β1 captures difference between


avg. y pre-treatment versus avg. y post-treatment
q  In 2nd regression, β1 captures difference in Δy
immediately after treatment versus Δy in all
other pre- and post-treatment periods
n  I.e. the Δp variable equals 1 only in immediate post-
treatment period, and 0 for all other periods

58
FD versus Standard Approach [Part 3]
n  Both approaches assume the effect of
treatment is immediate and persistent, e.g.
2.5

2
In this scenario,
Outcome,  y

1.5 both approaches


1 give same estimate
0.5

0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5

Period

59
FD versus Standard Approach [Part 4]
n  But, suppose the following is true...
In this scenario, FD
2.5
approach gives much
2
smaller estimate
Outcome,  y

1.5

1 1st approach compares


0.5 avg. pre- versus post
0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5 FD compares Δy from
Period t =0 to t =-1 against
Δy elsewhere (which
isn't always zero!)

60
Correct way to do difference
n  Correct way to get a 'differencing'
approach to match up with the more
standard simple diff specification in
multi-period setting is to instead use
yi , post − yi , pre = β1 + (ui , post − ui , pre )

q  This is exactly the same as simple difference

61
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
q  Cross-sectional difference & assumptions
q  Time-series difference & assumptions
q  Miscellaneous issues & advice

n  Difference-in-differences

62
Treatment effect isn't always immediate
n  In prior example, the specification is
wrong because the treatment effect only
slowly shows up over time
q  Why might such a scenario be plausible?
q  Answer = Many reasons. E.g. firms might
only slowly respond to change in regulation,
or CEO might only slowly change policy in
response to compensation shock

63
Accounting for a delay…
n  Simple-difference misses this subtlety; it
assumes effect was immediate
n  For this reason, it is always helpful to run
regression that allows effect to vary by period
q  How can you do this?
q  Answer = Insert indicators for each year relative
to the treatment year [see next slide]

64
Non-parametric approach
n  If have 5 pre- and 5 post-treatment
5
obs.;
could estimate : yi ,t = β 0 + ∑ βt pt + ui ,t
t =−4

q  pt is now an indicator that equals 1 if year = t and


zero otherwise; e.g.
n  t = 0 is the period treatment occurs
n  t = -1 is period before treatment

q  βt estimates change in y relative to excluded


periods; you then plot these in graph

65
Non-parametric approach – Graph
n  Plot estimates to trace out effect of treatment
Approach allows
1.2
effect of treatment
1
to vary by year!
Outcome,  y

0.8

0.6
Estimates capture
0.4 change relative to
0.2 excluded period (t-5)
0
-­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5

Period Could easily plot


These equal zero because y was same confidence intervals as well
as y in excluded period (t-5)

66
Simple Differences – Advice
n  In general, simple differences are not that
convincing in practice…
q  Cross-sectional difference requires us to
assume the average y of treated and untreated
would have been same absent treatment
q  Time-series difference requires us to assume
the average y would have been same in post-
and pre-treatment periods absent treatment

n  Is there a better way?

67
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
n  Difference-in-differences
q  Intuition & implementation
q  “Parallel trends” assumption

68
Difference-in-differences
n  Yes, we can do better!
n  We can do a difference-in-differences that
combines the two simple differences
q  Intuition = compare change in y pre- versus
post-treatment for treated group [1st difference]
to change in y pre- versus post-treatment for
untreated group [2nd difference]

69
Implementing diff-in-diff
n  Difference-in-differences estimator
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t

q  pt = 1 if period t occurs after treatment


and equals zero otherwise
q  di = 1 if unit is in treated group and
equals zero otherwise

What do β1, β2, and β3 capture?

70
Interpreting the estimates [Part 1]

n  Here is how to interpret everything…


q  β1 captures the average change in y from the pre-
to post-treatment periods that is common to both
treated and untreated groups
q  β2 captures the average difference in level of y
between treated and untreated groups that is
common to both pre- and post-treatment periods

71
Interpreting the estimates [Part 2]

q  β3 captures the average differential change in y


from the pre- to post-treatment period for the
treatment group relative to the change in y for the
untreated group

n  β3 is what we call the diff-in-diff estimate


When does β3 capture the causal effect
of the treatment?

72
Natural Experiments – Outline
n  Motivation and definition
n  Understanding treatment effects
n  Two types of simple differences
n  Difference-in-differences
q  Intuition & implementation
q  “Parallel trends” assumption

73
“Parallel trends” assumption

n  Identification assumption is what we call


the parallel trends assumption
q  Absent treatment, the change in y for treated
would not have been different than the change
in y for the untreated observations
n  To see why this is the underlying identification
assumption, it is helpful to re-express the diff-in-diff…

74
Differences estimation
n  Equivalent way to do difference-in-differences
is to instead estimate the following:

yi , post − yi , pre = β0 + β1di + (ui , post − ui , pre )

q  β1 gives the difference-in-differences estimate


n  In practice, don't do this because an adjustment to
standard errors is necessary to get right t-stat
n  And remember! This is not the same as taking first-
differences; FD will give misleading results

75
Difference-in-differences – Visually
n  Looking at what difference-in-differences
estimate is doing in graphs will also help you
see why the parallel trends assumption is key
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t

76
Diff-in-diffs – Visual Example #1
Treated
6

4
β3
Unobserved
Outcome,  y

counterfactual
3 β1

2
β2
1
Untreated
0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
Period

77
Diff-in-diff – Visual Example #2
Treated
7
β1 now takes
out avg.
6 difference pre-
5 vs. post now
β3
Unobserved
Outcome,  y

4 counterfactual
β1
3

2 β2

1 Untreated
0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
Period

78
Violation of parallel trends – Visual
There is no effect, but β3 > 0 because
7
parallel trends assumption was violated

5
Outcome,  y

0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
Period

79
Why we like diff-in-diff [Part 1]
n  With simple difference, any of the below
arguments would prevent causal inference
q  Cross-sectional diff – “Treatment and
untreated avg. y could be different for reasons
a, b, and c, that just happen to be correlated
with whether you are treated or not”
q  Time-series diff – “Treatment group's avg. y
could change post- treatment for reasons a, b,
and c, that just happen to be correlated with
the timing of treatment”

80
Why we like diff-in-diff [Part 2]
n  But, now the required argument to suggest
the estimate isn't causal is…
q  “The change in y for treated observations after
treatment would have been different than
change in y for untreated observations for
reasons a, b, and c, that just happen to be
correlated with both whether you are treated
and when the treatment occurs”
This is (usually) a much
harder story to tell

81
Example…
n  Bertrand & Mullainathan (JPE 2003) uses
state-by-state changes in regulations that
made it harder for firms to do M&A
q  They compare wages at firms pre- versus post-
regulation in treated versus untreated states
q  Are the below valid concerns about their
difference-in-differences…

82
Are these concerns for internal validity?
n  The regulations were passed during a time
period of rapid growth of wages nationally…
q  Answer = No. Indicator for post-treatment
accounts for common growth in wages

n  States that implement regulation are more likely


have unions, and hence, higher wages…
q  Answer = No. Indicator for treatment
accounts for this average difference in wages

83
Example continued…
n  However, ex-ante average differences is
troublesome in some regard…
q  Suggests treatment wasn't random
q  And, ex-ante differences can be problematic if
we think they their effect may vary with time…
n  Time-varying omitted variables are problematic
because they can cause violation of “parallel trends”
n  E.g. states with more unions were trending differently
at that time because of changes in union power

84
Summary of Today [Part 1]

n  Natural experiment provides random


variation in x that allows causal inference
q  Can be used in IV, regression discontinuity, but
most often associated with “treatment” effects

n  Two types of simple differences


q  Post-treatment comparison of treated & untreated
q  Pre- and post-treatment comparison of treated

85
Summary of Today [Part 2]

n  Simple differences require strong


assumptions; typically not plausible
n  Difference-in-differences helps with this
q  Compares change in y pre- versus post-treatment
for treated to change in y for untreated
q  Requires “parallel trends” assumption

86
In First Half of Next Class
n  Natural experiments [Part 2]
q  How to handle multiple events
q  Triple differences
q  Common robustness tests that can be used to
test whether internal validity is likely to hold

n  Related readings… see syllabus

87
Assign papers for next week…
n  Jayaratne and Strahan (QJE 1996)
q  Bank deregulation and economic growth

n  Bertrand and Mullainathan (JPE 2003)


q  Governance and managerial preferences

n  Hayes, Lemmon, and Qiu (JFE 2012)


q  Stock options and managerial incentives

88
Break Time
n  Let's take our 10 minute break
n  We'll do presentations when we get back

89
FNCE 926
Empirical Methods in CF
Lecture 7 – Natural Experiment [P2]

Professor Todd Gormley


Announcements & Informal Survey
n  Exercise #3 is due
n  Please fill out informal survey
q  Helps me figure out what changes I can
make to improve the course for the
second half and for future years
n  For example, what topic should I have
spent more time on? What topic did you
find the most interesting? Is there too
much, or too little work? Etc.

2
Background readings
n  Roberts and Whited
q  Sections 2.2, 4
n  Angrist and Pischke
q  Section 5.2

3
Outline for Today
n  Quick review of last lecture
n  Continue to discuss natural experiments
q  How to handle multiple events
q  Triple differences
q  Common robustness tests that can be used to
test whether internal validity is likely to hold

n  Student presentations of “NE #1” papers

4
Quick Review[Part 1]

n  Natural experiment provides random


variation in x that allows causal inference
q  Can be used in IV, regression discontinuity, but
most often associated with “treatment” effects

n  Two types of simple differences


q  Post-treatment comparison of treated & untreated
q  Pre- and post-treatment comparison of treated

5
Quick Review [Part 2]
n  Difference-in-differences is estimated with…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
q  Compares change in y pre- versus post-treatment
for treated to change in y for untreated
q  Requires “parallel trends” assumption

n  Let’s test your ability to identify a violation of


the necessary assumptions for simple diffs
and diff-in-diffs…

6
Quick Review [Part 3]
n  Suppose Spain exits the Euro. And, Ann
compares profitability of Spanish firms after
the exit to profitability before…
n  What is necessary for the comparison to
have any causal interpretation?
q  Answer = We must assume profitability after
Spain’s exit would have been same as profitability
prior to exit absent exit… Highly implausible

7
Quick Review [Part 4]
n  Now, suppose Bob compares profitability of
Spanish firms after the exit to profitability of
German firms after exit…
n  What is necessary for the comparison to
have any causal interpretation?
q  Answer = We must assume profitability of
Spanish firm would have been same as
profitability of German firms absent exit…
Again, this is highly implausible

8
Quick Review [Part 5]
n  Lastly, suppose Charlie compares change in
profitability of Spanish firms after exit to
change in profitability of German firms
n  What is necessary for the comparison to
have any causal interpretation?
q  Answer = We must assume change in profitability
of Spanish firm would have been same as change
for German firms absent exit… I.e. parallel
trends assumption

9
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
q  Using group means to get an estimate
q  When additional controls are appropriate
n  How to handle multiple events
n  Falsification tests
n  Triple differences

10
Standard Regression Format
n  Difference-in-differences estimator
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t

q  pt = 1 if period t occurs after treatment


and equals zero otherwise
q  di = 1 if unit is in treated group and
equals zero otherwise

But, there is another way that just involves


comparing four sample means…

11
Comparing group means approach
n  To see how we can get the same estimate,
β3, by just comparing sample means, first
calculate expected y under four possible
combinations of p and d indicators

12
Comparing group means approach [P1]
n  Again, the regression is…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t

q  And, the four possible combinations are:

E ( y | d = 1, p = 1) = β 0 + β1 + β 2 + β 3 What assumption did I


E ( y | d = 1, p = 0) = β 0 + β 2 make in doing this?
E ( y | d = 0, p = 1) = β 0 + β1 Answer: E(u|d,p)=0; i.e.
E ( y | d = 0, p = 0) = β 0 the “experiment” is random

13
Comparing group means approach [P2]
E ( y | d = 1, p = 1) = β 0 + β1 + β 2 + β3
E ( y | d = 1, p = 0) = β 0 + β 2
E ( y | d = 0, p = 1) = β 0 + β1
E ( y | d = 0, p = 0) = β 0

n  These can be arranged in two-by-two table


Post-­‐Treatment,   Pre-­‐Treatment,  
(1) (2) Difference,  (1)-­‐(2)
Treatment,  (a) β0+β1+β2+β3 β0+β2 β1+β3
Control,  (b) β0+β1 β0 β1

Difference,  (a)-­‐(b) β2+β3 β2 β3

14
Comparing group means approach [P3]
n  Now take the simple differences
Post-­‐Treatment,   Pre-­‐Treatment,  
(1) (2) Difference,  (1)-­‐(2)
Treatment,  (a) β0+β1+β2+β3 β0+β2 β1+β3
Control,  (b) β0+β1 β0 β1

Difference,  (a)-­‐(b) β2+β3 β2 β3

15
Comparing group means approach [P4]
n  Then, take difference-in-differences!
Post-­‐Treatment,   Pre-­‐Treatment,  
(1) (2) Difference,  (1)-­‐(2)
Treatment,  (a) β0+β1+β2+β3 β0+β2 β1+β3
Control,  (b) β0+β1 β0 β1

Difference,  (a)-­‐(b) β2+β3 β2 β3

This is why they call it the difference-in-differences


estimate; regression gives you same estimate as if you
took differences in the group averages

16
Simple difference – Revisited [Part 1]
n  Useful to look at simple differences
Post-­‐Treatment,   Pre-­‐Treatment,  
(1) (2) Difference,  (1)-­‐(2)
Treatment,  (a) β0+β1+β2+β3 β0+β2 β1+β3
Control,  (b) β0+β1 β0 β1

Difference,  (a)-­‐(b) β2+β3 β2 β3

When does that simple diff


This was cross-sectional give effect of treatment, β3?
simple difference Answer = when β2 equals
zero; i.e. no difference in level
of y absent treatment

17
Simple difference – Revisited [Part 2]
n  Now, look at time-series simple diff…
Post-­‐Treatment,   Pre-­‐Treatment,  
(1) (2) Difference,  (1)-­‐(2)
Treatment,  (a) β0+β1+β2+β3 β0+β2 β1+β3
Control,  (b) β0+β1 β0 β1

Difference,  (a)-­‐(b) β2+β3 β2 β3

When does that simple diff


This was time-series give effect of treatment, β3?
simple difference
Answer = when β1 equals zero; i.e.
no change in y absent treatment

18
Why the regression is helpful
n  Some papers will just report this simple
two-by-two table as their estimate
n  But, there are advantages to the regression
q  Can modify it to test timing of treatment
[we will talk about this in robustness section]
q  Can add additional controls, X

19
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
q  Using group means to get an estimate
q  When additional controls are appropriate
n  How to handle multiple events
n  Falsification tests
n  Triple differences

20
Adding controls to diff-in-diff
n  Easy to add controls to regression
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ΓX i ,t + ui ,t

q  X is some vector of controls


q  Г is vector of coefficients

n  E[y|d,p] in prior proofs just


becomes E[y|d,p,X]
From earlier lecture,
what type of controls
should you NEVER add?

21
When controls are inappropriate
n  Remember! You should never add controls
that might themselves be affected by treatment
q  Angrist-Pischke call this a “bad control”
q  You won’t be able to get a consistent estimate of β3
from estimating the equation

22
A Pet Peeve of TG – Refined
n  If you have a treatment that is truly random, do
not put in controls affected by the treatment!
q  I’ve had many referees force me to add controls
that are likely to be affected by the treatment…
q  If this happens to you, put in both regressions (with
and without controls), and at a minimum, add a
caveat as to why adding controls is inappropriate

23
When controls are appropriate

n  Two main reasons to add controls


q  Improve precision (i.e. lower standard errors)
q  Restore ‘random’ assignment of treatment

24
#1 – To improve precision
n  Adding controls can soak up some of
residual variation (i.e. noise) allowing you
to better isolate the treatment effect
q  Should the controls change the estimate?
n  NO! If treatment is truly random, adding
controls shouldn’t affect actual estimate; they
should only help lower the standard errors!

q  If adding controls changes estimates, you


might have ‘bad controls’ or worse, non-
random treatment L

25
Example – Improving precision
n  Suppose you have firm-level panel data
n  Some natural experiment ‘treats’ some
firms but not other firms
q  Could just estimate the standard diff-in-diff
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t

q  Or, could add fixed effects (like firm and year
FE) to get more precise estimate…

26
Example – Improving precision [Part 2]
n  So, suppose you estimate…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + α i + δ t + ui ,t

Firm fixed effects Year fixed


effects

q  What meaning does β1 have now?


q  What meaning does β2 have now?

27
Example – Improving precision [Part 3]

n  Trick question! They have no meaning!


q  pt is perfectly collinear with year FE
[because it doesn’t vary across firms]
q  di is perfectly collinear with firm FE
[because it doesn’t vary across time for each firm]

n  Stata just randomly drops a couple of the FE


q  The estimates on pt and di are just random
intercepts with no meaning

28
Example – Improving precision [Part 4]
n  Instead, you should estimate…
yi ,t = β 0 + β3 ( di × pt ) + α i + δ t + ui ,t

Firm fixed effects Year fixed effects


control for treatment control for post-
treatment

q  This is what some call the generalized


difference-in-differences estimator

29
Generalized Difference-in-differences

n  Advantage of generalized differences-in-


differences is that it can improve precision
and provide better fit of model
q  It doesn’t assume all firms in treatment (or
untreated) group have same average y; it allows
intercept to vary for each firm
q  It doesn’t assume that common change in y
around event is a simple change in level; it
allows common change in y to vary by year

30
Generalized D-i-D – Example [Part 1]
q  To see how Generalized D-i-D can be
helpful, consider the example from last week
7

6 pt only takes out


average jump at
5
treatment
Outcome,  y

4
β1
3

0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
Period

31
Generalized D-i-D – Example [Part 2]
q  Year dummies will better fit actual trend
7

5
Outcome,  y

0
-­‐5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
Period

32
When controls are appropriate

n  Two main reasons to add controls


q  Improve precision (i.e. lower standard errors)
q  Restore ‘random’ assignment of treatment

33
#2 – Restore randomness of treatment
n  Suppose the following is true… I.e. treatment
isn’t random
q  Observations of certain characteristic,
e.g. high x, are more likely to be treated
q  And, firms with this characteristic are likely
to have differential trend in outcome y And, non-
randomness is
n  Adding control for x could restore problematic for
identification
‘randomness’; i.e. being treated is
random after controlling for x!

34
Restoring randomness – Example
n  Natural experiment is change in regulation
q  Firms affected by regulation is random, except
that it is more likely to hit firms that are larger
q  And, we think larger firms might have different
trend in outcome y afterwards for other reasons
q  And, firm size is not going to be affected by the
change in regulation in any way

n  If all true, adding size as control would be an


appropriate and desirable thing to do

35
Controls continued…
n  In prior example, suppose size is potentially
effected by the change in regulation…
q  What would be another approach that won’t run
afoul of the ‘bad control’ problem?
n  Answer: Use firm size in year prior to treatment and
it’s interaction with post-treatment dummy
n  This will control for non-random assignment (based
on size) and differential trend (based on size)

36
Restoring randomness – Caution!
n  In practice, don’t often see use of controls
to restore randomness
q  Requires assumption that non-random
assignment isn’t also correlated with
unobservable variables…
q  So, not that plausible unless there are very
specific reasons for non-randomness

n  But, regression discontinuity is one


example of this; we’ll see it next week

37
One last note… be careful about SEs

n  Again, if have multiple pre- and post-treatment


periods, need to be careful with standard errors
q  Either cluster SEs at level of each unit
q  Or, collapse data down to one pre- and one post-
treatment observation for each cross-section

n  We will discuss more about standard errors in


lecture on “standard errors”

38
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
n  How to handle multiple events
q  Why they are useful
q  Two similar estimation approaches

n  Falsification tests


n  Triple differences

39
Motivating example…
n  Gormley and Matsa (2011) looked at
firms’ responses to increased left-tail risk
q  Used discovery that workers were exposed to
harmful chemical as exogenous increase in risk
q  One discovery occurred in 2000; a chemical
heavily used by firms producing
semiconductors was found to be harmful

n  Can you think of any concerns about


parallel trends assumption of this setting?

40
Motivating Example – Answer

n  Answer: Yes… This coincides with


bursting of technology bubble; technology
firms might arguably trend differently after
2000 for this reasons unrelated to chemical
q  How might multiple treatment events,
occurring at different times (which is what
Gormley and Matsa used), help?

41
Multiple treatment events
n  Sometimes, the natural experiment is
repeated a multiple points in times for
multiple groups of observations
q  E.g. U.S. states make a particular regulatory
change at different points in time

n  These settings are particularly useful


in mitigating concerns about violation
of parallel trends assumption…

42
How multiple events are helpful

n  Can show that effect of treatment is


similar across different time periods
n  Can show effect of treatment isn’t driven
by a particular set of treated firms
q  I.e. now the “identification police” would
need to come up with story as to why parallel
trends is violated for each unique event

43
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
n  How to handle multiple events
q  Why they are useful
q  Two similar estimation approaches

n  Falsification tests


n  Triple differences

44
Estimation with Multiple Events

n  Estimating model with multiple


events is still relatively easy to do
q  Use approach of Bertrand and
Mullainathan (JPE 2003)
q  Or, used “stacked” approach of
Gormley and Matsa (RFS 2011)

45
Multiple Events – Approach #1 [P1]
n  Just estimate the following estimation
yict = β dict + pt + mc + uict

q  yict is outcome for unit i (e.g. firm) in period t


(e.g. year) and cohort c, where “cohort” indexes
the different sets of firms treated by each event
n  E.g. different firms might be affected by a change in
regulation at different points in time; firms affected
at one point in time are a ‘cohort’

46
Multiple Events – Approach #1 [P2]

yict = β dict + pt + mc + uict

dict = indicator on Time period Cohort fixed effects;


whether cohort c is fixed effects; they are the control
affected by time t; they will control for the treatment
this is the interaction for post dummy in dummy in each event
between treatment & post each event

47
Multiple Events – Approach #1 [P3]
n  Intuition of this approach…
q  Every untreated observation at a particular
point in time acts as control for treated
observations in that time period
n  E.g. a firm treated in 1999 by some event will
act as a control for a firm treated in 1994 until
itself becomes treated in 1999

q  β will capture average treatment effect


across the multiple events

48
Multiple Events – Approach #2 [P1]
n  Now, think of running generalized diff-in-
diff for just one of the multiple events…
yit = β ( di × pt ) + α i + δ t + uit

q  di = indicator for unit i (e.g. firm) being a


treated firm in that particular event
q  pt = indicator for treatment having occurred
by period t (e.g. year)
q  Unit i and period t FE control for the
independent effects of di and pt

49
Multiple Events – Approach #2 [P2]
q  But, contrary to standard difference-in-
difference, your sample is…
n  Restricted to a small window around event;
e.g. 5 years pre- and post- event
n  And, drops any observations that are
treated by another event
q  I.e. your sample starts only with previously
untreated observations, and if a ‘control’
observation later gets treated by a different event,
those post-event observations are dropped

50
Multiple Events – Approach #2 [P3]
n  Now, create a similar sample for each
“event” being analyzed
n  Then, “stack” the samples into one dataset
and create a variable that identifies the event
(i.e. ‘cohort’) each observation belongs to
q  Note: some observation units will appear
multiple times in the data [e.g. firm 123 might be
a control in event year 1999 but a treated firm in
a later event in 2005]

51
Multiple Events – Approach #2 [P4]
n  Then, estimate the following on the
stacked dataset you’ve created
yict = β dict + δ tc + α ic + uict

dict = indicator on Time-cohort period Unit-cohort FE;


whether cohort c is fixed effects; they control for the
affected by time t; they control for post treatment dummy in
this is the interaction dummy in each event each cohort
between treatment & post (i.e. for each ‘stack’) (i.e. in each ‘stack’)

52
Multiple Events – Approach #2 [P5]
n  This approach has same intuition of the
first approach, but has a couple advantages
q  Can more easily isolate a particular window of
interest around each event
n  Prior approach compared all pre- versus post-
treatment observations against each other

q  Can more easily extend this into a triple-


difference type specification [more on that later]

53
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
n  How to handle multiple events
n  Falsification tests
n  Triple differences

54
Falsification Tests for D-i-D
n  Can never directly test underlying
identification assumption, but can do some
falsification tests to support its validity
#1 – Compare pre-treatment observables
#2 – Check that timing of observed change in y
coincides with timing of event [i.e. no pre-trend]
#3 – Check for treatment reversal
#4 – Check variables that shouldn’t be affected
#5 – Add a triple-difference

55
#1 – Pre-treatment comparison [Part 1]
n  Idea is that experiment ‘randomly’ treats
some subset of observations
q  If true, then ex-ante characteristics of ‘treated’
observations should be similar to ex-ante
characteristics of ‘untreated’ observations
q  Showing treated and untreated observations are
comparable in dimensions thought to affect y
can help ensure assignment was random

56
#1 – Pre-treatment comparison [Part 2]
n  If find ex-ante difference in some variable
z, is difference-in-difference is invalid?
q  Answer = Not necessarily.
n  We need some story as to why units are expected to
have differential trend in y after treatment (for
reasons unrelated to treatment) that is correlated with
z for this to actually be a problem for identification
n  And, even with this story, we could just control for z
and it’s interaction with time
n  But, what would be the lingering concern?

57
#1 – Pre-treatment comparison [Part 3]
n  Answer = unobservables!
q  If the treated and control differ ex-ante in
observable ways, we worry they might differ in
unobservable ways that related to some
violation of the parallel trends assumption

58
#2 – Check for pre-trend [Part 1]
n  Similar to last lecture, can just allow effect
of treatment to vary by period to non-
parametrically map out the timing
q  “Parallel trends” suggest we shouldn’t observe
any differential trend prior to treatment for the
observations that are eventually treated

59
#2 – Check for pre-trend [Part 2]
n  Estimate the following:
yi ,t = β 0 + β1di + β 2 pt + ∑ γ t ( di × λt ) + ui ,t
t

q  di and pt are defined just as before


q  λt is indicator that equals 1 if event time = t
and zero otherwise, where
n  t = 0 is the period treatment occurs
n  t = -1 is period before treatment

60
#2 – Check for pre-trend [Part 3]

q  γt estimates change in y relative to excluded


periods; you then plot these in graph
n  Easiest to fully saturate the model (i.e. include
λt for every period but the very first one); then
all estimates γt are relative to this period
n  Can also plot confidence interval for each γt

61
#2 – Check for pre-trend [Part 4]
n  Something like this is ideal…
1.4
1.2
1
Outcome,  y

0.8
0.6
0.4
0.2
0
-­‐0.2 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
-­‐0.4

Tight
Period confidence
No differential
pre-trend intervals

62
#2 – Check for pre-trend [Part 5]
n  Something like this is very bad
1.4
1.2
1
Outcome,  y

0.8
0.6
0.4
0.2
0
-­‐0.2 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
-­‐0.4
Period
y for treated firms was
already going up at faster
rate prior to event!

63
#2 – Check for pre-trend [Part 6]
n  Should we make much of wide confidence
intervals in these graphs? E.g.
2.5 Answer: Not too
2 much… Each period
1.5 point estimate might be
Outcome,  y

1 noisy; diff-in-diff will tell


0.5 us whether post-average
0 y is significantly different
-­‐0.5 -­‐4 -­‐3 -­‐2 -­‐1 0 1 2 3 4 5
then pre-average y
-­‐1
-­‐1.5
Period

64
#2 – Check for pre-trend [Part 7]
n  Another type of pre-trend check done is to
do the diff-in-diff in some “random” pre-
treatment to show no effect
q  I’m not a big fan of this… Why?
n  Answer #1 – It is subject to gaming; researcher might
choose a particular pre-period to look at that works
n  Answer #2 – Prior approach allows us to see what the
timing was and determine whether it is plausible

65
#3 – Treatment reversal
n  In some cases, the “natural experiment”
is subsequently reversed
q  E.g. regulation is subsequently undone

n  If we expect the reversal should have the


opposite effect, it is good to confirm this

66
#4 – Unaffected variables
n  In some cases, theory provides guidance
on what variables should be unaffected
by the “natural experiment”
q  If natural experiment is what we think it is,
we should see this in the data… so check

67
#5 – Add Triple difference
n  If theory tells us treatment effect should
be larger for one subset of observations,
we can check this with triple difference
q  Pre- versus post-treatment
q  Untreated versus treated
q  Less sensitive versus More sensitive

This is the third


difference

68
Natural Experiment Outline – Part 2
n  Difference-in-difference continued…
n  How to handle multiple events
n  Falsification tests
n  Triple differences
q  How to estimate & interpret it
q  Using the popular subsample approach

69
Diff-in-diff-in-diff – Regression
yi ,t = β 0 + β1 pt + β 2 di + β3hi + β 4 ( pt × hi )
+ β5 ( di × hi ) + β 6 ( pt × di ) + β 7 ( pt × di × hi ) + ui ,t

q  pt = 1 if period t occurs after


treatment and equals zero otherwise
q  di = 1 if unit is in treated group and
equals zero otherwise
q  hi = 1 if unit is group that is expected
to be more sensitive to treatment

70
Diff-in-diff-in-diff – Regression [Part 2]
n  How to choose and set hi
q  E.g. If theory says effect is bigger for larger
firms; could set hi = 1 if assets of firm in year
prior to treatment is above the median size
q  Note: Remember to use ex-ante measures to
construct indicator if you think underlying
variable (that determines sensitivity) might be
affected by treatment… Why?
q  Answer = To avoid bad controls!

71
Diff-in-diff-in-diff – Regression [Part 3]
yi ,t = β 0 + β1 pt + β 2 di + β3hi + β 4 ( pt × hi )
+ β5 ( di × hi ) + β 6 ( pt × di ) + β 7 ( pt × di × hi ) + ui ,t

n  Easy way to check if done correctly…


q  Should have 8 coefficients (including constant)
to capture the 2×2×2=8 different combinations
q  Likewise, a double difference has 4 coefficients
(including constant) for the 2×2=4 combinations

n  What do β6 and β7 capture?

72
Interpreting the estimates [Part 1]

n  β6 diff-in-diff estimate


for the less-sensitive obs.
q  Captures average differential change in y from
the pre- to post-treatment period for the less
sensitive observations in the treatment group
relative to the change in y for the less sensitive
observations in the untreated group

73
Interpreting the estimates [Part 2]

n  β7 is the triple diff estimate; it tells us how


much larger effect is for the more sensitive obs.
q  β7 captures how different the difference-in-
difference estimate is for observations considered
more sensitive to the treatment
q  What is total treatment effect for these firms?
q  Answer = β6+β7

74
Tangent – Continuous vs. Indicator?

n  Can also do the triple difference replacing hi


with a continuous measure instead of indicator
q  E.g. suppose we expect treatment effect is bigger for
larger firms; rather than constructing indicator based
on ex-ante size, could just use ex-ante size
q  What are the advantages, disadvantages of this?

75
Tangent – Continuous vs. Indicator?
n  Advantages
q  Makes better use of variation available in data
q  Provides estimate on magnitude of sensitivity

n  Disadvantages
q  Makes linear functional form assumption;
indicator imposes less structure on the data
q  More easily influenced by outliers

76
Generalized Triple-Difference

n  Similar to diff-in-diff, can add in FE to soak


up the various terms and improve precision
n  E.g. in firm-level panel regression with firm
and year fixed effects, you’d estimate
yi ,t = β1 ( pt × hi ) + β 2 ( pt × di )
+ β3 ( pt × di × hi ) + δ t + α i + ui ,t

q  The other terms (including the constant) all drop


out; they are collinear with the FE

77
Natural Experiment [P2] – Outline
n  Difference-in-difference continued…
n  How to handle multiple events
n  Falsification tests
n  Triple differences
q  How to estimate & interpret it
q  Using the popular subsample approach

78
Subsample Approach

n  Instead of doing full-blown triple-difference, you


can also just estimate the double-difference in the
two separate subsamples
q  Double-difference for low sensitive obs. (i.e. hi = 0)
q  Double-difference for more sensitive obs. (i.e. hi = 1)

n  Note: the estimates won’t directly match the β2,


β2+β3 effects in prior estimation… Why?

79
Subsample Approach Differences…

n  Answer = In subsample approach year FE


are allowed to differ by sub-sample
q  Therefore, subsample approach is actually
controlling for more things
q  However, one can easily recover the subsample
estimates in one regression (and test the statistical
difference) between subsamples by estimating…

80
Matching Subsample to Combined [P1]
yi ,t = β 2 ( pt × di ) + β3 ( pt × di × hi ) + δ t + (δ t × hi ) + α i + ui ,t

Year FE interacted with


sensitivity indicator

q  Just add interaction between year FE and


indicator for being more sensitivity…
n  This allows for different year FE for each subsample,
which is what happened when we estimated the
subsamples in two separate regressions

81
Matching Subsample to Combined [P2]

n  In prior regression…


q  β2 will equal coefficient from diff-in-diff using just
the subsample of less sensitive observations
q  β2+β3 will equal coefficient from diff-in-diff using
just the subsample of more sensitive observations
q  t-test on β3 tells you whether effect for more
sensitive subsample is statistically different from
that of the less sensitive subsample

82
Triple Diff – Stacked Regression [Part 1]
n  Another advantage of stacked regression
approach to multiple events is ability to
more easily incorporate a triple diff
q  Can simply run stacked regression in separate
subsamples to create triple-diff or run it in
one regression as shown previously

83
Triple Diff – Stacked Regression [Part 2]
n  Can’t easily do either of these in approach
of Bertrand and Mullainathan (2003)
q  Some observations act as both ‘control’ and
‘treated’ at different points in sample; not clear
how create subsamples in such a setting

84
External Validity – Final Note
n  While randomization ensures internal
validity (i.e. causal inferences), external
validity might still be an issue
q  Is the experimental setting representative of
other settings of interest to researchers?
n  I.e. can we extrapolate the finding to other settings?
n  A careful argument that the setting isn’t unique or
that the underlying theory (for why you observe what
you observe) is likely to apply elsewhere is necessary

85
Summary of Today [Part 1]

n  Diff-in-diff & control variables


q  Don’t add controls affected by treatment
q  Controls shouldn’t affect estimates, but can
help improve precision

n  Multiple events are helpful in mitigating


concerns about parallel trends assumption

86
Summary of Today [Part 2]

n  Many falsification tests one should do to


help assess internal validity
q  Ex. #1 – Compare ex-ante characteristics
q  Ex. #2 – Check timing of observed effect

n  Triple difference is yet another way to


check internal validity and mitigate
concerns about identification

87
In First Half of Next Class
n  Regression discontinuity
q  What are they?
q  How are they useful?
q  How do we implement them?

n  Related readings… see syllabus

88
Assign papers for next week…
n  Gormley and Matsa (RFS 2011)
q  Risk & CEO agency conflicts

n  Becker and Stromberg (RFS 2012)


q  Agency conflicts between equity & debt

n  Ashwini (JFE 2012)


q  Investor protection laws & corporate policies

89
Break Time
n  Let’s take our 10 minute break
n  We’ll do presentations when we get back

90
FNCE 926
Empirical Methods in CF
Lecture 8 – Regression Discontinuity

Professor Todd Gormley


Announcements
n  Rough draft of research proposal
due next week…
q  Just 1-3 page (single-spaced) sketch of
your proposal is fine…
n  Should clearly state your question
n  Should give me idea of where you’re going
with the identification strategy
n  See grading template on Canvas
q  Upload it to Canvas by noon next week
q  I will read and then send brief feedback

2
Background readings for today
n  Roberts and Whited
q  Section 5
n  Angrist and Pischke
q  Chapter 6

3
Outline for Today
n  Quick review of last lecture on NE
n  Discuss regression discontinuity
q  What is it? How is it useful?
q  How do we implement it?
q  What are underlying assumptions?

n  Student presentations of “NE #2” papers

13
Quick Review[Part 1]

n  Will adding controls affect diff-in-diff estimates


if treatment assignment was random?
q  Answer = Not unless you’ve added ‘bad controls’,
which are controls also affected by treatment.
When you’ve done this, you’re no longer
estimating the causal effect of treatment
q  Controls (that are exogenous) will just improve
precision, but shouldn’t affect estimates

14
Quick Review [Part 2]
n  What are some standard falsification tests you
might want to run with diff-in-diff?
q  Answers:
n  Compare ex-ante characteristics of treated & untreated
n  Check timing of treatment effect
n  Run regression using dep. variables that shouldn’t be
affected by treatment (if it is what we think it is)
n  Check whether reversal of treatment has opposite effect
n  Triple-difference estimation

15
Quick Review [Part 3]
n  If you find ex-ante differences in treated and
treated, is internal validity gone?
q  Answer = Not necessarily but it could suggest
non-random assignment of treatment that is
problematic… E.g. observations with
characteristic ‘z’ are more likely to be treated and
observations with this characteristic are also
likely to be trending differently for other reasons

16
Quick Review [Part 4]
n  Does the absence of a pre-trend in diff-in-diff
ensure that differential trends assumption
holds and causal inferences can be made?
q  Answer = Sadly, no. We can never prove
causality with 100% confidence. It could be that
trend was going to change after treatment for
reasons unrelated to treatment

17
Quick Review [Part 5]
n  How are multiple events that affect
multiple groups helpful?
q  Answer = Can check that treatment effect is
similar across events; helps reduce concerns
about violation of parallel trends since there
would need to be violation for each event

18
Quick Review [Part 6]
n  How are triple differences helpful and
reducing concerns about violation of
parallel trends assumption?
q  Answer = Before, an “identification
policeman” would just need a story about why
treated might be trending differently after
event for other reasons… Now, he/she would
need story about why that different trend
would be particularly true for subset of firms
that are more sensitive to treatment

19
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
n  Checks on internal validity
n  Heterogeneous effects & external validity

20
Basic idea of RDD
n  The basic idea of regression discontinuity
(RDD) is the following:
q  Observations (e.g. firm, individual, etc.) are
‘treated’ based on known cutoff rule
n  E.g. for some observable variable, x, an
observation is treated if x ≥ x’
n  This cutoff is what creates the discontinuity

q  Researcher is interested in how this treatment


affects outcome variable of interest, y

21
Examples of RDD settings
n  If you think about it, these type of cutoff
rules are commonplace in finance
q  A borrower FICO score > 620 makes
securitization of the loan more likely
n  Keys, et al (QJE 2010)

q  Accounting variable x exceeding some


threshold causes loan covenant violation
n  Roberts and Sufi (JF 2009)

22
RDD is like difference-in-difference…

n  Has similar flavor to diff-in-diff natural


experiment setting in that you can
illustrate identification with a figure
q  Plot outcome y against independent variable
that determines treatment assignment, x
q  Should observe sharp, discontinuous change in
y at the cutoff value of x’

23
But, RDD is different…
n  RDD has some key differences…
q  Assignment to treatment is NOT random;
assignment is based on value of x
q  When treatment only depends on x (what I’ll
later call “sharp RDD”, there is no overlap in
treatment & controls; i.e. we never observe the
same x for a treatment and a control

24
RDD randomization assumption
n  Assignment to treatment and control isn’t
random, but whether individual observation is
treated is assumed to be random
q  I.e. researcher assumes that observations (e.g. firm,
person, etc.) can’t perfectly manipulate their x value
q  Therefore, whether an observation’s x falls
immediately above or below key cutoff x’ is random!

25
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
q  Notation & ‘sharp’ vs. fuzzy assumption
q  Assumption about local continuity

n  Estimating regression discontinuity


n  Checks on internal validity
n  Heterogeneous effects & external validity

26
RDD terminology

q  x is called the “forcing variable”


n  Can be a single variable or multiple
variables; but for simplicity, we’ll work with
a single variable

q  x’ is called the “threshold”


q  y(0) is outcome absent treatment

q  y(1) is outcome with treatment

27
Two types of RDD
n  Sharp RDD
q  Assignment to treatment only depends on x; i.e.
if x ≥ x’ you are treated with probability 1

n  Fuzzy RDD


q  Having x ≥ x’ only increases probability of
treatment; i.e. other factors (besides x) will
influence whether you are actually treated or not

28
Sharp RDD assumption #1
n  Assignment to treatment occurs through
known and deterministic decision rule:
⎧1 if x ≥ x '
d = d ( x) = ⎨
⎩0 otherwise
q  Weak inequality and direction of treatment is
unimportant [i.e. could easily have x < x’]
q  But, it is important that there exists x’s
around the threshold value

29
Sharp RDD assumption #1 – Visually

Probability of treatment
moves from 0 to 1 around
threshold value x’

No untreated for x > x’


and no treated for x < x’

Only x determines
treatment

Figure is from Roberts and Whited (2010)

30
Sharp RDD – Examples
n  Ex. #1 – PSAT score > x’ means student
receives national merit scholarship
q  Receiving scholarship was determined solely
based on PSAT scores in the past
q  Thistlewaithe and Campbell (1960) used this to
study effect of scholarship on career plans

31
Fuzzy RDD assumption #1
n  Assignment to treatment is stochastic in
that only the probability of treatment has
known discontinuity at x’
0 < lim Pr(d = 1| x) − lim Pr(d = 1| x) < 1
x↓ x ' x↑ x '

q  Can also go other way, i.e. probability of


treatment drops at x’; all that is needed is
jump in the probability of treatment at x’

32
Fuzzy RDD assumption #1 – Visually
Treatment probability
increases at x’

Some untreated for x > x’


and some treated for x < x’

Treatment is not
purely driven by x

Figure is from Roberts and Whited (2010)

33
Fuzzy RDD – Example
n  Ex. #1 – FICO score > 620 increases
likelihood of loan being securitized
q  But, extent of loan documentation, lender,
etc., will matter as well…

34
Sharp versus Fuzzy RDD
n  This subtle distinction affects exactly how
you estimate the causal effect of treatment
q  With Sharp RDD, we will basically compare
average y immediate above and below x’
q  With fuzzy RDD, the average change in y around
threshold understates causal effect [Why?]
n  Answer = Comparison assumes all observations were
treated, but this isn’t true; if all observations had been
treated, observed change in y would be even larger; we
will need rescale based on change in probability

35
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
q  Notation & ‘sharp’ vs. fuzzy assumption
q  Assumption about local continuity

n  Estimating regression discontinuity


n  Checks on internal validity
n  Heterogeneous effects & external validity

36
RDD assumption #2
n  But, both RDDs share the following
assumption about local continuity
n  Potential outcomes, y(0) and y(1),
conditional on forcing variable, x, are
continuous at threshold x’
q  In words: y would be a smooth function around
threshold absent treatment; i.e. don’t expect any
jump in y at threshold x’ absent treatment

37
RDD assumption #2 – Visually
If all obs. had been treated, y
would be smooth around x’;
other lines says equivalent thing
for if none had been treated

Dashed lines
represent unobserved
counterfactuals

Figure is from Roberts and Whited (2010)

38
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
q  Sharp regression discontinuity
q  Fuzzy regression discontinuity

n  Checks on internal validity


n  Heterogeneous effects & external validity

39
How not to do Sharp RDD…
n  Given this setting, will the below estimation
reveal causal effect of treatment, d, on y?
yi = β 0 + β1di + ui

q  Answer = Unlikely! d is correlated with x, and


if x affects y, then there will be omitted variable!
n  E.g. Borrowers FICO score, used in Keys, et al
(2010) affects likelihood of default… therefore,
above regression can NOT be used to determine
effect of securitization on default risk

40
How not to do Sharp RDD… [Part 2]
n  How can we modify previous regression
to account for this omitted variable?
q  Answer: Control for x!
q  So, we could estimate: yi = β0 + β1di + β 2 xi + ui

q  But, why might this still be problematic?


n  Answer: (1) Assumes effect of x is linear, and (2)
doesn’t really make use of random assignment,
which is really occurring near the threshold

41
Bias versus Noise
n  Ideally, we’d like to compare average x right
below and right above x’; what is tradeoff?
q  Answer: We won’t have many observations and
estimate will be very noisy. A wider range of x
on each side reduces this noise, but increases risk
of bias that observations further from threshold
might vary for other reasons (including because
of the direct effect of x on y)

42
Bias versus Noise – Visual
Only a couple points near cutoff, x’… if
y just use them, get very noisy estimate

But, if compared average of y using


wider bins, I’d pick up ‘discontinuity’
where none might exist because I’d
incorrectly capture effect of x on y

x
x'

43
Estimating Sharp RDD
n  There are generally two ways to do RDD
that weigh that try to balance this tradeoff
between bias and noise
q  Use all data, but control for effect of x on y
in a very general and rigorous way
q  Use less rigorous controls for effect of x, but
only use data in small window around threshold

44
Estimating Sharp RDD, Using all data
n  First approach uses all the data available
and estimates two separate regressions
Estimate using
yi = β + f ( xi − x ') + u
b b
i only data below x’
yi = β a + g ( xi − x ') + uia Estimate using
only data above x’
q  Just let f( ) and g( ) be any continuous
function of xi – x’, where f(0)=g(0)=0
q  Treatment effect = βa – βb

45
Interpreting the Estimates…
yi = β b + f ( xi − x ') + uib
yi = β a + g ( xi − x ') + uia

n  Why are f( ) and g( ) included?


Answer = They are there to control for
underlying effect of x on y

n  What do βb and βa estimate?


q  Answer = βb is E[y|x=x’] from below,
and βb is E[y|x=x’] from above

46
Easier way to do this estimation
n  Can do all in one step; just use all the
data at once and estimate:
yi = α + β di + f ( xi − x ') + di × g ( xi − x ') + ui

Recall: What would we


di = indicator Controls for be assuming if
for x ≥ x’ relationship we drop d×g( )?
Estimate for
between x and y
β will equal
both above and
βa – βb below x’

47
Tangent about dropping g( )
n  Answer: If you drop di × g ( xi − x ') , you
assume functional form between x and y
is same above and below x’
q  Can be strong assumption, which is probably
why it shouldn’t be only specification used
q  But, Angrist and Pischke argue it usually
doesn’t make a big difference in practice

48
What should we use for f( ) and g( )?
n  In practice, a high-order polynomial
function is used for both f( ) and g( )
q  E.g. You might use a cubic polynomial

yi = α + β di + ∑ γ ( xi − x ') + ∑ t =1γ ta di ( xi − x ')t + ui


3 b s 3
s =1 s

q  How might you determine the correct


order of polynomial to use in practice?

49
Sharp RDD – Robustness Check
n  Ultimately, correct order of polynomial is
unknown; so, best to show robustness
q  Should try to illustrate that findings are robust
to different polynomial orders
q  Can do graphical analysis to provide a visual
inspection that polynomial order is correct
[I will cover graphical analysis in a second]

50
Estimating Sharp RDD, Using Window
n  Do same RDD estimate as before, but…
q  Restrict analysis to smaller window around x’
q  Use lower polynomial order controls

n  E.g. estimate below model in window


x’ – Δ ≤ x ≤ x’ + Δ for some Δ > 0
yi = α + β di + γ b ( xi − x ') + γ a di ( xi − x ') + ui

Controls are now just


linear in this example

51
Practical issues with this approach
n  What is appropriate window width and
appropriate order of polynomial?
q  Answer = There is no right answer! But, it
probably isn’t as necessary to have as
complicated of polynomial in smaller window
q  But, best to just show robustness to choice
of window width, Δ, and polynomial order

52
Tradeoff between two approaches
n  Approach with smaller window can be
subject to greater noise, but advantage is…
q  Doesn’t assume constant effect of treatment for
all values of x in the sample; in essence you are
estimating local avg. treatment effect
q  Less subject to risk of bias because correctly
controlling for relationship between x and y is
less important in the smaller window

53
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
q  Sharp regression discontinuity
q  Graphical analysis
q  Fuzzy regression discontinuity

n  Checks on internal validity


n  Heterogeneous effects & external validity

54
Graphical Analysis of RDD
n  Can construct a graph to visually
inspect whether a discontinuity exists
and whether chosen polynomial order
seems to fit the data well
q  Always good idea to do this graph with
RDD; provides sanity check and visual
illustration of variation driving estimate

55
How to do RDD graphical analysis [P1]

n  First, divide x into bins, making sure no


bin contains x’ as an interior point
q  E.g., if x ranges between 0 and 10 and
treatment occurs for x ≥ x’ = 5, you could
construct 10 bins, [0,1), [1,2),…[9,10]
q  Or, if x’ = 4.5, could use something like
[0,0.5), [0.5,1.5), [1.5, 2.5), etc.

56
How to do RDD graphical analysis [P2]

n  Second, calculate average y in each bin,


and plot this above midpoint for each bin
q  Plotted averages represent a non-parametric
estimate of E[y|x]

n  Third, estimate your RDD and plot


predicted values of y from the estimation

57
Example of supportive graph
Each dot is average y for corresponding bin

Solid line is predicted


values of y from
RDD regression

Fifth-order polynomial
was needed to fit the
non-parametric plot

Discontinuity is
apparent in both
estimation and non-
parametric plot

Figure is from Roberts and Whited (2010)

58
Example of non-supportive graph

Dash lines would have


been predicted values
from linear RDD [i.e.
polynomial of order 1]

But, looking at non-


parametric graph
would make clear
that a cubic version
(which is plotted as
solid line) would
show no effect!
Figure is from Roberts and Whited (2010)

59
RDD Graphs – Miscellaneous Issues
n  Non-parametric plot shouldn’t suggest jump in
y at other points besides x’ [Why?]
q  Answer = Calls into question internal validity of
RDD; possible that jump at x’ is driven by
something else that is unrelated to treatment

60
Bin Width in RDD graphs
n  What is optimal # of bins (i.e. bin width)?
What is the tradeoff with smaller bins?
q  Answer = Choice of bin width is subjective because
of tradeoff between precision and bias
n  By including more data points in each average, wider bins
give us more precise estimate of E[y|x] in that region of x
n  But, wider bins might be biased if E[y|x] is not constant
(i.e. has non-zero slope) within each of the wide bins

61
Test of overly wide graph bins
1. Construct indicator for each bin
2. Regress y on these indicators and their
interaction with forcing variable, x
3. Do joint F-test of interaction terms
q  If fails, that suggests there is a slope in
some of the bins… i.e. bins are too wide
q  See Lee and Lemieux (JEL 2010) for
more details and another test

62
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
q  Sharp regression discontinuity
q  Graphical analysis
q  Fuzzy regression discontinuity

n  Checks on internal validity


n  Heterogeneous effects & external validity

63
Intuition for Fuzzy RDD
n  As noted earlier, comparison of average y
immediately above and below threshold (as
done in Sharp RDD) won’t work
q  Again, not all observations above threshold are
treated and not all below are untreated; x > x’
just increases probability of treatment…

n  So, what can we do?


q  Answer = use x ≥ x’ as IV for treatment!!!

64
Fuzzy RDD Notation
n  Need to relabel a few variables
q  di = 1 if treated by event of interest; 0 otherwise
q  And, define new threshold indicator, Ti
⎧1 if x ≥ x '
T = T ( x) = ⎨
⎩0 otherwise
n  E.g. di = 1 if loan is securitized, Ti = 1 if
FICO score is greater than 620, which
increases probability loan is securitized

65
Estimating Fuzzy RDD [Part 1]
n  Estimate the below 2SLS model
yi = α + β di + f ( xi − x ') + ui

q  Where you use Ti as IV for di


q  What are necessary assumptions of IV?
n  Answer = Ti affects probability of di = 1
[relevance condition] but is unrelated to y conditional
on di and controls f( ) [exclusion condition]
n  These will be satisfied under earlier assumptions!

66
Estimating Fuzzy RDD [Part 2]
n  Again, f( ) is typically a polynomial function
n  Unlike sharp RDD, it isn’t as easy to allow
functional form to vary above & below
q  So, if worried about different functional forms,
what can you do to mitigate this concern?
q  Answer = Use a tighter window around event;
this is less sensitive to functional form, f(x)

67
Fuzzy RDD – Practical Issues

n  Exactly same practical issues arise


q  Correct polynomial order is unknown
q  Can also use small bandwidth (rather than all
the data) with lower order polynomial order

n  In general, show robustness to different


specifications and show graphs!

68
Fuzzy RDD Graphs
n  Do same graph of y on x as with sharp RDD
q  Again, should see discontinuity in y at x’
q  Should get sense that polynomial fit is good

n  In fuzzy RDD, should also plot similar graph


for treatment dummy, d, on x [Why?]
q  Answer = Helps make sure there is discontinuity
of treatment probability at the threshold value

69
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
n  Checks on internal validity
n  Heterogeneous effects & external validity

70
Robustness Tests for Internal Validity

n  Already discussed a few…


q  Show graphical analysis [picture is helpful!]
q  Make sure finding robust to chosen polynomial
q  Make sure finding robust to chosen bandwidth

n  Here are some others worth checking…

71
Additional check #1 – No manipulation

n  Researcher should ask the following…


q  Is there any reason to believe threshold x’ was
chosen because of some pre-existing discontinuity in
y or lack of comparability above and below x’ ?
n  If so… a clear violation of local continuity assumption

q  Is there any way or reason why subjects might


manipulate their x around threshold?
[Why ask this?]

72
Why manipulation can be problematic…
n  Answer = Again, subjects’ ability to
manipulate x can cause violation of local
continuity assumption
q  I.e. with manipulation, y might exhibit jump around
x’ absent treatment because of manipulation
n  E.g. in Keys, et al. (QJE 2010) default rate of loans at
FICO = 620 might jump regardless if weak borrowers
manipulate their FICO to get the lower interest rates that
one gets immediately with FICO above 620

73
And, why it isn’t always a problem
n  Why isn’t subjects’ ability to
manipulate x always a problem?
q  Answer = If they can’t perfectly manipulate it,
then there will still be randomness in treatment
n  I.e. in small enough bandwidth around x’, there will
still be randomness because idiosyncratic shocks will
push some above and some below threshold even if
they are trying to manipulate the x

74
An informal test for manipulation
n  Look for bunching of observations
immediately above or below threshold
q  Any bunching would suggest manipulation
q  But, why is this not a perfect test?
n  Answer = It assumes manipulation is monotonic;
i.e. all subjects either try to get above or below x’.
This need not be true in all scenarios

75
Additional check #2 – Balance tests
n  RDD assumes observations near but on
opposite sides of cutoff are comparable…
so, check this!
q  I.e. using graphical analysis or RDD, make
sure other observable factors that might
affect y don’t exhibit jump at threshold x’
q  Why doesn’t this test prove validity of RDD?
n  Answer: There could be discontinuity in unobservables!
Again, there is no way to prove causality

76
Using covariates instead…
n  You could also just add these other variables
that might affect y as controls
q  If RDD is internally valid, will these additional
controls effect estimate, and if so, how?
q  Answer: Similar to NE, they should only affect
precision of estimate. If they affect the estimated
treatment effect, you’ve got bigger problems; Why?
n  You might have ‘bad controls’
n  Or, observations around threshold aren’t comparable L

77
Additional check #3 – Falsification Tests
n  If threshold x’ only existed in certain years
or for certain types of observations…
q  E.g. law that created discontinuity was passed in
a given year, but didn’t exist before that, or
maybe the law didn’t apply to some firms

n  Then, what is a good falsification test?


q  Answer = Make sure no effect in years where
there was no discontinuity or for firms where
there isn’t supposed to be an effect!

78
Regression Discontinuity – Outline
n  Basic idea of regression discontinuity
n  Sharp versus fuzzy discontinuities
n  Estimating regression discontinuity
n  Checks on internal validity
n  Heterogeneous effects & external validity

79
Heterogeneous effects (HE)
n  If think treatment might differentially affect
observations based on their x, then need a
few additional assumptions for RDD to
identify the local average treatment effect
1. Effect of treatment is locally continuous at x’
2. Likelihood of treatment is always weakly Note: Latter
two only apply
greater above threshold value x’
to Fuzzy RDD
3. Effect of treatment and whether observation is
treated is independent of x near x’

80
HE assumption #1
n  Assumption that treatment effect is locally
continuous at x’ is typically not problem
q  It basically just says that there isn’t any jump in
treatment’s effect at x’; i.e. just again assuming
observations on either side of x’ are comparable
n  Note: This might violated if x’ was chosen because
effect of treatment was thought to be higher for x>x’
[E.g. law and/or regulation that creates discontinuity created
threshold at that point because effect was known to be biggest there]

81
HE assumption #2
n  Monotonic effect on likelihood of treatment
usually not a problem either
q  Just says that having x > x’ doesn’t make some
observations less likely to be treated and others
more likely to be treated
q  This is typically the case, but make sure that it
makes sense in your setting as well

82
HE assumption #3
n  Basically is saying ‘no manipulation’
q  In practice, it means that observations where
treatment effect is going to be larger aren’t
manipulating x to be above the threshold or that
likelihood of treatment for individual
observation depends on some variable that is
correlated with magnitude of treatment effect

83
HE affects interpretation of estimate
n  Key with heterogeneity is that you’re only
estimating a local average treatment effect
q  Assuming above assumptions hold, estimate
only reveals effect of treatment around
threshold, and for Fuzzy RDD, it only reveals
effect on observations that change treatment
status because of discontinuity
q  This limits external validity… How?

84
External validity and RDD [Part 1]
n  Answer #1: Identification relies on
observations close to the cutoff threshold
q  Effect of treatment might be different for
observations further away from this threshold
q  I.e. don’t make broad statements about how
the effect would hold for observations further
from the threshold value of x

85
External validity and RDD [Part 2]
n  Answer #2: In fuzzy RDD, treatment is
estimated using only “compliers”
q  I.e. we only pick up effect of those where
discontinuity is what pushes them into treatment
n  E.g. Suppose you study effect of PhD on wages using
GRE score > x’ with a fuzzy RDD. If discontinuity
only matters for students with mediocre GPA, then you
only estimate effect of PhD for those students
q  Same as with IV… be careful to not extrapolate
too much from the findings

86
Summary of Today [Part 1]

n  RDD is yet another way to identify causal


effect of some treatment on outcome y
q  Makes use of treatment assignment that isn’t
random, but where process follows some
known and arbitrary cutoff rule
q  Very common scenario in practice, and
estimator likely to be of increasing use

87
Summary of Today [Part 2]

n  Two types of RDD: “sharp” and “fuzzy”


q  Sharp RDD is when treatment is
deterministic and only depends on x
q  Fuzzy RDD is when treatment is stochastic and
probability of treatment has discontinuity at x’

n  Formal estimators are similar but different;


‘fuzzy’ RDD is really just an IV

88
Summary of Today [Part 3]

n  Many checks for internal validity; e.g.


q  Graphical analysis with non-parametric plots
q  Check whether observations around cutoff
appear to be comparable

n  If treatment effect is heterogeneous,


estimators interpretation is LATE

89
In First Half of Next Class
n  Miscellaneous Issues
q  Common data problems
q  Industry-adjusting
q  High-dimensional FE

n  Related readings… see syllabus

90
Assign papers for next week…
n  Malenko and Shen (working paper 2015)
q  Role of proxy advisory firms

n  Keys, et al. (QJE 2010)


q  Securitization and screening of loans

n  Almeida, et al. (JFE Forthcoming)


q  Impact of share repurchases

91
Break Time
n  Let’s take our 10 minute break
n  We’ll do presentations when we get back

92
FNCE 926
Empirical Methods in CF
Lecture 9 – Common Limits & Errors

Professor Todd Gormley


Announcements
n  Rough draft of research proposal due
q  Should have uploaded to Canvas
q  I’ll try to e-mail feedback by next week

2
Background readings for today
n  Ali, Klasa, and Yeung (RFS 2009)
n  Gormley and Matsa (RFS 2014)

3
Outline for Today
n  Quick review of last lecture on RDD
n  Discuss common limitations and errors
q  Our data isn’t perfect
q  Hypothesis testing mistakes
q  How not to control for unobserved heterogeneity

n  Student presentations of “RDD” papers

4
Quick Review[Part 1]

n  What is difference between “sharp” and


“fuzzy” regression discontinuity design (RDD)?
q  Answer = With “sharp”, change in treatment status
only depends on x and cutoff x’; with “fuzzy”, only
probability of treatment varies at cutoff

5
Quick Review[Part 2]
n  Estimation of sharp RDD
yi = α + β di + f ( xi − x ') + di × g ( xi − x ') + ui

q  di = indicator for x ≥ x’


q  f( ) and g( ) are polynomial functions that
control for effect of x on y
q  Can also do analysis in tighter window
around threshold value x’

6
Quick Review[Part 3]
n  Estimation of fuzzy RDD is similar…
yi = α + β di + f ( xi − x ') + ui

But use Ti and as instrument for di where


q  Ti is indicator for x ≥ x’
q  And, di is indicator for treatment

7
Quick Review [Part 4]
n  What are some standard internal validity tests
you might want to run with RDD?
q  Answers:
n  Check robustness to different polynomial orders
n  Check robustness to bandwidth
n  Graphical analysis to show discontinuity in y
n  Compare other characteristics of firms around cutoff
threshold to make sure no other discontinuities
n  And more…

8
Quick Review [Part 5]
n  If effect of treatment is heterogeneous,
how does this affect interpretation of RDD
estimates?
q  Answer = They take on local average
treatment effect interpretation, and fuzzy RDD
captures only effect of compliers. Neither is
problem for internal validity, but can
sometimes limit external validity of finding

9
Common Limitations & Errors – Outline
n  Data limitations
n  Hypothesis testing mistakes
n  How to control for unobserved heterogeneity

10
Data limitations
n  The data we use is almost never perfect
q  Variables are often reported with error
q  Exit and entry into dataset typically not random
q  Datasets only cover certain types of firms

11
Measurement error – Examples
n  Variables are often reported with error
q  Sometimes it is just innocent noise
n  E.g. Survey respondents self report past income
with error [because memory isn’t perfect]

q  Sometimes, it is more systematic


n  E.g. survey might ask teenagers # of times
smoked marijuana, but teenagers that have
smoked and have high GPA might say zero

n  How will these errors affect analysis?

12
Measurement error – Why it matters
n  Answer = Depends; but in general, hard
to know exactly how this will matter
q  If y is mismeasured…
n  If only random noise, just makes SEs larger
n  But if systematic in some way [as in second example],
can cause bias if error is correlated with x’s
q  If x is mismeasured…
n  Even simple CEV causes attenuation bias on
mismeasured x and biases on all other variables

13
Measurement error – Solution
n  Standard measurement error solutions
apply [see “Causality” lecture]
q  Though admittedly, measurement error is
difficult to deal with unless know exactly
source and nature of the error

14
Survivorship Issues – Examples
n  In other cases, observations are included
or missing for systematic reasons; e.g.
q  Ex. #1 – Firms that do an IPO and are
added to datasets that cover public firms may
be different than firms that do not do an IPO
q  Ex. #2 – Firms adversely affected by some
event might subsequently drop out of data
because of distress or outright bankruptcy
n  How can these issues be problematic?

15
Survivorship Issues – Why it matters
n  Answer = There is a selection bias,
which can lead to incorrect inferences
q  Ex. #1 – E.g. going public may not cause
high growth; it’s just that the firms going
public were going to grow faster anyway
q  Ex. #2 – Might not find adverse effect of
event (or might understate it’s effect) if some
affected firms go bankrupt and are dropped

16
Survivorship Issues – Solution
n  Again, no easy solutions; but, if worried
about survivorship bias…
q  Check whether treatment (in diff-in-diff) is
associated with observations being more or
less likely to drop from data
q  In other analysis, check whether covariates
of observations that drop are systematically
different in a way that might be important

17
Sample is limited – Examples

n  Observations in commonly used datasets


are often limited to certain firms
q  Ex. #1 – Compustat covers largest public firms
q  Ex. #2 – Execucomp only provides incentives
on CEOs of firms listed on S&P 1500

n  How this might affect our analysis?

18
Sample is limited – Why it matters

n  Answer = Need to be careful when


making claims about external validity
q  Ex. #1 – Might find no effect of treatment in
Compustat because treatment effect is greatest
for unobserved, smaller, private firms
q  Ex. #2 – Observed correlations between
incentives and risk-taking in Execucomp
might not hold for smaller firms

19
Sample is limited – Solution

n  Be careful with inferences to avoid


making claims that lack external validity
n  Argue that your sample is representative
of economically important group
n  Hand-collect your own data if theory
your interested in testing requires it!
q  This can actually make for a great paper and
is becoming increasingly important in finance

20
Interesting Example of Data Problem

n  Ali, Klasa, and Yeung (RFS 2009) provide


interesting example of data problem
q  They note the following…
n  Many theories argue that “industry concentration”
is important factor in many finance settings
n  But researchers measure industry concentration
(i.e. herfindahl index) using Compustat

q  How might this be problematic?

21
Ali, et al. – Example data problem [P1]

n  Answer = Systematic measurement error!


q  Compustat doesn’t include private firms; so
using it causes you to mismeasure concentration
q  Ali, et al. find evidence of this by calculating
concentration using U.S. Census data
n  Correlation between measures is just 13%
n  Moreover, error in Compustat measure is
systematically related to some key variables, like
turnover of firms in the industry

22
Ali, et al. – Example data problem [P2]
n  Ali, et al. (RFS 2009) found it mattered;
using Census measure overturns four
previously published results
q  E.g. Concentration is positively related to R&D,
not negatively related as previously argued
q  See paper for more details...

23
Common Limitations & Errors – Outline
n  Data limitations
n  Hypothesis testing mistakes
n  How to control for unobserved heterogeneity

24
Hypothesis testing mistakes
n  As noted in lecture on natural experiments,
triple-difference can be done by running
double-diff in two separate subsamples
q  E.g. estimate effect of treatment on small firms;
then estimate effect of treatment on large firms

25
Example inference from such analysis
Small   Large   Low  D/E   High  D/E  
                                     Sample  =  
Firms Firms Firms Firms
Treatment  *  Post 0.031 0.104** 0.056 0.081***
(0.121) (0.051) (0.045) (0.032)

N 2,334 3,098 2,989 2,876


R-­‐squared 0.11 0.15 0.08 0.21
Firm  dummies X X X X
Year  dummies X X X X

q  From above results, researcher often concludes…


n  “Treatment effect is larger for bigger firms” Do you see any
n  “High D/E firms respond more to treatment” problem with
either claim?

26
Be careful making such claims!

n  Answer = Yes! The difference across subsamples


may not actually be statistically significant!
q  Hard to know if different just eyeballing it because
whether difference is significant depends on
covariance of the two separate estimates

n  How can you properly test these claims?

27
Example triple interaction result

All  
                                     Sample  =  
Firms
Treatment  *  Post 0.031
(0.121)
Difference is not actually
Treatment  *  Post  *  Large 0.073
(0.065)
statistically significant

N 5,432
R-­‐squared 0.12 Remember to interact year
Firm  dummies X dummies & triple difference;
Year  dummies X
otherwise, estimates won’t
Year  *  Large  dummies X
match earlier subsamples

28
Practical Advice
n  Don’t make claims you haven’t tested;
they could easily be wrong!
q  Best to show relevant p-values in text or tables
for any statistical significance claim you make
q  If difference isn’t statistically significant
[e.g. p-value = 0.15], can just say so; triple-diffs
are noisy, so this isn’t uncommon
q  Or, be more careful in your wording…
n  I.e. you could instead say, “we found an effect for large
firms, but didn’t find much evidence for small firms”

29
Common Limitations & Errors – Outline
n  Data limitations
n  Hypothesis testing mistakes
n  How to control for unobserved heterogeneity
q  How not to control for it
q  General implications
q  Estimating high-dimensional FE models

30
Unobserved Heterogeneity – Motivation
n  Controlling for unobserved heterogeneity is a
fundamental challenge in empirical finance
q  Unobservable factors affect corporate policies and prices
q  These factors may be correlated with variables of interest

n  Important sources of unobserved heterogeneity are


often common across groups of observations
q  Demand shocks across firms in an industry,
differences in local economic environments, etc.

31
Many different strategies are used
n  As we saw earlier, FE can control for unobserved
heterogeneities and provide consistent estimates
n  But, there are other strategies also used to control
for unobserved group-level heterogeneity…
q  “Adjusted-Y” (AdjY) – dependent variable is
demeaned within groups [e.g. ‘industry-adjust’]
q  “Average effects” (AvgE) – uses group mean of
dependent variable as control [e.g. ‘state-year’ control]

32
AdjY and AvgE are widely used
n  In JF, JFE, and RFS…
q  Used since at least the late 1980s
q  Still used, 60+ papers published in 2008-2010
q  Variety of subfields; asset pricing, banking,
capital structure, governance, M&A, etc.

n  Also been used in papers published in


the AER, JPE, and QJE and top
accounting journals, JAR, JAE, and TAR

33
But, AdjY and AvgE are inconsistent

n  As Gormley and Matsa (2014) shows…


q  Both can be more biased than OLS
q  Both can get opposite sign as true coefficient
q  In practice, bias is likely and trying to predict its
sign or magnitude will typically be impractical

n  Now, let’s see why they are wrong…

34
The underlying model [Part 1]
n  Recall model with unobserved heterogeneity
yi , j = β X i , j + fi + ε i , j

q  i indexes groups of observations (e.g. industry);


j indexes observations within each group (e.g. firm)
n  yi,j = dependent variable
n  Xi,j = independent variable of interest
n  fi = unobserved group heterogeneity
n  ε i , j = error term

35
The underlying model [Part 2]

n  Make the standard assumptions:


N groups, J observations per group,
where J is small and N is large
X and ε are i.i.d. across groups, but
not necessarily i.i.d. within groups
var( f ) = σ 2f , µ f = 0
var( X ) = σ X2 , µ X = 0
Simplifies some expressions,
var(ε ) = σ ε , µε = 0
2
but doesn’t change any results

36
The underlying model [Part 3]

n  Finally, the following assumptions are made:


cov( f i , ε i , j ) = 0
co v( X i , j , ε i , j ) = co v( X i , j , ε i , − j ) = 0
cov( X i , j , f i ) = σ Xf ≠ 0
What do these imply?
Answer = Model is correct in that
if we can control for f, we’ll
properly identify effect of X; but
if we don’t control for f there will
be omitted variable bias

37
We already know that OLS is biased
True model is: yi , j = β X i , j + fi + ε i , j

But OLS estimates: yi , j = β OLS X i , j + uiOLS


,j

q  By failing to control for group effect, fi, OLS


suffers from standard omitted variable bias
σ Xf
βˆ OLS = β +
σ X2

Alternative estimation strategies are required…

38
Adjusted-Y (AdjY)
n  Tries to remove unobserved group heterogeneity by
demeaning the dependent variable within groups

AdjY estimates: yi , j − yi = β X i , j + uiAdjY


AdjY
,j

1
where yi =
J
∑ (β X
k ∈group i
i,k + fi + ε i ,k )

Note: Researchers often exclude observation at hand when


calculating group mean or use a group median, but both
modifications will yield similarly inconsistent estimates

39
Example AdjY estimation

n  One example – firm value regression:


Qi , j ,t − Qi ,t = α + β ' Xi, j ,t + ε i , j ,t
Qi , j ,t
q  = Tobin’s Q for firm j, industry i, year t
q  Qi ,t = mean of Tobin’s Q for industry i in year t
q  Xijt = vector of variables thought to affect value
q  Researchers might also include firm & year FE

Anyone know why AdjY is going to be inconsistent?

40
Here is why…
n  Rewriting the group mean, we have:
yi = fi + β X i + ε i ,

n  Therefore, AdjY transforms the true data to:


yi , j − yi = β X i , j − β X i + ε i , j − ε i

What is the AdjY estimation forgetting?

41
AdjY can have omitted variable bias
n  β̂ adjY can be inconsistent when β ≠ 0
True model: yi , j − yi = β X i , j − β X i + ε i , j − ε i
y
But, AdjY estimates: i , j − yi = β AdjY
X i, j + u AdjY
i, j

q  By failing to control for X i , AdjY suffers


from omitted variable bias when σ XX ≠ 0
σ XX In practice, a positive
βˆ AdjY = β − β
σ X2 covariance between X
and X will be common;
e.g. industry shocks

42
Now, add a second variable, Z

n  Suppose, there are instead two RHS variables


True model: yi , j = β X i , j + γ Zi , j + fi + ε i , j

n  Use same assumptions as before, but add:


cov( Z i , j , ε i , j ) = cov( Z i , j , ε i , − j ) = 0
var( Z ) = σ Z2 , µ Z = 0
cov( X i , j , Z i , j ) = σ XZ
cov( Z i , j , f i ) = σ Zf

43
AdjY estimates with 2 variables
n  With a bit of algebra, it is shown that:

⎡ β (σ XZ σ ZX − σ Z2σ XX ) + γ (σ XZ σ ZZ − σ Z2σ XZ ) ⎤
⎢β + ⎥
⎡ βˆ ⎤ ⎢
AdjY
σ Z σ X − σ XZ
2 2 2

⎢ AdjY ⎥ = ⎢ ⎥
⎢⎣γˆ ⎥⎦ ⎢ β (σ σ
XZ XX − σ X ZX )
2
σ + γ (σ σ
XZ XZ − σ X ZZ ) ⎥
2
σ
⎢γ + σ σ
2 2
− σ 2 ⎥
⎣ Z X XZ ⎦

Estimates of both Determining sign and


β and γ can be magnitude of bias will
inconsistent typically be difficult

44
Average Effects (AvgE)

n  AvgE uses group mean of dependent variable


as control for unobserved heterogeneity
AvgE estimates: yi , j = β AvgE X i , j + γ AvgE yi + uiAvgE
,j

45
Average Effects (AvgE)

n  Following profit regression is an AvgE example:


ROAi ,s,t = α + β ' Xi,s,t + γ ROAs,t + ε i ,s,t

q  ROAs,t = mean of ROA for state s in year t


q  Xist = vector of variables thought to profits
q  Researchers might also include firm & year FE

Anyone know why AvgE is going to be inconsistent?

46
AvgE has measurement error bias

n  AvgE uses group mean of dependent variable


as control for unobserved heterogeneity
AvgE estimates: yi , j = β AvgE X i , j + γ AvgE yi + uiAvgE
,j

Recall, true model: yi , j = β X i , j + fi + ε i , j

Problem is that yi measures fi with error

47
AvgE has measurement error bias
n  Recall that group mean is given by yi = fi + β X i + ε i ,
q  Therefore, yi measures fi with error − β X i − ε i
q  As is well known, even classical measurement error
causes all estimated coefficients to be inconsistent

n  Bias here is complicated because error can be


correlated with both mismeasured variable, fi ,
and with Xi,j when σ XX ≠ 0

48
AvgE estimate of β with one variable

n  With a bit of algebra, it is shown that:

βˆ AvgE = β +
( ) (
σ Xf βσ fX + β 2σ X2 + σ ε2 − σ εε − βσ XX σ 2f + βσ fX + σ εε )
( )
σ X2 σ 2f + 2βσ fX + β 2σ X2 + σ ε2 − (σ Xf + βσ XX )
2

Determining
magnitude and Covariance between X and X
again problematic, but not Even non-i.i.d.
direction of bias
needed for AvgE estimate to nature of errors
is difficult
be inconsistent can affect bias!

49
Comparing OLS, AdjY, and AvgE
n  Can use analytical solutions to compare
relative performance of OLS, AdjY, and AvgE
n  To do this, we re-express solutions…
q  We use correlations (e.g. solve bias in terms of
correlation between X and f, ρ Xf , instead of σ Xf )
q  We also assume i.i.d. errors [just makes bias of
AvgE less complicated]
q  And, we exclude the observation-at-hand when
calculating the group mean, X i , …

50
Why excluding Xi doesn’t help
n  Quite common for researchers to exclude
observation at hand when calculating group mean
q  It does remove mechanical correlation between X and
omitted variable, X i , but it does not eliminate the bias
q  In general, correlation between X and omitted variable, X i ,
is non-zero whenever X i is not the same for every group i
q  This variation in means across group is almost
assuredly true in practice; see paper for details

51
ρXf has large effect on performance
Estimate, β̂ AdjY more biased
than OLS, except for
True β = 1
2

large values for ρXf OLS


1

AdjY
0

AvgE gives
wrong sign for Other parameters held constant
AvgE low values of ρXf
-1

σ f / σ X = σ ε / σ X = 1, J = 10, ρ X , X = 0.5
i −i
-2

-0.75 -0.5 -0.25 0 0.25 0.5 0.75


ρ Xf
52
More observations need not help!
Estimate, β̂
OLS
1.25
1

AvgE
0.75

AdjY
0.5

0 5 10 15 20 25
σ f / σ X = σ ε / σ X = 1, ρ X , X = 0.5, ρ Xf = 0.25 J
i −i

53
Summary of OLS, AdjY, and AvgE

n  In general, all three estimators are inconsistent


in presence of unobserved group heterogeneity
n  AdjY and AvgE may not be an improvement
over OLS; depends on various parameter values

n  AdjY and AvgE can yield estimates with


opposite sign of the true coefficient

54
Fixed effects (FE) estimation
n  Recall: FE adds dummies for each group to OLS
estimation and is consistent because it directly
controls for unobserved group-level heterogeneity
n  Can also do FE by demeaning all variables with respect
to group [i.e. do ‘within transformation’] and use OLS

y
FE estimates: i , j − yi = β FE
( i, j i ) i, j
X − X + u FE

True model: yi , j − yi = β ( X i , j − X i ) + (ε i , j − ε i )

55
Comparing FE to AdjY and AvgE
n  To estimate effect of X on Y controlling for Z
q  One could regress Y onto both X and Z… Add group FE
q  Or, regress residuals from regression of Y on Z
onto residuals from regression of X on Z Within-group
transformation!
§  AdjY and AvgE aren’t the same as finding the
effect of X on Y controlling for Z because...
§  AdjY only partials Z out from Y
§  AvgE uses fitted values of Y on Z as control

56
The differences will matter! Example #1
n  Consider the following capital structure regression:
( D / A)i ,t = α + βXi,t + fi + ε i ,t
q  (D/A)it = book leverage for firm i, year t
q  Xit = vector of variables thought to affect leverage
q  fi = firm fixed effect

n  We now run this regression for each approach to


deal with firm fixed effects, using 1950-2010 data,
winsorizing at 1% tails…

57
Estimates vary considerably
De p e n d e n t variab le = b o o k le ve rag e

OLS Adj Y Avg E FE

Fixed  Assets/  Total  Assets 0.270*** 0.066*** 0.103*** 0.248***


(0.008) (0.004) (0.004) (0.014)
Ln(sales) 0.011*** 0.011*** 0.011*** 0.017***
(0.001) 0.000 0.000 (0.001)
Return  on  Assets -­‐0.015*** 0.051*** 0.039*** -­‐0.028***
(0.005) (0.004) (0.004) (0.005)
Z-­‐score -­‐0.017*** -­‐0.010*** -­‐0.011*** -­‐0.017***
0.000 (0.000) (0.000) (0.001)
Market-­‐to-­‐book  Ratio -­‐0.006*** -­‐0.004*** -­‐0.004*** -­‐0.003***
(0.000) (0.000) (0.000) (0.000)

Observations 166,974 166,974 166,974 166,974


R-­‐squared 0.29 0.14 0.56 0.66

58
The differences will matter! Example #2

n  Consider the following firm value regression:


Qi , j ,t = α + β ' Xi, j ,t + f j ,t + ε i , j ,t

q  Q = Tobin’s Q for firm i, industry j, year t


q  Xijt = vector of variables thought to affect value
q  fj,t = industry-year fixed effect

n  We now run this regression for each approach


to deal with industry-year fixed effects…

59
Estimates vary considerably
Dependent  Variable  =  Tobin's  Q
OLS Adj Y Avg E FE

Delaware  Incorporation 0.100*** 0.019 0.040 0.086**


(0.036) (0.032) (0.032) (0.039)
Ln(sales) -­‐0.125*** -­‐0.054*** -­‐0.072*** -­‐0.131***
(0.009) (0.008) (0.008) (0.011)
R&D  Expenses  /  Assets 6.724*** 3.022*** 3.968*** 5.541***
(0.260) (0.242) (0.256) (0.318)
Return  on  Assets -­‐0.559*** -­‐0.526*** -­‐0.535*** -­‐0.436***
(0.108) (0.095) (0.097) (0.117)

Observations 55,792 55,792 55,792 55,792


R-­‐squared 0.22 0.08 0.34 0.37

60
Common Limitations & Errors – Outline
n  Data limitations
n  Hypothesis testing mistakes
n  How to control for unobserved heterogeneity
q  How not to control for it
q  General implications
q  Estimating high-dimensional FE models

61
General implications

n  With this framework, easy to see that other


commonly used estimators will be biased

62
Other AdjY estimators are problematic
n  Same problem arises with other AdjY estimators
q  Subtracting off median or value-weighted mean
q  Subtracting off mean of matched control sample
[as is customary in studies if diversification “discount”]
q  Comparing “adjusted” outcomes for treated firms pre-
versus post-event [as often done in M&A studies]
q  Characteristically adjusted returns [as used in asset pricing]

63
AdjY-type estimators in asset pricing
n  Common to sort and compare stock returns across
portfolios based on a variable thought to affect returns
n  But, returns are often first “characteristically adjusted”
q  I.e. researcher subtracts the average return of a benchmark
portfolio containing stocks of similar characteristics
q  This is equivalent to AdjY, where “adjusted returns” are
regressed onto indicators for each portfolio

n  Approach fails to control for how avg. independent


variable varies across benchmark portfolios

64
Asset Pricing AdjY – Example
n  Asset pricing example; sorting returns based
on R&D expenses / market value of equity
Characteristically  adjusted  returns  by  R&D  Quintile  (i.e.,  Adj Y)
Missing Q1 Q2 Q3 Q4 Q5
-­‐0.012*** -­‐0.033*** -­‐0.023*** -­‐0.002 0.008 0.020***
(0.003) (0.009) (0.008) (0.007) (0.013) (0.006)

Difference between
We use industry-size benchmark portfolios Q5 and Q1 is 5.3
and sorted using R&D/market value percentage points

65
Estimates vary considerably
Dependent  Variable  =  Yearly  Stock  Return
Adj Y FE
R&D  Missing 0.021** 0.030***
(0.009) (0.010) Same AdjY result,
R&D  Quintile  2 0.01 0.019 but in regression
(0.013) (0.014) format; quintile 1
R&D  Quintile  3 0.032*** 0.051*** is excluded
(0.012) (0.018)
R&D  Quintile  4 0.041*** 0.068***
(0.015) (0.020) Use benchmark-period
R&D  Quintile  5 0.053*** 0.094*** FE to transform both
(0.011) (0.019) returns and R&D; this is
Observations 144,592 144,592 equivalent to double sort
2
R 0.00 0.47

66
What if AdjY or AvgE is true model?
§  If data exhibited structure of AvgE estimator,
this would be a peer effects model
[i.e. group mean affects outcome of other members]
§  In this case, none of the estimators (OLS, AdjY,
AvgE, or FE) reveal the true β [Manski 1993;
Leary and Roberts 2010]

§  Even ifinterested in studying yi , j − y i , AdjY


only consistent if Xi,j does not affect yi,j

67
Common Limitations & Errors – Outline
n  Data limitations
n  Hypothesis testing mistakes
n  How to control for unobserved heterogeneity
q  How not to control for it
q  General implications
q  Estimating high-dimensional FE models

68
Multiple high-dimensional FE

n  Researchers occasionally motivate using


AdjY and AvgE because FE estimator is
computationally difficult to do when there
are more than one FE of high-dimension

Now, let’s see why this is


(and isn’t) a problem…

69
LSDV is usually needed with two FE

n  Consider the below model with two FE


Two separate
yi , j ,k = β X i , j ,k + fi + δ k + ε i , j ,k group effects

q  Unless panel is balanced, within transformation can


only be used to remove one of the fixed effects
q  For other FE, you need to add dummy variables
[e.g. add time dummies and demean within firm]

70
Why such models can be problematic

n  Estimating FE model with many dummies


can require a lot of computer memory
q  E.g., estimation with both firm and 4-digit
industry-year FE requires ≈ 40 GB of memory

71
This is growing problem

n  Multiple unobserved heterogeneities


increasingly argued to be important
q  Manager and firm fixed effects in executive
compensation and other CF applications
[Graham, Li, and Qui 2011, Coles and Li 2011]
q  Firm and industry×year FE to control for
industry-level shocks [Matsa 2010]

72
But, there are solutions!

n  There exist two techniques that can be


used to arrive at consistent FE estimates
without requiring as much memory
#1 – Interacted fixed effects
#2 – Memory saving procedures

73
#1 – Interacted fixed effects
n  Combine multiple fixed effects into one-
dimensional set of fixed effect, and remove
using within transformation
q  E.g. firm and industry-year FE could be
replaced with firm-industry-year FE

But, there are limitations…


q  Can severely limit parameters you can estimate
q  Could have serious attenuation bias

74
#2 – Memory-saving procedures
n  Use properties of sparse matrices to reduce
required memory, e.g. Cornelissen (2008)
n  Or, instead iterate to a solution, which
eliminates memory issue entirely, e.g.
Guimaraes and Portugal (2010)
q  See paper for details of how each works
q  Both can be done in Stata using user-written
commands FELSDVREG and REGHDFE

75
These methods work…

n  Estimated typical capital structure


regression with firm and 4-digit
industry×year dummies
q  Standard FE approach would not work; my
computer did not have enough memory…
q  Sparse matrix procedure took 8 hours…
q  Iterative procedure took 5 minutes

76
Summary of Today [Part 1]

n  Our data isn’t perfect…


q  Watch for measurement error
q  Watch for survivorship bias
q  Be careful about external validity claims

n  Make sure to test that estimates across


subsamples are actually statistically different

77
Summary of Today [Part 2]

n  Don’t use AdjY or AvgE!


n  But, do use fixed effects
q  Should use benchmark portfolio-period FE in
asset pricing rather than char-adjusted returns
q  Use iteration techniques to estimate models with
multiple high-dimensional FE

78
In First Half of Next Class
n  Matching
q  What it does…
q  And, what it doesn’t do

n  Related readings… see syllabus

79
Assign papers for next week…
n  Gormley and Matsa (working paper, 2015)
q  Corporate governance & playing it safe preferences

n  Ljungqvist, Malloy, Marston (JF 2009) No comments


needed from
q  Data issues in I/B/E/S other groups

n  Bennedsen, et al. (working paper, 2012)


q  CEO hospitalization events

80
Break Time
n  Let’s take our 10 minute break
n  We’ll do presentations when we get back

81
FNCE 926
Empirical Methods in CF
Lecture 10 – Matching

Professor Todd Gormley


Announcements -- Research Proposal
q  You can find my detailed comments
about your rough draft on Canvas
q  Try to come see me before starting final
draft if have questions about comments
q  See six example proposals on Canvas
q  Final proposal due on May 3

2
Announcements – Exercise #4
n  Exercise #4 is due next week
q  Please upload to Canvas, Thanks!

3
Background readings for today
n  Roberts-Whited, Section 6
n  Angrist-Pischke, Sections 3.3.1-3.3.3
n  Wooldridge, Section 21.3.5

4
Outline for Today
n  Quick review of last lecture on “errors”
n  Discuss matching
q  What it does…
q  And what it doesn’t do

n  Discuss Heckman selection model


n  Student presentations of “Error” papers

5
Quick Review [Part 1]
n  What are 3 data limitations to keep in mind?
q  #1 – Measurement error; some variables may
be measured with error [e.g. industry concentration
using Compustat] leading to incorrect inferences
q  #2 – Survivorship bias; entry and exit of obs.
isn’t random and this can affect inference
q  #3 – External validity; our data often only
covers certain types of firms and need to keep
this in mind when making inferences

6
Quick Review [Part 2]
n  What is AdjY estimator, and why is it
inconsistent with unobserved heterogeneity?
q  Answer = AdjY demeans y with respect to
group; it is inconsistent because it fails to account
for how group mean of X’s affect adjusted-Y
n  E.g. “industry-adjust”
n  Diversification discount lit. has similar problem
n  Asset pricing has examples of this [What?]

7
Quick Review [Part 3]
n  Comparing characteristically-adjusted stock
returns across portfolios sorted on some
other X is example of AdjY in AP
q  What is proper way to control for unobserved
characteristic-linked risk factors?
q  Answer = Add benchmark portfolio-period FE
[See Gormley & Matsa (2014)]

8
Quick Review [Part 4]
n  What is AvgE estimator; why is it biased?
q  Answer = Uses group mean of y as control for
unobserved group-level heterogeneity; biased
because of measurement error problem

9
Quick Review [Part 5]
n  What are two ways to estimate model with
two, high-dimensional FE [e.g. firm and
industry-year FE]?
q  Answer #1: Create interacted FE and sweep it
away with usual within transformation
q  Answer #2: Use iterations to solve FE estimates

10
Matching – Outline
n  Introduction to matching
q  Comparison to OLS regression
q  Key limitations and uses
n  How to do matching
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

11
Matching Methods – Basic Idea [Part 1]
n  Matching approach to estimate treatment
effect is very intuitive and simple
q  For each treated observation, you find a
“matching” untreated observation that
serves as the de facto counterfactual
q  Then, compare outcome, y, of treated
observations to outcome of matched obs.

12
Matching Methods – Basic Idea [Part 2]
n  A bit more formally…
q  For each value of X, where there is both a
treated and untreated observation…
n  Match treated observations with X=X’ to
untreated observations with same X=X’
n  Take difference in their outcomes, y

q  Then, use average difference across all the


X’s as estimate of treatment effect

13
Matching Methods – Intuition
n  What two things is matching approach
basically assuming about the treatment?
q  Answer #1 = Treatment isn’t random; if it
were, would not need to match on X before
taking average difference in outcomes
q  Answer #2 = Treatment is random conditional
on X; i.e. controlling for X, untreated outcome
captures the unobserved treated counterfactual

14
Matching is a “Control Strategy”
n  Can think of matching as just a way to
control for necessary X’s to ensure CMI
strategy necessary for causality holds

What is another control strategy we


could use to estimate treatment effect?

15
Matching and OLS; not that different

n  Answer = Regression!


q  I.e. could just regress y onto indicator for
treatment with necessary controls for X to
ensure CMI assumption holds
n  E.g. to mirror matching estimator, you could just
put in indicators for each value of X as the set of
controls in the regression

So, how are matching & regression different?

16
Matching versus Regression
n  Basically, can think of OLS estimate as
particular weighted matching estimator
q  Demonstrating this difference in
weighting can be a bit technical…
n  See Angrist-Pischke Section 3.3.1 for more
details on this issue, but following example will
help illustrate this…

17
Matching vs Regression – Example [P1]
n  Example of difference in weighting…
q  First, do simple matching estimate
q  Then, do OLS where regress y on
treatment indicator and you control for X’s
by adding indicators for each value of X
n  This is very nonparametric and general way to
control for covariates X
n  If think about it, this is very similar to
matching; OLS will be comparing outcomes for
treated and untreated with same X’s

18
Matching vs Regression – Example [P2]
n  But, even in this example, you’ll get different
estimates from OLS and matching
q  Matching gives more weight to obs. with X=X’
when there are more treated with that X’
q  OLS gives more weight to obs. with X=X’ when
there is more variation in treatment [i.e. we observe
a more equal ratio of treated & untreated]

19
Matching vs Regression – Bottom Line
n  Angrist-Pischke argue that, in general,
differences between matching and OLS
are not of much empirical importance

n  Moreover, similar to OLS, matching


has a serious limitation…

20
Matching – Key Limitation [Part 1]

n  What sets matching estimator apart from


other estimators like IV, natural
experiments, and regression discontinuity?
q  Answer = It does not rely on any clear
source of exogenous variation!
n  I.e. If OLS estimate of treatment effect is biased, so
is a matching estimator of treatment effect!

21
Matching – Key Limitation [Part 2]

n  And, we abandoned OLS for a reason…


q  If original treatment isn’t random (i.e. exogenous),
it is often difficult to believe that controlling for
some X’s will somehow restore randomness
n  E.g. there could be problematic, unobserved heterogeneity
n  Note: regression discontinuity design is exception

q  Matching estimator suffers same problem!

22
Matching – Key Limitation [Part 3]

n  Please remember this!


n  Matching does NOT and cannot be used…
q  To fix simultaneity bias problem
q  To eliminate measurement error bias…
q  To fix omitted variable bias from unobservable
variables [can’t match on what you can’t observe!]

23
Matching – So, what good is it? [Part 1]

n  Prior slides would seem to suggest


matching isn’t that useful…
q  Basically just another control strategy that
is less dependent on functional form of X
q  Doesn’t resolve identification concerns

n  But, there are some uses…

24
Matching – So, what good is it? [Part 2]

n  Can be used…


q  To do robustness check on OLS estimate
q  To better screen the data used in OLS

n  Can sometimes have better finite-


sample properties than OLS

More about these later…

25
Matching – Outline
n  Introduction to matching
n  How to do matching
q  Notation & assumptions
q  Matching on covariates
q  Matching on propensity score
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

26
First some notation…
n  Suppose want to know effect of treatment,
d, where d = 1 if treated, d = 0 if not treated
n  Outcome y is given by…
q  y(1) = outcome if d = 1
q  y(0) = outcome if d = 0

n  Observable covariates are X = (x1,…,xk)

27
Identification Assumptions

n  Matching requires two assumptions in


order to estimate treatment effect
q  “Unconfoundedness”
q  “Overlap”

28
Assumption #1 – Unconfoundedness
n  Outcomes y(0) and y(1) are statistically
independent of treatment, d, conditional
on the observable covariates, X
q  I.e. you can think of assignment to treatment
as random once you control for X

29
“Unconfoundedness” explained…
n  This assumption is stronger version of
typical CMI assumption that we make
q  It is equivalent to saying treatment, d, is
independent of error u, in following regression
y = β0 + β1 x1 + ... + βk xk + γ d + u
n  Note: This stronger assumption needed in certain
matching estimators, like propensity score

30
Assumption #2 – Overlap

n  For each value of covariates, there is a


positive probability of being in the
treatment group and in the control group
q  I.e. There will be both treatment and control
observations available when match on X
q  Why do we need this assumption?
n  Answer = It would be problematic to do a matching
estimator if we didn’t have both treated and
untreated observations with the same X!

31
“Overlap” in practice

n  In reality, we don’t have “overlap”


q  E.g. think about continuous variables;
observations won’t have exact same X
q  As we’ll see shortly, we end instead use
observations with “similar” X in matching
n  This actually causes matching estimator to be
biased and inconsistent; but there are ways to
correct for this [see Abadie and Imbens (2008)]

32
Average Treatment Effect (ATE)

n  With both assumptions, easy to show that


ATE for subsample with X = X’ is equal
to difference in outcome between treated
and control observations with X = X’
q  See Roberts and Whited page 68 for proof
q  To get ATE for population, just integrate over
distribution X (i.e. take average ATE over all
the X’s weighting based on probability of X)

33
Difficulty with exact matching

n  In practice, difficult to use exact matches


when matching on # of X’s (i.e. k) is large
q  May not have both treated and control for
each possible combination of X’s
q  This is surely true when any x is continuous
(i.e. it doesn’t just take on discrete values)

34
Matching – Outline
n  Introduction to matching
n  How to do matching
q  Notation & assumptions
q  Matching on covariates
q  Matching on propensity score
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

35
Matching on Covariates – Step #1
n  Select a distance metric, ||Xi – Xj||
q  It tells us how far apart the vector of X’s for
observation i are from X’s for observation j
q  One example would be Euclidean distance

(X − X j ) ( Xi − X j )
'
Xi − X j = i

36
Matching on Covariates – Step #2
n  For each observation, i, find M closest
matches (based on chosen distance metric)
among observations where d ≠ di
q  I.e. for a treated observation (i.e. d = 1) find the
M closest matches among untreated observations
q  For an untreated observation (i.e. d = 0), find the
M closest matches among treated observations

37
Before Step #3… some notation
n  Define lm(i) as mth closest match to
observation i among obs. where d ≠ di
q  E.g. suppose obs. i =4 is treated [i.e. d =1]
n  l1(4) would represent the closest
untreated observation to observation i = 4
n  l2(4) would be the second closest, and so on

n  Define LM(i) = {lm(i),…, lM(i)}


Just way of labeling M
closest obs. to obs. i

38
Matching on Covariates – Step #3
n  Create imputed untreated outcome, yˆi (0) ,
and treated outcome, yˆi (1) , for each obs. i

⎧ yi if di = 0

yˆi (0) = ⎨ 1
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1 In words, what
is this doing?
⎧1
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M
⎪⎩ yi if di = 1

39
Interpretation… But, we don’t observe the
counterfactual, y(0); so, we
estimate it using average
outcome of M closest
⎧ yi if di = 0 untreated observations!

ˆyi (0) = ⎨ 1
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1
⎧1
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M
⎪⎩ yi if di = 1

If obs. i was treated, we observe


the actual outcome, y(1)

40
Interpretation…

⎧ yi if di = 0

ˆyi (0) = ⎨ 1 And vice versa, if obs. i
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1 had been untreated; we
impute unobserved
⎧1 counterfactual using
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M average outcome of M
⎪⎩ yi if di = 1 closest treated obs.

41
Matching on Covariates – Step #4
n  With assumptions #1 and #2, average
treatment effect (ATE) is given by:
1 N
∑ [ yˆi (1) − yˆi (0)]
N 1

In words, what is this doing?


Answer = Taking simple average of difference
between observed outcome and constructed
counterfactual for each observation

42
Matching – Outline
n  Introduction to matching
n  How to do matching
q  Notation & assumptions
q  Matching on covariates
q  Matching on propensity score
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

43
Matching on propensity score
n  Another way to do matching is to first
estimate a propensity score using
covariates, X, and then match on it…

44
Propensity Score, ps(x) [Part 1]

n  Propensity score, ps(x), is probability


of treatment given X [i.e. Pr(d = 1|X),
which is equal to CEF E[d|X]]
q  Intuitive measure…
n  Basically collapses your k-dimensional vector
X into a 1-dimensional measure of the
probability of treatment i.e. given the X’s
n  Can estimate this in many ways including
discrete choice models like Probit and Logit

45
Propensity Score, ps(x) [Part 2]

n  With unconfoundedness assumption,


conditioning on ps(X) is sufficient to
identify average treatment effect; i.e.
q  I.e. controlling for probability of treatment
(as predicted by X) is sufficient
n  Can do matching using just ps(X)
n  Or, can regress y on treatment indicator, d, and
add propensity score as control

46
Matching on ps(X) – Step #1
n  Estimate propensity score, ps(X), for
each observation i
q  For example, estimate d = β0 + β1 x1 + ... + β k xk + ui
using OLS, Probit, or Logit
n  Common practice is to use Logit with few
polynomial terms for any continuous covariates
q  Predicted value for observation i is it’s
propensity score, ps(Xi)

47
Tangent about Step #1
n  Note: You only need to include X’s
that predict treatment, d
q  This may be less than full set of X’s
q  In fact, being able to exclude some X’s
(because economic logic suggests they
shouldn’t predict d) can improve finite
sample properties of the matching estimate

48
Matching on ps(X) – Remaining Steps…
n  Now, use same steps as before, but choose
M closest matches using observations with
closest propensity score
q  E.g. if obs. i is untreated, choose M treated
observations with closest propensity scores

49
Propensity score – Advantage # 1
n  Propensity score helps avoid concerns about
subjective choices we make with matching
q  As we’ll see next, there are a lot of subjective
choices you need to make [e.g. distance metric,
matching method, etc.] when matching on covariates

50
Propensity score – Advantage # 2
n  Can skip matching entirely, and estimate
ATE using sample analog of
⎡ ( di − ps( X i ) ) yi ⎤
E⎢ ⎥
⎣ ps( X i ) (1 − ps( X i ) ) ⎦
q  See Angrist-Pischke, Section 3.3.2 for more
details about why this works

51
But, there is a disadvantage (sort of)
?
n  Can get lower standard errors by instead
matching on covariates if add more variables
that explain y, but don’t necessarily explain d
q  Same as with OLS; more covariates can increase
precision even if not needed for identification
q  But, Angrist and Hahn (2004) show that using
ps(X) and ignoring these covariates can actually
result in better finite sample properties

52
Matching – Outline
n  Introduction to matching
n  How to do matching
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

53
Practical Considerations

n  There are a lot of practical considerations


and choices to make with matching; e.g.,
q  Which distance metric to use?
q  How many matches for each observation?
q  Match with or without replacement?
q  Which covariates X should be used?
q  Use propensity score, and if so, how measure it?

54
Choice of distance metric [Part 1]

n  What is downside to simple Euclideun


distance metric from earlier?
(X − X j ) ( Xi − X j )
'
Xi − X j = i

q  Answer = It ignores the potentially


different scales of each variable [which is
why it typically isn’t used in practice]
n  Which variables will have more effect in
determining best matches with this metric?

55
Choice of distance metric [Part 2]
n  Two other possible distance metrics
standardize distances using inverse of
covariates’ variances and covariances
q  Abadie and Imbens (2006)

( X i − X j ) diag (Σ−X1 ) ( X i − X j ) Inverse of


'
Xi − X j =
variance-
q  Mahalanobis [probably most popular] covariance
matrix for
covariates
( X i − X j ) (Σ−X1 ) ( X i − X j )
'
Xi − X j =

56
Choice of matching approach
n  Should you match based on covariates, or
instead match using a propensity score?
q  And, if use propensity score, should you use
Probit, Logit, OLS, or nonparametric approach?

n  Unfortunately, no clear answer


q  Want whichever is going to be most accurate…
q  But, probably should show robustness to
several different approaches

57
And, how many matches? [Part 1]
n  Again, no clear answer…
n  Tradeoff is between bias and precision
q  Using single best match will be least biased
estimate of counterfactual, but least precise
q  Using more matches increases precision, but
worsens quality of match and potential bias

58
And, how many matches? [Part 2]
n  Two ways used to choose matches
q  “Nearest neighbor matching”
n  This is what we saw earlier; you choose the m matches
that are closest using your distance metric

q  “Caliper matching”


n  Choose all matches that fall within some radius
n  E.g. if using propensity score, could choose all matches
within 1% of observation’s propensity score

Question: What is intuitive advantage of caliper approach?

59
And, how many matches? [Part 3]
n  Bottom line advice
q  Best to try multiple approaches to ensure
robustness of the findings
n  If adding more matches (or expanding radius
in caliper approach) changes estimates, then
bias is potential issue and should probably stick
to smaller number of potential matches
n  If not, and only precision increases, then okay
to use a larger set of matches

60
With or without replacement? [Part 1]
n  Matching with replacement
q  Each observation can serve as a match
for multiple observations
q  Produces better matches, reducing
potential bias, but at loss of precision

n  Matching without replacement

61
With or without replacement? [Part 2]
n  Bottom line advice…
q  Roberts-Whited recommend to do
matching with replacement…
n  Our goal should be to reduce bias
n  In matching without replacement, the order in
which you match can affect estimates

62
Which covariates?
n  Need all X’s that affect outcome, y, and
are correlated with treatment, d [Why?]
q  Otherwise, you’ll have omitted variables!

n  But, do not include any covariates


that might be affected by treatment
q  Again, same “bad control” problem

Question: What might be way to control Answer:


for X that could be a “bad control”? Use lagged X

63
Matches for whom?
n  If use matches for all observations
(as done earlier), you estimate ATE
q  But, if only use and find matches for
treated observations, you estimate average
treatment effect on treated (ATT)
q  If only use and find matches for untreated,
you estimate average treatment effect on
untreated (ATU)

64
Matching – Outline
n  Introduction to matching
n  How to do matching
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

65
Testing “Overlap” Assumption
n  If only one X or using ps(X), can just
plot distribution for treated & untreated
n  If using multiple X, identify and inspect
worst matches for each x in X
q  If difference between match and
observation is large relative to standard
deviation of x, might have problem

66
If there is lack of “Overlap”
n  Approach is very subjective…
q  Could try discarding observations with
bad matches to ensure robustness
q  Could try switching to caliper matching
with propensity score

67
Testing “Unconfoundedness”
n  How might you try to test
unconfoundedness assumption?
q  Answer = Trick question; you can’t! We
do not observe error, u, and therefore can’t
know if treatment, d, is independent of it!
q  Again, we cannot test whether the
equations we estimate are causal!

68
But, there are other things to try…
n  Similar to natural experiment, can
do various robustness checks; e.g.
q  Test to make sure timing of observed
treatment effect is correct
q  Test to make sure treatment doesn’t
affect other outcomes that should,
theoretically, be unaffected
n  Or, look at subsamples where treatment
effect should either be larger or smaller

69
Matching – Outline
n  Introduction to matching
n  How to do matching
n  Practical considerations
n  Testing the assumptions
n  Key weaknesses and uses of matching

70
Weaknesses Reiterated [Part 1]
n  As we’ve just seen, there isn’t clear
guidance on how to do matching
q  Choices on distance metric, matching
approach, # of matches, etc. are subjective
q  Or, what is best way to estimate propensity
score? Logit, probit, nonparametric?

n  Different researchers, using different


methods might get different answers!

71
Weaknesses Reiterated [Part 2]

n  And, as noted earlier, matching is not a


way to deal with identification problem
q  Does NOT help with simultaneity, unobserved
omitted variables, or measurement error
q  Original OLS estimate of regressing y on
treatment, d, and X’s is similar but weighting
observations in particular way

72
Tangent – Related Problem
What is wrong
with this claim?
n  Often see a researcher estimate:
y = β0 + β1d + ps( X ) + u
q  d = indicator for some non-random event
q  ps(X) = prop. score for likelihood of treatment
estimated using some fancy, complicated Logit

n  Then, researcher will claim:


“Because ps(X) controls for any selection bias,
I estimate causal effect of treatment”

73
Tangent – Related Problem [Part 2]
n  Researcher assumes that observable X
captures ALL relevant omitted variables
q  I.e. there aren’t any unobserved variables
that affect y and are correlated with d
q  This is often not true… Remember long
list of unobserved omitted factors discussed
in lecture on panel data

n  Just because it seems fancy or


complicated doesn’t mean it’s identified!

74
Another Weakness – Inference

n  There isn’t always consensus or formal


method for calculating SE and doing
inference based on estimates

n  So, what good is it, and when


should we bother using it?

75
Use as a robustness check

n  Can use as robustness check to OLS


estimation of treatment effect
q  It avoids functional form assumptions
imposed by the regression; so, provides a
nice sanity check on OLS estimates
n  Angrist-Pischke argue, however, that it won’t
find much difference in practice if have right
covariates, particularly if researcher uses
regression with flexible controls for X

76
Use as precursor to regression [Part 1]

n  Can use matching to screen sample


used in later regression
q  Ex. #1 – Could estimate propensity score;
then do estimation using only sample
where the score lies between 10% and 90%
n  Helps ensure estimation is done only using obs.
with sufficient # of controls and treated
n  Think of it as ensuring sufficient overlap

77
Use as precursor to regression [Part 2]

q  Ex. #2 – Could estimate effect of


treatment using only control observations
that match characteristics of treated obs.
n  E.g. If industry X is hit by shock, select control
sample to firms matched to similar industry

78
Matching – Practical Advice

n  User-written program, “psmatch2,” in


Stata can be used to do matching and
obtain estimates of standard errors
q  Program is flexible and can do variety of
different matching techniques

79
Summary of Today [Part 1]

n  “Matching” is another control method


q  Use to estimate treatment effect in cases where
treatment is random after controlling for X
q  Comparable to OLS estimation of treatment
effect, just without functional form assumptions

n  Besides controlling for X, matching does


NOT resolve or fix identification problems

80
Summary of Today [Part 2]

n  Many different ways to do matching; e.g.


q  Match on covariates or propensity scores
q  Nearest neighbor or caliper matching

n  Primarily used as robustness test


q  If have right covariates, X, and relatively
flexible OLS model, matching estimate of
ATE will typically be quite similar to OLS

81
In First Half of Next Class
n  Standard errors & clustering
q  Should you use “robust” or “classic” SE?
q  “Clustering” and when to use it

n  Limited dependent variables…


are Probit, Logit, or Tobit needed?
n  Related readings… see syllabus

82
Assign papers for next week…
n  Morse (JFE 2011)
q  Payday lenders

n  Colak and Whited (RFS 2007)


q  Spin-offs, divestitures, and investment

n  Almeida, et al (working paper, 2014)


q  Credit ratings & sovereign credit ceiling

83
Break Time
n  Let’s take our 10 minute break
n  We’ll quickly cover Heckman selection models
and then do presentations when we get back

84
Heckman selection models
n  Motivation
n  How to implement
n  Limitations [i.e., why I don’t like them]

85
Motivation [Part 1]
n  You want to estimate something like…
Yi = bX i + ε i
q  Yi = post-IPO outcome for firm i
q  Xi = vector of covariates that explain Y
q  εi,t = error term
q  Sample = all firms that did IPO in that year

n  What is a potential concern?

86
Motivation [Part 2]
n  Answer = certain firms ‘self-select’ to
do an IPO, and the factors that drive
that choice might cause X to be
correlated with εi,t
q  It’s basically an omitted variable problem!
q  If willing to make some assumptions,
can use Heckman two-step selection
model to control for this selection bias

87
How to implement [Part 1]
n  Assume to choice to ‘self-select’ [in this
case, do an IPO] has following form…
⎧⎪1 if γ Zi + ηi > 0 ⎫⎪
IPOi = ⎨ ⎬
⎪⎩0 if γ Zi + ηi ≤ 0 ⎪⎭
q  Zi = factors that drive choice [i.e., IPO]
q  ηi,t = error term for this choice

88
How to implement [Part 2]
n  Regress choice variable (i.e., IPO) onto
Z using a Probit model
n  Then, use predicted values to calculate
the Inverse Mills Ratio for each
observation, λi = ϕ(γZi)/Φ(γZi)
n  Then, estimate original regression of Yi
onto Xi, but add λi as a control!

Basically, controls directly for omitted


variable; e.g. choice to do IPO

89
Limitations [Part 1]
n  Model for choice [i.e., first step of the estimation]
must be correct; otherwise inconsistent!
n  Requires assumption that the errors, ε and η,
have a bivariate normal distribution
q  Can’t test, and no reason to believe this is true
[i.e., what is the economic story behind this?]
q  And, if wrong… estimates are inconsistent!

90
Limitations [Part 2]
n  Can technically work if Z is just a subset of
the X variables [which is commonly what people
seem to do], but…
q  But, in this case, all identification relies on non-
linearity of the inverse mills ratio [otherwise, it
would be collinear with the X in the second step]
q  But again, this is entirely dependent on the
bivariate normality assumption and lacks
any economic intuition!

91
Limitations [Part 3]
n  When Z has variables not in X [i.e., excluded
instruments], then could just do IV instead!
q  I.e., estimate Yi = bX i + IPOi + ε i on full sample
using excluded IVs as instruments for IPO
q  Avoids unintuitive, untestable assumption of
bivariate normal error distribution!

92
FNCE 926
Empirical Methods in CF
Lecture 11 – Standard Errors & Misc.

Professor Todd Gormley


Announcements
n  Exercise #4 is due
n  Final exam will be in-class on April 26
q  After today, only two more classes of new
material, structural & experiments
q  Practice exam available on Canvas

2
Background readings for today
n  Readings for standard errors
q  Angrist-Pischke, Chapter 8
q  Bertrand, Duflo, Mullainathan (QJE 2004)
q  Petersen (RFS 2009)
n  Readings for limited dependent variables
q  Angrist-Pischke, Sections 3.4.2 and 4.6.3
q  Greene, Section 17.3

3
Outline for Today
n  Quick review of last lecture on matching
n  Discuss standard errors and clustering
q  “Robust” or “Classical”?
q  Clustering: when to do it and how

n  Discuss limited dependent variables


n  Student presentations of “Matching” papers

4
Quick Review [Part 1]
n  Matching is intuitive method
q  For each treated observation, find comparable
untreated observations with similar covariates, X
n  They will act as estimate of unobserved counterfactual
n  Do the same thing for each untreated observation

q  Take average difference in outcome, y, of


interest across all X to estimate ATE

5
Quick Review [Part 2]
n  But, what are necessary assumptions for
this approach to estimate ATE?
q  Answer #1 = Overlap… Need both treated
and control observations for X’s
q  Answer #2 = Unconfoundedness… Treatment is
as good as random after controlling for X

6
Quick Review [Part 3]
n  Matching is just a control strategy!
q  It does NOT control for unobserved variables
that might pose identification problems
q  It is NOT useful in dealing with other problems
like simultaneity and measurement error biases

n  Typically used as robustness check on OLS


or way to screen data before doing OLS

7
Quick Review [Part 4]
n  Relative to OLS estimate of treatment effect…
q  Matching basically just weights differently
q  And, doesn’t make functional form assumption
n  Angrist-Pischke argue you typically won’t find large
difference between two estimates if you have right
X’s and flexible controls for them in OLS

8
Quick Review [Part 5]
n  Many choices to make when matching
q  Match on covariates or propensity score?
q  What distance metric to use?
q  What # of observations?

n  Will want to show robustness of estimate to


various different approaches

9
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE

n  Limited dependent variables

10
Getting our standard errors correct
n  It is important to make sure we get our
standard errors correct so as to avoid
misleading or incorrect inferences
q  E.g. standard errors that are too small will cause
us to reject the null hypothesis that our
estimated β’s are equal to zero too often
n  I.e. we might erroneously claim to found a
“statistically significant” effect when none exists

11
Homoskedastic or Heteroskedastic?

n  One question that typically comes up when


trying figure out the appropriate SE is
homoskedasticity versus heteroskedasiticity
q  Homoskedasticity assumes the variance
of the residuals, u, around the CEF, does
not depend on the covariates, X
q  Heteroskedasticity doesn’t assume this

12
“Classical” versus “Robust” SEs [Part 1]
n  What do the default standard errors
reported by programs like Stata assume?
q  Answer = Homoskedasticity! This is what
we refer to as “classical” standard errors
n  As we discussed in earlier lecture, this is typically
not a reasonable assumption to make
n  “Robust” standard errors allow for
heteroskedasticity and don’t make this assumption

13
“Classical” versus “Robust” SEs [Part 2]
n  Putting aside possible “clustering” (which
we’ll discuss shortly), should you always
use robust standard errors?
q  Answer = Not necessarily! Why?
n  Asymptotically, “classical” and “robust” SE are
correct, but both suffer from finite sample bias, that
will tend to make them too small in small samples
n  “Robust” can sometimes be smaller than “classical”
SE because of this bias or simple noise!

14
Finite sample bias in standard errors
n  Finite sample bias is easily corrected in
“classical” standard errors
[Note: this is done automatically by Stata]
n  This is not so easy with “robust” SEs…
q  Small sample bias can be worse with
“robust” standard errors, and while finite
sample corrections help, they typically don’t
fully remove the bias in small samples

15
Many different corrections are available
n  Number of methods developed to try and
correct for this finite-sample bias
q  By default, Stata automatically does one of these
when use vce(robust) to calculate SE
q  But, there are other ways as well; e.g.,
n  regress y x, vce(hc2)
Developed by Davidson
n  regress y x, vce(hc3) and MacKinnon (1993);
works better when
heterogeneity is worse

16
Classical vs. Robust – Practical Advice
n  Compare the robust SE to the classical SE
and take maximum of the two
q  Angrist-Pischke argue that this will tend to be
closer to the true SE in small samples that
exhibit heteroskedasticity
n  If small sample bias is real concern, might want to
use HC2 or HC3 instead of typical “robust” option
n  While SE using this approach might be too large if
data is actually homoskedastic, this is less of concern

17
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE
n  Violation of independence and implications
n  How big of a problem is it? And, when?
n  How do we correct for it with clustered SE?
n  When might clustering not be appropriate?

n  Limited dependent variables

18
Clustered SE – Motivation [Part 1]
n  “Classical” and “robust” SE depend
on assumption of independence
q  i.e. our observations of y are random
draws from some population and are
hence uncorrelated with other draws
q  Can you give some examples where this
is likely an unrealistic in CF? [E.g. think
of firm-level capital structure panel regression]

19
Clustered SE – Motivation [Part 2]
n  Example Answers
q  Firm’s outcome (e.g. leverage) is likely
correlated with other firms in same industry
q  Firm’s outcome in year t is likely correlated
to outcome in year t-1, t-2, etc.

n  In practice, independence assumption is


often unrealistic in corporate finance

20
Clustered SE – Motivation [Part 3]
n  Moreover, this non-independence can
cause significant downward biases in
our estimated standard errors
q  E.g. standard errors can easily double,
triple, etc. once we correct for this!
q  This is different than correcting for
heterogeneity (i.e. “Classical” vs. “robust”)
tends to increase SE, at most, by about
30% according to Angrist-Pischke

21
Example violations of independence
n  Violations tend to come in two forms
#1 – Cross-sectional “Clustering”
n  E.g. outcome, y, [e.g. ROA] for a firm tends to be
correlated with y of other firms in same industry
because they are subject to same demand shocks

#2 – “Time series correlation”


n  E.g. outcome, y, [e.g. Ln(assets)]for firm in year t
tends to be correlated with the firm’s y in other
years because there serial correlation over time

22
Violation means non-i.i.d. errors
n  Such violations basically mean that our
errors, u, are not i.i.d. as assumed
q  Specifically, you can think of the errors as
being correlated in groups, where
yig = β0 + β1 xig + uig Error for observation
i, which is group g
n  var(uig ) = σ u2 > 0
n  corr (uig , u jg ) = ρuσ u2 > 0
“Robust” and
“classical” SEs
ρu is called “intra-class
assume this is zero
correlation coefficient”

23
“Cluster” terminology
n  Key idea: errors are correlated within groups
(i.e. clusters), but not correlated across them
q  In cross-sectional setting with one time period,
cluster might be industry; i.e. obs. within industry
correlated but obs. in different industries are not
q  In time series correlation, you can think of the
“cluster” as the multiple observations for each
cross-section [e.g. obs. on firm over time are the cluster]

24
Why are classical SE too low?
n  Intuition…
q  Broadly speaking, you don’t have as much
random variation as you really think you do
when calculating your standard errors; hence,
your standard errors are too small
n  E.g. if double # of observations by just replicating
existing data, your classical SE will go down even
though there is no new information; Stata does
not realize the observations are not independent

25
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE
n  Violation of independence and implications
n  How big of a problem is it? And, when?
n  How do we correct for it with clustered SE?
n  When might clustering not be appropriate?

n  Limited dependent variables

26
How large, and what’s important?
n  By assuming a structure for the non-i.i.d.
nature of the errors, we can derive a
formula for are large the bias will be
n  Can also see that two factors are key
q  Magnitude of intra-class correlation in u
q  Magnitude of intra-class correlation in x

27
Random effect version of violation
n  To do this, we will assume the within-group
correlation is driven by a random effect
yig = β0 + β1xig + vg + ηig
!"#
uig

All within-group
In this case, intra-
correlation is
class correlation
captured by random
coefficient is
effect vg, and σ v2
corr (ηig ,η jg ) = 0 ρu = 2
σ v + σ η2

28
Moulton Factor
n  With this setting and a constant # of
observations per group, n, we can show that
Correct SE
of estimate ( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2

SE ( β )
ˆ ⎣ u⎦
c 1

“Classical” SE
This ratio is called the
you get when you
“Moulton Factor”; it tells
don’t account for
you how much larger
correlation
corrected SE will be

29
Moulton Factor – Interpretation
( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2

SE ( β )
ˆ ⎣ ⎦ u
c 1

q  Interpretation = If corrected for this non-i.i.d.


structure within groups (i.e. clustering) classical SE
will larger by factor equal to Moultan Factor
n  E.g. Moultan Factor = 3 implies your standard errors
will triple in size once correctly account for correlation!

30
What affects the Moulton Factor?
( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2

SE ( β )
ˆ ⎣ ⎦ u
c 1

q  Formula highlights importance of n and ρu  


n  There is no bias if ρu  = 0 or if n = 1 [Why?]
n  If ρu  rises, the magnitude of bias rise [Why?]
n  If observations per group, n, rises bias is greater [Why?]

31
Answers about Moultan Factor

n  Answer #1: ρu  =0 implies each additional obs. provides new
info. (as if they are i.i.d.), and (2) n=1 implies there aren’t
multiple obs. per cluster, so correlation is meaningless
n  Answer #2 = Higher intra-class correlation ρu    means that
new observations within groups provide even less new
information, but classical standard errors don’t realize this
n  Answer #3 = Classical SE thinks each additional obs. adds
information, when in reality, it isn’t adding that much. So,
bias is worse with more observations per group.

32
Bottom line…
n  Moultan Factor basically shows that
downward bias is greatest when…
q  Dependent variable is highly correlated
across observations within group
[e.g. high time series correlation in panel]
q  And, we have a large # of observations per
group [e.g. large # of years in panel data]

Expanding to uneven group sizes, we see that one


other factor will be important as well…

33
Moulton Factor with uneven group sizes
1

( ) = ⎛⎜1 + ⎡⎢V ( n ) + n − 1⎤⎥ ρ ρ ⎞⎟


SE βˆ1 g
2

SE ( βˆ ) ⎜⎝ ⎢⎣ n
u x
⎥⎦ ⎟
c 1 ⎠
n  ng = size of group g
n  V(ng) = variance of group sizes
n  n = average group size
n  ρu  = intra-class correlation of errors, u
n  ρx  = intra-class correlation of covariate, x

34
Importance of non-i.i.d. x’s [Part 1]
1

( ) = ⎛⎜1 + ⎡⎢V ( n ) + n − 1⎤⎥ ρ ρ ⎞⎟


SE βˆ1 g
2

SE ( βˆ ) ⎜⎝ ⎢⎣ n
u x
⎥⎦ ⎟
c 1 ⎠

q  Now we see that a non-zero correlation


between x’s within groups is also important
n  Question: For what type of covariates will this
correlation be high? [i.e. when is clustering important?]

35
Importance of non-i.i.d. x’s [Part 2]

n  Prior formula shows that downward


bias will also be bigger when…
q  Covariate only varies at group level; px will
be exactly equal to 1 in those cases!
q  When covariate likely has a lot of time
series dependence [e.g. Ln(assets) of firm]

36
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE
n  Violation of independence and implications
n  How big of a problem is it? And, when?
n  How do we correct for it with clustered SE?
n  When might clustering not be appropriate?

n  Limited dependent variables

37
How do we correct for this?
n  There are many possible ways
q  If think error structure is random effects, as
modeled earlier, then you could just multiply
SEs by Moulton Factor…
q  But, more common way, which allows for
any type of within-group correlation, is to
“cluster” your standard errors
n  Implemented in Stata using vce(cluster variable)
option in estimation command

38
Clustered Standard Errors
n  Basic idea is that it allows for any type
of correlation of errors within group
q  E.g. if “cluster” was a firm’s observations
for years 1, 2, …, T, then it would allow
corr(ui1, ui2) to be different than corr(ui1, ui3)
n  Moultan factor approach would assume these
are all the same which may be wrong

n  Then, use independence across groups


and asymptotics to estimate SEs

39
Clustering – Cross-Sectional Example #1
n  Cross-sectional firm-level regression
yij = β0 + β1 x j + β2 zij + uij
q  yij is outcome for firm i in industry j
q  xj only varies at industry level
q  zij varies within industry
q  How should you cluster?
n  Answer = Cluster at the industry level. Observations
might be correlated within industries and one of the
covariates, x, is perfectly correlated within industries

40
Clustering – Cross-Sectional Example #2
n  Panel firm-level regression
yijt = β0 + β1 x jt + β2 zijt + uijt
q  yijt is outcome for firm i in industry j in year t
q  If you think firms are subject to similar industry
shocks over time, how might you cluster?
n  Answer = Cluster at the industry-year level. Obs.
might be correlated within industries in a given year
n  But, what is probably even more appropriate?

41
Clustering – Time-series example
n  Answer = cluster at industry level!
q  This allows errors to be correlated over time
within industries, which is very likely to the
true nature of the data structure in CF
n  E.g. Shock to y (and error u) in industry j in year t is
likely to be persistent and still partially present in
year t+1 for many variables we analyze. So,
corr(uijt, uijt+1) is not equal to zero. Clustering at
industry level would account for this; clustering at
industry-year level does NOT allow for any
correlation across time

42
Time-series correlation
n  Such time-series correlation is very
common in corporate finance
q  E.g. leverage, size, etc. are all persistent over time
q  Clustering at industry, firm, or state level is a non-
parametric and robust way to account for this!

43
Such serial correlation matters…
n  When non-i.i.d. structure comes from serial
correlation, the number of obs. per group,
n, is the number of years for each panel
q  Thus, downward bias of classical or robust SE
will be greater when have more years of data!
q  This can matter a lot in diff-in-diff… [Why?
Hint… there are three potential reasons]

44
Serial correlation in diff-in-diff [Part 1]
n  Serial correlation is particularly important in
difference-in-differences because…
#1 – Treatment indicator is highly correlated over
time! [E.g. for untreated firms is stays zero entire time,
and for treated firms it stays equal to 1 after treatment]
#2 – We often have multiple pre- and post-treatment
observations [i.e. many observations per group]
#3 – And, dependent variables typically used often
have a high time-series dependence to them

45
Serial correlation in diff-in-diff [Part 2]
n  Bertrand, Duflo, and Mullainathan (QJE
2004) shows how bad this SE bias can be…
q  In standard type of diff-in-diff where true β=0,
you’ll find significant effect at 5% level in as
much as 45 percent of the cases!
n  Remember… you should only reject null hypothesis
5% of time when the true effect is actually zero!

46
Firm FE vs. firm clusters

n  Whether to use both FE and clustering often


causes confusion for researchers
q  E.g. should you have both firm FE and clustering
at firm level, and if so, what is it doing?

Easiest to understand why both might be


appropriate with a few quick questions…

47
Firm FE vs. firm clusters [Part 1]

n  Consider the following regression


yit = β0 + β1xit + f i + vit
!"#
uit

q  yit = outcome for firm i in year t


q  fi = time-invariant unobserved heterogeneity
q  uit is estimation error term if don’t control for fi
q  vit is estimation error term if do control for fi

Now answer the following questions…

48
Firm FE vs. firm clusters [Part 2]

n  Why is it probably not a good idea to just


use firm clusters with no firm FE?
q  Answer = Clustering only corrects standard
errors; it doesn’t deal with potential omitted
variable bias if corr(x,f ) ≠ 0!

49
Firm FE vs. firm clusters [Part 3]
n  Why should we still cluster at firm level if
even if we already have firm FE?
q  Answer = Firm FE removes time-invariant
heterogeneity, fi, from error term, but it doesn’t
account for possible serial correlation!
n  I.e. vit might still be correlated with vit-1, vit-2, etc.
n  E.g. firm might get hit by shock in year t, and effect
of that shock only slowly fades over time

50
Firm FE vs. firm clusters [Part 4]
n  Will we get consistent estimates with both
firm FE and firm clusters if serial dependence
in error is driven by time-varying omitted
variable that is correlated with x?
q  Answer = No!
n  Clustering only corrects SEs; it doesn’t deal with potential
bias in estimates because of an omitted variable problem!
n  And, Firm FE isn’t sufficient in this case either because
omitted variable isn’t time-invariant

51
Clustering – Practical Advice [Part 1]
n  Cluster at most aggregate level of
variation in your covariates
q  E.g. if one of your covariates only varies at
industry or state level, cluster at that level

n  Always assume serial correlation


q  Don’t cluster at state-year, industry-year,
firm-year; cluster at state, industry, or firm
[this is particularly true in diff-in-diff]

52
Clustering – Practical Advice [Part 2]
n  Clustering is not a substitute for FE
q  Should use both FE to control for
unobserved heterogeneity across groups
and clustered SE to account for remaining
serial correlation in y

n  Be careful when # of clusters is small…

53
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE
n  Violation of independence and implications
n  How big of a problem is it? And, when?
n  How do we correct for it with clustered SE?
n  When might clustering not be appropriate?

n  Limited dependent variables

54
Need enough clusters…
n  Asymptotic consistency of estimated
clustered standard errors depends on #
of clusters, not # of observations
q  I.e. only guaranteed to get precise estimate of
correct SE if we have a lot of clusters
q  If too few clusters, SE will be too low!
n  This leads to practical questions like… “If I do
firm-level panel regression with 50 states and cluster
at state level, are there enough clusters?”

55
How important is this in practice?
n  Unclear, but probably not a big problem
q  Simulations of Bertrand, et al (QJE 2004)
suggest 50 clusters was plenty in their setting
n  In fact, bias wasn’t that bad with 10 states
n  This is consistent with Hansen (JoE 2007), which
finds that 10 clusters is enough when using
clusters to account for serial correlation

q  But, can’t guarantee this is always true,


particularly in cross-sectional settings

56
If worried about # of clusters…
n  You can try aggregating the data to
remove time-series variation
q  E.g. in diff-in-diff, you would collapse data
into one pre- and one post-treatment
observation for each firm, state, or industry
[depending on what level you think is non-i.i.d], and
then run the estimation
n  See Bertrand, Duflo, and Mullainathan (QJE 2004)
for more details on how to do this

57
Cautionary Note on aggregating
n  Can have very low power
q  Even if true β≠0, aggregating approach can
often fail to reject the null hypothesis

n  Not as straightforward (but still doable)


when have multiple events at different
times or additional covariates
q  See Bertrand, et al (QJE 2004) for details

58
Double-clustering
n  Petersen (2009) emphasized idea of
potentially clustering in second dimension
q  E.g. cluster for firm and cluster for year
[Note: this is not the same as a firm-year cluster!]
q  Additional year cluster allows errors within
year to be correlated in arbitrary ways
n  Year FE removes common error each year
n  Year clusters allows for things like when Firm A
and B are highly correlated within years, but Firm
A and C are not [I.e. it isn’t a common year error]

59
But is double-clustering it necessary?
n  In asset pricing, YES; in corporate
finance… unclear, but probably not
q  In asset pricing, makes sense… some firms
respond more to systematic shocks across
years [i.e. high equity beta firms!]
q  But, harder to think why correlation or
errors in a year would consistently differ
across firms for CF variables
n  Petersen (2009) finds evidence consistent with
this; adding year FE is probably sufficient in CF

60
Clustering in Panels – More Advice
n  Within Stata, two commands can do the
fixed effects estimation for you
q  xtreg, fe
q  areg
n  They are identical, except when it comes
to the cluster-robust standard errors
q  xtreg, fe cluster-robust SE are smaller because
it doesn’t adjust doF when clustering!

61
Clustering – xtreg, fe versus areg
n  xtreg, fe are appropriate when FE are nested
within clusters, which is commonly the case
[See Wooldridge 2010, Chapter 20]
q  E.g. firm fixed effects are nested within firm,
industry or state clusters. So, if you have firm FE
and cluster at firm, industry, or state, use xtreg, fe
q  Note: xtreg, fe will give you an error if FE aren’t
nested in clusters; then you should use areg

62
Standard Errors & LDVs – Outline
n  Getting your standard errors correct
q  “Classical” versus “Robust” SE
q  Clustered SE

n  Limited dependent variables

63
Limited dependent variables (LDV)

n  LDV occurs whenever outcome y is


zero-one indicator or non-negative
q  If think about it, it is very common
n  Firm-level indicator for issuing equity, doing
acquisition, paying dividend, etc.
n  Manager’s salary [b/c it is non-negative]

q  Zero-one outcomes are also called


discrete choice models

64
Common misperception about LDVs
n  It is often thought that LDVs
shouldn’t be estimated with OLS
q  I.e. can’t get causal effect with OLS
q  Instead, people argue you need to use
estimators like Probit, Logit, or Tobit

n  But, this is wrong!


To see this, let’s compare linear
probability model to Probit & Logit

65
Linear probability model (LPM)

n  LPM is when you use OLS to estimate


model where outcome, y, is an indicator
q  Intuitive and very few assumptions
q  But admittedly, there are issues…
n  Predicted values can be outside [0,1]
n  Error will be heteroskedastic [Does this cause bias?]
Answer = No! Just need to correct SEs

66
Logit & Probit [Part 1]
x‘ is vector
n  Basically, they assume latent model of controls,
including
y* = x ' β + u constant

q  y* is unobserved latent variable


q  And, we assume observed outcome, y,
equals 1 if y*>0, and zero otherwise
q  And, make assumption about error, u
n  Probit assumes u distributed normally
n  Logit assumes u is logistic distribution

67
What are Logit & Probit? [Part 2]

n  With those assumptions, can show…


q  Prob(y* > 0|x) = Prob(u < x’β|x) = F(x’ β)
q  And, thus Prob(y = 1|x) = F(x’ β), where F(x’
β) is cumulative distribution function of u

n  Because this is nonlinear, we use maximum


likelihood estimator to estimate β
q  See Greene, Section 17.3 for details

68
What are Logit & Probit? [Part 3]

n  Note: reported estimates in Stata are not


marginal effects of interest!
q  I.e. you can’t easily interpret them or compare
them to what you’d get with LPM
q  Need to use post-estimation command
“margins” to get marginal effects at average x

69
Logit, Probit versus LPM

n  Benefits of Logit & Probit


q  Predicted probabilities from Logit &
Probit will be between 0 and 1…

n  But, are they needed to estimate casual


effect of some random treatment, d?

70
NO! LPM is okay to use

n  Just think back to natural experiments,


where treatment, d, is exogenously assigned
q  Difference-in-differences estimators were shown
to estimate average treatment effects
q  Nothing in those proofs required assumption
that outcome y is continuous with full support!

n  Same is true of non-negative y


[I.e. Using Tobit isn’t necessary either]

71
Instrumental variables and LDV

n  Prior conclusions also hold in 2SLS


estimations with exogenous instrument
q  2SLS still estimates local average treatment
effect with limited dependent variables

72
Caveat – Treatment with covariates

n  There is, however, an issue when


estimating treatment effects when
including other covariates
q  CEF almost certainly won’t be linear
if there are additional covariates, x
n  It is linear if just have treatment, d, and no X’s

n  But, Angrist-Pischke say not to worry…

73
Angrist-Pischke view on OLS [Part 1]
n  OLS still gives best linear approx. of CEF
under less restrictive assumptions
q  If non-linear CEF has causal interpretation, then
OLS estimate has causal interpretation as well
q  If assumptions about distribution of error are
correct, non-linear models (e.g. Logit, Probit, and
Tobit) basically just provide efficiency gain

74
Angrist-Pischke view on OLS [Part 2]

n  But this efficiency gain (from using something


like Probit or Logit) comes with cost…
q  Assumptions of Probit, Logit, and Tobit are not
testable [can’t observe u]
q  Theory gives little guidance on right assumption,
and if assumption wrong, estimates biased!

75
Angrist-Pischke view on OLS [Part 3]
n  Lastly, in practice, marginal effects from
Probit, Logit, etc. will be similar to OLS
q  True even when average y is close to either 0 or 1
(i.e. there are a lot of zeros or lot of ones)

76
One other problem…

n  Nonlinear estimators like Logit, Probit, and


Tobit can’t easily estimate interaction effects
q  E.g. can’t have y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + u
q  Marginal effects reported by statistical programs
will be wrong; need to take additional steps to
get correct interacted effects; See Ai and Norton
(Economic Letters 2003)

77
One last thing to mention…
n  With non-negative outcome y and
random treatment indicator, d
q  OLS still correctly estimates ATE
q  But, don’t condition on y > 0 when selecting
your sample; that messes things up!
n  This is equivalent to “bad control” in that you’re
implicitly controlling for whether y > 0, which is
also outcome of treatment!
n  See Angrist-Pischke, pages 99-100

78
Summary of Today [Part 1]

n  Getting your SEs correct is important


q  If clustering isn’t important, run both
“classical” and “robust” SE; choose higher
q  But, use clustering when…
n  One of key independent variables only varies at
aggregate level (e.g. industry, state, etc)
n  Or, dependent variable or independent variables
likely exhibit time series dependence

79
Summary of Today [Part 2]

n  Miscellaneous advice on clustering


q  Best to assume time series dependence; e.g.
cluster at group level, not group-year
q  Firm FE and firm clusters are not substitutes
q  Use clustered SE produced by xtreg not areg

80
Summary of Today [Part 3]

n  Can use OLS with LDVs


q  Still gives ATE when estimating treatment effect
q  In other settings (i.e. have more covariates), still
gives best linear approx. of non-linear causal CEF

n  Estimators like Probit, Logit, Tobit have


their own problems

81
In First Half of Next Class
n  Randomized experiments
q  Benefits…
q  Limitations

n  Related readings; see syllabus

82
In Second Half of Next Class
n  Papers are not necessarily connected
to today’s lecture on standard errors

83
Assign papers for next week…
n  Heider and Ljungqvist (JFE Forthcoming)
q  Capital structure and taxes

n  Iliev (JF 2010)


q  Effect of SOX on accounting costs

n  Appel, Gormley, Keim (working paper, 2015)


q  Impact of passive investors on governance

84
Break Time
n  Let’s take our 10 minute break
n  We’ll do presentations when we get back

85

You might also like