Overall Notes Econo
Overall Notes Econo
Empirical Methods in CF
Lecture 1 – Linear Regression I
My expectation is that
you’ve seen most of this Despite trying to do much
before; but it is helpful to of it without math; today’s
review the key ideas that lecture likely to be long
are useful in practice and tedious… (sorry)
(without all the math)
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
n Miscellaneous issues
y = E ( y | x) + ε
where (y, x, ε) are random variables and E(ε|x)=0
Our goal is
to learn about
the CEF
αˆ = 963, 200
q Now, they will be…
βˆ = 18,500
Scaling y continued…
n Scaling y by an amount c just causes all the
estimates to be scaled by the same amount
q Mathematically, easy to see why…
y =α + βx +u
cy = ( cα ) + ( cβ ) x + cu
αˆ = 963.2
q Now, they will be…
βˆ = 1,850
Scaling x continued…
n Scaling x by an amount k just causes the
slope on x to be scaled by 1/k
q Mathematically, easy to see why…
Will interpretation of
estimates change?
y = α + βx + u
Answer: Again, no!
⎛β⎞
y = α + ⎜ ⎟ kx + u
⎝ k⎠
New slope
Scaling both x and y
n If scale y by an amount c and x by
amount k , then we get…
q Intercept scaled by c
q Slope scaled by c/k
y =α + βx +u
⎛ cβ ⎞
cy = ( cα ) + ⎜ ⎟ kx + cu
⎝ k ⎠
n When is scaling useful?
Practical application of scaling #1
n No one wants to see a coefficient of
0.000000456 or 1,234,567,890
n Just scale the variables for cosmetic purposes!
q It will effect coefficients & SEs
q But, it won’t affect t-stats or inference
Practical application of scaling #2 [P1]
n To improve interpretation, in terms of
found magnitudes, helpful to scale by the
variables by their sample standard deviation
q Let σx and σy be sample standard deviations of
x and y respectively
q Let c, the scalar for y, be equal to 1/σy
q Let k, the scalar for x, be equal to 1/σx
q I.e. unit of x and y is now standard deviations
Practical application of scaling #2 [P2]
n With the prior rescaling, how would we
interpret a slope coefficient of 0.25?
q Answer = a 1 s.d. increase in x is associated
with ¼ s.d. increase in y
q The slope tells us how many standard
deviations y changes, on average, for a
standard deviation change in x
q Is 0.25 large in magnitude? What about 0.01?
Shifting the variables
wage = α+ βeducation + u
ln(wage) = α+ βeducation + u
ln( y ) = α + β x + u
y = exp(α + β x + u )
Example interpretation
n Suppose you estimated the wage equation (where
wages are $/hour) and got…
ln( y) = α + β ln( x) + u
y = α + β ln( x) + u
Level-Level y x dy = βdx
Level-Log y ln(x) dy = (β/100)%dx
Log-Level ln(y) x %dy = (100β)dx
Log-Log ln(y) ln(x) %dy = β%dx
n See syllabus…
n Now, let’s talking about what happens if
you change units (i.e. scale) for either y
or x in these regressions…
Rescaling logs doesn’t matter [Part 1]
n What happens to intercept & slope if rescale
(i.e. change units) of y when in log form?
n Answer = Only intercept changes; slope
unaffected because it measures proportional
change in y in Log-Level model
log( y ) = α + β x + u
log(c) + log( y ) = log(c) + α + β x + u
log(cy ) = ( log(c) + α ) + β x + u
Rescaling logs doesn’t matter [Part 2]
n Same logic applies to changing scale of x in
level-log models… only intercept changes
y = α + β log( x) + u
y + β log(c) = α + β log( x) + β log(c) + u
y = (α − β log(c) ) + β log(cx) + u
Rescaling logs doesn’t matter [Part 3]
n Basic message – If you rescale a logged variable,
it will not effect the slope coefficient because you
are only looking at proportionate changes
Log approximation problems
n I once discussed a paper where author
argued that allowing capital inflows into
country caused -120% change in stock
prices during crisis periods…
q Do you see a problem with this?
n Of course! A 120% drop in stock prices isn’t
possible. The true percentage change was -70%.
Here is where that author went wrong…
Log approximation problems [Part 1]
n Approximation error occurs because as true
%Δy becomes larger, 100Δln(y)≈%Δy
becomes a worse approximation
n To see this, consider a change from y to y’…
y '− y
q Ex. #1: = 5% , and 100Δln(y) = 4.9%
y
y '− y
q Ex. #2: = 75% , but 100Δln(y)= 56%
y
Log approximation problems [Part 2]
Log approximation problems [Part 3]
n Problem also occurs for negative changes
y '− y
q Ex. #1: = −5% , and 100Δln(y) = -5.1%
y
y '− y
q Ex. #2: = −75% , but 100Δln(y)= -139%
y
Log approximation problems [Part 4]
q So, if implied percent change is large, better to convert
it to true % change before interpreting the estimate
ln( y ) = α + β x + u
ln( y ') − ln( y ) = β ( x '− x)
ln( y '/ y ) = β ( x '− x)
y '/ y = exp ( β ( x '− x) )
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ( x '− x) ) − 1⎤⎦
Log approximation problems [Part 5]
n We can now use this formula to see what
true % change in y is for x’–x = 1
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ( x '− x) ) − 1⎤⎦
[( y '− y) / y ] % = 100 ⎡⎣exp ( β ) − 1⎤⎦
q If β = 0.56, the percent change isn’t 56%, it is
i =1
[mean of predicted y = mean of y]
N
SSR = ∑ uˆi2 SSR is total variation in residuals
i =1 [mean of residual = 0]
SSR, SST, and SSE continued…
n The total variation, SST, can be broken
into two pieces… the explained part,
SSE and unexplained part, SSR
R2 = SSE/SST = 1 – SSR/SST
More about R2
n As seen on last slide, R2 must be
between 0 and 1
n It can also be shown that R2 is equal
to the square of the correlation
between y and predicted y
n If you add an independent variable,
R2 will never go down
Adjusted R2
n Because R2 always goes up, we often use
what is called Adjusted R2
⎛ N −1 ⎞
AdjR = 1 − (1 − R ) ⎜
2 2
⎟
⎝ N − 1 − k ⎠
2
Background readings
n Angrist and Pischke
q Sections 3.1-3.2, 3.4.1
n Wooldridge
q Sections 4.1 & 4.2
n Greene
q Chapter 3 and Sections 4.1-4.4, 5.7-5.9, 6.1-6.2
3
Announcements
n Exercise #1 is due next week
q You can download it from Canvas
q If any questions, please e-mail TA, or if
necessary, feel free to e-mail me
q When finished, upload both typed
answers and DO file to Canvas
4
Quick Review [Part 1]
n When does the CEF, E(y|x), we approx.
with OLS give causal inferences?
q Answer = If correlation between error, u, and
independent variables, x's, is zero
5
Quick Review [Part 2]
n What is interpretation of coefficients
in a log-log regression?
q Answer = Elasticity. It captures the percent
change in y for a percent change in x
6
Quick Review [Part 3]
n How should I interpret coefficient on x1 in a
multivariate regression? And, what two steps
could I use to get this?
q Answer = Effect of x1 holding other x's constant
q Can get same estimates in two steps by first
partialing out some variables and regressing
residuals on residuals in second step
7
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
q Heteroskedastic versus Homoskedastic errors
q Hypothesis tests
q Economic versus statistical significance
8
Hypothesis testing
n Before getting to hypothesis testing, which
allows us to say something like "our
estimate is statistically significant", it is
helpful to first look at OLS variance
q Understanding it and the assumptions made to
get it can help us get the right standard errors
for our later hypothesis tests
9
Variance of OLS Estimators
n Homoskedasticity implies Var(u|x) = σ2
q I.e. Variance of disturbances, u, doesn't
depend on level of observed x
n Heteroskedasticity implies Var(u|x) = f(x)
q I.e. Variance of disturbances, u, does depend
on level of x in some way
10
Variance visually…
Homoskedasticity Heteroskedasticity
11
Which assumption is more realistic?
n In investment regression, which is more realistic,
homoskedasticity or heteroskedasticity?
Investment = α + βQ + u
12
Heteroskedasticity (HEK) and bias
n Does heteroskedasticity cause bias?
q Answer = No! E(u|x)=0 (which is what we need
for consistent estimates) is something entirely
different. Hetereskedasticity just affects SEs!
q Heteroskedasticity just means that the OLS
estimate may no longer be the most efficient (i.e.
precise) linear estimator
13
Default is homoskedastic (HOK) SEs
n Default standard errors reported by
programs like Stata assume HOK
q If standard errors are heteroskedastic,
statistical inferences made from these
standard errors might be incorrect…
14
Robust standard errors (SEs)
n Use "robust" option to get standard
errors (for hypothesis testing ) that are
robust to heteroskedasticity
q Typically increases SE, but usually won't
make that big of a deal in practice
q If standard errors go down, could have
problem; use the larger standard errors!
q We will talk about clustering later…
15
Using WLS to deal with HEK
n Weighted least squares (WLS) is sometimes
used when worried about heteroskedasticity
q WLS basically weights the observation of x using
an estimate of the variance at that value of x
q Done correctly, can improve precision of estimates
16
WLS continued… a recommendation
n Recommendation of Angrist-Pischke
[See Section 3.4.1]: don't bother with WLS
q OLS is consistent, so why bother?
Can just use robust standard errors
q Finite sample properties can be bad [and it may
not actually be more efficient]
q Harder to interpret than just using OLS [which
is still best linear approx. of CEF]
17
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
q Heteroskedastic versus Homoskedastic errors
q Hypothesis tests
q Economic versus statistical significance
18
Hypothesis tests
n This type of phrases are common: "The
estimate, βˆ , is statistically significant"
q What does this mean?
q Answer = "Statistical significance" is
generally meant to imply an estimate is
statistically different than zero
19
Hypothesis tests[Part 2]
n When thinking about significance, it is
helpful to remember a few things…
q Estimates of β1, β2, etc. are functions of random
variables; thus, they are random variables with
variances and covariances with each other
q These variances & covariances can be estimated
[See textbooks for various derivations]
q Standard error is just the square root of an
estimate's estimated variance
20
Hypothesis tests[Part 3]
n Reported t-stat is just telling us how
many standard deviations our sample
estimate, βˆ , is from zero
q I.e. it is testing the null hypothesis: β = 0
q p-value is just the likelihood that we would
get an estimate βˆ standard deviations away
from zero by luck if the true β = 0
21
Hypothesis tests[Part 4]
n See textbooks for more details on how to
do other hypothesis tests; E.g.
q β1 = β 2
q β1 = β 2 = β 3 = 0
22
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
q Heteroskedastic versus Homoskedastic errors
q Hypothesis tests
q Economic versus statistical significance
23
Statistical vs. Economic Significance
n These are not the same!
q Coefficient might be statistically
significant, but economically small
n You can get this in large samples, or when
you have a lot of variation in x (or outliers)
24
Economic Significance
n You should always check economic
significance of coefficients
q E.g. how large is the implied change in y
for a standard deviation change in x?
q And importantly, is that plausible? If not,
you might have a specification problem
25
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
n Miscellaneous issues
q Irrelevant regressors & multicollinearity
q Binary models and interactions
q Reporting regressions
26
Irrelevant regressors
27
Variance and of OLS estimators
n Greater variance in your estimates, βˆ j ,
increases your standard errors, making it
harder to find statistically significant estimates
28
Variance formula
n Sampling variance of OLS slope is…
σ2
( )
Var βˆ j =
− x j ) (1 − R
∑ (x )
N 2 2
i =1 ij j
29
Variance formula – Interpretation
n How will more variation in x affect SE? Why?
n How will higher σ2 affect SE? Why?
n How will higher Rj2 affect SE? Why?
σ
( )
2
Var βˆ j =
∑ i=1( ij j ) ( j )
N 2
x − x 1 − R 2
30
Variance formula – Variation in xj
n More variation in xj is good; smaller SE!
q Intuitive; more variation in xj helps us
identify its effect on y!
q This is why we always want larger samples;
it will give us more variation in xj
31
Variance formula – Effect of σ2
n More error variance means bigger SE
q Intuitive; a lot of the variation in y is
explained by things you didn't model
q Can add variables that affect y (even if not
necessary for identification) to improve fit!
32
Variance formula – Effect of Rj2
n But, more variables can also be bad if
they are highly collinear
q Gets harder to disentangle effect of the
variables that are highly collinear
q This is why we don't want to add variables
that are "irrelevant" (i.e. they don't affect y)
33
Multicollinearity [Part 1]
n Highly collinear variables can inflate SEs
q But, it does not cause a bias or inconsistency!
q Problem is really just one of a having too small
of a sample; with a larger sample, one could get
more variation in the independent variables
and get more precise estimates
34
Multicollinearity [Part 2]
n Consider the following model
y = β0 + β1 x1 + β 2 x2 + β3 x3 + u
where x2 and x3 are highly correlated
q ( ) ( )
Var βˆ2 and Var βˆ3 may be large, but
correlation between x2 and x3 has no
( )
direct effect on Var βˆ1
q If x1 is uncorrelated with x2 and x3, the
( )
R12 = 0 and Var βˆ1 unaffected
35
Multicollinearity – Key Takeaways
n It doesn't cause bias
n Don't include controls that are highly
correlated with independent variable of
interest if they aren't needed for
identification [i.e. E(u|x) = 0 without them]
q But obviously, if E(u|x) ≠ 0 without these
controls, you need them!
q A larger sample will help increase precision
36
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
n Miscellaneous issues
q Irrelevant regressors & multicollinearity
q Binary models and interactions
q Reporting regressions
37
Models with interactions
n Sometimes, it is helpful for identification, to
add interactions between x's
q Ex. – theory suggests firms with a high value of x1
should be more affected by some change in x2
q E.g. see Rajan and Zingales (1998)
38
Interactions – Interpretation [Part 1]
n According to this model, what is the effect of
increasing x1 on y, holding all else equal?
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
q Answer:
Δy = ( β1 + β3 x2 ) Δx1
dy
= β1 + β3 x2
dx1
39
Interactions – Interpretation [Part 2]
n If β3 < 0, how does a higher x2 affect the
partial effect of x1 on y?
dy
= β1 + β3 x2
dx1
q Answer: The increase in y for a given change in
x1 will be smaller in levels (not necessarily in
absolute magnitude) for firms with a higher x2
40
Interactions – Interpretation [Part 3]
n Suppose, β1 > 0 and β3 < 0 … what is
the sign of the effect of an increase in x1
for the average firm in the population?
dy
= β1 + β3 x2
dx1
dy x2 = x2
q Answer: It is the sign of | = β1 + β3 x2
dx1
41
A very common mistake! [Part 1]
42
A very common mistake! [Part 2]
n To improve interpretation of β1, you can
reparameterize the model by demeaning
each variable in the model, and estimate
y! = δ 0 + δ 1x!1 + δ 2 x!2 + δ 3 x!1x!2 + u
where y! = y − µ y
x!1 = x1 − µ x
1
x!2 = x2 − µ x
2
43
A very common mistake! [Part 3]
n You can then show… Δy = (δ 1 + δ 3 x!2 ) Δx1
dy x2 =µ2
and thus, | = δ1 + δ 3 ( x2 − µ2 )
dx1
dy x2 =µ2
| = δ1
dx1
44
The main takeaway – Summary
n If you want to coefficients on non-
interacted variables to reflect the effect
of that variable for the "average" firm,
demean all your variables before
running the specification
45
Indicator (binary) variables
46
Motivation
n Indicator variables, also known as binary
variables, are quite popular these days
q Ex. #1 – Sex of CEO (male, female)
q Ex. #2 – Employment status (employed, unemployed)
q Also see in many diff-in-diff specifications
n Ex. #1 – Size of firm (above vs. below median)
n Ex. #2 – Pay of CEO (above vs. below median)
47
How they work
n Code the information using dummy variable
⎧1 if person i is male
q Ex. #1: Malei = ⎨
⎩0 otherwise
⎧1 if Ln(assets) of firm i > median
q Ex. #2: Largei = ⎨
⎩0 otherwise
48
Single dummy variable model
n Consider wage = β0 + δ 0 female + β1educ + u
n δ0 measures difference in wage between male
and female given same level of education
q E(wage|female = 0, educ) = β0 + β1educ
q E(wage|female = 1, educ) = β0 + δ0 + β1educ
q Thus, E(wage|f = 1, educ) – E(wage|f = 0, educ) = δ0
49
Single dummy just shifts intercept!
n When δ0 < 0, we have visually…
β1
50
Single dummy example – Wages
n Suppose we estimate the following wage model
51
Log dependent variable & indicators
n Nothing new; coefficient on indicator has %
interpretation. Consider following example…
ln( price) = −1.35 + 0.17ln(lotsize) + 0.71ln( sqrft )
+0.03bdrms + 0.054colonial
q Again, negative intercept meaningless; all other
variables are never all equal to zero
q Interpretation = colonial style home costs about
5.4% more than "otherwise similar" homes
52
Multiple indicator variables
n Suppose you want to know how much lower
wages are for married and single females
q Now have 4 possible outcomes
n Single & male
n Married & male
n Single & female
n Married & female
53
But, which to exclude?
n We have to exclude one of the four
because they are perfectly collinear with
the intercept, but does it matter which?
q Answer: No, not really. It just effects the
interpretation. Estimates of included
indicators will be relative to excluded indicator
q For example, if we exclude "single & male",
we are estimating partial change in wage
relative to that of single males
54
But, which to exclude? [Part 2]
55
Multiple indicators – Example
n Consider the following estimation results…
ln( wage) = 0.3 + 0.2l marriedMale − .20marriedFemale
−0.11singleFemale + 0.08education
56
Interactions with Indicators
n We could also do prior regression instead
using interactions between indicators
q I.e. construct just two indicators, 'female' and
'married' and estimate the following
57
Interactions with Indicators [Part 2]
n Before we had,
ln( wage) = 0.3 + 0.2l marriedMale − .20marriedFemale
−0.11singleFemale + 0.08education
n Now, we will have,
ln( wage) = 0.3 − 0.11 female + 0.21married
−0.30 ( female × married ) + 0.08education
58
Interactions with Indicators [Part 3]
n Answer: It will be the same!
ln( wage) = 0.32 − 0.11 female + 0.21married
−0.30 ( female × married ) + ...
59
Indicator Interactions – Example
n Krueger (1993) found…
ln( wage) = βˆ0 + 0.18compwork + 0.07comphome
+0.02 ( compwork × comphome ) + ...
60
Indicator Interactions – Example [part 2]
n Remember, these are just approximate percent
changes… To get true change, need to convert
q E.g. % change in wages for having computers at both
home and work is given by
100*[exp(0.18+0.07+0.02) – 1] = 31%
61
Interacting Indicators w/ Continuous
n Adding dummies alone will only shift
intercepts for different groups
n But, if we interact these dummies with
continuous variables, we can get different
slopes for different groups as well
q See next slide for an example of this
62
Continuous Interactions – Example
n Consider the following
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u
63
Visual #1 of Example
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u
In this example…
q Females earn lower wages
at all levels of education
q Avg. increase per unit of
education is also lower
64
Visual #2 of Example
ln( wage) = β0 + δ 0 female + β1educ + δ1 ( female × educ ) + u
In this example…
q Wage is lower for females
but only for lower levels
of education because their
slope is larger
65
Cautionary Note on Different Slopes!
n Crossing point (where women earn higher
wages) might occur outside the data
(i.e. at education levels that don't exist)
q Need to solve for crossing point before
making this claim about the data
Women : ln( wage) = β 0 + δ 0 + ( β1 + δ1 ) educ + u
Men : ln( wage) = β 0 + β1educ + u
66
Cautionary Note on Interpretation!
n Interpretation of non-interacted terms when
using continuous variables is tricky
n E.g., consider the following estimates
ln( wage) = 0.39 − 0.23 female + 0.08educ − .01( female × educ )
67
Cautionary Note [Part 2]
n Again, interpretation of non-interacted
variables does not equal average effect unless
you demean the continuous variables
q In prior example estimate the following:
ln( wage) = β 0 + δ 0 female + β1 ( educ − µeduc )
+δ1 female × ( educ − µeduc )
q Now, δ0 tells us how much lower the wage is of
women at the average education level
68
Cautionary Note [Part 3]
n Recall! As we discussed in prior lecture, the
slopes won't change because of the shift
q Only the intercepts, β0 and β0 + δ0 , and their
standard errors will change
69
Ordinal Variables
n Consider credit ratings: CR ∈ ( AAA, AA,..., C , D)
n If want to explain interest rate, IR, with
ratings, we could convert CR to numeric scale,
e.g. AAA = 1, AA = 2, … and estimate
IRi = β 0 + β1CRi + ui
q But, what are we implicitly assuming, and how
might it be a problematic assumption?
70
Ordinal Variables continued…
n Answer: We assumed a constant linear
relation between interest rates and CR
q I.e. Moving from AAA to AA produces same
change as moving from BBB to BB
q Could take log interest rate, but is a constant
proportional much better? Not really…
71
Convert ordinal to indicator variables
n E.g. let CRAAA = 1 if CR = AAA, 0 otherwise;
CRAA = 1 if CR = AA, 0 otherwise, etc.
n Then, run this regression
IRi = β 0 + β1CRAAA + β 2CRAA + ... + β m−1CRC + ui
q Remember to exclude one (e.g. "D")
72
Linear Regression – Outline
n The CEF and causality (very brief)
n Linear OLS model
n Multivariate estimation
n Hypothesis testing
n Miscellaneous issues
q Irrelevant regressors & multicollinearity
q Binary models and interactions
q Reporting regressions
73
Reporting regressions
n Table of OLS outputs should
generally show the following…
q Dependent variable [clearly labeled]
q Independent variables
q Est. coefficients, their corresponding
standard errors (or t-stat), and stars
indicating level of stat. significance
q Adjusted R2
q # of observations in each regression
74
Reporting regressions [Part 2]
n In body of paper…
q Focus only on variable(s) of interest
n Tell us their sign, magnitude, statistical &
economic significance, interpretation, etc.
75
Reporting regressions [Part 3]
n And last, but not least, don't report
regressions in tables that you aren't going to
discuss and/or mention in the paper's body
q If it's not important enough to mention in the
paper, it's not important enough to be in a table
76
Summary of Today [Part 1]
n Irrelevant regressors and multi-
collinearity do not cause bias
q But, can inflate standard errors
q So, avoid adding unnecessary controls
77
Summary of Today [Part 2]
n Interactions and binary variables
can help us get a causal CEF
q But, if you want to interpret non-interacted
indicators it is helpful to demean continuous var.
78
In First Half of Next Class
n Discuss causality and potential biases
q Omitted variable bias
q Measurement error bias
q Simultaneity bias
79
Assign papers for next week…
n Fazzari, et al (BPEA 1988) These classic papers
in finance that use
q Finance constraints & investment rather simple
estimations and
n Morck, et al (BPEA 1990) 'identification' was
not a foremost
q Stock market & investment concern
80
Break Time
n Let's take our 10 minute break
n We'll do presentations when we get back
81
FNCE 926
Empirical Methods in CF
Lecture 3 – Causality
2
Background readings for today
n Roberts-Whited
q Section 2
n Angrist and Pischke
q Section 3.2
n Wooldridge
q Sections 4.3 & 4.4
n Greene
q Sections 5.8-5.9
3
Outline for Today
n Quick review
n Motivate why we care about causality
n Describe three possible biases & some
potential solutions
q Omitted variable bias
q Measurement error bias
q Simultaneity bias
4
Quick Review [Part 1]
n Why is adding irrelevant regressors a
potential problem?
q Answer = It can inflate standard errors if the
irrelevant regressors are highly collinear with
variable of interest
5
Quick Review [Part 2]
n Suppose, β1 < 0 and β3 > 0 … what is the
sign of the effect of an increase in x1 for the
average firm in the below estimation?
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
q Answer: It is the sign of
dy x2 = x2
| = β1 + β3 x2
dx1
6
Quick Review [Part 3]
n How could we make the coefficients easier
to interpret in the prior example?
q Shift all the variables by subtracting out their
sample mean before doing the estimation
q It will allow the non-interacted coefficients to be
interpreted as effect for average firm
7
Quick Review [Part 4]
n Consider the following estimate:
ln( wage) = 0.32 − 0.11 female + 0.21married
−0.30 ( female × married ) + 0.08education
8
Outline for Today
n Quick review
n Motivate why we care about causality
n Describe three possible biases & some
potential solutions
q Omitted variable bias
q Measurement error bias
q Simultaneity bias
9
Motivation
n As researchers, we are interested in
making causal statements
q Ex. #1 – what is the effect of a change in
corporate taxes on firms' leverage choice?
q Ex. #2 – what is the effect of giving a CEO
more stock ownership in the firm on the
CEO's desire to take on risky investments?
10
What do we mean by causality?
n Recall from earlier lecture, that if our linear
model is the following…
y = β 0 + β1 x1 + ... + β k xk + u
11
The basic assumptions
n Assumption #1: E(u) = 0
n Assumption #2: E(u|x1,…,xk) = E(u)
q In words, average of u (i.e. unexplained portion
of y) does not depend on value of x
q This is "conditional mean independence" (CMI)
n Generally speaking, you need the estimation
error to be uncorrelated with all the x's
12
Tangent – CMI versus correlation
n CMI (which implies x and u are
uncorrelated) is needed for unbiasedness
[which is again a finite sample property]
n But, we only need to assume a zero
correlation between x and u for consistency
[which is a large sample property]
q This is why I'll typically just refer to whether
u and x are correlated in my test of whether
we can make causal inferences
13
Three main ways this will be violated
n Omitted variable bias
n Measurement error bias
n Simultaneity bias
14
Omitted variable bias (OVB)
n Probably the most common concern you
will hear researchers worry about
n Basic idea = the estimation error, u,
contains other variable, e.g. z, that affects
y and is correlated with an x
q Please note! The omitted variable is only
problematic if correlated with an x
15
OVB more formally, with one variable
n You estimate: y = β 0 + β1 x + u
n But, true model is: y = β 0 + β1 x + β 2 z + v
cov( x, z )
δ xz =
var( x)
16
Interpreting the OVB formula
cov( x, z ) Bias
ˆ
β1 = β1 + β2
var( x)
Effect of Effect of
x on y Regression z on y
of z on x
17
Direction and magnitude of the bias
ˆ cov( x, z )
β1 = β1 + β2
var( x)
18
Example – One variable case
n Suppose we estimate: ln( wage) = β 0 + β1educ + w
n But, true model is:
ln( wage) = β 0 + β1educ + β 2 ability + u
ˆ cov(educ, ability )
β1 = β1 + β2
var(educ)
19
Example – Answer
q Ability & wages likely positively correlated, so β2 > 0
q Ability & education likely positive correlated, so
cov(educ, ability) > 0
q Thus, the bias is likely to positive! βˆ is too big!
1
20
OVB – General Form
n Once move away from simple case of just one
omitted variable, determining sign (and
magnitude) of bias will be a lot harder
q Let β be vector of coefficients on k included variables
q Let γ be vector of coefficient on l excluded variables
q Let X be matrix of observations of included variables
q Let Z be matrix of observations of excluded variables
ˆβ = β + E[ X'Z] γ
E[ X'X]
21
OVB – General Form, Intuition
ˆβ = β + E[ X'Z] γ
E[ X'X]
Vector of partial effects of
Vector of regression excluded variables
coefficients
22
Eliminating Omitted Variable Bias
23
Observable omitted variables
n This is easy! Just add them as controls
q E.g. if the omitted variable, z, in my simple case
was 'leverage', then add leverage to regression
n A functional form misspecification is a special
case of an observable omitted variable
Let's now talk about this…
24
Functional form misspecification
n Assume true model is…
y = β 0 + β1 x1 + β 2 x2 + β3 x22 + u
2
n x
But, we omit squared term, 2
q Just like any OVB, bias on (β0 , β1 , β2 )will
2
depend on β3 and correlations among 1 2 2 ) ( x , x , x
q You get same type of problem if have incorrect
functional form for y [e.g. it should be ln(y) not y]
n In some sense, this is minor problem… Why?
25
Tests for correction functional form
n You could add additional squared and
cubed terms and look to see whether
they make a difference and/or have
non-zero coefficients
n This isn't as easy when the possible
models are not nested…
26
Non-nested functional form issues…
n Two non-nested examples are:
y = β 0 + β1 x1 + β 2 x2 + u Let's use this
versus example and
y = β 0 + β1 ln( x1 ) + β 2 ln( x2 ) + u see how we can
try to figure out
which is right
y = β 0 + β1 x1 + β 2 x2 + u
versus
y = β 0 + β1 x1 + β 2 z + u
27
Davidson-MacKinnon Test [Part 1]
n To test which is correct, you can try this…
q Take fitted values, ŷ , from 1st model and add them
as a control in 2nd model
y = β0 + β1 ln( x1 ) + β2 ln( x2 ) + θ1 yˆ + u
q Look at t-stat on θ1; if significant rejects 2nd model!
q Then, do reverse, and look at t-stat on θ1 in
y = β 0 + β1 x1 + β 2 x2 + θ1 yˆˆ + u
where ŷˆ is predicted value from 2nd model… if
significant then 1st model is also rejected L
28
Davidson-MacKinnon Test [Part 2]
n Number of weaknesses to this test…
q A clear winner may not emerge
n Both might be rejected
n Both might be accepted [If this happens, you can
use the R2 to choose which model is a better fit]
q And, rejecting one model does NOT imply
that the other model is correct L
29
Bottom line advice on functional form
n Practically speaking, you hope that changes
in functional form won't effect coefficients
on key variables very much…
q But, if it does… You need to think hard about
why this is and what the correct form should be
q The prior test might help with that…
30
Eliminating Omitted Variable Bias
31
Unobserved omitted variables
n Again, consider earlier estimation
32
Proxy variables [Part 1]
n Consider the following model…
y = β 0 + β1 x1 + β 2 x2 + β3 x3* + u
where x3* is unobserved, but we have proxy x3
Then, suppose 3 = δ 0 + δ1 x3 + v
*
n x
q v is error associated with proxy's imperfect
representation of unobservable x3
q Intercept just accounts for different scales
[e.g. ability has different average value than IQ]
33
Proxy variables [Part 2]
n If we are only interested in β1 or β2, we can just
replace x3* with x3 and estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x3 + u
34
Proxy variables – Assumptions
3 ) = 0 ; i.e. we have the right
*
#1 – E (u | x ,
1 2x , x
model and x3 would be irrelevant if we could
control for x1, x2, 3 , such that E (u | x3 ) = 0
*
x
q This is a common assumption; not controversial
35
Why the proxy works…
n Recall true model: y = β 0 + β x
1 1 + β x
2 2 + β 3 3 +u
x *
36
Proxy assumptions are key [Part 1]
n Suppose assumption #2 is wrong such that
x = δ 0 + δ 1x3 + γ 1x1 + γ 2 x2 + w
*
3
!##"## $
v
where E ( w | x1 , x2 , x3 ) = 0
37
Proxy assumptions are key [Part 2]
*
n x
Plugging in for 3 , you'd get
y = α 0 + α1 x1 + α 2 x2 + α 3 x3 + e
where α 0 = β 0 + β3δ 0 E.g. α1 captures effect
α1 = β1 + β 3γ 1 of x1 on y, β1 , but also
α 2 = β 2 + β 3γ 2 its correlation with
unobserved variable
α 3 = β 3δ1
38
Proxy variables – Example #1
n Consider earlier wage estimation
ln( wage) = β 0 + β1educ + β 2 ability + u
39
Proxy variables – Example #2
n Consider Q-theory of investment
investment = β 0 + β1Q + u
40
Proxy variables – Example #2 [Part 2]
n Even if assumptions held, we'd only be getting
consistent estimates of
investment = α 0 + α1Q + e
where α 0 = β0 + β1δ 0
α1 = β1δ1
41
Proxy variables – Summary
42
Random Coefficient Model
n So far, we've assumed that the effect of x on y
(i.e. β) was the same for all observations
q In reality, this is unlikely true; model might look
more like yi = αi + βi xi + ui , where
α i = α + ci I.e. each observation's
βi = β + di relationship between x
and y is slightly different
E (ci ) = E (d i ) = 0
43
Random Coefficient Model [Part 2]
n Regression would seem to be incorrectly
specified, but if willing to make assumptions,
we can identify the APE
If like, can think of
q Plug in for α and β the unobserved
yi = α + β xi + ( ci + di xi + ui ) differential intercept
and slopes as
q Identification requires omitted variable
E ( ci + di xi + ui | x ) = 0
What does this imply?
44
Random Coefficient Model [Part 3]
n This amounts to requiring
E ( ci | x ) = E ( ci ) = 0 ⇒ E (α i | x ) = E (α i )
E ( di | x ) = E ( di ) = 0 ⇒ E ( βi | x ) = E ( βi )
45
Random Coefficient Model [Part 4]
n Implications of APE
q Be careful interpreting coefficients when
you are implicitly arguing elsewhere in paper
that effect of x varies across observations
n Keep in mind the assumption this requires
n And, describe results using something like…
"we find that, on average, an increase in x
causes a β change in y"
46
Three main ways this will be violated
n Omitted variable bias
n Measurement error bias
n Simultaneity bias
47
Measurement error (ME) bias
n Estimation will have measurement error whenever
we measure the variable of interest imprecisely
q Ex. #1: Altman-z-score is noisy measure of default risk
q Ex. #2: Avg. tax rate is noisy measure of marg. tax rate
48
Measurement error vs. proxies
n Measurement error is similar to proxy variable,
but very different conceptually
q Proxy is used for something that is entirely
unobservable or measureable (e.g. ability)
q With measurement error, the variable we don't
observe is well-defined and can be quantified… it's
just that our measure of it contains error
49
ME of Dep. Variable [Part 1]
n Usually not a problem (in terms of bias); just
causes our standard errors to be larger. E.g. …
Let y = β0 + β1 x1 + ... + β k xk + u
*
q
50
ME of Dep. Variable [Part 2]
n As long as E(e|x)=0, the OLS estimates
are consistent and unbiased
q I.e. as long as the measurement error of y is
uncorrelated with the x's, we're okay
q Only issue is that we get larger standard errors
when e and u are uncorrelated [which is what
we typically assume] because Var(u+e)>Var(u)
51
ME of Dep. Variable [Part 3]
n Some common examples
q Market leverage – typically use book value
of debt because market value hard to observe
q Firm value – again, hard to observe market
value of debt, so we use book value
q CEO compensation – value of options are
approximated using Black-Scholes
52
ME of Dep. Variable [Part 4]
n Answer = Maybe… maybe not
q Ex. – Firm leverage is measured with error; hard to
observe market value of debt, so we use book value
n But, the measurement error is likely to be larger when firm's
are in distress… Market value of debt falls; book value doesn't
n This error could be correlated with x's if it includes things like
profitability (i.e. ME larger for low profit firms)
n This type of ME will cause inconsistent estimates
53
ME of Independent Variable [Part 1]
n Let's assume the model is y = β 0 + β1 x *
+u
n But, we observe x* with error, e = x − x*
q We assume that E(y|x*, x) = E(y|x*) [i.e. x
doesn't affect y after controlling for x*; this is
standard and uncontroversial because it is just
stating that we've written the correct model]
54
ME of Independent Variable [Part 2]
n There are lots of examples!
q Average Q measures marginal Q with error
q Altman-z score measures default prob. with error
q GIM, takeover provisions, etc. are all just noisy
measures of the nebulous "governance" of firm
55
ME of Independent Variable [Part 2]
n Answer depends crucially on what we assume
about the measurement error, e
n Literature focuses on two extreme assumptions
#1 – Measurement error, e, is uncorrelated
with the observed measure, x
#2 – Measurement error, e, is uncorrelated
with the unobserved measure, x*
56
Assumption #1: e uncorrelated with x
n Substituting x* with what we actually
observe, x* = x – e, into true model, we have
y = β 0 + β1 x + u − β1e
q Is there a bias?
n Answer = No. x is uncorrelated with e by assumption,
and x is uncorrelated with u by earlier assumptions
57
Assumption #2: e uncorrelated with x*
n We are still estimating y = β 0 + β1 x + u − β1e ,
but now, x is correlated with e
q e uncorrelated with x* guarantees e is correlated
with x; cov( x, e) = E ( xe) = E ( x*e) + E (e2 ) = σ e2
q I.e. an independent variable will be correlated with
the error… we will get biased estimates!
58
CEV with 1 variable = attenuation bias
n If work out math, one can show that the
estimate of β1, βˆ1 , in prior example (which
had just one independent variable) is…
⎛ σ x2* ⎞ This scaling
p lim( βˆ1 ) = β1 ⎜ 2
⎜ σ * + σ e2 ⎟⎟
factors is always
⎝ x ⎠
between 0 and 1
q The estimate is always biased towards zero; i.e. it
is an attenuation bias
n And, if variance of error, σ e2, is small, then attenuation
bias won't be that bad
59
Measurement error… not so bad?
n Under current setup, measurement error
doesn't seem so bad…
q If error uncorrelated with observed x, no bias
q If error uncorrelated with unobserved x*, we
get an attenuation bias… so at least the sign
on our coefficient of interest is still correct
60
Nope, measurement error is bad news
n Truth is, measurement error is
probably correlated a bit with both the
observed x and unobserved x*
q I.e… some attenuation bias is likely
61
ME with more than one variable
n If estimating y = β 0 + β1 x1 + ... + β k xk + u , and
just one of the x's is mismeasured, then…
q ALL the β's will be biased if the mismeasured
variable is correlated with any other x
[which presumably is true since it was included!]
q Sign and magnitude of biases will depend on all
the correlations between x's; i.e. big mess!
n See Gormley and Matsa (2014) math for AvgE
estimator to see how bad this can be
62
ME example
n Fazzari, Hubbard, and Petersen (1988) is
classic example of a paper with ME problem
q Regresses investment on Tobin's Q (it's measure
of investment opportunities) and cash
q Finds positive coefficient on cash; argues there
must be financial constraints present
q But Q is noisy measure; all coefficients are biased!
63
Three main ways this will be violated
n Omitted variable bias
n Measurement error bias
n Simultaneity bias
64
Simultaneity bias
n This will occur whenever any of the supposedly
independent variables (i.e. the x's) can be
affected by changes in the y variable; E.g.
y = β 0 + β1 x + u
x = δ 0 + δ1 y + v
q I.e. changes in x affect y, and changes in y affect x;
this is the simplest case of reverse causality
q An estimate of y = β0 + β1 x + u will be biased…
65
Simultaneity bias continued…
n To see why estimating y = β 0 + β1 x + u won't
reveal the true β1, solve for x
x = δ 0 + δ1 y + v
x = δ 0 + δ1 ( β 0 + β1 x + u ) + v
⎛ δ 0 + δ1 β 0 ⎞ ⎛ v ⎞ ⎛ δ1 ⎞
x=⎜ ⎟+⎜ ⎟+⎜ ⎟u
⎝ 1 − δ1β1 ⎠ ⎝ 1 − δ1β1 ⎠ ⎝ 1 − δ1β1 ⎠
66
Simultaneity bias in other regressors
n Prior example is case of reverse causality; the
variable of interest is also affected by y
n But, if y affects any x, their will be a bias; E.g.
y = β 0 + β1 x1 + β 2 x2 + u
x2 = γ 0 + γ 1 y + w
q Easy to show that x2 is correlated with u; and there
will be a bias on all coefficients
q This is why people use lagged x's
67
"Endogeneity" problem – Tangent
n In my opinion, the prior example is
what it means to have an "endogeneity"
problem or and "endogenous" variable
q But, as I mentioned earlier, there is a lot of
misusage of the word "endogeneity" in
finance… So, it might be better just saying
"simultaneity bias"
68
Simultaneity Bias – Summary
n If your x might also be affected by the y
(i.e. reverse causality), you won't be able to
make causal inferences using OLS
q Instrumental variables or natural experiments
will be helpful with this problem
69
"Bad controls"
n Similar to simultaneity bias… this is when
one x is affected by another x; e.g.
y = β 0 + β1 x1 + β 2 x2 + u
x2 = γ 0 + γ 1 x1 + v
q Angrist-Pischke call this a "bad control", and it
can introduce a subtle selection bias when
working with natural experiments
[we will come back to this in later lecture]
70
"Bad Controls" – TG's Pet Peeve
n But just to preview it… If you have an x
that is truly exogenous (i.e. random) [as you
might have in natural experiment], do not put
in controls, that are also affected by x!
q Only add controls unaffected by x, or just
regress your various y's on x, and x alone!
71
What is Selection Bias?
n Easiest to think of it just as an omitted
variable problem, where the omitted
variable is the unobserved counterfactual
q Specifically, error, u, contains some unobserved
counterfactual that is correlated with whether
we observe certain values of x
q I.e. it is a violation of the CMI assumption
72
Selection Bias – Example
n Mean health of hospital visitors = 3.21
n Mean health of non-visitors = 3.93
q Can we conclude that going to the hospital
(i.e. the x) makes you less healthy?
n Answer = No. People going to the hospital are
inherently less healthy [this is the selection bias]
n Another way to say this: we fail to control for what
health outcomes would be absent the visit, and this
unobserved counterfactual is correlated with going
to hospital or not [i.e. omitted variable]
73
Selection Bias – More later
n We'll treat it more formally later when
we get to natural experiments
74
Summary of Today [Part 1]
n We need conditional mean independence
(CMI), to make causal statements
n CMI is violated whenever an independent
variable, x, is correlated with the error, u
n Three main ways this can be violated
q Omitted variable bias
q Measurement error bias
q Simultaneity bias
75
Summary of Today [Part 2]
n The biases can be very complex
q If more than one omitted variable, or omitted
variable is correlated with more than one
regressor, sign of bias hard to determine
q Measurement error of an independent
variable can (and likely does) bias all
coefficients in ways that are hard to determine
q Simultaneity bias can also be complicated
76
Summary of Today [Part 3]
n To deal with these problems, there are
some tools we can use
q E.g. Proxy variables [discussed today]
q We will talk about other tools later, e.g.
n Instrumental variables
n Natural experiments
n Regression discontinuity
77
In First Half of Next Class
n Before getting to these other tools, will first
discuss panel data & unobserved heterogeneity
q Using fixed effects to deal with unobserved variables
n What are the benefits? [There are many!]
n What are the costs? [There are some…]
78
Assign papers for next week…
n Rajan and Zingales (AER 1998)
q Financial development & growth
79
Break Time
n Let's take our 10 minute break
n We'll do presentations when we get back
80
FNCE 926
Empirical Methods in CF
Lecture 4 – Panel Data
2
Background readings
n Angrist and Pischke
q Sections 5.1, 5.3
n Wooldridge
q Chapter 10 and Sections 13.9.1, 15.8.2, 15.8.3
n Greene
q Chapter 11
3
Outline for Today
n Quick review
n Motivate how panel data is helpful
q Fixed effects model
q Random effects model
q First differences
q Lagged y models
4
Quick Review [Part 1]
n What is the key assumption needed for us
to make causal inferences? And what are
the ways in which it can be violated?
q Answer = CMI is violated whenever an
independent variable, x, is correlated with the
error, u. This occurs when there is…
n Omitted variable bias
n Measurement error bias
n Simultaneity bias
5
Quick Review [Part 2]
n When is it possible to determine the sign of
an omitted variable bias?
q Answer = Basically, when there is just one
OMV that is correlated with just one of the x's;
other scenarios are much more complicated
6
Quick Review [Part 3]
n When is measurement error of the
dependent variable problematic (for
identifying the causal CEF)?
q Answer = If error is correlated with any x.
7
Quick Review [Part 4]
n What is the bias on the coefficient of x,
and on other coefficients when an indep-
endent variable, x, is measured with error?
q Answer = Hard to know!
n If ME is uncorrelated with observed x, no bias
n If ME is uncorrelated with unobserved x*, the
coefficient on x has an attenuation bias, but the
sign of the bias on all other coefficients is unclear
8
Quick Review [Part 5]
n When will an estimation suffer from
simultaneity bias?
q Answer = If we can think of any x as a
potential outcome variable; i.e. we think y
might directly affect an x
9
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
10
Motivation [Part 1]
n As noted in prior lecture, omitted
variables pose a substantial hurdle in
our ability to make causal inferences
n What's worse… many of them are
inherently unobservable to researchers
11
Motivation [Part 2]
n E.g. consider a the firm-level estimation
leveragei , j ,t = β 0 + β1 profiti , j ,t −1 + ui , j ,t
where leverage is debt/assets for firm i,
operating in industry j in year t, and profit is
the firms net income/assets
12
Motivation [Part 3]
n Oh, there are so, so many…
q Managerial talent and/or risk aversion
q Industry supply and/or demand shock
Sadly, this is
q Cost of capital easy to do with
q Investment opportunities other dependent
or independent
q And so on… variables…
13
Motivation [Part 4]
n Using observations from various
geographical regions (e.g. state or country)
opens up even more possibilities…
q Can you think of some unobserved variables
that might be related to a firm's location?
n Answer: any unobserved differences in local economic
environment, e.g. institutions, protection of property
rights, financial development, investor sentiment,
regional demand shocks, etc.
14
Motivation [Part 5]
n Sometimes, we can control for these
unobservable variables using proxy variables
q But, what assumption was required for a
proxy variable to provide consistent
estimates on the other parameters?
n Answer: It needs to be a sufficiently good proxy such
that the unobserved variable can't be correlated with
the other explanatory variables after we control for
the proxy variable… This might be hard to find
15
Panel data to the rescue…
n Thankfully, panel data can help us with a
particular type of unobserved variable…
16
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
17
Panel data
n Panel data = whenever you have multiple
observations per unit of observation i (e.g.
you observe each firm over multiple years)
q Let's assume N units i
q And, T observations per unit i [i.e. balanced panel]
n Ex. #1 – You observe 5,000 firms in Compustat
over a twenty year period [i.e. N=5,000, T=20]
n Ex. #2 – You observe 1,000 CEOs in Execucomp
over a 10 year period [i.e. N=1,000, T=10]
18
Time-invariant unobserved variable
n Consider the following model… Unobserved,
time-invariant
yi ,t = α + β xi ,t + δ fi + ui ,t variable, f
19
If we ignore f, we get OVB
n If estimate the model…
yi,t = α + β xi,t + vi,t
!
δ f i +ui ,t
20
Can solve this by transforming data
n First, notice that if you take the population
mean of the dependent variable for each
unit of observation, i, you get…
yi = α + β xi + δ fi + ui Again, I assumed
there are T obs.
per unit i
where
1 1 1
yi = ∑ yi ,t , xi = ∑ xi ,t , ui = ∑ ui ,t
T t T t T t
21
Transforming data [Part 2]
n Now, if we subtract yi from yi ,t , we have
yi ,t − yi = β ( xi ,t − xi ) + ( ui ,t − ui )
22
Fixed Effects (or Within) Estimator
n Answer: OLS estimation of transformed
model will yield a consistent estimate of β
n The prior transformation is called the
“within transformation” because it
demeans all variables within their group
q In this case, the “group” was each cross-section
of observations over time for each firm
q This is also called the FE estimator
23
Unobserved heterogeneity – Tangent
n Unobserved variable, f, is very general
q Doesn't just capture one unobserved
variable; captures all unobserved variables
that don't vary within the group
q This is why we often just call it
“unobserved heterogeneity”
24
FE Estimator – Practical Advice
n When you use the fixed effects (FE)
estimator in programs like Stata, it does
the within transformation for you
n Don't do it on your own because…
q The degrees of freedom(doF) (which are used
to get the standard errors) sometimes need to be
adjusted down by the number of panels, N
q What adjustment is necessary depends on
whether you cluster, etc.
25
Least Squares Dummy Variable (LSDV)
n Another way to do the FE estimation is
by adding indicator (dummy) variables
q Notice that the coefficient on fi, δ, doesn't
really have any meaning; so, can just rescale
the unobserved fi to make it equal to 1
yi ,t = α + β xi ,t + fi + ui ,t
q Now, to estimate this, we can just treat each
fi as a parameter to be estimated
26
LSDV continued…
n I.e. create a dummy variable for each
group i, and add it to the regression
q This is least squares dummy variable model
q Now, our estimation equation exactly matches
the true underlying model
yi ,t = α + β xi ,t + fi + ui ,t
q We get consistent estimates and SE that are
identical to what we'd get with within estimator
27
LSDV – Practical Advice
n Because the dummy variables will be
collinear with the constant, one of them
will be dropped in the estimation
q Therefore, don't try to interpret the intercept;
it is just the average y when all the x's are
equal to zero for the group corresponding to
the dropped dummy variable
q In xtreg, fe, the reported intercept is just
average of individual specific intercepts
28
LSDV versus FE [Part 1]
n Can show that LSDV and FE are identical,
using partial regression results [How?]
q Remember, to control for some variable z, we can
regress y onto both x and z, or we can just partial
z out from both y and x before regressing y on x
(i.e. regress residuals from regression of y on z
onto residual from regression of x on z)
q The demeaned variables are the residuals from a
regression of them onto the group dummies!
29
LSDV versus FE [Part 2]
n Reported R2 will be larger with LSDV
q All the dummy variables will explain a lot of the
variation in y, driving up R2
q Within R2 reported for FE estimator just reports
what proportion of the within variation in y that is
explained by the within variation in x
q The within R2 is usually of more interest to us
30
R-squared with FE – Practical Advice
n The within R2 is usually of more interest
since it describes explanatory power of x's
[after partialling out the FE]
q The get within R2, use xtreg, fe
n Reporting overall adjusted-R2 is also useful
q To get overall adjusted-R2, use areg command
instead of xtreg, fe. The “overall R2” reported
by xtreg does not include variation explained
by FE, but the R2 reported by areg does
31
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
32
FE Estimator – Benefits [Part 1]
n There are many benefits of FE estimator
q Allows for arbitrary correlation between each
fixed effect, fi, and each x within group i
n I.e. its very general and not imposing much structure on
what the underlying data must look like
33
FE Estimator – Benefits [Part 2]
q It is also very flexible and can help us control for
many types of unobserved heterogeneities
n Can add year FE if worried about unobserved
heterogeneity across time [e.g. macroeconomic shocks]
n Can add CEO FE if worried about unobserved
heterogeneity across CEOs [e.g. talent, risk aversion]
n Add industry-by-year FE if worried about unobserved
heterogeneity across industries over time [e.g. investment
opportunities, demand shocks]
34
FE Estimator – Tangent [Part 1]
35
FE Estimator – Tangent [Part 2]
36
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
37
FE Estimator – Costs
38
FE Cost #1 – Can't estimate some var.
n If no within-group variation in the
independent variable, x, of interest, can't
disentangle it from group FE
q It is collinear with group FE; and will be
dropped by computer or swept out in the
within transformation
39
FE Cost #1 – Example
q Consider following CEO-level estimation
ln(totalpay )ijt = α + β1 ln( firmsize)ijt + β1volatilityijt
+ β3 femalei + δ t + fi + λ j + uijt
n Ln(totalpay) is for CEO i, firm j, year t
n Estimation includes year, CEO, and firm FE
40
FE Cost #1 – Practical Advice
n Be careful of this!
q Programs like xtreg are good about dropping the
female variable and not reporting an estimate…
q But, if you create dummy variables yourself and
input them yourself, the estimation might drop one
of them rather than the female indicator
n I.e. you'll get an estimate for β3, but it has no
meaning! It's just a random intercept value that
depends entirely on the random FE dropped by Stata
41
FE Cost #1 – Any Solution?
n Instrumental variables can provide a
possible solution for this problem
q See Hausman and Taylor (Econometrica 1981)
q We will discuss this next week
42
FE Cost #2 – Measurement error [P1]
n Measurement error of independent variable
(and resulting biases) can be amplified
q Think of there being two types of variation
n Good (meaningful) variation
n Noise variation because we don't perfectly
measure the underlying variable of interest
43
FE Cost #2 – Measurement error [P2]
n Answer: Attenuation bias on
mismeasured variable will go up!
q Practical advice: Be careful in interpreting 'zero'
coefficients on potentially mismeasured
regressors; might just be attenuation bias!
q And remember, sign of bias on other
coefficients will be generally difficult to know
44
FE Cost #2 – Measurement error [P3]
n Problem can also apply even when all
variables are perfectly measured [How?]
n Answer: Adding FE might throw out relevant
variation; e.g. y in firm FE model might respond to
sustained changes in x, rather than transitory
changes [see McKinnish 2008 for more details]
n With FE you'd only have the transitory variation
leftover; might find x uncorrelated with y in FE
estimation even though sustained changes in x is
most important determinant of y
45
FE Cost #2 – Example
n Difficult to identify causal effect of credit
shocks on firm output because credit shocks
coincide with demand shocks [i.e. OVB]
q Paravisini, Rappoport, Schnabl, Wolfenzon
(2014) used product-level export data & shock to
some Peru banks to address this
n Basically regressed product output on total firm credit,
and added firm, bank, and product×destination FE (i.e.
dummy for selling a product to a particular country!)
n Found small effect… [Concern?]
46
FE Cost #2 – Example continued
n Concern = Credit extended to firms may
be measured with error!
q E.g. some loan originations and payoffs may
not be recorded in timely fashion
q Need to be careful interpreting a coefficient
from a model with so many FE as “small”
n Note: This paper is actually very good (and does
IV as well), and the authors are very careful to not
interpret their findings as evidence that financial
constraints only have a “small” effect
47
FE Cost #2 – Any solution?
n Admittedly, measurement error, in
general, is difficult to address
n For examples on how to deal with
measurement error, see following papers
q Griliches and Hausman (JoE 1986)
q Biorn (Econometric Reviews 2000)
q Erickson and Whited (JPE 2000, RFS 2012)
q Almeida, Campello, and Galvao (RFS 2010)
48
FE Cost #3 – Computation issues [P1]
n Estimating a model with multiple types of
FE can be computationally difficult
q When more than one type of FE, you cannot
remove both using within-transformation
n Generally, you can only sweep one away with
within-transformation; other FE dealt with by
adding dummy variable to model
n E.g. firm and year fixed effects [See next slide]
49
FE Cost #3 – Computation issues [P2]
Year FE
n Consider below model: Firm FE
yi ,t = α + β xi ,t + δ t + fi + ui ,t
50
FE Cost #3 – Computation issues [P3]
51
FE Cost #3 – Example
n But, computational issues is becoming
increasingly more problematic
q Researchers using larger datasets with many
more complicated FE structures
q E.g. if you try adding both firm and
industry×year FE, you'll have a problem
n Estimating 4-digit SIC×year and firm FE in
Compustat requires ≈ 40 GB memory
n No one has this; hence, no one does it…
52
FE Cost #3 – Any Solution?
n Yes, there are some potential solutions
q Gormley and Matsa (2014) discusses some
of these solutions in Section 4
q We will come back to this in “Common
Limitations and Errors” lecture
53
FE – Some Remaining Issues
54
Predicted values of FE [Part 1]
n Sometimes, predicted value of
unobserved FE is of interest
n Can get predicted value using
fˆi = yi − βˆ xi , for all i = 1,..., N
q E.g. Bertrand and Schoar (QJE 2003) did
this to back out CEO fixed effects
n They show that the CEO FE are jointly
statistically significant from zero, suggesting
CEOs have 'styles' that affect their firms
55
Predicted values of FE [Part 2]
n But, be careful with using these predicted
values of the FE
q They are unbiased, but inconsistent
n As sample size increases (and we get more
groups), we have more parameters to estimate…
never get the necessary asymptotics
n We call this the Incidental Parameters Problem
56
Predicted values of FE [Part 3]
57
Nonlinear models with FE [Part 1]
n Because we don't get consistent estimates
of the FE, we can't estimate nonlinear
panel data models with FE
q In practice, Logit, Tobit, Probit should not be
estimated with many fixed effects
q They only give consistent estimates under
rather strong and unrealistic assumptions
58
Nonlinear models with FE [Part 2]
Why should
q E.g. Probit with FE requires… we believe this
n Unobserved fi is to be distributed normally to be true?
n fi and xi,t to be independent
Almost surely
not true in CF
q And, Logit with FE requires…
n No serial correlation of y after conditioning on the
observable x and unobserved f
Probably unlikely in
q For more details, see… many CF settings
n Wooldridge (2010), Sections 13.9.1, 15.8.2-3
n Greene (2004) – uses simulation to show how bad
59
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
60
Random effects (RE) model [Part 1]
n Very similar model as FE…
yi ,t = α + β xi ,t + fi + ui ,t
61
Random effects (RE) model [Part 2]
62
Random effects (RE) model [Part 3]
63
Random effects – My Take
64
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
65
First differencing (FD) [Part 1]
n First differencing is another way to
remove unobserved heterogeneities
q Rather than subtracting off the group
mean of the variable from each variable,
you instead subtract the lagged observation
q Easy to see why this also works…
66
First differencing (FD) [Part 2]
n Notice that, yi ,t = α + β xi ,t + fi + ui ,t
yi ,t −1 = α + β xi ,t −1 + fi + ui ,t −1 Note: we'll lose
on observation
per cross-section
n From this, we can see that because there
won't be a lag
yi ,t − yi ,t −1 = β ( xi ,t − xi ,t −1 ) + (ui ,t − ui ,t −1 )
67
First differences (without time)
68
FD versus FE [Part 1]
n When just two observations per group,
they are identical to each other
n In other cases, both are consistent;
difference is generally about efficiency
q FE is more efficient if disturbances,
ui,t, are serially uncorrelated
Which is true?
q FD is more efficient if disturbance, Unclear. Truth is
ui,t, follow a random walk that it is probably
something in between
69
FD versus FE [Part 2]
n If strict exogeneity is violated (i.e. xi,t is
correlated with ui,s for s≠t), FE might be better
q As long as we believe xi,t and ui,t are uncorrelated,
the FE's inconsistency shrinks to 0 at rate 1/T;
but, FD gets no better with larger T
q Remember: T is the # of observations per group
70
FD versus FE [Part 3]
n Bottom line: not a bad idea to try both…
q If different, you should try to understand why
q With an omitted variable or measurement
error, you’ll get diff. answers with FD and FE
n In fact, Griliches and Hausman (1986) shows that
because measurement error causes predictably
different biases in FD and FE, you can (under
certain circumstances) use the biased estimates to
back out the true parameter
71
Outline for Panel Data
n Motivate how panel data is helpful
n Fixed effects model
q Benefits [There are many]
q Costs [There are some…]
72
Lagged dependent variables with FE
n We cannot easily estimate models with both
a lagged dep. var. and unobserved FE
yi ,t = α + ρ yi ,t −1 + β xi ,t + fi + ui ,t , ρ <1
73
Lagged y & FE – Problem with OLS
n To see the problem with OLS, suppose
you estimate the following:
yi,t = α + ρ yi,t−1 + β xi,t + vi,t
!
f i +ui ,t
q But, yi ,t −1 = α + ρ yi ,t − 2 + β xi ,t −1 + fi + ui ,t −1
q Thus, yi,t-1 and composite error, vi,t are positively
correlated because they both contain fi
q I.e. you get omitted variable bias
74
Lagged y & FE – Problem with FE
n Will skip the math, but it is always biased
q Basic idea is that if you do a within
transformation, the lagged mean of y, which will be
on RHS of the model now, will always be
negatively correlated with demeaned error, u
n Note #1 – This is true even if there was no unobserved
heterogeneity, f; FE with lagged values is always bad idea
n Note #2: Same problem applies to FD
75
How do we estimate this? IV?
n Basically, you're going to need instrument;
we will come back to this next week….
76
Lagged y versus FE – Bracketing
n Suppose you don't know which is correct
q Lagged value model: yi ,t = α + γ yi ,t −1 + β xi ,t + ui ,t
q Or, FE model: yi ,t = α + β xi ,t + fi + ui ,t
77
Bracketing continued…
n Use this to 'bracket' where true β is…
q But sometimes, you won't observe bracketing
q Likely means your model is incorrect in other
ways, or there is some severe finite sample bias
78
Summary of Today [Part 1]
79
Summary of Today [Part 2]
n FE estimator, however, has weaknesses
q Can't estimate variables that don't vary within
groups [or at least, not without an instrument]
q Could amplify any measurement error
n For this reason, be cautious interpreting zero or small
coefficients on possibly mismeasured variables
80
Summary of Today [Part 3]
81
In First Half of Next Class
n Instrumental variables
q What are the necessary assumptions? [E.g.
what is the exclusion restriction?]
q Is there are way we can test whether our
instruments are okay?
82
Assign papers for next week…
n Khwaja and Mian (AER 2008)
q Bank liquidity shocks
83
Break Time
n Let's take our 10 minute break
n We'll do presentations when we get back
84
FNCE 926
Empirical Methods in CF
Lecture 5 – Instrumental Variables
2
Background readings
n Roberts and Whited
q Section 3
n Angrist and Pischke
q Sections 4.1, 4.4, and 4.6
n Wooldridge
q Chapter 5
n Greene
q Sections 8.2-8.5
3
Outline for Today
n Quick review of panel regressions
n Discuss IV estimation
q How does it help?
q What assumptions are needed?
q What are the weaknesses?
4
Quick Review [Part 1]
n What type of omitted variable does panel
data and FE help mitigate, and how?
q Answer #1 = It can help eliminate omitted
variables that don’t vary within panel groups
q Answer #2 = It does this by transforming the
data to remove this group-level heterogeneity
[or equivalently, directly controls for it using
indicator variables as in LSDV]
5
Quick Review [Part 2]
n Why is random effects pretty useless
[at least in corporate finance settings]?
q Answer = It assumes that unobserved
heterogeneity is uncorrelated with x’s; this is
likely not going to be true in finance
6
Quick Review [Part 3]
n What are three limitations of FE?
#1 – Can’t estimate coefficient on variables that
don’t vary within groups
#2 – Could amplify any measurement error
n For this reason, be cautious interpreting zero or small
coefficients on possibly mismeasured variables
7
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
8
Motivating IV [Part 1]
n Consider the following estimation
y = β 0 + β1 x1 + ... + β k xk + u
9
Motivation [Part 2]
q Answer #1: No. We will not get a
consistent estimate of βk
q Answer #2: Very unlikely. We will only
get consistent estimate of other β if xk is
uncorrelated with all other x
10
Instrumental variables – Intuition
q Think of xk as having ‘good’ and ‘bad’ variation
n Good variation is not correlated with u
n Bad variation is correlated with u
11
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
12
Instrumental variables – Formally
n IVs must satisfy two conditions
q Relevance condition
q Exclusion condition
n What are these two conditions?
n Which is harder to satisfy?
n Can we test whether they are true?
13
Relevance condition [Part 1]
How can we test
n The following must be true… this condition?
14
Relevance condition [Part 2]
n Easy to test the relevance condition!
q Just run the regression of xk on all the other
x’s and the instrument z to see if z explains xk
q As we see later, this is what people call the
‘first stage’ of the IV estimation
15
Exclusion condition [Part 1]
How can we test
n The following must be true… this condition?
16
Exclusion condition [Part 2]
17
Side note – What’s wrong with this?
n I’ve seen many people try to use the below
argument as support for the exclusion
restriction… what’s wrong with it?
18
Side note – Answer
n If the original regression doesn’t give
consistent estimates, then neither will this one!
q cov(xk, u)≠0, so the estimates are still biased
q Moreover, if we believe the relevance condition,
then the coefficient on z is certainly biased because
z is correlated with xk
19
What makes a good instrument?
n Bottom line, an instrument must be justified
largely on economic arguments
q Relevance condition can be shown formally, but
you should have an economic argument for why
q Exclusion restriction cannot be tested… you need
to provide a convincing economic argument as to
why it explains y, but only through its effect on xk
20
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
21
Implementing IV estimation
n You’ve found a good IV, now what?
n One can think of the IV estimation as
being done in two steps
q First stage: regress xk on other x’s & z
q Second stage: take predicted xk from first
stage and use it in original model instead of xk
This is why we also call IV estimations
two stage least squares (2SLS)
22
First stage of 2SLS
n Estimate the following
xk = α 0 + α1 x1 + ... + α k −1 xk −1 + γ z + v
23
Second stage of 2SLS
n Use predicted values to estimate
y = β0 + β1 x1 + ... + βk xˆk + u
24
Intuition behind 2SLS
n Predicted values represent variation in xk
that is ‘good’ in that it is driven only by
factors that are uncorrelated with u
q Specifically, predicted value is linear function of
variables that are uncorrelated with u
25
Reduced Form Estimates [Part 1]
n The “reduced form” estimation is when
you regress y directly onto the instrument,
z, and other non-problematic x’s
y = β0 + β1 x1 + ... + βk −1 xk −1 + δ z + u
26
Reduced Form Estimates [Part 2]
n It can be shown that the IV estimate for
xk , βˆkIV, is simply given by…
Reduced form coefficient
δˆ estimate for z
βˆ IV
=
k
γˆ First stage coefficient
estimate for z
27
Practical advice [Part 1]
n Don’t state in your paper’s intro that
you use an IV to resolve an identification
problem, unless…
q You also state what the IV you use is
q And, provide a strong economic argument as
to why it satisfies the necessary conditions
28
Practical advice [Part 2]
n Don’t forget to justify why we should be
believe the exclusion restriction holds
q Too many researchers only talk
about the relevance condition
q Exclusion restriction is equally important
29
Practical Advice [Part 3]
n Do not do two stages on your own!
q Let the software do it; e.g. in Stata, use the
IVREG or XTIVREG (for panel data) commands
30
Practical Advice [Part 3-1]
31
Practical Advice [Part 3-2]
n People will try using predicted values
from non-linear model, e.g. Probit or
Logit, in a ‘second stage’ IV regression
q But, only linear OLS in first stage guarantees
covariates and fitted values in second stage
will be uncorrelated with the error
n I.e. this approach is NOT consistent
n This is what we call the “forbidden regression”
32
Practical Advice [Part 3-3]
n In models with quadratic terms, e.g.
y = β 0 + β1 x + β 2 x 2 + u
people often try to calculate one fitted
value x̂ using one instrument, z, and then
plug in x̂ and x̂ 2 into second stage…
q Seems intuitive, but it is NOT consistent!
q Instead, you should just use z and z2 as IVs!
33
Practical Advice [Part 3]
n Bottom line… if you find yourself plugging
in fitted values when doing an IV, you are
probably doing something wrong!
q Let the software do it for you; it will prevent
you from doing incorrect things
34
Practical Advice [Part 4]
n All x’s that are not problematic, need to be
included in the first stage!!!
q You’re not doing 2SLS, and you’re not getting
consistent estimates if this isn’t done
q This includes things like firm and year FE!
35
Practical Advice [Part 5]
n Always report your first stage results & R2
n There are two good reasons for this…
[What are they?]
q Answer #1: It is direct test of relevance
condition… i.e. we need to see γ≠0!
q Answer #2: It helps us determine whether
there might be a weak IV problem…
36
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
37
Consistent, but biased
n IV is a consistent, but biased, estimator
q For any finite number of observations, N,
the IV estimates are biased toward the
biased OLS estimate
q But, as N approaches infinity, the IV
estimates converge to the true coefficients
38
Weak instruments problem
n A weak instrument is an IV that doesn’t
explain very much of the variation in the
problematic regressor
n Why is this an issue?
q Small sample bias of estimator is greater
when the instrument is weak; i.e. our estimates,
which use a finite sample, might be misleading…
q t-stats in finite sample can also be wrong
39
Weak IV bias can be severe [Part 1]
n Hahn and Hausman (2005) show that
finite sample bias of 2SLS is ≈
j ρ (1 − r 2 )
Nr 2
q j = number of IVs [we’ll talk about
multiple IVs in a second]
q ρ = correlation between xk and u
q r2 = R2 from first-stage regression
q N = sample size
40
Weak IV bias can be severe [Part 2]
j ρ (1 − r 2 )
Nr 2
A low explanatory power in
More instruments, which we’ll talk first stage can result in
about later, need not help; they help large bias even if N is large
increase r2, but if they are weak (i.e.
don’t increase r2 much), they can
still increase finite sample bias
41
Detecting weak instruments
n Number of warning flags to watch for…
q Large standard errors in IV estimates
n You’ll get large SEs when covariance between
instrument and problematic regressor is low
42
Excluded IVs – Tangent
n Just some terminology…
q In some ways, can think of all non-
problematic x’s as IVs; they all appear in first
stage and are used to get predicted values
q But, when people refer to excluded IVs, they
refer to the IVs (i.e. z’s) that are excluded
from the second stage
43
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
44
More than one problematic regressor
n Now, consider the following…
y = β 0 + β1 x1 + ... + β k xk + u
45
Multiple IVs [Part 1]
n Just need one IV for each
problematic regressor, e.g. z1 and z2
n Then, estimate 2SLS in similar way…
q Regress xk on all other x’s (except xk-1)
and both instruments, z1 and z2
q Regress xk-1 on all other x’s (except xk)
and both instruments, z1 and z2
q Get predicted values, do second stage
46
Multiple IVs [Part 2]
n Need at least as many IVs as problematic
regressors to ensure predicted values are not
collinear with the non-problematic x’s
q If # of IVs match # of problematic x’s,
model is said to be “Just Identified”
47
“Overidentified” Models
48
Overidentified model conditions
n Necessary conditions very similar
q Exclusion restriction = none of the
instruments are correlated with u
q Relevance condition
E.g. you can’t
just have one IV n Each first stage (there will be h of them) must
that is correlated have at least one IV with non-zero coefficient
with all the
problematic
n Of the m instruments, there must be at least h of
regressors and them that are partially correlated with problematic
all the other IVs regressors [otherwise, model isn’t identified]
are not
49
Benefit of Overidentified Model
50
But, Overidentification Dilemma
n Suppose you are a very clever researcher…
q You find not just h instruments for h
problematic regressors, you find m > h
n First, you should consider yourself very clever
[a good instrument is hard to come by]!
n But, why might you not want to use the m-h
extra instruments?
51
Answer – Weak instruments
n Again, as we saw earlier, a weak
instrument will increase likelihood of finite
sample bias and misleading inferences!
q If have one really good IV, not clear you want
to add some extra (less good) IVs...
52
Practical Advice – Overidentified IV
53
Overidentification “Tests” [Part 1]
54
Overidentification “Tests” [Part 2]
55
Overidentification “Tests” [Part 3]
56
“Informal” checks – Tangent
n It is useful, however, to try some
“informal” checks on validity of IV
q E.g. One could show the IV is uncorrelated
with other non-problematic regressors or with
y that pre-dates the instrument
n Could help bolster economic argument that IV
isn’t related to outcome y for other reasons
n But, don’t do this for your actual outcome, y, why?
Answer = It would suggest a weak IV (at best)
57
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
58
Miscelleneous IV issues
n IVs with interactions
n Constructing additional IVs
n Using lagged y or lagged x as IVs
n Using group average of x as IV for x
n Using IV with FE
n Using IV with measurement error
59
IVs with interactions
n Suppose you want to estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x1 x2 + u
cov( x1 , u ) = 0
where
cov( x2 , u ) ≠ 0
60
IVs with interactions [Part 2]
n Answer = Yes! In this case, one can
construct other instruments from the one IV
q Use z as IV for x2
q Use x1z as IV for x1x2
61
Constructing additional IV
n Now, suppose you want to estimate
y = β 0 + β1 x1 + β 2 x2 + β3 x3 + u
cov( x1 , u ) = 0
Now, both x2 and x3
where cov( x , u ) ≠ 0 are problematic
2
cov( x3 , u ) ≠ 0
q Suppose you can only find one IV, z, and you
think z is correlated with both x2 and x3…
Can you use z and z2 as IVs?
62
Constructing additional IV [Part 2]
n Answer = Technically, yes. But
probably not advisable…
q Absent an economic reason for why z2 is
correlated with either x2 or x3 after
partialling out z, it’s probably not a good IV
n Even if it satisfies the relevance condition, it
might be a ‘weak’ instrument, which can be very
problematic [as seen earlier]
63
Lagged instruments
n It has become common in CF to use
lagged variables as instruments
n This usually takes two forms
q Instrumenting for a lagged y in dynamic
panel model with FE using a lagged lagged y
q Instrumenting for problematic x or lagged y
using lagged version of the same x
64
Example where lagged IVs are used
n As noted last week, we cannot estimate
models with both a lagged dep. var. and
unobserved FE
yi ,t = α + ρ yi ,t −1 + β xi ,t + fi + ui ,t , ρ <1
65
Using lagged y as IV in panel models
66
Lagged y values as instruments?
n Probably not…
q Lagged values of y will be correlated with
changes in errors if errors are serially correlated
q This is common in corporate finance,
suggesting this approach is not helpful
67
Lagged x values as instruments? [Part 1]
68
Lagged x values as instruments? [Part 2]
69
Using group averages as IVs [Part 1]
n Will often see the following…
yi , j = α + β xi , j + ui , j
q yi,j is outcome for observation i (e.g., firm)
in group j (e.g., industry)
q Researcher worries that cov(x,u)≠0
q So, they use group average, x−i , j , as IV
1
x− i , j = ∑
J − 1 i∈ j
xk , j J is # of observations
in the group
k ≠i
70
Using group averages as IVs [Part 2]
n They say…
q “group average of x is likely correlated with
own x” – i.e. relevance condition holds
q “but, group average doesn’t directly affect y”
– i.e., exclusion restriction holds
71
Using group averages as IVs [Part 3]
n Answer =
q Relevance condition implicitly assumes
some common group-level heterogeneity,
fj , that is correlated with xij
q But, if model has fj (i.e. group fixed effect),
then x−i , j must violate exclusion restriction!
72
Other Miscellaneous IVs
n As noted last week, IVs can also be useful
in panel estimations
#1 – Can help identify effect of variables that
don’t vary within groups [which we can’t
estimate directly in FE model]
#2 – Can help with measurement error
73
#1 – IV and FE models [Part 1]
n Use the following three steps to identify
variables that don’t vary within groups…
#1 – Estimate the FE model
#2 – Take group-averaged residuals, regress them
onto variable(s), x’, that don’t vary in groups
(i.e. the variables you couldn’t estimate in FE model)
n Why is this second step (on its own) problematic?
n Answer: because unobserved heterogeneity (which
is still collinear with x’) will still be in error (because
it partly explains group-average residuals)
74
#1 – IV and FE models [Part 2]
n Solution in second step is to use IV!
#3 – Use covariates that do vary in group (from
first step) as instruments in second step
n Which x’s from first step are valid IVs?
n Answer = those that don’t co-vary with unobserved
heterogeneity but do co-vary with variables that don’t
vary within groups [again, economic argument needed here]
75
#2 – IV and measurement error [Part 1]
n As discussed last week, measurement
error can be a problem in FE models
n IVs provide a potential solutions
q Pretty simple idea…
q Find z correlated to mismeasured variable,
but not correlated with u; use IV
76
#2 – IV and measurement error [Part 2]
n But easier said then done!
q Identifying a valid instrument requires researcher
to understand exact source of measurement error
n This is because the disturbance, u, will include the
measurement error; hence, how can you make an
economic argument that z is uncorrelated with it if you
don’t understand the measurement error?
77
Outline for Instrumental Variables
n Motivation and intuition
n Required assumptions
n Implementation and 2SLS
q Weak instruments problem
q Multiple IVs and overidentification tests
78
Limitations of IV
n There are two main limitations to discuss
q Finding a good instrument is really hard; even
the seemingly best IVs can have problems
q External validity can be a concern
79
Subtle violations of exclusion restriction
n Even the seemingly best IVs can violate
the exclusion restriction
q Roberts and Whited (pg. 31, 2011) provide a
good example of this in description of
Bennedsen et al. (2007) paper
q Whatever group is discussing this paper
next week should take a look… J
80
Bennedsen et al. (2007) example [Part 1]
n Paper studies effect of family CEO
succession on firm performance
q IVs for family CEO succession using
gender of first-born child
n Families where the first child was a boy are
more likely to have a family CEO succession
n Obviously, gender of first-born is totally
random; seems like a great IV…
81
Bennedsen et al. (2007) example [Part 2]
n Problem is that first-born gender may
be correlated with disturbance u
q Girl-first families may only turnover firm
to a daughter when she is very talented
q Therefore, effect of family CEO turnover
might depend on gender of first born
q I.e. gender of first born is correlated with
u because it includes interaction between
problematic x and the instrument, z!
82
External vs. Internal validity
n External validity is another concern of IV
[and other identification strategies]
q Internal validity is when the estimation
strategy successfully uncovers a causal effect
q External validity is when those estimates are
predictive of outcomes in other scenarios
n IV (done correctly) gives us internal validity
n But, it doesn’t necessarily give us external validity
83
External validity [Part 1]
n Issue is that IV estimates only tell us about
subsample where the instrument is predictive
q Remember, you’re only making use
of variation in x driven by z
q So, we aren’t learning effect of x for
observations where z doesn’t explain x!
84
External validity [Part 2]
n Again, consider Bennedsen et al (2007)
q Gender of first born may only predict likelihood
of family turnover in certain firms…
n I.e. family firms where CEO thinks females (including
daughters) are less suitable for leadership positions
85
External validity [Part 3]
n Answer: These firms might be different in
other dimensions, which limits the external
validity of our findings
q E.g. Could be that these are poorly run firms…
n If so, then we only identify effect for such
poorly run firms using the IV
n And, effect of family succession in well-run
firms might be quite different…
86
External validity [Part 4]
n Possible test for external validity problems
q Size of residual from first stage tells us something
about importance of IV for certain observations
n Large residual means IV didn’t explain much
n Small residual means it did
87
Summary of Today [Part 1]
88
Summary of Today [Part 2]
89
In First Half of Next Class
n Natural experiments [Part 1]
q How do they help with identification?
q What assumptions are necessary to
make causal inferences?
q What are their limitations?
90
Assign papers for next week…
n Gormley (JFI 2010)
q Foreign bank entry and credit access
91
Break Time
n Let’s take our 10 minute break
n We’ll do presentations when we get back
92
FNCE 926
Empirical Methods in CF
Lecture 6 – Natural Experiment [P1]
2
Background readings
n Roberts and Whited
q Sections 2.2, 4
n Angrist and Pischke
q Section 5.2
3
Outline for Today
n Quick review of IV regressions
n Discuss natural experiments
q How do they help?
q What assumptions are needed?
q What are their weaknesses?
4
Quick Review [Part 1]
5
Quick Review [Part 2]
n Angrist (1990) used randomness of
Vietnam draft to study effect of military
service on Veterans' earnings
q Person's draft number (which was random)
predicted likelihood of serving in Vietnam
q He found, using draft # as IV, that serving in
military reduced future earnings
Question: What might be a concern about the
external validity of his findings, and why?
6
Quick Review [Part 3]
7
Quick Review [Part 4]
n Question: Are more instruments
necessarily a good thing? If not, why not?
q Answer = Not necessarily. Weak instrument
problem (i.e. bias in finite sample) can be much
worse with more instruments, particularly if
they are weaker instruments
8
Quick Review [Part 5]
n Question: How can overidentification tests
be used to prove the IV is valid?
q Answer = Trick question! They cannot be
used in such a way. They rely on the
assumption that at least one IV is good. You
must provide a convincing economic argument
as to why your IVs make sense!
9
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
n Difference-in-differences
10
Recall… CMI assumption is key
11
CMI violation implies non-randomness
n Another way to think about CMI is that
it indicates that our x is non-random
q I.e. the distribution of x (or the
distribution of x after controlling for
other observable covariates) isn't random
n E.g. firms with high x might have higher y
(beyond just the effect of x on y) because high x
is more likely for firms with some omitted
variable contained in u…
12
Randomized experiments are great…
n In many of the “hard” sciences, the
researcher can simply design experiment to
achieve the necessary randomness
q Ex. #1 – To determine effect of new drug, you
randomly give it to certain patients
q Ex. #2 – To determine effect of certain gene,
you modify it in a random sample of mice
13
But, we simply can't do them L
n We can't do this in corporate finance!
q E.g. we can't randomly assign a firm's leverage
to determine it's effect on investment
q And, we can't randomly assign CEOs' # of
options to determine their effect on risk-taking
14
Defining a Natural Experiment
n Natural experiment is basically when
some event causes a random assignment
of (or change in) a variable of interest, x
q Ex. #1 – Some weather event increases
leverage for a random subset of firms
q Ex. #2 – Some change in regulation reduces
usage of options at a random subset of firms
15
Nat. Experiments Provide Randomness
16
NEs can be used in many ways
n Technically, natural experiments can be
used in many different ways
q Use them to construct IV
n E.g. gender of first child being a boy used in
Bennedsen, et al. (2007) is an example NE
17
And, the Difference-in-Differences…
n But admittedly, when most people refer to
natural experiment, they are talking about a
difference-in-difference (D-i-D) estimator
q Basically, compares outcome y for a “treated” group
to outcome y for “untreated” group where treatment
is randomly assigned by the natural experiment
q This is how I'll use NE in this class
18
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
q Notation and definitions
q Selection bias and why randomization matters
q Regression for treatment effects
19
Treatment Effects
n Before getting into natural experiments in
context of difference-in-difference, it is first
helpful to describe “treatment effects”
20
Notation and Framework
n Let d equal a treatment indicator from the
experiment we will study
q d = 0 à untreated by experiment (i.e. control group)
q d = 1 à treated by experiment (i.e. treated group)
21
Example treatments in corp. fin…
n Ex. #1 – Treatment might be that your
firm's state passed anti-takeover law
q d = 1 for firms incorporated in those states
q y could be a number of things, e.g. ROA
22
Average Treatment Effect (ATE)
n Can now define some useful things
q Average Treatment Effect (ATE) is given by
E[y(1) – y(0)]
23
But, ATE is unobservable
E[y(1) – y(0)]
24
Defining ATT
q Average Treatment Effect if Treated (ATT)
is given by E[y(1) – y(0)|d =1]
n This is the effect of treatment on those that are treated;
i.e change in y we'd expect to find if treated random
sample from population of observations that are treated
25
Defining ATU
q Average Treatment Effect if Untreated (ATU)
is given by E[y(1) – y(0)|d =0]
n This is what the effect of treatment would have been on
those that are not treated by the experiment
n We don't observe y(1) | d = 0
26
Uncovering ATE [Part 1]
27
Uncovering ATE [Part 2]
n In words, we compare average y of treated
to average y of untreated observations
q If we interpret this as the ATE, we are
assuming that absent the treatment, the treated
group would, on average, have had same
outcome y as the untreated group
q We can show this formally by simply working
out E[y(1)|d =1]–E[y(0)|d =0]…
28
Uncovering ATE [Part 3]
{E[ y(1) | d = 1] − E[ y(0) | d = 1]} + {E[ y(0) | d = 1] − E[ y(0) | d = 0]}
29
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
q Notation and definitions
q Selection bias and why randomization matters
q Regression for treatment effects
30
Selection bias defined
n Selection bias: E[ y(0) | d = 1] − E[ y(0) | d = 0]
q Definition = What the difference in average y
would have been for treated and untreated
observations absent any treatment
q We do not observe this counterfactual!
31
Introducing random treatment
n A random treatment, d, implies that d is
independent of potential outcomes; i.e.
In words, the
E[ y (0) | d = 1] = E[ y (0) | d = 0] = E[ y(0)] expected value
of y is the same
and for treated and
E[ y (1) | d = 1] = E[ y (1) | d = 0] = E[ y (1)] untreated absent
treatment
32
Random treatment makes life easy
33
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
q Notation and definitions
q Selection bias and why randomization matters
q Regression for treatment effects
34
ATE in Regression Format [Part 1]
n Can re-express everything in regression format
y = β 0 + β1d + u This regression will only give
consistent estimate of β1 if
β 0 = E[ y (0)] cov(d, u) = 0; i.e. treatment, d,
where β1 = y (1) − y(0) is random, and hence,
uncorrelated with y(0)!
u = y (0) − E[ y (0)]
35
ATE in Regression Format [Part 2]
n We are interested in E[y|d =1]–E[y|d =0]
q But, can easily show that this expression is equal to
β1 + E[ y(0) | d = 1] − E[ y(0) | d = 0]
36
Adding additional controls [Part 1]
n Regression format also allows us to easily
put in additional controls, X
q Intuitively, comparison of treated and untreated
just becomes E[y(1)|d =1, X]–E[y(0)|d =0,X]
q Same selection bias term will appear if treatment,
d, isn't random after conditioning on X
q Regression version just becomes Why might
y = β0 + β1d + ΓX + u there still be a
selection bias?
37
Adding additional controls [Part 2]
n Selection bias can still be present if treatment
is correlated with unobserved variables
q As we saw earlier, it is what we can't observe
(and control for) that can be a problem!
38
Adding additional controls [Part 3]
n Answer: No, controls are not necessary in
truly randomized experiment
q But, they can be helpful in making the estimates
more precise by absorbing residual variation…
we'll talk more about this later
39
Treatment effect – Example
n Suppose compare leverage of firms with and
without a credit rating [or equivalently, regress
leverage on indicator for rating]
q Treatment is having a credit rating
q Outcome of interest is leverage
40
Treatment effect – Example Answer
n Answer #1: Having a rating isn't random
q Firms with rating likely would have had higher
leverage anyway because they are larger, more
profitable, etc.; selection bias will be positive
q Selection bias is basically an omitted var.!
41
Heterogeneous Effects
n Allowing the effect of treatment to vary
across individuals doesn't affect much
q Just introduces additional bias term
q Will still get ATE if treatment is random…
broadly speaking, randomness is key
42
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
q Cross-sectional difference & assumptions
q Time-series difference & assumptions
q Miscellaneous issues & advice
We actually just
n Difference-in-differences did this one!
43
Cross-sectional Simple Difference
44
In regression format…
n Cross-section simple difference
yi ,t = β 0 + β1di + ui ,t
45
Identification Assumption
46
Another way to see the assumption…
This is causal interpretation
of coefficient on d
E[ y | d = 1] c− E[ y | d = 0]
( β0 + β1 + E[u | d = 1]) − ( β0 + E[u | d = 0]) CMI assumption ensures
β1 + E[u | d = 1] − E[u | d = 0] these last two terms cancel
such that our interpretation
matches causal β1
47
Multiple time periods & SEs
48
Multiple time periods & SEs – Solution
49
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
q Cross-sectional difference & assumptions
q Time-series difference & assumptions
q Miscellaneous issues & advice
n Difference-in-differences
50
Time-series Simple Difference
n Very intuitive idea
q Compare pre- and post-treatment
outcomes, y, for just the treated group
[i.e. pre-treatment period acts as 'control' group]
q I.e. run following regression…
51
In Regression Format
n Time-series simple difference
yi ,t = β 0 + β1 pt + ui ,t
52
Identification Assumption
53
Showing the assumption math…
This would be causal
interpretation of
coefficient on p
E[ y | p = 1] − E[ y | p = 0]
( β0 + β1 + E[u | p = 1]) − ( β0 + E[u | p = 0])
β1 + E[u | p = 1] − E[u | p = 0]
β1 + E[ y (0) | p = 1] − E[ y (0) | p = 0]
Same selection
bias term… our
estimated
coefficient on p
only matches
true causal effect
if this is zero
54
Again, be careful about SEs
55
Using a First-Difference (FD) Approach
n Could also run regression using first-
differences specification
yi ,t − yi ,t −1 = β1 ( pt − pt −1 ) + (ui ,t − ui ,t −1 )
56
FD versus Standard Approach [Part 1]
n Why might these two models give different
estimates of β1 when there are more than
one pre- and post-treatment periods?
yi ,t = β 0 + β1 pt + ui ,t
versus
yi ,t − yi ,t −1 = β1 ( pt − pt −1 ) + (ui ,t − ui ,t −1 )
57
FD versus Standard Approach [Part 2]
How might this
n Answer: matter in practice?
58
FD versus Standard Approach [Part 3]
n Both approaches assume the effect of
treatment is immediate and persistent, e.g.
2.5
2
In this scenario,
Outcome,
y
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
59
FD versus Standard Approach [Part 4]
n But, suppose the following is true...
In this scenario, FD
2.5
approach gives much
2
smaller estimate
Outcome,
y
1.5
60
Correct way to do difference
n Correct way to get a 'differencing'
approach to match up with the more
standard simple diff specification in
multi-period setting is to instead use
yi , post − yi , pre = β1 + (ui , post − ui , pre )
61
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
q Cross-sectional difference & assumptions
q Time-series difference & assumptions
q Miscellaneous issues & advice
n Difference-in-differences
62
Treatment effect isn't always immediate
n In prior example, the specification is
wrong because the treatment effect only
slowly shows up over time
q Why might such a scenario be plausible?
q Answer = Many reasons. E.g. firms might
only slowly respond to change in regulation,
or CEO might only slowly change policy in
response to compensation shock
63
Accounting for a delay…
n Simple-difference misses this subtlety; it
assumes effect was immediate
n For this reason, it is always helpful to run
regression that allows effect to vary by period
q How can you do this?
q Answer = Insert indicators for each year relative
to the treatment year [see next slide]
64
Non-parametric approach
n If have 5 pre- and 5 post-treatment
5
obs.;
could estimate : yi ,t = β 0 + ∑ βt pt + ui ,t
t =−4
65
Non-parametric approach – Graph
n Plot estimates to trace out effect of treatment
Approach allows
1.2
effect of treatment
1
to vary by year!
Outcome,
y
0.8
0.6
Estimates capture
0.4 change relative to
0.2 excluded period (t-5)
0
-‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
66
Simple Differences – Advice
n In general, simple differences are not that
convincing in practice…
q Cross-sectional difference requires us to
assume the average y of treated and untreated
would have been same absent treatment
q Time-series difference requires us to assume
the average y would have been same in post-
and pre-treatment periods absent treatment
67
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
n Difference-in-differences
q Intuition & implementation
q “Parallel trends” assumption
68
Difference-in-differences
n Yes, we can do better!
n We can do a difference-in-differences that
combines the two simple differences
q Intuition = compare change in y pre- versus
post-treatment for treated group [1st difference]
to change in y pre- versus post-treatment for
untreated group [2nd difference]
69
Implementing diff-in-diff
n Difference-in-differences estimator
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
70
Interpreting the estimates [Part 1]
71
Interpreting the estimates [Part 2]
72
Natural Experiments – Outline
n Motivation and definition
n Understanding treatment effects
n Two types of simple differences
n Difference-in-differences
q Intuition & implementation
q “Parallel trends” assumption
73
“Parallel trends” assumption
74
Differences estimation
n Equivalent way to do difference-in-differences
is to instead estimate the following:
75
Difference-in-differences – Visually
n Looking at what difference-in-differences
estimate is doing in graphs will also help you
see why the parallel trends assumption is key
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
76
Diff-in-diffs – Visual Example #1
Treated
6
4
β3
Unobserved
Outcome,
y
counterfactual
3 β1
2
β2
1
Untreated
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
77
Diff-in-diff – Visual Example #2
Treated
7
β1 now takes
out avg.
6 difference pre-
5 vs. post now
β3
Unobserved
Outcome,
y
4 counterfactual
β1
3
2 β2
1 Untreated
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
78
Violation of parallel trends – Visual
There is no effect, but β3 > 0 because
7
parallel trends assumption was violated
5
Outcome,
y
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
79
Why we like diff-in-diff [Part 1]
n With simple difference, any of the below
arguments would prevent causal inference
q Cross-sectional diff – “Treatment and
untreated avg. y could be different for reasons
a, b, and c, that just happen to be correlated
with whether you are treated or not”
q Time-series diff – “Treatment group's avg. y
could change post- treatment for reasons a, b,
and c, that just happen to be correlated with
the timing of treatment”
80
Why we like diff-in-diff [Part 2]
n But, now the required argument to suggest
the estimate isn't causal is…
q “The change in y for treated observations after
treatment would have been different than
change in y for untreated observations for
reasons a, b, and c, that just happen to be
correlated with both whether you are treated
and when the treatment occurs”
This is (usually) a much
harder story to tell
81
Example…
n Bertrand & Mullainathan (JPE 2003) uses
state-by-state changes in regulations that
made it harder for firms to do M&A
q They compare wages at firms pre- versus post-
regulation in treated versus untreated states
q Are the below valid concerns about their
difference-in-differences…
82
Are these concerns for internal validity?
n The regulations were passed during a time
period of rapid growth of wages nationally…
q Answer = No. Indicator for post-treatment
accounts for common growth in wages
83
Example continued…
n However, ex-ante average differences is
troublesome in some regard…
q Suggests treatment wasn't random
q And, ex-ante differences can be problematic if
we think they their effect may vary with time…
n Time-varying omitted variables are problematic
because they can cause violation of “parallel trends”
n E.g. states with more unions were trending differently
at that time because of changes in union power
84
Summary of Today [Part 1]
85
Summary of Today [Part 2]
86
In First Half of Next Class
n Natural experiments [Part 2]
q How to handle multiple events
q Triple differences
q Common robustness tests that can be used to
test whether internal validity is likely to hold
87
Assign papers for next week…
n Jayaratne and Strahan (QJE 1996)
q Bank deregulation and economic growth
88
Break Time
n Let's take our 10 minute break
n We'll do presentations when we get back
89
FNCE 926
Empirical Methods in CF
Lecture 7 – Natural Experiment [P2]
2
Background readings
n Roberts and Whited
q Sections 2.2, 4
n Angrist and Pischke
q Section 5.2
3
Outline for Today
n Quick review of last lecture
n Continue to discuss natural experiments
q How to handle multiple events
q Triple differences
q Common robustness tests that can be used to
test whether internal validity is likely to hold
4
Quick Review[Part 1]
5
Quick Review [Part 2]
n Difference-in-differences is estimated with…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
q Compares change in y pre- versus post-treatment
for treated to change in y for untreated
q Requires “parallel trends” assumption
6
Quick Review [Part 3]
n Suppose Spain exits the Euro. And, Ann
compares profitability of Spanish firms after
the exit to profitability before…
n What is necessary for the comparison to
have any causal interpretation?
q Answer = We must assume profitability after
Spain’s exit would have been same as profitability
prior to exit absent exit… Highly implausible
7
Quick Review [Part 4]
n Now, suppose Bob compares profitability of
Spanish firms after the exit to profitability of
German firms after exit…
n What is necessary for the comparison to
have any causal interpretation?
q Answer = We must assume profitability of
Spanish firm would have been same as
profitability of German firms absent exit…
Again, this is highly implausible
8
Quick Review [Part 5]
n Lastly, suppose Charlie compares change in
profitability of Spanish firms after exit to
change in profitability of German firms
n What is necessary for the comparison to
have any causal interpretation?
q Answer = We must assume change in profitability
of Spanish firm would have been same as change
for German firms absent exit… I.e. parallel
trends assumption
9
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
q Using group means to get an estimate
q When additional controls are appropriate
n How to handle multiple events
n Falsification tests
n Triple differences
10
Standard Regression Format
n Difference-in-differences estimator
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
11
Comparing group means approach
n To see how we can get the same estimate,
β3, by just comparing sample means, first
calculate expected y under four possible
combinations of p and d indicators
12
Comparing group means approach [P1]
n Again, the regression is…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
13
Comparing group means approach [P2]
E ( y | d = 1, p = 1) = β 0 + β1 + β 2 + β3
E ( y | d = 1, p = 0) = β 0 + β 2
E ( y | d = 0, p = 1) = β 0 + β1
E ( y | d = 0, p = 0) = β 0
14
Comparing group means approach [P3]
n Now take the simple differences
Post-‐Treatment,
Pre-‐Treatment,
(1) (2) Difference,
(1)-‐(2)
Treatment,
(a) β0+β1+β2+β3 β0+β2 β1+β3
Control,
(b) β0+β1 β0 β1
15
Comparing group means approach [P4]
n Then, take difference-in-differences!
Post-‐Treatment,
Pre-‐Treatment,
(1) (2) Difference,
(1)-‐(2)
Treatment,
(a) β0+β1+β2+β3 β0+β2 β1+β3
Control,
(b) β0+β1 β0 β1
16
Simple difference – Revisited [Part 1]
n Useful to look at simple differences
Post-‐Treatment,
Pre-‐Treatment,
(1) (2) Difference,
(1)-‐(2)
Treatment,
(a) β0+β1+β2+β3 β0+β2 β1+β3
Control,
(b) β0+β1 β0 β1
17
Simple difference – Revisited [Part 2]
n Now, look at time-series simple diff…
Post-‐Treatment,
Pre-‐Treatment,
(1) (2) Difference,
(1)-‐(2)
Treatment,
(a) β0+β1+β2+β3 β0+β2 β1+β3
Control,
(b) β0+β1 β0 β1
18
Why the regression is helpful
n Some papers will just report this simple
two-by-two table as their estimate
n But, there are advantages to the regression
q Can modify it to test timing of treatment
[we will talk about this in robustness section]
q Can add additional controls, X
19
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
q Using group means to get an estimate
q When additional controls are appropriate
n How to handle multiple events
n Falsification tests
n Triple differences
20
Adding controls to diff-in-diff
n Easy to add controls to regression
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ΓX i ,t + ui ,t
21
When controls are inappropriate
n Remember! You should never add controls
that might themselves be affected by treatment
q Angrist-Pischke call this a “bad control”
q You won’t be able to get a consistent estimate of β3
from estimating the equation
22
A Pet Peeve of TG – Refined
n If you have a treatment that is truly random, do
not put in controls affected by the treatment!
q I’ve had many referees force me to add controls
that are likely to be affected by the treatment…
q If this happens to you, put in both regressions (with
and without controls), and at a minimum, add a
caveat as to why adding controls is inappropriate
23
When controls are appropriate
24
#1 – To improve precision
n Adding controls can soak up some of
residual variation (i.e. noise) allowing you
to better isolate the treatment effect
q Should the controls change the estimate?
n NO! If treatment is truly random, adding
controls shouldn’t affect actual estimate; they
should only help lower the standard errors!
25
Example – Improving precision
n Suppose you have firm-level panel data
n Some natural experiment ‘treats’ some
firms but not other firms
q Could just estimate the standard diff-in-diff
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + ui ,t
q Or, could add fixed effects (like firm and year
FE) to get more precise estimate…
26
Example – Improving precision [Part 2]
n So, suppose you estimate…
yi ,t = β 0 + β1 pt + β 2 di + β3 ( di × pt ) + α i + δ t + ui ,t
27
Example – Improving precision [Part 3]
28
Example – Improving precision [Part 4]
n Instead, you should estimate…
yi ,t = β 0 + β3 ( di × pt ) + α i + δ t + ui ,t
29
Generalized Difference-in-differences
30
Generalized D-i-D – Example [Part 1]
q To see how Generalized D-i-D can be
helpful, consider the example from last week
7
4
β1
3
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
31
Generalized D-i-D – Example [Part 2]
q Year dummies will better fit actual trend
7
5
Outcome,
y
0
-‐5 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
Period
32
When controls are appropriate
33
#2 – Restore randomness of treatment
n Suppose the following is true… I.e. treatment
isn’t random
q Observations of certain characteristic,
e.g. high x, are more likely to be treated
q And, firms with this characteristic are likely
to have differential trend in outcome y And, non-
randomness is
n Adding control for x could restore problematic for
identification
‘randomness’; i.e. being treated is
random after controlling for x!
34
Restoring randomness – Example
n Natural experiment is change in regulation
q Firms affected by regulation is random, except
that it is more likely to hit firms that are larger
q And, we think larger firms might have different
trend in outcome y afterwards for other reasons
q And, firm size is not going to be affected by the
change in regulation in any way
35
Controls continued…
n In prior example, suppose size is potentially
effected by the change in regulation…
q What would be another approach that won’t run
afoul of the ‘bad control’ problem?
n Answer: Use firm size in year prior to treatment and
it’s interaction with post-treatment dummy
n This will control for non-random assignment (based
on size) and differential trend (based on size)
36
Restoring randomness – Caution!
n In practice, don’t often see use of controls
to restore randomness
q Requires assumption that non-random
assignment isn’t also correlated with
unobservable variables…
q So, not that plausible unless there are very
specific reasons for non-randomness
37
One last note… be careful about SEs
38
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
n How to handle multiple events
q Why they are useful
q Two similar estimation approaches
39
Motivating example…
n Gormley and Matsa (2011) looked at
firms’ responses to increased left-tail risk
q Used discovery that workers were exposed to
harmful chemical as exogenous increase in risk
q One discovery occurred in 2000; a chemical
heavily used by firms producing
semiconductors was found to be harmful
40
Motivating Example – Answer
41
Multiple treatment events
n Sometimes, the natural experiment is
repeated a multiple points in times for
multiple groups of observations
q E.g. U.S. states make a particular regulatory
change at different points in time
42
How multiple events are helpful
43
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
n How to handle multiple events
q Why they are useful
q Two similar estimation approaches
44
Estimation with Multiple Events
45
Multiple Events – Approach #1 [P1]
n Just estimate the following estimation
yict = β dict + pt + mc + uict
46
Multiple Events – Approach #1 [P2]
47
Multiple Events – Approach #1 [P3]
n Intuition of this approach…
q Every untreated observation at a particular
point in time acts as control for treated
observations in that time period
n E.g. a firm treated in 1999 by some event will
act as a control for a firm treated in 1994 until
itself becomes treated in 1999
48
Multiple Events – Approach #2 [P1]
n Now, think of running generalized diff-in-
diff for just one of the multiple events…
yit = β ( di × pt ) + α i + δ t + uit
49
Multiple Events – Approach #2 [P2]
q But, contrary to standard difference-in-
difference, your sample is…
n Restricted to a small window around event;
e.g. 5 years pre- and post- event
n And, drops any observations that are
treated by another event
q I.e. your sample starts only with previously
untreated observations, and if a ‘control’
observation later gets treated by a different event,
those post-event observations are dropped
50
Multiple Events – Approach #2 [P3]
n Now, create a similar sample for each
“event” being analyzed
n Then, “stack” the samples into one dataset
and create a variable that identifies the event
(i.e. ‘cohort’) each observation belongs to
q Note: some observation units will appear
multiple times in the data [e.g. firm 123 might be
a control in event year 1999 but a treated firm in
a later event in 2005]
51
Multiple Events – Approach #2 [P4]
n Then, estimate the following on the
stacked dataset you’ve created
yict = β dict + δ tc + α ic + uict
52
Multiple Events – Approach #2 [P5]
n This approach has same intuition of the
first approach, but has a couple advantages
q Can more easily isolate a particular window of
interest around each event
n Prior approach compared all pre- versus post-
treatment observations against each other
53
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
n How to handle multiple events
n Falsification tests
n Triple differences
54
Falsification Tests for D-i-D
n Can never directly test underlying
identification assumption, but can do some
falsification tests to support its validity
#1 – Compare pre-treatment observables
#2 – Check that timing of observed change in y
coincides with timing of event [i.e. no pre-trend]
#3 – Check for treatment reversal
#4 – Check variables that shouldn’t be affected
#5 – Add a triple-difference
55
#1 – Pre-treatment comparison [Part 1]
n Idea is that experiment ‘randomly’ treats
some subset of observations
q If true, then ex-ante characteristics of ‘treated’
observations should be similar to ex-ante
characteristics of ‘untreated’ observations
q Showing treated and untreated observations are
comparable in dimensions thought to affect y
can help ensure assignment was random
56
#1 – Pre-treatment comparison [Part 2]
n If find ex-ante difference in some variable
z, is difference-in-difference is invalid?
q Answer = Not necessarily.
n We need some story as to why units are expected to
have differential trend in y after treatment (for
reasons unrelated to treatment) that is correlated with
z for this to actually be a problem for identification
n And, even with this story, we could just control for z
and it’s interaction with time
n But, what would be the lingering concern?
57
#1 – Pre-treatment comparison [Part 3]
n Answer = unobservables!
q If the treated and control differ ex-ante in
observable ways, we worry they might differ in
unobservable ways that related to some
violation of the parallel trends assumption
58
#2 – Check for pre-trend [Part 1]
n Similar to last lecture, can just allow effect
of treatment to vary by period to non-
parametrically map out the timing
q “Parallel trends” suggest we shouldn’t observe
any differential trend prior to treatment for the
observations that are eventually treated
59
#2 – Check for pre-trend [Part 2]
n Estimate the following:
yi ,t = β 0 + β1di + β 2 pt + ∑ γ t ( di × λt ) + ui ,t
t
60
#2 – Check for pre-trend [Part 3]
61
#2 – Check for pre-trend [Part 4]
n Something like this is ideal…
1.4
1.2
1
Outcome,
y
0.8
0.6
0.4
0.2
0
-‐0.2 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
-‐0.4
Tight
Period confidence
No differential
pre-trend intervals
62
#2 – Check for pre-trend [Part 5]
n Something like this is very bad
1.4
1.2
1
Outcome,
y
0.8
0.6
0.4
0.2
0
-‐0.2 -‐4 -‐3 -‐2 -‐1 0 1 2 3 4 5
-‐0.4
Period
y for treated firms was
already going up at faster
rate prior to event!
63
#2 – Check for pre-trend [Part 6]
n Should we make much of wide confidence
intervals in these graphs? E.g.
2.5 Answer: Not too
2 much… Each period
1.5 point estimate might be
Outcome,
y
64
#2 – Check for pre-trend [Part 7]
n Another type of pre-trend check done is to
do the diff-in-diff in some “random” pre-
treatment to show no effect
q I’m not a big fan of this… Why?
n Answer #1 – It is subject to gaming; researcher might
choose a particular pre-period to look at that works
n Answer #2 – Prior approach allows us to see what the
timing was and determine whether it is plausible
65
#3 – Treatment reversal
n In some cases, the “natural experiment”
is subsequently reversed
q E.g. regulation is subsequently undone
66
#4 – Unaffected variables
n In some cases, theory provides guidance
on what variables should be unaffected
by the “natural experiment”
q If natural experiment is what we think it is,
we should see this in the data… so check
67
#5 – Add Triple difference
n If theory tells us treatment effect should
be larger for one subset of observations,
we can check this with triple difference
q Pre- versus post-treatment
q Untreated versus treated
q Less sensitive versus More sensitive
68
Natural Experiment Outline – Part 2
n Difference-in-difference continued…
n How to handle multiple events
n Falsification tests
n Triple differences
q How to estimate & interpret it
q Using the popular subsample approach
69
Diff-in-diff-in-diff – Regression
yi ,t = β 0 + β1 pt + β 2 di + β3hi + β 4 ( pt × hi )
+ β5 ( di × hi ) + β 6 ( pt × di ) + β 7 ( pt × di × hi ) + ui ,t
70
Diff-in-diff-in-diff – Regression [Part 2]
n How to choose and set hi
q E.g. If theory says effect is bigger for larger
firms; could set hi = 1 if assets of firm in year
prior to treatment is above the median size
q Note: Remember to use ex-ante measures to
construct indicator if you think underlying
variable (that determines sensitivity) might be
affected by treatment… Why?
q Answer = To avoid bad controls!
71
Diff-in-diff-in-diff – Regression [Part 3]
yi ,t = β 0 + β1 pt + β 2 di + β3hi + β 4 ( pt × hi )
+ β5 ( di × hi ) + β 6 ( pt × di ) + β 7 ( pt × di × hi ) + ui ,t
72
Interpreting the estimates [Part 1]
73
Interpreting the estimates [Part 2]
74
Tangent – Continuous vs. Indicator?
75
Tangent – Continuous vs. Indicator?
n Advantages
q Makes better use of variation available in data
q Provides estimate on magnitude of sensitivity
n Disadvantages
q Makes linear functional form assumption;
indicator imposes less structure on the data
q More easily influenced by outliers
76
Generalized Triple-Difference
77
Natural Experiment [P2] – Outline
n Difference-in-difference continued…
n How to handle multiple events
n Falsification tests
n Triple differences
q How to estimate & interpret it
q Using the popular subsample approach
78
Subsample Approach
79
Subsample Approach Differences…
80
Matching Subsample to Combined [P1]
yi ,t = β 2 ( pt × di ) + β3 ( pt × di × hi ) + δ t + (δ t × hi ) + α i + ui ,t
81
Matching Subsample to Combined [P2]
82
Triple Diff – Stacked Regression [Part 1]
n Another advantage of stacked regression
approach to multiple events is ability to
more easily incorporate a triple diff
q Can simply run stacked regression in separate
subsamples to create triple-diff or run it in
one regression as shown previously
83
Triple Diff – Stacked Regression [Part 2]
n Can’t easily do either of these in approach
of Bertrand and Mullainathan (2003)
q Some observations act as both ‘control’ and
‘treated’ at different points in sample; not clear
how create subsamples in such a setting
84
External Validity – Final Note
n While randomization ensures internal
validity (i.e. causal inferences), external
validity might still be an issue
q Is the experimental setting representative of
other settings of interest to researchers?
n I.e. can we extrapolate the finding to other settings?
n A careful argument that the setting isn’t unique or
that the underlying theory (for why you observe what
you observe) is likely to apply elsewhere is necessary
85
Summary of Today [Part 1]
86
Summary of Today [Part 2]
87
In First Half of Next Class
n Regression discontinuity
q What are they?
q How are they useful?
q How do we implement them?
88
Assign papers for next week…
n Gormley and Matsa (RFS 2011)
q Risk & CEO agency conflicts
89
Break Time
n Let’s take our 10 minute break
n We’ll do presentations when we get back
90
FNCE 926
Empirical Methods in CF
Lecture 8 – Regression Discontinuity
2
Background readings for today
n Roberts and Whited
q Section 5
n Angrist and Pischke
q Chapter 6
3
Outline for Today
n Quick review of last lecture on NE
n Discuss regression discontinuity
q What is it? How is it useful?
q How do we implement it?
q What are underlying assumptions?
13
Quick Review[Part 1]
14
Quick Review [Part 2]
n What are some standard falsification tests you
might want to run with diff-in-diff?
q Answers:
n Compare ex-ante characteristics of treated & untreated
n Check timing of treatment effect
n Run regression using dep. variables that shouldn’t be
affected by treatment (if it is what we think it is)
n Check whether reversal of treatment has opposite effect
n Triple-difference estimation
15
Quick Review [Part 3]
n If you find ex-ante differences in treated and
treated, is internal validity gone?
q Answer = Not necessarily but it could suggest
non-random assignment of treatment that is
problematic… E.g. observations with
characteristic ‘z’ are more likely to be treated and
observations with this characteristic are also
likely to be trending differently for other reasons
16
Quick Review [Part 4]
n Does the absence of a pre-trend in diff-in-diff
ensure that differential trends assumption
holds and causal inferences can be made?
q Answer = Sadly, no. We can never prove
causality with 100% confidence. It could be that
trend was going to change after treatment for
reasons unrelated to treatment
17
Quick Review [Part 5]
n How are multiple events that affect
multiple groups helpful?
q Answer = Can check that treatment effect is
similar across events; helps reduce concerns
about violation of parallel trends since there
would need to be violation for each event
18
Quick Review [Part 6]
n How are triple differences helpful and
reducing concerns about violation of
parallel trends assumption?
q Answer = Before, an “identification
policeman” would just need a story about why
treated might be trending differently after
event for other reasons… Now, he/she would
need story about why that different trend
would be particularly true for subset of firms
that are more sensitive to treatment
19
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
n Checks on internal validity
n Heterogeneous effects & external validity
20
Basic idea of RDD
n The basic idea of regression discontinuity
(RDD) is the following:
q Observations (e.g. firm, individual, etc.) are
‘treated’ based on known cutoff rule
n E.g. for some observable variable, x, an
observation is treated if x ≥ x’
n This cutoff is what creates the discontinuity
21
Examples of RDD settings
n If you think about it, these type of cutoff
rules are commonplace in finance
q A borrower FICO score > 620 makes
securitization of the loan more likely
n Keys, et al (QJE 2010)
22
RDD is like difference-in-difference…
23
But, RDD is different…
n RDD has some key differences…
q Assignment to treatment is NOT random;
assignment is based on value of x
q When treatment only depends on x (what I’ll
later call “sharp RDD”, there is no overlap in
treatment & controls; i.e. we never observe the
same x for a treatment and a control
24
RDD randomization assumption
n Assignment to treatment and control isn’t
random, but whether individual observation is
treated is assumed to be random
q I.e. researcher assumes that observations (e.g. firm,
person, etc.) can’t perfectly manipulate their x value
q Therefore, whether an observation’s x falls
immediately above or below key cutoff x’ is random!
25
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
q Notation & ‘sharp’ vs. fuzzy assumption
q Assumption about local continuity
26
RDD terminology
27
Two types of RDD
n Sharp RDD
q Assignment to treatment only depends on x; i.e.
if x ≥ x’ you are treated with probability 1
28
Sharp RDD assumption #1
n Assignment to treatment occurs through
known and deterministic decision rule:
⎧1 if x ≥ x '
d = d ( x) = ⎨
⎩0 otherwise
q Weak inequality and direction of treatment is
unimportant [i.e. could easily have x < x’]
q But, it is important that there exists x’s
around the threshold value
29
Sharp RDD assumption #1 – Visually
Probability of treatment
moves from 0 to 1 around
threshold value x’
Only x determines
treatment
30
Sharp RDD – Examples
n Ex. #1 – PSAT score > x’ means student
receives national merit scholarship
q Receiving scholarship was determined solely
based on PSAT scores in the past
q Thistlewaithe and Campbell (1960) used this to
study effect of scholarship on career plans
31
Fuzzy RDD assumption #1
n Assignment to treatment is stochastic in
that only the probability of treatment has
known discontinuity at x’
0 < lim Pr(d = 1| x) − lim Pr(d = 1| x) < 1
x↓ x ' x↑ x '
32
Fuzzy RDD assumption #1 – Visually
Treatment probability
increases at x’
Treatment is not
purely driven by x
33
Fuzzy RDD – Example
n Ex. #1 – FICO score > 620 increases
likelihood of loan being securitized
q But, extent of loan documentation, lender,
etc., will matter as well…
34
Sharp versus Fuzzy RDD
n This subtle distinction affects exactly how
you estimate the causal effect of treatment
q With Sharp RDD, we will basically compare
average y immediate above and below x’
q With fuzzy RDD, the average change in y around
threshold understates causal effect [Why?]
n Answer = Comparison assumes all observations were
treated, but this isn’t true; if all observations had been
treated, observed change in y would be even larger; we
will need rescale based on change in probability
35
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
q Notation & ‘sharp’ vs. fuzzy assumption
q Assumption about local continuity
36
RDD assumption #2
n But, both RDDs share the following
assumption about local continuity
n Potential outcomes, y(0) and y(1),
conditional on forcing variable, x, are
continuous at threshold x’
q In words: y would be a smooth function around
threshold absent treatment; i.e. don’t expect any
jump in y at threshold x’ absent treatment
37
RDD assumption #2 – Visually
If all obs. had been treated, y
would be smooth around x’;
other lines says equivalent thing
for if none had been treated
Dashed lines
represent unobserved
counterfactuals
38
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
q Sharp regression discontinuity
q Fuzzy regression discontinuity
39
How not to do Sharp RDD…
n Given this setting, will the below estimation
reveal causal effect of treatment, d, on y?
yi = β 0 + β1di + ui
40
How not to do Sharp RDD… [Part 2]
n How can we modify previous regression
to account for this omitted variable?
q Answer: Control for x!
q So, we could estimate: yi = β0 + β1di + β 2 xi + ui
41
Bias versus Noise
n Ideally, we’d like to compare average x right
below and right above x’; what is tradeoff?
q Answer: We won’t have many observations and
estimate will be very noisy. A wider range of x
on each side reduces this noise, but increases risk
of bias that observations further from threshold
might vary for other reasons (including because
of the direct effect of x on y)
42
Bias versus Noise – Visual
Only a couple points near cutoff, x’… if
y just use them, get very noisy estimate
x
x'
43
Estimating Sharp RDD
n There are generally two ways to do RDD
that weigh that try to balance this tradeoff
between bias and noise
q Use all data, but control for effect of x on y
in a very general and rigorous way
q Use less rigorous controls for effect of x, but
only use data in small window around threshold
44
Estimating Sharp RDD, Using all data
n First approach uses all the data available
and estimates two separate regressions
Estimate using
yi = β + f ( xi − x ') + u
b b
i only data below x’
yi = β a + g ( xi − x ') + uia Estimate using
only data above x’
q Just let f( ) and g( ) be any continuous
function of xi – x’, where f(0)=g(0)=0
q Treatment effect = βa – βb
45
Interpreting the Estimates…
yi = β b + f ( xi − x ') + uib
yi = β a + g ( xi − x ') + uia
46
Easier way to do this estimation
n Can do all in one step; just use all the
data at once and estimate:
yi = α + β di + f ( xi − x ') + di × g ( xi − x ') + ui
47
Tangent about dropping g( )
n Answer: If you drop di × g ( xi − x ') , you
assume functional form between x and y
is same above and below x’
q Can be strong assumption, which is probably
why it shouldn’t be only specification used
q But, Angrist and Pischke argue it usually
doesn’t make a big difference in practice
48
What should we use for f( ) and g( )?
n In practice, a high-order polynomial
function is used for both f( ) and g( )
q E.g. You might use a cubic polynomial
49
Sharp RDD – Robustness Check
n Ultimately, correct order of polynomial is
unknown; so, best to show robustness
q Should try to illustrate that findings are robust
to different polynomial orders
q Can do graphical analysis to provide a visual
inspection that polynomial order is correct
[I will cover graphical analysis in a second]
50
Estimating Sharp RDD, Using Window
n Do same RDD estimate as before, but…
q Restrict analysis to smaller window around x’
q Use lower polynomial order controls
51
Practical issues with this approach
n What is appropriate window width and
appropriate order of polynomial?
q Answer = There is no right answer! But, it
probably isn’t as necessary to have as
complicated of polynomial in smaller window
q But, best to just show robustness to choice
of window width, Δ, and polynomial order
52
Tradeoff between two approaches
n Approach with smaller window can be
subject to greater noise, but advantage is…
q Doesn’t assume constant effect of treatment for
all values of x in the sample; in essence you are
estimating local avg. treatment effect
q Less subject to risk of bias because correctly
controlling for relationship between x and y is
less important in the smaller window
53
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
q Sharp regression discontinuity
q Graphical analysis
q Fuzzy regression discontinuity
54
Graphical Analysis of RDD
n Can construct a graph to visually
inspect whether a discontinuity exists
and whether chosen polynomial order
seems to fit the data well
q Always good idea to do this graph with
RDD; provides sanity check and visual
illustration of variation driving estimate
55
How to do RDD graphical analysis [P1]
56
How to do RDD graphical analysis [P2]
57
Example of supportive graph
Each dot is average y for corresponding bin
Fifth-order polynomial
was needed to fit the
non-parametric plot
Discontinuity is
apparent in both
estimation and non-
parametric plot
58
Example of non-supportive graph
59
RDD Graphs – Miscellaneous Issues
n Non-parametric plot shouldn’t suggest jump in
y at other points besides x’ [Why?]
q Answer = Calls into question internal validity of
RDD; possible that jump at x’ is driven by
something else that is unrelated to treatment
60
Bin Width in RDD graphs
n What is optimal # of bins (i.e. bin width)?
What is the tradeoff with smaller bins?
q Answer = Choice of bin width is subjective because
of tradeoff between precision and bias
n By including more data points in each average, wider bins
give us more precise estimate of E[y|x] in that region of x
n But, wider bins might be biased if E[y|x] is not constant
(i.e. has non-zero slope) within each of the wide bins
61
Test of overly wide graph bins
1. Construct indicator for each bin
2. Regress y on these indicators and their
interaction with forcing variable, x
3. Do joint F-test of interaction terms
q If fails, that suggests there is a slope in
some of the bins… i.e. bins are too wide
q See Lee and Lemieux (JEL 2010) for
more details and another test
62
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
q Sharp regression discontinuity
q Graphical analysis
q Fuzzy regression discontinuity
63
Intuition for Fuzzy RDD
n As noted earlier, comparison of average y
immediately above and below threshold (as
done in Sharp RDD) won’t work
q Again, not all observations above threshold are
treated and not all below are untreated; x > x’
just increases probability of treatment…
64
Fuzzy RDD Notation
n Need to relabel a few variables
q di = 1 if treated by event of interest; 0 otherwise
q And, define new threshold indicator, Ti
⎧1 if x ≥ x '
T = T ( x) = ⎨
⎩0 otherwise
n E.g. di = 1 if loan is securitized, Ti = 1 if
FICO score is greater than 620, which
increases probability loan is securitized
65
Estimating Fuzzy RDD [Part 1]
n Estimate the below 2SLS model
yi = α + β di + f ( xi − x ') + ui
66
Estimating Fuzzy RDD [Part 2]
n Again, f( ) is typically a polynomial function
n Unlike sharp RDD, it isn’t as easy to allow
functional form to vary above & below
q So, if worried about different functional forms,
what can you do to mitigate this concern?
q Answer = Use a tighter window around event;
this is less sensitive to functional form, f(x)
67
Fuzzy RDD – Practical Issues
68
Fuzzy RDD Graphs
n Do same graph of y on x as with sharp RDD
q Again, should see discontinuity in y at x’
q Should get sense that polynomial fit is good
69
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
n Checks on internal validity
n Heterogeneous effects & external validity
70
Robustness Tests for Internal Validity
71
Additional check #1 – No manipulation
72
Why manipulation can be problematic…
n Answer = Again, subjects’ ability to
manipulate x can cause violation of local
continuity assumption
q I.e. with manipulation, y might exhibit jump around
x’ absent treatment because of manipulation
n E.g. in Keys, et al. (QJE 2010) default rate of loans at
FICO = 620 might jump regardless if weak borrowers
manipulate their FICO to get the lower interest rates that
one gets immediately with FICO above 620
73
And, why it isn’t always a problem
n Why isn’t subjects’ ability to
manipulate x always a problem?
q Answer = If they can’t perfectly manipulate it,
then there will still be randomness in treatment
n I.e. in small enough bandwidth around x’, there will
still be randomness because idiosyncratic shocks will
push some above and some below threshold even if
they are trying to manipulate the x
74
An informal test for manipulation
n Look for bunching of observations
immediately above or below threshold
q Any bunching would suggest manipulation
q But, why is this not a perfect test?
n Answer = It assumes manipulation is monotonic;
i.e. all subjects either try to get above or below x’.
This need not be true in all scenarios
75
Additional check #2 – Balance tests
n RDD assumes observations near but on
opposite sides of cutoff are comparable…
so, check this!
q I.e. using graphical analysis or RDD, make
sure other observable factors that might
affect y don’t exhibit jump at threshold x’
q Why doesn’t this test prove validity of RDD?
n Answer: There could be discontinuity in unobservables!
Again, there is no way to prove causality
76
Using covariates instead…
n You could also just add these other variables
that might affect y as controls
q If RDD is internally valid, will these additional
controls effect estimate, and if so, how?
q Answer: Similar to NE, they should only affect
precision of estimate. If they affect the estimated
treatment effect, you’ve got bigger problems; Why?
n You might have ‘bad controls’
n Or, observations around threshold aren’t comparable L
77
Additional check #3 – Falsification Tests
n If threshold x’ only existed in certain years
or for certain types of observations…
q E.g. law that created discontinuity was passed in
a given year, but didn’t exist before that, or
maybe the law didn’t apply to some firms
78
Regression Discontinuity – Outline
n Basic idea of regression discontinuity
n Sharp versus fuzzy discontinuities
n Estimating regression discontinuity
n Checks on internal validity
n Heterogeneous effects & external validity
79
Heterogeneous effects (HE)
n If think treatment might differentially affect
observations based on their x, then need a
few additional assumptions for RDD to
identify the local average treatment effect
1. Effect of treatment is locally continuous at x’
2. Likelihood of treatment is always weakly Note: Latter
two only apply
greater above threshold value x’
to Fuzzy RDD
3. Effect of treatment and whether observation is
treated is independent of x near x’
80
HE assumption #1
n Assumption that treatment effect is locally
continuous at x’ is typically not problem
q It basically just says that there isn’t any jump in
treatment’s effect at x’; i.e. just again assuming
observations on either side of x’ are comparable
n Note: This might violated if x’ was chosen because
effect of treatment was thought to be higher for x>x’
[E.g. law and/or regulation that creates discontinuity created
threshold at that point because effect was known to be biggest there]
81
HE assumption #2
n Monotonic effect on likelihood of treatment
usually not a problem either
q Just says that having x > x’ doesn’t make some
observations less likely to be treated and others
more likely to be treated
q This is typically the case, but make sure that it
makes sense in your setting as well
82
HE assumption #3
n Basically is saying ‘no manipulation’
q In practice, it means that observations where
treatment effect is going to be larger aren’t
manipulating x to be above the threshold or that
likelihood of treatment for individual
observation depends on some variable that is
correlated with magnitude of treatment effect
83
HE affects interpretation of estimate
n Key with heterogeneity is that you’re only
estimating a local average treatment effect
q Assuming above assumptions hold, estimate
only reveals effect of treatment around
threshold, and for Fuzzy RDD, it only reveals
effect on observations that change treatment
status because of discontinuity
q This limits external validity… How?
84
External validity and RDD [Part 1]
n Answer #1: Identification relies on
observations close to the cutoff threshold
q Effect of treatment might be different for
observations further away from this threshold
q I.e. don’t make broad statements about how
the effect would hold for observations further
from the threshold value of x
85
External validity and RDD [Part 2]
n Answer #2: In fuzzy RDD, treatment is
estimated using only “compliers”
q I.e. we only pick up effect of those where
discontinuity is what pushes them into treatment
n E.g. Suppose you study effect of PhD on wages using
GRE score > x’ with a fuzzy RDD. If discontinuity
only matters for students with mediocre GPA, then you
only estimate effect of PhD for those students
q Same as with IV… be careful to not extrapolate
too much from the findings
86
Summary of Today [Part 1]
87
Summary of Today [Part 2]
88
Summary of Today [Part 3]
89
In First Half of Next Class
n Miscellaneous Issues
q Common data problems
q Industry-adjusting
q High-dimensional FE
90
Assign papers for next week…
n Malenko and Shen (working paper 2015)
q Role of proxy advisory firms
91
Break Time
n Let’s take our 10 minute break
n We’ll do presentations when we get back
92
FNCE 926
Empirical Methods in CF
Lecture 9 – Common Limits & Errors
2
Background readings for today
n Ali, Klasa, and Yeung (RFS 2009)
n Gormley and Matsa (RFS 2014)
3
Outline for Today
n Quick review of last lecture on RDD
n Discuss common limitations and errors
q Our data isn’t perfect
q Hypothesis testing mistakes
q How not to control for unobserved heterogeneity
4
Quick Review[Part 1]
5
Quick Review[Part 2]
n Estimation of sharp RDD
yi = α + β di + f ( xi − x ') + di × g ( xi − x ') + ui
6
Quick Review[Part 3]
n Estimation of fuzzy RDD is similar…
yi = α + β di + f ( xi − x ') + ui
7
Quick Review [Part 4]
n What are some standard internal validity tests
you might want to run with RDD?
q Answers:
n Check robustness to different polynomial orders
n Check robustness to bandwidth
n Graphical analysis to show discontinuity in y
n Compare other characteristics of firms around cutoff
threshold to make sure no other discontinuities
n And more…
8
Quick Review [Part 5]
n If effect of treatment is heterogeneous,
how does this affect interpretation of RDD
estimates?
q Answer = They take on local average
treatment effect interpretation, and fuzzy RDD
captures only effect of compliers. Neither is
problem for internal validity, but can
sometimes limit external validity of finding
9
Common Limitations & Errors – Outline
n Data limitations
n Hypothesis testing mistakes
n How to control for unobserved heterogeneity
10
Data limitations
n The data we use is almost never perfect
q Variables are often reported with error
q Exit and entry into dataset typically not random
q Datasets only cover certain types of firms
11
Measurement error – Examples
n Variables are often reported with error
q Sometimes it is just innocent noise
n E.g. Survey respondents self report past income
with error [because memory isn’t perfect]
12
Measurement error – Why it matters
n Answer = Depends; but in general, hard
to know exactly how this will matter
q If y is mismeasured…
n If only random noise, just makes SEs larger
n But if systematic in some way [as in second example],
can cause bias if error is correlated with x’s
q If x is mismeasured…
n Even simple CEV causes attenuation bias on
mismeasured x and biases on all other variables
13
Measurement error – Solution
n Standard measurement error solutions
apply [see “Causality” lecture]
q Though admittedly, measurement error is
difficult to deal with unless know exactly
source and nature of the error
14
Survivorship Issues – Examples
n In other cases, observations are included
or missing for systematic reasons; e.g.
q Ex. #1 – Firms that do an IPO and are
added to datasets that cover public firms may
be different than firms that do not do an IPO
q Ex. #2 – Firms adversely affected by some
event might subsequently drop out of data
because of distress or outright bankruptcy
n How can these issues be problematic?
15
Survivorship Issues – Why it matters
n Answer = There is a selection bias,
which can lead to incorrect inferences
q Ex. #1 – E.g. going public may not cause
high growth; it’s just that the firms going
public were going to grow faster anyway
q Ex. #2 – Might not find adverse effect of
event (or might understate it’s effect) if some
affected firms go bankrupt and are dropped
16
Survivorship Issues – Solution
n Again, no easy solutions; but, if worried
about survivorship bias…
q Check whether treatment (in diff-in-diff) is
associated with observations being more or
less likely to drop from data
q In other analysis, check whether covariates
of observations that drop are systematically
different in a way that might be important
17
Sample is limited – Examples
18
Sample is limited – Why it matters
19
Sample is limited – Solution
20
Interesting Example of Data Problem
21
Ali, et al. – Example data problem [P1]
22
Ali, et al. – Example data problem [P2]
n Ali, et al. (RFS 2009) found it mattered;
using Census measure overturns four
previously published results
q E.g. Concentration is positively related to R&D,
not negatively related as previously argued
q See paper for more details...
23
Common Limitations & Errors – Outline
n Data limitations
n Hypothesis testing mistakes
n How to control for unobserved heterogeneity
24
Hypothesis testing mistakes
n As noted in lecture on natural experiments,
triple-difference can be done by running
double-diff in two separate subsamples
q E.g. estimate effect of treatment on small firms;
then estimate effect of treatment on large firms
25
Example inference from such analysis
Small
Large
Low
D/E
High
D/E
Sample
=
Firms Firms Firms Firms
Treatment
*
Post 0.031 0.104** 0.056 0.081***
(0.121) (0.051) (0.045) (0.032)
26
Be careful making such claims!
27
Example triple interaction result
All
Sample
=
Firms
Treatment
*
Post 0.031
(0.121)
Difference is not actually
Treatment
*
Post
*
Large 0.073
(0.065)
statistically significant
N 5,432
R-‐squared 0.12 Remember to interact year
Firm
dummies X dummies & triple difference;
Year
dummies X
otherwise, estimates won’t
Year
*
Large
dummies X
match earlier subsamples
28
Practical Advice
n Don’t make claims you haven’t tested;
they could easily be wrong!
q Best to show relevant p-values in text or tables
for any statistical significance claim you make
q If difference isn’t statistically significant
[e.g. p-value = 0.15], can just say so; triple-diffs
are noisy, so this isn’t uncommon
q Or, be more careful in your wording…
n I.e. you could instead say, “we found an effect for large
firms, but didn’t find much evidence for small firms”
29
Common Limitations & Errors – Outline
n Data limitations
n Hypothesis testing mistakes
n How to control for unobserved heterogeneity
q How not to control for it
q General implications
q Estimating high-dimensional FE models
30
Unobserved Heterogeneity – Motivation
n Controlling for unobserved heterogeneity is a
fundamental challenge in empirical finance
q Unobservable factors affect corporate policies and prices
q These factors may be correlated with variables of interest
31
Many different strategies are used
n As we saw earlier, FE can control for unobserved
heterogeneities and provide consistent estimates
n But, there are other strategies also used to control
for unobserved group-level heterogeneity…
q “Adjusted-Y” (AdjY) – dependent variable is
demeaned within groups [e.g. ‘industry-adjust’]
q “Average effects” (AvgE) – uses group mean of
dependent variable as control [e.g. ‘state-year’ control]
32
AdjY and AvgE are widely used
n In JF, JFE, and RFS…
q Used since at least the late 1980s
q Still used, 60+ papers published in 2008-2010
q Variety of subfields; asset pricing, banking,
capital structure, governance, M&A, etc.
33
But, AdjY and AvgE are inconsistent
34
The underlying model [Part 1]
n Recall model with unobserved heterogeneity
yi , j = β X i , j + fi + ε i , j
35
The underlying model [Part 2]
36
The underlying model [Part 3]
37
We already know that OLS is biased
True model is: yi , j = β X i , j + fi + ε i , j
38
Adjusted-Y (AdjY)
n Tries to remove unobserved group heterogeneity by
demeaning the dependent variable within groups
1
where yi =
J
∑ (β X
k ∈group i
i,k + fi + ε i ,k )
39
Example AdjY estimation
40
Here is why…
n Rewriting the group mean, we have:
yi = fi + β X i + ε i ,
41
AdjY can have omitted variable bias
n β̂ adjY can be inconsistent when β ≠ 0
True model: yi , j − yi = β X i , j − β X i + ε i , j − ε i
y
But, AdjY estimates: i , j − yi = β AdjY
X i, j + u AdjY
i, j
42
Now, add a second variable, Z
43
AdjY estimates with 2 variables
n With a bit of algebra, it is shown that:
⎡ β (σ XZ σ ZX − σ Z2σ XX ) + γ (σ XZ σ ZZ − σ Z2σ XZ ) ⎤
⎢β + ⎥
⎡ βˆ ⎤ ⎢
AdjY
σ Z σ X − σ XZ
2 2 2
⎥
⎢ AdjY ⎥ = ⎢ ⎥
⎢⎣γˆ ⎥⎦ ⎢ β (σ σ
XZ XX − σ X ZX )
2
σ + γ (σ σ
XZ XZ − σ X ZZ ) ⎥
2
σ
⎢γ + σ σ
2 2
− σ 2 ⎥
⎣ Z X XZ ⎦
44
Average Effects (AvgE)
45
Average Effects (AvgE)
46
AvgE has measurement error bias
47
AvgE has measurement error bias
n Recall that group mean is given by yi = fi + β X i + ε i ,
q Therefore, yi measures fi with error − β X i − ε i
q As is well known, even classical measurement error
causes all estimated coefficients to be inconsistent
48
AvgE estimate of β with one variable
βˆ AvgE = β +
( ) (
σ Xf βσ fX + β 2σ X2 + σ ε2 − σ εε − βσ XX σ 2f + βσ fX + σ εε )
( )
σ X2 σ 2f + 2βσ fX + β 2σ X2 + σ ε2 − (σ Xf + βσ XX )
2
Determining
magnitude and Covariance between X and X
again problematic, but not Even non-i.i.d.
direction of bias
needed for AvgE estimate to nature of errors
is difficult
be inconsistent can affect bias!
49
Comparing OLS, AdjY, and AvgE
n Can use analytical solutions to compare
relative performance of OLS, AdjY, and AvgE
n To do this, we re-express solutions…
q We use correlations (e.g. solve bias in terms of
correlation between X and f, ρ Xf , instead of σ Xf )
q We also assume i.i.d. errors [just makes bias of
AvgE less complicated]
q And, we exclude the observation-at-hand when
calculating the group mean, X i , …
50
Why excluding Xi doesn’t help
n Quite common for researchers to exclude
observation at hand when calculating group mean
q It does remove mechanical correlation between X and
omitted variable, X i , but it does not eliminate the bias
q In general, correlation between X and omitted variable, X i ,
is non-zero whenever X i is not the same for every group i
q This variation in means across group is almost
assuredly true in practice; see paper for details
51
ρXf has large effect on performance
Estimate, β̂ AdjY more biased
than OLS, except for
True β = 1
2
AdjY
0
AvgE gives
wrong sign for Other parameters held constant
AvgE low values of ρXf
-1
σ f / σ X = σ ε / σ X = 1, J = 10, ρ X , X = 0.5
i −i
-2
AvgE
0.75
AdjY
0.5
0 5 10 15 20 25
σ f / σ X = σ ε / σ X = 1, ρ X , X = 0.5, ρ Xf = 0.25 J
i −i
53
Summary of OLS, AdjY, and AvgE
54
Fixed effects (FE) estimation
n Recall: FE adds dummies for each group to OLS
estimation and is consistent because it directly
controls for unobserved group-level heterogeneity
n Can also do FE by demeaning all variables with respect
to group [i.e. do ‘within transformation’] and use OLS
y
FE estimates: i , j − yi = β FE
( i, j i ) i, j
X − X + u FE
True model: yi , j − yi = β ( X i , j − X i ) + (ε i , j − ε i )
55
Comparing FE to AdjY and AvgE
n To estimate effect of X on Y controlling for Z
q One could regress Y onto both X and Z… Add group FE
q Or, regress residuals from regression of Y on Z
onto residuals from regression of X on Z Within-group
transformation!
§ AdjY and AvgE aren’t the same as finding the
effect of X on Y controlling for Z because...
§ AdjY only partials Z out from Y
§ AvgE uses fitted values of Y on Z as control
56
The differences will matter! Example #1
n Consider the following capital structure regression:
( D / A)i ,t = α + βXi,t + fi + ε i ,t
q (D/A)it = book leverage for firm i, year t
q Xit = vector of variables thought to affect leverage
q fi = firm fixed effect
57
Estimates vary considerably
De p e n d e n t variab le = b o o k le ve rag e
58
The differences will matter! Example #2
59
Estimates vary considerably
Dependent
Variable
=
Tobin's
Q
OLS Adj Y Avg E FE
60
Common Limitations & Errors – Outline
n Data limitations
n Hypothesis testing mistakes
n How to control for unobserved heterogeneity
q How not to control for it
q General implications
q Estimating high-dimensional FE models
61
General implications
62
Other AdjY estimators are problematic
n Same problem arises with other AdjY estimators
q Subtracting off median or value-weighted mean
q Subtracting off mean of matched control sample
[as is customary in studies if diversification “discount”]
q Comparing “adjusted” outcomes for treated firms pre-
versus post-event [as often done in M&A studies]
q Characteristically adjusted returns [as used in asset pricing]
63
AdjY-type estimators in asset pricing
n Common to sort and compare stock returns across
portfolios based on a variable thought to affect returns
n But, returns are often first “characteristically adjusted”
q I.e. researcher subtracts the average return of a benchmark
portfolio containing stocks of similar characteristics
q This is equivalent to AdjY, where “adjusted returns” are
regressed onto indicators for each portfolio
64
Asset Pricing AdjY – Example
n Asset pricing example; sorting returns based
on R&D expenses / market value of equity
Characteristically
adjusted
returns
by
R&D
Quintile
(i.e.,
Adj Y)
Missing Q1 Q2 Q3 Q4 Q5
-‐0.012*** -‐0.033*** -‐0.023*** -‐0.002 0.008 0.020***
(0.003) (0.009) (0.008) (0.007) (0.013) (0.006)
Difference between
We use industry-size benchmark portfolios Q5 and Q1 is 5.3
and sorted using R&D/market value percentage points
65
Estimates vary considerably
Dependent
Variable
=
Yearly
Stock
Return
Adj Y FE
R&D
Missing 0.021** 0.030***
(0.009) (0.010) Same AdjY result,
R&D
Quintile
2 0.01 0.019 but in regression
(0.013) (0.014) format; quintile 1
R&D
Quintile
3 0.032*** 0.051*** is excluded
(0.012) (0.018)
R&D
Quintile
4 0.041*** 0.068***
(0.015) (0.020) Use benchmark-period
R&D
Quintile
5 0.053*** 0.094*** FE to transform both
(0.011) (0.019) returns and R&D; this is
Observations 144,592 144,592 equivalent to double sort
2
R 0.00 0.47
66
What if AdjY or AvgE is true model?
§ If data exhibited structure of AvgE estimator,
this would be a peer effects model
[i.e. group mean affects outcome of other members]
§ In this case, none of the estimators (OLS, AdjY,
AvgE, or FE) reveal the true β [Manski 1993;
Leary and Roberts 2010]
67
Common Limitations & Errors – Outline
n Data limitations
n Hypothesis testing mistakes
n How to control for unobserved heterogeneity
q How not to control for it
q General implications
q Estimating high-dimensional FE models
68
Multiple high-dimensional FE
69
LSDV is usually needed with two FE
70
Why such models can be problematic
71
This is growing problem
72
But, there are solutions!
73
#1 – Interacted fixed effects
n Combine multiple fixed effects into one-
dimensional set of fixed effect, and remove
using within transformation
q E.g. firm and industry-year FE could be
replaced with firm-industry-year FE
74
#2 – Memory-saving procedures
n Use properties of sparse matrices to reduce
required memory, e.g. Cornelissen (2008)
n Or, instead iterate to a solution, which
eliminates memory issue entirely, e.g.
Guimaraes and Portugal (2010)
q See paper for details of how each works
q Both can be done in Stata using user-written
commands FELSDVREG and REGHDFE
75
These methods work…
76
Summary of Today [Part 1]
77
Summary of Today [Part 2]
78
In First Half of Next Class
n Matching
q What it does…
q And, what it doesn’t do
79
Assign papers for next week…
n Gormley and Matsa (working paper, 2015)
q Corporate governance & playing it safe preferences
80
Break Time
n Let’s take our 10 minute break
n We’ll do presentations when we get back
81
FNCE 926
Empirical Methods in CF
Lecture 10 – Matching
2
Announcements – Exercise #4
n Exercise #4 is due next week
q Please upload to Canvas, Thanks!
3
Background readings for today
n Roberts-Whited, Section 6
n Angrist-Pischke, Sections 3.3.1-3.3.3
n Wooldridge, Section 21.3.5
4
Outline for Today
n Quick review of last lecture on “errors”
n Discuss matching
q What it does…
q And what it doesn’t do
5
Quick Review [Part 1]
n What are 3 data limitations to keep in mind?
q #1 – Measurement error; some variables may
be measured with error [e.g. industry concentration
using Compustat] leading to incorrect inferences
q #2 – Survivorship bias; entry and exit of obs.
isn’t random and this can affect inference
q #3 – External validity; our data often only
covers certain types of firms and need to keep
this in mind when making inferences
6
Quick Review [Part 2]
n What is AdjY estimator, and why is it
inconsistent with unobserved heterogeneity?
q Answer = AdjY demeans y with respect to
group; it is inconsistent because it fails to account
for how group mean of X’s affect adjusted-Y
n E.g. “industry-adjust”
n Diversification discount lit. has similar problem
n Asset pricing has examples of this [What?]
7
Quick Review [Part 3]
n Comparing characteristically-adjusted stock
returns across portfolios sorted on some
other X is example of AdjY in AP
q What is proper way to control for unobserved
characteristic-linked risk factors?
q Answer = Add benchmark portfolio-period FE
[See Gormley & Matsa (2014)]
8
Quick Review [Part 4]
n What is AvgE estimator; why is it biased?
q Answer = Uses group mean of y as control for
unobserved group-level heterogeneity; biased
because of measurement error problem
9
Quick Review [Part 5]
n What are two ways to estimate model with
two, high-dimensional FE [e.g. firm and
industry-year FE]?
q Answer #1: Create interacted FE and sweep it
away with usual within transformation
q Answer #2: Use iterations to solve FE estimates
10
Matching – Outline
n Introduction to matching
q Comparison to OLS regression
q Key limitations and uses
n How to do matching
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
11
Matching Methods – Basic Idea [Part 1]
n Matching approach to estimate treatment
effect is very intuitive and simple
q For each treated observation, you find a
“matching” untreated observation that
serves as the de facto counterfactual
q Then, compare outcome, y, of treated
observations to outcome of matched obs.
12
Matching Methods – Basic Idea [Part 2]
n A bit more formally…
q For each value of X, where there is both a
treated and untreated observation…
n Match treated observations with X=X’ to
untreated observations with same X=X’
n Take difference in their outcomes, y
13
Matching Methods – Intuition
n What two things is matching approach
basically assuming about the treatment?
q Answer #1 = Treatment isn’t random; if it
were, would not need to match on X before
taking average difference in outcomes
q Answer #2 = Treatment is random conditional
on X; i.e. controlling for X, untreated outcome
captures the unobserved treated counterfactual
14
Matching is a “Control Strategy”
n Can think of matching as just a way to
control for necessary X’s to ensure CMI
strategy necessary for causality holds
15
Matching and OLS; not that different
16
Matching versus Regression
n Basically, can think of OLS estimate as
particular weighted matching estimator
q Demonstrating this difference in
weighting can be a bit technical…
n See Angrist-Pischke Section 3.3.1 for more
details on this issue, but following example will
help illustrate this…
17
Matching vs Regression – Example [P1]
n Example of difference in weighting…
q First, do simple matching estimate
q Then, do OLS where regress y on
treatment indicator and you control for X’s
by adding indicators for each value of X
n This is very nonparametric and general way to
control for covariates X
n If think about it, this is very similar to
matching; OLS will be comparing outcomes for
treated and untreated with same X’s
18
Matching vs Regression – Example [P2]
n But, even in this example, you’ll get different
estimates from OLS and matching
q Matching gives more weight to obs. with X=X’
when there are more treated with that X’
q OLS gives more weight to obs. with X=X’ when
there is more variation in treatment [i.e. we observe
a more equal ratio of treated & untreated]
19
Matching vs Regression – Bottom Line
n Angrist-Pischke argue that, in general,
differences between matching and OLS
are not of much empirical importance
20
Matching – Key Limitation [Part 1]
21
Matching – Key Limitation [Part 2]
22
Matching – Key Limitation [Part 3]
23
Matching – So, what good is it? [Part 1]
24
Matching – So, what good is it? [Part 2]
25
Matching – Outline
n Introduction to matching
n How to do matching
q Notation & assumptions
q Matching on covariates
q Matching on propensity score
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
26
First some notation…
n Suppose want to know effect of treatment,
d, where d = 1 if treated, d = 0 if not treated
n Outcome y is given by…
q y(1) = outcome if d = 1
q y(0) = outcome if d = 0
27
Identification Assumptions
28
Assumption #1 – Unconfoundedness
n Outcomes y(0) and y(1) are statistically
independent of treatment, d, conditional
on the observable covariates, X
q I.e. you can think of assignment to treatment
as random once you control for X
29
“Unconfoundedness” explained…
n This assumption is stronger version of
typical CMI assumption that we make
q It is equivalent to saying treatment, d, is
independent of error u, in following regression
y = β0 + β1 x1 + ... + βk xk + γ d + u
n Note: This stronger assumption needed in certain
matching estimators, like propensity score
30
Assumption #2 – Overlap
31
“Overlap” in practice
32
Average Treatment Effect (ATE)
33
Difficulty with exact matching
34
Matching – Outline
n Introduction to matching
n How to do matching
q Notation & assumptions
q Matching on covariates
q Matching on propensity score
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
35
Matching on Covariates – Step #1
n Select a distance metric, ||Xi – Xj||
q It tells us how far apart the vector of X’s for
observation i are from X’s for observation j
q One example would be Euclidean distance
(X − X j ) ( Xi − X j )
'
Xi − X j = i
36
Matching on Covariates – Step #2
n For each observation, i, find M closest
matches (based on chosen distance metric)
among observations where d ≠ di
q I.e. for a treated observation (i.e. d = 1) find the
M closest matches among untreated observations
q For an untreated observation (i.e. d = 0), find the
M closest matches among treated observations
37
Before Step #3… some notation
n Define lm(i) as mth closest match to
observation i among obs. where d ≠ di
q E.g. suppose obs. i =4 is treated [i.e. d =1]
n l1(4) would represent the closest
untreated observation to observation i = 4
n l2(4) would be the second closest, and so on
38
Matching on Covariates – Step #3
n Create imputed untreated outcome, yˆi (0) ,
and treated outcome, yˆi (1) , for each obs. i
⎧ yi if di = 0
⎪
yˆi (0) = ⎨ 1
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1 In words, what
is this doing?
⎧1
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M
⎪⎩ yi if di = 1
39
Interpretation… But, we don’t observe the
counterfactual, y(0); so, we
estimate it using average
outcome of M closest
⎧ yi if di = 0 untreated observations!
⎪
ˆyi (0) = ⎨ 1
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1
⎧1
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M
⎪⎩ yi if di = 1
40
Interpretation…
⎧ yi if di = 0
⎪
ˆyi (0) = ⎨ 1 And vice versa, if obs. i
⎪⎩ M ∑ j∈LM ( i ) y j if di = 1 had been untreated; we
impute unobserved
⎧1 counterfactual using
⎪ ∑ j∈LM ( i ) y j if di = 0
yˆi (1) = ⎨ M average outcome of M
⎪⎩ yi if di = 1 closest treated obs.
41
Matching on Covariates – Step #4
n With assumptions #1 and #2, average
treatment effect (ATE) is given by:
1 N
∑ [ yˆi (1) − yˆi (0)]
N 1
42
Matching – Outline
n Introduction to matching
n How to do matching
q Notation & assumptions
q Matching on covariates
q Matching on propensity score
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
43
Matching on propensity score
n Another way to do matching is to first
estimate a propensity score using
covariates, X, and then match on it…
44
Propensity Score, ps(x) [Part 1]
45
Propensity Score, ps(x) [Part 2]
46
Matching on ps(X) – Step #1
n Estimate propensity score, ps(X), for
each observation i
q For example, estimate d = β0 + β1 x1 + ... + β k xk + ui
using OLS, Probit, or Logit
n Common practice is to use Logit with few
polynomial terms for any continuous covariates
q Predicted value for observation i is it’s
propensity score, ps(Xi)
47
Tangent about Step #1
n Note: You only need to include X’s
that predict treatment, d
q This may be less than full set of X’s
q In fact, being able to exclude some X’s
(because economic logic suggests they
shouldn’t predict d) can improve finite
sample properties of the matching estimate
48
Matching on ps(X) – Remaining Steps…
n Now, use same steps as before, but choose
M closest matches using observations with
closest propensity score
q E.g. if obs. i is untreated, choose M treated
observations with closest propensity scores
49
Propensity score – Advantage # 1
n Propensity score helps avoid concerns about
subjective choices we make with matching
q As we’ll see next, there are a lot of subjective
choices you need to make [e.g. distance metric,
matching method, etc.] when matching on covariates
50
Propensity score – Advantage # 2
n Can skip matching entirely, and estimate
ATE using sample analog of
⎡ ( di − ps( X i ) ) yi ⎤
E⎢ ⎥
⎣ ps( X i ) (1 − ps( X i ) ) ⎦
q See Angrist-Pischke, Section 3.3.2 for more
details about why this works
51
But, there is a disadvantage (sort of)
?
n Can get lower standard errors by instead
matching on covariates if add more variables
that explain y, but don’t necessarily explain d
q Same as with OLS; more covariates can increase
precision even if not needed for identification
q But, Angrist and Hahn (2004) show that using
ps(X) and ignoring these covariates can actually
result in better finite sample properties
52
Matching – Outline
n Introduction to matching
n How to do matching
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
53
Practical Considerations
54
Choice of distance metric [Part 1]
55
Choice of distance metric [Part 2]
n Two other possible distance metrics
standardize distances using inverse of
covariates’ variances and covariances
q Abadie and Imbens (2006)
56
Choice of matching approach
n Should you match based on covariates, or
instead match using a propensity score?
q And, if use propensity score, should you use
Probit, Logit, OLS, or nonparametric approach?
57
And, how many matches? [Part 1]
n Again, no clear answer…
n Tradeoff is between bias and precision
q Using single best match will be least biased
estimate of counterfactual, but least precise
q Using more matches increases precision, but
worsens quality of match and potential bias
58
And, how many matches? [Part 2]
n Two ways used to choose matches
q “Nearest neighbor matching”
n This is what we saw earlier; you choose the m matches
that are closest using your distance metric
59
And, how many matches? [Part 3]
n Bottom line advice
q Best to try multiple approaches to ensure
robustness of the findings
n If adding more matches (or expanding radius
in caliper approach) changes estimates, then
bias is potential issue and should probably stick
to smaller number of potential matches
n If not, and only precision increases, then okay
to use a larger set of matches
60
With or without replacement? [Part 1]
n Matching with replacement
q Each observation can serve as a match
for multiple observations
q Produces better matches, reducing
potential bias, but at loss of precision
61
With or without replacement? [Part 2]
n Bottom line advice…
q Roberts-Whited recommend to do
matching with replacement…
n Our goal should be to reduce bias
n In matching without replacement, the order in
which you match can affect estimates
62
Which covariates?
n Need all X’s that affect outcome, y, and
are correlated with treatment, d [Why?]
q Otherwise, you’ll have omitted variables!
63
Matches for whom?
n If use matches for all observations
(as done earlier), you estimate ATE
q But, if only use and find matches for
treated observations, you estimate average
treatment effect on treated (ATT)
q If only use and find matches for untreated,
you estimate average treatment effect on
untreated (ATU)
64
Matching – Outline
n Introduction to matching
n How to do matching
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
65
Testing “Overlap” Assumption
n If only one X or using ps(X), can just
plot distribution for treated & untreated
n If using multiple X, identify and inspect
worst matches for each x in X
q If difference between match and
observation is large relative to standard
deviation of x, might have problem
66
If there is lack of “Overlap”
n Approach is very subjective…
q Could try discarding observations with
bad matches to ensure robustness
q Could try switching to caliper matching
with propensity score
67
Testing “Unconfoundedness”
n How might you try to test
unconfoundedness assumption?
q Answer = Trick question; you can’t! We
do not observe error, u, and therefore can’t
know if treatment, d, is independent of it!
q Again, we cannot test whether the
equations we estimate are causal!
68
But, there are other things to try…
n Similar to natural experiment, can
do various robustness checks; e.g.
q Test to make sure timing of observed
treatment effect is correct
q Test to make sure treatment doesn’t
affect other outcomes that should,
theoretically, be unaffected
n Or, look at subsamples where treatment
effect should either be larger or smaller
69
Matching – Outline
n Introduction to matching
n How to do matching
n Practical considerations
n Testing the assumptions
n Key weaknesses and uses of matching
70
Weaknesses Reiterated [Part 1]
n As we’ve just seen, there isn’t clear
guidance on how to do matching
q Choices on distance metric, matching
approach, # of matches, etc. are subjective
q Or, what is best way to estimate propensity
score? Logit, probit, nonparametric?
71
Weaknesses Reiterated [Part 2]
72
Tangent – Related Problem
What is wrong
with this claim?
n Often see a researcher estimate:
y = β0 + β1d + ps( X ) + u
q d = indicator for some non-random event
q ps(X) = prop. score for likelihood of treatment
estimated using some fancy, complicated Logit
73
Tangent – Related Problem [Part 2]
n Researcher assumes that observable X
captures ALL relevant omitted variables
q I.e. there aren’t any unobserved variables
that affect y and are correlated with d
q This is often not true… Remember long
list of unobserved omitted factors discussed
in lecture on panel data
74
Another Weakness – Inference
75
Use as a robustness check
76
Use as precursor to regression [Part 1]
77
Use as precursor to regression [Part 2]
78
Matching – Practical Advice
79
Summary of Today [Part 1]
80
Summary of Today [Part 2]
81
In First Half of Next Class
n Standard errors & clustering
q Should you use “robust” or “classic” SE?
q “Clustering” and when to use it
82
Assign papers for next week…
n Morse (JFE 2011)
q Payday lenders
83
Break Time
n Let’s take our 10 minute break
n We’ll quickly cover Heckman selection models
and then do presentations when we get back
84
Heckman selection models
n Motivation
n How to implement
n Limitations [i.e., why I don’t like them]
85
Motivation [Part 1]
n You want to estimate something like…
Yi = bX i + ε i
q Yi = post-IPO outcome for firm i
q Xi = vector of covariates that explain Y
q εi,t = error term
q Sample = all firms that did IPO in that year
86
Motivation [Part 2]
n Answer = certain firms ‘self-select’ to
do an IPO, and the factors that drive
that choice might cause X to be
correlated with εi,t
q It’s basically an omitted variable problem!
q If willing to make some assumptions,
can use Heckman two-step selection
model to control for this selection bias
87
How to implement [Part 1]
n Assume to choice to ‘self-select’ [in this
case, do an IPO] has following form…
⎧⎪1 if γ Zi + ηi > 0 ⎫⎪
IPOi = ⎨ ⎬
⎪⎩0 if γ Zi + ηi ≤ 0 ⎪⎭
q Zi = factors that drive choice [i.e., IPO]
q ηi,t = error term for this choice
88
How to implement [Part 2]
n Regress choice variable (i.e., IPO) onto
Z using a Probit model
n Then, use predicted values to calculate
the Inverse Mills Ratio for each
observation, λi = ϕ(γZi)/Φ(γZi)
n Then, estimate original regression of Yi
onto Xi, but add λi as a control!
89
Limitations [Part 1]
n Model for choice [i.e., first step of the estimation]
must be correct; otherwise inconsistent!
n Requires assumption that the errors, ε and η,
have a bivariate normal distribution
q Can’t test, and no reason to believe this is true
[i.e., what is the economic story behind this?]
q And, if wrong… estimates are inconsistent!
90
Limitations [Part 2]
n Can technically work if Z is just a subset of
the X variables [which is commonly what people
seem to do], but…
q But, in this case, all identification relies on non-
linearity of the inverse mills ratio [otherwise, it
would be collinear with the X in the second step]
q But again, this is entirely dependent on the
bivariate normality assumption and lacks
any economic intuition!
91
Limitations [Part 3]
n When Z has variables not in X [i.e., excluded
instruments], then could just do IV instead!
q I.e., estimate Yi = bX i + IPOi + ε i on full sample
using excluded IVs as instruments for IPO
q Avoids unintuitive, untestable assumption of
bivariate normal error distribution!
92
FNCE 926
Empirical Methods in CF
Lecture 11 – Standard Errors & Misc.
2
Background readings for today
n Readings for standard errors
q Angrist-Pischke, Chapter 8
q Bertrand, Duflo, Mullainathan (QJE 2004)
q Petersen (RFS 2009)
n Readings for limited dependent variables
q Angrist-Pischke, Sections 3.4.2 and 4.6.3
q Greene, Section 17.3
3
Outline for Today
n Quick review of last lecture on matching
n Discuss standard errors and clustering
q “Robust” or “Classical”?
q Clustering: when to do it and how
4
Quick Review [Part 1]
n Matching is intuitive method
q For each treated observation, find comparable
untreated observations with similar covariates, X
n They will act as estimate of unobserved counterfactual
n Do the same thing for each untreated observation
5
Quick Review [Part 2]
n But, what are necessary assumptions for
this approach to estimate ATE?
q Answer #1 = Overlap… Need both treated
and control observations for X’s
q Answer #2 = Unconfoundedness… Treatment is
as good as random after controlling for X
6
Quick Review [Part 3]
n Matching is just a control strategy!
q It does NOT control for unobserved variables
that might pose identification problems
q It is NOT useful in dealing with other problems
like simultaneity and measurement error biases
7
Quick Review [Part 4]
n Relative to OLS estimate of treatment effect…
q Matching basically just weights differently
q And, doesn’t make functional form assumption
n Angrist-Pischke argue you typically won’t find large
difference between two estimates if you have right
X’s and flexible controls for them in OLS
8
Quick Review [Part 5]
n Many choices to make when matching
q Match on covariates or propensity score?
q What distance metric to use?
q What # of observations?
9
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
10
Getting our standard errors correct
n It is important to make sure we get our
standard errors correct so as to avoid
misleading or incorrect inferences
q E.g. standard errors that are too small will cause
us to reject the null hypothesis that our
estimated β’s are equal to zero too often
n I.e. we might erroneously claim to found a
“statistically significant” effect when none exists
11
Homoskedastic or Heteroskedastic?
12
“Classical” versus “Robust” SEs [Part 1]
n What do the default standard errors
reported by programs like Stata assume?
q Answer = Homoskedasticity! This is what
we refer to as “classical” standard errors
n As we discussed in earlier lecture, this is typically
not a reasonable assumption to make
n “Robust” standard errors allow for
heteroskedasticity and don’t make this assumption
13
“Classical” versus “Robust” SEs [Part 2]
n Putting aside possible “clustering” (which
we’ll discuss shortly), should you always
use robust standard errors?
q Answer = Not necessarily! Why?
n Asymptotically, “classical” and “robust” SE are
correct, but both suffer from finite sample bias, that
will tend to make them too small in small samples
n “Robust” can sometimes be smaller than “classical”
SE because of this bias or simple noise!
14
Finite sample bias in standard errors
n Finite sample bias is easily corrected in
“classical” standard errors
[Note: this is done automatically by Stata]
n This is not so easy with “robust” SEs…
q Small sample bias can be worse with
“robust” standard errors, and while finite
sample corrections help, they typically don’t
fully remove the bias in small samples
15
Many different corrections are available
n Number of methods developed to try and
correct for this finite-sample bias
q By default, Stata automatically does one of these
when use vce(robust) to calculate SE
q But, there are other ways as well; e.g.,
n regress y x, vce(hc2)
Developed by Davidson
n regress y x, vce(hc3) and MacKinnon (1993);
works better when
heterogeneity is worse
16
Classical vs. Robust – Practical Advice
n Compare the robust SE to the classical SE
and take maximum of the two
q Angrist-Pischke argue that this will tend to be
closer to the true SE in small samples that
exhibit heteroskedasticity
n If small sample bias is real concern, might want to
use HC2 or HC3 instead of typical “robust” option
n While SE using this approach might be too large if
data is actually homoskedastic, this is less of concern
17
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
n Violation of independence and implications
n How big of a problem is it? And, when?
n How do we correct for it with clustered SE?
n When might clustering not be appropriate?
18
Clustered SE – Motivation [Part 1]
n “Classical” and “robust” SE depend
on assumption of independence
q i.e. our observations of y are random
draws from some population and are
hence uncorrelated with other draws
q Can you give some examples where this
is likely an unrealistic in CF? [E.g. think
of firm-level capital structure panel regression]
19
Clustered SE – Motivation [Part 2]
n Example Answers
q Firm’s outcome (e.g. leverage) is likely
correlated with other firms in same industry
q Firm’s outcome in year t is likely correlated
to outcome in year t-1, t-2, etc.
20
Clustered SE – Motivation [Part 3]
n Moreover, this non-independence can
cause significant downward biases in
our estimated standard errors
q E.g. standard errors can easily double,
triple, etc. once we correct for this!
q This is different than correcting for
heterogeneity (i.e. “Classical” vs. “robust”)
tends to increase SE, at most, by about
30% according to Angrist-Pischke
21
Example violations of independence
n Violations tend to come in two forms
#1 – Cross-sectional “Clustering”
n E.g. outcome, y, [e.g. ROA] for a firm tends to be
correlated with y of other firms in same industry
because they are subject to same demand shocks
22
Violation means non-i.i.d. errors
n Such violations basically mean that our
errors, u, are not i.i.d. as assumed
q Specifically, you can think of the errors as
being correlated in groups, where
yig = β0 + β1 xig + uig Error for observation
i, which is group g
n var(uig ) = σ u2 > 0
n corr (uig , u jg ) = ρuσ u2 > 0
“Robust” and
“classical” SEs
ρu is called “intra-class
assume this is zero
correlation coefficient”
23
“Cluster” terminology
n Key idea: errors are correlated within groups
(i.e. clusters), but not correlated across them
q In cross-sectional setting with one time period,
cluster might be industry; i.e. obs. within industry
correlated but obs. in different industries are not
q In time series correlation, you can think of the
“cluster” as the multiple observations for each
cross-section [e.g. obs. on firm over time are the cluster]
24
Why are classical SE too low?
n Intuition…
q Broadly speaking, you don’t have as much
random variation as you really think you do
when calculating your standard errors; hence,
your standard errors are too small
n E.g. if double # of observations by just replicating
existing data, your classical SE will go down even
though there is no new information; Stata does
not realize the observations are not independent
25
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
n Violation of independence and implications
n How big of a problem is it? And, when?
n How do we correct for it with clustered SE?
n When might clustering not be appropriate?
26
How large, and what’s important?
n By assuming a structure for the non-i.i.d.
nature of the errors, we can derive a
formula for are large the bias will be
n Can also see that two factors are key
q Magnitude of intra-class correlation in u
q Magnitude of intra-class correlation in x
27
Random effect version of violation
n To do this, we will assume the within-group
correlation is driven by a random effect
yig = β0 + β1xig + vg + ηig
!"#
uig
All within-group
In this case, intra-
correlation is
class correlation
captured by random
coefficient is
effect vg, and σ v2
corr (ηig ,η jg ) = 0 ρu = 2
σ v + σ η2
28
Moulton Factor
n With this setting and a constant # of
observations per group, n, we can show that
Correct SE
of estimate ( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2
SE ( β )
ˆ ⎣ u⎦
c 1
“Classical” SE
This ratio is called the
you get when you
“Moulton Factor”; it tells
don’t account for
you how much larger
correlation
corrected SE will be
29
Moulton Factor – Interpretation
( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2
SE ( β )
ˆ ⎣ ⎦ u
c 1
30
What affects the Moulton Factor?
( ) = ⎡1 + ( n − 1) ρ ⎤
SE βˆ1 1
2
SE ( β )
ˆ ⎣ ⎦ u
c 1
31
Answers about Moultan Factor
n Answer #1: ρu
=0 implies each additional obs. provides new
info. (as if they are i.i.d.), and (2) n=1 implies there aren’t
multiple obs. per cluster, so correlation is meaningless
n Answer #2 = Higher intra-class correlation ρu
means that
new observations within groups provide even less new
information, but classical standard errors don’t realize this
n Answer #3 = Classical SE thinks each additional obs. adds
information, when in reality, it isn’t adding that much. So,
bias is worse with more observations per group.
32
Bottom line…
n Moultan Factor basically shows that
downward bias is greatest when…
q Dependent variable is highly correlated
across observations within group
[e.g. high time series correlation in panel]
q And, we have a large # of observations per
group [e.g. large # of years in panel data]
33
Moulton Factor with uneven group sizes
1
SE ( βˆ ) ⎜⎝ ⎢⎣ n
u x
⎥⎦ ⎟
c 1 ⎠
n ng = size of group g
n V(ng) = variance of group sizes
n n = average group size
n ρu
= intra-class correlation of errors, u
n ρx
= intra-class correlation of covariate, x
34
Importance of non-i.i.d. x’s [Part 1]
1
SE ( βˆ ) ⎜⎝ ⎢⎣ n
u x
⎥⎦ ⎟
c 1 ⎠
35
Importance of non-i.i.d. x’s [Part 2]
36
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
n Violation of independence and implications
n How big of a problem is it? And, when?
n How do we correct for it with clustered SE?
n When might clustering not be appropriate?
37
How do we correct for this?
n There are many possible ways
q If think error structure is random effects, as
modeled earlier, then you could just multiply
SEs by Moulton Factor…
q But, more common way, which allows for
any type of within-group correlation, is to
“cluster” your standard errors
n Implemented in Stata using vce(cluster variable)
option in estimation command
38
Clustered Standard Errors
n Basic idea is that it allows for any type
of correlation of errors within group
q E.g. if “cluster” was a firm’s observations
for years 1, 2, …, T, then it would allow
corr(ui1, ui2) to be different than corr(ui1, ui3)
n Moultan factor approach would assume these
are all the same which may be wrong
39
Clustering – Cross-Sectional Example #1
n Cross-sectional firm-level regression
yij = β0 + β1 x j + β2 zij + uij
q yij is outcome for firm i in industry j
q xj only varies at industry level
q zij varies within industry
q How should you cluster?
n Answer = Cluster at the industry level. Observations
might be correlated within industries and one of the
covariates, x, is perfectly correlated within industries
40
Clustering – Cross-Sectional Example #2
n Panel firm-level regression
yijt = β0 + β1 x jt + β2 zijt + uijt
q yijt is outcome for firm i in industry j in year t
q If you think firms are subject to similar industry
shocks over time, how might you cluster?
n Answer = Cluster at the industry-year level. Obs.
might be correlated within industries in a given year
n But, what is probably even more appropriate?
41
Clustering – Time-series example
n Answer = cluster at industry level!
q This allows errors to be correlated over time
within industries, which is very likely to the
true nature of the data structure in CF
n E.g. Shock to y (and error u) in industry j in year t is
likely to be persistent and still partially present in
year t+1 for many variables we analyze. So,
corr(uijt, uijt+1) is not equal to zero. Clustering at
industry level would account for this; clustering at
industry-year level does NOT allow for any
correlation across time
42
Time-series correlation
n Such time-series correlation is very
common in corporate finance
q E.g. leverage, size, etc. are all persistent over time
q Clustering at industry, firm, or state level is a non-
parametric and robust way to account for this!
43
Such serial correlation matters…
n When non-i.i.d. structure comes from serial
correlation, the number of obs. per group,
n, is the number of years for each panel
q Thus, downward bias of classical or robust SE
will be greater when have more years of data!
q This can matter a lot in diff-in-diff… [Why?
Hint… there are three potential reasons]
44
Serial correlation in diff-in-diff [Part 1]
n Serial correlation is particularly important in
difference-in-differences because…
#1 – Treatment indicator is highly correlated over
time! [E.g. for untreated firms is stays zero entire time,
and for treated firms it stays equal to 1 after treatment]
#2 – We often have multiple pre- and post-treatment
observations [i.e. many observations per group]
#3 – And, dependent variables typically used often
have a high time-series dependence to them
45
Serial correlation in diff-in-diff [Part 2]
n Bertrand, Duflo, and Mullainathan (QJE
2004) shows how bad this SE bias can be…
q In standard type of diff-in-diff where true β=0,
you’ll find significant effect at 5% level in as
much as 45 percent of the cases!
n Remember… you should only reject null hypothesis
5% of time when the true effect is actually zero!
46
Firm FE vs. firm clusters
47
Firm FE vs. firm clusters [Part 1]
48
Firm FE vs. firm clusters [Part 2]
49
Firm FE vs. firm clusters [Part 3]
n Why should we still cluster at firm level if
even if we already have firm FE?
q Answer = Firm FE removes time-invariant
heterogeneity, fi, from error term, but it doesn’t
account for possible serial correlation!
n I.e. vit might still be correlated with vit-1, vit-2, etc.
n E.g. firm might get hit by shock in year t, and effect
of that shock only slowly fades over time
50
Firm FE vs. firm clusters [Part 4]
n Will we get consistent estimates with both
firm FE and firm clusters if serial dependence
in error is driven by time-varying omitted
variable that is correlated with x?
q Answer = No!
n Clustering only corrects SEs; it doesn’t deal with potential
bias in estimates because of an omitted variable problem!
n And, Firm FE isn’t sufficient in this case either because
omitted variable isn’t time-invariant
51
Clustering – Practical Advice [Part 1]
n Cluster at most aggregate level of
variation in your covariates
q E.g. if one of your covariates only varies at
industry or state level, cluster at that level
52
Clustering – Practical Advice [Part 2]
n Clustering is not a substitute for FE
q Should use both FE to control for
unobserved heterogeneity across groups
and clustered SE to account for remaining
serial correlation in y
53
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
n Violation of independence and implications
n How big of a problem is it? And, when?
n How do we correct for it with clustered SE?
n When might clustering not be appropriate?
54
Need enough clusters…
n Asymptotic consistency of estimated
clustered standard errors depends on #
of clusters, not # of observations
q I.e. only guaranteed to get precise estimate of
correct SE if we have a lot of clusters
q If too few clusters, SE will be too low!
n This leads to practical questions like… “If I do
firm-level panel regression with 50 states and cluster
at state level, are there enough clusters?”
55
How important is this in practice?
n Unclear, but probably not a big problem
q Simulations of Bertrand, et al (QJE 2004)
suggest 50 clusters was plenty in their setting
n In fact, bias wasn’t that bad with 10 states
n This is consistent with Hansen (JoE 2007), which
finds that 10 clusters is enough when using
clusters to account for serial correlation
56
If worried about # of clusters…
n You can try aggregating the data to
remove time-series variation
q E.g. in diff-in-diff, you would collapse data
into one pre- and one post-treatment
observation for each firm, state, or industry
[depending on what level you think is non-i.i.d], and
then run the estimation
n See Bertrand, Duflo, and Mullainathan (QJE 2004)
for more details on how to do this
57
Cautionary Note on aggregating
n Can have very low power
q Even if true β≠0, aggregating approach can
often fail to reject the null hypothesis
58
Double-clustering
n Petersen (2009) emphasized idea of
potentially clustering in second dimension
q E.g. cluster for firm and cluster for year
[Note: this is not the same as a firm-year cluster!]
q Additional year cluster allows errors within
year to be correlated in arbitrary ways
n Year FE removes common error each year
n Year clusters allows for things like when Firm A
and B are highly correlated within years, but Firm
A and C are not [I.e. it isn’t a common year error]
59
But is double-clustering it necessary?
n In asset pricing, YES; in corporate
finance… unclear, but probably not
q In asset pricing, makes sense… some firms
respond more to systematic shocks across
years [i.e. high equity beta firms!]
q But, harder to think why correlation or
errors in a year would consistently differ
across firms for CF variables
n Petersen (2009) finds evidence consistent with
this; adding year FE is probably sufficient in CF
60
Clustering in Panels – More Advice
n Within Stata, two commands can do the
fixed effects estimation for you
q xtreg, fe
q areg
n They are identical, except when it comes
to the cluster-robust standard errors
q xtreg, fe cluster-robust SE are smaller because
it doesn’t adjust doF when clustering!
61
Clustering – xtreg, fe versus areg
n xtreg, fe are appropriate when FE are nested
within clusters, which is commonly the case
[See Wooldridge 2010, Chapter 20]
q E.g. firm fixed effects are nested within firm,
industry or state clusters. So, if you have firm FE
and cluster at firm, industry, or state, use xtreg, fe
q Note: xtreg, fe will give you an error if FE aren’t
nested in clusters; then you should use areg
62
Standard Errors & LDVs – Outline
n Getting your standard errors correct
q “Classical” versus “Robust” SE
q Clustered SE
63
Limited dependent variables (LDV)
64
Common misperception about LDVs
n It is often thought that LDVs
shouldn’t be estimated with OLS
q I.e. can’t get causal effect with OLS
q Instead, people argue you need to use
estimators like Probit, Logit, or Tobit
65
Linear probability model (LPM)
66
Logit & Probit [Part 1]
x‘ is vector
n Basically, they assume latent model of controls,
including
y* = x ' β + u constant
67
What are Logit & Probit? [Part 2]
68
What are Logit & Probit? [Part 3]
69
Logit, Probit versus LPM
70
NO! LPM is okay to use
71
Instrumental variables and LDV
72
Caveat – Treatment with covariates
73
Angrist-Pischke view on OLS [Part 1]
n OLS still gives best linear approx. of CEF
under less restrictive assumptions
q If non-linear CEF has causal interpretation, then
OLS estimate has causal interpretation as well
q If assumptions about distribution of error are
correct, non-linear models (e.g. Logit, Probit, and
Tobit) basically just provide efficiency gain
74
Angrist-Pischke view on OLS [Part 2]
75
Angrist-Pischke view on OLS [Part 3]
n Lastly, in practice, marginal effects from
Probit, Logit, etc. will be similar to OLS
q True even when average y is close to either 0 or 1
(i.e. there are a lot of zeros or lot of ones)
76
One other problem…
77
One last thing to mention…
n With non-negative outcome y and
random treatment indicator, d
q OLS still correctly estimates ATE
q But, don’t condition on y > 0 when selecting
your sample; that messes things up!
n This is equivalent to “bad control” in that you’re
implicitly controlling for whether y > 0, which is
also outcome of treatment!
n See Angrist-Pischke, pages 99-100
78
Summary of Today [Part 1]
79
Summary of Today [Part 2]
80
Summary of Today [Part 3]
81
In First Half of Next Class
n Randomized experiments
q Benefits…
q Limitations
82
In Second Half of Next Class
n Papers are not necessarily connected
to today’s lecture on standard errors
83
Assign papers for next week…
n Heider and Ljungqvist (JFE Forthcoming)
q Capital structure and taxes
84
Break Time
n Let’s take our 10 minute break
n We’ll do presentations when we get back
85