0% found this document useful (0 votes)
38 views16 pages

Handout 6 Causality

Uploaded by

Bot Gamers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

Handout 6 Causality

Uploaded by

Bot Gamers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

EC226 (Term 2: Handout 1) 1 SELECTION PROBLEM

Estimating Causal Effects


Readings:
• Stock and Watson (2003): Chapter 4 + 5

• Dougherty (2016): Chapter 2

• Wooldridge (2013): Chapter 2

1 Selection Problem
Consider the problem consider by Angrist and Pischke (2009) of whether hospitals make people
healthier. The National Health Interview Survey (NHIS) has two questions “During the past 12
months, was the respondent a patient in a hospital overnight?”, and “Would you say your health in
general is excellent, very good, good, fair, poor?” (measured on a scale 1 (excellent) to 5 (poor))
(2005 NHIS):
Group Sample Size Mean health status Std. Error
Hospital 7774 2.79 0.014
No Hospital 90049 2.07 0.003
Now assuming we have a model:
Yi = α + βDi + εi (1)

where Yi = response to health survey and Di = 1 went to hospital, 0 otherwise then the OLS esti-
mates would be
Ŷi = 2.07 + 0.72Di
(49.2)

with t-ratios in parentheses. This implies that going to hospital significantly worsens one’s health!
But people who go to the hospital are almost certainly less healthy than those who do not go to
hospital.
What we want to observe is:
(
Y1i if Di = 1
Potential Outcome =
Y0i if Di = 0

where Y0i is the health status of the individual if they did not go to hospital and Y1i is the health
status of the individual if they did go to hospital and we want to measure (Y1i − Y0i ) - the Average
Treatment Effect (ATE). Unfortunately, we only observe:
(
Y1i if Di = 1
Yi = ⇒ Yi = Y0i + (Y1i − Y0i )Di
Y0i if Di = 0

So we must try and learn about the effects of hospitalization by comparing the average health of
those who were and were not hospitalized.

E(Yi |Di = 1) − E(Yi |Di = 0) = E(Y1i |Di = 1) − E(Y0i |Di = 1)


| {z } | {z }
Observed difference ATT (ave. treatment effect on treated)
+ E(Y0i |Di = 1) − E(Y0i |Di = 0)
| {z }
Selection bias

1
EC226 (Term 2: Handout 1) 1 SELECTION PROBLEM

We want the first term on the right hand side (as this is our best estimate of the ATE) of this
equation (the health difference of the hospitalised, assuming we could observe them not having gone
to hospital). The effect we observe also includes the selection bias term which is the difference in
health if did not go to hospital between those who did and did not go to hospital. The bias arises in
the regression above as we believe: E(Y0i |Di = 1) − E(Y0i |Di = 0) > 0.
So we need solutions to the bias in OLS estimation of the above equation.

2
EC226 (Term 2: Handout 1) 2 RANDOM ASSIGNMENT

2 Random Assignment
If Di is randomly assigned then: E(Y0i |Di = 1) = E(Y0i |Di = 0) and hence the selection bias is zero
and:
E(Yi |Di = 1) − E(Yi |Di = 0) = E(Y1i |Di = 1) − E(Y0i |Di = 1)
| {z } | {z }
Observed difference ATT
In which case we can estimate equation (1) by OLS.

3
EC226 (Term 2: Handout 1) 3 IV ESTIMATION

3 IV estimation
Find a suitable instrument(s) for Di , assume this as a binary variable Zi , (which is relevant and
̸
Exogenous (εt =⇒ ̸
Zt (e.g. Z is randomly assigned) and Zt =⇒ εt (Z only affects Y through D)
in which case you can have the IV estimator written as:

Y1−Y0
bIV =
D1 − D0

where Y 1 , D1 are the means of the appropriate variable when Z = 1 and Y 0 , D0 are the means of
the appropriate variable when Z = 0.
In this case, there are a series of different types of individuals we have to consider, defined by the
manner in which they react to the instrument:

Z=0 Z=1
Compliers Not treated Treated
Always Takers Treated Treated
Never Takers Not treated Not treated
Defiers Treated Not treated

We assume that there are no Defiers (this is known as the monotonicity assumption). If one estimates
the effect on all that are treated then we are likely to have a bias in the coefficient estimates as the
never takers and the always takes are unlikely to be a random sample of the population.
So one can estimate the effect on Y of based on Z = 0/1, but this will give an Intended to Treat
estimator, which is not the same as the ATT effect. The IV estimator (Local Average Treatment
Effect – LATE) is based on only looking at the Compliers as there is no variation in Always Takers
and Never Takers with the instrument.
Consider the example from last term: Y=Performance; D=Attend; and Z=9am class (when classes
were allocated by the Department on a random basis rather than through a process whereby students
select their own classes), then our individuals are:

9am class Not 9am class


Compliers Miss class Attend class
Always Takers Attend class Attend class
Never Takers Miss class Miss class
Defiers Attend class Miss class

4
EC226 (Term 2: Handout 1) 4 DIFFERENCE-IN-DIFFERENCE ESTIMATOR

4 Difference-in-Difference Estimator
Consider general pooled cross-section model:

yit = α + dt + β1 x1it + β2 x2it + . . . + βk xkit + εit i = 1, . . . N, t = 1, . . . T (2)

T
P
where dt = δs Dis (are time dummies) and Dsit = 1 if period t = s and 0 otherwise (are cross-
s=2
section dummies). Assume there are n0 individuals in period t = 0, n1 individuals in period t = 1
and ultimately nT individuals in period t = T .
With pooled cross-section data a powerful estimation technique for estimating causal effects is differ-
ence in differences (DD). Good DD techniques rely on good natural experiments, such as a change
of government policy. In a typical set-up there is: A treatment group: a set of observations which
receive the new policy. A control group: a set of observations which does not receive the new policy.
Both the treatment and control groups are observed both before and after the policy change.
The crucial assumption is the common trends assumption: in absence of the policy both the treat-
ment and control groups would have followed the same trends. The greater the similarity between
the treatment and control groups the more likely this assumption is to be satisfied.
Example: Card and Krueger (2000) observed that in February 1992 both New Jersey (NJ) and
Pennsylvania (PA) (neighbouring states in the US) both charged a state minimum wage of 4.25
dollars. On April 1st 1992 NJ raised the state minimum wage from 4.25 to 5.05 dollars. The state
minimum wage in PA remained at 4.25. Using data on number of people employed (among other
things) in the same type of fast-food restaurants in NJ and PA before and after the policy change,
where NJ is the treatment group and PA the control group.
This is a natural experiment: naturally occurring quasi-random variation in the minimum wage. The
data was collected both before and after the policy change. There is no need for the restaurants to
be the same in each period, as long as they are collected in the relevant geographic areas.
The estimating regression takes the form:

yist = α + γN Jist + λdist + δ (N Jist × dist ) + εist (3)

where: yist is FTE employment in restaurant i, in state, s, in period, t. With s = 2 (NJ; PA) and
T = 2 (February, November). N Jist = 1 if the restaurant is in New Jersey (treatment state), and
zero if in Pennsylvania (control state). dist = 1 for observations in November (post-treatment) and
zero for observations in February (pre-treatment). N Jist × dist is an interaction term between the
state and time dummies. With associated assumptions:

1. E(εist ) = E(εist |x) = 0, breaking down this assumption: E(εijs |N Jist ) = 0, E(εist |dist ) = 0,
and importantly, E(εist |N Jist × dist ) = 0 is the common trends assumption.

2. V (εist |x) = σ 2

3. cov(εist , εjst |x) = 0

4. εist |x ∼ N (0, σ 2 )

Turning dummy variables on and off we get:

• PA pre-treatment: E(yist |N Jist = 0, dist = 0) = α

5
EC226 (Term 2: Handout 1) 4 DIFFERENCE-IN-DIFFERENCE ESTIMATOR

• NJ pre-treatment: E(yist |N Jist = 1, dist = 0) = α + γ

• PA post-treatment: E(yist |N Jist = 0, dist = 1) = α + λ

• NJ post treatment: E(yist |N Jist = 1, dist = 1) = α + γ + λ + δ

Such that the DD estimate is: = (NJpost − NJpre) − (PApost − PApre) = δ


Difference in differences estimate is calculated using average employment in fast food restaurants as
given in the table below (standard errors in parentheses):

PA NJ NJ-PA
FTE employment before 23.33 20.44 -2.89
(1.35) (0.41) (1.44)
FTE employment after 21.17 21.03 -0.14
(0.94) (0.52) (1.07)
Change in FTE -2.16 0.59 2.76
(1.25) (0.54) (1.36)

The DD estimate is 2.76 with a standard error of 1.36 and the t-statistic is approximately 2, implying
we reject no effect at the 5% level.
The estimated regression in the current example would be:

ŷist = 23.33 + 2.89N Jist − 2.16dist + 2.76(N Jist × dist )

the difference-in-difference coefficient of 2.76 is significant at the 5% level. Diagrammatically we can


see all of these coefficients on the following diagrams.

6
EC226 (Term 2: Handout 1) 4 DIFFERENCE-IN-DIFFERENCE ESTIMATOR

The important assumption in all DD strategies is the common trends assumption, which is that
in the absence of the treatment, the treated and control groups would have followed a similar trend.
The DD estimate will have a causal interpretation if:

1. The employment trends would be the same in both states in the absence of the treatment
(common trends assumption).

2. Treatment induces deviation from the trend as seen in the figures above. Note this is not the
same as stating the control and treatment groups have the same level of the outcome).1

This is a difficult assumption to test, as by definition you do not know what would have happened
to the treatment group had they not entered the treatment.
Even if pre-treatment trends were the same (as below – common trends assumption), the concern is if
anything else occurred between pre and post treatment in the treatment state but not the control. For
instance, other policy changes which asymmetrically shifted average employment and this therefore
might therefore require the presence of other control variables in the regression equation.
1 It is common to use regression to estimate DD estimates since we can control for other variables and it is easy

to calculate standard errors. We can extend the framework to include: i) multiple treatment/control states as well as
time periods, ii) look at extended pre- and post-treatment effects

7
EC226 (Term 2: Handout 1) 4 DIFFERENCE-IN-DIFFERENCE ESTIMATOR

Alternatively one might think about extending the analysis. Extending the regression to more
than two states and periods:

yist =γs + λt + δDist + β1 x1ist + β2 x2ist + ... + βk xkist + εist

s
P T
P
where γs = α + γj Sijt ; λt = λj dist ; Dist = 1 if the treated state is in the treatment period, 0
j=2 j=2
otherwise. Notes on control group and control variables:

1. Selection of control group: the DD set-up can be made much more general. For instance,
instead of states one could use demographic groups, some of which are affected by a policy
change and some are not.

2. Selection of control variables: it is important to only include exogenous controls (i.e. not
affected by the policy), otherwise you will remove variation caused by the policy. That is, you
will remove the effect you are trying to estimate.

It is possible to extend the DD idea to DDD. For instance, suppose:


USA State A: implements a health care policy, between t=0 and t=1, aimed at people 65 and over
and health outcome is measured by some variable, Y .
USA State A: carry out a standard DD analysis, using under 65’s as the control group (maybe re-
stricted people 60-65, say).
USA State A: yit = α + β1 Oit + β2 dt + β3 (dt × Oit ) + εit ,
where Oit = 1 if i is 65 or over, and zero otherwise, dt = 1 if t = 1 and the final term is the interac-
tion, such that β3 gives the policy effect.
USA State A: The OLS estimator of β3 will be unbiased if OLD and NOT OLD would have followed
parallel health trends in absence of the policy.
USA State A: the problem with this is that other factors unrelated to the program might affect health
in the younger generation relative to the elderly. Policy changes at the national level for instance.
Potential solution: extend analysis to include USA State B which did not implement a health care
policy, but which would have been affected by policy changes at the national level.

8
EC226 (Term 2: Handout 1) 4 DIFFERENCE-IN-DIFFERENCE ESTIMATOR

USA State B: yit = δ0 + δ1 Oit + δ2 dt + δ3 (dt × Oit ) + εit


where Oit = 1 if i is 65 or over, and zero otherwise, dt = 1 if t = 1 and the final term is the interac-
tion, such that δ3 gives the policy effect in State B.
To get the DDD estimator:

yijt =δ0 + δ1 Oijt + δ2 dijt + δ3 dijt × Oijt + γ0 Sijt + γ1 Oijt × Sijt + γ2 dijt
+ γ3 dijt × Oijt × Sijt + εit
Where Sijt = 1 if i is in State A and zero otherwise. In this situation γ3 , the term on the triple
interaction, gives the DDD policy parameter.
• State=B, Control pre-treatment: E(yijt |Sijt = 0, Oijt = 0, dijt = 0) = δ0

• State=B, Treated pre-treatment: E(yijt |Sijt = 0, Oijt = 1, dijt = 0) = δ0 + δ1

• State=B, Control post-treatment: E(yijt |Sijt = 0, Oijt = 0, dijt = 1) = δ0 + δ2

• State=B, Treated post-treatment: E(yijt |Sijt = 0, Oijt = 1, dijt = 1) = δ0 + δ1 + δ2 + δ3


In which case:

δ3 =[E(yijt |Sijt = 0, Oijt = 1, dijt = 1) − E(yijt |Sijt = 0, Oijt = 0, dijt = 1)]


− [E(yijt |Sijt = 0, Oijt = 1, dijt = 0) − E(yijt |Sijt = 0, Oijt = 0, dijt = 0)]

• State=A, Control pre-treatment: E(yijt |Sijt = 1, Oijt = 0, dijt = 0) = δ0 + γ0

• State=A, Treated pre-treatment: E(yijt |Sitt = 1, Oitt = 1, ditt = 0) = δ0 + δ1 + γ0 + γ1

• State=A, Control post-treatment: E(yijt |Sijt = 1, Oijt = 0, dijt = 1) = δ0 + δ2 + γ0 + γ2

• State=A, Treated post-treatment: E(yijt |Sijt = 1, Oijt = 1, dijt = 1) = δ0 + δ1 + δ2 + δ3 + γ0 +


γ1 + γ2 + γ3

In which case:

δ3 + γ3 =[E(yijt |Sijt = 1, Oijt = 1, dijt = 1) − E(yijt |Sijt = 1, Oijt = 0, dijt = 1)]


− [E(yijt |Sijt = 1, Oijt = 1, dijt = 0) − E(yijt |Sist = 1, Oijt = 0, dijt = 0)]

Such that the DD estimate is:

γ3 = [(ȳA;O;1 − ȳA;O;0 ) − (ȳA;N O;1 − ȳA;N O;0 )] − [(ȳB;O;1 − ȳB;O;0 ) − (ȳB;N O;1 − ȳB;N O;0 )]

Where, State A is the policy state and B is the non-policy state, O is old and N O is not old, and 1
is period after policy and 0 is period before policy.
The DDD estimator has 4 terms:
1. (ȳA;O;1 − ȳA;O;0 ): Difference in mean health in OLD between t = 1 and t = 0 in State A.

2. (ȳA;N O;1 − ȳA;N O;0 ): Difference in mean health in NOT OLD between t = 0 and t = 1 in State
A.

3. (ȳB;O;1 − ȳB;O;0 ): Difference in mean health in OLD between t = 0 and t = 1 in State B.

4. (ȳB;N O;1 − ȳB;N O;0 ): Difference in mean health in NOT OLD between t = 0 and t = 1 in State
B.

9
EC226 (Term 2: Handout 1) 5 PROPENSITY SCORE MATCHING (PSM)

5 Propensity Score Matching (PSM)


PSM are methods use to try and balance the treated group and the control group, by trying to model
the treatment. There are many example of people using PSM methods see, for example, Nguyen et al.
(2006) (please note as with most things there are many bad examples of inappropriate use of PSM
methdos).
Consider the example in which we are looking at the effect of introducing clinic in villages on infant
mortality. We have data on 9 villages and the data is below:

T imrate povrate pcdocs


1 10 0.5 0.1
1 15 0.6 0.2
1 22 0.7 0.1
1 19 0.6 0.2
0 25 0.6 0.1
0 19 0.5 0.2
0 4 0.1 0.4
0 8 0.3 0.5
0 6 0.2 0.4

Where T = 1 if has a clinic and 0 otherwise imrate is the infant mortality rate (per 1000 births),
povrate is a measure of poverty and pcdocs is number of doctors per capita. If we test for a difference
in means we have:

In which we find that having a clinic is actually increasing the infant mortality by 4.1 per 1000
births. However, presumably clinics were put in those villages that are poorer and we would expect
to have a higher mortality rate. If we observed data before the clinics were introduced we could do
a diff-in-diff model and obtain the true effects of the clinic. In the absence of this an alternative
is to attempt to construct the model for treatment and then use this to create a modified control
group which looks much more like the treated group and then use this reconstructed control group
to compare with the treated group.
PSM essentially has two steps:

10
EC226 (Term 2: Handout 1) 5 PROPENSITY SCORE MATCHING (PSM)

1. Estimate a probit/logit model of // P (T = 1) = F (β0 + β1 x1 + . . . + βk xk ) // and use the


predicted probabilities to match cases from the treat group to similar cases in the control group.

2. Do a difference in means test between the treated group and the artificially constructed control
group.

In our example we are using psmatch2 in Stata:

And we see a significant negative effect of the treatment (ATT=-7, although marginally insignif-
icant at 5% level), compared to what we saw previously which was a positive 4.1 effect.
It might be important to compare the means of the explanatory variables for the treated and the
new control group of the explanatory variable (covariates) to check balance betwwen the two groups
(in our case we see no significant difference based on povrate and pcdocs).
In Stata you can do this as:

11
EC226 (Term 2: Handout 1) 5 PROPENSITY SCORE MATCHING (PSM)

An alternative estimator uses the teffects command in Stata, from which we get:

In this case the estimate is negative, but the effect is insignificant. To test the extent to which
there is balance between the treatment and control group:

Where you want the Matched differences to be close to zero and the Matched Variance Ratio to
be close to unity.

12
EC226 (Term 2: Handout 1) 6 REGRESSION DISCONTINUITY DESIGN (RDD)

6 Regression Discontinuity Design (RDD)


RDDs are examples of quasi-experimental designs to estimate causal effects, there are many examples
of RDDs in the literature, see for example Mealli and Rampichini (2012). The basic principle behind
RDD is to use an arbitrary rule in an attempt to satisfy the assumption: E(εi |x) = 0. For example,
Question: What is the impact of achieving a merit in a math test (during the year) on a child’s
final math exam score (at the end of the year)? A merit is awarded for a mark ≥ 70. The final exam
score is measured between (0, 100). Intuitively: comparing the final exam scores, between children
just above and below the 70 boundary, will give the causal impact of achieving a merit. Under the
assumption that children above and below 70 are similar in all other respects.
The general set-up is the following: We are interested in estimating the impact of some treatment (D)
on some outcome variable (y). You observe a continuous variable, x say, for a group of individuals.
There is a cut-off along x, X C say, where if x ≥ X C the individual receives the treatment, such that
D = 1. If x < X C the individual does not receive the treatment, such that D = 0 (see Figure 4).

The causal parameter can be estimated using a regression of the form:

yi = α + δxi + βDi + εi

where yi is the outcome variable; xi is known as the assignment variable (or forcing variable or
running variable); X C is the cut-off along the assignment variable which assigns the treatment; Di
is a dummy variable equal to one if xi ≥ X C and zero if xi < X C ; and εi is the error term. The
model can be estimated using OLS and will give the causal estimate if E(εi |Di , xi ) = 0.
Note:
the closer the observations are to the cut-off the greater our trust in the causal estimate is likely to

13
EC226 (Term 2: Handout 1) 6 REGRESSION DISCONTINUITY DESIGN (RDD)

be, however, we are often restricted in terms of sample size. In the above diagram is clear that:

1. You do not observe treated and control individuals at the same level of x.

2. As such the causal (treatment) effect is based on extending (extrapolating) the regression
function.

For this reason it is very important to try different specifications of the regression function. The
above graph is an example of a linear assignment mechanism. Another well used mechanisms tested
are higher order polynomial forms.
The causal parameter can be estimated using a regression of the form:

yi = α + δ1 xi + δ2 x2i + δ3 x3i + βDi + εi

Where: yi is the outcome variable; (xi , x2i , x3i ) is known as the assignment variable (and is now a 3rd
order polynomial); X C is the cut-off along the assignment variable which assigns the treatment; Di
is a dummy variable equal to one if xi ≥ X C and zero if xi < X C ; εi is the error term (see Figure 5).
The model can be estimated using OLS and will give the causal estimate if E(εi |Di , xi , x2i , x3i ) = 0.
Note:
there is no need for control variables as Di is random; however, controls may improve precision.

14
EC226 (Term 2: Handout 1) 6 REGRESSION DISCONTINUITY DESIGN (RDD)

Beware using a linear trend (solid line) for a non-linear tread (dashed line), can lead to misspec-
ification (see Figure 6).
Note:
rdplot is a nice command for plotting RDD graphics in Stata.
In any of the above designs, to capture the causal effect we require E(εi |Di , xi ) = 0. As we have seen
in other cases, this is an impossible assumption to test. However, it is common practice to check the
balancing of covariates, i.e. the observable characteristics, such as: gender, age, past test scores are
all balance, i.e. there are no statistical differences (t-tests) between those individuals just below and
just above the cut-off point. The distribution of the running variable (x) should be smooth over the
range of points close to the cut-off suggesting there has been no manipulation.
Regression discontinuity designs will fail if individuals can precisely manipulate the assignment vari-
able. For example, if students could precisely choose their test score, through effort for example.
Those who choose a score X C or just above, will be systematically different from a student who
chooses just below the cut-off.

15
EC226 (Term 2: Handout 1) REFERENCES

References
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion.
Number 8769 in Economics Books. Princeton University Press.

Card, D. and Krueger, A. B. (2000). Minimum wages and employment: A case study of the fast-food
industry in new jersey and pennsylvania: Reply. The American Economic Review, 90(5):1397–1420.

Dougherty, C. (2016). Introduction to Econometrics. OUP Catalogue. Oxford University Press.

Mealli, F. and Rampichini, C. (2012). Evaluating the effects of university grants by using regression
discontinuity designs. Journal of the Royal Statistical Society: Series A (Statistics in Society),
175(3):775–798.

Nguyen, A. N., Taylor, J., and Bradley, S. (2006). The estimated effect of catholic schooling on
educational outcomes using propensity score matching. Bulletin of Economic Research, 58(4):285–
307.

Stock, J. and Watson, M. W. (2003). Introduction to Econometrics. Prentice Hall, New York.

Wooldridge, J. M. (2013). Introduction to Econometrics. Cengage.

16

You might also like