0% found this document useful (0 votes)
98 views162 pages

江艇+IRID metrics 2016 slides

The document discusses the potential outcomes framework for causal inference in econometrics. It introduces key concepts like potential outcomes, treatment effects, unconfoundedness assumption, propensity scores, and causal estimands. Regular observational studies satisfy the unconfoundedness and overlap assumptions, while irregular studies may violate these assumptions but can still identify causal effects using techniques like instrumental variables, difference-in-differences, and regression discontinuity designs.

Uploaded by

hansen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views162 pages

江艇+IRID metrics 2016 slides

The document discusses the potential outcomes framework for causal inference in econometrics. It introduces key concepts like potential outcomes, treatment effects, unconfoundedness assumption, propensity scores, and causal estimands. Regular observational studies satisfy the unconfoundedness and overlap assumptions, while irregular studies may violate these assumptions but can still identify causal effects using techniques like instrumental variables, difference-in-differences, and regression discontinuity designs.

Uploaded by

hansen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

The following slides are under constant revision. Please do not distribute without permission.

應用微觀計量經濟學
Applied Microeconometrics

江 艇
中國人民大學國家發展與戰略研究院、經濟學院
2016 年 8 月 8 日至 10 日,上海

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 0 开篇

A word on teaching philosophy.

示例 1. 语⾔与储蓄率 (Chen, 2013, AER).

相关关系就是因果关系!

示例 2. “Determinants” of Nobel laureates (Messerli, 2012, N Engl J


Med).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 计量结果要么符合直觉,要么违背直觉,符合直觉的没意义,违背直
觉的是错的。
• Reaction to the first criticism: Statistically significant relationship may
not be economically important. It could be the many “other things”
that matter more.
• Reaction to the second criticism: In what sense does economics ex-
plain the reality when its proposed explanation is beyond intuition?
Because many “other things” are twisted, econometrics may not only
explain the visible facts, but also reveal the hidden truth.
• Caveat: Overly and improperly used econometric techniques are no
more than wicked crafts.[1]

[1]Stuart Firestein: “Sometimes science is like looking for a black cat in a dark
room. It’s difficult — especially when there is no cat.”

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Evolution of economic research. A popular view of star economists
from The Economist.

Challenge to observational studies. Sample selection vs. self-selection.


下图来⾃ Ramsey and Schafer (1997), The Statistical Sleuth: A Course
in Methods of Data Analysis.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


© Ting JIANG, 2016 Summer, Renmin Univ of China.
What econometrics CANNOT do?
• One cannot test non-falsifiable theories.
• One cannot test non-identifiable theories.
Internal validity vs. external validity.
• Internal validity: The statistical inferences about causal effects are
valid for the population being studied.
• External validity: The inferences and conclusions can be generalized
from the population and setting studied to other populations and set-
tings. Potential threats to external validity include
– Differences in populations.
示例 3. The weirdest people in the world? (Henrich et al., 2010,
Behav Brain Sci).
– Differences in settings. The importance of a well-spelled-out eco-
nomic theory.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Criteria for good empirical research.
• Richard Freeman on the three laws for econometrics you can trust:
1. It had better be there in the ordinary-least-squares regression.
2. It had better still be there in the econometrically sophisticated high-
tech instrument procedures.
3. It had better still be there for small technical tweaks to the econo-
metrically sophisticated procedures.
• My view: Good empirical research is like detective stories. Often
you don’t have strong evidence, but a collection of compatible weak
evidence speaks for itself. Similar in spirit to the famous saying:
Once you eliminate the impossible, whatever remains, no mat-
ter how improbable, must be the truth. — Arthur Conan Doyle
示例 4. Corruption in sumo wrestling (Duggan and Levitt, 2002, AER).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Golden age for applied microeconomics?
• Raj Chetty: It is undergoing a major methodological transformation,
a data-driven approach that combines a broad set of skills to answer-
ing important policy questions. Availability of large and high quality
data sets has made compelling implementation of modern economet-
ric methods possible.
• Peter Klenow:
@Welfare @Welfare @Knowledge
@Research
= @Knowledge
 @Research
Micro >0 High Low
Macro 0? Low High
• Andrew Gelman: Randomized experiments give you accurate esti-
mates of things you don’t care about; Observational studies give you
biased estimates of things that actually matter.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 1 潜在结果分析框架与因果推断

1.1 Taking error terms seriously

• ⼀个例⼦:我们想研究 D 和 Y 之间的关系。
D : some binary treatment(是否上⼤学)
Y : some outcome(40 岁时的⼯资⽔平)
• 假定数据 (Di ; Yi ) 独⽴同分布,于是我们写下⼀个线性模型:
Yi = 0 + 1Di + "i
• 那么问题来了:"i 是什么玩意⼉?显然,我们只有知道 "i 的含义,才
能知道 0; 1 的含义。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 第⼀种理解:令
0 := E[Yi jDi = 0]
1 := E[Yi jDi = 1] E[Yi jDi = 0]
⾃然有
E[Yi jDi ] = 0 + 1Di
(请注意,线性并不是⼀个假设,所谓 saturated model)
然后定义 "i := Yi − 0 1Di . By construction
E["i jDi ] = 0
于是 OLS 估计量⽆偏且⼀致。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 第⼆种理解:ui 表⽰不可观测的能⼒。
Yi = ˇ0 + ˇ1Di + ui
那么,如果能⼒和 treatment 相关
C ov(Di ; ui ) ¤ 0
则 ˇ0; ˇ1 的 OLS 估计是有偏且不⼀致的。
• 我们把 ˇ0; ˇ1 这种参数称为 structural or causal parameters. 这时候,
函数形式的假定就具有了实质性含义(即使对于 binary D 也是如
此)。
• 结论:我们不能只是匆匆写下 Yi = 0 + 1Di + "i 这个⽅程,却忘记
明确给出关于 " 的假设——到底 " 只是⼀种 expectational residual,
还是⼀种 “structural” but unobserved explanatory factor.
• 接下来的问题是:为什么第⼆种理解是重要的?

© Ting JIANG, 2016 Summer, Renmin Univ of China.


1.2 Potential outcomes framework
• 回答:因为我们关⼼ treatment (cause) 对 response (outcome) 的影
响。在我们的⼼⽬中,( 0; 1) 应该反映的是 D 和 Y 之间的因果关系
或结构性关系,⽽不仅仅是相关性。
• 我们需要⼀种定义 cause-and-effect relationships 的⼀般性框架:po-
tential outcomes framework.
• 定义 treatment effect (causal effect).
Yi0 =Potential outcome when individual i
receives control treatment 0 by intervention
Yi1 =Potential outcome when individual i
receives active treatment 1 by intervention
Individual treatment effect: Yi1 Yi0

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 关于这⼀定义的⼏点评论:
– Causal effect 的定义只取决于 potential outcomes, ⽽不取决于究竟
哪个 potential outcome 被实际观察到(实际发⽣)。
– Causal effect 永远是在同⼀时点上的不同结果之间的⽐较,⽽且
treatment 必须发⽣在 outcome 之前。
– 因果推断的基本难题就是数据缺失 (missing data):我们永远只能
⾄多观察到⼀个 potential outcome. 没有观察到的那个 potential
outcome 叫做 counterfactual.
– 因果分析的关键就是构造 counterfactual, 这就是个 imputation 问
题。⽽且 counterfactual 只能从实际 untreated 的个体中去寻找,也
可以说因果分析的关键就是寻找好的控制组。
– 我们实际进⾏估计和推断时所依赖的只能是 observed outcomes.
⽽且必须涉及多个个体或同⼀个体的不同时期,但⽆论是 cross-
sectional comparison 还是 before-after comparison,它们都不是
causal effects 的定义,尽管它们对于估计和推断很重要。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Observed treatment 8
<1 treatment group
Di =
:0 control group

Observed outcome
Yi = (1 Di )  Yi0 + Di  Yi1
• 多个个体的存在本⾝并⽆助于解决因果推断的基本难题:treatment
levels 和 potential outcomes 数量爆炸式增长。
所以我们才需要——
• Stable unit treatment value assumption (SUTVA). 每个个体的 potential
outcome 不会因为其他个体接受的 treatment 的不同⽽不同;每个个
体所接受到的每个 treatment 都只会产⽣⼀种 potenital outcome.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


这代表两层含义:
– No interference. 不存在同侪效应 (peer effects), 不存在⼀般均衡效
应 (general equilibrium effects).
– No hidden variations in treatment. 实施 treatment 的⽅式不重要。
单有 SUVTA 还不够,我们还需要知道——
• Assignment mechanism: 每个个体如何接受 treatment.
– Covariates X 的重要性
▷ 通过 control ⼀部分 outcome variation 使得估计更加精确。
▷ 我们可能对由 covariate 定义的 subpopulation 感兴趣。
▷ 会影响 assignment mechanism.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– 由倾向得分 (propensity score) 来定义:
(Di = 1jYi1; Yi0; Xi )
– Overlap assumption:
0 < (Di = 1jYi1; Yi0; Xi ) < 1
– Unconfoundedness (strong ignorability) assumption:
(Di = 1jYi1; Yi0; Xi ) = (Di = 1jXi )
等价地,
Di ? (Yi1; Yi0)jXi
– 经典随机试验满⾜以上两个假设。Imbens and Rubin (2015) 把满
⾜这两个假设的 observational studies 称作 regular observational
studies.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– 什么是 irregular observational studies? 三种主要情形:
▷ Unconfounded given latent variables that are only partially ob-
served.
H) instrumental variables
▷ Unconfounded given pre data when pre and post data for both
the treatment and control groups are available.
H) difference in differences
▷ Violation of overlap when assignment to treatment is determined
by a threshold eligibility rule.
H) regression discontinuity

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Causal estimands.
– Average treatment effect (ATE).

 := E Yi1 Yi0
– Subgroup average effect.
▷ Average treatment effect on the treated (ATT).
1 0

1 := E Yi Yi jDi = 1
⼀般来说,ATT 不等于 ATE.
▷ Average treatment effect on the untreated (ATU).
1 0

0 := E Yi Yi jDi = 0
▷ Defined over covariates.

 (x) := E Yi1 Yi0jXi =x

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 若 potential outcomes 均值独⽴于 treatment,则 group-mean differ-
ence 能够识别 average treatment effect.
d
 d

E Y jD = E Y H)
E(Y jD = 1) E(Y jD = 0)
1
 0

=E Y jD = 1 E Y jD = 0
1
 0

=E Y E Y
1 0

=E Y Y

• 类似地,若想要识别 conditional treatment effect E Y1 Y 0jX ,关
键的假设是 selection on observables:
d
 d

E Y jD; X = E Y jX
(由 unconfoundedness 假设可得)
进⼀步地,
 n o
E Y 1 Y 0 =E E Y 1 Y 0jX
Z

= E Y1 0
Y jX = x dFX (x)
© Ting JIANG, 2016 Summer, Renmin Univ of China.
selection on unobservables 则是指
d
 d

E Y jD; X; " = E Y jX; "
• randomization 确保了均值独⽴。若均值独⽴假设不成⽴,则 group-
mean difference 存在 selection bias.
 
E Y jD = 1 E Y jD = 0
1
 0
 0
 0

=E Y jD = 1 E Y jD = 1 + E Y jD = 1 E Y jD = 0
„ ƒ‚ … „ ƒ‚ …
average treatment effect on the treated selection bias
 
由此也可看出,要想识别 ATT, 需要假定 E Y 0jD =E Y0 .
 
要想识别 ATU, 则需要假定 E Y jD = E Y .
1 1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 假定 potential outcomes 是线性的,可以帮助我们更好地理解识别条
件。
Yid = ˛d + Xi0ˇd + Uid ; d = 0; 1

E Y 1
Y = ˛1 ˛0 + E(X)0(ˇ1 ˇ0)
0

E(Y jD = 1) E(Y jD = 0)
=˛1 ˛0 + E(XjD = 1)0ˇ1 E(XjD = 0)0ˇ0
1
 0

+ E U jD = 1 E U jD = 0
0

„1 ˛0 + E(X)
ƒ‚ (ˇ1 ˇ0…)
average treatment effect
 0  0
+ E(XjD = 1) E(X) ˇ1 E(XjD = 0) E(X) ˇ0
„ ƒ‚ …
selection bias due to observables
1
 0

+E U jD = 1 E U jD = 0
„ ƒ‚ …
selection bias due to unobservables

© Ting JIANG, 2016 Summer, Renmin Univ of China.


类似地,
E(Y jD = 1; X ) E(Y jD = 0; X )
=˛1 ˛0 + X 0(ˇ1 ˇ0)
1
 0

+ E U jD = 1; X E U jD = 0; X
请注意,在此模型中,
d
 d
 d
 d

E U jD; X = E U jX () E Y jD; X = E Y jX

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 反过来,当我们写下⼀个线性模型时,从 potential outcomes framwork
的⾓度看,意味着什么?
Yi = ˛ + ˇDi + Ui

Yi = (1 Di )Yi0 + Di Yi1 = Yi0 + Yi1 Yi0  Di  Yi0 + ˇi Di
 h  i
Yi0 = E Yi0 + Yi0 E Yi 0
 ˛ + "i
 
ˇi = E(ˇi ) + ˇi E(ˇi )  ˇ + i
因此
Ui = "i + Di i
OLS 估计的⼀致性要求
E(DU ) = E(D") + E(D) = 0

0
 1

E DY = 0; E DY =0
no selection bias 和线性回归模型的扰动项外⽣性假设是⼀回事。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• ⾟普森悖论 (Simpson’s Paradox): 部分的效应与总体的效应可能出现
背离。

Sign E(Y jD = 1) E(Y jD = 0)

¤ S ign E(Y jX; D = 1) E(Y jX; D = 0)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Drug (D = 1) No drug (D = 0)
Men (X = 0) 81/87 (93%) 234/270 (87%)
Women (X = 1) 192/263 (73%) 55/80 (69%)
Combined data 273/350 (78%) 289/350 (83%)

263 + 80
P (X = 1) = = 49%
700
87 + 270
P (X = 0) = = 51%
700
263
P (X = 1jD = 1) = = 75%
350
80
P (X = 1jD = 0) = = 23%
350

© Ting JIANG, 2016 Summer, Renmin Univ of China.


 (X = 1) = E(Y jX = 1; D = 1) E(Y jX = 1; D = 0)
= 73% 69% = 4%
 (X = 0) = E(Y jX = 0; D = 1) E(Y jX = 0; D = 0)
= 93% 87% = 6%

E(Y jD = 1)
=E(Y jX = 1; D = 1)P (X = 1jD = 1)
+ E(Y jX = 0; D = 1)P (X = 0jD = 1)
=73%  75% + 93%  25% = 78%
E(Y jD = 0)
=E(Y jX = 1; D = 0)P (X = 1jD = 0)
+ E(Y jX = 0; D = 0)P (X = 0jD = 0)
=69%  23% + 87%  77% = 83%
H)E(Y jD = 1) E(Y jD = 0) = 5%

© Ting JIANG, 2016 Summer, Renmin Univ of China.


直观解释:⽆论是否使⽤药物,⼥性的治愈率都较低;⼥性使⽤药物
的⼏率较⾼。因此,药物在总体上似乎⽆效的原因在于:当我们随机
挑选⼀位药物使⽤者时,该对象为⼥性的⼏率较⾼,因此平均⽽⾔治
愈率较低。换⾔之,负向的⼥性效应抵消了正向的⽤药效应。


 =E E(Y jX; D = 1) E(Y jX; D = 0)
= (X = 1)P (X = 1) + (X = 0)P (X = 0)
=4%  49% + 6%  51% = 5%

1 =E E(Y jX; D = 1) E(Y jX; D = 0)jD = 1
=(X = 1)P (X = 1jD = 1) + (X = 0)P (X = 0jD = 1)
=4%  75% + 6%  25% = 4:5%

0 =E E(Y jX; D = 1) E(Y jX; D = 0)jD = 0
=(X = 1)P (X = 1jD = 0) + (X = 0)P (X = 0jD = 0)
=4%  23% + 6%  77% = 5:5%
 = 1  P (D = 1) + 0  P (D = 0)
© Ting JIANG, 2016 Summer, Renmin Univ of China.
从线性模型的⾓度来思考这⼀问题。
E(Y jX; D = d ) = ˛d + ˇd X
可以分组回归或使⽤交互项模型
E(Y jD; X ) = ˛0 + (˛1 ˛0)D + ˇ0X + (ˇ1 ˇ0)D  X
8
ˆ
ˆ E(Y jX = 1; D = 1) = ˛1 + ˇ1 = 73%
ˆ
ˆ
ˆ
<E(Y jX = 0; D = 1) = ˛1 = 93%
ˆ
ˆ E(Y jX = 1; D = 0) = ˛0 + ˇ0 = 69%
ˆ
ˆ
:̂E(Y jX = 0; D = 0) = ˛0 = 87%

© Ting JIANG, 2016 Summer, Renmin Univ of China.


1 X 
1 = (˛1 ˛0) + (ˇ1 ˇ0)X t
NT
t 2T
1 X 
0 = (˛1 ˛0) + (ˇ1 ˇ0)Xc
NC
c2C
N
X
1  
= (˛1 ˛0) + (ˇ1 ˇ0)Xi
N
i =1
XN N
X
1   1  
= ˛1 + ˇ1Xi ˛0 + ˇ0Xi
N N
i =1 i =1
⾟普森悖论相当于犯了⼀个什么错误?
1 X  1 X 
¤ ˛1 + ˇ1X t ˛0 + ˇ0Xc
NT NC
t 2T c2C

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 2 当我谈最小二乘法时我谈些什么

2.1 OLS 的代数学


  X n  2
min S ˇ˜ = yi x0i ˇ˜
ˇ˜ i =1
 0
xi = (1 xi 2 : : : xiK )0 ; ˇ˜ = ˇ˜1 : : : ˇ˜K

n
X  
xi yi x0i ˇ˜ = 0
i =1

n
! 1 n
X X
b= xi x0i xi yi
i =1 i =1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• The linear approximation does not depend on any economic or statis-
tical theory, and it holds irrespective of how the data are generated.
Moreover, the linear approximation is an in-sample result. It does not
give information about observations outside the sample, and there is
no direct interpretation of the coefficients.
• Special case. Simple linear regression (K = 2).
yi = b0 + b1xi + ei
Pn \
i =1 (xi x̄) (yi ȳ) Cov (x; y)
b1 = Pn =
(x
i =1 i x̄) 2 \
Var (x)
b0 = ȳ b1x̄
Let xi = 1 if individual i P
is male and zeroP otherwise. If we define
xi yi (1 xi )yi
ȳm = P ; ȳf = P
xi (1 xi )
then
b0 = ȳf ; b1 = ȳm ȳf
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• Matrix notation.
0 1 0 1
x01 1 x12    x1K  
B C B :: C
XnK = @ :: A = @ :: :: A = xE1 : : : xEK
x0n 1 xn2    xnK
 0
yn1 = y1 : : : yn

   0  
min S ˇ˜ = y Xˇ˜ y Xˇ˜
ˇ˜

1
b = (X 0 X ) X0 y

y = Xb + e

X0 e = 0

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Frisch-Waugh-Lovell Theorem.
– Suppose we regress y on X1 and X2 and get
y = X1 b1 + X2 b2 + u
– We can get the same b1 and u in the following way: Regress y and
X1 on X2 separately and store the corresponding residuals ey and e1,
and then regress ey on e1.
ey = e1b1 + u
– We don’t have to net out the effect of X2 on y to get b2 unless we
also wish to have u.
– What’s so good about FWL? Once we purge the regression of less
interesting variables, we can reduce it to the bivariate case and make
use of the “covariance-to-variance” formula for subsequent analysis.
– Especially useful in bivariate correlation plot.
示例 5. ⼥⼈和犁 (Alesina et al, 2013, QJE)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.2 小样本理论:假设、性质与推断
• Linear statistical model:
y = Xˇ + "
• Key assumption 1: Strict exogeneity.
E ("jX) = 0

E (yjX) = Xˇ
• Unbiasedness of OLS estimator.
E (bjX) = ˇ
• 在包含控制变量的回归中我们实际隐含的往往是条件均值独⽴性假
定——如果我们并不关⼼控制变量的因果效应。(控制变量控制的到
底是什么?)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Key assumption 2: Spherical error variance.
Var ("jX) = E (""0jX) =  2In
1. Conditional homoskedasticity (条件同⽅差).
Var ("i jX) =  2
2. No correlation between observations.

Cov "i ; "j jX = 0
• Conditional variance of b.
1
Var (bjX) =  2 (X0X)
In the simple linear regression,
2
Var (b1jX) = Pn
i =1 (xi x̄)2

© Ting JIANG, 2016 Summer, Renmin Univ of China.


\ s2
Var (bk jX) =  
(n 1) 1 2
Rc;k \
Var (xk )
where Rc;k
2 is the centered-R2 in the regression of x on all other vari-
k
ables.
– The larger the sample size,
– The greater the variation in xk ,
– The less the correlation of xk with the other variables,
– The better the overall fit of the regression,
the lower the variance of bk will be.
An important implication is that the presence of two highly collinear
variables does not affect the precision of estimating a third variable in
interest.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Don’t make too much of R2.
– Varies with different samples.
– Sensitive to the definition of y.
– No absolute benchmark to judge whether it is high or low.
– One can easily manipulate a high R2.
– Measures the quality of linear approximation, while has nothing to
say about whether the model is true, whether the assumptions are
valid, or whether the estimator has good statistical properties.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Key assumption 3: Normality of error distribution.
2

"jX  N 0;  In

 
1
(b ˇ)jX  N 0;  2 (X0X)
• t -test of individual regression coefficients.
H0 : ˇk = ˇ¯k

bk ˇ¯k bk ˇ¯k
tk ≜ =r    t(n K)
\
SE (bk ) s 2 (X 0 X ) 1
kk

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Decision rules for t -test.
– Test based on critical value.

Prob t˛/2(n K) < tk < t˛/2(n K) = 1 ˛
Accept H0 if tk falls within this range. Reject otherwise.
– Test based on p -value.
p = Prob (t > jtk j)  2
Accept H0 if p > ˛ , reject otherwise.
– Test based on confidence interval.
bk SE\ \
(bk )t˛/2(n K) < ˇk < bk + SE (bk )t˛/2(n K)
Accept H0 if ˇ¯k falls within this range. Reject otherwise.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• t -test of one linear restriction.
H0 : r1ˇ1 + : : : + rK ˇK = r0ˇ = q

r0 b q r0 b q
tstat ≜ =q  t(n K)
\
SE (r 0 b ) 1
s 2 r 0 (X 0 X ) r
• F -test of general linear hypotheses.
H0 : Rˇ = q

  1
0 2 0 1 0
Fstat = (Rb q ) s R (X X ) R (Rb q) / dim(q)
F (dim(q); n K)
where dim(q) denotes the number of linear restrictions.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Decision rules for F -test.
– Test based on critical value.
Prob (F < F˛ (dim(q); n K)) = 1 ˛
Accept H0 if Fstat < F˛ (dim(q); n K), reject otherwise.
– Test based on p -value.
p = Prob (F > Fstat)
Accept H0 if p > ˛ , reject otherwise.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.3 大样本理论:假设、性质与推断
• The key assumption for finite-sample theory to hold is normal distur-
bances. Large-sample theory establishes that, when sample size is
sufficiently large (n ! 1), OLS estimators will have desirable prop-
erties even if this assumption is relaxed. It turns out that we can also
relax the strict exogeneity assumption.
• Law of large numbers. fxi g i.i.d., E (xi ) = , then
x n !p 
A weaker version of LLN also holds when fxi g is identically distributed
and having only weak persistence. That is to say, it allows for (serial
or cross-sectional) correlation.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Central limit theorem. fxi g i.i.d., E (xi ) = , Var (xi ) = Σ, then
x n !d N (; AVar(x n))
where the asymptotic variance of x n,
Σ
AVar(x n) =
n
There are weaker versions of CLT when fxi g is non-independent but
uncorrelated and when fxi g is correlated. In the latter case AVar(x n)
involves not only the variance of xi but also its (auto)covariances.
• Key assumption: Predetermined (前定) regressors.
E(xi "i ) = E [xi (yi x0i ˇ)] = 0
This condition is called moment condition (矩条件) or orthogonality
condition (正交条件).
• Consistency of b.
n
! 1 n
!
1 1 X 1 X
0 0
b ˇ = (X X ) X"= xi x0i xi "i !p 0
n n
i =1 i =1
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• Asymptotic normality of b when xi "i is uncorrelated.
n    !
1X Var (xi "i ) E "2i xi x0i
xi "i !d N 0; = N 0;
n n n
i=1
 
1 1  1
b !d N ˇ; (E (xi x0i )) E "2i xi x0i (E (xi x0i ))
n
– Importantly, the error term can be conditionally heteroskedastic.
\
– Call SE (bk ) heteroskedasticity-robust (Huber-White) standard error.
实现:robust
– Robust standard errors can be more misleading than homoskedasticity-
only standard errors in situations where heteroskedasticity is mod-
est. A conservative rule of thumb (but rarely adopted!) is to use
the maximum of the conventional standard errors and the robust
standard errors.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Vol.53, No. 2 (March,1985)
Econometrica,

MISCELLANEA

ON HETEROS*EDASTICITY

BY J. HUSTON MCCULLOCH'

THE MOST PRESSING ISSUE in econometricorthography todayis whetherheteros*edas-


is used in theirtextsby
ticityshould be spelled witha k or witha c. Heteroskedasticity
Dhrymes(1970), Goldberger(1964), Intriligator(1978), Kmenta (1971) and Valavanis
(1959), while heteroscedasticityis preferredby Champernowne(1969), Chow (1983),
Goldfeld and Quandt (1972), Johnston(1963), Maddala (1979), Malinvaud (1970), and
Theil (1971).2
Our word is a modern coinage, derived fromthe two Greek roots hetero-(e'rEpo-),
meaning"other" or "different," and skedannumi(o-KE5asvPYVg),meaning"to scatter."
of Greek kappa (K).
The letterin questionis thereforethe transliteration
wordswhichscholarshave lifteddirectlyfromGreekintoEnglish,theletter
In scientific
kappa is always transliteratedas k. Examples are skeptic (o-KECIT-KOs) and skeleton
(orKEAAErOs).
Greekkappa does sometimesmakeitswayintoEnglishas c, butonlyin commonwords
whichenteredEnglishthroughFrenchand old scientific wordsthatenteredthroughLatin.
Examplesare sceptre(orKw7rpov), scene (CrWqvK ) and cyclic(KVKAcKOs).Kappa becomes
c in French or Latin, simplybecause k is not used in these languages except'to spell
foreignproper names. When such a c is followedby e, i, or y, however,it is always
sibillant.The onlyway a kappa takeninto Frenchcan retainits "k" sound beforeone of
thesevowels is in the rareeventthatit becomes "qu" (as in squelette).
In Englishas in Frenchand Latin,c beforee is alwayssoft.3Examples includeceiling,
celerey,ceremony,cease, cedar, celestial,celibacy,cell, cement,cent,center,necessary,
scent,etc.,any of whichwould sound ridiculouswitha hard c.
If heteros*edasticitywerespelled witha c, it would thushave had to have enteredthe
Englishlanguage either in 1066withtheNormaninvadersor else in themiddleages from
Latin,,neitherof which was the case. Furthermore, it would have to be pronounced
whichit is not.
"heterossedasticity,"
Heteroskedasticityis therefore
the properEnglishspelling.4

Ohio State University

ManuscriptreceivedFebruary,1984; revisionreceivedMay, 1984.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


'The authoris indebted
toEvangelos
FalarisandJerry
Thursby
forinvaluable
technical
assistance.
Anyerrors or omissionsremaintheresponsibility
oftheauthor.
异⽅差⼀词来⾃两个希腊词根,hetero- ("`"o-),表⽰“其它”
或“不同”,以及 skedannumi ("ı ˛´ ),表⽰“分散”。因此问
题的关键是,希腊字母  应该如何翻译。
作者发现,直接由希腊语传⼊英语的科学词汇中, 永远是
译作 k 的。只有在希腊语经由法语传⼊英语的⽇常词汇和经由拉
丁语传⼊英语的科学词汇中, 有时才被译作 c,这是因为法语
和拉丁语⼀般是不使⽤字母 k 的。⽽且当这种 c 在元⾳ e、i、y
之前时,是发成咝⾳的。此外,在英语中,c 位于 e 之前时,总
是发清⾳的。
异⽅差若要拼作 heteroscedasticity,只有两种可能,该词要
么是 1066 年法国诺曼底公爵⼊侵英格兰时传⼊,要么是中世纪
由拉丁语传⼊,这两种可能性都不存在。⽽且 heteroscedasticity
也应该被读作 “heterossedasticity”,但事实并⾮如此。
因此,异方差一词的正确拼法是 heteroskedasticity。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Asymptotic normality of b when xi "i is correlated within clusters.
Clustering arises when sampling is of independent groups but errors
for individuals within the group are correlated.
Let Xg (ng  K), "g (ng  1), and eg (ng  1) denote the regressor matrix,
the error vector, and the residual vector for group g , respectively; ng
is the number of observations in group g ; G is the number of groups.
0 1 10 1
XG G
X
1 1
b ˇ=@ X0g Xg A @ X0g "g A
G G
g=1 g=1
  1
1   0  1    
b !d N ˇ; E Xg Xg E X0g "g "0g Xg E X0g Xg
G

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– 实现:cluster(clustvar)
– It is especially necessary to use cluster-robust standard errors when
an aggregated or macro variable is included while data are on an
individual basis.
– Two high-profile papers advocate the use of cluster-robust standard
errors. (Bertrand et al., 2004, JQE; Petersen, 2009, RFS).
– 问题:聚类到省级层⾯ vs. 聚类到市级层⾯,哪种做法更稳健(隐
含的扰动项假设更弱)?
– 复杂的 clustering 可以通过重新定义 cluster 变量轻松实现。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Estimating Standard Errors in Finance Panel Data Sets

Firm 1 Firm 2 Firm 3


ε112 ε11 ε12 ε11 ε13 0 0 0 0 0 0
Firm 1

ε12 ε11 ε122 ε12 ε13 0 0 0 0 0 0


ε13 ε11 ε13 ε12 ε132 0 0 0 0 0 0
0 0 0 ε212 ε21 ε22 ε21 ε23 0 0 0
Firm 2

0 0 0 ε22 ε21 ε222 ε22 ε23 0 0 0


0 0 0 ε23 ε21 ε23 ε22 ε232 0 0 0
0 0 0 0 0 0 ε312 ε31 ε32 ε31 ε33
Firm 3

0 0 0 0 0 0 ε32 ε31 ε322 ε32 ε33


0 0 0 0 0 0 ε33 ε31 ε33 ε32 ε332
Figure 1
Residual cross product matrix: Assumptions about zero covariances
The figure shows a sample covariance matrix of the residuals. Assumptions about the elements of this matrix and
which are zero is the source of difference in the various standard error estimates. The standard OLS assumption
©is that
Ting only2016
JIANG, theSummer,
diagonal terms
Renmin are
Univ nonzero. Standard errors clustered by firm assume that the correlation of the
of China.
residuals within the cluster may be nonzero (these elements are shaded). This cluster assumption assumes that
• z -test of individual regression coefficients.
H0 : ˇk = ˇ¯k
Robust t -statistic
bk ˇ¯k
tk ≜ !d N (0; 1)
\
SE (bk )
• 2-test of linear restrictions.
H0 : Rˇ = q
Wald statistic
  1
W = (Rb \
0
q) RVar (b )R 0
(Rb q) !d 2(dim(q))

• Stata still reports results of t -test and F -test, which are asymptotically
valid and may work better for moderate sample sizes. (But there is no
guarantee.)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.4 那些“伪”非线性模型:多项式、对数、交互项
Polynomial models.
y = ˇ0 + ˇ1x + ˇ2x 2 + "
• Marginal effect.
dE (yjx)
= ˇ1 + 2ˇ2x
dx
实现:lincom.
• 计算顶点
ˇ1
2ˇ2
实现:nlcom.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Logarithmic models.
1. Linear-log model.
dy dy
y = ˇ0 + ˇ1 log x + "; ˇ1 = =
d log x dx/x
2. Log-linear model.
d log y dy/y
log y = ˇ0 + ˇ1x + "; ˇ1 = =
dx dx
3. Log-log model.
d log y dy/y
log y = ˇ0 + ˇ1 log x + "; ˇ1 = =
d log x dx/x
Rules of thumb for taking logs.
• For positive pecuniary amount and head counts, log is often taken.
• Variables measured in years and variables representing a proportion
or a percentage usually appear in level forms.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
Interaction effects.

1. Difference in differences in means.


y = ˇ0 + ˇ1D1 + ˇ2D2 + ˇ3 (D1  D2) + "
2. Heterogeneity in slopes.
y = ˇ0 + ˇ1x + ˇ2D + ˇ3 (x  D) + "
3. Substitutes vs. complements.
y = ˇ0 + ˇ1x1 + ˇ2x2 + ˇ3 (x1  x2) + "

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Calculate the marginal effect at mean.
@y
jx2=x̄2 = ˇˆ1 + ˇˆ3x̄2
@x1
Use lincom or transform the regression model
y = ˇ0 + ˇ1x1 + ˇ2x2 + ˇ3(x1 x̄1)  (x2 x̄2) + "
and the new ˇˆ1 is what we want.
• One may want to check robustness to misspecification by considering
full second-order expansion (Balli and Sorensen, 2013, Empir Econ).
• 分组回归还是交互项?
• Insignificane may indicate not the end of the world, but the light of
hope: countervailing mechanisms or group-specific heterogeneities
are waiting for your call (LU Ming).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.5 Selection on Observables vs. Unobservables
• Key reference: Altonji et al. (2005, JPE).
• 当控制了最重要的控制变量之后,如果关键解释变量的系数估计不
再随着更多控制变量的加⼊⽽发⽣⼤幅变化,那么就说明潜在的遗
漏变量偏误可能很⼩了。
• Omitted variable bias formula. Suppose the structural model is
y = ˇ0 + ˇ1x + ˇ2z + "
If we regress y on x only, then
ˆ C ov(x; z)
ˇ1 !p ˇ1 + ˇ2 
Var(x)
示例 6. 私⽴学校的经济回报 (Dale and Krueger, 2002, QJE).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• A working measure. Consider two regressions: one with a restricted
set of control variables, and one with a full set of controls. Denote the
estimated coeffcient for the variable of interest from the first regression
ˇˆR and the estimated coeffcient from the second regression ˇˆF . The
ratio can be calculated as
ˇˆF
ˇˆR ˇˆF
示例 7. 奴⾪贸易与信任 (Nunn and Wantchekon, 2011, AER).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.6 面板数据模型
• Time-constant omitted variables.
yit = x0it ˇ + vi t = x0it ˇ + ui + "it ; t = 1; 2; : : : ; T
• Pooled OLS. View the panel as a “cross section” with n  T observa-
tions. 0 1 10 1
Xn XT n X
X T
bPOLS = @ xi t x0i t A @ xit yit A
i =1 t =1 i =1 t =1

When is pooled OLS valid?


E(xit vit ) = 0; i.e. E(xi t "it ) = 0 and E(xi t ui ) = 0
• Key assumption for fixed-effects models:
E (xi s "i t ) = 0; s; t = 1; : : : ; T
ui and Xi can be correlated in any arbitrary form.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Within transformation.
ȳi = x̄0i ˇ + ui + "¯i
1 PT 1 PT 1 PT
where ȳi = T t =1 yit , x̄i = T t =1 xit , "
¯i = T t=1 "it
yit ȳi = (xi t x̄i )0ˇ + ("it "¯i )
ÿit = ẍ0it ˇ + "¨i t
• Fixed-effects estimator.
! ! 0 1 10 1
n 1 n n X
T n X
T
X X X X
bFE = Ẍ0i Ẍi Ẍ0i ÿi = @ ẍit ẍ0it A @ ẍit ÿit A
i=1 i =1 i =1 t =1 i =1 t =1
n
! 1 n
! n
! 1
X X X
\
Var (bFE) = Ẍ0i Ẍi Ẍ0i "ˆ¨i "ˆ¨0i Ẍi Ẍ0i Ẍi
i=1 i =1 i =1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• STATA’s robust option for panel data is automatically cluster(id).
• Stock and Watson (2008,
q ECMA) suggest that the right t -statistic testing
one element of ˇ is nn 1 tn 1, and the F -statistic testing p elements
 
of ˇ is n np Fp;n p . STATA does not seem to have adjusted it.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Least squares dummy variable (LSDV) regression.
Traditionally we view ui as parameters rather than random variables.
8
n
X <1 i = j
j j
yit = x0it ˇ + ıj Dit + "i t ; Dit =
:0 i ¤ j
j =1
– One can easily show that bLSDV = bFE.
– From the FOCs of LSDV regression, we have
ûi = ȳi x̄0i bFE
The “estimates” of fixed effects are inconsistent (because consis-
tency would rely on large-T asymptotics, and LSDV regression masks
this fact).
– Estimate the constant in the fixed effects model.
n
X
1
b0 = ûi
n
i =1
– How to make in and out of sample prediction?
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• First-differencing transformation.
∆yi t = ∆x0it ˇ + ∆"i t ; t = 2; : : : ; T
n
! 1 n !
X X
0
bFD = ∆Xi ∆Xi ∆X0i ∆yi
i=1 i =1
0 1 10 1
n X
X T n X
X T
=@ ∆xi t ∆x0i t A @ ∆xit ∆yit A
i=1 t =2 i =1 t =2

• Comments.
– In both FE and FD, parameters are identified off the variation within
groups over time, not between groups. This variation may require
justification.
– FE and FD are equivalent when T = 2.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– Both FE and FD require "it to be uncorrelated with lagged, con-
temporaneous, and future x. If we maintain only contemporaneous
exogeneity, i.e., E(xit "it ) = 0, FE is generally less inconsistent than
FD when T is large. That’s why FE is more popular in estimating
static panel data models.
– It does not make sense to use lagged x to solve endogeneity problem.
– One shortcoming for fixed-effects models is that, if the variable of
primary interest has little variation across time, FE or FD will be
very imprecise. In this case, we may want to estimate time-specific
slopes for this variable.
• In practice we usually include in the equation both individual-specific
effects and time-specific effects, i.e., two-way models.
• How to construct rich controls of fixed effects?

© Ting JIANG, 2016 Summer, Renmin Univ of China.


2.7 标准误!标准误!
• Two-way
Estimating clustering:
Standard Cameron,
Errors in Finance Panel DataGelbach,
Sets and Miller (2009, JBES).
实现:cgmreg. This feature is also incorporated into ivreg2.

Firm 1 Firm 2 Firm 3


ε112 ε11 ε12 ε11 ε13 ε11 ε21 0 0 ε11 ε31 0 0
Firm 1

ε12 ε11 ε122 ε12 ε13 0 ε12 ε22 0 0 ε12 ε32 0


ε13 ε11 ε13 ε12 ε132 0 0 ε13 ε23 0 0 ε13 ε33

ε21 ε11 0 0 ε212 ε21 ε22 ε21 ε23 ε21 ε31 0 0


Firm 2

0 ε22 ε12 0 ε22 ε21 ε222 ε22 ε23 0 ε22 ε32 0


0 0 ε23 ε13 ε23 ε21 ε23 ε22 ε232 0 0 ε23 ε33
ε31 ε11 0 0 ε31 ε21 0 0 ε312 ε31 ε32 ε31 ε33
Firm 3

0 ε32 ε12 0 0 ε32 ε22 0 ε32 ε31 ε322 ε32 ε33


0 0 ε33 ε13 0 0 ε33 ε23 ε33 ε31 ε33 ε32 ε332

©Figure 6 2016 Summer, Renmin Univ of China.


Ting JIANG,
Residual cross product matrix: Firm and time effects
• Asymptotic theory holds only for the case when the number of clusters
goes to infinity. But in practice we may have as few as 4 clusters. The
number of clusters smaller than 30 is considered to be too small.
• Finite-sample refinement of (one-way) cluster-robust standard errors
based on bootsrtap method (Cameron, Gelbach, and Miller, 2008,
ReStat).
实现:cgmwildboot

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 3 匹配 (Matching)
• Key reference: Imbens (2015, JHR), “Matching Methods in Practice:
Three Examples.”
3.1 匹配的工作原理
• Treatment effect 定义及其识别假设。如果
D ? Y 0jX
(selection on observables; D is ignorable given X )

E(Y jX; D = 1) E(Y jX; D = 0)


=E(Y 1jX; D = 1) E(Y 0jX; D = 0)
=E(Y 1jX; D = 1) E(Y 0jX; D = 1)
=E(Y 1 Y 0jX; D = 1)
=1(X )

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Z
˚
1 =E E(Y 1 Y 0jX; D = 1)jD = 1 = 1(x)dFXjD=1(x)
˚
=E Y E(Y jX; D = 0)jD = 1
˚
(E E(Y jX; D = 0)jD = 1 是什么意思?)
可以类似地定义 0 和  并给出相应的识别假设。
• 匹配是⼀种控制 X 的⾮参数⽅法,⽆需假定 homogeneous treatment
effect.
• 匹配的识别假设和最⼩⼆乘法的识别假设是⼀样的。
那么问题来了——

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.1.1 最小二乘法咋就不行?
• 思路:要估计处理组和控制组的条件期望函数 E(Y jX; D = 1) 和
E(Y jX; D = 0),OLS ⽅法假定该函数为线性,可以分⼦样本回归,
也可以⽤全样本回归
Y = ˇ0 + ˇ1D + ˇ2X + ˇ3D  X + U
E(Y jX; D = 0) =ˇ0 + ˇ2X
E(Y jX; D = 1) =ˇ0 + ˇ1 + ˇ2X + ˇ3X
 
ˆ ˆ ˆ ˆ
ˆ1 = ˇ0 + ˇ1 + ˇ2X̄T + ˇ3X̄T ˆ ˆ
ˇ0 + ˇ2X̄T
=ˇˆ1 + ˇˆ3X̄T
 
ˆ ˆ ˆ ˆ
ˆ0 = ˇ0 + ˇ1X̄C + ˇ2X̄C + ˇ3X̄C ˆ ˆ
ˇ0 + ˇ2X̄C
=ˇˆ1 + ˇˆ3X̄C
NT NC
ˆ = ˆ1 + ˆ0 = ˇˆ1 + ˇˆ3X̄
N N
(还好没有超越我们的认识)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 但是要保证 treatment effect 的 OLS 估计⼀致,下⾯两个条件中必须
⾄少满⾜⼀个(证明略)
– 条件期望函数是否确实为线性
– 处理组和控制组的 X 的分布是否相同
否则,存在由 misspecification 引起的 extrapolation bias. 试看⼀例。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


© Ting JIANG, 2016 Summer, Renmin Univ of China.
• 从另⼀个⾓度来理解:不同观测值在 OLS 估计的贡献如何?注意到

ˆ1 =ȲT ˆ ˆ
ˇ0 + ˇ2X̄T
 
=ȲT ȲC + ˇˆ2 X̄T X̄C
 P   
c2C X c X̄ C Y c Ȳ C 
=ȲT ȲC + P 2  (X̄T X̄C
c2C Xc X̄C
!
1 X  X̄C X̄T
=ȲT 1 Xc X̄C   2
 Yc
NC X 2 X̄C
c2C C
!i = 2:8084 0:0949  Xc
re75 = 156:7 的个体被赋予最⾼负权重 12:05,但该个体的 outcome
是多少对于 re75 2 [0; 25:14] 的处理组个体⽽⾔是没有意义的。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.2 匹配的准备工作
3.2.1 选择 X
• 基本原则:“pre but not post”.
• 必须控制

D Y

• 必须不能控制

X1
D Y X D Y
X2

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 只有当关⼼直接效应时才控制

D Y

• 例⼦:⼀个两难选择

D Y

X1 X2

X1: 不可观测的能⼒
D : 参加⼩学奥数竞赛
X2: 受教育程度
Y : 收⼊⽔平
© Ting JIANG, 2016 Summer, Renmin Univ of China.
3.2.2 评估 overlap

• 计算 normalized difference
X̄T X̄C
∆=q 
2 2
sC + sT /2
1 X 2 1 X 2
sC2 = Xc X̄C ; sT2 = Xt X̄T
NC 1 NT 1
c2C t2T
• 这个统计量和检验两个样本均值是否相等的 t 统计量长得很像,但
不建议使⽤后者。(为什么?)
X̄T X̄C
t stat = q
sC2 /NC + sT2 /NT
• 如果 normalized difference 差异很⼤,则要考虑删截样本。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.2.3 估计倾向得分
• h() 形式的选择不是为了提供因果解释,⽽只是为了更好地近似条件
期望函数。
exp(h(X)0 )
(X) =
1 + exp(h(X)0 )
• specification search (基于 stepwise regression):
1. 先根据经验决定有哪些变量必须加⼊(如果没有先验信息,就只加
⼊截距项)。
2. 然后逐⼀加⼊其余所有⼀次项,对新增项的系数显著性进⾏ likeli-
hood ratio test, 统计量数值最⼤的⼀次项加⼊ h().
3. 对余下的⼀次项重复步骤 2,直到本轮新增项系数的检验统计量最
⼤值低于临界值 Cli n = 1.
4. 然后逐⼀加⼊所有⼆次项(包括平⽅项和交互项),进⾏类似于步
骤 2-3 的操作,统计量临界值 Cqua = 2:71.
• 根据最终确定的 h() 估计倾向得分。
© Ting JIANG, 2016 Summer, Renmin Univ of China.
3.2.4 删截样本
• matching on (X ). 基于 log odds ratio
 
(X)
l(X) = ln
1 (X)
matching without replacement(从 l max 开始), 得到 balanced sample.
实现:psmatch2
• truncating on (X ) (Crump et al., 2009, Biometrika).
保留 ˛  (X)  1 ˛ 的样本,其中 ˛ 满⾜
 ˇ 
1 1 ˇ 1 1
=2E ˇ 
˛(1 ˛) ˇ
(X )(1 (X)) (X)(1 (X)) ˛(1 ˛)
• Rule of thrumb: ˛ = 0:1.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


© Ting JIANG, 2016 Summer, Renmin Univ of China.
3.3 最近邻匹配 (Nearest-Neighbor Matching)
• 匹配标准(为了解决 dimensionality problem)
:Mahalanobis distance
(quadratic distance)
(Xc X t )0Ω(X) 1(Xc Xt )
其中 Ω(X ) 是样本协⽅差
• 不论是 NNM 还是下⾯的 PSM, 可能都有必要预先进⾏⼀步 exact
matching, 例如性别、⾏业等。
• 估计量
1 X 
ˆ1 = Yt Ê(Y jX t ; D = 0)
NT
t 2T
1 X
Ê(Y jX t ; D = 0) = Yc
M
c2C t
(注意:Ê(Y jX t ; D = 0) 并不是 E(Y jX t ; D = 0) 的⼀致估计量,因为
M 是固定常数,但我们要估计的是 1,⽽不是 E(Y jX t ; D = 0) 本⾝,
所以不存在问题。)
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• 不能⽤样本⽅差作为渐进⽅差估计。也不能⽤⾃抽样⽅差作为渐进
⽅差估计 (Abadie and Imbens (2008, ECMA)).
• Abadie and Imbens (2006, ECMA) 证明了 NNMatch 估计量的⼀致
性,并给出了正确的标准误公式。
• Abadie and Imbens (2011, JBES) 对 NNMatch 进⾏了偏误修正(因为
匹配不是精确的),并且说明 Abadie and Imbens (2006) 的标准误公
式仍然正确。下⾯以 ATE 为例。
– 估计量
X N
BC 1 1 0

ˆ = Ŷi Ŷi
N
i =1
1 1 X˚
Ŷi =Di Yi + (1 Di ) ˆ 1(Xi )
Yt +  ˆ 1(X t )

M
t 2Ti
0 1 X˚
Ŷi =(1 Di )Yi + Di Yc + ˆ 0(Xi ) ˆ 0(Xc )

M
c2Ci
ˆ d (X ) 是 d (X ) = E(Y jX; D = d ) 的 OLS 估计。
其中 
© Ting JIANG, 2016 Summer, Renmin Univ of China.
– 渐进⽅差
1 X s1 s0

BC 2
Yi Yi ˆ
N
i
(  ) J
!2
1 X KM (i) 2
2M 1 KM (i) J 1 X
+ + Yi Ylj (i )
N M M M J +1 J
i j =1

其中
1 X
Yis1 =Di Yi + (1 Di ) Yt
M
t2Ti
s0 1 X
Yi =(1 Di )Yi + Di Yc
M
c2Ci
KM (i ) 是观测值 i 被⽤于匹配的次数;lj (i) 是同⼀(处理或控制)
组内和观测值 i 最接近的第 j 个观测值。
• 实现:teffects nnmatch

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.4 倾向得分匹配 (Propensity Score Matching)

• PSM 为什么可⾏?我们⾸先证明:给定 (X ), 处理组和控制组的 X


的分布是相同的,换句话说,在分布的意义上,要想 balance X , 只
要 balance (X ) 就够了。
(X ) = P (D = 1jX) = E(DjX )


(X ) =E (X)j(X)

=E E(DjX)j(X)

=E Dj(X)

=P D = 1j(X)

© Ting JIANG, 2016 Summer, Renmin Univ of China.



P (D = 1; X  t j(X)) =E D  1(X  t)j(X )
n ˇˇ o
=E E D  1(X  t)jX ˇ(X)
n  ˇ o
ˇ
=E E DjX  1(X  t)ˇ(X)
n ˇ o
ˇ
=E (X)  1(X  t)ˇ(X)

=(X)P X  tj(X)


P (D = 1; X  t j(X )) 
P X  t jD = 1; (X) =  = P X  tj(X)
P D = 1j(X)
• ⽐较⼀下:
– 基于 X 的匹配:使处理组和控制组之间的 X 相同(或⾄少接近)。
– 随机试验:使 X 和 " 的分布相同。
– 基于 (X) 的匹配:使 X 的分布相同。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• PSM 的识别
D ? (Y 0; Y 1)jX ) D ? (Y 0; Y 1)j(X)

˚ ˚ ˚ 1 0

E Y j(X); D = 1 E Y j(X); D = 0 = E Y Y j(X)

n ˚ o
) E E Y1 Y 0j(X) = EfY 1 Y 0g
• 估计量  
1 X 1 X
ˆ1PSM = Yt Yc
NT M
t 2T c2C t
• Abadie and Imbens (2016, ECMA) 给出了 PSM 估计量的渐进分布。
• 实现:teffects psmatch

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.4.1 Estimation Based on Subclassification

除最近邻匹配之外,Imbens (2015) 建议的第⼆种估计⽅法实际上是这


样的——

• 根据 (X ) 对样本进⾏分块,在每个 block 内分别进⾏ OLS 估计,然


后进⾏加权平均。
• 分块基于 log odds ratio l(X),保证每个 block 内处理组和控制组的
l¯ ⽐较接近(t 检验)。
实现:ttest + xtile
• 标准误需要⼿动计算。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


3.5 Non-Binary Treatment

• Weighting.
   0 1
  
DY D (1 D)Y + DY DY 1
E =E =E
(X ) (X)
 (X) 
   
E DY 1jX E(DjX)E Y 1jX
=E =E
n (X)o (X)
=E E Y 1jX = E(Y 1)
 
(1 D)Y
E = E(Y 0)
1 (X)
X  
W 1 Di 1 Di
ˆ = Yi
N ˆ i ) 1 (X
(X ˆ i)
i

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Regression adjustment.
E(Y jX; D = d ) = X 0ˇd ; d = 0; 1

1 X 0ˆ 1 X 0ˆ 1 X 0 ˆ 
¯ = Xi ˇ1 Xi ˇ0 = Xi ˇ1 ˇˆ0
N N N
i i i
• Multivalued and continuous treatments.
– 对于 multivalued treatment, 类似定义
d (X) = P (D = d jX)
– 对于 continuous treatment, 定义 fDjX (d jX) 为⼴义倾向得分。
– 实现:teffects 的其它选项。参见 SJ14-3, SJ14-1, SJ13-3,
SJ8-3.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 4 工具变量
• An explanatory variable is said to be endogenous if it is correlated
with the error term. Endogeneity will cause OLS estimator to be in-
consistent.
n
! 1 n
!
1X 0 1X
b=ˇ+ xi xi x i "i
n n
i =1 i =1
1
!p ˇ + (E(x0i xi )) (E(xi "i )) ¤ ˇ
if if E(xi "i ) ¤ 0.
In the simple regression yi = ˇ0 + ˇ1xi + "i ,
\
Cov (yi ; xi ) Cov(yi ; xi ) Cov("i ; xi )
b1OLS = !p = ˇ1 + ¤ ˇ1
\
Var (xi ) Var(xi ) Var(xi )
if E(xi "i ) ¤ 0.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Instrumental variable solution. A predetermined variable that is cor-
related with the endogenous regressor is called an instrumental vari-
able, denoted zi .
Cov\(yi ; zi ) Cov(yi ; zi ) Cov("i ; zi )
IV
b1 ≜ !p = ˇ1 + = ˇ1
Cov\(xi ; zi ) Cov(xi ; zi ) Cov(xi ; zi )
if Cov("i ; zi ) = 0 and Cov(xi ; zi ) ¤ 0.
• Two-stage least squares.
– Stage 1: Regress xi on zi using OLS, and obtain x̂i .
xi = 0 + 1zi + !i = x̂i + !i
– Stage 2: Regress yi on x̂i using OLS.
yi = ˇ0 + ˇ1x̂i + ["i + ˇ1(xi x̂i )]
\
Cov (yi ; x̂i )
b12SLS ≜
\
Var (x̂i )

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– The second stage regressor satisfies orthogonality condition:
Cov(x̂i ; "i + ˇ1(xi x̂i )) = Cov(x̂i ; "i ) + ˇ1Cov(x̂i ; !i ) = 0
• The equivalence between b1IV and b12SLS.
• Summary of the basic idea.
x y x y z x y

" " "

– Relevance: IV should be correlated with the endogenous explana-


tory variable.
Cov(z; x) ¤ 0
– Exclusion: IV should not appear on the right hand side of the struc-
tural equation.
– Exogeneity (independence): IV should be uncorrelated with the er-
ror term.
Cov(z; ") = 0
© Ting JIANG, 2016 Summer, Renmin Univ of China.
4.1 内生性的种种来源
4.1.1 Simultaneity Bias
... arising from equilibrium conditions.
• Demand curve
qid = ˛0 + ˛1pi + ui
• Supply curve.
qis = ˇ0 + ˇ1pi + vi
• Market equilibrium
qid = qis
• Simplifying assumptions.
E(ui ) = E(vi ) = Cov(ui ; vi ) = 0
• Price is endogenous in both demand and supply equations.
• Regressing qi on pi (and a constant) does not estimate either the de-
mand curve or the supply curve consistently.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• In order to consistently estimate (for example) the demand slope, we
need to find an IV for pi in the demand equation, an observable factor
that is predetermined w.r.t. the demand curve, but correlated with pi .
A natural candidate is a pure supply shifter.

... arising from reverse causalities.

• “Institutions affect economic performance”.


gi = ˛0 + ˛1di + ui
• “Rich economies choose or can afford better institutions.”
di = ˇ0 + ˇ1gi + vi

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.1.2 Omitted Variables Bias
• The true model.
y = ˇ0 + ˇ1x1 + ˇ2x2 + : : : + ˇK xK + q + "
• q is unobservable. What we estimate is
y = ˇ0 + ˇ1x1 + ˇ2x2 + : : : + ˇK xK + u; u ≜ q + "
• Write the linear projection of q onto the observable regressors,
q = ı0 + ı1x1 + ı2x2 + : : : + ıK xK + v
E(v) = 0; Cov(xk ; v) = 0; k = 1; 2; : : : ; K
y = (ˇ0 + ı0)+(ˇ1 + ı1)x1 +(ˇ2 + ı2)x2 +: : :+(ˇK + ıK )xK +( v +")
• It is easy to see
plimn!1bk = ˇk + ık

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Suppose all ık except ıK are zero, then
plimn!1bk = ˇk ; k = 1; 2; : : : ; K 1
Cov(xK ; q)
plimn!1bK = ˇK +
Var(xK )
• For example, xK denotes years of schooling and q denotes unobserved
ability. A more able person tends to have higher wage ( > 0), and is
also likely to receive more education (Cov(xK ; q) > 0), therefore OLS
will overestimate the return on schooling.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


OLS Solution: Find a proxy.
• Use IQ, denoted by z , as a proxy variable for unobserved ability.
q = 0 + 1z + !; Cov(z; !) = 0
• Qualifications for a valid proxy.
1. Redundancy: z is irrelevant for explaining y once x and q have
been controlled for. (For example, we don’t need to control for IQ
if ability were observable and hence controlled.)
Cov(z; ") = 0
2. The correlation between q and x is zero once z is partialled out.
Cov(x; !) = 0
• Consistent OLS estimator with proxy.
y = (ˇ0 + 0) + ˇ1x1 + : : : + ˇK xK + 1z + ( ! + ")
Cov(x; ! + ") = 0; Cov(z; ! + ") = 0
© Ting JIANG, 2016 Summer, Renmin Univ of China.
Proxy vs. IV. If we are unable to find a valid proxy (for ability), then
we try to find a valid instrument (for years of schooling). Both a proxy
variable and an instrument variable must be redundant (do not appear in
the true model that explicitly contains the omitted variable). However,
a proxy is with regard to the omitted variable, while an IV is with regard
to the endogenous explanatory variable. In other words, a proxy should
be highly correlated with the omitted variable, while an IV should be
uncorrelated with the omitted variable. Therefore, a proxy makes a poor
IV, and an IV makes a poor proxy.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.1.3 Measurement Error
• The true model

y = ˇ0 + ˇ1x1 + ˇ2x2 + : : : + ˇK xK +"
• Measurement error.

xK = xK + eK
E(eK ) = 0; Cov(xk ; eK ) = 0 8 k ¤ K; Cov("; eK ) = 0
• The classical errors-in-variables (CEV) assumption.

Cov(xK ; eK ) = 0
In some cases it is clear that the CEV assumption cannot be true.
• What we estimate is
y = ˇ0 + ˇ1x1 + ˇ2x2 + : : : + ˇK xK + u; u ≜ " ˇK eK
Cov(xK ; u) = ˇK Cov(xK ; eK ) = ˇK e2K ¤ 0

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• OLS estimators of all ˇk are inconsistent.
• Attenuation bias (bias toward zero). Easy to show in cases where there
is a single regressor (K = 1) measured with error or xK (x ) is uncor-
K
related with all other x( K),
 
Var(eK )
bK !p ˇK 1
Var(xK )
• Again, we shall try to find an instrument for the mismeasured variable.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Measurement error vs. proxy. The measurement error problem has
a statistical structure similar to the proxy variable problem, but they are
conceptually very different. In the proxy variable case, we are look-
ing for a variable that is somehow associated with the omitted variable
(unobservable and usually not well-defined) in order to cope with the
endogeneity of other explanatory variables. We cannot estimate the ef-
fect of the omitted variable per se. In the measurement error case, the
variable that we do not observe has a well-defined quantitative meaning
but our measure of it may contain error. The mismeasured explanatory
variable is the very one whose effect is of primary interest and its own
endogeneity is what we are addressing.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Multiple indicators (repeated measurement) solution.
• The true model
y = ˇ0 + ˇ1x1 + : : : + ˇK xK + q + "
• Indicator
z1 = 0 + 1q + !1
Cov(q; !1) = Cov("; !1) = Cov(x; !1) = 0
• OLS estimator is inconsistent.
 
0 0
y= + x ˇ + z1 + " !1
1 1 1
• Assume there exists a second indicator z2 that satisfies the same as-
sumptions as z1,
z2 = 0 + 1q + !2
If Cov(!1; !2) = 0 then z2 can be used as an IV for z1.
• This approach nests the CEV model as a special case.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• This approach is very different from the IV solution that leaves q in the
error term. In the latter case, we must decide which elements of x are
correlated with q and find IVs for them. In the former case, we need
not know such information, and the elements of x serve as their own
instruments.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.2 两阶段最小二乘法
• Key assumption 1 (moment condition): zi is predetermined.
E(zi "i ) = E [zi (yi x0i ˇ)] = 0
• Key assumption 2: zi and xi are sufficiently linearly correlated.
• Method of moments: A method of estimating population moments (ex-
pectation of function of random vectors) with sample analogue (sam-
ple mean). OLS estimation is a method of moments.
• IV estimation as a method of moments.
E(zi "i ) = E [zi (yi x0i ˇ)] = 0
E(zi x0i )ˇ = E(zi yi )
1
ˇ = (E(zi x0i )) E(zi yi )
n
! 1 n
!
1X 0 1X
bIV = zi xi zi yi = (Z0X) 1Z0y
n n
i =1 i =1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• bIV is a consistent estimator of ˇ and is asymptotically normally dis-
tributed.
• IV estimation applies to just-identified cases only.
• Linear combinations of IVs are still valid IVs. In overidentified cases
we can construct K combinations of all L available IVs to make it just
identified. 2SLS offers such a way of construction.

X̂ = Z(Z0Z) 1Z0X

b2SLS =(X̂0X̂) 1X̂0y = (X̂0X) 1X̂0y


 1 0
= X Z (Z Z ) Z X X Z (Z 0 Z ) 1 Z 0 y
0 0 1 0

• b2SLS is a consistent estimator of ˇ and is asymptotically normally dis-


tributed.
• What goes wrong when we do the two stages manually?

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• The bias of 2SLS estimator.
– bias is increasing in the number of instruments;
– bias is increasing in the “degree of endogeneity”;
– bias is decreasing in the fit of the first-stage;
– bias is increasing in the number of exogenous covariates;
– bias is decreasing in the variance of the endogenous variable;
– as instruments become weak, E(ˇˆ2SLS ˇ) ! E(ˇˆOLS ˇ): weak
instruments bias 2SLS towards OLS;
– as n gets large, the bias decreases: 2SLS is consistent.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.3 Forbidden Regression
• Definition: improper use of instruments leads to inconsistency.
– Case 1: Linear projection is incomplete.
– Case 2: Expectation operator does not pass through nonlinear func-
tions (of either the first or the second stage).
• Define a modified 2SLS estimator as ˇˆM 2SLS = (X̃0X̃) 1X̃0y where X̃ is
an estimator of E(XjZ). Define an indirect least squares estimator as
ˇˆILS = (X̃0X) 1X̃0y .
– Result 1: 如果 X̃ 是 E(XjZ) 的⼀致估计,则 ˇˆM 2SLS 是 ˇ 的⼀致估
计。
– Result 2: 如果第⼀阶段 misspecified, 则 ˇˆM 2SLS 不⼀致。但标准的
2SLS estimator 不受此限(OLS 第⼀阶段往往是 misspecified) 。
– Result 3: 即使第⼀阶段 misspecified,ILS estimator 仍然⼀致。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Implications:
– Forbidden regression using a subset of included exogenous vari-
ables in the first stage. Suppose X = (x1 x2 y2), Z = (x1 z). 只
有当 E(y2jx1; x2; z) = E(y2jx1; z) 并且 E(y2jx1; z) 线性,modified
2SLS 才是⼀致估计。但 ILS 永远⼀致。
– Use a probit/logit in the first stage when y2 is binary. 只有当第⼀阶
段的 probit/logit correctly specified, modified 2SLS 才⼀致。但 ILS
永远⼀致,⽽且⽐标准的 2SLS 更 efficient。
– Forbidden regression using nonlinear functions of predicted values
from the first stage rather than taking the nonlinear transformation
before estimating the first stage. 例如 y = ˇ0 + ˇ1y2 + ˇ2y22 + u. ŷ22
不可能是 E(y22jZ) 的⼀致估计。正确的 2SLS 做法是⽤同⼀组 Z 估
计两个第⼀阶段⽅程。⽽ ILS 仍然是⼀致的。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– Wooldridge’s (2010) example.
hours = 12 log(wage) + 13[log(wage)]2 + ı10 + ı11educ + ı12age
+ ı13kidslt6 + ı14kidsge6 + ı15nwifeinc + u1
log(wage) =ı20 + ı21educ + ı22exper + ı23exper2 + u2
▷ Use some squares and cross products of the exogenous variables
as instruments and apply the IV procedures directly. Wooldridge
adds three quadratic terms as additional instruments:
age2; educ2; nwifeinc2
▷ 你可能会想,加了新的 IV 以后,实际上关于 log(wage) 的第⼀
阶段也⽤到了这些新 IV,这没问题么?⾸先,即使 log(wage) 的
CEF 不包含这些新 IV,加⼊之后也并不会降低估计的 efficiency.
其次,如果我们要对 log(wage) 和 [log(wage)]2 的第⼀阶段使⽤不
同的 regressors, 势必要⼿动做。但是……
▷ We might use (ŷ2)2 as a single IV for y22.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


– When the structural model is with interactive terms, the common
practice is to use the interacted instruments as instruments.
– One cannot apply a two-step procedure to
y1 = 1[z1ı1 + ˛1y2 + u1 > 0]
y2 = 1[zı2 + v2 > 0]
Since E(y2jz) = Φ(zı2) and ı2 is consistently estimated, it is tempt-
ing to run probit of y1 on z1 and Φ̂(z ı̂2). For this procedure to
work, we have to have Pr(y1jz) = Φ[z1ı1 + ˛Φ(zı2)]. But Pr(y1jz) =
E(y1jz) = E(1[z1ı1 + ˛1y2 + u1 > 0]jz). Since the indicator function
is nonlinear, we cannot pass the expected value through.
– 对于⽐较复杂的⾮线性模型,建议还是当作线性模型来处理。(An-
grist, 2001, JBES)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.4 广义矩方法初步
• The overidentified case corresponds to a system of linear equations
which may not have a solution.
E (zi (yi x0i ˇ)) = 0
Xn
1 ˜ =0
zi (yi x0i ˇ)
n
i =1
• Sample moments should be close to zero if population moments hold.
n
!0 n
!
1 X 1 X
˜ Wn ) = n
min Jn(ˇ; 0
zi (yi xi ˇ)˜ Wn ˜
zi (yi x0i ˇ)
ˇ˜ n n
i =1 i =1
where Wn is a symmetric and positive definite weighting matrix.
• For any weighting matrix, there exists a consistent and asymptotically
normally distributed GMM estimator.
1
bGMM = (X0ZWnZ0X) X0ZWnZ0y

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• The optimal (efficient) GMM estimator is the one in the bGMM family
that has the smallest asymptotic variance.
• 2SLS estimator also belongs to the bGMM family. It coincides with the
optimal GMM estimator under conditional heteroskedasticity.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.5 相关检验

• Key reference: Baum et al. (2007, SJ7-4). “Enhanced Routines for In-
strumental Variables/Generalized Method of Moments Estimation and
Testing.”

Testing the relevance of instruments.

• Test the joint significance of the excluded instruments in the first-stage


regression. Rule of thumb when there is only one endogenous re-
gressor: Fstat > 10. F -statistic might be misleading when there are
multiple endogenous regressors.
• Stock-Yogo test of weak instruments.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Testing overidentifying restrictions.

• When we have more instruments than needed to identify an equation,


we can test whether the instruments are valid in the sense that they are
uncorrelated with the error term.
• These are tests of the joint hypotheses of correct model specification
and the orthogonality conditions. Rejection may be either because
instruments are not truly exogenous, or because they are incorrectly
excluded from the regression. Moreover, it may be either because
the excluded instruments are not good, or because the predetermined
regressors are actually endogenous.
• Hansen’s J (Sargan) test.
1

Jn bEGMM; S !d 2(L K)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Testing a subset of overidentifying restrictions: Hayashi’s C (difference-
in-Sargan) test.
Suppose we can divide the L instruments into two groups: L1 vari-
ables that are known to satisfy the moment conditions, and L2 vari-
ables that are suspect. The moment conditions regarding L2 are testable
if L1  K . The idea is to compare two Jn from two separate GMM
estimators of the same regression, one using only L1 instruments, and
the other using a full set of L instruments. If the inclusion of L2 sus-
pect instruments significantly increases Jn, that is a good reason for
doubting the predeterminedness of the L2 instruments.
C ≜J J1 !d 2(L L1)
• Even if partially testable, the exogeneity of instruments has to be jus-
tified mainly from a theoretical ground.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Testing for endogeneity of the regressors.

• H0: The regressors of interest are exogenous.


• Hayashi’s C test.

Indirect test of the exclusion restriction. In samples where the first


stage is zero, the reduced form should be zero as well. On the other
hand, a statistically significant reduced-form estimate with no evidence
of a corresponding first stage is cause for worry, because this suggests
some channel other than the treatment variable links instruments with
outcomes. We can construct “no-first-stage samples” and check whether
they generate no evidence of significant reduced-form effects (Angrist
and Pischke, 2014).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Two important options in ivreg2.
• Two-way clustering: cluster(varname1 varname2).
• Partialling out some exogenous regressors: partial., especially useful
when using cluster() and the number of clusters is less than L, or
when requesting a robust covariance matrix and the regressors include
dummies.
为什么不能自己制造 IV?
y = ˇ1x1 + ˇ2z2 + u
x1 = 1z1 + 2z2 + v
• 内⽣性来⾃ u 和 v 的相关性。
• 问题:⽣成⼀个和 x1 ⾜够相关的随机变量。既然是任意⽣成的,就
肯定不会影响 y(满⾜ exclusion restriction)。这样的变量会是好的
IV 么?
• Monte-Carlo experiment: Draw u and v from multivariate standard
normal distribution with correlation :5. ˇ1 = 5, ˇ2 = 1 = 2 = 1.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
What should we do in practice?

• Report the first stage result and think about whether it (magnitude and
sign) makes sense.
• Report the reduced-form regression of the dependent variable on in-
struments.
• Pick your best single instrument and report just-identified estimates.
• Check over-identified 2SLS and GMM (CUE) estimates. Worry if they
are very different.
• Carry out specification tests.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.6 一些无中生有的神奇办法
4.6.1 利用高阶矩
Generated instruments using heteroskedasticity.
• Key reference: Lewbel (2013, JBES).
y1 = x0ˇ1 + y2 + "1
y2 = x0ˇ2 + "2
• Let z be a vector of exogenous variables, in particular, z could be a
subvector of x, or z could equal x. Under the following assumptions,
all the parameters are identified:
1. Cov(z; "1"2) = 0
2. Cov(z; "22) ¤ 0
• The model can be estimated by 2SLS of y1 on x and y2 using x and
(z−z̄)"ˆ2 as instruments.
• 实现:ivreg2; ivreg2h.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
A control function approach.

• Key reference: Klein and Vella (2009, JHR; 2010, JoE).


• KV construct a nonlinear control function that can be estimated with
minimal distributional assumptions (conditional heteroskedasticity of
"1 and "2. "p #
0 Var("1jx)
y1 = x ˇ1 + y2 +  p "2 + u
Var("2jx)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.6.2 放弃点估计
• Key reference: Conley et al. (2012, REStat).
y =Xˇ + Z + "
X =ZΠ + V
ˇˆ = (Z 0X) 1Z 0Y !p ˇ + /Π
There is typically a trade-off between instrument strength and degree
of violation of the exclusion restriction.
• Union of confidence intervals with support assumption.
(Y Z 0) = Xˇ + "
Estimate ˇ and construct a symmetric (1 ˛) confidence interval in
the usual way:
ˆ 0) ˙ z1 \ ˆ
CI (1 ˛; 0) = [ˇ( ˛/2SE(ˇ( 0))]

Do so for each 0 and construct the union of the confidence intervals.


CI (1 ˛) = [ 02Γ CI (1 ˛; 0)
© Ting JIANG, 2016 Summer, Renmin Univ of China.
• local-to-zero approximation. Suppose
p
n  G

ˇˆ =(X 0PZ X) 1X 0PZ y


=(X 0PZ X) 1X 0PZ (Xˇ + Z + ")
p 1 0 p p
n(ˇˆ ˇ) = (X PZ X) X Z n + n(X 0PZ X) 1X 0PZ "
0

For example, if we choose a Gaussian prior,


 N (; Ω)

ˇˆ  N (ˇ + A; Var(ˇ) + AΩA0)
where A = (X 0PZ X ) 1X 0Z .
• 实现:plausexog.
• Applications: Nunn and Wantchekon (2011, AER); Cosar and Demir
(2016, JDE); Liu and Lu (2015, JIE); Dincecco and Prado (2012, JEG);
just to name a few.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
4.7 LATE

• Wald estimator: IV estimator with a binary instrument Z for a binary


regressor D
E(Y jZ = 1) E(Y jZ = 0)
ˇ=
E(DjZ = 1) E(DjZ = 0)
C ov(Y; Z)/Var(Z)
=
C ov(D; Z)/Var(Z)
C ov(Y; Z)
=
C ov(D; Z)
分⼦被称为 intent-to-treat effect, effect of treatment intention (effect
of assignment), not effect of actual treatment received.
(如何直观解释?)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Heterogeneous treatment effect 会怎样?
1 0 1
 
Yi Yi = E Yi Yi + Yi1 Yi0 E Yi1
0
Yi0
„ ƒ‚ … „ ƒ‚ …
ˇ Ui
0 1 0

Y =Y + Y Y D
=Y 0 + (ˇ + U )D
=ˇD + (Y 0 + UD)
– ⼀般来说,C ov(Z; UD) ¤ 0,因为 C ov(Z; D) ¤ 0, 除⾮ U 均值独
⽴于 (Z; D)。
– 但这⼀假设不合理,因为 U 是 treatment effect 的⼀部分。

– 因此 ˇ 的 IV 估计量不是 E Yi1 0
Yi 的⼀致估计。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 那么,IV 估计到底估计的是什么?
定义 potential treatment
– D 0 = 0; D 1 = 0: never takers
– D 0 = 0; D 1 = 1: compliers
– D 0 = 1; D 1 = 0: defiers
– D 0 = 1; D 1 = 1: always-takers
Imbens and Angrist (1994) 证明,若
1. P (D = 1jZ = 1) ¤ P (D = 1jZ = 0): 存在 compliers
2. Di1  Di0 8i : 不存在 defiers
3. (Y 0; Y 1; D 0; D 1) ? Z

© Ting JIANG, 2016 Summer, Renmin Univ of China.



E(Y jZ = 1) E(Y jZ = 0)
˚ 1 0
˚ 1 0

=E DY + (1 D)Y jZ = 1 E DY + (1 D)Y jZ = 0
n  o n  o
=E D 1Y 1 + 1 D 1 Y 0jZ = 1 E D 0Y 1 + 1 D 0 Y 0jZ = 0
n  o n  o
=E D 1Y 1 + 1 D 1 Y 0 E D 0Y 1 + 1 D 0 Y 0
n   o
=E D 1 D 0 Y 1 Y 0
1 0 1 0
 1 0

=E Y Y jD D =1 P D D =1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


D1 D 0 = 1 () D 1 = 1; D 0 = 0 (compliers)

1 0

E(Y jZ = 1) E(Y jZ = 0)
E Y Y jcompliers =
P (compliers)
E(Y jZ = 1) E(Y jZ = 0)
=
E(DjZ = 1) E(DjZ = 0)
因为
E(DjZ = 1) E(DjZ = 0)
=P (D = 1jZ = 1) P (D = 0jZ = 1)
=P (always-takers or compliers) P (always-takers)
=P (compliers)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


4.7.1 Intuition of LATE
Observable:
D=1 D=0
Z = 1 n11 n10
Z = 0 n01 n00

Unobservable (假定不存在 defiers):


Z=0
D=0 D=1
D = 0 Never-taker (non; nun) Defier
Z=1
D = 1 Complier (ntc ; ncc ) Always-taker (noa ; nua)

D=1 D=0
Z = 1 Complier (ntc ) & Always-taker (nua) Never-taker (non)
Z=0 Always-taker (nao ) Complier (ncc ) & Never-taker (nun

© Ting JIANG, 2016 Summer, Renmin Univ of China.


因为 Z 是随机的,因此 Z = 0/1 subset 中的⽐例代表总体⽐例。
n01 n10
na = ; nn =
n01 + n00 n11 + n10

n11n00 n10n01
nc = 1 na nn =
(n11 + n10)(n01 + n00)

n11 + n10 u n01 + n00


nua = n01  ; nn = n10 
n01 + n00 n11 + n10

n11n00 n10n01 c n11n00 n10n01


ntc = ; nc =
n01 + n00 n11 + n10

© Ting JIANG, 2016 Summer, Renmin Univ of China.


D=0 D=1
Z=0 Z=1
Always-taker
noa nua
Z=0 Z=1
Complier
ncc ntc
Z=0 Z=1
Never-taker
nun non

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• LATE 是关于 compliers: ncc + ntc
• ATT 是关于 always-takers 和 a subset of compliers: noa + nua + ntc
• 若 na 较⼩,则认为不存在 always-takers,P (D = 1jZ = 0) = 0,此时
ATT 是关于 ntc ,又因为 Z random assignment, 因此 LATE 等于 ATT
• First stage 恰是 complier 的⽐例,全部为 complier 时,first-stage=1.

示例 8. Adams et al. (2009, Journal of Empirical Finance).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


nc
E(Y jZ = 1; D = 1) = E(Y 1jcomplier)
nc + na
na
+ E(Y 1jalways-taker)
nc + na

E(Y jZ = 0; D = 1) = E(Y 1jalways-taker)

nc
E(Y jZ = 0; D = 0) = E(Y 0jcomplier)
nc + nn
nn
+ E(Y 0jnever-taker)
nc + nn

E(Y jZ = 1; D = 0) = E(Y 0jnever-taker)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


We can compare
E(Y 1jcomplier) vs. E(Y 1jalways-taker)
and
E(Y 0jcomplier) vs. E(Y 0jnever-taker)
If there is little difference, it is plausible that the average effect for com-
pliers is indicative of average effects for other compliance types.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 5 断点回归 (Regression Discontinuity)

• Key reference: Lee and Lemieux (2010, JEL), “Regression Discontinu-


ity Designs in Economics.”

© Ting JIANG, 2016 Summer, Renmin Univ of China.


5.1 断点回归的工作原理
• before-after comparison 是⼀种特殊的断点回归
1 X 
ˆ = Yi1 Yi0 !p E(Y1 Y0)
N
i
=E(Y11 Y00)
= E(Y11 Y10) + E(Y10 Y00)
„ ƒ‚ …
treatment effect on the post-treated

– 定义 D t = 1(t  1). 当 D t 发⽣跳跃时,E(Y t ) 随之发⽣跳跃。


– 识别假设:E(Y10 Y00) = 0,应读作 E(Y t0) 在 t = 1 处平滑(连续)

时间间隔越短,该假设越可能成⽴。
• 推⽽⼴之,如果是否进⼊处理组取决于某种“⼀⼑切”标准
8
<0 X < c
Di =
:1 X  c

则标准左右两侧 outcome 的差异即可看作 treatment effect.


© Ting JIANG, 2016 Summer, Renmin Univ of China.
Lee and Lemieux: Regression Discontinuity Designs in Economics 287

3
Outcome variable (Y)

B′
2 τ
A″

0
c″ c c′
Assignment variable (X)

Figure 1. Simple Linear RD Setup

© Ting JIANG, 2016 Summer, Renmin Univ of China.


E[
• 正式地,令 Di 为 probability of receiving treatment, 如果 E(DjX) =
P (D = 1jX ) 存在断点,D 影响 Y ,则
Lee and Lemieux: E(Y jX) 也存在断点。X
Regression 被称De
Discontinuity
0
为 assignment/forcing/running variable.
0 x
• 定义 rule of assignment to treatment: ıi = 1(Xi  c)
A.–Randomized Experiment
若 D = ı,则 E(DjX) = D,称为 sharp RD
B. Regression Discontinuity Desig
– 若 D ¤ ı,则 E(DjX) ¤ D,称为 fuzzy RD
11

E[W|X]
E[D|X]

E[D|X]
0 00
0 x 00xx

• 直观上,E(Y jX ) 的跳跃与 E(DjX) 的跳跃的⽐值即为 treatment ef-


B.fect.
Regression Discontinuity Design
C. Matching on Observables
© Ting JIANG, 2016 Summer, Renmin Univ of China.
X X

ı D Y ı D Y

"

• 由图可知,要研究 D 对 Y 的影响,应该控制 X,但要⽐较相同 X 下


的实验组和对照组很困难,因为(⼏乎)没有 overlap, 因此只能研究
X ' c 的那部分样本,这相当于在 before-after comparison 中,t = 1
时所有个体都进⼊实验组,所以只能以⾃⾝的历史作为对照,但这⼀
对照只有当时间间隔很短时才有效。
• RD 的识别。考察如下的结构模型
Yi =Di + errori
E(errorƒ‚
=Di + „ jX = Xi…) + error
„ i E(error
ƒ‚ i jX = Xi…)
m(Xi ) Ui
© Ting JIANG, 2016 Summer, Renmin Univ of China.
E(Y jX) = E(DjX) + m(X )

lim E(Y jX) = lim E(DjX) + lim m(X)


X #c X #c X#c
lim E(Y jX) = lim E(DjX) + lim m(X)
X "c X "c X"c

如果
1. E(DjX) 在 X = c 处存在断点:limX #c E(DjX ) ¤ limX"c E(DjX)
2. m(X) 在 X = c 处连续:limX #c m(X) = limX "c m(X)

RD limX #c E(Y jX) limX "c E(Y jX )
 = (5.1)
limX #c E(DjX) limX "c E(DjX)
• 如何从 potential outcomes framework 的⾓度 justify 这⼀结构模型?
以 sharp RD 为例。
lim E(DjX) lim E(DjX) = 1
X #c X "c
© Ting JIANG, 2016 Summer, Renmin Univ of China.
1
 0

 = lim E Y jX lim E Y jX
X #c X "c
1
 0

= lim E Y jX lim E Y jX
X #c X #c
0

(如果E Y jX 在X = c处连续)
1 0

= lim E Y Y jX
X #c
„ ƒ‚ …
treatment effect on the just treated

Y =(1 D)Y 0 + DY 1
1

= Y Y D +Y0
0
   
0
= E Y Y jX + Y 1
1
Y0 E Y0 Y 1jX D + Y 0
„ ƒ‚ … „ ƒ‚ …
 

0
 0

E(Y jX ) = D + E D + Y jX = D + E Y jX
(因为 E(DjX ) = DE(jX) = 0.)

所以 m(X ) 在 X = c 处连续就是指 E Y 0jX 在 X = c 处连续。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• (5.1) 式还可写作
E(Y jı = 1) E(Y jı = 0)
=
E(Djı = 1) E(Djı = 0)
在 fuzzy RD 中,此即为  的 Wald 估计量,ı ⾃动成为 D 的 IV,满
⾜ IV 所需要的三个条件:
– Exclusion: ⽆法检验,但应该成⽴。
– Independence: ⾃动成⽴。
– Relevance: 可以检验。
关于 IV 估计的所有讨论此处都适⽤。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• RD 的三个特点:
– Local randomization.
– 在 sharp RD 中,D ⽆内⽣性之虞;在 fuzzy RD 中,IV 唾⼿可得。
– 所有在 X = c 处连续的变量都⽆需控制。
• 我们可以把 RD 看作⼀种特殊的 selection-on-observables 模型。
– unconfoundedness 条件轻松满⾜
– overlap 条件绝不满⾜,但代之以连续性条件 (of potential outcomes
conditional on X = c )
因此我们不是把 X 相同的处理组个体和控制组个体进⾏⽐较,⽽是
把充分接近 X = c 两侧的处理组个体和控制组个体进⾏⽐较。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• Substantializing 识别假设
– treatment(⾄少部分地)由可观测变量 X 决定,⽽不是相反:treat-
ment 不能影响 X ,也就是说,X 是 pre-treatment 变量,或者是⽆
法改变的变量。
– treatment 虽然不是随机的,个体可以通过影响 X 来影响 treatment
assignment, 但这种影响不能是精确的——cutoff 附近的个体⽆法
操纵是否落在 cutoff 的哪⼀侧,也就是说 c 是独⽴于 X 的(外⽣
的) ,ı 完全取决于 X 和 c . 只有这样,cutoff 附近的个体才能被看
作是 exchangeable or otherwise identical.
– 除了 treatment 之外,其它变量都是 X 的平滑函数,即不存在其它⽅
式使得邻近 cutoff 两侧的个体出现差异,只有这样才能把 outcome
的跳跃全部归因于 treatment 的跳跃。⽽且这是可以检验的!
• 强内部有效性,弱外部有效性?不信抬头看,⽼天放过谁!

© Ting JIANG, 2016 Summer, Renmin Univ of China.


5.2 断点回归的估计方法
因为我们不可能真的只使⽤ X = c 的样本,⽤于估计的样本中包含距
离 c 多远的观测值,⾯临估计偏误和估计效率的 trade-off, 因此尝试多
种估计⽅法保证结果的稳健性尤为重要。

5.2.1 OLS/2SLS
• OLS for sharp RD
Y = D + m(X) + U
m(X) = 0+ 1+ı(X c)+ 1 (1 ı)(X c)+ 2+ı(X c)2+ 2 (1 ı)(X c)2
• 2SLS for fuzzy RD
Y =D + m(X) + U
D =˛ı + m(X) + "

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• 注意事项
– 尝试使⽤全部样本和不同 c 邻域的⼦样本。
– 尝试 m(X) 的线性形式和⼆次形式,但不要更⾼阶了!因为⾼阶
多项式赋予远离 cutoff 的观测值过⾼权重 (Gelman and Imbens,
2014, WP).
– 在 fuzzy RD 情形中,第⼀阶段和第⼆阶段使⽤相同的 m(X) 阶数。
– 在 fuzzy RD 情形中,根据第⼆阶段选择 bandwidth,然后第⼀阶
段使⽤相同的 bandwidth.
• 很多 IV ⽂章都可以当成 RD/RK 来理解,反之亦然。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


Lecture 6 双重差分

© Ting JIANG, 2016 Summer, Renmin Univ of China.


© Ting JIANG, 2016 Summer, Renmin Univ of China.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
6.1 双重差分的工作原理
• DID 的识别
 (X) =E(Y jX; D = 1; T = 1) E(Y jX; D = 1; T = 0)
 
E(Y jX; D = 0; T = 1) E(Y jX; D = 0; T = 0)
=E(Y 1jX; D = 1; T = 1) E(Y 0jX; D = 1; T = 0)
 0 0

E(Y jX; D = 0; T = 1) E(Y jX; D = 0; T = 0)
 1 0

= E(Y jX; D = 1; T = 1) E(Y jX; D = 1; T = 1)
 0 0

+ E(Y jX; D = 1; T = 1) E(Y jX; D = 1; T = 0)
 0 0

E(Y jX; D = 0; T = 1) E(Y jX; D = 0; T = 0)
因此识别条件是
 
E(Y 0jX; D = 1; T = 1) 0
E(Y jX; D = 1; T = 0)
 
= E(Y 0jX; D = 0; T = 1) 0
E(Y jX; D = 0; T = 0)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• DID 要求随机分组么?
识别假设可以简写作
0
 0

E ∆Y jX; D = 1 = E ∆Y jX; D = 0
也就是说给定 X ,∆Y 0 均值独⽴于 D,即关于 ∆Y 0 是随机分组的!
• 在此条件下,DID 估计量识别了 ATT at the post-treatment period.
Z
DID 1
ˇ

 = (X )dF (X jD = 1; T = 1) = E Y Y D = 1; T = 1)

• 线性模型视⾓下的识别。考察下⾯的线性模型(简便起见,省略 X)
Yd00 =ˇ0 + ˇ1D + U0
Yd01 =ˇ0 + ˇ1D + ˇ2 + U1
Yd11 =ˇ0 + ˇ1D + ˇ2 + ˇ3 + U1
ˇ3 为 treatment effect
 
Ydt =(1 T )Yd00
+ T (1 D)Y010 + DY111
 
=ˇ0 + ˇ1D + ˇ2T + ˇ3D  T + (1 T )U0 + T U1
© Ting JIANG, 2016 Summer, Renmin Univ of China.
之前的识别条件可写作
E(U1jD = 1; T = 1) E(U0jD = 1; T = 0)
=E(U1jD = 0; T = 1) E(U0jD = 0; T = 0)
此时
 (X ) =E(Y jX; D = 1; T = 1) E(Y jX; D = 1; T = 0)
 
E(Y jX; D = 0; T = 1) E(Y jX; D = 0; T = 0)
=ˇ3
• DID 就是 before-after comparison + matching.
• 所有交互项模型都可以从 DID 的⾓度来理解。
• 对 DID 的威胁
– D  T 反映的可能是其它 treatment effect(不可解决)
– 不好的控制组,检验平⾏趋势很重要!
– D 不稳定,即处理组和控制组存在 compositional changes. 换句
话说,是否进⼊处理组可操纵,例如 program application, policy
threshold, migration 等等。(我们通常表述成政策内⽣。)
© Ting JIANG, 2016 Summer, Renmin Univ of China.
示例 9. 茶叶的价格与消失的⼥性 (Qian, 2008, QJE).
• Bertrand et al. (2004, QJE) 对标准误估计的建议:聚类标准误。
• DID 估计值有可能对因变量的定义不稳健。(什么样的趋势才稳健?)
• ⾯板数据的 DID 回归⽅程是
Yidt = ˇ0 + Di  T t + ui +  t + "idt
• 重复横截⾯数据的 DID 回归⽅程是
Yidt = ˇ0 + Di  T t + Di +  t + "idt
• 如果超过两期,更灵活的⽅程形式是(以⾯板数据为例)
T
X
Yidt = ˇ0 + l (Di  T tl ) + ui +  t + "idt
l=2
其中 8
<1 t =l
D tl =
:0 otherwise

© Ting JIANG, 2016 Summer, Renmin Univ of China.


T = 0 T = 1 Diff
D=0 1 2 1
D=1 3 4 1
DID 0
T =0 T =1 Diff
D = 0 ln(1) = 0 ln(2) = 0:693 0:693
D = 1 ln(3) = 1:099 ln(4) = 1:386 0:287
DID :406
T =0 T =1 Diff
D = 0 e 1 = 2:718 e 2 = 7:389 4:671
D = 1 e 3 = 20:086 e 4 = 54:598 34:512
DID 29:841

© Ting JIANG, 2016 Summer, Renmin Univ of China.


What should we do in practice?
• Smart use of tables and graphs.
• Establish the similarity between treatment and control groups.
• Verify parallel trends.
• Try as many specifications as possible, e.g., controlling for time trends.
• Results need to show consistency across different choices of control
groups and different regression specifications when there are notice-
able differences between treatment and control groups.
• Robust inference with small number of units.
• See if the intensity of treatment play a role.
• Consider possibilities of heterogeneous treatment effects.
• Consider different (maybe conceptually close) grouping variables.
• Placebo tests: counterfactual treatment group and counterfactual treat-
ment time.
© Ting JIANG, 2016 Summer, Renmin Univ of China.
6.2 非线性模型中的双重差分
• 考察如下模型:
Yi t = 1(Yit > 0)
Yit = ˇ0 + ˇ1D + ˇ2T + ˇ3D  T + Ui t ; U  N (0; U2 )
• ˇ3 并不是 treatment effect.
"    # "    #
ˇ0 ˇ1 ˇ2 ˇ3 ˇ0 ˇ1 ˇ0 ˇ2 ˇ0
Φ + + + Φ + Φ + Φ
U U U U U U U U U
ˇ3
¤
U
• 计算 treatment effect: 在 (D = 1; T = 1) ⼦样本中,计算两个预测值
(概率),⼀个设 D  T = 1,另⼀个设 D  T = 0,然后计算这两个预
测值差异的均值。
实现:margins, ⼀个⾮常强⼤有⽤的命令,可以替代 mfx,
dprobit, dlogit2 等)

© Ting JIANG, 2016 Summer, Renmin Univ of China.


6.3 Fuzzy DID
• Sharp DID 中,没有 always-takers 和 never-takers. 但可能的情形是:处
理组和控制组都有个体实际接受了 treatment, 只是处理组中的 treat-
ment ⽐例⾼于控制组。
• 此时的做法是,将交互项作为实际是否接收 treatment 的 IV
DID of the outcome
ˆ =
DID of the treatment
• 凡是使⽤交互项作为内⽣解释变量 IV 的⽂章,都可以从这个⾓度理
解。但实际上,要保证这种 ratio estimator 识别的是因果效应,需要
对 treatment effect 的 heterogeneity 有⼀些额外的限制 (Clement de
Chaisemartin and Xavier D’Haultfoeuille, 2015).
示例 10. ⼤兴⼟⽊与教育回报 (Duflo, 2001, AER).
示例 11. ⽯油价格冲击,收⼊与健康⽀出 (Acemoglu et al., 2013,
REStat).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


6.4 DID Matching

• 先通过匹配⽅法构造控制组,然后进⾏ DID.
• 先差分,然后对 differenced outcome 进⾏匹配估计。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


6.5 三重差分

• 假定某省实施了⼀项针对 70 岁以上⽼年⼈的公共政策,我们关⼼
的结果变量是健康⽔平。⼀种做法是⽐较该省 70 岁以上⽼年⼈和
60-70 岁之间⽼年⼈的健康⽔平在政策实施前后的变化,这种做法的
问题在于,和该政策⽆关的其它因素可能也会对 70 岁以上⽼年⼈和
60-70 岁之间⽼年⼈的健康⽔平差异产⽣影响,例如同时推⾏的某项
中央政策;另⼀种做法是⽐较该省和其他省 70 岁以上⽼年⼈的健康
⽔平在政策实施前后的变化,这种做法的问题在于,该省 70 岁以上
⽼年⼈健康⽔平的时序变化可能与其他省 70 岁以上⽼年⼈健康⽔平
的时序变化具有系统性的差异,这种差异可能来⾃不同省份之间经
济增长的差异,⽽不是该政策驱动的差异。⼀种更稳健的做法是⽐较
该省的 DID 估计和其他省的 DID 估计。

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• DDD 的识别
E(∆Y 0jX; G = 1; D = 1) E(∆Y 0jX; G = 0; D = 1)
=E(∆Y 0jX; G = 1; D = 0) E(∆Y 0jX; G = 0; D = 0)
 DDD = E(Y 1 Y 0jG = 1; D = 1; T = 1)
• 这⼀识别假设要弱于 DID 的识别假设。
如何来看?假设有三期,第三重差分还是在时间维度上。
E(∆Y 0jX; T = 2; D = 1) E(∆Y 0jX; T = 2; D = 0)
=E(∆Y 0jX; T = 1; D = 1) E(∆Y 0jX; T = 1; D = 0)
DID 要求右边等于零,DDD 只要求左右相等。
• 交互项形式的回归⽅程:
Yigdt =ˇ0 + 1Gi + 2Di + 3T t
+ 1Gi  Di + 2Gi  T t + 3Di  T t
+ Gi  Di  T t + "igdt
示例 12. 加班⼯资与劳动需求 (Hamermesh and Trejo, 2000, REStat).
© Ting JIANG, 2016 Summer, Renmin Univ of China.
6.6 合成控制 (Synthetic Control)

• Key references: Abadie and Gardeazabal (2003, AER); Abadie et al.


(2010, JASA; 2015, AJPS)
• 过去⼗五年来 program evaluation 领域最重要的创新 (Athey and Im-
bens, 2016, WP).
• 基本思想:通过控制组个体的凸组合来构造 counterfactual
– i = 0; 1; : : : ; N , i = 0 为处理组
– t = 1; : : : ; T0; T0 + 1; : : : ; T , T0 + 1 之后为 post-treatment period
– Y 为 outcome
– X 为 covariate 向量,可能包含个体特征的 pre-treatment 均值,以
及若⼲ pre-treatment Y 或其均值.

© Ting JIANG, 2016 Summer, Renmin Univ of China.


 N
X 0  N
X 
w(Ω) = argmin X0 wi  Xi Ω X0 wi  Xi

i =1 i =1
X
s:t: wi  0 8i; wi = 1
i
T0 
X N
X 2
Ω = arg min Y0t wi (Ω)  Yit

t =1 i =1
注意,如果 X 包含所有 pre-treatment Y , 这⼀⽅法相当于
T0 
X N
X 2
w = arg min Y0t wi  Yit
w
t =1 i =1
treatment effect on the treated
XN
SC 

ˆt = Y0t wi Ω  Yi t ; 8t  T0 + 1
i =1

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• ⾮传统的统计推断:placebo test
– 对控制组个体进⾏同样的估计,根据 placebo effects 的经验分布计
算 p 值。
– 对 pretreatment period 进⾏同样的估计。
• 实现:synth

示例 13. 加州的控烟条例 (Abadie et al., 2010, JASA).

示例 14. 两德统⼀的经济后果 (Abadie et al., 2015, AJPS).

© Ting JIANG, 2016 Summer, Renmin Univ of China.


• ⼏点评论:
– 主要适⽤于 treatment at an aggregate level(不容易找到特征类似
的控制组个体), 但 SUVTA 不容易满⾜。
– 知道每个控制组个体的权重,可以与定性研究相结合。
– 仅适⽤于单⼀处理组个体情形,并要求 T0 较⼤,N 不能太⼤也不
能太⼩。
– 当处理组与控制组差异很⼤时不适⽤。
– 与线性回归的区别:线性回归也是⼀种加权,但对权重没有限制,
因此可能存在 extrapolation bias. 可与 Hsiao et al. (2011, JAE) 进
⾏⽐较。
– 与 DID 的区别:放松了识别假设,允许个体效应随时间变化 (ui  t ),
但当处理组与控制组差异很⼤时,bias 可能⽐ DID 更⼤。
– 与 (DID-) matching 的区别:提供了⼀种解决 dimensionality prob-
lem 的⽅法,但 less robust to general model specifications.

© Ting JIANG, 2016 Summer, Renmin Univ of China.

You might also like