HD - Machine Learnind and Econometrics
HD - Machine Learnind and Econometrics
HD - Machine Learnind and Econometrics
Draft. Christophe Gaillac, Toulouse School of Economics and CREST, ENSAE Paris,
[email protected]. Jérémy L’Hour, CREST, ENSAE Paris and INSEE, [email protected].
We thank Pierre Alquier, Xavier D’Haultfœuille and Anna Simoni for their help and comments. Com-
ments welcome.
Summary
These are the lecture notes for the course Machine Learning for Economet-
rics (High-Dimensional Econometrics, previously) taught in the third year
of ENSAE Paris and the second year of the Master in Economics of Institut
Polytechnique de Paris. They cover recent applications of high-dimensional
statistics and machine learning to econometrics, including variable selection,
inference with high-dimensional nuisance parameters in different settings, het-
erogeneity, networks and analysis of text data. The focus will be on policy
evaluation problems. Recent advances in causal inference such as the synthetic
controls method will be reviewed.
The goal of the course is to give insights about these new methods, their
implementation, their benefits and their limitations. The course is a bridge
between econometrics and machine learning, and will mostly benefit students
who are highly curious about recent advances in econometrics, whether they
want to study the theory or use them in applied work. Students are expected
to be familiar with Econometrics 2 (2A) and Statistical Learning (3A).
1
Contents
1
4.5.1 Logit Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Instrument Selection to estimate returns to schooling . . . . . . . 58
2
7.3.1 Application to heterogeneity in the effect of subsidized training on
trainee earnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4 Estimating Features of Heterogeneous Effects with Selection on Observables129
7.4.1 Estimation of Key Features of the CATE . . . . . . . . . . . . . . 130
7.4.2 Inference About Key Features of the CATE . . . . . . . . . . . . 133
7.4.3 Algorithm: Inference about key features of the CATE . . . . . . . 136
7.4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9 Appendix 155
A Exam 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B Exam 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C Exam 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3
4
Chapter 1
A first quantity of interest is the average treatment effect τ0 := E [Yi (1) − Yi (0)], which is
5
the average impact of the intervention among the population. When the treatment assign-
ment is random conditional on some observables (i.e. the assumption that E[εi |Di , Xi ] =
0 in the model below) and under the assumption that there exist only a limited number
of significant covariates (sparsity), Chapters 2 give tools to handle the estimation of τ0
in model
0
Yi = Di τ0 + Xi β0 + εi , with E[εi ] = 0 and E[εi |Di , Xi ] = 0,
where Xi is a vector of p exogenous control variables, p being possibly larger than the
number of observations. The large dimension of Xi , in combination with the sparsity
assumption, opens the door to use selection methods such as the Lasso, that this chapter
reviews in details. Chapter 3 uses the intuition explained in the preceding chapter but
presents a more general framework and introduces sample-splitting, a crucial device when
using non-standard tools such as ML estimators.
Chapter 4 then explains how to adapt these tools when the econometrician relaxes the
exogeneity assumption, i.e. now assumes E[εi |Di , Xi ] 6= 0, but possesses a (possibly large)
number of instrumental variables Zi , all satisfying the exogeneity assumption E[εi |Zi ] = 0.
Going further, Chapter 5 develops the theoretical refinements of the tools presented so
far, with the aim of using weaker assumptions. It specifically deals with non-gaussian
errors, sample-splitting and panel data.
However, the average treatment effect (τ0 ) does not allow to describe heterogeneity in
the reactions to the intervention – some people might benefit a lot from the intervention,
while some other may not respond at all or see their outcome worsening. Chapter 7 thus
deals with a more complex parameter of interest, which is the average treatment effect
conditional on some (observed) covariates τ : x 7→ E [Yi (1) − Yi (0)|Xi = x]. Causal
random forests are tools adapted from machine learning that allow to make inference
about the function τ (·), i.e. to test for significant effect of the treatment conditionally on
the covariates taking the value x. However, theory requires strong hypotheses to obtain
such tests. The end of Chapter 7 lowers our expectations to only perform inference about
features of the conditional average treatment effect. This allows to use ML methods, with
few hypotheses, to test for the heterogeneity of the treatment or to obtain information
about its shape.
Thus, the researcher has to choose the relevant methods among those presented in
6
this course according to the parameter of interest or the available data. This choice is
summarized in Figure 1.1.
Figure 1.1: A brief road-map to use some methods presented through the chapters
7
1.2 Resources and Reading List
These notes should be self-contained. Due to the nature of the course, no textbook
currently covers the same material. We provide general references in each chapter. A
GitHub repository for the class is available at github.com/jlhourENSAE/hdmetrics and
contains (mostly) R code. That being said, we list below a limited number of general
references. We encourage you to read them before the class.
Introduction
Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should
know about. Annual Review of Economics, 11
Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric ap-
proach. Journal of Economic Perspectives, 31(2):87–106
Belloni, A., Chernozhukov, V., and Hansen, C. (2014). High-Dimensional Methods and In-
ference on Structural and Treatment Effects. Journal of Economic Perspectives, 28(2):29–
50
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W.,
and Robins, J. (2018). Double/debiased machine learning for treatment and structural
parameters. The Econometrics Journal, 21(1):C1–C68
Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012a). Sparse models and
methods for optimal instruments with an application to eminent domain. Econometrica,
80(6):2369–2429
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-
regularization inference: An elementary, general approach. Annu. Rev. Econ., 7(1):649–
688
8
The Synthetic Control Method
Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and method-
ological aspects. Working Paper
Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for
comparative case studies: Estimating the effect of california’s tobacco control program.
Journal of the American Statistical Association, 105(490):493–505
Chapter 5 in Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, So-
cial, and Biomedical Sciences. Number 9780521885881 in Cambridge Books. Cambridge
University Press
Network Data
9
10
Chapter 2
Model selection and parsimony among explanatory variables are traditional scientific
problems that have a particular echo in statistics and econometrics. They have received
growing attention over the past two decades, as high-dimensional datasets have become
increasingly available to statisticians in various fields. But even with a small dataset,
high-dimensional problems can occur, for example when doing series estimation of a non-
parametric model. In practice, applied econometricians often select variables by trial
and error, guided by their intuition and report results based on the assumption that the
selected model is the true. These results are often backed by further sensitivity analysis
and robustness checks. However, the variable selection step of empirical work is rarely
fully acknowledged although it is not innocuous. Leamer (1983) was one of the first
econometric papers to address this problem. For a modern presentation, see Leeb and
Pötscher (2005) and, in the context of policy evaluation, Belloni et al. (2014).
Section 2.1 serves as a larger introduction and describes the problem posed by post-
selection inference. Sections 2.2 and 2.3 introduce the Lasso estimator as it is often
used as a selection device. Section 2.4 builds on the intuition of section 2.1 to deal with
the regularization bias. Section 2.5 exposes the key theoretical concepts to deal with
post-selection inference and Section 2.6 considers its application in simple cases.
Notations a . b means that a ≤ cb for some constant c > 0 that does not depend on the
sample size n. ϕ and Φ respectively denote the pdf and the cdf of the standard Gaussian
distribution. For a vector δ ∈ Rp , kδk0 := Card {1 ≤ j ≤ p, δj 6= 0}, kδk∞ := max |δj |.
j=1,...,p
11
The m-sparse-norm of matrix Q is defined as:
p
δ T Qδ
kQksp(m) := sup .
kδk0 ≤m kδk2
kδk2 >0
CLT = Central Limit Theorem; LLN = Law of Large Numbers; CMT = Continuous
Mapping Theorem.
We begin by analyzing the two-step inference method described in the introduction (se-
lecting the model first, then reporting results from that model as if it were the truth).
This Section is based on the work of Leeb and Pötscher (2005).
Assumption 2.1 (Possibly Sparse Gaussian Linear Model). Consider the iid sequence
of random variables (Yi , Xi )i=1,...,n such that:
Yi = Xi,1 τ0 + Xi,2 β0 + εi ,
The most sparse true model is coded by M0 , a random variable taking value R (“re-
stricted”) if β0 = 0 and U (“unrestricted”) otherwise.
Everything in Section 2.1 will be conditional on the covariates (Xi )1≤i≤n but we leave
that dependency hidden. In particular, conditional on the covariates, the unrestricted
estimator is normally distributed:
√ β(U
2
b ) − β0 0 σβ ρσβ στ
n ∼N , ,
τb(U ) − τ0 0 ρσβ στ στ2
12
Consistent Model Selection Procedure. The econometrician is interested in per-
forming inference over the parameter τ0 and wonders whether he should include Xi,2 in
the regression. At the end, he reports the result from model M
c he has selected in a first
M0 .
Proof of Lemma 2.1 Considering the selection rule (2.2) and the Gaussian distributional
assumption in model (2.1):
√
P M = R = P | nβ(U )/σβ | ≤ cn
c b
√ √ b √
= P −cn − nβ0 /σβ ≤ n(β(U ) − β0 )/σβ ≤ cn − nβ0 /σβ
√ √
= Φ cn − nβ0 /σβ − Φ −cn − nβ0 /σβ
√ √
= Φ nβ0 /σβ + cn − Φ nβ0 /σβ − cn
√
= ∆ nβ0 /σβ , cn ,
with ∆(a, b) := Φ(a + b) − Φ(a − b) and the fourth equality uses the symmetry of the
Gaussian distribution, Φ(−x) = 1 − Φ(x). From this equation and the restrictions on
cn , the probability that M
c = R tends to one if β0 = 0 (M0 = R) and to zero otherwise
(M0 = U ).
13
Remark 2.1. Since the probability of selecting the true model tends to one with the
sample size, Lemma 2.1 might induce you to think that a consistent model selection
procedure allows inference to be performed “as usual”, i.e. that the model selection step
can be overlooked. However, for any given sample size n, the probability of selecting
the true model can be very small if β0 is close to zero without exactly being zero. For
√ √
example, assume that β0 = δσβ cn / n with |δ| < 1 then: nβ0 /σβ = δcn and the
probability of selecting the unrestricted model from the proof of Lemma 2.1 is equal to
1 − Φ(cn (1 + δ)) + Φ((δ − 1)cn ), and tends to zero although the true model is U because
β0 6= 0! This quick analysis tells us that the model selection procedure is blind to small
√
deviations from the restricted model (β0 = 0) that are of the order of cn / n. Statisticians
say that in that case, the model selection procedure is not uniformly consistent with
respect to β0 . For the econometrician, it means that the classical inference procedure,
i.e. the procedure that assumes that the selected model is the true, or that is conditional
on the selected model being the true, and uses the asymptotic normality to perform tests
and construct confidence intervals may require very large sample sizes to be accurate.
Furthermore, this required sample size depends on the unknown parameter β0 (see the
numerical evidence in Leeb and Pötscher (2005)).
τ̃ := τb(M
c) = τb(R)1 c + τb(U )1 c .
M =R M =U (2.2)
Bearing in mind the caveat issued in the previous paragraph, is a consistent model se-
lection procedure sufficient to waive concerns over the post-selection approach? Indeed,
using Lemma 2.1, it is tempting to think that, τ̃ will be asymptotically distributed as a
Gaussian and that standard inference also applies. However, we will show that the finite
sample distribution of the post-selection estimator can be very different from a standard
Gaussian distribution. The result displayed here can be found in Leeb (2006). The next
Lemma will be useful when computing the distribution of the post-selection estimator.
τb(R) ⊥ β(U
b ).
14
Proof of Lemma 2.2 We use the following matrix notations: Xj = (Xi,j )1≤i≤n for any
j = 1, 2, y = (Yi )1≤i≤n and X = (Xi0 )1≤i≤n . Notice that τb(R) = [X1 0 X1 ]−1 X1 0 y, and
define MX1 := In − X1 [X1 0 X1 ]−1 X1 0 , the projector on the orthogonal complement of the
column space of X1 . Form the matrix XO := [X1 : MX1 X2 ] and define βbO the coefficient
0 −1 O0
obtained from the regression of y on XO , that is: βbO = XO XO X y . We can show
that:
τb(R)
βbO = −1 ,
[X2 0 MX1 X2 ] X2 0 MX1 y
−1
and that τb(R) and [X2 0 MX1 X2 ] X2 0 MX1 y are uncorrelated, using Cochran’s theo-
rem. Using Frisch–Waugh–Lovell Theorem (Theorem 2.3 at the end of this chapter),
−1
[X2 0 MX1 X2 ] X2 0 MX1 y = β(U
b ), which completes the proof.
Lemma 2.3 (Density of the Post-Selection estimator, from Leeb (2006)). The finite-
√
sample (conditional on (Xi )i=1,...,n ) density of n(τ̃ − τ0 ) is given by:
√ !
√
β0 1 x ρ nβ0
f√n(τ̃ −τ0 ) (x) = ∆ n , cn p ϕ p +p
σβ στ 1 − ρ 2 στ 1 − ρ 2 1 − ρ σβ
2
" √ !#
nβ0 /σβ + ρx/στ cn 1 x
+ 1−∆ p ,p ϕ ,
1 − ρ2 1 − ρ2 στ στ
We consider the first term in the sum. From Lemma 2.2, for any real number x, we have:
√ √
P x ≤ n(b τ (R) − τ0 ) ≤ x + dx | M
c = R = P x ≤ n(b τ (R) − τ0 ) ≤ x + dx .
So as dx → 0, the first part in the sum (times 1/dx) is the probability of selecting
√
model R times the density of n(b τ (R) − τ0 ). The probability of selecting model R
√
is P Mc = R = ∆ ( nβ0 /σβ , cn ). Before continuing, notice the relation between the
moments of Xi and those of the OLS estimators in model U , from Assumption 2.1:
" n #
σ2 σβ2
1X 0 −ρσβ στ
X i Xi = 2 2 .
n i=1 σβ στ (1 − ρ2 ) −ρσβ στ στ2
15
√
In order to compute the density of τ (R) − τ0 ), we use the usual OLS formula and
n(b
substitute Yi by the model of Assumption 2.1:
n
!
√ √ στ √ σ 2 1X
τ (R) − τ0 ) = − nβ0 ρ + n τ2 (1 − ρ2 )
n(b Xi,1 εi .
σβ σ n i=1
Now, we focus on the second part in the sum and reverse the events:
√ √
P x ≤ n(b
τ (U ) − τ0 ) ≤ x + dx | M = U P M = U = P M = U | x ≤ n(b
c c c τ (U ) − τ0 ) ≤ x + dx
√
× P x ≤ n(b
τ (U ) − τ0 ) ≤ x + dx .
Recall that √ 2
n( b ) − β0 )
β(U 0 σβ ρσβ στ
√ ∼N , .
n(bτ (U ) − τ0 ) 0 ρσβ στ στ2
so, we have directly
√
P (x ≤ τ (U ) − τ0 ) ≤ x + dx)
n(b 1 x
→ ϕ ,
dx στ στ
On the other:
! !
√ β(U √ √ β0
b )
1 x
P n < −cn | n(b
τ (U ) − τ0 ) = x =1−Φ p n + ρ + cn ,
σβ 1 − ρ2 σβ στ
16
√
Figure 2.1: Finite-sample density of n(τ̃ − τ0 ), ρ = .4
0.4
.5
.3
.2
0.3
.1
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Note: Density
√ of the post-selection estimator τ̃ for different values of β0 /σβ , see legend. Other parameters
are: cn = log n, n = 100, στ = 1 and ρ = .4. See Lemma 2.3 for the mathematical formula.
√
Figure 2.2: Finite-sample density of n(τ̃ − τ0 ), ρ = .7
0.5
.5
.3
0.4
.2
.1
0.3
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Note: See Figure 2.1. ρ = .7. This chart is similar to the one in Leeb and Pötscher (2005).
17
Remark 2.2. Lemma 2.3 gives the finite-sample density of the post-selection estimator.
There is be an omitted variable bias that the post-selection estimator cannot overcome
√
unless β0 = 0 or ρ = 0. Indeed, when ρ = 0, n(τ̃ − τ0 ) ∼ N (0, στ2 ); while when β0 = 0,
√
n(τ̃ − τ0 ) ∼ N (0, στ2 /(1 − ρ2 )) (approximately), because ∆(0, cn ) ≥ 1 − exp(−c2n /2) -
the probability of selecting the restricted model - is large. Figures 2.1 and 2.2 plot the
finite-sample density of the post-selection estimator for several values of β0 /σβ in the
cases ρ = .4 and ρ = .7, respectively. Figure 2.1 shows a mild albeit significant distortion
from a standard Gaussian distribution. The post-selection estimator clearly exhibits a
bias. As the correlation between the two covariates intensifies, Figure 2.2, the density of
the post-selection estimator becomes highly non-Gaussian, even exhibiting two modes.
See Leeb and Pötscher (2005) for further discussion. Following this analysis, it is clear
that inference (i.e. tests and confidence intervals) based on standard Gaussian quantiles
will in general give a picture very different from true distribution depicted in Figure 2.2.
Assumption 2.3 (Sparse Gaussian Linear Model). Let the iid sequence of random vari-
ables (Yi , Xi )i=1,...,n . The dimension of the vector Xi is denoted p. p is assumed to be
larger than 1 and allowed to be much larger than n. We assume the following linear
relation:
Yi = Xi0 β0 + εi ,
with εi ∼ N (0, σ 2 ), εi ⊥ Xi , kβ0 k0 ≤ s < p. The covariates are bounded almost surely
max kXi k∞ ≤ M .
i=1,...,n
Remark 2.3 (Key Concept: Sparsity). One particular assumption in the model dis-
played in Assumption 2.3 deserves special attention. The sparsity assumption, kβ0 k0 =
Pp
j=1 1 {βj 6= 0} ≤ s, means that we assume at most s components of β0 are different
18
from zero. The notion of sparsity, i.e. the assumption that although we consider many
variables, only a small number of elements in the vector of parameters is different from
zero, is an inherent element of the high-dimensional literature. It amounts to recasting
the high-dimensional problem in a variable selection framework where a good estimator
should be able to correctly select the relevant variables or to estimate the quantities of in-
√
terest consistently at a rate close to n, only paying a price depending on s and p. Before
continuing further, let’s introduce the sparsity set, i.e. the set of indices that correspond
to non-zero elements of β0 : S0 := {j ∈ {1, ..., p}, β0j 6= 0}. A less restrictive concept has
been introduced by Belloni et al. (2012b). Called approximate sparsity, it assumes that
the high-dimensional parameter can be decomposed into a sparse component, which has
a lot of zero entries and some large entries, and a small component for which all entries
are small and decaying towards zero without never exactly being zero, see Assumption
4.3 in Chapter 4. Although more general, this assumption complicates the proof.
Pn
Denote by L(β) = n−1 i=1 (Yi − Xi0 β)2 the mean-square loss function. The Lasso
estimator is defined as:
The Lasso minimizes the sum of the empirical mean-square loss and a penalty or reg-
ularization term λn kβk1 . Notice that the solution to (2.3) is not necessarily unique.
Because the `1 -norm has a kink at zero, the resulting solution of the program, β,
b will
be sparse. λn sets the trade-off between fit and sparsity. It has been shown that in a
sparsity context (see the remark above), Lasso-type estimators can provide a good ap-
proximation of the relevant quantities that are subject to a sparse structure, be it finite
or infinite dimensional parameters. In presence of a high-dimensional β0 for which the
sparsity assumption is not assumed to hold, using the Lasso estimator is not a good idea.
If instead β0 is supposed to be dense (i.e. many small entries but no true zeros), using a
`2 -regularization (the Ridge estimator) performs better. For more on when to use which
type of regularization, see Abadie and Kasy (2017).
The Lasso and related techniques to deal with high-dimension have spurred a vast lit-
erature since the seminal paper of Tibshirani (1994). Good Statistics textbook references
are Bühlmann and van de Geer (2011) and Giraud (2014). Other key papers are Candes
19
and Tao (2007); van de Geer (2008); Bickel et al. (2009).
To show consistency of the Lasso estimator, another ingredient is needed: the re-
b := n−1 Pn Xi Xi0 , the empirical Gram ma-
stricted eigenvalue condition. Denote by Σ i=1
trix. In a high-dimensional settings, we are specifically worried about cases where the
number of covariates is larger than the sample size (p > n), because then Σ
b is degenerate
δ 0 Σδ
b
minp = 0.
δ∈R kδS k22
δ6=0
In this case, OLS cannot be computed. This is why the restricted eigenvalue is needed:
all square sub-matrices contained in the empirical Gram matrix of dimension no larger
than s should have a positive minimal eigenvalue. Let’s make it clearer. For a non-empty
subset S ⊂ {1, ..., p} and α > 0, we define the set:
δ 0 Σδ
b
κ2α (Σ)
b := min min > 0.
S⊂{1,...,p}δ∈C[S,α] kδS k22
|S|≤s
This condition appears and is discussed in particular in Bickel et al. (2009); Rudelson
and Zhou (2013). We make this assumption directly on the empirical Gram matrix,
instead of making it on the population Gram matrix E(XX 0 ), in order to simplify the
proof. For a probabilistic link between population and empirical Gram matrices under
fairly weak conditions, see e.g. Oliveira (2013). Conditions that fulfill the same purpose as
the restricted eigenvalue conditions have been used before, most notably the compatibility
condition, coherence condition and the restricted isometry condition. See also (Bühlmann
and van de Geer, 2011, p. 106)
Theorem 2.1 (`1 consistency of the Lasso). Under Assumption 2.3 and a restricted
eigenvalue condition 2.4 with C[S0 , 3], the Lasso estimator defined in 2.3 with tuning
20
p
parameter λn = (4σM/α) 2 log(2p)/n, where α ∈ (0, 1), verifies with probability greater
than 1 − α:
r
42 σM 2s2 log(2p)
kβb − β0 k1 ≤ . (2.5)
ακ23 (Σ)
b n
Remark 2.4. The main take-away from Theorem 2.1 is that the Lasso converges in `1
p
to the true value β0 at rate s log(p)/n. This is to be compared to the OLS rate under
√
full knowledge of the sparsity pattern which is s/ n. The conclusion is that there is a
p
price to pay for ignorance which manifests itself by this log(p) term. This rate is called
fast compared to a slower rate that exists without Assumption 2.4.
By adding a modified version of Assumption 2.4, an `2 rate can be obtain: kβ0 − βk
b 2.
p
s log(p)/n. Prediction rates (i.e. kY − X 0 βk
b 2 ) have also been largely dealt with in the
literature, see e.g. Bickel et al. (2009), but are less of interest in this course.
Moreover, notice that the Lasso is NOT asymptotically Gaussian: the event βbj = 0
has a non-zero probability of occurring. Consequently, it is not possible to construct the
usual confidence sets on β0 (i.e. Gaussian, asymptotic-based).
b + λn kβk
L(β) b 1 ≤ L(β0 ) + λn kβ0 k1 . (2.6)
Step 1: Difference in Square Losses. Decompose the difference between the two
loss functions in two elements and replace Yi :
n
1 X 2
2
L(β) − L(β0 ) =
b Yi − Xi0 βb − (Yi − Xi0 β0 )
n i=1
n
1 X 0 2
= Xi (β0 − β) + εi − ε2i
b
n i=1
" n # " n #
1 X 1 X
= (βb − β0 )0 Xi Xi0 (βb − β0 ) + 2(βb − β0 )0 εi X i .
n i=1 n i=1
| {z }
=Σ
b
21
Step 2: Concentration Inequality. It is time to apply the concentration inequality
of Lemma 2.6 to k n1 ni=1 εi Xi k∞ . Using Markov’s inequality:
P
n ! 4E max 1 Pn ε X X , ..., X
1 X λ j=1,...,p n i=1 i ij 1 n
n
P max εi Xij ≥ X1 , ..., Xn ≤
j=1,...,p n 4 λn
i=1
p
4σM 2 log(2p)
≤ √
n λn
≤ α,
q
4σM 2 log(2p)
since λn = α n
. Since the right-hand side is non-probabilistic, we obtain:
!
1 X n λ
n
P max εi Xij ≥ ≤ α.
j=1,...,p n 4
i=1
max n2 ni=1 εi Xij <
P λn
On the event 2
that occurs with probability greater than
j=1,...,p
1 − α:
λ
0b b n
(β − β0 ) Σ(β − β0 ) ≤ λn kβ0 k1 − kβk1 + kβb − β0 k1 .
b b (2.7)
2
Step 3: Decompose the `1 -norms. Now, we will use βS0 to denote the vector β
of dimension p for which elements that are not in S0 are replaced by 0. Notice that
β = βS0 + βS0C . By the reverse triangular inequality:
Also, notice that β0,S0C = 0, so : kβ0,S0C k1 − kβbS0C k1 = −kβ0,S0C − βbS0C k1 . So from (2.7), we
obtain:
b βb − β0 ) ≤ 3λn kβ0,S0 − βbS0 k1 − λn kβ0,S C − βbS C k1 .
(βb − β0 )0 Σ( (2.8)
2 2 0 0
Step 4: Cone Condition and Restricted Eigenvalues. It means that we have the
following cone condition:
so βb − β0 ∈ C[S0 , 3]. Using Assumption 2.4 on the restricted eigenvalue of the empirical
√
Gram matrix and Cauchy-Schwarz inequality kδS0 k1 ≤ skδS0 k2 we have:
2(βb − β0 )0 Σ(
b βb − β0 ) + λn kβ0 − βk
b 1 ≤ 4λn kβ0,S − βbS k1
0 0
√ q
s
≤ 4λn (βb − β0 )0 Σ(
b βb − β0 )
κ3 (Σ)
b
s
≤ 4λ2n + (βb − β0 )0 Σ(
b βb − β0 ),
2 b
κ3 (Σ)
where the final inequality uses 4uv ≤ u2 + 4v 2 . We finally obtain:
s
(βb − β0 )0 Σ( b 1 ≤ 4λ2
b βb − β0 ) + λn kβ0 − βk
n .
κ23 (Σ)
b
All in all, with probability greater than 1 − α:
r
2
b 1 ≤ 4 σM 2s2 log(2p)
kβ0 − βk .
ακ23 (Σ)
b n
Remark 2.5 (The Post-Lasso). Before moving on, we should mention the Post-Lasso, a
close cousin of the Lasso that has been studied in particular in the chapter by Belloni and
Chernozhukov (2011) in the book by Alquier et al. (2011) and Belloni and Chernozhukov
(2013). It is a two-step estimator in which a second step is added to the Lasso procedure in
order to remove the bias that comes from shrinkage. That second step consists in running
an OLS regression using only the covariates associated with a non-zero coefficient in the
Lasso step. More precisely, the procedure is:
1. Run the Lasso regression as in Equation (2.3), denote Sb the set of non-zero Lasso
coefficients.
2. Run an OLS regression including only the covariates corresponding to the non-zero
coefficients from above:
βbP L = arg min L(β)
β∈Rp ,βSbC =0
The performance is comparable to that of the Lasso in theory, albeit the bias appears to
be smaller in applications since undue shrinkage on non-zero coefficients is removed. To
stress the lessons of this chapter, the Post-Lasso estimator is still NOT asymptotically
Normal: it is subject to the post-selection inference problem highlighted by Leeb and
Potscher in Section 2.1!
23
2.4 The Regularization Bias
In this section, we discuss the regularization bias which is nothing more than an omitted
variable bias arising from the same mechanism described in section 2.1.
estimator, it is easy to trust the Lasso too much. Indeed, one may think that the Lasso
can function as both a device to recover the support of β0 and to estimate that same
quantity precisely or at a fast rate. However, Yang (2005) shows that for any model
selection procedure to be consistent, it must behave sub-optimally for estimating the
regression function and vice-versa. Indeed, the condition on the penalty parameter λn in
p
Zhao and Yu (2006) is quite different from our requirement of λn = 4σM 2 log(2p)/n/α
in Theorem 2.1. The moral of the story is that even when using the Lasso estimator,
selecting relevant covariates and estimating well their effect are two objectives that cannot
be pursued at the same time. Furthermore, the warnings issued in Section 2.1 on post-
selection inference still apply for the Lasso.
In presence of a high-dimensional parameter to estimate, the econometric literature
has chosen to pursue high-quality estimation of β0 . Indeed, because most economic
applications are concerned with a well-posed causal question of the type “What is the
effect of A on B?”, the identity of the relevant regressors matters less than estimating well
some nuisance parameters: think about estimating a control function or the first-stage of
an IV regression for example.
But even when focusing only on precise estimation of β0 , the Lasso is not the only
thing you need. Indeed, high-dimensional statistics poses a different challenge because p
is assumed to grow with the sample size. Indeed, assume p is constant, then as n → ∞,
the problem boils down to a small dimensional one where n >> p. In a high-dimensional
setting, there is an asymptotic bias or a regularization bias.
24
Assumption 2.5 (Linear Model with Controls). Consider the iid sequence of random
variables (Yi , Di , Xi )i=1,...,n such that:
Yi = Di τ0 + Xi0 β0 + εi ,
with εi such that Eεi = 0, Eε2i = σ 2 < ∞ and εi ⊥ (Di , Xi ). Di ∈ {0, 1}. Xi is of
dimension p > 1. p is allowed to be much larger than n and to grow with n. Denote by
µd := E(X|D = d) for d ∈ {0, 1} and π0 := EDi .
Denote βb the corresponding estimator for β0 obtained in step 2. Notice that for j ∈
b := n−1 ni=1 Di .
{1, ..., p}, if βbjL = 0 then βbj = 0. Also denote by π
P
Pn
1
Di (Yi − Xi0 β)
b 1 X
τb := n i=1
= (Yi − Xi0 β),
b
πb n1 D =1
i
Pn
Where nd := i=1 1 {Di = d}, d ∈ {0, 1} is a random quantity.
√
Lemma 2.4 (Regularization Bias of τb). Under Assumption 2.5, if µ1 6= 0: τ − τ0 | →
n|b
∞.
From the above equation, one would hope that because βb is `1 consistent, the first term
converges to zero in probability and we would only be left with the second term. This is
25
p
b −→ π0 and Slutsky’s
not the case. By the CLT - using also the LLN and CMT to prove π
theorem:
" n
#
σ2
1 X d
b−1
π √ Di εi −→ N 0, .
n i=1 π0
Now in general, we can show:
" n #0
1X √ p
Di Xi n β0 − βb ≈ s log p → ∞,
n
i=1
Remark 2.6 (The Regularization Bias is an Omitted Variable Bias). Lemma 2.4 is
disappointing: in the high-dimensional case, the naive plug-in strategy does not work
well. This is because of two ingredients: µ1 6= 0 and p → ∞. If we were in a small-
√
dimensional case and had for example an OLS estimator for β0 , n β0 − β would
b
be asymptotically normal and the problem would disappear. Notice that in this small-
dimensional case, there is no selection step. What is the problem with the approach
proposed in this section? In a nutshell: it is a single-equation procedure. Recall that
the selection step is using only the outcome equation, i.e. the elements of X tend to be
selected if they have a large entry in coefficient β0 . As a consequence, that procedure will
tend to miss variables that have a moderate effect on Y but a strong effect on D, thereby
creating an omitted variable bias in the estimator of τ0 . As Belloni et al. (2014) put it:
“Intuitively, any such variable has a moderate direct effect on the outcome that will be
incorrectly misattributed to the effect of the treatment when this variable is strongly related
to the treatment and the variable is not included in the regression”. We call the omitted-
variable bias arising from non-orthogonalized procedures that uses machine-learning tools
such as the Lasso in a first step the regularization bias.
Now, focus on a nice case. For that, we make two assumptions. A first one limits the
growth rate of p. It is fairly common and technical. A second one gives an intuition for
more general results in the next section.
26
Assumption 2.6 (Growth Condition).
s log p
√ → 0.
n
Assumption 2.7 (Balanced Design). Assume:
1. µ1 = 0,
The second part of Assumption 2.7 is fairly technical but can be proven under lower-
level assumptions such as normality or sub-Gaussianity of Xi and application of Lemma
2.6, recalling that E(Di Xi ) = 0 under the first part of Assumption 2.7.
√ d
Lemma 2.5 (A Favorable Case). Under Assumptions 2.5, 2.6 and 2.7: τ − τ0 ) −→
n (b
N (0, σ 2 /π0 ).
Proof of Lemma 2.5. Go over the proof of Lemma 2.4. Now, because of Assumption
2.7, we obtain:
n
1 X
p
√ Di Xi
. log p.
n
i=1 ∞
0
Using Theorem 2.1 and |a b| ≤ kak∞ kbk1 , we have:
"
n
#0
n
1 X h i
1 X
√ D i Xi β0 − β ≤
√ Di Xi
β0 − β
b
b
n i=1
n
i=1
1
∞
s log p
. √ → 0,
n
by the growth condition of Assumption 2.6. So the quantity on the left-hand side of the
above inequality converges to zero in probability (`1 consistency implies consistency in
probability). Using Slutsky’s Theorem and Equation (2.10), we obtain the result.
Notice that Assumption 2.7 (1) implies
√ n
∂ n(bτ − τ0 ) 1 1 X p
= √ Di Xi −→ 0, as n → ∞.
∂β0 − β
b π
b n i=1
Under this assumption, the estimator τb is first-order insensitive to small deviations from
the true value β0 . This is what we are going to exploit in the next section.
27
2.5 Theory: Immunization/Orthogonalization Pro-
cedure
Condition (ORT) means that the moment condition for estimating τ0 is not affected by
small perturbation around the true value of the nuisance parameter η0 . This is exactly the
intuition behind the double selection or immunized or Neyman-orthogonalized procedure
Chernozhukov et al. (2017a); Belloni et al. (2017a); Chernozhukov et al. (2018). Changing
the estimating moment can neutralize the effect of the first step estimation and suppress
the regularization bias. We say that any function ψ that satisfies Condition (ORT) is an
orthogonal score, or Neyman-orthogonal.
28
in technical details. We defer technical details to e.g. Lemma 2 and 3 in Chernozhukov
et al. (2015) and Belloni et al. (2017a).
Eψ(Zi , τ0 , η0 ) = 0,
for some known real-valued function ψ(.) satisfying the orthogonality condition (ORT), a
vector of observables Zi and a high-dimensional sparse nuisance parameter η0 such that
kη0 k0 ≤ s. The design respects the growth condition of Assumption 2.6.
kb
η k0 . s,
p
kb
η − η0 k1 . s2 log p/n,
p
kb
η − η0 k2 . s log p/n.
We take this estimator as given. It does not have to be a Lasso, but the Lasso or Post-
Lasso clearly verify these assumptions, in a sparsity or approximate sparsity scenario, i.e.
in cases where you are confident that only a few control variables matter. Chernozhukov
et al. (2018) extend these conditions to any machine-learning procedure of sufficient
quality. They will be discussed in Section 3.1. Notice that the ML procedure you are
going to use depend on the assumptions you are willing to make about η0 because they
will condition the performance of this tool. For example, if you believe η0 to be sparse,
a Lasso should work well. The estimator we are going to consider is τ̌ such that:
n
1X
ψ(Zi , τ̌ , ηb) = 0. (IMMUNIZED)
n i=1
For clearer exposition, we consider the simple case of Assumption 2.10 below.
29
Assumption 2.10 (Affine-Quadratic Model). The function ψ(.) is such that:
where Γj , j = 1, 2, are functions with all their second order derivatives with respect to η
constant over the convex parameter space of η.
The class of models above may seem restrictive but include many usual parameters
of interests such as the Average Treatment Effect (ATE), Average Treatment Effect on
the Treated (ATET), the Local Average Treatment Effect (LATE), any linear regression
coefficient.
√ d
n (τ̌ − τ0 ) −→ N (0, σΓ2 ),
30
" n #
1 0 1
X
η − η0 )
+ (b η − η0 ) .
∂η ∂η0 Γ1 (Zi , η0 ) (b
2 n i=1
| {z }
:=I3
p
Under regularity assumptions, by the LLN, I1 −→ E[Γ1 (Zi , η0 )]. Then:
n
r
1 X
s2 log p
|I2 | ≤
∂η Γ1 (Zi , η0 )
kb
η − η0 k1 . → 0,
n
n
i=1 ∞
and finally:
n
1
1 X
s log p
η − η0 k22
|I3 | ≤ kb ∂η ∂η0 Γ1 (Zi , η0 )
. → 0,
2
n i=1
n
sp(s log n)
if we assume that the sparse-norm of the second-order derivatives matrix (which does not
depend on ηb) is bounded:
1 Xn
∂ ∂ Γ (Z , η ) . 1,
0
η η 1 i 0
n
i=1 sp(s log n)
which occurs under reasonable conditions, see Rudelson and Zhou (2013). Secondly, we
d
need to show √1n ni=1 ψ(Zi , τ0 , ηb) −→ N (0, E[ψ 2 (Zi , τ0 , η0 )]). Decompose similarly:
P
n n
1 X 1 X
√ ψ(Zi , τ0 , ηb) = √ ψ(Zi , τ0 , η0 )
n i=1 n i=1
| {z }
:=I10
" n
#0
1 X
+ √ ∂η ψ(Zi , τ0 , η0 ) (bη − η0 )
n i=1
| {z }
:=I20
√ n
" #
n 0 1
X
+ η − η0 )
(b η − η0 ) .
∂η ∂η0 ψ(Zi , τ0 , η0 ) (b
2 n i=1
| {z }
:=I30
d
Typically, a standard CLT ensures that I10 −→ N (0, E[ψ 2 (Zi , τ0 , η0 )]) as long as E[ψ 2 (Zi , τ0 , η0 )] <
∞.
1 Xn
s log p
|I20 | ≤
√ η − η0 k1 . √
∂η ψ(Zi , τ0 , η0 )
kb → 0,
n
n
i=1 ∞
31
which occurs under mild conditions using more general version of Lemma 2.6, thanks to
the (ORT) condition in Assumption 2.8. And finally:
√
n
n
1 X
s log p
|I30 | ≤ η − η0 k22
kb ∂η ∂η0 ψ(Zi , τ0 , η0 )
. √ → 0,
2
n
i=1
n
sp(s log n)
Remark 2.7 (Importance of Theorem 2.2). Theorem 2.2 is powerful because if you can
find an estimator which is defined as a root of an orthogonal moment condition, i.e. that
satisfies condition (ORT), this estimator is going to be asymptotically Gaussian with
a variance that you can estimate to perform inference. A word should be said on the
assumptions to emphasize the ones that matter the most. Assumption 2.8 is extremely
important and is the point of this whole section. The key part is that ψ should satisfy
the (ORT) condition, other wise we cannot control I20 . Assumption 2.9 deals with the
quality of the nuisance parameter estimation: although it can be made more general to
accommodate other machine learning type methods, the nuisance parameter estimator
should have good performances. Assumption 2.6 on the growth condition is necessary
α
but is not very restrictive: p can grow as quickly as en for an α ∈]0, 1/2[! Assumption
2.10 does not matter: it is a simplification in the context of this course to make the
proof easier. Moreover, it is not so restrictive: many parameters of interest fit into that
framework.
The reason is that n−1 ni=1 ψ(Zi , τ, ηb) = 0 will not have a solution in general.
P
32
2.6 The Double Selection Method
For a given score function m(.) which does not satisfy condition (ORT), how can we find
a ψ(.) that does? Notice that we denote the nuisance parameter by β0 in the first case
and by η0 in the second. This different notation signifies that most of the time η0 is
different from β0 and is in general of larger dimension. We are going to cover Belloni
et al. (2014) which deals with the linear case. Chernozhukov et al. (2015) covers the
Maximum Likelihood and GMM cases. Section 2.2 of Chernozhukov et al. (2018) covers
an even wider range of models.
Recall our estimation strategy in Model 2.5. Assume further that the treatment
equation is given by:
Di = Xi0 δ0 + ξi ,
where Xi ⊥ ξi and ξi ⊥ εi . Denote η := (β, δ)0 .We are going to show that the following
moment condition:
From the orthogonality condition in the treatment equation above, δ0 is such that:
So we have indeed that ψ satisfies (ORT), i.e. E∂η ψ(Zi , τ0 , η0 ) = 0. Equation (2.13) has
a Frish-Waugh-Lovell flavour:
33
or even more clearly, because of Equation (2.14):
Intuitively, the third step may come as a surprise since we perform a regression of Y on
D and the selected X instead of performing the regression of Y − X 0 βbL on D − X 0 δbL . By
Frisch–Waugh–Lovell’s Theorem (Theorem 2.3), this is equivalent, up to the difference
between the Lasso and the Post-Lasso estimator (the third step amounts to using a
Post-Lasso instead of the Lasso). Define the post-double selection estimators βb and δb as:
n
X
βb = arg min (Yi − Di τb − Xi0 β)2 , (2.15)
β:βj =0,∀j ∈
/SbD ∪S
bY i=1
X n
δb = arg min (Di − Xi0 δ)2 . (2.16)
δ:δj =0,∀j ∈
/SbD ∪S
bY i=1
34
The resulting estimator verifies Theorem 2.2 and allows you to perform inference on the
parameter of interest. For the intuition behind this result, see again Section 2.4 on the
regularization bias. The selection procedure advocated here is based on a two-equation
approach. By selecting the elements of X in relation with both D and Y , it does not
miss any confounder as it was the case with the more naive approach.
Remark 2.9 (Computing Standard Errors). When both the outcome and the treatment
equations are linear, the asymptotic variance in Theorem 2.2 writes:
E[ξi2 ε2i ]
σΓ2 = ,
E[ξi2 ]2
which can be consistently estimated by:
" n #−2 n
1 X 1 X
bΓ2 =
σ ξbi2 ξb2 εb2 ,
n i=1 n − sb − 1 i=1 i i
Moving beyond the linear case, Farrell (2015) presents a more general method to
estimate treatment effect parameters (ATE, ATET) using similar ideas.
cξ c2 σj2
i.e. E e j
≤e 2 .
1
E max |ξj | = E log exp max c|ξj |
j=1,...,p c j=1,...,p
35
" ( p )#
1 X
c|ξj |
≤ E log e
c j=1
" ( p )#
1 X
≤ E log ecξj + e−cξj
c j=1
( p )
1 X
E ecξj + E e−cξj
≤ log
c j=1
1 n c2 L2
o
≤ log 2pe 2
c
log(2p) cL2
= + .
c 2
where the third inequality uses Jensen inequality and the fourth the remark at the begin-
p
ning of the proof. The bound is minimized for the value c∗ = 2 log(2p)/L and is equal
p
to L 2 log(2p) which completes the proof.
Theorem 2.3 (Frisch–Waugh–Lovell’s Theorem, Frisch and Waugh 1933, Lovell 1963).
Consider the regression of the vector of dimension n, y, on the full-rank matrix of
dimension n × p, X. Consider the partition: X = [X1 : X2 ], and define PX1 :=
X1 [X1 0 X1 ]−1 X1 0 , MX1 := In − PX1 , and PX and MX the same quantities for X. Con-
sider the two quantities:
36
Chapter 3
37
3.1 Double Machine Learning and Sample-Splitting
The Method: Cross-fitting Double ML. We present the DML1 method of Cher-
nozhukov et al. (2017a). We assume that we have a sample of n copies of the random
vector Zi where n is divisible by an integer K to simplify notations.
1. Take a K-fold random partition (Ik )k=1,...,K of observation indices {1, . . . , n} such
that each fold Ik has size n/k. For each k ∈ {1, . . . , K} define IkC := {1, . . . , n} \Ik .
2. For each k ∈ {1, . . . , K}, construct a ML estimator of η0 using only the auxiliary
sample IkC :
η̂k = η̂ (Zi )i∈IkC .
3. For each k ∈ {1, . . . , K}, using the main sample Ik , construct the estimator τ̌k as
the solution of:
1 X
ψ(Zi , τ̌k , η̂k ) = 0.
n/K i∈I
k
38
is necessary for technical reasons: it helps controlling remainder terms without using as-
sumptions that are too strong, allowing for many types of machine learning estimators
to be used for estimating the nuisance parameters. Intuitively, sample-splitting removes
the bias stemming from over-fitting by using an hold-out sample to estimate the nuisance
parameter η0 and then predict over the main sample.
Notice that the sample-splitting technique advocated here introduces more uncer-
tainty, which should be taken into account when reporting the results. Chernozhukov
et al. (2017a) also proposes to split the sample in S different random partitions and re-
port the mean cross-fitting estimator and a corrected standard error. We do not explore
these refinements.
This GitHub repository https://github.com/VC2015/DMLonGitHub/ contains very
clear files to perform cross-fitting procedure described above with many different ML
techniques.
Y = g0 (D, X) + ε, E [ε | D, X] = 0,
D = m0 (X) + ξ, E [ξ | X] = 0.
Section 2.6 is a particular case where we had g0 (D, X) = Dτ0 + X 0 β0 and m0 (X) = X 0 δ0 .
We have two standard target parameters of interest, the Average Treatment Effect (ATE)
and the Average Treatment Effect on the Treated (ATET):
Notice that in Section 2.6, τ0AT E = τ0AT ET = τ0 because there was no treatment effect
heterogeneity. For the ATE, the orthogonal score of Hahn (1998) is given by:
D(Y − g(1, X)) (1 − D)(Y − g(0, X))
ψ AT E (Zi , τ, η) = [g(1, X) − g(0, X)] + − − τ.
m(X) 1 − m(X)
39
The nuisance parameter true value is η0 = (g0 , m0 ). For the ATET, the orthogonal score
is:
AT ET 1 m(X) D
ψ (Zi , τ, η) = D− (1 − D) (Y − g(0, X)) − τ.
π0 1 − m(X) π0
The nuisance parameter true value is η0 = (g0 , m0 , π0 ) with π0 = P (D = 1). These or-
thogonal scores are the basis of Farrell (2015); Bléhaut et al. (2017). Similar expressions
exist for the Local Average Treatment Effect (LATE) and can be found in Chernozhukov
et al. (2018).
Question: Verify that E ψ AT E (Zi , τ0AT T , η0 ) = 0. Do the same for τ0AT ET .
Remark 3.2 (Affine-quadratic models and trick to compute the variance). Notice that
these orthogonal scores fall under the affine-quadratic type of Assumption 2.10 so com-
puting standard errors will simply follow from the expression in Theorem 2.2. Moreover,
in both cases, E [Γ1 (Zi , η0 )] = −1 which implies that τ0 = E [Γ2 (Zi , η0 )]. As a conse-
quence, from Theorem 2.2, σΓ2 = V [Γ2 (Zi , η0 )]. This observation makes computation of
the standard error fairly simple: for each observation, store Γ̂2 (Zi , η̂) in a vector gamma
and compute the standard error using sd(gamma) / sqrt(n).
In practice. Suppose you observe the outcome, treatment status and a set of covariates
(Zi )i=1,...,n = (Yi , Di , Xi )i=1,...,n from a population of interest and wants to estimate the
treatment effect for the treated τ0AT ET . Here is a strategy you could use:
1. Partition the observation indices {1, . . . , n} in two, such that each fold (I1 , I2 ) has
size n/2.
2. Using only the sample I1 , construct a ML estimator of g(0, X) and m(X). For
example, g(0, X) can be estimated by running a feedforward neural network of Yi
on Xi for the non-treated in this sample. Let us denote this estimator by gbI1 (x).
Similarly, m(X) could be estimated by running a Logit-Lasso of Di on Xi in this
sample. Let us denote this estimator by m
b I1 (x).
3. Now, use these estimators on the sample I2 to compute the treatment effect
1 X b I1 (Xi )
m
τ̌I2 := P Di − (1 − Di ) (Yi − gbI1 (Xi ))
i∈I2 Di i∈I 1−m b I1 (Xi )
2
40
4. Repeat steps 2-3, swapping the roles of I1 and I2 to get τ̌I1 .
DGP. File: DataSim.R. The outcome equation is linear and given by: Yi = Di τ0 +
Xi0 β0 + εi , where τ0 = .5, εi ⊥ Xi , and εi ∼ N (0, 1). The treatment equation follows
a Probit model, Di |Xi ∼ Probit (Xi0 δ0 ). The covariates are simulated as Xi ∼ N (0, Σ),
where each entry of the variance-covariance matrix is set as follows: Σj,k = .5|j−k| . Every
other element of Xi is replaced by 1 if Xi,j > 0 and 0 otherwise. The most interesting
part of the DGP is the form of the coefficients δ0 and β0 :
ρd (−1)j /j 2 , j < p/2 ρy (−1)j /j 2 , j < p/2
( (
β0j = , δ0j =
0, elsewhere ρy (−1)j+1 /(p − j + 1)2 , elsewhere
We are in an approximately sparse setting for both equations. ρy and ρd are constants
that are set to fix the signal-to-noise ratio, in the sense that a larger constant ρy will
mean that the covariates play a larger role. The trick here is that some variables that
matter a lot in the treatment assignment are irrelevant in the outcome equation. The
fact that the one-equation selection procedure will miss some relevant variables for the
outcome should create a bias and non-Normal behavior.
Model and Estimators. We estimate a model based on linear equations for both the
outcome and the treatment as in Section 2.6, although it does not corresponds to the
DGP. We compare three estimators:
41
3. A double-selection estimator based on the Lasso with cross-fitting (K = 5) as
described in Section 3.1.
You can play with these simulations using the file DoubleML Simulation.R. Notice that
this file makes every step very clear and makes very little use of any package so you can
follow easily what is going on. In particular, the Lasso regression is coded from scratch
functions/LassoFISTA.R. Table 3.1 and Figure 3.1 display the result in one particular
high-dimensional setting.
Estimator:
Naive Post-Selec Double Selec Double Selec
w. Cross-fitting
(1) (2) (3)
Bias .748 .020 .024
Root MSE .766 .200 .183
Parameters are set to: R = 10, 000, n = 200, p = 150, K = 5,
Note:
τ0 = .5, ρy = .3, ρd = .7.
density
density
1.0
1.0
1.0
0.5 0.5
0.5
−1 0 1 −1 0 1 −1 0 1
Treatment effect Treatment effect Treatment effect
42
quantity of interest is the ATET, defined as the impact of the participation in the pro-
gram on 1978 yearly earnings in dollars. The treated group comprises people who were
randomly assigned to this program from the population at risk (n1 = 185). Two con-
trol groups are available. The first one is experimental: it is directly comparable to the
treated group as it has been generated by a random control trial (sample size n0 = 260).
The second one comes from the Panel Study of Income Dynamics (PSID) (sample size
n0 = 2490). The presence of the experimental sample allows to obtain a benchmark for
the ATET obtained with observational data. We use these datasets to illustrate the tools
seen in the chapter.
To allow for a flexible specification, we consider the setting of Farrell (2015) and take
the raw covariates of the dataset (age, education, black, hispanic, married, no degree,
income in 1974, income in 1975, no earnings in 1974, no earnings in 1975), two-by-two-
interactions between the four continuous variables and the dummies, two-by-two inter-
actions between the dummies and up to a degree of order 5 polynomial transformations
of continuous variables. Continuous variables are linearly rescaled to [0, 1]. All in all, we
end up with 172 variables to select from. The experimental benchmark for the ATET
estimate is $1,794 (633). We use the package hdm to implement the Lasso and Logit-Lasso
and the package randomForest to use a random forest of 500 trees. We partition the
sample in 5 folds.
Estimator:
Experimental Cross-fitting Cross-fitting
w. 20 partitions
(1) (2) (3)
OLS 1,794
(633)
Lasso 2,305 2,403
(676) (685)
Random Forest 7,509 1,732
(6,711) (1,953)
The file DoubleML Lalonde.R details each step and compute a ATET estimate where
the propensity score and outcome functions are estimated using (i) a Lasso procedure and
(ii) a random forest. We compute standard errors and confidence intervals. Table 3.2
43
displays the results. With or without considering many data splits, the Lasso procedure
ends up pretty close to the experimental estimate. The results are more mixed for the
random forest: the simple cross-fitting procedure gives very imprecise results. They
might be due to a particularly unfortunate split or a particularly bad performance of the
off-the-shelf random forest algorithm in this case. When considering many partitions of
the data, the point estimate is reasonable but the standard-error is still very high. All
in all, the message is to be cautious and test several ML algorithms when possible and
consider many data splits so the results do not depend so much on the partitions. For
a comparison between a wide range of ML tools, see Section 6.1 in Chernozhukov et al.
(2018).
44
Chapter 4
This chapter reviews some important results addressing model selection in the linear
instrumental variables (IV) model. Namely, we remove the exogeneity assumption, ε ⊥
(D, X), from model (2.5), but assume the econometrician possesses several instruments
verifying an exogeneity assumption, while allowing the identity of these instruments to
be unknown and the number of potential candidates to be larger than the sample size.
We distinguish two different cases of high-dimension in the IV model:
– The (very)-many-instruments case, i.e. pzn > n where the number of instruments p
is allowed to grow with the sample size n,
– The many endogenous variables case, i.e. pdn > n where pdn is the number of
endogenous variables, but still pzn > pdn .
Those two frameworks are natural in empirical applications, but inference in the second
case is more complicated and will not be treated here1 . Instrumental variable techniques
to handle endogenous variables are widespread but often lead to imprecise inference.
With few instruments and controls, following Amemiya (1974), Chamberlain (1987), and
Newey (1990), one can try to improve the precision of IV techniques by estimating optimal
instruments. Consider the model 4.1 below:
Assumption 4.1 (IV Model). Consider the i.i.d sequence (Yi , Di , Xi , Zi )i=1,...,n satisfying
0
Yi = Di τ0 + Xi β0 + εi , E[εi ] = 0, E[εi |Zi , Xi ] = 0, (4.1)
where
1
see Remark 4.2 below.
45
1. Xi is a vector of pxn exogenous control variables, including the constant 1.
0 0 0
The moment condition E[ε|W ] = 0, where W = Z , X , implies a sequence of un-
conditional moment conditions E[εA(W )] = 0 indexed by a vector of instruments: the
function A(·) such that E [A(W )2 ] < ∞. This legitimately raises the question of the
choice of A(·) in order to minimize the asymptotic variance and obtain more precise esti-
mates. In Section 4.1, we briefly summarize results on the optimal instruments problem:
which is the optimal transformation A? We refer to Newey and McFadden (1994) for
more details and the methodology in the low dimensional case of model 4.1 (without
assumption 4).
However, even with one instrument Z, considering a high number of transformations of
0
the initial instrument (f1 (Z), . . . , fp (Z)) using series estimators, using B-Splines, poly-
nomials, ect... makes the problem high dimensional. Then in Section 4.2, we present
tools from Belloni et al. (2012a) who use Lasso and Post-Lasso to estimate the first-stage
regression of the endogenous variables on the instruments. As described in Chernozhukov
et al. (2015), this problem fits the double ML structure described in Section 2.6. Estima-
tion of the parameters of interest uses orthogonal or immunized estimating equations that
are robust to small perturbations in the nuisance parameter, similarly to what was used
in Chapter 2.5. For simplicity, we restrict ourselves to the conditionally homoscedastic
case
E ε2 |Z, X = σ 2 .
In this section we remind results about the optimal choice of A(·) in the moment equation
E[εA(W )] = 0 such that E [A(W )2 ] < ∞ to obtain more precise estimates. Define
46
0 0 0 0
θ0 := τ0 , β0 and S := D, X ∈ Rp+1 . We study the Generalized Method of Moments
estimator (GMM) based on the moments conditions
h 0
i
M (θ0 , A) := E A(W ) Y − S θ0 = 0,
2
Define
M
c
c(θ, A)0 M
(θ, A)
:= M c(θ, A) and:
2
2
θ̂n := argmaxθ∈Θ −
M (θ, A)
, (4.2)
c
2
The set of assumptions 4.2 below ensure identification of θ0 in the set Θ, namely that
M (θ, A) vanishes only at θ0 :
∀θ ∈ Θ, M (θ, A) = 0 ⇒ θ = θ0
0
4. G(A) G(A) is non singular.
c(θ, A)0 M
c(θ, A) →P −E ψ(U, θ, A)0 E [ψ(U, θ, A)]
1. convergence of the objective function −M
which admits a unique maximum at θ = θ0 ;
−1
1 Pn 0 1 Pn
2. convergence θ̂n →P θ0 , where θ̂n = i=1 A(Wi )Si [A(Wi )Yi ] (see
n n i=1
Theorem 5.7 in Van der Vaart (1998));
47
3. asymptotic normality, namely that
−1 0
!
√ −1 0
0 0
n θ̂n − θ0 →d N 0, G G G ΣG GG . (4.3)
Indeed, to prove the asymptotic normality, consider the first order condition
0
∇θ M
c θ̂n , A Mc θ̂n , A = 0
which is satisfied with probability approaching one. Then, using second order Taylor’s
h i
theorem for Mc θ̂n , A at θ0 yields that there exists θ ∈ θ0 , θ̂n such that
0 0 0
∇θ M θ̂n , A M θ̂n , A = ∇θ M θ̂n , A M (θ0 , A) + ∇θ M θ̂n , A ∇θ M θ θ̂n − θ0
c c c c c c
√ 0 −1 0 √
n θ̂n − θ0 = ∇θ M c θ̂n , A ∇θ M
c θ −∇θ M
c θ̂n , A nM
c(θ0 , A) .
Then, using condition (3) we have ∇θ Mc θ, A →P G(A) and ∇θ M c θ̂n , A →P G(A),
√ c 0
using condition (4), nM (θ0 , A) →d N (0, Σ), where Σ = E ψ(U, θ0 , A)ψ(U, θ0 , A) , and
using the Slutsky’s theorem we obtain (4.3).
0
Thus, in (4.3), the asymptotic variance simplifies to G−1 Σ(G−1 ) and takes a specific form
−1
h 0
i 0
V (A) := G(A) E ψ(U, θ0 , A)ψ(U, θ0 , A) G(A)−1 .
Theorem 4.1 (Necessary condition for optimal instruments, Theorem 5.3 in Newey and
McFadden (1994) p. 2166). If an efficient choice A of A exists for the estimator (4.2),
then it has to solve
h 0 i
G(A) = E ψ(U, θ0 , A)ψ U, θ0 , A , for all A such that E A(W )2 < ∞.
48
0
This is satisfied, using the homoscedasticity assumption E (Y − S θ0 )2 |W = σ 2 , when
E [S|W ]
A(W ) = .
σ2
Being invariant to a multiplication by a constant matrix, the function A(W ) = E [S|W ]
minimizes the asymptotic variance, which becomes
0 −1
h i
Λ∗ = σ 2 E E [S|W ] E [S|W ] , (4.4)
which is the semi-parametric efficiency bound (see Chapter 25 in Van der Vaart (1998)).
Here, A(W ) is the optimal instrument. In practice, the optimal instrument is the regres-
sion function of S on W , w 7→ E [S|W = w], which is naturally a high dimensional object
(see, e.g. Tsybakov (2009)). It has to be estimated and we now describe how to use the
sparsity assumption to perform efficient estimation in a high dimensional setting.
Note that these are plenty of ways to estimate E [S|W ] under different assumptions.
Lately we use machine learning tools to allow for W to be high dimensional under spar-
sity assumptions.
where, as described in Assumption 4.3 below, δ0 has only few “important” components
(approximately sparse), and because instruments Z may be correlated with the controls
X, we use the equation:
where ρd ⊥ X and
0 0 0
Y = X (ν0 τ0 ) + X β0 + ε + τ0 ρd = X (ν0 τ0 + β0 ) + ε + τ0 ρd . (4.7)
| {z } | {z }
:=θ0 :=ρy ⊥X
49
We make three preliminary remarks.
First, the following two cases naturally arise in practice:
1. either the list of available and possible instruments is large, while the econometrician
knows that only few of them are relevant;
2. or, from a small list of regressors Z, the optimal instruments can be approximated
using a basis of functions (series estimators, using B-Splines, polynomials, ect). This
case is treated using non-sparse methods in Newey (1990). In this decomposition,
z
the potential number pz of needed functions {fj }pj=1 is allowed to be larger than
n. Note that instead of Z, one could also consider transformations of the initial
instruments
0 0
f = (f1 , . . . , fp ) = (f1 (Z), . . . , fp (Z)) .
Second, like in section 2.6, the key assumption made on the nuisance component is
approximate sparsity: namely A(W ) = E [S|W ] (remember that here pd = 1) is assumed
to be well approximated by few (s n) of these pz instruments. Denote the nuisance
component by η0 = (θ0 , ν0 , δ0 , γ0 ) and assume that it can be decomposed into a sparse
component η0m and relatively small non-sparse component η0r :
Third, like in Section 2.6, we have to choose the moment equations carefully so that
model selection errors in the estimation of the nuisance component (here, the optimal
instrument) have limited impact on the estimation of the parameter of interest τ0 . We
now show how the optimal instrument problem can be cast in the framework of the im-
munization procedure developed in Section 2.6, and in particular in the Affine-Quadratic
model (2.10).
50
4.3 Immunization Procedure for Instrumental Vari-
able Models
Starting from (4.1), Chernozhukov et al. (2015) propose to base estimation using or-
thononalised moments like in Frisch–Waugh–Lovell theorem. Consider the space of ran-
dom variables that are square integrable on the canonical probability space (Ω, A, P ),
which we denote by L2 (P ). This is an Hilbert space associated with the scalar product
1/2
< X, W >= E [XW ] and norm kXk = E [X 2 ] . Define pX (W ) = E [W |X], which is
the orthogonal projection of W on the subspace of L2 (P ), {ξ = h(X), E [h(X)2 ] < ∞}
of square integrable random variables that are measurable with respect to X. Applying
mX (W ) = W − pX (W ) = W − E [W |X] to (4.1) yields the equations:
where
0
mX D = D − E [D|X] = D − X ν0 ,
0
mX Y = Y − E [Y |X] = Y − X (ν0 τ0 + β0 ).
where
0 0 0
mX pX,Z D = mX E[D|X, Z] = E[D|X, Z] − E[E[D|X, Z]|X] = X γ0 + Z δ0 − X ν0 .
Note that if D where exogenous, (4.9) would simply be E [mX εmX D] = 0. In the present
context, in the same spirit as the optimal instrument, we should use with p n
but to handle the errors coming from the selection in the estimation of covariates X, we
have to subtract the term E [D|X] to obtain a robust estimator which yields
51
The moment condition (4.8) can be rewritten as
E [ψ(W, τ0 , η)] = 0
where
0
0 0 0
ψ(W, τ0 , η) = Y − τ0 D − X β0 Z δ0 + X γ0 − X ν0 (4.12)
0
0 0 0
= Y − τ0 D − X (θ0 − ν0 τ0 ) Z δ0 + X γ0 − X ν0
0 0
0 0
T
= Y − X θ0 − (D − X ν0 )τ0 Z δ0 + X γ0 − X ν0 (4.13)
0 0 0
ψ(W, τ0 , η) = ρy − ρd τ0 Z δ0 + X γ0 − X ν0
0 0 0
= ε Z δ0 + X γ0 − X ν0 . (4.14)
The instruments for D, controlling for the correlation between Z and X, are
=ζ 0 δ0 .
Equation (4.13) can be rewritten under the form of the Affine-Quadratic model (2.10)
52
0 0
3. Do Lasso or Post-Lasso regression of D̂ = X γ̂ + Z δ̂ on X to get ν̂;
T
The estimator of η0 is η̂ = θ̂, ν̂, γ̂, δ̂ ;
4. Then
√
2 h i−1
τ̌ = argminτ ∈R
nM
c(τ, η̂)
= Γc1 (η̂)0 Γ
c1 (η̂) c1 (η̂)0 Γ
Γ c2 (η̂).
Note that Step 4 amounts to perform 2SLS using the residuals Y − θ̂0 X from Step
2 as running variable, the residuals D − D̂ from Step 1 as covariate, and the residuals
D̂ − ν̂ 0 X as instruments.
Remark 4.1. In the case of “small number” of controls (see Belloni et al. (2012a)), θ0
is no longer a “nuisance” parameter in the sense that there is no selection to be done on
X. In this case, we can take A(W ) = E [D|Z, X], as (ORT) does not have to hold with
respect to θ0 .
Using the formulation (4.17) of the model as an Affine-Quadratic model (see assump-
tion 2.10) and if assumptions of Theorem 2.2 are satisfied, namely if
√
4. the growth condition s log(p)/ n → 0 holds;
√
n (τ̌ − τ0 ) → N (0, σΓ2 ), (4.20)
53
Question: Show that, when these are no controls X (take Z = ζ), Λ∗ = σΓ2 , where
−1
from (4.4), Λ∗ = σ 2 E E [D|Z]2 , and thus that this estimator of τ0 with Optimal
Moreover, Theorem 3 in Belloni et al. (2012a) shows that the result continues to hold
h i−1 P 2
0
with Λ∗ replaced by Λ̂∗ = σ̂ 2 E D̂2 , where σ̂ 2 := ni=1 Yi − Di τ̂ − Xi β̂ /n.
Remark 4.2. If we use approximate sparsity assumption 4.3, then we have to impose the
following assumption, and the result (4.20) also holds (see Chernozhukov et al. (2015))
kη̂k0 ≤s,
r
s
kη̂ − η0m k2 ≤ log p,
n
r
s2
kη̂ − η0m k1 ≤ log p,
n
Remark 4.3 (Estimation and inference with many endogenous regressors). In this course,
we limit ourselves to the case where the number pd of endogenous regressors is fixed. How-
ever, several recent papers Gautier and Tsybakov (2011), Gautier and Tsybakov (2013)
and Belloni et al. (2017b) consider inference of an high dimensional parameter τ0 with a
high dimensional nuisance parameter.
This goes beyond this course, but this can be useful is the following situations:
- Economic theory is not explicit enough about which variables belong to the true
model. Here searching for the good “small” set of potentially endogenous variables
to put into the outcome equation may not be possible.
d
where {fk }pk=1 is a family of function (ex: basis) that capture nonlinearities.
54
4.4 Simulation Study
Similarly to the simulation exercise of the previous section, this illustrates two points:
2. the cross-fitting estimator trades off a large bias for a smaller MSE compared to
the immunized estimator that uses the whole sample.
DGP. We use a DGP close to the one in Chernozhukov et al. (2015): namely i.i.d
observations (Yi , Di , Zi , Xi )ni=1 from
0
Yi =τ0 Di + Xi β0 + 2εi
0 0
Di =Xi γ0 + Zi δ0 + Ui
Zi =ΠXi + αζi ,
where
- The number of controls is set to 200, the number of instruments to 150, the number
of observations to 202.
- The most interesting part of the DGP is the form of the coefficients β0 , γ0 , and δ0 :
(
1/4, j < 4
β0j = ,
0, elsewhere
55
1. An “oracle” estimator, where the coefficients of the nuisance parameters are known,
0
and we run standard IV regression of Yi − E [Yi |Xi ] on Di − E [Di |Xi ] using ζi δ0 as
instruments;
Estimator:
Naive Post-Selec Double Selec Oracle
(1) (2) (3)
Bias 0.04 0.01 0.00
Root MSE 0.36 0.39 0.06
MAD 0.24 0.25 0.04
Parameters are set to: n = 202, px = 200, px = 150, K = 3,
Note:
τ0 = 1.5
56
4.5 Empirical Applications
4.5.1 Logit Demand
We briefly introduce the logit demand model in the context where we only observe market
share data (see the seminal papers by Berry et al. (1995), Berry (1994) and Nevo (2001),
and the datasets provided in the Github). The model describes demand for a product in
the “characteristic space”, namely a product can be characterized by a number of features
(for a car: efficiency, type of oil, power, ect) and consumers value those characteristics.
The consumer can choose among J products and maximizes his utility of consuming this
good. Individual random utility for choosing good j ∈ {0, . . . , J} is modeled as
0
uij = Xj β0 − τ0 Pj + ζj + εij , (εij , ζj ) ⊥ Xj
exp (δj )
Pij = PJ , δj = XjT β0 − τ0 Pj + ζj .
1 + k=1 exp (δk )
Moreover, the econometrician does not observe individual choices, but only market
shares of product j: sjt = Qjt /Mt at period t, where Mt is the total number of households
in the market, and Qjt the number choosing the product j in period t. This yields
0
exp Xjt β0 − τ0 Pjt + ζjt
sjt = ,
1 + Jk=1 Xkt β0 − τ0 Pkt + ζkt
P 0
thus, using sj /s0 and assuming that market shares are non zero, we get
0
log(sj ) − log(s0 ) = Xjt β0 − τ0 Pjt + ζjt . (4.21)
However, price may be correlated with unobserved component ζjt such that OLS would
lead to an estimate of τ0 which is biased towards zero. We use the instrumental equation:
0 0
Pjt = Zjt δ0 + Xjt γ0 + ujt . (4.22)
Here, controls include a constant and several covariates. In Berry et al. (1995), they
suggest to use the so-called “BLP instruments” namely characteristics of other products,
which may satisfy an exclusion restriction: for any j 0 6= j and t0 , as well as any function of
57
those characteristics. The justification is that, if a product is close in the “characteristics
space” to its competitors, it may impact the markups, then the price (however, one
should prefer cost based instruments, rarely available). Thus, we are left with a very-
high dimensional set of potential instruments for Pjt .
Originally, Berry et al. (1995) solve this problem of dimension taking sums of product
characteristics formed by summing over products excluding product j
X X
Zk,jt = Xj 0 ,jt , Xj 0 ,jt ,
j 0 6=j,j 0 ∈If j 0 6=j,j 0 ∈I
/ f
Not to mention the classical problems with those specific forms (own-price elasticities
quasi proportional to prices, symmetry of cross price elasticity with respect to products),
facing inelastic demand is inconsistent with profit maximizing price choice in this frame-
work, thus theory would predict that demand should be elastic for all products, which is
not the case of estimates without selection in Table 4.2. Estimators with selection give
in that sense much more plausible estimates.
We now replicate the analysis of the returns to schooling done in Card (1993), and see how
results are changed if we enlarge the set of possible instruments. David Card considers
58
Table 4.2: Estimation of τ0
Price Coefficient Standard Error Number Inelastic
Estimates Without Selection
Baseline OLS -9.63 0.84 586.00
Baseline 2SLS -9.48 0.87 990.00
2SLS Estimates With “Double Selection”
Baseline 2SLS Selection -11.29 0.93 224.00
Augmented 2SLS Selection -11.44 0.91 212.00
0
Y =τ0 D + X β0 + ε, ε⊥X
0 0
D =Z δ0 + X δ0 + u, u ⊥ (Z, X)
where Yi is the weekly log wage of individual i, Di denotes education (in years), Xi is
a vector of controls, possibly high dimensional, and Zi denotes a vector of instrumental
variables for education.
In this example, the instruments are the two indicator for whether a subject grew up
near a two-year college or a four-year college. He also proposes to use IQ as instruments
for the Knowledge of the World of Work (KWW) test scores, which could be added as
control. The control variables Xi are: age and work experience at the time of the survey,
subject’s father’s and mother’s years of education, indicator for family situation at the
subject age 14 (whether subject lived with both mother and father, single mom, step-
parent), 9 indicators for living region, a dummy for living in a Standard Metropolitan
Statistical Area (SMSA) and another for living in the south, an indicator for whether
subject’s race is black, marital status at the time of the survey, indicator for whether
subject had library card at age 14, KWW test scores and interactions. Two options are
possible without using selection: either using those four instruments, or interacting those
instruments with their interactions with the controls, leading to 48 instruments. In this
second case, we have to correct for the use of many instruments with the Fuller (1977)
estimator, implemented in the R package ivmodel.
Results presented in Table 4.3, using the code shows that:
- The Post-Lasso selects 5 among all the potential 64 instruments, and so does provide
some pertinent selection without prior knowledge.
59
- Comparison of standard errors in the Lasso case with the Fuller estimates with 64
instruments shows a small improvement.
Finally, we also refer to the github of the course where we replicate the application
of the instrument selection techniques developed in the previous sections to the Angrist
and Krueger (1991) dataset from Belloni et al. (2010). The dataset NEW7080.dta can be
found at https://economics.mit.edu/faculty/angrist/data1/data/angkru1991.
60
Chapter 5
where
p
X
Γ̂δ
= Γ̂j δj .
1
j=1
The penalty Γ̂ ∈ Mp,p (R) is an estimator of the “ideal” penalty loadings Γ̂0 = diag(γ̂10 , . . . , γ̂p0 ),
qP
n
where γ̂j0 = 2 2
i=1 Zi,j εi /n. This is an “ideal” penalty loadings as εi is not observed.
In practice:
√
1. We set λ = 2c nΦ−1 (1 − 0.1/(2p log(p ∨ n))), where c = 1.1 and Φ−1 (·) denotes
the inverse of the standard normal cumulative distribution function.
61
2. We estimate the ideal loadings 1) using “conservative” penalty loadings and 2)
plugging in the resulting estimated residuals in place of εi to obtain the refined
loadings.
2
Assumption 5.1 (Moment conditions). Assume that (i) maxj=1,...,p E [Di2 ]+E Di2 Zj,i +
2 2 3 3
1/ E Zj,i εi ≤ K1 (ii) maxj=1,...,p E Zj,i εi ≤ K2 , where K1 , K2 < ∞.
Under theses moment conditions, Theorem 5.1 below gives rates of convergence for
Lasso under non-Gaussian and heteroscedastic errors, which relaxes the assumptions
made in Theorem 2.1. Of course this is more realistic in most applications.
Theorem 5.1 (Rates for Lasso Under Non-Gaussian and Heteroscedastic Errors, Theo-
rem 1 in Belloni et al. (2012a)). Consider model (5.1), the sparsity assumption |δ0 |0 ≤ s,
assumptions 2.4 and 5.1. Take ε > 0, there exist C1 and C2 such that the LASSO es-
√
timator defined in (5.2) with the tuning parameter λ = 2c nΦ−1 (1 − α/(2p)), where
α → 0, log(1/α) ≤ c1 log(max(p, n)), c1 > 0, and with asymptotically valid penalty load-
ings lΓ̂0 ≤ Γ̂ ≤ uΓ̂0 where l →P 1, u →P 1 satisfies, with probability 1 − ε
r
C1 s2 log(max(p, n))
δ̂ − δ0
≤ 2 (5.3)
1 κC n
n
1 X 0 0
2 C2 s log(max(p, n))
Zi δ̂ − Zi δ0 ≤ , (5.4)
n i=1 κC n
1 Pn 0
where κC := κC Zi Zi and
n i=1
kγ̂ 0 k∞
uc + 1
C= .
k1/γ̂ 0 k∞ lc − 1
Intuition for Theorem 5.1: Regularizing event and concentration inequality.
The proof of the LASSO with Gaussian errors, Step 2 in Theorem 2.1, is based on the
fact that with high probability 1 − α we have the following “regularizing event”
( )
1 X n λ
n
max εi Xij ≤ .
j=1,...,p n 4
i=1
To ensures this, we used the Markov inequality, conditional on X1 , . . . , Xn and the fol-
lowing concentration inequality (see Lemma 2.6) which for p gaussian random variables
ξj ∼ N (0, σj2 ) ensures that
p
E max |ξj | ≤ max σj 2 log(2p).
j=1,...,p j=1,...,p
62
This shows how to choose λ. In the general case of the LASSO with non Gaussian and
heteroscedastic errors, to choose λ and the penalty loadings γ 0 , we use the same ideas.
We ensure that we have the regularizing event with high probability
Pn √
i=1 Zi,j εi / n λ
max ≤ √ (5.5)
j=1,...,p γ̂j0 2c n
using the following concentration inequality applied to Ui,j := Zi,j εi . This ensures that
there exists finite constant A > 0 such that
√
Pn
i=1 Ui,j / n −1 α A
P max qP ≤Φ 1− ≥1−α 1+ (5.6)
j=1,...,p n 2
2p ln
i=1 Ui,j /n
where ln → ∞. This comes from moderate deviation theorems for self-normalized sums
(see Lemma 5 in Belloni et al. (2012a) and Belloni et al. (2018)). The idea is that, if the
loadings Γ̂0 are chosen so that the term
Pn √
i=1 Zi,j εi / n
γ̂j0
behaves like a standard normal random variable, then we could get the desired condition
√
(5.5) taking λ/(2c n) large enough to dominate the maximum of p standard normal
random variables with high probability. Belloni et al. (2012a) show that taking (γj0 )2 =
Var (Zi,j εi ) allows to fulfill this idea, even if the εi are not i.i.d Gaussian. This yields
(5.6). Then, the Lemma 5.1 below ensures that on this regularizing event, we have the
desired inequalities.
Lemma 5.1 (Lemma 6 in Belloni et al. (2012a)). Consider model (5.1), the sparsity
assumption |δ0 |0 ≤ s, assumptions 2.4 and 5.1. If the penalty dominates the score in the
sense that n
γ̂j0 λ 1 X
≥ max 2c Zi,j εi
n j≤p n
i=1
or equivalently (5.5), then, we have
√
v
u n 2
u1 X 0 0 1 λ s
t Zi δ̂ − Zi δ0 ≤ u + (5.7)
n i=1 c nκc0
0
(1 + c0 ) 1 λs
Γ̂ δ̂ − δ0
≤ u+ , (5.8)
1 κc0 c nκc0
where c0 = (uc + 1)/(lc − 1).
63
Pn 0 0 2
Proof of Lemma 5.1. Denote by L(δ) = i=1 Di − Zi δ /n. Because δ̂ is solution of
the minimisation program, we have
λ
L δ̂ − L (δ0 ) ≤
Γ̂δ0
−
Γ̂δ̂
. (5.9)
n 1 1
√ u
v
n
0
Γ̂ δ̂ − δ0
≤
s u 1 X 0
2
Zi δ̂ − δ0
t
κc0 n i=1
S0 1
64
√ X n
s 1 0
2
≤ (1 + c0 ) Zi δ̂ − δ0
κc0 n i=1
Elements for the proof of Theorem 5.1. The proof is based on these three steps:
First, the fact that for this choice of λ, using Lemma 5 in Belloni et al. (2012a), we have
as α → 0 and n → ∞
1 Pn
√ √n i=1 Zi,j εi
P 2c n
0
> λ = o(1).
γj
Thus for n large enough and α small enough we can consider the regularising event,
1 Pn
√
n i=1 Zi,j εi
λ
E := ≤ 2c√n
γj0
√
v
u n 2
u1 X 0 0 1λ s
t Zi δ̂ − Zi δ0 ≤ u +
n i=1 cnκc0
r
1 2c −1 α s
≤ u+ Φ 1−
c κc0 2p n
r
1 2cC3 s log(p/α)
≤ u+
c κc0 n
and
0
(1 + c0 ) 1 λs
Γ̂ δ̂ − δ0
≤ u+
1 κc0 c nκc0
(1 + c0 ) 1 −1 α 2cs
≤ u+ Φ 1− √
κc0 c 2p nκc0
65
p
(1 + c0 ) 1 2cC3 s log(p/α)
≤ u+ √
κc0 c κc0 n
(5.11)
based on
" nj
#−1 nj
1 X c 1 X c
τ̌j = Γ1 Wij , η̂ j Γ2 Wij , η̂ j for j ∈ {a, b}.
nj i=1 nj i=1
This estimator combines two estimators of the treatment effect based on each sample,
where each one uses a preliminary estimator of the nuisance function based on the other
sample only.
66
Proof of Theorem 5.2. The proof mainly consists of modifying the proof in Theorem
n c
j
2.2 to use independence between (εi )i=1 and ηbj for j ∈ {a, b}.
We use that E εj |Xij , ζij = 0 and that {εji , 1 ≤ i ≤ nj } are independent from the sample
j c , to get
c
E ψ Wij , τ0 , η̂ j − ψ Wij , τ0 , η0
c
= E E ψ Wij , τ0 , η̂ j − ψ(Wij , τ0 , η0 )|Xij , ζij , j c
h j 0 j c 0 c 0 c
i
= E E εj |Xij , ζij Zi (δ̂ − δ0 ) + Xij (γ̂ j − γ0 ) − Xij (ν̂ j − ν0 ) = 0
c c
Then, using the Chebyshev inequality, the fact that η̂ j − η j are independent of {εji , 1 ≤
i ≤ nj } by independence of the two subsamples j and j c , and that {εji , 1 ≤ i ≤ nj } have
conditional variance on Xij , ζij bounded from above by K, we obtain
√ nj !
nX
c
ψ Wij , τ0 , η̂ j − ψ Wij , τ0 , η0 > ε
P
nj
i=1
√ nj 2
1 n X j jc
j
≤ 2E ψ Wi , τ0 , η̂ − ψ Wi , τ0 , η0
ε nj
i=1
nj nj
" #
1 n X j 0 j c
j 0 jc
j
0 j c 2 X j 2
≤ 2E 2 Zi δ̂ − δ0 + (Xi ) γ̂ − γ0 − Xi ν̂ − ν0 (εi )
ε nj i=1 i=1
nj nj
" " ##
1 n X j
0
jc
j 0 jc
j 0 jc
2 X
j 2 j j c
≤ 2E E 2 Zi δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0 (εi ) Xi , ζi , j
ε nj i=1 i=1
nj nj
" # " #
1 n X j 0
jc
j 0 jc
j 0 jc
2 X j 2
j j c
≤ 2E 2 (Zi ) δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0 E (εi ) Xi , ζi , j
ε nj i=1 i=1
nj
" #
K n X j 0 jc
j 0 jc
j 0 jc
2
≤ 2E 2 (Zi ) δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0
ε nj i=1
K n C2 s log(max(p, nj ))
≤ .
ε2 nj κC nj
67
Using Theorem 5.1, we obtain that
" nj
#−1 nj
√ 1 X j 1 X
ψ Wij , τ0 , η0 + oP (1).
nj (τ̌j − τ0 ) = Γ1 Wi , η0 √ (5.14)
nj i=1 nj i=1
Step 2: on the aggregated estimator τ̌ . Putting the two results together, we get
√
the asymptotic representation of n (τ̌ − τ0 )
" n nb
#−1
√ 1X a
1 X
n (τ̌ − τ0 ) = Γ1 (Wia , η0 ) + Γ1 (Wib , η0 )
n i=1 n i=1
na
! nb
! !
1 X √ 1 X √
× Γ1 (Wia , η0 ) n(τ̌a − τ0 ) + Γ1 (Wib , η0 ) n(τ̌b − τ0 ) + oP (1)
n i=1 n i=1
" n #−1 na nb
!
1X 1 X 1 X
= Γ1 (Wi , η0 ) √ ψ(Wia , τ0 , η0 ) + √ ψ(Wib , τ0 , η0 ) + oP (1),
n i=1 n i=1 n i=1
which concludes the proof.
where E [εit uit ] 6= 0 but E [εit |Zi1 , . . . , ZiT ] = E [uit |Zi1 , . . . , ZiT ] = 0, where we have
a high dimensional number pz of instruments Zit satisfying pz nT the number of
individuals and time periods observed. For simplicity we do not consider cases of fixed
or high dimensional number of controls, but the ideas of the double selection of Section
4.2 can be directly extended here. We use the classical “within” transformation
T
1X
Ÿit = Yit − Yit
T t=1
68
(and respectively Z̈it and ε̈it the “within” transformation of Zit and εit ) to partial out
the fixed effect in both equations, which reduces the model to
We then use the sparsity assumption kδ0 k0 ≤ s and the following Cluster-Lasso regression
of D̈it on Z̈it to estimate δ0 , and use to estimate τ0 the orthogonal moment condition
" T
#
1 X 0
0
E Ÿit − τ0 D̈it − Z̈it δ0 Z̈it δ0 = 0
T i=1
where
0
0
Γ1 (D̈it , Z̈it , δ) = − D̈it − Z̈it δ Z̈it δ
0
Γ2 (Ÿit , Z̈it , δ) = − Ÿit Z̈it δ.
h i
0
Note that the moment condition E Ÿit − τ0 D̈it Z̈it δ0 satisfies also (ORT) because
these are no controls here.
The Cluster Lasso, Intuition. We consider the regression (5.18). The Cluster-Lasso
coefficient estimate is based on
n T pZ
1 XX 0
2 λ X
δ̂ ∈ arg min D̈it − Z̈it δ + γ̂j |δj | . (5.19)
δ∈Rpz nT i=1 i=1 nT j=1
happens with high probability. Using like in (5.6) moderate deviations theorems from the
self-normalized theory (see Lemma 5 in Belloni et al. (2012a)) we have that for α → 0
69
PT
as n → ∞, the variables Uij := t=1 Z̈itj ε̈it /T , which are independent random variables
across i with mean zero, satisfy (if finite third-order moments),
√
Pn
i=1 Uij / n
P max qP > Φ−1 1 − α = o(1),
1≤j≤pz n 2
2p
i=1 Uij /n
Define hP i
T 2 2
E Z̈
t=1 itj itε̈ /T
iZT = T min 2 .
1≤j≤p PT
E t=1 Z̈itj ε̈it /T
Belloni et al. (2016) call the quantity iZT the “index of information”, in the sense that is
inversely related to the strength of within-individual dependence and can vary between
iZT = 1 (perfect dependence within cluster i) and izT = T (perfect independence within i).
Theorem 5.3 show that, through this quantity, this dependence impacts the rates.
n op
, where Σ̈jk = ni=1 Tt=1 Z̈itj Z̈itk /(nT ).
P P
Define the Empirical Gram matrix Σ̈ = Σ̈jk
j,k=1
70
Note that in the above theorem, the effective sample size niZT is intuitively related to
the time dependence structure: when observations are totally independent across time
(iZT = T ), the size is nT whereas if observations are perfectly dependent (iZT = 1), the
size is n. Theorem 5.4 is an extension of Theorem 5.1, and the proofs share the same key
elements.
Theorem 5.4 (Asymptotic normality of the Cluster-Lasso estimator for treatment effect
in panels, Theorem 2 in Belloni et al. (2016)). Assume that conditions of Theorem 5.3
hold, that the growth condition s2 log(max(p, nT ))2 /(niD
T ) = o(1) holds. Assume that the
moments conditions given in SMIV p16 in Belloni et al. (2016). Then the IV estimator
τ̌ satisfies
q
−1/2 d
niD
TV (τ̌ − τ0 ) → N (0, 1),
where 2
P T
E ψ Ÿit , D̈it , Z̈it , δ0
t=1 /T
V := iD
T hP i2 .
T
TE t=1 Γ1 D̈it , Z̈it , δ0 /T
Codes for the cluster LASSO is given in clusterlasso.R (with slight modifications of
the code rlasso.R from the package hdm, to implement the clustered penalty loadings).
We refer to Belloni et al. (2016) for simulations and application to gun control.
71
county land area) (see Baltagi (2008) for a full data description). To handle the potential
endogenity of the number of police per capita, we use the same instruments as Cornwell
and Trumbull (1994), namely offense mix (ratio of crimes involving face-to-face contact to
those that do not) and per capita tax revenue (again, see Baltagi (2008) for motivations).
The variable selection method introduced in this chapter allow to solve part of the trade-
off that the researcher otherwise faces: including many covariates to include all the
potential confounders while not lowering estimation precision. To illustrate this point,
we consider equations (5.15)-(5.16) with: the same set of controls (16) and instruments
(2) as in Cornwell and Trumbull (1994) and Baltagi (2008) or a “large set” of controls
(i.e. including interactions and polynomial transformations up to order 2, which yields
544 control variables) and IV (98). The idea is that one might not be sure of the exact
identity of the controls that enter the equation. Table 5.1 focuses on the effect of number
of police per capita on crime rates. The estimate for the Cluster LASSO is roughly equal
to the one of the within estimator with few controls and IV (first column). However
the Cluster-LASSO estimator does not require apriori selection (and it selects different
controls and IV than in the one included in the baseline). The within “large” estimates
appears to be biased as the number of controls is in out case close to the number of
observations.
Table 5.1: Economics of Crime Estimates using Cluster Post-Double Selection
72
Chapter 6
The synthetic control method has been viewed as “the most important innovation in
the policy evaluation literature in the last 15 years” by Athey and Imbens (2017) and
its popularity in applied work keeps growing with applications in fields ranging from the
link between taxation and migration of football players (Kleven et al., 2013), immigration
(Bohn et al., 2014), health policy (Hackmann et al., 2015); minimum wage (Allegretto
et al., 2013), regional policies (Gobillon and Magnac, 2016); prostitution laws (Cunning-
ham and Shah, 2017), financial value of connections to policy-makers (Acemoglu et al.,
2016), and many more.
Sometimes considered as an alternative to difference-in-differences when only aggre-
gate data are available (Angrist and Pischke, 2009, Section 5.2), the synthetic controls
method offers a data-driven procedure to select a comparison unit, called a “synthetic
unit”, in comparative case studies. The synthetic unit is constructed as a convex com-
bination of control units that best reproduces the treated unit during the pre-treatment
period. In consequence, some units in the control group (also referred to as the “donor
pool”) will be assigned a weight of zero. In contrast, the difference-in-differences estima-
tor would take any control unit and give it a weight of 1/n0 where n0 is the control group
size. This remark will be detailed below but it hints at the flexibility that the synthetic
controls method offers by providing a data-driven way of weighting each control unit.
While the link to the difference-in-differences approach is direct, the synthetic controls
method is also related to matching estimators (Abadie and Cattaneo, 2018, Section 4, for a
short introduction) because solving the synthetic control program amounts to minimizing
a type of matching discrepancy. For more on that subject, see Abadie and L’Hour (2019).
73
References. The original papers that developed the method are Abadie and Gardeaz-
abal (2003); Abadie et al. (2010, 2015). Abadie et al. (2010), where the authors study the
effect of a large-scale tobacco control program in California, is the most iconic. Abadie
(2019) describes the methodology when applying the synthetic control method. Doud-
chenko and Imbens (2016) makes the connection between synthetic control, difference-in-
differences, regression and balancing. We should also mention a YouTube video on the
topic by Alberto Abadie: https://youtu.be/2jzL0DZfr_Y, and R, STATA and Matlab
packages, available at http://web.stanford.edu/~jhain/synthpage.html.
A very good textbook treatment of causal inference methods is given in Imbens and
Rubin (2015). A clear and concise presentation of the main tools of the field is given
in Abadie and Cattaneo (2018) which we strongly encourage you to read. The research
frontier is described in Athey and Imbens (2017). Angrist and Pischke (2010) reviews
progress in empirical economic research. A competing causal framework to Rubin (1974)
is known as Directed Acyclic Graphs (DAG) and is developed in Pearl (2000). Wasserman
(2010) has an introductory chapter on DAG.
The synthetic controls method makes explicit use of the panel data framework. The
presentation is inspired by Doudchenko and Imbens (2016). We observe n0 + 1 units in
periods t = 1, ..., T . Unit 1 is treated starting from period T0 + 1 onward, while units
2 to n0 + 1 are never treated. Units 2 to n0 + 1 form what is called the “donor pool”
because these units may or may not be selected to take part in the synthetic unit (see
Remark 6.5). Let Yi,t (0) the potential outcome for unit i at time t if it is not treated and
Yi,t (1) the potential outcome if it is exposed to the intervention. We observe exposure to
treatment (Di,t ) and realized outcome Yi,tobs defined by:
Yi,t (0) if Di,t = 0
Yi,tobs = Yi,t (Di,t ) =
Yi,t (1) if Di,t = 1
74
Remark 6.1 (A note on the dimension). Most of the applied papers that use the synthetic
control method are dealing with long panel data where T is relatively large or proportional
to n0 , and there are at most a dozen treated units. For example, in Abadie et al. (2010),
T = 40, n0 = 38 and only one treated unit; in Acemoglu et al. (2016), T ≈ 300, n0 = 513
and a dozen treated units. This is in sharp contrast with standard panel data or repeated
cross-section settings where n0 is very large while T ranges from two to to a dozen. In
most applications where the method is used, a “unit” is a city, a region or even a country,
hence the limited sample size. T0 is necessarily larger than 1 but it is usually located
after the middle of the period, i.e. T0 > (T − 1)/2, so as to have a long pre-treatment
period that allows to “train” the synthetic unit (see Theorem 6.1 for a justification).
This particular setting implies that the asymptotic framework where the number of units
grows tends to infinity less relevant.
75
Let Xtreat be the p × 1 vector of pre-intervention characteristics for the treated unit.
Let Xc be the p × n0 matrix containing the same variables for control units. In most
applications, the p pre-intervention characteristics will only contain pre-treatment out-
comes (in which case p = T0 ) but one might want to add other predictors of the outcome
observed during the pre-treatment period that may or may not be time invariant, we
collect them in a vector Zi such that for the treated unit:
obs
Y1,1
Y obs
1,2
Xtreat := ... .
(p×1) obs
Y1,T 0
Z1
Xc is defined similarly. For some p × p symmetric and positive semidefinite matrix V ,
√
we adopt the notation kXkV = X 0 V X. Consider ω = (ω2 , . . . , ωn0 +1 ) a vector of n0
parameters verifying the following constraints:
ωi ≥ 0, i = 2, ..., n0 + 1, (NON-NEGATIVITY)
X
ωi = 1. (ADDING-UP)
i≥2
These constraints prevent interpolation outside of the support of the data, i.e. the coun-
terfactuel cannot take a value larger than the maximal value or smaller than the minimal
value observed for a control unit. The synthetic control solution ω ∗ solves the program:
76
Remark 6.2 (A note on the choice of Xtreat and Xc ). They should contain pre-treatment
variables that are good predictors of the outcome of interest. In the Mariel Boatlift
example of Card (1990) where the outcomes of interest are wages and unemployment, it
is aggregate demographic indicators (gender, race, age), education levels, median income,
GDP per capita. Due to the time series nature of the problem, including pre-treatment
outcomes is strongly advised by Theorem 6.1. e.g. 1975-1979 unemployment rates, as
it is a way to create a control unit that verifies the Common Trend Assumption (CTA).
Note that the synthetic control method implicitly uses the Conditional Independence
Assumption (CIA).
Remark 6.3 (A note on the choice of V ). V is a diagonal matrix with each element along
the diagonal reflecting prior knowledge from the researcher about the importance of each
variable for the intervention under study. The synthetic control program (SYNTH) writes
in this case: #2
p
" nX
X 0 +1
T0
" nX
#2
X 0 +1
M SP E(V ) := obs
Y1,t − ωi∗ (V )Yi,tobs .
t=1 i=2
We can also use a form of cross-validation (Abadie and L’Hour (2019)). To simplify the
exposition and because it is the most natural choice, we will assume that the validation
period is at the end of the pre-intervention period, although other choices are possible.
The procedure is as follows:
1. Split the pre-intervention period that contains T0 dates into T0 − k initial training
dates and k subsequent validation dates.
τbt (V ) = obs
Y1,t − ωi∗ (V )Yi,tobs ,
i=2
77
where ωi∗ (V ) solves (SYNTH) with X measured in the training period {1, . . . , T0 −
k − 1}.
3. Choose V to minimize the mean squared prediction error over the validation period,
T0
1 X
MSPE(V ) = τbt (V )2 .
k t=T −k
0
Notice that over the validation period, the computed treatment effect must be zero.
Assumption 6.1 (IID Transitory Shocks). (εi,t )i=1,...,n0 +1,t=1,...,T are iid, across both i and
t, random variables with mean zero and variance σ 2 . Moreover, for some even integer
m > 2, E|εi,t |m < ∞.
kXtreat − Xc ω ∗ k2V = 0.
It is a crucial point to prove the next theorem. This is not an abnormal case in many
applications because of over-fitting n0 > p. However, the curse of dimensionality entails
78
that as p grows, this assumption is less and less likely to be verified. See the precise
discussion is given in Ferman and Pinto (2016). Let ξ(M ) be the smallest eigenvalue of:
T0
1 X
λt λ0t .
M t=T0 −M +1
Denote by λP the T0 × F matrix with the t-th row equal to λ0t and assume:
Assumption 6.3 (Nonsingularity of Factor Matrix). ξ(M ) ≥ cξ > 0 for any positive
0
integer M . As a consequence, λP λP is non-singular. Moreover, assume |λt |∞ ≤ λ̄, for
1 ≤ t ≤ T.
Theorem 6.1 (Bias of the Synthetic Controls Estimator). Under Assumptions 6.1, 6.2
and 6.3, for t ∈ {T0 + 1, . . . , T }:
|Eb
τt − τt | → 0.
T0 →+∞
Remark 6.4. It shows that the bias of the synthetic controls estimator goes to zero as
the number of pre-treatment period increases. It says nothing about, for example, the `1
or `2 -consistency, especially because in the proof below E(|R3,t |) = E(|ε1,t − ε2,t |) does not
decrease with T0 . Indeed, we only observe one treated unit, hence there is a non-vanishing
variance.
Proof of Theorem 6.1. Using the factor specification, for any t = 1, ..., T :
" nX
#
0 +1
τbt = Y1,t (1) − Y1,t (0) + Y1,t (0) − ωi∗ Yi,t (0)
i=2
" nX
# " #0 " # " #
0 +1 nX
0 +1 nX
0 +1 nX
0 +1
where the last line comes from (ADDING-UP) and perfect matching of the synthetic unit,
Assumption 6.2. Now, consider the pre-treatment outcomes written in matrix notations.
YiP is the T0 × 1 vector of pre-treatment outcomes for unit i with t-th element equal to
Yi,tobs . εP
i is the T0 × 1 vector of pre-treatment transitory shocks. Notice that because
79
From equation (6.2), using Assumption 6.3:
" nX
# " # " #
0 +1 −1 nX0 +1 −1 nX
0 +1
0 0 0 0
µ1 − ωi∗ µi = λP λP λP Y1P − ωi∗ YiP − λP λP λ P εP 1 − ωi∗ εP
i .
i=2 i=2 i=2
(6.3)
Equation (6.3) above helps understanding the nature of the synthetic controls method-
ology: the quality of the approximation of the factor loadings of the treated, µ1 , by the
synthetic unit depends on the distance between the treated pre-treatment outcomes and
the synthetic unit’s. This observation advocates for including the pre-treatment outcomes
in the (SYNTH) program, and constitutes the crucial point of the theorem. Furthermore,
because Assumption 6.2 holds, the first term in equation (6.3) vanishes, so we have a nice
decomposition of the bias for t > T0 by plugging equation (6.3) inside equation (6.1):
nX
" #
−1 0 +1 −1 nX0 +1
0 0 P0 P 0
τbt − τt = λ0t λP λP λP ωi∗ εP 0
i − λt λ λ λ P εP
1 + ε1,t − ωi∗ εi,t .
i=2 | {z } i=2
| {z } :=R2,t | {z }
:=R1,t :=R3,t
When t > T0 , R2,t and R3,t have mean zero thanks to Assumption 6.1. This is not the
case for R1,t because there is no reason to think that εi,t and ωi∗ are independent for t ≤ T0
since ωi∗ depends on Y1P , . . . , YnP0 +1 , and as consequence on εP P
1 , . . . , εn0 +1 . Rewrite:
nX
!−1
0 +1 −1 nX
0 +1 T0 T0
P0 P 0
X X
R1,t = ωi∗ λ0t λ λ λP εP
i = ωi∗ λ0t λt λ0t λs εi,s .
i=2 i=2 s=1 t=1
P −1
T0
By Cauchy-Schwarz inequality, since t=1 λt λ0t is symmetric and positive-definite,
and using Assumption 6.3:
T0
!−1 2 T0
!−1
T0
!−1
2
F λ̄2
X X X
λ0t λt λ0t λs ≤ λ0t λt λ0t λt λ0s λt λ0t λs ≤ .
t=1 t=1 t=1
T0 cξ
PT0 P −1
0 T0
Define ε̃i := s=1 λt t=1 λt λ0t λs εi,s . Using Assumption 6.1 and Holder’s inequality:
nX
!1/m !1/m
0 +1 nX
0 +1 nX
0 +1
80
And by Rosenthal’s inequality (Lemma 6.2), with some constant C(m) defined in the
statement of the inequality:
!m/2
2
m T0 T0
F λ̄ X X
E|ε̃i |m ≤ C(m) max E|εj,t |m , E|εj,t |2 .
T0 cξ t=1 t=1
From the equation above and (6.4), and using Assumption 6.1:
!
F λ̄2 (E|εi,t |m )1/m
1/m σ
E|R1,t | ≤ C(m)1/m n0 max 1−1/m
,√ .
cξ T0 T0
81
Figure 6.1: Voter Turnout in the US and EDR Laws
Treated
Pen. Synthetic Control
.95 Confidence Interval
Other States
80
70
Turnout %
60
50
40
Note: The .95 confidence intervals are computed by inverting Fisher Tests. 10,000 permutations are used.
The dashed purple line is the average turnout per election for the 38 nontreated States.
82
the nine synthetic treated states, computed using the synthetic control method state-by-
state. The variables taken into account are the pre-treatment outcomes, that is to say
all the turnout rates in the presidential elections from 1920 to 1972. It is interesting to
have a look at how many untreated states receive a positive weight in the synthetic units.
For the first wave, 6 to 8 non-zero untreated units per synthetic unit receive a positive
weight, while 3 to 4 non-zero untreated units per synthetic unit for the second and 2
untreated units per synthetic unit for the third.
The synthetic unit closely reproduces the behavior of the treated states turnout rates
before the treatment. The treatment effect is given by the difference between the solid
black line and the dashed red line. Using the methodology we will see in the next section,
the impact is positive and significant at 5% for every post-treatment election, as the black
line is outside of the confidence bands.
Remark 6.5 (Sparsity of the synthetic control solution). In most cases, n0 , the number
of control units, is larger than p, the number of pre-treatment characteristics. As a
consequence of this observation and of the constrained optimization problem (SYNTH)
s.t. (NON-NEGATIVITY) and (ADDING-UP), the solution found often happens to be
sparse, i.e. kω ∗ k0 << n0 . It is the case in this example. Theorem 1 in Abadie and L’Hour
(2019) shows that under mild regularity conditions kω ∗ k0 ≤ p + 1. We also show that
a necessary condition for ωi∗ > 0 to happen, is that control unit i is connected to the
treated unit in a particular tessellation of the cloud of points defined by the columns of
(Xtreat , Xc ), called the Delaunay triangulation.
When is using the synthetic control method a good idea? Abadie (2019) gives a few
guidelines that helps ruling out the use of the synthetic control method. First of all, the
treatment effect should be large enough to be distinguishable from the volatility in the
outcome. Notice that an increasing volatility in the outcome also increases the risk of over-
fitting. In cases where the volatility in the outcome is too large, it can be a good idea to
filter the time series beforehand. Second, a donor pool should be available and comprised
of units that have not undergone the same treatment or any other idiosyncratic shock that
83
may bias the results. They should also be similar enough in terms of characteristics to
the treated so as to warrant a comparison. Third, there should be no anticipation about
the policy from the agents. The effect of the policy may take place before its formal
implementation if forward-looking agents react in anticipation, when it is announced.
In that case, it is possible to back-date the intervention. Fourth, no spillover effects.
If spillovers from a policy are substantial and effect units that are geographically close,
selecting them as part of the donor pool may not be a good idea. Fifth, the “convex
hull condition”. The synthetic unit is a convex combination of the units belonging to
the donor pool. Hence, the synthetic outcome lies within the convex hull of the donor
pool’s outcomes. Once constructed, the researcher should check that the synthetic unit
characteristics are close enough to that of the treated unit. In some cases, the treated
unit is so peculiar that a credible counterfactual may not be constructed. Sixth, there
should be enough post-intervention dates so as to detect an effect that may take time to
appear.
When these conditions are met, the synthetic control method offers a few advantages.
The first is the absence of extrapolation: the counterfactual cannot take a value larger
than the maximum or lower than the minimum value observed in the donor pool at each
date. Synthetic control estimators preclude extrapolation, because the weights are non-
negative and sum to one. The second is the transparency of the fit. Echoing the “convex
hull condition” from the previous paragraph, it can be checked whether the treated can
or cannot be reproduced by looking at the discrepancy kXtreat − Xc ω ∗ k2V . Third, it offers
a safeguard against specification search. Indeed, only the pre-treatment characteristics
are needed to compute the weights, so they can be computed even before the treatment
takes place so as to not indulge in specification search to obtain desired results. Fourth,
the sparsity of the solution facilitate the qualitative interpretation of the counterfactual.
Notice that when defining the quantity of interest, τt , we have not used any expectation
as is done e.g. in Abadie et al. (2010). This is because the synthetic controls method is
developed in a framework where we observe not a random sample of individuals from a
84
super-population but aggregate data. Hence uncertainty does not come from sampling (or
at least, it is negligible), but comes from treatment assignment. An illustrative example
of this point is Card (1990) which uses the Mariel Boatlift as a natural experiment to
measure the effect of a large and sudden influx of migrants on wages and employment
of less-skilled native workers in the Miami labor market. Between April and October
1980, around 125,000 Cubans ran away from Fidel Castro and sought asylum in Florida,
which increased Miami labor force by 7%. Card uses individual level data from the
Current Population Survey (CPS) for Miami and four comparison cities (Atlanta, Los
Angeles, Houston and Tampa-St. Petersburg) to run an analysis similar to difference-in-
differences. Here is a shortened version of one table from the paper:
Do we really believe that there is this much uncertainty around the unemployment
rate at the city level? Anecdotal evidence suggests the standard-error for the French
unemployment rate is close to .2. The moral of the story is that if the aggregate data
we are dealing with are expressed per capita, Yi,tobs is probably already an average over
a sufficiently large sample to apply the LLN. For example, Yi,tobs = Ūi,t ≈ E(Uk,it ) where
Uk,it is a Bernoulli variable equal to one when individual k is unemployed at time t in city
i. For more on this subject, see e.g. Abadie et al. (2018). However, this observation does
not necessarily rule out the use of the asymptotic framework, since we can always assume
that the “realized” unemployment rate of a city comes from a super-population anyway,
but it justifies the use of another type of inference in the synthetic control methodology.
85
6.5.1 Permutation Tests in a Simple Framework
For this paragraph, we leave the synthetic controls framework and introduce we is refered
to as Fisher Exact P-Values which are permutation tests (see also Chapter 5 in Imbens
and Rubin, 2015). Recall a simple one-period RCT framework where we observe an iid
sequence (Di , Yiobs )i=1,...,N with:
Yi (0) if Di = 0
Yiobs = Yi (Di ) =
Yi (1) if Di = 1
(Yi (0), Yi (1)) ⊥ Di . Denote the missing outcome by Yimis := Yi (1 − Di ). Fisher tests
allow to test the sharp null hypothesis of a constant treatment effect for everybody. The
sharp null hypothesis writes for a constant C:
1 X nb b obs )
o
p(C) := 1 θ(D π ) ≥ θ(D
|Π| π∈Π
86
1. For b = 1, ..., B, draw a new permutation of the treatment assignment D b , compute
the statistics θ(D
b b ) using H0 (C).
ϕα = 1 {b
p(C) ≤ α} . (6.5)
Lemma 6.1 (Level of the Test). Suppose that D obs = (D1 , . . . , DN ) is exchangeable with
respect to Π under H0 (C). Then, for α ∈ (0, 1), the test in equation (6.5) is of level α,
i.e. under H0 (C):
P [p(C) ≤ α] ≤ α.
see that:
n o
b obs ) > θb(k) ,
1 {p(C) ≤ α} = 1 θ(D
for k = d(1 − α) × |Π|e. Because Π forms a group, the randomization quantiles are
invariant:
b π )(k) = θb(k) , for all π ∈ Π,
θ(D
therefore
X n o X n o
1 θ(D b π )(k) =
b π ) > θ(D b obs ) > θb(k) ≤ α|Π|.
1 θ(D
π∈Π π∈Π
n o n o
b obs ) > θb(k) has the same distribution as 1 θ(D
By exchangeability, 1 θ(D b π )(k)
b π ) > θ(D
87
Let us illustrate this intuition by a short simulation exercise. Let π := P(D = 1) = .2,
N = 200 and simulate Y | D ∼ N (τ0 D, 1) for τ0 ∈ {0, .75}. The statistics under scrutiny
is the absolute value of the difference between the ATE estimator and C:
1 X 1 X
θb = Yiobs − Yiobs − C .
N1
D =1
N − N1 D =0
i i
N = 2 0 0 ; p i = . 2 ; B = 1 0 0 0 0 ; mu = 0 . 7 5
Figure 6.2 displays a case for which H0 (0) is false on the left-hand panel, and true on
the right-hand panel. In the first case, the observed statistics is located in the tail of the
distribution of the estimates computed under a random permutation. In that sense, the
observed effect is abnormally large. In the second case, the observed effect is in the belly
of the distribution, making it insignificant. P-values given below the chart quantify the
conclusion associated with Fisher tests.
From the tests presented in the previous section, we can construct confidence intervals by
exploiting the duality between confidence intervals and tests. The intuition to construct
88
Figure 6.2: A Very Simple Fisher Test: H0 : C = 0 false vs. H0 : C = 0 true
τ=.5 τ=0
2 2
density
density
1 1
0 0
Note: This is the histogram over all simulated permutations. The solid purple line is the observed statistics.
On the left-hand-side, pb = 0.006 (θb = .671), on the right-hand-side, pb = 0.745 (θb = −0.50) for the sharp
null hypothesis of no treatment effect.
a confidence interval of level 1 − α is the following: perform the test of the hypothesis
H0 (C) at the level α ∈ (0, 1) and include in the confidence interval any value of C for
which H0 (C) is not rejected. More formally, the confidence interval will be defined as:
Denote by C0 the true treatment effect, we have that the probability that this interval
contains C0 is given by:
P [IC1−α 3 C0 ] = P [b
p(C0 ) > α] = 1 − P [b
p(C0 ) ≤ α] ≥ 1 − α,
Figure 6.3 shows the p-value as a function of C for both cases τ0 ∈ {0, .75}. In the first
case, τ0 = .75, the 95% confidence interval is approximately [.3, 2.7]. In the second case,
τ0 = .5, the 95% confidence interval is approximately [−.75, .55]. Notice that contrary to
asymptotic or Normal distribution-based confidence intervals, they are not symmetric.
89
Figure 6.3: p-value as a function of C
0.75 ● ●
●
●
0.50 ●
p−value
●
●
●
●
●
● ●
0.25 ●
●
● ●
●
●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
−1 0 1 2
C
Note: The blue curve is the first case, τ = .75, the red is the second τ = 0. The horizontal purple line is
located at y = .05.
We adapt the previous tests to the synthetic control framework where there are multiple
periods of time. We want to test the assumption:
n1
!2 , n1
!2
X X X X
τbit τbit , (6.6)
t∈T1 i=1 t∈T0 i=1
Pn ∗ ∗
where τbit = Yi,t − j=n1 +1 ωi,j Yj,t , for i = 1, . . . , n1 and ωi,j is the weight given to the
untreated state j in the synthetic unit that reproduces treated state i. How much larger
is that ratio when the nine treated states are indeed treated compared to the same
90
ratio when we consider any othe nine states ? The p-values obtained from randomized
inference (B = 10, 000) using the MSPE ratio 5 × 10−3 . Figure 6.1 also displays the
pointwise confidence interval based on inverting the Fisher test of no effect for each date.
− ni=2
obs
P 0 +1 ∗ obs
Y1,t ω Y if t ≤ T0 ,
u
bt = obs 0
Pni0 +1i,t ∗ obs
Y1,t − τt − i=2 ωi Yi,t if t ≥ T0 + 1.
1 X
pb = 1 {S(b
uπ ) ≥ S(b
u)} .
|Π| π∈Π
91
L’Hour (2019) introduce a penalty term in (SYNTH) and find a synthetic unit for each
treated (indexed by treat here). So if there are n1 treated, for each treat = 1, . . . , n1 , the
synthetic control weights solve:
nX
1 +n0
∗
ωtreat (λ) = arg min kXtreat − Xc ωk2V +λ ωtreat,i kXtreat − Xi k2V (PENSYNTH)
i=n1 +1
n1
" nX
#
0 +n1
1 X
τbt (λ) := Y obs − ω∗ (λ)Yi,tobs .
n1 treat=1 treat,t i=n +1 treat,i
1
Theory and ways to choose λ are developed in Abadie and L’Hour (2019). The resulting
estimator also reduces the risk of large interpolation bias by not averaging units that are
very far away from each other.
On the other hand, Ben-Michael et al. (2019) propose a so-called partially pooled
synthetic control
n1
ν 1 X
ω1∗ (ν), . . . , ωn∗ 1 (ν) kXtreat − Xc ωtreat k2V
= arg min
2 n1 treat=1
p
" n1
#2
1−ν1X 1 X
+ Xtreat,j − Xc,j ωtreat ,
2 p j=1 n1 treat=1
(PART-POOLEDSYNTH)
92
6.7 Conclusion and Extensions
The synthetic controls method:
– The treatment effect (ATET) is estimated as the difference of the outcomes between
the treated and the synthetic unit,
93
6.8 Supplementary Application: The California To-
bacco Control Program
In January 1989, the state of California enforced Proposition 99, one of the first large-scale
tobacco control programs, by increasing cigarette tax by 25c/pack, funneling tax revenues
to anti-smoking budgets, etc. What was its effect on per capita tobacco consumption?
This is the iconic example of the synthetic controls method where the goal is to create a
synthetic California by reweighting states that did not change tobacco legislation. Xtreat
and Xc contain retail price of cigarettes, log income per capita, percentage of population
between 15-24, per capita beer consumption (all 1980-1988 averages), 1970 to 1974, 1980
and 1988 cigarette consumption. All details about this application are given in Abadie
et al. (2010). All the tables and figures in this section are taken directly from the article.
Figure 6.4 is taken from Abadie et al. (2010) and represents cigarette consumption in
California and in 38 other states that did not implement any anti-tobacco reform. It is
obvious that the CTA does not hold. Figure 6.5 compares the same data for California
and its synthetic doppelganger which has not implemented the treatment. The synthetic
unit is composed of Utah (33.4%), Nevada (23.4%), Montana (19.9%), Colorado (16.4%)
and Connecticut (6.9%). And Table 6.6 compares the characteristics of California, its
synthetic unit and the 38 other states. The synthetic unit closely reproduces the behavior
of California tobacco consumption before the treatment. The treatment effect is given
94
Figure 6.5: Proposition 99: California vs. Synthetic California
by the difference between the solid black line and the dashed line. Proposition 99 has
entailed a consumption decrease of an estimated 30 packs per capita in 2000.
We adapt the previous tests to the synthetic control framework. How much larger is
the RMSPE when California is treated compared to the RMSPE when another state is
treated ?
95
Figure 6.7: Treatment estimate for CA and 38 control states
Note: Some extreme states cannot be reproduced by a synthetic control. Source: Abadie et al. (2010).
Note: The most extreme states have been discarded. Source: Abadie et al. (2010).
" n #m/2
n
X X
E|Sn |m ≤ C(m) max E|ξi |m , E|ξi |2 ,
i=1 i=1
96
Figure 6.9: Treatment estimate for CA and 19 control states
Note: Among States that are well reproduced, CA is the most extreme. p-value is equal to .05! (1/(19 + 1))
Source: Abadie et al. (2010).
97
6.10 Exercises and Questions
The purpose of this exercise is to show that regression adjustment to estimate treatment
effect is using a weighted mean of the control group outcomes to construct the counter-
factual. The setting is an iid sample of the vector (Yiobs , Di , Xi ) for i = 1, ..., n. Yiobs is the
outcome variable which value depends on whether the unit i is treated or not. If unit i is
treated (Di = 1), then Yiobs = Yi (1). If unit i is not treated (Di = 0), then Yiobs = Yi (0).
Xi is a vector of covariates of dimension p which includes an intercept. The index i is
dropped when unnecessary. Define π := P(D = 1). The quantity of interest is the ATET
defined by:
τ AT ET = E [Y (1) − Y (0)|D = 1] .
τ AT ET = E Y obs |D = 1 − E W0 Y obs |D = 0 .
where W0 is random variable that depends on both X and D. Assume that the outcome
without the treatment follows a linear model: Y (0) = X 0 β0 +ε with (D, X) ⊥ ε and Eε =
0. Assume E ((1 − D)XX 0 ) is non-singular. The Oaxaca-Blinder procedure estimates the
ATET in two steps. The first one consists in estimating β0 . The second step estimates the
i = Yiobs −Xi0 β,
ATET as a simple mean of the residuals computed on the treated group as b b
98
5. From the previous questions, suggest an estimator of the ATET of the form:
1 X obs X
τbOB := Yi − ωi Yiobs
n1 i:D =1 i:D =0
i i
Give the expression of the weights in that case. Do the weights ωi sum to one?
99
100
Chapter 7
To motivate the analysis of heterogeneous treatment effects (TE), we consider the optimal
policy learning problem of Bhattacharya and Dupas (2012). This is only an introduction,
with a simplified set-up to the topic of optimal policy learning and we refer to Kitagawa
and Tetenov (2017a), Kitagawa and Tetenov (2017b), and Athey and Wager (2017) for a
more general case and precise results.
Assume the government has limited budget resources to subsidize health and educa-
tion. Thus, they want to determine a fixed allocation of treatment resources to a target
population, while maximizing the population mean outcome.
Let Y be the observed individual level outcome and D a binary treatment, which
should be determined by a specific decision rule. The researcher also have information X
about individuals which should lead the allocation decision. Following the classical set
up, we denote by Y = DY1 + (1 − D)Y0 , where Y0 and Y1 are the values of the potential
outcome Y without and with treatment. We make the uncounfoundeness Assumption
7.1, which states that all variation in D is random once the covariates X are fixed. In
particular, this prevents the case where those with the largest anticipated effects based
on some unobservables are more likely to be treated.
D ⊥ (Y0 , Y1 ) | X, (7.1)
101
Namely, we assume that the treatment status is independent of the potential out-
comes conditional on observed individual characteristics, like in a randomized control
trial (RCT). A simple model of the planner’s problem consists in determining a map
p : X → [0, 1], where X is the support of X, which denotes the probability that a
household with characteristics X is assigned to the treatment, such that the welfare is
maximized
Z Z
max (µ1 (x)p(x) + µ0 (x)(1 − p(x))) dFX (x) = (µ0 (x) + τ (x)p(x)) dFX (x)
p(·) x∈X x∈X
(7.2)
Z
s.c. c ≥ p(x)dFX (x),
x∈X
where τ (x) := E [Y1 − Y0 | X = x], µj (x) := E [Yj | X = x], j ∈ {0, 1}, and c is the
fraction of individuals that can be treated (which is assumed to be proportional to the
budget constraint). Under 7.1, the conditional average treatment effect (CATE) τ (·) can
be decomposed
τ (x) = µ1 (x) − µ0 (x).
Assume that (i) for some δ ≥ 0, we have P(τ (X) > δ) > c (the constraint is relevant)
and (ii) that Fτ (X) is increasing and τ (X) has bounded support with Lebesgue den-
sity bounded away from zero. Assumption (ii) yields that γ ∈ X 7→ P (τ (X) ≥ γ) is
decreasing, which yields that a solution for (7.2) takes the form
Z
p(X) = 1 {τ (X) ≥ γ} , γ s.t. c = p(x)dFX (x).
x∈X
Here, γ is the (1 − c) quantile of the random variable τ (X), which is unique when Fτ (X)
is increasing. This problem illustrates simply that the knowledge of the heterogeneous
treatment effect τ (·) would ideally allows one to assign the treatment in an efficient way.
102
We consider here a “plug-in” approach: estimating the CATE to find the optimal
policy. They are several alternatives and extensions. In particular, Kitagawa and Tetenov
(2017a), Kitagawa and Tetenov (2017b), and Athey and Wager (2017)) consider the
welfare maximization under the constraint that the policy should belong to a restricted
class with limited complexity, which is closer to what is needed in applications. They
show that one can estimate the optimal policy at a parametric rate given restriction on
the complexity of the class of policies whereas the CATE estimators does not reach this
rate.
Remark 7.1. Note that the dual formulation of this problem is also of interest: one
aims at estimating the minimum cost to achieve a given average outcome value in the
population
Z
min p(x)dFX (x) (7.3)
p(·) x∈X
Z
s.c. (µ1 (x)p(x) + µ0 (x)(1 − p(x))) dFX (x) = b,
x∈X
Remark 7.2. Note also that this set up can handle heterogeneous cost of treatment
across the population h(·) which yields to the following modified budget constraint
Z
c= h(x)p(x)dFX (x),
x∈X
103
The problem here comes from the fact that the researcher (i) often does not know all
subgroups (i.e. the sets of interactions among covariates) that are of interest (ii) would
want to do several tests of this type to identify the subpopulation of interest. But suppose
that each test is conducted at level α, then the chance that at least one rejection occurs is
much higher that α. This limits the use of inference. Despite existence of corrections for
this multiple testing problem (see Bonferroni correction, that do not take into account
correlation between the events, or List et al. (2016) and Hsu (2017) for two different
types of less conservative strategies), they still require to specify the hypotheses of the
test, thus limiting the potential characterization of heterogeneous treatment effect.
One way to circumvent this problem is to use a nonparametric estimator for the
heterogeneous treatment effect. In particular, the use of a special kind of random forests,
developed in Section 7.2.1 allows to handle non-linear relationship between covariates
without a priori.
In our econometric context, we face the fundamental problem of causal inference, where
we never observe both Y0 and Y1 for the same individual. Thus, one can not directly
use machine learning methods to estimate the treatment effect.
Different Methods. To handle this potential outcome problem, under the selection
on the observables assumption, there is mainly two kind of methods:
DY (1 − D)Y
H= − .
p(X) 1 − p(X)
104
2. Conditional mean regression: the second type of methods uses the fact that under
the assumption of selection on the unobservables,
Thus an estimator of τ (·) using machine learning estimators of µ1 (·) and µ0 (·)
separately does not require to estimate the propensity score. The problem here is
that biases in the separate estimations of µ1 (·) and µ0 (·) can accumulate and lead
to important and unpredictable bias in the estimation of τ (·). In the next section,
we show how to estimate them jointly, using a criterion adapted to our object τ (·).
We now remind the main properties of random trees and forests, then describe causal
random forests.
One sample trees. We first describe how to grow a decision tree to estimate µ(x) =
E [Y |X = x] from an i.i.d sample (Wi )ni=1 := (Yi , X i )ni=1 using recursive partitioning. For
an more general introduction, please refer to the lecture notes of Statistical Learning by
Arnak Dalayan, to Hastie et al. (2009) Chapter 15 for a broader reference, or to Biau
and Scornet (2016) for a more advanced survey. A decision tree method produces an
adaptive weighting αi (x) to quantify the importance of the i-th training sample Wi for
understanding the test point x:
n
X 1 {X i ∈ L(x)}
µ̂(x) = αi (x)Yi , where αi (x) := , (7.4)
i=1
|{i : X i ∈ L(x)}|
where L(x) is the “leaf” the point x is falling into. In other words, (7.4) is a local average
of all the Yi corresponding to an X i falling “close” (in the same leaf) to x. The leaves
constitute a partition of the feature space X , that maximizes an overall (infeasible) seg-
mentation criterion. Given a set A ∈ X, each node of the tree partition the feature space
into two child nodes A1 , A2 . For a random-split tree, this is done in the following way:
105
(a) draw a covariate j ∈ {1, . . . , p} according to some distribution (b) Use a segmentation
test of the type X j ≥ s where s is chosen in order to maximize the heterogeneity between
the two child nodes A1 , A2 (see below). Thus, this recursive algorithm can be described
in the following way:
1. Initialization: initialize the list containing the cells associated withe the root of the
tree A = (X ) and the tree Af inal as an empty list.
ELSE choose randomly a coordinate in {1, . . . , p}, choose the best spit s in
the segmentation test and create the two child nodes by cutting N .
Then remove the parent node A and add the child nodes to the list A =
A − {A} + {child nodes A1 and A2 }.
This algorithm (similar to the original one by Breiman (2001)) is described on an example
with n0 = 2 in Figures 7.1 and 7.2. To compute the prediction, in a “one sample” decision
tree, we simply take the average of the outcomes Yi of the corresponding observations
that fall into the leaf L(x). We note that an additional regularization step can be added
to the above algorithm to cut leaves according to some cutting criterion (called pruning).
This is not necessary in our context, and was not used in the original formulation in
Breiman (2001).
106
Figure 7.1: Decision tree algorithm: phases 1 and 2
Note: Example of random-spit tree with stopping criterion the minimal number of observation equal to 2 in
each leaf.
107
Figure 7.2: Decision tree algorithm: phases 3, 4 and evaluation
Note: Example of random-spit tree with stopping criterion the minimal number of observation k = 2 in each
leaf.
Double sample trees (or “honest” trees). Here, we follow here Athey and Imbens
(2016) and Athey and Wager (2017), and refer to these papers for more details. To make
causal inference, we should rely on the “honest” property of the tree. An honest tree does
not use the same sample to place the splits and to evaluate the value of the estimator on
the leaves.
We now study the properties of double sample trees, built in this way:
108
1. Draw a subsample of size s from {1, . . . , n} with replacement and divide it into 2
disjoint sets I, J with size |I| = bs/2c and |I| = ds/2e.
2. Grow tree via recursive partitioning, with splits chosen from the J sample (i.e
without using Y -observations from the I sample).
Double sample random forests (bagging ) As a final step we aggregate trees trained
over all possible subsamples of size s of the training data:
−1
n X
µ̂(x; Z 1 , . . . , Z n ) = Eξ∼Ξ [T (x; ξ, Z i1 , . . . , Z is )] , (7.5)
s
1≤i1 <···<is ≤n
where ξ summarizes the randomness in the selection of the variable when growing the
n
tree, Z i := (Di , X i , Yi ), and is the number of permutation of s elements among
s
n. The estimator in equation (7.5) is evaluated using Monte Carlo methods: we draw
without replacement B samples of size s (Zi∗1 , . . . , Zi∗s ) and consider the approximation
(7.6) of (7.5)
B
1 X
µ̂(x; Z 1 , . . . , Z n ) ≈ T (x; ξb∗ , Z ∗b,1 , . . . , Z ∗b,s ), (7.6)
B b=1
where the base learner is
X
T (x; ξb∗ , Z ∗b,1 , . . . , Z ∗b,s ) = ∗
αi,b ∗
(x)Yi,b , (7.7)
i∈{ib,1 ,...,ib,s }
∗ ∗
∗
1 X i,b ∈ Lb (x)
αi,b (x) = ,
|{i : X ∗i,b ∈ L∗b (x)}|
This aggregation strategy, called bagging, reduces the variance of the estimator of µ (see
e.g. Bühlmann et al. (2002) for an analysis). Note that the “honesty” property consists
∗ ∗
in making the weights αi,b (x) in (7.7) independent of Yi,b .
Bias and honesty of the regression random forest We still consider i.i.d observa-
tions (Yi , X i )ni=1 and show the consistency of an estimator µ̂(·) of µ(·) = E [Yi |X i = ·].
Definition 7.1 (Diameter of a leaf). The diameter of the leaf L(x) is the length of the
longest segment contained inside L(x), which we denote by Diam(L(x)).
The diameter of the leaf L(x) parallel to the j-th axis is the length of the longest segment
contained inside L(x) parallel to the j-th axis, which we denote by Diamj (L(x)).
109
To ensure consistency, we need to enforce that the leaves become small in all direc-
tions of the feature space X as n (thus s) gets large: Diam(L(x)) → 0 as s → ∞ (see
Lemma 7.1). To do so, we enforce randomness in the selection of variables at each step
(random-split tree). We need the following assumptions.
Assumption 7.2. - Random-split tree: the probability that the next split occurs
along the j-th feature is bounded from below by δ/p, 0 < δ ≤ 1.
- α−regular: from the I sample, each split leaves at least a fraction α of the obser-
vations of the training example on each side of the split.
- Honest tree: the samples used to build the nodes and to evaluate the estimator on
the leaves are different.
The minimum leaf size k is a regularisation parameter that has to be fixed by the
researcher. In practice she can use cross-validation to choose k. The following lemma is
key, but relies on a very strong assumption for the distribution of the covariates.
Lemma 7.1 (Control in probability of the leaf diameter in uniform random forests,
Lemma 2 in Wager and Athey (2017)). Let T satisfies Assumption 7.2 and X 1 , . . . , X s ∼
U([0, 1]p ) independently. Let 0 < η < 1, then for s large enough,
−α1 δ/p ! −α2 δ/p
s s
P Diamj (L(x)) ≥ ≤ ,
2k − 1 2k − 1
where
- cj (x) the number of splits leading to L(x) along the j-th axis.
110
Then, using that T is α-regular, the minimal number of observations in L(x) is sαc(x) ,
α > 1, which is thus inferior or equal to 2k − 1. This yields
as the minimal total number of split that lead to L(x) is c0 and that at each of these
nodes, the probability to draw the j-th coordinate is bounded from below by δ/p. Then,
we use the multiplicative Chernoff bound
2µ
P (cj (x) ≤ (1 − η)µ0 ) ≤ e−η 0 /2
, (7.8)
where µ0 = E[Z] = δc0 /p. Finally, Wager and Walther (2015) show that the diameter
of the leaf along the axis j is related the number of observations in the leaf, when
covariates are uniformly distributed, L(x) via
Lemma 7.2 (Consistency of the double sample random forests, Theorem 3 in Wager
and Athey (2017)). Consider T satisfying Assumption 7.2, x 7→ µ(x) that is Lipschitz
continuous, α ≤ 0.2, then the bias of the random forest at x ∈ X is bounded by
Proof of Lemma 7.2. Start from the definition (7.5) which yields E [µ̂(x)] = E [T (x; Z i )].
Then, define µ̃(x) like µ̂(x) replacing αi (x) by
1 {X i ∈ L(x)}
α̃i (x) = ,
s|L(x)|
111
where |L(x)| is the Lebesgue measure of the leaf L(x). Using the honesty assumption
(i.e. that Y is independent of L(x)) for the third equality and that P (X i ∈ L(x)|L(x)) =
|L(x)| for the fourth equality, we have
=E [E [Y |X ∈ L(x)] − E [Y |X = x]] .
√
Then, using the fact that the diagonal length of an unit hyper-cube in dimension p is p
p
( −α1 δ/p ) [ ( −α1 δ/p )
√ s s
Diam(L(x)) ≥ p ⊂ Diamj (L(x)) ≥
2k − 1 j=1
2k − 1
Double sample causal trees. The algorithm for double sample causal trees is similar
to the algorithm of the double sample trees but splits are chosen (segmentation criterion)
112
to maximize the variance of
1 X
τ̂ (x) = µ̂1 (x) − µ̂0 (x) = Yi
|{i, Di = 1, X i ∈ L(x)}|
i: Di =1, X i ∈L(x)
1 X
− Yi . (7.10)
|{i : Di = 0, X i ∈ L(x)}|
i, Di =0, X i ∈L(x)
Assumption 7.3. - α−regular: each leaf L(·) leaves at least a fraction α of the
available training examples on each side of the split
- Minimum leaf size k: there are between k and 2k − 1 observations from each
treatment group in each terminal leaf of the tree (with Di = 1 or with Di = 0).
- Honest trees: the sample J used to place the splits is different from the sample
I used to evaluate the estimator through (7.10).
Without the honesty property, the treatment effect estimator would be based on splits
that lead to high treatment effect, but this probably means that the treatment effect in
those leaves is biased.
Remark 7.3 (On the segmentation criterion for Double Causal forests.). If the outcome
of the regression τi were observed and without split of the train sample S tr (like for the
regression CART algorithm), then the splits should minimize the empirical counterpart
of the squared-error loss
But, as τi is not directly observed Athey and Imbens (2016) use a criterion that mimics
what is done in the CART algorithm. Here, using that the estimators µ̂(X i ) are constant
on each leaf Lm by definition, we have, for x ∈ Lm
X X 1 X
µ̂(X i )2 = µ̂(X i ) Yk
i ∈S, i s.t. X ∈L i ∈S, i s.t. X ∈L
|Lm | k ∈S, k s.t. X k ∈Lm
i m i m
113
!
X 1 X
= µ̂(X i ) Yk .
k ∈S, k s.t. X k ∈Lm
|Lm | i ∈S, i s.t. X i ∈Lm
X
= µ̂(X k )Yk .
k ∈S, k s.t. X k ∈Lm
Thus, (7.11) yields that in the regression context we want to minimize the unbiased
estimator of MSEµ̂ (S te , S tr , T ) which is
1 X
MSEµ̂ S tr , S tr , T = − tr τ̂ (X i ; S tr , T )2 .
|S | tr
i∈S
In the treatment effect context, this leads Athey and Imbens (2016) to consider by analogy
the maximization of the feasible criterion
1 X
−MSEτ̂ S tr , S eval , T = τ̂ (X i ; S tr , T )2 ,
|S eval |
i∈S eval
where the train sample is split in an evaluation sample S eval and real train sample S tr .
Assumption 7.4 (Regularity conditions for asymptotic normality). The potential sam-
ples (X i , Y1,i ) and (X i , Y0,i ) satisfy, for j ∈ {0, 1},
h i
- Var (Yj |X = x) > 0 and E |Yj − E [Yj |X = x]|2+δ1 ≤ M for some constants
δ1 , M > 0 uniformly over all x ∈ [0, 1]p ;
Denote the infinitesimal jackknife estimator (see Efron (2014); Wager et al. (2014))
2 Xn B
!
n−1 n 1 X ∗ ∗
V̂IJ (x) = τ̂b (x) − τ̂b∗ (x) Ni,b − Nb∗ , (7.12)
n n−s i=1
B − 1 b=1
114
∗
where Ni,b indicates whether or not the i-th training example was used for the b-th
bootstrap tree and Nb∗ , τ̂b∗ are averages over the B bootstrap trees.
Theorem 7.1 (Asymptotic normality of double sample causal random forests, Theorem
1 in Wager and Athey (2017)). Assume that we have i.i.d samples Z i = (X i , Yi , Di )ni=1 ∈
[0, 1]p × R × {0, 1}, that the selection on observables (7.1) holds, and that there exists
> 0 such that ≤ P (D = 1|X) ≤ 1 − (overlap condition). Suppose Assumption 7.4
holds and consider a double sample causal random forests satisfying Assumption 7.3 with
α ≤ 0.2. Assume that
−1
log(α−1 )
β p
s = bn c, for some βmin := 1 − 1 + < β < 1. (7.13)
δ log ((1 − α)−1 )
Then, there exists C(·) and γ > 0, such that random forest predictions are asymptotically
Gaussian:
τ̂ (x) − τ (x)
→d N (0, 1) ,
σn (x)
s C(x)
where σn (x) := . The asymptotic variance σn (x) can be consistently estimated
n log(n/s)γ
using the infinitesimal jackknife
Several remarks are in order. First, from the restrictions on β, one can make more
precise the rate of convergence n−1/(1+pα3 /δ) , which does not allow for a “high dimensional”
case in the sense of the previous section (p log(n)). Theorem 7.1 allows to perform
inference, to test the significativity of the treatment effect for a population with covariates
x without apriori on the specific spaces in the feature space to test. Third, Theorem 7.1
uses a very strong assumption on the distribution of the covariates, which allows here to
make inference.
Remark 7.4 (Key ideas for the proof of Theorem 7.1, can be skipped for the
first lecture). A random forest is an U-statistic, which means that can be written as
−1
n X
µ= T (Z i1 , . . . , Z is )
s s
i∈{1,...,n} , with i1 <···<is
for a bounded function T (see for a detailed exposition Chapter 12 p162 in Van der
Vaart (1998)). The usual way to prove asymptotic normality for U-statistics is to use the
115
◦
projection µ̂(x) of µ̂(x) onto the class S of all statistics of the form
n
X
gix (Z i )
i=1
where E (gix (Z i ))2 < ∞, which is called the Hájek projection. Then, from µ̂(x) =
◦ ◦ ◦
(µ̂(x) − µ̂(x)) + µ̂(x), the proof amounts to show that µ̂(x) − µ̂(x) →p 0 and to apply
◦
the CLT to the projection µ̂(x). More precisely, we use Proposition 7.1 (which is Lemma
11.10 in Chapter 11 in Van der Vaart (1998)).
◦
Proof of Proposition 7.1. The proof follows from the fact that T belongs to S and
because we can verify that for all S ∈ S,
◦
E (T − T )S = 0.
Then, an important result (see Theorem 11.2 in Van der Vaart (1998)) states that if
◦
the projection T statisfies
Var(T )
lim ◦ →1 (7.14)
n→∞
Var(T )
then ◦ ◦
T − E [T ] T − E[T ]
− ◦ →P 0.
Sd (T ) Sd(T )
This is due to the fact that
◦ ◦ ◦
T − E [T ] T − E[ T ] =2−2 Cov(T, T )
Cov , ◦ ◦
Sd (T ) Sd(T ) Sd (T ) Sd(T )
116
and that using orthogonality
2
◦ ◦ ◦ ◦
E TT = E T − T T + E T
Thus, to prove asymptotic normality of the random forest, one could try to show
(7.14). However, this does not hold for regression trees. Hence Wager and Athey (2017)
rather show a close adaptation of this property (7.14) (namely that regression trees are
ν-incremental) which states that under the conditions of Theorem 7.1, there exists C1 (·)
such that ◦
Vars (T x )
lim inf log(s)p ≥ C1 (x). (7.15)
s→∞ Vars (T x )
Then, they use that by independence of the observations and symmetry permutation of
the trees T ,
E [T (x; z, Z 2 , . . . , Z s )] if j ∈ i
E [T (x; Z i1 , . . . , Z is )|Z j = z] =
0 if j ∈
/i
hence, denoting for simplicity by T (x) := Eξ∼Ξ [T (x; ξ, Z 1 , Z 2 , . . . , Z s )] (where the ex-
pectation is only on the randomness of ξ, so T (x) depends on Z 1 , Z 2 , . . . , Z s ), we have
n
◦ X
µ̂(x) = E [µ̂(x)] + (E [µ̂(x)|Z i ] − E [µ̂(x)])
i=1
n
sX
= E [T (x)] + (E [T (x)|Z i ] − E [T (x)]) . (7.16)
n i=1
Moreover, we have
s
◦ X
T (x) = E [T (x)] + (E [T (x)|Z i ] − E [T (x)]) . (7.17)
i=1
117
◦
Question: Show that (7.16)-(7.17) yields σn2 (x) = sVars (T (x))/n.
◦
Let σn2 (x) be the variance of µ̂(x). Lemma 7 in Wager and Athey (2017) shows that
" 2 #
◦
E µ̂(x) − µ̂(x)
s 2 Var (T (x))
s
≤
σn2 (x) n σn2 (x)
s Vars (T (x)) s ◦
= ◦ (using σn2 (x) = Vars (T (x))),
n Var (T (x)) n
s
Remark 7.5 (Local centering). Insights from the literature considered in the first two
sections (namely Chernozhukov et al. (2017)) made Athey and Wager (2017) consider
a local centering pre-treatment before estimating causal random forests. Specifically,
they show on simulations that estimating the above double sample causal forests with
orthogonalized outcomes
Yei = Yi − E [Yi |X i = x] ,
e i = Di − E [Di |X i = x]
D
(where we use the second equation only if the propensity score is used), improves the
performances of the algorithm. In practice, they propose to use random forest estimators
for the regression function in the above equations and to use
Yei = Yi − Ŷ (−i) (X i )
e i = Di − D̂(−i) (X i ),
D
where Ŷ (−i) (X i ) and D̂(−i) (X i ) are leave-one-out estimators (random forests evaluated
without the i-th observation, which is quasi-computationally free). Section 7.2.3 makes
these statements more precise based on Nie and Wager (2017) and Athey et al. (2019).
118
7.2.2 Simulations
We consider a simulation setups from Wager and Athey (2017), where X i ∼ U ([0, 1]p ),
Di |Xi ∼ Bernoulli(p(Xi )), and
where there is no confounding, m(x) = 0 and p(x) = 0.5, but there is heterogeneity in
the treatment
1
τ (x) = ζ(x1 )ζ(x2 ), where ζ(u) = 1 + .
1 + e−20(u−1/3)
This setting can be implemented via the package causalForest (and randomForestCI
for confidence intervals). However, one can also use the package grf (generalized random
forest), which uses a gradient based algorithm and a slightly different criterion (see Section
7.3 for more details) but is more efficient and stable. For the codes, see CausalForests.R
on the github page of the course.
We only report here the results from Wager and Athey (2017) on Figure 7.3 of the
case p = 3. The comparison is done with the k-nearest-neighbours (kNN)
1 X 1 X
τ̂kN N = Yi − Yi
k k
i∈S1 (x) i∈S0 (x)
where k is taken to be 10 and 100, and S1 (x) and S0 (x) are the neighbours of x. It is
shown in details in Table 3 in Wager and Athey (2017) that the mean squared error for
the causal random forest are more robust than for the kNN to the increase in the number
of covariates (which still is small). Starting from dimension 3, the mean squared error of
the causal random forest estimate is lower than the 100-nearest-neighbours estimator.
119
Figure 7.3: Figure comparing the true treatment effect (left) to the causal random forest estimate (center),
and the kNN estimate (right) in dimension 3, adapted from Wager and Athey (2017)
Thus, τ satisfies
where the penalization Λn takes into account the complexity of the class Θ where τ
belongs. Thus, Nie and Wager (2017) propose the two steps estimator
1. Fit m̂ and p̂ via any method tuned for optimal predictive accuracy (random forests,
deep neural networks, LASSO, ect)
2. Estimate treatment effects via a plug-in version of (7.19), using the leave-one-out
estimators Ŷ (−i) (X i ) := m̂(X i ) and D̂(−i) (X i ) := p̂(X i ).
Allowing for a two-steps estimation, contrary to the causal forest formulation of Section
7.2.1 allows to choose more adapted methods in the first steps to the profiles of m and
p. Nie and Wager (2017) show bounds on the regret according to the complexity of the
class Θ and this formulation is used in the package grf.
7.2.4 Applications
We focus on the applications from Davis and Heller (2017a) and Davis and Heller (2017b)
which estimate the benefits from two youth employment program in Chicago. These two
120
randomized controlled trials (RCTs) are the same summer job program in 2012 and 2013.
They have relatively large sample size (1,634 and 5,216 observations, respectively) and
observe a large set of covariates. The program provides disadvantaged youth aged 14
to 22 with 25 hours a week of employment and an adult mentor. Participants are paid
at Chicago’s minimum wage. They focus on two outcomes: violent-crime arrests within
two years of random assignment and an indicator for ever being employed during the six
quarters after the program.
They ask the following question: if we divide the sample into a group predicted to
respond positively to the program and one that is not, would we successfully identify
youth with larger treatment effects? To do so they train the causal forest on half of the
sample, then use the treatment effect predictions on the other part. Then they regress the
outcomes on the indicators 1{τ̂ (x > 0}, Di 1{τ̂ (x > 0}, and Di (1 − 1{τ̂ (x > 0}). They
test the null hypothesis that the treatment effect is equal among the two groups. Figure
7.4 shows their results: in the in-sample, the test is rejected for both outcomes whereas
it detects significant heterogeneity only for the return to employment in the out-sample.
This could be the sign of some overfitting. Note that changing the splitting rule does
seems to alter much these results. They conclude that the sampling error may hide the
treatment effect.
Hussam et al. (2017) uses causal random forest to evaluate the impact of giving a 100$
grant to randomly selected entrepreneurs in India on their returns. More, they compare
the predicted treatment using causal forests based on entrepreneurs characteristics to the
treatment effect when the grant is accorded based on community rankings (private or
public). They find that peer reports are predictive over and above observable traits but
that making the rankings public create incentives to lie that reduces community’s reports
accuracy.
Finally, we underline that Athey and Wager (2019) report the application of causal
random forests and the package grf made during a challenge to estimate from the dataset
of the National Study of Learning Mindsets, the treatment effect of giving a nudge-like
intervention to instill students the belief that intelligence can be developed on student
achievement. We recommend to have a look at the code and datasets which are provided
121
Figure 7.4: Comparison of average treatment effects among the two groups identified using causal forests
as having positive or negative treatment effect (from Davis and Heller (2017a))
at https://github.com/grf-labs.
Athey et al. (2019) generalize the above heterogeneous treatment estimation to the case
122
of endogenous treatment. The aim is to measure the causal effect of an outcome while
acknowledging that the intervention and the outcome are tied by non causal factors. The
selection on the observables no longer holds. For example, one may be interested by the
causal effect of childbearing on female labor force participation (the classical IV used in
the literature is the “mixed children” dummy, namely having children of different sexes).
They consider the following model
The idea is then to extend previous random forests to learn in a data-driven way the
weights αi , and get an asymptotically normal estimator τ̂ of τ . The problem here com-
pared to the case where the unconfoundedness assumption holds is that both τ and µ are
implicitly defined.
123
The Gradient tree algorithm. The algorithm computes the splits (hence the weights
and the estimator) recursively. Start from a parent note P that we seek to divide in
two children C1 , C2 using an axis-aligned cut such has find the best improvement of the
accuracy of our estimator τ̂ , namely minimizing
2 h i
X 2
err(C1 , C2 ) = P (X i ∈ Cj |X i ∈ P ) E τ̂Cj (J ) − τ (X i ) |X i ∈ Cj ,
j=1
where τ̂Cj (J ) are fit over the children Cj in the first part of the train sample J . However
here we do not have access to a direct unbiased estimate of err(C1 , C2 ), which yields
Athey et al. (2019) to propose a new procedure:
then compute
1 X
AP := ∇ψτ̂P ,µ̂P (W i )
|{i : X i ∈ P }|
{i: X i ∈P }
1 X −Di Zi −Zi
= .
|{i : X i ∈ P }| −Di −1
{i: X i ∈P }
Note that in the IV model (7.20), the minimization problem of (7.22) has a solution
P P
{i: X ∈P } Z i Y i − Y P / {i: X ∈P } Zi D i − D P
τ̂P (J ) i i
= 1 P ,
µ̂P (J ) (Y i − D i τ̂ P (J ))
|{i : X i ∈ P }| {i: X i ∈P }
P P
where Y P = {i: X i ∈P } Yi /|{i : X i ∈ P }| and DP = {i: X i ∈P } Di /|{i : X i ∈
P }|. Then compute the pseudo-outcomes
ρi := −(1, 0)A−1
P ψτ̂P ,µ̂P (W i ) ∈ R, (7.23)
Then, relabel the observations in each child by solving the estimating equation.
124
The justification of this algorithmic way to find estimate an optimal partition is given
in Proposition 1 in Athey et al. (2019) which states that (1) if AP is a consistent estimator
of ∇E [ψτ̂P ,µ̂P (W i )|X i ∈ P ], (2) that the parent node has a radius smaller than r > 0 (3)
the regularity assumptions of Theorem 7.2, and considering the number of observation in
the child nodes as fixed and large in front of 1/r2 , then
h i
˜ 1 , C2 ) + o(r2 ),
err(C1 , C2 ) = K(P ) − E ∆(C
Remark 7.6 (Influence function). The intuition for the use of ρi comes from the proof
of the asymptotic normality for Z-estimators, which are estimators of θ0 based on the
moment condition
E [ψθ0 (Xi )] = 0.
Using the asymptotic representation of the Z-estimator θ̂n in Theorem 5.21 page 52 in
Van der Vaart (1998)
n
1 X 1
θ̂n = θ0 + ∇ψθ−1 ψθ0 (Xi ) + op √ ,
n 0
i=1
n
we have that the influence of the n-th observation on the estimator is given by
Central Limit Theorem for Generalized Random Forests (GRF) in the instru-
mental variable model. We only study the case of model (7.20) and refer to Athey
et al. (2019) for the more general case of asymptotic normality for GRF. Denote by
125
E [Di Zi |X i = x] E [Zi |X i = x]
=− ,
E [Di |X i = x] 1
τ̃ ∗ (x) is useful because it has the same form as the base learner in the U-statistic studied
in Wager and Athey (2017). Thus, tools developed in Wager and Athey (2017) and in
Section 7.2.1 can be applied, which yields the asymptotic normality of τ̂ (x) provided that
τ̂ (x) and τ̃ ∗ (x) are close asymptotically, which is ensured via Assumption 7.5. τ̃ ∗ (x) is
exactly the output of an infeasible regression forest trained with outcomes τ (x) + ρ∗i (x).
is Lipschitz continuous in x;
Theorem 7.2 (Asymptotic normality of GRF for the instrumental variable model (7.20),
Theorem 5 in Athey et al. (2019)). Assume that we have i.i.d samples W i = (X i , Zi , Yi , Di )ni=1 ∈
[0, 1]p × R × R × {0, 1}, and that there exists > 0 such that ≤ P (D = 1|X) ≤ 1 −
(overlap condition). Make Assumption 7.4, 7.5 and consider a double sample causal ran-
dom forests satisfying Assumption 7.3 with α ≤ 0.2. Assume that β satisfies (7.13).
Then, there exists C(·) and γ > 0, such that random forest predictions are asymptotically
Gaussian:
τ̂ (x) − τ (x)
→d N (0, 1) ,
σn (x)
s C(x)
where σn (x) := .
n log(n/s)γ
126
Intuitions for the proof of Theorem 7.2. The proof follows from the fact that τ̃ ∗ (x)
is formally equivalent to the output of a regression forest, thus using Theorem 7.1 we
have
τ̃ ∗ (x) − τ (x)
→d N (0, 1)
σn (x)
Then, from Theorem 3 (consistency of (τ̂ , µ̂)) and Lemma 4 in Athey et al. (2019) we
have
n ∗ 2 s 2/3
(τ̃ (x) − τ̂ (x)) = Op (7.24)
s n
so (τ̃ ∗ (x) − τ̂ (x))/σn →p 0 which yields the result. The technical part of the Theorem is
the proof of Lemma 4 which yields (7.24).
Athey et al. (2019), similarly to Nie and Wager (2017), recommend to orthogonalize
the variables Yi , Di , Zi , using preliminary leave-one-out estimators m̂(−i) , p̂(−i) , and ẑ (−i)
of E [Yi |X i = x], E [Di |X i = x], and E [Zi |X i = x] which yields
Pn (−i)
i=1 αi (x) Yi − m̂ (X i ) Zi − ẑ (−i) (X i )
τ̂ (·) = Pn (−i) (X )) (Z − ẑ (−i) (X ))
. (7.25)
i=1 αi (x) (Di − p̂ i i i
This option is implemented in the grf package. One can build pointwise confidence
intervals such that
from the fact that Var[τ̃ ∗ (x)]/σn2 →p 1 and using the definition of ρ∗i (x) to build σ̂n2 (x).
First, we strongly recommend to have a look at the simulations used in Athey and Wager
(2019) using the grf package and available at https://github.com/grf-labs which nicely
illustrate the performances of the IV Forest.
We consider here an application of the generalised random forests to estimating het-
erogeneity in the effect of subsidized training on trainee earnings. We use data from
Abadie et al. (2002) which can be dowloaded at https://economics.mit.edu/faculty/
angrist/data1/data/abangim02. We re-analyse data from the Job Training Partner-
ship Act (JTPA), which is a large publicly-funded training program. Individuals are
127
randomly assigned to the JTPA treatment and control groups, the treatment consisting
in offering training. Only 60 percent of the treatment group accepted to effectively be
trained, but the randomized treatment assignment provides an instrument for the treat-
ment status. More, because there is only 2 percent of individuals receiving JTPA services
in the control group, they interpret the effect for compliers as the effect on those who are
treated. See Abadie et al. (2002) for more details and an alternative estimation method
of the effects of this training on the distribution of earnings based on quantile regression
handling the endogeneity of the treatment. We focus on the heterogeneity of the effect
of the training according to the baseline characteristics interactions: age, High school
graduate indicator, marital status, Black and Hispanic indicators, Aid to Families with
Dependent Children (AFDC) receipt, and a dummy for whether one worked less that 13
weeks in past year. We denote by Y the 30-month earning, D the enrollment in JTPA
services, and Z the offer of services.
We train a generalised random forest on 80% of the sample for man and woman sep-
arately, and using IV (denoted by “GRF”) or not (denoted by “CRF”). We draw several
comparisons. First, Figure 7.5 represents the distributions of the predicted treatment ef-
fect on 20% of the sample, using “GRF” or “CRF”, which is our test sample. Of course,
a deeper analysis of the results could be done by reporting the precise estimate for the
treatment effect for subgroups of the sample (here all the covariates are binary variables,
so we can not “plot” the estimated treatment as in the simulation in the grf package).
Similarly to Abadie et al. (2002), we observe that there is a important difference between
the two, as illustrated on Table 7.1. This underlines the importance of using an instru-
mental variable in that context. Results from Table 7.1 can also be compared to the
estimates for quantile of the treatment effects on the test sample to the quantile treat-
ment effects (QTE) estimated by Abadie et al. (2002) using quantile regression handling
the endogeneity of the treatment. Both are very close, which is coherent, but the lack of
uniform confidence band for GRF prevent us from doing more comparisons.
128
Figure 7.5: GRF (blue) and RF (red) predicted treatment effect distribution for 30-month earning
for men (left) and women (right) among the test sample.
ATE, OLS ATE, 2SLS ATE, RF ATE, GRF Q, 0.25 Q, 0.50 Q, 0.75
Men 3,754 (536) 1,593 (895) 3,456 1,908 863 1,836 2,986
Women 2,215 (334) 1,780 (532) 1,972 1,828 626 1,648 2,746
Table 7.1: Table of the estimated treatment effect on the 30-month earnings.
This section is based on Chernozhukov et al. (2017b), a paper we encourage you to read.
The previous sections have shown that in order to perform inference on the Conditional
Average Treatment Effect (CATE), one often has to make strong assumptions on the
underlying true CATE that might not always hold (e.g. covariates uniformly distributed
in causal forests) or un-testable. Furthermore, the practical application of these methods
very often substantially differs from their theoretical counterparts (e.g. tuning parameters
are chosen via cross-validation). There might be a trade-off between the assumptions that
we are willing to make and the amount of knowledge about the targeted object that we
want to get. Following a strand in the statistical literature (e.g. Lei et al. (2017)),
Chernozhukov et al. (2017b) propose to change the point of view about the CATE, and
to estimate/perform inference regarding key features of the CATE rather than the
true object itself. Changing our objectives allows to use plenty of ML methods, viewed
as proxies of the CATE, and to limit the amount of assumptions we have to make (in
particular, the ML methods do not have to be consistent). The idea consists in post-
129
processing those estimators to get consistent succinct summaries of the CATE.
The model is similar the one of Section 7.2: we observe the outcome variable Y =
DY1 + (1 − D)Y0 , the treatment dummy D, covariates X ∈ Rp , and make the selection on
the observables (7.1) and overlap (∃ > 0, s.t. ≤ P (D = 1|X) ≤ 1 − ) assumptions.
The following equation hold
Chernozhukov et al. (2017b) propose partitioning the initial data (Yi , Di , X i )i=1,...,n be-
tween an auxiliary sample (denoted DataA ) and a main sample (denoted DataM ). The
first step is to estimate x → µ0 (x) and x → τ (x) on the auxiliary sample. The following
estimators are the estimators of µ0 and τ respectively that result from a machine learning
algorithm (any algorithm can be considered):
m0 : x → m0 (x|DataA ) , (7.28)
T : x → T (x|DataA ) . (7.29)
1. the Best Linear Predictor (BLP) of the CATE τ (·) based on the ML proxy predictor
T (·);
2. the Sorted Group Average Treatment Effects (GATES), which is the average of the
τ (·) by heterogeneity groups induced by T (·).
Best linear predictor (BLP) of the CATE. The first key feature (1), the Best linear
predictor (BLP) of the CATE using the proxy T is defined as the linear projection of the
130
CATE on the linear span of 1 and this proxy in the space L2 (P ):
E (τ (X) − f (X))2
BLP [τ (X)|T (X)] = arg min
f (X)∈Span(1,T (X))
where
(b1 , b2 ) ∈ arg min E (τ (X) − B1 − B2 T (X))2 .
(7.31)
(B1 ,B2 )∈R2
and from the auxiliary sample A we can estimate a ML proxy T (·) of τ (·). Thus we can
estimate Cov(τ (X), T (X))/Var(T (X)) using (7.32) running the regression
Theorem 7.3 (Consistency of the Best linear predictor estimator, Theorem 2.2 in Cher-
nozhukov et al. (2017b)). Consider the maps x 7→ T (x) and x 7→ m0 (x) as fixed. Assume
h 0 i
that Y and X have finite second order moments and that E X X is full rank, where
Then, (β1 , β2 ) defined as (7.33), where also solves the problem (7.31), hence
Several remarks are in order. First, this identification is constructive and simple: the
weighted OLS estimation procedure is described in 7.4.3. Second, this strategy does not
assume that the estimator T (X) is a consistent estimate of τ (X), allowing the very high
dimensional setting p n. However, we only learn about the best linear projection of τ
onto (T (X), 1) which yields that if T (X) is a bad predictor we learn noting about the
truth. Finally, note two extreme interesting cases:
- If T (X) is a perfect proxy for τ (X) and τ (X) is not a constant, then β2 = 1
131
- If T (X) is complete noise, uncorrelated to τ (X), then β2 = 0.
If there is no heterogeneity, then β2 = 0 yields a very simple test of the joint hypothesis
that there is heterogeneity and that T (X) is relevant (which is a problem if we do not
reject, as one can not separate the two hypotheses).
Proof of Theorem 7.3. We only show that β2 = Cov(τ (X), T (X))/Var(T (X)), since
the proof for β1 follows a similar reasoning. The normal equations (7.34) that identify
(β1 , β2 ) give for β2
Cov(w(X)(D − p(X))Y, T (X) − E [T (X)])
β2 = . (7.35)
Var(T (X) − E [T (X)])
The denominator is equal to Var(T (X)). Now, since T (X) − E [T (X)] has mean zero,
the numerator of (7.35) is
thus
Third, we have
E [w(X)(D − p(X))U (T (X) − E [T (X)])] = E w(X)(D − p(X)) E [U |D, X](T (X) − E [T (X))]
| {z }
=0
= 0,
132
Remark 7.7. To reduce the noise generated by the Horvitz-Thompson like weight H :=
(D − p(X))/(p(X)(1 − p(X))) in (7.34), Chernozhukov et al. (2017b) recommend to use
the following regression instead of (7.33)
Sorted Group Average Treatment Effects. We can also divide the support of the
proxy predictor T (X) into over-lapping regions to define groups of similar treatment
response an perform inference over their expected treatment effect
The innovation of Chernozhukov et al. (2017b) in inference about the following key feature
of the CATE:
is to provide methods to do inference handling the two sources of uncertainty that appear
when using sample splitting methods. Specifically, using different partitions in two parts
{A, M } of the initial sample and aggregating the different estimates θ̂A brings:
To make inference with methods using data splitting, one has to adjust the normal con-
fidence level in a specific way.
Denote by
133
- the lower median (usual median) Med(X) := inf {x ∈ R : PX (X ≤ x) ≥ 1/2};
Let us first make more precise those two sources of uncertainty, that arise from the
repeated use of partitions {A, M } of the initial sample {Yi , Di , Xi }ni=1 :
where Φ is the c.d.f of the standard normal. This yields the conditional confidence
intervals
P (θA ∈ [LA , UA ]| DataA ) = 1 − α + oP (1),
h i
−1
where [LA , UA ] := θ̂A ± Φ (1 − α/2)σ̂A .
H0 : θA = θ0 , H1 : θA < θ0 (7.37)
134
Adjusted sample splitting P-values. First note that under H0 , pA ∼ U(0, 1) con-
ditional on data A, but that conditional on the whole data, there is still randomness.
Thus, Chernozhukov et al. (2017b) define testing the null hypothesis (7.37) with signifi-
cance level α, based on the p-values pA that are random conditional on the data as
This means that for at least 50% of the random data splits, the realized p-values pA
falls below the level α/2. The construction (7.38) is based on the fact that the median
M of J uniformly distributed random variables (not necessarily independent) satisfies
P (M ≤ α/2) ≤ α. Theorem 7.4 show the uniform (over the distributions P in P, all the
possible distributions of the data satisfying H0 ) validity of the sample-splitting adjusted
p-values.
Assumption 7.6 (Uniform asymptotic size for the conditional test). Assume that all
partitions DataA of the data are “regular” in the sense that under H0 , for
! !
θ̂A − θA θ̂A − θA
pA = Φ and pA = 1 − Φ ,
σ̂A σ̂A
and for all x ∈ [0, 1], supP ∈P |PP (pA ≤ x|DataA ) − x| ≤ δ = o(1).
Theorem 7.4 (Uniform asymptotic size for the unconditional test with sample splitting,
Theorem 3.1 in Chernozhukov et al. (2017b)). If Assumption 7.6 holds, then under H0 ,
Proof of Theorem 7.4. We have that p0.5 ≤ α/2 which is equivalent to E [1 {pA ≤ α/2}|Data] ≥
1/2, which yields
From the sample-splitting adjusted p-values, inverting the test, one can deduce the
following confidence interval
135
for α < 0.25, where for σ̂A > 0
! !
θ̂A − θA
pl (θ) :=Med 1 − Φ Data
σ̂A
! !
θ̂A − θA
pu (θ) :=Med Φ Data .
σ̂A
and to report the following confidence intervals with the nominal level 1 − 2α:
[l; u] := Med(LA | Data); Med(UA | Data) .
The following theorem shows the uniform validity of this type of confidence interval,
using the validity of the confidence interval CI introduced above, which is tighter but
more difficult to compute.
Theorem 7.5 (Uniform validity of the confidence interval CI with sample splitting,
Theorem 3.2, in Chernozhukov et al. (2017b)). If Assumption 7.6 holds, then CI ⊆ [l; u]
and
sup PP (θA ∈ CI) ≥ 1 − 2α − 2δ = 1 − 2α + o(1).
P ∈P
136
Step 2. Consider S splits in half of the indices i ∈ {1, . . . , n} into the main sample M , and
the auxiliary sample A. For each split s ∈ {1, . . . , S}, do
Step 2.1 Tune and train each ML method separately to learn m0 and T using A. For
each observation i in M , compute the predicted baseline effect m0 (X i ) and
the predicted treatment effect T (X i )
0
M
Yi = α̂ Z 1,i + β̂1 (Di − p(X i )) + β̂2 (Di − p(X i )) T (X i ) − T + ε̂i , i∈M
M
where T is the average of T (X i ) in M, and such that
1 X
w(X i )ε̂i Z i = 0,
|M | i∈M
where
h 0
M
i0
Z i = Z 1,i , Di − p(X i ), (Di − p(X i )) T (X i ) − T ,
such that
1 X
w(X i )ε̂i W i = 0,
|M | i∈M
h 0 i0
W i = Z 1,i , {(Di − p(X i ))1 {T (X i ) ∈ Ik }}K
k=1 ,
137
Step 4. Compute the estimates, (1 − α) confidence level and conditional p-values for all the
parameters of interest.
Step 5. Compute the adjusted (1 − 2α)-confidence intervals and adjusted p-values using the
variational method described in section 7.4.2.
Several remarks are in order. First, note note that maximizing Λ̂ in Step 2.4 is
equivalent to maximizing the correlation between the ML proxy predictor and the true
τ . Maximizing Λˆ in Step 2.4 is equivalent to maximizing the part of the variation of τ
which is explained by K
P
k=1 γ̂k (Di − p(X i )) 1 {T (X i ) ∈ Ik }.
7.4.4 Simulations
We implement this strategy in the simulation setting Case 1 of section 7.2.2 where
1
τ (x) = ζ(x1 )ζ(x2 ), where ζ(u) = 1 + .
1+ e−20(u−1/3)
We use adaptation of the code from MLInference (this Github code also contains the
dataset and code to replicate the application in Chernozhukov et al. (2017b)). The neural
network model is able to fit well the heterogeneity (β2 is close to 1). This is reassuring as
the shape of ζ is a sigmoid, which is the base function for the neural network here (i.e.
“activation function”). A look at the so called “CLAN” (see Chernozhukov et al. (2017b))
on Table 7.4, which are the average characteristics for the most and least affected groups
E [Xk |G5 ] and E [Xk |G1 ] for the two variables X1 and X2 shows that the ML methods
are able to detect properly that those in the upper-right square (resp. lower-left square)
in the (X1 , X2 ) space are those which benefit the most (resp. benefit the less) from the
treatment.
7.4.5 Application
We refer to the exam in section B in the Appendix for an application to the gender wage
gap heterogeneity.
138
Elastic Net Boosting Nnet Random Forest
ˆ
Λ 8.359 8.444 8.507 8.379
Λ̂ 0.882 0.941 0.968 0.892
Table 7.2: Table of performance measures for GATES and BLP for the four ML methods used based on
100 splits.
Treatment Effect
3 3
2 2
1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score
Treatment Effect
3 3
2 2
1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score
Figure 7.6: Estimated GATES with robust confidence intervals based on 100 splits for the four ML Methods
used. Quantiles of the true treatment effect are min : 1.00, 25%: 2.00, 50%: 2.54, 75%: 3.92, max: 3.99.
139
Nnet Boosting
Most Affected Least Affected Difference Most Affected Least Affected Difference
X1 0.777 0.235 0.539 0.720 0.248 0.475
(0.762,0.793) (0.219,0.252) (0.517,0.561) (0.703,0.737) (0.231,0.264) (0.451,0.498)
X2 0.768 0.238 0.529 0.715 0.268 0.453
(0.752,0.785) (0.221,0.256) (0.504,0.553) (0.698,0.734) (0.250,0.285) (0.427,0.478)
Table 7.4: Estimated average characteristics for the most and least affected groups E [Xk |G5 ] and E [Xk |G1 ]
based on 100 splits (see CLAN in Chernozhukov et al. (2017b)) for the two variables X1 and X2 with robust
confidence intervals for the four ML Methods used. The “least affected” correspond to the group G1 and
“most affected” correspond to the group G5 .
140
Chapter 8
A growing empirical literature studies social and economic networks, either for their
own sake or to asses the importance of peer effects in many fields of the economic dis-
cipline. Recent examples include development and policy evaluation (Banerjee et al.,
2014), welfare participation (Bertrand et al., 2000), criminal activities (Patacchini and
Zenou, 2008), education (Sacerdote, 2011), etc.
n
Networks are by nature high-dimensional in the sense that links can potentially
2
be formed between n individuals, that is if we consider undirected links, twice as many if
we consider directed links. As a direct consequence, standard, usually low-dimensional,
methods such as the MLE can quickly turn out to require special tools to accommodate
the high-dimensionality of networks (see Section 8.2). Moreover, they constitute a fertile
ground to make use of high-dimensional tools such as the Lasso, as we will illustrate in
Section 8.3.
Broadly speaking, empirical questions involving networks can be divided into two
categories that we will review separately: network formations (i.e. what factors explain
the existence of an observed network and not another one?) and network spillovers or
peer effects (i.e. what is the impact that individuals linked through a network have on
each other?). We will first introduce the vocabulary and statistics specific to networks.
This series of NBER video lectures by Matthew Jackson and Daron Acemoglu presents
key network concepts and their use in Economics, www.nber.org/econometrics_minicourse_
2014. We also recommend Graham (2019).
141
8.1 Vocabulary and Concepts
1 if {i, j} ∈ Eg
Wij = Wji = .
0 otherwise
A first set of statistics is related to the density of links in a graph and a second is
related to the correlation between the presence of links (clustering).
142
Figure 8.1: Examples of graphs
2 5
1 1 2 3 1 2
5 3 4 4
n
Density of a network. From a set of n nodes, = n(n − 1)/2 non-directed links
2
can be constructed. The density of a graph is the share of all possible links that do exist:
n−1 n
1 X X
density(g) := Wij .
n i=1 j=i+1
2
The degree of a node is the number of neighbors it has (i.e. an isolated node has degree
zero). An important statistic, both to describe a graph and to perform estimation in
many models, is the degree sequence of a graph, defined as the vector of dimension n that
collects the degree of each node:
n
!
X
d(g) := Wij .
j=1 i=1,...,n
The average degree is a commonly used measure of how well-connected a graph is:
n n
¯ := 1
XX n−1
d(g) Wij = density(g).
n i=1 j=1 2
Graphs are characterized along their density into two categories: sparse graphs for which
the density goes to zero as n → ∞ and dense graphs for which the average degree is
proportional to n, i.e. the density converges to a constant. The first case occurs, for
example, when the average degree of a graph is constant.
Clustering. Several statistics are used to measure the degree of clustering of a graph.
These metrics answer the question: “If node i is linked to both j and k, what is the
143
probability that j and k are also linked?”. Directly related is the clustering coefficient of
node i:
n−1 Xn
1 X
ci (g) := Wij Wik Wjk ,
di (g) j=1
k=j+1
2
the proportion of connected nodes among all possible connections between neighbors of
i. The clustering coefficient of the graph is given by the average clustering coefficient.
The global clustering coefficient:
P
Wij Wik Wjk
i<j<k
c(g)global := P ,
i<j<k 1 {Wij Wjk + Wij Wik + Wjk Wik > 0}
measures the share of all triples (i, j, k) with ij and ik linked that also have jk linked.
Empirically, we observe that many social networks (e.g. friendship at university, so-
cial relationships in villages) exhibit both sparsity and a high degree of clustering.
144
n
where εij is iid across the pairs. A link forms if and only if the marginal utility is
2
large enough:
Wij = 1 {f (xi , xj ) − εij > 0} .
Assuming a Type I extreme value distribution, we get a Logit model for the conditional
probability of forming a link:
ef (xi ,xj )
P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + ef (xi ,xj )
The Erdös-Rényi model. Also known as the Bernoulli random graph model. This
very simple model helps understanding the difficulties of modeling the two stylized facts of
empirical social networks (sparsity and clustering). Assume that the function is constant:
f (xi , xj ) = α0 . Then:
eα0
p := P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + e α0
The first implication of the model means that to recover a sparse structure, we need to
think about a sequence p = pn that decreases with n, for example pn = d/(n − 1) where
the expected degree does not change with the number of nodes. So, sparsity means
pn → 0. The second implication means that the clustering coefficient is pn .
Here is the underlying tension. If such a model is sparse, there can be no clustering at
the limit since pn → 0. On the other hand, if we assume a lower bound for the probability
of link formation pn → p > 0 so that clustering does not vanish, the expected degree
145
becomes at least (n − 1)p, meaning the network becomes dense.
Question: pn → 0 what does it mean for α0 ? How do you interpret that in terms of
the agents’ behavior?
exp(β0 kxi − xj k + νi + νj )
P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + exp(β0 kxi − xj k + νi + νj )
146
with degree sequence equal to d(g):
D = {v ∈ G, d(v) = d(g)} .
Following the ideas developed by Cox (1958) and Chamberlain (1980) in conditional
maximum likelihood models, estimation of β0 can be performed using:
P
exp β0 n−1
Pn
i=1 w
j=i+1 ij kx i − x j k
P (W = w|X , d(g); β, ν) = P P ,
n−1 P n
v∈D exp β0 i=1 j=i+1 vij kxi − xj k
which does not depend on ν. The techniques to compute an estimator of β0 are developed
in Graham (2017) and in the associated Python code, available at www.github.com/
bryangraham/netrics. The resulting estimator is consistent and asymptotically Normal,
albeit at a rate depending if the graph is dense (n−1/2 ) or sparse (n−1/4 ).
where:
147
P P
– C is the normalizing constant, C = w exp ( H θ0H gH (w))
Question: Show that the Erdös-Rényi model is an ERGM. Hint: take H = {i, j}
and θ0H constant over any H.
We will slightly change the definition of the adjacency matrix from the previous sec-
tion. Before, we were only concerned whether a link existed or not, hence the binary
nature of the elements in W . Here, we may also care about the strength of such links,
although we will weight all existing links attached to a particular node equally as a sim-
plification. Consequently, we assume that we observe the set Fi ⊂ Ng of individual i’s
friends. In that case, we will have Wij = 1/|Fi | if j ∈ Fi and Wij = 0 otherwise. We say
that an individual i is isolated if Fi is empty, i.e. Wij = 0 for all j.
148
8.3.1 The Linear Model of Social Interactions
The canonical model in network econometrics to assess social interactions is based on the
linear specification of Manski (1993):
where Yi is the outcome observed for node i (i.e. individual i), Xi is the observed char-
acteristics of dimension 1 (for simplification) and Wi,j are the entries of the incidence
matrix that code the social structure.
From our definition of the adjacency matrix in this context, model (8.1) is a regression
of the individual outcome over his characteristics, the mean of his peers’ outcomes and
the mean of his peers’ characteristics. Before further interpretation, consider model (8.1)
stacked in a matrix:
y = α1 + βW y + ηX + γW X + ε.
Because y appears on both side, we can simplify the model to obtain a reduced-form,
under the condition that In −βW is non-singular. Notice that this condition is equivalent
to det (In − βW ) 6= 0, i.e. zero is not a root of the characteristic polynomial of In − βW ,
which is equivalent to det β1 In − W 6= 0 meaning that 1/β is not in the spectrum of
W . Under this assumption:
In model (8.1), β captures the endogenous social effect and γ the exogenous social effect.
The quantity (In − βW )−1 is referred to as the social multiplier because peer’s outcome
influence propagates shocks. This phenomenon does not occur if environmental or con-
textual effects (influence through peer’s characteristics) are the main social influence
mechanism, i.e. if β = 0.
The key subject in the peer effect literature is to distinguish between the social mul-
tiplier and contextual or environmental effects. Model (8.1) is not identified in itself and
requires further assumptions. The next theorem is due to Bramoullé et al. (2009).
149
Theorem 8.1 (Identification of Peer Effects, (α, β, η, γ)). Suppose that |β| < 1 and
P
ηβ + γ 6= 0. Also assume that W is such that for any non-isolated i, j Wij = 1. If
matrices In , W and W 2 are linearly independent social effects are identified. Otherwise
and if no individual is isolated, the social effects are not identified.
Proof of Theorem 8.1. Assume that (α, β, η, γ) and (α0 , β 0 , η 0 , γ 0 ) lead to the same
reduced-form, i.e we have that almost surely:
−1
α (In − βW )−1 1 = α0 (In − β 0 W ) 1 (8.2)
−1
(In − βW )−1 (ηIn + γW ) = (In − β 0 W ) (η 0 In + γ 0 W ) (8.3)
Multiply equation (8.3) by (In − β 0 W ) (In − βW ) and consider that for any real b, (In − bW )−1 W =
W (In − bW )−1 by symmetry of W . Thus, developing yields:
(η − η 0 )In + (η 0 β − ηβ 0 + γ − γ 0 ) W + (γ 0 β − β 0 γ) W 2 = 0.
Coming back to equation (8.2) of the reduced-form at the beginning, both cases also yield
α0 = α since β 0 = β.
Next, suppose that In , W and W 2 are linearly dependent and that no student is
isolated. This last assumption yields that 1 is in the spectrum of W since W 1 = 1
from j Wij = 1 for any i. So 1/(1 − β) is an eigenvalue of (In − βW )−1 associated
P
150
with eigenvector 1. As a consequence, the very first equation of the proof becomes
α(1 − β 0 ) = α0 (1 − β). Since In , W and W 2 are linearly dependent, there exist for
example two scalars a and b such that W 2 = aIn + bW . Plugging this into equation (8.3)
shows only three equations need to be satisfied for (α, β, η, γ) and (α0 , β 0 , η 0 , γ 0 ) to yield
the same reduced-form.
Remark 8.1 (Identification in groups). Imagine that instead of interacting in any net-
work, individuals interact in groups (e.g. classrooms) and that the student is also included
when computing the mean. This is equivalent to saying that there is a partition of the
population in (non-overlapping) subsets. In that case, the second part of Theorem 8.1
applies and peer effects are not identified since W 2 = W . More generally, Proposition 2
in Bramoullé et al. (2009) states that if individuals interact in groups, if all groups have
the same size, social effects are not identified. Conversely, if at least two groups have
different sizes and if ηβ + γ 6= 0, social effects are identified.
Brock and Durlauf (2001) have established identification for binary models with social
interactions.
We will not deal with estimation of such models, but notice that model (8.1) requires
instruments because of endogeneity: the outcome of my friend is correlated with my
outcome (which includes the error term).
The Perils of Peer Effects. Angrist (2014) warns against mistaking generic clustering
(outcomes tends to be correlated within groups) for causal peer effects, a warning issued
by Manski (1993) as the reflection problem. From Manski (1993):
This paper examines the “reflection” problem that arises when a researcher
observing the distribution of behaviour in a population tries to infer whether
the average behaviour in some group influences the behaviour of the individ-
uals that comprise the group. The term reflection is appropriate because the
problem is similar to that of interpreting the almost simultaneous movements
of a person and his reflection in a mirror. Does the mirror image cause the
person’s movements or reflect them? An observer who does not understand
something of optics and human behaviour would not be able to tell.
151
Or, again: “observed behavior is always consistent with the hypothesis that individual
behavior reflects mean reference-group behavior”. Following this line of thought, Angrist
(2014) argues that many significant peer effects are, in fact, spurious.
We close this chapter on networks with Manresa (2016) on recovering the structure of
social interactions using panel data. The paper is relevant for the class for two reasons:
(i) the goal of the proposed method is to estimate the strength of the social interactions
in a network while not observing the adjacency matrix of said network, (ii) it uses both
the Lasso and the double-selection procedure studied in Chapter 2.
Assumption 8.2 (Linear model with unknown social interactions). Manresa (2016)
considers the following model, similar to (8.1) without the peers’ outcomes:
X
Yit = αi + ηi Xit + γij Xjt + Zit θ + εit , (8.4)
j6=i
where E[εit |Xj1 , ..., Xjt , Zi1 , ..., Zit ] = 0 for any i, j, t. αi and ηi are the own intercept
and own effect, while γij measure peer j’s characteristics influence over individual i’s
outcome. Zit are other characteristics which effect does not depend on the individual. By
definition, the parameter γ := (γij )i>j is high-dimensional since there are n(n−1) entries
in this vector. Consistent with the observation that many empirical social networks are
P
sparse, it is assumed that j6=i 1 {γij 6= 0} ≤ si << T for all i.
Target Parameters. Many parameters can be of interest in this model. The aver-
age private effect is defined as n−1 ni=1 ηi . Directly related to the structure of social
P
1 X 1
Mi := γji + ηi ,
n − 1 j6=i n
that is the average impact of individual i’s characteristics over the whole network.
If estimating θ is of interest, one can use the double-selection method presented in
Chapter 2, treating ηi and the γij as the nuisance parameters. Here, the method consists
in (i) regressing Zit on X1t , ..., Xnt using a pooled Lasso, (ii) regressing Yit on X1t , ..., Xnt
152
using a pooled Lasso, (iii) regressing Yit on Zit and the set of the X1t , ..., Xnt corresponding
to a non-zero coefficient either in step (i) or (ii).
Estimation of ηi and γij in model 8.2 is done with a pooled Lasso regression of Yit −Zit θ̂
on X1t , ..., Xnt . Careful analysis of the properties of these estimators is available in the
original paper.
Going Further. This section gave a few ideas about modeling network formation with
no intention of exhaustivity. Likewise, we point out several contributions in the Econo-
metrics of network, driven mainly by our readings rather than our expertise in the domain.
Kolaczyk (2009) is a textbook reference that mainly belongs to the statistical literature,
while Jackson (2008) views network for the economist’s point of view. Chandrasekhar
and Lewis (2011) study the properties of GMM estimators based on partial network data,
that is data that have been sampled and may be incomplete regarding the nodes and links
that constitute a network. Such sampled network data, they say, can be very misleading
when studying peer effects as they contain non-trivial measurement errors. They use sta-
tistical techniques to predict the full network and mitigate the bias. Boucher and Fortin
(2015) reviews econometric problems related to inference with networks.
153
154
Chapter 9
Appendix
A Exam 2018
Class documents authorized. Calculator forbidden. Duration: 2 hours.
Documents de cours autorisés. Calculatrice interdite. Durée: 2 heures.
Part I contains 6 questions; Part II contains 4 questions (8 sub-questions); Part III contains 5 questions.
Part I: Questions
1. We want to perform inference on a parameter θ using only one sample splitting of the data =
{A, M } into two subsamples A and M . We use the subsample M to train a ML estimator θ. b We
then use the subsample A to evaluate it, leading to the estimator θA . What type of inference can
b
we make ? What problem does it raise?
2. Is the use of the Lasso in the first step in Section 1.5 the key ingredient to solve the post-selection
inference problem?
3. Explain intuitively why sample splitting is useful in Causal Random Forest estimation to get
consistency of the estimator of the heterogeneous treatment effect. How do we call this property?
4. “The synthetic control estimator does not use the full sample of control units”. Explain and
criticize.
5. What is Leeb and Potscher’s point? Does the result in Theorem 1.2 (asymptotic normality of the
immunized estimator) contradicts Leeb and Potscher analysis? Why?
τ0 = E [Y1 − Y0 |D = 1] . (ATET)
155
Define π = P(D = 1) and the propensity score p(X) = P(D = 1|X). We make the following two
assumptions. The Conditional Independence Assumption:
Y0 ⊥ D|X, (CIA)
1. (a) Define:
p(Xi )
m(Wi , τ, p) = Di − (1 − Di ) Yiobs − Di τ.
1 − p(Xi )
where λ > 0 is a tuning parameter. How would you call such a method? Will the estimator
of τ0 be asymptotically normal in that case?
3. Assume that the outcome under no treatment is given by Y0 = X > γ0 + ε with ε ⊥ X and Eε = 0.
(a) Show that E(DX(Y0 − X > γ0 )) = 0.
(b) Suggest a moment condition ψ which is orthogonal. Prove that it is.
4. Based on the previous questions, give an estimator τ̌ of τ0 which is asymptotically Normal even
in the high-dimensionnal case. Which theorem do you use?
UR,i,t = 0. Xt ∈ RpX is a random vector measuring the characteristics of the party’s candidate in district
t, Dt the amount of advertising spend by the party in district t, ξL,t is a district specific unobserved
shock (ex: candidate’s reputation), and i,t,L is an idiosyncratic unobserved shock distributed with cdf
−1
F (t) = [1 + e−t ] . Xt is considered exogeneous while Dt is endogeneous, and Zt is an instumental
variable. g(·) is an infinitely differentiable function on R of the index Xt> β0 .
156
where
p q
( )
X X
f0 ∈ Fp,q := f : f (x, z) = γ0,i 1{x ∈ Cai ,r } + δ0,i 1{z ∈ Cbi ,r }, ai ∈ RdX , bi ∈ RdZ ,
i=1 i=1
where Cai ,r and Cbi ,r are hypercubes respectively in RdX and RdZ of centers {ai } and {bi } and side of
length r.
We observe an i.i.d sample (Wt )nt=1 = (St , Xt , Dt , Zt )nt=1 , among the n electoral districts, where St ∈ (0, 1)
is the observed share of votes for candidate L in district t.
1. Assume that p < n and q < n and that the true function f0 has only few zero coefficients {γ0,i }pi=1
and {δ0,i }qi=1 in its decomposition. Propose a particularly adapted consistent estimator of the
regression function E (D|X = x, Z = z). Can we use it in the case where the assumption p < n
and q < n does not hold? Explain.
2. Assume now that p > n and q > n and sparsity of the coefficients {γ0,i }pi=1 and {δ0,i }qi=1 in the
decomposition of f0 . Give a particularly adapted consistent estimator of the regression function
E (D|X = x, Z = z) and the estimating equation.
B. Estimation of τ0 .
3. Write the estimating equation, starting from (A.1), using the dependent variable S̃t := ln(St /(1 −
St )).
4. Find two functions Q1 and Q2 such as
m(Wt , η, τ0 ) = S̃t − Q1 (η, Yt , Dt , Xt ) Q2 (η, Zt , Xt ),
E(m(Wt , η, τ0 )) = 0 (A.2)
E(∂η m(Wt , η, τ )) = 0, ∀τ ∈ Θ, (A.3)
where Θ is a compact neighborhood of τ0 . Similarly to the course, you should use (A.1), (FS),
and an additional linear equation of your choice making more precise the correlation structure
between the instruments and the regressors.
5. Give the conditions on the function g under which the estimator τb defined using (A.2) is asymp-
totically normal, using only a theorem of the course.
157
Exam 2018: Elements of Correction
Part I: Questions
1. In this case, we only have the conditional confidence intervals
(a) For b = 1, ..., B, reshuffle the treatment assignment at random, compute the OLS estimator
of τ0 , τ̂b and compare it with the observed statistics τ̂ obs .
(b) Compute Fisher’s p-value:
B
1 X
1 |τ̂b | ≥ |τ̂ obs |
p̂ :=
B
b=1
(c) Reject H0 if p̂ is below a pre-determined threshold: the observed treatment allocation gives
an effect which is abnormally large.
158
Question 2: In this case the LASSO using the transformed regressors:
Xet,i =1{Xt ∈ Ca , }
i
where
p
X X q
>
Y (γ0 , δ0 )
= Yj γ0,j + Yj+p δ0,j .
b b b
1
j=1 j=1
Thus, we have
e > γ0 + X
Dt =X e > Π> δ0 + ut + ζ > δ0
t t t
et> (γ0 + Π> δ0 ) + ρdt , thus
=X
et> ν0 + ρdt ,
Dt =X ρdt ⊥ Xt (A.4)
(A.5)
Question 5: In this course, asymptotic normality of the estimator τb is proved for an Affine Quadratic
model, which imposes that
1. either g() is the affine, which makes the model identical to the one in the course.
2. either g() is quadratic, which is allowed by the theorem, but requires that in the first stage we
have an estimator of β0 in a sparse non linear index model.
159
B Exam 2019
Class documents authorized. Calculator forbidden. Duration: 2 hours.
Documents de cours autorisés. Calculatrice interdite. Durée: 2 heures.
Part I contains 6 questions; Part II contains 10 questions (not counting sub-questions).
Part I: Questions
1. Considering the setup of Chapter 4, is the synthetic control estimator a consistent estimator of
the treatment effect on the treated? Explain why or why not.
2. What is a sparse graph? Explain using concepts seen in class and give an economic or sociological
example.
3. What are the main differences between random forests and causal random forests? How is it
implemented in practice?
4. How would you modify the standard LASSO estimation procedure when errors are non-Gaussian
and heteroscedastic, if you want to obtain the same rates of convergence (up to a constant)?
5. Describe the best linear predictor of the CATE using two machine learning proxies m0 and T of,
respectively, E [Y |X, D = 0] and the CATE.
6. In which case(s) do you prefer using a random forest instead of a LASSO and vice-versa?
(a) Considering the problem we are studying, what would you like to include in Xi ?
(b) Give a (simple) consistent estimator of θ in the case where p is a small integer (for example
p = 6), as n → ∞.
(c) Is it still a consistent estimator if p > n and/or p → ∞ ? If you answer no, propose a
consistent estimator in that case.
(d) Show that E[ln Wi |Xi , Fi = 1]−E[ln Wi |Xi , Fi = 0] = θ. Do you think that it is a reasonable
assumption?
3. In order to deepen analysis, we consider the model
ln Wi = α + θ(Zi )Fi + Xi0 β + εi , with E[εi |Xi , Fi ] = 0 and kβk0 ≤ s << p, (A.7)
where θ(Zi ) measures an effect that depends on some covariates Zi ⊂ Xi . Specifically, we assume
that
K
X
θ(z) = θ k zk .
k=1
160
(a) “Model (A.7) allows to study an heterogeneous wage gap”. Do you agree or disagree? Justify
(a formula or two would be welcome).
(b) Rewrite model (A.7) as a linear regression model. What are the corresponding Normal
equations?
(c) Assuming that p > n and p → ∞ but K and s are small integers, how could you estimate
consistently (θ1 , . . . , θK )? In your answer, you will explicitly write down an immunized
moment condition ψ for (θ1 , . . . , θK ) and add the necessary assumptions.
4. Tables 7-10 in the Appendix A are extracted from Bach, Chernozhukov and Spindler (2018). They
display estimates of (θ1 , . . . , θK ) based on Model (A.7) obtained by the method in Q3 on a US
sample of college graduates. Interpret three rows of your choice among these four tables.
5. From these four tables, what do you see as the main problem to perform inference in this context?
6. One other way to model heterogeneity in the wage gap is to use causal random forests. We assume
in the next to question that (Xi )ni=1 are i.i.d and distributed uniformly Xi ∼ U ([0, 1]p ). Then, at
some point x in the support of Xi , we define the causal random forest as
−1
n X
µ̂(x; X1 , . . . , Xn ) = T (x; Xi1 , . . . , Xis ),
s
1≤i1 <···<is ≤n
where
X 1{Xi ∈ L(x)}
T (x; Xi1 , . . . , Xis ) = αi (x) ln Wi , αi (x) = ,
s|L(x)|
i∈{i1 ,...,is }
L(x) are the leaves of the tree T , |L(x)| their Lebesgue measure, and s ∈ [n/2, n) is the fixed size
of the subsamples.
Assuming that the regression function µ : x → E [ln Wi |Xi = x] is Lipschitz with constant C and
that the construction of the leaves L is independent of the sample (Xi )ni=1 , show the following
inequality
|E [µ̂(x; X1 , . . . , Xn )] − µ(x)| ≤ CDiam(L(x)), (A.8)
where Diam(L(x)) is the diameter of the leaf containing v.
7. Explain from (A.8) what high-level condition we may enforce on Diam(L(x)) to obtain consis-
tency. Does standard random forests achieve this condition and why? How is this implemented
in practice in the causal random forest of Athey and Wager?
8. For any given ML proxy, we form five groups Gk , for k ∈ {1, . . . , 5}, among the population based
on predicted outcome T (Xi ) using the splits Ik based on the quantiles
−1 k
Ik := [lk−1 , lk ], where lk = FT (Xi ) ,
5
ans FT−1
(Xi ) is the quantile function of T (Xi ). Using Figure 9.1 and Table 3, give your interpretation
of the heterogeneity in wage gap and compare with the interpretation made in Question 4 from
tables 7-10 in the Appendix. Describe explicitly the differences about the nature of the parameter
of interest and their consequences in the interpretation.
9. (Bonus) We want to take into account selection in the labour market participation. Explain how
would you model it and give a potential estimation procedure if the selection equation depends
upon a high dimensional sparse a priori unknown set of variables.
161
Appendix A: Table for the Part II, Questions 4 and 5
162
Appendix B: Results for Part II, Questions 8-10
We denote by
2 K
ˆ
Λ̂ = β̂2 Var (T (X)) ˆ = X γ̂ 2 P (T (X) ∈ I ) ,
and Λ k k
k=1
where β̂2 is the estimator of the slope of the best linear predictor and γ̂k the estimator the sorted average
group treatment effects (GATES). For any given ML proxy, we form five groups Gk , for k ∈ {1, . . . , 5},
among the population based on predicted outcome T (Xi ) using the splits Ik based on the quantiles
k
Ik := [lk−1 , lk ], where lk = FT−1
(Xi ) ,
5
and FT−1
(Xi ) is the quantile function of T (Xi ).
0.0 0.0
−0.1 −0.1
Treatment Effect
Treatment Effect
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score
Figure 9.1: Estimated GATES (Sorted group average treatment effect) with robust confidence intervals at
90% for the two best ML Methods used based on 100 splits.
163
Random Forest Elastic Net
Most Affected Least Affected Difference Most Affected Least Affected Difference
On log wage
Age 31.47 34.36 -2.826 31.49 33.54 -2.044
(31.21,31.73) (34.10,34.62) (-3.196,-2.456) (31.22,31.75) (33.27,33.81) (-2.427,-1.660)
Nb. Ch-19y. 0.263 0.831 -0.566 0.237 0.814 -0.586
(0.238,0.287) (0.807,0.856) (-0.602,-0.530) (0.212,0.262) (0.790,0.838) (-0.621,-0.551)
Exper. 9.060 14.70 -5.634 9.238 14.06 -4.771
(8.793,9.328) (14.43,14.96) (-6.004,-5.258) (8.948,9.528) (13.78,14.34) (-5.185,-4.358)
Table 9.3: Estimated average characteristics for the most and least affected groups E[Xk |G5 ] and
E[Xk |G1 ] based on 100 splits for the two variables of age (Age), number of children under 19 years
old (Nb. Ch.-19y), and years of work experience (Exper.) with robust confidence intervals at 90%
for the ML Methods used. The “least affected” correspond to the group G1 and “most affected”
correspond to the group G5 .
164
(c) No, it is not. Considering the sparse structure, you want to use the double selection pro-
cedure seen in the class, using a LASSO in the first two steps – a brief description of the
prcedure was needed here.
(d) E[ln Wi |Xi , Fi = f ] = θf + Xi0 β. It means that the wage gap is constant across the support
of Xi which is probably unreasonable.
3. (a) It is true. Indeed, in that case E[ln Wi |Xi , Fi = 1] − E[ln Wi |Xi , Fi = 0] = θ(Zi ) =
PK
k=1 θk Zi,k . So the wage gap varies by θk percentage points when Zi,k varies by one
unit. Since the overall wage gap is negative, a positive value of θk means that (for example)
the wage gap is smaller than baseline in the population for which Zk = 1.
(b) Using the notation θ = (θk )k=1,...,K , we have:
ln Wi = α + Fi Zi0 θ + Xi0 β + εi ,
So we have a linear model with p + K covariates (Fi Zi0 , Xi0 ) and the p + K Normal equations
are
with the derivatives with respect to β equal to zero. It means using the ”double-selection”
procedure but with K parameters of interests. It will requires K + 2 steps:
i. The K first steps consists in regressing each element of Zi on Xi for the sub-sample of
women using a Lasso,
ii. The K + 1-th step is a LASSO regression of ln Wi on Xi ,
iii. The last step is a regression of ln Wi on Fi Zi and the union of all the elements of Xi
selected previously.
4. Example: Compared to baseline, having a child of 18 year old or young increases the wage gap
by 5 pp (the wage gap is more negative for them). So women who have a child of 18 or younger
earn 5 pp less than men compared to other women.
5. There is a multiple testing problem. (These tables already correct for multiple testing, but you
could not know that).
|E [µ̂(x; X1 , . . . , Xn )] − µ(x)|
= |E [T (x; X1 , . . . , Xn )] − µ(x)|
X 1{Xi ∈ L(x)}
= E ln Wi − µ(x)
i∈{i1 ,...,is } s|L(x)|
1 X 1{Xi ∈ L(x)}
= E [ln Wi |Xi ∈ L(x)]
E |Xi ∈ L(x) − µ(x)
s |L(x)|
i∈{i1 ,...,is }
= |E [ln Wi |Xi ∈ L(x)] − E [ln Wi |Xi = x]| ≤ CDiam(L(x)).
165
7. We choose these two methods based on the performance indicators Λ and Λ, which measure how
much heterogeneity is captured by the procedure. The table indicates clearly that Random Forest
and Elastic Net are the best ones. From Table 2 we can see that the average of E[ln Wi |Xi , Fi =
1] − E[ln Wi |Xi , Fi = 0], which is β1 is negative, so women earn in average less than men, but that
the slope of the BLP is significantly positive and close to 1, thus there is heterogeneity and it its
profile is quite well describe by the Elatic net and Random forests proxies.
8. From Figure 9.1 and Table 3 we see that there is a group of women on which there is no wage gap.
They are women with less children and experience than average. Here, the parameter of interest
depends on the accuracy of the ML proxy, as well as the splits. We thus only can learn about
features of E[ln Wi |Xi = ·, Fi = 1] − E[ln Wi |Xi = ·, Fi = 0] (heterogeneity, subgroups that benefit
the less and the most and their characteristics) and not about this quantity itself and previously
done. As this Generic ML procedure depends on sample splitting, the p-value should be adapted
to take into account this additional randomness.
9. (Bonus) We want to take into account selection in the labour market participation. Explain how
would you model it and give a potential estimation procedure if the selection equation depends
upon a high dimensional sparse a priori unknown set of variables.
166
C Exam 2020
Expected duration: around 2.5 hours. Turn in before: May 6 2020, 11pm.
Durée indicative: 2 heures 30 mins. Date de rendu: 6 mai 2020, 23h.
Part I contains 3 questions; Part II contains 12 questions (not counting sub-questions); Part III contains
3 questions.
Part I: Questions
1. Give three advantages to using the synthetic control method when using it is appropriate. Explain
briefly these advantages.
2. Justify why the very-many-instruments case in treatment effect estimation with endogeneity is a
frequent context. What is the solution that we described during the course?
3. What is the regularization bias? Can it exist in a small-dimensional case (p < n)?
1. T (X) denotes the proxy machine learning resulting from a given algorithm, i.e. the prediction of
the conditional average treatment effect, τ (X) for a household with characteristics X. Consider
the following regression on the main sample:
For questions 2-5, your answers must be backed by statistical evidence (p-values, etc.) when
possible.
2. We train four different algorithms: an elastic net, a gradient boosting machine, a neural network
and a random forest. Table 9.4 reports the statistics Λ = |βb2 |2 Vb (T (X)), where βb2 has been
estimated from the above regression, for each algorithm.
(a) Explain how and why the statistics Λ can help you choose the best of all four algorithms.
167
(b) According to Table 9.4, which algorithm is the best?
3. Table 9.5 reports the results (estimator, 90% confidence interval and p-values) from the regression
in question (1) for the two best algorithms.
4. For a given ML proxy and k = 1, . . . , 5, we define the group Gk = 1 {`k−1 ≤ T (X) < `k } with
quantiles −∞ = `0 ≤ `1 ≤ . . . ≤ `5 = +∞ such that the population is split in five folds of 20%
on the basis of a ranking of households using the ML predictor. If for a household G1 = 1, it
is deemed “most affected”. If for a household G5 = 1, it is deemed “least affected”. Table 9.6
reports estimates for the treatment effect of the least and most affected population, as well as the
difference.
(a) Write down the regression equation that allowed to obtained these results. Explain how it
has been estimated.
(b) Does the treatment have an effect on every household?
(c) Is there a difference in treatment effect between the most and least affected households?
5. We want to see if most and least affected households have different characteristics, in order to
answer the initial question. Table 9.7 reports the result.
We now focus on estimating the Conditional Average Treatment Effect (CATE) function
where X is high-dimensional (dimension p >> n, the number of observations). Give the formula
for the LASSO estimators of α1 and α0 . Propose an estimator for the CATE based on these
estimators. Justify intuitively why in practice it does not have good properties.
7. What is the “solution” that has been proposed to handle this issue, when p < n, in the Causal
Random Forest estimator (CRF hereafter)?
8. We consider the model of this randomized control trial (RCT), where the treatment allocation is
random, D ⊥ X, and
Y = X > γ + Dτ (X) + , ⊥ (D, X), (A.10)
where τ (X) is assumed to be linear in X, which is high-dimensional. We base our estimator for
τ on
( n )
1X > > 2
β, δ = argminβ,δ
b b Yi − Xi β − (Di − E [Di ])Xi δ + λβ kβk1 + λδ kδk1 . (A.11)
n i=1
Identify γ and τ in terms of β and δ and give the estimator of τ based on β,
b δb . Write down the
estimating moment equations that we use in (A.11).
168
9. How do we call β in (A.11)? In this RCT context, is the estimator based on (A.11) immunized?
Prove it.
10. Justify intuitively why such an estimator solves the problem mentioned in questions 6 and 7.
11. Give a context where this estimator of the CATE is more adapted to the CRF estimator and
another context where the CRF is more suited.
12. Coming back to the application, one problem is that randomisation has been done at the water-
meter routes level and not at the household level. This could potentially yield selection insofar
as households in the same neighbourhood (sharing a water-meter route) have potentially similar
water-consumption behaviours and reactions to the treatment.
We want to control for that using the following model
D = Z > γ + ζ, Z ⊥ ζ, Z ⊥ ,
where Z are auxiliary variables available (ex: median/average income at the neighbourhood level,
median/average water consumption, rate of owner occupied houses, ect..) and is the residual in
(A.10). Write down how would you modify (A.11) to take that into account in a way that your
estimator is immunized. Show this last point.
We define S(X) = P (S = 1|X). For each individual we define a risk index at level s as R(s, X) =
1(S(X) > s). There are special measures (such as lockdown) L = 1 which can be taken such that if
L = 1 people will not get the disease but if L = 0 people get the disease.
1. Define the true positive rate T P R(s), the true negative rate T N R(s), the false positive rate
F P R(s) and false negative rate F N R(s) as a function of s.
169
Appendix: Tables
170
Table 9.7: Group average characteristics
Algo. 1 Algo. 2
Least Affected Most Affected Difference Least Affected Most Affected Difference
Voting 0.098 0.120 -0.017 0.096 0.121 -0.024
Frequency (0.096,0.100) (0.118,0.122) (-0.020,-0.014) (0.094,0.098) (0.119,0.123) (-0.027,-0.021)
- - [0.000] - - [0.000]
Democrat 0.166 0.204 -0.044 0.147 0.242 -0.087
(0.159,0.174) (0.197,0.212) (-0.054,-0.033) (0.139,0.154) (0.234,0.249) (-0.097,-0.077)
- - [0.000] - - [0.000]
Republican 0.408 0.448 -0.012 0.409 0.390 0.008
(0.399,0.418) (0.438,0.458) (-0.025,0.001) (0.400,0.419) (0.380,0.399) (-0.006,0.021)
- - [0.159] - - [0.531]
by a family of functions (fi )pi=1 , hence coming back to the very-many-instruments case. During
the course we use a sparse model for IV, assuming that only few instruments are indeed useful,
and provide a LASSO based method to estimate the treatment while controlling estimation of the
nuisance parameter.
3. The regularization bias is a bias that occurs due to the fact that using ML tools is a first-step
yields estimator that do not converge fast enough. In the case of the Lasso, it amounts to an
omitted variable bias. It can exist in a small-dimensional case if a non-conventional estimator is
used is a first-step or if there is a selection step. (Think about Leeb and Potscher’s model).
171
3. (a) Yes, the test of H0 : ”β1 = 0” is rejected at any level of confidence. On average, households
who received a water-saving incentive, consume about 952 gallons (about 3604L) of water
less than non-treated households, everything else being equal.
(b) The hypothesis that β2 = 0 is rejected for Algo 1. For Algo 2, it is not rejected (for a test
of 5% level of confidence for example), which means either that there is no heterogeneity or
that Algo 2 is too weak. Since Algo 2 is dominated by Algo 1 in term of Λ, the evidence
given by Algo 1 should be given more weight. Notice that, as explained in the course, for a
test of significance level α, the p-value displayed here must be below α/2 to be rejected, in
order to take into account the random splitting of the data.
4. (a) See the regression for the GATES:
5
X
w(X)(D − p(X))Y = γk Gk + ε,
k=1
where the treatment effect most affected is γ1 and the treatment effect for the least affected
is estimated by γ5 . Indeed, a quick calculation shows that:
γk = E[Y1 − Y0 |Gk ].
This regression can be estimated by OLS. This is just a mean of w(X)(D − p(X))Y over
each group.
(b) This question cannot be answered rigorously because we test by groups, so even if the
average treatment effect is significantly different from zero in the least affected group, the
treatment effect (as measured by the CATE) can be zero for some individuals in this group.
Furthermore, groups are based on values of the proxy predictors which is not perfectly
correlated with the true CATE. So even in the most-affected group (as defined by values of
the proxy predictor), the true CATE can be zero for some individuals. However, we accepted
the conclusion that with algorithm 1, even the least affected have a non-zero treatment effect
so that implies that the treatment effect is significant for most people.
(c) Not really, the test for the difference is not rejected at any commonly accepted level (5%,
10%).
5. (a) See the regression for the CLAN in Chernozhukov et al. (2017b):
5
X
X= θk Gk + ε,
k=1
where the mean characteristic of the most affected is θ1 and the mean characteristic for the
least affected is estimated by θ5 . It can be estimated by OLS.
(b) First of all, notice that these coefficients estimate P [V OT E = 1|G1 = 1] and P [V OT E =
1|G5 = 1] but since P [G1 ] = P [G5 ] = .2, Bayes Theorem implies that: P [V OT E = 1|G1 =
1]/P [V OT E = 1|G5 = 1] = P [G1 = 1|V OT E = 1]/P [G5 = 1|V OT E = 1] so we can
have the interpretation that household where voting is more frequent are more likely to be
amongst the most affected by the pro-social campaign as compared to being amongst the
least affected. The difference is significant.
(c) The same remark goes and it follows that Democrats are more likely to be amongst the
most affected (the difference is significant) as compared to being the least affected. That
difference is not significant for the Republican households.
Notice that in all three cases, the test compares most and least affected, not each group to
the general population. So it could be the case that Democrats are relatively more likely to
be amongst the most affected than amongst the least affected but could be under-represented
in both of these groups as compared to the general population. [It’s not the case in this
application, but you cannot tell from the tables only].
172
6. For j = 0, 1:
n
1X
bj ∈ arg max
α 1 {Di = j} (Yi − XiT α)2 + λ kαk1 ,
α n i=1
The CATE can then be estimated by:
τb(X) = X T (b
α1 − α
b0 ).
The problem is that this estimator is composed of two estimators that are computed separately
and may not be optimal to estimate the CATE.
7. The proposed solution for the CRF is to split according to a joint criterion (not one for D = 1
and another for D = 0) specially designed to target the treatment effect, and not the treatment
conditioning on D.
8. Here, γ = β − E(D)δ, τ (X) = X > δ, τb(X) = X > δ,
b and we use
> > X
E (Y − X β − (D − E[D])X δ) = 0.
X(D − E[D])
9. β is a nuisance parameter. The estimator is immunized as, for the equation related to δj ,
10. Here, the two coefficients are estimated simultaneously and are likely to yield a better estimate
of the CATE.
11. This estimator is relevant if there is sparsity in both β0 and δ0 , i.e. only a few components of
X are relevant to predict the baseline outcome and and the treatment effect heterogeneity. And
both the CATE and the regression function for the baseline outcome are linear. The CRF is more
suited in a context where the regression function for the baseline outcome is piece-wise constant.
12. We replace E[D] by Z > γ, and add a penalty plus a potential sparsity assumption to handle the
potential high dimensionality of γ, which is a nuisance parameter, which yields
( n )
1 X
> > >
2
β,
bγ b, δb = argminβ,δ Yi − Xi β − (Di − Z γ)Xi δ + λβ kβk1 + λγ kγk1 + λδ kδk1 .
n i=1
It is immunized (in the sense of the definition of the course) as for the equation related to δj ,
= −E [XXj ] E (D − Z > γ) = 0.
and
173
174
List of Theorems and Lemmas
4.1 Theorem (Necessary condition for optimal instruments, Theorem 5.3 in Newey and Mc-
Fadden (1994) p. 2166) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Theorem (Rates for Lasso Under Non-Gaussian and Heteroscedastic Errors, Theorem 1
in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Lemma (Lemma 6 in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Theorem (Asymptotic Normality of the Split-Sample Immunized Estimator, Theorem 7
in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Theorem (Cluster-Lasso convergence rates in panels, Theorem 1 in Belloni et al. (2016)) . 70
5.4 Theorem (Asymptotic normality of the Cluster-Lasso estimator for treatment effect in
panels, Theorem 2 in Belloni et al. (2016)) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.1 Lemma (Control in probability of the leaf diameter in uniform random forests, Lemma 2
in Wager and Athey (2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Lemma (Consistency of the double sample random forests, Theorem 3 in Wager and Athey
(2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1 Theorem (Asymptotic normality of double sample causal random forests, Theorem 1 in
Wager and Athey (2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Theorem (Asymptotic normality of GRF for the instrumental variable model (7.20), The-
orem 5 in Athey et al. (2019)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
175
7.3 Theorem (Consistency of the Best linear predictor estimator, Theorem 2.2 in Chernozhukov
et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.4 Theorem (Uniform asymptotic size for the unconditional test with sample splitting, The-
orem 3.1 in Chernozhukov et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5 Theorem (Uniform validity of the confidence interval CI with sample splitting, Theorem
3.2, in Chernozhukov et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
176
Bibliography
Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and methodological aspects.
Working Paper.
Abadie, A., Angrist, J., and Imbens, G. (2002). Instrumental variables estimates of the effect of subsidized
training on the quantiles of trainee earnings. Econometrica, 70(1):91–117.
Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2018). Sampling-based vs. design-based
uncertainty in regression analysis. Working Paper.
Abadie, A. and Cattaneo, M. D. (2018). Econometric methods for program evaluation. Annual Review
of Economics, 10(1):465–503.
Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case
studies: Estimating the effect of california’s tobacco control program. Journal of the American Sta-
tistical Association, 105(490):493–505.
Abadie, A., Diamond, A., and Hainmueller, J. (2015). Comparative politics and the synthetic control
method. American Journal of Political Science, 59(2):495–510.
Abadie, A. and Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque
Country. American Economic Review, 93(1):113–132.
Abadie, A. and Kasy, M. (2017). The Risk of Machine Learning. arXiv e-prints, page arXiv:1703.10935.
Abadie, A. and L’Hour, J. (2019). A penalized synthetic control estimator for disaggregated data.
Working Paper.
Acemoglu, D., Johnson, S., Kermani, A., Kwak, J., and Mitton, T. (2016). The value of connections in
turbulent times: Evidence from the united states. Journal of Financial Economics, 121:368–391.
Allegretto, S., Dube, A., Reich, M., and Zipperer, B. (2013). Credible Research Designs for Minimum
Wage Studies. IZA Discussion Papers 7638, Institute for the Study of Labor (IZA).
Alquier, P., Gautier, E., and Stoltz, G. (2011). Inverse Problems and High-Dimensional Estimation:
Stats in the Château Summer School, August 31 - September 4, 2009. Springer Berlin Heidelberg,
Berlin, Heidelberg.
Amemiya, T. (1974). The nonlinear two-stage least-squares estimator. Journal of econometrics, 2(2):105–
110.
Angrist, J. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton University Press, 1 edition.
Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and
earnings? The Quarterly Journal of Economics, 106(4):979–1014.
177
Angrist, J. D. and Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better
research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2):3–30.
Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of
the National Academy of Sciences, 113(27):7353–7360.
Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation.
Journal of Economic Perspectives, 31(2):3–32.
Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should know about.
Annual Review of Economics, 11.
Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics,
47(2):1148–1178.
Athey, S. and Wager, S. (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896.
Athey, S. and Wager, S. (2019). Estimating treatment effects with causal forests: An application. arXiv
preprint arXiv:1902.07409.
Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.
Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (2014). Using Gossips to Spread
Information: Theory and Evidence from a Randomized Controlled Trial. ArXiv e-prints.
Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012a). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012b). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
Belloni, A. and Chernozhukov, V. (2011). High Dimensional Sparse Econometric Models: An Introduc-
tion, pages 121–156. Springer Berlin Heidelberg, Berlin, Heidelberg.
Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse
models. Bernoulli, 19(2):521–547.
Belloni, A., Chernozhukov, V., Chetverikov, D., and Hansen, C. (2018). High dimensional econometrics
and regularized gmm. arXiv:1806.01888, Contributed chapter for Handbook of Econometrics.
Belloni, A., Chernozhukov, V., Fernández-Val, I., and Hansen, C. (2017a). Program evaluation and
causal inference with high-dimensional data. Econometrica, 85(1):233–298.
Belloni, A., Chernozhukov, V., and Hansen, C. (2010). LASSO Methods for Gaussian Instrumental
Variables Models. ArXiv e-prints.
Belloni, A., Chernozhukov, V., and Hansen, C. (2014). High-Dimensional Methods and Inference on
Structural and Treatment Effects. Journal of Economic Perspectives, 28(2):29–50.
Belloni, A., Chernozhukov, V., Hansen, C., and Kozbur, D. (2016). Inference in high-dimensional panel
models with an application to gun control. Journal of Business & Economic Statistics, 34(4):590–605.
Belloni, A., Chernozhukov, V., Hansen, C., and Newey, W. (2017b). Simultaneous confidence intervals
for high-dimensional linear models with many endogenous variables. arXiv preprint arXiv:1712.08102.
Ben-Michael, E., Feller, A., and Rothstein, J. (2019). Synthetic controls and weighted event studies with
staggered adoption.
Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica:
Journal of the Econometric Society, pages 841–890.
178
Berry, S. T. (1994). Estimating discrete-choice models of product differentiation. The RAND Journal of
Economics, pages 242–262.
Bertrand, M., Luttmer, E. F. P., and Mullainathan, S. (2000). Network effects and welfare cultures*.
The Quarterly Journal of Economics, 115(3):1019–1055.
Bhattacharya, D. and Dupas, P. (2012). Inferring welfare maximizing treatment assignment under budget
constraints. Journal of Econometrics, 167(1):168–196.
Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector.
Ann. Statist., 37(4):1705–1732.
Bléhaut, M., D’Haultfœuille, X., L’Hour, J., and Tsybakov, A. (2017). A parametric generalization of
the synthetic control method, with high dimension. Working Paper.
Bohn, S., Lofstrom, M., and Raphael, S. (2014). Did the 2007 legal arizona workers act reduce the state’s
unauthorized immigrant population? Review of Economics and Statistics, 96(2):258–269.
Boucher, V. and Fortin, B. (2015). Some Challenges in the Empirics of the Effects of Networks. IZA
Discussion Papers 8896, Institute for the Study of Labor (IZA).
Bramoullé, Y., Djebbari, H., and Fortin, B. (2009). Identification of peer effects through social networks.
Journal of Econometrics, 150(1):41–55.
Brock, W. A. and Durlauf, S. N. (2001). Discrete choice with social interactions. The Review of Economic
Studies, 68(2):235–260.
Bühlmann, P., Yu, B., et al. (2002). Analyzing bagging. The Annals of Statistics, 30(4):927–961.
Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer-Verlag Berlin Heidelberg.
Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than
n. Ann. Statist., 35(6):2313–2351.
Card, D. (1990). The impact of the mariel boatlift on the miami labor market. ILR Review, 43(2):245–
257.
Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.
NBER working paper, (w4483).
Chamberlain, G. (1980). Analysis of covariance with qualitative data. The Review of Economic Studies,
47(1):225–238.
Chandrasekhar, A. (2016). Econometrics of network formation. The Oxford Handbook of the Economics
of Networks, pages 303–357.
Chatterjee, S., Diaconis, P., and Sly, A. (2010). Random graphs with a given degree sequence. ArXiv
e-prints.
179
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017a). Dou-
ble/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261–
65.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.
(2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics
Journal, 21(1):C1–C68.
Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2017b). Generic machine learning infer-
ence on heterogenous treatment effects in randomized experiments. arXiv preprint arXiv:1712.04802.
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-Selection and Post-Regularization Inference
in Linear Models with Many Controls and Instruments. American Economic Review, 105(5):486–90.
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-regularization
inference: An elementary, general approach. Annu. Rev. Econ., 7(1):649–688.
Chernozhukov, V., Wuthrich, K., and Zhu, Y. (2017). An Exact and Robust Conformal Inference Method
for Counterfactual and Synthetic Controls. arXiv e-prints, page arXiv:1712.09089.
Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data.
The Review of economics and Statistics, pages 360–366.
Cox, D. R. (1958). The regression analysis of binary sequences (with discussion). J Roy Stat Soc B,
20:215–242.
Crépon, B., Duflo, E., Gurgand, M., Rathelot, R., and Zamora, P. (2013). Do labor market policies have
displacement effects? evidence from a clustered randomized experiment. The Quarterly Journal of
Economics, 128(2):531–580.
Cunningham, S. and Shah, M. (2017). Decriminalizing indoor prostitution: Implications for sexual
violence and public health. NBER Working Papers, 20281.
Darolles, S., Fan, Y., Florens, J.-P., and Renault, E. (2011). Nonparametric instrumental regression.
Econometrica, 79(5):1541–1565.
Davis, J. and Heller, S. B. (2017a). Rethinking the benefits of youth employment programs: The
heterogeneous effects of summer jobs. Technical report, National Bureau of Economic Research.
Davis, J. and Heller, S. B. (2017b). Using causal forests to predict treatment heterogeneity: An appli-
cation to summer jobs. American Economic Review, 107(5):546–50.
de Paula, A. (2015). Econometrics of network models. CeMMAP working papers CWP52/15, Centre
for Microdata Methods and Practice, Institute for Fiscal Studies.
Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical
Association, 109(507):991–1007.
Erdös, P. and Rényi, A. (1959). On random graphs i. Publicationes Mathematicae Debrecen, 6:290.
Erdös, P. and Rényi, A. (1960). On the evolution of random graphs. PUBLICATION OF THE MATH-
EMATICAL INSTITUTE OF THE HUNGARIAN ACADEMY OF SCIENCES, pages 17–61.
180
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than
observations. Journal of Econometrics, 189(1):1 – 23.
Ferman, B. and Pinto, C. (2016). Revisiting the synthetic control estimator. Working Paper.
Fuller, W. A. (1977). Some properties of a modification of the limited information estimator. Economet-
rica, pages 939–953.
Gautier, E. and Tsybakov, A. (2011). High-dimensional instrumental variables regression and confidence
sets. arXiv preprint arXiv:1105.2454.
Gautier, E. and Tsybakov, A. B. (2013). Pivotal estimation in high-dimensional regression via linear
programming. In Empirical inference, pages 195–204. Springer.
Gobillon, L. and Magnac, T. (2016). Regional policy evaluation: Interactive fixed effects and synthetic
controls. The Review of Economics and Statistics, 98(3):535–551.
Graham, B. S. (2017). An econometric model of network formation with degree heterogeneity. Econo-
metrica, 85(4):1033–1063.
Graham, B. S. (2019). Network data. Working Paper 26577, National Bureau of Economic Research.
Hackmann, M. B., Kolstad, J. T., and Kowalski, A. E. (2015). Adverse selection and an individual
mandate: When theory meets practice. American Economic Review, 105(3):1030–1066.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average
treatment effects. Econometrica, 66(2):315–332.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data mining,
Inference and Prediction. Springer, 2nd edition.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association,
81(396):945–960.
Hsu, Y.-C. (2017). Consistent tests for conditional treatment effects. The Econometrics Journal, 20(1):1–
22.
Hussam, R., Rigol, N., and Roth, B. (2017). Targeting high ability entrepreneurs using community
information: Mechanism design in the field. Unpublished Manuscript.
Ibragimov, R. and Sharakhmetov, S. (2002). The exact constant in the rosenthal inequality for random
variables with mean zero. Theory of Probability & Its Applications, 46(1):127–132.
Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Number 9780521885881 in Cambridge Books. Cambridge University Press.
Jackson, M. O. (2008). Social and Economic Networks. Princeton University Press, Princeton, NJ, USA.
Kitagawa, T. and Tetenov, A. (2017a). Equality-minded treatment choice. Technical report, Centre for
Microdata Methods and Practice, Institute for Fiscal Studies.
Kitagawa, T. and Tetenov, A. (2017b). Who should be treated? empirical welfare maximization methods
for treatment choice. Working Paper.
Kleven, H. J., Landais, C., and Saez, E. (2013). Taxation and international migration of superstars:
Evidence from the european football market. American Economic Review, 103(5):1892–1924.
181
Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer Publishing
Company, Incorporated, 1st edition.
LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental
Data. American Economic Review, 76(4):604–20.
Leamer, E. E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1):31–
43.
Leeb, H. (2006). The distribution of a linear predictor after model selection: Unconditional finite-sample
distributions and asymptotic approximations, volume Number 49 of Lecture Notes–Monograph Series,
pages 291–311. Institute of Mathematical Statistics, Beachwood, Ohio, USA.
Leeb, H. and Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric
Theory, null:21–59.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2017). Distribution-free predictive
inference for regression. Journal of the American Statistical Association.
List, J. A., Shaikh, A. M., and Xu, Y. (2016). Multiple hypothesis testing in experimental economics.
Technical report, National Bureau of Economic Research.
Manresa, E. (2016). Estimating the structure of social interactions using panel data. Working Paper.
Manski, C. F. (1993). Identification of endogenous social effects: The reflection problem. The Review of
Economic Studies, 60(3):531–542.
Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of
Economic Perspectives, 31(2):87–106.
Nevo, A. (2001). Measuring market power in the ready-to-eat cereal industry. Econometrica, 69(2):307–
342.
Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of
econometrics, 4:2111–2245.
Nie, X. and Wager, S. (2017). Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint
arXiv:1712.04912.
Oliveira, R. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares
and restricted eigenvalue properties. ArXiv e-prints.
Patacchini, E. and Zenou, Y. (2008). The strength of weak ties in crime. European Economic Review,
52(2):209–236.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,
NY, USA.
Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. (2017).
Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint
arXiv:1707.00102.
182
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology, 66(5):688.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. Information
Theory, IEEE Transactions on, 59(6):3434–3447.
Sacerdote, B. (2011). Peer Effects in Education: How Might They Work, How Big Are They and How
Much Do We Know Thus Far?, volume 3 of Handbook of the Economics of Education, chapter 4,
pages 249–277. Elsevier.
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society, Series B, 58:267–288.
van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist.,
36(2):614–645.
Van der Vaart, A. W. (1998). Asymptotic statistics, volume 3. Cambridge university press.
Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association.
Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and
the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651.
Wager, S. and Walther, G. (2015). Adaptive Concentration of Regression Trees, with Application to
Random Forests. ArXiv e-prints.
Wasserman, L. (2010). All of statistics : a concise course in statistical inference. Springer, New York.
Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactive fixed effects
models. Political Analysis, 25(01):57–76.
Xu, Y. and Liu, L. (2018). gsynth: Generalized Synthetic Control Method. R package version 1.0.9.
Yang, Y. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification
and regression estimation. Biometrika, 92(4):937–950.
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541–2563.
183