HD - Machine Learnind and Econometrics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 185

Machine Learning for Econometrics

ENSAE Paris – IP Paris

Christophe Gaillac Jérémy L’Hour


TSE and CREST CREST and Insee
This Version: 2019-2020 class
Last Edited: June 4, 2020

Draft. Christophe Gaillac, Toulouse School of Economics and CREST, ENSAE Paris,
[email protected]. Jérémy L’Hour, CREST, ENSAE Paris and INSEE, [email protected].
We thank Pierre Alquier, Xavier D’Haultfœuille and Anna Simoni for their help and comments. Com-
ments welcome.
Summary

These are the lecture notes for the course Machine Learning for Economet-
rics (High-Dimensional Econometrics, previously) taught in the third year
of ENSAE Paris and the second year of the Master in Economics of Institut
Polytechnique de Paris. They cover recent applications of high-dimensional
statistics and machine learning to econometrics, including variable selection,
inference with high-dimensional nuisance parameters in different settings, het-
erogeneity, networks and analysis of text data. The focus will be on policy
evaluation problems. Recent advances in causal inference such as the synthetic
controls method will be reviewed.
The goal of the course is to give insights about these new methods, their
implementation, their benefits and their limitations. The course is a bridge
between econometrics and machine learning, and will mostly benefit students
who are highly curious about recent advances in econometrics, whether they
want to study the theory or use them in applied work. Students are expected
to be familiar with Econometrics 2 (2A) and Statistical Learning (3A).

1
Contents

1 Introduction and Roadmap 5


1.1 Brief Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Resources and Reading List . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 High-Dimension, Variable Selection and Post-Selection Inference 11


2.1 The Post-Selection Inference Problem . . . . . . . . . . . . . . . . . . . . 12
2.2 High-Dimension, Sparsity and the Lasso . . . . . . . . . . . . . . . . . . 18
2.3 Some Theory on the Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 The Regularization Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Theory: Immunization/Orthogonalization Procedure . . . . . . . . . . . 28
2.6 The Double Selection Method . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Auxiliary Mathematical Results . . . . . . . . . . . . . . . . . . . . . . . 35

3 Methodology: Using Machine Learning Tools in Econometrics 37


3.1 Double Machine Learning and Sample-Splitting . . . . . . . . . . . . . . 38
3.2 Orthogonal Scores to Estimate Treatment Effects . . . . . . . . . . . . . 39
3.3 Simulation Study: The Regularization Bias . . . . . . . . . . . . . . . . . 41
3.4 Empirical Application: Job Training Program . . . . . . . . . . . . . . . 42

4 High-Dimension and Endogeneity 45


4.1 The Optimal Instruments Problem . . . . . . . . . . . . . . . . . . . . . 46
4.2 Sparse Model for Instrumental Variables . . . . . . . . . . . . . . . . . . 49
4.3 Immunization Procedure for Instrumental Variable Models . . . . . . . . 51
4.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Empirical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1
4.5.1 Logit Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Instrument Selection to estimate returns to schooling . . . . . . . 58

5 Further Developments: Variable Selection with Non-Gaussian Errors,


Sample-Splitting and Panel Data 61
5.1 High-Quality Nuisance Estimation with Non-Gaussian Errors . . . . . . . 61
5.2 Sample-Splitting to Relax the Growth Rate Assumption . . . . . . . . . 66
5.3 Regularization and Selection of Instruments in Panels . . . . . . . . . . . 68

6 The Synthetic Control Method 73


6.1 Setting and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 A Result on the Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 The Impact of Election Day Registration on Voter Turnout . . . . . . . . 81
6.4 When and Why Using the Synthetic Control Method . . . . . . . . . . . 83
6.5 Inference Using Permutation Tests . . . . . . . . . . . . . . . . . . . . . 84
6.5.1 Permutation Tests in a Simple Framework . . . . . . . . . . . . . 86
6.5.2 Confidence Intervals by Inversion of Tests . . . . . . . . . . . . . 88
6.5.3 Application to the Election Day Registration . . . . . . . . . . . . 90
6.6 Multiple Treated Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7 Conclusion and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.8 Supplementary Application: The California Tobacco Control Program . . 94
6.9 Auxiliary Mathematical Results . . . . . . . . . . . . . . . . . . . . . . . 96
6.10 Exercises and Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7 Machine Learning Methods for Heterogeneous Treatment Effects 101


7.1 Motivation: Optimal Policy Learning . . . . . . . . . . . . . . . . . . . . 101
7.2 Heterogeneous Effects with Selection on Observables . . . . . . . . . . . 103
7.2.1 Nonparametric Estimation of the Treatment Effects . . . . . . . . 104
7.2.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2.3 Complements and Alternative Algorithm . . . . . . . . . . . . . . 119
7.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Heterogeneous Effects with Instrumental Variables . . . . . . . . . . . . . 122

2
7.3.1 Application to heterogeneity in the effect of subsidized training on
trainee earnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4 Estimating Features of Heterogeneous Effects with Selection on Observables129
7.4.1 Estimation of Key Features of the CATE . . . . . . . . . . . . . . 130
7.4.2 Inference About Key Features of the CATE . . . . . . . . . . . . 133
7.4.3 Algorithm: Inference about key features of the CATE . . . . . . . 136
7.4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8 Network Data and Peer Effects 141


8.1 Vocabulary and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2 Network Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Estimating Peer Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3.1 The Linear Model of Social Interactions . . . . . . . . . . . . . . 149
8.3.2 Estimating the Structure of Social Interactions . . . . . . . . . . . 152

9 Appendix 155
A Exam 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B Exam 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C Exam 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

3
4
Chapter 1

Introduction and Roadmap

1.1 Brief Introduction


Doing empirical research involves, most of the time, crucial choices, such as the func-
tional form of the equation to be estimated (e.g. linear, quadratic, distribution of the
error terms), the number and the identity of the control variables, or the choice of the
instruments. These choices leave room for arbitrariness and, more dangerously, for p-
hacking (i.e. making these choices so as to obtain results that suit the researcher’s prior
beliefs or expected results). In any event, they open the way for criticism of the results.
The increasing availability of large datasets and the advances of Machine Learning (ML)
have both made this problem more acute (e.g. more covariates to include, traditional
methods not working, ML algorithms to train and select), while providing potential solu-
tions. In this course, we will focus on high-dimensional methods (i.e. which can handle a
large number of covariates and/or instrumental variables) and on some ML methods with
the goal of performing causal inference in mind – Machine Learning for Econometrics.
Let us now briefly describe the problems that this course deals with – we will be mainly
focusing on policy evaluation or causal inference, although these tools apply more broadly.
We want to emphasize that the tools should be selected according to the parameters of
interest. We introduce the potential outcome model of Rubin (1974). Let Yi (0) be the
potential outcome for individual i if not treated and Yi (1) the potential outcome if treated.
We only observe treatment status Di ∈ {0, 1} and realized outcome Yiobs defined by:

obs Yi (0) if Di = 0
Yi = Yi (Di ) = .
Yi (1) if Di = 1

A first quantity of interest is the average treatment effect τ0 := E [Yi (1) − Yi (0)], which is

5
the average impact of the intervention among the population. When the treatment assign-
ment is random conditional on some observables (i.e. the assumption that E[εi |Di , Xi ] =
0 in the model below) and under the assumption that there exist only a limited number
of significant covariates (sparsity), Chapters 2 give tools to handle the estimation of τ0
in model
0
Yi = Di τ0 + Xi β0 + εi , with E[εi ] = 0 and E[εi |Di , Xi ] = 0,

where Xi is a vector of p exogenous control variables, p being possibly larger than the
number of observations. The large dimension of Xi , in combination with the sparsity
assumption, opens the door to use selection methods such as the Lasso, that this chapter
reviews in details. Chapter 3 uses the intuition explained in the preceding chapter but
presents a more general framework and introduces sample-splitting, a crucial device when
using non-standard tools such as ML estimators.
Chapter 4 then explains how to adapt these tools when the econometrician relaxes the
exogeneity assumption, i.e. now assumes E[εi |Di , Xi ] 6= 0, but possesses a (possibly large)
number of instrumental variables Zi , all satisfying the exogeneity assumption E[εi |Zi ] = 0.
Going further, Chapter 5 develops the theoretical refinements of the tools presented so
far, with the aim of using weaker assumptions. It specifically deals with non-gaussian
errors, sample-splitting and panel data.
However, the average treatment effect (τ0 ) does not allow to describe heterogeneity in
the reactions to the intervention – some people might benefit a lot from the intervention,
while some other may not respond at all or see their outcome worsening. Chapter 7 thus
deals with a more complex parameter of interest, which is the average treatment effect
conditional on some (observed) covariates τ : x 7→ E [Yi (1) − Yi (0)|Xi = x]. Causal
random forests are tools adapted from machine learning that allow to make inference
about the function τ (·), i.e. to test for significant effect of the treatment conditionally on
the covariates taking the value x. However, theory requires strong hypotheses to obtain
such tests. The end of Chapter 7 lowers our expectations to only perform inference about
features of the conditional average treatment effect. This allows to use ML methods, with
few hypotheses, to test for the heterogeneity of the treatment or to obtain information
about its shape.
Thus, the researcher has to choose the relevant methods among those presented in

6
this course according to the parameter of interest or the available data. This choice is
summarized in Figure 1.1.

Figure 1.1: A brief road-map to use some methods presented through the chapters

Furthermore, Chapter 6 presents the synthetic control method, an inherently high-


dimensional tool particularly useful for policy evaluation with aggregate data. It also
takes the opportunity to introduce permutation inference. Chapter 8 is a short introduc-
tion to econometric methods to perform inference in network data, i.e. data stemming
from observations linked to each other in a particular fashion.
Each chapter starts by introducing the problem. Besides the theory, we develop
empirical examples taken from the relevant literature or Monte Carlo simulations.

7
1.2 Resources and Reading List

These notes should be self-contained. Due to the nature of the course, no textbook
currently covers the same material. We provide general references in each chapter. A
GitHub repository for the class is available at github.com/jlhourENSAE/hdmetrics and
contains (mostly) R code. That being said, we list below a limited number of general
references. We encourage you to read them before the class.

Introduction

Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should
know about. Annual Review of Economics, 11
Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric ap-
proach. Journal of Economic Perspectives, 31(2):87–106

High-Dimension, Variable Selection and Post-Selection Inference

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). High-Dimensional Methods and In-
ference on Structural and Treatment Effects. Journal of Economic Perspectives, 28(2):29–
50

Using Machine Learning Tools in Econometrics

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W.,
and Robins, J. (2018). Double/debiased machine learning for treatment and structural
parameters. The Econometrics Journal, 21(1):C1–C68

High-Dimension and Endogeneity

Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012a). Sparse models and
methods for optimal instruments with an application to eminent domain. Econometrica,
80(6):2369–2429
Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-
regularization inference: An elementary, general approach. Annu. Rev. Econ., 7(1):649–
688

8
The Synthetic Control Method

Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and method-
ological aspects. Working Paper
Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for
comparative case studies: Estimating the effect of california’s tobacco control program.
Journal of the American Statistical Association, 105(490):493–505
Chapter 5 in Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, So-
cial, and Biomedical Sciences. Number 9780521885881 in Cambridge Books. Cambridge
University Press

Heterogeneous Treatment Effects

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment


effects using random forests. Journal of the American Statistical Association
Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals
of Statistics, 47(2):1148–1178
Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2017b). Generic ma-
chine learning inference on heterogenous treatment effects in randomized experiments.
arXiv preprint arXiv:1712.04802

Network Data

Chandrasekhar, A. (2016). Econometrics of network formation. The Oxford Handbook of


the Economics of Networks, pages 303–357
de Paula, A. (2015). Econometrics of network models. CeMMAP working papers CWP52/15,
Centre for Microdata Methods and Practice, Institute for Fiscal Studies

9
10
Chapter 2

High-Dimension, Variable Selection


and Post-Selection Inference

Model selection and parsimony among explanatory variables are traditional scientific
problems that have a particular echo in statistics and econometrics. They have received
growing attention over the past two decades, as high-dimensional datasets have become
increasingly available to statisticians in various fields. But even with a small dataset,
high-dimensional problems can occur, for example when doing series estimation of a non-
parametric model. In practice, applied econometricians often select variables by trial
and error, guided by their intuition and report results based on the assumption that the
selected model is the true. These results are often backed by further sensitivity analysis
and robustness checks. However, the variable selection step of empirical work is rarely
fully acknowledged although it is not innocuous. Leamer (1983) was one of the first
econometric papers to address this problem. For a modern presentation, see Leeb and
Pötscher (2005) and, in the context of policy evaluation, Belloni et al. (2014).
Section 2.1 serves as a larger introduction and describes the problem posed by post-
selection inference. Sections 2.2 and 2.3 introduce the Lasso estimator as it is often
used as a selection device. Section 2.4 builds on the intuition of section 2.1 to deal with
the regularization bias. Section 2.5 exposes the key theoretical concepts to deal with
post-selection inference and Section 2.6 considers its application in simple cases.

Notations a . b means that a ≤ cb for some constant c > 0 that does not depend on the
sample size n. ϕ and Φ respectively denote the pdf and the cdf of the standard Gaussian
distribution. For a vector δ ∈ Rp , kδk0 := Card {1 ≤ j ≤ p, δj 6= 0}, kδk∞ := max |δj |.
j=1,...,p

11
The m-sparse-norm of matrix Q is defined as:
p
δ T Qδ
kQksp(m) := sup .
kδk0 ≤m kδk2
kδk2 >0

CLT = Central Limit Theorem; LLN = Law of Large Numbers; CMT = Continuous
Mapping Theorem.

2.1 The Post-Selection Inference Problem

We begin by analyzing the two-step inference method described in the introduction (se-
lecting the model first, then reporting results from that model as if it were the truth).
This Section is based on the work of Leeb and Pötscher (2005).

Assumption 2.1 (Possibly Sparse Gaussian Linear Model). Consider the iid sequence
of random variables (Yi , Xi )i=1,...,n such that:

Yi = Xi,1 τ0 + Xi,2 β0 + εi ,

where εi ∼ N (0, σ 2 ), σ 2 is known, Xi = (Xi,1 , Xi,2 ) is of dimension 2, εi ⊥ Xi , and


E(Xi Xi0 ) is non-singular. We use the following shorthand notation for the OLS variance-
covariance matrix elements:
" n #−1
στ2 στ,β
 
1 X
:= σ 2 Xi Xi0 .
στ,β σβ2 n i=1

The most sparse true model is coded by M0 , a random variable taking value R (“re-
stricted”) if β0 = 0 and U (“unrestricted”) otherwise.

Everything in Section 2.1 will be conditional on the covariates (Xi )1≤i≤n but we leave
that dependency hidden. In particular, conditional on the covariates, the unrestricted
estimator is normally distributed:

√ β(U
     2 
b ) − β0 0 σβ ρσβ στ
n ∼N , ,
τb(U ) − τ0 0 ρσβ στ στ2

where ρ = στ,β /στ σβ .

12
Consistent Model Selection Procedure. The econometrician is interested in per-
forming inference over the parameter τ0 and wonders whether he should include Xi,2 in
the regression. At the end, he reports the result from model M
c he has selected in a first

step. Denote by τb(U ) and β(U


b ) the OLS estimators in the unrestricted model (model U )

and by τb(R) and β(R)


b = 0 the restricted OLS estimators (model R). The econometrician
includes Xi,2 in the model if its Student statistics is large enough:

Assumption 2.2 (Decision Rule).


( √ b
U if | nσβ(U )
| > cn
M
c= β (2.1)
R otherwise,

with cn → ∞ and cn / n → 0 as n → ∞.
√ √
The AIC criterion corresponds to cn = 2 and the BIC to cn = log n. How does
this selection method performs asymptotically?

Lemma 2.1 (Model Selection Consistency of (2.2)). For M0 ∈ {U, R},


 
PM 0 Mc = M0 → 1,

as n → ∞, where PM0 indicates the probability distribution of M


c under the true model

M0 .

Proof of Lemma 2.1 Considering the selection rule (2.2) and the Gaussian distributional
assumption in model (2.1):
  √ 
P M = R = P | nβ(U )/σβ | ≤ cn
c b
 √ √ b √ 
= P −cn − nβ0 /σβ ≤ n(β(U ) − β0 )/σβ ≤ cn − nβ0 /σβ
√  √ 
= Φ cn − nβ0 /σβ − Φ −cn − nβ0 /σβ
√  √ 
= Φ nβ0 /σβ + cn − Φ nβ0 /σβ − cn
√ 
= ∆ nβ0 /σβ , cn ,

with ∆(a, b) := Φ(a + b) − Φ(a − b) and the fourth equality uses the symmetry of the
Gaussian distribution, Φ(−x) = 1 − Φ(x). From this equation and the restrictions on
cn , the probability that M
c = R tends to one if β0 = 0 (M0 = R) and to zero otherwise

(M0 = U ). 

13
Remark 2.1. Since the probability of selecting the true model tends to one with the
sample size, Lemma 2.1 might induce you to think that a consistent model selection
procedure allows inference to be performed “as usual”, i.e. that the model selection step
can be overlooked. However, for any given sample size n, the probability of selecting
the true model can be very small if β0 is close to zero without exactly being zero. For
√ √
example, assume that β0 = δσβ cn / n with |δ| < 1 then: nβ0 /σβ = δcn and the
probability of selecting the unrestricted model from the proof of Lemma 2.1 is equal to
1 − Φ(cn (1 + δ)) + Φ((δ − 1)cn ), and tends to zero although the true model is U because
β0 6= 0! This quick analysis tells us that the model selection procedure is blind to small

deviations from the restricted model (β0 = 0) that are of the order of cn / n. Statisticians
say that in that case, the model selection procedure is not uniformly consistent with
respect to β0 . For the econometrician, it means that the classical inference procedure,
i.e. the procedure that assumes that the selected model is the true, or that is conditional
on the selected model being the true, and uses the asymptotic normality to perform tests
and construct confidence intervals may require very large sample sizes to be accurate.
Furthermore, this required sample size depends on the unknown parameter β0 (see the
numerical evidence in Leeb and Pötscher (2005)).

The Distribution of the Post-Selection Estimator. Leeb and Pötscher (2005)


analyze the distribution of the post-selection estimator τ̃ defined by

τ̃ := τb(M
c) = τb(R)1 c + τb(U )1 c .
M =R M =U (2.2)

Bearing in mind the caveat issued in the previous paragraph, is a consistent model se-
lection procedure sufficient to waive concerns over the post-selection approach? Indeed,
using Lemma 2.1, it is tempting to think that, τ̃ will be asymptotically distributed as a
Gaussian and that standard inference also applies. However, we will show that the finite
sample distribution of the post-selection estimator can be very different from a standard
Gaussian distribution. The result displayed here can be found in Leeb (2006). The next
Lemma will be useful when computing the distribution of the post-selection estimator.

Lemma 2.2 (Independence, from Leeb (2006)).

τb(R) ⊥ β(U
b ).

14
Proof of Lemma 2.2 We use the following matrix notations: Xj = (Xi,j )1≤i≤n for any
j = 1, 2, y = (Yi )1≤i≤n and X = (Xi0 )1≤i≤n . Notice that τb(R) = [X1 0 X1 ]−1 X1 0 y, and
define MX1 := In − X1 [X1 0 X1 ]−1 X1 0 , the projector on the orthogonal complement of the
column space of X1 . Form the matrix XO := [X1 : MX1 X2 ] and define βbO the coefficient
 0 −1  O0 
obtained from the regression of y on XO , that is: βbO = XO XO X y . We can show
that:
 
τb(R)
βbO = −1 ,
[X2 0 MX1 X2 ] X2 0 MX1 y
−1
and that τb(R) and [X2 0 MX1 X2 ] X2 0 MX1 y are uncorrelated, using Cochran’s theo-
rem. Using Frisch–Waugh–Lovell Theorem (Theorem 2.3 at the end of this chapter),
−1
[X2 0 MX1 X2 ] X2 0 MX1 y = β(U
b ), which completes the proof. 

Lemma 2.3 (Density of the Post-Selection estimator, from Leeb (2006)). The finite-

sample (conditional on (Xi )i=1,...,n ) density of n(τ̃ − τ0 ) is given by:
√ !

 
β0 1 x ρ nβ0
f√n(τ̃ −τ0 ) (x) = ∆ n , cn p ϕ p +p
σβ στ 1 − ρ 2 στ 1 − ρ 2 1 − ρ σβ
2
" √ !#  
nβ0 /σβ + ρx/στ cn 1 x
+ 1−∆ p ,p ϕ ,
1 − ρ2 1 − ρ2 στ στ

where ρ = στ,β /στ σβ .

Proof of Lemma 2.3 Start from Equation (2.2):


√   √   
P x ≤ n(τ̃ − τ0 ) ≤ x + dx = P x ≤ n(bτ (R) − τ0 ) ≤ x + dx | M = R P M = R
c c
 √   
+ P x ≤ n(b
τ (U ) − τ0 ) ≤ x + dx | M
c=U P M c=U .

We consider the first term in the sum. From Lemma 2.2, for any real number x, we have:
 √  √ 
P x ≤ n(b τ (R) − τ0 ) ≤ x + dx | M
c = R = P x ≤ n(b τ (R) − τ0 ) ≤ x + dx .

So as dx → 0, the first part in the sum (times 1/dx) is the probability of selecting

model R times the density of n(b τ (R) − τ0 ). The probability of selecting model R
  √
is P Mc = R = ∆ ( nβ0 /σβ , cn ). Before continuing, notice the relation between the

moments of Xi and those of the OLS estimators in model U , from Assumption 2.1:
" n #
σ2 σβ2
 
1X 0 −ρσβ στ
X i Xi = 2 2 .
n i=1 σβ στ (1 − ρ2 ) −ρσβ στ στ2

15

In order to compute the density of τ (R) − τ0 ), we use the usual OLS formula and
n(b
substitute Yi by the model of Assumption 2.1:
n
!
√ √ στ √ σ 2 1X
τ (R) − τ0 ) = − nβ0 ρ + n τ2 (1 − ρ2 )
n(b Xi,1 εi .
σβ σ n i=1

Because the εi are iid Gaussian and conditionally on the Xi , we obtain:


√ √
 
στ 2 2
τ (R) − τ0 ) ∼ N − nβ0 ρ , στ (1 − ρ ) .
n(b
σβ

The bias − nβ0 ρστ /σβ corresponds to the usual omitted-variable bias (OVB) (Angrist
and Pischke, 2009, p. 59):
√ ρστ σβ √ Cov(Xi,1 , Xi,2 )
− nβ0 2 = nβ0 .
σβ V(Xi,1 )

Now, we focus on the second part in the sum and reverse the events:
 √     √ 
P x ≤ n(b
τ (U ) − τ0 ) ≤ x + dx | M = U P M = U = P M = U | x ≤ n(b
c c c τ (U ) − τ0 ) ≤ x + dx
√ 
× P x ≤ n(b
τ (U ) − τ0 ) ≤ x + dx .

Recall that √     2 
n( b ) − β0 )
β(U 0 σβ ρσβ στ
√ ∼N , .
n(bτ (U ) − τ0 ) 0 ρσβ στ στ2
so, we have directly
√  
P (x ≤ τ (U ) − τ0 ) ≤ x + dx)
n(b 1 x
→ ϕ ,
dx στ στ

as dx → 0. Due to the properties of Gaussian vectors, we get:


√ √ σβ √
 
2 2
n(β(U ) − β0 ) | n(b
b τ (U ) − τ0 ) ∼ N ρ τ (U ) − τ0 ), σβ (1 − ρ ) .
n(b
στ
√ b √
Now compute P(| nβ(U )/σβ | > cn | n(b τ (U ) − τ0 ) = x). On the one hand:
! !
√ β(U √ √ β0
b ) 
1 x
P n > cn | n(bτ (U ) − τ0 ) = x = Φ p n + ρ − cn .
σβ 1 − ρ2 σβ στ

On the other:
! !
√ β(U √ √ β0
b ) 
1 x
P n < −cn | n(b
τ (U ) − τ0 ) = x =1−Φ p n + ρ + cn ,
σβ 1 − ρ2 σβ στ

which yields the result. 

16

Figure 2.1: Finite-sample density of n(τ̃ − τ0 ), ρ = .4

0.4

.5
.3
.2
0.3

.1
0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Note: Density
√ of the post-selection estimator τ̃ for different values of β0 /σβ , see legend. Other parameters
are: cn = log n, n = 100, στ = 1 and ρ = .4. See Lemma 2.3 for the mathematical formula.

Figure 2.2: Finite-sample density of n(τ̃ − τ0 ), ρ = .7
0.5

.5
.3
0.4

.2
.1
0.3
0.2
0.1
0.0

−3 −2 −1 0 1 2 3

Note: See Figure 2.1. ρ = .7. This chart is similar to the one in Leeb and Pötscher (2005).

17
Remark 2.2. Lemma 2.3 gives the finite-sample density of the post-selection estimator.
There is be an omitted variable bias that the post-selection estimator cannot overcome

unless β0 = 0 or ρ = 0. Indeed, when ρ = 0, n(τ̃ − τ0 ) ∼ N (0, στ2 ); while when β0 = 0,

n(τ̃ − τ0 ) ∼ N (0, στ2 /(1 − ρ2 )) (approximately), because ∆(0, cn ) ≥ 1 − exp(−c2n /2) -
the probability of selecting the restricted model - is large. Figures 2.1 and 2.2 plot the
finite-sample density of the post-selection estimator for several values of β0 /σβ in the
cases ρ = .4 and ρ = .7, respectively. Figure 2.1 shows a mild albeit significant distortion
from a standard Gaussian distribution. The post-selection estimator clearly exhibits a
bias. As the correlation between the two covariates intensifies, Figure 2.2, the density of
the post-selection estimator becomes highly non-Gaussian, even exhibiting two modes.
See Leeb and Pötscher (2005) for further discussion. Following this analysis, it is clear
that inference (i.e. tests and confidence intervals) based on standard Gaussian quantiles
will in general give a picture very different from true distribution depicted in Figure 2.2.

2.2 High-Dimension, Sparsity and the Lasso


A popular device to perform the variable selection step described in the previous section
is the Lasso of Tibshirani (1994). In this section, we make a detour from the post-
selection inference problem to give a proof of the Lasso in the linear model relying on
strong assumptions. These assumptions can and are relaxed in the Lasso literature, but
they allow us to simplify the proof nonetheless. Let’s start with a Gaussian linear model.

Assumption 2.3 (Sparse Gaussian Linear Model). Let the iid sequence of random vari-
ables (Yi , Xi )i=1,...,n . The dimension of the vector Xi is denoted p. p is assumed to be
larger than 1 and allowed to be much larger than n. We assume the following linear
relation:
Yi = Xi0 β0 + εi ,

with εi ∼ N (0, σ 2 ), εi ⊥ Xi , kβ0 k0 ≤ s < p. The covariates are bounded almost surely
max kXi k∞ ≤ M .
i=1,...,n

Remark 2.3 (Key Concept: Sparsity). One particular assumption in the model dis-
played in Assumption 2.3 deserves special attention. The sparsity assumption, kβ0 k0 =
Pp
j=1 1 {βj 6= 0} ≤ s, means that we assume at most s components of β0 are different

18
from zero. The notion of sparsity, i.e. the assumption that although we consider many
variables, only a small number of elements in the vector of parameters is different from
zero, is an inherent element of the high-dimensional literature. It amounts to recasting
the high-dimensional problem in a variable selection framework where a good estimator
should be able to correctly select the relevant variables or to estimate the quantities of in-

terest consistently at a rate close to n, only paying a price depending on s and p. Before
continuing further, let’s introduce the sparsity set, i.e. the set of indices that correspond
to non-zero elements of β0 : S0 := {j ∈ {1, ..., p}, β0j 6= 0}. A less restrictive concept has
been introduced by Belloni et al. (2012b). Called approximate sparsity, it assumes that
the high-dimensional parameter can be decomposed into a sparse component, which has
a lot of zero entries and some large entries, and a small component for which all entries
are small and decaying towards zero without never exactly being zero, see Assumption
4.3 in Chapter 4. Although more general, this assumption complicates the proof.

Pn
Denote by L(β) = n−1 i=1 (Yi − Xi0 β)2 the mean-square loss function. The Lasso
estimator is defined as:

βb ∈ arg min L(β) + λn kβk1 . (2.3)


β∈Rp

The Lasso minimizes the sum of the empirical mean-square loss and a penalty or reg-
ularization term λn kβk1 . Notice that the solution to (2.3) is not necessarily unique.
Because the `1 -norm has a kink at zero, the resulting solution of the program, β,
b will

be sparse. λn sets the trade-off between fit and sparsity. It has been shown that in a
sparsity context (see the remark above), Lasso-type estimators can provide a good ap-
proximation of the relevant quantities that are subject to a sparse structure, be it finite
or infinite dimensional parameters. In presence of a high-dimensional β0 for which the
sparsity assumption is not assumed to hold, using the Lasso estimator is not a good idea.
If instead β0 is supposed to be dense (i.e. many small entries but no true zeros), using a
`2 -regularization (the Ridge estimator) performs better. For more on when to use which
type of regularization, see Abadie and Kasy (2017).
The Lasso and related techniques to deal with high-dimension have spurred a vast lit-
erature since the seminal paper of Tibshirani (1994). Good Statistics textbook references
are Bühlmann and van de Geer (2011) and Giraud (2014). Other key papers are Candes

19
and Tao (2007); van de Geer (2008); Bickel et al. (2009).
To show consistency of the Lasso estimator, another ingredient is needed: the re-
b := n−1 Pn Xi Xi0 , the empirical Gram ma-
stricted eigenvalue condition. Denote by Σ i=1

trix. In a high-dimensional settings, we are specifically worried about cases where the
number of covariates is larger than the sample size (p > n), because then Σ
b is degenerate

in the sense that it is not positive definite:

δ 0 Σδ
b
minp = 0.
δ∈R kδS k22
δ6=0

In this case, OLS cannot be computed. This is why the restricted eigenvalue is needed:
all square sub-matrices contained in the empirical Gram matrix of dimension no larger
than s should have a positive minimal eigenvalue. Let’s make it clearer. For a non-empty
subset S ⊂ {1, ..., p} and α > 0, we define the set:

C[S, α] := {v ∈ Rp : kvS C k1 ≤ αkvS k1 , v 6= 0} (2.4)

Assumption 2.4 (Restricted Eigenvalue). The empirical Gram matrix Σ


b satisfies:

δ 0 Σδ
b
κ2α (Σ)
b := min min > 0.
S⊂{1,...,p}δ∈C[S,α] kδS k22
|S|≤s

This condition appears and is discussed in particular in Bickel et al. (2009); Rudelson
and Zhou (2013). We make this assumption directly on the empirical Gram matrix,
instead of making it on the population Gram matrix E(XX 0 ), in order to simplify the
proof. For a probabilistic link between population and empirical Gram matrices under
fairly weak conditions, see e.g. Oliveira (2013). Conditions that fulfill the same purpose as
the restricted eigenvalue conditions have been used before, most notably the compatibility
condition, coherence condition and the restricted isometry condition. See also (Bühlmann
and van de Geer, 2011, p. 106)

2.3 Some Theory on the Lasso


This Section gives a simple proof of the Lasso and introduces the Post-Lasso estimator.

Theorem 2.1 (`1 consistency of the Lasso). Under Assumption 2.3 and a restricted
eigenvalue condition 2.4 with C[S0 , 3], the Lasso estimator defined in 2.3 with tuning

20
p
parameter λn = (4σM/α) 2 log(2p)/n, where α ∈ (0, 1), verifies with probability greater
than 1 − α:
r
42 σM 2s2 log(2p)
kβb − β0 k1 ≤ . (2.5)
ακ23 (Σ)
b n
Remark 2.4. The main take-away from Theorem 2.1 is that the Lasso converges in `1
p
to the true value β0 at rate s log(p)/n. This is to be compared to the OLS rate under

full knowledge of the sparsity pattern which is s/ n. The conclusion is that there is a
p
price to pay for ignorance which manifests itself by this log(p) term. This rate is called
fast compared to a slower rate that exists without Assumption 2.4.
By adding a modified version of Assumption 2.4, an `2 rate can be obtain: kβ0 − βk
b 2.
p
s log(p)/n. Prediction rates (i.e. kY − X 0 βk
b 2 ) have also been largely dealt with in the

literature, see e.g. Bickel et al. (2009), but are less of interest in this course.
Moreover, notice that the Lasso is NOT asymptotically Gaussian: the event βbj = 0
has a non-zero probability of occurring. Consequently, it is not possible to construct the
usual confidence sets on β0 (i.e. Gaussian, asymptotic-based).

Proof of Theorem 2.1 Since βb is a solution of the minimization program:

b + λn kβk
L(β) b 1 ≤ L(β0 ) + λn kβ0 k1 . (2.6)

Step 1: Difference in Square Losses. Decompose the difference between the two
loss functions in two elements and replace Yi :
n
1 X 2
2
L(β) − L(β0 ) =
b Yi − Xi0 βb − (Yi − Xi0 β0 )
n i=1
n
1 X 0 2
= Xi (β0 − β) + εi − ε2i
b
n i=1
" n # " n #
1 X 1 X
= (βb − β0 )0 Xi Xi0 (βb − β0 ) + 2(βb − β0 )0 εi X i .
n i=1 n i=1
| {z }

b

Consequently, from Equation (2.6) we obtain:


" n
#
  1X
(βb − β0 )0 Σ( b 1 − 2(βb − β0 )0
b βb − β0 ) ≤ λn kβ0 k1 − kβk εi X i
n i=1
n
b 1 + 2kβb − β0 k1 k 1
  X
≤ λn kβ0 k1 − kβk εi X i k ∞ ,
n i=1

21
Step 2: Concentration Inequality. It is time to apply the concentration inequality
of Lemma 2.6 to k n1 ni=1 εi Xi k∞ . Using Markov’s inequality:
P

 
n ! 4E max 1 Pn ε X X , ..., X
1 X λ j=1,...,p n i=1 i ij 1 n
n
P max εi Xij ≥ X1 , ..., Xn ≤

j=1,...,p n 4 λn
i=1
p
4σM 2 log(2p)
≤ √
n λn
≤ α,
q
4σM 2 log(2p)
since λn = α n
. Since the right-hand side is non-probabilistic, we obtain:
!
1 X n λ
n
P max εi Xij ≥ ≤ α.

j=1,...,p n 4
i=1
 
max n2 ni=1 εi Xij <
P λn
On the event 2
that occurs with probability greater than
j=1,...,p
1 − α:
  λ
0b b n
(β − β0 ) Σ(β − β0 ) ≤ λn kβ0 k1 − kβk1 + kβb − β0 k1 .
b b (2.7)
2

Step 3: Decompose the `1 -norms. Now, we will use βS0 to denote the vector β
of dimension p for which elements that are not in S0 are replaced by 0. Notice that
β = βS0 + βS0C . By the reverse triangular inequality:

kβ0,S0 k1 − kβbS0 k1 ≤ kβ0,S0 − βbS0 k1 .

Also, notice that β0,S0C = 0, so : kβ0,S0C k1 − kβbS0C k1 = −kβ0,S0C − βbS0C k1 . So from (2.7), we
obtain:
b βb − β0 ) ≤ 3λn kβ0,S0 − βbS0 k1 − λn kβ0,S C − βbS C k1 .
(βb − β0 )0 Σ( (2.8)
2 2 0 0

Step 4: Cone Condition and Restricted Eigenvalues. It means that we have the
following cone condition:

kβ0,S0C − βbS0C k1 ≤ 3kβ0,S0 − βbS0 k1 ,

so βb − β0 ∈ C[S0 , 3]. Using Assumption 2.4 on the restricted eigenvalue of the empirical

Gram matrix and Cauchy-Schwarz inequality kδS0 k1 ≤ skδS0 k2 we have:

kβ0,S0 − βbS0 k21


(βb − β0 )0 Σ(
b βb − β0 ) ≥ κ2 (Σ)kβ
3
b 0,S 0 − bS k2 ≥ κ2 (Σ)
β 0 2 3
b . (2.9)
s
22
Step 5: Finishing the Proof. We are now ready to get through the final step of the
proof. Using inequalities (2.8) and (2.9) notice that:

2(βb − β0 )0 Σ(
b βb − β0 ) + λn kβ0 − βk
b 1 ≤ 4λn kβ0,S − βbS k1
0 0
√ q
s
≤ 4λn (βb − β0 )0 Σ(
b βb − β0 )
κ3 (Σ)
b
s
≤ 4λ2n + (βb − β0 )0 Σ(
b βb − β0 ),
2 b
κ3 (Σ)
where the final inequality uses 4uv ≤ u2 + 4v 2 . We finally obtain:
s
(βb − β0 )0 Σ( b 1 ≤ 4λ2
b βb − β0 ) + λn kβ0 − βk
n .
κ23 (Σ)
b
All in all, with probability greater than 1 − α:
r
2
b 1 ≤ 4 σM 2s2 log(2p)
kβ0 − βk .
ακ23 (Σ)
b n


Remark 2.5 (The Post-Lasso). Before moving on, we should mention the Post-Lasso, a
close cousin of the Lasso that has been studied in particular in the chapter by Belloni and
Chernozhukov (2011) in the book by Alquier et al. (2011) and Belloni and Chernozhukov
(2013). It is a two-step estimator in which a second step is added to the Lasso procedure in
order to remove the bias that comes from shrinkage. That second step consists in running
an OLS regression using only the covariates associated with a non-zero coefficient in the
Lasso step. More precisely, the procedure is:

1. Run the Lasso regression as in Equation (2.3), denote Sb the set of non-zero Lasso
coefficients.

2. Run an OLS regression including only the covariates corresponding to the non-zero
coefficients from above:
βbP L = arg min L(β)
β∈Rp ,βSbC =0

The performance is comparable to that of the Lasso in theory, albeit the bias appears to
be smaller in applications since undue shrinkage on non-zero coefficients is removed. To
stress the lessons of this chapter, the Post-Lasso estimator is still NOT asymptotically
Normal: it is subject to the post-selection inference problem highlighted by Leeb and
Potscher in Section 2.1!

23
2.4 The Regularization Bias
In this section, we discuss the regularization bias which is nothing more than an omitted
variable bias arising from the same mechanism described in section 2.1.

Model Selection and Estimation Cannot Be Achieved Optimally at the Same


Time. The previous section was concerned with defining and understanding the Lasso
estimator. Because of the sparsity property of the resulting solution (see e.g. Belloni
and Chernozhukov (2013) shows that kβk
b 0 . s) and the natural appeal of the Post-Lasso

estimator, it is easy to trust the Lasso too much. Indeed, one may think that the Lasso
can function as both a device to recover the support of β0 and to estimate that same
quantity precisely or at a fast rate. However, Yang (2005) shows that for any model
selection procedure to be consistent, it must behave sub-optimally for estimating the
regression function and vice-versa. Indeed, the condition on the penalty parameter λn in
p
Zhao and Yu (2006) is quite different from our requirement of λn = 4σM 2 log(2p)/n/α
in Theorem 2.1. The moral of the story is that even when using the Lasso estimator,
selecting relevant covariates and estimating well their effect are two objectives that cannot
be pursued at the same time. Furthermore, the warnings issued in Section 2.1 on post-
selection inference still apply for the Lasso.
In presence of a high-dimensional parameter to estimate, the econometric literature
has chosen to pursue high-quality estimation of β0 . Indeed, because most economic
applications are concerned with a well-posed causal question of the type “What is the
effect of A on B?”, the identity of the relevant regressors matters less than estimating well
some nuisance parameters: think about estimating a control function or the first-stage of
an IV regression for example.
But even when focusing only on precise estimation of β0 , the Lasso is not the only
thing you need. Indeed, high-dimensional statistics poses a different challenge because p
is assumed to grow with the sample size. Indeed, assume p is constant, then as n → ∞,
the problem boils down to a small dimensional one where n >> p. In a high-dimensional
setting, there is an asymptotic bias or a regularization bias.

The Bias of Simple Plug-In Estimators: Illustration in a Simple Linear Model.

24
Assumption 2.5 (Linear Model with Controls). Consider the iid sequence of random
variables (Yi , Di , Xi )i=1,...,n such that:

Yi = Di τ0 + Xi0 β0 + εi ,

with εi such that Eεi = 0, Eε2i = σ 2 < ∞ and εi ⊥ (Di , Xi ). Di ∈ {0, 1}. Xi is of
dimension p > 1. p is allowed to be much larger than n and to grow with n. Denote by
µd := E(X|D = d) for d ∈ {0, 1} and π0 := EDi .

Suppose the econometrician is interested in estimating the treatment effect τ0 of Di


on Yi in the model 2.5 above, while β0 is merely a nuisance parameter. In presence of a
high-dimensional set of controls Xi , a naive post-selection procedure (e.g. Belloni et al.,
2014, p. 36) follows two steps:

1. (Selection) Run a Lasso regression of Y on D and X, forcing D to remain in the


model by excluding τ from the penalty part in the Lasso, Equation (2.3). Obtain
βbL . Exclude all the elements in X that correspond to a zero coefficient in βbL ,

2. (Estimation) Run an OLS regression of Y on D and the set of selected X to obtain


the post-selection estimator τb.

Denote βb the corresponding estimator for β0 obtained in step 2. Notice that for j ∈
b := n−1 ni=1 Di .
{1, ..., p}, if βbjL = 0 then βbj = 0. Also denote by π
P

Pn
1
Di (Yi − Xi0 β)
b 1 X
τb := n i=1
= (Yi − Xi0 β),
b
πb n1 D =1
i

Pn
Where nd := i=1 1 {Di = d}, d ∈ {0, 1} is a random quantity.

Lemma 2.4 (Regularization Bias of τb). Under Assumption 2.5, if µ1 6= 0: τ − τ0 | →
n|b
∞.

Proof of Lemma 2.4. Substituting Model 2.5, we obtain:


" n #0 " n #
√ 1 X √   √ 1 X
n (b b−1
τ − τ0 ) = π Di Xi b−1 n
n β0 − βb + π Di εi . (2.10)
n i=1 n i=1

From the above equation, one would hope that because βb is `1 consistent, the first term
converges to zero in probability and we would only be left with the second term. This is

25
p
b −→ π0 and Slutsky’s
not the case. By the CLT - using also the LLN and CMT to prove π
theorem:
" n
#
σ2
 
1 X d
b−1
π √ Di εi −→ N 0, .
n i=1 π0
Now in general, we can show:
" n #0
1X √   p
Di Xi n β0 − βb ≈ s log p → ∞,

n


i=1

The intuition behind is twofold. By the LLN one can show:


" n #
1X p
Di Xi −→ π0 µ1 .
n i=1
In general µ1 6= 0 and this term does not go to zero. Moreover, since p → ∞, we have
√ √
k n(βb − β0 )k1 ≈ s log p which does not go to zero, hence proving the result. 

Remark 2.6 (The Regularization Bias is an Omitted Variable Bias). Lemma 2.4 is
disappointing: in the high-dimensional case, the naive plug-in strategy does not work
well. This is because of two ingredients: µ1 6= 0 and p → ∞. If we were in a small-
√  
dimensional case and had for example an OLS estimator for β0 , n β0 − β would
b

be asymptotically normal and the problem would disappear. Notice that in this small-
dimensional case, there is no selection step. What is the problem with the approach
proposed in this section? In a nutshell: it is a single-equation procedure. Recall that
the selection step is using only the outcome equation, i.e. the elements of X tend to be
selected if they have a large entry in coefficient β0 . As a consequence, that procedure will
tend to miss variables that have a moderate effect on Y but a strong effect on D, thereby
creating an omitted variable bias in the estimator of τ0 . As Belloni et al. (2014) put it:
“Intuitively, any such variable has a moderate direct effect on the outcome that will be
incorrectly misattributed to the effect of the treatment when this variable is strongly related
to the treatment and the variable is not included in the regression”. We call the omitted-
variable bias arising from non-orthogonalized procedures that uses machine-learning tools
such as the Lasso in a first step the regularization bias.

Now, focus on a nice case. For that, we make two assumptions. A first one limits the
growth rate of p. It is fairly common and technical. A second one gives an intuition for
more general results in the next section.

26
Assumption 2.6 (Growth Condition).
s log p
√ → 0.
n
Assumption 2.7 (Balanced Design). Assume:

1. µ1 = 0,

2. The concentration bound:


n

1 X p
√ D X . log p. (2.11)

i i
n


i=1 ∞

The second part of Assumption 2.7 is fairly technical but can be proven under lower-
level assumptions such as normality or sub-Gaussianity of Xi and application of Lemma
2.6, recalling that E(Di Xi ) = 0 under the first part of Assumption 2.7.
√ d
Lemma 2.5 (A Favorable Case). Under Assumptions 2.5, 2.6 and 2.7: τ − τ0 ) −→
n (b
N (0, σ 2 /π0 ).

Proof of Lemma 2.5. Go over the proof of Lemma 2.4. Now, because of Assumption
2.7, we obtain:
n
1 X p
√ Di Xi . log p.

n
i=1 ∞
0
Using Theorem 2.1 and |a b| ≤ kak∞ kbk1 , we have:
"
n
#0
n

1 X h i 1 X
√ D i Xi β0 − β ≤ √ Di Xi β0 − β
b b
n i=1 n
i=1
1

s log p
. √ → 0,
n
by the growth condition of Assumption 2.6. So the quantity on the left-hand side of the
above inequality converges to zero in probability (`1 consistency implies consistency in
probability). Using Slutsky’s Theorem and Equation (2.10), we obtain the result. 
Notice that Assumption 2.7 (1) implies
√ n
∂ n(bτ − τ0 ) 1 1 X p
= √ Di Xi −→ 0, as n → ∞.
∂β0 − β
b π
b n i=1

Under this assumption, the estimator τb is first-order insensitive to small deviations from
the true value β0 . This is what we are going to exploit in the next section.

27
2.5 Theory: Immunization/Orthogonalization Pro-
cedure

This presentation by Victor Chernozhukov is a good introduction to this section, https:


//youtu.be/eHOjmyoPCFU.
Lemma 2.5 displayed a happy case, where estimation of the high-dimensional nuisance
parameter does not affect the asymptotic distribution of the estimator of the parameter of
interest. To give the intuition of the general case, assume that the parameter of interest,
τ0 solves the equation Em(Zi , τ0 , β0 ) = 0 for some known score function m(.), a vector of
observables Zi and nuisance parameter β0 . In case you find this setting abstract, think of
the fully parametric case where m(.) is the derivative of the log-likelihood (Wasserman,
2010, Chapter 9). In the previous section, we had Zi = (Yi , Di , Xi ), and m(Zi , τ, β) :=
(Yi − Di τ − Xi0 β)Di . In Lemma 2.4, the derivative of the estimating moment with respect
to nuisance parameter was not zero:

E∂β m(Zi , τ0 , β0 ) = −π0 µ1 6= 0.

What if we could replace m by another score function or estimating moment ψ and


potentially different nuisance parameter η0 such that:

E∂η ψ(Zi , τ0 , η0 ) = 0 (ORT)

Condition (ORT) means that the moment condition for estimating τ0 is not affected by
small perturbation around the true value of the nuisance parameter η0 . This is exactly the
intuition behind the double selection or immunized or Neyman-orthogonalized procedure
Chernozhukov et al. (2017a); Belloni et al. (2017a); Chernozhukov et al. (2018). Changing
the estimating moment can neutralize the effect of the first step estimation and suppress
the regularization bias. We say that any function ψ that satisfies Condition (ORT) is an
orthogonal score, or Neyman-orthogonal.

Immunization Procedure. This paragraph develops the ideas of Chernozhukov et al.


(2015); Chernozhukov et al. (2015, 2017a); Belloni et al. (2017a); Chernozhukov et al.
(2018). The proofs are voluntarily less rigorous to give the intuition without getting lost

28
in technical details. We defer technical details to e.g. Lemma 2 and 3 in Chernozhukov
et al. (2015) and Belloni et al. (2017a).

Assumption 2.8 (Orthogonalized Moment Condition). The (scalar) parameter of in-


terest, τ0 is given by:

Eψ(Zi , τ0 , η0 ) = 0,

for some known real-valued function ψ(.) satisfying the orthogonality condition (ORT), a
vector of observables Zi and a high-dimensional sparse nuisance parameter η0 such that
kη0 k0 ≤ s. The design respects the growth condition of Assumption 2.6.

Further assume that we have a first-step estimator ηb of η0 which is of sufficient quality:

Assumption 2.9 (High-Quality Nuisance Estimation). We have first-step estimator ηb


such that with high-probability:

kb
η k0 . s,
p
kb
η − η0 k1 . s2 log p/n,
p
kb
η − η0 k2 . s log p/n.

We take this estimator as given. It does not have to be a Lasso, but the Lasso or Post-
Lasso clearly verify these assumptions, in a sparsity or approximate sparsity scenario, i.e.
in cases where you are confident that only a few control variables matter. Chernozhukov
et al. (2018) extend these conditions to any machine-learning procedure of sufficient
quality. They will be discussed in Section 3.1. Notice that the ML procedure you are
going to use depend on the assumptions you are willing to make about η0 because they
will condition the performance of this tool. For example, if you believe η0 to be sparse,
a Lasso should work well. The estimator we are going to consider is τ̌ such that:
n
1X
ψ(Zi , τ̌ , ηb) = 0. (IMMUNIZED)
n i=1

For clearer exposition, we consider the simple case of Assumption 2.10 below.

29
Assumption 2.10 (Affine-Quadratic Model). The function ψ(.) is such that:

ψ(Zi , τ, η) = Γ1 (Zi , η)τ − Γ2 (Zi , η),

where Γj , j = 1, 2, are functions with all their second order derivatives with respect to η
constant over the convex parameter space of η.

The class of models above may seem restrictive but include many usual parameters
of interests such as the Average Treatment Effect (ATE), Average Treatment Effect on
the Treated (ATET), the Local Average Treatment Effect (LATE), any linear regression
coefficient.

Theorem 2.2 (Asymptotic Normality of the Immunized Estimator). The immunized


estimator τ̌ defined by (IMMUNIZED) in affine-quadratic model 2.10 under Assumptions
2.8, 2.6 and using a first-stage nuisance estimator satisfying Assumption 2.9 is such that:

√ d
n (τ̌ − τ0 ) −→ N (0, σΓ2 ),

with σΓ2 := E[ψ(Zi , τ0 , η0 )2 ]/E[Γ1 (Zi , η0 )]2 .

Proof of Theorem 2.2. τ̌ defined by (IMMUNIZED) is such that :


" n
#−1 n
1X 1X
τ̌ = Γ1 (Zi , ηb) Γ2 (Zi , ηb).
n i=1 n i=1

From Assumption 2.10, we can verify:


" n #−1 n
√ 1X 1 X
n (τ0 − τ̌ ) = Γ1 (Zi , ηb) √ ψ(Zi , τ0 , ηb). (2.12)
n i=1 n i=1
Pn p
Firstly, we need to show that n−1 i=1 Γ1 (Zi , ηb) −→ EΓ1 (Zi , η0 ). By the affine-quadratic
Assumption 2.10:
n n
1X 1X
Γ1 (Zi , ηb) = Γ1 (Zi , η0 )
n i=1 n i=1
| {z }
:=I1
" n
#0
1 X
+ η − η0 )
∂η Γ1 (Zi , η0 ) (b
n i=1
| {z }
:=I2

30
" n #
1 0 1
X
η − η0 )
+ (b η − η0 ) .
∂η ∂η0 Γ1 (Zi , η0 ) (b
2 n i=1
| {z }
:=I3

p
Under regularity assumptions, by the LLN, I1 −→ E[Γ1 (Zi , η0 )]. Then:
n r
1 X s2 log p
|I2 | ≤ ∂η Γ1 (Zi , η0 ) kb
η − η0 k1 . → 0,

n n
i=1 ∞

and finally:

n
1 1 X s log p
η − η0 k22
|I3 | ≤ kb ∂η ∂η0 Γ1 (Zi , η0 ) . → 0,

2 n i=1 n
sp(s log n)

if we assume that the sparse-norm of the second-order derivatives matrix (which does not
depend on ηb) is bounded:

1 Xn
∂ ∂ Γ (Z , η ) . 1,

0
η η 1 i 0
n


i=1 sp(s log n)

which occurs under reasonable conditions, see Rudelson and Zhou (2013). Secondly, we
d
need to show √1n ni=1 ψ(Zi , τ0 , ηb) −→ N (0, E[ψ 2 (Zi , τ0 , η0 )]). Decompose similarly:
P

n n
1 X 1 X
√ ψ(Zi , τ0 , ηb) = √ ψ(Zi , τ0 , η0 )
n i=1 n i=1
| {z }
:=I10
" n
#0
1 X
+ √ ∂η ψ(Zi , τ0 , η0 ) (bη − η0 )
n i=1
| {z }
:=I20
√ n
" #
n 0 1
X
+ η − η0 )
(b η − η0 ) .
∂η ∂η0 ψ(Zi , τ0 , η0 ) (b
2 n i=1
| {z }
:=I30

d
Typically, a standard CLT ensures that I10 −→ N (0, E[ψ 2 (Zi , τ0 , η0 )]) as long as E[ψ 2 (Zi , τ0 , η0 )] <
∞.

1 Xn s log p
|I20 | ≤ √ η − η0 k1 . √
∂η ψ(Zi , τ0 , η0 ) kb → 0,

n n
i=1 ∞

provided that we have a moderate deviation bound:



1 X n p
√ ∂η ψ(Zi , τ0 , η0 ) . log p,

n
i=1 ∞

31
which occurs under mild conditions using more general version of Lemma 2.6, thanks to
the (ORT) condition in Assumption 2.8. And finally:

n

n 1 X s log p
|I30 | ≤ η − η0 k22
kb ∂η ∂η0 ψ(Zi , τ0 , η0 ) . √ → 0,

2 n
i=1
n
sp(s log n)

if we assume that the sparse-norm of the matrix is bounded:



1 X n
∂η ∂η0 ψ(Zi , τ0 , η0 ) . 1.

n


i=1 sp(s log n)

These steps give the desired result in light of Equation (2.12). 

Remark 2.7 (Importance of Theorem 2.2). Theorem 2.2 is powerful because if you can
find an estimator which is defined as a root of an orthogonal moment condition, i.e. that
satisfies condition (ORT), this estimator is going to be asymptotically Gaussian with
a variance that you can estimate to perform inference. A word should be said on the
assumptions to emphasize the ones that matter the most. Assumption 2.8 is extremely
important and is the point of this whole section. The key part is that ψ should satisfy
the (ORT) condition, other wise we cannot control I20 . Assumption 2.9 deals with the
quality of the nuisance parameter estimation: although it can be made more general to
accommodate other machine learning type methods, the nuisance parameter estimator
should have good performances. Assumption 2.6 on the growth condition is necessary
α
but is not very restrictive: p can grow as quickly as en for an α ∈]0, 1/2[! Assumption
2.10 does not matter: it is a simplification in the context of this course to make the
proof easier. Moreover, it is not so restrictive: many parameters of interest fit into that
framework.

Remark 2.8 (The Over-Identified Case). Because we consider a scalar parameter of


interest τ0 identified by a single equation, Equation (IMMUNIZED) was fine as a defini-
tion. In general, when τ0 is identified by a set of equations of larger dimension, the GMM
estimator will take the form:
2
1 X n
τb = arg min ψ(Zi , τ, ηb) .

τ ∈Rd n i=1
2

The reason is that n−1 ni=1 ψ(Zi , τ, ηb) = 0 will not have a solution in general.
P

32
2.6 The Double Selection Method
For a given score function m(.) which does not satisfy condition (ORT), how can we find
a ψ(.) that does? Notice that we denote the nuisance parameter by β0 in the first case
and by η0 in the second. This different notation signifies that most of the time η0 is
different from β0 and is in general of larger dimension. We are going to cover Belloni
et al. (2014) which deals with the linear case. Chernozhukov et al. (2015) covers the
Maximum Likelihood and GMM cases. Section 2.2 of Chernozhukov et al. (2018) covers
an even wider range of models.
Recall our estimation strategy in Model 2.5. Assume further that the treatment
equation is given by:
Di = Xi0 δ0 + ξi ,

where Xi ⊥ ξi and ξi ⊥ εi . Denote η := (β, δ)0 .We are going to show that the following
moment condition:

Eψ(Zi , τ0 , η0 ) := E[(Yi − Di τ0 − Xi0 β0 )(Di − Xi0 δ0 )] = 0, (2.13)

is Neyman-orthogonalized, i.e. satisfies (ORT ) and allows to estimate τ0 . ψ is a known


function of observables (the data) Zi := (Yi , Di , Xi ), the parameter of interest τ0 and the
nuisance parameter η0 = (β0 , δ0 )0 . See the Example 2.1 in Chernozhukov et al. (2018) for
the wider framework that underpins this choice. Firstly:
−(Di − Xi0 δ)Xi
   
∂β ψ(Zi , τ, η)
∂η ψ(Zi , τ, η) = = .
∂δ ψ(Zi , τ, η) −(Yi − Di τ − Xi0 β)Xi
In Model 2.5, β0 was implicitly defined by the orthogonality condition (i.e. Normal
equations, or the theoretical FOC of the least square program):

E [(Yi − Di τ0 − Xi0 β0 )Xi ] = E [εi Xi ] = 0.

From the orthogonality condition in the treatment equation above, δ0 is such that:

E [(Di − Xi0 δ0 )Xi ] = E [ξi Xi ] = 0. (2.14)

So we have indeed that ψ satisfies (ORT), i.e. E∂η ψ(Zi , τ0 , η0 ) = 0. Equation (2.13) has
a Frish-Waugh-Lovell flavour:

E[(Yi − Di τ0 − Xi0 β0 ) (Di − Xi0 δ0 ) ] = E(ξi εi ) = 0,


| {z } | {z }
Residual from Residual from
Outcome Regression Treatment Regression

33
or even more clearly, because of Equation (2.14):

E[(Yi − Xi0 β0 − (Di − Xi0 δ0 )τ0 ) (Di − Xi0 δ0 ) ] = 0,


| {z } | {z }
Residual from Residual from
Outcome Regression Treatment Regression

which is the orthogonality condition in a problem where Y is regressed on the residual


from the treatment regression and X. Another formulation is given by:

Cov[Di − Xi0 δ0 , Yi − Xi0 β0 ]


τ0 = .
E[(Di − Xi0 δ0 )2 ]
τ0 is the coefficient from the regression of the residual of the regression of Y on X on
the residual of the regression of D on X. This observation gives rise to the Double
Selection method of Belloni et al. (2014) or the Double Machine Learning estimator of
Chernozhukov et al. (2017a):

1. (Treatment Selection) Regress D on X using a Lasso, obtain δbL . Define SbD :=


n o
j = 1, ..., p, δbjL 6= 0 the set of selected variables,

2. (Outcome Selection) Regress Y on X using a Lasso, obtain βbL . Define SbY :=


n o
j = 1, ..., p, βbjL 6= 0 ,

3. (Estimation) Regress Y on D and the sb = |SbD ∪ SbY | elements in X which correspond


to the indices j ∈ SbD ∪ SbY , using OLS.

Intuitively, the third step may come as a surprise since we perform a regression of Y on
D and the selected X instead of performing the regression of Y − X 0 βbL on D − X 0 δbL . By
Frisch–Waugh–Lovell’s Theorem (Theorem 2.3), this is equivalent, up to the difference
between the Lasso and the Post-Lasso estimator (the third step amounts to using a
Post-Lasso instead of the Lasso). Define the post-double selection estimators βb and δb as:
n
X
βb = arg min (Yi − Di τb − Xi0 β)2 , (2.15)
β:βj =0,∀j ∈
/SbD ∪S
bY i=1

X n
δb = arg min (Di − Xi0 δ)2 . (2.16)
δ:δj =0,∀j ∈
/SbD ∪S
bY i=1

Based on Equation (2.13), the post-double-selection estimator τ̌ has explicit form:

n−1 ni=1 (Yi − Xi0 β)(D 0b


P
i − Xi δ)
b
τ̌ = . (2.17)
n−1 ni=1 Di (Di − Xi0 δ)
P b

34
The resulting estimator verifies Theorem 2.2 and allows you to perform inference on the
parameter of interest. For the intuition behind this result, see again Section 2.4 on the
regularization bias. The selection procedure advocated here is based on a two-equation
approach. By selecting the elements of X in relation with both D and Y , it does not
miss any confounder as it was the case with the more naive approach.

Remark 2.9 (Computing Standard Errors). When both the outcome and the treatment
equations are linear, the asymptotic variance in Theorem 2.2 writes:
E[ξi2 ε2i ]
σΓ2 = ,
E[ξi2 ]2
which can be consistently estimated by:
" n #−2 n
1 X 1 X
bΓ2 =
σ ξbi2 ξb2 εb2 ,
n i=1 n − sb − 1 i=1 i i

τ Di −Xi0 βb and ξbi = Di −Xi0 δ.


with εbi = Yi −b b And post-double-selection estimators defined

in (2.15) and (2.16).

Moving beyond the linear case, Farrell (2015) presents a more general method to
estimate treatment effect parameters (ATE, ATET) using similar ideas.

2.7 Auxiliary Mathematical Results


Lemma 2.6, extracted from Chatterjee (2013), gives a bound on the tail of the maximum
of Gaussian random variables. This lemma is actually more general and applies to sub-
gaussian random variables. We also recall the Frisch-Waugh-Lovell Theorem 2.3.

Lemma 2.6 (Concentration Inequality for Gaussian Random Variables). Consider p


gaussian random variables such that for j = 1, ..., p, ξj ∼ N (0, σj2 ) and set L = max σj .
j=1,...,p
Then:  
p
E max |ξj | ≤ L 2 log(2p).
j=1,...,p
 
Proof of Lemma 2.6 Since ξj ∼ N (0, σj2 ), a direct calculation shows that E ecξj =
c2 σj2
e for any c ∈ R. As an aside, note that the proof will only use that ξj is sub-gaussian,
2

 cξ  c2 σj2
i.e. E e j
≤e 2 .
     
1
E max |ξj | = E log exp max c|ξj |
j=1,...,p c j=1,...,p

35
" ( p )#
1 X
c|ξj |
≤ E log e
c j=1
" ( p )#
1 X
≤ E log ecξj + e−cξj
c j=1
( p )
1 X  
E ecξj + E e−cξj
 
≤ log
c j=1
1 n c2 L2
o
≤ log 2pe 2
c
log(2p) cL2
= + .
c 2

where the third inequality uses Jensen inequality and the fourth the remark at the begin-
p
ning of the proof. The bound is minimized for the value c∗ = 2 log(2p)/L and is equal
p
to L 2 log(2p) which completes the proof. 

Theorem 2.3 (Frisch–Waugh–Lovell’s Theorem, Frisch and Waugh 1933, Lovell 1963).
Consider the regression of the vector of dimension n, y, on the full-rank matrix of
dimension n × p, X. Consider the partition: X = [X1 : X2 ], and define PX1 :=
X1 [X1 0 X1 ]−1 X1 0 , MX1 := In − PX1 , and PX and MX the same quantities for X. Con-
sider the two quantities:

1. βb = (βb10 , βb20 )0 = (X0 X)−1 X0 y,

2. β̃2 = (X2 0 MX1 X2 )−1 X2 0 MX1 y,

then β̃2 = βb2 .

Proof of Theorem 2.3. Consider the decomposition:

y = PX y + MX y = Xβb + MX y = X1 βb1 + X2 βb2 + MX y.

Pre-multiply everything by X2 0 MX1 :

X2 0 MX1 y = X2 0 MX1 X1 βb1 + X2 0 MX1 X2 βb2 + X2 0 MX1 MX y = 0 + X2 0 MX1 X2 βb2 + 0.

As a consequence, βb2 = (X2 0 MX1 X2 )−1 X2 0 MX1 y = β̃2 . 


See also the regression anatomy formula, in (Angrist and Pischke, 2009, pp 35-36).

36
Chapter 3

Methodology: Using Machine


Learning Tools in Econometrics

This chapter is a methodological extension of the concepts seen previously. In particular,


the setting is the one of Section 2.5, where we are interested in performing inference over
a low-dimensional parameter τ0 in the presence of a high-dimensional nuisance parameter
η0 using an orthogonal score function ψ(Z, τ, η), i.e. that respects (ORT). Here, variable
selection may or may not be the starting point but at least η0 is a complex object that
will be estimated using Machine Learning (ML) tools. We do not survey these tools
as they are studied in other classes at ENSAE but one can think of Lasso, Post-Lasso
and Ridge regressions, Elastic Nets, Regression Trees and Random Forests, Neural Net-
works, Aggregated Methods, etc. See Hastie et al. (2009) for a textbook treatment. We
also strongly recommend the survey by Athey and Imbens (2019) that targets empirical
economists.
Theoretical arguments will not be developed in this chapter. The most complete
and clear reference is Chernozhukov et al. (2017a). Other similar references albeit not
necessarily as complete are Chernozhukov et al. (2015); Chernozhukov et al. (2015). For
computational aspects, hdm is an R package developed by Victor Chernozhukov, Christian
Hansen and Martin Spindler available at https://cran.r-project.org/web/packages/
hdm/index.html. See also https://github.com/demirermert/MLInference. It mostly
focuses on the Lasso in the IV and unconfoundedness cases. It is very simple to use
for most simple applications of the Lasso, including the Logit-Lasso. A Stata package,
LASSOPACK, has been recently developed too, available at https://ideas.repec.org/c/
boc/bocode/s458458.html.

37
3.1 Double Machine Learning and Sample-Splitting

In this section, we take as given a low-dimensional parameter of interest τ0 , the high-


dimensional nuisance parameter η0 and especially the orthogonal score function ψ(Z, τ, η).
It is necessary that ψ satisfies (ORT) as highlighted in Section 2.5. Besides the orthogonal
score, sample-splitting is the other necessary ingredient that makes the method works in
more complex problems without making strong assumptions.

The Method: Cross-fitting Double ML. We present the DML1 method of Cher-
nozhukov et al. (2017a). We assume that we have a sample of n copies of the random
vector Zi where n is divisible by an integer K to simplify notations.

1. Take a K-fold random partition (Ik )k=1,...,K of observation indices {1, . . . , n} such
that each fold Ik has size n/k. For each k ∈ {1, . . . , K} define IkC := {1, . . . , n} \Ik .

2. For each k ∈ {1, . . . , K}, construct a ML estimator of η0 using only the auxiliary
sample IkC :
 
η̂k = η̂ (Zi )i∈IkC .

3. For each k ∈ {1, . . . , K}, using the main sample Ik , construct the estimator τ̌k as
the solution of:
1 X
ψ(Zi , τ̌k , η̂k ) = 0.
n/K i∈I
k

4. Aggregate the estimators τ̌k on each main sample:


K
1 X
τ̌ = τ̌k .
K k=1

Remark 3.1 (Sample-splitting). Theorem 3.1 in Chernozhukov et al. (2017a) shows


that the cross-fitting estimator τ̌ is asymptotically Gaussian under reasonable condi-
tions. There is no theory to choose K but reasonable values are K = 2, 4, 5. The name
cross-fitting comes from the particular sample-splitting technique adopted here where the
auxiliary sample IkC used to estimate η0 and the main sample Ik used to estimate τ0 are
swapped in order to not lose efficiency (the sample Ik is shorter than n). Sample-splitting

38
is necessary for technical reasons: it helps controlling remainder terms without using as-
sumptions that are too strong, allowing for many types of machine learning estimators
to be used for estimating the nuisance parameters. Intuitively, sample-splitting removes
the bias stemming from over-fitting by using an hold-out sample to estimate the nuisance
parameter η0 and then predict over the main sample.
Notice that the sample-splitting technique advocated here introduces more uncer-
tainty, which should be taken into account when reporting the results. Chernozhukov
et al. (2017a) also proposes to split the sample in S different random partitions and re-
port the mean cross-fitting estimator and a corrected standard error. We do not explore
these refinements.
This GitHub repository https://github.com/VC2015/DMLonGitHub/ contains very
clear files to perform cross-fitting procedure described above with many different ML
techniques.

3.2 Orthogonal Scores to Estimate Treatment Ef-


fects
We consider the model in Section 5.1 of Chernozhukov et al. (2017a) which is a more
flexible model where treatment effects are allowed to be heterogeneous. Consider the
vector (Y, D, X) such that D ∈ {0, 1} and:

Y = g0 (D, X) + ε, E [ε | D, X] = 0,

D = m0 (X) + ξ, E [ξ | X] = 0.

Section 2.6 is a particular case where we had g0 (D, X) = Dτ0 + X 0 β0 and m0 (X) = X 0 δ0 .
We have two standard target parameters of interest, the Average Treatment Effect (ATE)
and the Average Treatment Effect on the Treated (ATET):

τ0AT E = E [g0 (1, X) − g0 (0, X)] ,

τ0AT ET = E [g0 (1, X) − g0 (0, X) | D = 1] .

Notice that in Section 2.6, τ0AT E = τ0AT ET = τ0 because there was no treatment effect
heterogeneity. For the ATE, the orthogonal score of Hahn (1998) is given by:
D(Y − g(1, X)) (1 − D)(Y − g(0, X))
ψ AT E (Zi , τ, η) = [g(1, X) − g(0, X)] + − − τ.
m(X) 1 − m(X)

39
The nuisance parameter true value is η0 = (g0 , m0 ). For the ATET, the orthogonal score
is:
 
AT ET 1 m(X) D
ψ (Zi , τ, η) = D− (1 − D) (Y − g(0, X)) − τ.
π0 1 − m(X) π0
The nuisance parameter true value is η0 = (g0 , m0 , π0 ) with π0 = P (D = 1). These or-
thogonal scores are the basis of Farrell (2015); Bléhaut et al. (2017). Similar expressions
exist for the Local Average Treatment Effect (LATE) and can be found in Chernozhukov
et al. (2018).

 
Question: Verify that E ψ AT E (Zi , τ0AT T , η0 ) = 0. Do the same for τ0AT ET .

Remark 3.2 (Affine-quadratic models and trick to compute the variance). Notice that
these orthogonal scores fall under the affine-quadratic type of Assumption 2.10 so com-
puting standard errors will simply follow from the expression in Theorem 2.2. Moreover,
in both cases, E [Γ1 (Zi , η0 )] = −1 which implies that τ0 = E [Γ2 (Zi , η0 )]. As a conse-
quence, from Theorem 2.2, σΓ2 = V [Γ2 (Zi , η0 )]. This observation makes computation of
the standard error fairly simple: for each observation, store Γ̂2 (Zi , η̂) in a vector gamma
and compute the standard error using sd(gamma) / sqrt(n).

In practice. Suppose you observe the outcome, treatment status and a set of covariates
(Zi )i=1,...,n = (Yi , Di , Xi )i=1,...,n from a population of interest and wants to estimate the
treatment effect for the treated τ0AT ET . Here is a strategy you could use:

1. Partition the observation indices {1, . . . , n} in two, such that each fold (I1 , I2 ) has
size n/2.

2. Using only the sample I1 , construct a ML estimator of g(0, X) and m(X). For
example, g(0, X) can be estimated by running a feedforward neural network of Yi
on Xi for the non-treated in this sample. Let us denote this estimator by gbI1 (x).
Similarly, m(X) could be estimated by running a Logit-Lasso of Di on Xi in this
sample. Let us denote this estimator by m
b I1 (x).

3. Now, use these estimators on the sample I2 to compute the treatment effect
1 X b I1 (Xi )
m

τ̌I2 := P Di − (1 − Di ) (Yi − gbI1 (Xi ))
i∈I2 Di i∈I 1−m b I1 (Xi )
2

40
4. Repeat steps 2-3, swapping the roles of I1 and I2 to get τ̌I1 .

5. Aggregate the estimators:


τ̌I1 + τ̌I2
τ̌ = .
2

3.3 Simulation Study: The Regularization Bias


This simulation exercise illustrates two points: (i) the naive post-selection estimator
suffers from a large regularization bias, (ii) the cross-fitting estimator trade-off a larger
bias for a smaller MSE compared to the immunized estimator that uses the whole sample.

DGP. File: DataSim.R. The outcome equation is linear and given by: Yi = Di τ0 +
Xi0 β0 + εi , where τ0 = .5, εi ⊥ Xi , and εi ∼ N (0, 1). The treatment equation follows
a Probit model, Di |Xi ∼ Probit (Xi0 δ0 ). The covariates are simulated as Xi ∼ N (0, Σ),
where each entry of the variance-covariance matrix is set as follows: Σj,k = .5|j−k| . Every
other element of Xi is replaced by 1 if Xi,j > 0 and 0 otherwise. The most interesting
part of the DGP is the form of the coefficients δ0 and β0 :
ρd (−1)j /j 2 , j < p/2 ρy (−1)j /j 2 , j < p/2
( (
β0j = , δ0j =
0, elsewhere ρy (−1)j+1 /(p − j + 1)2 , elsewhere
We are in an approximately sparse setting for both equations. ρy and ρd are constants
that are set to fix the signal-to-noise ratio, in the sense that a larger constant ρy will
mean that the covariates play a larger role. The trick here is that some variables that
matter a lot in the treatment assignment are irrelevant in the outcome equation. The
fact that the one-equation selection procedure will miss some relevant variables for the
outcome should create a bias and non-Normal behavior.

Model and Estimators. We estimate a model based on linear equations for both the
outcome and the treatment as in Section 2.6, although it does not corresponds to the
DGP. We compare three estimators:

1. A simple post-selection estimator where a Lasso-selection step is performed using


the outcome equation as described in Section 2.4,

2. A double-selection estimator based on the Lasso as described in Section 2.6,

41
3. A double-selection estimator based on the Lasso with cross-fitting (K = 5) as
described in Section 3.1.

You can play with these simulations using the file DoubleML Simulation.R. Notice that
this file makes every step very clear and makes very little use of any package so you can
follow easily what is going on. In particular, the Lasso regression is coded from scratch
functions/LassoFISTA.R. Table 3.1 and Figure 3.1 display the result in one particular
high-dimensional setting.

Table 3.1: Estimation of τ0

Estimator:
Naive Post-Selec Double Selec Double Selec
w. Cross-fitting
(1) (2) (3)
Bias .748 .020 .024
Root MSE .766 .200 .183
Parameters are set to: R = 10, 000, n = 200, p = 150, K = 5,
Note:
τ0 = .5, ρy = .3, ρd = .7.

Figure 3.1: Distribution of τ̂ − τ0


Naive Post−Selec Double−Selec Double−Selec, Cross−fitting
2.5
2.0
2.0
2.0
1.5
1.5
1.5
density

density

density

1.0
1.0
1.0

0.5 0.5
0.5

0.0 0.0 0.0

−1 0 1 −1 0 1 −1 0 1
Treatment effect Treatment effect Treatment effect

Note: See table above.

3.4 Empirical Application: Job Training Program


We revisit LaLonde (1986)’s dataset using the exercise in Bléhaut et al. (2017). This
dataset was first built to evaluate the impact of the National Supported Work (NSW)
program. The NSW is a transitional, subsidized work experience program targeted to-
wards people with longstanding employment problems: ex-offenders, former drug addicts,
women who were long-term recipients of welfare benefits and school dropouts. Here, the

42
quantity of interest is the ATET, defined as the impact of the participation in the pro-
gram on 1978 yearly earnings in dollars. The treated group comprises people who were
randomly assigned to this program from the population at risk (n1 = 185). Two con-
trol groups are available. The first one is experimental: it is directly comparable to the
treated group as it has been generated by a random control trial (sample size n0 = 260).
The second one comes from the Panel Study of Income Dynamics (PSID) (sample size
n0 = 2490). The presence of the experimental sample allows to obtain a benchmark for
the ATET obtained with observational data. We use these datasets to illustrate the tools
seen in the chapter.
To allow for a flexible specification, we consider the setting of Farrell (2015) and take
the raw covariates of the dataset (age, education, black, hispanic, married, no degree,
income in 1974, income in 1975, no earnings in 1974, no earnings in 1975), two-by-two-
interactions between the four continuous variables and the dummies, two-by-two inter-
actions between the dummies and up to a degree of order 5 polynomial transformations
of continuous variables. Continuous variables are linearly rescaled to [0, 1]. All in all, we
end up with 172 variables to select from. The experimental benchmark for the ATET
estimate is $1,794 (633). We use the package hdm to implement the Lasso and Logit-Lasso
and the package randomForest to use a random forest of 500 trees. We partition the
sample in 5 folds.

Table 3.2: Treatment Effect on LaLonde (1986)

Estimator:
Experimental Cross-fitting Cross-fitting
w. 20 partitions
(1) (2) (3)
OLS 1,794
(633)
Lasso 2,305 2,403
(676) (685)
Random Forest 7,509 1,732
(6,711) (1,953)

The file DoubleML Lalonde.R details each step and compute a ATET estimate where
the propensity score and outcome functions are estimated using (i) a Lasso procedure and
(ii) a random forest. We compute standard errors and confidence intervals. Table 3.2

43
displays the results. With or without considering many data splits, the Lasso procedure
ends up pretty close to the experimental estimate. The results are more mixed for the
random forest: the simple cross-fitting procedure gives very imprecise results. They
might be due to a particularly unfortunate split or a particularly bad performance of the
off-the-shelf random forest algorithm in this case. When considering many partitions of
the data, the point estimate is reasonable but the standard-error is still very high. All
in all, the message is to be cautious and test several ML algorithms when possible and
consider many data splits so the results do not depend so much on the partitions. For
a comparison between a wide range of ML tools, see Section 6.1 in Chernozhukov et al.
(2018).

44
Chapter 4

High-Dimension and Endogeneity

This chapter reviews some important results addressing model selection in the linear
instrumental variables (IV) model. Namely, we remove the exogeneity assumption, ε ⊥
(D, X), from model (2.5), but assume the econometrician possesses several instruments
verifying an exogeneity assumption, while allowing the identity of these instruments to
be unknown and the number of potential candidates to be larger than the sample size.
We distinguish two different cases of high-dimension in the IV model:

– The (very)-many-instruments case, i.e. pzn > n where the number of instruments p
is allowed to grow with the sample size n,

– The many endogenous variables case, i.e. pdn > n where pdn is the number of
endogenous variables, but still pzn > pdn .

Those two frameworks are natural in empirical applications, but inference in the second
case is more complicated and will not be treated here1 . Instrumental variable techniques
to handle endogenous variables are widespread but often lead to imprecise inference.
With few instruments and controls, following Amemiya (1974), Chamberlain (1987), and
Newey (1990), one can try to improve the precision of IV techniques by estimating optimal
instruments. Consider the model 4.1 below:

Assumption 4.1 (IV Model). Consider the i.i.d sequence (Yi , Di , Xi , Zi )i=1,...,n satisfying

0
Yi = Di τ0 + Xi β0 + εi , E[εi ] = 0, E[εi |Zi , Xi ] = 0, (4.1)

where
1
see Remark 4.2 below.

45
1. Xi is a vector of pxn exogenous control variables, including the constant 1.

2. Zi is a vector of pzn instrumental variables

3. Di is an endogenous variable, E [εi |Di ] 6= 0

4. we have pxn  n and pzn  n (we denote by px := pxn and pz := pzn ).

0 0 0
The moment condition E[ε|W ] = 0, where W = Z , X , implies a sequence of un-
conditional moment conditions E[εA(W )] = 0 indexed by a vector of instruments: the
function A(·) such that E [A(W )2 ] < ∞. This legitimately raises the question of the
choice of A(·) in order to minimize the asymptotic variance and obtain more precise esti-
mates. In Section 4.1, we briefly summarize results on the optimal instruments problem:
which is the optimal transformation A? We refer to Newey and McFadden (1994) for
more details and the methodology in the low dimensional case of model 4.1 (without
assumption 4).
However, even with one instrument Z, considering a high number of transformations of
0
the initial instrument (f1 (Z), . . . , fp (Z)) using series estimators, using B-Splines, poly-
nomials, ect... makes the problem high dimensional. Then in Section 4.2, we present
tools from Belloni et al. (2012a) who use Lasso and Post-Lasso to estimate the first-stage
regression of the endogenous variables on the instruments. As described in Chernozhukov
et al. (2015), this problem fits the double ML structure described in Section 2.6. Estima-
tion of the parameters of interest uses orthogonal or immunized estimating equations that
are robust to small perturbations in the nuisance parameter, similarly to what was used
in Chapter 2.5. For simplicity, we restrict ourselves to the conditionally homoscedastic
case
E ε2 |Z, X = σ 2 .
 

We refer to Belloni et al. (2012a) for the general case.

4.1 The Optimal Instruments Problem

In this section we remind results about the optimal choice of A(·) in the moment equation
E[εA(W )] = 0 such that E [A(W )2 ] < ∞ to obtain more precise estimates. Define

46
0 0 0 0
θ0 := τ0 , β0 and S := D, X ∈ Rp+1 . We study the Generalized Method of Moments
estimator (GMM) based on the moments conditions
h  0
i
M (θ0 , A) := E A(W ) Y − S θ0 = 0,
2
Define M
c c(θ, A)0 M
(θ, A) := M c(θ, A) and:
2

2
θ̂n := argmaxθ∈Θ − M (θ, A) , (4.2)
c
2

c(θ, A) := En [ψ(U, θ, A)], ψ(U, θ, A) = A(W ) Y − S 0 θ0 , and U = (Y, D, X, Z).



where M
Denote by
h 0
i
G(A) := E [∇θ ψ(U, θ0 , A)] = E A(W )S .

The set of assumptions 4.2 below ensure identification of θ0 in the set Θ, namely that
M (θ, A) vanishes only at θ0 :

∀θ ∈ Θ, M (θ, A) = 0 ⇒ θ = θ0

and allow to prove asymptotic normality of the GMM estimator θ̂n .

Assumption 4.2 (Regularity Conditions). 1. θ0 is an interior point of Θ, which is


compact;

2. ψ(u, ·, A) is continuously differentiable in a neighborhood N of θ0 with probability


approaching one;

3. E kψ(U, θ0 , A)k2 is finite and E [supθ∈N k∇θ ψ(U, θ, A)k] < ∞,


 

0
4. G(A) G(A) is non singular.

Under Assumption 4.2, we have

c(θ, A)0 M
c(θ, A) →P −E ψ(U, θ, A)0 E [ψ(U, θ, A)]
 
1. convergence of the objective function −M
which admits a unique maximum at θ = θ0 ;
 −1
1 Pn  0 1 Pn
2. convergence θ̂n →P θ0 , where θ̂n = i=1 A(Wi )Si [A(Wi )Yi ] (see
n n i=1
Theorem 5.7 in Van der Vaart (1998));

47
3. asymptotic normality, namely that
−1 0
!
√    −1 0 
0 0
n θ̂n − θ0 →d N 0, G G G ΣG GG . (4.3)

Indeed, to prove the asymptotic normality, consider the first order condition
 0  
∇θ M
c θ̂n , A Mc θ̂n , A = 0

which is satisfied with probability approaching one. Then, using second order Taylor’s
  h i
theorem for Mc θ̂n , A at θ0 yields that there exists θ ∈ θ0 , θ̂n such that

 0    0  0  
∇θ M θ̂n , A M θ̂n , A = ∇θ M θ̂n , A M (θ0 , A) + ∇θ M θ̂n , A ∇θ M θ θ̂n − θ0
c c c c c c

thus, with high probability,

√    0  −1 0 √
    
n θ̂n − θ0 = ∇θ M c θ̂n , A ∇θ M
c θ −∇θ M
c θ̂n , A nM
c(θ0 , A) .

  
Then, using condition (3) we have ∇θ Mc θ, A →P G(A) and ∇θ M c θ̂n , A →P G(A),
√ c  0
using condition (4), nM (θ0 , A) →d N (0, Σ), where Σ = E ψ(U, θ0 , A)ψ(U, θ0 , A) , and
using the Slutsky’s theorem we obtain (4.3).
0
Thus, in (4.3), the asymptotic variance simplifies to G−1 Σ(G−1 ) and takes a specific form

−1
h 0
i 0
V (A) := G(A) E ψ(U, θ0 , A)ψ(U, θ0 , A) G(A)−1 .

Theorem 4.1 (Necessary condition for optimal instruments, Theorem 5.3 in Newey and
McFadden (1994) p. 2166). If an efficient choice A of A exists for the estimator (4.2),
then it has to solve
h 0 i
G(A) = E ψ(U, θ0 , A)ψ U, θ0 , A , for all A such that E A(W )2 < ∞.
 

This can be restated as:


h 0 0
i
E [A(W )S 0 ] = E A(W )(Y − S θ0 )2 A(W ) .

Thus, using iterated expectations, we obtain


h  h 0
i i
E A(W ) E [S 0 |W ] − E (Y − S θ0 )2 |W A(W )0 = 0

48
 0 
This is satisfied, using the homoscedasticity assumption E (Y − S θ0 )2 |W = σ 2 , when
E [S|W ]
A(W ) = .
σ2
Being invariant to a multiplication by a constant matrix, the function A(W ) = E [S|W ]
minimizes the asymptotic variance, which becomes
0 −1
h i
Λ∗ = σ 2 E E [S|W ] E [S|W ] , (4.4)

which is the semi-parametric efficiency bound (see Chapter 25 in Van der Vaart (1998)).
Here, A(W ) is the optimal instrument. In practice, the optimal instrument is the regres-
sion function of S on W , w 7→ E [S|W = w], which is naturally a high dimensional object
(see, e.g. Tsybakov (2009)). It has to be estimated and we now describe how to use the
sparsity assumption to perform efficient estimation in a high dimensional setting.
Note that these are plenty of ways to estimate E [S|W ] under different assumptions.
Lately we use machine learning tools to allow for W to be high dimensional under spar-
sity assumptions.

4.2 Sparse Model for Instrumental Variables


We now assume the linear first stage equation:
0 0
D = X γ0 + Z δ0 + u, u ⊥ (Z, X), (FS)

where, as described in Assumption 4.3 below, δ0 has only few “important” components
(approximately sparse), and because instruments Z may be correlated with the controls
X, we use the equation:

Z = ΠX + ζ, X ⊥ ζ, Π ∈ Mpxn ,pxn (R), (4.5)

which yields the following two equations:


0 0 0 0 0 0 0
D = X γ0 + X Π δ0 + u + ζ δ0 = X (γ0 + Π δ0 ) + u + ζ δ0 , (4.6)
| {z } | {z }
:=ν0 :=ρd

where ρd ⊥ X and
0 0 0
Y = X (ν0 τ0 ) + X β0 + ε + τ0 ρd = X (ν0 τ0 + β0 ) + ε + τ0 ρd . (4.7)
| {z } | {z }
:=θ0 :=ρy ⊥X

49
We make three preliminary remarks.
First, the following two cases naturally arise in practice:

1. either the list of available and possible instruments is large, while the econometrician
knows that only few of them are relevant;

2. or, from a small list of regressors Z, the optimal instruments can be approximated
using a basis of functions (series estimators, using B-Splines, polynomials, ect). This
case is treated using non-sparse methods in Newey (1990). In this decomposition,
z
the potential number pz of needed functions {fj }pj=1 is allowed to be larger than
n. Note that instead of Z, one could also consider transformations of the initial
instruments
0 0
f = (f1 , . . . , fp ) = (f1 (Z), . . . , fp (Z)) .

Second, like in section 2.6, the key assumption made on the nuisance component is
approximate sparsity: namely A(W ) = E [S|W ] (remember that here pd = 1) is assumed
to be well approximated by few (s  n) of these pz instruments. Denote the nuisance
component by η0 = (θ0 , ν0 , δ0 , γ0 ) and assume that it can be decomposed into a sparse
component η0m and relatively small non-sparse component η0r :

Assumption 4.3 (Approximate Sparsity). There exists c > 0 such that

η0 =η0m + η0r , supp(η0m ) ∩ supp(η0r ) = ∅


p p
kη0m k0 ≤ s, kη0r k2 ≤ c s/n, kη0r k1 ≤ c s2 /n.

Third, like in Section 2.6, we have to choose the moment equations carefully so that
model selection errors in the estimation of the nuisance component (here, the optimal
instrument) have limited impact on the estimation of the parameter of interest τ0 . We
now show how the optimal instrument problem can be cast in the framework of the im-
munization procedure developed in Section 2.6, and in particular in the Affine-Quadratic
model (2.10).

50
4.3 Immunization Procedure for Instrumental Vari-
able Models
Starting from (4.1), Chernozhukov et al. (2015) propose to base estimation using or-
thononalised moments like in Frisch–Waugh–Lovell theorem. Consider the space of ran-
dom variables that are square integrable on the canonical probability space (Ω, A, P ),
which we denote by L2 (P ). This is an Hilbert space associated with the scalar product
1/2
< X, W >= E [XW ] and norm kXk = E [X 2 ] . Define pX (W ) = E [W |X], which is
the orthogonal projection of W on the subspace of L2 (P ), {ξ = h(X), E [h(X)2 ] < ∞}
of square integrable random variables that are measurable with respect to X. Applying
mX (W ) = W − pX (W ) = W − E [W |X] to (4.1) yields the equations:

mX Y =mX Dτ0 + mX ε, E [ε|X, Z] = 0 (4.8)

where

0
mX D = D − E [D|X] = D − X ν0 ,
0
mX Y = Y − E [Y |X] = Y − X (ν0 τ0 + β0 ).

For estimation, we use the following implication of (4.1),

E [mX ε (mX pX,Z D)] = 0, (4.9)

where

0 0 0
mX pX,Z D = mX E[D|X, Z] = E[D|X, Z] − E[E[D|X, Z]|X] = X γ0 + Z δ0 − X ν0 .

Note that if D where exogenous, (4.9) would simply be E [mX εmX D] = 0. In the present
context, in the same spirit as the optimal instrument, we should use with p  n

E [εE [D|X, Z]] = 0 (4.10)

but to handle the errors coming from the selection in the estimation of covariates X, we
have to subtract the term E [D|X] to obtain a robust estimator which yields

E [(ε − E [ε|X]) (E [D|X, Z] − E [D|X])] = 0. (4.11)

51
The moment condition (4.8) can be rewritten as

E [ψ(W, τ0 , η)] = 0

where
 0
 0 0 0

ψ(W, τ0 , η) = Y − τ0 D − X β0 Z δ0 + X γ0 − X ν0 (4.12)
 0
 0 0 0

= Y − τ0 D − X (θ0 − ν0 τ0 ) Z δ0 + X γ0 − X ν0
 0 0
 0 0

T
= Y − X θ0 − (D − X ν0 )τ0 Z δ0 + X γ0 − X ν0 (4.13)

= ((Y − E [Y |X]) − (D − E [D|X])τ0 ) (E [D|X, Z] − E [D|X])

and we also have

 0 0 0

ψ(W, τ0 , η) = ρy − ρd τ0 Z δ0 + X γ0 − X ν0
 0 0 0

= ε Z δ0 + X γ0 − X ν0 . (4.14)

The instruments for D, controlling for the correlation between Z and X, are

A(Z) = E [D|Z, X] − E [D|X] (4.15)


0 0 0 0
=Z δ0 + X γ0 − X (γ0 + Π δ0 )
0
=(Z − ΠX) δ0 (4.16)

=ζ 0 δ0 .

Equation (4.13) can be rewritten under the form of the Affine-Quadratic model (2.10)

M (τ0 , η) = E [τ0 ψ1 (W, η) + ψ2 (W, η)] = Γ1 (η)τ0 − Γ2 (η) (4.17)


h 0
 0 0 0
i
Γ1 (η) := E [Γ1 (W, η)] := −E (D − X ν0 ) Z δ0 + X γ0 − X ν0 , (4.18)
h 0
 0 0 0
i
Γ2 (η) := E [Γ2 (W, η)] := −E Y − X θ0 Z δ0 + X γ0 − X ν0 . (4.19)

We summarize the estimation algorithm proposed in Chernozhukov et al. (2015) before


studying its theoretical properties:

1. Do Lasso or Post-Lasso Regression of D on (X, Z) to obtain γ̂ and δ̂;

2. Do Lasso or Post-Lasso regression of Y on X to get θ̂;

52
0 0
3. Do Lasso or Post-Lasso regression of D̂ = X γ̂ + Z δ̂ on X to get ν̂;
 T
The estimator of η0 is η̂ = θ̂, ν̂, γ̂, δ̂ ;

4. Then
√ 2 h i−1
τ̌ = argminτ ∈R nM
c(τ, η̂)
= Γc1 (η̂)0 Γ
c1 (η̂) c1 (η̂)0 Γ
Γ c2 (η̂).

Note that Step 4 amounts to perform 2SLS using the residuals Y − θ̂0 X from Step
2 as running variable, the residuals D − D̂ from Step 1 as covariate, and the residuals
D̂ − ν̂ 0 X as instruments.

Question: Verify that Assumption (ORT) holds: ∂η M (τ, η) = 0.


Question: Show that the orthogonality (ORT) condition does not hold if there is no
term E [D|X] in (4.15).

Remark 4.1. In the case of “small number” of controls (see Belloni et al. (2012a)), θ0
is no longer a “nuisance” parameter in the sense that there is no selection to be done on
X. In this case, we can take A(W ) = E [D|Z, X], as (ORT) does not have to hold with
respect to θ0 .

Using the formulation (4.17) of the model as an Affine-Quadratic model (see assump-
tion 2.10) and if assumptions of Theorem 2.2 are satisfied, namely if

1. we assume sparsity, i.e assumption 4.3 with η0 = η0m ;

2. assumption (ORT) is satisfied;

3. the High-Quality Nuisance Estimation Assumption with sparsity;


4. the growth condition s log(p)/ n → 0 holds;

then we can apply Theorem 2.2 and asymptotic normality holds


n (τ̌ − τ0 ) → N (0, σΓ2 ), (4.20)

with σΓ2 := E[ψ(W, τ0 , η0 )2 ]/E[Γ1 (W, η0 )]2 .

53
Question: Show that, when these are no controls X (take Z = ζ), Λ∗ = σΓ2 , where
−1
from (4.4), Λ∗ = σ 2 E E [D|Z]2 , and thus that this estimator of τ0 with Optimal


IV Estimated by Lasso or Post-Lasso achieves the efficiency bound. This extends to


“small number of controls” (see Remark 4.1).

Moreover, Theorem 3 in Belloni et al. (2012a) shows that the result continues to hold
h i−1 P  2
0
with Λ∗ replaced by Λ̂∗ = σ̂ 2 E D̂2 , where σ̂ 2 := ni=1 Yi − Di τ̂ − Xi β̂ /n.

Remark 4.2. If we use approximate sparsity assumption 4.3, then we have to impose the
following assumption, and the result (4.20) also holds (see Chernozhukov et al. (2015))

Assumption 4.4 (High-Quality Nuisance Estimation with Approximate sparsity). Make


assumption 4.3 and assume that ηb satisfies, with high probability

kη̂k0 ≤s,
r
s
kη̂ − η0m k2 ≤ log p,
n
r
s2
kη̂ − η0m k1 ≤ log p,
n

Remark 4.3 (Estimation and inference with many endogenous regressors). In this course,
we limit ourselves to the case where the number pd of endogenous regressors is fixed. How-
ever, several recent papers Gautier and Tsybakov (2011), Gautier and Tsybakov (2013)
and Belloni et al. (2017b) consider inference of an high dimensional parameter τ0 with a
high dimensional nuisance parameter.
This goes beyond this course, but this can be useful is the following situations:

- Economic theory is not explicit enough about which variables belong to the true
model. Here searching for the good “small” set of potentially endogenous variables
to put into the outcome equation may not be possible.

- Many nonlinear functions of an endogenous regressor: namely when the outcome


equation is of the form
p d
0
X
Y = τ0,k fk (D) + X β0 + ε,
k=1

d
where {fk }pk=1 is a family of function (ex: basis) that capture nonlinearities.

54
4.4 Simulation Study
Similarly to the simulation exercise of the previous section, this illustrates two points:

1. the naive post-selection estimator suffers from a large regularization bias;

2. the cross-fitting estimator trades off a large bias for a smaller MSE compared to
the immunized estimator that uses the whole sample.

DGP. We use a DGP close to the one in Chernozhukov et al. (2015): namely i.i.d
observations (Yi , Di , Zi , Xi )ni=1 from

0
Yi =τ0 Di + Xi β0 + 2εi
0 0
Di =Xi γ0 + Zi δ0 + Ui

Zi =ΠXi + αζi ,

where α = 0.125, 0.25, 0.5 and


    
εi 1 0.6 0 0
 ui    0.6 1 0 0  
 ζi  ∼ N
  0,  
  0 0 Ipz 0 
xi 0 0 0 Σ

where

- Σ is a px × px matrix with Σkj = (0.5)|j−k| and Ipz the pz × pz identity matrix.

- The number of controls is set to 200, the number of instruments to 150, the number
of observations to 202.

- The most interesting part of the DGP is the form of the coefficients β0 , γ0 , and δ0 :
(
1/4, j < 4
β0j = ,
0, elsewhere

γ0 = β0 , and δ0j = 3/j 2 . We are in an approximately sparse setting for both


equations.
 
- Π = Ipz , 0pz ×(px −pz ) and τ0 = 1.5

We compare three estimators:

55
1. An “oracle” estimator, where the coefficients of the nuisance parameters are known,
0
and we run standard IV regression of Yi − E [Yi |Xi ] on Di − E [Di |Xi ] using ζi δ0 as
instruments;

2. A naive non-orthogonal estimator, where we use Lasso regression of D on (X, Z)


to obtain the identities of the controls and instruments that enter the instrumental
D
equation: IX = {j : δ̂j 6= 0}, IZD = {j : δ̂j 6= 0}. We run Lasso regression of
Y on X to obtain the identities of the controls that enter the outcome equation:
Y
IX = {j : δ̂j 6= 0}. We then run 2SLS estimator of Y on D and the selected controls
D Y
and instruments IX ∪ IX and IZD .

3. A double-selection estimator based on the Lasso as described in the previous section;

Table 4.1: Estimation of τ0

Estimator:
Naive Post-Selec Double Selec Oracle
(1) (2) (3)
Bias 0.04 0.01 0.00
Root MSE 0.36 0.39 0.06
MAD 0.24 0.25 0.04
Parameters are set to: n = 202, px = 200, px = 150, K = 3,
Note:
τ0 = 1.5

Figure 4.1: Distribution of τ̂ − τ0

Note: See table above.

56
4.5 Empirical Applications
4.5.1 Logit Demand

We briefly introduce the logit demand model in the context where we only observe market
share data (see the seminal papers by Berry et al. (1995), Berry (1994) and Nevo (2001),
and the datasets provided in the Github). The model describes demand for a product in
the “characteristic space”, namely a product can be characterized by a number of features
(for a car: efficiency, type of oil, power, ect) and consumers value those characteristics.
The consumer can choose among J products and maximizes his utility of consuming this
good. Individual random utility for choosing good j ∈ {0, . . . , J} is modeled as

0
uij = Xj β0 − τ0 Pj + ζj + εij , (εij , ζj ) ⊥ Xj

and εij ∼ F (·) = exp(− exp(−·)) type I extreme value.


Question: Show that this yields the expression for the choice probabilities

exp (δj )
Pij = PJ , δj = XjT β0 − τ0 Pj + ζj .
1 + k=1 exp (δk )

Moreover, the econometrician does not observe individual choices, but only market
shares of product j: sjt = Qjt /Mt at period t, where Mt is the total number of households
in the market, and Qjt the number choosing the product j in period t. This yields
0 
exp Xjt β0 − τ0 Pjt + ζjt
sjt = ,
1 + Jk=1 Xkt β0 − τ0 Pkt + ζkt
P 0

thus, using sj /s0 and assuming that market shares are non zero, we get

0
log(sj ) − log(s0 ) = Xjt β0 − τ0 Pjt + ζjt . (4.21)

However, price may be correlated with unobserved component ζjt such that OLS would
lead to an estimate of τ0 which is biased towards zero. We use the instrumental equation:

0 0
Pjt = Zjt δ0 + Xjt γ0 + ujt . (4.22)

Here, controls include a constant and several covariates. In Berry et al. (1995), they
suggest to use the so-called “BLP instruments” namely characteristics of other products,
which may satisfy an exclusion restriction: for any j 0 6= j and t0 , as well as any function of

57
those characteristics. The justification is that, if a product is close in the “characteristics
space” to its competitors, it may impact the markups, then the price (however, one
should prefer cost based instruments, rarely available). Thus, we are left with a very-
high dimensional set of potential instruments for Pjt .
Originally, Berry et al. (1995) solve this problem of dimension taking sums of product
characteristics formed by summing over products excluding product j
 
X X
Zk,jt =  Xj 0 ,jt , Xj 0 ,jt  ,
j 0 6=j,j 0 ∈If j 0 6=j,j 0 ∈I
/ f

where If is the set of products produced by firm f .


But tools developed in the previous section allow to consider wider possibilities. Cher-
nozhukov et al. (2015) apply these techniques to revisit results from Berry et al. (1995).
We apply the same tools to a dataset (semifrabricated) from Nevo (2001) on the ready
to eat cereal industry (see dataset cerealps3.csv).
Table 4.2 presents results using the set of constructed instruments (labelled “z1-z20”),
and in “Augmented 2SLS Selection”, quadratics and cubics in all these instruments. The
identity of the controls and instruments selected in the “augmented” set reveals that
these are important nonlinearities missed by the baseline set of variables. Moreover, the
selection method give more plausible estimates with respect to the important quantities
of the model that are price elasticities:

∂sj Pk −τ0 Pj (1 − Sj ) if j = k
=
∂Pk sj τ0 Pk sk otherwise

Not to mention the classical problems with those specific forms (own-price elasticities
quasi proportional to prices, symmetry of cross price elasticity with respect to products),
facing inelastic demand is inconsistent with profit maximizing price choice in this frame-
work, thus theory would predict that demand should be elastic for all products, which is
not the case of estimates without selection in Table 4.2. Estimators with selection give
in that sense much more plausible estimates.

4.5.2 Instrument Selection to estimate returns to schooling

We now replicate the analysis of the returns to schooling done in Card (1993), and see how
results are changed if we enlarge the set of possible instruments. David Card considers

58
Table 4.2: Estimation of τ0
Price Coefficient Standard Error Number Inelastic
Estimates Without Selection
Baseline OLS -9.63 0.84 586.00
Baseline 2SLS -9.48 0.87 990.00
2SLS Estimates With “Double Selection”
Baseline 2SLS Selection -11.29 0.93 224.00
Augmented 2SLS Selection -11.44 0.91 212.00

first the instrumental variable model:

0
Y =τ0 D + X β0 + ε, ε⊥X
0 0
D =Z δ0 + X δ0 + u, u ⊥ (Z, X)

where Yi is the weekly log wage of individual i, Di denotes education (in years), Xi is
a vector of controls, possibly high dimensional, and Zi denotes a vector of instrumental
variables for education.
In this example, the instruments are the two indicator for whether a subject grew up
near a two-year college or a four-year college. He also proposes to use IQ as instruments
for the Knowledge of the World of Work (KWW) test scores, which could be added as
control. The control variables Xi are: age and work experience at the time of the survey,
subject’s father’s and mother’s years of education, indicator for family situation at the
subject age 14 (whether subject lived with both mother and father, single mom, step-
parent), 9 indicators for living region, a dummy for living in a Standard Metropolitan
Statistical Area (SMSA) and another for living in the south, an indicator for whether
subject’s race is black, marital status at the time of the survey, indicator for whether
subject had library card at age 14, KWW test scores and interactions. Two options are
possible without using selection: either using those four instruments, or interacting those
instruments with their interactions with the controls, leading to 48 instruments. In this
second case, we have to correct for the use of many instruments with the Fuller (1977)
estimator, implemented in the R package ivmodel.
Results presented in Table 4.3, using the code shows that:

- The Post-Lasso selects 5 among all the potential 64 instruments, and so does provide
some pertinent selection without prior knowledge.

59
- Comparison of standard errors in the Lasso case with the Fuller estimates with 64
instruments shows a small improvement.

Table 4.3: Estimation of τ0


Nb. of IV Estimate Std. Er. Fuller Estimate Fuller Std. Er.
OLS 0.074 0.0053
2SLS 3 0.114 0.0160 0.115 0.0163
2SLS 64 0.094 0.0135 0.103 0.0176
DML, Post-Lasso 64 0.105 0.0170
Table 4.4: Note: Number of observations 1619 and number of covariates 17.

Finally, we also refer to the github of the course where we replicate the application
of the instrument selection techniques developed in the previous sections to the Angrist
and Krueger (1991) dataset from Belloni et al. (2010). The dataset NEW7080.dta can be
found at https://economics.mit.edu/faculty/angrist/data1/data/angkru1991.

60
Chapter 5

Further Developments: Variable


Selection with Non-Gaussian Errors,
Sample-Splitting and Panel Data

5.1 High-Quality Nuisance Estimation with Non-Gaussian


Errors
We have already described the properties of the LASSO, assuming that the error term
is Gaussian. Belloni et al. (2012a) relaxed this assumption while also describing how
to choose parameter λ in a more general case. We now give insights about this choice.
Consider the selection equation:
0
Di = Zi δ0 + εi , E [εi |Zi ] = 0 (5.1)

and consider the LASSO estimator


n
1 X 0
2 λ
δ̂ ∈ argmin Di − Zi δ + Γ̂δ , (5.2)

δ∈Rp n i=1 n 1

where
p
X
Γ̂δ = Γ̂j δj .

1
j=1

The penalty Γ̂ ∈ Mp,p (R) is an estimator of the “ideal” penalty loadings Γ̂0 = diag(γ̂10 , . . . , γ̂p0 ),
qP
n
where γ̂j0 = 2 2
i=1 Zi,j εi /n. This is an “ideal” penalty loadings as εi is not observed.

In practice:

1. We set λ = 2c nΦ−1 (1 − 0.1/(2p log(p ∨ n))), where c = 1.1 and Φ−1 (·) denotes
the inverse of the standard normal cumulative distribution function.

61
2. We estimate the ideal loadings 1) using “conservative” penalty loadings and 2)
plugging in the resulting estimated residuals in place of εi to obtain the refined
loadings.
 
2
Assumption 5.1 (Moment conditions). Assume that (i) maxj=1,...,p E [Di2 ]+E Di2 Zj,i +
 2 2  3 3
1/ E Zj,i εi ≤ K1 (ii) maxj=1,...,p E Zj,i εi ≤ K2 , where K1 , K2 < ∞.

Under theses moment conditions, Theorem 5.1 below gives rates of convergence for
Lasso under non-Gaussian and heteroscedastic errors, which relaxes the assumptions
made in Theorem 2.1. Of course this is more realistic in most applications.

Theorem 5.1 (Rates for Lasso Under Non-Gaussian and Heteroscedastic Errors, Theo-
rem 1 in Belloni et al. (2012a)). Consider model (5.1), the sparsity assumption |δ0 |0 ≤ s,
assumptions 2.4 and 5.1. Take ε > 0, there exist C1 and C2 such that the LASSO es-

timator defined in (5.2) with the tuning parameter λ = 2c nΦ−1 (1 − α/(2p)), where
α → 0, log(1/α) ≤ c1 log(max(p, n)), c1 > 0, and with asymptotically valid penalty load-
ings lΓ̂0 ≤ Γ̂ ≤ uΓ̂0 where l →P 1, u →P 1 satisfies, with probability 1 − ε
r
C1 s2 log(max(p, n))
δ̂ − δ0 ≤ 2 (5.3)

1 κC n
n
1 X 0 0
2 C2 s log(max(p, n))
Zi δ̂ − Zi δ0 ≤ , (5.4)
n i=1 κC n
 
1 Pn 0
where κC := κC Zi Zi and
n i=1
kγ̂ 0 k∞
 
uc + 1
C= .
k1/γ̂ 0 k∞ lc − 1
Intuition for Theorem 5.1: Regularizing event and concentration inequality.
The proof of the LASSO with Gaussian errors, Step 2 in Theorem 2.1, is based on the
fact that with high probability 1 − α we have the following “regularizing event”
( )
1 X n λ
n
max εi Xij ≤ .

j=1,...,p n 4
i=1

To ensures this, we used the Markov inequality, conditional on X1 , . . . , Xn and the fol-
lowing concentration inequality (see Lemma 2.6) which for p gaussian random variables
ξj ∼ N (0, σj2 ) ensures that
 
p
E max |ξj | ≤ max σj 2 log(2p).
j=1,...,p j=1,...,p

62
This shows how to choose λ. In the general case of the LASSO with non Gaussian and
heteroscedastic errors, to choose λ and the penalty loadings γ 0 , we use the same ideas.
We ensure that we have the regularizing event with high probability
 Pn √ 
i=1 Zi,j εi / n λ
max ≤ √ (5.5)
j=1,...,p γ̂j0 2c n

using the following concentration inequality applied to Ui,j := Zi,j εi . This ensures that
there exists finite constant A > 0 such that

 
Pn    
i=1 Ui,j / n −1 α A
P  max qP ≤Φ 1− ≥1−α 1+ (5.6)
j=1,...,p n 2
2p ln
i=1 Ui,j /n

where ln → ∞. This comes from moderate deviation theorems for self-normalized sums
(see Lemma 5 in Belloni et al. (2012a) and Belloni et al. (2018)). The idea is that, if the
loadings Γ̂0 are chosen so that the term
Pn √
i=1 Zi,j εi / n
γ̂j0
behaves like a standard normal random variable, then we could get the desired condition

(5.5) taking λ/(2c n) large enough to dominate the maximum of p standard normal
random variables with high probability. Belloni et al. (2012a) show that taking (γj0 )2 =
Var (Zi,j εi ) allows to fulfill this idea, even if the εi are not i.i.d Gaussian. This yields
(5.6). Then, the Lemma 5.1 below ensures that on this regularizing event, we have the
desired inequalities.

Lemma 5.1 (Lemma 6 in Belloni et al. (2012a)). Consider model (5.1), the sparsity
assumption |δ0 |0 ≤ s, assumptions 2.4 and 5.1. If the penalty dominates the score in the
sense that n
γ̂j0 λ 1 X
≥ max 2c Zi,j εi

n j≤p n
i=1

or equivalently (5.5), then, we have
 √
v
u n  2 
u1 X 0 0 1 λ s
t Zi δ̂ − Zi δ0 ≤ u + (5.7)
n i=1 c nκc0
 

0
 (1 + c0 ) 1 λs
Γ̂ δ̂ − δ0 ≤ u+ , (5.8)

1 κc0 c nκc0
where c0 = (uc + 1)/(lc − 1).

63
Pn 0 0 2
Proof of Lemma 5.1. Denote by L(δ) = i=1 Di − Zi δ /n. Because δ̂ is solution of
the minimisation program, we have
  λ  
L δ̂ − L (δ0 ) ≤ Γ̂δ0 − Γ̂δ̂ . (5.9)

n 1 1

Then, expanding the quadratic function L (·) we have


 0
2  0 0
2
Di − Zi δ̂ = Di − Zi δ − Zi (δ̂ − δ)
 0
2
= εi − Zi (δ̂ − δ)
0
 0 2
2
= εi − 2εi Zi (δ̂ − δ) + Zi (δ̂ − δ) .
 −1 P
n
Thus, using Holder inequality and S := 2 Γ̂0 i=1 Zi εi /n for the first inequality
n 
n
  1 X 0
 2 2 X 0
 
L δ̂ − L (δ0 ) − Zi δ̂ − δ0 = εi Zi δ̂ − δ0

n i=1 n
i=1

 
≤ kSk∞ Γ̂0 δ̂ − δ0 .

1

Thus, together with (5.9) and λ/n ≥ c kSk∞ we obtain



1 X n   2
0
Zi δ̂ − δ0

n


i=1
λ    
≤ Γ̂δ0 − Γ̂δ̂ + kSk∞ Γ̂0 δ̂ − δ0

n  1
1
  1
λ    
0

≤ Γ̂ δ̂ − δ0 − Γ̂ δ̂ − δ0 c + kSk∞ Γ̂ δ̂ − δ0

n S0 S 1
   1   0 1 
1 λ 0 1 λ 0 
≤ u+ Γ̂ δ̂ − δ0 − l− Γ̂ δ̂ − δ0 c . (5.10)
c n S0 1
c n S0 1
   
0 0
Then, Assumption 2.4 yields Γ̂ δ̂ − δ0 c ≤ c0 Γ̂ δ̂ − δ0

. By definition of
S0 1 S0 1
 s
0  1 Pn  0  2
κc0 we obtain κc0 Γ̂ δ̂ − δ0 S0 ≤ n i=1 Zi δ̂ − δ0
and with the Cauchy-

 2
≥ Γ̂0 δ̂ − δ0 /√s thus
0   
Schwarz inequality Γ̂ δ̂ − δ 0
S0 2 S0 1

√ u
v
 n 
0 
Γ̂ δ̂ − δ0 ≤
s u 1 X 0
 2
Zi δ̂ − δ0
t
κc0 n i=1

S0 1

which, in (5.10), yields (5.7).


The second statement of the lemma follows by
   
0 0
Γ̂ δ̂ − δ0 ≤ (1 + c0 ) Γ̂ δ̂ − δ

0

1 S0

1

64
√ X n 

s 1 0
 2
≤ (1 + c0 ) Zi δ̂ − δ0

κc0 n i=1

and the result follows applying (5.7). 

Elements for the proof of Theorem 5.1. The proof is based on these three steps:
First, the fact that for this choice of λ, using Lemma 5 in Belloni et al. (2012a), we have
as α → 0 and n → ∞

1 Pn
 
 √ √n i=1 Zi,j εi



P 2c n

0
> λ = o(1).
γj


Thus for n large enough and α small enough we can consider the regularising event,

1 Pn

 √

n i=1 Zi,j εi
 

 λ 
E := ≤ 2c√n 


 γj0 
 

which occurs with probability greater that 1 − .


Second, using the fact that
n
!
1 1X 0
κc0 ≥ 0 κ(kγ̂ 0 k /k1/γ̂ 0 k )c0 Zi Zi > 0.
kγ̂ k∞ ∞ ∞ n i=1

Third, on E we use Lemma 5.1, with λ = 2c nΦ−1 (1 − α/(2p)), and using that there
p
exists C3 such that Φ−1 (1 − α/(2p)) ≤ C3 log(p/α) we obtain

 √
v
u n  2 
u1 X 0 0 1λ s
t Zi δ̂ − Zi δ0 ≤ u +
n i=1 cnκc0
   r
1 2c −1 α s
≤ u+ Φ 1−
c κc0 2p n
  r
1 2cC3 s log(p/α)
≤ u+
c κc0 n

and
 

0
 (1 + c0 ) 1 λs
Γ̂ δ̂ − δ0 ≤ u+

1 κc0 c nκc0
   
(1 + c0 ) 1 −1 α 2cs
≤ u+ Φ 1− √
κc0 c 2p nκc0

65
  p
(1 + c0 ) 1 2cC3 s log(p/α)
≤ u+ √
κc0 c κc0 n
(5.11)

which yields the result. 

5.2 Sample-Splitting to Relax the Growth Rate As-


sumption
In this section, we analyze how to use sample splitting to relax the growth rate assumption
4. We replace it by the weaker condition
s log(max(p, n))
→ 0. (5.12)
n
Similarly to Belloni et al. (2012a), we restrict ourselves to the two samples case, but this
easily generalizes to the K-samples case.
Denote by a and b the two samples of sizes na = bn/2c and nb = n − na , j c = {a, b} \ j
for j ∈ {a, b}, and define the following sample splitting estimator
"n nb
#−1
X a X
Γ1 Wia , η̂ b + Γ1 Wib , η̂ a
 
τ̌ =
i=1 i=1
na
! nb
! !
X X
Γ1 Wia , η̂ b
Wib , η̂ a τ̌b
 
× τ̌a + Γ1 (5.13)
i=1 i=1

based on
" nj
#−1 nj
1 X c 1 X c
τ̌j = Γ1 Wij , η̂ j Γ2 Wij , η̂ j for j ∈ {a, b}.
nj i=1 nj i=1
This estimator combines two estimators of the treatment effect based on each sample,
where each one uses a preliminary estimator of the nuisance function based on the other
sample only.

Theorem 5.2 (Asymptotic Normality of the Split-Sample Immunized Estimator, The-


orem 7 in Belloni et al. (2012a)). The immunized estimator τ̌ defined by (5.13) in the
affine-quadratic model (2.10) under Assumptions 2.8, the growth condition (5.12) and
using a first-stage nuisance estimator satisfying Assumption 2.9 is such that:

n (τ̌ − τ0 ) → N (0, σΓ2 ),

with σΓ2 := E[ψ(Wi , τ0 , η0 )2 ]/E[Γ1 (Wi , η0 )]2 .

66
Proof of Theorem 5.2. The proof mainly consists of modifying the proof in Theorem
n c
j
2.2 to use independence between (εi )i=1 and ηbj for j ∈ {a, b}.

Step 1: analysis of τ̌j . Take j ∈ {a, b}.


Pnj c
Like in Theorem 2.2, the growth condition (5.12) suffices to get n−1 Γ1 Wij , η̂ j

j i=1 →
EΓ1 (Wi , η0 ). Then, we have to show that under the weaker condition (5.12), we still have
nj
1 X c
√ ψ Wij , τ0 , η̂ j → N (0, Var[ψ(Wi , τ0 , η0 )]).
nj i=1

We use that E εj |Xij , ζij = 0 and that {εji , 1 ≤ i ≤ nj } are independent from the sample
 

j c , to get

c
E ψ Wij , τ0 , η̂ j − ψ Wij , τ0 , η0
 

c
= E E ψ Wij , τ0 , η̂ j − ψ(Wij , τ0 , η0 )|Xij , ζij , j c
  
h    j 0 j c 0 c 0 c
i
= E E εj |Xij , ζij Zi (δ̂ − δ0 ) + Xij (γ̂ j − γ0 ) − Xij (ν̂ j − ν0 ) = 0

c c
Then, using the Chebyshev inequality, the fact that η̂ j − η j are independent of {εji , 1 ≤
i ≤ nj } by independence of the two subsamples j and j c , and that {εji , 1 ≤ i ≤ nj } have
conditional variance on Xij , ζij bounded from above by K, we obtain


√ nj !
nX
c
ψ Wij , τ0 , η̂ j − ψ Wij , τ0 , η0 > ε
 
P
nj
i=1
√ nj 2 
1  n X j jc
 j

≤ 2E ψ Wi , τ0 , η̂ − ψ Wi , τ0 , η0 

ε nj
i=1

nj nj
" #
1 n X  j 0  j c 
j 0 jc
 j
0 j c 2 X j 2
≤ 2E 2 Zi δ̂ − δ0 + (Xi ) γ̂ − γ0 − Xi ν̂ − ν0 (εi )
ε nj i=1 i=1
nj  nj
" " ##
1 n X j
 0 
jc

j 0 jc
 j 0 jc
 2 X
j 2 j j c
≤ 2E E 2 Zi δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0 (εi ) Xi , ζi , j
ε nj i=1 i=1

nj  nj
" # " #
1 n X j 0

jc

j 0 jc
 j 0 jc
  2 X j 2

j j c
≤ 2E 2 (Zi ) δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0 E (εi ) Xi , ζi , j
ε nj i=1 i=1

nj
" #
K n X  j 0  jc 
j 0 jc
 j 0 jc
2
≤ 2E 2 (Zi ) δ̂ − δ0 + (Xi ) γ̂ − γ0 − (Xi ) ν̂ − ν0
ε nj i=1
K n C2 s log(max(p, nj ))
≤ .
ε2 nj κC nj

67
Using Theorem 5.1, we obtain that
" nj
#−1 nj
√ 1 X j 1 X
ψ Wij , τ0 , η0 + oP (1).
 
nj (τ̌j − τ0 ) = Γ1 Wi , η0 √ (5.14)
nj i=1 nj i=1
Step 2: on the aggregated estimator τ̌ . Putting the two results together, we get

the asymptotic representation of n (τ̌ − τ0 )
" n nb
#−1
√ 1X a
1 X
n (τ̌ − τ0 ) = Γ1 (Wia , η0 ) + Γ1 (Wib , η0 )
n i=1 n i=1
na
! nb
! !
1 X √ 1 X √
× Γ1 (Wia , η0 ) n(τ̌a − τ0 ) + Γ1 (Wib , η0 ) n(τ̌b − τ0 ) + oP (1)
n i=1 n i=1
" n #−1 na nb
!
1X 1 X 1 X
= Γ1 (Wi , η0 ) √ ψ(Wia , τ0 , η0 ) + √ ψ(Wib , τ0 , η0 ) + oP (1),
n i=1 n i=1 n i=1
which concludes the proof. 

5.3 Regularization and Selection of Instruments in


Panels
In this section, we briefly show following Belloni et al. (2016) how to use the LASSO and
the regularization procedure of Section 4.3 when observations are identically distributed
among individuals but correlated across time. We refer to Belloni et al. (2016) for more
details. In the following, results holds for n → ∞ and fixed T (more realistic in a macroe-
conomic context) and n → ∞, T → ∞ joint asymptotics.

Consider the following panel data model

Yit =τ0 Dit + ei + εit (5.15)


0
Dit =Zit δ0 + fi + uit , (5.16)

where E [εit uit ] 6= 0 but E [εit |Zi1 , . . . , ZiT ] = E [uit |Zi1 , . . . , ZiT ] = 0, where we have
a high dimensional number pz of instruments Zit satisfying pz  nT the number of
individuals and time periods observed. For simplicity we do not consider cases of fixed
or high dimensional number of controls, but the ideas of the double selection of Section
4.2 can be directly extended here. We use the classical “within” transformation
T
1X
Ÿit = Yit − Yit
T t=1

68
(and respectively Z̈it and ε̈it the “within” transformation of Zit and εit ) to partial out
the fixed effect in both equations, which reduces the model to

Ÿit =τ0 D̈it + ε̈it (5.17)


0
D̈it =Z̈it δ0 + üit . (5.18)

We then use the sparsity assumption kδ0 k0 ≤ s and the following Cluster-Lasso regression
of D̈it on Z̈it to estimate δ0 , and use to estimate τ0 the orthogonal moment condition
" T
#
1 X  0
 0
E Ÿit − τ0 D̈it − Z̈it δ0 Z̈it δ0 = 0
T i=1

thus the estimator


" n T
#−1 n T
1 XX   1 XX  
τ̌ = Γ1 D̈it , Z̈it , δ̂ Γ2 Ÿit , Z̈it , δ̂ ,
nT i=1 t=1 nT i=1 t=1

where
 0
 0
Γ1 (D̈it , Z̈it , δ) = − D̈it − Z̈it δ Z̈it δ
0
Γ2 (Ÿit , Z̈it , δ) = − Ÿit Z̈it δ.
h  i
0
Note that the moment condition E Ÿit − τ0 D̈it Z̈it δ0 satisfies also (ORT) because
these are no controls here.

The Cluster Lasso, Intuition. We consider the regression (5.18). The Cluster-Lasso
coefficient estimate is based on
n T pZ
1 XX 0
2 λ X
δ̂ ∈ arg min D̈it − Z̈it δ + γ̂j |δj | . (5.19)
δ∈Rpz nT i=1 i=1 nT j=1

Similarly to Section 4.3, the penalty loadings γ


bj are chosen such that the “regularisation
event”
n X T
λφ̂j 1
X
≥ 2c Z̈itj ε̈it

nT nT i=1 t=1

happens with high probability. Using like in (5.6) moderate deviations theorems from the
self-normalized theory (see Lemma 5 in Belloni et al. (2012a)) we have that for α → 0

69
PT
as n → ∞, the variables Uij := t=1 Z̈itj ε̈it /T , which are independent random variables
across i with mean zero, satisfy (if finite third-order moments),

 
Pn  
i=1 Uij / n
P  max qP > Φ−1 1 − α  = o(1),
1≤j≤pz n 2
2p
i=1 Uij /n

which yields the choice of the “ideal” penalty loadings are


n T T
2 1 XXX
γ̂j0 = Z̈itj Z̈it0 j ε̈it ε̈it0 .
nT i=1 t=1 t0 =1
P 
T
Because ε̈it is unknown, we start with a conservative penalty (an estimator of Var i=1 Z̈itj D̈it /T )

then plug-in the estimated ε̈it and iterate. Similarly to Section 4.3, we take λ = 2c nT Φ−1 (1−
α/(2pz )).

Define hP i
T 2 2
E Z̈
t=1 itj itε̈ /T
iZT = T min  2  .
1≤j≤p PT
E t=1 Z̈itj ε̈it /T

Belloni et al. (2016) call the quantity iZT the “index of information”, in the sense that is
inversely related to the strength of within-individual dependence and can vary between
iZT = 1 (perfect dependence within cluster i) and izT = T (perfect independence within i).
Theorem 5.3 show that, through this quantity, this dependence impacts the rates.
n op
, where Σ̈jk = ni=1 Tt=1 Z̈itj Z̈itk /(nT ).
P P
Define the Empirical Gram matrix Σ̈ = Σ̈jk
j,k=1

Theorem 5.3 (Cluster-Lasso convergence rates in panels, Theorem 1 in Belloni et al.


(2016)). Take ε > 0. Let {(Dit , Zit } be an i.i.d sample accross i for which n, T → ∞
jointly. Assume that s = o(niZT ), s log(max(p, nT )) = o(niZT ), and the other Regularity
Conditions (RE) and the sparse eigenvalue condition SE p12 in Belloni et al. (2016) based

on Σ̈. Consider a feasible Cluster-Lasso estimator δ̂ with penalty level λ = 2c nT Φ−1 (1−
α/(2pZ )) and loadings {γ̂j }pj=1
z
, lγ̂j0 ≤ γ̂j ≤ uγ̂j0 where l →P 1, u →P 1. Then, there exist
C1 and C2 such that with probability 1 − ε,
s
s log(max(p, nT ))
δ0 − δ̂ ≤ C1

2 niZT
s
s2 log(max(p, nT ))
δ0 − δ̂ ≤ C2 ,

1 niZT

70
Note that in the above theorem, the effective sample size niZT is intuitively related to
the time dependence structure: when observations are totally independent across time
(iZT = T ), the size is nT whereas if observations are perfectly dependent (iZT = 1), the
size is n. Theorem 5.4 is an extension of Theorem 5.1, and the proofs share the same key
elements.

Theorem 5.4 (Asymptotic normality of the Cluster-Lasso estimator for treatment effect
in panels, Theorem 2 in Belloni et al. (2016)). Assume that conditions of Theorem 5.3
hold, that the growth condition s2 log(max(p, nT ))2 /(niD
T ) = o(1) holds. Assume that the

moments conditions given in SMIV p16 in Belloni et al. (2016). Then the IV estimator
τ̌ satisfies
q
−1/2 d
niD
TV (τ̌ − τ0 ) → N (0, 1),

where   2 
P T
E ψ Ÿit , D̈it , Z̈it , δ0
t=1 /T
V := iD
T hP   i2 .
T
TE t=1 Γ1 D̈it , Z̈it , δ0 /T

Codes for the cluster LASSO is given in clusterlasso.R (with slight modifications of
the code rlasso.R from the package hdm, to implement the clustered penalty loadings).
We refer to Belloni et al. (2016) for simulations and application to gun control.

Application of IV Variable Selection in Panel Data. We consider an application


to the economic model of crime using data from Baltagi (2008) who replicates Cornwell
and Trumbull (1994). Data is direclty accessible as the “Crime” dataset in the plm R
package (using the command data(”Crime”, package = ”plm”)) and consists of a panel
data of 90 counties in North Carolina over the period 1981–87. All variables are in logs
except for the regional and time dummies. The main explanatory variables consist of the
probability of arrest (which is measured by the ratio of arrests to offenses), probability
of conviction given arrest (which is measured by the ratio of convictions to arrests),
probability of a prison sentence given a conviction (measured by the proportion of total
convictions resulting in prison sentences), average prison sentence in days as a proxy for
sanction severity, the number of police per capita as a measure of the county’s ability
to detect crime, and the population density (which is the county population divided by

71
county land area) (see Baltagi (2008) for a full data description). To handle the potential
endogenity of the number of police per capita, we use the same instruments as Cornwell
and Trumbull (1994), namely offense mix (ratio of crimes involving face-to-face contact to
those that do not) and per capita tax revenue (again, see Baltagi (2008) for motivations).
The variable selection method introduced in this chapter allow to solve part of the trade-
off that the researcher otherwise faces: including many covariates to include all the
potential confounders while not lowering estimation precision. To illustrate this point,
we consider equations (5.15)-(5.16) with: the same set of controls (16) and instruments
(2) as in Cornwell and Trumbull (1994) and Baltagi (2008) or a “large set” of controls
(i.e. including interactions and polynomial transformations up to order 2, which yields
544 control variables) and IV (98). The idea is that one might not be sure of the exact
identity of the controls that enter the equation. Table 5.1 focuses on the effect of number
of police per capita on crime rates. The estimate for the Cluster LASSO is roughly equal
to the one of the within estimator with few controls and IV (first column). However
the Cluster-LASSO estimator does not require apriori selection (and it selects different
controls and IV than in the one included in the baseline). The within “large” estimates
appears to be biased as the number of controls is in out case close to the number of
observations.
Table 5.1: Economics of Crime Estimates using Cluster Post-Double Selection

Estimator of effect of the number of police per capita


Within Within “large” LASSO Cluster
(1) (2) (3)
Estimate 0.477∗∗∗ 0.306∗∗∗ 0.459∗
Sd. error 0.168 0.054 0.200

72
Chapter 6

The Synthetic Control Method

The synthetic control method has been viewed as “the most important innovation in
the policy evaluation literature in the last 15 years” by Athey and Imbens (2017) and
its popularity in applied work keeps growing with applications in fields ranging from the
link between taxation and migration of football players (Kleven et al., 2013), immigration
(Bohn et al., 2014), health policy (Hackmann et al., 2015); minimum wage (Allegretto
et al., 2013), regional policies (Gobillon and Magnac, 2016); prostitution laws (Cunning-
ham and Shah, 2017), financial value of connections to policy-makers (Acemoglu et al.,
2016), and many more.
Sometimes considered as an alternative to difference-in-differences when only aggre-
gate data are available (Angrist and Pischke, 2009, Section 5.2), the synthetic controls
method offers a data-driven procedure to select a comparison unit, called a “synthetic
unit”, in comparative case studies. The synthetic unit is constructed as a convex com-
bination of control units that best reproduces the treated unit during the pre-treatment
period. In consequence, some units in the control group (also referred to as the “donor
pool”) will be assigned a weight of zero. In contrast, the difference-in-differences estima-
tor would take any control unit and give it a weight of 1/n0 where n0 is the control group
size. This remark will be detailed below but it hints at the flexibility that the synthetic
controls method offers by providing a data-driven way of weighting each control unit.
While the link to the difference-in-differences approach is direct, the synthetic controls
method is also related to matching estimators (Abadie and Cattaneo, 2018, Section 4, for a
short introduction) because solving the synthetic control program amounts to minimizing
a type of matching discrepancy. For more on that subject, see Abadie and L’Hour (2019).

73
References. The original papers that developed the method are Abadie and Gardeaz-
abal (2003); Abadie et al. (2010, 2015). Abadie et al. (2010), where the authors study the
effect of a large-scale tobacco control program in California, is the most iconic. Abadie
(2019) describes the methodology when applying the synthetic control method. Doud-
chenko and Imbens (2016) makes the connection between synthetic control, difference-in-
differences, regression and balancing. We should also mention a YouTube video on the
topic by Alberto Abadie: https://youtu.be/2jzL0DZfr_Y, and R, STATA and Matlab
packages, available at http://web.stanford.edu/~jhain/synthpage.html.
A very good textbook treatment of causal inference methods is given in Imbens and
Rubin (2015). A clear and concise presentation of the main tools of the field is given
in Abadie and Cattaneo (2018) which we strongly encourage you to read. The research
frontier is described in Athey and Imbens (2017). Angrist and Pischke (2010) reviews
progress in empirical economic research. A competing causal framework to Rubin (1974)
is known as Directed Acyclic Graphs (DAG) and is developed in Pearl (2000). Wasserman
(2010) has an introductory chapter on DAG.

6.1 Setting and Estimation

The synthetic controls method makes explicit use of the panel data framework. The
presentation is inspired by Doudchenko and Imbens (2016). We observe n0 + 1 units in
periods t = 1, ..., T . Unit 1 is treated starting from period T0 + 1 onward, while units
2 to n0 + 1 are never treated. Units 2 to n0 + 1 form what is called the “donor pool”
because these units may or may not be selected to take part in the synthetic unit (see
Remark 6.5). Let Yi,t (0) the potential outcome for unit i at time t if it is not treated and
Yi,t (1) the potential outcome if it is exposed to the intervention. We observe exposure to
treatment (Di,t ) and realized outcome Yi,tobs defined by:

Yi,t (0) if Di,t = 0
Yi,tobs = Yi,t (Di,t ) =
Yi,t (1) if Di,t = 1

The quantity of interest is the intervention effect on unit 1 from dates T0 + 1 to T :

τt := Y1,t (1) − Y1,t (0), t = T0 + 1, ..., T

74
Remark 6.1 (A note on the dimension). Most of the applied papers that use the synthetic
control method are dealing with long panel data where T is relatively large or proportional
to n0 , and there are at most a dozen treated units. For example, in Abadie et al. (2010),
T = 40, n0 = 38 and only one treated unit; in Acemoglu et al. (2016), T ≈ 300, n0 = 513
and a dozen treated units. This is in sharp contrast with standard panel data or repeated
cross-section settings where n0 is very large while T ranges from two to to a dozen. In
most applications where the method is used, a “unit” is a city, a region or even a country,
hence the limited sample size. T0 is necessarily larger than 1 but it is usually located
after the middle of the period, i.e. T0 > (T − 1)/2, so as to have a long pre-treatment
period that allows to “train” the synthetic unit (see Theorem 6.1 for a justification).
This particular setting implies that the asymptotic framework where the number of units
grows tends to infinity less relevant.

We observe the following matrix:


···
 
Y1,T (1) Y2,T (0) Yn0 +1,T (0)
.. .. ..
. . .
 
 
Y (1) Y 2,T0 +1 (0) · · · Yn0 +1,T0 +1 (0) 
 
Y obs := Yi,tobs =  1,T0 +1

.

t=T,...,1
i=1,...,n0 +1  Y1,T0 (0) Y2,T0 (0) · · · Yn0 +1,T0 (0) 
 .
.. .. .. 
 . . 
Y1,1 (0) Y2,1 (0) · · · Yn0 +1,1 (0)
Thus, we have the following missing variable problem which is the Fundamental Problem
of Causal Inference (Holland, 1986):
···
 
? Y2,T (0) YN +1,T (0)
.
.. .. ..
. .
 
 
? Y2,T0 +1 (0) · · · Yn0 +1,T0 +1 (0) 
 
Y (0) :=  .

 Y1,T0 (0) Y2,T0 (0) · · · Yn0 +1,T0 (0) 
 .. .. .. 
 . . . 
Y1,1 (0) Y2,1 (0) ··· Yn0 +1,1 (0)
The synthetic controls method aims to recover the T − T0 − 1 missing elements by re-
weighting the n0 observed elements at the end of each line and produce a counterfactual:
Y2,T (0) · · · Yn0 +1,T (0)
 
?
.. .. ..
. . .
 
 
? Y2,T0 +1 (0) · · · Yn0 +1,T0 +1 (0) 
 
Y (0) =  .

 Y1,T0 (0) Y2,T0 (0) · · · Yn0 +1,T0 (0) 
 .. .. .. 
 . . . 
Y1,1 (0) Y2,1 (0) · · · Yn0 +1,1 (0)

75
Let Xtreat be the p × 1 vector of pre-intervention characteristics for the treated unit.
Let Xc be the p × n0 matrix containing the same variables for control units. In most
applications, the p pre-intervention characteristics will only contain pre-treatment out-
comes (in which case p = T0 ) but one might want to add other predictors of the outcome
observed during the pre-treatment period that may or may not be time invariant, we
collect them in a vector Zi such that for the treated unit:
 obs 
Y1,1
 Y obs 
 1,2 
Xtreat :=  ...  .
 
(p×1)  obs 

 Y1,T 0
Z1
Xc is defined similarly. For some p × p symmetric and positive semidefinite matrix V ,

we adopt the notation kXkV = X 0 V X. Consider ω = (ω2 , . . . , ωn0 +1 ) a vector of n0
parameters verifying the following constraints:

ωi ≥ 0, i = 2, ..., n0 + 1, (NON-NEGATIVITY)
X
ωi = 1. (ADDING-UP)
i≥2

These constraints prevent interpolation outside of the support of the data, i.e. the coun-
terfactuel cannot take a value larger than the maximal value or smaller than the minimal
value observed for a control unit. The synthetic control solution ω ∗ solves the program:

min kXtreat − Xc ωk2V , (SYNTH)


ω

subject to (NON-NEGATIVITY) and (ADDING-UP).The synthetic unit is a projection


of the treated unit onto the convex hull defined by the control units. The synthetic
control estimator is then defined as the difference between the observed outcome for the
treated and the synthetic outcome:
nX
0 +1
obs
τbt := Y1,t − ωi∗ Yi,tobs .
i=2

As an aside, the difference-in-differences estimator would give:


n0 +1
!
1 X
τbtDID := Y1,t
obs
− obs
Y1,T + Y obs − Yi,T
obs
,
0
n0 i=2 i,t 0

weighting equally each member of the donor pool.

76
Remark 6.2 (A note on the choice of Xtreat and Xc ). They should contain pre-treatment
variables that are good predictors of the outcome of interest. In the Mariel Boatlift
example of Card (1990) where the outcomes of interest are wages and unemployment, it
is aggregate demographic indicators (gender, race, age), education levels, median income,
GDP per capita. Due to the time series nature of the problem, including pre-treatment
outcomes is strongly advised by Theorem 6.1. e.g. 1975-1979 unemployment rates, as
it is a way to create a control unit that verifies the Common Trend Assumption (CTA).
Note that the synthetic control method implicitly uses the Conditional Independence
Assumption (CIA).

Remark 6.3 (A note on the choice of V ). V is a diagonal matrix with each element along
the diagonal reflecting prior knowledge from the researcher about the importance of each
variable for the intervention under study. The synthetic control program (SYNTH) writes
in this case: #2
p
" nX
X 0 +1

arg min vj,j Xtreat,j − ωi Xc,ij .


ω
j=1 i=2

ω ∗ depends on the choice of V , so we use the notation ω ∗ (V ). Abadie et al. (2010)


proposes setting v1 , · · · , vp using a nested minimization of the Mean Square Prediction
Error (MSPE) over pre-treatment period:

T0
" nX
#2
X 0 +1

M SP E(V ) := obs
Y1,t − ωi∗ (V )Yi,tobs .
t=1 i=2

We can also use a form of cross-validation (Abadie and L’Hour (2019)). To simplify the
exposition and because it is the most natural choice, we will assume that the validation
period is at the end of the pre-intervention period, although other choices are possible.
The procedure is as follows:

1. Split the pre-intervention period that contains T0 dates into T0 − k initial training
dates and k subsequent validation dates.

2. For each validation period, t ∈ {T0 − k, . . . , T0 }, compute


nX
0 +1

τbt (V ) = obs
Y1,t − ωi∗ (V )Yi,tobs ,
i=2

77
where ωi∗ (V ) solves (SYNTH) with X measured in the training period {1, . . . , T0 −
k − 1}.

3. Choose V to minimize the mean squared prediction error over the validation period,
T0
1 X
MSPE(V ) = τbt (V )2 .
k t=T −k
0

Notice that over the validation period, the computed treatment effect must be zero.

6.2 A Result on the Bias


Why does it work? This section details the result given by Abadie et al. (2010) on the bias
of the estimator. Suppose the outcome under no-treatment is given by a factor model:

Yi,t (0) = δt + Zi0 θt + λ0t µi + εi,t ,

where δt is a time fixed-effect, θt is a vector of time-varying parameters, Zi are observed


covariates, λt are unobserved common factors of dimension F , µi are unobserved fac-
tors loadings (dimension F ) and εi,t are unobserved transitory shocks. In case you are
unfamiliar with factor models, think of the vector λt as the underlying macroeconomic
dynamics that affect differently each unit through µi . Instead of taking this into account
by using several macroeconomic variables, one might want to capture them with a small
number of factors, similarly to what is done in a principal component analysis.

Assumption 6.1 (IID Transitory Shocks). (εi,t )i=1,...,n0 +1,t=1,...,T are iid, across both i and
t, random variables with mean zero and variance σ 2 . Moreover, for some even integer
m > 2, E|εi,t |m < ∞.

Assumption 6.2 (Perfect Synthetic Match). The matching discrepancy is zero:

kXtreat − Xc ω ∗ k2V = 0.

The synthetic unit reproduces perfectly the treated unit.

It is a crucial point to prove the next theorem. This is not an abnormal case in many
applications because of over-fitting n0 > p. However, the curse of dimensionality entails

78
that as p grows, this assumption is less and less likely to be verified. See the precise
discussion is given in Ferman and Pinto (2016). Let ξ(M ) be the smallest eigenvalue of:
T0
1 X
λt λ0t .
M t=T0 −M +1

Denote by λP the T0 × F matrix with the t-th row equal to λ0t and assume:

Assumption 6.3 (Nonsingularity of Factor Matrix). ξ(M ) ≥ cξ > 0 for any positive
0
integer M . As a consequence, λP λP is non-singular. Moreover, assume |λt |∞ ≤ λ̄, for
1 ≤ t ≤ T.

Theorem 6.1 (Bias of the Synthetic Controls Estimator). Under Assumptions 6.1, 6.2
and 6.3, for t ∈ {T0 + 1, . . . , T }:

|Eb
τt − τt | → 0.
T0 →+∞

Remark 6.4. It shows that the bias of the synthetic controls estimator goes to zero as
the number of pre-treatment period increases. It says nothing about, for example, the `1
or `2 -consistency, especially because in the proof below E(|R3,t |) = E(|ε1,t − ε2,t |) does not
decrease with T0 . Indeed, we only observe one treated unit, hence there is a non-vanishing
variance.

Proof of Theorem 6.1. Using the factor specification, for any t = 1, ..., T :
" nX
#
0 +1

τbt = Y1,t (1) − Y1,t (0) + Y1,t (0) − ωi∗ Yi,t (0)
i=2
" nX
# " #0 " # " #
0 +1 nX
0 +1 nX
0 +1 nX
0 +1

= τ t + δt 1 − ωi∗ + Z1 − ωi∗ Zi θt + λ0t µ1 − ωi∗ µi + ε1,t − ωi∗ εi,t


i=2 i=2 i=2 i=2
" nX
# " #
0 +1 nX
0 +1

= τt + λ0t µ1 − ωi∗ µi + ε1,t − ωi∗ εi,t , (6.1)


i=2 i=2

where the last line comes from (ADDING-UP) and perfect matching of the synthetic unit,
Assumption 6.2. Now, consider the pre-treatment outcomes written in matrix notations.
YiP is the T0 × 1 vector of pre-treatment outcomes for unit i with t-th element equal to
Yi,tobs . εP
i is the T0 × 1 vector of pre-treatment transitory shocks. Notice that because

Y1P = (Y1,t (0))t=1,...,T0 , following the same steps as above:


nX
" # " #
0 +1 nX
0 +1 nX
0 +1

Y1P − ωi∗ YiP = λP µ1 − ωi∗ µi + εP


1 − ωi∗ εP
i . (6.2)
i=2 i=2 i=2

79
From equation (6.2), using Assumption 6.3:
" nX
# " # " #
0 +1  −1 nX0 +1  −1 nX
0 +1
0 0 0 0
µ1 − ωi∗ µi = λP λP λP Y1P − ωi∗ YiP − λP λP λ P εP 1 − ωi∗ εP
i .
i=2 i=2 i=2
(6.3)
Equation (6.3) above helps understanding the nature of the synthetic controls method-
ology: the quality of the approximation of the factor loadings of the treated, µ1 , by the
synthetic unit depends on the distance between the treated pre-treatment outcomes and
the synthetic unit’s. This observation advocates for including the pre-treatment outcomes
in the (SYNTH) program, and constitutes the crucial point of the theorem. Furthermore,
because Assumption 6.2 holds, the first term in equation (6.3) vanishes, so we have a nice
decomposition of the bias for t > T0 by plugging equation (6.3) inside equation (6.1):
nX
" #
 −1 0 +1  −1 nX0 +1
0 0 P0 P 0
τbt − τt = λ0t λP λP λP ωi∗ εP 0
i − λt λ λ λ P εP
1 + ε1,t − ωi∗ εi,t .
i=2 | {z } i=2
| {z } :=R2,t | {z }
:=R1,t :=R3,t

When t > T0 , R2,t and R3,t have mean zero thanks to Assumption 6.1. This is not the
case for R1,t because there is no reason to think that εi,t and ωi∗ are independent for t ≤ T0
since ωi∗ depends on Y1P , . . . , YnP0 +1 , and as consequence on εP P
1 , . . . , εn0 +1 . Rewrite:

nX
!−1
0 +1  −1 nX
0 +1 T0 T0
P0 P 0
X X
R1,t = ωi∗ λ0t λ λ λP εP
i = ωi∗ λ0t λt λ0t λs εi,s .
i=2 i=2 s=1 t=1
P −1
T0
By Cauchy-Schwarz inequality, since t=1 λt λ0t is symmetric and positive-definite,
and using Assumption 6.3:

T0
!−1 2  T0
!−1 
T0
!−1 
2
F λ̄2
X X X 
λ0t λt λ0t λs  ≤ λ0t λt λ0t λt  λ0s λt λ0t λs  ≤ .
t=1 t=1 t=1
T0 cξ

PT0 P −1
0 T0
Define ε̃i := s=1 λt t=1 λt λ0t λs εi,s . Using Assumption 6.1 and Holder’s inequality:

nX
!1/m !1/m
0 +1 nX
0 +1 nX
0 +1

|R1,t | ≤ ωi∗ |ε̃i | ≤ ωi∗ |ε̃i |m ≤ |ε̃i |m .


i=2 i=2 i=2

Applying again Holder’s inequality:


nX
! "n +1 #!1/m
0 +1 X0

E ωi∗ |ε̃i | ≤ E |ε̃i |m . (6.4)


i=2 i=2

80
And by Rosenthal’s inequality (Lemma 6.2), with some constant C(m) defined in the
statement of the inequality:
 !m/2 
 2
m T0 T0
F λ̄ X X
E|ε̃i |m ≤ C(m) max  E|εj,t |m , E|εj,t |2 .
T0 cξ t=1 t=1

From the equation above and (6.4), and using Assumption 6.1:
!
F λ̄2 (E|εi,t |m )1/m
 
1/m σ
E|R1,t | ≤ C(m)1/m n0 max 1−1/m
,√ .
cξ T0 T0

From the decomposition above and using Jensen’s inequality:


!
F λ̄2 (E|εi,t |m )1/m
 
1/m 1/m σ
|Eb
τt − τt | ≤ E|R1,t | ≤ C(m) n0 max 1−1/m
,√ .
cξ T0 T0

6.3 The Impact of Election Day Registration on Voter


Turnout
We revisit Xu (2017) which analyzes the impact of Election Day Registration (EDR)
on voter turnout in the United States. In most US states, eligible voters must register
on a separate day before casting their votes, which entails an extra cost of voting and
has been perceived as a cause of low turnout rates. With the objective of raising voter
turnout, EDR laws were first implemented in Maine, Minnesota and Wisconsin in 1976;
Idaho, New Hampshire and Wyoming followed suit in 1996; finally, Montana, Iowa and
Connecticut adopted the legislation as well in 2012. We refer the reader to the original
article for further details on the policy. The dataset, available in the R package gsynth
(Xu and Liu, 2018), comprises turnout rates measured during 24 elections (from 1920 to
2012), for 47 US states among which 9 are treated (i.e. adopted the EDR) and 38 are
non-treated. Since the adoption of the treatment was staggered, we consider the nine
treated states together with treatment starting in 1976.
Figure 6.1 represents voter turnout rates in the nine treated states (solid black line)
and in the 38 other states that did not implement any such reform (dashed purple line).
It is obvious that the CTA does not hold. The dashed red line, is a simple average of

81
Figure 6.1: Voter Turnout in the US and EDR Laws

Treated
Pen. Synthetic Control
.95 Confidence Interval
Other States
80
70
Turnout %

60
50
40

1920 1940 1960 1980 2000

Note: The .95 confidence intervals are computed by inverting Fisher Tests. 10,000 permutations are used.
The dashed purple line is the average turnout per election for the 38 nontreated States.

82
the nine synthetic treated states, computed using the synthetic control method state-by-
state. The variables taken into account are the pre-treatment outcomes, that is to say
all the turnout rates in the presidential elections from 1920 to 1972. It is interesting to
have a look at how many untreated states receive a positive weight in the synthetic units.
For the first wave, 6 to 8 non-zero untreated units per synthetic unit receive a positive
weight, while 3 to 4 non-zero untreated units per synthetic unit for the second and 2
untreated units per synthetic unit for the third.
The synthetic unit closely reproduces the behavior of the treated states turnout rates
before the treatment. The treatment effect is given by the difference between the solid
black line and the dashed red line. Using the methodology we will see in the next section,
the impact is positive and significant at 5% for every post-treatment election, as the black
line is outside of the confidence bands.

Remark 6.5 (Sparsity of the synthetic control solution). In most cases, n0 , the number
of control units, is larger than p, the number of pre-treatment characteristics. As a
consequence of this observation and of the constrained optimization problem (SYNTH)
s.t. (NON-NEGATIVITY) and (ADDING-UP), the solution found often happens to be
sparse, i.e. kω ∗ k0 << n0 . It is the case in this example. Theorem 1 in Abadie and L’Hour
(2019) shows that under mild regularity conditions kω ∗ k0 ≤ p + 1. We also show that
a necessary condition for ωi∗ > 0 to happen, is that control unit i is connected to the
treated unit in a particular tessellation of the cloud of points defined by the columns of
(Xtreat , Xc ), called the Delaunay triangulation.

6.4 When and Why Using the Synthetic Control Method

When is using the synthetic control method a good idea? Abadie (2019) gives a few
guidelines that helps ruling out the use of the synthetic control method. First of all, the
treatment effect should be large enough to be distinguishable from the volatility in the
outcome. Notice that an increasing volatility in the outcome also increases the risk of over-
fitting. In cases where the volatility in the outcome is too large, it can be a good idea to
filter the time series beforehand. Second, a donor pool should be available and comprised
of units that have not undergone the same treatment or any other idiosyncratic shock that

83
may bias the results. They should also be similar enough in terms of characteristics to
the treated so as to warrant a comparison. Third, there should be no anticipation about
the policy from the agents. The effect of the policy may take place before its formal
implementation if forward-looking agents react in anticipation, when it is announced.
In that case, it is possible to back-date the intervention. Fourth, no spillover effects.
If spillovers from a policy are substantial and effect units that are geographically close,
selecting them as part of the donor pool may not be a good idea. Fifth, the “convex
hull condition”. The synthetic unit is a convex combination of the units belonging to
the donor pool. Hence, the synthetic outcome lies within the convex hull of the donor
pool’s outcomes. Once constructed, the researcher should check that the synthetic unit
characteristics are close enough to that of the treated unit. In some cases, the treated
unit is so peculiar that a credible counterfactual may not be constructed. Sixth, there
should be enough post-intervention dates so as to detect an effect that may take time to
appear.
When these conditions are met, the synthetic control method offers a few advantages.
The first is the absence of extrapolation: the counterfactual cannot take a value larger
than the maximum or lower than the minimum value observed in the donor pool at each
date. Synthetic control estimators preclude extrapolation, because the weights are non-
negative and sum to one. The second is the transparency of the fit. Echoing the “convex
hull condition” from the previous paragraph, it can be checked whether the treated can
or cannot be reproduced by looking at the discrepancy kXtreat − Xc ω ∗ k2V . Third, it offers
a safeguard against specification search. Indeed, only the pre-treatment characteristics
are needed to compute the weights, so they can be computed even before the treatment
takes place so as to not indulge in specification search to obtain desired results. Fourth,
the sparsity of the solution facilitate the qualitative interpretation of the counterfactual.

6.5 Inference Using Permutation Tests

Notice that when defining the quantity of interest, τt , we have not used any expectation
as is done e.g. in Abadie et al. (2010). This is because the synthetic controls method is
developed in a framework where we observe not a random sample of individuals from a

84
super-population but aggregate data. Hence uncertainty does not come from sampling (or
at least, it is negligible), but comes from treatment assignment. An illustrative example
of this point is Card (1990) which uses the Mariel Boatlift as a natural experiment to
measure the effect of a large and sudden influx of migrants on wages and employment
of less-skilled native workers in the Miami labor market. Between April and October
1980, around 125,000 Cubans ran away from Fidel Castro and sought asylum in Florida,
which increased Miami labor force by 7%. Card uses individual level data from the
Current Population Survey (CPS) for Miami and four comparison cities (Atlanta, Los
Angeles, Houston and Tampa-St. Petersburg) to run an analysis similar to difference-in-
differences. Here is a shortened version of one table from the paper:

Table 6.1: Difference-in-differences estimate on unemployment rates


Year
1979 1981 1981-1979

Miami 8.3 (1.7) 9.6 (1.8) 1.3 (2.5)


Comparison cities 10.3 (.8) 12.6 (.9) 2.3 (1.2)
Miami-Comparison -2.0 (1.9) -3.0 (2.0) -1.0 (2.8)
From Table 4 in Card (1990).
Note: African-American workers. Stan-
dard errors in parentheses.

Do we really believe that there is this much uncertainty around the unemployment
rate at the city level? Anecdotal evidence suggests the standard-error for the French
unemployment rate is close to .2. The moral of the story is that if the aggregate data
we are dealing with are expressed per capita, Yi,tobs is probably already an average over
a sufficiently large sample to apply the LLN. For example, Yi,tobs = Ūi,t ≈ E(Uk,it ) where
Uk,it is a Bernoulli variable equal to one when individual k is unemployed at time t in city
i. For more on this subject, see e.g. Abadie et al. (2018). However, this observation does
not necessarily rule out the use of the asymptotic framework, since we can always assume
that the “realized” unemployment rate of a city comes from a super-population anyway,
but it justifies the use of another type of inference in the synthetic control methodology.

85
6.5.1 Permutation Tests in a Simple Framework

For this paragraph, we leave the synthetic controls framework and introduce we is refered
to as Fisher Exact P-Values which are permutation tests (see also Chapter 5 in Imbens
and Rubin, 2015). Recall a simple one-period RCT framework where we observe an iid
sequence (Di , Yiobs )i=1,...,N with:

Yi (0) if Di = 0
Yiobs = Yi (Di ) =
Yi (1) if Di = 1

(Yi (0), Yi (1)) ⊥ Di . Denote the missing outcome by Yimis := Yi (1 − Di ). Fisher tests
allow to test the sharp null hypothesis of a constant treatment effect for everybody. The
sharp null hypothesis writes for a constant C:

H0 (C) : “Yi (1) = Yi (0) + C, i = 1, ..., N 00 .

This hypothesis is not to be confused with a constant treatment effect on average. It


appears to be much stronger and logically implies an assumption about a constant average
treatment effect. However, the paradox discussed in Ding (2014) shows that when the
null hypothesis does not hold, there are cases where the Neymanian hypothesis of no
treatment effect on average is rejected while Fisher’s sharp null hypothesis is not, which
contradicts the logical links between the two assumptions. This surprising result can be
explained by a the lack of power of Fisher tests.
Denote D obs := (D1 , . . . , DN ) the observed vector of assignments. Under H0 (C):
Yimis = Yiobs − (2Di − 1)C so the missing variable problem is solved by imputing the
constant treatment effect. In the case C = 0, it means that the treatment assignment
should not matter. Recall than a permutation π is a one-to-one mapping π : {1, . . . , N } →
{1, . . . , N }. We can compute any regular treatment effect statistics θ(D
b π ) under any per-
n o
mutation π ∈ Π where D π belongs to d ∈ {0, 1}N | N and N1 = N
P P
i=1 di = N1 i=1 Di .

And compute Fisher’s p-value defined as:

1 X nb b obs )
o
p(C) := 1 θ(D π ) ≥ θ(D
|Π| π∈Π

In practice, because Π can be very large, the distribution is approximated by Monte-


Carlo, using the procedure:

86
1. For b = 1, ..., B, draw a new permutation of the treatment assignment D b , compute
the statistics θ(D
b b ) using H0 (C).

2. Approximate Fisher’s p-value using:


B
!
1 X n
b obs )
b b ) ≥ θ(D
o
pb(C) := 1+ 1 θ(D .
B+1 b=1

3. Reject H0 (C) if pb(C) is below a pre-determined threshold: the observed treatment


allocation gives an effect which is abnormally large when compared to the random-
ized distribution. For α ∈ (0, 1), the test is:

ϕα = 1 {b
p(C) ≤ α} . (6.5)

Lemma 6.1 (Level of the Test). Suppose that D obs = (D1 , . . . , DN ) is exchangeable with
respect to Π under H0 (C). Then, for α ∈ (0, 1), the test in equation (6.5) is of level α,
i.e. under H0 (C):
P [p(C) ≤ α] ≤ α.

D obs = (D1 , ...DN ) is exchangeable with respect to Π if D1 , . . . , DN are iid random


variables. The proof is inspired by Chernozhukov et al. (2017).
Proof of Lemma 6.1
|Π|
Let {θb(i) }i=1 denote the non-decreasing rearrangement of {θ(D
b π ) : π ∈ Π}. We can

see that:
n o
b obs ) > θb(k) ,
1 {p(C) ≤ α} = 1 θ(D

for k = d(1 − α) × |Π|e. Because Π forms a group, the randomization quantiles are
invariant:
b π )(k) = θb(k) , for all π ∈ Π,
θ(D

therefore
X n o X n o
1 θ(D b π )(k) =
b π ) > θ(D b obs ) > θb(k) ≤ α|Π|.
1 θ(D
π∈Π π∈Π
n o n o
b obs ) > θb(k) has the same distribution as 1 θ(D
By exchangeability, 1 θ(D b π )(k)
b π ) > θ(D

for any π ∈ Π. So we have that:


1 X nb o h n oi
b obs ) > θb(k) = E[1 {p(C) ≤ α}].
b π )(k) = E 1 θ(D
α≥ 1 θ(D π ) > θ(D
|Π| π∈Π

87

Let us illustrate this intuition by a short simulation exercise. Let π := P(D = 1) = .2,
N = 200 and simulate Y | D ∼ N (τ0 D, 1) for τ0 ∈ {0, .75}. The statistics under scrutiny
is the absolute value of the difference between the ATE estimator and C:

1 X 1 X
θb = Yiobs − Yiobs − C .

N1
D =1
N − N1 D =0
i i

Here is the code:


t h e t a . i t e r = function (D, Y, Dstar ,C=0){
ChangeTreat = ( Dstar==1−D)
Ystar = Y
Ystar [ ChangeTreat ] = Y −(2∗Dstar −1)∗C
return ( abs (mean( Ystar [ Dstar ==1]) − mean( Ystar [ Dstar ==0] − C) ) )
}

N = 2 0 0 ; p i = . 2 ; B = 1 0 0 0 0 ; mu = 0 . 7 5

d = i f e l s e ( runif (N) < pi , 1 , 0 )


y = d∗mu + rnorm(N)

t h e t a . hat = abs (mean( y [ d==1]) − mean( y [ d==0]))


print ( t h e t a . hat ) # Theta

Dpermut = r e p l i c a t e (B, sample ( d ) ) # each column i s


# a random p e r m u t a t i o n o f d
t h e t a . r e s h u f f l e d = mapply ( function ( b )
t h e t a . i t e r ( d , y , Dpermut [ , b ] ,C=0) , b=1:B)
p . v a l 1 = ( 1 + sum( t h e t a . sim>t h e t a . obs ) ) / (B+1)

Figure 6.2 displays a case for which H0 (0) is false on the left-hand panel, and true on
the right-hand panel. In the first case, the observed statistics is located in the tail of the
distribution of the estimates computed under a random permutation. In that sense, the
observed effect is abnormally large. In the second case, the observed effect is in the belly
of the distribution, making it insignificant. P-values given below the chart quantify the
conclusion associated with Fisher tests.

6.5.2 Confidence Intervals by Inversion of Tests

From the tests presented in the previous section, we can construct confidence intervals by
exploiting the duality between confidence intervals and tests. The intuition to construct

88
Figure 6.2: A Very Simple Fisher Test: H0 : C = 0 false vs. H0 : C = 0 true
τ=.5 τ=0

2 2
density

density
1 1

0 0

0.0 0.3 0.6 −0.6 −0.3 0.0 0.3


Estimated ATE Estimated ATE

Note: This is the histogram over all simulated permutations. The solid purple line is the observed statistics.
On the left-hand-side, pb = 0.006 (θb = .671), on the right-hand-side, pb = 0.745 (θb = −0.50) for the sharp
null hypothesis of no treatment effect.

a confidence interval of level 1 − α is the following: perform the test of the hypothesis
H0 (C) at the level α ∈ (0, 1) and include in the confidence interval any value of C for
which H0 (C) is not rejected. More formally, the confidence interval will be defined as:

IC1−α = {C ∈ R | pb(C) > α} .

Denote by C0 the true treatment effect, we have that the probability that this interval
contains C0 is given by:

P [IC1−α 3 C0 ] = P [b
p(C0 ) > α] = 1 − P [b
p(C0 ) ≤ α] ≥ 1 − α,

using Lemma 6.1. As a consequence, IC1−α is indeed of level 1 − α.

Figure 6.3 shows the p-value as a function of C for both cases τ0 ∈ {0, .75}. In the first
case, τ0 = .75, the 95% confidence interval is approximately [.3, 2.7]. In the second case,
τ0 = .5, the 95% confidence interval is approximately [−.75, .55]. Notice that contrary to
asymptotic or Normal distribution-based confidence intervals, they are not symmetric.

89
Figure 6.3: p-value as a function of C

0.75 ● ●


0.50 ●
p−value




● ●
0.25 ●

● ●


● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●

● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−1 0 1 2
C

Note: The blue curve is the first case, τ = .75, the red is the second τ = 0. The horizontal purple line is
located at y = .05.

6.5.3 Application to the Election Day Registration

We adapt the previous tests to the synthetic control framework where there are multiple
periods of time. We want to test the assumption:

H0 : “Yi,t (1) = Yi,t (0), t = T0 + 1, . . . , T , i = 1, . . . , n0 + 100 .

The procedure to perform inference is based on reassigning the treatment at random to


any nine states and compare it with the synthetic control estimator observed with the
original treatment assignment. Because we have multiple dates and multiple treated in
this example, we need to define a summary statistics. Abadie and L’Hour (2019) suggest
using the ratio between the aggregate mean square prediction error in a post-intervention
period T1 ⊆ {T0 + 1, . . . , T } and a pre-intervention period T0 ⊆ {1, . . . , T0 },

n1
!2 , n1
!2
X X X X
τbit τbit , (6.6)
t∈T1 i=1 t∈T0 i=1

Pn ∗ ∗
where τbit = Yi,t − j=n1 +1 ωi,j Yj,t , for i = 1, . . . , n1 and ωi,j is the weight given to the
untreated state j in the synthetic unit that reproduces treated state i. How much larger
is that ratio when the nine treated states are indeed treated compared to the same

90
ratio when we consider any othe nine states ? The p-values obtained from randomized
inference (B = 10, 000) using the MSPE ratio 5 × 10−3 . Figure 6.1 also displays the
pointwise confidence interval based on inverting the Fisher test of no effect for each date.

Testing Using Permutations with Respect to Time. An alternative way of testing


the hypothesis H0 : “τt = τt0 , t = T0 + 1, . . . , T 00 for some user-specified path (τt0 )t=T0 +1,...,T
relies on comparing the distribution of (b
ut )t=1,...,T before and after the treatment where
u
bt is defined as:

− ni=2
 obs
P 0 +1 ∗ obs
Y1,t ω Y if t ≤ T0 ,
u
bt = obs 0
Pni0 +1i,t ∗ obs
Y1,t − τt − i=2 ωi Yi,t if t ≥ T0 + 1.

For this purpose, consider the statistic:



T
1 X
u) = √
S(b u
bt ,

T − T0


t=T0 +1

under all the permutations π ∈ Π of {1, . . . , T } and compute the p-value:

1 X
pb = 1 {S(b
uπ ) ≥ S(b
u)} .
|Π| π∈Π

The validity of this approach is developed in Chernozhukov et al. (2017). It is computa-


tionally very appealing since the error terms are defined once and for all and don’t need
to be re-computed under each permutation.

6.6 Multiple Treated Units


So far, we have considered designs with a unique treated unit for which we used the
synthetic control method to create a counterfactual. How does this method extend to
designs with multiple treated units? Two potential solutions with their advantages and
benefits can be considered: (i) creating a synthetic unit for each treated and averaging
them or (ii) creating a synthetic unit for the average of the treated units.
In Abadie and L’Hour (2019), we consider the first solution which comes with its
potential set of problems to solve. In particular, considering many treated units increases
the chance that at least one of them falls into the convex hull defined by the untreated
units, and so that the synthetic control solution (SYNTH) is not unique. Abadie and

91
L’Hour (2019) introduce a penalty term in (SYNTH) and find a synthetic unit for each
treated (indexed by treat here). So if there are n1 treated, for each treat = 1, . . . , n1 , the
synthetic control weights solve:

nX
1 +n0

ωtreat (λ) = arg min kXtreat − Xc ωk2V +λ ωtreat,i kXtreat − Xi k2V (PENSYNTH)
i=n1 +1

subject to (NON-NEGATIVITY) and (ADDING-UP). λ > 0 is a tuning parameter


(think: Lasso). λ sets the trade-off between reproducing well the treated and the sum of
the pairwise distance between the treated and any single control unit. λ → 0 : pure syn-
thetic control, λ → ∞ : one-match nearest-neighbor matching. At the end, the difference
between the treated units and heir respective synthetic units is averaged:

n1
" nX
#
0 +n1
1 X
τbt (λ) := Y obs − ω∗ (λ)Yi,tobs .
n1 treat=1 treat,t i=n +1 treat,i
1

Theory and ways to choose λ are developed in Abadie and L’Hour (2019). The resulting
estimator also reduces the risk of large interpolation bias by not averaging units that are
very far away from each other.
On the other hand, Ben-Michael et al. (2019) propose a so-called partially pooled
synthetic control

n1
ν 1 X
ω1∗ (ν), . . . , ωn∗ 1 (ν) kXtreat − Xc ωtreat k2V

= arg min
2 n1 treat=1
p
" n1
#2
1−ν1X 1 X
+ Xtreat,j − Xc,j ωtreat ,
2 p j=1 n1 treat=1
(PART-POOLEDSYNTH)

subject to (NON-NEGATIVITY) and (ADDING-UP) for ν ∈ [0, 1]. This estimator


balances two objectives: reproducing well each treated individually (the first part, similar
to the vanilla synthetic control method) and reproducing well the average of the treated
for each characteristics (the second part). The authors argue that constructing a synthetic
unit for each treated unit separately and then averaging can result in a less-than-optimal
fit for the average of the treated unit and can lead to a potential bias. They show that
their estimator is a solution to that issue.

92
6.7 Conclusion and Extensions
The synthetic controls method:

– Uses aggregate data to construct a conterfactual of an aggregate treated unit (syn-


thetic unit),

– The synthetic unit is a convex combination of control units so as to match some


policy-invariant features and pre-intervention outcomes of the treated,

– The treatment effect (ATET) is estimated as the difference of the outcomes between
the treated and the synthetic unit,

– Inference is based on permutation tests (Fisher’s exact procedure).

93
6.8 Supplementary Application: The California To-
bacco Control Program
In January 1989, the state of California enforced Proposition 99, one of the first large-scale
tobacco control programs, by increasing cigarette tax by 25c/pack, funneling tax revenues
to anti-smoking budgets, etc. What was its effect on per capita tobacco consumption?
This is the iconic example of the synthetic controls method where the goal is to create a
synthetic California by reweighting states that did not change tobacco legislation. Xtreat
and Xc contain retail price of cigarettes, log income per capita, percentage of population
between 15-24, per capita beer consumption (all 1980-1988 averages), 1970 to 1974, 1980
and 1988 cigarette consumption. All details about this application are given in Abadie
et al. (2010). All the tables and figures in this section are taken directly from the article.

Figure 6.4: Proposition 99: California vs. Rest of the US

Source: Abadie et al. (2010).

Figure 6.4 is taken from Abadie et al. (2010) and represents cigarette consumption in
California and in 38 other states that did not implement any anti-tobacco reform. It is
obvious that the CTA does not hold. Figure 6.5 compares the same data for California
and its synthetic doppelganger which has not implemented the treatment. The synthetic
unit is composed of Utah (33.4%), Nevada (23.4%), Montana (19.9%), Colorado (16.4%)
and Connecticut (6.9%). And Table 6.6 compares the characteristics of California, its
synthetic unit and the 38 other states. The synthetic unit closely reproduces the behavior
of California tobacco consumption before the treatment. The treatment effect is given

94
Figure 6.5: Proposition 99: California vs. Synthetic California

Source: Abadie et al. (2010).

by the difference between the solid black line and the dashed line. Proposition 99 has
entailed a consumption decrease of an estimated 30 packs per capita in 2000.

Figure 6.6: Proposition 99: Balance check table

Source: Abadie et al. (2010).

We adapt the previous tests to the synthetic control framework. How much larger is
the RMSPE when California is treated compared to the RMSPE when another state is
treated ?

95
Figure 6.7: Treatment estimate for CA and 38 control states

Note: Some extreme states cannot be reproduced by a synthetic control. Source: Abadie et al. (2010).

Figure 6.8: Treatment estimate for CA and 34 control states

Note: The most extreme states have been discarded. Source: Abadie et al. (2010).

6.9 Auxiliary Mathematical Results


Lemma 6.2 (Rosenthal’s Inequality). Consider ξ1 , . . . , ξn , n independent random vari-
ables with mean zero, E|ξi |m < ∞ for some even m > 2 and let Sn = ni=1 ξi . Then:
P

 " n #m/2 
n
X X
E|Sn |m ≤ C(m) max  E|ξi |m , E|ξi |2 ,
i=1 i=1

96
Figure 6.9: Treatment estimate for CA and 19 control states

Note: Among States that are well reproduced, CA is the most extreme. p-value is equal to .05! (1/(19 + 1))
Source: Abadie et al. (2010).

Figure 6.10: Ratio post-treatment MSPE to pre-treatment MSPE

Source: Abadie et al. (2010).

where C(m) := E(X − 1)m where X ∼ P(1).

Proof of Lemma 6.2 See Ibragimov and Sharakhmetov (2002). 

97
6.10 Exercises and Questions
The purpose of this exercise is to show that regression adjustment to estimate treatment
effect is using a weighted mean of the control group outcomes to construct the counter-
factual. The setting is an iid sample of the vector (Yiobs , Di , Xi ) for i = 1, ..., n. Yiobs is the
outcome variable which value depends on whether the unit i is treated or not. If unit i is
treated (Di = 1), then Yiobs = Yi (1). If unit i is not treated (Di = 0), then Yiobs = Yi (0).
Xi is a vector of covariates of dimension p which includes an intercept. The index i is
dropped when unnecessary. Define π := P(D = 1). The quantity of interest is the ATET
defined by:

τ AT ET = E [Y (1) − Y (0)|D = 1] .

In many applications, the ATET is estimated by considering an estimand of the following


form:

τ AT ET = E Y obs |D = 1 − E W0 Y obs |D = 0 .
   

where W0 is random variable that depends on both X and D. Assume that the outcome
without the treatment follows a linear model: Y (0) = X 0 β0 +ε with (D, X) ⊥ ε and Eε =
0. Assume E ((1 − D)XX 0 ) is non-singular. The Oaxaca-Blinder procedure estimates the
ATET in two steps. The first one consists in estimating β0 . The second step estimates the
i = Yiobs −Xi0 β,
ATET as a simple mean of the residuals computed on the treated group as b b

for i such that Di = 1.

1. Show that E [Y (1) − Y (0)|D = 1] = E Y obs |D = 1 − E [X|D = 1]0 β0 .


 

2. Define β0 = arg min E (1 − D)(Y obs − X 0 β)2 . Express β0 as a function of some


 
β∈Rp
moments. What is its empirical counterpart ? What kind of regression would you
do to estimate β0 ?

3. Show that in this case W0 = E [X|D = 1]0 E [XX 0 |D = 0]−1 X.

4. Show that this weight satisfies:

E[X|D = 1] = E[W0 X|D = 0].

Interpret this condition.

98
5. From the previous questions, suggest an estimator of the ATET of the form:

1 X obs X
τbOB := Yi − ωi Yiobs
n1 i:D =1 i:D =0
i i

Give the expression of the weights in that case. Do the weights ωi sum to one?

6. Compare with the synthetic controls estimator.

99
100
Chapter 7

Machine Learning Methods for


Heterogeneous Treatment Effects

7.1 Motivation: Optimal Policy Learning

To motivate the analysis of heterogeneous treatment effects (TE), we consider the optimal
policy learning problem of Bhattacharya and Dupas (2012). This is only an introduction,
with a simplified set-up to the topic of optimal policy learning and we refer to Kitagawa
and Tetenov (2017a), Kitagawa and Tetenov (2017b), and Athey and Wager (2017) for a
more general case and precise results.
Assume the government has limited budget resources to subsidize health and educa-
tion. Thus, they want to determine a fixed allocation of treatment resources to a target
population, while maximizing the population mean outcome.
Let Y be the observed individual level outcome and D a binary treatment, which
should be determined by a specific decision rule. The researcher also have information X
about individuals which should lead the allocation decision. Following the classical set
up, we denote by Y = DY1 + (1 − D)Y0 , where Y0 and Y1 are the values of the potential
outcome Y without and with treatment. We make the uncounfoundeness Assumption
7.1, which states that all variation in D is random once the covariates X are fixed. In
particular, this prevents the case where those with the largest anticipated effects based
on some unobservables are more likely to be treated.

Assumption 7.1 (Selection on the observables/Unconfoundedness).

D ⊥ (Y0 , Y1 ) | X, (7.1)

101
Namely, we assume that the treatment status is independent of the potential out-
comes conditional on observed individual characteristics, like in a randomized control
trial (RCT). A simple model of the planner’s problem consists in determining a map
p : X → [0, 1], where X is the support of X, which denotes the probability that a
household with characteristics X is assigned to the treatment, such that the welfare is
maximized
Z Z
max (µ1 (x)p(x) + µ0 (x)(1 − p(x))) dFX (x) = (µ0 (x) + τ (x)p(x)) dFX (x)
p(·) x∈X x∈X
(7.2)
Z
s.c. c ≥ p(x)dFX (x),
x∈X

where τ (x) := E [Y1 − Y0 | X = x], µj (x) := E [Yj | X = x], j ∈ {0, 1}, and c is the
fraction of individuals that can be treated (which is assumed to be proportional to the
budget constraint). Under 7.1, the conditional average treatment effect (CATE) τ (·) can
be decomposed
τ (x) = µ1 (x) − µ0 (x).

Assume that (i) for some δ ≥ 0, we have P(τ (X) > δ) > c (the constraint is relevant)
and (ii) that Fτ (X) is increasing and τ (X) has bounded support with Lebesgue den-
sity bounded away from zero. Assumption (ii) yields that γ ∈ X 7→ P (τ (X) ≥ γ) is
decreasing, which yields that a solution for (7.2) takes the form
Z
p(X) = 1 {τ (X) ≥ γ} , γ s.t. c = p(x)dFX (x).
x∈X

Here, γ is the (1 − c) quantile of the random variable τ (X), which is unique when Fτ (X)
is increasing. This problem illustrates simply that the knowledge of the heterogeneous
treatment effect τ (·) would ideally allows one to assign the treatment in an efficient way.

In a broader perspective, it is thus of interest to identify the populations that benefit


the most/the least from the treatment, in order to maximize the social welfare. One
limitation that is not addressed in the above setting is the fact that there can be negative
externalities: the treated agents are going to benefit at the cost of untreated ones. For
example, Crépon et al. (2013) estimate these effects in the labor market.

102
We consider here a “plug-in” approach: estimating the CATE to find the optimal
policy. They are several alternatives and extensions. In particular, Kitagawa and Tetenov
(2017a), Kitagawa and Tetenov (2017b), and Athey and Wager (2017)) consider the
welfare maximization under the constraint that the policy should belong to a restricted
class with limited complexity, which is closer to what is needed in applications. They
show that one can estimate the optimal policy at a parametric rate given restriction on
the complexity of the class of policies whereas the CATE estimators does not reach this
rate.

Remark 7.1. Note that the dual formulation of this problem is also of interest: one
aims at estimating the minimum cost to achieve a given average outcome value in the
population
Z
min p(x)dFX (x) (7.3)
p(·) x∈X
Z
s.c. (µ1 (x)p(x) + µ0 (x)(1 − p(x))) dFX (x) = b,
x∈X

Remark 7.2. Note also that this set up can handle heterogeneous cost of treatment
across the population h(·) which yields to the following modified budget constraint
Z
c= h(x)p(x)dFX (x),
x∈X

and solution for the modified problem (7.2)


Z
p(X) = 1 {τ (X) − h(X) ≥ γ} , γ s.t. c = h(x)p(x)dFX (x).
x∈X

7.2 Heterogeneous Effects with Selection on Observ-


ables
In this section we use the same setting as above, maintaining the selection on the observ-
able assumption 7.1. We aim at identifying the subpopulations that are going to benefit
the most/least from the treatment in order to treat them in a more efficient way.
A first idea would be to do clustering of the population into groups based on observed
characteristics to perform the following test:

H0 : ∀x ∈ X τ (x) = τ0 , against H1 : ∃x ∈ X τ (x) 6= τ0 .

103
The problem here comes from the fact that the researcher (i) often does not know all
subgroups (i.e. the sets of interactions among covariates) that are of interest (ii) would
want to do several tests of this type to identify the subpopulation of interest. But suppose
that each test is conducted at level α, then the chance that at least one rejection occurs is
much higher that α. This limits the use of inference. Despite existence of corrections for
this multiple testing problem (see Bonferroni correction, that do not take into account
correlation between the events, or List et al. (2016) and Hsu (2017) for two different
types of less conservative strategies), they still require to specify the hypotheses of the
test, thus limiting the potential characterization of heterogeneous treatment effect.

One way to circumvent this problem is to use a nonparametric estimator for the
heterogeneous treatment effect. In particular, the use of a special kind of random forests,
developed in Section 7.2.1 allows to handle non-linear relationship between covariates
without a priori.

7.2.1 Nonparametric Estimation of the Treatment Effects

In our econometric context, we face the fundamental problem of causal inference, where
we never observe both Y0 and Y1 for the same individual. Thus, one can not directly
use machine learning methods to estimate the treatment effect.

Different Methods. To handle this potential outcome problem, under the selection
on the observables assumption, there is mainly two kind of methods:

1. Transformed outcome regression: using the transformation of the outcome variable


with the propensity score p(x) = E [D|X = x],

DY (1 − D)Y
H= − .
p(X) 1 − p(X)

Question: Show that τ (x) = E [H|X = x].

Thus, a nonparametric estimator using as outcome H is unbiased.

104
2. Conditional mean regression: the second type of methods uses the fact that under
the assumption of selection on the unobservables,

τ (x) = E [Y1 |X = x] − E [Y0 |X = x] = µ1 (x) − µ0 (x).

Thus an estimator of τ (·) using machine learning estimators of µ1 (·) and µ0 (·)
separately does not require to estimate the propensity score. The problem here is
that biases in the separate estimations of µ1 (·) and µ0 (·) can accumulate and lead
to important and unpredictable bias in the estimation of τ (·). In the next section,
we show how to estimate them jointly, using a criterion adapted to our object τ (·).

An estimator based on H often has a larger variance compared to a conditional mean


regression estimator due to the division by the propensity score (see the interesting ex-
ample in Section 3 of Powers et al. (2017)). This explains why we first concentrate on
the later type of estimator.

We now remind the main properties of random trees and forests, then describe causal
random forests.

One sample trees. We first describe how to grow a decision tree to estimate µ(x) =
E [Y |X = x] from an i.i.d sample (Wi )ni=1 := (Yi , X i )ni=1 using recursive partitioning. For
an more general introduction, please refer to the lecture notes of Statistical Learning by
Arnak Dalayan, to Hastie et al. (2009) Chapter 15 for a broader reference, or to Biau
and Scornet (2016) for a more advanced survey. A decision tree method produces an
adaptive weighting αi (x) to quantify the importance of the i-th training sample Wi for
understanding the test point x:
n
X 1 {X i ∈ L(x)}
µ̂(x) = αi (x)Yi , where αi (x) := , (7.4)
i=1
|{i : X i ∈ L(x)}|

where L(x) is the “leaf” the point x is falling into. In other words, (7.4) is a local average
of all the Yi corresponding to an X i falling “close” (in the same leaf) to x. The leaves
constitute a partition of the feature space X , that maximizes an overall (infeasible) seg-
mentation criterion. Given a set A ∈ X, each node of the tree partition the feature space
into two child nodes A1 , A2 . For a random-split tree, this is done in the following way:

105
(a) draw a covariate j ∈ {1, . . . , p} according to some distribution (b) Use a segmentation
test of the type X j ≥ s where s is chosen in order to maximize the heterogeneity between
the two child nodes A1 , A2 (see below). Thus, this recursive algorithm can be described
in the following way:

1. Initialization: initialize the list containing the cells associated withe the root of the
tree A = (X ) and the tree Af inal as an empty list.

2. Expansion: for each node A ∈ A:

IF A satisfies the stopping criterion (number of observation in the leaf less


that n0 ),

Remove A from the list A


Concatenate Af inal = Af inal + {A}.

ELSE choose randomly a coordinate in {1, . . . , p}, choose the best spit s in
the segmentation test and create the two child nodes by cutting N .
Then remove the parent node A and add the child nodes to the list A =
A − {A} + {child nodes A1 and A2 }.

This algorithm (similar to the original one by Breiman (2001)) is described on an example
with n0 = 2 in Figures 7.1 and 7.2. To compute the prediction, in a “one sample” decision
tree, we simply take the average of the outcomes Yi of the corresponding observations
that fall into the leaf L(x). We note that an additional regularization step can be added
to the above algorithm to cut leaves according to some cutting criterion (called pruning).
This is not necessary in our context, and was not used in the original formulation in
Breiman (2001).

More on the segmentation test. For classification (when Yi is binary), it is classical


to use the initial CART criterion (ClAssification and Regression Tree) which consists in
choosing s to maximize the gain in homogeneity

I(A1 , A2 ) = G(A) − qG(A1 ) − (1 − q)G(A2 ),


2 2
where q = NA1 /NA , and G(A) = 1 − Y A − 1 − Y A is the Gini index, that is a measure
of the homogeneity in the node. A good split generates two “heterogeneous” children

106
Figure 7.1: Decision tree algorithm: phases 1 and 2

Note: Example of random-spit tree with stopping criterion the minimal number of observation equal to 2 in
each leaf.

containing “homogenenous” observations (G(A) = 0 if Y A equals 0 or 1).


For regression, we consider a similar structure, where we maximise the decrease of the
variance (see Remark 7.3 below), which can be rewritten

I(A1 , A2 ) = E(A) − (qE(A1 ) + (1 − q)E(A2 )),


1 P 2
where E(A) = Yi − Y A . Thus the segmentation criterion consists finding
NA i,Xi ∈A
the splits that decrease the most the variance, thus minimising qE(A1 ) + (1 − q)E(A2 )
over all possible splits.

107
Figure 7.2: Decision tree algorithm: phases 3, 4 and evaluation

Note: Example of random-spit tree with stopping criterion the minimal number of observation k = 2 in each
leaf.

Double sample trees (or “honest” trees). Here, we follow here Athey and Imbens
(2016) and Athey and Wager (2017), and refer to these papers for more details. To make
causal inference, we should rely on the “honest” property of the tree. An honest tree does
not use the same sample to place the splits and to evaluate the value of the estimator on
the leaves.

We now study the properties of double sample trees, built in this way:

108
1. Draw a subsample of size s from {1, . . . , n} with replacement and divide it into 2
disjoint sets I, J with size |I| = bs/2c and |I| = ds/2e.

2. Grow tree via recursive partitioning, with splits chosen from the J sample (i.e
without using Y -observations from the I sample).

3. Estimate leaf responses using only the I sample.

Double sample random forests (bagging ) As a final step we aggregate trees trained
over all possible subsamples of size s of the training data:
 −1
n X
µ̂(x; Z 1 , . . . , Z n ) = Eξ∼Ξ [T (x; ξ, Z i1 , . . . , Z is )] , (7.5)
s
1≤i1 <···<is ≤n

where ξ summarizes the randomness   in the selection of the variable when growing the
n
tree, Z i := (Di , X i , Yi ), and is the number of permutation of s elements among
s
n. The estimator in equation (7.5) is evaluated using Monte Carlo methods: we draw
without replacement B samples of size s (Zi∗1 , . . . , Zi∗s ) and consider the approximation
(7.6) of (7.5)
B
1 X
µ̂(x; Z 1 , . . . , Z n ) ≈ T (x; ξb∗ , Z ∗b,1 , . . . , Z ∗b,s ), (7.6)
B b=1
where the base learner is
X
T (x; ξb∗ , Z ∗b,1 , . . . , Z ∗b,s ) = ∗
αi,b ∗
(x)Yi,b , (7.7)
i∈{ib,1 ,...,ib,s }
 ∗ ∗


1 X i,b ∈ Lb (x)
αi,b (x) = ,
|{i : X ∗i,b ∈ L∗b (x)}|
This aggregation strategy, called bagging, reduces the variance of the estimator of µ (see
e.g. Bühlmann et al. (2002) for an analysis). Note that the “honesty” property consists
∗ ∗
in making the weights αi,b (x) in (7.7) independent of Yi,b .

Bias and honesty of the regression random forest We still consider i.i.d observa-
tions (Yi , X i )ni=1 and show the consistency of an estimator µ̂(·) of µ(·) = E [Yi |X i = ·].

Definition 7.1 (Diameter of a leaf). The diameter of the leaf L(x) is the length of the
longest segment contained inside L(x), which we denote by Diam(L(x)).
The diameter of the leaf L(x) parallel to the j-th axis is the length of the longest segment
contained inside L(x) parallel to the j-th axis, which we denote by Diamj (L(x)).

109
To ensure consistency, we need to enforce that the leaves become small in all direc-
tions of the feature space X as n (thus s) gets large: Diam(L(x)) → 0 as s → ∞ (see
Lemma 7.1). To do so, we enforce randomness in the selection of variables at each step
(random-split tree). We need the following assumptions.

Assumption 7.2. - Random-split tree: the probability that the next split occurs
along the j-th feature is bounded from below by δ/p, 0 < δ ≤ 1.

- α−regular: from the I sample, each split leaves at least a fraction α of the obser-
vations of the training example on each side of the split.

- Minimum leaf size k: there are between k and 2k − 1 observations in each


terminal leaf of the tree.

- Honest tree: the samples used to build the nodes and to evaluate the estimator on
the leaves are different.

The minimum leaf size k is a regularisation parameter that has to be fixed by the
researcher. In practice she can use cross-validation to choose k. The following lemma is
key, but relies on a very strong assumption for the distribution of the covariates.

Lemma 7.1 (Control in probability of the leaf diameter in uniform random forests,
Lemma 2 in Wager and Athey (2017)). Let T satisfies Assumption 7.2 and X 1 , . . . , X s ∼
U([0, 1]p ) independently. Let 0 < η < 1, then for s large enough,
 −α1 δ/p !  −α2 δ/p
s s
P Diamj (L(x)) ≥ ≤ ,
2k − 1 2k − 1

where

α1 =0.99(1 − η) log((1 − α)−1 )/ log(α−1 )

α2 =η 2 /(2 log(α−1 )).

Proof of Lemma 7.1. Take 0 < η < 1 and denote by

- c(x) the number of splits leading to L(x).

- cj (x) the number of splits leading to L(x) along the j-th axis.

110
Then, using that T is α-regular, the minimal number of observations in L(x) is sαc(x) ,
α > 1, which is thus inferior or equal to 2k − 1. This yields

log (s/(2k − 1))


c(x) ≥ c0 := .
log(α−1 )

Using that T is a random-split tree, with large probability we have


 
δ
cj (x) ≥ Z, Z ∼ Binomial c0 ,
p

as the minimal total number of split that lead to L(x) is c0 and that at each of these
nodes, the probability to draw the j-th coordinate is bounded from below by δ/p. Then,
we use the multiplicative Chernoff bound


P (cj (x) ≤ (1 − η)µ0 ) ≤ e−η 0 /2
, (7.8)

where µ0 = E[Z] = δc0 /p. Finally, Wager and Walther (2015) show that the diameter
of the leaf along the axis j is related the number of observations in the leaf, when
covariates are uniformly distributed, L(x) via

Diam(L(x)) ≤ (1 − α)0.99cj (x) .

Combining this inequality with (7.8) yields the result. 

Finally, a key assumption to obtain consistency is that x 7→ µ(x) is Lipschitz contin-


uous. This limits the use of such random forests to smooth regression functions µj .

Lemma 7.2 (Consistency of the double sample random forests, Theorem 3 in Wager
and Athey (2017)). Consider T satisfying Assumption 7.2, x 7→ µ(x) that is Lipschitz
continuous, α ≤ 0.2, then the bias of the random forest at x ∈ X is bounded by

|E [µ̂(x)] − µ(x)| = O s−α3 δ/p ,




where α3 = log((1 − α)−1 )/(2 log(α−1 )).

Proof of Lemma 7.2. Start from the definition (7.5) which yields E [µ̂(x)] = E [T (x; Z i )].
Then, define µ̃(x) like µ̂(x) replacing αi (x) by

1 {X i ∈ L(x)}
α̃i (x) = ,
s|L(x)|

111
where |L(x)| is the Lebesgue measure of the leaf L(x). Using the honesty assumption
(i.e. that Y is independent of L(x)) for the third equality and that P (X i ∈ L(x)|L(x)) =
|L(x)| for the fourth equality, we have

E [µ̃(x)] − µ(x) =E [T (x; Z)] − E [Y |X = x]


  

X 1 {X i ∈ L(x)}
=E E  Yi L(x), X i ∈ L(x) − E [Y |X = x]
s|L(x)|
i∈{i1 ,...,is }
 
1 X 1 {X i ∈ L(x)}
=
s
E
|L(x)| L(x) E [E [Yi |X i ∈ L(x)]] − E [Y |X = x]
i∈{i1 ,...,is }

=E [E [Y |X ∈ L(x)] − E [Y |X = x]] .

Then, using that x 7→ E [Y |X = x] is Lipschitz continuous, there exists a constant C > 0,

|E [Y |X ∈ L(x)] − E [Y |X = x]| ≤ CDiam(L(x)). (7.9)


Then, using the fact that the diagonal length of an unit hyper-cube in dimension p is p
p
(  −α1 δ/p ) [ (  −α1 δ/p )
√ s s
Diam(L(x)) ≥ p ⊂ Diamj (L(x)) ≥
2k − 1 j=1
2k − 1

and using Lemma 7.1 yields


 −α1 δ/p ! X p  −α1 δ/p !
√ s s
P Diam(L(x)) ≥ p ≤ P Diamj (L(x)) ≥
2k − 1 j=1
2k − 1
 −α2 δ/p
s
≤p ,
2k − 1
p
We conclude taking η = log((1 − α)−1 ) ≤ 0.49 (thus 0.99(1 − η) ≥ 0.51, hence the 1/2
in α3 ) and using that E [µ̃(x) − µ̂(x)] = O n−1/2



Asymptotic normality of regression random forest is shown in Theorem 8 in Wager and


Athey (2017). Here, for simplicity we directly state the asymptotic normality for causal
forests (Theorem 11 in Wager and Athey (2017)). Causal forests are random forests for
treatment effect estimation, namely adapted to the potential outcome problem, where
the outcome of the regression is not directly observed.

Double sample causal trees. The algorithm for double sample causal trees is similar
to the algorithm of the double sample trees but splits are chosen (segmentation criterion)

112
to maximize the variance of
1 X
τ̂ (x) = µ̂1 (x) − µ̂0 (x) = Yi
|{i, Di = 1, X i ∈ L(x)}|
i: Di =1, X i ∈L(x)
1 X
− Yi . (7.10)
|{i : Di = 0, X i ∈ L(x)}|
i, Di =0, X i ∈L(x)

and Assumption 7.2 is replaced by

Assumption 7.3. - α−regular: each leaf L(·) leaves at least a fraction α of the
available training examples on each side of the split

- Minimum leaf size k: there are between k and 2k − 1 observations from each
treatment group in each terminal leaf of the tree (with Di = 1 or with Di = 0).

- Honest trees: the sample J used to place the splits is different from the sample
I used to evaluate the estimator through (7.10).

Without the honesty property, the treatment effect estimator would be based on splits
that lead to high treatment effect, but this probably means that the treatment effect in
those leaves is biased.

Remark 7.3 (On the segmentation criterion for Double Causal forests.). If the outcome
of the regression τi were observed and without split of the train sample S tr (like for the
regression CART algorithm), then the splits should minimize the empirical counterpart
of the squared-error loss

E (τi − τ̂ (X i ))2 = E τi2 − 2E [τ̂ (X i )τi ] + E τ̂ (X i )2


     

on the test sample S te which amounts to minimizing


1 X 2 
MSEτ̂ S te , S tr , T := te τi − τ̂ (X i ; S tr , T ) − τi2

|S | te
i∈S
2 X 1
= − te τi τ̂ (X i ; S tr , T ) + te τ̂ (X i ; S tr , T )2 (7.11)
|S | te |S |
i∈S

But, as τi is not directly observed Athey and Imbens (2016) use a criterion that mimics
what is done in the CART algorithm. Here, using that the estimators µ̂(X i ) are constant
on each leaf Lm by definition, we have, for x ∈ Lm
X X 1 X
µ̂(X i )2 = µ̂(X i ) Yk
i ∈S, i s.t. X ∈L i ∈S, i s.t. X ∈L
|Lm | k ∈S, k s.t. X k ∈Lm
i m i m

113
!
X 1 X
= µ̂(X i ) Yk .
k ∈S, k s.t. X k ∈Lm
|Lm | i ∈S, i s.t. X i ∈Lm
X
= µ̂(X k )Yk .
k ∈S, k s.t. X k ∈Lm

Thus, (7.11) yields that in the regression context we want to minimize the unbiased
estimator of MSEµ̂ (S te , S tr , T ) which is
1 X
MSEµ̂ S tr , S tr , T = − tr τ̂ (X i ; S tr , T )2 .

|S | tr
i∈S

In the treatment effect context, this leads Athey and Imbens (2016) to consider by analogy
the maximization of the feasible criterion
1 X
−MSEτ̂ S tr , S eval , T = τ̂ (X i ; S tr , T )2 ,

|S eval |
i∈S eval

where the train sample is split in an evaluation sample S eval and real train sample S tr .

Another interesting criterion analyzed by Athey and Imbens (2016) is based on


|τ̂1 − τ̂2 |
T =p ,
Var(τ̂1 ) + Var(τ̂2 )
where τ̂1 and τ̂2 are the estimated treatment effects in each child node, with estimated
variances Var(τ̂1 ) and Var(τ̂2 ), respectively. This criterion is a T-statistic like criterion,
that tests the equality of treatment effects between the two potential child nodes, and
aims at selecting the more different ones.

Assumption 7.4 (Regularity conditions for asymptotic normality). The potential sam-
ples (X i , Y1,i ) and (X i , Y0,i ) satisfy, for j ∈ {0, 1},

- X i ∼ U ([0, 1]p ) independently;

- µj : x 7→ E [Yj |X = x] and µj,2 : x 7→ E (Yj )2 X = x are Lipschitz continuous;


 

h i
- Var (Yj |X = x) > 0 and E |Yj − E [Yj |X = x]|2+δ1 ≤ M for some constants
δ1 , M > 0 uniformly over all x ∈ [0, 1]p ;

Denote the infinitesimal jackknife estimator (see Efron (2014); Wager et al. (2014))
 2 Xn B
!
n−1 n 1 X ∗  ∗

V̂IJ (x) = τ̂b (x) − τ̂b∗ (x) Ni,b − Nb∗ , (7.12)
n n−s i=1
B − 1 b=1

114

where Ni,b indicates whether or not the i-th training example was used for the b-th
bootstrap tree and Nb∗ , τ̂b∗ are averages over the B bootstrap trees.

Theorem 7.1 (Asymptotic normality of double sample causal random forests, Theorem
1 in Wager and Athey (2017)). Assume that we have i.i.d samples Z i = (X i , Yi , Di )ni=1 ∈
[0, 1]p × R × {0, 1}, that the selection on observables (7.1) holds, and that there exists
 > 0 such that  ≤ P (D = 1|X) ≤ 1 −  (overlap condition). Suppose Assumption 7.4
holds and consider a double sample causal random forests satisfying Assumption 7.3 with
α ≤ 0.2. Assume that
−1
log(α−1 )

β p
s = bn c, for some βmin := 1 − 1 + < β < 1. (7.13)
δ log ((1 − α)−1 )
Then, there exists C(·) and γ > 0, such that random forest predictions are asymptotically
Gaussian:
τ̂ (x) − τ (x)
→d N (0, 1) ,
σn (x)
s C(x)
where σn (x) := . The asymptotic variance σn (x) can be consistently estimated
n log(n/s)γ
using the infinitesimal jackknife

V̂IJ (x)/σn2 (x) →p 1

Several remarks are in order. First, from the restrictions on β, one can make more
precise the rate of convergence n−1/(1+pα3 /δ) , which does not allow for a “high dimensional”
case in the sense of the previous section (p  log(n)). Theorem 7.1 allows to perform
inference, to test the significativity of the treatment effect for a population with covariates
x without apriori on the specific spaces in the feature space to test. Third, Theorem 7.1
uses a very strong assumption on the distribution of the covariates, which allows here to
make inference.

Remark 7.4 (Key ideas for the proof of Theorem 7.1, can be skipped for the
first lecture). A random forest is an U-statistic, which means that can be written as
 −1
n X
µ= T (Z i1 , . . . , Z is )
s s
i∈{1,...,n} , with i1 <···<is

for a bounded function T (see for a detailed exposition Chapter 12 p162 in Van der
Vaart (1998)). The usual way to prove asymptotic normality for U-statistics is to use the

115

projection µ̂(x) of µ̂(x) onto the class S of all statistics of the form
n
X
gix (Z i )
i=1

where E (gix (Z i ))2 < ∞, which is called the Hájek projection. Then, from µ̂(x) =
 
◦ ◦ ◦
(µ̂(x) − µ̂(x)) + µ̂(x), the proof amounts to show that µ̂(x) − µ̂(x) →p 0 and to apply

the CLT to the projection µ̂(x). More precisely, we use Proposition 7.1 (which is Lemma
11.10 in Chapter 11 in Van der Vaart (1998)).

Proposition 7.1. Let Z 1 , . . . , Z n be independent random vectors, then the projection of


a random variable T with finite second moment onto the class S is given by
n
◦ X
T = E [T ] + (E [T |Z i ] − E [T ]) .
i=1


Proof of Proposition 7.1. The proof follows from the fact that T belongs to S and
because we can verify that for all S ∈ S,
 

E (T − T )S = 0.

Indeed, for all i ∈ {1, . . . , n},


 
   
◦  ◦ 
E (T − T )gi (Z i ) = E  E [T |Z i ] − E[T |Z i ] gi (Z i )


| {z }
=0

because for all j 6= i, by independence E [E [T |Z i ] |Z j ] = E [T ]. 

Then, an important result (see Theorem 11.2 in Van der Vaart (1998)) states that if

the projection T statisfies
Var(T )
lim ◦ →1 (7.14)
n→∞
Var(T )
then ◦ ◦
T − E [T ] T − E[T ]
− ◦ →P 0.
Sd (T ) Sd(T )
This is due to the fact that
◦ ◦ ◦
 
T − E [T ] T − E[ T ] =2−2 Cov(T, T )
Cov  , ◦ ◦
Sd (T ) Sd(T ) Sd (T ) Sd(T )

116
and that using orthogonality
      2
◦ ◦ ◦ ◦
E TT = E T − T T + E T

which proves convergence in second mean, hence in probability.

Thus, to prove asymptotic normality of the random forest, one could try to show
(7.14). However, this does not hold for regression trees. Hence Wager and Athey (2017)
rather show a close adaptation of this property (7.14) (namely that regression trees are
ν-incremental) which states that under the conditions of Theorem 7.1, there exists C1 (·)
such that ◦
Vars (T x )
lim inf log(s)p ≥ C1 (x). (7.15)
s→∞ Vars (T x )

Then, they use that by independence of the observations and symmetry permutation of
the trees T ,

E [T (x; z, Z 2 , . . . , Z s )] if j ∈ i
E [T (x; Z i1 , . . . , Z is )|Z j = z] =
0 if j ∈
/i

which yields in (7.5) that


 −1
n X
E [µ̂(x)|Z i = z] = E [Eξ∼Ξ [T (x; ξ, Z i1 , . . . , Z is )]|Z i = z] ,
s
1≤i1 <···<is ≤n
 −1  
n n−1
= Eξ∼Ξ [T (x; ξ, z, Z 2 , . . . , Z s )]
s s−1
s
= Eξ∼Ξ [T (x; ξ, z, Z 2 , . . . , Z s )] ,
n

hence, denoting for simplicity by T (x) := Eξ∼Ξ [T (x; ξ, Z 1 , Z 2 , . . . , Z s )] (where the ex-
pectation is only on the randomness of ξ, so T (x) depends on Z 1 , Z 2 , . . . , Z s ), we have
n
◦ X
µ̂(x) = E [µ̂(x)] + (E [µ̂(x)|Z i ] − E [µ̂(x)])
i=1
n
sX
= E [T (x)] + (E [T (x)|Z i ] − E [T (x)]) . (7.16)
n i=1

Moreover, we have
s
◦ X
T (x) = E [T (x)] + (E [T (x)|Z i ] − E [T (x)]) . (7.17)
i=1

117

Question: Show that (7.16)-(7.17) yields σn2 (x) = sVars (T (x))/n.

Let σn2 (x) be the variance of µ̂(x). Lemma 7 in Wager and Athey (2017) shows that
" 2 #

E µ̂(x) − µ̂(x)
 s 2 Var (T (x))
s

σn2 (x) n σn2 (x)
s Vars (T (x)) s ◦
= ◦ (using σn2 (x) = Vars (T (x))),
n Var (T (x)) n
s

which yields with (7.15) that


" 2 #

E µ̂(x) − µ̂(x)
→0
σn2 (x)
It remains to show that the right hand side of equation (7.16) satisfies the conditions for
the Lyapunov central limit Theorem.

Remark 7.5 (Local centering). Insights from the literature considered in the first two
sections (namely Chernozhukov et al. (2017)) made Athey and Wager (2017) consider
a local centering pre-treatment before estimating causal random forests. Specifically,
they show on simulations that estimating the above double sample causal forests with
orthogonalized outcomes

Yei = Yi − E [Yi |X i = x] ,
e i = Di − E [Di |X i = x]
D

(where we use the second equation only if the propensity score is used), improves the
performances of the algorithm. In practice, they propose to use random forest estimators
for the regression function in the above equations and to use

Yei = Yi − Ŷ (−i) (X i )
e i = Di − D̂(−i) (X i ),
D

where Ŷ (−i) (X i ) and D̂(−i) (X i ) are leave-one-out estimators (random forests evaluated
without the i-th observation, which is quasi-computationally free). Section 7.2.3 makes
these statements more precise based on Nie and Wager (2017) and Athey et al. (2019).

118
7.2.2 Simulations

We consider a simulation setups from Wager and Athey (2017), where X i ∼ U ([0, 1]p ),
Di |Xi ∼ Bernoulli(p(Xi )), and

Yi |X i , ∼ N (m(X i ) + (Di − 0.5)τ (X i ), 1) ,

where there is no confounding, m(x) = 0 and p(x) = 0.5, but there is heterogeneity in
the treatment

1
τ (x) = ζ(x1 )ζ(x2 ), where ζ(u) = 1 + .
1 + e−20(u−1/3)

This setting can be implemented via the package causalForest (and randomForestCI
for confidence intervals). However, one can also use the package grf (generalized random
forest), which uses a gradient based algorithm and a slightly different criterion (see Section
7.3 for more details) but is more efficient and stable. For the codes, see CausalForests.R
on the github page of the course.
We only report here the results from Wager and Athey (2017) on Figure 7.3 of the
case p = 3. The comparison is done with the k-nearest-neighbours (kNN)

1 X 1 X
τ̂kN N = Yi − Yi
k k
i∈S1 (x) i∈S0 (x)

where k is taken to be 10 and 100, and S1 (x) and S0 (x) are the neighbours of x. It is
shown in details in Table 3 in Wager and Athey (2017) that the mean squared error for
the causal random forest are more robust than for the kNN to the increase in the number
of covariates (which still is small). Starting from dimension 3, the mean squared error of
the causal random forest estimate is lower than the 100-nearest-neighbours estimator.

7.2.3 Complements and Alternative Algorithm

Denote by p(x) := E [D|X = x] and by m(x) := E [Y |X = x] = µ0 (x) + p(x)τ (x).


Nie and Wager (2017) based their R-learner estimator of the treatment effects τ on the
following representation proposed by Robinson (1988)

Yi − m(X i ) = (Di − p(X i ))τ (X i ) + i , E [i |X i , Di ] = 0.

119
Figure 7.3: Figure comparing the true treatment effect (left) to the causal random forest estimate (center),
and the kNN estimate (right) in dimension 3, adapted from Wager and Athey (2017)

Thus, τ satisfies

τ (·) = arg min E ((Yi − m(X i ) − (Di − p(X i ))τ (X i ))2 .


  
(7.18)
τ

Hence, with a preliminary knowledge m̂ and p̂ of the nuisance functions m and p, we


could estimate τ by the minimisation of the penalized empirical loss
( n )
1X
τ̂ (·) = arg min ((Yi − m̂(X i ) − (Di − p̂(X i ))τ (X i ))2 + Λn (τ (X i )) , (7.19)
τ ∈Θ n i=1

where the penalization Λn takes into account the complexity of the class Θ where τ
belongs. Thus, Nie and Wager (2017) propose the two steps estimator

1. Fit m̂ and p̂ via any method tuned for optimal predictive accuracy (random forests,
deep neural networks, LASSO, ect)

2. Estimate treatment effects via a plug-in version of (7.19), using the leave-one-out
estimators Ŷ (−i) (X i ) := m̂(X i ) and D̂(−i) (X i ) := p̂(X i ).

Allowing for a two-steps estimation, contrary to the causal forest formulation of Section
7.2.1 allows to choose more adapted methods in the first steps to the profiles of m and
p. Nie and Wager (2017) show bounds on the regret according to the complexity of the
class Θ and this formulation is used in the package grf.

7.2.4 Applications

We focus on the applications from Davis and Heller (2017a) and Davis and Heller (2017b)
which estimate the benefits from two youth employment program in Chicago. These two

120
randomized controlled trials (RCTs) are the same summer job program in 2012 and 2013.
They have relatively large sample size (1,634 and 5,216 observations, respectively) and
observe a large set of covariates. The program provides disadvantaged youth aged 14
to 22 with 25 hours a week of employment and an adult mentor. Participants are paid
at Chicago’s minimum wage. They focus on two outcomes: violent-crime arrests within
two years of random assignment and an indicator for ever being employed during the six
quarters after the program.

They ask the following question: if we divide the sample into a group predicted to
respond positively to the program and one that is not, would we successfully identify
youth with larger treatment effects? To do so they train the causal forest on half of the
sample, then use the treatment effect predictions on the other part. Then they regress the
outcomes on the indicators 1{τ̂ (x > 0}, Di 1{τ̂ (x > 0}, and Di (1 − 1{τ̂ (x > 0}). They
test the null hypothesis that the treatment effect is equal among the two groups. Figure
7.4 shows their results: in the in-sample, the test is rejected for both outcomes whereas
it detects significant heterogeneity only for the return to employment in the out-sample.
This could be the sign of some overfitting. Note that changing the splitting rule does
seems to alter much these results. They conclude that the sampling error may hide the
treatment effect.
Hussam et al. (2017) uses causal random forest to evaluate the impact of giving a 100$
grant to randomly selected entrepreneurs in India on their returns. More, they compare
the predicted treatment using causal forests based on entrepreneurs characteristics to the
treatment effect when the grant is accorded based on community rankings (private or
public). They find that peer reports are predictive over and above observable traits but
that making the rankings public create incentives to lie that reduces community’s reports
accuracy.
Finally, we underline that Athey and Wager (2019) report the application of causal
random forests and the package grf made during a challenge to estimate from the dataset
of the National Study of Learning Mindsets, the treatment effect of giving a nudge-like
intervention to instill students the belief that intelligence can be developed on student
achievement. We recommend to have a look at the code and datasets which are provided

121
Figure 7.4: Comparison of average treatment effects among the two groups identified using causal forests
as having positive or negative treatment effect (from Davis and Heller (2017a))

at https://github.com/grf-labs.

7.3 Heterogeneous Effects with Instrumental Vari-


ables

Athey et al. (2019) generalize the above heterogeneous treatment estimation to the case

122
of endogenous treatment. The aim is to measure the causal effect of an outcome while
acknowledging that the intervention and the outcome are tied by non causal factors. The
selection on the observables no longer holds. For example, one may be interested by the
causal effect of childbearing on female labor force participation (the classical IV used in
the literature is the “mixed children” dummy, namely having children of different sexes).
They consider the following model

Yi = τ (X i )Di + µ0 (X i ) + i , E [Zi i |X i = x] = 0, E [i |X i = x] = 0 ∀x ∈ X .


(7.20)
Note that this is a very particular case of the so-called nonparametric instrumental vari-
able (NPIV) model (see e.g. Newey and Powell (2003); Darolles et al. (2011)), which
takes the form

Yi = ϕ(Di , X i ) + , E [Zi i |X i = x] = 0, E [i |X i = x] = 0,

with ϕ(Di , X i ) = τ (X i )Di + µ0 (X i ). τ (·) is the heterogeneous causal impact of Di on


Yi using Zi as an instrument. Estimation is based on the following moments conditions,
∀x ∈ X ,
   
Zi
E [ψτ,µ0 (W i )|X i = x] := E (Yi − τ (X i )Di − µ(X i )) X i = x = 0, (7.21)
1

where W i = (X i , Yi , Di , Zi ). Here, τ is our parameter of interest and µ is the nuisance


parameter. The idea is to use random forests to compute weights αi (x) that measure the
importance of the i-th training sample in estimating τ (·) through the moment equation
( n )
X 1 {X i ∈ L(x)}
(τ̂ , µ̂) ∈ arg min αi (·)ψτ,µ (W i ) , where αi (x) := .


i=1
|{i : X i ∈ L(x)}|
2

If there is a unique solution, (τ (x), µ(x)) solves the equation


n
X
αi (x)ψτ (x),µ(x) (W i ) = 0.
i=1

The idea is then to extend previous random forests to learn in a data-driven way the
weights αi , and get an asymptotically normal estimator τ̂ of τ . The problem here com-
pared to the case where the unconfoundedness assumption holds is that both τ and µ are
implicitly defined.

123
The Gradient tree algorithm. The algorithm computes the splits (hence the weights
and the estimator) recursively. Start from a parent note P that we seek to divide in
two children C1 , C2 using an axis-aligned cut such has find the best improvement of the
accuracy of our estimator τ̂ , namely minimizing
2 h i
X 2
err(C1 , C2 ) = P (X i ∈ Cj |X i ∈ P ) E τ̂Cj (J ) − τ (X i ) |X i ∈ Cj ,
j=1

where τ̂Cj (J ) are fit over the children Cj in the first part of the train sample J . However
here we do not have access to a direct unbiased estimate of err(C1 , C2 ), which yields
Athey et al. (2019) to propose a new procedure:

1. In a labeling step, compute (τ̂P (J ), µ̂P (J )) in the parent node using


 

X


(τ̂P (J ), µ̂P (J )) ∈ arg min
ψτ,µ (W i )
, (7.22)

{i∈J , X i ∈P }
2

then compute
1 X
AP := ∇ψτ̂P ,µ̂P (W i )
|{i : X i ∈ P }|
{i: X i ∈P }
 
1 X −Di Zi −Zi
= .
|{i : X i ∈ P }| −Di −1
{i: X i ∈P }

Note that in the IV model (7.20), the minimization problem of (7.22) has a solution
 P  P  
 
{i: X ∈P } Z i Y i − Y P / {i: X ∈P } Zi D i − D P
τ̂P (J ) i i
= 1 P ,
µ̂P (J ) (Y i − D i τ̂ P (J ))
|{i : X i ∈ P }| {i: X i ∈P }
P P
where Y P = {i: X i ∈P } Yi /|{i : X i ∈ P }| and DP = {i: X i ∈P } Di /|{i : X i ∈
P }|. Then compute the pseudo-outcomes

ρi := −(1, 0)A−1
P ψτ̂P ,µ̂P (W i ) ∈ R, (7.23)

where (1, 0) is the vector in R2 that picks the coordinate of τ .

2. In a regression step, run a CART regression split on the pseudo-outcomes, namely


finding the split that maximizes the criterion
 2
2
˜ 1 , C2 ) =
X 1 X
∆(C  ρi  .
j=1
|{i : X i ∈ Cj }|
{i: X i ∈Cj }

Then, relabel the observations in each child by solving the estimating equation.

124
The justification of this algorithmic way to find estimate an optimal partition is given
in Proposition 1 in Athey et al. (2019) which states that (1) if AP is a consistent estimator
of ∇E [ψτ̂P ,µ̂P (W i )|X i ∈ P ], (2) that the parent node has a radius smaller than r > 0 (3)
the regularity assumptions of Theorem 7.2, and considering the number of observation in
the child nodes as fixed and large in front of 1/r2 , then
h i
˜ 1 , C2 ) + o(r2 ),
err(C1 , C2 ) = K(P ) − E ∆(C

where K(P ) is a deterministic term linked to the purity of P .

Remark 7.6 (Influence function). The intuition for the use of ρi comes from the proof
of the asymptotic normality for Z-estimators, which are estimators of θ0 based on the
moment condition
E [ψθ0 (Xi )] = 0.

Using the asymptotic representation of the Z-estimator θ̂n in Theorem 5.21 page 52 in
Van der Vaart (1998)
n  
1 X 1
θ̂n = θ0 + ∇ψθ−1 ψθ0 (Xi ) + op √ ,
n 0
i=1
n
we have that the influence of the n-th observation on the estimator is given by

θ̂(X1 , . . . , Xn ) − θ̂(X1 , . . . , Xn−1 )


n−1 n−1
1 1 X 1 X
= ∇ψθ−1 ψθ0 (Xn ) + ∇ψθ−1 ψθ0 (Xi ) − ∇ψθ−1 ψθ0 (Xi )
n 0
n 0
i=1
n − 1 0
i=1
n−1
1 1 X
= ∇ψθ−1 ψθ 0 (X n ) − ∇ψθ
−1
ψθ0 (Xi )
n 0
n(n − 1) 0
i=1
1
= ∇ψθ−1 ψθ0 (Xn ) + oP (1)
n 0

which shares the same form as ρi .

Central Limit Theorem for Generalized Random Forests (GRF) in the instru-
mental variable model. We only study the case of model (7.20) and refer to Athey
et al. (2019) for the more general case of asymptotic normality for GRF. Denote by

V (x) = ∇E [ψτ,µ (W i )|X i = x]

125
 
E [Di Zi |X i = x] E [Zi |X i = x]
=− ,
E [Di |X i = x] 1

the infeasible pseudo outcomes ρ∗i by

ρ∗i (x) = −(1, 0)V (x)−1 ψτ (x),µ(x) (W i ) ∈ R

and denote the pseudo-forest by


n
X n
X

τ̃ (x) := τ (x) + αi (x)ρ∗i (x) = αi (x) (τ (x) + ρ∗i (x)) .
i=1 i=1

τ̃ ∗ (x) is useful because it has the same form as the base learner in the U-statistic studied
in Wager and Athey (2017). Thus, tools developed in Wager and Athey (2017) and in
Section 7.2.1 can be applied, which yields the asymptotic normality of τ̂ (x) provided that
τ̂ (x) and τ̃ ∗ (x) are close asymptotically, which is ensured via Assumption 7.5. τ̃ ∗ (x) is
exactly the output of an infeasible regression forest trained with outcomes τ (x) + ρ∗i (x).

Assumption 7.5 (Regularity conditions for asymptotic normality of GRF). Assume

- a Lipschitz-x signal: for fixed values of (τ, µ0 ),

Mτ,µ (x) := E [ψτ,µ (W i )|X i = x]

is Lipschitz continuous in x;

- Smooth identification: V (x) is invertible for all x ∈ X . This is implied by the


fact that the instrument is valid, i.e. correlated with Di .

Theorem 7.2 (Asymptotic normality of GRF for the instrumental variable model (7.20),
Theorem 5 in Athey et al. (2019)). Assume that we have i.i.d samples W i = (X i , Zi , Yi , Di )ni=1 ∈
[0, 1]p × R × R × {0, 1}, and that there exists  > 0 such that  ≤ P (D = 1|X) ≤ 1 − 
(overlap condition). Make Assumption 7.4, 7.5 and consider a double sample causal ran-
dom forests satisfying Assumption 7.3 with α ≤ 0.2. Assume that β satisfies (7.13).
Then, there exists C(·) and γ > 0, such that random forest predictions are asymptotically
Gaussian:
τ̂ (x) − τ (x)
→d N (0, 1) ,
σn (x)
s C(x)
where σn (x) := .
n log(n/s)γ

126
Intuitions for the proof of Theorem 7.2. The proof follows from the fact that τ̃ ∗ (x)
is formally equivalent to the output of a regression forest, thus using Theorem 7.1 we
have
τ̃ ∗ (x) − τ (x)
→d N (0, 1)
σn (x)
Then, from Theorem 3 (consistency of (τ̂ , µ̂)) and Lemma 4 in Athey et al. (2019) we
have   
n ∗ 2 s 2/3
(τ̃ (x) − τ̂ (x)) = Op (7.24)
s n
so (τ̃ ∗ (x) − τ̂ (x))/σn →p 0 which yields the result. The technical part of the Theorem is
the proof of Lemma 4 which yields (7.24). 

Athey et al. (2019), similarly to Nie and Wager (2017), recommend to orthogonalize
the variables Yi , Di , Zi , using preliminary leave-one-out estimators m̂(−i) , p̂(−i) , and ẑ (−i)
of E [Yi |X i = x], E [Di |X i = x], and E [Zi |X i = x] which yields
Pn (−i)
 
i=1 αi (x) Yi − m̂ (X i ) Zi − ẑ (−i) (X i )
τ̂ (·) = Pn (−i) (X )) (Z − ẑ (−i) (X ))
. (7.25)
i=1 αi (x) (Di − p̂ i i i

This option is implemented in the grf package. One can build pointwise confidence
intervals such that

lim E τ (x) ∈ τ̂ (x) ± Φ−1 (1 − α/2)σ̂n2 (x) = 1 − α


 
n→∞

from the fact that Var[τ̃ ∗ (x)]/σn2 →p 1 and using the definition of ρ∗i (x) to build σ̂n2 (x).

7.3.1 Application to heterogeneity in the effect of subsidized


training on trainee earnings

First, we strongly recommend to have a look at the simulations used in Athey and Wager
(2019) using the grf package and available at https://github.com/grf-labs which nicely
illustrate the performances of the IV Forest.
We consider here an application of the generalised random forests to estimating het-
erogeneity in the effect of subsidized training on trainee earnings. We use data from
Abadie et al. (2002) which can be dowloaded at https://economics.mit.edu/faculty/
angrist/data1/data/abangim02. We re-analyse data from the Job Training Partner-
ship Act (JTPA), which is a large publicly-funded training program. Individuals are

127
randomly assigned to the JTPA treatment and control groups, the treatment consisting
in offering training. Only 60 percent of the treatment group accepted to effectively be
trained, but the randomized treatment assignment provides an instrument for the treat-
ment status. More, because there is only 2 percent of individuals receiving JTPA services
in the control group, they interpret the effect for compliers as the effect on those who are
treated. See Abadie et al. (2002) for more details and an alternative estimation method
of the effects of this training on the distribution of earnings based on quantile regression
handling the endogeneity of the treatment. We focus on the heterogeneity of the effect
of the training according to the baseline characteristics interactions: age, High school
graduate indicator, marital status, Black and Hispanic indicators, Aid to Families with
Dependent Children (AFDC) receipt, and a dummy for whether one worked less that 13
weeks in past year. We denote by Y the 30-month earning, D the enrollment in JTPA
services, and Z the offer of services.

We train a generalised random forest on 80% of the sample for man and woman sep-
arately, and using IV (denoted by “GRF”) or not (denoted by “CRF”). We draw several
comparisons. First, Figure 7.5 represents the distributions of the predicted treatment ef-
fect on 20% of the sample, using “GRF” or “CRF”, which is our test sample. Of course,
a deeper analysis of the results could be done by reporting the precise estimate for the
treatment effect for subgroups of the sample (here all the covariates are binary variables,
so we can not “plot” the estimated treatment as in the simulation in the grf package).
Similarly to Abadie et al. (2002), we observe that there is a important difference between
the two, as illustrated on Table 7.1. This underlines the importance of using an instru-
mental variable in that context. Results from Table 7.1 can also be compared to the
estimates for quantile of the treatment effects on the test sample to the quantile treat-
ment effects (QTE) estimated by Abadie et al. (2002) using quantile regression handling
the endogeneity of the treatment. Both are very close, which is coherent, but the lack of
uniform confidence band for GRF prevent us from doing more comparisons.

128
Figure 7.5: GRF (blue) and RF (red) predicted treatment effect distribution for 30-month earning
for men (left) and women (right) among the test sample.

ATE, OLS ATE, 2SLS ATE, RF ATE, GRF Q, 0.25 Q, 0.50 Q, 0.75
Men 3,754 (536) 1,593 (895) 3,456 1,908 863 1,836 2,986
Women 2,215 (334) 1,780 (532) 1,972 1,828 626 1,648 2,746
Table 7.1: Table of the estimated treatment effect on the 30-month earnings.

7.4 Estimating Features of Heterogeneous Effects with


Selection on Observables

This section is based on Chernozhukov et al. (2017b), a paper we encourage you to read.
The previous sections have shown that in order to perform inference on the Conditional
Average Treatment Effect (CATE), one often has to make strong assumptions on the
underlying true CATE that might not always hold (e.g. covariates uniformly distributed
in causal forests) or un-testable. Furthermore, the practical application of these methods
very often substantially differs from their theoretical counterparts (e.g. tuning parameters
are chosen via cross-validation). There might be a trade-off between the assumptions that
we are willing to make and the amount of knowledge about the targeted object that we
want to get. Following a strand in the statistical literature (e.g. Lei et al. (2017)),
Chernozhukov et al. (2017b) propose to change the point of view about the CATE, and
to estimate/perform inference regarding key features of the CATE rather than the
true object itself. Changing our objectives allows to use plenty of ML methods, viewed
as proxies of the CATE, and to limit the amount of assumptions we have to make (in
particular, the ML methods do not have to be consistent). The idea consists in post-

129
processing those estimators to get consistent succinct summaries of the CATE.
The model is similar the one of Section 7.2: we observe the outcome variable Y =
DY1 + (1 − D)Y0 , the treatment dummy D, covariates X ∈ Rp , and make the selection on
the observables (7.1) and overlap (∃ > 0, s.t.  ≤ P (D = 1|X) ≤ 1 − ) assumptions.
The following equation hold

Y = µ0 (X) + Dτ (X) + U, E [U |X, D] = 0 (7.26)

τ (X) = µ1 (X) − µ0 (X) = E [Y1 |X] − E [Y0 |X] . (7.27)

7.4.1 Estimation of Key Features of the CATE

Chernozhukov et al. (2017b) propose partitioning the initial data (Yi , Di , X i )i=1,...,n be-
tween an auxiliary sample (denoted DataA ) and a main sample (denoted DataM ). The
first step is to estimate x → µ0 (x) and x → τ (x) on the auxiliary sample. The following
estimators are the estimators of µ0 and τ respectively that result from a machine learning
algorithm (any algorithm can be considered):

m0 : x → m0 (x|DataA ) , (7.28)

T : x → T (x|DataA ) . (7.29)

m0 and T are called ML proxy predictors because no performance assumption is imposed


on them, they can merely be proxies of the true functions. Then, they propose to post-
process these estimators to make inference of about the following key features of the
CATE on the main sample:

1. the Best Linear Predictor (BLP) of the CATE τ (·) based on the ML proxy predictor
T (·);

2. the Sorted Group Average Treatment Effects (GATES), which is the average of the
τ (·) by heterogeneity groups induced by T (·).

3. the CLassification ANalysis (CLAN), which is the average of the characteristics X


on groups induced by quantiles of T (·).

Best linear predictor (BLP) of the CATE. The first key feature (1), the Best linear
predictor (BLP) of the CATE using the proxy T is defined as the linear projection of the

130
CATE on the linear span of 1 and this proxy in the space L2 (P ):

E (τ (X) − f (X))2
 
BLP [τ (X)|T (X)] = arg min
f (X)∈Span(1,T (X))

Cov(τ (X), T (X))


= E [τ (X)] + (T (X) − E[T (X)]) (7.30)
Var(T (X))
= b1 + b2 (T (X) − E[T (X)]) ,

where
(b1 , b2 ) ∈ arg min E (τ (X) − B1 − B2 T (X))2 .
 
(7.31)
(B1 ,B2 )∈R2

We do not observe τ (·), but we actually observe an unbiased signal of τ (·) as


 
D − p(X)
E Y X = τ (X) (7.32)
p(X)(1 − p(X)

and from the auxiliary sample A we can estimate a ML proxy T (·) of τ (·). Thus we can
estimate Cov(τ (X), T (X))/Var(T (X)) using (7.32) running the regression

w(X)(D − p(X))Y = β1 + β2 (T (X) − E[T (X)]) + , (7.33)


    
1 0
E  = , (7.34)
T (X) − E[T (X)] 0

where w(X) = ((1 − p(X))p(X))−1 .

Theorem 7.3 (Consistency of the Best linear predictor estimator, Theorem 2.2 in Cher-
nozhukov et al. (2017b)). Consider the maps x 7→ T (x) and x 7→ m0 (x) as fixed. Assume
h 0 i
that Y and X have finite second order moments and that E X X is full rank, where
Then, (β1 , β2 ) defined as (7.33), where also solves the problem (7.31), hence

Cov(τ (X), T (X))


β1 = E [τ (X)] and β2 = .
Var(T (X))

Several remarks are in order. First, this identification is constructive and simple: the
weighted OLS estimation procedure is described in 7.4.3. Second, this strategy does not
assume that the estimator T (X) is a consistent estimate of τ (X), allowing the very high
dimensional setting p  n. However, we only learn about the best linear projection of τ
onto (T (X), 1) which yields that if T (X) is a bad predictor we learn noting about the
truth. Finally, note two extreme interesting cases:

- If T (X) is a perfect proxy for τ (X) and τ (X) is not a constant, then β2 = 1

131
- If T (X) is complete noise, uncorrelated to τ (X), then β2 = 0.

If there is no heterogeneity, then β2 = 0 yields a very simple test of the joint hypothesis
that there is heterogeneity and that T (X) is relevant (which is a problem if we do not
reject, as one can not separate the two hypotheses).
Proof of Theorem 7.3. We only show that β2 = Cov(τ (X), T (X))/Var(T (X)), since
the proof for β1 follows a similar reasoning. The normal equations (7.34) that identify
(β1 , β2 ) give for β2
Cov(w(X)(D − p(X))Y, T (X) − E [T (X)])
β2 = . (7.35)
Var(T (X) − E [T (X)])
The denominator is equal to Var(T (X)). Now, since T (X) − E [T (X)] has mean zero,
the numerator of (7.35) is

Cov(w(X)(D − p(X))Y, T (X) − E [T (X)]) = E [w(X)(D − p(X))Y (T (X) − E [T (X))]] .


(7.36)
Recall that Y = µ0 (X) + Dτ (X) + U hence the result comes from (7.36) and the three
following moments. First, using the law of iterated expectations we have

E [w(X)(D − p(X))µ0 (X)(T (X) − E [(X)])]


 

= E w(X)µ0 (X)(T (X) − E [(X)]) E [D − p(X)|X]


| {z }
=0
= 0.

Second, since D|X ∼ B(p(X)) then

E [w(X)(D − p(X))D|X] = E w(X)(D − p(X))2 |X = 1,


 

thus

E [w(X)(D − p(X))Dτ (X)(T (X) − E [T (X))]] = E [τ (X)(T (X) − E [T (X))]]

= Cov (τ (X), T (X)) .

Third, we have
 

E [w(X)(D − p(X))U (T (X) − E [T (X)])] = E w(X)(D − p(X)) E [U |D, X](T (X) − E [T (X))]
| {z }
=0
= 0,

which yields the result. 

132
Remark 7.7. To reduce the noise generated by the Horvitz-Thompson like weight H :=
(D − p(X))/(p(X)(1 − p(X))) in (7.34), Chernozhukov et al. (2017b) recommend to use
the following regression instead of (7.33)

w(X)(D − p(X))Y = µ0 Z 1 H + β0 + β1 (T (X) − E[T (X)]) + ,

where Z 1 = (1, m0 (X), T (X)).

Sorted Group Average Treatment Effects. We can also divide the support of the
proxy predictor T (X) into over-lapping regions to define groups of similar treatment
response an perform inference over their expected treatment effect

GAT ES : E [τ (X)|G1 ] ≤ · · · ≤ E [τ (X)|GK ] ,

for Gk = 1{lk−1 ≤ T (X) ≤ lk } with −∞ = l0 ≤ l1 · · · ≤ lK = ∞. For that, Chernozhukov


et al. (2017b) consider the regression of the unbiased signal w(X)(D − p(X)Y on the
indicators 1 {i ∈ G1 } , . . . , 1{i ∈ GK } and the projection parameters are the GATES.

7.4.2 Inference About Key Features of the CATE

The innovation of Chernozhukov et al. (2017b) in inference about the following key feature
of the CATE:

- θ = β2 , the heterogeneity predictor loading parameter (based on the BLP);

- θ = β1 + β2 (T (x) − E [T ]), the individual prediction of τ ,

is to provide methods to do inference handling the two sources of uncertainty that appear
when using sample splitting methods. Specifically, using different partitions in two parts
{A, M } of the initial sample and aggregating the different estimates θ̂A brings:

1. Conditional uncertainty, which is estimation uncertainty regarding the parameter


θ, conditional on the data split;

2. “Variational” uncertainty, which is induced by the data splitting.

To make inference with methods using data splitting, one has to adjust the normal con-
fidence level in a specific way.

Denote by

133
- the lower median (usual median) Med(X) := inf {x ∈ R : PX (X ≤ x) ≥ 1/2};

- the upper median Med(X) := sup {x ∈ R : PX (X ≥ x) ≥ 1/2} (next distinct quan-


tile of a random variable) ;

- Med(X) := (Med(X) + Med(X))/2,

where for continuous variable the two notions coincide.

Let us first make more precise those two sources of uncertainty, that arise from the
repeated use of partitions {A, M } of the initial sample {Yi , Di , Xi }ni=1 :

1. Conditional uncertainty: conditional on the data of sample A (hereafter DataA )


the ML estimators of the previous sections yield that, as the cardinal of the set M
(denoted by |M |) goes to infinity, with high probability and under assumptions of
Theorem 7.3: !
θ̂A − θA
≤ z DataA → Φ(z),

P
σ̂A

where Φ is the c.d.f of the standard normal. This yields the conditional confidence
intervals
P (θA ∈ [LA , UA ]| DataA ) = 1 − α + oP (1),
h i
−1
where [LA , UA ] := θ̂A ± Φ (1 − α/2)σ̂A .

2. “Variational” uncertainty: to make inference unconditionally on the data-split,


Chernozhukov et al. (2017b) propose to
!
θ̂A − θA
(a) either to adjust the p-values pA = Φ for the test
σ̂A

H0 : θA = θ0 , H1 : θA < θ0 (7.37)

to take into account the sample splitting;

(b) or to aggregate the estimators and adjust the confidence intervals.

Note that conditional on DataA , θ̂A is a random variable.

134
Adjusted sample splitting P-values. First note that under H0 , pA ∼ U(0, 1) con-
ditional on data A, but that conditional on the whole data, there is still randomness.
Thus, Chernozhukov et al. (2017b) define testing the null hypothesis (7.37) with signifi-
cance level α, based on the p-values pA that are random conditional on the data as

p0.5 = Med (pA |Data) ≤ α/2. (7.38)

This means that for at least 50% of the random data splits, the realized p-values pA
falls below the level α/2. The construction (7.38) is based on the fact that the median
M of J uniformly distributed random variables (not necessarily independent) satisfies
P (M ≤ α/2) ≤ α. Theorem 7.4 show the uniform (over the distributions P in P, all the
possible distributions of the data satisfying H0 ) validity of the sample-splitting adjusted
p-values.

Assumption 7.6 (Uniform asymptotic size for the conditional test). Assume that all
partitions DataA of the data are “regular” in the sense that under H0 , for
! !
θ̂A − θA θ̂A − θA
pA = Φ and pA = 1 − Φ ,
σ̂A σ̂A

and for all x ∈ [0, 1], supP ∈P |PP (pA ≤ x|DataA ) − x| ≤ δ = o(1).

Theorem 7.4 (Uniform asymptotic size for the unconditional test with sample splitting,
Theorem 3.1 in Chernozhukov et al. (2017b)). If Assumption 7.6 holds, then under H0 ,

sup PP (p0.5 ≤ α/2) ≤ α + 2δ = α + o(1).


P ∈P

Proof of Theorem 7.4. We have that p0.5 ≤ α/2 which is equivalent to E [1 {pA ≤ α/2}|Data] ≥
1/2, which yields

PP (p0.5 ≤ α/2) =EP [1 {EP [1 {pA ≤ α/2}|Data] ≥ 1/2}]

≤PP (pA ≤ α/2) /(1/2) (using Markov inequality)

≤2 (α/2 + δ) (using Assumption 7.6).


From the sample-splitting adjusted p-values, inverting the test, one can deduce the
following confidence interval

CI := {θ ∈ R, pu (θ) > α/2, pl (θ) > α/2, } ,

135
for α < 0.25, where for σ̂A > 0
! !
θ̂A − θA
pl (θ) :=Med 1 − Φ Data
σ̂A
! !
θ̂A − θA
pu (θ) :=Med Φ Data .
σ̂A

Adjusted point estimator and confidence intervals. Chernozhukov et al. (2017b)


propose to aggregate the estimators obtained from several partitions of the initial sample
through
 
θ̂ := Med θ̂A Data

and to report the following confidence intervals with the nominal level 1 − 2α:

 
[l; u] := Med(LA | Data); Med(UA | Data) .

The following theorem shows the uniform validity of this type of confidence interval,
using the validity of the confidence interval CI introduced above, which is tighter but
more difficult to compute.

Theorem 7.5 (Uniform validity of the confidence interval CI with sample splitting,
Theorem 3.2, in Chernozhukov et al. (2017b)). If Assumption 7.6 holds, then CI ⊆ [l; u]
and
sup PP (θA ∈ CI) ≥ 1 − 2α − 2δ = 1 − 2α + o(1).
P ∈P

7.4.3 Algorithm: Inference about key features of the CATE

We describe the algorithm used in Chernozhukov et al. (2017b) to do inference about


key features of the CATE, before considering simulation and applications in the next two
sections. Consider the i.i.d sample (Yi , X i , Di )ni=1 . These algorithms based on alternative
estimators for the BLP and the GATES are more stable (see Chernozhukov et al. (2017b)
for a formal proof of the theoretical identification equivalence).

Step 0. Fix the number of splits S, the significance level α

Step 1. Compute the propensity scores p(X i )

136
Step 2. Consider S splits in half of the indices i ∈ {1, . . . , n} into the main sample M , and
the auxiliary sample A. For each split s ∈ {1, . . . , S}, do

Step 2.1 Tune and train each ML method separately to learn m0 and T using A. For
each observation i in M , compute the predicted baseline effect m0 (X i ) and
the predicted treatment effect T (X i )

Step 2.2 Estimate the BLP parameters by weighted OLS in M via

0
 M

Yi = α̂ Z 1,i + β̂1 (Di − p(X i )) + β̂2 (Di − p(X i )) T (X i ) − T + ε̂i , i∈M

M
where T is the average of T (X i ) in M, and such that

1 X
w(X i )ε̂i Z i = 0,
|M | i∈M

where
h 0
 M
i0
Z i = Z 1,i , Di − p(X i ), (Di − p(X i )) T (X i ) − T ,

w(X i ) = (p(X i )(1 − p(X i )))−1


0
Z 1,i = (1, m0 (X i ), T (X i )) .

Step 2.2 Estimate the GATES parameters by weighted OLS in M via


K
0
X
Yi = α̂ Z 1,i + γ̂k (Di − p(X i )) 1 {T (X i ) ∈ Ik } + ε̂i , i∈M
k=1

such that
1 X
w(X i )ε̂i W i = 0,
|M | i∈M

h 0 i0
W i = Z 1,i , {(Di − p(X i ))1 {T (X i ) ∈ Ik }}K
k=1 ,

Ik = [lk−1 , lk ], lk = qk/K (T (X i )).

Step 2.4 Compute the performance measures for each ML methods


2 K
ˆ
Λ̂ = β̂2 Var (T (X)) ˆ = X γ̂ 2 P (T (X) ∈ I ) .
and Λ k k
k=1

ˆ over the splits.


Step 3. Choose the ML methods that maximize Λ̂ and Λ

137
Step 4. Compute the estimates, (1 − α) confidence level and conditional p-values for all the
parameters of interest.

Step 5. Compute the adjusted (1 − 2α)-confidence intervals and adjusted p-values using the
variational method described in section 7.4.2.

Several remarks are in order. First, note note that maximizing Λ̂ in Step 2.4 is
equivalent to maximizing the correlation between the ML proxy predictor and the true
τ . Maximizing Λˆ in Step 2.4 is equivalent to maximizing the part of the variation of τ

which is explained by K
P
k=1 γ̂k (Di − p(X i )) 1 {T (X i ) ∈ Ik }.

7.4.4 Simulations

We implement this strategy in the simulation setting Case 1 of section 7.2.2 where

1
τ (x) = ζ(x1 )ζ(x2 ), where ζ(u) = 1 + .
1+ e−20(u−1/3)

We use adaptation of the code from MLInference (this Github code also contains the
dataset and code to replicate the application in Chernozhukov et al. (2017b)). The neural
network model is able to fit well the heterogeneity (β2 is close to 1). This is reassuring as
the shape of ζ is a sigmoid, which is the base function for the neural network here (i.e.
“activation function”). A look at the so called “CLAN” (see Chernozhukov et al. (2017b))
on Table 7.4, which are the average characteristics for the most and least affected groups
E [Xk |G5 ] and E [Xk |G1 ] for the two variables X1 and X2 shows that the ML methods
are able to detect properly that those in the upper-right square (resp. lower-left square)
in the (X1 , X2 ) space are those which benefit the most (resp. benefit the less) from the
treatment.

7.4.5 Application

We refer to the exam in section B in the Appendix for an application to the gender wage
gap heterogeneity.

138
Elastic Net Boosting Nnet Random Forest
ˆ
Λ 8.359 8.444 8.507 8.379
Λ̂ 0.882 0.941 0.968 0.892
Table 7.2: Table of performance measures for GATES and BLP for the four ML methods used based on
100 splits.

Nnet Nnet Boosting Boosting True


ATE HET ATE HET ATE
2.758 1.003 2.759 0.910 2.753
90% CI (2.679,2.836) (0.921,1.085) (2.680,2.838) (0.832,0.986)
Table 7.3: Table of the estimated ATE and parameter β2 of the BLP based on 100 splits for the two best
methods according to the selection procedure based on Λ: neural networks and boosting trees.

Elastic Net Boosting

ATE 90% CB(ATE) ATE 90% CB(ATE)


4 GATES 90% CB(GATES)
4 GATES 90% CB(GATES)
Treatment Effect

Treatment Effect

3 3

2 2

1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score

Nnet Random Forest

ATE 90% CB(ATE) ATE 90% CB(ATE)


4 GATES 90% CB(GATES)
4 GATES 90% CB(GATES)
Treatment Effect

Treatment Effect

3 3

2 2

1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score

Figure 7.6: Estimated GATES with robust confidence intervals based on 100 splits for the four ML Methods
used. Quantiles of the true treatment effect are min : 1.00, 25%: 2.00, 50%: 2.54, 75%: 3.92, max: 3.99.

139
Nnet Boosting
Most Affected Least Affected Difference Most Affected Least Affected Difference
X1 0.777 0.235 0.539 0.720 0.248 0.475
(0.762,0.793) (0.219,0.252) (0.517,0.561) (0.703,0.737) (0.231,0.264) (0.451,0.498)
X2 0.768 0.238 0.529 0.715 0.268 0.453
(0.752,0.785) (0.221,0.256) (0.504,0.553) (0.698,0.734) (0.250,0.285) (0.427,0.478)
Table 7.4: Estimated average characteristics for the most and least affected groups E [Xk |G5 ] and E [Xk |G1 ]
based on 100 splits (see CLAN in Chernozhukov et al. (2017b)) for the two variables X1 and X2 with robust
confidence intervals for the four ML Methods used. The “least affected” correspond to the group G1 and
“most affected” correspond to the group G5 .

140
Chapter 8

Network Data and Peer Effects

A growing empirical literature studies social and economic networks, either for their
own sake or to asses the importance of peer effects in many fields of the economic dis-
cipline. Recent examples include development and policy evaluation (Banerjee et al.,
2014), welfare participation (Bertrand et al., 2000), criminal activities (Patacchini and
Zenou, 2008), education (Sacerdote, 2011), etc.
 
n
Networks are by nature high-dimensional in the sense that links can potentially
2
be formed between n individuals, that is if we consider undirected links, twice as many if
we consider directed links. As a direct consequence, standard, usually low-dimensional,
methods such as the MLE can quickly turn out to require special tools to accommodate
the high-dimensionality of networks (see Section 8.2). Moreover, they constitute a fertile
ground to make use of high-dimensional tools such as the Lasso, as we will illustrate in
Section 8.3.

Broadly speaking, empirical questions involving networks can be divided into two
categories that we will review separately: network formations (i.e. what factors explain
the existence of an observed network and not another one?) and network spillovers or
peer effects (i.e. what is the impact that individuals linked through a network have on
each other?). We will first introduce the vocabulary and statistics specific to networks.

This series of NBER video lectures by Matthew Jackson and Daron Acemoglu presents
key network concepts and their use in Economics, www.nber.org/econometrics_minicourse_
2014. We also recommend Graham (2019).

141
8.1 Vocabulary and Concepts

We follow the surveys by de Paula (2015) and Chandrasekhar (2016).

Vocabulary. A network is usually represented by a graph g which is a pair of sets


(Ng , Eg ) of nodes or vertices Ng and edges, links or ties Eg . The cardinality of these sets
is denoted by n := |Ng | and |Eg | respectively. In our applications, vertices are economic
agents: students, firms or households for example. An edge represents a connection
between two nodes, e.g. friendship, business link, geographic distance. A graph is said to
be undirected when Eg is the set of unordered pairs with elements in Ng , for example {i, j}
with i, j ∈ Ng . Such a graph represents reciprocal connections between two nodes, like
friendship between two pupils. A graph is said to be directed when Eg is the set of ordered
pairs with elements in Ng × Ng and is used when the connections are asymmetrical such
as a buyer-supplier relationship in a production network. Links can also be weighted to
represent the intensity of a particular relationship. Figure 8.1 illustrates these definitions.

A useful representation of graphs is called the adjacency or incidence matrix W .


The incidence matrix is a square matrix of dimension n. A non-zero element Wij of W
represents an edge from node i to node j. Notice that when the graph is undirected, W is
symmetric. The incidence matrix allows to translate combinatorial operations into linear
algebraic ones. For an adjacency matrix W representing a simple graph (non-weighted, no
self-link and at most one link between any pair of nodes), the ij element of W k produces
the number of paths of length k between i and j.
Question: Write down the incidence matrix W of the graphs in Figure 8.1.

In the first part of this chapter (network formation), we consider a non-weighted


simple graph g with adjacency matrix W :


1 if {i, j} ∈ Eg
Wij = Wji = .
0 otherwise

A first set of statistics is related to the density of links in a graph and a second is
related to the correlation between the presence of links (clustering).

142
Figure 8.1: Examples of graphs

2 5

1 1 2 3 1 2

5 3 4 4

 
n
Density of a network. From a set of n nodes, = n(n − 1)/2 non-directed links
2
can be constructed. The density of a graph is the share of all possible links that do exist:
n−1 n
1 X X
density(g) :=   Wij .
n i=1 j=i+1
2
The degree of a node is the number of neighbors it has (i.e. an isolated node has degree
zero). An important statistic, both to describe a graph and to perform estimation in
many models, is the degree sequence of a graph, defined as the vector of dimension n that
collects the degree of each node:
n
!
X
d(g) := Wij .
j=1 i=1,...,n

The average degree is a commonly used measure of how well-connected a graph is:
n n
¯ := 1
XX n−1
d(g) Wij = density(g).
n i=1 j=1 2

Graphs are characterized along their density into two categories: sparse graphs for which
the density goes to zero as n → ∞ and dense graphs for which the average degree is
proportional to n, i.e. the density converges to a constant. The first case occurs, for
example, when the average degree of a graph is constant.

Clustering. Several statistics are used to measure the degree of clustering of a graph.
These metrics answer the question: “If node i is linked to both j and k, what is the

143
probability that j and k are also linked?”. Directly related is the clustering coefficient of
node i:
n−1 Xn
1 X
ci (g) :=   Wij Wik Wjk ,
di (g) j=1
k=j+1
2
the proportion of connected nodes among all possible connections between neighbors of
i. The clustering coefficient of the graph is given by the average clustering coefficient.
The global clustering coefficient:
P
Wij Wik Wjk
i<j<k
c(g)global := P ,
i<j<k 1 {Wij Wjk + Wij Wik + Wjk Wik > 0}

measures the share of all triples (i, j, k) with ij and ik linked that also have jk linked.
Empirically, we observe that many social networks (e.g. friendship at university, so-
cial relationships in villages) exhibit both sparsity and a high degree of clustering.

Question: Give an explanation for both these stylized facts.

8.2 Network Formation


This part of the chapter focuses
  on network formation, asking the questions: “how and
n
why, among all possible 2 2 networks, do we observe this particular network?”. In
statistical terms, it amounts to studying the probability distribution of the adjacency
matrix W . See Chandrasekhar (2016) for a survey on network formation.
We will review a few network models, in particular the Erdös-Rényi model (Erdös
and Rényi (1959, 1960)), the model of Graham (2017) with homophily and degree het-
erogeneity (see also Chatterjee et al. (2010)), and Exponential Random Graph Models
(ERGMs, Kolaczyk (2009)), having in mind economic or sociological applications.

An Economic Starting Point. Assume that we observe characteristics Xi for node i


and that conditionally on both Xi and Xj , the event that a link between i and j forms
is independent of the same event for any other pair of nodes. Furthermore, assume that
the marginal utility of forming a link is given by:

ui (g + ij) − ui (g − ij) = f (xi , xj ) − εij ,

144
 
n
where εij is iid across the pairs. A link forms if and only if the marginal utility is
2
large enough:
Wij = 1 {f (xi , xj ) − εij > 0} .

Assuming a Type I extreme value distribution, we get a Logit model for the conditional
probability of forming a link:

ef (xi ,xj )
P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + ef (xi ,xj )

If the econometrician is interested in testing homophily, i.e. the propensity of people


to associate and bond with similar individuals (“birds of a feather flock together”), a
standard model would assume the functional form f (xi , xj ) = β0 kxi − xj k reflecting a
higher probability of link formation between two similar nodes.

The Erdös-Rényi model. Also known as the Bernoulli random graph model. This
very simple model helps understanding the difficulties of modeling the two stylized facts of
empirical social networks (sparsity and clustering). Assume that the function is constant:
f (xi , xj ) = α0 . Then:

eα0
p := P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + e α0

Chandrasekhar (2016) focuses on the following three implications of this model:

1. The expected degree of a node is p(n − 1),

2. The probability that two neighbors of i (say j and k) are linked is p,

3. The probability that that nodes i, j and k are mutually linked is p3 .

The first implication of the model means that to recover a sparse structure, we need to
think about a sequence p = pn that decreases with n, for example pn = d/(n − 1) where
the expected degree does not change with the number of nodes. So, sparsity means
pn → 0. The second implication means that the clustering coefficient is pn .
Here is the underlying tension. If such a model is sparse, there can be no clustering at
the limit since pn → 0. On the other hand, if we assume a lower bound for the probability
of link formation pn → p > 0 so that clustering does not vanish, the expected degree

145
becomes at least (n − 1)p, meaning the network becomes dense.

Question: pn → 0 what does it mean for α0 ? How do you interpret that in terms of
the agents’ behavior?

Homophily and Degree Heterogeneity. Graham (2017) assumes a model of the


form f (xi , xj ) = β0 kxi − xj k + νi + νj , reflecting both homophily through the first term
and gregariousness (or degree heterogeneity, i.e. the propensity to form links) of individ-
uals i and j through the second and third terms. Notice that each individual has a fixed
effect νi that reflects its preference to form links and shifts the probability of a given
link being formed. This fixed-effect formulation is similar to panel data in no fortuitous
way: since each individual participates in n − 1 link formations, each link decision can
be thought as an analog to a date in the time dimension of a panel. A Logit structure of
the link probability still shows:

exp(β0 kxi − xj k + νi + νj )
P(Wij = 1|Xi = xi , Xj = xj ) = .
1 + exp(β0 kxi − xj k + νi + νj )

Estimation of β0 in this model is complicated by the existence of the nuisance parameter


 
n
of dimension n, ν := (νi )i=1,...,n . On the other hand, notice that we have link
2
formation events. Graham (2017) develops a conditional maximum likelihood method
(CMLE) to overcome this difficulty. Denote w the realized adjacency matrix. Write the
likelihood of the model:
n−1n
Y Y exp(β0 kxi − xj k + νi + νj )wij
P (W = w|X ; β, ν) = .
i=1 j=i+1
1 + exp(β0 kxi − xj k + νi + νj )

A few manipulations yield:


n−1 X
n
!
X
wij kxi − xj k exp d(g)T ν

P (W = w|X ; β, ν) = exp β0
i=1 j=i+1
"n−1 n #−1
Y Y
× 1 + exp(β0 kxi − xj k + νi + νj ) ,
i=1 j=i+1

where d(g) is the degree sequence of the graph represented by the


 adjacency
 matrix W .
n
Denote D the set of all feasible graphs among G the set of the 2 2 undirected graphs

146
with degree sequence equal to d(g):

D = {v ∈ G, d(v) = d(g)} .

Following the ideas developed by Cox (1958) and Chamberlain (1980) in conditional
maximum likelihood models, estimation of β0 can be performed using:
 P 
exp β0 n−1
Pn
i=1 w
j=i+1 ij kx i − x j k
P (W = w|X , d(g); β, ν) = P  P ,
n−1 P n
v∈D exp β0 i=1 j=i+1 vij kxi − xj k

which does not depend on ν. The techniques to compute an estimator of β0 are developed
in Graham (2017) and in the associated Python code, available at www.github.com/
bryangraham/netrics. The resulting estimator is consistent and asymptotically Normal,
albeit at a rate depending if the graph is dense (n−1/2 ) or sparse (n−1/4 ).

Exponential Random Graph Models. Exponential Random Graph Models or ERGMs


are one of the most popular classes of network models. See Section 6.5 in Kolaczyk (2009)
for a statistical approach. Their popularity is explain by the fact that ERGMs are formu-
lated in a manner that allows to apply standard statistical principles. The name ERGM
refers to their similarity with the exponential family in Statistics. An ERGM models the
probability of observing a particular graph g defined by its observed adjacency matrix w
as: !
1 X
P (W = w|θ0 ) = exp θ0H gH (w) , (ERGM)
C H

where:

– Each H is a configuration. A configuration is defined as the set of possible edges


among a subset of the nodes in g. Such subset can and generally includes a pair
or three nodes. The sum occurs over all the pairs and all the triplets possible with
nodes in g respectively.
Q
– gH (w) = wi j∈H wij , so it’s one if the configuration H occurs in g and zero other-
wise. In the case H only includes the link between a pair, gH (w) = 1 if there is a
link between the two nodes and zero otherwise.

– θ0 = (θ0H )H is the vector of parameters.

147
P P
– C is the normalizing constant, C = w exp ( H θ0H gH (w))

Question: Show that the Erdös-Rényi model is an ERGM. Hint: take H = {i, j}
and θ0H constant over any H.

Looking at equation (ERGM) suggests estimating θ0 using a MLE. Although well-


defined in theory, computation can be quite complicated in practice duetothe prohibitive
n
cost of evaluating the normalizing constant C, which is a sum over 2 2 terms. Con-
sequently, two broad approaches have been developed to compute the MLE. The first
approach, based on Monte Carlo methods, maximizes a stochastic approximation of the
log-likelihood. It can be quite computationally intensive. The second approach estimates
θ0 by maximizing a pseudo-likelihood, similar to the technique developed in Graham
(2017) to estimate β0 in the conditional likelihood formulation of the model.

8.3 Estimating Peer Effects

A growing body of research in Economics is interested in measuring peer effects, i.e


the influence of a given individual’s reference group over his outcome or behavior. This
literature is older than network formation in Economics, but relatively recent compared to
the same literature in Sociology. The underlying idea is that instead of only maximizing
his private utility, the individual may seek to adopt a behavior in conformity with his
peer’s behavior. Conformity to peer group is a powerful explanation for many behaviors,
ranging from education to fertility and career choices.

We will slightly change the definition of the adjacency matrix from the previous sec-
tion. Before, we were only concerned whether a link existed or not, hence the binary
nature of the elements in W . Here, we may also care about the strength of such links,
although we will weight all existing links attached to a particular node equally as a sim-
plification. Consequently, we assume that we observe the set Fi ⊂ Ng of individual i’s
friends. In that case, we will have Wij = 1/|Fi | if j ∈ Fi and Wij = 0 otherwise. We say
that an individual i is isolated if Fi is empty, i.e. Wij = 0 for all j.

148
8.3.1 The Linear Model of Social Interactions

The canonical model in network econometrics to assess social interactions is based on the
linear specification of Manski (1993):

Assumption 8.1 (Linear model of social interactions).


n
X n
X
Yi = α + β Wij Yj + ηXi + γ Wij Xj + εi , with E (εi |X, W ) = 0, (8.1)
j=1 j=1

where Yi is the outcome observed for node i (i.e. individual i), Xi is the observed char-
acteristics of dimension 1 (for simplification) and Wi,j are the entries of the incidence
matrix that code the social structure.

From our definition of the adjacency matrix in this context, model (8.1) is a regression
of the individual outcome over his characteristics, the mean of his peers’ outcomes and
the mean of his peers’ characteristics. Before further interpretation, consider model (8.1)
stacked in a matrix:
y = α1 + βW y + ηX + γW X + ε.

Because y appears on both side, we can simplify the model to obtain a reduced-form,
under the condition that In −βW is non-singular. Notice that this condition is equivalent
to det (In − βW ) 6= 0, i.e. zero is not a root of the characteristic polynomial of In − βW ,
 
which is equivalent to det β1 In − W 6= 0 meaning that 1/β is not in the spectrum of
W . Under this assumption:

y = α (In − βW )−1 1 + (In − βW )−1 (ηIn + γW ) X + (In − βW )−1 ε.

In model (8.1), β captures the endogenous social effect and γ the exogenous social effect.
The quantity (In − βW )−1 is referred to as the social multiplier because peer’s outcome
influence propagates shocks. This phenomenon does not occur if environmental or con-
textual effects (influence through peer’s characteristics) are the main social influence
mechanism, i.e. if β = 0.
The key subject in the peer effect literature is to distinguish between the social mul-
tiplier and contextual or environmental effects. Model (8.1) is not identified in itself and
requires further assumptions. The next theorem is due to Bramoullé et al. (2009).

149
Theorem 8.1 (Identification of Peer Effects, (α, β, η, γ)). Suppose that |β| < 1 and
P
ηβ + γ 6= 0. Also assume that W is such that for any non-isolated i, j Wij = 1. If
matrices In , W and W 2 are linearly independent social effects are identified. Otherwise
and if no individual is isolated, the social effects are not identified.

Proof of Theorem 8.1. Assume that (α, β, η, γ) and (α0 , β 0 , η 0 , γ 0 ) lead to the same
reduced-form, i.e we have that almost surely:

−1
α (In − βW )−1 1 = α0 (In − β 0 W ) 1 (8.2)
−1
(In − βW )−1 (ηIn + γW ) = (In − β 0 W ) (η 0 In + γ 0 W ) (8.3)

Multiply equation (8.3) by (In − β 0 W ) (In − βW ) and consider that for any real b, (In − bW )−1 W =
W (In − bW )−1 by symmetry of W . Thus, developing yields:

(η − η 0 )In + (η 0 β − ηβ 0 + γ − γ 0 ) W + (γ 0 β − β 0 γ) W 2 = 0.

As a consequence, if In , W and W 2 are linearly independent, then:



 η = η0
η 0 β + γ = ηβ 0 + γ 0
 0
γ β = γβ 0

Consider two subcases:

1. Suppose γ 0 β 6= 0. As a consequence of β/γ = β 0 /γ 0 , there exists λ 6= 0 such that


(β 0 , γ 0 ) = λ(β, γ). Thus, the second equation yields ηβ + γ = ηβ 0 + γ 0 = λ(ηβ + γ)
which means λ = 1 because we have assumed ηβ + γ 6= 0. Therefore we obtain
β 0 = β and γ 0 = γ.

2. Suppose γ 0 β = 0. Since ηβ + γ 6= 0, we cannot have β = γ = 0 or β 0 = γ 0 = 0.


So either, β = β 0 = 0, in which case γ = γ 0 by the second equation. Or γ = γ 0 , in
which case β = β 0 = 0.

Coming back to equation (8.2) of the reduced-form at the beginning, both cases also yield
α0 = α since β 0 = β.
Next, suppose that In , W and W 2 are linearly dependent and that no student is
isolated. This last assumption yields that 1 is in the spectrum of W since W 1 = 1
from j Wij = 1 for any i. So 1/(1 − β) is an eigenvalue of (In − βW )−1 associated
P

150
with eigenvector 1. As a consequence, the very first equation of the proof becomes
α(1 − β 0 ) = α0 (1 − β). Since In , W and W 2 are linearly dependent, there exist for
example two scalars a and b such that W 2 = aIn + bW . Plugging this into equation (8.3)
shows only three equations need to be satisfied for (α, β, η, γ) and (α0 , β 0 , η 0 , γ 0 ) to yield
the same reduced-form. 

Remark 8.1 (Identification in groups). Imagine that instead of interacting in any net-
work, individuals interact in groups (e.g. classrooms) and that the student is also included
when computing the mean. This is equivalent to saying that there is a partition of the
population in (non-overlapping) subsets. In that case, the second part of Theorem 8.1
applies and peer effects are not identified since W 2 = W . More generally, Proposition 2
in Bramoullé et al. (2009) states that if individuals interact in groups, if all groups have
the same size, social effects are not identified. Conversely, if at least two groups have
different sizes and if ηβ + γ 6= 0, social effects are identified.

Brock and Durlauf (2001) have established identification for binary models with social
interactions.
We will not deal with estimation of such models, but notice that model (8.1) requires
instruments because of endogeneity: the outcome of my friend is correlated with my
outcome (which includes the error term).

The Perils of Peer Effects. Angrist (2014) warns against mistaking generic clustering
(outcomes tends to be correlated within groups) for causal peer effects, a warning issued
by Manski (1993) as the reflection problem. From Manski (1993):

This paper examines the “reflection” problem that arises when a researcher
observing the distribution of behaviour in a population tries to infer whether
the average behaviour in some group influences the behaviour of the individ-
uals that comprise the group. The term reflection is appropriate because the
problem is similar to that of interpreting the almost simultaneous movements
of a person and his reflection in a mirror. Does the mirror image cause the
person’s movements or reflect them? An observer who does not understand
something of optics and human behaviour would not be able to tell.

151
Or, again: “observed behavior is always consistent with the hypothesis that individual
behavior reflects mean reference-group behavior”. Following this line of thought, Angrist
(2014) argues that many significant peer effects are, in fact, spurious.

8.3.2 Estimating the Structure of Social Interactions

We close this chapter on networks with Manresa (2016) on recovering the structure of
social interactions using panel data. The paper is relevant for the class for two reasons:
(i) the goal of the proposed method is to estimate the strength of the social interactions
in a network while not observing the adjacency matrix of said network, (ii) it uses both
the Lasso and the double-selection procedure studied in Chapter 2.

Assumption 8.2 (Linear model with unknown social interactions). Manresa (2016)
considers the following model, similar to (8.1) without the peers’ outcomes:
X
Yit = αi + ηi Xit + γij Xjt + Zit θ + εit , (8.4)
j6=i

where E[εit |Xj1 , ..., Xjt , Zi1 , ..., Zit ] = 0 for any i, j, t. αi and ηi are the own intercept
and own effect, while γij measure peer j’s characteristics influence over individual i’s
outcome. Zit are other characteristics which effect does not depend on the individual. By
definition, the parameter γ := (γij )i>j is high-dimensional since there are n(n−1) entries
in this vector. Consistent with the observation that many empirical social networks are
P
sparse, it is assumed that j6=i 1 {γij 6= 0} ≤ si << T for all i.

Target Parameters. Many parameters can be of interest in this model. The aver-
age private effect is defined as n−1 ni=1 ηi . Directly related to the structure of social
P

interaction, one can be interested in the average social effect of individual i:

1 X 1
Mi := γji + ηi ,
n − 1 j6=i n

that is the average impact of individual i’s characteristics over the whole network.
If estimating θ is of interest, one can use the double-selection method presented in
Chapter 2, treating ηi and the γij as the nuisance parameters. Here, the method consists
in (i) regressing Zit on X1t , ..., Xnt using a pooled Lasso, (ii) regressing Yit on X1t , ..., Xnt

152
using a pooled Lasso, (iii) regressing Yit on Zit and the set of the X1t , ..., Xnt corresponding
to a non-zero coefficient either in step (i) or (ii).
Estimation of ηi and γij in model 8.2 is done with a pooled Lasso regression of Yit −Zit θ̂
on X1t , ..., Xnt . Careful analysis of the properties of these estimators is available in the
original paper.

Application. In Manresa (2016), model 8.2 is a regression of firm’s i log-output over


its own lagged R&D spending and its competitors’ (Xit = rdit−1 ), controlling by log of
lagged labour, log of lagged capital and other controls (Zit = (lit−1 , kit−1 , z̃it )). Spillovers
(i.e. the network of interest) between firms are not observable because they are not
necessarily related to an input market or a geographic position.

Going Further. This section gave a few ideas about modeling network formation with
no intention of exhaustivity. Likewise, we point out several contributions in the Econo-
metrics of network, driven mainly by our readings rather than our expertise in the domain.
Kolaczyk (2009) is a textbook reference that mainly belongs to the statistical literature,
while Jackson (2008) views network for the economist’s point of view. Chandrasekhar
and Lewis (2011) study the properties of GMM estimators based on partial network data,
that is data that have been sampled and may be incomplete regarding the nodes and links
that constitute a network. Such sampled network data, they say, can be very misleading
when studying peer effects as they contain non-trivial measurement errors. They use sta-
tistical techniques to predict the full network and mitigate the bias. Boucher and Fortin
(2015) reviews econometric problems related to inference with networks.

153
154
Chapter 9

Appendix

A Exam 2018
Class documents authorized. Calculator forbidden. Duration: 2 hours.
Documents de cours autorisés. Calculatrice interdite. Durée: 2 heures.
Part I contains 6 questions; Part II contains 4 questions (8 sub-questions); Part III contains 5 questions.

Part I: Questions
1. We want to perform inference on a parameter θ using only one sample splitting of the data =
{A, M } into two subsamples A and M . We use the subsample M to train a ML estimator θ. b We
then use the subsample A to evaluate it, leading to the estimator θA . What type of inference can
b
we make ? What problem does it raise?

2. Is the use of the Lasso in the first step in Section 1.5 the key ingredient to solve the post-selection
inference problem?

3. Explain intuitively why sample splitting is useful in Causal Random Forest estimation to get
consistency of the estimator of the heterogeneous treatment effect. How do we call this property?

4. “The synthetic control estimator does not use the full sample of control units”. Explain and
criticize.

5. What is Leeb and Potscher’s point? Does the result in Theorem 1.2 (asymptotic normality of the
immunized estimator) contradicts Leeb and Potscher analysis? Why?

6. Consider the model: Yi = Di τ0 + Xi> β0 + εi where Di is a binary variable and εi ⊥ (Xi , Di ).


Describe a methodology based on Fisher tests to perform inference over τ0 .

Part II: Exercise 1


We observe an iid sample of the random vector Wi = (Yiobs , Di , Xi> )> for i = 1, ..., n. Yiobs is the outcome
variable which value depends on whether the unit i is treated or not. If unit i is treated (Di = 1), then
Yiobs = Y1i . If unit i is not treated (Di = 0), then Yiobs = Y0i . Xi is a vector of covariates of dimension
pX > 1 which includes an intercept. The index i is dropped when unnecessary. The quantity of interest
is the Average Treatment on the Treated defined as:

τ0 = E [Y1 − Y0 |D = 1] . (ATET)

155
Define π = P(D = 1) and the propensity score p(X) = P(D = 1|X). We make the following two
assumptions. The Conditional Independence Assumption:

Y0 ⊥ D|X, (CIA)

and the Common Support Assumption:

0 < p(X) < 1. (CommonSup)

1. (a) Define:
 
p(Xi )
m(Wi , τ, p) = Di − (1 − Di ) Yiobs − Di τ.
1 − p(Xi )

Verify that E[m(Wi , p, τ0 )] = 0.


(b) Suppose you have an estimate of the propensity score p̂. Suggest an estimator τ̂ of τ0 .
2. (a) Assume the propensity score is given by a Logit, i.e. p(X) = exp(X > β0 )/(1 + exp(X > β0 )).
The moment condition from question 1 is denoted m(Wi , τ, β) from now on. Compute
E [∂β m(Wi , τ0 , β0 )].
(b) Consider a small dimensional case where pX , the dimension of X, is small and fixed. What is
the most efficient method to estimate β0 ? Give the formula for the corresponding estimator.
Will the resulting estimator of τ0 , τ̂ be asymptotically Normal?
(c) Consider the high-dimensional case where β0 is estimated using:
" n #
1X > >

min − Di Xi β − log 1 + exp(Xi β) + λ kβk1 ,
β∈RpX n i=1

where λ > 0 is a tuning parameter. How would you call such a method? Will the estimator
of τ0 be asymptotically normal in that case?
3. Assume that the outcome under no treatment is given by Y0 = X > γ0 + ε with ε ⊥ X and Eε = 0.
(a) Show that E(DX(Y0 − X > γ0 )) = 0.
(b) Suggest a moment condition ψ which is orthogonal. Prove that it is.
4. Based on the previous questions, give an estimator τ̌ of τ0 which is asymptotically Normal even
in the high-dimensionnal case. Which theorem do you use?

Part III: Exercise 2


Consider the following utility model of an individual i in electoral district t who makes a choice between
two parties L and R:

UL,i,t = g(Xt> β0 ) + τ0 Dt + ξL,t + i,t,L , (A.1)


2 2
E(i,t,L ) = 0, ξL,t ⊥ (Xt , Zt ), and E(ξL,t |Xt ) =σ ,

UR,i,t = 0. Xt ∈ RpX is a random vector measuring the characteristics of the party’s candidate in district
t, Dt the amount of advertising spend by the party in district t, ξL,t is a district specific unobserved
shock (ex: candidate’s reputation), and i,t,L is an idiosyncratic unobserved shock distributed with cdf
−1
F (t) = [1 + e−t ] . Xt is considered exogeneous while Dt is endogeneous, and Zt is an instumental
variable. g(·) is an infinitely differentiable function on R of the index Xt> β0 .

Assume the following model for the first stage equation

Dt = f0 (Xt , Zt ) + ut , ut ⊥ (Xt , Zt ), (FS)

156
where
p q
( )
X X
f0 ∈ Fp,q := f : f (x, z) = γ0,i 1{x ∈ Cai ,r } + δ0,i 1{z ∈ Cbi ,r }, ai ∈ RdX , bi ∈ RdZ ,
i=1 i=1

where Cai ,r and Cbi ,r are hypercubes respectively in RdX and RdZ of centers {ai } and {bi } and side of
length r.

We observe an i.i.d sample (Wt )nt=1 = (St , Xt , Dt , Zt )nt=1 , among the n electoral districts, where St ∈ (0, 1)
is the observed share of votes for candidate L in district t.

A. Estimation of the First Stage Equation.

1. Assume that p < n and q < n and that the true function f0 has only few zero coefficients {γ0,i }pi=1
and {δ0,i }qi=1 in its decomposition. Propose a particularly adapted consistent estimator of the
regression function E (D|X = x, Z = z). Can we use it in the case where the assumption p < n
and q < n does not hold? Explain.
2. Assume now that p > n and q > n and sparsity of the coefficients {γ0,i }pi=1 and {δ0,i }qi=1 in the
decomposition of f0 . Give a particularly adapted consistent estimator of the regression function
E (D|X = x, Z = z) and the estimating equation.

B. Estimation of τ0 .

3. Write the estimating equation, starting from (A.1), using the dependent variable S̃t := ln(St /(1 −
St )).
4. Find two functions Q1 and Q2 such as
 
m(Wt , η, τ0 ) = S̃t − Q1 (η, Yt , Dt , Xt ) Q2 (η, Zt , Xt ),

where η is a nuisance parameter that have to be defined, satisfies:

E(m(Wt , η, τ0 )) = 0 (A.2)
E(∂η m(Wt , η, τ )) = 0, ∀τ ∈ Θ, (A.3)

where Θ is a compact neighborhood of τ0 . Similarly to the course, you should use (A.1), (FS),
and an additional linear equation of your choice making more precise the correlation structure
between the instruments and the regressors.
5. Give the conditions on the function g under which the estimator τb defined using (A.2) is asymp-
totically normal, using only a theorem of the course.

157
Exam 2018: Elements of Correction
Part I: Questions
1. In this case, we only have the conditional confidence intervals

P (θA ∈ [LA , UA ]| DataA ) = 1 − α + oP (1),


h i
where [LA , UA ] := θb1 ± Φ−1 (1 − α/2)b
σA . This does not take into account the variability in-
troduced by the sample splitting, which prevents from a generalization to any distribution of the
whole data.
2. No, the double-selection procedure is the key. Although notice that the Lasso is “good enough”
to be used in the first steps.
3. If the same sample were used to estimate the splits positions and the values on the leaves, the
algorithm would tend to separate two leaves that have heterogeneous (relatively high and low)
TE in this sample, thus leading to a biased estimate if we use the sample sample to evaluate it.
If we use another sample for evaluation, it limits over-fitting and ensures consistency.
4. The vector of synthetic control weight is sparse in general, meaning that only a few entries are not
zeros. As a consequence, the corresponding control units do not take part in the counterfactual.
On the one hand, it does not use the full sample (loss of efficiency?), but on the other, it discards
unit that do not help reproducing the treated unit.
5. Leeb and Potscher’s point is that performing inference after a selection step is more complicated
than it looks because post-selection estimators do not enjoy “nice properties” (asymptotic nor-
mality for example). Theorem 1.2 is not in contradiction: it is a solution to the problem. It shows
than in cases where the estimator is immunized, inference based on the Normal distribution is
still possible. In many cases, it means adding another selection step (hence the name “double
selection”).
6. Denote Dobs = (D1 , . . . , Dn ) the observed vector of treatment assignment and τ̂ obs the corre-
sponding OLS estimator. The Fisher procedure is as follows:

(a) For b = 1, ..., B, reshuffle the treatment assignment at random, compute the OLS estimator
of τ0 , τ̂b and compare it with the observed statistics τ̂ obs .
(b) Compute Fisher’s p-value:
B
1 X 
1 |τ̂b | ≥ |τ̂ obs |

p̂ :=
B
b=1

(c) Reject H0 if p̂ is below a pre-determined threshold: the observed treatment allocation gives
an effect which is abnormally large.

Part II: Exercise 1


cf. Computer Session 1.

Part III: Exercise 2


Question 1: Regression honest random forests (as D is continuous), with random-split, are particularly
adapted to the shapes of the basis functions of Fp,q (hypercubes). This estimator can only be used in
the low dimensional case, because the rate of convergence n−1/(1+pα2 δ) does not allow for p+q >> log(n).

158
Question 2: In this case the LASSO using the transformed regressors:

Xet,i =1{Xt ∈ Ca , }
i

Zet,i =1{Zt ∈ Cbi , }

thus the linear model


Dt = γ0> X
et + δ0> Z
et + ut ,
e t ∈ Rp , Z
X et ∈ Rq , is particularly adapted. This can be done using the estimator
n 2
1 X et> γ0 − Zet> δ0 + λ

γ0 , δb0 ) ∈
(b argmin Dt − X Yb (γ0 , δ0 )> ,

(γ0 ,δ0 )∈Rp+q n t=1 n 1

where
p
X X q
>
Y (γ0 , δ0 ) = Yj γ0,j + Yj+p δ0,j .
b b b
1
j=1 j=1

The Yb ∈ Mp+q,p+q (R) are penalty loadings.

Question 3: Classically, using the logistic distribution, we have


 
St
S̃t := ln = g(Xt> β0 ) + τ0 Dt + ξL,t .
1 − St

Question 4: We add the following linear equation


et + ζt , ζt ⊥ Xt , Π ∈ Mp+q,p+q (R).
Zet = ΠX

Thus, we have
e > γ0 + X
Dt =X e > Π> δ0 + ut + ζ > δ0
t t t
et> (γ0 + Π> δ0 ) + ρdt , thus
=X
et> ν0 + ρdt ,
Dt =X ρdt ⊥ Xt (A.4)
(A.5)

Denote the nuisance parameter η = (β0 , ν0 , δ0 , γ0 ). Consider


    
e > (τ0 ν0 ) − g(X
m(Wt , η, τ0 ) = S̃t − X e > β0 ) − τ0 Dt − X
e > ν0 Xe > γ0 + Z
e > δ0 − X
e > ν0 ,
t t t t t t

then both equations are satisfied, and in particular


h i
E [∂β0 m(Wt , η, τ0 )] = −E ζt> δ0 X et> β0 ) = 0 (using ζt ⊥ Xt )
et g(X
h i
E [∂ν0 m(Wt , η, τ0 )] = −E Xe t ξt = 0

Question 5: In this course, asymptotic normality of the estimator τb is proved for an Affine Quadratic
model, which imposes that

1. either g() is the affine, which makes the model identical to the one in the course.
2. either g() is quadratic, which is allowed by the theorem, but requires that in the first stage we
have an estimator of β0 in a sparse non linear index model.

159
B Exam 2019
Class documents authorized. Calculator forbidden. Duration: 2 hours.
Documents de cours autorisés. Calculatrice interdite. Durée: 2 heures.
Part I contains 6 questions; Part II contains 10 questions (not counting sub-questions).

Part I: Questions
1. Considering the setup of Chapter 4, is the synthetic control estimator a consistent estimator of
the treatment effect on the treated? Explain why or why not.
2. What is a sparse graph? Explain using concepts seen in class and give an economic or sociological
example.
3. What are the main differences between random forests and causal random forests? How is it
implemented in practice?

4. How would you modify the standard LASSO estimation procedure when errors are non-Gaussian
and heteroscedastic, if you want to obtain the same rates of convergence (up to a constant)?
5. Describe the best linear predictor of the CATE using two machine learning proxies m0 and T of,
respectively, E [Y |X, D = 0] and the CATE.

6. In which case(s) do you prefer using a random forest instead of a LASSO and vice-versa?

Part II: Gender Wage Gap Heterogeneity


We are interested in measuring the gender wage gap as defined by the relative difference in pay that
emerges between men and women if one controls for the effects of observable characteristics. We observe
an iid sample of the random vector (ln Wi , Fi , Xi )i=1,...,n where ln Wi is the log-weekly wage, Fi is a
dummy variable equal to one if individual i is a female, Xi is a vector of observed characteristics of
dimension p which can be (very) large.

1. How do you interpret the quantity E[ln Wi |Xi , Fi = 1] − E[ln Wi |Xi , Fi = 0] ?


2. Consider the model

ln Wi = α + θFi + Xi0 β + εi , with E[εi |Xi , Fi ] = 0 and kβk0 ≤ s << p. (A.6)

(a) Considering the problem we are studying, what would you like to include in Xi ?
(b) Give a (simple) consistent estimator of θ in the case where p is a small integer (for example
p = 6), as n → ∞.
(c) Is it still a consistent estimator if p > n and/or p → ∞ ? If you answer no, propose a
consistent estimator in that case.
(d) Show that E[ln Wi |Xi , Fi = 1]−E[ln Wi |Xi , Fi = 0] = θ. Do you think that it is a reasonable
assumption?
3. In order to deepen analysis, we consider the model

ln Wi = α + θ(Zi )Fi + Xi0 β + εi , with E[εi |Xi , Fi ] = 0 and kβk0 ≤ s << p, (A.7)

where θ(Zi ) measures an effect that depends on some covariates Zi ⊂ Xi . Specifically, we assume
that
K
X
θ(z) = θ k zk .
k=1

160
(a) “Model (A.7) allows to study an heterogeneous wage gap”. Do you agree or disagree? Justify
(a formula or two would be welcome).
(b) Rewrite model (A.7) as a linear regression model. What are the corresponding Normal
equations?
(c) Assuming that p > n and p → ∞ but K and s are small integers, how could you estimate
consistently (θ1 , . . . , θK )? In your answer, you will explicitly write down an immunized
moment condition ψ for (θ1 , . . . , θK ) and add the necessary assumptions.
4. Tables 7-10 in the Appendix A are extracted from Bach, Chernozhukov and Spindler (2018). They
display estimates of (θ1 , . . . , θK ) based on Model (A.7) obtained by the method in Q3 on a US
sample of college graduates. Interpret three rows of your choice among these four tables.
5. From these four tables, what do you see as the main problem to perform inference in this context?
6. One other way to model heterogeneity in the wage gap is to use causal random forests. We assume
in the next to question that (Xi )ni=1 are i.i.d and distributed uniformly Xi ∼ U ([0, 1]p ). Then, at
some point x in the support of Xi , we define the causal random forest as
 −1
n X
µ̂(x; X1 , . . . , Xn ) = T (x; Xi1 , . . . , Xis ),
s
1≤i1 <···<is ≤n

where
X 1{Xi ∈ L(x)}
T (x; Xi1 , . . . , Xis ) = αi (x) ln Wi , αi (x) = ,
s|L(x)|
i∈{i1 ,...,is }

L(x) are the leaves of the tree T , |L(x)| their Lebesgue measure, and s ∈ [n/2, n) is the fixed size
of the subsamples.
Assuming that the regression function µ : x → E [ln Wi |Xi = x] is Lipschitz with constant C and
that the construction of the leaves L is independent of the sample (Xi )ni=1 , show the following
inequality
|E [µ̂(x; X1 , . . . , Xn )] − µ(x)| ≤ CDiam(L(x)), (A.8)
where Diam(L(x)) is the diameter of the leaf containing v.
7. Explain from (A.8) what high-level condition we may enforce on Diam(L(x)) to obtain consis-
tency. Does standard random forests achieve this condition and why? How is this implemented
in practice in the causal random forest of Athey and Wager?

8. For any given ML proxy, we form five groups Gk , for k ∈ {1, . . . , 5}, among the population based
on predicted outcome T (Xi ) using the splits Ik based on the quantiles
 
−1 k
Ik := [lk−1 , lk ], where lk = FT (Xi ) ,
5

ans FT−1
(Xi ) is the quantile function of T (Xi ). Using Figure 9.1 and Table 3, give your interpretation
of the heterogeneity in wage gap and compare with the interpretation made in Question 4 from
tables 7-10 in the Appendix. Describe explicitly the differences about the nature of the parameter
of interest and their consequences in the interpretation.
9. (Bonus) We want to take into account selection in the labour market participation. Explain how
would you model it and give a potential estimation procedure if the selection equation depends
upon a high dimensional sparse a priori unknown set of variables.

161
Appendix A: Table for the Part II, Questions 4 and 5

162
Appendix B: Results for Part II, Questions 8-10
We denote by
2 K
ˆ
Λ̂ = β̂2 Var (T (X)) ˆ = X γ̂ 2 P (T (X) ∈ I ) ,
and Λ k k
k=1

where β̂2 is the estimator of the slope of the best linear predictor and γ̂k the estimator the sorted average
group treatment effects (GATES). For any given ML proxy, we form five groups Gk , for k ∈ {1, . . . , 5},
among the population based on predicted outcome T (Xi ) using the splits Ik based on the quantiles
 
k
Ik := [lk−1 , lk ], where lk = FT−1
(Xi ) ,
5

and FT−1
(Xi ) is the quantile function of T (Xi ).

Elastic Net Boosting Nnet Random Forest


ˆ
Λ 0.046 0.040 0.043 0.055
Λ̂ 0.120 0.104 0.108 0.109
Table 9.1: Table of performance measures for GATES and best linear predictor for the four ML methods
used on log wages, based on 100 splits.

Random Forest Random Forest Elastic Net Elastic Net


β1 β2 β1 β2
Log wage -0.207 0.810 -0.181 0.686
90% CI (-0.234,-0.181) (0.609,1.010) (-0.208,-0.155) (0.538,0.838)
Table 9.2: Table of the estimated constant β1 and slope β2 of the best linear predictor for the two best
methods based on 100 splits according to the selection procedure based on Λ: Random Forest and Elastic
net.

Random Forest Elastic Net

ATE 90% CB(ATE) ATE 90% CB(ATE)

GATES 90% CB(GATES) GATES 90% CB(GATES)

0.0 0.0

−0.1 −0.1
Treatment Effect

Treatment Effect

−0.2 −0.2

−0.3 −0.3

−0.4 −0.4

1 2 3 4 5 1 2 3 4 5
Group by Het Score Group by Het Score

Figure 9.1: Estimated GATES (Sorted group average treatment effect) with robust confidence intervals at
90% for the two best ML Methods used based on 100 splits.

163
Random Forest Elastic Net
Most Affected Least Affected Difference Most Affected Least Affected Difference
On log wage
Age 31.47 34.36 -2.826 31.49 33.54 -2.044
(31.21,31.73) (34.10,34.62) (-3.196,-2.456) (31.22,31.75) (33.27,33.81) (-2.427,-1.660)
Nb. Ch-19y. 0.263 0.831 -0.566 0.237 0.814 -0.586
(0.238,0.287) (0.807,0.856) (-0.602,-0.530) (0.212,0.262) (0.790,0.838) (-0.621,-0.551)
Exper. 9.060 14.70 -5.634 9.238 14.06 -4.771
(8.793,9.328) (14.43,14.96) (-6.004,-5.258) (8.948,9.528) (13.78,14.34) (-5.185,-4.358)
Table 9.3: Estimated average characteristics for the most and least affected groups E[Xk |G5 ] and
E[Xk |G1 ] based on 100 splits for the two variables of age (Age), number of children under 19 years
old (Nb. Ch.-19y), and years of work experience (Exper.) with robust confidence intervals at 90%
for the ML Methods used. The “least affected” correspond to the group G1 and “most affected”
correspond to the group G5 .

Exam 2019: Elements of Correction


Part I: Questions
1. No, it is not. This is because there is only one treated unit, so no LLN type result applies.
2. A sparse graph is a graph for which its density goes to zero as its number of nodes grows. Example:
friendship network.
3. Causal random forests’ goal is to estimate a treatment effect consistently whereas random forests
estimate a regression function and aim at minimising the prediction error (often in `2 norm, or
MSE). This has consequences on the form of the estimator, where causal random forests requiring
the honesty property to be consistent.
4. To take into account non-Gaussian and heteroscedastic errors the `1 penalty in the standard
LASSO estimation is modified using penalty loadings that are designed so that that results of the
moderate deviation theory can be applied. This yields a two step estimation procedure were these
loadings are initialised with the identity matrix, then updated using the estimated error terms of
the first step.
5. The best linear predictor of the CATE is the linear projection of an unbiased signal of the CATE
on the linear vector space generated by T . This BLP meaning thus depends on the performances
of T . If it fits well the CATE then the BLP slope coefficient will be closed to one and we learn
about features of the CATE by looking at T .
6. You want to use a LASSO if you are willing to assume a linear structure for which coefficients
are sparse (only a small number of non-zeros). You would rather use a random forest if you are
willing to assume that the regression function is piece-wise constant.

Part II: Gender Wage Gap Heterogeneity


1. E[ln Wi |Xi , Fi = 1] − E[ln Wi |Xi , Fi = 0] is the average wage gap between men and women for the
population with characteristics Xi .
2. (a) Xi can contain many things: hours worked, experience, experience squared, age, type of
education, number of years of education, geographic localisation, nationality, marital status,
number of young kids, number of kids in general, industry, psychological traits such as
conscientiousness and openness etc.
(b) In that case, a simple OLS estimator works thanks to the exogeneity assumption.

164
(c) No, it is not. Considering the sparse structure, you want to use the double selection pro-
cedure seen in the class, using a LASSO in the first two steps – a brief description of the
prcedure was needed here.
(d) E[ln Wi |Xi , Fi = f ] = θf + Xi0 β. It means that the wage gap is constant across the support
of Xi which is probably unreasonable.

3. (a) It is true. Indeed, in that case E[ln Wi |Xi , Fi = 1] − E[ln Wi |Xi , Fi = 0] = θ(Zi ) =
PK
k=1 θk Zi,k . So the wage gap varies by θk percentage points when Zi,k varies by one
unit. Since the overall wage gap is negative, a positive value of θk means that (for example)
the wage gap is smaller than baseline in the population for which Zk = 1.
(b) Using the notation θ = (θk )k=1,...,K , we have:

ln Wi = α + Fi Zi0 θ + Xi0 β + εi ,

So we have a linear model with p + K covariates (Fi Zi0 , Xi0 ) and the p + K Normal equations
are

E [(ln Wi − α − Fi Zi0 θ − Xi0 β + εi )(Fi Zi0 , Xi0 )0 ] = 0.

If we only care about θ we only care about

E [(ln Wi − α − Fi Zi0 θ − Xi0 β + εi )Fi Zi ] = 0,

considering β as a nuisance parameter.


(c) An immunized condition ψ for (θ1 , . . . , θK )will take the form of

E [(ln Wi − α − Fi Zi0 θ − Xi0 β + εi )(Fi Zi − Xi0 γ] = 0,

with the derivatives with respect to β equal to zero. It means using the ”double-selection”
procedure but with K parameters of interests. It will requires K + 2 steps:
i. The K first steps consists in regressing each element of Zi on Xi for the sub-sample of
women using a Lasso,
ii. The K + 1-th step is a LASSO regression of ln Wi on Xi ,
iii. The last step is a regression of ln Wi on Fi Zi and the union of all the elements of Xi
selected previously.

4. Example: Compared to baseline, having a child of 18 year old or young increases the wage gap
by 5 pp (the wage gap is more negative for them). So women who have a child of 18 or younger
earn 5 pp less than men compared to other women.

5. There is a multiple testing problem. (These tables already correct for multiple testing, but you
could not know that).

6. Using that E [µ̂(x; X1 , . . . , Xn )] = E [T (x; X1 , . . . , Xn )]. Thus, we have

|E [µ̂(x; X1 , . . . , Xn )] − µ(x)|
= |E [T (x; X1 , . . . , Xn )] − µ(x)|

 
X 1{Xi ∈ L(x)}
= E ln Wi − µ(x)
i∈{i1 ,...,is } s|L(x)|

 
1 X 1{Xi ∈ L(x)}
= E [ln Wi |Xi ∈ L(x)]
E |Xi ∈ L(x) − µ(x)
s |L(x)|
i∈{i1 ,...,is }
= |E [ln Wi |Xi ∈ L(x)] − E [ln Wi |Xi = x]| ≤ CDiam(L(x)).

165
7. We choose these two methods based on the performance indicators Λ and Λ, which measure how
much heterogeneity is captured by the procedure. The table indicates clearly that Random Forest
and Elastic Net are the best ones. From Table 2 we can see that the average of E[ln Wi |Xi , Fi =
1] − E[ln Wi |Xi , Fi = 0], which is β1 is negative, so women earn in average less than men, but that
the slope of the BLP is significantly positive and close to 1, thus there is heterogeneity and it its
profile is quite well describe by the Elatic net and Random forests proxies.
8. From Figure 9.1 and Table 3 we see that there is a group of women on which there is no wage gap.
They are women with less children and experience than average. Here, the parameter of interest
depends on the accuracy of the ML proxy, as well as the splits. We thus only can learn about
features of E[ln Wi |Xi = ·, Fi = 1] − E[ln Wi |Xi = ·, Fi = 0] (heterogeneity, subgroups that benefit
the less and the most and their characteristics) and not about this quantity itself and previously
done. As this Generic ML procedure depends on sample splitting, the p-value should be adapted
to take into account this additional randomness.
9. (Bonus) We want to take into account selection in the labour market participation. Explain how
would you model it and give a potential estimation procedure if the selection equation depends
upon a high dimensional sparse a priori unknown set of variables.

166
C Exam 2020
Expected duration: around 2.5 hours. Turn in before: May 6 2020, 11pm.
Durée indicative: 2 heures 30 mins. Date de rendu: 6 mai 2020, 23h.
Part I contains 3 questions; Part II contains 12 questions (not counting sub-questions); Part III contains
3 questions.

Part I: Questions
1. Give three advantages to using the synthetic control method when using it is appropriate. Explain
briefly these advantages.
2. Justify why the very-many-instruments case in treatment effect estimation with endogeneity is a
frequent context. What is the solution that we described during the course?
3. What is the regularization bias? Can it exist in a small-dimensional case (p < n)?

Part II: Problem 1


During a drought in the southeastern United States in 2007, pro-social leaflets that encouraged water
conservation were randomly sent by mail to 35,000 out of 106,000 households. The outcome of interest
is water consumption between June and September 2007 (after the pro-social campaign) in thousands
of gallons. The goal of the analysis is to study whether there exist heterogeneity in the treatment effect
and if the treatment has an higher impact on households who (i) vote more often or (ii) are considered
either Democrats or Republicans.
D is a dummy variable that takes the value 1 if the household has received a water conservation message
and 0 otherwise. Recall that Y1 and Y0 denotes the two random variables representing the potential
water consumption between June and September 2007 respectively with and without treatment. Y =
Y0 + D(Y1 − Y0 ) denotes the observed water consumption. X denotes a set of characteristics, such as past
water consumption, an indicator of being registered to vote, whether the property is rented or owned,
the age and value of the property, the age of the owner, etc. All variables are measured at the household
level.
p(X) denotes the probability of being treated and we use the notation w(X) = 1/(p(X)(1 − p(X))).
In order to estimate the relevant effects, we use the Generic Machine Learning methodology. For all the
reported results, 30 different splits of the data between a main and auxiliary sample are considered.

1. T (X) denotes the proxy machine learning resulting from a given algorithm, i.e. the prediction of
the conditional average treatment effect, τ (X) for a household with characteristics X. Consider
the following regression on the main sample:

w(X)(D − p(X))Y = β1 + β2 (T (X) − E[T (X)]) + ε. (A.9)

(a) In this regression, what is β1 ?


(b) In this regression, what is β2 ? Explain how it can help solve the question of treatment effect
heterogeneity.

For questions 2-5, your answers must be backed by statistical evidence (p-values, etc.) when
possible.

2. We train four different algorithms: an elastic net, a gradient boosting machine, a neural network
and a random forest. Table 9.4 reports the statistics Λ = |βb2 |2 Vb (T (X)), where βb2 has been
estimated from the above regression, for each algorithm.
(a) Explain how and why the statistics Λ can help you choose the best of all four algorithms.

167
(b) According to Table 9.4, which algorithm is the best?

3. Table 9.5 reports the results (estimator, 90% confidence interval and p-values) from the regression
in question (1) for the two best algorithms.

(a) Does the treatment have an effect?


(b) Is there heterogeneity in that effect?

4. For a given ML proxy and k = 1, . . . , 5, we define the group Gk = 1 {`k−1 ≤ T (X) < `k } with
quantiles −∞ = `0 ≤ `1 ≤ . . . ≤ `5 = +∞ such that the population is split in five folds of 20%
on the basis of a ranking of households using the ML predictor. If for a household G1 = 1, it
is deemed “most affected”. If for a household G5 = 1, it is deemed “least affected”. Table 9.6
reports estimates for the treatment effect of the least and most affected population, as well as the
difference.

(a) Write down the regression equation that allowed to obtained these results. Explain how it
has been estimated.
(b) Does the treatment have an effect on every household?
(c) Is there a difference in treatment effect between the most and least affected households?

5. We want to see if most and least affected households have different characteristics, in order to
answer the initial question. Table 9.7 reports the result.

(a) How do you think this table has been obtained?


(b) Are households who vote more often more likely to respond to water-saving incentives?
(c) Are households who vote for Democrat or Republican candidates more often more likely to
respond to water-saving incentives?

We now focus on estimating the Conditional Average Treatment Effect (CATE) function

τ (x) = E[Y1 − Y0 |X = x] = µ1 (x) − µ0 (x),

where µj (x) = E[Yj |X = x], j = 0, 1.

6. Assume that you have the following model, for j = 0, 1,

Yj = X > αj + j , E[j |X] = 0,

where X is high-dimensional (dimension p >> n, the number of observations). Give the formula
for the LASSO estimators of α1 and α0 . Propose an estimator for the CATE based on these
estimators. Justify intuitively why in practice it does not have good properties.
7. What is the “solution” that has been proposed to handle this issue, when p < n, in the Causal
Random Forest estimator (CRF hereafter)?
8. We consider the model of this randomized control trial (RCT), where the treatment allocation is
random, D ⊥ X, and
Y = X > γ + Dτ (X) + ,  ⊥ (D, X), (A.10)
where τ (X) is assumed to be linear in X, which is high-dimensional. We base our estimator for
τ on
( n )
  1X > > 2

β, δ = argminβ,δ
b b Yi − Xi β − (Di − E [Di ])Xi δ + λβ kβk1 + λδ kδk1 . (A.11)
n i=1
 
Identify γ and τ in terms of β and δ and give the estimator of τ based on β,
b δb . Write down the
estimating moment equations that we use in (A.11).

168
9. How do we call β in (A.11)? In this RCT context, is the estimator based on (A.11) immunized?
Prove it.
10. Justify intuitively why such an estimator solves the problem mentioned in questions 6 and 7.
11. Give a context where this estimator of the CATE is more adapted to the CRF estimator and
another context where the CRF is more suited.
12. Coming back to the application, one problem is that randomisation has been done at the water-
meter routes level and not at the household level. This could potentially yield selection insofar
as households in the same neighbourhood (sharing a water-meter route) have potentially similar
water-consumption behaviours and reactions to the treatment.
We want to control for that using the following model

D = Z > γ + ζ, Z ⊥ ζ, Z ⊥ ,

where Z are auxiliary variables available (ex: median/average income at the neighbourhood level,
median/average water consumption, rate of owner occupied houses, ect..) and  is the residual in
(A.10). Write down how would you modify (A.11) to take that into account in a way that your
estimator is immunized. Show this last point.

Part III: Problem 2


There is a disease in a country which affects individuals who get it in different ways. Some are unaffected
but some develop a severe form S = 1. We know Ps = P (S = 1) in the population. Developing a sever
form is a characteristic of the individual: for a given individual there is nothing random in developing a
severe form conditional on getting the disease. Those who get the severe form need access to a special
unit service. ηa is the number of equipment per person in the population. There is a large sample of
people who got the disease in the past on which we observe whether they develop a severe form, and
there is a large set of health covariates on them X.

We define S(X) = P (S = 1|X). For each individual we define a risk index at level s as R(s, X) =
1(S(X) > s). There are special measures (such as lockdown) L = 1 which can be taken such that if
L = 1 people will not get the disease but if L = 0 people get the disease.

1. Define the true positive rate T P R(s), the true negative rate T N R(s), the false positive rate
F P R(s) and false negative rate F N R(s) as a function of s.

2. The government seeks to define an assignment to L conditional on observable characteristics:


L(X), consistent with available equipments. Show that it is possible to define s∗ such that
deciding L(X) = 1 for R(s∗ , X) = 1 (so that those people will not get the disease) and L(X) = 0
for others (and all of them will get the disease) will lead to a share of people getting a severe form
of the disease consistent with the number of available equipment as a function of Ps and ηa and
the false positive rate function.
3. People with L(X) = 1 are safe but L(X) = 1 is painful. There is in the population a category
of people O = 1 (O stands for old) complaining that the L∗ (X) rule: L∗ (X) = 1 if R(s∗ , X) = 1
is unfair. More generally consider a decision rule based on X: L(X). Define demographic parity,
predictive parity and odd parity and comment on the differences between all these notions.
Under which conditions are predictive and odd parity equivalent?

169
Appendix: Tables

Table 9.4: Algorithm ranking – Λ


Λ Elastic Gradient Boosting Neural Random
Net Machine Network Forest
Water Cons. (T3 2007) 1.137 1.165 1.000 0.933

Table 9.5: Regression results


Algo. 1 Algo. 2
β1 β2 β1 β2
Water Cons. -0.952 0.116 -0.902 0.058
(T3 2017) (-1.278,-0.631) (0.068,0.167) (-1.233,-0.576) (-0.031,0.146)
[0.000] [0.000] [0.000] [0.441]

Table 9.6: Treatment effect by groups


Algo. 1 Algo. 2
Least Affected Most Affected Difference Least Affected Most Affected Difference
Conso. Eau -0.953 -1.688 0.700 -0.707 -1.483 0.780
(T3 2007) (-1.685,-0.217) (-2.417,-0.960) (-0.302,1.722) (-1.459,0.050) (-2.235,-0.730) (-0.290,1.858)
[0.023] [0.000] [0.342] [0.135] [0.000] [0.307]

170
Table 9.7: Group average characteristics
Algo. 1 Algo. 2
Least Affected Most Affected Difference Least Affected Most Affected Difference
Voting 0.098 0.120 -0.017 0.096 0.121 -0.024
Frequency (0.096,0.100) (0.118,0.122) (-0.020,-0.014) (0.094,0.098) (0.119,0.123) (-0.027,-0.021)
- - [0.000] - - [0.000]
Democrat 0.166 0.204 -0.044 0.147 0.242 -0.087
(0.159,0.174) (0.197,0.212) (-0.054,-0.033) (0.139,0.154) (0.234,0.249) (-0.097,-0.077)
- - [0.000] - - [0.000]
Republican 0.408 0.448 -0.012 0.409 0.390 0.008
(0.399,0.418) (0.438,0.458) (-0.025,0.001) (0.400,0.419) (0.380,0.399) (-0.006,0.021)
- - [0.159] - - [0.531]

Exam 2020: Elements of Correction


Part I: Questions
1. Should be three of: (1) no extrapolation (weights are non-negative and sum to one), (2) trans-
parency of the fit (the fit before the treatment can be assessed), (3) prevents specification search
(weights can be computed independently of the post-treatment outcome), (4) sparsity/qualitative
interpretation (at most p + 1 are strictly positive). See the corresponding chapter. Answers that
were too generic / also true for other standard estimators / not explained with a precise argument
were rejected.
2. Either the list of available and possible instruments is large, while the researcher knows that only
few of them are relevant; but more importantly, even when one has only 1 instrument Z, one
could also consider transformations of the initial instrument

(f1 (Z), . . . , fp (Z))

by a family of functions (fi )pi=1 , hence coming back to the very-many-instruments case. During
the course we use a sparse model for IV, assuming that only few instruments are indeed useful,
and provide a LASSO based method to estimate the treatment while controlling estimation of the
nuisance parameter.
3. The regularization bias is a bias that occurs due to the fact that using ML tools is a first-step
yields estimator that do not converge fast enough. In the case of the Lasso, it amounts to an
omitted variable bias. It can exist in a small-dimensional case if a non-conventional estimator is
used is a first-step or if there is a selection step. (Think about Leeb and Potscher’s model).

Part II: Problem 1


1. (a) β1 = E[τ (X)] = E[Y1 − Y0 ], the average treatment effect.
(b) β2 is the best linear predictor coefficient (see course for the formula), testing H0 : “β2 = 0”
offers a test for heterogeneity. When this hypothesis is rejected, we know that there is BOTH
heterogeneity and that the ML proxy is relevant. When this hypothesis is NOT rejected,
the conclusion is unclear: it can be either because there is no heterogeneity, or because the
proxy predictor is weak (not correlated with the CATE).
2. (a) It can be expressed as a function of the squared correlation between the ML proxy and the
true CATE times the variance of the CATE. So maximizing this quantity ensures we select
the algorithm which is the most correlated to the true CATE.
(b) Gradient Boosting Machine, as its corresponding Λ is the largest.

171
3. (a) Yes, the test of H0 : ”β1 = 0” is rejected at any level of confidence. On average, households
who received a water-saving incentive, consume about 952 gallons (about 3604L) of water
less than non-treated households, everything else being equal.
(b) The hypothesis that β2 = 0 is rejected for Algo 1. For Algo 2, it is not rejected (for a test
of 5% level of confidence for example), which means either that there is no heterogeneity or
that Algo 2 is too weak. Since Algo 2 is dominated by Algo 1 in term of Λ, the evidence
given by Algo 1 should be given more weight. Notice that, as explained in the course, for a
test of significance level α, the p-value displayed here must be below α/2 to be rejected, in
order to take into account the random splitting of the data.
4. (a) See the regression for the GATES:
5
X
w(X)(D − p(X))Y = γk Gk + ε,
k=1

where the treatment effect most affected is γ1 and the treatment effect for the least affected
is estimated by γ5 . Indeed, a quick calculation shows that:

γk = E[Y1 − Y0 |Gk ].

This regression can be estimated by OLS. This is just a mean of w(X)(D − p(X))Y over
each group.
(b) This question cannot be answered rigorously because we test by groups, so even if the
average treatment effect is significantly different from zero in the least affected group, the
treatment effect (as measured by the CATE) can be zero for some individuals in this group.
Furthermore, groups are based on values of the proxy predictors which is not perfectly
correlated with the true CATE. So even in the most-affected group (as defined by values of
the proxy predictor), the true CATE can be zero for some individuals. However, we accepted
the conclusion that with algorithm 1, even the least affected have a non-zero treatment effect
so that implies that the treatment effect is significant for most people.
(c) Not really, the test for the difference is not rejected at any commonly accepted level (5%,
10%).
5. (a) See the regression for the CLAN in Chernozhukov et al. (2017b):
5
X
X= θk Gk + ε,
k=1

where the mean characteristic of the most affected is θ1 and the mean characteristic for the
least affected is estimated by θ5 . It can be estimated by OLS.
(b) First of all, notice that these coefficients estimate P [V OT E = 1|G1 = 1] and P [V OT E =
1|G5 = 1] but since P [G1 ] = P [G5 ] = .2, Bayes Theorem implies that: P [V OT E = 1|G1 =
1]/P [V OT E = 1|G5 = 1] = P [G1 = 1|V OT E = 1]/P [G5 = 1|V OT E = 1] so we can
have the interpretation that household where voting is more frequent are more likely to be
amongst the most affected by the pro-social campaign as compared to being amongst the
least affected. The difference is significant.
(c) The same remark goes and it follows that Democrats are more likely to be amongst the
most affected (the difference is significant) as compared to being the least affected. That
difference is not significant for the Republican households.
Notice that in all three cases, the test compares most and least affected, not each group to
the general population. So it could be the case that Democrats are relatively more likely to
be amongst the most affected than amongst the least affected but could be under-represented
in both of these groups as compared to the general population. [It’s not the case in this
application, but you cannot tell from the tables only].

172
6. For j = 0, 1:
n
1X
bj ∈ arg max
α 1 {Di = j} (Yi − XiT α)2 + λ kαk1 ,
α n i=1
The CATE can then be estimated by:

τb(X) = X T (b
α1 − α
b0 ).

The problem is that this estimator is composed of two estimators that are computed separately
and may not be optimal to estimate the CATE.
7. The proposed solution for the CRF is to split according to a joint criterion (not one for D = 1
and another for D = 0) specially designed to target the treatment effect, and not the treatment
conditioning on D.
8. Here, γ = β − E(D)δ, τ (X) = X > δ, τb(X) = X > δ,
b and we use
  
> > X
E (Y − X β − (D − E[D])X δ) = 0.
X(D − E[D])

9. β is a nuisance parameter. The estimator is immunized as, for the equation related to δj ,

∂β E (Y − X > β − (D − E[D])X > δ)Xj (D − E[D]) = −E [XXj (D − E[D])]


 

= −E [XXj ] E [(D − E[D])] = 0.

10. Here, the two coefficients are estimated simultaneously and are likely to yield a better estimate
of the CATE.
11. This estimator is relevant if there is sparsity in both β0 and δ0 , i.e. only a few components of
X are relevant to predict the baseline outcome and and the treatment effect heterogeneity. And
both the CATE and the regression function for the baseline outcome are linear. The CRF is more
suited in a context where the regression function for the baseline outcome is piece-wise constant.
12. We replace E[D] by Z > γ, and add a penalty plus a potential sparsity assumption to handle the
potential high dimensionality of γ, which is a nuisance parameter, which yields
( n )
  1 X
> > >
2
β,
bγ b, δb = argminβ,δ Yi − Xi β − (Di − Z γ)Xi δ + λβ kβk1 + λγ kγk1 + λδ kδk1 .
n i=1

It is immunized (in the sense of the definition of the course) as for the equation related to δj ,

∂β E (Y − X > β − (D − Z > γ)X > δ)Xj (D − Z > γ) = −E XXj (D − Z > γ)


   

= −E [XXj ] E (D − Z > γ) = 0.
 

and

∂γ E (Y − X > β − (D − Z > γ)X > δ)Xj (D − Z > γ)


 

is the sum of two terms: the first

E Z(X > δ)Xj (D − Z > γ) = −E Z(X > δ)Xj E [ζ] = 0.


   

and the second

−E (Y − X > β − (D − Z > γ)X > δ)Xj Z = −E [Xj Z] = 0.


 

173
174
List of Theorems and Lemmas

2.1 Lemma (Model Selection Consistency of (2.2)) . . . . . . . . . . . . . . . . . . . . . . . . 13


2.2 Lemma (Independence, from Leeb (2006)) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Lemma (Density of the Post-Selection estimator, from Leeb (2006)) . . . . . . . . . . . . . 15
2.1 Theorem (`1 consistency of the Lasso) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Lemma (Regularization Bias of τb) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Lemma (A Favorable Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Theorem (Asymptotic Normality of the Immunized Estimator) . . . . . . . . . . . . . . . 30
2.6 Lemma (Concentration Inequality for Gaussian Random Variables) . . . . . . . . . . . . . 35
2.3 Theorem (Frisch–Waugh–Lovell’s Theorem, Frisch and Waugh 1933, Lovell 1963) . . . . . 36

4.1 Theorem (Necessary condition for optimal instruments, Theorem 5.3 in Newey and Mc-
Fadden (1994) p. 2166) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Theorem (Rates for Lasso Under Non-Gaussian and Heteroscedastic Errors, Theorem 1
in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Lemma (Lemma 6 in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Theorem (Asymptotic Normality of the Split-Sample Immunized Estimator, Theorem 7
in Belloni et al. (2012a)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Theorem (Cluster-Lasso convergence rates in panels, Theorem 1 in Belloni et al. (2016)) . 70
5.4 Theorem (Asymptotic normality of the Cluster-Lasso estimator for treatment effect in
panels, Theorem 2 in Belloni et al. (2016)) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.1 Theorem (Bias of the Synthetic Controls Estimator) . . . . . . . . . . . . . . . . . . . . . 79


6.1 Lemma (Level of the Test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Lemma (Rosenthal’s Inequality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.1 Lemma (Control in probability of the leaf diameter in uniform random forests, Lemma 2
in Wager and Athey (2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Lemma (Consistency of the double sample random forests, Theorem 3 in Wager and Athey
(2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1 Theorem (Asymptotic normality of double sample causal random forests, Theorem 1 in
Wager and Athey (2017)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Theorem (Asymptotic normality of GRF for the instrumental variable model (7.20), The-
orem 5 in Athey et al. (2019)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

175
7.3 Theorem (Consistency of the Best linear predictor estimator, Theorem 2.2 in Chernozhukov
et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.4 Theorem (Uniform asymptotic size for the unconditional test with sample splitting, The-
orem 3.1 in Chernozhukov et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.5 Theorem (Uniform validity of the confidence interval CI with sample splitting, Theorem
3.2, in Chernozhukov et al. (2017b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.1 Theorem (Identification of Peer Effects, (α, β, η, γ)) . . . . . . . . . . . . . . . . . . . . . . 150

176
Bibliography

Abadie, A. (2019). Using synthetic controls: Feasibility, data requirements, and methodological aspects.
Working Paper.

Abadie, A., Angrist, J., and Imbens, G. (2002). Instrumental variables estimates of the effect of subsidized
training on the quantiles of trainee earnings. Econometrica, 70(1):91–117.

Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2018). Sampling-based vs. design-based
uncertainty in regression analysis. Working Paper.

Abadie, A. and Cattaneo, M. D. (2018). Econometric methods for program evaluation. Annual Review
of Economics, 10(1):465–503.

Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case
studies: Estimating the effect of california’s tobacco control program. Journal of the American Sta-
tistical Association, 105(490):493–505.

Abadie, A., Diamond, A., and Hainmueller, J. (2015). Comparative politics and the synthetic control
method. American Journal of Political Science, 59(2):495–510.

Abadie, A. and Gardeazabal, J. (2003). The Economic Costs of Conflict: A Case Study of the Basque
Country. American Economic Review, 93(1):113–132.

Abadie, A. and Kasy, M. (2017). The Risk of Machine Learning. arXiv e-prints, page arXiv:1703.10935.

Abadie, A. and L’Hour, J. (2019). A penalized synthetic control estimator for disaggregated data.
Working Paper.

Acemoglu, D., Johnson, S., Kermani, A., Kwak, J., and Mitton, T. (2016). The value of connections in
turbulent times: Evidence from the united states. Journal of Financial Economics, 121:368–391.

Allegretto, S., Dube, A., Reich, M., and Zipperer, B. (2013). Credible Research Designs for Minimum
Wage Studies. IZA Discussion Papers 7638, Institute for the Study of Labor (IZA).

Alquier, P., Gautier, E., and Stoltz, G. (2011). Inverse Problems and High-Dimensional Estimation:
Stats in the Château Summer School, August 31 - September 4, 2009. Springer Berlin Heidelberg,
Berlin, Heidelberg.

Amemiya, T. (1974). The nonlinear two-stage least-squares estimator. Journal of econometrics, 2(2):105–
110.

Angrist, J. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton University Press, 1 edition.

Angrist, J. D. (2014). The perils of peer effects. Labour Economics, 30(C):98–108.

Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and
earnings? The Quarterly Journal of Economics, 106(4):979–1014.

177
Angrist, J. D. and Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better
research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2):3–30.

Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of
the National Academy of Sciences, 113(27):7353–7360.

Athey, S. and Imbens, G. W. (2017). The state of applied econometrics: Causality and policy evaluation.
Journal of Economic Perspectives, 31(2):3–32.

Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should know about.
Annual Review of Economics, 11.

Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics,
47(2):1148–1178.

Athey, S. and Wager, S. (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896.

Athey, S. and Wager, S. (2019). Estimating treatment effects with causal forests: An application. arXiv
preprint arXiv:1902.07409.

Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.

Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (2014). Using Gossips to Spread
Information: Theory and Evidence from a Randomized Controlled Trial. ArXiv e-prints.

Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012a). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.

Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012b). Sparse models and methods for
optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.

Belloni, A. and Chernozhukov, V. (2011). High Dimensional Sparse Econometric Models: An Introduc-
tion, pages 121–156. Springer Berlin Heidelberg, Berlin, Heidelberg.

Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse
models. Bernoulli, 19(2):521–547.

Belloni, A., Chernozhukov, V., Chetverikov, D., and Hansen, C. (2018). High dimensional econometrics
and regularized gmm. arXiv:1806.01888, Contributed chapter for Handbook of Econometrics.

Belloni, A., Chernozhukov, V., Fernández-Val, I., and Hansen, C. (2017a). Program evaluation and
causal inference with high-dimensional data. Econometrica, 85(1):233–298.

Belloni, A., Chernozhukov, V., and Hansen, C. (2010). LASSO Methods for Gaussian Instrumental
Variables Models. ArXiv e-prints.

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). High-Dimensional Methods and Inference on
Structural and Treatment Effects. Journal of Economic Perspectives, 28(2):29–50.

Belloni, A., Chernozhukov, V., Hansen, C., and Kozbur, D. (2016). Inference in high-dimensional panel
models with an application to gun control. Journal of Business & Economic Statistics, 34(4):590–605.

Belloni, A., Chernozhukov, V., Hansen, C., and Newey, W. (2017b). Simultaneous confidence intervals
for high-dimensional linear models with many endogenous variables. arXiv preprint arXiv:1712.08102.

Ben-Michael, E., Feller, A., and Rothstein, J. (2019). Synthetic controls and weighted event studies with
staggered adoption.

Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica:
Journal of the Econometric Society, pages 841–890.

178
Berry, S. T. (1994). Estimating discrete-choice models of product differentiation. The RAND Journal of
Economics, pages 242–262.

Bertrand, M., Luttmer, E. F. P., and Mullainathan, S. (2000). Network effects and welfare cultures*.
The Quarterly Journal of Economics, 115(3):1019–1055.

Bhattacharya, D. and Dupas, P. (2012). Inferring welfare maximizing treatment assignment under budget
constraints. Journal of Econometrics, 167(1):168–196.

Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector.
Ann. Statist., 37(4):1705–1732.

Bléhaut, M., D’Haultfœuille, X., L’Hour, J., and Tsybakov, A. (2017). A parametric generalization of
the synthetic control method, with high dimension. Working Paper.

Bohn, S., Lofstrom, M., and Raphael, S. (2014). Did the 2007 legal arizona workers act reduce the state’s
unauthorized immigrant population? Review of Economics and Statistics, 96(2):258–269.

Boucher, V. and Fortin, B. (2015). Some Challenges in the Empirics of the Effects of Networks. IZA
Discussion Papers 8896, Institute for the Study of Labor (IZA).

Bramoullé, Y., Djebbari, H., and Fortin, B. (2009). Identification of peer effects through social networks.
Journal of Econometrics, 150(1):41–55.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Brock, W. A. and Durlauf, S. N. (2001). Discrete choice with social interactions. The Review of Economic
Studies, 68(2):235–260.

Bühlmann, P., Yu, B., et al. (2002). Analyzing bagging. The Annals of Statistics, 30(4):927–961.

Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer-Verlag Berlin Heidelberg.

Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than
n. Ann. Statist., 35(6):2313–2351.

Card, D. (1990). The impact of the mariel boatlift on the miami labor market. ILR Review, 43(2):245–
257.

Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.
NBER working paper, (w4483).

Chamberlain, G. (1980). Analysis of covariance with qualitative data. The Review of Economic Studies,
47(1):225–238.

Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment restrictions.


Journal of Econometrics, 34(3):305–334.

Chandrasekhar, A. (2016). Econometrics of network formation. The Oxford Handbook of the Economics
of Networks, pages 303–357.

Chandrasekhar, A. G. and Lewis, R. (2011). Econometrics of sampled networks. Working Paper.

Chatterjee, S. (2013). Assumptionless consistency of the Lasso. ArXiv e-prints.

Chatterjee, S., Diaconis, P., and Sly, A. (2010). Random graphs with a given degree sequence. ArXiv
e-prints.

179
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. (2017a). Dou-
ble/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261–
65.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J.
(2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics
Journal, 21(1):C1–C68.

Chernozhukov, V., Demirer, M., Duflo, E., and Fernandez-Val, I. (2017b). Generic machine learning infer-
ence on heterogenous treatment effects in randomized experiments. arXiv preprint arXiv:1712.04802.

Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Post-Selection and Post-Regularization Inference
in Linear Models with Many Controls and Instruments. American Economic Review, 105(5):486–90.

Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-regularization
inference: An elementary, general approach. Annu. Rev. Econ., 7(1):649–688.

Chernozhukov, V., Wuthrich, K., and Zhu, Y. (2017). An Exact and Robust Conformal Inference Method
for Counterfactual and Synthetic Controls. arXiv e-prints, page arXiv:1712.09089.

Cornwell, C. and Trumbull, W. N. (1994). Estimating the economic model of crime with panel data.
The Review of economics and Statistics, pages 360–366.

Cox, D. R. (1958). The regression analysis of binary sequences (with discussion). J Roy Stat Soc B,
20:215–242.

Crépon, B., Duflo, E., Gurgand, M., Rathelot, R., and Zamora, P. (2013). Do labor market policies have
displacement effects? evidence from a clustered randomized experiment. The Quarterly Journal of
Economics, 128(2):531–580.

Cunningham, S. and Shah, M. (2017). Decriminalizing indoor prostitution: Implications for sexual
violence and public health. NBER Working Papers, 20281.

Darolles, S., Fan, Y., Florens, J.-P., and Renault, E. (2011). Nonparametric instrumental regression.
Econometrica, 79(5):1541–1565.

Davis, J. and Heller, S. B. (2017a). Rethinking the benefits of youth employment programs: The
heterogeneous effects of summer jobs. Technical report, National Bureau of Economic Research.

Davis, J. and Heller, S. B. (2017b). Using causal forests to predict treatment heterogeneity: An appli-
cation to summer jobs. American Economic Review, 107(5):546–50.

de Paula, A. (2015). Econometrics of network models. CeMMAP working papers CWP52/15, Centre
for Microdata Methods and Practice, Institute for Fiscal Studies.

Ding, P. (2014). A paradox from randomization-based causal inference. ArXiv e-prints.

Doudchenko, N. and Imbens, G. W. (2016). Balancing, regression, difference-in-differences and synthetic


control methods: A synthesis. NBER Working Papers, 22791.

Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical
Association, 109(507):991–1007.

Erdös, P. and Rényi, A. (1959). On random graphs i. Publicationes Mathematicae Debrecen, 6:290.

Erdös, P. and Rényi, A. (1960). On the evolution of random graphs. PUBLICATION OF THE MATH-
EMATICAL INSTITUTE OF THE HUNGARIAN ACADEMY OF SCIENCES, pages 17–61.

180
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than
observations. Journal of Econometrics, 189(1):1 – 23.

Ferman, B. and Pinto, C. (2016). Revisiting the synthetic control estimator. Working Paper.

Fuller, W. A. (1977). Some properties of a modification of the limited information estimator. Economet-
rica, pages 939–953.

Gautier, E. and Tsybakov, A. (2011). High-dimensional instrumental variables regression and confidence
sets. arXiv preprint arXiv:1105.2454.

Gautier, E. and Tsybakov, A. B. (2013). Pivotal estimation in high-dimensional regression via linear
programming. In Empirical inference, pages 195–204. Springer.

Giraud, C. (2014). Introduction to High-Dimensional Statistics. Chapman and Hall, 1 edition.

Gobillon, L. and Magnac, T. (2016). Regional policy evaluation: Interactive fixed effects and synthetic
controls. The Review of Economics and Statistics, 98(3):535–551.

Graham, B. S. (2017). An econometric model of network formation with degree heterogeneity. Econo-
metrica, 85(4):1033–1063.

Graham, B. S. (2019). Network data. Working Paper 26577, National Bureau of Economic Research.

Hackmann, M. B., Kolstad, J. T., and Kowalski, A. E. (2015). Adverse selection and an individual
mandate: When theory meets practice. American Economic Review, 105(3):1030–1066.

Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average
treatment effects. Econometrica, 66(2):315–332.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data mining,
Inference and Prediction. Springer, 2nd edition.

Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association,
81(396):945–960.

Hsu, Y.-C. (2017). Consistent tests for conditional treatment effects. The Econometrics Journal, 20(1):1–
22.

Hussam, R., Rigol, N., and Roth, B. (2017). Targeting high ability entrepreneurs using community
information: Mechanism design in the field. Unpublished Manuscript.

Ibragimov, R. and Sharakhmetov, S. (2002). The exact constant in the rosenthal inequality for random
variables with mean zero. Theory of Probability & Its Applications, 46(1):127–132.

Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Number 9780521885881 in Cambridge Books. Cambridge University Press.

Jackson, M. O. (2008). Social and Economic Networks. Princeton University Press, Princeton, NJ, USA.

Kitagawa, T. and Tetenov, A. (2017a). Equality-minded treatment choice. Technical report, Centre for
Microdata Methods and Practice, Institute for Fiscal Studies.

Kitagawa, T. and Tetenov, A. (2017b). Who should be treated? empirical welfare maximization methods
for treatment choice. Working Paper.

Kleven, H. J., Landais, C., and Saez, E. (2013). Taxation and international migration of superstars:
Evidence from the european football market. American Economic Review, 103(5):1892–1924.

181
Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer Publishing
Company, Incorporated, 1st edition.

LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of Training Programs with Experimental
Data. American Economic Review, 76(4):604–20.

Leamer, E. E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1):31–
43.

Leeb, H. (2006). The distribution of a linear predictor after model selection: Unconditional finite-sample
distributions and asymptotic approximations, volume Number 49 of Lecture Notes–Monograph Series,
pages 291–311. Institute of Mathematical Statistics, Beachwood, Ohio, USA.

Leeb, H. and Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric
Theory, null:21–59.

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2017). Distribution-free predictive
inference for regression. Journal of the American Statistical Association.

List, J. A., Shaikh, A. M., and Xu, Y. (2016). Multiple hypothesis testing in experimental economics.
Technical report, National Bureau of Economic Research.

Manresa, E. (2016). Estimating the structure of social interactions using panel data. Working Paper.

Manski, C. F. (1993). Identification of endogenous social effects: The reflection problem. The Review of
Economic Studies, 60(3):531–542.

Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of
Economic Perspectives, 31(2):87–106.

Nevo, A. (2001). Measuring market power in the ready-to-eat cereal industry. Econometrica, 69(2):307–
342.

Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica,


pages 809–837.

Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of
econometrics, 4:2111–2245.

Newey, W. K. and Powell, J. L. (2003). Instrumental variable estimation of nonparametric models.


Econometrica, 71(5):1565–1578.

Nie, X. and Wager, S. (2017). Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint
arXiv:1712.04912.

Oliveira, R. (2013). The lower tail of random quadratic forms, with applications to ordinary least squares
and restricted eigenvalue properties. ArXiv e-prints.

Patacchini, E. and Zenou, Y. (2008). The strength of weak ties in crime. European Economic Review,
52(2):209–236.

Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,
NY, USA.

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. (2017).
Some methods for heterogeneous treatment effect estimation in high-dimensions. arXiv preprint
arXiv:1707.00102.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, pages 931–954.

182
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology, 66(5):688.

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. Information
Theory, IEEE Transactions on, 59(6):3434–3447.

Sacerdote, B. (2011). Peer Effects in Education: How Might They Work, How Big Are They and How
Much Do We Know Thus Far?, volume 3 of Handbook of the Economics of Education, chapter 4,
pages 249–277. Elsevier.

Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society, Series B, 58:267–288.

Tsybakov, A. B. (2009). Introction to nonparametric estimation. Springer.

van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist.,
36(2):614–645.

Van der Vaart, A. W. (1998). Asymptotic statistics, volume 3. Cambridge university press.

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using
random forests. Journal of the American Statistical Association.

Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and
the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651.

Wager, S. and Walther, G. (2015). Adaptive Concentration of Regression Trees, with Application to
Random Forests. ArXiv e-prints.

Wasserman, L. (2010). All of statistics : a concise course in statistical inference. Springer, New York.

Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactive fixed effects
models. Political Analysis, 25(01):57–76.

Xu, Y. and Liu, L. (2018). gsynth: Generalized Synthetic Control Method. R package version 1.0.9.

Yang, Y. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification
and regression estimation. Biometrika, 92(4):937–950.

Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541–2563.

183

You might also like