0% found this document useful (0 votes)
2 views45 pages

Course1 Review

Uploaded by

hishamhaydar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views45 pages

Course1 Review

Uploaded by

hishamhaydar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Machine Learning for Econometrics

I. Statistics and Econometrics Prerequisites

Christophe Gaillac

Autumn 2024
Motivation

To put everybody at the same page, we review (briefly) useful tools


1 A refresher on a number of statistical tools and techniques:
Singular value decomposition
High dimension and penalized regressions
Generalized method of moments
Factor models
Random forests
Neural networks

2 A refresher on the framework underlying causal inference:


Starting from randomized controlled trials
Conditional independence and the propensity score
Instrumental variables/Optimal IV.

I. Statistics and Econometrics Prerequisites 2 / 35


Outline

1 Statistical tools

2 Causal inference

3 Additional references

I. Statistics and Econometrics Prerequisites 3 / 35


Singular value decomposition

The rank of a n × p matrix X, which is denoted by


r(X ) ≤ min(n, p ), is the dimension of the vector space spanned by
its column vectors. Any real such matrix possesses the following
SVD:
X = USV ′ , (1)
where U and V are square matrices of dimension n and p
respectively such that U ′ U = In and V ′ V = Ip . S is a rectangular
diagonal matrix.
Can also be rewritten with S a square matrix of dimension r(X ) and
the second dimension of both U and V also changed to r(X ):

r(X )
X= ∑ sj uj vj′ ,
j =1

where uj and vj are the j-th rows of U and V respectively.

I. Statistics and Econometrics Prerequisites 4 / 35


Singular value decomposition
Useful for:
1 Detecting multi-colinearity or anticipate numerical problems when

inverting the Gram matrix. Indeed, notice that if sj = 0 for j ≤ p,


then X ′ X is singular and the OLS estimator cannot be computed.
2 Connection to PCA:
r(X )
X ′X = ∑ sj2 vj vj′ ,
j =1

and v1 is the first principal component (or mode) of X ′ X,


associated with the highest variance.
3 Compute the pseudo-inverse of a square matrix, also called the
Moore-Penrose inverse (when r(X ) < p):
r(X )
∑ sj−2 vj vj′ .
+
X ′X

:=
j =1

4 Natural way to approximate X.


I. Statistics and Econometrics Prerequisites 5 / 35
High dimension and penalized regressions

Consider a linear regression with normally distributed errors in iid case

Yi = Xi′ β 0 + ε i , ε i ∼ N (0, σ2 ),

with
Xi are 1 × p vectors, Yi and ε i are scalars
The number of covariates p may be larger than the number of
observations n...

Two problems then arise:


(i) the accuracy of the OLS estimator deteriorates (increased variance)
due to multicollinearity,
(ii) it becomes impossible to calculate (if the Gram matrix ∑ni=1 Xi Xi′ /n
is no longer invertible).

I. Statistics and Econometrics Prerequisites 6 / 35


High dimension and penalized regressions

Consider a linear regression with normally distributed errors in iid case

Yi = Xi′ β 0 + ε i , ε i ∼ N (0, σ2 ),

with
Xi are 1 × p vectors, Yi and ε i are scalars
The number of covariates p may be larger than the number of
observations n...

Two problems then arise:


(i) the accuracy of the OLS estimator deteriorates (increased variance)
due to multicollinearity,
(ii) it becomes impossible to calculate (if the Gram matrix ∑ni=1 Xi Xi′ /n
is no longer invertible).

I. Statistics and Econometrics Prerequisites 6 / 35


Reminder Ridge
For a penalty level λ ≥ 0, the Ridge estimator is defined as the
solution to the minimization program:
1 n
βbR (λ) = argmin ∑ Yi − Xi′ β + λ|| β||22 ,
2
(2)
β ∈ Rp n i =1
q
where || β||2 = ∑pj=1 β2j .
Solving the previous program, we find:
" # −1
n
1 1 n
βbR (λ) = ∑
n i =1
Xi Xi′ + λIp
n i∑
Xi Yi .
=1

When λ → 0, the solution becomes the OLS using the


pseudo-inverse of the covariance matrix:
" #+
n
1 1 n
βbR (λ) → ∑
n i =1
Xi Xi′
n i∑
Xi Yi ,
=1

I. Statistics and Econometrics Prerequisites 7 / 35


Reminder Lasso

Other idea: β 0 has at most s of his coefficients which are non-zero


(sparsity ) ∥ β 0 ∥0 ≤ s < p .
Naive way to select the model: penalize the number of non-zero
coefficients (ℓ0 ): ∥ β 0 ∥0 = ∑K
k =1 1 ( β k ̸ = 0 ).
This is non convex and infeasible for large p (“NP-hard”)

Idea (Tibshirani, 1994): replace the ℓ0 penalty by a convex one ℓ1 :


K
∥ β ∥1 = ∑ | β k |.
k =1

I. Statistics and Econometrics Prerequisites 8 / 35


Reminder Lasso

Other idea: β 0 has at most s of his coefficients which are non-zero


(sparsity ) ∥ β 0 ∥0 ≤ s < p .
Naive way to select the model: penalize the number of non-zero
coefficients (ℓ0 ): ∥ β 0 ∥0 = ∑K
k =1 1 ( β k ̸ = 0 ).
This is non convex and infeasible for large p (“NP-hard”)

Idea (Tibshirani, 1994): replace the ℓ0 penalty by a convex one ℓ1 :


K
∥ β ∥1 = ∑ | β k |.
k =1

I. Statistics and Econometrics Prerequisites 8 / 35


Reminder Lasso

Lasso estimator (“Least Absolute Shrinkage and Selection


Operator”):

1 n

2
βb(λ) ∈ argmin Yi − Xi′ β + λ∥ β∥1 .
β ∈ Rp n i =1

The Lasso minimizes the sum of the empirical average quadratic loss
and a penalty or regularization term λ∥ β∥1 .

Choice of λ: usually using cross-validation. The idea is to split the


data into two disjoint folds, one on which we will compute the
estimator for a given λ, and the other on which we will optimize this
λ to minimize the out-of-sample (OOS) error (to avoid overfitting).

I. Statistics and Econometrics Prerequisites 9 / 35


Reminder Lasso

Lasso estimator (“Least Absolute Shrinkage and Selection


Operator”):

1 n

2
βb(λ) ∈ argmin Yi − Xi′ β + λ∥ β∥1 .
β ∈ Rp n i =1

The Lasso minimizes the sum of the empirical average quadratic loss
and a penalty or regularization term λ∥ β∥1 .

Choice of λ: usually using cross-validation. The idea is to split the


data into two disjoint folds, one on which we will compute the
estimator for a given λ, and the other on which we will optimize this
λ to minimize the out-of-sample (OOS) error (to avoid overfitting).

I. Statistics and Econometrics Prerequisites 9 / 35


Reminder Lasso: choosing λ by cross-validation
The procedure: using n = K × n0 for two integers K and n0 .
1 For K ∈ N, randomly draw a partition of 1, . . . , n into K groups of

equal sizes n0 (folds). Let Gi ∈ 1, . . . , K be the group to which


observation i belongs.
2 For each k = 1, . . . , K , using only the data not belonging to group k,
compute the Lasso estimator:
1

2
βbLk (λ) = argmin Yi − Xi′ β + λ|| β||1 .
β ∈ Rp (K − 1)n0 i : Gi ̸ = k

3 For each k = 1, . . . , K , compute the error on group k:


1  2

n0 i : G
Yi − X ′ bL
i kβ ( λ ) .
i =k

4 Aggregate the errors from the previous step and then minimize with
respect to λ:
K 2
1 1 
λL = argmin
b
K ∑
n ∑ Yi − Xi′ βbLk (λ) .
λ ≥0 k =1 0 i : Gi =k

I. Statistics and Econometrics Prerequisites 10 / 35


Generalized method of moments

The generalized method of moments (GMM) is a generalization of


the definition of the OLS. In the latter, E[ε|X ] = 0 involves the
following orthogonality equation:

E X Y − X ′ β 0 = 0.
 

One may want to define a model from a vector of random variables


U, a vector of coefficients θ, and moments M that we want to set
to zero:
M (θ ) := E [ψ(U, θ )] = 0.
Denote the empirical counterpart of this vector of moments by:
n
b (θ ) := 1 ∑ ψ(Ui , θ ).
M
n i =1

I. Statistics and Econometrics Prerequisites 11 / 35


Generalized method of moments

The GMM θ̂n estimator of the parameter θ0 is then obtained by


minimizing the Euclidean norm of this vector
∥M b (θ )′ M
b (θ )∥2 := M b (θ ), which corresponds to the following
2
optimization program:
2
θ̂n := argmax − M
b (θ ) . (3)
θ ∈Θ 2

Under some identifying assumption, M (θ ) takes value 0 only at the


value θ0 : ∀θ ∈ Θ, M (θ ) = 0 =⇒ θ = θ0 . and therefore the
parameter θ0 is identified in the set Θ.
These assumptions allow us to prove the ASN of θ̂n :
√  d
n θ̂n − θ0 → N 0, (G ′ G )−1 G ′ ΣG ((G ′ G )−1 )′ .


where G := E [∇θ ψ(U, θ0 )], Σ = E[ψ(U, θ0 )ψ(U, θ0 )′ ].

I. Statistics and Econometrics Prerequisites 12 / 35


Factor models
Factor models are a dimensional reduction technique assuming that
the dependence between explanatory variables can be well
approximated by a lower (or latent) common underlying structure
(see, e.g., Stock et Watson (2002); Bai (2003); Bai et Ng (2006) or
Chapter 11.13 in Hansen (2022)). =⇒ related to PCA

Consider a xt random vector of p and the approximate factor model:

xt = Λft + ε t , t = 1, . . . , T ,

where
Λ is a weighting matrix of the size factors p × r “factors loading”;
ft is a vector of factors r × 1;
ε t are idiosyncratic errors.

In matrix form:
X = F Λ′ + E ,
T ×p T ×r r ×p T ×p

I. Statistics and Econometrics Prerequisites 13 / 35


Factor models
Factor models are a dimensional reduction technique assuming that
the dependence between explanatory variables can be well
approximated by a lower (or latent) common underlying structure
(see, e.g., Stock et Watson (2002); Bai (2003); Bai et Ng (2006) or
Chapter 11.13 in Hansen (2022)). =⇒ related to PCA

Consider a xt random vector of p and the approximate factor model:

xt = Λft + ε t , t = 1, . . . , T ,

where
Λ is a weighting matrix of the size factors p × r “factors loading”;
ft is a vector of factors r × 1;
ε t are idiosyncratic errors.

In matrix form:
X = F Λ′ + E ,
T ×p T ×r r ×p T ×p

I. Statistics and Econometrics Prerequisites 13 / 35


Factor models
Factor models are a dimensional reduction technique assuming that
the dependence between explanatory variables can be well
approximated by a lower (or latent) common underlying structure
(see, e.g., Stock et Watson (2002); Bai (2003); Bai et Ng (2006) or
Chapter 11.13 in Hansen (2022)). =⇒ related to PCA

Consider a xt random vector of p and the approximate factor model:

xt = Λft + ε t , t = 1, . . . , T ,

where
Λ is a weighting matrix of the size factors p × r “factors loading”;
ft is a vector of factors r × 1;
ε t are idiosyncratic errors.

In matrix form:
X = F Λ′ + E ,
T ×p T ×r r ×p T ×p

I. Statistics and Econometrics Prerequisites 13 / 35


Factor models

One way to estimate the factors is to use the least squares,


minimizing in (Λ, F )
T
∑ (xt − Λft )′ (xt − Λft ) = ∥X − F Λ∥2F ,
t =1
q
where ∥A∥F = ∑T K
t =1 ∑k =1 Ak,t is the Frobenius norm of matrix A.
2

The solution of the individual compornents Λ and F is not unique;

I. Statistics and Econometrics Prerequisites 14 / 35


Factor models
The most judicious standardization in terms of computational cost
depends on p and T :
N1 F ′ F /T = Ir preferably when T < p;
N2 Λ′ Λ = Ir when p < T .
Using (N2), for fixed Λ, Fb is ordinary least squares solution, i.e.

ft (Λ) = (Λ′ Λ)−1 Λ′ xt = Λ′ xt .


b

Using this expression in the least squares objective function:


T
1    
T ∑ ( xt − ΛΛ ′
x t ) ′
( xt − ΛΛ ′
x t ) = tr Σ
b − tr Λ ′b
ΣΛ ,
t =1

where Σb = ∑t xt xt′ /T is the empirical covariance matrix.


 
We have max (N2):Λ′ Λ=Ir tr Λ′ ΣΛ
b = ∑rk =1 sk , and Λ
b is the matrix
formed by the first r sigular vectors of the empirical covariance
matrix. =⇒ estimation using PCA
I. Statistics and Econometrics Prerequisites 15 / 35
Factor models
The most judicious standardization in terms of computational cost
depends on p and T :
N1 F ′ F /T = Ir preferably when T < p;
N2 Λ′ Λ = Ir when p < T .
Using (N2), for fixed Λ, Fb is ordinary least squares solution, i.e.

ft (Λ) = (Λ′ Λ)−1 Λ′ xt = Λ′ xt .


b

Using this expression in the least squares objective function:


T
1    
T ∑ ( xt − ΛΛ ′
x t ) ′
( xt − ΛΛ ′
x t ) = tr Σ
b − tr Λ ′b
ΣΛ ,
t =1

where Σb = ∑t xt xt′ /T is the empirical covariance matrix.


 
We have max (N2):Λ′ Λ=Ir tr Λ′ ΣΛ
b = ∑rk =1 sk , and Λ
b is the matrix
formed by the first r sigular vectors of the empirical covariance
matrix. =⇒ estimation using PCA
I. Statistics and Econometrics Prerequisites 15 / 35
Factor models
The most judicious standardization in terms of computational cost
depends on p and T :
N1 F ′ F /T = Ir preferably when T < p;
N2 Λ′ Λ = Ir when p < T .
Using (N2), for fixed Λ, Fb is ordinary least squares solution, i.e.

ft (Λ) = (Λ′ Λ)−1 Λ′ xt = Λ′ xt .


b

Using this expression in the least squares objective function:


T
1    
T ∑ ( xt − ΛΛ ′
x t ) ′
( xt − ΛΛ ′
x t ) = tr Σ
b − tr Λ ′b
ΣΛ ,
t =1

where Σb = ∑t xt xt′ /T is the empirical covariance matrix.


 
We have max (N2):Λ′ Λ=Ir tr Λ′ ΣΛ
b = ∑rk =1 sk , and Λ
b is the matrix
formed by the first r sigular vectors of the empirical covariance
matrix. =⇒ estimation using PCA
I. Statistics and Econometrics Prerequisites 15 / 35
Factor models

One can use Factor models to reduce the dimension of some of the
variables while keeping the meaning of some others using Factor
augmented regression (see Bai (2003)).

Consider the observation of a sample i.i.d (xt , yt , zt )T


t =1 satisfying
the model

yt = ft′ γ + zt′ β + ε t , (4)


xt = Λft + νt , (5)
E ( ft ε t ) = E ( z t ε t ) = E(ft νt′ ) = E(νt ε t ) = 0, (6)

where Λ is a p × r matrix, and ft is a factor vector r × 1.

In this model, xt impacts yt only through latent factors.

I. Statistics and Econometrics Prerequisites 16 / 35


Random trees
The objective is to estimate the conditional expectation
µ(x ) = E [Y |X = x ] from an i.i.d. sample
(Ui )i =1,...,n = (Yi , Xi )i =1,...,n using recursive partitioning.
The construction method of a decision tree produces an adaptive
weighting αi (x ) to quantify the importance of the i-th training
sample Wi at the evaluation point X :
n
1 {Xi ∈ L(x )}
µ̂(x ) = ∑ αi (x )Yi , with αi (x ) :=
|{i : Xi ∈ L(x )}|
, (7)
i =1

where L(x ) is the “leaf” in which the point x falls.


(7) is a locally weighted average of the αi (x ) of all Yi corresponding
to an Xi falling in a neighborhood (in the same leaf) of the point x .
The leaves constitute a partition of the feature space X . This
partition maximizes a global segmentation criterion, which usually
consists in choosing how maximize the homogeneity gain.
I. Statistics and Econometrics Prerequisites 17 / 35
Random trees

1 Initialization: initialize the list containing the cells associated with


the root of the tree A = (X ) and the tree Afinal as an empty list.
2 Expansion: for each node A ∈ A:
IF A satisfies the stopping criterion (number of observations in the
leaf less than n0 ),
- Remove A from the list A
- Concatenate Afinal = Afinal + {A}.

ELSE randomly choose a coordinate from {1, . . . , p }, choose the


best split s in the segmentation test, and create the two child nodes
by splitting N .
Then remove parent node A and add the child nodes to the list
A = A − {A} + {child nodesA1 and A2 }.

I. Statistics and Econometrics Prerequisites 18 / 35


Random trees

Figure: Decision tree algorithm: steps 1 et 2

Figure: *
I. Statistics and Econometrics Prerequisites 19 / 35
Random trees

Figure: Decision tree algorithm: steps 1 et 2

Figure: *
I. Statistics and Econometrics Prerequisites 20 / 35
Random trees

Figure: Decision tree algorithm: steps 3, 4 and evaluation

Figure: *
I. Statistics and Econometrics Prerequisites 21 / 35
Random trees

Figure: Decision tree algorithm: steps 3, 4 and evaluation

Figure: *
I. Statistics and Econometrics Prerequisites 22 / 35
Random forests
We aggregate the trees formed on all possible subsamples of size s
from the training data U1 , . . . , Un :
  −1
n
µ̂(x ; U1 , . . . , Un ) =
s ∑ T (x ; Ui1 , . . . , Uis ), (8)
1≤i <...<i ≤n
1 s

The estimator of equation (8) is evaluated using Monte Carlo


methods:
B
1
µ̂(x ; U1 , . . . , Un ) ≈
B ∑ T (x ; Ub,1
∗ ∗
, . . . , Ub,s ), (9)
b =1

where the learning is based on



T (x ; Ub,1 ∗
, . . . , Ub,s )= ∑ ∗
αb,i ∗
(x )Yb,i , (10)
i ∈Ib
 ∗
∗ 1 Xb,i ∈ Lb∗ (x )
αb,i (x ) = ∗ ∈ L∗ (x )}| .
|{i : Xb,i b

This aggregation strategy, known as bagging.


I. Statistics and Econometrics Prerequisites 23 / 35
Neural networks
A neural network can be seen as a function µ(.) that, given an input
vector X , associates a representation or an output, µ(X ).
Feed-forward neural network. A network with two layers can be
written as follows:
µ(x ) = g2 (b2 + Θ2 g1 (b1 + Θ1 x )), (11)
where h1 (.) = g1 (b1 + Θ1 .) is the first layer consisting of the
parameter matrix Θ1 of dimension l1 × p (also called weights in the
terminology of neural networks), the parameter vector b1 of
dimension l1 (also called bias), and the activation function g1 (.)
that is applied element-wise.
h2 (.) = g2 (b2 + Θ2 .) is the second layer (final layer).
Linear regression is a shallow neural network
Some activation functions: a) Sigmoid:
g (x ) = exp(x )/(1 + exp(x )), b) Rectified Linear Unit (ReLU):
g (x ) = max (0, x ).
I. Statistics and Econometrics Prerequisites 24 / 35
Outline

1 Statistical tools

2 Causal inference

3 Additional references

I. Statistics and Econometrics Prerequisites 25 / 35


Definitions

We consider the potential outcome model of Rubin (1974).


For a given individual, only the treatment status Di ∈ {0, 1} is
observed, as well as the realized outcome Yi defined by:

Yi (0) if Di = 0,
Yi := Yi (Di ) =
Yi (1) if Di = 1.

The treatment effect for an individual i is then:

∆i = Yi (1) − Yi (0).

I. Statistics and Econometrics Prerequisites 26 / 35


Quantities of interest

Average treatment effect (ATE)

τ0 := E[Yi (1) − Yi (0)].

Average treatment effect on the treated (ATT), defined as

τ0ATT := E[Yi (1) − Yi (0)|Di = 1].

When we observe characteristics Xi , our attention shifts towards the


conditional average treatment effect (CATE), defined as the
function:
τ : x 7→ E [Yi (1) − Yi (0)|Xi = x ] .

I. Statistics and Econometrics Prerequisites 27 / 35


Randomized controlled trials

Suppose we observe an i.i.d. sample of pairs of random variables


(Yi , Di ), i = 1, . . . , n, and:
(SUTVA): Yi := Yi (Di ).
(Conditional independence or unconfoundedness):
(Yi (0), Yi (1)) ⊥⊥ Di |Xi
(Overlap) ∃η > 0, such that, for all x ∈ X , η ≤ p (x ) ≤ 1 − η,
where p : x ∈ X 7→ P(D = 1|X = x ).

Rosenbaum et Rubin (1983) show that:

(Yi (0), Yi (1)) ⊥⊥ Di | p (Xi ). (12)

I. Statistics and Econometrics Prerequisites 28 / 35


Randomized controlled trials

Two characterizations of the ATE:


1 Using the propensity score:

1 n Di Yi (1 − Di )Yi
 

τIPW = ∑ − .
n i =1 p (Xi ) 1 − p (Xi )

2 Using difference between regression functions

E [Yi (1) − Yi (0)|Xi = x ] = µ1 (x ) − µ0 (x ),

where µj (x ) = E [Yi |Xi = x , Di = j ] for j = 1, 2.

I. Statistics and Econometrics Prerequisites 29 / 35


Efficient estimation of treatment effect
The augmented inverse propensity score (AIPW) estimator, defined by
Robins et al. (1994) and Hahn (1998), is designed to correct the bias
due to the estimation of µ(j ) .

n
1 Di (Yi − µ1 (Xi )) (1 − Di )(Yi − µ0 (Xi ))
τbAIPW =
n ∑ µ1 (Xi ) − µ0 (Xi ) + p ( Xi )

1 − p (Xi )
.
i =1

This AIPW estimator has two important properties:


1 It achieves the semi-parametric efficiency bound with

√ σ02 (X )
    2 
d σ (X )
n(τbAIPW − τ0 ) → N 0, Var(τ (X )) + E +E 1 ,
1 − p (X ) p (X )

where σj2 (x ) = Var(Y (j )|X = x ) for j = 0, 1.


2 It is doubly-robust, meaning that it is consistent either if the estimators
b(j ) for j = 0, 1 are consistent, or if p
µ b is consistent.

I. Statistics and Econometrics Prerequisites 30 / 35


Instrumental variables

Consider a linear model allowing to estimate the effect of the


treatment D ∈ R (discrete or continuous, and of dimension 1 here)
while controlling for variables X ∈ Rpx (also includes the intercept):

Y =Dτ0 + X ′ β 0 + ε, with E[ε|X ] = 0. (13)

But E[Dε] ̸= 0... a possible strategy is to identify instrumental


variables W = (Z ′ , X ′ )′ ∈ Rp +px , where p is larger than the
number of endogenous variables D (here 1).

I. Statistics and Econometrics Prerequisites 31 / 35


Instrumental variables

An instrument is a variable that


(i) is correlated with the endogenous variable
(ii) is uncorrelated with the residuals.

We define D ∗ the best linear prediction of D using W , i.e.


D ∗ = W ′ γand denote by W ∗ = (D ∗ , X ′ )′ .

If E [W ∗ (W ∗ )′ ] is non-singular and E [εW ] = 0, the true parameters


take the form:

(τ0 , β′0 )′ = E[W ∗ (W ∗ )′ ]−1 E[W ∗ Y ]. (14)

I. Statistics and Econometrics Prerequisites 32 / 35


Optimal Instrumental variables

The researcher may have access to multiple instruments or may


want to consider transformations of the initial instrument A(W ),
since they are all relevant E[εA(W )] = 0.

This raises the legitimate question of choosing the function A(·) in


order to minimize the asymptotic variance of the GMM estimator of
θ0 := (τ0 , β′0 )′ : Optimal IV.

We denote S := (D, X ′ )′ ∈ Rp +1 . We can show that the optimal


instrument is therefore the regression function of S on W ,

w 7 → E [S |W = w ]

Without further restrictions, it is naturally an object of high


dimension (see, e.g. Tsybakov, 2009). With few instruments,
Newey et McFadden (1994) propose nonparametric estimators in the
form of series.

I. Statistics and Econometrics Prerequisites 33 / 35


Outline

1 Statistical tools

2 Causal inference

3 Additional references

I. Statistics and Econometrics Prerequisites 34 / 35


Additional references

Athey, S. and G. W. Imbens (2019). Machine learning methods that


economists should know about. Annual Review of Economics 11.

Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of


Statistical Learning: Data mining, Inference and Prediction. Springer

Abadie, A. and M. D. Cattaneo (2018). Econometric methods for


program evaluation. Annual Review of Economics 10(1), 465–503.

I. Statistics and Econometrics Prerequisites 35 / 35


Appendix

I. Statistics and Econometrics Prerequisites 36 / 35


Bibliography I

Bai, J. (2003). Inferential theory for factor models of large dimensions.


Econometrica, 71(1):135–171.
Bai, J. et Ng, S. (2006). Confidence intervals for diffusion index
forecasts and inference for factor-augmented regressions.
Econometrica, 74(4):1133–1150.
Hahn, J. (1998). On the role of the propensity score in efficient
semiparametric estimation of average treatment effects. Econometrica,
66(2):315–332.
Hansen, B. E. (2022). Econometrics. Princeton University Press.
Newey, W. K. et McFadden, D. (1994). Large sample estimation and
hypothesis testing. Handbook of econometrics, 4:2111–2245.
Robins, J. M., Rotnitzky, A. et Zhao, L. P. (1994). Estimation of
regression coefficients when some regressors are not always observed.
Journal of the American statistical Association, 89(427):846–866.

I. Statistics and Econometrics Prerequisites 37 / 35


Bibliography II

Rosenbaum, P. R. et Rubin, D. B. (1983). The central role of the


propensity score in observational studies for causal effects. Biometrika,
70(1):41–55.
Rubin, D. B. (1974). Estimating causal effects of treatments in
randomized and nonrandomized studies. Journal of educational
Psychology, 66(5):688.
Stock, J. H. et Watson, M. W. (2002). Forecasting using principal
components from a large number of predictors. Journal of the
American statistical association, 97(460):1167–1179.
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society, Series B, 58:267–288.
Tsybakov, A. B. (2009). Introction to nonparametric estimation.
Springer.

I. Statistics and Econometrics Prerequisites 38 / 35

You might also like