0% found this document useful (0 votes)
2 views

Binary

The document discusses econometric methods for analyzing binary data, focusing on modelization, estimation, and identification of parameters. It highlights the limitations of linear models for binary dependent variables and introduces generalized linear models (GLMs) as a solution. Key concepts include latent variables, the probit and logit models, and the use of maximum likelihood estimation for parameter estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Binary

The document discusses econometric methods for analyzing binary data, focusing on modelization, estimation, and identification of parameters. It highlights the limitations of linear models for binary dependent variables and introduces generalized linear models (GLMs) as a solution. Key concepts include latent variables, the probit and logit models, and the use of maximum likelihood estimation for parameter estimation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Econometrics 2

4. Binary data methods

Laurent Davezies & Elia Lapenta

ENSAE, 2024/2025

1 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

2 / 47
Motivation

▶ The objective is to explain a binary variable Y by


X = (1, X1 , ..., XK −1 )′ ∈ RK .
▶ The two possible values that Y can take are arbitrary, we assume
without loss of generalization that Y ∈ {0, 1}.
▶ There are many contexts where the dependent variable Y is binary:
▶ In micro-economics: participant or non-participant in the labor
market, employment vs unemployment, purchase or non-purchase of
a durable consumption good.
▶ In credit risk analysis: borrower defaults or not.
▶ In assurance: insurance claim or not.
▶ In biostatistics: individual recovers from a disease or remains ill,
medical treatment is effective or not.
▶ In social sciences: graduation or not, living in couple or single, vote
or abstention, etc.

3 / 47
General ideas

▶ Linear models are not well adapted to analyze binary dependent


variables ...
▶ ... But we will assume that a latent (unobserved) variable Y ∗ is
linear in the parameters and determine how Y is defined through
this latent variable.
▶ Since the resulting models for Y are non-linear, it is crucial to think
carefully about what are the parameters of interest.
▶ The fact that Y ∗ is not observed implies that certain parameters
need to be normalized.
▶ Parameters are estimated using the maximum likelihood method.
▶ Some of these ideas reappear in chapter 5.

4 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

5 / 47
Modelization
▶ Linear models are not well suited to analyze binary dependent
variables. Indeed, if Y ∈ {0, 1}, then
E (Y |X ) = P(Y = 1|X ) ∈ [0, 1]. (1)
In a linear model (under the exogeneity assumption E (ε|X ) = 0), it
follows that E (Y |X ) = X ′ β0 . But nothing guarantees that
X ′ β0 ∈ [0, 1].
▶ In order to satisfy the restriction (1), the following assumption is
made:
E (Y |X ) = F (X ′ β0 ), (2)
where F (.) is a strictly increasing bijective (and known) function
from R to ]0, 1[, that is to say F satisfies the properties of a
distribution function.
▶ N.B.: equation (2) corresponds to the equation of a so-called
Generalized Linear Model (GLM), i.e., a model of the form:
h(E (Y |X )) = X ′ β0 ,
where h is a known function called the link function.
6 / 47
Non-linear models and latent variables

▶ Model (2) can be interpreted in terms of latent variables.


▶ Suppose there exists a continuous variable Y ∗ ∈ R such that

Y = 1{Y ∗ ≥ 0}.

▶ Suppose in addition that Y ∗ is generated by a linear model:

Y ∗ = X ′ β0 + ε, (3)

where −ε is independent from X and has the distribution function


F . Then

P(Y = 1|X ) = P(X ′ β0 + ε ≥ 0|X ) = P(−ε ≤ X ′ β0 |X ) = F (X ′ β0 ).

▶ So we obtain again equation (2).


▶ The latent variable interpretation is, often, quite natural.

7 / 47
Latent variable interpretation

Example 1: microeconomics.
▶ Y = agent’s choice between two options. Let U1 be the (expected)
utility of this agent when choosing option Y = 1, and U0 the utility
when choosing Y = 0.
▶ Then define the difference between the utilities of the two choices as
Y ∗ = U1 − U0 .
▶ If the agent is rational the chosen option is the one that generates
the highest utility:

Y = 1{U1 ≥ U0 } = 1{Y ∗ ≥ 0}.

Example 2: corporate finance.


▶ The default (Y = 1) of a company occurs if its debt, denoted D, is
larger then some threshold value S (possibly random).
▶ The latent variable is then Y ∗ = D − S.

8 / 47
Latent variable interpretation

Example 3: biostatistics.
▶ Y =1 if an individual is ill, 0 otherwise.
▶ Un individual has recovered if the number of bacteries N (for
example) has descended below a certain threshold S, which may
possibly be individal-specific.
▶ On a alors Y ∗ = N − S.
Example 4: education.
▶ Y = 1 if a student graduates and obtain a diploma, 0 otherwise.
▶ The diploma is obtained if the student’s average grade M exceeds a
fixed threshold s.
▶ Then Y ∗ = M − s.

9 / 47
Two important examples: the probit and logit models

▶ A priori any choice for F is possible.


▶ The two most common choices are
▶ F = Φ, distribution function of a N (0, 1) random variable: probit
model;
▶ F (x ) = Λ(x ) = 1/(1 + exp(−x )), the logistic distribution function:
logit model.
▶ The difference between the two distribution functions is quite small.
2
When |x | → +∞, φ(x ) = Φ′ (x ) ∝ e −x /2 and
Λ′ (x ) = Λ(x )(1 − Λ(x )) = O(e −|x | ).
⇒ the logistic density function has fatter tails.

10 / 47
Parameters and marginal effects

▶ Qualitatively, the j−th component of X , Xj , has a positive effect


on P(Y = 1|X ) iff β0j > 0.
▶ Quantitatively, the interpretation of β0j is more subtle.
▶ In a standard linear model E (Y |X ) = X ′ β0 , the parameter β0j
corresponding to Xj can be interpreted as the marginal effect of Xj :

∂E (Y |X1 = x1 , ..., XK −1 = xK −1 )
= β0j .
∂xj

▶ This effect is independent of x = (x1 , ..., xK −1 ).

11 / 47
Parameters and marginal effects

▶ But in binary models (as generally in non-linear models), the


marginal effect of Xj is no longer β0j , and it depends on x :

∂E (Y |X = x ) ∂F (u) ∂x ′ β0
= = f (x ′ β0 )β0j with f = F ′ .
∂xj ∂u ′
u=x β0 ∂x j

▶ Remark 1 : if f is symmetric and unimodal, the effect of a variable


on P(Y = 1|X ) is larger the closer x ′ β0 is to 0, that is
P(Y = 1|X ) ≃ 0.5.
▶ Remark 2 : we always have

β0l ∂E (Y |X = x )/∂xl
=
β0j ∂E (Y |X = x )/∂xj

The ratio of parameters can thus be interpreted as in a standard


linear model.

12 / 47
Parameters and marginal effects
▶ Besides estimating β0j, it is of interest to estimate the average
marginal effect:
∆j = E [f (X ′ β0 )] β0j .
This is the expected marginal effect of Xj across the whole
population.
▶ It is also possible to focus on marginal effects calculated across
sub-populations, by calculating the expected marginal effect for units
verifying X ∈ A (for example): E [f (X ′ β0 )|X ∈ A] β0j , or the
marginal effect for the average (representative) unit, f (E (X )′ β0 ) β0j .
▶ When an explanatory variable Xj is discrete (dichotomous), it is
more appropriate to replace marginal effects by
′ ′
F (x−j β0−j + β0j ) − F (x−j β0−j ),
where x−j = (1, x1 , ..., xj−1 , xj+1 , ..., xK −1 )′ . The average effect of
globally switching Xj from 0 to 1 is then:
′ ′
 
E F (X−j β0−j + β0j ) − F (X−j β0−j ) .
13 / 47
A specificity of the logit model: the odds-ratios.

▶ Define the risk (or odd) as the ratio of probabilities


P(Y = 1|X )/P(Y = 0|X ).
▶ In the case of a logit model:

P(Y = 1|X = x ) 1/(1 + e −x β0 ) ′
= −x ′ β0 = e x β0
P(Y = 0|X = x ) e /(1 + e −x ′ β0 )
▶ Consider now a binary explanatory variable Xj ∈ {0, 1}. We then get:

P(Y = 1|X−j = x−j , Xj = 1)/P(Y = 0|X−j = x−j , Xj = 1)


e β0j = .
P(Y = 1|X−j = x−j , Xj = 0)/P(Y = 0|X−j = x−j , Xj = 0)

▶ Therefore e β0j equals the ratio of risks (odds-ratio) corresponding to


Xj = 1 and Xj = 0. This odds-ratio is independent of the value of
X−j .

14 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

15 / 47
Identification
▶ Let us return to the equation:
Y = 1{X ′ β0 + ε ≥ 0}.
▶ Two questions: (i) why is the threshold fixed at 0 ? (ii) why is the
variance of ε fixed (at 1 for the probit model, at π 2 /3 for the logit
model) ?
▶ Reason: the model parameters are not identified otherwise. Indeed,
we have:
′ ′
Y = 1{β01 +X−1 β0−1 +ε ≥ s} ⇐⇒ Y = 1{β01 −s+X−1 β0−1 +ε ≥ 0}.
▶ Put differently, it is not possible to identify separately the constant
β01 and the treshold s. The threshold s is therefore (arbitrarily)
fixed at 0.
▶ Similarly, it is not possible to identify separately β0 and the variance
σ02 of the error term. ε. Indeed,
Y = 1{X ′ β0 + ε ≥ 0} ⇐⇒ Y = 1{X ′ (β0 /σ0 ) + ε/σ0 ≥ 0}.
▶ The variance σ02 is therefore arbitrarily fixed.
16 / 47
Identification
When s and σ0 are fixed and if E (XX ′ ) is an invertible matrix, the model
is identified.
Proof: letting Pβ be the distribution of the observations when the true
parameter is β, it is necessary to show that the function β 7→ Pβ is
injective. In our conditional binary model, the identifiability of the model
is equivalent to

Pβ (Y = 1|X ) = Pβ ′ (Y = 1|X ) ⇒ β = β ′ ∀(β, β ′ ).

But
(E (XX ′ ) is invertible ) ⇐⇒ (X ′ λ = 0 =⇒ λ = 0)
Consequently,

Pβ (Y = 1|X ) = Pβ ′ (Y = 1|X ) ⇐⇒ F (X ′ β) = F (X ′ β ′ )
⇐⇒ X ′β = X ′β′
⇐⇒ β = β′ 2

17 / 47
Estimation of the model: the maximum likelihood method

▶ Let us now consider the estimation of β0 using a sample of i.i.d.


observations ((Y1 , X1 ), ..., (Yn , Xn )).
▶ Since the model is fully parametric, the unknown parameters can be
estimated by the method of maximum likelihood.
▶ The density of Y conditional on X is
y 1−y
P(Y = y |X = x ) = [P(Y = 1|X = x )] [P(Y = 0|X = x )]
= F (x ′ β)y (1 − F (x ′ β))1−y .

▶ The maximum likelihood function of the i.i.d. observations


(Y, X) = ((Y1 , X1 ), ..., (Yn , Xn )) conditional on X can then be
written as
n
Y
Ln (Y|X; β) = F (Xi′ β)Yi (1 − F (Xi′ β))1−Yi .
i=1

18 / 47
Estimation of the model: the maximum likelihood method

▶ The maximum likelihood estimator is then defined as:

βb ∈ arg max Ln (Y|X; β).


β∈RK

▶ Note that this estimator is not necessarily unique, and it might not
even exist.
▶ Also note that Ln (Y|X; β) is the likelihood conditional on X.
Denoting g(Xi ) the density of Xi , the unconditional likelihood can
be written
n
Y
Ln (Y, X; β) = F (Xi′ β)Yi (1 − F (Xi′ β))1−Yi g(Xi ).
i=1

Since in practice the distribution of X is typically of no particular


interest, the focus is on the conditional likelihood.

19 / 47
First order conditions
▶ It is easier to maximize the log-likelihood function:
n
X
ℓn (Y|X; β) = Yi ln (F (Xi′ β)) + (1 − Yi ) ln (1 − F (Xi′ β))
i=1

▶ We have ∂F (Xi′ β)/∂β= f (Xi′ β)Xi . Therefore:


n 
f (Xi′ β) −f (Xi′ β)

∂ℓn X
(Y|X; β) = Yi + (1 − Yi ) Xi .
∂β i=1
F (Xi′ β) 1 − F (Xi′ β)

Reorganizing terms gives


n
∂ℓn X f (Xi′ β) ′
∂β
(Y|X; β) =
F (X ′ β)(1 − F (X ′ β)) [Yi − F (Xi β)] Xi . (4)
i=1 i i

▶ The first order conditions can hence be written as:


n
X f (Xi′ β)
b h i
Yi − F (Xi′ β)
b Xi = 0 (5)
i=1 i
b − F (X ′ β))
F (X ′ β)(1 b
i

which do not imply, in general, a simple analytical solution.


20 / 47
Existence and unicity of the solution

▶ If a dichotomous variable Xj is such that: if xij = 1 ⇒ yi = 1 for all


i (or xij = 1 ⇒ yi = 0 for all i), the the estimator does not exist.
▶ Indeed, ∂ℓn /∂βj can then be written as (considering the case xij = 1
⇒ yi = 1 for all i)
n
X f (xi′ β) X f (x ′ β)
[yi − F (xi′ β)] xij = i
>0 ∀β
i=1
F (xi′ β)(1 ′
− F (xi β)) i:x =1
F (x ′
i β)
ij

▶ This example shows that, among observations such that xij = 1, the
variable yi should vary across i in order to estimate β0j . In the
absence of variation, economtrics/statistics software packages like
Stata indicate that β0j is not identified and will automatically
“expel” xj from the list of explanatory variables.

21 / 47
Existence and unicity of the solution

▶ In the case of a logit model, we have Λ′ = Λ(1 − Λ), so


n
∂ 2 ℓn X
(Y|X; β) = − Λ′ (Xi′ β)Xi Xi′ << 0.
∂β∂β ′ i=1

The matrix of second derivatives is a negative definite matrix. The


log-likelihood is then strictly concave ⇒ the first order conditions
contain at most one solution, and this solution correspond to a
global maximum.
▶ In the case of a probit model, it can also be shown that the
log-likelihood is strictly concave.
▶ In general, the maximisation program is not necessarily concave and
multiple solutions may exist. Ideally it must then be verified that the
solution corresponds to a global maximum.

22 / 47
Remarks on the optimisation procedure

▶ Unlike the OLS estimator, the maximum likelihood estimator


cannot, in general, be expressed explicitly.
▶ The estimator can be obtained numerically using optimisation
algorithms such as Newton-Raphson algorithm (there are other
algorithms as well).
 Starting with an initial value β (0) , define the
(m)
sequence β m∈N
:
−1
∂ 2 ℓn

∂ℓn
β (m+1) = β (m) − (Y|X; β (m) ) (Y|X; β (m) )
∂β∂β ′ ∂β

▶ Under strict concavity of ℓn (Y|X; β), the sequence β (m) , necessarily


converges to the the maximum likelihood estimator.
▶ In the cases of the logit and probit models the iterations typically
converge very quickly to the optimum.

23 / 47
Asymptotic properties

Under various technical conditions (see the Statistics 1 course), it can be


P √ d
shown that βb −→ β0 and n(βb − β0 ) −→ N 0, I1−1 (β0 ) , where I1 (β0 )


is the Fisher information associated with one observation. Moreover,

f 2 (X ′ β0 )
 

I1 (β0 ) = E XX .
F (X ′ β0 )(1 − F (X ′ β0 ))

This Fisher information can be consistently estimated by:


n
1X f 2 (Xi′ β)
b
I\(β
1 0 ) = Xi Xi′ .
n i=1 F (Xi β)(1
′ b − F (X ′ β))
i
b

Recall that the maximum likelihood estimator is asymptotically the best


“regular” estimator: if another estimator βe verifies
√ e 
d
n β − β0 −→ N (0, V ), then V >> I1−1 (β0 ).

24 / 47
Asymptotic properties
Proof of the formula for I1 (β0 ): we have
 
∂ℓ1
I1 (β0 ) = V (Y |X ; β0 )
∂β

where ℓ1 (Y |X ; β0 ) is the log-likelihood (evaluated at β0 ) of one


observation:

ℓ1 (Y |X ; β0 ) = Y ln (F (X ′ β0 )) + (1 − Y ) ln (1 − F (X ′ β0 ))

Using the law of total variance gives:


     
∂ℓ1 ∂ℓ1
I1 (β0 ) = E V (Y |X ; β0 ) X +V E (Y |X ; β0 ) X .
∂β ∂β

Using equation (4)) gives

∂ℓ1 f (X ′ β0 )
(Y |X ; β0 ) = [Y − F (X ′ β0 )] X .
∂β F (X β0 )(1 − F (X ′ β0 ))

25 / 47
Asymptotic properties

Therefore  
∂ℓ1
E (Y |X ; β0 ) X =0
∂β
because E (Y − F (X ′ β0 )|X ) = 0, and
    ′ 
∂ℓ1 ∂ℓ1 ∂ℓ1
V (Y |X ; β0 ) X = E (Y |X ; β0 ) (Y |X ; β0 ) X
∂β ∂β ∂β
 
f (X ′ β0 )2 2
Y − F (X ′ β0 ) XX ′ X

= E ′ ′
F (X β0 ) (1 − F (X β0 ))
2 2

f 2 (X ′ β0 )XX ′
=
F (X ′ β ′
0 )(1 − F (X β0 ))
 
since E (Y − F (X ′ β0 ))2 X = F (X ′ β0 )(1 − F (X ′ β0 )). The result is
then obtained.

26 / 47
Hypothesis testing

▶ We wish to test a hypothesis of the form

H0 : Rβ0 = 0 against H1 : Rβ0 ̸= 0 (R matrice p × K , p ≤ K ).

▶ For example, β0j = 0 or β02 = ... = β0K −1 = 0 (that is β0−1 = 0).


▶ We use one of the following three maximum-likelihood based tests:
the Wald test, the score test, and the likelihood ratio test. The test
statistics associated with 3 tests are:
h −1
i−1
ξnW = nβb′ R ′ R I\ 1 (β0 ) R′ R βb
1 ∂ℓn −1 ∂ℓn
ξnS = (Y|X; βbC ) I\
1 (β0 ) (Y|X; βbC )
n ∂β ′ ∂β
h i
ξnR = 2 ℓn (Y|X; βb) − ℓn (Y|X; βbC )

where βbC is the constraint maximum likelihood estimator, i.e., the


one obtained under H0 .

27 / 47
Hypothesis testing
▶ Concerning the statistic ξnW , I\
1 (β0 ) corresponds to the formula on
page 24.
▶ Concerning the statistic ξnS , it is the same formula except that βb is
replaced by βbC .
▶ Note that the three statistics tend to be “small” under the
hypothesis H0 .
d
▶ Under H0 , ξnT −→ χ2 (p) (T = W , S or R).
▶ The critical region of a test of asymptotic level α therefore takes the
2 2
form {ξnT > q χ (p) (1 − α)} where q χ (p) (y ) is the quantile of order y
of a χ2 (p).
▶ To test H0 : β0j = 0 against H1 : β0j ̸= 0, the usual t-test is
mostly used. This test produces the same result as the Wald test,
 2
because (you should verify this) ξ W = βbj /se(βbj ) ≡ t 2 and
n j
|tj | > q N(0,1) (1 − α/2) ⇔ ξnW > q χ2 (1) (1 − α) where
q N(0,1) (1 − α/2) is the quantile of order 1 − α/2 of a 0, 1 .
28 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

29 / 47
Explanatory power of the model

▶ Let us define, similarly to the R 2 in linear model, the pseudo-R 2 as:

ℓn (Y|X; β)
b
pseudo-R 2 = 1 −
ℓn (Y|X; βbC )

where βbC is the parameter estimate obtained under the null


hypothesis β0−1 = 0.
▶ Since 0 > ℓn (Y|X; β) b ≥ ℓn (Y|X; βbC ), the pseudo-R 2 belongs to
]0, 1]. It is close to 1 when

Yi = 1 and F (Xi′ β)
b ≃ 1 or Yi = 0 and F (X ′ β)
i
b ≃ 0.

▶ Like the R 2 , the pseudo-R 2 augments mechanically with the number


of variables in the model.
▶ Other indicators: concordance table, score, percentage of concordant
pairs...

30 / 47
Choice of variables
▶ Trade off between:
▶ increase of explanatory power of model;
▶ loss of precision due to estimating a large number of parameters.
▶ One can test the null hypothesis that variables have no effect,
possibly through sequential procedures (forward, backward, ...).
▶ Drawback: when n tends to infinity, such procedures entail that the
null is rejected for most of the explanatory variables.
▶ one can also use the information criteria AIC (Akaike Information
Criterion, Akaike, 1973) or BIC (Bayesian Information Criterion,
Schwarz, 1978).
▶ Such criteria are used to determine which model to select. Suppose
there are J possible parametric models:

(Pβ (1) )β (1) ∈B (1) , ..., (Pβ (J) )β (J) ∈B (J) .

The objective is to select the “true” data-generating model.


31 / 47
Choice of variables

▶ Akaike’s criterion for model j having pj parameters:

AIC(j) = ℓn (Y|X; βb(j) ) − pj

The chosen model is then model j0 = arg maxj AIC(j).


▶ This criterion does not necessarily lead to a correct selection when n
tends to infinity. Indeed, it does not sufficiently the number of
parameters of each model.
▶ To account for that drawback, Schwarz (1978) proposed the
following criterion:
pj
BIC(j) = ℓn (Y|X; βb(j) ) − ln(n)
2

32 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

33 / 47
Advantage of a linear model
▶ Sometimes, for simplicity reasons, a linear probability model is
estimated instead of a logit or probit model:
E (Y |X ) = X ′ β0 .
▶ Example: panel data. Suppose that
E (Yit |Xit , αi ) = Xit′ β0 + αi ,
where αi is the unit-specific fixed which is possibly correlated with
Xit .
▶ This unobserved fixed effect can be eliminated by applying the first
difference or within operator:
E (Yit − Yit−1 |Xit , Xit−1 ) = (Xit − Xit−1 )′ β0 .
▶ In non-linear models, this trick does not work since
E (Yit − Yit−1 |Xit , Xit−1 , αi ) = F (Xit′ β + αi ) − F (Xit−1

β + αi ).
▶ Furthermore, maximum likelihood estimation of (β, α1 , ..., αn ) does
not produce consistent estimators of the parameters because of the
so-called incidental parameter problem: the number of parameters
tend to infinity with n. 34 / 47
Modelization and estimation
▶ The linear probability model can be rewritten as Y = X ′ β0 + ε, with

1 − X ′ β0 with (conditional) probability X ′ β0


ε=
−X ′ β0 with probability 1 − X ′ β0
▶ Therefore:

V (ε|X ) = E (ε2 |X ) = X ′ β0 (1 − X ′ β0 )2 + (1 − X ′ β0 )(X ′ β0 )2


= X ′ β0 (1 − X ′ β0 ).
▶ The model is heteroscedastic. It can be estimated by OLS but also
by Generalized Least Squares (GLS):
▶ First estimate β0 by OLS: βbOLS .
▶ Then reestimate β0 by
n
X 1 2
Yi − Xi′ β

βbGLS = arg min
β
i=1
Xi′ βbOLS (1 − Xi′ βbOLS )

▶ The GLS estimator more precise in theory, but not necessarily in


practice if Xi′ βbOLS ≃ 0 or ≃ 1 for certain observations i.
35 / 47
Comparaison logit/probit/linear model

▶ The difference between logit, probit and linear models is that not the
same distribution function F (.) is chosen in E (Y |X ) = F (X ′ β0 ):
F = Λ, Φ, or the identity function, depending on the type of model.
▶ There are semi-parametric models wherein it is assumed that
P(Y = 1|X ) = F (X ′ β0 ) with F and β0 unknown. This is equivalent
to considering a latent model Y ∗ = X ′ β0 + ε where ε ⊥⊥ X and with
an unknown distribution function.
▶ Such model are less rectrictive but are harder to estimate.
▶ The results obtained using a logit model, a probit model, or a linear
model, are often quite similar in terms of their marginal effects.
▶ In terms of the coefficients, we have in general

βblogit ≃ 1.6βbprobit ≃ 4βblinéaire .

36 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

37 / 47
Potential Outcomes

Consider a bianry outcome Y and an experimental framework where


Y (0), Y (1), X ⊥
⊥ D and a Logit regression of Y on (1, D, X ):
▶ Experimental framework ensures E (Y (d)|X ) = E (Y |D = d, X ).
▶ First order conditions of the MLE are
n
X
Yi − Λ(βb0 + βbD Di + Xi′ βbX ) = 0
i=1
n
X  
Di Yi − Λ(βb0 + βbD Di + Xi′ βbX ) = 0
i=1
n
X  
Xi Yi − Λ(βb0 + βbD Di + Xi′ βbX ) = 0
i=1

38 / 47
Logit and Potential Outcomes

The two first equations are equivalent to:


n
1 X   
Pn (1 − Di ) Yi (0) − Λ βb0 + Xi′ βbX =0
i=1 1 − Di i=1

n
1 X   
Pn Di Yi (1) − Λ βb0 + βbD + Xi′ βbX =0
i=1 Di i=1

Let Zi = (1, Di , Xi′ )′ and assume E (Zi Zi′ ) non singular. Then
(b0 , bD , bX ) 7→ E (ℓ1 (Y|X; (b0 , bD , bX′ )′ )) admits a unique maximum
denoted β0 , βD , βX .
We can show that (βb0 , βbD , βbX ) converges to (β0 , βD , βX ) and the first
order conditions ensures the non random equations:

E (Yi (0) − Λ(β0 + Xi′ βX )|Di = 0) = 0, (6)


E (Yi (1) − Λ(β0 + βD + Xi′ βX )|Di = 1) = 0. (7)

39 / 47
Robust estimation of the Average Treatment Effect
Even if E (Y (d)|X ) ̸= Λ(β0 + βD d + x ′ βX ) (misspecification):
Independence of D with Y (0), Y (1), X ensures that the average
treatment effect is (!):

δ := E (Y (1) − Y (0))
(1)
= E (Y (1)|D = 1) − E (Y (0)|D = 0)
(2)
= E (Λ(β0 + βD + Xi′ βX )|D = 1) − E (Λ(β0 + Xi′ βX )|D = 0)
(3)
= E (Λ(β0 + βD + Xi′ βX ) − Λ(β0 + Xi′ βX )) ,

(1) and (3) are consequences of D ⊥ ⊥ (Y (0), Y (1), X ), (2) is a


consequence of (6) and (7).
δ can be estimated consistently by:
n
1X b
δb = Λ(β0 + βbD + Xi′ βbX ) − Λ(βb0 + Xi′ βbX )
n i=1

40 / 47
Robust estimation of the Average Treatment Effect
In an experimental design, with a binary outcome:
Pn Pn
Yi Di (1−Di )Yi
1. Naive estimator δ =
b Pi=1
n
D
− P
i=1
n
1−D
of the ATE is obtained
i i
i=1 i=1
by (check this !):
▶ Linear regression of Y on (1, D): b
δ = βbD
▶ Logit regression of Y on (1, D): δ = Λ(βb0 + βbD ) − Λ(βb0 )
b
▶ δ = Φ(βb0 + βbD ) − Φ(βb0 )
Probit regression of Y on (1, D): b
▶ Considering any MLE based on P(Y = 1|D) = F (β0 + βD D) for F
continuous with continuous inverse: bδ = F (βb0 + βbD ) − F (βb0 )
2. If some X also affect potential outcomes Y (0), Y (1), more efficient
estimators to consider:
▶ Linear regression on (1, D, X ): e
δ = βbD
▶ Logit regression on (1, D, X ) (robust to misspecification):
Pn
δ̆ = n1 i=1 Λ(βb0 + βbD + Xi βbX ) − Λ(βb0 + Xi βbX )
▶ but estimators derived from Probit of other model for binary
outcome are not robust to misspecification!
▶ Choice between e δ or δ̆ depends on their respective asymptotic
variance (use robust variance estimator based on sandwich formula!).

41 / 47
Conditional Average Treatment Effect in Expe.

Moreover if E (Y (d)|X ) = Λ(β0 + βD d + x ′ βX ) (no misspecification):

n
1X b
δ(x ) =
b Λ(β0 + βbD + x ′ βbX ) − Λ(βb0 + x ′ βbX )
n i=1
converges to the conditional average treatment effect:

δ(x ) := E (Y (1) − Y (0)|X = x )

This estimator is not robust to misspecification.


Competition with linear regression on (1, D, X ): E (Y (d)|X = x ) can not
be equal to β0 + x ′ βX if x ′ βX has a support larger than [−β0 ; 1 − β0 ] for
any β0 ∈ R.
Similar consistency for the Probit (or model based on other link function)
but again these estimators of the CATE are not robust to
misspecification.

42 / 47
Outline

Introduction

Modelization and parameters of interest

Identification and estimation

Model quality, selection of variables

Linear probability models

Potential Outcomes

Application

43 / 47
Example: labor market participation of women

▶ The objective is to explain labor market participation (Y = 1, Y = 0


otherwise) of women according to their age, their diploma and
family situation (living in couple or not, number of children below
the age of 3 years).
▶ The 2023 French labor force survey (Enquête Emploi) is used,
focusing on women aged less than 65, and who have terminated
their studies.
▶ Modalities of the variable measuring education level (DIP7):
1 Advanced diploma (Master level university or Grande Ecole)
2 Baccalauréat + 3/4 years
3 Baccalauréat + 2 years
4 Baccalauréat or brevet professionnel or equivalent diploma
5 CAP, BEP or equivalent diploma
6 Brevet des collèges
7 No diploma or CEP

44 / 47
Code R

45 / 47
Results: logit model coefficients

46 / 47
Results: probit model coefficients

47 / 47

You might also like