Binary
Binary
ENSAE, 2024/2025
1 / 47
Outline
Introduction
Potential Outcomes
Application
2 / 47
Motivation
3 / 47
General ideas
4 / 47
Outline
Introduction
Potential Outcomes
Application
5 / 47
Modelization
▶ Linear models are not well suited to analyze binary dependent
variables. Indeed, if Y ∈ {0, 1}, then
E (Y |X ) = P(Y = 1|X ) ∈ [0, 1]. (1)
In a linear model (under the exogeneity assumption E (ε|X ) = 0), it
follows that E (Y |X ) = X ′ β0 . But nothing guarantees that
X ′ β0 ∈ [0, 1].
▶ In order to satisfy the restriction (1), the following assumption is
made:
E (Y |X ) = F (X ′ β0 ), (2)
where F (.) is a strictly increasing bijective (and known) function
from R to ]0, 1[, that is to say F satisfies the properties of a
distribution function.
▶ N.B.: equation (2) corresponds to the equation of a so-called
Generalized Linear Model (GLM), i.e., a model of the form:
h(E (Y |X )) = X ′ β0 ,
where h is a known function called the link function.
6 / 47
Non-linear models and latent variables
Y = 1{Y ∗ ≥ 0}.
Y ∗ = X ′ β0 + ε, (3)
7 / 47
Latent variable interpretation
Example 1: microeconomics.
▶ Y = agent’s choice between two options. Let U1 be the (expected)
utility of this agent when choosing option Y = 1, and U0 the utility
when choosing Y = 0.
▶ Then define the difference between the utilities of the two choices as
Y ∗ = U1 − U0 .
▶ If the agent is rational the chosen option is the one that generates
the highest utility:
8 / 47
Latent variable interpretation
Example 3: biostatistics.
▶ Y =1 if an individual is ill, 0 otherwise.
▶ Un individual has recovered if the number of bacteries N (for
example) has descended below a certain threshold S, which may
possibly be individal-specific.
▶ On a alors Y ∗ = N − S.
Example 4: education.
▶ Y = 1 if a student graduates and obtain a diploma, 0 otherwise.
▶ The diploma is obtained if the student’s average grade M exceeds a
fixed threshold s.
▶ Then Y ∗ = M − s.
9 / 47
Two important examples: the probit and logit models
10 / 47
Parameters and marginal effects
∂E (Y |X1 = x1 , ..., XK −1 = xK −1 )
= β0j .
∂xj
11 / 47
Parameters and marginal effects
∂E (Y |X = x ) ∂F (u) ∂x ′ β0
= = f (x ′ β0 )β0j with f = F ′ .
∂xj ∂u ′
u=x β0 ∂x j
β0l ∂E (Y |X = x )/∂xl
=
β0j ∂E (Y |X = x )/∂xj
12 / 47
Parameters and marginal effects
▶ Besides estimating β0j, it is of interest to estimate the average
marginal effect:
∆j = E [f (X ′ β0 )] β0j .
This is the expected marginal effect of Xj across the whole
population.
▶ It is also possible to focus on marginal effects calculated across
sub-populations, by calculating the expected marginal effect for units
verifying X ∈ A (for example): E [f (X ′ β0 )|X ∈ A] β0j , or the
marginal effect for the average (representative) unit, f (E (X )′ β0 ) β0j .
▶ When an explanatory variable Xj is discrete (dichotomous), it is
more appropriate to replace marginal effects by
′ ′
F (x−j β0−j + β0j ) − F (x−j β0−j ),
where x−j = (1, x1 , ..., xj−1 , xj+1 , ..., xK −1 )′ . The average effect of
globally switching Xj from 0 to 1 is then:
′ ′
E F (X−j β0−j + β0j ) − F (X−j β0−j ) .
13 / 47
A specificity of the logit model: the odds-ratios.
14 / 47
Outline
Introduction
Potential Outcomes
Application
15 / 47
Identification
▶ Let us return to the equation:
Y = 1{X ′ β0 + ε ≥ 0}.
▶ Two questions: (i) why is the threshold fixed at 0 ? (ii) why is the
variance of ε fixed (at 1 for the probit model, at π 2 /3 for the logit
model) ?
▶ Reason: the model parameters are not identified otherwise. Indeed,
we have:
′ ′
Y = 1{β01 +X−1 β0−1 +ε ≥ s} ⇐⇒ Y = 1{β01 −s+X−1 β0−1 +ε ≥ 0}.
▶ Put differently, it is not possible to identify separately the constant
β01 and the treshold s. The threshold s is therefore (arbitrarily)
fixed at 0.
▶ Similarly, it is not possible to identify separately β0 and the variance
σ02 of the error term. ε. Indeed,
Y = 1{X ′ β0 + ε ≥ 0} ⇐⇒ Y = 1{X ′ (β0 /σ0 ) + ε/σ0 ≥ 0}.
▶ The variance σ02 is therefore arbitrarily fixed.
16 / 47
Identification
When s and σ0 are fixed and if E (XX ′ ) is an invertible matrix, the model
is identified.
Proof: letting Pβ be the distribution of the observations when the true
parameter is β, it is necessary to show that the function β 7→ Pβ is
injective. In our conditional binary model, the identifiability of the model
is equivalent to
But
(E (XX ′ ) is invertible ) ⇐⇒ (X ′ λ = 0 =⇒ λ = 0)
Consequently,
Pβ (Y = 1|X ) = Pβ ′ (Y = 1|X ) ⇐⇒ F (X ′ β) = F (X ′ β ′ )
⇐⇒ X ′β = X ′β′
⇐⇒ β = β′ 2
17 / 47
Estimation of the model: the maximum likelihood method
18 / 47
Estimation of the model: the maximum likelihood method
▶ Note that this estimator is not necessarily unique, and it might not
even exist.
▶ Also note that Ln (Y|X; β) is the likelihood conditional on X.
Denoting g(Xi ) the density of Xi , the unconditional likelihood can
be written
n
Y
Ln (Y, X; β) = F (Xi′ β)Yi (1 − F (Xi′ β))1−Yi g(Xi ).
i=1
19 / 47
First order conditions
▶ It is easier to maximize the log-likelihood function:
n
X
ℓn (Y|X; β) = Yi ln (F (Xi′ β)) + (1 − Yi ) ln (1 − F (Xi′ β))
i=1
▶ This example shows that, among observations such that xij = 1, the
variable yi should vary across i in order to estimate β0j . In the
absence of variation, economtrics/statistics software packages like
Stata indicate that β0j is not identified and will automatically
“expel” xj from the list of explanatory variables.
21 / 47
Existence and unicity of the solution
22 / 47
Remarks on the optimisation procedure
23 / 47
Asymptotic properties
f 2 (X ′ β0 )
′
I1 (β0 ) = E XX .
F (X ′ β0 )(1 − F (X ′ β0 ))
24 / 47
Asymptotic properties
Proof of the formula for I1 (β0 ): we have
∂ℓ1
I1 (β0 ) = V (Y |X ; β0 )
∂β
ℓ1 (Y |X ; β0 ) = Y ln (F (X ′ β0 )) + (1 − Y ) ln (1 − F (X ′ β0 ))
∂ℓ1 f (X ′ β0 )
(Y |X ; β0 ) = [Y − F (X ′ β0 )] X .
∂β F (X β0 )(1 − F (X ′ β0 ))
′
25 / 47
Asymptotic properties
Therefore
∂ℓ1
E (Y |X ; β0 ) X =0
∂β
because E (Y − F (X ′ β0 )|X ) = 0, and
′
∂ℓ1 ∂ℓ1 ∂ℓ1
V (Y |X ; β0 ) X = E (Y |X ; β0 ) (Y |X ; β0 ) X
∂β ∂β ∂β
f (X ′ β0 )2 2
Y − F (X ′ β0 ) XX ′ X
= E ′ ′
F (X β0 ) (1 − F (X β0 ))
2 2
f 2 (X ′ β0 )XX ′
=
F (X ′ β ′
0 )(1 − F (X β0 ))
since E (Y − F (X ′ β0 ))2 X = F (X ′ β0 )(1 − F (X ′ β0 )). The result is
then obtained.
26 / 47
Hypothesis testing
27 / 47
Hypothesis testing
▶ Concerning the statistic ξnW , I\
1 (β0 ) corresponds to the formula on
page 24.
▶ Concerning the statistic ξnS , it is the same formula except that βb is
replaced by βbC .
▶ Note that the three statistics tend to be “small” under the
hypothesis H0 .
d
▶ Under H0 , ξnT −→ χ2 (p) (T = W , S or R).
▶ The critical region of a test of asymptotic level α therefore takes the
2 2
form {ξnT > q χ (p) (1 − α)} where q χ (p) (y ) is the quantile of order y
of a χ2 (p).
▶ To test H0 : β0j = 0 against H1 : β0j ̸= 0, the usual t-test is
mostly used. This test produces the same result as the Wald test,
2
because (you should verify this) ξ W = βbj /se(βbj ) ≡ t 2 and
n j
|tj | > q N(0,1) (1 − α/2) ⇔ ξnW > q χ2 (1) (1 − α) where
q N(0,1) (1 − α/2) is the quantile of order 1 − α/2 of a 0, 1 .
28 / 47
Outline
Introduction
Potential Outcomes
Application
29 / 47
Explanatory power of the model
ℓn (Y|X; β)
b
pseudo-R 2 = 1 −
ℓn (Y|X; βbC )
Yi = 1 and F (Xi′ β)
b ≃ 1 or Yi = 0 and F (X ′ β)
i
b ≃ 0.
30 / 47
Choice of variables
▶ Trade off between:
▶ increase of explanatory power of model;
▶ loss of precision due to estimating a large number of parameters.
▶ One can test the null hypothesis that variables have no effect,
possibly through sequential procedures (forward, backward, ...).
▶ Drawback: when n tends to infinity, such procedures entail that the
null is rejected for most of the explanatory variables.
▶ one can also use the information criteria AIC (Akaike Information
Criterion, Akaike, 1973) or BIC (Bayesian Information Criterion,
Schwarz, 1978).
▶ Such criteria are used to determine which model to select. Suppose
there are J possible parametric models:
(Pβ (1) )β (1) ∈B (1) , ..., (Pβ (J) )β (J) ∈B (J) .
32 / 47
Outline
Introduction
Potential Outcomes
Application
33 / 47
Advantage of a linear model
▶ Sometimes, for simplicity reasons, a linear probability model is
estimated instead of a logit or probit model:
E (Y |X ) = X ′ β0 .
▶ Example: panel data. Suppose that
E (Yit |Xit , αi ) = Xit′ β0 + αi ,
where αi is the unit-specific fixed which is possibly correlated with
Xit .
▶ This unobserved fixed effect can be eliminated by applying the first
difference or within operator:
E (Yit − Yit−1 |Xit , Xit−1 ) = (Xit − Xit−1 )′ β0 .
▶ In non-linear models, this trick does not work since
E (Yit − Yit−1 |Xit , Xit−1 , αi ) = F (Xit′ β + αi ) − F (Xit−1
′
β + αi ).
▶ Furthermore, maximum likelihood estimation of (β, α1 , ..., αn ) does
not produce consistent estimators of the parameters because of the
so-called incidental parameter problem: the number of parameters
tend to infinity with n. 34 / 47
Modelization and estimation
▶ The linear probability model can be rewritten as Y = X ′ β0 + ε, with
▶ The difference between logit, probit and linear models is that not the
same distribution function F (.) is chosen in E (Y |X ) = F (X ′ β0 ):
F = Λ, Φ, or the identity function, depending on the type of model.
▶ There are semi-parametric models wherein it is assumed that
P(Y = 1|X ) = F (X ′ β0 ) with F and β0 unknown. This is equivalent
to considering a latent model Y ∗ = X ′ β0 + ε where ε ⊥⊥ X and with
an unknown distribution function.
▶ Such model are less rectrictive but are harder to estimate.
▶ The results obtained using a logit model, a probit model, or a linear
model, are often quite similar in terms of their marginal effects.
▶ In terms of the coefficients, we have in general
36 / 47
Outline
Introduction
Potential Outcomes
Application
37 / 47
Potential Outcomes
38 / 47
Logit and Potential Outcomes
n
1 X
Pn Di Yi (1) − Λ βb0 + βbD + Xi′ βbX =0
i=1 Di i=1
Let Zi = (1, Di , Xi′ )′ and assume E (Zi Zi′ ) non singular. Then
(b0 , bD , bX ) 7→ E (ℓ1 (Y|X; (b0 , bD , bX′ )′ )) admits a unique maximum
denoted β0 , βD , βX .
We can show that (βb0 , βbD , βbX ) converges to (β0 , βD , βX ) and the first
order conditions ensures the non random equations:
39 / 47
Robust estimation of the Average Treatment Effect
Even if E (Y (d)|X ) ̸= Λ(β0 + βD d + x ′ βX ) (misspecification):
Independence of D with Y (0), Y (1), X ensures that the average
treatment effect is (!):
δ := E (Y (1) − Y (0))
(1)
= E (Y (1)|D = 1) − E (Y (0)|D = 0)
(2)
= E (Λ(β0 + βD + Xi′ βX )|D = 1) − E (Λ(β0 + Xi′ βX )|D = 0)
(3)
= E (Λ(β0 + βD + Xi′ βX ) − Λ(β0 + Xi′ βX )) ,
40 / 47
Robust estimation of the Average Treatment Effect
In an experimental design, with a binary outcome:
Pn Pn
Yi Di (1−Di )Yi
1. Naive estimator δ =
b Pi=1
n
D
− P
i=1
n
1−D
of the ATE is obtained
i i
i=1 i=1
by (check this !):
▶ Linear regression of Y on (1, D): b
δ = βbD
▶ Logit regression of Y on (1, D): δ = Λ(βb0 + βbD ) − Λ(βb0 )
b
▶ δ = Φ(βb0 + βbD ) − Φ(βb0 )
Probit regression of Y on (1, D): b
▶ Considering any MLE based on P(Y = 1|D) = F (β0 + βD D) for F
continuous with continuous inverse: bδ = F (βb0 + βbD ) − F (βb0 )
2. If some X also affect potential outcomes Y (0), Y (1), more efficient
estimators to consider:
▶ Linear regression on (1, D, X ): e
δ = βbD
▶ Logit regression on (1, D, X ) (robust to misspecification):
Pn
δ̆ = n1 i=1 Λ(βb0 + βbD + Xi βbX ) − Λ(βb0 + Xi βbX )
▶ but estimators derived from Probit of other model for binary
outcome are not robust to misspecification!
▶ Choice between e δ or δ̆ depends on their respective asymptotic
variance (use robust variance estimator based on sandwich formula!).
41 / 47
Conditional Average Treatment Effect in Expe.
n
1X b
δ(x ) =
b Λ(β0 + βbD + x ′ βbX ) − Λ(βb0 + x ′ βbX )
n i=1
converges to the conditional average treatment effect:
42 / 47
Outline
Introduction
Potential Outcomes
Application
43 / 47
Example: labor market participation of women
44 / 47
Code R
45 / 47
Results: logit model coefficients
46 / 47
Results: probit model coefficients
47 / 47