0% found this document useful (0 votes)
9 views37 pages

Slides Week4

The document discusses maximum likelihood methods in econometrics, emphasizing the importance of specifying complete probability distributions for outcomes rather than just regression functions. It explains how maximum likelihood estimators (MLEs) are derived by maximizing the likelihood function and highlights properties of MLEs, including their invariance and the ability to improve numerical properties through re-parameterization. Additionally, it covers the limiting distribution of MLEs and the conditions under which they converge to a normal distribution.

Uploaded by

Kan dénis Koffi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views37 pages

Slides Week4

The document discusses maximum likelihood methods in econometrics, emphasizing the importance of specifying complete probability distributions for outcomes rather than just regression functions. It explains how maximum likelihood estimators (MLEs) are derived by maximizing the likelihood function and highlights properties of MLEs, including their invariance and the ability to improve numerical properties through re-parameterization. Additionally, it covers the limiting distribution of MLEs and the conditions under which they converge to a normal distribution.

Uploaded by

Kan dénis Koffi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Maximum Likelihood Methods

• Some of the models used in econometrics specify the complete


probability distribution of the outcomes of interest rather than
just a regression function.
• Sometimes this is because of special features of the outcomes
under study - for example because they are discrete or censored,
or because there is serial dependence of a complex form.
• When the complete probability distribution of outcomes given
covariates is specified we can develop an expression for the
probability of observation of the responses we see as a function
of the unknown parameters embedded in the specification.
• We can then ask what values of these parameters maximise
this probability for the data we have. The resulting statistics,
functions of the observed data, are called maximum likelihood
estimators. They possess important optimality properties and
have the advantage that they can be produced in a rule directed
fashion.

G023. III
Estimating a Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-


tributed random variables with P [Yi = 1] = p, P [Yi = 0] = 1−p
for all i.
• We might use such a model for data recording the occurrence
or otherwise of an event for n individuals, for example being
in work or not, buying a good or service or not, etc.
• Let y1 , . . . , yn indicate the data values obtained and note that
in this model
n
Y
P [Y1 = y1 ∩ · · · ∩ Yn = yn ] = pyi (1 − p)(1−yi )
i=1
Pn Pn
= p i=1 yi (1 − p) i=1 (1−yi )

= L(p; y).

With any set of data L(p; y) can be calculated for any value of
p between 0 and 1. The result is the probability of observing
the data to hand for each chosen value of p.
• One strategy for estimating p is to use that value that max-
imises this probability. The resulting estimator is called the
maximum likelihood estimator (MLE) and the maximand, L(p; y),
is called the likelihood function.

G023. III
Log Likelihood Function

• The maximum of the log likelihood function, l(p; y) = log L(p, y),
is at the same value of p as is the maximum of the likelihood
function (because the log function is monotonic).
• It is often easier to maximise the log likelihood function (LLF).
For the problem considered here the LLF is
à n ! n
X X
l(p; y) = yi log p + (1 − yi ) log(1 − p).
i=1 i=1

Let
p̂ = arg maxL(p; y) = arg maxl(p; y).
p p

On differentiating we have the following.


n n
1X 1 X
lp (p; y) = yi − (1 − yi )
p i=1 1 − p i=1
n n
1 X 1 X
lpp (p; y) = − 2 yi − (1 − yi ).
p i=1 (1 − p)2 i=1

Note that lpp (p; y) is always negative for admissable p so the


optimisation problem has a unique solution corresponding to a
maximum. The solution to lp (p̂; y) = 0 is
n
1X
p̂ = yi
n i=1

just the mean of the observed values of the binary indicators,


equivalently the proportion of 1’s observed in the data.

G023. III
Likelihood Functions and Estimation in General

• Let Yi , i = 1, . . . , n be continuously distributed random vari-


ables with joint probability density function f (y1 , . . . , yn , θ).
• The probability that Y falls in infinitesimal intervals of width
dy1 , . . . dyn centred on values y1 , . . . , yn is

A = f (y1 , . . . , yn , θ)dy1 dy2 . . . dyn

Here only the joint density function depends upon θ and the
value of θ that maximises f (y1 , . . . , yn , θ) also maximises A.
• In this case the likelihood function is defined to be the joint
density function of the Yi ’s.
• When the Yi ’s are discrete random variables the likelihood func-
tion is the joint probability mass function of the Yi ’s, and in
cases in which there are discrete and continuous elements the
likelihood function is a combination of probability density ele-
ments and probability mass elements.
• In all cases the likelihood function is a function of the observed
data values that is equal to, or proportional to, the probability
of observing these particular values, where the constant of pro-
portionality does not depend upon the parameters which are
to be estimated.

G023. III
Likelihood Functions and Estimation in General

• When Yi , i = 1, . . . , n are independently distributed the joint


density (mass) function is the product of the marginal density
(mass) functions of each Yi , the likelihood function is
n
Y
L(y; θ) = fi (yi ; θ),
i=1

and the log likelihood function is the sum:


n
X
l(y; θ) = log fi (yi ; θ).
i=1

There is a subscript i on f to allow for the possibility that each


Yi has a distinct probability distribution.
• This situation arises when modelling conditional distributions
of Y given some covariates x. In particular, fi (yi ; θ) = fi (yi |xi ; θ),
and often fi (yi |xi ; θ) = f (yi |xi ; θ).
• In time series and panel data problems there is often depen-
dence among the Yi ’s. For any list of random variables Y =
{Y1 , . . . , Yn } define the i − 1 element list Yi− = {Y1 , . . . , Yi−1 }.
The joint density (mass) function of Y can be written as
n
Y
f (y) = fyi |yi− (yi |yi− )fy1 (y1 ),
i=2

G023. III
Invariance

• Note that (parameter free) monotonic transformations of the


Yi ’s (for example, a change of units of measurement, or use of
logs rather than the original y data) usually leads to a change
in the value of the maximised likelihood function when we work
with continuous distributions.
• If we transform from y to z where y = h(z) and the joint
density function of y is fy (y; θ) then the joint density function
of z is ¯ ¯
¯ ∂h(z) ¯
fz (z; θ) = ¯¯ ¯ fy (h(z); θ).
∂z ¯
• For any given set of values, y ∗ , the value of θ that maximises
the likelihood function fy (y ∗ , θ) also maximises the likelihood
function fz (z ∗ ; θ) where y ∗ = h(z ∗ ), so the maximum likelihood
estimator is invariant with respect to such changes in the way
the data are presented.
• However the maximised
¯ ¯ likelihood functions will differ by a
¯ ∂h(z) ¯
factor equal to ¯ ∂z ¯ ∗ .
z=z

• The reason for this is that we omit the infinitesimals dy1 , . . . dyn
from the likelihood function for continuous variates and these
change when we move from y to z because they are denomi-
nated in the units in which y or z are measured.

G023. III
Maximum Likelihood: Properties

• Maximum likelihood estimators possess another important in-


variance property. Suppose two researchers choose different
ways in which to parameterise the same model. One uses θ,
and the other uses λ = h(θ) where this function is one-to-one.
Then faced with the same data and producing estimators θ̂ and
λ̂, it will always be the case that λ̂ = h(θ̂).
• There are a number of important consequences of this:
– For instance, if we are interested in the ratio of two para-
meters, the MLE of the ratio will be the ratio of the ML
estimators.
– Sometimes a re-parameterisation can improve the numeri-
cal properties of the likelihood function. Newton’s method
and its variants may in practice work better if parameters
are rescaled.

G023. III
Maximum Likelihood: Improving Numerical Properties

• An example of this often arises when, in index models, elements


of x involve squares, cubes, etc., of some covariate, say x1 .
Then maximisation of the likelihood function may be easier
if instead of x21 , x31 , etc., you use x21 /10, x31 /100, etc., with
consequent rescaling of the coefficients on these covariates. You
can always recover the MLEs you would have obtained without
the rescaling by rescaling the estimates.
• There are some cases in which a re-parameterisation can pro-
duce a globally concave likelihood function where in the origi-
nal parameterisation there was not global concavity.
• An example of this arises in the “Tobit” model.
– This is a model in which each Yi is N (x0i β, σ 2 ) with negative
realisations replaced by zeros. The model is sometimes
used to model expenditures and hours worked, which are
necessarily non-negative.
– In this model the likelihood as parameterised here is not
globally concave, but re-parameterising to λ = β/σ, and
γ = 1/σ, produces a globally concave likelihood function.
– The invariance property tells us that having maximised the
“easy” likelihood function and obtained estimates λ̂ and γ̂,
we can recover the maximum likelihood estimates we might
have had difficulty finding in the original parameterisation
by calculating β̂ = λ̂/γ̂ and σ̂ = 1/γ̂.

G023. III
Properties Of Maximum Likelihood Estimators

• First we just sketch the main results:

– Let l(θ; Y ) be the log likelihood function now regarded as


a random variable, a function of a set of (possibly vector)
random variables Y = {Y1 , . . . , Yn }.
– Let lθ (θ; Y ) be the gradient of this function, itself a vector
of random variables (scalar if θ is scalar) and let lθθ (θ; Y )
be the matrix of second derivatives of this function (also a
scalar if θ is a scalar).
– Let
θ̂ = arg max l(θ; Y ).
θ

In order to make inferences about θ using θ̂ we need to


determine the distribution of θ̂. We consider developing a
large sample approximation. The limiting distribution for
a quite wide class of maximum likelihood problems is as
follows:

d
n1/2 (θ̂ − θ) → N (0, V0 )
where
V0 = − plim(n−1 lθθ (θ0 ; Y ))−1
n→∞
and θ0 is the unknown parameter value. To get an ap-
proximate distribution that can be used in practice we use
(n−1 lθθ (θ̂; Y ))−1 or some other consistent estimator of V0
in place of V0 .

G023. III
Properties Of Maximum Likelihood Estimators

• We apply the method for dealing with M-estimators.


• Suppose θ̂ is uniquely determined as the solution to the first
order condition
lθ (θ̂; Y ) = 0
and that θ̂ is a consistent estimator of the unknown value of
the parameter, θ0 . Weak conditions required for consistency
are quite complicated and will not be given here.
• Taking a Taylor series expansion around θ = θ0 and then eval-
uating this at θ = θ̂ gives

0 ' lθ (θ0 ; Y ) + lθθ0 (θ0 ; Y )(θ̂ − θ0 )

and rearranging and scaling by powers of the sample size n


¡ ¢−1 −1/2
n1/2 (θ̂ − θ0 ) ' − n−1 lθθ0 (θ; Y ) n lθ (θ; Y ).

As in our general treatment of M-estimators if we can show


that
p
n−1 lθθ0 (θ0 ; Y ) → A(θ0 )
and
d
n−1/2 lθ (θ0 ; Y ) → N (0, B(θ0 ))
then

d
n1/2 (θ̂ − θ0 ) → N (0, A(θ0 )−1 B(θ0 )A(θ0 )−10 ).

G023. III
Maximum Likelihood: Limiting Distribution

• What is the limiting distribution of n−1/2 lθ (θ0 ; Y )?


• First note that in problems for which the Yi ’s are indepen-
dently distributed, n−1/2 lθ (θ0 ; Y ) is a scaled mean of random
variables and we may be able to find conditions under which
a central limit theorem applies, indicating a limiting normal
distribution.
• We must now find the mean and variance of this distribution.
Since L(θ; Y ) is a joint probability density function (we just
consider the continuous distribution case here),
Z
L(θ; y)dy = 1

where multiple integration is over the support of Y . If this


support does not depend upon θ, then
Z Z

L(θ; y)dy = Lθ (θ; y)dy = 0.
∂θ
But, because l(θ; y) = log L(θ; y), and lθ (θ; y) = Lθ (θ; y)/L(θ; y),
we have
Z Z
Lθ (θ; y)dy = lθ (θ; y)L(θ; y)dy = E [lθ (θ; Y )]

and so E [lθ (θ; Y )] = 0.


• This holds for any value of θ, in particular for θ0 above. If the
variance of lθ (θ0 ; Y ) converges to zero as n becomes large then
lθ (θ0 ; Y ) will converge in probability to zero and the mean of
the limiting distribution of n−1/2 lθ (θ0 ; Y ) will be zero.

G023. III
Maximum Likelihood: Limiting Distribution

• We turn now to the variance of the limiting distribution. We


have just shown that
Z
lθ (θ; y)L(θ; y)dy = 0.

Differentiating again
Z Z

lθ (θ; y)L(θ; y)dy = (lθθ0 (θ; y)L(θ; y) + lθ (θ; y)Lθ0 (θ; y)) dy
∂θ0 Z
= (lθθ0 (θ; y) + lθ (θ; y)lθ (θ; y)0 ) L(θ; y)dy
= E [lθθ0 (θ; Y ) + lθ (θ; Y )lθ (θ; Y )0 ]
= 0.

Separating the two terms in the penultimate line,

E [lθ (θ; Y )lθ (θ; Y )0 ] = −E [lθθ0 (θ; Y )] (4)

and note that, since E [lθ (θ; Y )] = 0,

V ar[lθ (θ; Y )] = E [lθ (θ; Y )lθ (θ; Y )0 ]

and so

V ar[lθ (θ; Y )] = −E [lθθ0 (θ; Y )]


£ ¤
⇒ V ar[n−1/2 lθ (θ; Y )] = −E n−1 lθθ0 (θ; Y )

giving
B(θ0 ) = − plim n−1 lθθ0 (θ0 ; Y ).
n→∞
The matrix
I(θ) = −E [lθθ (θ; Y )]
plays a central role in likelihood theory - it is called the Infor-
mation Matrix .
Finally, because B(θ0 ) = −A(θ0 )
µ ¶−1
A(θ)−1 B(θ)A(θ)−10 = − plim n−1 lθθ0 (θ; Y ) .
n→∞
Of course a number of conditions are required to hold for the
results above to hold. These include the boundedness of third
order derivatives of the log likelihood function, independence or
at most weak dependence of the Yi ’s, existence of moments of
derivatives of the log likelihood, or at least of probability limits
of suitably scaled versions of them, and lack of dependence of
the support of the Yi ’s on θ.
The result in equation (4) above leads, under suitable condi-
tions concerning convergence, to
¡ ¢ ¡ ¢
plim n−1 lθ (θ; Y )lθ (θ; Y )0 = − plim n−1 lθθ0 (θ; Y ) .
n→∞ n→∞

This gives an alternative way of “estimating ” V0 , namely


n o−1
o −1 0
V̂0 = n lθ (θ̂; Y )lθ (θ̂; Y )
which compared with
n o−1
Ṽ0o −1
= −n lθθ0 (θ̂; Y )
has the advantage that only first derivatives of the log like-
lihood function need to be calculated. Sometimes V̂0o is re-
ferred to as the “outer product of gradient” (OPG) estimator.
Both these estimators use the “observed” values of functions
of derivatives of the LLF and. It may be possible to derive
explicit expressions for the expected values of these functions.
Then one can estimate V0 by
© ª−1
V̂0e = E[n−1 lθ (θ; Y )lθ (θ; Y )0 ]|θ=θ̂
© ª−1
= −E[n−1 lθθ0 (θ; Y )]|θ=θ̂ .
These two sorts of estimators are sometimes referred to as “ob-
served information” (V̂0o , Ṽ0o ) and “expected information” (V̂0e )
estimators.
Maximum likelihood estimators possess optimality property,
namely that, among the class of consistent and asymptotically
normally distributed estimators, the variance matrix of their
limiting distribution is the smallest that can be achieved in the
sense that other estimators in the class have limiting distribu-
tions with variance matrices exceeding the MLE’s by a positive
semidefinite matrix.
G023. III
Estimating a Conditional Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-


tributed random variables with

P [Yi = 1|X = xi ] = p(xi , θ)


P [Yi = 0|X = xi ] = 1 − p(x, θ).

This is an obvious extension of the model in the previous sec-


tion.
• The likelihood function for this problem is
n
Y
P [Y1 = y1 ∩ · · · ∩ Y1 = y1 |x] = p(xi , θ)yi (1 − p(xi , θ))(1−yi )
i=1
= L(θ; y).

where y denotes the complete set of values of yi and dependence


on x is suppressed in the notation. The log likelihood function
is
n
X n
X
l(θ; y) = yi log p(xi , θ) + (1 − yi ) log(1 − p(xi , θ))
i=1 i=1

and the maximum likelihood estimator of θ is

θ̂ = arg max l(θ; y).


θ

So far this is an obvious generalisation of the simple problem


met in the last section.

G023. III
Estimating a Conditional Probability

• To implement the model we choose a form for the function


p(x, θ), which must of course lie between zero and one.

– One common choice is


exp(x0 θ)
p(x, θ) =
1 + exp(x0 θ)
which produces what is commonly called a logit model .
– Another common choice is
Z x0 θ
p(x, θ) = Φ(x0 θ) = φ(w)dw
−∞
φ(w) = (2π)−1/2 exp(−w2 /2)

in which Φ is the standard normal distribution function.


This produces what is known as a probit model .

• Both models are widely used. Note that in both cases a single
index model is specified, the probability functions are monotonic
increasing, probabilities arbitrarily close to zero or one are ob-
tained when x0 θ is sufficiently large or small, and there is a
symmetry in both of the models in the sense that p(−x, θ) =
1 − p(x, θ). Any or all of these properties might be inappropri-
ate in a particular application but there is rarely discussion of
this in the applied econometrics literature.

G023. III
More on Logit and Probit

• Both models can also be written as a linear model involving a


latent variable.
• We define a latent variable Yi∗ , which is unobserved, but
determined by the following model:

Yi∗ = Xi0 θ + εi

We observe the variable Yi which is linked to Yi∗ as:



 Yi = 0 if Yi∗ < 0

Yi = 1 if Yi∗ ≥ 0

• The probability of observing Yi = 1 is:

pi = P (Yi = 1) = P (Yi∗ ≥ 0)
= P (Xi0 θ + εi ≥ 0)
= P (εi ≥ −Xi0 θ)
= 1 − Fε (−Xi0 θ)

where Fε is the cumulative distribution function of the random


variable ε.

G023. III
Odds-Ratio

• Define the ratio pi /(1−pi ) as the odds-ratio. This is the ratio


of the probability of outcome 1 over the probability of outcome
0. If this ratio is equal to 1, then both outcomes have equal
probability (pi = 0.5). If this ratio is equal to 2, say, then
outcome 1 is twice as likely than outcome 0 (pi = 2/3).
• In the logit model, the log odds-ratio is linear in the parame-
ters:
pi
ln = Xi0 θ
1 − pi
• In the logit model, θ is the marginal effect of X on the log
odds-ratio. A unit increase in X leads to an increase of θ % in
the odds-ratio.

G023. III
Marginal Effects

• Logit model:
∂pi θ exp(Xi0 θ)(1 + exp(Xi0 θ)) − θ exp(Xi0 θ)2
=
∂X (1 + exp(Xi0 θ))2
θ exp(Xi0 θ)
=
(1 + exp(Xi0 θ))2
= θpi (1 − pi )

A one unit increase in X leads to an increase of θpi (1 − pi ).

• Probit model:
∂pi
= θφ(Xi0 θ)
∂Xi
A one unit increase in X leads to an increase of θφ(Xi0 θ).

G023. III
ML in Single Index Models

• We can cover both cases by considering general single index


models, so for the moment rewrite p(x, θ) as g(w) where w =
x0 θ.
• The first derivative of the log likelihood function is:

n
X gw (x0 θ)xi
i gw (x0i θ)xi
lθ (θ; y) = yi − (1 − yi )
i=1
g(x0i θ) 1 − g(x0i θ)
Xn
gw (x0i θ)
= (yi − g(x0i θ)) xi
i=1
g(x0i θ) (1 − g(x0i θ))

Here gw (w) is the derivative of g(w) with respect to w.


• The expression for the second derivative is rather messy. Here
we just note that its expected value given x is quite simple,
namely
n
X gw (x0i θ)2
E[lθθ (θ; y)|x] = − 0 0 xi x0i ,
i=1
g(xi θ) (1 − g(xi θ))

the negative of which is the Information Matrix for general


single index binary data models.

G023. III
Asymptotic Properties of the Logit Model

• For the logit model there is major simplification


exp(w)
g(w) =
1 + exp(w)
exp(w)
gw (w) =
(1 + exp(w))2
gw (w)
⇒ = 1.
g(w) (1 − g(w))
Therefore in the logit model the MLE satisfies
n
à !
X exp(x0i θ̂)
yi − xi = 0,
1 + exp(x 0 θ̂)
i=1 i

the Information Matrix is


n
X exp(x0i θ) 0
I(θ) = 0 2 xi xi ,
i=1
(1 + exp(xi θ))

the MLE has the limiting distribution


d
n1/2 (θ̂n − θ) → N (0, V0 )
à n
!−1
X 0
exp(xi θ)
V0 = plim n−1 0
0
2 xi xi ,
n→∞
i=1
(1 + exp(x i θ))

and we can conduct approximate inference using the following


approximation
n1/2 (θ̂n − θ) ' N (0, V0 )
using the estimator
 −1
X n
 exp(x0i θ̂) 0
V̂0 = n−1 ³ ´2 xi xi 
i=1 1 + exp(x0i θ̂)

when producing approximate hypothesis tests and confidence


intervals.
G023. III
Asymptotic Properties of the Probit Model

• In the probit model

g(w) = Φ(w)
gw (w) = φ(w)
gw (w) φ(w)
⇒ = .
g(w) (1 − g(w)) Φ(w)(1 − Φ(w))
Therefore in the probit model the MLE satisfies
n ³
X ´ φ(x0i θ̂)
yi − Φ(x0i θ̂) xi = 0,
i=1 Φ(x0i θ̂)(1 − Φ(x0i θ̂))

the Information Matrix is


n
X φ(x0i θ)2
I(θ) = 0 0 xi x0i ,
i=1
Φ(xi θ)(1 − Φ(xi θ))

the MLE has the limiting distribution


d
n1/2 (θ̂n − θ) → N (0, V0 )
à n
!−1
X 0 2
φ(xi θ)
V0 = plim n−1 0
0
0 θ)) xi xi ,
n→∞
i=1
Φ(x i θ)(1 − Φ(x i

and we can conduct approximate inference using the following


approximation
n1/2 (θ̂n − θ) ' N (0, V0 )
using the estimator
à n
!−1
X φ(x0i θ̂)2
V̂0 = n−1 xi x0i
0 Φ(x0i θ̂))
i=1 Φ(xi θ̂)(1 −

when producing approximate tests and confidence intervals.

G023. III
Example: Logit and Probit

• We have data from households in Kuala Lumpur (Malaysia)


describing household characteristics and their concern about
the environment. The question is
”Are you concerned about the environment? Yes / No”.
We also observe their age, sex (coded as 1 men, 0 women), in-
come and quality of the neighborhood measured as air quality.
The latter is coded with a dummy variable smell, equal to 1 if
there is a bad smell in the neighborhood. The model is:

Concerni = β0 +β1 agei +β2 sexi +β3 log incomei +β4 smelli +ui

• We estimate this model with three specifications, LPM, logit


and probit:

Probability of being concerned by Environment


Variable LPM Logit Probit
Est. t-stat Est. t-stat Est. t-stat
age .0074536 3.9 .0321385 3.77 .0198273 3.84
sex .0149649 0.3 .06458 0.31 .0395197 0.31
log income .1120876 3.7 .480128 3.63 .2994516 3.69
smell .1302265 2.5 .5564473 2.48 .3492112 2.52
constant -.683376 -2.6 -5.072543 -4.37 -3.157095 -4.46
Some Marginal Effects
Age .0074536 .0077372 .0082191
log income .1120876 .110528 .1185926
smell .1302265 .1338664 .1429596

G023. III
Multinomial Logit

• The logit model was dealing with two qualitative outcomes.


This can be generalized to multiple outcomes:
– choice of transportation: car, bus, train...
– choice of dwelling: house, apartment, social housing.
• The multinomial logit: Denote the outcomes as j = 1, . . . , J
and pj the probability of outcome j.

exp(X 0 θj )
pj = PJ
0θk )
k=1 exp(X

where θj is a vector of parameter associated with outcome j.

G023. III
Identification

• If we multiply all the coefficients by a factor λ this does not


change the probabilities pj , as the factor cancel out. This
means that there is under identification. We have to normalize
the coefficients of one outcome, say, J to zero. All the results
are interpreted as deviations from the baseline choice.
• We write the probability of choosing outcome j = 1, . . . , J − 1
as:
exp(X 0 θj
pj = P
1 + J−1 0 k
k=1 exp(X θ )

• We can express the logs odds-ratio as:


pj
ln = X 0θj
pJ

• The odds-ratio of choice j versus J is only expressed as a


function of the parameters of choice j, but not of those other
choices: Independence of Irrelevant Alternatives (IIA).

G023. III
Independence of Irrelevant Alternatives

An anecdote which illustrates a violation of this property has


been attributed to Sidney Morgenbesser:

After finishing dinner, Sidney Morgenbesser decides to order


dessert. The waitress tells him he has two choices: apple pie and
blueberry pie. Sidney orders the apple pie.

After a few minutes the waitress returns and says that they also
have cherry pie at which point Morgenbesser says ”In that case I’ll
have the blueberry pie.”

G023. III
Independence of Irrelevant Alternatives

• Consider travelling choices, by car or with a red bus. Assume


for simplicity that the choice probabilities are equal:
P (car)
P (car) = P (red bus) = 0.5 =⇒ =1
P (red bus)

• Suppose we introduce a blue bus, (almost) identical to the red


bus. The probability that individuals will choose the blue bus
is therefore the same as for the red bus and the odd ratio is:
P (blue bus)
P (blue bus) = P (red bus) =⇒ =1
P (red bus)

• However, the IIA implies that odds ratios are the same whether
of not another alternative exists. The only probabilities for
which the three odds ratios are equal to one are:

P (car) = P (blue bus) = P (red bus) = 1/3

However, the prediction we ought to obtain is:

P (red bus) = P (blue bus) = 1/4 P (car) = 0.5

G023. III
Marginal Effects: Multinomial Logit

• θj can be interpreted as the marginal effect of X on the log


odds-ratio of choice j to the baseline choice.
• The marginal effect of X on the probability of choosing out-
come j can be expressed as:
X J
∂pj
= pj [θj − pk θk ]
∂X
k=1

Hence, the marginal effect on choice j involves not only the


coefficients relative to j but also the coefficients relative to the
other choices.
• Note that we can have θj < 0 and ∂pj /∂X > 0 or vice versa.
Due to the non linearity of the model, the sign of the coefficients
does not indicate the direction nor the magnitude of the effect
of a variable on the probability of choosing a given outcome.
One has to compute the marginal effects.

G023. III
Example

• We analyze here the choice of dwelling: house, apartment or


low cost flat, the latter being the baseline choice. We include as
explanatory variables the age, sex and log income of the head
of household:

Variable Estimate Std. Err. Marginal Effect


Choice of House
age .0118092 .0103547 -0.002
sex -.3057774 .2493981 -0.007
log income 1.382504 .1794587 0.18
constant -10.17516 1.498192
Choice of Apartment
age .0682479 .0151806 0.005
sex -.89881 .399947 -0.05
log income 1.618621 .2857743 0.05
constant -15.90391 2.483205

G023. III
Ordered Models

• In the multinomial logit, the choices were not ordered. For


instance, we cannot rank cars, busses or train in a meaningful
way. In some instances, we have a natural ordering of the out-
comes even if we cannot express them as a continuous variable:
– Yes / Somehow / No.
– Low / Medium / High
• We can analyze these answers with ordered models.

G023. III
Ordered Probit

• We code the answers by arbitrary assigning values:

Yi = 0 if No, Yi = 1 if Somehow, Yi = 2 if Yes

• We define a latent variable Yi∗ which is linked to the explana-


tory variables:
Yi∗ = Xi0 θ + εi
Yi = 0 if Yi∗ < 0
Yi = 1 if Yi∗ ∈ [0, µ[
Yi = 2 if Yi∗ ≥ µ
µ is a threshold and an auxiliary parameter which is estimated
along with θ.
• We assume that εi is distributed normally.
• The probability of each outcome is derived from the normal
cdf:
P (Yi = 0) = Φ(−Xi0 θ)
P (Yi = 1) = Φ(µ − Xi0 θ) − Φ(−Xi0 θ)
P (Yi = 2) = 1 − Φ(µ − Xi0 θ)

G023. III
Ordered Probit

• Marginal Effects:
∂P (Yi = 0)
= −θφ(−Xi0 θ)
∂Xi
∂P (Yi = 1)
= θ (φ(Xi0 θ) − φ(µ − Xi0 θ))
∂Xi
∂P (Yi = 2)
= θφ(µ − Xi0 θ)
∂Xi

• Note that if θ > 0, ∂P (Yi = 0)/∂Xi < 0 and ∂P (Yi = 2)/∂Xi >
0:
– If Xi has a positive effect on the latent variable, then by
increasing Xi , fewer individuals will stay in category 0.
– Similarly, more individuals will be in category 2.
– In the intermediate category, the fraction of individual will
either increase or decrease, depending on the relative size
of the inflow from category 0 and the outflow to category 2.

G023. III
Ordered Probit: Example

• We want to investigate the determinants of health.


• Individuals are asked to report their health status in three cat-
egories: poor, fair or good.
• We estimate an ordered probit and calculate the marginal ef-
fects at the mean of the sample.

Variable Coeff sd. err. Marginal Effects Sample


Poor Fair Good Mean
Age 18-30 -1.09** .031 -.051** -.196** .248** .25
Age 30-50 -.523** .031 -.031** -.109** .141** .32
Age 50-70 -.217** .026 -.013** -.046** .060** .24
Male -.130** .018 -.008** -.028** .037** .48
Income low third .428** .027 .038** .098** -.136** .33
Income medium third .264** .022 .020** .059** -.080** .33
Education low .40** .028 .031** .091** -.122** .43
Education Medium .257** .026 .018** .057** -.076** .37
Year of interview -.028 .018 -.001 -.006 .008 1.9
Household size -.098** .008 -.006** -.021** .028** 2.5
Alcohol consumed .043** .041 .002** .009** -.012** .04
Current smoker .160** .018 .011** .035** -.046** .49
cut1 .3992** .058
cut2 1.477** .059

Age group Proportion


Poor Health Fair Health Good Health
Age 18-30 .01 .08 .90
Age 30-50 .03 .13 .83
Age 50-70 .07 .28 .64
Age 70 + .15 .37 .46

G023. III
Ordered Probit: Example

• Marginal Effects differ by individual characteristics.


• Below, we compare the marginal effects from an ordered probit
and a multinomial logit.

Marginal Effects for Good Health


Variable Ordered X Ordered Multinomial
Probit at mean Probit at X Logit at X
Age 18-30 .248** 1 .375** .403**
Age 30-50 .141** 0 .093** .077**
Age 50-70 .060** 0 .046** .035**
Male .037** 1 .033** .031**
Income low third -.136** 1 -.080** -.066**
Income medium third -.080** 0 -.071** -.067**
Education low -.122** 1 -.077** -.067**
Education Medium -.076** 0 -.069** -.064**
Year of interview .008 1 .006 .003
Household size .028** 2 .023** .020**
Alcohol consumed -.012** 0 -.010** -.011**
Current smoker -.046** 0 -.041** -.038**

G023. III
Models for Count Data

• The methods developed above are useful when we want to


model the occurrence or otherwise of an event. Sometimes
we want to model the number of times an event occurs. In
general it might be any nonnegative integer. Count data are
being used increasingly in econometrics.
• An interesting application is to the modelling of the returns to
R&D investment in which data on numbers of patents filed in a
series of years by a sample of companies is studied and related
to data on R&D investments.
• Binomial and Poisson probability models provide common start-
ing points in the development of count data models.
• If Z1 , . . . , Zm are identically and independently distributed bi-
nary random variables with P [Zi = 1] = p, P [Zi = 0] = 1 − p,
then the sum of the Zi ’s has a Binomial distribution,
m
X
Y = Zi ∼ Bi(m, p)
i=1

and
m!
P [Y = j] = pj (1 − p)m−j , j ∈ {0, 1, 2, . . . , m}
j!(m − j)!

G023. III
Models for Count Data

• As m becomes large, m1/2 (m−1 Y − p) becomes approximately


normally distributed, N (0, p(1 − p)), and as m becomes large
while mp = λ remains constant, Y comes to have a Poisson
distribution,
Y ∼ P o(λ)
and
λj
P [Y = j] = exp(−λ), j ∈ {0, 1, 2, . . . }.
j!

• In each case letting p or λ be functions of covariates creates


a model for the conditional distribution of a count of events
given covariate values.
• The Poisson model is much more widely used, in part because
there is no need to specify or estimate the parameter m.
• In the application to R&D investment one might imagine that
a firm seeds a large number of research projects in a period
of time, each of which has only a small probability of produc-
ing a patent. This is consonant with the Poisson probability
model but note that one might be concerned about the under-
lying assumption of independence across projects built into the
Poisson model.

G023. III
Models for Count Data

• The estimation of the model proceeds by maximum likelihood.


The Poisson model is used as an example. Suppose that we
specify a single index model:
λ(x0i θ)yi
P [Yi = yi |xi ] = exp(−λ(x0i θ)), j ∈ {0, 1, 2, . . . }.
yi !

• The log likelihood function is


n
X
l(θ, y) = yi log λ(x0i θ) − λ(x0i θ) − log yi !
i=1

with first derivative


Xn µ ¶
λw (x0i θ)
lθ (θ, y) = yi 0 − λw (x0i θ) xi
i=1
λ(xi θ)
n
X λw (x0i θ)
= (yi − λ(x0i θ)) xi
i=1
λ(x0i θ)

where λw (w) is the derivative of λ(w) with respect to w.


• The MLE satisfies
n ³
X ´ λ (x0 θ̂)
w i
yi − λ(x0i θ̂) xi = 0.
i=1 λ(x0i θ̂)

G023. III
Models for Count Data

• The second derivative matrix is


n
à µ ¶2 ! n
X 0
λww (xi θ) 0
λw (xi θ) X λw (x0i θ)2
0 0
lθθ (θ, y) = (yi − λ(xi θ)) 0 − 0 x x
i i − 0 xi x0i
i=1
λ(xi θ) λ(xi θ) i=1
λ(xi θ)

where, note, the first term has expected value zero. Therefore
the Information Matrix for this conditional Poisson model is
n
X λw (x0 θ)2
i
I(θ) = 0 xi x0i .
i=1
λ(xi θ)

The limiting distribution of the MLE is (under suitable condi-


tions)
d
n1/2 (θ̂ − θ0 ) → N (0, V0 )
à n
!−1
X 0 2
λw (xi θ)
V0 = plim n−1 0 xi x0i
n→∞
i=1
λ(xi θ)

and we can make approximate inference about θ0 using


¡ ¢
(θ̂ − θ0 ) ' N 0, n−1 V0

with V0 estimated by
à n
!−1
X λw (x0 θ̂)2
i
V̂0 = n−1 xi x0i .
0
λ(xi θ̂)
i=1

• In applied work a common choice is λ(w) = exp(w) for which


λw (w) λw (w)2
=1 = exp(w).
λ(w) λ(w)

G023. III

You might also like