0% found this document useful (0 votes)

9 views37 pages

Slides Week4

The document discusses maximum likelihood methods in econometrics, emphasizing the importance of specifying complete probability distributions for outcomes rather than just regression functions. It explains how maximum likelihood estimators (MLEs) are derived by maximizing the likelihood function and highlights properties of MLEs, including their invariance and the ability to improve numerical properties through re-parameterization. Additionally, it covers the limiting distribution of MLEs and the conditions under which they converge to a normal distribution.

Uploaded by

Kan dénis Koffi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views37 pages

Slides Week4

Uploaded by

Kan dénis Koffi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Maximum Likelihood Methods

• Some of the models used in econometrics specify the complete

probability distribution of the outcomes of interest rather than
just a regression function.
• Sometimes this is because of special features of the outcomes
under study - for example because they are discrete or censored,
or because there is serial dependence of a complex form.
• When the complete probability distribution of outcomes given
covariates is specified we can develop an expression for the
probability of observation of the responses we see as a function
of the unknown parameters embedded in the specification.
• We can then ask what values of these parameters maximise
this probability for the data we have. The resulting statistics,
functions of the observed data, are called maximum likelihood
estimators. They possess important optimality properties and
have the advantage that they can be produced in a rule directed
fashion.

G023. III
Estimating a Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-

tributed random variables with P [Yi = 1] = p, P [Yi = 0] = 1−p
for all i.
• We might use such a model for data recording the occurrence
or otherwise of an event for n individuals, for example being
in work or not, buying a good or service or not, etc.
• Let y1 , . . . , yn indicate the data values obtained and note that
in this model
n
Y
P [Y1 = y1 ∩ · · · ∩ Yn = yn ] = pyi (1 − p)(1−yi )
i=1
Pn Pn
= p i=1 yi (1 − p) i=1 (1−yi )

= L(p; y).

With any set of data L(p; y) can be calculated for any value of
p between 0 and 1. The result is the probability of observing
the data to hand for each chosen value of p.
• One strategy for estimating p is to use that value that max-
imises this probability. The resulting estimator is called the
maximum likelihood estimator (MLE) and the maximand, L(p; y),
is called the likelihood function.

G023. III
Log Likelihood Function

• The maximum of the log likelihood function, l(p; y) = log L(p, y),
is at the same value of p as is the maximum of the likelihood
function (because the log function is monotonic).
• It is often easier to maximise the log likelihood function (LLF).
For the problem considered here the LLF is
Ã n ! n
X X
l(p; y) = yi log p + (1 − yi ) log(1 − p).
i=1 i=1

Let
p̂ = arg maxL(p; y) = arg maxl(p; y).
p p

On differentiating we have the following.

n n
1X 1 X
lp (p; y) = yi − (1 − yi )
p i=1 1 − p i=1
n n
1 X 1 X
lpp (p; y) = − 2 yi − (1 − yi ).
p i=1 (1 − p)2 i=1

Note that lpp (p; y) is always negative for admissable p so the

optimisation problem has a unique solution corresponding to a
maximum. The solution to lp (p̂; y) = 0 is
n
1X
p̂ = yi
n i=1

just the mean of the observed values of the binary indicators,

equivalently the proportion of 1’s observed in the data.

G023. III
Likelihood Functions and Estimation in General

• Let Yi , i = 1, . . . , n be continuously distributed random vari-

ables with joint probability density function f (y1 , . . . , yn , θ).
• The probability that Y falls in infinitesimal intervals of width
dy1 , . . . dyn centred on values y1 , . . . , yn is

A = f (y1 , . . . , yn , θ)dy1 dy2 . . . dyn

Here only the joint density function depends upon θ and the
value of θ that maximises f (y1 , . . . , yn , θ) also maximises A.
• In this case the likelihood function is defined to be the joint
density function of the Yi ’s.
• When the Yi ’s are discrete random variables the likelihood func-
tion is the joint probability mass function of the Yi ’s, and in
cases in which there are discrete and continuous elements the
likelihood function is a combination of probability density ele-
ments and probability mass elements.
• In all cases the likelihood function is a function of the observed
data values that is equal to, or proportional to, the probability
of observing these particular values, where the constant of pro-
portionality does not depend upon the parameters which are
to be estimated.

G023. III
Likelihood Functions and Estimation in General

• When Yi , i = 1, . . . , n are independently distributed the joint

density (mass) function is the product of the marginal density
(mass) functions of each Yi , the likelihood function is
n
Y
L(y; θ) = fi (yi ; θ),
i=1

and the log likelihood function is the sum:

n
X
l(y; θ) = log fi (yi ; θ).
i=1

There is a subscript i on f to allow for the possibility that each

Yi has a distinct probability distribution.
• This situation arises when modelling conditional distributions
of Y given some covariates x. In particular, fi (yi ; θ) = fi (yi |xi ; θ),
and often fi (yi |xi ; θ) = f (yi |xi ; θ).
• In time series and panel data problems there is often depen-
dence among the Yi ’s. For any list of random variables Y =
{Y1 , . . . , Yn } define the i − 1 element list Yi− = {Y1 , . . . , Yi−1 }.
The joint density (mass) function of Y can be written as
n
Y
f (y) = fyi |yi− (yi |yi− )fy1 (y1 ),
i=2

G023. III
Invariance

• Note that (parameter free) monotonic transformations of the

Yi ’s (for example, a change of units of measurement, or use of
logs rather than the original y data) usually leads to a change
in the value of the maximised likelihood function when we work
with continuous distributions.
• If we transform from y to z where y = h(z) and the joint
density function of y is fy (y; θ) then the joint density function
of z is ¯ ¯
¯ ∂h(z) ¯
fz (z; θ) = ¯¯ ¯ fy (h(z); θ).
∂z ¯
• For any given set of values, y ∗ , the value of θ that maximises
the likelihood function fy (y ∗ , θ) also maximises the likelihood
function fz (z ∗ ; θ) where y ∗ = h(z ∗ ), so the maximum likelihood
estimator is invariant with respect to such changes in the way
the data are presented.
• However the maximised
¯ ¯ likelihood functions will differ by a
¯ ∂h(z) ¯
factor equal to ¯ ∂z ¯ ∗ .
z=z

• The reason for this is that we omit the infinitesimals dy1 , . . . dyn
from the likelihood function for continuous variates and these
change when we move from y to z because they are denomi-
nated in the units in which y or z are measured.

G023. III
Maximum Likelihood: Properties

• Maximum likelihood estimators possess another important in-

variance property. Suppose two researchers choose different
ways in which to parameterise the same model. One uses θ,
and the other uses λ = h(θ) where this function is one-to-one.
Then faced with the same data and producing estimators θ̂ and
λ̂, it will always be the case that λ̂ = h(θ̂).
• There are a number of important consequences of this:
– For instance, if we are interested in the ratio of two para-
meters, the MLE of the ratio will be the ratio of the ML
estimators.
– Sometimes a re-parameterisation can improve the numeri-
cal properties of the likelihood function. Newton’s method
and its variants may in practice work better if parameters
are rescaled.

G023. III
Maximum Likelihood: Improving Numerical Properties

• An example of this often arises when, in index models, elements

of x involve squares, cubes, etc., of some covariate, say x1 .
Then maximisation of the likelihood function may be easier
if instead of x21 , x31 , etc., you use x21 /10, x31 /100, etc., with
consequent rescaling of the coefficients on these covariates. You
can always recover the MLEs you would have obtained without
the rescaling by rescaling the estimates.
• There are some cases in which a re-parameterisation can pro-
duce a globally concave likelihood function where in the origi-
nal parameterisation there was not global concavity.
• An example of this arises in the “Tobit” model.
– This is a model in which each Yi is N (x0i β, σ 2 ) with negative
realisations replaced by zeros. The model is sometimes
used to model expenditures and hours worked, which are
necessarily non-negative.
– In this model the likelihood as parameterised here is not
globally concave, but re-parameterising to λ = β/σ, and
γ = 1/σ, produces a globally concave likelihood function.
– The invariance property tells us that having maximised the
“easy” likelihood function and obtained estimates λ̂ and γ̂,
we can recover the maximum likelihood estimates we might
have had difficulty finding in the original parameterisation
by calculating β̂ = λ̂/γ̂ and σ̂ = 1/γ̂.

G023. III
Properties Of Maximum Likelihood Estimators

• First we just sketch the main results:

– Let l(θ; Y ) be the log likelihood function now regarded as

a random variable, a function of a set of (possibly vector)
random variables Y = {Y1 , . . . , Yn }.
– Let lθ (θ; Y ) be the gradient of this function, itself a vector
of random variables (scalar if θ is scalar) and let lθθ (θ; Y )
be the matrix of second derivatives of this function (also a
scalar if θ is a scalar).
– Let
θ̂ = arg max l(θ; Y ).
θ

In order to make inferences about θ using θ̂ we need to

determine the distribution of θ̂. We consider developing a
large sample approximation. The limiting distribution for
a quite wide class of maximum likelihood problems is as
follows:

d
n1/2 (θ̂ − θ) → N (0, V0 )
where
V0 = − plim(n−1 lθθ (θ0 ; Y ))−1
n→∞
and θ0 is the unknown parameter value. To get an ap-
proximate distribution that can be used in practice we use
(n−1 lθθ (θ̂; Y ))−1 or some other consistent estimator of V0
in place of V0 .

G023. III
Properties Of Maximum Likelihood Estimators

• We apply the method for dealing with M-estimators.

• Suppose θ̂ is uniquely determined as the solution to the first
order condition
lθ (θ̂; Y ) = 0
and that θ̂ is a consistent estimator of the unknown value of
the parameter, θ0 . Weak conditions required for consistency
are quite complicated and will not be given here.
• Taking a Taylor series expansion around θ = θ0 and then eval-
uating this at θ = θ̂ gives

0 ' lθ (θ0 ; Y ) + lθθ0 (θ0 ; Y )(θ̂ − θ0 )

and rearranging and scaling by powers of the sample size n

¡ ¢−1 −1/2
n1/2 (θ̂ − θ0 ) ' − n−1 lθθ0 (θ; Y ) n lθ (θ; Y ).

As in our general treatment of M-estimators if we can show

that
p
n−1 lθθ0 (θ0 ; Y ) → A(θ0 )
and
d
n−1/2 lθ (θ0 ; Y ) → N (0, B(θ0 ))
then

d
n1/2 (θ̂ − θ0 ) → N (0, A(θ0 )−1 B(θ0 )A(θ0 )−10 ).

G023. III
Maximum Likelihood: Limiting Distribution

• What is the limiting distribution of n−1/2 lθ (θ0 ; Y )?

• First note that in problems for which the Yi ’s are indepen-
dently distributed, n−1/2 lθ (θ0 ; Y ) is a scaled mean of random
variables and we may be able to find conditions under which
a central limit theorem applies, indicating a limiting normal
distribution.
• We must now find the mean and variance of this distribution.
Since L(θ; Y ) is a joint probability density function (we just
consider the continuous distribution case here),
Z
L(θ; y)dy = 1

where multiple integration is over the support of Y . If this

support does not depend upon θ, then
Z Z
∂
L(θ; y)dy = Lθ (θ; y)dy = 0.
∂θ
But, because l(θ; y) = log L(θ; y), and lθ (θ; y) = Lθ (θ; y)/L(θ; y),
we have
Z Z
Lθ (θ; y)dy = lθ (θ; y)L(θ; y)dy = E [lθ (θ; Y )]

and so E [lθ (θ; Y )] = 0.

• This holds for any value of θ, in particular for θ0 above. If the
variance of lθ (θ0 ; Y ) converges to zero as n becomes large then
lθ (θ0 ; Y ) will converge in probability to zero and the mean of
the limiting distribution of n−1/2 lθ (θ0 ; Y ) will be zero.

G023. III
Maximum Likelihood: Limiting Distribution

• We turn now to the variance of the limiting distribution. We

have just shown that
Z
lθ (θ; y)L(θ; y)dy = 0.

Differentiating again
Z Z
∂
lθ (θ; y)L(θ; y)dy = (lθθ0 (θ; y)L(θ; y) + lθ (θ; y)Lθ0 (θ; y)) dy
∂θ0 Z
= (lθθ0 (θ; y) + lθ (θ; y)lθ (θ; y)0 ) L(θ; y)dy
= E [lθθ0 (θ; Y ) + lθ (θ; Y )lθ (θ; Y )0 ]
= 0.

Separating the two terms in the penultimate line,

E [lθ (θ; Y )lθ (θ; Y )0 ] = −E [lθθ0 (θ; Y )] (4)

and note that, since E [lθ (θ; Y )] = 0,

V ar[lθ (θ; Y )] = E [lθ (θ; Y )lθ (θ; Y )0 ]

and so

V ar[lθ (θ; Y )] = −E [lθθ0 (θ; Y )]

£ ¤
⇒ V ar[n−1/2 lθ (θ; Y )] = −E n−1 lθθ0 (θ; Y )

giving
B(θ0 ) = − plim n−1 lθθ0 (θ0 ; Y ).
n→∞
The matrix
I(θ) = −E [lθθ (θ; Y )]
plays a central role in likelihood theory - it is called the Infor-
mation Matrix .
Finally, because B(θ0 ) = −A(θ0 )
µ ¶−1
A(θ)−1 B(θ)A(θ)−10 = − plim n−1 lθθ0 (θ; Y ) .
n→∞
Of course a number of conditions are required to hold for the
results above to hold. These include the boundedness of third
order derivatives of the log likelihood function, independence or
at most weak dependence of the Yi ’s, existence of moments of
derivatives of the log likelihood, or at least of probability limits
of suitably scaled versions of them, and lack of dependence of
the support of the Yi ’s on θ.
The result in equation (4) above leads, under suitable condi-
tions concerning convergence, to
¡ ¢ ¡ ¢
plim n−1 lθ (θ; Y )lθ (θ; Y )0 = − plim n−1 lθθ0 (θ; Y ) .
n→∞ n→∞

This gives an alternative way of “estimating ” V0 , namely

n o−1
o −1 0
V̂0 = n lθ (θ̂; Y )lθ (θ̂; Y )
which compared with
n o−1
Ṽ0o −1
= −n lθθ0 (θ̂; Y )
has the advantage that only first derivatives of the log like-
lihood function need to be calculated. Sometimes V̂0o is re-
ferred to as the “outer product of gradient” (OPG) estimator.
Both these estimators use the “observed” values of functions
of derivatives of the LLF and. It may be possible to derive
explicit expressions for the expected values of these functions.
Then one can estimate V0 by
© ª−1
V̂0e = E[n−1 lθ (θ; Y )lθ (θ; Y )0 ]|θ=θ̂
© ª−1
= −E[n−1 lθθ0 (θ; Y )]|θ=θ̂ .
These two sorts of estimators are sometimes referred to as “ob-
served information” (V̂0o , Ṽ0o ) and “expected information” (V̂0e )
estimators.
Maximum likelihood estimators possess optimality property,
namely that, among the class of consistent and asymptotically
normally distributed estimators, the variance matrix of their
limiting distribution is the smallest that can be achieved in the
sense that other estimators in the class have limiting distribu-
tions with variance matrices exceeding the MLE’s by a positive
semidefinite matrix.
G023. III
Estimating a Conditional Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-

tributed random variables with

P [Yi = 1|X = xi ] = p(xi , θ)

P [Yi = 0|X = xi ] = 1 − p(x, θ).

This is an obvious extension of the model in the previous sec-

tion.
• The likelihood function for this problem is
n
Y
P [Y1 = y1 ∩ · · · ∩ Y1 = y1 |x] = p(xi , θ)yi (1 − p(xi , θ))(1−yi )
i=1
= L(θ; y).

where y denotes the complete set of values of yi and dependence

on x is suppressed in the notation. The log likelihood function
is
n
X n
X
l(θ; y) = yi log p(xi , θ) + (1 − yi ) log(1 − p(xi , θ))
i=1 i=1

and the maximum likelihood estimator of θ is

θ̂ = arg max l(θ; y).

So far this is an obvious generalisation of the simple problem

met in the last section.

G023. III
Estimating a Conditional Probability

• To implement the model we choose a form for the function

p(x, θ), which must of course lie between zero and one.

– One common choice is

exp(x0 θ)
p(x, θ) =
1 + exp(x0 θ)
which produces what is commonly called a logit model .
– Another common choice is
Z x0 θ
p(x, θ) = Φ(x0 θ) = φ(w)dw
−∞
φ(w) = (2π)−1/2 exp(−w2 /2)

in which Φ is the standard normal distribution function.

This produces what is known as a probit model .

• Both models are widely used. Note that in both cases a single
index model is specified, the probability functions are monotonic
increasing, probabilities arbitrarily close to zero or one are ob-
tained when x0 θ is sufficiently large or small, and there is a
symmetry in both of the models in the sense that p(−x, θ) =
1 − p(x, θ). Any or all of these properties might be inappropri-
ate in a particular application but there is rarely discussion of
this in the applied econometrics literature.

G023. III
More on Logit and Probit

• Both models can also be written as a linear model involving a

latent variable.
• We define a latent variable Yi∗ , which is unobserved, but
determined by the following model:

Yi∗ = Xi0 θ + εi

We observe the variable Yi which is linked to Yi∗ as:


 Yi = 0 if Yi∗ < 0

Yi = 1 if Yi∗ ≥ 0

• The probability of observing Yi = 1 is:

pi = P (Yi = 1) = P (Yi∗ ≥ 0)
= P (Xi0 θ + εi ≥ 0)
= P (εi ≥ −Xi0 θ)
= 1 − Fε (−Xi0 θ)

where Fε is the cumulative distribution function of the random

variable ε.

G023. III
Odds-Ratio

• Define the ratio pi /(1−pi ) as the odds-ratio. This is the ratio

of the probability of outcome 1 over the probability of outcome
0. If this ratio is equal to 1, then both outcomes have equal
probability (pi = 0.5). If this ratio is equal to 2, say, then
outcome 1 is twice as likely than outcome 0 (pi = 2/3).
• In the logit model, the log odds-ratio is linear in the parame-
ters:
pi
ln = Xi0 θ
1 − pi
• In the logit model, θ is the marginal effect of X on the log
odds-ratio. A unit increase in X leads to an increase of θ % in
the odds-ratio.

G023. III
Marginal Effects

• Logit model:
∂pi θ exp(Xi0 θ)(1 + exp(Xi0 θ)) − θ exp(Xi0 θ)2
=
∂X (1 + exp(Xi0 θ))2
θ exp(Xi0 θ)
=
(1 + exp(Xi0 θ))2
= θpi (1 − pi )

A one unit increase in X leads to an increase of θpi (1 − pi ).

• Probit model:
∂pi
= θφ(Xi0 θ)
∂Xi
A one unit increase in X leads to an increase of θφ(Xi0 θ).

G023. III
ML in Single Index Models

• We can cover both cases by considering general single index

models, so for the moment rewrite p(x, θ) as g(w) where w =
x0 θ.
• The first derivative of the log likelihood function is:

n
X gw (x0 θ)xi
i gw (x0i θ)xi
lθ (θ; y) = yi − (1 − yi )
i=1
g(x0i θ) 1 − g(x0i θ)
Xn
gw (x0i θ)
= (yi − g(x0i θ)) xi
i=1
g(x0i θ) (1 − g(x0i θ))

Here gw (w) is the derivative of g(w) with respect to w.

• The expression for the second derivative is rather messy. Here
we just note that its expected value given x is quite simple,
namely
n
X gw (x0i θ)2
E[lθθ (θ; y)|x] = − 0 0 xi x0i ,
i=1
g(xi θ) (1 − g(xi θ))

the negative of which is the Information Matrix for general

single index binary data models.

G023. III
Asymptotic Properties of the Logit Model

• For the logit model there is major simplification

exp(w)
g(w) =
1 + exp(w)
exp(w)
gw (w) =
(1 + exp(w))2
gw (w)
⇒ = 1.
g(w) (1 − g(w))
Therefore in the logit model the MLE satisfies
n
Ã !
X exp(x0i θ̂)
yi − xi = 0,
1 + exp(x 0 θ̂)
i=1 i

the Information Matrix is

n
X exp(x0i θ) 0
I(θ) = 0 2 xi xi ,
i=1
(1 + exp(xi θ))

the MLE has the limiting distribution

d
n1/2 (θ̂n − θ) → N (0, V0 )
Ã n
!−1
X 0
exp(xi θ)
V0 = plim n−1 0
0
2 xi xi ,
n→∞
i=1
(1 + exp(x i θ))

and we can conduct approximate inference using the following

approximation
n1/2 (θ̂n − θ) ' N (0, V0 )
using the estimator
 −1
X n
 exp(x0i θ̂) 0
V̂0 = n−1 ³ ´2 xi xi 
i=1 1 + exp(x0i θ̂)

when producing approximate hypothesis tests and confidence

intervals.
G023. III
Asymptotic Properties of the Probit Model

• In the probit model

g(w) = Φ(w)
gw (w) = φ(w)
gw (w) φ(w)
⇒ = .
g(w) (1 − g(w)) Φ(w)(1 − Φ(w))
Therefore in the probit model the MLE satisfies
n ³
X ´ φ(x0i θ̂)
yi − Φ(x0i θ̂) xi = 0,
i=1 Φ(x0i θ̂)(1 − Φ(x0i θ̂))

the Information Matrix is

n
X φ(x0i θ)2
I(θ) = 0 0 xi x0i ,
i=1
Φ(xi θ)(1 − Φ(xi θ))

the MLE has the limiting distribution

d
n1/2 (θ̂n − θ) → N (0, V0 )
Ã n
!−1
X 0 2
φ(xi θ)
V0 = plim n−1 0
0
0 θ)) xi xi ,
n→∞
i=1
Φ(x i θ)(1 − Φ(x i

and we can conduct approximate inference using the following

approximation
n1/2 (θ̂n − θ) ' N (0, V0 )
using the estimator
Ã n
!−1
X φ(x0i θ̂)2
V̂0 = n−1 xi x0i
0 Φ(x0i θ̂))
i=1 Φ(xi θ̂)(1 −

when producing approximate tests and confidence intervals.

G023. III
Example: Logit and Probit

• We have data from households in Kuala Lumpur (Malaysia)

describing household characteristics and their concern about
the environment. The question is
”Are you concerned about the environment? Yes / No”.
We also observe their age, sex (coded as 1 men, 0 women), in-
come and quality of the neighborhood measured as air quality.
The latter is coded with a dummy variable smell, equal to 1 if
there is a bad smell in the neighborhood. The model is:

Concerni = β0 +β1 agei +β2 sexi +β3 log incomei +β4 smelli +ui

• We estimate this model with three specifications, LPM, logit

and probit:

Probability of being concerned by Environment

Variable LPM Logit Probit
Est. t-stat Est. t-stat Est. t-stat
age .0074536 3.9 .0321385 3.77 .0198273 3.84
sex .0149649 0.3 .06458 0.31 .0395197 0.31
log income .1120876 3.7 .480128 3.63 .2994516 3.69
smell .1302265 2.5 .5564473 2.48 .3492112 2.52
constant -.683376 -2.6 -5.072543 -4.37 -3.157095 -4.46
Some Marginal Effects
Age .0074536 .0077372 .0082191
log income .1120876 .110528 .1185926
smell .1302265 .1338664 .1429596

G023. III
Multinomial Logit

• The logit model was dealing with two qualitative outcomes.

This can be generalized to multiple outcomes:
– choice of transportation: car, bus, train...
– choice of dwelling: house, apartment, social housing.
• The multinomial logit: Denote the outcomes as j = 1, . . . , J
and pj the probability of outcome j.

exp(X 0 θj )
pj = PJ
0θk )
k=1 exp(X

where θj is a vector of parameter associated with outcome j.

G023. III
Identification

• If we multiply all the coefficients by a factor λ this does not

change the probabilities pj , as the factor cancel out. This
means that there is under identification. We have to normalize
the coefficients of one outcome, say, J to zero. All the results
are interpreted as deviations from the baseline choice.
• We write the probability of choosing outcome j = 1, . . . , J − 1
as:
exp(X 0 θj
pj = P
1 + J−1 0 k
k=1 exp(X θ )

• We can express the logs odds-ratio as:

pj
ln = X 0θj
pJ

• The odds-ratio of choice j versus J is only expressed as a

function of the parameters of choice j, but not of those other
choices: Independence of Irrelevant Alternatives (IIA).

G023. III
Independence of Irrelevant Alternatives

An anecdote which illustrates a violation of this property has

been attributed to Sidney Morgenbesser:

After finishing dinner, Sidney Morgenbesser decides to order

dessert. The waitress tells him he has two choices: apple pie and
blueberry pie. Sidney orders the apple pie.

After a few minutes the waitress returns and says that they also
have cherry pie at which point Morgenbesser says ”In that case I’ll
have the blueberry pie.”

G023. III
Independence of Irrelevant Alternatives

• Consider travelling choices, by car or with a red bus. Assume

for simplicity that the choice probabilities are equal:
P (car)
P (car) = P (red bus) = 0.5 =⇒ =1
P (red bus)

• Suppose we introduce a blue bus, (almost) identical to the red

bus. The probability that individuals will choose the blue bus
is therefore the same as for the red bus and the odd ratio is:
P (blue bus)
P (blue bus) = P (red bus) =⇒ =1
P (red bus)

• However, the IIA implies that odds ratios are the same whether
of not another alternative exists. The only probabilities for
which the three odds ratios are equal to one are:

P (car) = P (blue bus) = P (red bus) = 1/3

However, the prediction we ought to obtain is:

P (red bus) = P (blue bus) = 1/4 P (car) = 0.5

G023. III
Marginal Effects: Multinomial Logit

• θj can be interpreted as the marginal effect of X on the log

odds-ratio of choice j to the baseline choice.
• The marginal effect of X on the probability of choosing out-
come j can be expressed as:
X J
∂pj
= pj [θj − pk θk ]
∂X
k=1

Hence, the marginal effect on choice j involves not only the

coefficients relative to j but also the coefficients relative to the
other choices.
• Note that we can have θj < 0 and ∂pj /∂X > 0 or vice versa.
Due to the non linearity of the model, the sign of the coefficients
does not indicate the direction nor the magnitude of the effect
of a variable on the probability of choosing a given outcome.
One has to compute the marginal effects.

G023. III
Example

• We analyze here the choice of dwelling: house, apartment or

low cost flat, the latter being the baseline choice. We include as
explanatory variables the age, sex and log income of the head
of household:

Variable Estimate Std. Err. Marginal Effect

Choice of House
age .0118092 .0103547 -0.002
sex -.3057774 .2493981 -0.007
log income 1.382504 .1794587 0.18
constant -10.17516 1.498192
Choice of Apartment
age .0682479 .0151806 0.005
sex -.89881 .399947 -0.05
log income 1.618621 .2857743 0.05
constant -15.90391 2.483205

G023. III
Ordered Models

• In the multinomial logit, the choices were not ordered. For

instance, we cannot rank cars, busses or train in a meaningful
way. In some instances, we have a natural ordering of the out-
comes even if we cannot express them as a continuous variable:
– Yes / Somehow / No.
– Low / Medium / High
• We can analyze these answers with ordered models.

G023. III
Ordered Probit

• We code the answers by arbitrary assigning values:

Yi = 0 if No, Yi = 1 if Somehow, Yi = 2 if Yes

• We define a latent variable Yi∗ which is linked to the explana-

tory variables:
Yi∗ = Xi0 θ + εi
Yi = 0 if Yi∗ < 0
Yi = 1 if Yi∗ ∈ [0, µ[
Yi = 2 if Yi∗ ≥ µ
µ is a threshold and an auxiliary parameter which is estimated
along with θ.
• We assume that εi is distributed normally.
• The probability of each outcome is derived from the normal
cdf:
P (Yi = 0) = Φ(−Xi0 θ)
P (Yi = 1) = Φ(µ − Xi0 θ) − Φ(−Xi0 θ)
P (Yi = 2) = 1 − Φ(µ − Xi0 θ)

G023. III
Ordered Probit

• Marginal Effects:
∂P (Yi = 0)
= −θφ(−Xi0 θ)
∂Xi
∂P (Yi = 1)
= θ (φ(Xi0 θ) − φ(µ − Xi0 θ))
∂Xi
∂P (Yi = 2)
= θφ(µ − Xi0 θ)
∂Xi

• Note that if θ > 0, ∂P (Yi = 0)/∂Xi < 0 and ∂P (Yi = 2)/∂Xi >
0:
– If Xi has a positive effect on the latent variable, then by
increasing Xi , fewer individuals will stay in category 0.
– Similarly, more individuals will be in category 2.
– In the intermediate category, the fraction of individual will
either increase or decrease, depending on the relative size
of the inflow from category 0 and the outflow to category 2.

G023. III
Ordered Probit: Example

• We want to investigate the determinants of health.

• Individuals are asked to report their health status in three cat-
egories: poor, fair or good.
• We estimate an ordered probit and calculate the marginal ef-
fects at the mean of the sample.

Variable Coeff sd. err. Marginal Effects Sample

Poor Fair Good Mean
Age 18-30 -1.09** .031 -.051** -.196** .248** .25
Age 30-50 -.523** .031 -.031** -.109** .141** .32
Age 50-70 -.217** .026 -.013** -.046** .060** .24
Male -.130** .018 -.008** -.028** .037** .48
Income low third .428** .027 .038** .098** -.136** .33
Income medium third .264** .022 .020** .059** -.080** .33
Education low .40** .028 .031** .091** -.122** .43
Education Medium .257** .026 .018** .057** -.076** .37
Year of interview -.028 .018 -.001 -.006 .008 1.9
Household size -.098** .008 -.006** -.021** .028** 2.5
Alcohol consumed .043** .041 .002** .009** -.012** .04
Current smoker .160** .018 .011** .035** -.046** .49
cut1 .3992** .058
cut2 1.477** .059

Age group Proportion

Poor Health Fair Health Good Health
Age 18-30 .01 .08 .90
Age 30-50 .03 .13 .83
Age 50-70 .07 .28 .64
Age 70 + .15 .37 .46

G023. III
Ordered Probit: Example

• Marginal Effects differ by individual characteristics.

• Below, we compare the marginal effects from an ordered probit
and a multinomial logit.

Marginal Effects for Good Health

Variable Ordered X Ordered Multinomial
Probit at mean Probit at X Logit at X
Age 18-30 .248** 1 .375** .403**
Age 30-50 .141** 0 .093** .077**
Age 50-70 .060** 0 .046** .035**
Male .037** 1 .033** .031**
Income low third -.136** 1 -.080** -.066**
Income medium third -.080** 0 -.071** -.067**
Education low -.122** 1 -.077** -.067**
Education Medium -.076** 0 -.069** -.064**
Year of interview .008 1 .006 .003
Household size .028** 2 .023** .020**
Alcohol consumed -.012** 0 -.010** -.011**
Current smoker -.046** 0 -.041** -.038**

G023. III
Models for Count Data

• The methods developed above are useful when we want to

model the occurrence or otherwise of an event. Sometimes
we want to model the number of times an event occurs. In
general it might be any nonnegative integer. Count data are
being used increasingly in econometrics.
• An interesting application is to the modelling of the returns to
R&D investment in which data on numbers of patents filed in a
series of years by a sample of companies is studied and related
to data on R&D investments.
• Binomial and Poisson probability models provide common start-
ing points in the development of count data models.
• If Z1 , . . . , Zm are identically and independently distributed bi-
nary random variables with P [Zi = 1] = p, P [Zi = 0] = 1 − p,
then the sum of the Zi ’s has a Binomial distribution,
m
X
Y = Zi ∼ Bi(m, p)
i=1

and
m!
P [Y = j] = pj (1 − p)m−j , j ∈ {0, 1, 2, . . . , m}
j!(m − j)!

G023. III
Models for Count Data

• As m becomes large, m1/2 (m−1 Y − p) becomes approximately

normally distributed, N (0, p(1 − p)), and as m becomes large
while mp = λ remains constant, Y comes to have a Poisson
distribution,
Y ∼ P o(λ)
and
λj
P [Y = j] = exp(−λ), j ∈ {0, 1, 2, . . . }.
j!

• In each case letting p or λ be functions of covariates creates

a model for the conditional distribution of a count of events
given covariate values.
• The Poisson model is much more widely used, in part because
there is no need to specify or estimate the parameter m.
• In the application to R&D investment one might imagine that
a firm seeds a large number of research projects in a period
of time, each of which has only a small probability of produc-
ing a patent. This is consonant with the Poisson probability
model but note that one might be concerned about the under-
lying assumption of independence across projects built into the
Poisson model.

G023. III
Models for Count Data

• The estimation of the model proceeds by maximum likelihood.

The Poisson model is used as an example. Suppose that we
specify a single index model:
λ(x0i θ)yi
P [Yi = yi |xi ] = exp(−λ(x0i θ)), j ∈ {0, 1, 2, . . . }.
yi !

• The log likelihood function is

n
X
l(θ, y) = yi log λ(x0i θ) − λ(x0i θ) − log yi !
i=1

with first derivative

Xn µ ¶
λw (x0i θ)
lθ (θ, y) = yi 0 − λw (x0i θ) xi
i=1
λ(xi θ)
n
X λw (x0i θ)
= (yi − λ(x0i θ)) xi
i=1
λ(x0i θ)

where λw (w) is the derivative of λ(w) with respect to w.

• The MLE satisfies
n ³
X ´ λ (x0 θ̂)
w i
yi − λ(x0i θ̂) xi = 0.
i=1 λ(x0i θ̂)

G023. III
Models for Count Data

• The second derivative matrix is

n
Ã µ ¶2 ! n
X 0
λww (xi θ) 0
λw (xi θ) X λw (x0i θ)2
0 0
lθθ (θ, y) = (yi − λ(xi θ)) 0 − 0 x x
i i − 0 xi x0i
i=1
λ(xi θ) λ(xi θ) i=1
λ(xi θ)

where, note, the first term has expected value zero. Therefore
the Information Matrix for this conditional Poisson model is
n
X λw (x0 θ)2
i
I(θ) = 0 xi x0i .
i=1
λ(xi θ)

The limiting distribution of the MLE is (under suitable condi-

tions)
d
n1/2 (θ̂ − θ0 ) → N (0, V0 )
Ã n
!−1
X 0 2
λw (xi θ)
V0 = plim n−1 0 xi x0i
n→∞
i=1
λ(xi θ)

and we can make approximate inference about θ0 using

¡ ¢
(θ̂ − θ0 ) ' N 0, n−1 V0

with V0 estimated by
Ã n
!−1
X λw (x0 θ̂)2
i
V̂0 = n−1 xi x0i .
0
λ(xi θ̂)
i=1

• In applied work a common choice is λ(w) = exp(w) for which

λw (w) λw (w)2
=1 = exp(w).
λ(w) λ(w)

G023. III

ML Notes
No ratings yet
ML Notes
4 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
A Guide To Modern Econometrics by Verbeek 181 190
No ratings yet
A Guide To Modern Econometrics by Verbeek 181 190
10 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
Notes Maximum Likelihood
No ratings yet
Notes Maximum Likelihood
3 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
Inf 2
No ratings yet
Inf 2
37 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
Statistical Inference: Classical and Bayesian Methods
No ratings yet
Statistical Inference: Classical and Bayesian Methods
22 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
MLE Assingnment
No ratings yet
MLE Assingnment
7 pages
Topic 14: Maximum Likelihood Estimation: 1 Examples
No ratings yet
Topic 14: Maximum Likelihood Estimation: 1 Examples
6 pages
MLE Dan Bayesian Estimation From Walpole Book
No ratings yet
MLE Dan Bayesian Estimation From Walpole Book
13 pages
7 Mle
No ratings yet
7 Mle
31 pages
Maximum Likelihood Estimators and Least Squares
No ratings yet
Maximum Likelihood Estimators and Least Squares
5 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
18.650 Statistics For Applications
No ratings yet
18.650 Statistics For Applications
25 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
No ratings yet
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
207 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
10 pages
Introduction To Maximum Likelihood (ML)
No ratings yet
Introduction To Maximum Likelihood (ML)
16 pages
Maximum Likelihood Estimation (MLE)
No ratings yet
Maximum Likelihood Estimation (MLE)
4 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
NOTES
No ratings yet
NOTES
14 pages
Maximum
No ratings yet
Maximum
3 pages
MLEstimation
No ratings yet
MLEstimation
8 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
TS Theme3
No ratings yet
TS Theme3
18 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Chapte 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapte 2 - Maximum Likelihood - HEC - Lausanne
276 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Maximum Likelihood Notes1
No ratings yet
Maximum Likelihood Notes1
10 pages
Chapter 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapter 2 - Maximum Likelihood - HEC - Lausanne
277 pages
Maximum Likelihood An Introduction: L. Le Cam
No ratings yet
Maximum Likelihood An Introduction: L. Le Cam
31 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Understanding Maximum Likelihood
No ratings yet
Understanding Maximum Likelihood
5 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
Stat100b Maximum Likelihood
No ratings yet
Stat100b Maximum Likelihood
9 pages
Sta255 Week 11-2 Pre
No ratings yet
Sta255 Week 11-2 Pre
21 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Exercises of Complex Analysis
From Everand
Exercises of Complex Analysis
Simone Malacrida
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
ST202 2018 (1171664)
No ratings yet
ST202 2018 (1171664)
4 pages
Random Variable Illustration
No ratings yet
Random Variable Illustration
5 pages
Lec8 Jan24
No ratings yet
Lec8 Jan24
34 pages
Random Variables and Distributions - New
No ratings yet
Random Variables and Distributions - New
84 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
AP Statistics 2
No ratings yet
AP Statistics 2
2 pages
Normal Distribution Test
No ratings yet
Normal Distribution Test
2 pages
STATS, Answer Key
No ratings yet
STATS, Answer Key
8 pages
Unit-2 - Random Variables and Probability Distributions - Jan2025
No ratings yet
Unit-2 - Random Variables and Probability Distributions - Jan2025
136 pages
Assignment 2 - PSNM
No ratings yet
Assignment 2 - PSNM
2 pages
AP Statistics: Sample Student Responses and Scoring Commentary
No ratings yet
AP Statistics: Sample Student Responses and Scoring Commentary
16 pages
Foundations of Probability With R
No ratings yet
Foundations of Probability With R
70 pages
Continuous Random Variables and Their Probability Distributions
No ratings yet
Continuous Random Variables and Their Probability Distributions
59 pages
MRK - Fall 2024 - STA642 - 1 - BC240208187
No ratings yet
MRK - Fall 2024 - STA642 - 1 - BC240208187
2 pages
Analisa Probabilitas Pemilihan Moda Antara Mobil Pribadi, Angkutan Umum Minibus Ac, Dan Minibus Non Ac (Studi Kasus B. Aceh-Lhokseumawe)
No ratings yet
Analisa Probabilitas Pemilihan Moda Antara Mobil Pribadi, Angkutan Umum Minibus Ac, Dan Minibus Non Ac (Studi Kasus B. Aceh-Lhokseumawe)
10 pages
SL 4.7 Discrete Random Variables
No ratings yet
SL 4.7 Discrete Random Variables
19 pages
Chapter 6
No ratings yet
Chapter 6
35 pages
Day04 Business Moments
No ratings yet
Day04 Business Moments
10 pages
PQT Syllabus (Modified) S.Y.
No ratings yet
PQT Syllabus (Modified) S.Y.
2 pages
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Lec16 MTH305
No ratings yet
Lec16 MTH305
72 pages
(International Journal of Quality and Reliability Management) An Alternative To The Weibull Step-Stress Model
No ratings yet
(International Journal of Quality and Reliability Management) An Alternative To The Weibull Step-Stress Model
8 pages
DS215 Assignment 1
No ratings yet
DS215 Assignment 1
2 pages
Probability Distribution Mcqs PDF For Lecturer, Screening Tests, Interviews and For Competitive Exams
No ratings yet
Probability Distribution Mcqs PDF For Lecturer, Screening Tests, Interviews and For Competitive Exams
10 pages
AMOS
No ratings yet
AMOS
3 pages
T Distribution
No ratings yet
T Distribution
57 pages
Hkdse m1 Notes
No ratings yet
Hkdse m1 Notes
3 pages
Introduction To Probability Part II
No ratings yet
Introduction To Probability Part II
20 pages
Distribution Theory - Notes
No ratings yet
Distribution Theory - Notes
14 pages
Theory Assignment 5
No ratings yet
Theory Assignment 5
3 pages