0% found this document useful (0 votes)
52 views17 pages

Limited Dependent Variables

The document discusses limited dependent variable models where the outcome variable is restricted, such as binary outcomes where the variable takes on values of 0 or 1, multinomial choice models where the variable takes on values from 1 to J, and censored outcomes where the variable is greater than or equal to some constant c. It focuses on binary choice models, where the outcome can only be 0 or 1. These models are useful for situations like whether an individual buys a product or not. Common binary choice models discussed are the linear probability model (LPM), probit model, and logit model. The LPM can have issues with predicted probabilities not being between 0 and 1, while the probit and logit models use a nonlinear function to ensure

Uploaded by

Jose Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views17 pages

Limited Dependent Variables

The document discusses limited dependent variable models where the outcome variable is restricted, such as binary outcomes where the variable takes on values of 0 or 1, multinomial choice models where the variable takes on values from 1 to J, and censored outcomes where the variable is greater than or equal to some constant c. It focuses on binary choice models, where the outcome can only be 0 or 1. These models are useful for situations like whether an individual buys a product or not. Common binary choice models discussed are the linear probability model (LPM), probit model, and logit model. The LPM can have issues with predicted probabilities not being between 0 and 1, while the probit and logit models use a nonlinear function to ensure

Uploaded by

Jose Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Econ 710 – Economic Statistics and Econometrics II Spring 2018

Limited dependent variables

Introduction

During the next few lectures we will talk about limited dependent variables models, which means
that the outcome variable is restricted in some sense. Specific examples we will talk about are

• binary outcomes, where Y ∈ {0, 1},

• multinomial choice models, where Y ∈ {1, . . . , J}, and

• censored outcomes, where Y ≥ c for some constant c.

So far, you have mainly seen linear models, where the moment conditions are linear functions of
the parameters. Important examples include the linear regression model

Y = X 0 β + U,

where X is a K−vector of covariates, Y is an observed scalar outcome, and U is an unobserved


random variable. Here the moment conditions are

E[(Y − X 0 β)X] = 0.

You have also seen models with endogenous regressors, which means that E[U X] 6= 0, but where
we have an instrument vector Z such that

E[(Y − X 0 β)Z] = 0.

Linear models are particular easy to deal with because the estimators are usually available in
closed forms and they are linear functions of Y . Hence, proving properties such as unbiasedness,
consistency, or asymptotic normality is particularly easy. The models we will discuss here are
nonlinear models and some of the analysis will therefore be a little bit more complicated. For
example, many estimators will not be available in closed form, but are defined as solutions to
optimization problems.
In the next few sections we will largely focus on the features of the different models, why they
are useful, and the interpretation of the parameters of interest. We will focus less on estimation,
which is often based on maximizing a likelihood function. In a few weeks, we will talk about general
extremum estimation, which covers most of the limited dependent variables models as special cases.
We will then see under which conditions the estimated parameters are consistent and asymptotically
normally distributed and how we can use these results for inference.
So far, a lot of the econometrics material you have seen was based on Bruce Hansen’s book “Econo-
metrics”. This book covers linear models in much more detail than nonlinear models. Nevertheless,

1 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

chapter 20 of this book is a good additional reference for the limited dependent variable models.
The notation I use in my lecture notes is slightly different to that in Bruce’s book. First, I follow
the notation of many other textbooks and denote random variables by upper case letters, such
as Y or X, and I denote realizations by lower case letters, such as y and x. Second, I denote
unobserved random variables by U , rather than e or ε. The main reason is we will start thinking
more about structural interpretations and identification. The difference between X and U is then
simply that X is observed but U is not, and we do not want to think about U as an “error term”
or a “disturbance” or a “residual”. This notation highlights this distinction. More on this later.
Hopefully these differences in the notation will not be too confusing.

Binary choice

In this section, we will discuss binary choice models, where the scalar outcome variable Y can only
take values 0 or 1. These models are often used when agents can only make a binary decision, such
as whether or not to buy a product. In this example, we could set Yi = 1 if person i buys the
product and Yi = 0 if person i does not buy the product. This decision might depend on individual
and product characteristics, which could be contained in a vector of covariates Xi .
Binary outcome models have the feature that

E[Y | X] = P (Y = 1 | X) · 1 + P (Y = 0 | X) · 0
= P (Y = 1 | X).

Hence, modeling the conditional expectation E[Y | X] is the same as modeling the conditional
probability P (Y = 1 | X).
Since we already know the linear regression model, a first idea is to assume that

E[Y | X] = P (Y | X) = X 0 β

or, equivalently, that


Y = X 0β + U
where E[U | X] = 0. This model is called the linear probability model (LPM). As we know, it is
straightforward to estimate β by OLS. Moreover, the parameter β has the standard interpretation
as a marginal effect. That is, for continuous regressors X we have
∂E[Y | X] ∂P (Y = 1 | X)
= = β.
∂X ∂X

One main shortcoming of the LPM is that the estimated conditional probabilities

| X = x) = x0 β̂,
P (Y \

where β̂ is the OLS estimator, are not necessarily between 0 and 1. For example if a regressor can
take any value in (−∞, ∞) and if the corresponding slope coefficient is not equal to 0, then x0 β will

2 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

be outside the unit interval for large positive and negative values of the regressor. The problem
in this example is that a linear model for the conditional probability is naturally misspecified.
Even though this model might still be useful for a descriptive analysis, especially if the estimated
probabilities are all between 0 and 1, it is much more common to use a nonlinear model with binary
outcomes.

Nonlinear models

As we just discussed, the main drawback of the LPM is that the estimated probabilities are not
necessarily between 0 and 1. Therefore, one idea is to simply write

P (Y | X) = F (X 0 β),

where F : R → [0, 1] is a known function, which is strictly increasing and symmetric about 0. By
symmetric about 0 we mean that F (u) = 1 − F (−u). A class of functions with these properties
is the class of distribution functions of continuously distributed random variables on R with a
symmetric distribution. While symmetry is a useful property for some of the calculations below,
none of the main results rely on it.
One specific example of such a function F is the cdf of a normally distributed random variable with
a mean of 0 and a variance of 1. With this choice, the model is called the probit model. When F
is the cdf corresponding to the standard logistic distribution (with a location parameter of 0 and a
scale parameter of 1), the model is called the logit model.
One way of interpreting the model is to write

Y = 1(X 0 β + U ≥ 0),

where U is a random variable, which is independent of X and has a distribution function F . Here
1(·) denotes the indicator function, which is 1 if the argument is true and 0 otherwise.
Notice that since F is symmetric about 0, the distribution function of −U is also F . Therefore

E[Y | X = x] = P (Y = 1 | X = x)
= P (X 0 β + U ≥ 0 | X = x)
= P (x0 β ≥ −U | X = x)
= P (x0 β ≥ −U )
= F (x0 β).

Interpretation

We have seen above that in the LPM with continuous regressors


∂P (Y = 1 | X)

∂X

3 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

and therefore, we can interpret β as marginal effects. This is not true in the nonlinear models
because we now have
∂P (Y = 1 | X)
= f (X 0 β) · β,
∂X
where f is the derivative of F . Hence, the marginal effects are not equal to β and the marginal
effects depend on the value of X. However, notice that since f (x) > 0 for all x, the signs of the
elements of β are the same as the signs of the marginal effects. Finally, the ratios of the elements
of β equal the ratios of the marginal effects:
∂P (Y =1|X)
∂X1 f (X 0 β) · β1 β1
∂P (Y =1|X)
= 0
= .
f (X β) · β2 β2
∂X2

To conclude, even though β contains some useful information, it is of limited interest on its own.
But if we knew β, we could easily calculate objects of interest, such as marginal effects
∂P (Y = 1 | X)
= f (X 0 β) · β
∂X
as well as conditional probabilities

P (Y = 1 | X) = F (X 0 β).

Estimation

We can estimate β by maximum likelihood. To do so, recall that

P (Y = 1 | X = x) = F (x0 β)

and thus,
P (Y = 0 | X = x) = 1 − F (x0 β)
Conditional on X = x, Y has a Bernoulli distribution and therefore, the conditional probability
mass function is
fY |X (y | x) = F (x0 β)y (1 − F (x0 β))1−y ,
where y ∈ {0, 1}.
Hence, the log-likelihood function is
n
X
log L(β) = log fY |X (Yi | Xi )
i=1
Xn
log F (Xi0 β)Yi (1 − F (Xi0 β))1−Yi

=
i=1
Xn
= Yi log F (Xi0 β) + (1 − Yi ) log(1 − F (Xi0 β))
i=1
X X
= log F (Xi0 β) + log(1 − F (Xi0 β)).
Yi =1 Yi =0

4 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

Maximizing the likelihood with respect to β yields the maximum likelihood estimator denoted by
β̂M L . Assuming that the model is correctly specified, and under standard regularity conditions,

the resulting estimator β̂M L is a consistent estimator of β and n(β̂M L − β) is asymptotically
normally distributed. Hence, you can calculate standard errors, obtain confidence intervals, and
test hypotheses just as before. However, be careful, because β might not be the parameter of
interest!
We can also estimate conditional choice probabilities

F (x0 β) by F (x0 β̂M L ),

marginal effects for a fixed value of x

f (x0 β)β by f (x0 β̂M L )β̂M L

or average marginal effect


Z Z
0
f (x β)βdx by f (x0 β̂M L )β̂M L dx.

These estimated parameters are also asymptotically normally distributed, which follows from an
application of the delta method. Recall that if
√ d
n(β̂M L − β) → N (0, V )

then for any continuously differentiable function g : RK → R


√ d
n(g(β̂M L ) − g(β)) → N (0, ∇g(β)0 V ∇g(β)),

where ∇g(β) is the gradient of g at β.


In a few weeks we will talk about general extremum estimation, which covers most of the limited
dependent variables models as special cases. We will then see under which conditions the estimated
parameters are consistent and asymptotically normally distributed and how we can use these results
for inference.

Identification issues and normalizations

You might think that using a known function F is restrictive. This is definitely true and it is also
restrictive to assume linearity inside the function F . One idea to generalize the probit model

Y = 1(X 0 β + U ≥ 0), U ∼ N (0, 1),

is to instead assume that U ∼ N (µ, σ 2 ). We could then try to estimate µ and σ 2 along with β.

5 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

To simplify the analysis, suppose that X = (1, X̃), where X̃ is a scalar, and write β = (β1 , β2 )0 .
Also suppose that U ∼ N (µ, σ 2 ). Then

Y = 1(β1 + X̃ · β2 + U ≥ 0)
 
β1 − µ β2 U − µ
= 1 + X̃ · + ≥0
σ σ σ

Now let
β1 − µ
β̃1 = ,
σ
β2
β̃2 = ,
σ
U −µ
Ũ = .
σ
Then

Y = 1(X 0 β + U ≥ 0)
= 1(X 0 β̃ + Ũ ≥ 0),

where Ũ ∼ N (0, 1). Since both models yield the exact same distribution of (Y, X), there is no
way that we can distinguish between the two models using our data (which is just a sample from
the distribution of (Y, X)). In particular, the data might be generated from a model with β and
U ∼ N (µ, σ 2 ) or from a model with β̃ and Ũ ∼ N (0, 1). There is no way for us to tell them apart
even if we knew the joint distribution of Y and X. We therefore say that the two models are
observationally equivalent and that (β, µ, σ 2 ) is not identified without additional assumptions. We
will talk more about observational equivalence and identification later.
For now, since the two models with (1) β and U ∼ N (µ, σ 2 ) and (2) β̃ and Ũ ∼ N (0, 1) yield the
same distribution of the data, they also imply the same conditional probabilities P (Y | X) and
the same marginal effects. Thus, if these parameters are the objects we mainly care about, we can
use the normalization µ = 0 and σ 2 = 1. Again, a different normalization, such as µ = −5 and
σ 2 = 7 will give us different estimates of β, but identical estimated conditional probabilities and
marginal effects! These normalizations are another reason why we are not primarily interested in
β, but rather conditional probabilities and marginal effects, which are invariant to these types of
normalizations.

Random utility models

We will now derive probit and logit models for the conditional probabilities P (Y | X) from a simple
underlying economic model of decision making. These models are useful to justify probit and logit
models from economic theory, and this way of thinking about choice models will be useful when we
talk about multinomial choice later.

6 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

Suppose people have to make a choice between two options, 0 and 1. Individual i receives utility
Vij from alternative j. We assume that all individuals maximize utility. That is, they choose
alternative 1 if Vi1 > Vi0 and they choose alternative 0 if Vi0 > Vi1 . Don’t worry about what
happens when Vi1 = Vi0 for now. We observe Yi = 1 if alternative 1 was chosen and 0 otherwise.
Thus
Yi = 1(Vi1 > Vi0 ).

We will now impose additional assumptions on the utility Vij . Specifically, we will assume that

Vij = Xi0 βj + Uij .

Here Xi is a K−vector of observable characteristics. These characteristics may differ for different
individuals. For example Xi could contain the age of person i. Also notice that Xi could contain
characteristics of both options (e.g. the price of products). Since βj can be different for the two
different options, the characteristics of option 1 might only affect the utility of option 1, and vice
versa. Uij is the unobserved part of the utility. We will think about Xi0 βj as the deterministic part
of the utility, because Xi is observed and βj is not random, while Uij is the random part.
With this additional structure we get

Yi = 1(Vi1 > Vi0 )


= 1(Xi0 β1 + Ui1 > Xi0 β0 + Ui0 )
= 1(Xi0 (β1 − β0 ) > Ui0 − Ui1 ).

Now assume that X ⊥


⊥ (Ui0 , Ui1 ). Then

P (Yi = 1 | Xi = x) = P (Xi0 (β1 − β0 ) > Ui0 − Ui1 | Xi = x)


= P (x0 (β1 − β0 ) > Ui0 − Ui1 )
= FUi0 −Ui1 (x0 (β1 − β0 )),

where FUi0 −Ui1 is the distribution function of Ui0 − Ui1 . The probability

P (Yi = 1 | Xi = x)

is called the choice probability. It denotes the percentage of people with covariates x who choose
option 1. We showed that this choice probability depends on the difference in the deterministic
components evaluated at the cdf of Ui0 −Ui1 . First, note that from our data we can only learn about
the difference β1 − β0 and not about β1 and β0 separately. The reason is that only differences in
utility matter for decision making. If we increase the utility of both options by some fixed constant,
then behavior does not change.
Similarly, as discussed above, we have to normalize parts of the distribution of Ui0 − Ui1 because
different combinations of distribution functions and parameter vectors are observationally equiva-
lent. Again, these normalizations affect the estimated difference β1 − β0 , but they do not affect the
estimated choice probabilities or the estimated marginal effects.

7 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

Now we could assume that ! ! !!


Ui0 µ0 σ02 σ01
∼N .
Ui1 µ1 σ01 σ12
Then
Ui0 − Ui1 ∼ N (µ0 − µ1 , σ02 + σ12 − 2σ01 )

Normalizing
µ0 − µ1 = 0 and σ02 + σ12 − 2σ01 = 1

yields a probit model for the conditional choice probabilities.


Another approach is to assume that Ui0 and Ui1 are independent and have an extreme value type
1 (ETV1) distribution. In particular, if U has an ETV1 distribution, then

P (U ≤ u) = exp(exp(−u)).

The reason why this is a useful assumptions is that if Ui0 and Ui1 are independent and have an
ETV1 distribution, then Ui0 − Ui1 has a standard logistic distribution. Hence, we get a logit model
for the conditional choice probabilities.
Two more remarks. First, notice that in both of these cases P (Vi1 = Vi0 | Xi ) = 0 and therefore,
we can ignore ties. Second, if Xij was different for different choices, we could estimate β1 and β0
separately. For example, we might observe the price of both product 0 and product 1 and we might
want to assume that the price of product 1 only affects utility of product 1 and vice versa. Then
Xij could be the price consumer i faced for product j. In this case we get

P (Yi = 1 | Xi0 = x0 , Xi1 = x1 ) = FUi0 −Ui1 (x01 β1 − x00 β0 )

Hence, changes in x1 , while holding x0 constant, tells us about β1 .

Multinomial logit

Previously we looked at binary response models. We now move on to the case where there are
J unordered values that the outcome variable Y may take. For the most part, we will continue
thinking within the discrete choice framework, although much of this can be applied to cases where
Y is not a variable chosen by some agent.
The traditional theory of demand starts out with a set of goods, say, 1, . . . , J, and then assumes
each person i has a utility function Ui (c1 , . . . , cJ ), where cj is the quantity of good j consumed. A
major problem with this approach is: how do we think about the demand for a new good? This
is precisely the problem faced by the local government in the San Francisco Bay area in the 1960’s
and early 1970’s as they planned a new heavy rail system, BART, the Bay Area Rapid Transit.
This rail line would be a new option for people commuting to work. Before it opened, people could
(1) drive alone, (2) carpool, or (3) take the bus. If we think about the demand for transportation,

8 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

the classical approach says that people have utility functions for each of these three options. But
what does that tell us about what their demand will be for BART, once it opens?
To answer this question, we will think about each good as a bundle of characteristics. In the
transportation choice question, we do not think of each travel option j as something like “drive
alone”, but rather as a vector of characteristics, such as how much it costs, how long it takes,
how long it requires you to walk, etc. We then suppose that people have preferences over these
characteristics, rather than over the goods directly. Their preferences over the characteristics then
define what their utility for any given good j with characteristics xj will be. A new good then
is just a new bundle of characteristics. Consequently, once we know consumer preferences over
characteristics, predicting demand for a new good is as simple as adding a new option to the choice
set.

Random utility models

Before we considered a random utility model for the choice between two options. That framework
extends immediately to the choice between J options. Each individual i receives utility Vij from
option j. We decompose this utility into two pieces: a deterministic piece Xij0 βj and an unobservable
piece Uij so that
Vij = Xij0 βj + Uij .

Notice that here, we allow Xij to vary across j. For example, rather than including the price of each
product as a separate characteristic, we might want to directly assume that the price of product j
only affects Vij , but not Vik for k 6= j. We still allow βj to be different for different options, but it
can also be useful to restrict this further. Whether or not we will be able to estimate all parameters
or only difference of parameters will depend on whether we have sufficient variation of Xij across
j, similar as in the discrete choice model. We will not discuss this issue in any detail for now, but
in an application you definitely want to revisit this issue and think about which parameters you
can estimate and which ones you cannot.
Similar to the binary choice model, we now assume that agent i chooses exactly one of the J
products and the agent chooses the product which maximizes her utility. That is

Yij = 1 if and only if Vij > Vik for all j 6= k.

Let Xi = (Xi1 , . . . , XiJ ). Then

P (Yij = 1 | Xi = x) = P (Vij > Vik for all j 6= k)


= P (Xij0 βj + Uij > Xij0 βk + Uik for all j 6= k)
= P (Xij0 βj − Xij0 βk > Uik − Uij for all j 6= k).

With most parametric assumptions on the distribution of (Ui1 , . . . , UiJ ) we would get a compli-
cated expression for the choice probabilities and calculating them in practice can be numerically

9 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

complicated. However, if we assume that Uij are iid EVT1 distributed and independent of Xi , then
after some algebra, it can be shown that

exp(Xij0 βj )
P (Yij = 1 | Xi = x) = PJ .
0
k=1 exp(Xik βk )

This is extremely useful because the log-likelihood is simply


n X
X J X
log(P (Yij = 1 | Xi = x)),
i=1 j=1 Yij =1

which we can easily maximize to get estimates for the parameters, given the closed form expression
for the choice probabilities.
Coming back to the original question: what is the choice probability of a new product? To answer
this question assume that βj is the same for all products. That is, βj = β for all j = 1, . . . , J. As
an example, suppose Xij is the price consumer i faces for product j. Then βj = β means that if
we increase the price of all product by 1, then consumers would make the same choices. With this
assumption we can use an existing set of J options and estimate β. The choice probability of a
new product, say J + 1, is then simply
0
exp(Xi(J+1) β)
P (Yi(J+1) = 1 | Xi = x) = PJ+1 .
exp(X 0 β)
k=1 ik

Problems and extensions

A very nice feature of the multinomial logit model is that we get a simple closed form expression
for the choice probabilities. It is therefore possible to estimate the parameters even if J is very
large.
One problem with the multinomial logit model is that it can yield strange substitution patterns.
The specific problem we discuss below by means of a simple example is usually referred to as
independence of irrelevant alternatives (IIA). A famous example is the following. Suppose that
there are two ways to get to work: via car and a blue bus (bb). Suppose that currently their
market shares are 50% each. That is
0
exp(Xi(bb) β) 1
P (Yi = blue bus | Xi(bb) , Xi(car) ) = 0 0 =
exp(Xi(bb) β) + exp(Xi(car) β) 2

and 0
exp(Xi(car) β) 1
P (Yi = car | Xi(bb) , Xi(car) ) = 0 0 =
exp(Xi(bb) β) + exp(Xi(car) β) 2
Here Xi(bb) contains the observed characteristics associated with taking the blue bus that affect the
utility of agent i (such as the price, the time it takes etc.). Likewise Xi(car) contains the observed

10 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

characteristics associated with taking the car that affect the utility of agent i. Notice that the two
equations above imply that
0
exp(Xi(bb) β)
P (Yi = blue bus)
1= = 0
P (Y = car) exp(Xi(car) β)

Now suppose that we give people a third option: they can now take a red bus (rb) to work if they
want. What substitution patterns do we expect? Well, probably people do not care about what
color bus they take. They only care about whether it is a car or a bus. So we might expect that
people who took the car will still take the car, even with this additional option. That is
1
P (Yi = car | Xi(bb) , Xi(car) , Xi(rb) ) = ,
2
where Xi(rb) contains the observed characteristics associated with taking the blue bus.
However, there is a problem. Notice that if people do not care about the color of the bus, then

P (Yi = blue bus | Xi(bb) , Xi(car) , Xi(rb) ) = P (Yi = red bus | Xi(bb) , Xi(car) , Xi(rb) ).

Given our logit expression, this implies that


0 0
exp(Xi(bb) β) = exp(Xi(rb) β).

But we also assumed before that


0 0
exp(Xi(bb) β) = exp(Xi(car) β),

which then implies that (conditional on Xi(bb) , Xi(car) , Xi(rb) )


1
P (Yi = car) = P (Yi = blue bus) = P (Yi = red bus) = .
3
Thus, the logit model predicts that once the red bus is introduced, people shift away from the car
and the blue bus equally to the red bus. This is completely contrary to what we think should
happen. This illustrates an important fact about the multinomial logit model: it places strong
restrictions on the substitution patterns. Indeed, continuing the logic above, we can introduce a
rainbow fleet of buses and thereby drive the market share of cars to zero!
There are alternative and more complicated models, which yield more reasonable substitution
patterns. A popular example is the so called nested logit model. The basic idea is that we put each
product in one of several nests. For example, one nest could be carpool and driving, and another
nest could be blue bus and red bus. If we then add a new option, say a green bus, we could add it
to one of the nests. We then get the IIA property within the nest. For example people shift away
from the blue bus and the red bus equally to the green bus. However, the total market shares of
each nest might not be affected by the new choice.
Another example is the so called random coefficients logit model, which is extremely popular in IO
(and in this context often referred to as the BLP model). The basic idea is that each consumer i

11 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

gets her own coefficient in the utility function. That is β becomes βi (or βj becomes βij ). Since we
then have as many coefficients as observations, we can only estimate the distribution of βi rather
than each individual coefficient.
We will not discuss these models in more detail here, but you will probably come across them in
second year classes. We will talk a little bit about random coefficients more generally towards the
end of the semester.

Censored data

In this section we will discuss censored outcomes, which means that our outcome variable Y is only
observed if Y ≥ c and we observe c otherwise. We can assume without loss of generality that c = 0
as we could always transform the outcome to Y − c ≥ 0. It also does not matter if the outcome is
bounded from above or below as we could always look at −Y . Thus, we assume that
(
Yi∗ if Yi∗ ≥ 0
Yi =
0 if Yi∗ < 0

We will also impose assumptions on Yi∗ , namely that

Yi∗ = Xi0 β + Ui , E(Ui | Xi ) = 0.

Hence, we have a standard regression model, except that we do not observe all outcomes. We
do, however, observe all the regressors of all individuals and we also know if Yi∗ ≥ 0. A standard
example of a censored variables is wages, which might be top coded. Another example is wage
data, where wages are only observed if people work. This examples fits better into selection models
discussed in the next section.
There are also models with truncated outcomes. The difference between censored and truncated
outcomes is that with truncated outcomes, we have no information on individuals with Yi∗ <
0. These observations are simply missing. Hence with truncated outcomes, we cannot estimate
P (Yi∗ ≥ 0 | Xi ), which will play a key role in estimating models with censored outcomes. We will
not consider truncated outcomes in this class.
Simply regressing Yi on Xi to estimate β does not yield a consistent estimator as figure 1 illustrates.
If we were able to use all the observations on Yi∗ and Xi (the latent unobserved outcomes and the
observed regressors), we would get the gray line. Using only the observed outcomes yields the blue
line, which underestimates the slope coefficient.
Without assumptions on the distribution of Ui , it is very hard to get a consistent estimator of β, and
it would require assumptions on the support of the regressors. Instead, we use a more traditional
approach and we will assume that Ui ⊥⊥ Xi and that Ui ∼ N (0, σ 2 ). Notice that, as opposed to the
probit model, we do not have to standardize the variance to 1. We can then estimate all parameters
using a maximum likelihood approach.

12 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

Figure 1: Regression with censored data

Before we derive the likelihood, it is useful to think about a two step procedure. First, notice that

P (Yi > 0 | Xi = x) = P (Yi∗ > 0 | Xi = x)


= P (Xi0 β + Ui > 0 | Xi = x)
= P (Xi0 β > −Ui | Xi = x)
 0 
Xi β Ui
= P > − | Xi = x
σ σ
 0 

= Φ ,
σ
where Φ is the cdf of a standard normal random variable. Hence, just as in the probit model, we
can estimate σ1 β.
For the second step consider

E[Yi | Yi > 0, Xi ] = E[Yi∗ | Yi∗ > 0, Xi ]


= E[Xi0 β + Ui | Xi0 β + Ui > 0, Xi ]
= Xi0 β + E[Ui | Ui > −Xi0 β, Xi ]
Xi0 β
 
0 Ui Ui
= Xi β + σE | >− , Xi
σ σ σ
It turns out that for a standard normally distributed random variable Zi
φ(a)
E[Z | Z > a] = ,
1 − Φ(a)
where φ is the standard normal pdf. Define
φ(a) φ(−a)
λ(a) = = .
Φ(a) 1 − Φ(−a)

13 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

The function λ(·) is called the inverse Mills ratio. Then

Xi0 β
 
E[Yi | Yi ≥ 0, Xi ] = Xi0 β + σλ .
σ
1
But if we already know, or have an
 estimate of, σ β from the first step, we can use all data with
Yi > 0 and regress Yi on Xi and λ −Xi0 β̂σ̂ to get estimates of β and σ.
We can estimate all parameters in one step by writing down the likelihood. First, we have obser-
vations with Yi = 0 and
 0   0 
xβ xβ
P (Yi = 0 | Xi = x) = 1 − P (Yi > 0 | Xi = x) = 1 − Φ =Φ − .
σ σ

Moreover, for all y > 0

P (Yi ≤ y | Xi = x) = P (Yi∗ ≤ y | Xi = x)
= P (Xi0 β + Ui ≤ y | Xi = x)
y − Xi0 β
 
Ui
= P ≤ | Xi = x
σ σ
y − x0 β
 
= Φ .
σ

It follows that
y − x0 β
 
∂P (Yi ≤ y | Xi = x) 1
= φ
∂y σ σ
The log likelihood is therefore
n 1(Yi >0) !
Xi0 β 1(Yi =0) 1 y − x0 β
X    
log L(β, σ) = log Φ − φ
σ σ σ
i=1
Xi0 β Yi − Xi0 β
     
X X 1
= log Φ − + log φ .
σ σ σ
Yi =0 Yi >0

Maximizing the likelihood with respect to β and σ yields consistent and asymptotically normally
distributed estimators.

Sample selection

In this last section we will talk about sample selection. A good example is wage data, where we
only observe wages for people who work. However, not everyone has the same reservation wage
and hence, we only observe Yi if
Yi ≥ Ri .

If everyone had the same reservation wage, then we would be in the censored outcome framework
discussed in the previous section. Here, on the other hand, we now allow for a different cutoff for

14 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

different people and Ri is unknown. We do observe whether or not somebody works. In summary,
we observe
Yi if Yi ≥ Ri

as well as
1(Yi ≥ Ri ) and Xi

for each individual in the sample.


We will now impose some additional structure. We will start by assuming that the outcome Yi and
the unobserved variable Ri are linear functions of Xi . That is

Yi = Xi0 β + Ui

and
Ri = Xi0 δ + Vi .

Now define

Ti = 1(Yi ≥ Ri )
= 1(Xi0 β + Ui ≥ Xi0 δ + Vi )
= 1(Xi0 (β − δ) ≥ Vi − Ui ).

This equation is called the selection equation. The model described above is called the selection
model. The idea is that people are purposefully choosing whether to work, and hence they select
into the workforce. While the model discussed here is traditionally called ‘the’ selection model, this
idea of selection is very broad and goes far beyond this specific model.
More generally, we can allow for different regressors in the outcome equation

Yi = Xi0 β + Ui

and in the selection equation


Ti = 1(Xi0 (β − δ) ≥ Vi − Ui ).

Hence, we will consider the more general model based on the two equations

Yi = Xi0 β + Ui
Ti = 1(Zi0 γ ≥ Wi ).

We always observe Ti , Xi , and Zi and we observe Yi if Ti = 1.


We will also impose restrictions on the joint distribution of Ui and Wi , namely that (Ui , Wi ) ⊥

(Xi , Zi ) and ! ! !!
Ui 0 σu2 σvw
∼N , .
Wi 0 σvw 1

15 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

We normalize the means of Ui and Wi to 0 because Xi and Zi may contain constants and we
cannot estimate the intercepts and means of Ui and Wi separately. Moreover, similar as in the
probit model
Ti = 1(Zi0 γ ≥ Wi )
 0 
Zi γ Wi
= 1 ≥
σw σw
and hence we can always standardize Wi to have a variance of 1. Again, note that therefore the
scale of γ does not have a great interpretation.
Let’s now think about how to estimate the parameters. The first idea is to simply use all observed
outcomes and regress Yi and Xi . To see that such a regression would yield a biased estimator of β,
write
E[Yi | Xi = x, Zi = z, Ti = 1] = x0 β + E[Ui | Xi = x, Zi = z, Ti = 1]
= x0 β + E[Ui | Xi = x, Zi = z, Zi0 γ ≥ Wi ]
= x0 β + E[Ui | z 0 γ ≥ Wi ]
Now it turns out that a property of the bivariate normal distribution is that
φ(z 0 γ)
E[Ui | z 0 γ ≥ Wi ] = −σvw ,
Φ(z 0 γ)
where φ and Φ are the pdf and cdf of a standard normal random variable, respectively. Again, let
φ(a)
λ(a) = .
Φ(a)
Then
E[Yi | Xi = x, Zi = z, Ti = 1] = x0 β − σvw λ(z 0 γ).
Thus, regressing Yi on Xi for all observation with Ti = 1 generally yields a biased estimator.
However, if γ was known, we could simply regress Yi on Xi and λ(Zi0 γ) for all observation with
Ti = 1. Even though, γ is not known, we have
P (Ti = 1 | Zi = z) = Φ(z 0 γ)
and we can therefore estimate γ using our probit estimator. This yields a two-step estimator:

1. Estimate γ using a probit model and obtain γ̂.

2. Regress Yi on Xi and λ(Zi0 γ̂) for all observation with Ti = 1.

This estimator is know as the Heckman two-stage estimator or the Heckit estimator. An advantage
of the estimator is that it is very simple to implement. However, the usual OLS standard errors
you get form the second step are incorrect because they do not account for the fact that γ was
estimated in a first step. You can obtain correct standard errors using the generalized method of
moments, which we will discuss in a few weeks. An alternative is to use a maximum likelihood
estimator, which is more complicated to implement.

16 of 17
Econ 710 – Economic Statistics and Econometrics II Spring 2018

Other thoughts

We have discussed several nonlinear models with limited dependent variables, including probit,
logit, multinomial choice, censored outcomes, and sample section models. There are many other
interesting models, which we will not have time to discuss, but which you might see in your second
year. One example is the Roy model. The original motivating example for this model was: how do
we analyze what careers people choose? In this example, suppose there are just two options: fishing
or hunting. More generally, we think of two sectors, the fishing sector and the hunting sector. Also,
in this model everyone must work and the only question is what sector to work in. You could
also think about allowing people to choose both whether to work at all, and if they work, what
sector to work in, but that is much more complicated. Consider person i, who earns the wage Wf i
if she works in fishing and earns the wage Whi if she works in hunting. The classical Roy model
assumes that people chooses the sector to maximize their wage. That is, person i chooses fishing if
Wf i > Whi and hunting if Whi > Wf i . We observe their wage in the chosen sector only,

Yi = max{Wf i , Whi }.

The Roy model has been very influential, and is used to study many different settings, not just
career choice. For example it has been used to study local labor markets, whether firms adopt a
new technology, or whether workers choose to join a union.
The models we discussed above all impose very strong assumptions on the unobservables, namely
that they have a known distribution, such as a normal distribution. Of course, if this assumption
is incorrect (and there is in general no reason to believe that the unobservables have the assumed
distribution) then the estimation results are misleading. An alternative is to use so called nonpara-
metric or semiparametric methods, which do not rely on these strong functional form assumptions.
You can learn about these methods in your second year and we might talk about them briefly at
the end of the semester.

17 of 17

You might also like