MCMC Methods For Fitting and Comparing Multinomial Response Models
MCMC Methods For Fitting and Comparing Multinomial Response Models
Abstract
This paper is concerned with statistical inference in multinomial probit, multinomial-
t and multinomial logit models. New Markov chain Monte Carlo (MCMC) algorithms
for tting these models are introduced and compared with existing MCMC methods.
The question of parameter identication in the multinomial probit model is readdressed.
Model comparison issues are also discussed and the method of Chib (1995) is utilized
to nd Bayes factors for competing multinomial probit and multinomial logit models.
The methods and ideas are illustrated in detail with an example.
Keywords: Bayes factor; Gibbs sampling; Monte Carlo EM algorithm; Marginal likeli-
hood; Metropolis-Hastings algorithm; Multinomial logit; Multinomial probit; Multinomial-
t; Model comparison.
1 Introduction
The tting of multinomial probit models has been viewed as a challenge for over twenty
ve years. One major diculty is the problem of evaluating the likelihood function while
another, somewhat neglected one, is the problem of estimating covariance parameters of
the model given that only outcome per subject is observed. As a result of this missing-
ness, which is inherent in multinomial data, it is possible that dierent combinations of
regression and covariance parameters can produce virtually identical outcome probabilities.
1
Recently, developments in simulation-based Bayesian and classical methods have given rise
to reasonably eective methods for estimating this model [McFadden (1989), Albert and
Chib (1993), McCulloch, Polson, and Rossi (1994) and Stern (1997)]. Despite these devel-
opments, further improvements in the tting of the model are possible, based on Markov
chain Monte Carlo methods [Gelfand and Smith (1990), Chib and Greenberg (1996)].
In general terms, Markov chain simulation methods provide a rather attractive frame-
work for dealing with the MNP and related multinomial models. The use of these methods
in the context of probit models was initiated by Albert and Chib (1993). A central reason
for studying these methods is that they are easy to implement and can be applied from both
a classical and Bayesian perspective. One version of these methods can be used to sample
the posterior distribution of the parameters, while another can be used to search for the
maximum-likelihood estimate. As a bonus, these methods can be extended for the tting of
more general multinomial models than the MNP. One such model that is introduced in this
paper, the multinomial-t, relies on a multivariate-t assumption for the latent data. It turns
out that the basic algorithms have to be modied only slightly to apply to this model.
In addition to tackling the question of tting the MNP model, another purpose of
this paper is to develop a framework within which alternative multinomial models can be
compared. This framework is important because there is a paucity of discussion in the
literature on the practical benets of the MNP model over the much simpler multinomial
logit (MNL) model. Although it is well known that the MNL model suers from a weakness
not shared with the MNP model|that the ratio of probabilities of any two outcomes does
not depend on the presence or absence of other outcomes|it appears that the importance
of this weakness has not been assessed in empirical settings. One reason for this may be that
the comparison of these non-nested models is dicult from a classical perspective. From a
Bayesian viewpoint, however, such comparisons can be handled more conveniently. For a
specied set of priors, a method due to Chib (1995) can be used to calculate the marginal
likelihood of the model and the Bayes factor, which is used in the Bayesian context to
compare models. We apply this technique to a data set and nd that the support for the
MNL model over both the MNP and MNT models is decisive. Moreover, the Bayes factor
supports the MNL model in another example, which we do not report on, in which the data
2
are articially generated from the MNP model. This result is possibly an artifact of the
covariates and our design, but it nonetheless emphasizes the important point that support
for the MNP model over the MNL model is not guaranteed once model complexity is taken
into account.
The rest of the paper is organized as follows. In Section 2 the various multinomial models
are described, and in Section 3 two new MCMC algorithms for tting the MNP and related
models are presented. This section also discusses the issues related to identication and
points out why the parameters of the MNP model are likely to be weakly identied. Section
4 explains the computation of the marginal likelihood and Bayes factors and considers
an application that involves the comparison of MNP, MNT, and MNL models. Results
from a real data set are introduced at various places in the text to illustrate the methods.
Concluding remarks are contained in Section 5.
3
or in vector notation as Zi = Xi + "i , where = ( ;
1 ;
2 ; : : : ;
J ),
0 1
(vi vi;J )0 wi0 00 00
B (v v )0 00 wi0 00 CC
1 +1
Xi = BBB i .. i;J
2
..
+1
. . . ... C
CA ;
@ . .
(viJ vi;J )0 00 00 wi0
+1
and "i = ("i ; : : : ; "iJ )0 NJ (0; ): For identiability reasons the (1; 1) element of , ;
1 11
is constrained to equal one and
J +1 is normalized to zero. In terms of the latent values
zij , the observed outcome is given by the conditions
(
ij = maxfZi ; 0g
Yi = jJ + 1 ifif zmax (1)
l fzil g 0;
and the probability mass function of Yi is
Z
Pr(Yi = j j ; ) = J (Zi jXi ; ) dZi ; j J + 1;
Aj
where J is the density function of the J -variate normal distribution and
(
Aj = ffZZi :: zzi1 < zij ; : : : ; 0 < zij ; zi;j +1 < zij ; : : : ; ziJ < zij g; j J
i i1 < 0; : : : ; zij < 0; : : : ziJ < 0g; j = J +1:
The multinomial probabilities thus require the computation of a complicated multivariate
integral. One way to compute the integral is by the Monte Carlo importance sampling
method developed by Geweke (1991), Hajivasilliou (1990), and Keane (1994), and known
as the GHK method (see Appendix 1 for further details). For estimation purposes, it is not
necessary to compute this probability, as is discussed below.
2.2 Multinomial- t
Now suppose that the distribution F on the underlying undierenced latent values Zi is
multivariate-t with specied degrees of freedom . This gives rise to a model that we call
the multinomial-t model. Albert and Chib (1993) extended the probit link to the t-link
in the binary response case and provided a simple approach for estimating the resulting
model. As in the MNP case, the MNT model can be expressed in terms of the dierenced
latent values Zi , where now Zi j; MVTJ (Xi ; ; ) with density
(v+J )=2
f (Zi j ; ) / jj =
1 2
1 + 1 (Zi Xi )0 1 (Zi Xi ) :
4
As before, 11 = 1 and the observed outcomes Yi are dened by (1). Following Albert and
Chib (1993), the model for the latent Zi may be expressed as a scale mixture of normals by
introducing a random variable i Gamma( 2 ; 2 ) and letting
5
on 11 makes it dicult to sample . To solve this problem, McCulloch and Rossi (1994)
propose an algorithm that ignores the restriction on 11 in the sampling. Their algorithm
simulates the non-identied parameters of the model, obtaining draws of the identied
parameters ex-post from the draws of the non-identied parameters. Nobile (1995) has
pointed out that, as a consequence of sampling the non-identied parameters, this method
is sensitive to the prior distribution.
To sample the identied parameters in a MCMC simulation with data augmentation,
one iterates on the following steps a large number of times.
where 8
>
< (maxf0; maxfZi j gg; 1) if yi = j; j = 1; : : : ; J
( )
Rij = > ( 1; maxfZi j g) if yi 6= j; j = 1; : : : ; J ;
: ( 1; 0] ( )
if yi = J + 1
which follows from the set-valued inverse of the mapping in (1). The density f (zij jZi( j ) ; ; )
is obtained by the usual multivariate normal theory. Instead of sampling the zij in this man-
ner, the entire vector Zi can be sampled from (Zi jyi ; ; ) by the accept-reject method
[Albert and Chib (1993)]. In this approach the vector Zi is drawn from N (Xi ; ) and
accepted as a valid draw if the vector falls in the region implied by yi . The advantages
of this method are that it requires little coding and that it tends to improve the serial
correlation of the sampled output because the Zi are drawn in one block. A disadvantage is
6
that several sampled vectors may have to be discarded before one is accepted. Nonetheless,
because the accept-reject method is not a Markov chain sampler, the method is useful in
initializing the Markov chain simulations for the latent data.
The next two distributions are proportional to the complete data density
Y
n
f (Zj; ) = f (Zi j; )
i=1 !
/ jj n=2 exp 1 X(Z X )0 1 (Z X ) : (3)
2 i i i i i
7
The mapping between and is one-to-one. This parameterization of leaves the vector
entirely unrestricted. Any 2 Rp leads to a matrix that is symmetric, positive denite,
and has 11 = 1:
To understand the nature of this parameterization, consider the case J = 2; where
!
L= l 1 0
21 l22 :
From = LL0 it follows that 12 = l12 and 22 = l12 2
+ l22
2
. These imply that l222
=
22 12 2
; which is the determinant of and is positive if is positive denite. Thus, the
parameterization = (l12 ; log(l22 )) imposes the required properties of positive deniteness
along with the condition that 11 = 1.
A major advantage of the parameterization from a Bayesian perspective is that it
permits a straightforward use of MCMC methods. Furthermore, a prior distribution on
can be assigned by specifying a prior distribution on each ij and then using this prior
distribution to infer the required distribution of . To illustrate this idea, suppose that our
prior beliefs about vech () are proportional to a normal distribution with mean vector s0
and covariance matrix S0 , as in Chib and Greenberg (1995b). The required prior on can
be determined by the following Monte Carlo procedure:
1. Set i = 1
(a) While i is less than I (a prespecied quantity), sample a vector vech ()i /
N (s0 ; S0 ) and form the matrix i = Li Li0 . From Li compute and store the
vector i :
(b) Increment i and go to (1a).
P P
2. Compute v0 = I 1 Ii=1 i and G0 = I 1 Ii=1 (i v0 )( i v0 )0 , the mean and
covariance of f i g. Let the prior distribution of be N (v0 ; G0 ).
Note that the above prior on fij g overcomes the well known limitation of the Wishart
distribution wherein the spread of the distribution is controlled by a single scalar degrees
of freedom parameter. A notable advantage of working in the parameterization is that it
8
leads to a unrestricted posterior density. In contrast, the posterior density of vech () is
restricted to the region that produces a positive-denite matrix.
Now consider the sampling of (equivalently the sampling of ) from the density
(jZ; ). By denition the full conditional density is
Y
n
(jZ; ) / () (Zi jXi ; )
i=1
/ ()f (Zj; ); 2 Rp ; (4)
where ( ) is the unnormalized Gaussian prior density for and the value of the normalizing
constant is not required. This posterior density can be sampled by the MH algorithm with
a tailored proposal density. Tailoring is achieved by nding the mode and curvature of
log f (Zj; ) from a few Newton-Raphson steps. The mode and curvature are then used
to create a multivariate-t proposal density, fT ( j; V; ), where is the mode, V is the
inverse of the negative Hessian at the mode, and and are adjustable parameters. With
denoting the current point in the iterations, the MCMC algorithm proceeds by iterating
on the following steps.
Algorithm MNP 1
Sample Z as in the basic algorithm for sampling the MNP posterior distribution;
Sample as in the basic algorithm for sampling the MNP posterior distribution;
Sample t from fT (:j; V; ) and compute
( t ) fT (j; V; ) )
f ( Zj
(; t ) = min 1; f (Zj; ) ;
:
fT ( t j; V; )
Move to t with probability ( ; t ) and stay at with probability 1 ( ; t ):
It should be noted that this algorithm is easily modied if the covariance matrix has
more constraints than 11 = 1. In that case one can operate directly on the unique elements
of , as in Chib and Greenberg (1995b) in a dierent but related context. This point is
illustrated in one of the examples considered below.
9
Posterior sampling without augmentation
Algorithm 1 exploits the simplication that arises from data augmentation. One question
is whether it is possible to sample the posterior distribution without augmentation. The
main problem (one that is avoided by data augmentation) is that it is necessary to compute
the likelihood function at least once during each point in the iterations. This can be a
prohibitive computational burden if the sample size and the number of alternatives are
large. In the case of smaller models, however, one may proceed as follows.
Let = (; ) denote the parameters of the model and consider sampling in one
block with the MH algorithm. To nd the proposal density for one can utilize the
output of Algorithm 1. Specically, one can run Algorithm 1 for G = 5000 iterations
P
(say) to nd the mean vector = G 1 Gg=1 (g) and the sample covariance matrix V =
G 1 PGg=1 ( (g) )( (g) )0 . Based on these quantities, the proposal density can be
specied as fT ( j; V; ); where fT is the multivariate-t density with degrees of freedom.
A sample of draws from the posterior distribution can then be obtained by repeating the
following step.
Algorithm MNP 2
Sample (t ; t ) from fT ( j; V; ) and let
( t ; t ) fT ( ; j; V; ) )
p ( y j
[( ; ); ( t ; t )] = min 1; p(yj ; )
fT (t ; t j; V; )
denote the probability of move. Then move to ( ; ) with probability [( ; ); ( t ; t )]
and stay at (; ) with probability 1 [( ; ); ( t ; t )].
It should be noted that if J is large it may be necessary to sample and in two blocks.
In that case, however, the likelihood function p(yj ; ) must be evaluated twice within each
iteration and the proposal density for each block must also be dened in a dierent way.
At this point, therefore, it does not seem feasible to implement this algorithm in general
without incurring an enormous computational cost.
10
Starting values for algorithms
It is often useful to initialize posterior sampling algorithms in regions that have high mass
under the posterior distribution. This seems to be particularly important in the tting of
MNP models. One way to compute a high density point is by the Monte Carlo EM (MCEM)
algorithm, which also relies on data augmentation and delivers the approximate maximum
likelihood estimate [Natarajan, Kiefer, and McCulloch (1995)]. Let ((t) ; (t) ) denote the
current value of the parameters and (^ ; ^ ) the estimates obtained at convergence. The
algorithm is implemented by iterating on the following steps.
Algorithm MCEM
Sample Z j as in the basic algorithm for sampling posterior distribution. Repeat this
( )
step N times.
Update through the expression t = (Pni X0i Xi) (Pni X0i Zi) ,
( +1) 1 1 1
where Zi = N j
1
i =1
( )
i
In implementing this algorithm N is initially chosen to be a small number, and its value
is steadily increased as the maximizer is approached. In the examples below, N is set equal
to ten for the rst twenty iterations and is increased to four hundred close to convergence.
A well known problem with the EM algorithm is that it does not automatically provide
an estimate of the observed information matrix at convergence. This is not a problem if
one is using the MCEM algorithm to supply starting values for the full posterior sampling
algorithms. If standard errors are required, then one can compute the observed informa-
tion matrix using the Louis (1982) formula E 2 log f (Zj; ) Var f log f (Zj; )g,
where the expectation and variance are with respect to the distribution Zjy; ^ ; ^ and
denotes dierentiation w.r.t. the parameters. Each of these terms can be estimated by
taking M additional draws fZ(1) ; : : : ; Z(M ) g from Zjy; ; ^ and computing the expectation
and variance as corresponding sample averages.
11
3.2 MCMC sampling of the MNT and MNL models
Consider now the tting of the MNT model by MCMC methods. In this case, Algorithm
1 is easily modied because of the fundamental connection between the multivariate-t and
multivariate normal distributions. The general idea is to conduct the sampling with i
(i n) as additional parameters of the model. Then, conditional on i , the latent data Zi
follow the distribution
Zi N (Xi; i 1 ) :
Accordingly, the full conditional distributions of zij and are obtained by replacing by
i 1 in the expressions presented above. To sample , the MH approach given in the
context of Algorithm 1 can again be applied by noting that Zi 1i =2 is distributed as normal
with mean Xi 1i =2 and variance . Finally, the mixing variable i (i n) is sampled from
the gamma distribution
!
i jZi ; ; Gamma 2 ; + J + ( Z i X i )0 1(Zi Xi )
; i n:
2
Algorithm 2 can also be modied by making use of the GHK algorithm to evaluate Pr(yi =
j j ; ), but now under the assumption that the distribution of the latent data is multivariate-
t: The GHK algorithm in this case requires simulation from univariate student-t distribu-
tions as discussed in Appendix 1.
To conduct MCMC sampling of in the MNL model we note that the posterior density
of is proportional to
Y exp(vij0 + wi0
j ) !d ij
( jy) / PJ +1 exp(v0 + w0
) ();
i;j j ij i j
where yi = j
dij = 10 ifotherwise
and () is the prior distribution for ; assumed to be multivariate normal with known
mean vector and covariance matrix. This density can be sampled by the MH algorithm in
which the proposal density q( ) is taken to be multivariate-t with mean vector equal to the
mode of the posterior distribution and scale matrix equal to the curvature at the mode of
the posterior distribution. The algorithm is then implemented by iterating on the following
steps.
12
Algorithm MNL
Let be the current value and choose y from q().
Accept y as the next value in the sample with probability
( y jy)q( ) )
(; y) = min 1; (
:
( jy)q( y )
Accept as the next value in the sample with probability 1 (; y):
3.3 Comparison of algorithms for the MNP model
The algorithms for the MNP model are now compared with data on four multinomial
choices. The results are similar for the MNT model and are suppressed. The data consist of
210 observations on highway and transit usage between Sydney, Melbourne, and New South
Wales, Australia, that were collected by David Hensher and are contained in the Limdep
computer package. The choices are whether to travel by air (A), train (T ), bus (B ), or car
(C ), with car treated as the base choice. The covariates are terminal waiting time (TTME),
in-vehicle time (INVT), in-vehicle cost (INVC), a generalized cost measure (GC), indicator
variables for the rst three choices (IND1, IND2, IND3), household income times A (HA),
and traveling party size times A (PA). Data for the rst two observations are presented in
Table 1. The covariates are in their undierenced form (vij ):
One model that is useful for these data consists of the seven covariates TTIME, INVT,
IND1, IND2, IND3, HA and PA. On the assumption that the prior information on the
parameters is represented by the distributions
where s0 = (0; 1; 0; 0; :75) and S0 = diag(1; 0:51; 1; 1; 0:51); we nd (via the simulation
method described in Section 2) that the prior mean of is ( 0:01; 0:057; 0:006; 0:006; 0:383)
and that the prior variance is approximately 0.28 for each component of . The prior on
is taken to be Gaussian with these moments. Algorithms 1{2 are run for 10,000 cycles, and
the MH parameters in Algorithms 1 and 2 are set at = 1 and = 20. Each of the posterior
sampling algorithms is initialized by the point estimate from the MCEM algorithm.
13
yi TTIME INVT INVC GC IND1 IND2 IND3 HA PA
4 69 59 100 70 1 0 0 35 1
4 34 31 372 71 0 1 0 0 0
4 35 25 417 70 0 0 1 0 0
4 0 10 180 30 0 0 0 0 0
4 64 58 68 68 1 0 0 30 2
4 44 31 354 84 0 1 0 0 0
4 53 25 399 85 0 0 1 0 0
4 0 11 255 50 0 0 0 0 0
Results are summarized in Tables 2 and 3. Point estimates of and |MLE and
posterior means|are fairly close across the various algorithms. Some of the dierences
may be attributable to identication problems inherent in this model that are discussed
below. Dierences between the MLE and the posterior means may also re
ect asymmetries
in the posterior distribution with a resulting dierence between modes and means. It is also
interesting to compare the serial correlation of the sampled output from Algorithms 1-2.
Figures 1 and 2 reproduce the output of . It is seen that the serial correlation of the output
from Algorithm 2 dissipates quickly relative to Algorithm 1. However, the point estimates
from the two algorithms are very close (as are the predicted probabilities computed below)
and, therefore, one may conclude that the benets that accrue from adopting Algorithm 2
are outweighed by the computational burden.
Finally, we compare the posterior predicted probability of the observed choice of each in-
dividual from each of the three algorithms. This probability is computed from the posterior
sample of the parameters generated by each of the algorithms as
X
Pr(Yi = ji ja) = G 1
Pr(Yi = ji j(ag) ; (ag) ); (5)
g
where ji is the choice made by the ith subject, ( (ag) ; (ag) ) are draws from the posterior
distribution, and a = 1; 2 indexes, respectively, the MNP Algorithms 1 and 2. Figure 3
14
MCEM Algorithm 1 Algorithm 2
Variable MLE Std Error Mean Std Dev Mean Std Dev
TTME -0.030 0.007 -0.040 0.007 -0.039 0.007
GC -0.011 0.002 -0.012 0.002 -0.012 0.002
IND1 2.096 0.743 2.807 0.601 2.666 0.601
IND2 1.474 0.317 1.786 0.271 1.714 0.273
IND3 1.272 0.316 1.511 0.269 1.477 0.266
HA 0.013 0.005 0.013 0.006 0.014 0.006
PA -0.471 0.125 -0.523 0.125 -0.512 0.122
15
1
2.5 1.2
1
0.4
Sampled output 1
0.5 2
0.2 0.5 0.8
1.5
0 0.6
0
1 0
−0.2 0.4
1 1 1 1 1
0 0 0 0 0
0 10 20 0 10 20 0 10 20 0 10 20 0 10 20
σ21 σ22 σ31 σ32 σ33
displays the scatter plot of the probabilities for two pairs of the algorithms. It will be
seen that the points lie on or very close to the 45 line. The correlations between the
predicted probabilities are over 0.999, indicating that the results from the algorithms are
indistinguishable.
3.4 Identication issues
In tting MNP models it is important to keep in mind the issue of parameter identication.
Keane (1992) points out that the parameters of the MNP model are weakly identied and
attributes this problem to the lack of exclusion restrictions in : He argues that \movements
in the regressor coecients can eectively mimic the eects of changes in the covariance
parameters," thus leading to a
at likelihood surface. We attribute the problem of fragile
identication to the large number of free parameters in the model rather than to the lack
16
1 3.5 0.6 1.5
1.4
3
0.4 1.2
Sampled output 1
0.5 2.5
0.2 1
2
0.5 0.8
0
0 1.5 0.6
−0.2 0
1 0.4
0.5 −0.4 0.2
−0.5 −0.5
0 5000 0 5000 0 5000 0 5000 0 5000
1 1 1 1 1
0 0 0 0 0
0 10 20 0 10 20 0 10 20 0 10 20 0 10 20
σ21 σ22 σ31 σ32 σ33
of exclusion restrictions. It is possible to obtain very similar likelihood functions for quite
dierent sets of parameter values whether or not there are exclusion restrictions. The same
problem arises in the MNT version of the model, but it is less serious in the MNL model
because there are no covariance parameters to estimate.
The case J = 2 is examined. Figure 4 displays in the (z1 ; z2 ) space those regions
that lead, respectively, to choices 1 (lightly shaded), 2 (medium shaded), or 3 (heavily
shaded). The distribution of a (zi1 ; zi2 ) pair depends on its mean Xi and the covariance
matrix : To see how fragile identication may arise, consider an observation for which the
mean is located deep in the region where yi = j (i.e., the covariates are very eective in
predicting choice). In that case, the probability that the person chooses j is very high. If
the covariates are eective predictors for most of the observations in a sample, the observed
17
1
0.9
0.8
0.7
0.6
2
0.5
pr
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
pr1
choice is consistent with quite dierent covariance matrices, and the resulting likelihood
function is
at. Note that the likelihood contribution of the ith subject is based only on
the actual choice made. Figure 4 illustrates the problem in a less extreme case. The dashed
99% contour is plotted around a mean of (0; 0:5) and vech () = (0; 1); and the solid
contour is around mean (0:39; 0:22) and vech () = (1:68; 3:00): The correlation is zero
for the rst of these and 0.97 for the second. Although the two sets of parameters are very
dierent, they yield the same probabilities of choices to two decimal places: 0.43, 0.22, and
0.35. Thus, even for observations that are not deep in one of the regions, the parameters
may not be well identied, and the extent of the problem would vary for dierent data sets.
In view of this discussion, we support Keane's ideas that identication may be fragile
but believe that for some data sets this fragility will persist even in the presence of exclusion
18
z2
z1
Figure 4: Means at (0,-0.50) and (0.39, -0.22), vech ()s at(0,1) and (1.68,3.00), and two
99% contours.
restrictions. (Our model implies several restrictions; for example, the variable IND1 is not
contained in the T and B equations.) The problem seems less severe for estimates of than
for , and although the coecients are somewhat dierent, the predicted probabilities are
very close.
19
of model Mk be given by
Z
m(yjMk ) = p(yjMk ; ; ())( ; jMk ) d d ; (6)
where we have adopted the parameterization for and suppressed the dependence of the
parameters on Mk : The MNL marginal likelihood has the same form except for the integra-
tion over : Given the marginal likelihood of each model, model evidence in favor of Mk over
Mr is measured by the Bayes factor Bkr ; which is given by the ratio m(yjMk )=m(yjMr ).
4.1 Computation of marginal likelihood
A straightforward way to estimate the integral (6) is by the method of Chib (1995) [see
DiCiccio, Kass, Raftery, and Wasserman (1997) for this and other methods of computing
the marginal likelihood]. The Chib method utilizes Bayes theorem to obtain
))( ; jMk )
m(yjMk ) = p(yjMk ;(; ; ( jM ; y) ;
k
where all normalizing constants are included and and are arbitrary points, taken
to be high density values such as the posterior means. Transforming to the log scale and
utilizing conditional/marginal decompositions yields
log m(yjMk ) = log p(yjMk ; ; ( )) + log ( jMk ) + log ( jMk )
log ( jMk ; y; ) log ( jMk ; y): (7)
A key desirable feature of this approach is that the likelihood function p(yjMk ; ; ( ))
needs to be computed only once. In the appendix we explain how each term in (7) is
computed.
In order to implement this Bayesian model selection approach it is necessary to think
carefully about the prior inputs. One criterion is that the prior distributions lead a priori to
the same distribution of observable responses across models. Another possible requirement
on the prior is that the choice between dierent models depends primarily on the data and
only slightly on the details of the prior. We oer two suggestions for choosing such priors
for :
20
One approach to specifying a prior on k ; where k indicates Mk , is based on the
preposterior distribution of the data under Mk :
Z
Pr(yjMk ) = fk (z j k ; k )k ( k )k (k ) dz d k dk ;
where k N (0; cI ) and vech (k ) / N (s0 ; S0 ). In this approach, c, s0 ; and S0 are chosen
to make Pr(yjMk ) approximately equal for the models to be compared and approximately
equal to what is known about Pr(yjMk ): For example, for the travel data discussed in
the example, the approximate percentage breakdown of people traveling by the various
modes may be known from previous studies, or information may be available for trips
between comparable destinations. Under this approach, the priors for the two models are
comparable in the sense that they produce the same probabilities of choice.
An alternative prior can be based on a method that uses a training sample. For model
Mk , assume that the prior distribution is k (; k jck ); where ck is a vector of hyperpa-
rameters. Let yt be a vector of n1 observations selected at random from y; and let yr be
the remainder of the sample. The training prior distribution is dened as
k ( k ; k jyt ) = k (yt jk ; k )k (k ; k jck ):
The ratio of marginal likelihoods for Mk and Mj based on yt is
m ( y t ) R pk (yt j ; k )k ( ; k jck ) d dk
Bkj = m (yt ) = R p (yt jk ; ) (k; jc ) d kd :
k
j j j j j j j j j j
This expression represents the Bayes factor before seeing the data in yr : Our suggestion is
to choose ck and cj so that Bkj = 1: This choice makes the rst stage priors k (k ; k jck )
and j (j ; j jcj ) comparable in the sense that the Bayes factor based on them and the
training sample does not favor either model.
Bayes factors are now computed for the data of our example. For the purpose of this
illustration we have chosen proper priors for each model that imply approximately the same
prior probability distribution on the outcomes. The consequences of a particular prior for
the outcomes are determined by simulation. This requires the simulation of parameters
from the prior distribution followed by a simulation of the outcomes given the parameters.
These two steps are repeated a large number of times and the hyperparameters are adjusted
until the implied empirical distribution of the outcomes is roughly similar across models.
21
4.2 Example (cont.)
Let the model tted in Section 3.3 be denoted as M1 , let M2 denote the MNP model
that adds two covariates|in vehicle cost for all stages (INVC) and in vehicle time for all
stages (INVT)|to model M1 ; and let M3 denote the MNP model in which has equal
covariances. This patterned covariance arises from the assumption that the original set of
four latent variables are independent. Finally, let M4 denote the MNL model and let M5
denote the MNT model with = 10 (both with the same covariates as M1 ).
We begin with the posterior distribution of in model M3 . Due to the restriction
on the covariances, Algorithm 2 cannot be applied in this case, but one can use a version
of Algorithm 1 where the parameters are sampled directly through an MH step. The
posterior distribution is summarized in Table 4. The posterior distribution of in this
model is close to that of M1 and is not reported.
Algorithm 1
Covariance Mean Std Dev
ij ; i 6= j 0.267 0.112
22 0.800 0.310
33 0.445 0.184
22
Model M1 M2 M3 M4
M2 -4.63 { { {
M3 0.54 5.17 { {
M4 7.81 12.43 7.27 {
M5 2.23 6.86 1.69 -5.58
Table 5: Log (base 10) of Bayes factors for row model against column model
Model Data Prior Posterior Marginal S. E.
Likelihood Ordinate Ordinate Likelihood
M 1 -83.37 -10.92 -9.55 -103.72 0.05
M 2 -79.76 -13.65 -14.94 -108.35 0.06
M 3 -83.11 -10.54 -9.53 -103.18 0.04
M 4 -80.75 -9.99 -5.17 -95.91 0.04
M 5 -83.82 -8.78 -8.89 -101.49 0.05
Table 6: Log (base 10) of the marginal likelihood and its components. Numerical standard
error in the last column is computed as in Chib (1995).
plot of the predicted probabilities from models M1 and M4 of the choice made by each
subject. In the case of the MNL model, the probability of the observed outcome is larger
than that from the MNP model in about 80% of the observations. Thus, in this case, the
MNL model is more successful than the MNP model in predicting the choices made by the
individuals. The MNT model was included because it represents a compromise between
MNP and MNL in the sense that it allows for correlated errors but has thicker tails than
the normal. Interestingly, Greene (1997) obtains a parallel result with a classical nested
models test. He compares the MNL model and the nested logit model (a model that is
similar to the MNP in that both relax the independence of irrelevant attributes property)
and nds that the MNL model cannot be rejected for these data.
5 Conclusions
This paper has presented a set of new MCMC-based algorithms and inference procedures
for the Bayesian analysis of the MNP model. One contribution is the comparison for the
rst time of dierent MCMC algorithms for simulating the posterior distribution of the
23
1
0.9
0.8
0.7
0.6
mnl
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
mnp
Figure 5: Predicted posterior mean of observed choices from the MNL model (vertical axis)
and the MNP model (horizontal axis)
parameters. Another contribution is the study of the MNT model and its analysis by
MCMC methods. A general comment based on our experience is that the tting of these
models requires some care and that the covariance parameters can be particularly dicult
to estimate, regardless of the algorithm that may be used in the tting.
An important concern of this paper is the question of comparing the t of alternative
MNP models and the t of the MNP model with that of the MNT and the simpler MNL
model. We show that the Bayes factors framework is quite useful for this purpose and that
the marginal likelihood of competing models can be computed from the MCMC output as
a by-product of the simulation procedure. One interesting result is that the MNP model
is not guaranteed to t better than the MNL model once model complexity is taken into
account. Finally, the paper reports on a probability plot for comparing the t of alternative
24
multinomial response models that should be useful in the practical tting of these models.
A Appendix
A.1 Computing ( i ) with the GHK algorithm
p y j ;
If yi = j (j < J + 1), reorder the variables so that outcome j appears in the rst row.
Let
Q(i1r) = 1 (A(i1r) );
where A(i1r) = i1 =l11 and () is the cdf of the standard normal distribution. Draw
(i1r) from TN(A(i1r) ; 1); where TN(B; U ) is the standard normal distribution truncated
to (B; U ).
{ For j = 2; : : : ; J; let
Q(ijr) = (Bij(r) );
where (r ) Pj 1 l (r)
Bij(r) = i1 ij + l11 il1 m=1 jm im :
jj
{ Draw ijr from TN( 1; Bijr ):
( ) ( )
25
Compute
Y
J
Q(ir) = Q(ijr) :
j =1
The GHK estimate of the probability p(yi j; ) is then given by
X
R
Qi = R 1
Q(ir) :
r=1
To apply this method to the MNT model, one draws the (ijr) from the standard univariate-t
distribution, truncated as above, and replaces the cdf of the normal in the above calculations
by the cdf of the t distribution.
A.2 Computing the marginal likelihood using Chib's method
Details for computing the marginal likelihood for the MNP model of equation (7) follow,
and the necessary modications for the MNT model are obvious. Calculations for the MNL
model are discussed at the end of this subsection. Note that in this section and refer
to values of and at high density points. The dependence of the parameters on Mk is
suppressed.
5. Kernel smoothing may be applied to the sample of generated by the original MCMC
run to obtain the ordinate at : If is high-dimensional it may be desirable to nd
the ordinate by applying the kernel smoothing to several blocks of the ij [Chib and
Greenberg (1995b)]. Note that the kernel smoothing steps suggested here and above
can be made as accurate as desired by increasing the number of simulated values.
This option is, of course, not available when kernel smoothing is employed on data
for which the sample size is xed.
Finally, the calculation of the marginal likelihood for the MNL model proceeds in a
similar fashion. The likelihood function at is available in closed form. The prior ordinate
for is a normal distribution, and the posterior ordinate is computed by kernel smoothing.
References
Albert, J. and S. Chib (1993), Bayesian analysis of binary and polychotomous response
data. Journal of the American Statistical Association 88, 669{679.
Chib, S. (1995), Marginal likelihood from the Gibbs output. Journal of the American
Statistical Association, 90, 1313{1321.
Chib, S. and E. Greenberg (1995a), Understanding the Metropolis-Hastings algorithm.
The American Statistician, 49, 327{335.
Chib, S. and E. Greenberg (1995b), Analysis of multivariate probit models. Biometrika,
forthcoming.
Chib, S. and Greenberg, E. (1996), Markov Chain Monte Carlo Simulation Methods in
Econometrics, Econometric Theory, 12, (1996), 409-431.
DiCiccio, T., R. Kass, A. Raftery, and L. Wasserman (1997), Computing Bayes fac-
tors by combining simulation and asymptotic approximations. Journal of the Ameri-
can Statistical Association, 92, 903{915.
Gelfand, A. E. and Smith, A. F. M. (1990), Sampling-Based approaches to calculating
marginal densities, Journal of the American Statistical Association, 85, 398-409.
27
Geweke, J. (1991), Ecient simulation from the multivariate normal and Student-t distri-
butions subject to linear constraints. In Computer Science and Statistics: Proceedings
of the Twenty-Third Symposium on the Interface, 571{578.
Geweke, J., M. Keane, and D. Runkle (1994), Alternative computational approaches
to inference in the multinomial probit model, Review of Economics and Statistics, 76,
609{632.
Greene, W. (1997), Econometric Analysis, 3rd ed., Upper Saddle River, NJ: Prentice-
Hall.
Hajivassiliou, V. A. (1990), Smooth simulation estimation of panel LDV models. Manuscript.
Hausman, J.A. and D.A. Wise (1978), A conditional probit model for qualitative choice:
Discrete decisions recognizing interdependence and heterogenous preferences. Econo-
metrica, 46, 403-426.
Keane, M. P. (1992), A note on identication in the multinomial probit model. Journal
of Business & Economic Statistics, 10, 193{200.
Keane, M. P. (1994), A computationally practical simulation estimator for panel data.
Econometrica, 62, 95{116.
Louis, T. A. (1982), \Finding the observed information matrix using the EM algorithm,"
Journal of the Royal Statistical Society B, 44, 226{233.
McCulloch, R. E. and P. E. Rossi (1994), Exact likelihood analysis of the multinomial
probit model. Journal of Econometrics, 64, 207{240.
McFadden, D (1989), A method of simulated moments for estimation of discrete response
models without numerical integration. Econometrica, 57, 995-1026.
Natarajan, R., C. E. McCulloch, and N. M. Kiefer (1995), Maximum likelihood
for the multinomial probit model. Manuscript.
Nobile, A. (1995), A hybrid Markov chain for the Bayesian analysis of the multinomial
probit model. Manuscript.
Pinheiro, J. C., and D. M. Bates (1996), Unconstrained parametrizations for variance-
covariance matrices. Statistics and Computing, 6, 289{296.
Stern, S. (1997), Simulation-based estimation. Journal of Economic Literature, 35, 2006{
2039.
28