DCA Disaggregate 1
DCA Disaggregate 1
They are namely used to provide a detailed representation of the complex aspects of transportation demand, based on strong theoretical justifications. Moreover, several packages and tools are available to help practionners using these models for real applications, making discrete choice models more and more popular. Discrete choice models are powerful but complex. The art of finding the appropriate model for a particular application requires from the analyst both a close familiarity with the reality under interest and a strong understanding of the methodological and theoretical background of the model. The main theoretical aspects of discrete choice models are reviewed in this paper. The main assumptions used to derive discrete choice models in general, and random utility models in particular, are covered in detail. The Multinomial Logit Model, the Nested Logit Model and the Generalized Extreme Value model are also discussed. In the context of transportation demand analysis, disaggregate models have played an important role these last 25 years. These models consider that the demand is the result of several decisions of each individual in the population under consideration. These decisions usually consist of a choice made among a finite set of alternatives. An example of sequence of choices in the context of transportation demand is described in Figure 1: choice of an activity (play-yard), choice of destination (6th street), choice of departure time (early), choice of transportation mode (bike) and choice of itinerary (local streets). For this reason, discrete choice models have been extensively used in this context.
Figure 1: A sequence of choices A model, as a simplified description of the reality, provides a better understanding of complex systems. Moreover, it allows for obtaining prediction of future states of the considered system, controlling or influencing its behavior and optimizing its performances. The complex system under consideration here is a specific aspect of human behavior dedicated to choice decisions. The complexity of this ``system'' clearly requires many simplifying assumptions in order to obtain operational models. A specific model will correspond to a specific set of assumptions, and it is important from a practical point of view to be aware of these assumptions when prediction, control or optimization is performed.
The assumptions associated with discrete choice models in general are detailed in Section 2. Section 3 focuses specifically on assumptions related to random utility models. Some of the most used models, the Multinomial Logit Model (Section 4), the Nested Logit Model (Section 5) and the Generalized Extreme value Model (Section 6), are then introduced, with special emphasis on the Nested Logit model. Among the many publications that can be found in the literature, we refer the reader to BenAkiva and Lerman (1985), Anderson, De Palma and Thisse (1992), Hensher and Johnson (1981) and Horowitz, Koppelman and Lerman (1986) for more comprehensive developments.
In order to develop models capturing how individuals are making choices, we have to make specific assumptions. We will distinguish here among assumptions about 1. the decision-maker: these assumptions define who is the decision-maker, and what are his/her characteristics; 2. the alternatives: these assumptions determine what are the possible options of the decision-maker; 3. the attributes: these assumptions identify the attributes of each potential alternative that the decision-maker is taking into account to make his/her decision; 4. the decision rules: they describe the process used by the decision-maker to reach his/her choice. In order to narrow down the huge number of potential models, we will consider some of these assumptions as fixed throughout the paper. It does not mean that there is no other valid assumption, but we cannot cover everything in this context. For example, even if continuous models will be briefly described, discrete models will be the primary focus of this paper.
Decision-maker
As mentioned in the introduction, choice models are referred to as disaggregate models. It means that the decision-maker is assumed to be an individual. In general, for most practical applications, this assumption is not restrictive. The concept of ``individual'' may easily been extended, depending on the particular application. We may consider that a group of persons (a household or a government, for example) is the decision-maker. In doing so, we decide to ignore all internal decisions within the group, and to consider only the decision of the group as a whole. The example described in Figure 1 reflects the decisions of a household, without accounting for all potential negotiations among the parents and the children. We will refer to ``decision-maker'' and individual'' interchangeably throughout the rest of the paper.
Because of its disaggregate nature, the model has to include the characteristics, or attributes, of the individual. Many attributes, like age, gender, income, eyes color or social security number may be considered in the model .
The analyst has to identify those that are likely to explain the choice of the individual. There is no automatic process to perform this identification. The knowledge of the actual application and the data availability play an important role in this process.
Alternatives
Analyzing the choice of an individual requires the knowledge of what has been chosen, but also of what has not been chosen. Therefore, assumptions must be made about options, or alternatives, that were considered by the individual to perform the choice. The set containing these alternatives, called the choice set, must be characterized. The characterization of the choice set depends on the context of the application. If we consider the example described in Figure 2, the time spent on each Internet site may be anything, as far as the total time is not more than two hours. The resulting choice set is represented in Figure 3, and is defined by
It is a typical example of a continuous choice set, where the alternatives are defined by some constraints and cannot be enumerated.
Figure 3: Example of a continuous choice set In this paper, we focus on discrete choice sets. A discrete choice set contains a finite number of alternatives that can be explicitly listed. The corresponding choice models are called discrete choice models. The choice of a transportation mode is a typical application leading to a discrete choice set. In this context, the characterization of the choice set consists in the identification of the list of alternatives. To perform this task, two concepts of choice set are considered: the universal choice set and the reduced choice set. The universal choice set contains all potential alternatives in the context of the application. Considering the mode choice in the example of Figure 1, the universal choice set may contain all potential transportation modes, like walk, bike, bus, car, etc. The alternative plane, which is also a transportation mode, is clearly not an option in this context and, therefore, is not included in the universal choice set. The reduced choice set is the subset of the universal choice set considered by a particular individual. Alternatives in the universal choice set that are not available to the individual under consideration are excluded (for example, the alternative car may not be an option for individuals without a driver license). The awareness of the availability of the alternative by the decisionmaker should be considered as well. The reader is referred to Swait (1984) for more details on choice set generation. In the following, ``choice set'' will refer to the reduced choice set, except when explicitly mentioned.
Attributes
Each alternative in the choice set must be characterized by a set of attributes. Similarly to the characterization of the decision-maker described in Section 2.1, the analyst has to identify the attributes of each alternatives that are likely to affect the choice of the individual. In the context of a transportation mode choice, the list of attributes for the mode car could include the travel time, the out-of-pocket cost and the comfort. The list for bus could include the travel time, the out-of-pocket cost, the comfort and the bus frequency. Note that some attributes may be generic to all alternatives, and some may be specific to an alternative (bus frequency is specific to bus). Also, qualitative attributes, like comfort, may be considered. An attribute is not necessarily a directly observed quantity. It can be any function of available data. For example, instead of considering travel time as an attribute, the logarithm of the travel time may be considered. The out-of-pocket cost may be replaced by the ratio between the out-ofpocket cost and the income of the individual. The definition of attributes as a function of available data depends on the problem. Several definitions must usually be tested to identify the most appropriate.
Decision rules
At this point, we have identified and characterized both the decision-maker and all available alternatives. We will now focus on the assumptions about the rules used by the decision-maker to come up with the actual choice. Different sets of assumptions can be considered, that leads to different family of models. We will describe here three theories on decision rules, and the corresponding models. The neoclassical economic theory, described in Section 2.4.1, introduces the concept of utility. The Luce model (Section 2.4.2) and the random utility models (introduced in Section 2.4.3 and developed in Section 3) are designed to capture uncertainty.
The neoclassical economic theory assumes that each decision-maker is able to compare two alternatives a and b in the choice set using a preference-indifference operator . If , the decision-maker either prefers a to b, or is indifferent. The preference-indifference operator is supposed to have the following properties: 1. Reflexivity:
2. Transitivity:
3. Comparability:
Because the choice set is finite, the existence of an alternative which is preferred to all of them is guaranteed, that is
More interestingly, and because of the three properties listed above, it can be shown that the existence of a function
such that
It results that using the preference-indifference operator to make a choice is equivalent to assigning a value, called utility, to each alternative, and selecting the alternative associated with the highest utility. The concept of utility associated with the alternatives plays an important role in the context of discrete choice models. However, the assumptions of neoclassical economic theory presents strong limitations for practical applications. Indeed, the complexity of human behavior suggests that a choice model should explicitly capture some level of uncertainty. The neoclassical economic theory fails to do so. The exact source of uncertainty is an open question. Some models assume that the decision rules are intrinsically stochastic, and even a complete knowledge of the problem would not overcome the uncertainty. Others consider that the decision rules are deterministic, and motivate the uncertainty from the impossibility of the analyst to observe and capture all dimensions of the problem, due to its high complexity. Anderson et al. (1992) compare this debate with the one between Einstein and Bohr, about the uncertainty principle in theoretical physics. Bohr argued for the intrinsic stochasticity of nature and Einstein claimed that ``Nature does not play dice''.
Two families of models can be derived, depending on the assumptions about the source of uncertainty. Models with stochastic decision rules, like the model proposed by Luce (1959), described in Section 2.4.2, or the ``elimination by aspects'' approach, proposed by Tverski (1972), assumes a deterministic utility and a probabilistic decision process. Random Utility Models, introduced in Section 2.4.3 and developed in Section 3, are based on the deterministic decision rules from the neoclassical economic theory, where uncertainty is captured by random variables representing utilities.
An important characteristic of models dealing with uncertainty is that, instead of identifying one alternative as the chosen option, they assign to each alternative a probability to be chosen. Luce (1959) proposed the choice axiom to characterize a choice probability law. The choice axiom can be stated as follow. Denoting the probability of choosing a in the choice set , and the probability of choosing one element of the subset within , the two following properties hold for any choice set , and , such that . is dominated, that is if there exists such that b is always
1. If an alternative
preferred to a or, equivalently, , then removing a from does not modify the probability of any other alternative to be chosen, that is
2. If no alternative is dominated, that is if for all choice probability is independent from the sequence of decisions, that is
, then the
The independence described by (7) can be illustrated using a example of transportation mode choice, where we consider Car, Bike, Bus . We apply two different assumptions to compute the probability of choosing ``car'' as a transportation mode. 1. The decision-maker may decide first to use a motorized mode (car or bus, in this case). The probability of choosing ``car'' is then given by
2. Alternatively, the decision-maker may decide first to use a private transportation mode (car or bike, in this case). The probability of choosing ``car'' is then given by
Equation (7) of the choice axiom imposes that both assumptions produce the same probability, that is
The second part of the choice axiom can be interpreted in a different way. Luce (1959) has shown that (7) is a sufficient and necessary condition for the existence of a function , such that, for all , we have
verifying
Random utility models assume, as neoclassical economic theory, that the decision-maker has a perfect discrimination capability. In this context, however, the analyst is supposed to have incomplete information and, therefore, uncertainty must be taken into account. Manski (1997) identifies four different sources of uncertainty: unobserved alternative attributes, unobserved individual attributes (called ``unobserved taste variations'' by Manski, 1997), measurement errors and proxy, or instrumental, variables. The utility is modeled as a random variable in order to reflect this uncertainty. More specifically, the utility that individual i is associating with alternative a is given by
where is the deterministic part of the utility, and is the stochastic part, capturing the uncertainty. Similarly to the neoclassical economic theory, the alternative with the highest utility is supposed to be chosen. Therefore, the probability that alternative a is chosen by decisionmaker i within choice set is
Random utility models are the most used discrete choice models for transportation applications. Therefore, the rest of the paper is devoted to them.
For all practical purposes, the mean of the random term is usually supposed to be zero. It can be shown that this assumption is not restrictive. We do it here on a simple example. Considering the example described in Figure 4, we denote the mean of the error term of each alternative by and , respectively. Then, the error terms can be specified as
and
where
and
The terms and , called Alternative Specific Constants (ASC), are capturing the mean of the error term. Therefore, it can be assumed without loss of generality, that the error terms have zero mean if the model specification includes these ASCs. In practice, it is impossible to estimate the value of all ASCs from observed data. Considering again the example of Figure 4, the probability of choosing alternative 1, say, is not modified if an arbitrary constant K is added to both utilities. Therefore, only the difference between the two ASCs can be identified. Indeed, from (17), we have
for any
. If
, we obtain
Defining produces the same result. This property can be generalized easily to models with more than two alternatives, where only differences between ASCs can be identified.
It is common practice to constrain one ASC in the model to zero. From a modeling viewpoint, the choice of the particular alternative whose ASC is constrained is purely arbitrary. However, Bierlaire, Lotan and Toint (1997) have shown that the estimation process is influenced by this choice. They propose a different technique of ASC specification which is optimal from an estimation perspective. To derive assumptions about the variance of the random term, we observe that the scale of the utility may be arbitrarily specified. Indeed, for any , we have
The arbitrary decision about is equivalent to assuming a particular variance v of the distribution of the error term. Indeed, if
we have also
We will illustrate this relationship with several examples in the remaining of this section. Once assumptions about the mean and the variance of the error term distribution have been defined, the focus is now on the actual functional form of this distribution. We will consider here three different distributions yielding to three different families of models: linear, probit and logit models. The linear model is obtained from the assumption that the density function of the error term is given by
where , is an arbitrary constant. This density function is used to derive the probability of choosing one particular alternative. Considering the example presented in Figure 4, the probability is given by (23) (see Figure 5).
Figure 5: Linear model The linear model presents some problem for real applications. First, the probability associated with extreme values ( in the example) is exactly zero. Therefore, if any extreme event happens in the reality, the model will never capture it. Second, the discontinuity of the derivatives at -L and L causes problems to most of the estimation procedures. We conclude the presentation of the linear model by emphasizing that the constant L determines the scale of the distribution. For the binary example, assuming L is 1/2, that is . is equivalent to assuming . Using (21), we have that . A common value for
The Normal Probability Unit, or Probit, model is derived from the assumption that the error terms are normally distributed, that is
where is an arbitrary constant. This density function is used to derive the probability of choosing one particular alternative. Considering the example presented in
and
are normally distributed with zero mean, variances , the probability is given by (25) (see Figure 6).
and
where
is the variance of
Figure 6: Probit model The probit model is motivated by the Central Limit Theorem , assuming that the error terms are the sum of independent unobserved quantities. Unfortunately, the probability function (25) has no closed analytical form, which limits practical use of this model. We refer the reader to Daganzo (1979) for a comprehensive development of probit models. We conclude this short introduction of the probit model by looking at the scale parameter. Considering again the binary example presented in Figure 4 in the probit context, we have we have that assuming common practice to arbitrary define is equivalent to assuming , that is . . Using (21), . It is
Despite its complexity, the probit model has been applied to many practical problems (see Whynes, Reedand and Newbold, 1996, Bolduc, Fortin and Fournier, 1996, Yai, Iwakura and Morichi, 1997 among recent publications). However, the most widely used model in practical applications is probably the Logistic Probability Unit, or Logit, model. The error terms are now assumed to be independent and identically Gumbel distributed. The density function of the Gumbel distribution is given by (26) (see Figure 7).
where
where
The Gumbel distribution is an approximation of the Normal law, as shown in Figure 8, where the plain line represents the Normal distribution, and the dotted line the Gumbel distribution.
Figure 8: Comparison between Normal and Gumbel distribution We derive the probability function for the binary example of Figure 4 from the following property of the Gumbel distribution. If is Gumbel distributed with location parameter and scale parameter , and is Gumbel distributed with location parameter and scale parameter , then follows a Logistic distribution with location parameter and scale parameter (the name of the Logit model comes from this property). The density function of the Logistic distribution is given by
where
or, equivalently,
In order to determine the relationship between the scale parameter and the variance of the distribution, we compute we have that assuming common practice to arbitrary define is equivalent to assuming , that is . . Using (21), . It is
In most cases, the arbitrary decision about the scale parameter does not matter and can be safely ignored. But it is important not to completely forget its existence. Indeed, it may sometimes play an important role. For example, utilities derived from different models can be compared only if the value of is the same for all of them. It is usually not the case with the scale parameters commonly used in practice, as shown in Table 1. Namely, a utility estimated with a logit model has to be divided by before being compared with a utility estimated with a probit model.
Table 1: Model comparison The list of models presented here above is not exhaustive. Other assumptions about the distribution of the error term will lead to other families of models. For instance, Ben-Akiva and Lerman (1985) cite the arctan and the truncated exponential models. These models are not often used in practice and we will not consider them here.
where is a vector containing all attributes, both of individual i and alternative a. The function defined in (33) is commonly assumed to be linear in the parameters, that is, if n attributes are considered,
where are parameters to be estimated. This assumption simplifies the formulation and the estimation of the model, and is not as restrictive as it may seem. Indeed, nonlinear effects can still be captured in the attributes definition, as mentioned in Section 2.3.
The derivation of this result is attributed to Holman and Marley by Luce and Suppes (1965). We refer the reader to Ben-Akiva and Lerman (1985) and Anderson et al. (1992) for additional details. It is interesting to note that the multinomial logit model can also be derived from the choice axiom defined by (6) and (7). Indeed, defining equivalent to (35). and , we have that (11) is
An important property of the multinomial logit model is the Independence from Irrelevant Alternatives (IIA). This property can be stated as follows. The ratio of the probabilities of any two alternatives is independent from the choice set. That is, for any choice sets and such that , for any alternative and in , we have
This result can be proven easily using (35). Ben-Akiva and Lerman (1985) propose an equivalent definition: The ratio of the choice probabilities of any two alternatives is entirely unaffected by the systematic utilities of any other alternatives. The IIA property of multinomial logit models is a limitation for some practical applications. This limitation is often illustrated by the red bus/blue bus paradox (see, for example, Ben-Akiva and Lerman, 1985) in the modal choice context. We prefer here the path choice example presented in Figure 9.
Figure 9: A path choice example The probability provided by the multinomial logit model (35) for this example are
which is not consistent with the intuitive result. This situation appears in choice problems with significantly correlated alternatives, as it is clearly the case in the example. Indeed, alternatives 2a and 2b are so similar that their utilities share many unobserved attributes of the path and, therefore, the assumption of independence of the random part of these utilities is not valid in this context. The Nested Logit Model, presented in the next section, partly overcomes this limitation of the multinomial logit model
The nested logit model, first derived by Ben-Akiva (1973), is an extension of the multinomial logit model designed to capture correlations among alternatives. It is based on the partitioning of the choice set into several nests such that
and
The utility function of each alternative is composed of a term specific to the alternative, and a term associated with the nest. If , we have
The error terms and are supposed to be independent. As for the multinomial logit model, error terms are supposed to be independent and identically Gumbel distributed, with scale parameter . The distribution of is such that the random variable distributed with scale parameter . is Gumbel
Each nest within the choice set is associated with a pseudo-utility, called composite utility, expected maximum utility, inclusive value or accessibility in the literature. The composite utility for nest is defined as
where
is the component of the utility which is common to all alternatives in the nest
where
and
. Indeed, if
Clearly, we have
Ben-Akiva and Lermand (1985) derive condition (46) directly from utility theory. Note also that if , we have .
The parameters and are closely related in the model. Actually, only their ratio is meaningful. It is not possible to identify them separately. A common practice is to arbitrarily constrain one of them to a value (usually 1). The impacts of this arbitrary decision on the model are briefly discussed in Section 5.1. We illustrate here the Nested Logit Model with the path choice example described in Figure 9. First, the choice set is divided into , , and
. The deterministic components of the utilities are and . The composite utilities of each nest are
and
and
where the value of has been assumed to be 1, without loss of generality. The probability of each alternative is then computed. We obtain
and
and
as a function of
, the nested logit model produces the same results as the multinomial logit model
(37), and all probabilities are . On the other hand, when goes to infinity, and goes to 0, the probability of each nest is closer and closer to 1/2. At the limit, the model is becoming a binary choice model, where the small detours a and b are ignored in the choice process.
and
by .
A model where the scale parameter is arbitrarily constrained to 1 is said to be ``normalized from the top''. A model where one of the parameters is constrained to 1 is said to be ``normalized from the bottom''. The latter may produce a simpler formulation of the model. We illustrate it using the example of Figure 11.
and
and
to obtain
and
This formulation, proposed by Daly (1987), simplifies the estimation process. For this reason, it has been adopted in estimation packages like ALOGIT (Daly, 1987) or HieLoW (Bierlaire, 1995, Bierlaire and Vandevyvere, 1995). We emphasize here that this formulation should be used with caution when the same parameters are present in more than one nest. In this case, specific techniques, inspired from artificial trees proposed by Bradley and Daly (1991) must be used to obtain a correct specification of the model. The description of these techniques is out of the scope of this paper. A direct extension of the nested logit model consists in partionning some or all nests into subnests, which can, in turn, be divided into sub-nests. Because of the complexity of these models, their structure is usually represented as a tree, as suggested by Daly (1987). Clearly, the number of potential structures, reflecting the correlation among alternatives, can be very large. No technique has been proposed thus far to identify the most appropriate correlation structure directly from the data. We conclude our introduction of nested logit models by mentioning their limitations. These models are designed to capture choice problems where alternatives within each nest are correlated. No correlation across nests can be captured by the Nested Logit Model. When alternatives cannot be partitioned into well separated nests to reflect their correlation, Nested Logit Models are not applicable. This is the case for most route choice problems. Several models within the ``logit family'' have been designed to capture specific correlation structures. For example, Cascetta (1996) captures overlapping paths in a route choice context using commonality factors, Koppelman and Wen (1997) capture correlation between pair of alternatives, and Vovsha (1997) proposes a cross-nested model allowing alternatives to belong to more than one nest. The two last models are derived from the Generalized Extreme Value model, presented in the next section.
where
1.
for all
2. G is homogeneous of degree 3. 4.
for all i such that , and the kth partial derivative with respect to k distinct is non-negative if k is odd, and nonpositive if k is even, that is, if and , we have such that if and
As an example, we consider
which is the multinomial logit model. Similarly, the nested logit model can be derived with
The Generalized Extreme Value model provides a nice theoretical framework for the development of new discrete choice models, like Koppelman and Wen (1997) and Vovsha (1997) .
Conclusion
We have covered in this paper the main theoretical aspects of discrete choice models in general, and random utility models in particular. A good awareness of underlying assumptions is necessary for an efficient use of these models for practical applications. In particular, we have focused on the location parameters and the scale parameters in multinomial and nested logit
models. Despite its importance, the role of these parameters tend to be underestimated by practitioners. This may lead to incorrect specifications of the models, or incorrect interpretation of the results.
Acknowledgments
This paper is based on a lecture given at the NATO Advanced Studies Institute Operations Research and Decision Aid Methodologies in Traffic and Transportation Management, Balatonfured, Hungary, March 1997. Comments from the students and other lecturers of the ASI have been very useful to write this paper. Moreover, I am very grateful to Moshe Ben-Akiva and John Bowman for their valuable discussions and comments.
References
1 Simon P. Anderson, And de Palma, and Jacques-Franois Thisse. Discrete Choice Theory of Product Differentiation. MIT Press, Cambridge, Ma, 1992. 2 M. E. Ben-Akiva. Structure of passenger travel demand models. PhD thesis, Department of Civil Engineering, MIT, Cambridge, Ma, 1973. 3 M. E. Ben-Akiva and S. R. Lerman. Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press, Cambridge, Ma., 1985. 4 Moshe Ben-Akiva and B. Franois. homogeneous generalized extreme value model. Working paper, Department of Civil Engineering, MIT, Cambridge, Ma, 1983. 5 M. Bierlaire. A robust algorithm for the simultaneous estimation of hierarchical logit models. GRT Report 95/3, Department of Mathematics, FUNDP, 1995. 6 M. Bierlaire, T. Lotan, and Ph. L. Toint. On the overspecification of multinomial and nested logit models due to alternative specific constants. Transportation Science, 1997. (forthcoming). 7 M. Bierlaire and Y. Vandevyvere. HieLoW: the interactive user's guide. Transportation Research Group - FUNDP, Namur, 1995. 8 Denis Bolduc, Bernard Fortin, and Marc-Andre Fournier. The effect of incentive policies on the practice location of doctors: A multinomial probit analysis. Journal of labor economics, 14(4):703, 1996. 9 M. A. Bradley and A.J. Daly. Estimation of logit choice models using mixed stated preferences and revealed preferences information. In Methods for understanding travel behaviour in the 1990's, pages 116-133, Qubec, mai 1991. International Association for Travel Behaviour. 6th international conference on travel behaviour.
10 Ennio Cascetta. A modified logit route choice model overcoming path overlapping problems. Specification and some calibration results for interurban networks. In Proceedings of the 13th International Symposium on the Theory of Road Traffic Flow (Lyon, France), 1996. 11 C. F. Daganzo. Multinomial Probit: The theory and its application to demand forecasting. Academic Press, New York, 1979. 12 A. Daly. Estimating ``tree'' logit models. Transportation Research B, 21(4):251-268, 1987. 13 D. A. Hensher and L. W. Johnson. Applied discrete choice modelling. Croom Helm, London, 1981. 14 J. L. Horowitz, F. S. Koppelman, and S. R. Lerman. A self-instructing course in disaggregate mode choice modeling. Technology Sharing Program, US Department of Transportation, Washington, D.C. 20590, 1986. 15 F. S. Koppelman and Chieh-Hua Wen. The paired combinatorial logit model: properties, estimation and application. Transportation Research Board, 76th Annual Meeting, Washington DC, January 1997. Paper #970953. 16 R. Luce. Individual choice behavior: a theoretical analysis. J. Wiley and Sons, New York, 1959. 17 R. D. Luce and P. Suppes. Preference, utility and subjective probabiblity. In R. D. Luce, R. R. Bush, and E. Galanter, editors, Handbook of Mathematical Psychology, New York, 1965. J. Wiley and Sons. 18 C. Manski. The structure of random utility models. Theory and Decision, 8:229-254, 1977. 19 Andrey Andreyevich Markov. Calculation of probabilities. Tip. Imperatorskoi Akademii Nauk, Sint Petersburg, 1900. (in Russian). 20 D. McFadden. Modelling the choice of residential location. In A. Karlquist et al., editor, Spatial interaction theory and residential location, pages 75-96, Amsterdam, 1978. North-Holland. 21 J. Swait. Probabilistic choice set formation in transportation demand models. PhD thesis, Department of Civil and Environmental Engineering, Massachussetts Institute of Technology, Cambridge, Ma, 1984. 22 A. Tversky. Elimination by aspects: a theory of choice. Psychological Review, 79:281299, 1972.
23 Peter Vovsha. Cross-nested logit model: an application to mode choice in the Tel-Aviv metropolitan area. Transportation Research Board, 76th Annual Meeting, Washington DC, January 1997. Paper #970387. 24 D.K. Whynes, G. Reedand, and P. Newbold. General practitioners' choice of referral destination: A probit analysis. Managerial and Decision Economics, 17(6):587, 1996. 25 T. Yai, S. Iwakura, and S. Morichi. Multinomial probit with structured covariance for route choice behavior. Transportation Research B, 31(3):195-208, June 1997.
Chapter
5
Discrete Dependent Variable Models
1. Examples: An analyst wants to model: 2. 1. The effect of household member characteristics, transportation network characteristics, and alternative mode characteristics on choice of transportation mode; bus, walk, auto, carpool, single occupant auto, rail, or bicycle. 3. 2. The effect of consumer characteristics on choice of vehicle purchase: sport utility vehicle, van, auto, light pickup truck, or motorcycle. 4. 3. The effect of traveler characteristics and employment characteristics on airline carrier choice; Delta, United Airlines, Southwest, etc. 5. 4. The effect of involved vehicle types, pre-crash conditions, and environmental factors on vehicle crash outcome: property damage only, mild injury, severe injury, fatality.
What methods are used for fixing serially correlated errors? What can be done to deal with multi-collinearity? What is endogeneity and how can it be fixed? How does one know if the errors are Gumbel distributed?
1. 2. Characteristics of the journey or activity: Journey or activity purpose; work, grocery shopping, school, etc., time of day, accessibility and proximity of activity destination 2. 3. Characteristics of transport facility:Qualitative Factors; comfort and convenience, reliability and regularity, protection, security Quantitative Factors; in-vehicle travel times, waiting and walking times, out-of-pocket monetary costs, availability and cost of parking, proximity/accessibility of transport mode
1. 1.
2. 2. The set of choices or classifications must be mutually exclusive; that is, a particular outcome can only be represented by one choice or classification. 3. 3. The set of choices or classifications must be collectively exhaustive, that is all choices or classifications must be represented by the choice set or classification.
Even when the 2nd and 3rd criteria are not met, the analyst can usually re-define the set of alternatives or classifications so that the criteria are satisfied. Planning Example: An analyst wishing to model mode choice for commute decisions defines the choice set as AUTO, BUS, RAIL, WALK, and BIKE. The modeler observed a person in the database drove her personal vehicle to the transit station and then took a bus, violating the second criteria. To remedy the modeling problem and similar problems that might arise, the analyst introduces some new choices (or classifications) into the modeling process: AUTO-BUS, AUTO-RAIL, WALK-BUS, WALK-RAIL, BIKE-BUS, BIKERAIL. By introducing these new categories the analyst has made the discrete choice data comply with the stated modeling requirements.
1. 1.
An individual is faced with a finite set of choices from which only one can be chosen.
2. 2. Individuals belong to a homogenous population, act rationally, and possess perfect information and always select the option that maximizes their net personal utility. 3. 3. If C is defined as the universal choice set of discrete alternatives, and J the number of elements in C, then each member of the population has some subset of C as his or her choice set. Most decision-makers, however, have some subset C n, that is considerably smaller than C. It should be recognized that defining a subset Cn, that is the feasible choice set for an individual is not a trivial task; however, it is assumed that it can be determined. 4. 4. Decision-makers are endowed with a subset of attributes xn X, all measured attributes relevant in the decision making process.
Planning Example: In identifying the choice set of travel mode the analyst identifies the universal choice set C to consist of the following: 1. driving alone 2. sharing a ride 3. taxi 4. motorcycle 5. bicycle 6. walking 7. transit bus 8. light rail transit The analyst identifies a family whose choice set is fairly restricted because the do not own a vehicle, and so their choice set Cn is given by: 1. 1. sharing a ride 2. 2. taxi 3. 3. bicycle 4. 4. walking 5. 5. transit bus 6. 6. light rail transit The modeler, who is an OBSERVER of the system, does not possess complete information about all elements considered important in the decision making process by all individuals making a choice, so Utility is broken down into 2 components, V and :
Uin = (Vin + in); where; Uin is the overall utility of choice i for individual n, Vin is the systematic or measurably utility which is a function of xn and i for individual n and choice i in includes idiosyncrasies and taste variations, combined with measurement or observations errors made by modeler, and is the random utility component.
The error term allows for a couple of important cases: 1) two persons with the same measured attributes and facing the same choice set make different decisions; 2) some individuals do not select the best alternative (from the modelers point of view it demonstrated irrational behavior).
The decision maker n chooses the alternative from which he derives the greatest utility. In the binomial or two-alternative case, the decision-maker chooses alternative 1 if and only if:
If = 2 - 1, which is the difference in unobserved utilities between alternatives 2 and 1 for travelers 1 through N (subscript not shown), then the probability distribution or density of , (), can be specified to form specific classes of models.
A couple of important observations about the probability density given by F (V1 - V2) can be made.
1. 1. The error is small when there are large differences in systematic utility between alternatives one and two. 2. 2. Large errors are likely when differences in utility are small, thus decision makers are more likely to choose an alternative on the wrong side of the indifference line (V1 - V2 = 0). Alternative 1 is chosen when V1 - V2 > 0 (or when > 0), and alternative 2 is chosen when V1 - V2 < 0.
Thus, for binomial models of discrete choice:
V1 -V2
This structure for the error term is a general result for binomial choice models. By making assumptions about the probability density of the residuals, the modeler can choose between several different binomial choice model formulations. Two types of binomial choice models are most common and found in practice: the logit and the probit models. The logit model assumes a logistic distribution of errors, and the probit model assumes a normal distributed errors. These models, however, are not practical for cases when there are more than two cases, and the probit model is not easy to estimate (mathematically) for more than 4 to 5 choices.
The maximum of L is solved by differentiating the function with respect to each of the betas and setting the partial derivatives equal to zero, or the values of 1, , K that provides the maximum of L . In many cases the log likelihood function is globally concave, so that if a solution to the first order conditions exist, they are unique. This does not always have to be the case, however. Under general conditions the likelihood estimators can be shown to be consistent, asymptotically efficient, and asymptotically normal. In more complex and realistic models, the likelihood function is evaluated as before, but instead of estimating one parameter, there are many parameters associated with Xs that must be estimated, and there are as many equations as there are Xs to solve. In practice the probabilities that maximize the likelihood function are likely to be different across individuals (unlike the simplified example above where all individuals had the same probability). Because the likelihood function is between 0 and 1, the log likelihood function is negative. The maximum to the log-likelihood function, therefore, is the smallest negative value of the log likelihood function given the data and specified probability functions.
Planning Example. Suppose 10 individuals making travel choices between auto (A) and transit (T) were observed. All travelers are assumed to possess identical attributes (a really poor assumption), and so the probabilities are not functions of betas but simply a function of p, the probability of choosing Auto. The analyst also does not have any alternative specific attributesa very naive model that doesnt reflect reality. The likelihood function will be: L* = px (1-p)n-x = p7 (1-p)3 where; p = probability that a traveler chooses A, 1-p = probability that a traveler chooses T, n = number of travelers = 10 x = number of travelers choosing A.
Recall that the analyst is trying to estimate p, the probability that a traveler chooses A. If 7 travelers were observed taking A and 3 taking T, then it can be shown that the maximum likelihood estimate of p is 0.7, or in other words, the value of L* is maximized when p=0.7 and 1-p=0.3. All other combinations of p and 1-p result in lower values of L*. To see this, the analyst plots numerous values of L* for all integer values of P (T) from 0.0 to 10.0. The following plot is obtained:
Similarly (and in practice), one could use the log likelihood function to derive the maximum likelihood estimates, where L = log (L*) = Log [p7 (1-p)3] = Log p7 + Log (1-p)3 = 7 Log p + 3 Log (1-p).
LogLikehood Function
Note that in this simple model p is the only parameter being estimated, so maximizing the likelihood function L* or the log (L*) only requires one first order condition, the derivative of p with respect to log (L*).
Where; 1.
1.
Utility for traveler n and mode i = Uin = Vin + in Pn (i) is the probability that traveler n chooses mode i
2. 2.
3. 3. Numerator is utility for mode i for traveler n, denominator is the sum of utilities for all alternative modes Cn for traveler n 4. 4. 5. 5. The disturbances in are independently distributed The disturbances in are identically distributed
6. 6. The disturbances are Gumbel distributed with location parameter and a scale parameter > 0.
The MNL model expresses the probability that a specific alternative is chosen is the exponent of the utility of the chosen alternative divided by the exponent of the sum of all alternatives (chosen and not chosen). The predicted probabilities are bounded by zero and one. There are several assumptions embedded in the estimation of MNL models.
where xin and xjn are vectors describing the attributes of alternatives i and j as well as attributes of traveler n.
Note that the ratio of probabilities of modes i and j for individual n are unaffected by irrelevant alternatives in Cn. One way to pose the IIA problem is to explain the red bus/blue bus paradox. Assume that the initial choice probabilities for an individual are as follows: P (auto) = P (A) = 70% P (blue bus) = P (BB) = 20% P (rail) = P(R) = 10%
By the IIA assumption: P (A)/P (BB) = 70 / 20 = 3.5, and P(R)/P (BB) = 10 / 20 = .5. Assume that a red bus is introduced with all the same attributes as those of the blue bus (i.e. it is indistinguishable from blue bus except for color, an unobserved attribute). So, in order to retain constant ratios of alternatives (IIA), the original share of blue bus probability, the following is
obtained: since the probability of the red bus and blue bus must be equal, and the total probability of all choices must sum to one. If one attempts an alternate solution where the original bus share is split between RB and BB, and the correct ratios are retained, one obtains the same answer as previously. This is an unrealistic forecast by the logit model, since the individual is forecast to use buses more than before, and auto and rail less, despite the fact that a new mode with new attributes has not been introduced. In reality, one would not expect the probability of auto to decline, because for traveler n a new alternative has not been introduced. In estimating MNL models, the analyst must be cautious of cases similar to the red-bus/blue-bus problem, in which mode share should decrease by a factor for each of the similar alternatives. If one attempts an alternate solution where the original bus share is split between RB and BB, and the correct ratios are retained, one obtains the same answer as previously.
This is an unrealistic forecast by the logit model, since the individual is forecast to use buses more than before, and auto and rail less, despite the fact that a new mode with new attributes has not been introduced. In reality, one would not expect the probability of auto to decline, because for traveler n a new alternative has not been introduced. In estimating MNL models, the analyst must be cautious of cases similar to the red-bus/blue-bus problem, in which mode share should decrease by a factor for each of the similar alternatives.
The IIA restriction does not apply to the population as a whole. That is, it does not restrict the shares of the population choosing any two alternatives to be unaffected by the utilities of other alternatives. The key in understanding this distinction is that for homogenous market segments IIA holds, but across market segments unobserved attributes vary, and thus the IIA property does not hold for a population of individuals. A MNL therefore is an appropriate model if the systematic component of utility accounts for heterogeneity across individuals. In general, models with many socio-economic variables have a better chance of not violating IIA. When IIA does not hold, there are various methods that can be used to get around the problem, such as nested logit and probit models.
Elasticities of MNL
The analyst can use coefficients estimated in logit models to determine both disaggregate and aggregate elasticities, as well as cross-elasticities.
For example, assume that individual 18 (an observation in observed data) has an auto travel time of 51.0 minutes and transit travel time of 85.0 minutes. For this individual, the probability of choosing auto is given by plugging auto travel time and transit travel time into the MNL estimated on the sample of data:
This individuals direct elasticity of auto travel time with respect to auto choice probability is calculated to obtain:
Thus for an additional minute of travel time in the auto, there would be a decrease of 0.39% in auto usage for this individual. Of course this is the statistical result, which suggests that over repeated choice occasions, the decision maker would use auto 1 time less in 100 per 3 minutes of additional travel time.
Disaggregate Cross-Elasticities
If the analyst was instead interested in the effect that travel time had on choosing mode j, say auto, where Pn (Auto) = .45, the analyst could simply set up a cross elasticity as follows:
which suggests that the deterministic utility of the Auto for traveler n increases by 22.5 given a unit increase of 1 minute of travel time to the bus stop.
The dis-aggregate elasticity is the probability that a subgroup will choose mode i with respect to a unit or incremental change in variable k. For example, this could be used to predict the change in mode share for group of transit if transit fare increased by 1 unit.
Estimation of MNL models leads to fairly standard output from estimation programs. In general program output can be obtained showing coefficient estimates, model goodness of fit, elasticities, and various other aspects of model fitting.
Model Coefficients
There are several rules to consider when interpreting the coefficients in the MNL model. 1. 1. Alternative specific constants can only be included in MNL models for n-1 alternatives. Characteristics of decision-makers, such as socio-economic variables, must be entered as alternative specific. Characteristics of the alternative decisions themselves, such as costs of different choices, can be entered in MNL models as generic or as alternative specific. 2. 2. Variable coefficients only have meaning in relation to each other, i.e., there is no absolute interpretation of coefficients. In other words, the absolute magnitude of coefficients is not interpretable like it is in ordinary least squares regression models. 3. 3. Alternative Specific Constants, like regression, allow some flexibility in the estimation process and generally should be left in the model, even if they are not significant. 4. 4. Like model coefficients in regression, the probability statements made in relation to tstatistics are conditional. For example, most computer programs provide t-statistics that provide the probability that the data were observed given that true coefficient value is zero. This is different than the probability that the coefficient is zero, given the data. Planning Example: A binary logit model was estimated on data from Washington, D.C. (see Ben-Akiva and Lerman, 1985). The following table (adapted from Ben-Akiva et. al.) shows the model results, specifically coefficient estimates, asymptotic standard errors, and the asymptotic t-statistic. These t-statistics represent the t-value (which corresponds to some probability) that the true model parameter is equal to zero. Recall that the critical values of for a two-tailed test are 1.65 and 1.96 for the 0.90 and 0.95 confidence levels respectively. Variable Name Auto Constant In-vehicle time (min) Out-of-vehicle time (min) Out of pocket cost* Transit fare ** Auto ownership* Downtown workplace* (indicator variable) Coef. Estimate 1.45 -0.00897 -0.0308 -0.0115 -0.00708 0.770 -0.561 Standard Error 0.393 0.0063 0.0106 0.00262 0.00378 0.213 0.306 t statistic 3.70 -1.42 -2.90 -4.39 -1.87 3.16 -1.84
* auto specific variable; **transit specific variable Inspection of the estimation results suggests that all else being equal, the auto is the preferred alternative, since the alternative specific constant for auto is positive. Note that only one alternative specific constant is entered in the model. Also, all but one of the variables is statistically significant at the 10% level of significance. The model shows that for an additional minute of in-vehicle travel time, the utility of that mode decreases. Since the variable is entered as generic, it reflects the effect of in vehicle travel time in either transit or auto. It might be believed that travelers do not have
the same response to travel time by mode, and so this variable could be entered as alternative specific. The model shows that out-of-vehicle time, entered as generic, is by a factor of about 3 more influential on utility than is in-vehicle time. The model shows that travelers are sensitive to travel costs; utility for transit decreases as transit fare increases, and utility for auto decreases as out-of-pocket costs increase. Notice that auto riders are approximately doubly sensitive to travel costs as are transit riders. Owning a vehicle provides greater utility for taking auto, as one would expect. Working in the downtown actually reduces the utility of the autopresumably the downtown is easily accessed via transit, and the impedance to downtown via auto is great. Although part of the impedance may partly be cost and travel time, which has already been captured, there may be additional impedance due to availability and cost of parking, safety, and other factors.
Refine models
Assess Goodness of Fit
There are several goodness of fit measures available for testing how well a MNL model fits the data on which it was estimated. The likelihood ratio test is a generic test that can be used to compare models with different levels of complexity. Let L () be the maximum log likelihood attained with the estimated parameter vector , on which no constraints have been imposed. Let (c) be the maximum log likelihood attained with constraints applied to a subset of coefficients in . Then, asymptotically (i.e. for large samples) -2(L (c)-L ()) has a chi-square distribution with degrees of freedom equaling the number of constrained coefficients. Thus the above statistic, called the likelihood ratio, can be used to test the null hypothesis that two different models perform approximately the same (in explaining the data). If there is insufficient evidence to support the more complex model, then the simpler model is preferred. For large differences in log likelihood there is evidence to support preferring the more complex model to the simpler one. In the context of discrete choice analysis, two standard tests are often provided. The first test compares a model estimated with all variables suspected of influencing the choice process to a model that has no coefficients whatsoevera model that predicts equal probability for all choices. The test statistic is given by: -2(L (0) - L ()) 2 , df = total number of coefficients The null hypothesis, H0 is that all coefficients are equal to 0, or, all alternatives are equally likely to be chosen. L (0) is the log likelihood computed when all coefficients including alternative specific constants are constrained to be zero, and L () is the log likelihood computed with no constraints on the model.
A second test compares a complex model with another nave model, however this model contains alternative specific constants for n-1 alternatives. This nave model is a model that predicts choice probabilities based on the observed market shares of the respective alternatives. The test statistics is given by: -2(L(C) - L ()) 2 , df = total number of coefficients The null hypothesis, H0 is that coefficients are zero except the alternative specific constants. L(C) is the log likelihood value computed when all slope coefficients are constrained to be equal to zero except alternative specific constants. In general the analyst can conduct specification tests to compare full models versus reduced models using the chi-square test as follows: -2(L (F) - L(R)) 2 , df = total number of restrictions, = KF - KR In logit models there is a statistic similar to R-Squared in regression called the Pseudo Coefficient of Determination. The Psuedo coefficient of determination is meant to convey similar information as does R-Squared in regressionthe larger is 2 the larger the proportion of log-likelihood explained by the parameterized model.
The c2 statistic is the 2 statistic corrected for the number of parameters estimated, and is given by: Planning Example Continued: Consider again the binary logit model estimated on Washington, D.C., data. The summary statistics for the model are provided below (adapted from Ben-Akiva and Lerman, 1985). Summary Statistics for Washington D.C. Binary Logit Model number of model parameters = 7 L (0) = - 1023 L ( ) = -374.4 -2[L (0) L ( )] = 1371.7 2 (7, 0.95) = 14.07 = 0.660 c = 0.654 The summary statistics show the log likelihood values for the nave model with zero parameters, and the value for the model with the seven parameters discussed previously. Clearly, the model with seven parameters has a larger log likelihood than the nave model, and in fact the likelihood ratio test suggests that this difference is statistically significant. The and c suggest that about 65% of the log likelihood is explained by the seven parameter model. This interpretation of should be used loosely, as this interpretation is not strictly correct. A more useful application of c would be to compare it to a competing model estimated on the same data. This would provide one piece of objective criterion for comparing alternative models.
Variables Selection
There are asymptotic t-statistics that are evaluated similarly to t-statistics in regression save for the restriction on sample sizes. That is to say that as the sample size grows the sampling distribution of estimated parameters approaches the t-distribution. Thus, variables entered into the logit model can be evaluated statistically using t-statistics and jointly using the log likelihood ratio test (see goodness of fit). Of course variables should be selected a priori based upon their theoretical or material role in the decision process being modeled.
degrees of freedom.
Since the ratio of choice probabilities between two modes is expected to remain unchanged relative to other choices, other choices could feasibly be added to the choice set and the original choice probability ratios should remain unchanged. A test proposed by Hausman and McFadden (1984) incorporates the use of a test conducted between the restricted choice set model (r), which is the model estimated without one of the choice alternatives, and a full model (f), estimated on the full set of alternatives. If IIA is not violated, then the coefficient estimates should only be affected by random fluctuation caused by statistical sampling. The test statistic q=[bu - br] [Vu - Vr] [bu - br] is asymptotically chi-square distributed with Kr degrees of freedom, where Kr is the number of coefficients in the restricted choice set model, bu and br are the coefficient vectors estimated for the unrestricted and restricted choice sets respectively, and Vu and Vr are the variance-covariance matrices for the unrestricted and restricted choice sets respectively. This test can be found in textbooks on discrete choice and some software programs.
Uncorrelated errors
Correlated errors occur when either unobserved attributes of choices are shared across alternatives (the IIA assumption), or when panel data are used and choices are correlated over time. Violation of the IIA assumption has been dealt with in a previous section. Panel data need to be handled with more sophisticated methods incorporating both cross-sectional and panel data.
Outlier Analysis
Similar to regression, the analyst should perform outlier analysis. In doing so, the analyst should inspect the predicted choice probabilities with the chosen alternative. An outlier can arbitrarily be defined as a case where a decision was chosen even though it only had a 1 in 100 chance of being selected. When these cases are identified, the analyst looks for miscoding and measurement errors made on variables. If an observation is influential, but is not erroneous, then the analyst must search for ways to investigate. One way is to estimate the model without the observation included, and then again without the observation.
Nested logit
One may think of a multi-dimensional choice context as one with inherent structure, or hierarchy. This notion helps the analyst to visualize the nested logit model, although the nested logit model is not inherently a hierarchical model. Consider the case of four travel alternatives, auto, bus with walk access, bus with auto access, and carpool. This might be thought of as a nested choice structure, the first decision is made between public transit and auto, and then between which
alternative given that public or private has been selected. Mathematically, this nested structure allows subsets of alternatives to share unobserved components of utility, which is a strict violation of the IIA property in the MNL model. For example, if transit alternatives are nested together, then it is feasible that these alternatives shared unobserved utility components such as comfort, ride quality, safety, and other attributes of transit that were omitted from the systematic utility functions. This work-around solution to the IIA assumption in MNL model is a feasible and relatively easy solution. In a nutshell, the analyst groups alternatives that share unobserved attributes at different levels of a nest, so as to allow error terms within a nest to be correlated. For more detailed discussions on nested logit models consult the references in listed in this chapter.
Multi-nomial probit
Multinomial probit is an extension of probit models to more than two alternatives. Unfortunately, they are difficult to estimate for more than 4 or 5 alternatives due to the mathematical complexity of the likelihood function as the number of alternatives increases. As computers become faster and/or computational methods become improved, multinomial probit models may be used to estimate models for reasonable sized choice sets.
Model Documentation
Once a model has been estimated and selected to be the best among competing models it needs to be thoroughly documented, so that others may learn from the modelers efforts. It is important to recognize that a model that performs below expectations is still a model worth reporting. This is because the accumulation of knowledge is based on objective reporting of findingsand only presenting success stories is not objective reporting. It is just as valuable to learn that certain variables dont appear effectual on a certain response than vice versa. When reporting the results of models, enough information should be provided so that another researcher could replicate your results. Not reporting things like sample sizes, manipulations to the data, estimated variance, etc., could render follow-on studies difficult. Perhaps the most important aspect of model documentation is the theory behind the model. That is, all the variables in the model should be accompanied by a material explanation for being there. Why are the Xs important in their influence on Y? What is the mechanism by which X influences Y? Would one suspect an underlying causal relation, or is the relationship merely associative? These are the types of questions that should be answered in the documentation accompanying a modeling effort. In addition, the model equations, the t-statistics, R-square, MSE, and F-ratio tests results should be reported. Thorough model documentation will allow for future enhancements to an existing model.
Model Implementation
Model implementation, of course, should have been considered early on in the planning stages of any research investigation. There are a number of considerations to take into account during implementation stages: 5) 1) Are the variables needed to run the model easily accessible?
6) 2) 7) 3) 8) 4) 9) 5)
Is the model going to be used within the domain with which it was intended? Has the model been validated? Will the passage of time render model predictions invalid? Will transferring the model to another geographical location jeopardize model accuracy?
These questions and other carefully targeted questions about the particular phenomenon under study will aid in an efficient and scientifically sound implementation plan.
made, per unit increase in the multiplicative exponent of X1. Thus, the interpretation is not straightforward as is the interpretation for linear regression.
Degrees of freedom are associated with sample size. Every time a statistical parameter is estimated on a sample of data the ability to computer additional parameters decreases. Degrees of freedom are the number of independent data points used to estimate a particular parameter.
What methods can be used to specify the relation between choice and the Xs?
Unlike linear regression, which represents a linear relation between a continuous variable and one or more independent variables, it is difficult to develop a useful plot between explanatory variables and the response used in discrete choice models. An exception to this occurs when repeated observations are made on an individual, or data are grouped (aggregate). For instance, a plot of proportion choosing alternative A by group (or individual across repeat observations) may reveal some differences across experimental groups (or individuals).
The Gumbel distribution is the assumed distribution of the MNL and nested logit models. The Gumbel distribution has two parameters, a scale parameter and a location parameter, . It is conveniently assumed in MNL and nested logit models that the scale parameter is equal to 1, since it is not directly estimable. Unlike linear regression, it is not easy to determine if model errors are Gumbel distributed. In the case or grouped (aggregate) or repeated individual observation data, an analyst could plot the distribution of errors by computing the choice probabilities and comparing them to observed proportions of choices. However, this is often impractical, and due to the lack of alternative distributional forms offered in discrete choice models, is often not performed.
There are several characteristics of the modeling of count data that warrant mention here. Note that modeling count data is not covered in detail in this manuscript, and so only some of the main highlights are provided. The references provided at the end of this chapter should be consulted for detailed guidance for estimating models based on count data. 13) 1) A random variable Y that follows a Poisson process is characterized by a Poisson mean , where E [Y] = Var [Y] = . When E [Y] > Var [Y], the Poisson process is said to be overdispersed. 14) 2) Over-dispersion of the Poisson process occurs in a number of ways. It can occur when a Poisson process is observed where time intervals are random instead of fixed. It can also occur when the means of Poisson random variables are thought to vary across subjects, such that the distribution of is gamma distributed. This occurs often in crash investigations, as sights being investigated have different underlying safety as reflected by their Poisson process means. 15) 3) The Poisson regression model produces probabilities of an event occurring Y=y times, given a host of covariates thought to effect the Poisson mean. Thus, the model is used to compute the long-run probability that an experimental unit with covariates X has Y=y occurrences of the event of interest. For instance, the Poisson regression model can be used to model crash occurrence on two-lane rural roads with differing crash probabilities as a function of site characteristics.
McCullagh, P. and J.A. Nelder, (1983). Generalized Linear Models. Second Edition. Chapman and Hall, New York, New York. Venables, W.N., and B.D. Ripley, (1994). Modern Applied Statistics with S-Plus, Second
18) 1) Logistic regression models can be used to model fundamentally different response variables, those that are truly binomial, such as 0, 1, and proportions data, which are continuous data within the interval (0, 1). Binomial data are individual level observations on a binomial outcome, whereas proportions data could be obtained from grouped data (multiple experimental units observed on the binary outcome variable), or panel data (multiple observations on the same experimental unit over time). 19) 2) Logistic regression models make use of the logistic transformation, given by log[ p /(1 p)], where p is the binomial probability of success. The logistic transformation, employed as the response variable in the logistic regression model, ensures that the model cannot predict outside the range of (0, 1). 20) 3) Graphical plots of logistic regression data are only useful for grouped or panel data, since the predicted response lies in the interval (0, 1), whereas the binomial outcome takes on 0 or 1 discretely. 21) 4) Outliers in logistic regression are determined by observing too many 0s or 1s when the model predicts an extremely low probability of observing this outcome. 22) 5) Similar to linear regression, the probability of observing a 0 or 1 is influenced by an exogenous set of predictor variables, or Xs. 23) 6) Standard statistics such as parameter estimates, standard errors, and t-statistics are provided in the modeling of binary outcome data via logistic regression models.
McCullagh, P. and J.A. Nelder, (1983). Generalized Linear Models. Second Edition. Chapman and Hall, New York, New York.