Scott Long (2006) Testing For IIA in The Multinomial Model
Scott Long (2006) Testing For IIA in The Multinomial Model
& Research
Volume 35 Number 4
May 2007 583-600
Testing for IIA in the Ó 2007 Sage Publications
10.1177/0049124106292361
The multinomial logit model is perhaps the most commonly used regression
model for nominal outcomes in the social sciences. A concern raised by
many researchers, however, is the assumption of the independence of irrele-
vant alternatives (IIA) that is implicit in the model. In this article, the
authors undertake a series of Monte Carlo simulations to evaluate the three
most commonly discussed tests of IIA. Results suggest that the size proper-
ties of the most common IIA tests depend on the data structure for the inde-
pendent variables. These findings are consistent with an earlier impression
that, even in well-specified models, IIA tests often reject the assumption
when the alternatives seem distinct and often fail to reject IIA when the
alternatives can reasonably be viewed as close substitutes. The authors con-
clude that tests of the IIA assumption that are based on the estimation of a
restricted choice set are unsatisfactory for applied work.
583
584 Sociological Methods & Research
Hsiao (1985) demonstrated that the MTT test is asymptotically biased and
proposed an alternative likelihood ratio test, known as the Small and Hsiao
test, that eliminates this bias. A third IIA test, proposed by Hausman and
McFadden (1984), compares the estimates from the full and restricted
model. The most commonly used tests are the Hausman and McFadden
(HM) test and the Small and Hsiao (SH) test, which are frequently dis-
cussed in econometrics texts (e.g., Greene 2003; Train 2003) and can be
easily computed using standard software (Zhang and Hoffman 1993).
Model-based tests are computed by estimating a more general model that
does not impose the IIA assumption and testing constraints that lead to
IIA. The most commonly discussed alternative models are multinomial
probit, nested logit, and mixed logit (see Train 2003 for an excellent dis-
cussion of these models). When these alternative models are used, IIA can
be tested by comparing the unrestricted model to a model that imposes
constraints leading to IIA. Unfortunately, these models are computation-
ally more difficult and are less familiar to applied researchers. As a conse-
quence, these tests are rarely seen in substantive applications. In the case
of the multinomial probit, issues of identification also make application
difficult (Keane 1992). These tests are not considered further in our article.
Evaluation of statistical tests typically involves assessment of their size
and power properties. In assessing size properties, the nominal signifi-
cance level of a test (e.g., .05, .10) is compared with the empirical signifi-
cance level in the data structure that does not violate the assumption being
evaluated. The empirical significance level is defined as the proportion
of times that the correct null hypothesis is rejected over a large number of
replications. If the size properties of a test are appropriate, the power of
the test is evaluated by assessing the proportion of times that the test
rejects the null hypothesis using a data structure that violates the assump-
tion. The more powerful the test, the higher the proportion of tests that
detects a violation of the assumption.
In two recent articles, Fry and Harris (1996, 1998) use Monte Carlo
simulations to evaluate six choice set partitioning tests of IIA, including the
MTT, SH, and HM tests. The first article provided evidence that these tests
have poor size properties and that critical values based on asymptotic theory
may be inappropriate. In their second article, they find that the SH test is
oversized and that the HM test is reasonably well sized. Although the MTT
test is found to be undersized, it has the greatest power when using empiri-
cal critical values. These values are the 95th percentile of the test statistics
from 1,000 simulations on samples from a population in which IIA is not
violated. Fry and Harris conclude that multiple tests should be used, that
586 Sociological Methods & Research
expðxβm Þ
Prðy = m | xÞ = PJ for m = 1; . . . ; J: ð1Þ
j=1 exp xβj
The vector βm = ðβ0m βkm βKm Þ0 includes the intercept β0m and
coefficients βkm for the effect of xk on outcome m. To identify the model,
we assume without loss of generality that β1 = 0. The model can also be
written in terms of the odds for each pair of options m and n:
m | n = expðx½βm − βn Þ: ð2Þ
Equation (2) shows that the odds of choosing m versus n do not depend on
which other outcomes are possible. That is, the odds are determined only
by the coefficient vectors for m and n—namely, βm and βn . This is the
independence of irrelevant alternatives property, or simply IIA.
Testing IIA
estimation. To define these tests, the estimates from the restricted choice
set are stacked in the vector β br0 β
b r = ðβ br0 Þ0 , with the corresponding
2 J−1
estimates from the full model β b = ðβ
f b β
f 0 bf 0 Þ0 . Note that β
bf does not
2 J−1
bf since it was not estimated in the restricted estimation.
include β J
dβ
where Varð br Þ and Varð
dβ bf Þ are the estimated covariance matrices. If IIA
holds, HM is asymptotically distributed as chi-square with df equal to the
rows in βbr . Significant values of HM indicate that the IIA assumption has
been violated. Hausman and McFadden (1984:1226) note that HM can be
dβ
negative if Varð br Þ − Varð
dβ bf Þ is not positive semidefinite, but they con-
clude that this is evidence that IIA holds. We use this decision rule in the
results we present below.
Generation of Data
To examine the size properties of the MTT, HM, and SH tests, we con-
ducted Monte Carlo simulations using eight artificial data sets in which the
IIA assumption was not violated. These artificial data sets were constructed
to reflect scenarios that might occur in real survey data with both continuous
and categorical covariates, with different degrees of collinearity among the
covariates, different values of the βs, and small cells in the cross-tabulation
between the outcome variable and dichotomous covariates. For each data
590 Sociological Methods & Research
In Data Sets 1, 2, and 3, all of the xs are continuous. These data sets
differ in the degree of collinearity among the xs, with the maximum corre-
lations ranging from .62 in Data Set 1 to .82 in Data Set 3. Data Set 4 was
created by dichotomizing x2 with 47 percent of the cases equal to 1. Data
Sets 5 through 8 are discussed in the ‘‘IIA Tests in Data With Sparse
Cells’’ section. Table 1 summarizes the data sets used in our simulations.
Design of Simulations
For each data set, simulations were run for sample sizes of n = 150,
250, 350, 500, 1,000, and 2,000.2 The simulations involved these steps:
Cheng, Long / Testing for IIA 591
Table 1
Descriptive Statistics for Data Sets Used in Simulations
Percentage in Means Correlations
These steps were repeated 500 times for each sample size in each data set.
To determine the empirical size for each test, we computed the percentage
of times that each test rejected the null hypothesis that IIA held in the
population at the .05 and .10 levels of significance. Since the results at the
.10 level are consistent with those at the .05 level, they are not reported.
For the HM test, we used Hausman and McFadden’s (1984) suggestion
that negative chi-squared values be recorded as 0 with the corresponding
p value of 1.3
Our analysis begins by examining the three IIA tests using the first four
data structures and shows that the size properties are affected by the
amount of collinearity and depend on which version of the test is used.
Because the undersized properties of the MTT test are highly consistent
with those suggested in earlier research, we only present the results for the
HM and SH tests. While the SH test has seemingly reasonable size proper-
ties with samples of 500 or more in data structures with different degrees
of collinearity, we show that the presence of sparse cells can lead to severe
size distortion for sample sizes up to 2,000, the largest we present. Using
592 Sociological Methods & Research
these findings as a guide, we consider the MTT and illustrate the practical
problems with using empirical critical values.
The results of the simulations for the HM test are presented in Figure 1,
which shows the percentage of times the HM test rejected the null hypoth-
esis of no violation of the IIA assumption using the .05 level.4 For each data
structure, three versions of the HM test were computed, excluding either the
first, second, or third outcome category. The percentage listed in the title for
the graph using Data Set 4 indicates that 10.6 percent of the cases were
found in the smallest cell of the cross-tabulation between y and x2 . The
numbers on the lines within each graph indicate the deleted category for the
test being presented. The results illustrate that the HM test does not reliably
converge to its appropriate size even when the sample is 2,000. Second, the
properties of the test depend on which outcome category is deleted in the
restricted estimation. For example, in Data Set 2, the test approaches its
nominal .05 level when Category 2 is excluded but levels off around .15
when Category 1 is excluded. In a substantial proportion of the samples, the
resulting HM test was negative. Even with a sample size of 1,000, 21 to 49
percent of the test statistics were negative. Overall, our results indicate that
the HM test is not a viable test for assessing IIA.
As shown in Figure 2, the SH test approximates its nominal size as the
sample increases to 500 or 1,000. The magnitude of departures from the
nominal size and the sample size at which these distortions are largely
removed depends on the degree of collinearity in the data. For example,
with high collinearity, the size properties are quite poor with samples
smaller than 500 and require a sample of at least 1,000 before they are
nearly eliminated. We also found evidence of a practical problem that is
often encountered when applying these tests with real-world data. There
are six ways to compute the SH test in our example. Each outcome cate-
gory can be the base category in the MNLM used to compute the test. For
each base category, there are two variations of the test, depending on
which nonbase category is removed. While using Category 1 as the base
category when excluding Category 3 is the same model as using Category
2 as the base when excluding Category 3, the results from the SH test will
differ due to their dependence on a particular draw of random numbers. In
more than 33 percent of samples of 500, at least one of the six possible SH
tests provided inconsistent conclusions compared to the other test. Even in
Cheng, Long / Testing for IIA 593
Figure 1
Size Properties of the Hausman-McFadden Test
of the Independence of Irrelevant Alternatives
Percent rejected
30 30
20 20
1
3 3 1 1 1
1 3 3
10 3 3
1 10
3 2 13
1 3
1 11 13
3 2
3 2 2
2 22 2 222 2
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
Percent rejected
30 30
20 20
1
3 1 1
10 1 1
3 3 3 10 1
1
3 2 1 2
1
3 2 2 3
0
222 2
0
11
2
3 21
3 2 2
3 3 3
Figure 2
Size Properties of the Small-Hsiao Test
of the Independence of Irrelevant Alternatives
Percent rejected
30 30
1
20 20 3
3
1
10 2133
2 10 3
13
22 1
2
21 1
3 2
3 1
3 1
2
3
1
2
3 1
2 1 2 2
3
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
Percent rejected
30 1 30
20 3 20
23
1
21 3
2
1 1
10 2
321 10
3 2
3
1
3 12 1
3
2 2
3 1
3 2
2
1 3 2 3
1
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
Figure 3
Illustration of Severe Size Distortion
HM test SH test
50 50
40 3
31 40
Percent rejected
Percent rejected
1 13
3
1
30 30 3
2 1
3 3
1
20 1 20 3
3 3131
11
1 322
10 10
2 2
2 2
22 2 2 2
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
of y and x2 varied from 1.8 percent to less than .1 percent. These small
percentages could easily occur in data where one of the independent vari-
ables indicates membership in an underrepresented group, with an out-
come category with few cases, or when a combination of multiple binary
variables would lead to the rare occurrence of some outcome category for
some combination of independent characteristics.
In drawing small samples from data structures in which there were
sparse cells, it was common to draw a sample in which there was a zero in
the y × x2 table. In such cases, the MNLM can be estimated, but a singu-
d βÞ.
larity occurs in Covð b A researcher who encounters this situation when
building a model is likely to respecify the model to remove the singularity,
either dropping one of the independent variables or collapsing categories.
We adapted our simulations to reflect this scenario. If a zero cell was
encountered, we drew a replacement sample. Figure 4 presents the results
of our simulations for the SH test in data sets with sparse cells. Again, the
percentages listed in the title for each graph indicate the percentage of
cases in the smallest y × x2 cell in the population data structure. The per-
centage of tests that reject the null depends greatly on the excluded cate-
gory. In some cases, the tests have extreme size distortion, rejecting the
correct null 50 percent of the time, even with samples of 1,000. In supple-
mentary analyses, we extended the simulations to larger sample sizes
596 Sociological Methods & Research
Figure 4
Size Properties of the Small-Hsiao Test of the Independence
of Irrelevant Alternatives in the Presence of Sparse Cells
Percent rejected
3
1
2
30 30 1
1 3
3
20 20 2 1
31
2
10 32 10 22 3
32
1 3
1 1
2
3 3
1 2 1
2
3
1
2
2
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
Percent rejected
40 1
3 3
40
30 3 30 2
2 1
20 2 20
22
10 22 10 2 2
2 2
2
0 0
0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
(6,000, 8,000, and 10,000) and restricted analysis to random samples with
at least five observations in the smallest cell. In both cases, the size distor-
tion persisted, again confirming our early finding that the size properties
of the IIA tests are highly dependent on the data structure for the indepen-
dent variables. The results for the HM test (not shown) are similar to those
for Data Structures 1 through 4: The size of the test does not converge as
the sample size increases, and the percentage rejected depends on the cate-
gory excluded in the restricted model. These findings could explain why
Cheng, Long / Testing for IIA 597
Table 2
Empirical Critical Values by Data Structure for n = 500
Excluded Category
Data Set 1 2 3
Fry and Harris (1996, 1998) found contradictory results for the size prop-
erties of the HM and SH tests in their two simulations.
Fry and Harris (1998) explored the use of size-adjusted tests. For these
tests, the critical value is set to be the 95th percentile of the test statistic
computed in the simulation. Based on power, they recommend the MTT as
the preferred test. They state, ‘‘Furthermore, where possible, we would
recommend that a simulation experiment be conducted to obtain empirical
(size-corrected) critical values for use in inference concerning the IIA
property’’ (p. 419). Our results suggest that the sampling distribution of
MTT is highly dependent on the data structure. For example, Table 2
shows the empirical critical values generated for the MTT test in our eight
data structures. Even though Structures 1 through 3 are very similar, dif-
fering only in their degree of collinearity, the values differ substantially
relative to the small variances in the distributions of the MTT tests (e.g.,
for Data Set 1, the standard deviations for the three tests are .03, .36, and
.21). The variability in the computed empirical critical values suggests
that the MTT test may not be effective even with size adjustments.
Furthermore, even if a researcher decides that the size-adjusted MTT test
is appropriate, we caution that Fry and Harris’s advice requires that
researchers obtain the empirical critical values from their own simulations
using their data. We believe that this makes the size-adjusted MTT imprac-
tical in most substantive applications.
598 Sociological Methods & Research
Conclusion
Our overall conclusion, based on the simulations shown above and our
evaluation of other data structures, is that tests of the IIA assumption that
are based on the estimation of a restricted choice set are unsatisfactory for
applied work. The Hausman-McFadden test shows substantial size distor-
tion that is unaffected by sample size in our simulations. The Small-Hsiao
test has reasonable size properties in some data sets but has severe size dis-
tortion even in large samples when there are sparse cells in the table of the
outcome variable with a binary independent variable. While our simula-
tions are based on relatively simple models with three outcomes and three
independent variables, we suspect that simulations using more complex
models that more closely approximate real-world models would uncover
additional problems with these tests. Furthermore, even if a researcher
decided to use these tests, the problem of inconsistent results based on dif-
ferent variations of the test is likely. The MTT test with empirically based
critical values, as suggested by Fry and Harris (1996, 1998), also has lim-
itations that make its use impractical in substantive applications.
Overall, it appears that the best advice regarding concern about IIA
goes back to an early statement by McFadden (1974), who wrote that the
multinomial and conditional logit models should only be used in cases
where the outcome categories ‘‘can plausibly be assumed to be distinct
and weighed independently in the eyes of each decision maker.’’ Similarly,
Amemiya (1981:1517) suggests that the MNLM works well when the
alternatives are dissimilar. Care in specifying the model to involve distinct
outcomes that are not substitutes for one another seems to be reasonable,
albeit unfortunately ambiguous, advice. The generalized extreme value
(e.g., nested logit, paired combinatorial logit, etc.) and mixed logit model
(see Train 2003) show great promise for models that do not impose the IIA
assumption but require intensive calculation to estimate and involve more
complicate data structures.
Notes
1. While our simulations are based on the multinomial logit model, the results for the IIA
tests should also apply to the conditional logit model.
2. We also ran simulations with samples sizes of 200, 300, 400, and 450. The results were
consistent with those presented in our figures.
3. Supplementary analyses suggest that 20 to 60 percent of the resulting chi-square values
from the Hausman-McFadden test were negative, but the incidence decreases as sample size
Cheng, Long / Testing for IIA 599
increases. There is no clear relationship between the type of data structure and the percentage
of tests with negative chi-square values.
4. The scales of the figures are fixed to make comparisons across figures easier.
References
Alvarez, R. Michael and Jonathan Nalgler. 1995. ‘‘Economics, Issues and the Perot Candi-
dacy: Voter Choice in the 1992 Presidential Election.’’ American Journal of Political
Science 39:714-44.
Amemiya, Takeshi. 1981. ‘‘Qualitative Response Models: A Survey.’’ Journal of Economic
Literature 19:1483-1536.
Begg, Colin B. and Robert Gray. 1984. ‘‘Calculation of Polychotomous Logistic Regression
Parameters Using Individualized Regressions.’’ Biometrika 71:11-8.
Brooks, Robert D., Tim R. L. Fry, and Mark N. Harris. 1997. ‘‘The Size and Power Properties
of Combining Choice Set Partition Tests for the IIA Property in the Logit Model.’’ Journal
of Quantitative Economics 13:45-61.
———. 1998. ‘‘Combining Choice Set Partition Tests for IIA: Some Results in the Four
Alternative Setting.’’ Journal of Quantitative Economics 14:1-9.
Dow, Jay K. and James W. Endersby. 2004. ‘‘Multinomial Probit and Multinomial Logit: A
Comparison of Choice Models for Voting Research.’’ Electoral Studies 23:107-22.
Fry, Tim R. L. and Mark N. Harris. 1996. ‘‘A Monte Carlo Study of Tests for the Indepen-
dence of Irrelevant Alternatives Property.’’ Transportation Research Part B: Methodolo-
gical 30:19-30.
———. 1998. ‘‘Testing for Independence of Irrelevant Alternatives: Some Empirical
Results.’’ Sociological Methods & Research 26:401-23.
Greene, William H. 2003. Econometric Analysis. 5th ed. New York: Prentice Hall.
Hausman, Jerry A. 1978. ‘‘Specification Tests in Econometrics.’’ Econometrica 46:1251-71.
Hausman, Jerry A. and Daniel McFadden. 1984. ‘‘Specification Tests for the Multinomial
Logit Model.’’ Econometrica 52:1219-40.
Keane, Michael P. 1992. ‘‘A Note on Identification in the Multinomial Probit Model.’’ Journal
of Business and Economic Statistics 10:193-200.
Lacy, Dean and Barry C. Burden. 1999. ‘‘The Vote-Stealing and Turnout Effects of Ross
Perot in the 1992 U.S. Presidential Election.’’ American Journal of Political Science
43: 233-55.
Larntz, Kinley. 1978. ‘‘Small Sample Comparisons of Exact Levels of Chi-Squared Goodness-
of-Fit Statistics.’’ Journal of the American Statistical Association 73:253-63.
Long, J. Scott and Jeremy Freese. 2005. Regression Models for Categorical Dependent Vari-
ables Using Stata. 2nd ed. College Station, TX: Stata Press.
McFadden, Daniel. 1974. ‘‘Conditional Logit Analysis of Qualitative Choice Behavior.’’
Pp. 105-42 in Frontiers of Econometrics, edited by P. Zarembka. New York: Academic Press.
McFadden, Daniel, Kenneth Train, and William B. Tye. 1981. ‘‘An Application of Diagnos-
tic Tests for the Independence From Irrelevant Alternatives Property of the Multinomial
Logit Model.’’ Transportation Research Board Record 637:39-46.
Mokhtarian, Patricia L. and Michael N. Bagley. 2000. ‘‘Modeling Employees’ Perceptions
and Proportional Preferences of Work Locations: The Regular Workplace and Telecom-
muting Alternatives.’’ Transportation Research Part A-Policy and Practice 34:223-42.
600 Sociological Methods & Research
Pels, Eric, Peter Nijkamp, and Piet Rietveld. 2001. ‘‘Airport and Airline Choice in a Multiple
Airport Region: An Empirical Analysis for the San Francisco Bay Area.’’ Regional Studies
35:1-9.
Small, Kenneth A. and Cheng Hsiao. 1985. ‘‘Multinomial Logit Specification Tests.’’ Inter-
national Economic Review 26:619-27.
Train, Kenneth. 2003. Discrete Choice Methods With Simulation. New York: Cambridge
University Press.
Zhang, Junsen and Saul D. Hoffman. 1993. ‘‘Discrete-Choice Logit Models: Testing the IIA
Property.’’ Sociological Methods & Research 22:193-213.