Is Neglected Heterogeneity Really An Issue in Binary and Fractional Regression Models? A Simulation Exercise For Logit, Probit and Loglog Models
Is Neglected Heterogeneity Really An Issue in Binary and Fractional Regression Models? A Simulation Exercise For Logit, Probit and Loglog Models
2009/10
1
Departamento de Economia, Universidade de Évora and CEFAGE – UE
2
Departamento de Economia, Universidade de Évora and CEFAGE – UE
Abstract
∗
The authors thank João M.C. Santos Silva for very helpful comments. Financial support from Fundação
para a Ciência e a Tecnologia is also gratefully acknowledged (grant PTDC/ECO/64693/2006). Address
for correspondence: Joaquim J.S. Ramalho, Department of Economics, Universidade de Évora, Largo dos
Colegiais, 7000-803 ÉVORA, Portugal (e-mail: [email protected]).
1
1 Introduction
In economics, researchers are often interested in explaining a limited dependent variable, ,
as a function of a set of explanatory variables, . Due to the bounded nature of the variable
of interest, linear specifications often provide an inadequate description of the conditional
mean of , ( |), since no restriction is imposed on the range of values taken by the
predicted outcome. Moreover, when interest lies in the conditional probability of , Pr ( |),
nonlinear models are typically used. While the omission of relevant explanatory variables that
are independent of the included regressors is relatively innocuous in linear models, it generally
causes inconsistency in the estimation of the parameters of interest in nonlinear models (see
inter alia Gourieroux 2000, pp. 32-33). In this paper we examine the consequences of the
presence of that type of unobserved heterogeneity in logit, probit and loglog models for binary
and fractional or proportionate data.
To the best of our knowledge, there are very few studies examining the consequences
of unobserved heterogeneity in binary and fractional regression models. Moreover, the few
studies undertaken have assumed very restrictive conditions or were only concerned with
the effects of neglected heterogeneity on particular aspects of those models. For example,
Lee (1982) derived conditions under which omission of an orthogonal explanatory variable
would not cause bias in the estimation of the remaining parameters of a binary logit model.
However, those conditions are too stringent to be of practical use. Yatchew and Griliches
(1985) showed that for a binary probit model with a normally distributed omitted variable,
the estimators for the parameters of the included variables suffer from attenuation bias.
Wooldridge (2002, 2005), under similar assumptions, demonstrated that that bias does not
affect the consistent estimation of the partial effect of the observed regressors on the outcome.
Finally, Cramer (2003, 2007) considered the binary logit model and proved formally that
the same bias attenuation would occur in this context if the distribution of the omitted
variables is such that their relegation to the disturbance term of the latent regression equation
that originates the logit model does not change its logistic distribution, which is also a
very strong assumption. However, this last author presents also a small simulation study
which reveals that a particular partial effect, the average sample effect, is quite insensitive
to the inconsistency of the parameters of interest, even in cases where the logit shape of the
2
conditional distribution is severely affected.1
Given that calculation of partial effects is often the main aim of empirical work and that
in nonlinear models the analysis of the magnitude of regression coefficients is not relevant
per se, both Wooldridge (2002) and Cramer (2007) suggest that, similarly to what happens
in linear models, unobserved heterogeneity is not an important issue in, respectively, binary
probit and logit models. However, it is not clear whether the robustness of the binary logit
model revealed by the simulation study of Cramer (2007) extends to the binary probit model
(or, in fact, to any other binary or fractional model) since no similar analysis has been carried
out for the latter model. Moreover, there are other quantities of interest in empirical work
that have not been considered by those authors. One example is outcome prediction, which
is relevant not only for the analysis of binary and fractional data but also in the estimation
of multi-part models which require binary outcome prediction in the first stage. Testing the
significance of the observed covariates is clearly another relevant issue for practitioners.
In order to examine these questions, we consider the theoretical framework of Wooldridge
(2002) and Cramer (2007) and extend their results for other quantities of interest and models.
However, given that a more general theoretical approach does not seem to be feasible, in this
paper we conduct also an extensive Monte Carlo study that extends the findings of the cited
papers in several directions. On the one hand, in addition to the binary logit and probit
models, we consider also an alternative asymmetric specification, the loglog model, and, in
each case, both binary outcomes, where interest lies in modelling Pr ( |), and fractional
responses, where the main purpose is modelling ( |).2 On the other hand, we examine
the consequences of neglected heterogeneity over the performance of standard estimators
for those models at various levels: (i) the magnitude and direction of the parameters of
interest; (ii) the two common forms of calculating partial effects considered separately by
Wooldridge (2002) and Cramer (2007); (iii) the prediction of outcomes; and (iv) the size and
power of Wald tests for the significance of the included regressors. In all cases, we consider
1
See also the work by Neuhaus and Jewell (1993) in the area of generalized linear models, which include
the models analyzed in this paper as particular cases. However, their analysis was restricted to the case of a
single observed covariate.
2
See Papke and Wooldridge (1996) for a seminal paper on the so-called fractional regression model and
Ramalho, Ramalho and Murteira (2009) for a comprehensive survey on this subject.
3
several patterns of neglected heterogeneity by assuming various alternative distributions for
the omitted variables and assigning different weights to their relative importance.
The paper is organized as follows. In Section 2 we establish the framework of the paper,
discussing analytically the consequences of neglected heterogeneity in binary regression mod-
els. The Monte Carlo simulation study to assess the performance of naive estimators in both
binary and fractional regression models is carried out in Section 3. Section 4 concludes.
2 Framework
Consider a random sample of = 1 individuals and let be the binary or fractional
variable of interest, defined, respectively, as = {0 1} and ∈ [0 1], and 1 and 2 be,
respectively, 1 - and 2 -vectors of explanatory variables. Denote by 1 and 2 the 1 - and
2 -vectors of parameters associated with 1 and 2 , respectively, and assume that there are
no relevant explanatory variables other than those included in 1 and 2 . Assume also that
1 contains an intercept term, 2 is not observed and 1 and 2 are independent. Finally,
assume that
( |1 = 1 2 = 2 ) = (1 1 + 2 2 ) , (1)
¡ ¢ −
where () is defined as 1 + , Φ (), and for, respectively, logit, probit, and
loglog models. Note that in the binary case (·) also equals Pr ( = 1|1 = 1 2 = 2 ).
where X2 and 2 (2 ) denote, respectively, the sample space and the marginal distribution of
2 . As, in general, ( |1 ) 6= (1 1 ), naive estimation based on (1 1 ) will not produce
consistent estimators for 1 . In fact, it seems that omission of 2 will bias 1 towards zero, as
shown by Yatchew and Griliches (1985) and Wooldridge (2002) for a particular binary probit
model, by Cramer (2007) for a peculiar binary logit model, and by Neuhaus and Jewell
(1993) for any generalized linear model based on a log concave density function (which is
4
the case of the binary and fractional logit, probit and loglog models) with a single observed
covariate. Howewer, as we show next, retracing the arguments of Yatchew and Griliches
(1985), Wooldridge (2002) and Cramer (2007), it is not possible to prove formally that this
attenuation effect will be the consequence of neglected heterogeneity under any circumstances.
For simplicity, consider the following latent regression equation:
∗ = 1 1 + 2 2 + , (3)
where ∗ is not observed, 1 includes a unit variable, 2 contains a single explanatory variable
that is uncorrelated with 1 and is a random disturbance that is uncorrelated with the
regressors. Instead of ∗ , we observe the binary variable , which takes the value 1 if ∗ 0
and the value 0 otherwise. Assume that has mean zero and variance 2 and denote its
standardized distribution by . When 2 is observed, it follows that:
( |1 2 ) = Pr ( = 1|1 2 )
= Pr ( −1 1 − 2 2 |1 2 )
= 1 − Pr ( ≤ −1 1 − 2 2 |1 2 )
µ ¶
1 2
= 1 − −1 − 2
µ ¶
= 1 1 + 2 2 , (4)
where (·) is the complementary function of (·). When has a symmetric distribution,
(·) ≡ (·). As it is well known, the parameters 1 and 2 are not separately identified
from . Let 1 = 1 .
Assume now that 2 is not observed and has mean zero and variance 2 . Then, the
composite error ∗ = 2 2 + is independent of 1 and has variance 2∗ = 22 2 + 2 .
Denote the standardized distribution of ∗ by ∗ . In this setting, it follows that:
( |1 ) = Pr ( = 1|1 )
5
Let ∗1 = 1 ∗ . Clearly, we cannot evaluate the effects of omitting 2 over parameter
estimation unless we assume that = ∗ , i.e. the distribution of 2 must be such that its
inclusion in the error term does not change the distribution of the disturbance. If we make
this assumption, then = ∗ and, comparing (4) and (5), we find that
∗1 = 1 . (6)
∗
As ∗ (unless 2 = 0 or 2 = 0), in general |∗1 | |1 |, which implies that, under the
assumptions made, omission of an explanatory variable produces an attenuation bias in the
estimation of the observed covariates.
In this proof, the crucial assumption is that = ∗ . Actually, most of the papers cited
above made this assumption.3 Indeed, both Yatchew and Griliches (1985) and Wooldridge
(2002, 2005) assumed that both and 2 are normally distributed, which implies that ∗ has
also a normal distribution. On the other hand, in his proof of the existence of an attenuation
bias in the logit model, Cramer (2007) did not specify the distribution of 2 but assumed
that both and ∗ had a logistic distribution. However, in practice, it is extremely unlikely
that = ∗ . Moreover, for fractional regression models, which cannot be written in latent
form, no similar proof seems to be feasible. Therefore, in the Monte Carlo simulation study
carried out in the next section, we investigate whether equation (6), which applies only to
very specific binary regression models, also holds approximately for cases where 6= ∗ and
for fractional regression models.
For empirical analysis based on nonlinear models, the focus is not so much the analysis
of the magnitude of the regression coefficients, but consistent estimation of partial effects.
The two most usual forms of measuring partial effects in nonlinear models in applied work
are the average sample effect (), which is the mean of the partial effects calculated
independently for each individual in the sample, and the population partial effect ( ),
which is calculated for specific values of the covariates. As discussed in detail by Wooldridge
3
The exception is Neuhaus and Jewell (1993). However, their geometric approach applies only to models
with a single observed covariate.
6
(2002), in presence of neglected heterogeneity we are usually interested in calculating partial
effects averaged across the population distribution of the omitted variables.
Consider again the model described by (1) and assume that 2 is not observed. In this
setting, for the covariate 1 , those partial effects are defined by
and, considering evaluation at a given point 1 = ̄1 (e.g. the mean of the observed regres-
sors), by:
( |1 = ̄1 ) 2 [ (̄1 1 + 2 2 )]
= = . (8)
1 1
As both effects depend on 2 , the naive estimators
³
´
X ̂
\ = 1
1 1
(9)
=1 1
and ³ ´
̄1 ̂1
\
= , (10)
1
where ̂1 denotes the naive estimator of 1 , should be inconsistent, since ̂1 is inconsistent
\ and \
and (·) is in general misspecified. However, when = ∗ both provide
consistent estimates for and , respectively. Indeed, consider again the example
discussed in the previous section. Using (2) and (5), we know that for binary regression
models: µ ¶
1
∗
( |1 ) = 2 [ (1 1 + 2 2 )] = 1 . (11)
∗
Hence, ³ ´
2 [ (̄1 1 + 2 2 )] ∗ ̄1 1∗
= = . (12)
1 1
Therefore, as when = ∗ , = ∗ and ̂1 converges to ∗1 = 1 ∗ , it follows that under
this assumption \
is a consistent estimator for . A similar proof may be performed
for .
Wooldridge (2002), using similar arguments, was the first to demonstrate that in the
binary probit model with a normally distributed omitted variable the bias in the estimation
of 1 does not carry over to the estimation of the . Cramer (2007) showed that the same
7
conclusion holds for logit models in the particular case where the logit shape of ( |1 2 )
of (1) is preserved in ( |1 ) of (2).4 This last author also shows by simulation that, for
logit models, even in cases where ( |1 ) deviates significantly from the logit functional
form assumed for ( |1 2 ), the is relatively robust to neglected heterogeneity. In
section 3 we investigate whether this robustness of naive partial effects may be extended to
other models and more general settings.
In this paper we examine also whether naive predictions of ( |1 ) or Pr ( |1 ), based
on the misspecified functional form (1 1 ) evaluated at the inconsistent estimator ̂1 , are
reliable. So far, the literature has been silent about this issue. However, outcome prediction,
besides being a relevant matter per se, is also the basis for the estimation of partial effects
in multi-part models where the first stage usually requires the estimation of a binary model.
Because 2 is not observed, the main interest is outcome prediction averaged across the
population distribution of the omitted variables, just like discussed above for partial effects.
From (11), it is clear that the same assumptions required above for consistent estimation of
³ ´
partial effects are still needed: only if = ∗ does ̄1 ̂1 consistently predicts ( |1 ).
Therefore, in a probit model with normal distributed heterogeneity or in the very special logit
model considered by Cramer (2007) neglected heterogeneity is not a problem also for outcome
prediction. In our Monte Carlo study we focus on cases where 6= ∗ .
Finally, as testing the significance of the impact of a particular covariate on the outcome
variable is one of the main aims of any empirical study, we next evaluate the effects of
neglected heterogeneity on significance tests. In particular, we examine the application of
the widely used Wald test to assess the individual significance of the parameters associated
to the observed regressors in presence of unobserved heterogeneity.
When there are no omitted variables, the Wald statistic for assessing 0 : 1 = 0 is
4
These findings are supported by a former work by Stoker (1986), who showed that misspecification of
the functional form in single index models does not affect the estimation of average behavioral derivatives.
8
Ár ³ ´ ³ ´
given by = ̂1 ̂ ̂1 , where ̂ ̂1 denotes an estimate of the variance of ̂1 ,
and converges to a standard normal distribution. For binary data, considering again model
(1), it follows that
⎡ ³ ´2 ⎤−1
³ ´ X
1 ̂1 + 2 ̂2
⎢1 ³ ´h ³ ´i ⎥
̂ ̂1 = ⎣ ⎦ (13)
=1 1 ̂1 + 2 ̂2 1 − 1 ̂1 + 2 ̂2
where () = () and is the relevant element of 0 . Hence,
v
u ³ ´2
u 2
u1 X ̂1 1 ̂1 + 2 ̂2
=u t ³ ´h ³ ´i . (14)
=1 1 ̂1 + 2 ̂2 1 − 1 ̂1 + 2 ̂2
9
model it is well known that () = () [1 − ()], which implies that statistics (14) and
(15) may be simplified to
v v ³ ´
u u
u1 X 2 ³ ´ q u ̂ + ̂
u1 X
1 1 2 2
=t ̂1 1 ̂1 + 2 ̂2 = ̂1 t (16)
=1 =1 1
and v
v u ³ ´
u ³ ´ ³ ´ q u
u1 X 2
u 1 X
1 ̂1
= t ̂1 1 ̂1 = ̂1 t , (17)
=1 =1 1
³ ´.
respectively, since () 1 = 1 (). Thus, as both 1 ̂1 + 2 ̂2 1 and
³
´.
1 ̂1 1 converge to the same quantity, 2 [ (1 1 + 2 2 )]/ 1 , see (12),
and ̂1 and ̂1 converge to 1 and ∗1 , respectively, it follows from (6), (16) and (17) that
v
u r
u ̂1
= t → . (18)
̂1 ∗
Hence, assuming = ∗ , in a logit model the naive Wald test is depressed relative to
by the square root of the attenuation factor that relates ̂1 to ̂1 .5 This implies that, in fact,
in small samples unobserved heterogeneity may reduce the power of Wald tests. However,
from (18) it is also evident that under neglected heterogeneity the Wald test retains its
consistency.
In the Monte Carlo study that follows we investigate the size and power properties of
naive Wald statistics under general patterns of heterogeneity.
where 0 = 0, 2 ranges from 0 to 4 in steps of 0.25 and 1 takes different values across the
different experiments. Our aim is to analyze the effects of omitting 2 on the estimation
5
Note that this implies that the same relationship holds for the ratio of the standard errors of ̂1 and ̂1 .
10
of 1 and related statistics. Note that 2 = 0 corresponds to the case where there is no
neglected heterogeneity and that larger values of 2 imply a larger amount of heterogeneity.
In all experiments, 1 is generated from a mixtures of normal distributions, where the
variate is (−1 1) with probability 07 and (2333 1) with probability 03, and 2 is
generated from the N (0 1), 5 , (1) and 2(1) distributions. Both variables are
scaled to have mean zero and variance one. The choice of an asymmetric distribution for 1
was made to avoid the reflection property about the origin that would affect the sampling
distribution of the estimators of 1 ; see Chesher and Peters (1994) and Chesher (1995) for a
discussion on the design of Monte Carlo simulation studies.
We generate as a Bernoulli (binary case) or a beta (fractional case) variate with mean
given by the logit, probit, or loglog functional form and the shape parameter of the beta
distribution fixed at 1.6 In the former case, the parameters of interest are estimated by
maximum likelihood (ML), while in the latter we use the quasi-maximum likelihood (QML)
method, which are the standard ways of dealing with each type of data. In both cases, we
estimate full and curtailed versions of the models, i.e. models with and without 2 . The full
version of the model yields consistent estimators for all the quantities of interest and, hence,
it will be used as a reference to evaluate the consequences of neglected heterogeneity.
All experiments were repeated 5000 times using the statistical package and, given the
substantial amount of results produced in each experiment, we summarized them in figures.
In most cases (the only exceptions are the experiments regarding the Wald tests), given the
similarity of the results obtained, only those relative to binary models are reported.7 Apart
from the last experiment, where several samples sizes were considered, in all the remaining
cases the sample size is = 200.
Under some special conditions, we proved above that an attenuation bias is imposed by
neglected heterogeneity over naive estimation of the parameters of the observed regressors.
As in this Monte Carlo study we consider only one observed covariate, according to the
6
See inter alia Ramalho, Ramalho and Murteira (2009) for the mean-dispersion parametrization of the
beta distribution used in the generation of data.
7
Full results are available from the authors upon request.
11
findings of Neuhaus and Jewell (1993), we know for sure that an attenuation bias will be
present in all the models simulated. However, this bias may differ substantially from that
predicted by (6), since the assumptions made in its derivation are not met in 11 out of the
12 models simulated. Therefore, the main aim of our first set of experiments is to examine
whether equation (6) measures appropriately the extent of the bias caused by neglected
heterogeneity when 6= ∗ . Figure 1 displays the values of the ratio ̂1 1 for two different
values of 1 (-1 and 1) for each one of the 17 values of 2 simulated. In this figure we display
also (solid line) the value of the ratio ∗1 1 , obtained from (6).
Clearly, in all cases, ̂1 is depressed towards zero, its absolute bias increasing as 2 (i.e.
the extent of heterogeneity) increases. Equation (6) gives often a very good approximation
to the attenuation bias (e.g. loglog and, obviously, probit models with normal-distributed
heterogeneity and logit model with 5 -distributed heterogeneity) but in some cases there
are some important deviations. For example, when 2 has an exponential or chi-square
distribution ̂1 is not, in general, as biased as predicted by (6) in the logit and probit
models, while for the loglog model the attenuation effect is amplified relative to (6). Note
also that in some cases the actual bias depends on the value of 1 , while (6) is not a function
of that parameter. Therefore, as the extent of that bias is not perfectly approximated by
(6) in many cases, next we investigate the consequences of this fact over the calculation of
marginal effects and prediction of outcomes when 6= ∗ .
Using the same setup of the previous section, in Figure 2 we display the mean across the
replications of the estimated for the case 1 = 1. For the curtailed model we estimate
the as in (9), while for the full model we use (7), where the expectation ( |1 ) is
calculated by integration as in (2) with 2 (2 ) replaced by the density used to generate 2 .
This figure shows clearly that in the logit case ML estimation based on the full (MLf) or the
curtailed (MLc) equations leads to very similar results (the largest bias is 3.6% for 2 = 275
in the chi-square case). Thus, as already noted by Cramer (2007), logit analysis of the
is very robust to neglected heterogeneity.
12
Figure 2 about here
When we consider the evaluation of the at the mean of 1 , the bias of the various
estimators is much smaller. For example, for the symmetric 2 case, the maximum bias in
13
the loglog model is now 4.2%. Nevertheless, for the chi-square case the bias may still be
substantial: the maximum bias for the logit, probit and loglog models is, respectively, 9.8%,
21.4% and 25.1%.
Overall, the results obtained in this section allows us to achieve three main conclusions.
First, the logit model produces more robust estimates of partial effects than probit or loglog
models. Second, when our interest is the calculation of average partial effects, which is usually
the case in empirical work (in most cases, practitioners report only average partial effects), it
is preferable to compute s instead of s evaluated at the mean of the regressors, since
the former appears to be clearly much more robust to neglected heterogeneity.8 Finally, under
neglected heterogeneity, computation of s for an individual with specific characteristics
may be very unreliable.
Figure 4 illustrates the effects of the omission of 2 in the prediction of ( |1 ) through
a simulation design similar to that used for the s. For the full model the prediction is
based on (19), while for the curtailed equation we used the naive estimator (̂0 + ̂1 1 ).
Clearly, unobserved heterogeneity is relatively harmless in logit models: the maximum
bias in the 0.05-0.95 quantile range is 5.0% (2 = 2). The probit model is also robust
to the omission of variables when the distribution of 2 is symmetric, but displays more
important distortions in cases where 2 is asymmetric (maximum bias: 15.8% for 2 = 2).
Finally, the loglog model is relatively robust to unobserved heterogeneity when 2 has a
normal distribution but displays some bias in the other case, achieving a maximum bias of
23.7% (2 = 2). Hence, for outcome prediction, unobserved heterogeneity resulting from the
omission of independent explanatory variables does not seem to be a relevant issue only in
logit models. Nevertheless, note that our results suggest that when 6= ∗ , the consequences
of using a misspecified model ∗ are much more serious for calculation of s (which require
the computation of derivatives of ∗ ) than for outcome prediction.
8
A similar finding was reported by Ramalho, Ramalho and Murteira (2009), who found that computation
of s is relatively robust to functional form misspecification in the framework of fractional regression
models, while estimation of s evaluated at the mean of the covariates may be severely biased.
14
Figure 4 about here
3.4 Size and power of Wald tests for the significance of observed
regressors
In our final set of experiments we investigate the size and power of naive (Q)ML-based Wald
tests for assessing the statistical significance of observed regressors, i.e. we examine their
ability for testing the null hypothesis 0 : 1 = 0 both when it is true and false. Figures
5-6 display the percentage of rejections of 0 for a nominal level of 5% when this hypothesis
is indeed true (the horizontal lines represent the limits of a 95% confidence interval for the
nominal size). This percentage is very similar for the curtailed and full models in the binary
case, being always very near to the nominal level of 5%. For fractional data, where we use
robust estimation of standard errors since we are performing QML estimation, the empirical
size of the Wald test based on the naive estimator is even closer to the nominal size than
that based on the full equation. Therefore, these results show clearly that the size properties
of the Wald test for 1 = 0 are very robust to the presence of neglected heterogeneity.
With regard to the power properties of the Wald test, Figures 7-8 illustrate a very different
scenario. In this case, we observe an important decay on the percentage of rejections of the
false 0 as the level of heterogeneity increases. This decay seems to be more substantial, in
relative terms, in the probit and loglog models, in cases where 1 is larger, and with fractional
data.
In order to check whether equation (18), which was derived for binary logit models under
the assumption = ∗ , provides also a good approximation for other models, in Figures
9-10 we represent three ratios: that given by (18) (solid line) and two others that are
given by the mean across replications of that ratio for the two values of 1 simulated.
15
Figure 9 about here
Figure 10 about here
For binary models, see Figure 9, equation (18) seems to be a reasonable approximation.
In fact, comparing Figures 1 and 9, a very similar pattern was obtained. In contrast, for
fractional regression models the attenuation bias in the estimation of the Wald statistic is
much larger, which explains why the loss of power detected in Figure 8 is more substantial for
these models. Clearly, equation (18) is not a good approximation when robust sandwich-type
variance estimators are used.
A further investigation on the power of naive Wald tests was conducted. Only for the
chi-square distribution and for the value of 1 = 015, which led to the poorest power
performance of all the cases illustrated in the previous figures, we run experiments for =
{200 500 1000 2500 5000}. Figure 11 shows that in all cases the power of the test increases
substantially as increases, which confirms that the Wald test is still consistent in presence
of omitted variables, as discussed in Section 2. Given these results, it seems that we can trust
the outcome of a naive Wald test that reveals that a given explanatory variable is significant.
The opposite conclusion may be simply the consequence of the omission of relevant variables,
unless the sample size is large and/or the amount of heterogeneity is small.
4 Conclusion
It is well known that the omission of orthogonal relevant variables in nonlinear models causes
inconsistency in the estimation of the parameters of interest associated with the included
regressors. However, some recent work on the probit and logit models by Wooldridge (2002,
2005) and Cramer (2003, 2007), respectively, shows that, in some cases, the bias does not
carry over to the marginal effect of those regressors on the outcome and that, hence, neglected
heterogeneity may not be really an issue in, at least, binary logit and probit models. In this
paper, we demonstrated analytically that, under similar assumptions to those imposed by
those authors, their results can be extended to any other model for binary data. Moreover,
we showed that, while other features like outcome prediction are also robust to neglected
16
heterogeneity, Wald tests for the individual significance of an included covariate are biased
towards the non-rejection of the null hypothesis of non-significance.
Given that the theoretical analysis undertaken in this paper requires strong assumptions,
we performed also an extensive Monte Carlo simulation study considering more general forms
of heterogeneity. We found that, in general, unobserved heterogeneity independent of the
included covariates: (i) produces an attenuation bias in the estimation of regression coeffi-
cients; (ii) is relatively innocuous for logit estimation of the , while in the probit and
loglog cases there may be important biases in its estimation; (iii) has much more destructive
effects over the estimation of s than s; (iv) only for logit models does not affect
substantially the prediction of outcomes; and (v) is innocuous for the size and the consistency
of Wald tests for the significance of the observed regressors but, in small samples, reduces
their power substantially.
Overall, our results imply that unobserved heterogeneity is not a relevant problem in
any of the nonlinear models considered in this paper if the aim of the analysis is simply
obtaining the direction of the partial effects of the covariates. In addition, in the logit
case, neglected heterogeneity is also relatively innocuous for outcome prediction and the
calculation of s.9 These are, we think, very comforting and useful results for practitioners
since the usual ways of dealing with unobserved heterogeneity are not entirely satisfactory,
requiring strong distributional assumptions for the unobservables which often give rise to
a model that does not describe properly the data, or are too complex to be widely used
by applied economists, often requiring the utilization of nonparametric techniques which
frequently cannot be computed without substantial programming experience.
Another important implication of our results is that it is extremely important to test
the general specification of the functional form adopted for the model.10 Indeed, if the test
indicates that the functional form of our binary regression model is correctly specified (which
means that = ∗ ), then we know that calculation of partial effects and outcome prediction
9
Note that this unique property of robustness of the logit model is not totally unexpected. In fact, this
model is also robust to other problems, like endogenous stratification and nonignorable missing data, that,
in general, cause the inconsistency of the estimators based on other models; see, respectively, Hsieh, Manski
and McFadden (1985) and Ramalho and Smith (2003).
10
For a comparison of various functional form tests for binary and fractional regression models see Ramalho,
Ramalho and Murteira (2009) and Ramalho and Ramalho (2009).
17
is not affected by the presence of neglected heterogeneity. In such a case, the only relevant
problem that remains is the poor power of the Wald test in small samples. However, if all
variables are statistically significant or the sample is very large, then even that is not really
a problem.
References
Chesher, A. (1995), “A mirror image invariance for m-estimators”, Econometrica, 63(1),
207-211.
Chesher, A. and Peters, S. (1994), “Symmetry, regression design, and sampling distribu-
tions”, Econometric Theory, 10, 116-129.
Cramer, J.S. (2003), Logit Models from Economics and Other Fields, Cambridge, Cambridge
University Press.
Cramer, J.S. (2007), “Robustness of logit analysis: unobserved heterogeneity and mis-
specified disturbances”, Oxford Bulletin of Economics and Statistics, 69(4), 545-555.
Hsieh, D.A., Manski, C.F. and McFadden, D. (1985), “Estimation of response probabili-
ties from augmented retrospective observations”, Journal of the American Statistical
Association, 80, 651-662.
Lagakos, S.W. and Schoenfeld, D.A. (1984), “Properties of proportional-hazards score tests
under misspecified regression models”, Biometrics, 40, 1037-1048.
Lee, L.F. (1982), “Specification error in multinomial logit models”, Journal of Econometrics,
20, 197-209.
Neuhaus, J.M. and Jewell, N.P. (1993), “A geometric approach to assess bias due to omitted
covariates in generalized linear models”, Biometrika, 80, 807-815.
18
Papke, L.E. and Wooldridge, J.M. (1996), “Econometric methods for fractional response
variables with an application to 401(k) plan participation rates”, Journal of Applied
Econometrics, 11(6), 619-632.
Ramalho, E.A., and Ramalho, J.J.S. (2009),“Alternative versions of the RESET test for
binary response index models: a comparative study”, mimeo.
Ramalho, E.A., Ramalho, J.J.S. and Murteira, J. (2009),“Alternative estimating and testing
empirical strategies for fractional regression models”, Journal of Economic Surveys,
forthcoming.
Ramalho, E.A., and Smith, R.J. (2003), “Discrete Choice Nonresponse”, Centre for Micro-
data Methods and Practice, I.F.S. and U.C.L.. http://cemmap.ifs.org.uk/wps/cwp0307.pdf
Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data, Cam-
bridge, MIT Press.
Wooldridge, J.M. (2005), “Unobserved heterogeneity and estimation of average partial ef-
fects”, in D.W.K. Andrews and J.H. Stock (eds.) Identification and Inference for Econo-
metric Models, Cambridge, Cambridge University Press, 27-55.
19
Figure 1: Attenuation bias of parameter estimates in binary regression models
Logit model
1.0
1.0
1.0
α1 = − 1 α1 = − 1 α1 = − 1 α1 = − 1
α1 = 1 α1 = 1 α1 = 1 α1 = 1
0.8
0.8
0.8
0.8
^ α1
^ α1
^ α1
^ α1
0.6
0.6
0.6
0.6
n
1
n
1
n
1
n
1
α
α
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
1.0
1.0
1.0
α1 = − 1 α1 = − 1 α1 = − 1 α1 = − 1
α1 = 1 α1 = 1 α1 = 1 α1 = 1
0.8
0.8
0.8
0.8
^ α1
^ α1
^ α1
^ α1
0.6
0.6
0.6
0.6
n
1
n
1
n
1
n
1
α
α
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
1.0
1.0
1.0
α1 = − 1 α1 = − 1 α1 = − 1 α1 = − 1
α1 = 1 α1 = 1 α1 = 1 α1 = 1
0.8
0.8
0.8
0.8
^ α1
^ α1
^ α1
^ α1
0.6
0.6
0.6
0.6
n
1
n
1
n
1
n
1
α
α
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
α1 = 1)
Figure 2: Average sample effects for binary regression models (α
0.30
MLc MLc
probit MLf probit MLf
0.25
0.25
loglog loglog
0.20
0.20
logit logit
ASE
ASE
0.15
0.15
0.10
0.10
0.05
0.05
0 1 2 3 4 0 1 2 3 4
α2 α2
0.30
MLc MLc
probit MLf probit MLf
0.25
0.25
loglog loglog
0.20
0.20
logit logit
ASE
ASE
0.15
0.15
0.10
0.10
0.05
0.05
0 1 2 3 4 0 1 2 3 4
α2 α2
α1 = 1)
Figure 3: Population partial effects for binary regression models (α
Normal−distributed heterogeneity
Logit Probit Loglog
0.4
0.4
0.4
MLc MLc MLc
MLf MLf α2 = 0.5 MLf
α2 = 0.5
α2 = 1
0.3
0.3
0.3
α2 = 1
α2 = 0.5
PPE
PPE
PPE
0.2
0.2
0.2
α2 = 1 α2 = 2 α2 = 2
α2 = 2
α2 = 4
0.1
0.1
0.1
α2 = 4 α2 = 4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X1 quantiles X1 quantiles X1 quantiles
Chi−square−distributed heterogeneity
0.4
0.4
MLc MLc MLc
MLf MLf α2 = 0.5 MLf
α2 = 0.5
α2 = 1
0.3
0.3
0.3
α2 = 1
α2 = 0.5
α2 = 1 α2 = 2
PPE
PPE
PPE
0.2
0.2
0.2
α2 = 2
α2 = 2
α2 = 4 α2 = 4
0.1
0.1
0.1
α2 = 4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X1 quantiles X1 quantiles X1 quantiles
α1 = 1)
Figure 4: Predicted outcomes for binary regression models (α
Normal−distributed heterogeneity
Logit Probit Loglog
1.0
1.0
1.0
MLc MLc MLc
MLf α2 = 0.5 MLf α2 = 0.5 MLf
0.8
0.8
0.8
α2 = 0.5
α2 = 2 α2 = 2 α2 = 2
Predicted outcomes
Predicted outcomes
Predicted outcomes
α2 = 4 α2 = 4
0.6
0.6
0.6
α2 = 4
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X1 quantiles X1 quantiles X1 quantiles
Chi−square−distributed heterogeneity
1.0
1.0
MLc MLc MLc
MLf α2 = 0.5 MLf α2 = 0.5 MLf
0.8
0.8
0.8
α2 = 0.5
α2 = 2
Predicted outcomes
Predicted outcomes
Predicted outcomes
α2 = 2
0.6
0.6
0.6
α2 = 2
α2 = 4 α2 = 4
α2 = 4
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X1 quantiles X1 quantiles X1 quantiles
Figure 5: Empirical size for binary regression models (N = 200)
Logit model
0.08
0.08
0.08
MLc MLc MLc MLc
MLf MLf MLf MLf
empirical size
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
0.08
0.08
0.08
MLc MLc MLc MLc
MLf MLf MLf MLf
empirical size
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
0.08
0.08
0.08
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 6: Empirical size for fractional regression models (N = 200)
Logit model
0.08
0.08
0.08
QMLc QMLc QMLc QMLc
QMLf QMLf QMLf QMLf
empirical size
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
0.08
0.08
0.08
QMLc QMLc QMLc QMLc
QMLf QMLf QMLf QMLf
empirical size
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
0.08
0.08
0.08
empirical size
empirical size
empirical size
0.06
0.06
0.06
0.06
0.04
0.04
0.04
0.04
0.02
0.02
0.02
0.02
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 7: Empirical power for binary regression models (N = 200)
Logit model
1.0
1.0
1.0
MLc MLc MLc MLc
MLf MLf MLf MLf
0.4 0.6 0.8
empirical power
empirical power
empirical power
α1 = 0.3 α1 = 0.3 α1 = 0.3 α1 = 0.3
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
1.0
1.0
1.0
α1 = 0.3 MLc α1 = 0.3 MLc α1 = 0.3 MLc α1 = 0.3 MLc
MLf MLf MLf MLf
0.4 0.6 0.8
empirical power
empirical power
empirical power
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
1.0
1.0
1.0
empirical power
empirical power
empirical power
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 8: Empirical power for fractional regression models (N = 200)
Logit model
1.0
1.0
1.0
α1 = 0.3 QMLc α1 = 0.3 QMLc α1 = 0.3 QMLc α1 = 0.3 QMLc
QMLf QMLf QMLf QMLf
0.4 0.6 0.8
empirical power
empirical power
empirical power
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
1.0
1.0
1.0
α1 = 0.3 QMLc α1 = 0.3 QMLc α1 = 0.3 QMLc α1 = 0.3 QMLc
QMLf QMLf QMLf QMLf
0.4 0.6 0.8
empirical power
empirical power
empirical power
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
1.0
1.0
1.0
empirical power
empirical power
empirical power
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 9: Attenuation bias of Wald statistics in binary regression models
Logit model
1.0
1.0
1.0
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
α1 = 0.3 α1 = 0.3 α1 = 0.3 α1 = 0.3
0.8
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
1.0
1.0
1.0
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
α1 = 0.3 α1 = 0.3 α1 = 0.3 α1 = 0.3
0.8
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
1.0
1.0
1.0
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 10: Attenuation bias of Wald statistics in fractional regression models
Logit model
1.0
1.0
1.0
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
α1 = 0.3 α1 = 0.3 α1 = 0.3 α1 = 0.3
0.8
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Probit model
1.0
1.0
1.0
α1 = 0.15 α1 = 0.15 α1 = 0.15 α1 = 0.15
α1 = 0.3 α1 = 0.3 α1 = 0.3 α1 = 0.3
0.8
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Loglog model
1.0
1.0
1.0
0.8
0.8
0.8
Wn W
Wn W
Wn W
Wn W
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2 α2
Figure 11: Empirical power − different sample sizes (chi−square−distributed heterogeneity; α1 = 0.15)
1.0
1.0
1.0
N = 5000 N = 5000
0.8
0.8
0.8
N = 500
N = 500
N = 1000 N = 5000
empirical power
empirical power
empirical power
0.6
0.6
0.6
N = 2500 N = 2500
0.4
0.4
0.4
N = 500 N = 200 N = 2500
N = 200
N = 1000
N = 1000
0.2
0.2
0.2
N = 200
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2
1.0
1.0
N = 5000 N = 5000
0.8
0.8
0.8
N = 500
N = 500 N = 5000
N = 500 N = 2500 N = 200
empirical power
empirical power
empirical power
N = 2500
0.6
0.6
0.6
N = 200
N = 2500
0.4
0.4
0.4
N = 200 N = 1000 N = 1000
N = 1000
0.2
0.2
0.2
0.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
α2 α2 α2